문c 블로그

RCU(Read Copy Update) -2- (Callback process)

2016-12-022021-04-13 문영일 2 Comments

RCU(Read Copy Update) -2- (Callback process)

RCU_BH 및 RCU_SCHED state 제거

커널 v4.20-rc1 부터 rcu_preempt, rcu_bh 및 rcu_sched와 같이 3 종류의 state를 사용하여 왔는데 rcu_preempt 하나로 통합되었다.

call_rcu_bh() 및 call_rcu_sched() 함수는 제거되었고, 대신 call_rcu() 만을 사용한다.

다음은 커널 v4.19에서 사용하는 rcu 스레드를 보여준다.

$ ps -ef | grep rcu
root         3     2  0 Jan15 ?        00:00:00 [rcu_gp]
root         4     2  0 Jan15 ?        00:00:00 [rcu_par_gp]
root        10     2  0 Jan15 ?        00:27:01 [rcu_preempt]
root        11     2  0 Jan15 ?        00:00:10 [rcu_sched]
root        12     2  0 Jan15 ?        00:00:00 [rcu_bh]
root        32     2  0 Jan15 ?        00:00:00 [rcu_tasks_kthre]

다음은 커널 v5.4에서 사용하는 rcu 스레드를 보여준다.

rcu_sched 및 rcu_bh가 없는 것을 확인할 수 있다.

$ ps -ef | grep rcu
root         3     2  0 Mar10 ?        00:00:00 [rcu_gp]
root         4     2  0 Mar10 ?        00:00:00 [rcu_par_gp]
root        10     2  0 Mar10 ?        00:00:06 [rcu_preempt]
root        20     2  0 Mar10 ?        00:00:00 [rcu_tasks_kthre]

RCU 동기 Update

synchronize_rcu()

kernel/rcu/tree.c

/**
 * synchronize_rcu - wait until a grace period has elapsed.
 *
 * Control will return to the caller some time after a full grace
 * period has elapsed, in other words after all currently executing RCU
 * read-side critical sections have completed.  Note, however, that
 * upon return from synchronize_rcu(), the caller might well be executing
 * concurrently with new RCU read-side critical sections that began while
 * synchronize_rcu() was waiting.  RCU read-side critical sections are
 * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
 * In addition, regions of code across which interrupts, preemption, or
 * softirqs have been disabled also serve as RCU read-side critical
 * sections.  This includes hardware interrupt handlers, softirq handlers,
 * and NMI handlers.
 *
 * Note that this guarantee implies further memory-ordering guarantees.
 * On systems with more than one CPU, when synchronize_rcu() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last RCU read-side critical section whose beginning
 * preceded the call to synchronize_rcu().  In addition, each CPU having
 * an RCU read-side critical section that extends beyond the return from
 * synchronize_rcu() is guaranteed to have executed a full memory barrier
 * after the beginning of synchronize_rcu() and before the beginning of
 * that RCU read-side critical section.  Note that these guarantees include
 * CPUs that are offline, idle, or executing in user mode, as well as CPUs
 * that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked synchronize_rcu(), which returned
 * to its caller on CPU B, then both CPU A and CPU B are guaranteed
 * to have executed a full memory barrier during the execution of
 * synchronize_rcu() -- even if CPU A and CPU B are the same CPU (but
 * again only if the system has more than one CPU).
 */

void synchronize_rcu(void)
{
        RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
                         lock_is_held(&rcu_lock_map) ||
                         lock_is_held(&rcu_sched_lock_map),
                         "Illegal synchronize_rcu() in RCU read-side critical section");
        if (rcu_blocking_is_gp())
                return;
        if (rcu_gp_is_expedited())
                synchronize_rcu_expedited();
        else
                wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);

grace period가 지날때까지 기다린다(sleep).

코드 라인 7~8에서 아직 rcu의 gp 사용이 블러킹상태인 경우엔 gp 대기 없이 곧바로 함수를 빠져나간다.
- preemptible 커널에서 아직 rcu 스케쥴러가 동작하지 않는 경우
- 부팅 중인 경우 또는 1개의 online cpu만을 사용하는 경우
코드 라인 9~12에서 gp를 대기할 때 조건에 따라 다음 두 가지중 하나를 선택하여 동작한다.
- 더 신속한 Brute-force RCU grace period 방법
- 일반 RCU grace period 방법

rcu_blocking_is_gp()

kernel/rcu/tree.c

/*
 * During early boot, any blocking grace-period wait automatically
 * implies a grace period.  Later on, this is never the case for PREEMPT.
 *
 * Howevr, because a context switch is a grace period for !PREEMPT, any
 * blocking grace-period wait automatically implies a grace period if
 * there is only one CPU online at any point time during execution of
 * either synchronize_rcu() or synchronize_rcu_expedited().  It is OK to
 * occasionally incorrectly indicate that there are multiple CPUs online
 * when there was in fact only one the whole time, as this just adds some
 * overhead: RCU still operates correctly.
 */

static int rcu_blocking_is_gp(void)
{
        int ret;

        if (IS_ENABLED(CONFIG_PREEMPTION))
                return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
        might_sleep();  /* Check for RCU read-side critical section. */
        preempt_disable();
        ret = num_online_cpus() <= 1;
        preempt_enable();
        return ret;
}

rcu의 gp 사용이 블러킹된 상태인지 여부를 반환한다.

코드 라인 5~6에서 preemptible 커널인 경우 rcu 스케쥴러가 비활성화 여부에 따라 반환한다.
코드 라인 7~11에서 online cpu수를 카운트하여 1개 이하인 경우 blocking 상태로 반환한다. 1개의 cpu인 경우 might_sleep() 이후에 preempt_disable() 및 preempt_enable()을 반복하였으므로 필요한 GP는 완료된 것으로 간주한다.

다음 그림은 동기화 rcu 요청 함수인 synchronize_rcu() 및 비동기 rcu 요청 함수인 call_rcu() 함수 두 가지에 대해 각각 호출 경로를 보여준다.

wait_rcu_gp()

include/linux/rcupdate_wait.h

#define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)

Grace Period가 완료될 때까지 대기한다. 인자로 gp 완료를 대기하는 다음 두 함수 중 하나가 주어진다.

call_rcu()
call_rcu_tasks()

_wait_rcu_gp()

include/linux/rcupdate_wait.h

#define _wait_rcu_gp(checktiny, ...) \
do {                                                                    \
        call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };               \
        struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];    \
        __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),              \
                        __crcu_array, __rs_array);                      \
} while (0)

가변 인자를 받아 처리할 수 있게 하였으나, 현제 실제 커널 코드에서는 1 개의 인자만 받아 처리하고 있다.

__wait_rcu_gp()

kernel/rcu/update.c

void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
                   struct rcu_synchronize *rs_array)
{
        int i;
        int j;

        /* Initialize and register callbacks for each crcu_array element. */
        for (i = 0; i < n; i++) {
                if (checktiny &&
                    (crcu_array[i] == call_rcu)) {
                        might_sleep();
                        continue;
                }
                init_rcu_head_on_stack(&rs_array[i].head);
                init_completion(&rs_array[i].completion);
                for (j = 0; j < i; j++)
                        if (crcu_array[j] == crcu_array[i])
                                break;
                if (j == i)
                        (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
        }

        /* Wait for all callbacks to be invoked. */
        for (i = 0; i < n; i++) {
                if (checktiny &&
                    (crcu_array[i] == call_rcu))
                        continue;
                for (j = 0; j < i; j++)
                        if (crcu_array[j] == crcu_array[i])
                                break;
                if (j == i)
                        wait_for_completion(&rs_array[i].completion);
                destroy_rcu_head_on_stack(&rs_array[i].head);
        }
}
EXPORT_SYMBOL_GPL(__wait_rcu_gp);

Grace Period가 완료될 때까지 대기한다. (@checktiny는 tiny/true 모델을 구분한다. 두 번째 인자는 전달 되는 array 크기를 갖고, 세 번째 인자는 호출할 비동기 콜백 함수(call_rcu() 또는 call_rcu_tasks())가 주어지며 마지막으로 네 번째 인자에는 gp 대기를 위한 rcu_synchronize 구조체 배열이 전달된다)

코드 라인 8~13에서 인자 수 @n 만큼 순회하며 첫 번째 인자 @checktiny 가 설정되었고, 세 번째 인자로 call_rcu 함수가 지정된 경우 preemption pointer를 실행한 후 skip 한다.
- 현재 커널 코드들에서는 @checktiny=0으로 호출되고 있다.
코드 라인 14~15에서 마지막 인자로 제공된 @rs_array의 rcu head를 스택에서 초기화하고, 또한 completion도 초기화한다.
- @rs_array에는 gp 비동기 처리를 위한 다음 두 함수가 사용되고 있다.
코드 라인 16~20에서 중복된 함수 호출이 없으면 세 번째 인자로 전달 받은 다음의 gp 비동기 처리 함수 중 하나를 호출하고, 인자로 rcuhead와 wakeme_after_rcu() 콜백 함수를지정한다.
- call_rcu()
- call_rcu_tasks()
코드 라인 24~27에서 인자 수 @n 만큼 순회하며 인자로 call_rcu 함수가 지정된 경우 skip 한다.
코드 라인 28~32에서중복된 함수 호출이 없으면 순회 중인 인덱스에 해당하는 콜백 함수가 처리 완료될 때까지 대기한다.
코드 라인 33에서 스택에 위치한 rcu head를 제거한다.

wakeme_after_rcu()

kernel/rcu/update.c

/**
 * wakeme_after_rcu() - Callback function to awaken a task after grace period
 * @head: Pointer to rcu_head member within rcu_synchronize structure
 *
 * Awaken the corresponding task now that a grace period has elapsed.
 */

void wakeme_after_rcu(struct rcu_head *head)
{
        struct rcu_synchronize *rcu;

        rcu = container_of(head, struct rcu_synchronize, head);
        complete(&rcu->completion);
}
EXPORT_SYMBOL_GPL(wakeme_after_rcu);

gp가 완료되었음을 알리도록 complete 처리를 한다. (콜백 함수)

RCU 비동기 Update

call_rcu()

/**
 * call_rcu() - Queue an RCU callback for invocation after a grace period.
 * @head: structure to be used for queueing the RCU updates.
 * @func: actual callback function to be invoked after the grace period
 *
 * The callback function will be invoked some time after a full grace
 * period elapses, in other words after all pre-existing RCU read-side
 * critical sections have completed.  However, the callback function
 * might well execute concurrently with RCU read-side critical sections
 * that started after call_rcu() was invoked.  RCU read-side critical
 * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
 * may be nested.  In addition, regions of code across which interrupts,
 * preemption, or softirqs have been disabled also serve as RCU read-side
 * critical sections.  This includes hardware interrupt handlers, softirq
 * handlers, and NMI handlers.
 *
 * Note that all CPUs must agree that the grace period extended beyond
 * all pre-existing RCU read-side critical section.  On systems with more
 * than one CPU, this means that when "func()" is invoked, each CPU is
 * guaranteed to have executed a full memory barrier since the end of its
 * last RCU read-side critical section whose beginning preceded the call
 * to call_rcu().  It also means that each CPU executing an RCU read-side
 * critical section that continues beyond the start of "func()" must have
 * executed a memory barrier after the call_rcu() but before the beginning
 * of that RCU read-side critical section.  Note that these guarantees
 * include CPUs that are offline, idle, or executing in user mode, as
 * well as CPUs that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
 * resulting RCU callback function "func()", then both CPU A and CPU B are
 * guaranteed to execute a full memory barrier during the time interval
 * between the call to call_rcu() and the invocation of "func()" -- even
 * if CPU A and CPU B are the same CPU (but again only if the system has
 * more than one CPU).
 */

void call_rcu(struct rcu_head *head, rcu_callback_t func)
{
        __call_rcu(head, func, 0);
}
EXPORT_SYMBOL_GPL(call_rcu);

rcu 콜백 함수 @func을 등록한다. 이 rcu 콜백 함수는 GP(Grace Period) 완료 후 호출된다.

__call_rcu()

kernel/rcu/tree.c

/*
 * Helper function for call_rcu() and friends.  The cpu argument will
 * normally be -1, indicating "currently running CPU".  It may specify
 * a CPU only if that CPU is a no-CBs CPU.  Currently, only rcu_barrier()
 * is expected to specify a CPU.
 */

static void
__call_rcu(struct rcu_head *head, rcu_callback_t func, bool lazy)
{
        unsigned long flags;
        struct rcu_data *rdp;
        bool was_alldone;

        /* Misaligned rcu_head! */
        WARN_ON_ONCE((unsigned long)head & (sizeof(void *) - 1));

        if (debug_rcu_head_queue(head)) {
                /*
                 * Probable double call_rcu(), so leak the callback.
                 * Use rcu:rcu_callback trace event to find the previous
                 * time callback was passed to __call_rcu().
                 */
                WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n",
                          head, head->func);
                WRITE_ONCE(head->func, rcu_leak_callback);
                return;
        }
        head->func = func;
        head->next = NULL;
        local_irq_save(flags);
        rdp = this_cpu_ptr(&rcu_data);

        /* Add the callback to our list. */
        if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist))) {
                // This can trigger due to call_rcu() from offline CPU:
                WARN_ON_ONCE(rcu_scheduler_active != RCU_SCHEDULER_INACTIVE);
                WARN_ON_ONCE(!rcu_is_watching());
                // Very early boot, before rcu_init().  Initialize if needed
                // and then drop through to queue the callback.
                if (rcu_segcblist_empty(&rdp->cblist))
                        rcu_segcblist_init(&rdp->cblist);
        }
        if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
                return; // Enqueued onto ->nocb_bypass, so just leave.
        /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */
        rcu_segcblist_enqueue(&rdp->cblist, head, lazy);
        if (__is_kfree_rcu_offset((unsigned long)func))
                trace_rcu_kfree_callback(rcu_state.name, head,
                                         (unsigned long)func,
                                         rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                         rcu_segcblist_n_cbs(&rdp->cblist));
        else
                trace_rcu_callback(rcu_state.name, head,
                                   rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                   rcu_segcblist_n_cbs(&rdp->cblist));

        /* Go handle any RCU core processing required. */
        if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
            unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
                __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
        } else {
                __call_rcu_core(rdp, head, flags);
                local_irq_restore(flags);
        }
}

call_rcu의 헬퍼 함수로 grace period가 지난 후에 RCU 콜백을 호출할 수 있도록 큐에 추가한다. @lazy가 true인 경우 kfree_call_rcu() 함수에서 호출하는 경우이며 @func에는 콜백 함수 대신 offset이 담긴다.

코드 라인 11~21에서 CONFIG_DEBUG_OBJECTS_RCU_HEAD 커널 옵션을 사용하여 디버깅을 하는 동안 __call_rcu() 함수가 이중 호출되는 경우 경고 메시지를 출력하고 함수를 빠져나간다.
코드 라인 22~23에서 rcu 인스턴스에 인수로 전달받은 콜백 함수를 대입한다. rcu는 리스트로 연결되는데 가장 마지막에 추가되므로 마지막을 의미하는 null을 대입한다.
코드 라인 25에서 cpu별 rcu 데이터를 표현한 자료 구조에서 현재 cpu에 대한 rcu 데이터를 rdp에 알아온다.
코드 라인 28~36에서 낮은 확률로 early 부트 중이라 disable 되었고, rcu 콜백 리스트가 준비되지 않은 경우 콜백 리스트를 초기화하고 enable 한다.
코드 라인 37~38에서 낮은 확률로 no-cb용 커널 스레드에서 처리할 콜백이 인입된 경우를 위한 처리를 수행한다.
코드 라인 40에서 rcu 콜백 함수를 콜백 리스트에 엔큐한다.
코드 라인 41~49에서 kfree 타입 또는 일반 타입의 rcu 요청에 맞게 trace 로그를 출력한다.
코드 라인 52~58에서 no-callback 처리용 스레드를 깨우거나 __call_rcu_core() 함수를 호출하여 rcu 코어 처리를 수행한다.

다음 그림은 __call_rcu() 함수를 호출하여 콜백을 추가하는 과정과 이를 처리하는 과정 두 가지를 모두 보여준다.

__call_rcu_core()

kernel/rcu/tree.c

/*
 * Handle any core-RCU processing required by a call_rcu() invocation.
 */

static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head,
                            unsigned long flags)
{
        /*
         * If called from an extended quiescent state, invoke the RCU
         * core in order to force a re-evaluation of RCU's idleness.
         */
        if (!rcu_is_watching())
                invoke_rcu_core();

        /* If interrupts were disabled or CPU offline, don't invoke RCU core. */
        if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id()))
                return;

        /*
         * Force the grace period if too many callbacks or too long waiting.
         * Enforce hysteresis, and don't invoke rcu_force_quiescent_state()
         * if some other CPU has recently done so.  Also, don't bother
         * invoking rcu_force_quiescent_state() if the newly enqueued callback
         * is the only one waiting for a grace period to complete.
         */
        if (unlikely(rcu_segcblist_n_cbs(&rdp->cblist) >
                     rdp->qlen_last_fqs_check + qhimark)) {

                /* Are we ignoring a completed grace period? */
                note_gp_changes(rdp);

                /* Start a new grace period if one not already started. */
                if (!rcu_gp_in_progress()) {
                        rcu_accelerate_cbs_unlocked(rdp->mynode, rdp);
                } else {
                        /* Give the grace period a kick. */
                        rdp->blimit = DEFAULT_MAX_RCU_BLIMIT;
                        if (rcu_state.n_force_qs == rdp->n_force_qs_snap &&
                            rcu_segcblist_first_pend_cb(&rdp->cblist) != head)
                                rcu_force_quiescent_state();
                        rdp->n_force_qs_snap = rcu_state.n_force_qs;
                        rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist);
                }
        }
}

rcu 처리(새 콜백, qs, gp 등의 변화) 를 위해 rcu core를 호출한다.

코드 라인 8~9에서 현재 cpu가 extended-qs(idle 또는 idle 진입) 상태에서 호출된 경우 rcu core 처리를 위해 rcu softirq 또는 rcu core 스레드를 호출한다.
- 스케줄 틱에서도 rcu에 대한 softirq 호출을 하지만 더 빨리 처리하기 위함이다.
코드 라인 12~13에서 인터럽트가 disable된 상태인 경우이거나 offline 상태이면 함수를 빠져나간다.
코드 라인 22~26에서 낮은 확률로 rcu 콜백이 너무 많이 대기 중이면 force_qs 조건을 만족하는지 확인하기 전에 먼저 gp 상태 변경을 확인한다.
코드 라인 29~30에서 gp가 진행 중이 아니면 신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리한다.
코드 라인 31~39에서 gp가 이미 진행 중이면 blimit 제한을 max(디폴트 10000개) 값으로 풀고 fqs(force quiescent state)를 진행한다.
- rcu_state.n_force_qs == rdp->n_force_qs_snap
  - 현재 cpu가 force 처리한 qs 수가 글로벌에 저장한 수와 일치하면
  - 다른 cpu가 force 하는 경우 글로벌 값이 변경되므로 이 경우에는 force qs 처리하지 않을 계획이다.

rcu_is_watching()

kernel/rcu/tree.c

/**
 * rcu_is_watching - see if RCU thinks that the current CPU is not idle
 *
 * Return true if RCU is watching the running CPU, which means that this
 * CPU can safely enter RCU read-side critical sections.  In other words,
 * if the current CPU is not in its idle loop or is in an interrupt or
 * NMI handler, return true.
 */

bool notrace rcu_is_watching(void)
{
        bool ret;

        preempt_disable_notrace();
        ret = !rcu_dynticks_curr_cpu_in_eqs();
        preempt_enable_notrace();
        return ret;
}
EXPORT_SYMBOL_GPL(rcu_is_watching);

rcu가 동작중인 cpu를 감시하는지 여부를 반환한다. (true=watching, nohz idle 상태가 아니거나 인터럽트 진행중일 때 안전하게 read-side 크리티컬 섹션에 진입 가능한 상태이다.)

invoke_rcu_core()

kernel/rcu/tree.c

/*
 * Wake up this CPU's rcuc kthread to do RCU core processing.
 */

static void invoke_rcu_core(void)
{
        if (!cpu_online(smp_processor_id()))
                return;
        if (use_softirq)
                raise_softirq(RCU_SOFTIRQ);
        else
                invoke_rcu_core_kthread();
}

rcu softirq가 enable(디폴트)된 경우 softirq를 호출하고, 그렇지 않은 경우 rcu core 스레드를 깨운다.

use_softirq
- “module/rcutree/parameters/use_softirq” 파라미터 값에 의해 결정된다. (디폴트=1)

rcu의 3단계 자료 관리

3단계 구조

rcu의 상태 관리를 위해 접근 비용 별로 3 단계의 구조로 관리한다.

rcu_state
- 글로벌에 하나만 존재하며 락을 사용하여 접근한다.
rcu_node
- 노드별로 구성되며, 하이라키 구조(노드 관리는 64개씩, leaf 노드는 16개)로 관리한다. (NUMA의 노드가 아니라 관리목적의 cpu들의 집합이다)
- 노드 락을 사용하여 접근한다.
rcu_data
- per-cpu 별로 구성된다.
- 로컬 cpu에 대한 접근 시에 사용하므로 락없이 사용된다.

약칭

rsp
- rcu_state 구조체 포인터
rnp
- rcu_node 구조체 포인터
rdp
- rcu_data 구조체 포인터

다음 그림과 같이 rcu 데이터 구조체들이 3 단계로 운영되고 있음을 보여준다.

다음 그림은 qs와 급행 qs의 보고에 사용되는 각 멤버들을 보여준다.

qs 보고 체계
- gp 시작 후
  - qs 체크가 필요한 경우 core_need_qs 및 cpu_no_qs를 1로 클리어한다.
  - qsmaskinitnext는 핫플러그 online cpu 상태가 변경될 때마다 이의 변화를 수용하고, qsmakinit에 이를 복사하여 사용한다. 그리고 이를 qsmaskinit에 복사한 후 매 gp가 시작될 때마다 qsmaskinit -> qsmask로 복사하여 qs를 체크할 준비를 한다.
- qs 보고
  - cpu에서 qs가 체크된 경우 core_need_qs와 cpu_no_qs를 모두 클리어한다.
  - 노드에 보고하여 해당 cpu에 대한 qsmask를 클리어한다. 하위 노드들의 qs가 모두 체크되면 상위 노드도 동일하게 전달하여 처리한다.
급행 qs 보고 체계
- gp 시작 후
  - 급행 qs 체크가 필요한 경우 core_need_qs 및 cpu_no_qs를 1로 클리어한다.
  - expmaskinitnext는 핫플러그 online cpu 상태가 변경될 때마다 이의 변화를 수용하고, expmakinit에 이를 복사하여 사용한다. 그리고 이를 매 gp가 시작될 때마다 expmaskinit -> expmask로 복사하여 급행 qs를 체크할 준비를 한다.
- 급행 qs 보고
  - cpu에서 급행 qs가 체크된 경우 core_need_qs와 cpu_no_qs를 모두 클리어한다.
  - 노드에 보고하여 해당 cpu에 대한 expmask를 클리어한다. 하위 노드들의 급행 qs가 모두 체크되면 상위 노드도 동일하게 전달하여 처리한다.
  - exp_deferred_qs의 경우는 disable(irq, bh, preemption) 상태에서 rcu_read_unlock()이 수행될 때에도 qs를 완료 시키면 안된다. 이 때 qs의 완료를 지연시키기 위해 사용된다.

gp 시퀀스 관리

기존 gpnum이 삭제되고, 새로운 gp 시퀀스 번호(gp_seq)가 커널 v4.19-rc1에서 소개되었다.

참고: rcu: Introduce grace-period sequence numbers

gp 시퀀스는 다음과 같이 두 개의 파트로 나누어 관리한다.

gp 카운터
- 새로운 gp가 시작될 때마다 gp 카운터가 증가된다.
- overflow에 대한 빠른 버그를 찾기위해 jiffies가 -300초에 해당하는 틱부터 시작하였다시피 gp 카운터도 -300부터 시작한다.
gp 상태
- idle(0)
  - gp가 동작하지 않는 idle 상태이다.
- start(1)
  - gp가 시작되어 동작하는 상태이다.
- 그 외의 번호는 srcu에서만 사용되므로 생략한다.

다음 그림과 같이 gp 시퀀스는 두 개의 파트로 나누어 관리한다.

다음 그림과 같이 새로운 gp가 시작하고 끝날때 마다 gp 시퀀스가 증가되는 모습을 보여준다.

위 3가지 구조체에 공통적으로 gp_seq가 담겨 있다. rcu 관리에서 최대한 락 사용을 억제하기 위해 per-cpu를 사용한 rcu_data는 로컬 cpu 데이터를 취급한다. 로컬 데이터는 해당 cpu가 필요한 시점에서만 갱신한다. 따라서 이로 인해 로컬 gp_seq는 글로벌 gp_seq 번호에 비해 최대 1 만큼 늦어질 수도 있다. 루트 노드의 값은 항상 rcu_state의 값들과 동시에 변경되지만 그 외의 노드들 값은 노드 락을 거는 시간이 필요하므로 약간씩 지연된다. (모든 값의 차이는 최대 1을 초과할 수 없다.)

gp 시퀀스 번호 관련 함수들

/*
 * Grace-period counter management.
 */

#define RCU_SEQ_CTR_SHIFT       2
#define RCU_SEQ_STATE_MASK      ((1 << RCU_SEQ_CTR_SHIFT) - 1)

rcu_seq_ctr()

kernel/rcu/rcu.h

/*
 * Return the counter portion of a sequence number previously returned
 * by rcu_seq_snap() or rcu_seq_current().
 */

static inline unsigned long rcu_seq_ctr(unsigned long s)
{
        return s >> RCU_SEQ_CTR_SHIFT;
}

gp 시퀀스의 카운터 부분만을 반환한다.

rcu_seq_state()

kernel/rcu/rcu.h

/*
 * Return the state portion of a sequence number previously returned
 * by rcu_seq_snap() or rcu_seq_current().
 */

static inline int rcu_seq_state(unsigned long s)
{
        return s & RCU_SEQ_STATE_MASK;
}

gp 시퀀스의 상태 값만을 반환한다. (0~3)

rcu_seq_set_state()

kernel/rcu/rcu.h

/*
 * Set the state portion of the pointed-to sequence number.
 * The caller is responsible for preventing conflicting updates.
 */

static inline void rcu_seq_set_state(unsigned long *sp, int newstate)
{
        WARN_ON_ONCE(newstate & ~RCU_SEQ_STATE_MASK);
        WRITE_ONCE(*sp, (*sp & ~RCU_SEQ_STATE_MASK) + newstate);
}

gp 시퀀스의 상태 부분을 @newstate로 갱신한다.

rcu_seq_start()

kernel/rcu/rcu.h

/* Adjust sequence number for start of update-side operation. */
static inline void rcu_seq_start(unsigned long *sp)
{
        WRITE_ONCE(*sp, *sp + 1);
        smp_mb(); /* Ensure update-side operation after counter increment. */
        WARN_ON_ONCE(rcu_seq_state(*sp) != 1);
}

gp 시작을 위해 gp 시퀀스를 1 증가시킨다. (rcu_seq_end() 함수가 호출된 후 gp idle 상태인대 이후에 gp 시작을 처리한다)

rcu_seq_endval()

kernel/rcu/rcu.h

/* Compute the end-of-grace-period value for the specified sequence number. */
static inline unsigned long rcu_seq_endval(unsigned long *sp)
{
        return (*sp | RCU_SEQ_STATE_MASK) + 1;
}

gp 종료를 위해 gp 시퀀스 @sp의 카운터 부분을 1 증가시키고, 상태 부분은 0(gp idle)인 값을 반환한다.

rcu_seq_end()

kernel/rcu/rcu.h

/* Adjust sequence number for end of update-side operation. */
static inline void rcu_seq_end(unsigned long *sp)
{
        smp_mb(); /* Ensure update-side operation before counter increment. */
        WARN_ON_ONCE(!rcu_seq_state(*sp));
        WRITE_ONCE(*sp, rcu_seq_endval(sp));
}

gp 종료를 위해 gp 시퀀스 @sp의 카운터 부분을 1 증가시키고, 상태 부분은 0(gp idle)으로 변경한다.

rcu_seq_snap()

kernel/rcu/rcu.h

/*
 * rcu_seq_snap - Take a snapshot of the update side's sequence number.
 *
 * This function returns the earliest value of the grace-period sequence number
 * that will indicate that a full grace period has elapsed since the current
 * time.  Once the grace-period sequence number has reached this value, it will
 * be safe to invoke all callbacks that have been registered prior to the
 * current time. This value is the current grace-period number plus two to the
 * power of the number of low-order bits reserved for state, then rounded up to
 * the next value in which the state bits are all zero.
 */

static inline unsigned long rcu_seq_snap(unsigned long *sp)
{
        unsigned long s;

        s = (READ_ONCE(*sp) + 2 * RCU_SEQ_STATE_MASK + 1) & ~RCU_SEQ_STATE_MASK;
        smp_mb(); /* Above access must not bleed into critical section. */
        return s;
}

현재까지 등록된 콜백이 안전하게 처리될 수 있는 가장 빠른 update side의 gp 시퀀스 번호를 알아온다. (gp가 진행 중인 경우 안전하게 두 단계 뒤의 gp 시퀀스를 반환하고, gp가 idle 상태인 경우 다음 단계 뒤의 시퀀스 번호를 반환한다. 반환 되는 gp 시퀀스의 상태는 idle 이다.)

예) sp=12 (gp idle)
- 16
예) sp=9 (gp start)
- 16

다음 그림은 gp 시퀀스의 스냅샷 값을 알아오는 모습을 보여준다.

rcu_seq_current()

kernel/rcu/rcu.h

/* Return the current value the update side's sequence number, no ordering. */

static inline unsigned long rcu_seq_current(unsigned long *sp)
{
        return READ_ONCE(*sp);
}

update side의 현재 gp 시퀀스 값을 반환한다.

rcu_seq_started()

kernel/rcu/rcu.h

/*
 * Given a snapshot from rcu_seq_snap(), determine whether or not the
 * corresponding update-side operation has started.
 */

static inline bool rcu_seq_started(unsigned long *sp, unsigned long s)
{
        return ULONG_CMP_LT((s - 1) & ~RCU_SEQ_STATE_MASK, READ_ONCE(*sp));
}

스냅샷 @s를 통해 gp가 시작되었는지 여부를 반환한다.

참고로 스냅샷 @s의 하위 2비트는 0으로 항상 gp_idle 상태 값을 반환한다.
rounddown(@s-1, 4) < @sp

다음 그림은 rcu_seq_started() 함수를 통해 스냅샷 기준으로 gp가 이미 시작되었는지 여부를 알아온다.

gp 시퀀스가 9 또는 12에 있을 때 스냅샷을 발급하면 16이다. 이후 gp 시퀀스가 13이 되는 순간 gp가 시작되어 이 함수가 true를 반환한다.

rcu_seq_done()

kernel/rcu/rcu.h

/*
 * Given a snapshot from rcu_seq_snap(), determine whether or not a
 * full update-side operation has occurred.
 */

static inline bool rcu_seq_done(unsigned long *sp, unsigned long s)
{
        return ULONG_CMP_GE(READ_ONCE(*sp), s);
}

스냅샷 @s를 통해 gp가 완료되었는지 여부를 반환한다.

이 함수는 nocb 운용 시 rcu_segcblist_nextgp() 함수에서 반환된 wait 구간의 gp 시퀀스 값을 스냅샷 @s 값으로 사용하며, gp 시퀀스가 이 값에 도달하는 순간 true를 반환한다.
- 참고로 세그먼트 콜백리스트 각 구간의 gp 시퀀스 값은 스냅샷 값을 사용하므로 항상 gp idle 상태이다.
@sp >= @s

다음 그림은 rcu_seq_done() 함수를 통해 스냅샷 기준으로 gp가 완료되었는지 여부를 알아온다.

gp 시퀀스가 9 또는 12에 있을 때 스냅샷을 발급하면 16이다. 이후 gp 시퀀스가 13이 되는 순간 gp가 시작되어 이 함수가 true를 반환한다.

rcu_seq_completed_gp()

kernel/rcu/rcu.h

/*
 * Has a grace period completed since the time the old gp_seq was collected?
 */

static inline bool rcu_seq_completed_gp(unsigned long old, unsigned long new)
{
        return ULONG_CMP_LT(old, new & ~RCU_SEQ_STATE_MASK);
}

노드의 gp 시퀀스 @new가 기존 cpu가 진행하던 gp 시퀀스 @old를 초과하는 것으로 gp를 완료하면 true를 반환한다.

@old < rounddown(@new, 4)
예) old=5, new=9
- 5 < 8 = true

다음 그림은 rcu_seq_completed_gp() 함수를 통해 new gp 기준으로 old gp가 완료되었는지 여부를 알아온다.

노드의 gp 시퀀스가 기존 cpu가 진행하던 gp를 완료하면 true를 반환한다.
- 현재 cpu의 gp 시퀀스가 #3 구간의 gp를 진행하고 있을 때(12, 13), 노드의 gp 시퀀스가 기존 #3 구간을 끝내고 새 구간으로 진입하면 이 함수가 true를 반환한다.

rcu_seq_new_gp()

kernel/rcu/rcu.h

/*
 * Has a grace period started since the time the old gp_seq was collected?
 */

static inline bool rcu_seq_new_gp(unsigned long old, unsigned long new)
{
        return ULONG_CMP_LT((old + RCU_SEQ_STATE_MASK) & ~RCU_SEQ_STATE_MASK,
                            new);
}

gp 시퀀스 old 이후로 새로운 gp가 시작되었는지 여부를 반환한다.

roundup(@old, 4) < @new
예) old=5, new=9
- 8 < 9 = true

다음 그림은 rcu_seq_new_gp() 함수를 통해 old gp 이후로 new gp가 시작되었는지 여부를 알아온다.

현재 cpu의 gp 시퀀스 구간보다 노드의 gp 시퀀스가 새로운 gp를 시작하게되면 true를 반환한다.
- 현재 cpu의 gp 시퀀스가 #2 구간의 gp를 진행하고 있거나(9) #3 구간에서 gp가 idle 중일 때, 노드의 gp 시퀀스가 새 구간 #3의 gp를 시작(13)하면 이 함수가 true를 반환한다.

rcu_seq_diff()

kernel/rcu/rcu.h

/*
 * Roughly how many full grace periods have elapsed between the collection
 * of the two specified grace periods?
 */

static inline unsigned long rcu_seq_diff(unsigned long new, unsigned long old)
{
        unsigned long rnd_diff;

        if (old == new)
                return 0;
        /*
         * Compute the number of grace periods (still shifted up), plus
         * one if either of new and old is not an exact grace period.
         */
        rnd_diff = (new & ~RCU_SEQ_STATE_MASK) -
                   ((old + RCU_SEQ_STATE_MASK) & ~RCU_SEQ_STATE_MASK) +
                   ((new & RCU_SEQ_STATE_MASK) || (old & RCU_SEQ_STATE_MASK));
        if (ULONG_CMP_GE(RCU_SEQ_STATE_MASK, rnd_diff))
                return 1; /* Definitely no grace period has elapsed. */
        return ((rnd_diff - RCU_SEQ_STATE_MASK - 1) >> RCU_SEQ_CTR_SHIFT) + 2;
}

두 개의 gp 시퀀스 간에 소요된 gp 수를 반환한다.

코드 라인 5~6에서 두 값이 동일한 경우 0을 반환한다.
코드 라인 11~13에서 rnd_diff 값을 다음과 같이 구한다.
- = 내림 @new – 올림 @old + @new 상태 || @old 상태
코드 라인 14~15에서 rnd_diff 값이 3 미만인 경우 1을 반환한다.
코드 라인 16에서 다음 값을 반환한다.
- (rnd_diff – 4) >> 2 + 2

RCU CB 처리 (softirq)

rcu_core_si()

kernel/rcu/tree.c

static void rcu_core_si(struct softirq_action *h)
{
        rcu_core();
}

완료된 rcu 콜백들을 호출하여 처리한다.

rcu_core()

kernel/rcu/tree.c

/* Perform RCU core processing work for the current CPU.  */

static __latent_entropy void rcu_core(void)
{
        unsigned long flags;
        struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
        struct rcu_node *rnp = rdp->mynode;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);

        if (cpu_is_offline(smp_processor_id()))
                return;
        trace_rcu_utilization(TPS("Start RCU core"));
        WARN_ON_ONCE(!rdp->beenonline);

        /* Report any deferred quiescent states if preemption enabled. */
        if (!(preempt_count() & PREEMPT_MASK)) {
                rcu_preempt_deferred_qs(current);
        } else if (rcu_preempt_need_deferred_qs(current)) {
                set_tsk_need_resched(current);
                set_preempt_need_resched();
        }

        /* Update RCU state based on any recent quiescent states. */
        rcu_check_quiescent_state(rdp);

        /* No grace period and unregistered callbacks? */
        if (!rcu_gp_in_progress() &&
            rcu_segcblist_is_enabled(&rdp->cblist) && !offloaded) {
                local_irq_save(flags);
                if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
                        rcu_accelerate_cbs_unlocked(rnp, rdp);
                local_irq_restore(flags);
        }

        rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());

        /* If there are callbacks ready, invoke them. */
        if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist) &&
            likely(READ_ONCE(rcu_scheduler_fully_active)))
                rcu_do_batch(rdp);

        /* Do any needed deferred wakeups of rcuo kthreads. */
        do_nocb_deferred_wakeup(rdp);
        trace_rcu_utilization(TPS("End RCU core"));
}

완료된 rcu 콜백들을 호출하여 처리한다.

코드 라인 9~10에서 cpu가 offline 상태인 경우 함수를 빠져나간다.
코드 라인 15~20에서 preempt 가능한 상태인 경우(preempt 카운터=0)인 경우 deferred qs를 처리한다. 그렇지 않고 deferred qs가 pending 상태인 경우 리스케줄 요청을 수행한다.
- deferred qs 상태인 경우 deferred qs를 해제하고, blocked 상태인 경우 blocked 해제 후 qs를 보고한다.
코드 라인 23에서 현재 cpu에 대해 새 gp가 시작되었는지 체크한다. 또한 qs 상태를 체크하고 패스된 경우 rdp에 기록하여 상위 노드로 보고하게 한다.
코드 라인 26~32에서 gp가 idle 상태이면서 새로운 콜백이 존재하고, 필요 시 acceleration을 수행한다.
코드 라인 34에서 gp 요청을 체크한다.
코드 라인 37~39에서 완료된 콜백이 있는 경우 rcu 콜백들을 호출한다.
코드 라인 42에서 rcu no-callback을 위한 처리를 한다

rcu_do_batch()

kernel/rcu/tree.c -1/2-

/*
 * Invoke any RCU callbacks that have made it to the end of their grace
 * period.  Thottle as specified by rdp->blimit.
 */

static void rcu_do_batch(struct rcu_data *rdp)
{
        unsigned long flags;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);
        struct rcu_head *rhp;
        struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
        long bl, count;
        long pending, tlimit = 0;

        /* If no callbacks are ready, just return. */
        if (!rcu_segcblist_ready_cbs(&rdp->cblist)) {
                trace_rcu_batch_start(rcu_state.name,
                                      rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                      rcu_segcblist_n_cbs(&rdp->cblist), 0);
                trace_rcu_batch_end(rcu_state.name, 0,
                                    !rcu_segcblist_empty(&rdp->cblist),
                                    need_resched(), is_idle_task(current),
                                    rcu_is_callbacks_kthread());
                return;
        }

        /*
         * Extract the list of ready callbacks, disabling to prevent
         * races with call_rcu() from interrupt handlers.  Leave the
         * callback counts, as rcu_barrier() needs to be conservative.
         */
        local_irq_save(flags);
        rcu_nocb_lock(rdp);
        WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
        pending = rcu_segcblist_n_cbs(&rdp->cblist);
        bl = max(rdp->blimit, pending >> rcu_divisor);
        if (unlikely(bl > 100))
                tlimit = local_clock() + rcu_resched_ns;
        trace_rcu_batch_start(rcu_state.name,
                              rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                              rcu_segcblist_n_cbs(&rdp->cblist), bl);
        rcu_segcblist_extract_done_cbs(&rdp->cblist, &rcl);
        if (offloaded)
                rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist);
        rcu_nocb_unlock_irqrestore(rdp, flags);

seg 콜백리스트의 완료 구간에서 대기중인 rcu 콜백들을 호출한다.

코드 라인 4~5에서 set 콜백리스트를 no-cb offload 처리하는지 여부를 알아온다.
- softirq에서 콜백을 처리하지 않고, no-cb 스레드에 떠넘겨 처리하는 경우인지를 알아온다.
코드 라인 12~21에서 seg 콜백 리스트의 완료 구간에서 대기중인 rcu 콜백들이 하나도 없는 경우 그냥 함수를 빠져나간다.
코드 라인 31~34에서 blimit 값을 구하고, 이 값이 100을 초과하는 경우 현재 시각보다 3ms 후로 tlimit 제한 시각을 구한다.
코드 라인 38에서 rcu seg 콜백리스트의 done 구간의 콜백들을 extract하여 rcl 리스트로 옮긴다.
코드 라인 39~40에서 콜백을 no-cb 스레드에서 처리해야 하는 경우 rdp->qlen_last_fqs_check에 콜백 수를 대입한다.

kernel/rcu/tree.c -2/2-

        /* Invoke callbacks. */
        rhp = rcu_cblist_dequeue(&rcl);
        for (; rhp; rhp = rcu_cblist_dequeue(&rcl)) {
                debug_rcu_head_unqueue(rhp);
                if (__rcu_reclaim(rcu_state.name, rhp))
                        rcu_cblist_dequeued_lazy(&rcl);
                /*
                 * Stop only if limit reached and CPU has something to do.
                 * Note: The rcl structure counts down from zero.
                 */
                if (-rcl.len >= bl && !offloaded &&
                    (need_resched() ||
                     (!is_idle_task(current) && !rcu_is_callbacks_kthread())))
                        break;
                if (unlikely(tlimit)) {
                        /* only call local_clock() every 32 callbacks */
                        if (likely((-rcl.len & 31) || local_clock() < tlimit))
                                continue;
                        /* Exceeded the time limit, so leave. */
                        break;
                }
                if (offloaded) {
                        WARN_ON_ONCE(in_serving_softirq());
                        local_bh_enable();
                        lockdep_assert_irqs_enabled();
                        cond_resched_tasks_rcu_qs();
                        lockdep_assert_irqs_enabled();
                        local_bh_disable();
                }
        }

        local_irq_save(flags);
        rcu_nocb_lock(rdp);
        count = -rcl.len;
        trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(),
                            is_idle_task(current), rcu_is_callbacks_kthread());

        /* Update counts and requeue any remaining callbacks. */
        rcu_segcblist_insert_done_cbs(&rdp->cblist, &rcl);
        smp_mb(); /* List handling before counting for rcu_barrier(). */
        rcu_segcblist_insert_count(&rdp->cblist, &rcl);

        /* Reinstate batch limit if we have worked down the excess. */
        count = rcu_segcblist_n_cbs(&rdp->cblist);
        if (rdp->blimit >= DEFAULT_MAX_RCU_BLIMIT && count <= qlowmark)
                rdp->blimit = blimit;

        /* Reset ->qlen_last_fqs_check trigger if enough CBs have drained. */
        if (count == 0 && rdp->qlen_last_fqs_check != 0) {
                rdp->qlen_last_fqs_check = 0;
                rdp->n_force_qs_snap = rcu_state.n_force_qs;
        } else if (count < rdp->qlen_last_fqs_check - qhimark)
                rdp->qlen_last_fqs_check = count;

        /*
         * The following usually indicates a double call_rcu().  To track
         * this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.
         */
        WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(&rdp->cblist));
        WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                     count != 0 && rcu_segcblist_empty(&rdp->cblist));

        rcu_nocb_unlock_irqrestore(rdp, flags);

        /* Re-invoke RCU core processing if there are callbacks remaining. */
        if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist))
                invoke_rcu_core();
}

코드 라인 2~3에서 seg 콜백 리스트를 순회하며 콜백을 하나씩 디큐해온다.
코드 라인 5~6에서 rcu 콜백을 recliaim 호출하여 처리하고, kfree용 콜백인 경우 lazy 카운터를 감소시킨다.
코드 라인 11~14에서 호출된 콜백 수가 blimit 이상이면서 offload되지 않으며 다음 조건 중 하나라도 만족하는 경우 콜백을 그만 처리하기 위해 루프를 벗어난다.
- 조건
  - 리스케줄 요청이 있는 경우
  - 현재 태스크가 idle 태스크도 아니고 no-cb 커널 스레드도 아닌 경우
- 임시로 사용되는 로컬 콜백리스트의 len 멤버는 0부터 시작하여 디큐되어 호출될때마다 1씩 감소한다. 따라서 이 값에는 콜백 호출된 수가 음수 값으로 담겨 있게 된다.
코드 라인 15~21에서 낮은 확률로 tlimit 제한 시각(3ms)이 설정된 경우 매 32개의 콜백을 처리할 때 마다 제한 시간을 초과한 경우 그만 처리하기 위해 루프를 벗어난다.
코드 라인 22~29에서 no-cb 커널 스레드에서 콜백들을 처리하도록 offload된 경우 현재 태스크의 rcu_tasks_holdout 멤버를 false로 변경한다.
코드 라인 34에서 호출한 콜백 수를 count 변수로 알아온다.
코드 라인 39~41에서 처리하지 않고 남은 콜백들을 다음에 처리하기 위해 다시 seg 콜백 리스트의 done 구간에 추가하고, 콜백 수도 추가한다.
코드 라인 44~46에서 rdp->blimit가 10000개 이상이고, seg 콜백 리스트에 있는 콜백 수가 qlowmark(디폴트=100) 이하인 경우 rdp->blimit을 blimit 값으로 재설정한다.
코드 라인 49~53에서 로컬 콜백 리스트의 콜백들을 모두 처리하여 count가 0이고, rdp->qlen_last_fqs_check가 0이 아닌 경우 이 값을 0으로 리셋한다. 만일 다 처리하지 않고 남은 수가 rdp->qlen_last_fqs_check – qhimark 보다 작은 경우 rdp->qlen_last_fqs_check 값을 남은 count 값으로 대입한다.
코드 라인 66~67에서 offloaded되지 않고 done 구간에 남은 콜백들이 여전히 남아 있는 경우 softirq를 호출하여 계속 처리하게 한다.

__rcu_reclaim()

kernel/rcu/rcu.h

/*
 * Reclaim the specified callback, either by invoking it (non-lazy case)
 * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
 */

static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
{
        rcu_callback_t f;
        unsigned long offset = (unsigned long)head->func;

        rcu_lock_acquire(&rcu_callback_map);
        if (__is_kfree_rcu_offset(offset)) {
                trace_rcu_invoke_kfree_callback(rn, head, offset);
                kfree((void *)head - offset);
                rcu_lock_release(&rcu_callback_map);
                return true;
        } else {
                trace_rcu_invoke_callback(rn, head);
                f = head->func;
                WRITE_ONCE(head->func, (rcu_callback_t)0L);
                f(head);
                rcu_lock_release(&rcu_callback_map);
                return false;
        }
}

rcu 콜백을 recliaim 처리한다. (kfree용 콜백인 경우 kfree 후 true를 반환하고, 그 외의 경우 해당 콜백을 호출한 후 false를 반환한다.)

코드 라인 7~11에서 rcu 콜백에 함수가 아닌 kfree용 rcu offset이 담긴 경우 이를 통해 kfree를 수행하고 true를 반환한다.
코드 라인 12~19에서 rcu 콜백에 함수가 담긴 경우 이를 호출하고 false를 반환한다.

구조체

rcu_state 구조체

kernel/rcu/tree.h

/*
 * RCU global state, including node hierarchy.  This hierarchy is
 * represented in "heap" form in a dense array.  The root (first level)
 * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
 * level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]),
 * and the third level in ->node[m+1] and following (->node[m+1] referenced
 * by ->level[2]).  The number of levels is determined by the number of
 * CPUs and by CONFIG_RCU_FANOUT.  Small systems will have a "hierarchy"
 * consisting of a single rcu_node.
 */

struct rcu_state {
        struct rcu_node node[NUM_RCU_NODES];    /* Hierarchy. */
        struct rcu_node *level[RCU_NUM_LVLS + 1];
                                                /* Hierarchy levels (+1 to */
                                                /*  shut bogus gcc warning) */
        int ncpus;                              /* # CPUs seen so far. */

        /* The following fields are guarded by the root rcu_node's lock. */

        u8      boost ____cacheline_internodealigned_in_smp;
                                                /* Subject to priority boost. */
        unsigned long gp_seq;                   /* Grace-period sequence #. */
        struct task_struct *gp_kthread;         /* Task for grace periods. */
        struct swait_queue_head gp_wq;          /* Where GP task waits. */
        short gp_flags;                         /* Commands for GP task. */
        short gp_state;                         /* GP kthread sleep state. */
        unsigned long gp_wake_time;             /* Last GP kthread wake. */
        unsigned long gp_wake_seq;              /* ->gp_seq at ^^^. */

        /* End of fields guarded by root rcu_node's lock. */

        struct mutex barrier_mutex;             /* Guards barrier fields. */
        atomic_t barrier_cpu_count;             /* # CPUs waiting on. */
        struct completion barrier_completion;   /* Wake at barrier end. */
        unsigned long barrier_sequence;         /* ++ at start and end of */
                                                /*  rcu_barrier(). */
        /* End of fields guarded by barrier_mutex. */

        struct mutex exp_mutex;                 /* Serialize expedited GP. */
        struct mutex exp_wake_mutex;            /* Serialize wakeup. */
        unsigned long expedited_sequence;       /* Take a ticket. */
        atomic_t expedited_need_qs;             /* # CPUs left to check in. */
        struct swait_queue_head expedited_wq;   /* Wait for check-ins. */
        int ncpus_snap;                         /* # CPUs seen last time. */

        unsigned long jiffies_force_qs;         /* Time at which to invoke */
                                                /*  force_quiescent_state(). */
        unsigned long jiffies_kick_kthreads;    /* Time at which to kick */
                                                /*  kthreads, if configured. */
        unsigned long n_force_qs;               /* Number of calls to */
                                                /*  force_quiescent_state(). */
        unsigned long gp_start;                 /* Time at which GP started, */
                                                /*  but in jiffies. */
        unsigned long gp_end;                   /* Time last GP ended, again */
                                                /*  in jiffies. */
        unsigned long gp_activity;              /* Time of last GP kthread */
                                                /*  activity in jiffies. */
        unsigned long gp_req_activity;          /* Time of last GP request */
                                                /*  in jiffies. */
        unsigned long jiffies_stall;            /* Time at which to check */
                                                /*  for CPU stalls. */
        unsigned long jiffies_resched;          /* Time at which to resched */
                                                /*  a reluctant CPU. */
        unsigned long n_force_qs_gpstart;       /* Snapshot of n_force_qs at */
                                                /*  GP start. */
        unsigned long gp_max;                   /* Maximum GP duration in */
                                                /*  jiffies. */
        const char *name;                       /* Name of structure. */
        char abbr;                              /* Abbreviated name. */

        raw_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
                                                /* Synchronize offline with */
                                                /*  GP pre-initialization. */
};

rcu 글로벌 상태를 관리하는 구조체이다.

node[]
- 컴파일 타임에 산출된 NUM_RCU_NODES 수 만큼 rcu_node 들이 배열에 배치된다.
*level[]
- 각 레벨의 (0(top) 레벨부터 최대 3레벨까지) 첫 rcu_node를 가리킨다.
- 최소 노드가 1개 이상 존재하므로 level[0]는 항상 node[0]를 가리킨다.
- 노드가 2개 이상되면 level[1]은 node[1]을 가리킨다.
ncpus
- cpu 수
gp_seq
- 현재 grace period 번호
- overflow에 관련된 에러가 발생하는 것에 대해 빠르게 감지하기 위해 -300UL부터 시작한다.
*gp_kthread
- grace period를 관리하는 커널 스레드이다.
- “rcu_preempt” 라는 이름을 사용한다.
gp_wq
- gp 커널 스레드가 대기하는 wait queue이다.
gp_flags
- gp 커널 스레드에 대한 명령이다.
- 2개의 gp 플래그를 사용한다.
  - RCU_GP_FLAG_INIT(0x1) -> gp 시작 필요
  - RCU_GP_FLAG_FQS(0x2) -> fqs 필요
gp_state
- gp 커널 스레드의 상태이다.
  - RCU_GP_IDLE(0) – gp가 동작하지 않는 상태
  - RCU_GP_WAIT_GPS(1) – gp 시작 대기
  - RCU_GP_DONE_GPS(2) – gp 시작을 위해 완료 대기
  - RCU_GP_ONOFF(3) – gp 초기화 핫플러그
  - RCU_GP_INIT(4) – gp 초기화
  - RCU_GP_WAIT_FQS(5) – fqs 시간 대기
  - RCU_GP_DOING_FQS(6) – fqs 시간 완료 대기
  - RCU_GP_CLEANUP(7) – gp 클린업 시작
  - RCU_GP_CLEANED(8) – gp 클린업 완료
gp_wake_time
- 마지막 gp kthread 깨어난 시각
gp_wake_seq
- gp kthread 깨어났을 때의 gp_seq
barrier_mutex
- rcu_barrier() 함수에서 사용하는 베리어 뮤텍스
barrier_cpu_count
- 베리어 대기중인 cpu 수
barrier_completion
- 베리어 완료를 기다리기 위해 사용되는 completion
barrier_sequence
- rcu_barrier()에서 사용하는 베리어용 gp 시퀀스
exp_mutex
- expedited gp를 순서대로 처리하기 위한 뮤텍스 락
exp_wake_mutex
- 순서대로 wakeup하기 위한 뮤텍스 락
expedited_sequence
- 급행 grace period 시퀀스 번호
expedited_need_qs
- expedited qs를 위해 남은 cpu 수
sexpedited_wq
- 체크인 대기 워크큐
ncpus_snap
- 지난 스캔에서 사용했던 cpu 수가 담긴다.
jiffies_force_qs
- force_quiescent_state() -> gp kthread -> rcu_gp_fqs() 함수를 호출하여 fqs를 해야 할 시각(jiffies)
- gp 커널 스레드에서 gp가 완료되지 않아 강제로 fqs를 해야할 때까지 기다릴 시각이 담긴다.
jiffies_kick_kthreads
- 현재 gp에서 이 값으로 지정된 시간이 흘러 stall된 커널 스레드를 깨우기 위한 시각이 담긴다.
- 2 * jiffies_till_first_fqs 시간을 사용한다.
n_force_qs
- force_quiescent_state() -> gp kthread -> rcu_gp_fqs() 함수를 호출하여 fqs를 수행한 횟수
- 각 cpu(rcu_data)에서 수행한 값이 글로벌(rcu_state)에 갱신되며, 해당 cpu에 fqs를 수행했던 이후로 변경된 적이 없는지 확인하기 위해 사용된다.
gp_start
- gp 시작된 시각(틱)
gp_activity
- gp kthread가 동작했던 마지막 시각(틱)
gp_req_activity
- gp 시작 및 종료 요청 시각(틱)이 담기며, gp stall 경과 시각을 체크할 때 사용한다.
jiffies_stall
- cpu stall을 체크할 시각(틱)
jiffies_resched
- cpu stall을 체크하는 시간의 절반의 시각(jiffies)으로 설정되고 필요에 따라 5 틱씩 증가
- 이 시각이 되면 리스케줄한다.
n_force_qs_gpstart
- gp가 시작될 때마다 fqs를 수행한 횟수(n_force_qs)로 복사(snapshot)된다.
gp_max
- 최대 gp 기간(jiffies)
*name
- rcu 명
- “rcu_sched”, “rcu_bh”, “rcu_preempt”
abbr
- 축약된 1 자리 rcu 명
- ‘s’, ‘b’, ‘p’

rcu_node 구조체

kernel/rcu/tree.h – 1/2

/*
 * Definition for node within the RCU grace-period-detection hierarchy.
 */

struct rcu_node {
        raw_spinlock_t __private lock;  /* Root rcu_node's lock protects */
                                        /*  some rcu_state fields as well as */
                                        /*  following. */
        unsigned long gp_seq;   /* Track rsp->rcu_gp_seq. */
        unsigned long gp_seq_needed; /* Track furthest future GP request. */
        unsigned long completedqs; /* All QSes done for this node. */
        unsigned long qsmask;   /* CPUs or groups that need to switch in */
                                /*  order for current grace period to proceed.*/
                                /*  In leaf rcu_node, each bit corresponds to */
                                /*  an rcu_data structure, otherwise, each */
                                /*  bit corresponds to a child rcu_node */
                                /*  structure. */
        unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */
        unsigned long qsmaskinit;
                                /* Per-GP initial value for qsmask. */
                                /*  Initialized from ->qsmaskinitnext at the */
                                /*  beginning of each grace period. */
        unsigned long qsmaskinitnext;
                                /* Online CPUs for next grace period. */
        unsigned long expmask;  /* CPUs or groups that need to check in */
                                /*  to allow the current expedited GP */
                                /*  to complete. */
        unsigned long expmaskinit;
                                /* Per-GP initial values for expmask. */
                                /*  Initialized from ->expmaskinitnext at the */
                                /*  beginning of each expedited GP. */
        unsigned long expmaskinitnext;
                                /* Online CPUs for next expedited GP. */
                                /*  Any CPU that has ever been online will */
                                /*  have its bit set. */
        unsigned long ffmask;   /* Fully functional CPUs. */
        unsigned long grpmask;  /* Mask to apply to parent qsmask. */
                                /*  Only one bit will be set in this mask. */
        int     grplo;          /* lowest-numbered CPU or group here. */
        int     grphi;          /* highest-numbered CPU or group here. */
        u8      grpnum;         /* CPU/group number for next level up. */
        u8      level;          /* root is at level 0. */
        bool    wait_blkd_tasks;/* Necessary to wait for blocked tasks to */
                                /*  exit RCU read-side critical sections */
                                /*  before propagating offline up the */
                                /*  rcu_node tree? */
        struct rcu_node *parent;
        struct list_head blkd_tasks;
                                /* Tasks blocked in RCU read-side critical */
                                /*  section.  Tasks are placed at the head */
                                /*  of this list and age towards the tail. */
        struct list_head *gp_tasks;
                                /* Pointer to the first task blocking the */
                                /*  current grace period, or NULL if there */
                                /*  is no such task. */
        struct list_head *exp_tasks;
                                /* Pointer to the first task blocking the */
                                /*  current expedited grace period, or NULL */
                                /*  if there is no such task.  If there */
                                /*  is no current expedited grace period, */
                                /*  then there can cannot be any such task. */

gp 감지를 포함하는 rcu 노드 구조체이다.

lock
- rcu 노드 락
gp_seq
- 이 노드에 대한 gp 번호
gp_seq_needed
- 이 노드에 요청한 새 gp 번호
completedqs
- 이 노드가 qs 되었을때의 completed 번호
qsmask
- leaf 노드의 경우 qs를 보고해야 할 rcu_data에 대응하는 비트가 1로 설정된다.
- leaf 노드가 아닌 경우 qs를 보고해야 할 child 노드에 대응하는 비트가 1로 설정된다.
rcu_gp_init_mask
- gp 초기화시의 offline cpumask
qsmaskinit
- gp가 시작할 때마다 적용되는 초기값
- qsmask & expmask
qsmaskinitnext
- 다음 gp를 위한 online cpumask
expmask
- preemptible-rcu에서만 사용되며, 빠른(expedited) grace period가 진행중인 경우 설정된다..
- leaf 노드가 아닌 노드들의 초기값은 qsmaskinit으로한다. (snapshot)
expmaskinit
- expmask의 per-gp 초기 값
expmaskinitnext
- 다음 pepedited gp를 위한 online cpumask
ffmask
- fully functional cpus
grpmask
- 이 노드가 부모 노드의 qsmask에 대응하는 비트 값이다.
- 예) 512개의 cpu를 위해 루트노드에 32개의 leaf 노드가 있을 때 각 leaf 노드의 grpmask 값은 각각 0x1, 0x2, … 0x8000_0000이다.
grplo
- 이 노드가 관리하는 시작 cpu 번호
- 예) grplo=48, grphi=63
grphi
- 이 노드가 관리하는 끝 cpu 번호
- 예) grplo=48, grphi=63
grpnum
- 상위 노드에서 볼 때 이 노드에 해당하는 그룹번호(32bits: 0~31, 64bits: 0~63)
- grpmask에 설정된 비트 번호와 같다.
  - 예) grpmask=0x8000_0000인 경우 bit31이 설정되어 있다. 이러한 경우 grpnum=31이다.
level
- 이 노드에 해당하는 노드 레벨
- 루트 노드는 0이다.
- 최대 값은 3이다. (최대 4 단계의 레벨 구성이 가능하다)
wait_blkd_tasks
- read side critcal section에서 preemption된 블럭드 태스크가 있는지 여부
*parent
- 부모 노드를 가리킨다.
blkd_tasks
- preemptible 커널의 read side critical section에서 preempt된 경우 해당 태스크를 이 리스트에 추가된다.
*gp_tasks
- 현재 일반 gp에서 블럭된 첫 번째 태스크를 가리킨다.
*exp_tasks
- 현재 급행 gp에서 블럭된 첫 번째 태스크를 가리킨다.

kernel/rcu/tree.h – 2/2

        struct list_head *boost_tasks;
                                /* Pointer to first task that needs to be */
                                /*  priority boosted, or NULL if no priority */
                                /*  boosting is needed for this rcu_node */
                                /*  structure.  If there are no tasks */
                                /*  queued on this rcu_node structure that */
                                /*  are blocking the current grace period, */
                                /*  there can be no such task. */
        struct rt_mutex boost_mtx;
                                /* Used only for the priority-boosting */
                                /*  side effect, not as a lock. */
        unsigned long boost_time;
                                /* When to start boosting (jiffies). */
        struct task_struct *boost_kthread_task;
                                /* kthread that takes care of priority */
                                /*  boosting for this rcu_node structure. */
        unsigned int boost_kthread_status;
                                /* State of boost_kthread_task for tracing. */
#ifdef CONFIG_RCU_NOCB_CPU
        struct swait_queue_head nocb_gp_wq[2];
                                /* Place for rcu_nocb_kthread() to wait GP. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
        raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;

        spinlock_t exp_lock ____cacheline_internodealigned_in_smp;
        unsigned long exp_seq_rq;
        wait_queue_head_t exp_wq[4];
        struct rcu_exp_work rew;
        bool exp_need_flush;    /* Need to flush workitem? */
} ____cacheline_internodealigned_in_smp;

*boost_tasks
- priority 부스트가 필요한 첫 태스크
boost_mtx
- priority 부스트에 사용되는 락
*boost_kthread_task
- 이 노드에서 priority 부스트를 수행하는 boost 커널 스레드
boost_kthread_status
- boost_kthread_task trace를 위한 상태
nocb_gp_wq[]
- 2 개의 no-cb용 커널 스레드의 대기큐
exp_seq_rq
- 급행 grace period 시퀀스 번호
exp_wq[]
- synchronize_rcu_expedited() 호출한 태스크들이 급행 gp를 대기하게 되는데, 이 호출한 태스크들이 대기하는 곳이다.
- 4개의 해시 리스트 형태로 운영된다. (급행 gp 시퀀스 번호에서 하위 2비트를 우측 시프트하여 버린 후 하위 2비트로 해시 운영한다)
rew
- rcu_exp_work 구조체가 임베드되며, 내부에선 wait_rcu_exp_gp() 함수를 호출하는 워크큐가 사용된다.
exp_need_flush
- sync_rcu_exp_select_cpus() 함수내부에서 각 leaf 노드들에서 위의 워크큐가 사용되는 경우 여부를 관리할 때 사용된다.

rcu_data 구조체

kernel/rcu/tree.h – 1/2

/* Per-CPU data for read-copy update. */

struct rcu_data {
        /* 1) quiescent-state and grace-period handling : */
        unsigned long   gp_seq;         /* Track rsp->rcu_gp_seq counter. */
        unsigned long   gp_seq_needed;  /* Track furthest future GP request. */
        union rcu_noqs  cpu_no_qs;      /* No QSes yet for this CPU. */
        bool            core_needs_qs;  /* Core waits for quiesc state. */
        bool            beenonline;     /* CPU online at least once. */
        bool            gpwrap;         /* Possible ->gp_seq wrap. */
        bool            exp_deferred_qs; /* This CPU awaiting a deferred QS? */
        struct rcu_node *mynode;        /* This CPU's leaf of hierarchy */
        unsigned long grpmask;          /* Mask to apply to leaf qsmask. */
        unsigned long   ticks_this_gp;  /* The number of scheduling-clock */
                                        /*  ticks this CPU has handled */
                                        /*  during and after the last grace */
                                        /* period it is aware of. */
        struct irq_work defer_qs_iw;    /* Obtain later scheduler attention. */
        bool defer_qs_iw_pending;       /* Scheduler attention pending? */

        /* 2) batch handling */
        struct rcu_segcblist cblist;    /* Segmented callback list, with */
                                        /* different callbacks waiting for */
                                        /* different grace periods. */
        long            qlen_last_fqs_check;
                                        /* qlen at last check for QS forcing */
        unsigned long   n_force_qs_snap;
                                        /* did other CPU force QS recently? */
        long            blimit;         /* Upper limit on a processed batch */

        /* 3) dynticks interface. */
        int dynticks_snap;              /* Per-GP tracking for dynticks. */
        long dynticks_nesting;          /* Track process nesting level. */
        long dynticks_nmi_nesting;      /* Track irq/NMI nesting level. */
        atomic_t dynticks;              /* Even value for idle, else odd. */
        bool rcu_need_heavy_qs;         /* GP old, so heavy quiescent state! */
        bool rcu_urgent_qs;             /* GP old need light quiescent state. */
#ifdef CONFIG_RCU_FAST_NO_HZ
        bool all_lazy;                  /* All CPU's CBs lazy at idle start? */
        unsigned long last_accelerate;  /* Last jiffy CBs were accelerated. */
        unsigned long last_advance_all; /* Last jiffy CBs were all advanced. */
        int tick_nohz_enabled_snap;     /* Previously seen value from sysfs. */
#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */

        /* 4) rcu_barrier(), OOM callbacks, and expediting. */
        struct rcu_head barrier_head;
        int exp_dynticks_snap;          /* Double-check need for IPI. */

1) qs와 gp 핸들링 관련

gp_seq
- gp 시퀀스 번호
gp_seq_needed
- 요청한 gp 시퀀스 번호
cpu_no_qs
- gp 시작 후 설정되며 cpu에서 qs가 체크되면 클리어된다.
- 그 후 해당 노드 및 최상위 노드까지 보고하며, 각 노드의 rnp->qsmask의 비트 중 하위 rnp 또는 rdp의 ->grpmask에 해당하는 비트를 클리어한다.
core_need_qs
- cpu가 qs를 체크해야 하는지 여부
beenonline
- 한 번이라도 online 되었었던 경우 1로 설정된다.
gpwrap
- nohz 진입한 cpu는 스케줄 틱을 한동안 갱신하지 못하는데, 이로 인해 gp 시퀀스 역시 장시간 갱신 못할 수도 있다. cpu의 gp 시퀀스가 노드용 gp 시퀀스에 비해 ulong 값의 1/4을 초과하도록 갱신을 못한 경우 gp 시퀀스를 오버플로우로 판정하여 이 값을 true로 변경한다.
exp_deferred_qs
- 현재 cpu가 deferred qs를 대기중인지 여부
*mynode
- 이 cpu를 관리하는 노드
grpmask
- 이 cpu가 해당 leaf 노드의 qsmask에 대응하는 비트 값으로 qs 패스되면 leaf 노드의 qsmask에 반영한다.
- 예) 노드 당 16개의 cpu가 사용될 수 있으며 각각의 cpu에 대해 0x1, 0x2, … 0x1_0000 값이 배치된다.
ticks_this_gp
- 마지막 gp 이후 구동된 스케줄 틱 수
- CONFIG_RCU_CPU_STALL_INFO 커널 옵션을 사용한 경우 cpu stall 정보의 출력을 위해 사용된다.
defer_qs_iw
defer_qs_iw_pending

2) 배치 핸들링

cblist
- 콜백들이 추가되는 segmented 콜백 리스트이다.
qlen_last_fqs_check
- fqs를 위해 마지막 체크시 사용할 qlen 값
n_force_qs_snap
- n_force_qs가 복사된 값으로 최근에 fqs가 수행되었는지 확인하기 위해 사용된다.
blimit
- 배치 처리할 콜백 제한 수
- 빠른 인터럽트 latency를 보장하게 하기 위해 콜백들이 많은 경우 한 번에 blimit 이하의 콜백들만 처리하게 한다.

3) dynticks(nohz) 인터페이스

dynticks_snap
- dynticks->dynticks 값을 복사해두고 카운터 값이 변동이 있는지 확인하기 위해 사용된다.
dynticks_nesting
- 초기 값은 1로 시작되며, eqs 진입 시 1 감소 시키며, 퇴출 시 1 증가 시킨다.
dynticks_nmi_nesting
- 초기 값은 DYNTICK_IRQ_NONIDLE(long_max / 2 + 1)로 시작되며, irq/nmi 진출시 1(eqs)~2 증가되고, 퇴출시 1(eqs)~2 감소된다.
dynticks
- per-cpu로 구성된 전역 rcu_dynticks에 연결된다.
- no-hz에서 qs상태를 관리하기 위해 사용한다.
rcu_need_heavy_qs
- 2 * jiffies_to_sched_qs 시간(틱)동안 gp 연장
rcu_urgent_qs
- jiffies_to_sched_qs 시간(틱)동안 gp 연장
all_lazy
- 대기 중인 모든 콜백이 lazy(kfree) 타입 콜백인 경우 true가 된다.
last_accelerate
- 최근 accelerate 콜백 처리한 시각(틱)
last_advance_all
- 최근 advance 콜백 처리한 시각(틱)
tick_nozh_enabled_snap
- nohz active 여부를 snap 저장하여 변경 여부를 체크하기 위해 사용한다.

4) rcu 배리어, OOM 콜백과 expediting

barrier_head
- 베리어 역할로 사용하는 콜백
exp_dynticks_snap
- bit0 클리어된 rdp->dynticks 값을 snap 저장하여 사용한다.

kernel/rcu/tree.h – 2/2

        /* 5) Callback offloading. */
#ifdef CONFIG_RCU_NOCB_CPU
        struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */
        struct task_struct *nocb_gp_kthread;
        raw_spinlock_t nocb_lock;       /* Guard following pair of fields. */
        atomic_t nocb_lock_contended;   /* Contention experienced. */
        int nocb_defer_wakeup;          /* Defer wakeup of nocb_kthread. */
        struct timer_list nocb_timer;   /* Enforce finite deferral. */
        unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */

        /* The following fields are used by call_rcu, hence own cacheline. */
        raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
        struct rcu_cblist nocb_bypass;  /* Lock-contention-bypass CB list. */
        unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
        unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
        int nocb_nobypass_count;        /* # ->cblist enqueues at ^^^ time. */

        /* The following fields are used by GP kthread, hence own cacheline. */
        raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
        struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
        u8 nocb_gp_sleep;               /* Is the nocb GP thread asleep? */
        u8 nocb_gp_bypass;              /* Found a bypass on last scan? */
        u8 nocb_gp_gp;                  /* GP to wait for on last scan? */
        unsigned long nocb_gp_seq;      /*  If so, ->gp_seq to wait for. */
        unsigned long nocb_gp_loops;    /* # passes through wait code. */
        struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
        bool nocb_cb_sleep;             /* Is the nocb CB thread asleep? */
        struct task_struct *nocb_cb_kthread;
        struct rcu_data *nocb_next_cb_rdp;
                                        /* Next rcu_data in wakeup chain. */

        /* The following fields are used by CB kthread, hence new cacheline. */
        struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp;
                                        /* GP rdp takes GP-end wakeups. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */

        /* 6) RCU priority boosting. */
        struct task_struct *rcu_cpu_kthread_task;
                                        /* rcuc per-CPU kthread or NULL. */
        unsigned int rcu_cpu_kthread_status;
        char rcu_cpu_has_work;

        /* 7) Diagnostic data, including RCU CPU stall warnings. */
        unsigned int softirq_snap;      /* Snapshot of softirq activity. */
        /* ->rcu_iw* fields protected by leaf rcu_node ->lock. */
        struct irq_work rcu_iw;         /* Check for non-irq activity. */
        bool rcu_iw_pending;            /* Is ->rcu_iw pending? */
        unsigned long rcu_iw_gp_seq;    /* ->gp_seq associated with ->rcu_iw. */
        unsigned long rcu_ofl_gp_seq;   /* ->gp_seq at last offline. */
        short rcu_ofl_gp_flags;         /* ->gp_flags at last offline. */
        unsigned long rcu_onl_gp_seq;   /* ->gp_seq at last online. */
        short rcu_onl_gp_flags;         /* ->gp_flags at last online. */
        unsigned long last_fqs_resched; /* Time of last rcu_resched(). */

        int cpu;
};

5) callback offloading (no-cb)

nocb_cb_wq
- nocb 커널 스레드가 잠들때 대기하는 리스트
*nocb_gp_kthread
- no-cb용 gp 커널 스레드
nocb_lock
- no-cb용 스핀락
nocb_lock_contended
- no-cb용 lock contention 유무를 관리하기 위해 사용되는 카운터이다.
nocb_defer_wakeup
- no-cb용 커널 스레드를 깨우는 것에 대한 유예 상태
  - RCU_NOGP_WAKE_NOT(0)
  - RCU_NOGP_WAKE(1)
  - RCU_NOGP_WAKE_FORCE(2)
nocb_timer
- no-cb용 타이머
- no-cb용 gp 커널 스레드를 1틱 deferred wakeup을 할 때 사용한다.
nocb_gp_adv_time
- no-cb용 gp 커널 스레드에서 콜백들을 flush 처리하고 advance 처리를 하는데, 1 틱이내에 반복하지 않기 위해 사용한다.
nocb_bypass_lock
- no-cb bypass 갱신 시 사용하는 스핀락
nocb_bypass
- no-cb용 bypass 콜백 리스트
nocb_bypass_first
- no-cb용 bypass 에 처음 콜백이 추가될 때의 시각을 기록한다.
- no-cb용 bypass 리스트에 있는 콜백들을 flush하여 처리한 경우에도 해당 시각을 기록한다.
- 1ms 이내에 진입하는 콜백들이 일정량(16개)을 초과하여 진입할 때에만 no-cb용 bypass에 콜백을 추가하고 nocb_nobypass_count 카운터를 증가시키는데 이 카운터(는 이 시각이 바뀔 때마다 리셋된다.
nocb_nobypass_last
- no-cb용 bypass 리스트에 콜백을 추가 시도할 때 갱신되는 시각(틱)이다.
nocb_nobypass_count
- no-cb용 bypass 리스트를 사용하여 콜백을 추가하기 위해 카운팅을 한다.
- 이 값이 매 틱마다 nocb_nobypass_lim_per_jiffy(디폴트 1ms 당 16개) 갯 수를 초과할 때에만 no-cb용 bypass 리스트에 콜백을 추가한다.
nocb_gp_lock
- no-cb용 gp 커널스레드가 사용하는 락
nocb_bypass_timer
- no-cb용 bypass 타이머로 새 gp 요청을 기다리기 위해 슬립 전에 no-cb용 콜백들을 모두 flush 처리하기 위해 2틱을 설정하여 사용한다.
nocb_gp_sleep
- no-cb용 gp 커널 스레드가 슬립 상태인지 여부를 담는다. gp 변화를 위해 외부에서 이 값을 false로 바꾸고 깨워 사용한다.
nocb_gp_bypass
- no-cb용 gp 커널 스레드가 지난 스캔에서 bypass 모드로 동작하였는지 여부를 담는다.
nocb_gp_gp
- no-cb용 gp 커널 스레드가 지난 스캔에서 wait 중인지 여부를 담는다.
nocb_gp_seq
- no-cb용 gp 커널 스레드에서 gp idle 상태일때 -1 값을 갖고, 진행 중일 때 ->gp_seq 값을 담고 있다.
nocb_gp_loops
- no-cb용 gp 커널 스레드의 루프 횟수를 카운팅하기 위해 사용된다. (gp state 출력용)
nocb_gp_wq
- no-cb용 gp 커널 스레드가 슬립하는 곳이다.
nocb_cb_sleep
- no-cb용 cb 커널 스레드가 슬립 상태인지 여부를 담는다. 콜백 처리를 위해 이 값을 false로 바꾸고 깨워 사용한다.
*nocb_cb_kthread
- no-cb용 cb 커널 스레드를 가리킨다.
*nocb_next_cb_rdp
- wakeup 체인에서 다음에 처리할 rdp를 가리킨다.
*nocb_gp_rdp
- no-cb용 gp 커널 스레드가 있는 rdp(cpu)를 가리킨다.

6) rcu priority boosting

*rcu_cpu_kthread_task
- rcuc 커널 스레드
rcu_cpu_kthread_status
- rcu 커널 스레드 상태
  - RCU_KTHREAD_STOPPED 0
  - RCU_KTHREAD_RUNNING 1
  - RCU_KTHREAD_WAITING 2
  - RCU_KTHREAD_OFFCPU 3
  - RCU_KTHREAD_YIELDING 4
rcu_cpu_has_work
- cpu에 처리할 콜백이 있어 cb용 콜백 처리 커널 스레드를 깨워야 할 때 사용된다.

7) diagnostic data / rcu cpu stall warning

softirq_snap
- cpu stall 정보를 출력할 때 사용하기 위해 rcu softirq 카운터 값 복사(snap-shot)
- CONFIG_RCU_CPU_STALL_INFO 커널 옵션 필요
rcu_iw
- rcu_iw_handler() 함수가 담긴 rcu irq work
rcu_iw_pending
- rcu irq work 호출 전에 true가 담긴다.
rcu_iw_gp_seq
- rcu irq work 호출 전에 rnp->gp_seq가 담긴다.
rcu_ofl_gp_seq
- cpu offline 변경 시 gp_seq가 담긴다.
rcu_ofl_gp_flags
- cpu offline 변경 시 gp_flags가 담긴다.
rcu_onl_gp_seq
- cpu online 변경 시 gp_seq가 담긴다.
rcu_onl_gp_flags
- cpu online 변경 시 gp_flags가 담긴다.
last_fqs_resched
- 최근 fqs 리스케줄 시각(틱)이 담긴다.
- fqs 진행될 때 3 * jiffies_to_sched_qs 시간이 지난 경우 리스케줄 요청을 한다.
cpu
- cpu 번호

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c – 현재글
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c

numa_default_policy()

2016-12-022016-12-02 문영일 Leave a comment

numa_default_policy()

mm/mempolicy.c

/* Reset policy of current process to default */
void numa_default_policy(void)
{
        do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
}

현재 태스크를 기본 메모리 정책으로 설정한다.

참고

numa_policy_init() | 문c 블로그

numa_policy_init()

2016-12-022016-12-02 문영일 Leave a comment

numa_policy_init()

mm/mempolicy.c

void __init numa_policy_init(void)
{
        nodemask_t interleave_nodes; 
        unsigned long largest = 0;
        int nid, prefer = 0;

        policy_cache = kmem_cache_create("numa_policy",
                                         sizeof(struct mempolicy),
                                         0, SLAB_PANIC, NULL);

        sn_cache = kmem_cache_create("shared_policy_node",
                                     sizeof(struct sp_node),
                                     0, SLAB_PANIC, NULL);

        for_each_node(nid) {
                preferred_node_policy[nid] = (struct mempolicy) {
                        .refcnt = ATOMIC_INIT(1),
                        .mode = MPOL_PREFERRED,
                        .flags = MPOL_F_MOF | MPOL_F_MORON,
                        .v = { .preferred_node = nid, },
                };
        }
                
        /*
         * Set interleaving policy for system init. Interleaving is only
         * enabled across suitably sized nodes (default is >= 16MB), or
         * fall back to the largest node if they're all smaller.
         */
        nodes_clear(interleave_nodes);
        for_each_node_state(nid, N_MEMORY) {
                unsigned long total_pages = node_present_pages(nid);

                /* Preserve the largest node */
                if (largest < total_pages) {
                        largest = total_pages;
                        prefer = nid;
                }

                /* Interleave this node? */
                if ((total_pages << PAGE_SHIFT) >= (16 << 20))
                        node_set(nid, interleave_nodes);
        }

        /* All too small, use the largest */
        if (unlikely(nodes_empty(interleave_nodes)))
                node_set(prefer, interleave_nodes);

        if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
                pr_err("%s: interleaving failed\n", __func__);

        check_numabalancing_enable();
}

코드 라인 07~13에서 mempolicy 구조체 타입으로 전역 policy_cache kmem 캐시와 sp_node 구조체 타입으로 전역 sn_cache kmem 캐시를 구성한다.
코드 라인 15~22에서 전체 노드에 대해 preferred_node_policy[] 배열을 초기화한다.
코드 라인 29~41에서 interleave 노드를 초기화한다. 16M 이상의 모든 메모리 노드를 interleave 노드에 참여 시킨다.
코드 라인 44~45에서 만일 참여한 노드가 하나도 없는 경우 가장 큰 노드를 interleave 노드에 참여시킨다.
코드 라인 47~48에서 interleave 메모리 정책을 설정한다.
코드 라인 50에서 NUMA 밸런싱을 설정한다.

mm/mempolicy.c

static struct mempolicy preferred_node_policy[MAX_NUMNODES];

do_set_mempolicy()

mm/mempolicy.c

/* Set the process memory policy */
static long do_set_mempolicy(unsigned short mode, unsigned short flags,
                             nodemask_t *nodes)
{
        struct mempolicy *new, *old;
        NODEMASK_SCRATCH(scratch);
        int ret;

        if (!scratch)
                return -ENOMEM;

        new = mpol_new(mode, flags, nodes);
        if (IS_ERR(new)) {
                ret = PTR_ERR(new);
                goto out;
        }

        task_lock(current);
        ret = mpol_set_nodemask(new, nodes, scratch);
        if (ret) {
                task_unlock(current);
                mpol_put(new);
                goto out;
        }
        old = current->mempolicy;
        current->mempolicy = new;
        if (new && new->mode == MPOL_INTERLEAVE &&
            nodes_weight(new->v.nodes))
                current->il_next = first_node(new->v.nodes);
        task_unlock(current);
        mpol_put(old);
        ret = 0;
out:
        NODEMASK_SCRATCH_FREE(scratch);
        return ret;
}

요청한 메모리 정책 모드를 새로 만들고 설정한 후 현재 태스크의 메모리 정책에 대입시킨다.

mpol_new()

mm/mempolicy.c

/*
 * This function just creates a new policy, does some check and simple
 * initialization. You must invoke mpol_set_nodemask() to set nodes.
 */
static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
                                  nodemask_t *nodes)
{
        struct mempolicy *policy;

        pr_debug("setting mode %d flags %d nodes[0] %lx\n",
                 mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);

        if (mode == MPOL_DEFAULT) {
                if (nodes && !nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
                return NULL;
        }
        VM_BUG_ON(!nodes);

        /*
         * MPOL_PREFERRED cannot be used with MPOL_F_STATIC_NODES or
         * MPOL_F_RELATIVE_NODES if the nodemask is empty (local allocation).
         * All other modes require a valid pointer to a non-empty nodemask.
         */
        if (mode == MPOL_PREFERRED) {
                if (nodes_empty(*nodes)) {
                        if (((flags & MPOL_F_STATIC_NODES) ||
                             (flags & MPOL_F_RELATIVE_NODES)))
                                return ERR_PTR(-EINVAL);
                }
        } else if (mode == MPOL_LOCAL) {
                if (!nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
                mode = MPOL_PREFERRED;
        } else if (nodes_empty(*nodes))
                return ERR_PTR(-EINVAL);
        policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
        if (!policy)
                return ERR_PTR(-ENOMEM);
        atomic_set(&policy->refcnt, 1);
        policy->mode = mode;
        policy->flags = flags;

        return policy;
}

새로운 메모리 정책을 생성한다.

mpol_set_nodemask()

mm/mempolicy.c

/*
 * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
 * any, for the new policy.  mpol_new() has already validated the nodes
 * parameter with respect to the policy mode and flags.  But, we need to
 * handle an empty nodemask with MPOL_PREFERRED here.
 *
 * Must be called holding task's alloc_lock to protect task's mems_allowed
 * and mempolicy.  May also be called holding the mmap_semaphore for write.
 */
static int mpol_set_nodemask(struct mempolicy *pol,
                     const nodemask_t *nodes, struct nodemask_scratch *nsc)
{
        int ret;

        /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
        if (pol == NULL)
                return 0;
        /* Check N_MEMORY */
        nodes_and(nsc->mask1,
                  cpuset_current_mems_allowed, node_states[N_MEMORY]);

        VM_BUG_ON(!nodes);
        if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
                nodes = NULL;   /* explicit local allocation */
        else {
                if (pol->flags & MPOL_F_RELATIVE_NODES)
                        mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
                else
                        nodes_and(nsc->mask2, *nodes, nsc->mask1);

                if (mpol_store_user_nodemask(pol))
                        pol->w.user_nodemask = *nodes;
                else
                        pol->w.cpuset_mems_allowed =
                                                cpuset_current_mems_allowed;
        }

        if (nodes)
                ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
        else
                ret = mpol_ops[pol->mode].create(pol, NULL);
        return ret;
}

mpol_relative_nodemask()

mm/mempolicy.c

static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
                                   const nodemask_t *rel)
{
        nodemask_t tmp;
        nodes_fold(tmp, *orig, nodes_weight(*rel));
        nodes_onto(*ret, tmp, *rel);
}

mpol_op[] 구조체 배열

static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
        [MPOL_DEFAULT] = {
                .rebind = mpol_rebind_default,
        },
        [MPOL_INTERLEAVE] = {
                .create = mpol_new_interleave,
                .rebind = mpol_rebind_nodemask,
        },
        [MPOL_PREFERRED] = {
                .create = mpol_new_preferred,
                .rebind = mpol_rebind_preferred,
        },
        [MPOL_BIND] = {
                .create = mpol_new_bind,
                .rebind = mpol_rebind_nodemask,
        },
};

check_numabalancing_enable()

mm/mempolicy.c

#ifdef CONFIG_NUMA_BALANCING
static int __initdata numabalancing_override;

static void __init check_numabalancing_enable(void)
{
        bool numabalancing_default = false;

        if (IS_ENABLED(CONFIG_NUMA_BALANCING_DEFAULT_ENABLED))
                numabalancing_default = true;

        /* Parsed by setup_numabalancing. override == 1 enables, -1 disables */
        if (numabalancing_override)
                set_numabalancing_state(numabalancing_override == 1);

        if (num_online_nodes() > 1 && !numabalancing_override) {
                pr_info("%s automatic NUMA balancing. "
                        "Configure with numa_balancing= or the "
                        "kernel.numa_balancing sysctl",
                        numabalancing_default ? "Enabling" : "Disabling");
                set_numabalancing_state(numabalancing_default);
        }
} 
#endif

NUMA 밸런싱을 설정한다.

코드 라인 12~13에서 “numa_balancing=enable” 커널 파라메터가 설정된 경우 전역 numabalancing_enabled를 true로 설정한다.
코드 라인 15~21에서 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED 커널 옵션이 사용된 커널에서 온라인 노드가 2개 이상인 경우도 전역 numabalancing_enabled를 true로 설정한다.

구조체

mempolicy 구조체

/*
 * Describe a memory policy.
 *
 * A mempolicy can be either associated with a process or with a VMA.
 * For VMA related allocations the VMA policy is preferred, otherwise
 * the process policy is used. Interrupts ignore the memory policy
 * of the current process.
 *
 * Locking policy for interlave:
 * In process context there is no locking because only the process accesses
 * its own state. All vma manipulation is somewhat protected by a down_read on
 * mmap_sem.
 *
 * Freeing policy:
 * Mempolicy objects are reference counted.  A mempolicy will be freed when
 * mpol_put() decrements the reference count to zero.
 *
 * Duplicating policy objects:
 * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
 * to the new storage.  The reference count of the new object is initialized
 * to 1, representing the caller of mpol_dup().
 */
struct mempolicy {
        atomic_t refcnt;
        unsigned short mode;    /* See MPOL_* above */
        unsigned short flags;   /* See set_mempolicy() MPOL_F_* above */
        union {
                short            preferred_node; /* preferred */
                nodemask_t       nodes;         /* interleave/bind */
                /* undefined for default */
        } v;
        union {
                nodemask_t cpuset_mems_allowed; /* relative to these nodes */
                nodemask_t user_nodemask;       /* nodemask passed by user */
        } w; 
};

mempolicy_operations 구조체

mm/mempolicy.c

static const struct mempolicy_operations {
        int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
        /*
         * If read-side task has no lock to protect task->mempolicy, write-side
         * task will rebind the task->mempolicy by two step. The first step is
         * setting all the newly nodes, and the second step is cleaning all the
         * disallowed nodes. In this way, we can avoid finding no node to alloc
         * page.
         * If we have a lock to protect task->mempolicy in read-side, we do
         * rebind directly.
         *
         * step:
         *      MPOL_REBIND_ONCE - do rebind work at once
         *      MPOL_REBIND_STEP1 - set all the newly nodes
         *      MPOL_REBIND_STEP2 - clean all the disallowed nodes
         */
        void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
                        enum mpol_rebind_step step);
} mpol_ops[MPOL_MAX];

Policy 관련 enum

/*
 * Both the MPOL_* mempolicy mode and the MPOL_F_* optional mode flags are
 * passed by the user to either set_mempolicy() or mbind() in an 'int' actual.
 * The MPOL_MODE_FLAGS macro determines the legal set of optional mode flags.
 */

/* Policies */
enum {
        MPOL_DEFAULT,
        MPOL_PREFERRED,
        MPOL_BIND, 
        MPOL_INTERLEAVE,
        MPOL_LOCAL,
        MPOL_MAX,       /* always last member of enum */
};

참고

NUMA with Linux | Lunatine

Zoned Allocator -14- (Kswapd & Kcompactd)

2016-12-012021-10-14 문영일 13 Comments

Kswapd & Kcompactd

노드마다 kswapd와 kcompactd가 동작하며 free 메모리가 일정량 이상 충분할 때에는 잠들어(sleep) 있다. 그런데 페이지 할당자가 order 페이지 할당을 시도하다 free 페이지가 부족해 low 워터마크 기준을 충족하지 못하는 순간 kswapd 및 kcompactd를 깨운다. kswapd는 자신의 노드에서 페이지 회수를 진행하고, kcompactd는 compaction을 진행하는데 모든 노드에 대해 밸런스하게 high 워터마크 기준을 충족하게 되면 스스로 sleep 한다.

kswapd 초기화

kswapd_init()

mm/vmscan.c

static int __init kswapd_init(void)
{
        int nid, ret;

        swap_setup();
        for_each_node_state(nid, N_MEMORY)
                kswapd_run(nid);
        ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
                                        "mm/vmscan:online", kswapd_cpu_online,
                                        NULL);
        WARN_ON(ret < 0);
        return 0;
}

module_init(kswapd_init)

kswapd를 사용하기 위해 초기화한다.

코드 라인 5에서 kswapd 실행 전에 준비한다.
코드 라인 6~7에서 모든 메모리 노드에 대해 kswapd를 실행시킨다.
코드 라인 8~10에서 cpu가 hot-plug를 통해 CPUHP_AP_ONLINE_DYN 상태로 변경될 때 kswapd_cpu_online() 함수가 호출될 수 있도록 등록한다.

swap_setup()

mm/swap.c

/*
 * Perform any setup for the swap system
 */

void __init swap_setup(void)
{
        unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); 

        /* Use a smaller cluster for small-memory machines */
        if (megs < 16)                          
                page_cluster = 2;
        else
                page_cluster = 3;
        /* 
         * Right now other parts of the system means that we
         * _really_ don't want to cluster much more
         */
}

kswapd 실행 전에 준비한다.

코드 라인 3~9에서 전역 total 램이 16M 이하이면 page_cluster에 2를 대입하고 그렇지 않으면 3을 대입한다.

kswapd_run()

mm/vmscan.c

/*
 * This kswapd start function will be called by init and node-hot-add.
 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
 */

int kswapd_run(int nid)
{
        pg_data_t *pgdat = NODE_DATA(nid);
        int ret = 0;

        if (pgdat->kswapd)
                return 0;

        pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
        if (IS_ERR(pgdat->kswapd)) {
                /* failure at boot is fatal */
                BUG_ON(system_state == SYSTEM_BOOTING);
                pr_err("Failed to start kswapd on node %d\n", nid);
                ret = PTR_ERR(pgdat->kswapd);
                pgdat->kswapd = NULL;
        }
        return ret;
}

kswapd 스레드를 동작시킨다.

코드 라인 3~7에서 @nid 노드의 kswapd 스레드가 이미 실행 중인 경우 skip하기 위해 0을 반환한다.
코드 라인 9~16에서 kswapd 스레드를 동작시킨다.

kswapd 동작

kswapd()

mm/vmscan.c

/*
 * The background pageout daemon, started as a kernel thread
 * from the init process.
 *
 * This basically trickles out pages so that we have _some_
 * free memory available even if there is no other activity
 * that frees anything up. This is needed for things like routing
 * etc, where we otherwise might have all activity going on in
 * asynchronous contexts that cannot page things out.
 *
 * If there are applications that are active memory-allocators
 * (most normal use), this basically shouldn't matter.
 */

static int kswapd(void *p)
{
        unsigned int alloc_order, reclaim_order;
        unsigned int classzone_idx = MAX_NR_ZONES - 1;
        pg_data_t *pgdat = (pg_data_t*)p;
        struct task_struct *tsk = current;

        struct reclaim_state reclaim_state = {
                .reclaimed_slab = 0,
        };
        const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

        if (!cpumask_empty(cpumask))
                set_cpus_allowed_ptr(tsk, cpumask);
        current->reclaim_state = &reclaim_state;

        /*
         * Tell the memory management that we're a "memory allocator",
         * and that if we need more memory we should get access to it
         * regardless (see "__alloc_pages()"). "kswapd" should
         * never get caught in the normal page freeing logic.
         *
         * (Kswapd normally doesn't need memory anyway, but sometimes
         * you need a small amount of memory in order to be able to
         * page out something else, and this flag essentially protects
         * us from recursively trying to free more memory as we're
         * trying to free the first piece of memory in the first place).
         */
        tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
        set_freezable();

        pgdat->kswapd_order = 0;
        pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
        for ( ; ; ) {
                bool ret;

                alloc_order = reclaim_order = pgdat->kswapd_order;
                classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);

kswapd_try_sleep:
                kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
                                        classzone_idx);

                /* Read the new order and classzone_idx */
                alloc_order = reclaim_order = pgdat->kswapd_order;
                classzone_idx = kswapd_classzone_idx(pgdat, 0);
                pgdat->kswapd_order = 0;
                pgdat->kswapd_classzone_idx = MAX_NR_ZONES;

                ret = try_to_freeze();
                if (kthread_should_stop())
                        break;

                /*
                 * We can speed up thawing tasks if we don't call balance_pgdat
                 * after returning from the refrigerator
                 */
                if (ret)
                        continue;

                /*
                 * Reclaim begins at the requested order but if a high-order
                 * reclaim fails then kswapd falls back to reclaiming for
                 * order-0. If that happens, kswapd will consider sleeping
                 * for the order it finished reclaiming at (reclaim_order)
                 * but kcompactd is woken to compact for the original
                 * request (alloc_order).
                 */
                trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
                                                alloc_order);
                reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
                if (reclaim_order < alloc_order)
                        goto kswapd_try_sleep;
        }

        tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
        current->reclaim_state = NULL;

        return 0;
}

각 노드에서 동작되는 kswapd 스레드는 각 노드에서 동작하는 zone에 대해 free 페이지가 low 워터마크 이하로 내려가는 경우 백그라운드에서 페이지 회수를 진행하고 high 워터마크 이상이 되는 경우 페이지 회수를 멈춘다.

코드 라인 5~14에서 요청 노드에서 동작하는 온라인 cpumask를 읽어와서 현재 태스크에 설정한다.
- 요청 노드의 cpumask를 현재 task에 cpus_allowed 등을 설정한다.
코드 라인 15에서 현재 태스크의 reclaim_state가 초기화된 reclaim_state 구조체를 가리키게 한다.
코드 라인 29에서 현재 태스크의 플래그에 PF_MEMALLOC, PF_SWAPWRITE 및 PF_KSWAPD를 설정한다.
- PF_MEMALLOC
  - 메모리 reclaim을 위한 태스크로 워터 마크 제한 없이 할당할 수 있도록 한다.
- PF_SWAPWRITE
  - anon 메모리에 대해 swap 기록 요청을 한다.
- PF_KSWAPD
  - kswapd task를 의미한다.
코드 라인 30에서 현재 태스크를 freeze할 수 있도록 PF_NOFREEZE 플래그를 제거한다.
- 참고: freeze | 문c
코드 라인 32~33에서 kswapd_order를 0부터 시작하게 하고 kswapd_classzone_idx는 가장 상위부터 할 수 있도록 MAX_NR_ZONES를 대입해둔다.
코드 라인 34~38에서 alloc_order와 reclaim_order를 노드의 kswapd가 진행하는 order를 사용한다. 그리고 대상 존인 classzone_idx를 가져온다.
코드 라인 40~42에서 try_sleep: 레이블이다. kswapd가 슬립하도록 시도한다.
코드 라인 45~46에서 alloc_order와 reclaim_order를 외부에서 요청한 order와 zone을 적용시킨다.
코드 라인 47~48에서 kwapd_order를 0으로 리셋하고, kswapd_classzone_idx는 가장 상위부터 할 수 있도록 MAX_NR_ZONES를 대입해둔다.
코드 라인 50에서 현재 태스크 kswapd에 대해 freeze 요청이 있는 경우 freeze 시도한다.
코드 라인 51~52에서 현재 태스크의 KTHREAD_SHOULD_STOP 플래그 비트가 설정된 경우 루프를 탈출하고 스레드 종료 처리한다.
코드 라인 58~59에서 freeze 된 적이 있으면 빠른 처리를 위해 노드 밸런스를 동작시키지 않고 계속 진행한다.
코드 라인 71에서 freeze 한 적이 없었던 경우이다. order 페이지와 존을 대상으로 페이지 회수를 진행한다.
코드 라인 72~73에서 요청한 order에서 회수가 실패한 경우 order 0 및 해당 존에서 다시 시도하기 위해 try_sleep: 레이블로 이동한다.
코드 라인 74에서 요청한 order에서 회수가 성공한 경우에는 루프를 계속 반복한다.
코드 라인 76~79에서 kswapd 스레드의 처리를 완료시킨다.

kswapd_try_to_sleep()

mm/vmscan.c

static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
                                unsigned int classzone_idx)
{
        long remaining = 0;
        DEFINE_WAIT(wait);

        if (freezing(current) || kthread_should_stop())
                return;

        prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

        /*
         * Try to sleep for a short interval. Note that kcompactd will only be
         * woken if it is possible to sleep for a short interval. This is
         * deliberate on the assumption that if reclaim cannot keep an
         * eligible zone balanced that it's also unlikely that compaction will
         * succeed.
         */
        if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
                /*
                 * Compaction records what page blocks it recently failed to
                 * isolate pages from and skips them in the future scanning.
                 * When kswapd is going to sleep, it is reasonable to assume
                 * that pages and compaction may succeed so reset the cache.
                 */
                reset_isolation_suitable(pgdat);

                /*
                 * We have freed the memory, now we should compact it to make
                 * allocation of the requested order possible.
                 */
                wakeup_kcompactd(pgdat, alloc_order, classzone_idx);

                remaining = schedule_timeout(HZ/10);

                /*
                 * If woken prematurely then reset kswapd_classzone_idx and
                 * order. The values will either be from a wakeup request or
                 * the previous request that slept prematurely.
                 */
                if (remaining) {
                        pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
                        pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
                }

                finish_wait(&pgdat->kswapd_wait, &wait);
                prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
        }

        /*
         * After a short sleep, check if it was a premature sleep. If not, then
         * go fully to sleep until explicitly woken up.
         */
        if (!remaining &&
            prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
                trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

                /*
                 * vmstat counters are not perfectly accurate and the estimated
                 * value for counters such as NR_FREE_PAGES can deviate from the
                 * true value by nr_online_cpus * threshold. To avoid the zone
                 * watermarks being breached while under pressure, we reduce the
                 * per-cpu vmstat threshold while kswapd is awake and restore
                 * them before going back to sleep.
                 */
                set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);

                if (!kthread_should_stop())
                        schedule();

                set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
        } else {
                if (remaining)
                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
                else
                        count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
        }
        finish_wait(&pgdat->kswapd_wait, &wait);
}

노드에 대해 요청 order 및 zone 까지 free 페이지가 high 워터마크 기준으로 밸런스하게 할당할 수 있는 상태라면 sleep 한다.

코드 라인 7~8에서 freeze 요청이 있는 경우 함수를 빠져나간다.
코드 라인 10에서 현재 태스크를 kswapd_wait에 추가하여 sleep할 준비를 한다.
코드 섹션 19~48에서 요청 zone 까지 그리고 요청 order에 대해 free 페이지가 밸런스된 high 워터마크 기준을 충족하는 경우의 처리이다.
- 최근에 direct-compaction이 완료되어 해당 존에서 compaction을 다시 처음부터 시작할 수 있도록 존에 compact_blockskip_flush이 설정된다. 이렇게 설정된 존들에 대해 skip 블럭 비트를 모두 클리어한다.
- compactd 스레드를 깨운다.
- 0.1초를 sleep 한다. 만일 중간에 깬 경우 노드에 kswapd가 처리중인 zone과 order를 기록해둔다.
- 다시 슬립할 준비를 한다.
코드 섹션 54~71에서 중간에 깨어나지 않고 0.1초를 완전히 슬립하였고, 여전히 요청 zone 까지 그리고 요청 order에 대해 free 페이지가 확보되어 밸런스된 high 워터마크 기준을 충족하면 다음과 같이 처리한다.
- NR_FREE_PAGES 등의 vmstat을 정밀하게 계산할 필요 여부를 per-cpu 스레졸드라고 하는데, 이를 일반적인 기준의 스레졸드로 지정하도록 노드에 포함된 각 zone에 대해 normal한 스레졸드 값을 지정한다.
- 스레드 종료 요청이 아닌 경우 sleep한다.
- kswapd가 깨어났다는 이유는 메모리 압박 상황이 되었다라는 의미이다. 따라서 이번에는 per-cpu 스레졸드 값으로 pressure한 스레졸드 값을 사용하기 지정한다.
코드 섹션 72~77에서 메모리 부족 상황이 빠르게 온 경우이다. 슬립하자마자 깨어났는데 조건에 따라 관련 카운터를 다음과 같이 처리한다.
- 0.1초간 잠시 sleep 하는 와중에 다시 메모리 부족을 이유로 현재 스레드인 kswapd가 깨어난 경우에는 KSWAPD_LOW_WMARK_HIT_QUICKLY 카운터를 증가시킨다.
- 0.1초 슬립한 후에도 high 워터마크 기준을 충족할 만큼 메모리가 확보되지 못해 슬립하지 못하는 상황이다. 이러한 경우 KSWAPD_HIGH_WMARK_HIT_QUICKLY 카운터를 증가시킨다.
코드 섹션 78에서 kswapd_wait 에서 현재 태스크를 제거한다.

다음 그림은 kswapd_try_to_sleep() 함수를 통해 kswapd가 high 워터마크 기준을 충족하면 슬립하는 과정을 보여준다.

밸런스될 때까지 페이지 회수

balance_pgdat()

mm/vmscan.c -1/3-

/*
 * For kswapd, balance_pgdat() will reclaim pages across a node from zones
 * that are eligible for use by the caller until at least one zone is
 * balanced.
 *
 * Returns the order kswapd finished reclaiming at.
 *
 * kswapd scans the zones in the highmem->normal->dma direction.  It skips
 * zones which have free_pages > high_wmark_pages(zone), but once a zone is
 * found to have free_pages <= high_wmark_pages(zone), any page is that zone
 * or lower is eligible for reclaim until at least one usable zone is
 * balanced.
 */

static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
{
        int i;
        unsigned long nr_soft_reclaimed;
        unsigned long nr_soft_scanned;
        unsigned long pflags;
        unsigned long nr_boost_reclaim;
        unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
        bool boosted;
        struct zone *zone;
        struct scan_control sc = {
                .gfp_mask = GFP_KERNEL,
                .order = order,
                .may_unmap = 1,
        };

        psi_memstall_enter(&pflags);
        __fs_reclaim_acquire();

        count_vm_event(PAGEOUTRUN);

        /*
         * Account for the reclaim boost. Note that the zone boost is left in
         * place so that parallel allocations that are near the watermark will
         * stall or direct reclaim until kswapd is finished.
         */
        nr_boost_reclaim = 0;
        for (i = 0; i <= classzone_idx; i++) {
                zone = pgdat->node_zones + i;
                if (!managed_zone(zone))
                        continue;

                nr_boost_reclaim += zone->watermark_boost;
                zone_boosts[i] = zone->watermark_boost;
        }
        boosted = nr_boost_reclaim;

restart:
        sc.priority = DEF_PRIORITY;
        do {
                unsigned long nr_reclaimed = sc.nr_reclaimed;
                bool raise_priority = true;
                bool balanced;
                bool ret;

                sc.reclaim_idx = classzone_idx;

                /*
                 * If the number of buffer_heads exceeds the maximum allowed
                 * then consider reclaiming from all zones. This has a dual
                 * purpose -- on 64-bit systems it is expected that
                 * buffer_heads are stripped during active rotation. On 32-bit
                 * systems, highmem pages can pin lowmem memory and shrinking
                 * buffers can relieve lowmem pressure. Reclaim may still not
                 * go ahead if all eligible zones for the original allocation
                 * request are balanced to avoid excessive reclaim from kswapd.
                 */
                if (buffer_heads_over_limit) {
                        for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
                                zone = pgdat->node_zones + i;
                                if (!managed_zone(zone))
                                        continue;

                                sc.reclaim_idx = i;
                                break;
                        }
                }

우선 순위를 12부터 1까지 높여가며 페이지 회수 및 compaction을 진행하여 free 페이지가 요청 order 및 zone까지 밸런스하게 high 워터마크 기준을 충족할 때까지 진행한다.

코드 라인 11~15에서 준비한 scan_control 구조체에 매핑된 페이지를 언맵할 수 있게 may_unmap=1로 설정한다.
코드 라인 17에서 메모리 부족으로 인한 현재 태스크의 psi 산출을 시작하는 지점이다.
- 참고: PSI – Pressure Stall Information | kernel.org
코드 라인 20에서 PAGEOUTRUN 카운터를 증가시킨다.
코드 라인 27~36에서 요청한 classzone_idx 이하의 존들을 순회하며 워터마크 부스트 값을 합산하여 boosted 및 nr_boost_reclaim에 대입한다. 그리고 존별 부스트 값도 zone_boosts[]에 대입한다.
코드 라인 38~40에서 restart: 레이블이다. 우선 순위를 초가값(12)으로 한 후 다시 시도한다.
코드 라인 46~67에서 reclaim할 존 인덱스 값으로 classzone_idx를 사용하되, buffer_heads_over_limit가 설정된 경우 가용한 최상위 존을 대입한다.

mm/vmscan.c -2/3-

.               /*
                 * If the pgdat is imbalanced then ignore boosting and preserve
                 * the watermarks for a later time and restart. Note that the
                 * zone watermarks will be still reset at the end of balancing
                 * on the grounds that the normal reclaim should be enough to
                 * re-evaluate if boosting is required when kswapd next wakes.
                 */
                balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
                if (!balanced && nr_boost_reclaim) {
                        nr_boost_reclaim = 0;
                        goto restart;
                }

                /*
                 * If boosting is not active then only reclaim if there are no
                 * eligible zones. Note that sc.reclaim_idx is not used as
                 * buffer_heads_over_limit may have adjusted it.
                 */
                if (!nr_boost_reclaim && balanced)
                        goto out;

                /* Limit the priority of boosting to avoid reclaim writeback */
                if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
                        raise_priority = false;

                /*
                 * Do not writeback or swap pages for boosted reclaim. The
                 * intent is to relieve pressure not issue sub-optimal IO
                 * from reclaim context. If no pages are reclaimed, the
                 * reclaim will be aborted.
                 */
                sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
                sc.may_swap = !nr_boost_reclaim;
                sc.may_shrinkslab = !nr_boost_reclaim;

                /*
                 * Do some background aging of the anon list, to give
                 * pages a chance to be referenced before reclaiming. All
                 * pages are rotated regardless of classzone as this is
                 * about consistent aging.
                 */
                age_active_anon(pgdat, &sc);

                /*
                 * If we're getting trouble reclaiming, start doing writepage
                 * even in laptop mode.
                 */
                if (sc.priority < DEF_PRIORITY - 2)
                        sc.may_writepage = 1;

                /* Call soft limit reclaim before calling shrink_node. */
                sc.nr_scanned = 0;
                nr_soft_scanned = 0;
                nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
                                                sc.gfp_mask, &nr_soft_scanned);
                sc.nr_reclaimed += nr_soft_reclaimed;

                /*
                 * There should be no need to raise the scanning priority if
                 * enough pages are already being scanned that that high
                 * watermark would be met at 100% efficiency.
                 */
                if (kswapd_shrink_node(pgdat, &sc))
                        raise_priority = false;

코드 라인 8~12에서 노드가 밸런스 상태가 아니고, 부스트 중이면 nr_boost_reclaim을 리셋한 후 restart: 레이블로 이동하여 다시 시작한다.
코드 라인 19~20에서 노드가 이미 밸런스 상태이고 부스트 중이 아니면 더이상 페이지 확보를 할 필요 없으므로 out 레이블로 이동한다.
코드 라인 23~24에서 부스트 중에는 priority가 낮은 순위(12~10)는 상관없지만 높은 순위(9~1)부터는 더 이상 우선 순위가 높아지지 않도록 raise_priority에 false를 대입한다.
- 가능하면 낮은 우선 순위에서는 writeback을 허용하지 않는다.
코드 라인 32~34에서 랩톱(절전 지원) 모드가 아니고 부스트 중이 아니면 may_writepage를 1로 설정하여 writeback을 허용한다. 그리고 부스트 중이 아니면 may_swap 및 may_shrinkslab을 1로 설정하여 swap 및 슬랩 shrink를 지원한다.
코드 라인 42에서 swap이 활성화된 경우 inactive anon이 active anon보다 작을 경우 active 리스트에 대해 shrink를 수행하여 active와 inactive간의 밸런스를 다시 잡아준다.
코드 라인 48~49에서 높은 우선 순위(9~1)에서는 may_writepage를 1로 설정하여 writeback을 허용한다.
코드 라인 52~56에서 노드를 shrink하기 전에 memcg soft limit reclaim을 수행한다. 스캔한 수는 nr_soft_scanned에 대입되고, nr_reclaimed에는 soft reclaim된 페이지 수가 추가된다.
코드 라인 63~64에서 free 페이지가 high 워터마크 기준을 충족할 만큼 노드를 shrink 한다. 만일 shrink가 성공한 경우 순위를 증가시키지 않도록 raise_priority를 false로 설정한다.

mm/vmscan.c -3/3-

                /*
                 * If the low watermark is met there is no need for processes
                 * to be throttled on pfmemalloc_wait as they should not be
                 * able to safely make forward progress. Wake them
                 */
                if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
                                allow_direct_reclaim(pgdat))
                        wake_up_all(&pgdat->pfmemalloc_wait);

                /* Check if kswapd should be suspending */
                __fs_reclaim_release();
                ret = try_to_freeze();
                __fs_reclaim_acquire();
                if (ret || kthread_should_stop())
                        break;

                /*
                 * Raise priority if scanning rate is too low or there was no
                 * progress in reclaiming pages
                 */
                nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
                nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);

                /*
                 * If reclaim made no progress for a boost, stop reclaim as
                 * IO cannot be queued and it could be an infinite loop in
                 * extreme circumstances.
                 */
                if (nr_boost_reclaim && !nr_reclaimed)
                        break;

                if (raise_priority || !nr_reclaimed)
                        sc.priority--;
        } while (sc.priority >= 1);

        if (!sc.nr_reclaimed)
                pgdat->kswapd_failures++;

out:
        /* If reclaim was boosted, account for the reclaim done in this pass */
        if (boosted) {
                unsigned long flags;

                for (i = 0; i <= classzone_idx; i++) {
                        if (!zone_boosts[i])
                                continue;

                        /* Increments are under the zone lock */
                        zone = pgdat->node_zones + i;
                        spin_lock_irqsave(&zone->lock, flags);
                        zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
                        spin_unlock_irqrestore(&zone->lock, flags);
                }

                /*
                 * As there is now likely space, wakeup kcompact to defragment
                 * pageblocks.
                 */
                wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
        }

        snapshot_refaults(NULL, pgdat);
        __fs_reclaim_release();
        psi_memstall_leave(&pflags);
        /*
         * Return the order kswapd stopped reclaiming at as
         * prepare_kswapd_sleep() takes it into account. If another caller
         * entered the allocator slow path while kswapd was awake, order will
         * remain at the higher level.
         */
        return sc.order;
}

코드 라인 6~8에서 페이지 할당 중 메모리가 부족하여 direct reclaim 시도 중 대기하고 있는 태스크들이 pfmemalloc_wait 리스트에 존재하고, 노드가 direct reclaim을 해도 된다고 판단하면 대기 중인 태스크들을 모두 깨운다.
- allow_direct_reclaim(): normal 존 이하의 free 페이지 합이 min 워터마크를 더한 페이지의 절반보다 큰 경우 true.
코드 라인 12~15에서 freeze 하였다가 깨어났었던 경우 또는 kswapd 스레드 정지 요청이 있는 경우 루프를 빠져나간다.
코드 라인 21~30에서 reclaimed 페이지와 nr_boost_reclaim을 산출한 후 부스트 중이 아니면서 회수된 페이지가 없으면 루프를 벗어난다.
코드 라인 32~34에서 우선 순위를 높이면서 최고 우선 순위까지 루프를 반복한다. 만일 회수된 페이지가 없거나 우선 순위 상승을 원하지 않는 경우에는 우선 순위 증가없이 루프를 반복한다.
코드 라인 36~37에서 루프 완료 후까지 회수된 페이지가 없으면 kswapd_failures 카운터를 증가시킨다.
코드 라인 39~60에서 out: 레이블이다. 처음 시도 시 부스트된 적이 있었으면 kcompactd를 깨운다. 또한 워터마크 부스트 값을 갱신한다.
코드 라인 64에서 메모리 부족으로 인한 현재 태스크의 psi 산출을 종료하는 지점이다.

아래 그림은 task A에서 direct 페이지 회수를 진행 중에 pfmemalloc 워터마크 기준 이하로 떨어진 경우 kswapd에 의해 페이지 회수가 될 때까지스로틀링 즉, direct 페이지 회수를 잠시 쉬게 하여 cpu 부하를 줄인다.

Kswapd 깨우기

wake_all_kswapds()

mm/page_alloc.c

static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
                             const struct alloc_context *ac)
{
        struct zoneref *z;
        struct zone *zone;
        pg_data_t *last_pgdat = NULL;
        enum zone_type high_zoneidx = ac->high_zoneidx;

        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, high_zoneidx,
                                        ac->nodemask) {
                if (last_pgdat != zone->zone_pgdat)
                        wakeup_kswapd(zone, gfp_mask, order, high_zoneidx);
                last_pgdat = zone->zone_pgdat;
        }
}

alloc context가 가리키는 zonelist 중 관련 노드의 kswpad를 깨운다.

wakeup_kswapd()

mm/vmscan.c

/*
 * A zone is low on free memory or too fragmented for high-order memory.  If
 * kswapd should reclaim (direct reclaim is deferred), wake it up for the zone's
 * pgdat.  It will wake up kcompactd after reclaiming memory.  If kswapd reclaim
 * has failed or is not needed, still wake up kcompactd if only compaction is
 * needed.
 */

void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
                   enum zone_type classzone_idx)
{
        pg_data_t *pgdat;

        if (!managed_zone(zone))
                return;

        if (!cpuset_zone_allowed(zone, gfp_flags))
                return;
        pgdat = zone->zone_pgdat;
        pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat,
                                                           classzone_idx);
        pgdat->kswapd_order = max(pgdat->kswapd_order, order);
        if (!waitqueue_active(&pgdat->kswapd_wait))
                return;

        /* Hopeless node, leave it to direct reclaim if possible */
        if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
            (pgdat_balanced(pgdat, order, classzone_idx) &&
             !pgdat_watermark_boosted(pgdat, classzone_idx))) {
                /*
                 * There may be plenty of free memory available, but it's too
                 * fragmented for high-order allocations.  Wake up kcompactd
                 * and rely on compaction_suitable() to determine if it's
                 * needed.  If it fails, it will defer subsequent attempts to
                 * ratelimit its work.
                 */
                if (!(gfp_flags & __GFP_DIRECT_RECLAIM))
                        wakeup_kcompactd(pgdat, order, classzone_idx);
                return;
        }

        trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order,
                                      gfp_flags);
        wake_up_interruptible(&pgdat->kswapd_wait);
}

지정된 zone에서 order 페이지를 할당하려다 메모리가 부족해지면 kswapd 태스크를 깨운다.

코드 라인 6~7에서 유효한 존이 아닌 경우 처리할 페이지가 없으므로 함수를 빠져나간다.
코드 라인 9~10에서 요청한 zone의 노드가 cgroup cpuset을 통해 허가되지 않은 경우 처리를 포기하고 빠져나간다.
코드 라인 11~14에서 kswapd에 존과 order를 지정하여 요청한다.
코드 라인 15~16에서 kswapd가 이미 동작 중이면 함수를 빠져나간다.
코드 라인 19~32에서 다음 조건을 만족하고 direct-reclaim을 허용하지 않는 경우에 한해 kcompactd만 깨우고 함수를 빠져나간다.
- kswad를 통한 페이지 회수 실패가 MAX_RECLAIM_RETRIES(16)번 이상인 경우
- 노드가 이미 밸런스 상태이고 부스트 중이 아닌 경우
코드 라인 36에서 kswapd 태스크를 깨운다.

current_is_kswapd()

include/linux/swap.h

static inline int current_is_kswapd(void)
{
        return current->flags & PF_KSWAPD;
}

현재 태스크가 kswapd 인 경우 true를 반환한다.

kcompactd

kcompactd 초기화

kcompactd_init()

static int __init kcompactd_init(void)
{
        int nid;
        int ret;

        ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
                                        "mm/compaction:online",
                                        kcompactd_cpu_online, NULL);
        if (ret < 0) {
                pr_err("kcompactd: failed to register hotplug callbacks.\n");
                return ret;
        }

        for_each_node_state(nid, N_MEMORY)
                kcompactd_run(nid);
        return 0;
}
subsys_initcall(kcompactd_init)

kcompactd를 사용하기 위해 초기화한다.

코드 라인 6~12에서 cpu가 hot-plug를 통해 CPUHP_AP_ONLINE_DYN 상태로 변경될 때 kcompactd_cpu_online() 함수가 호출될 수 있도록 등록한다.
코드 라인 14~15에서 모든 메모리 노드에 대해 kcompactd를 실행시킨다.

kcompactd_run()

mm/compaction.c

/*
 * This kcompactd start function will be called by init and node-hot-add.
 * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
 */

int kcompactd_run(int nid)
{
        pg_data_t *pgdat = NODE_DATA(nid);
        int ret = 0;

        if (pgdat->kcompactd)
                return 0;

        pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
        if (IS_ERR(pgdat->kcompactd)) {
                pr_err("Failed to start kcompactd on node %d\n", nid);
                ret = PTR_ERR(pgdat->kcompactd);
                pgdat->kcompactd = NULL;
        }
        return ret;
}

kcompactd 스레드를 동작시킨다.

코드 라인 3~7에서 @nid 노드의 kcompactd 스레드가 이미 실행 중인 경우 skip 하기 위해 0을 반환한다.
코드 라인 9~14에서 kcompactd 스레드를 동작시킨다.

kcompactd 동작

kcompactd()

mm/compaction.c

/*
 * The background compaction daemon, started as a kernel thread
 * from the init process.
 */

static int kcompactd(void *p)
{
        pg_data_t *pgdat = (pg_data_t*)p;
        struct task_struct *tsk = current;

        const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

        if (!cpumask_empty(cpumask))
                set_cpus_allowed_ptr(tsk, cpumask);

        set_freezable();

        pgdat->kcompactd_max_order = 0;
        pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;

        while (!kthread_should_stop()) {
                unsigned long pflags;

                trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
                wait_event_freezable(pgdat->kcompactd_wait,
                                kcompactd_work_requested(pgdat));

                psi_memstall_enter(&pflags);
                kcompactd_do_work(pgdat);
                psi_memstall_leave(&pflags);
        }

        return 0;
}

각 메모리 노드에서 동작되는 kcompactd 스레드는 슬립한 상태에 있다가 kswapd가 페이지 회수를 진행한 후 호출되어 깨어나면 백그라운드에서 compaction을 진행하고 다시 슬립한다.

코드 라인 3~9에서 현재 kcompactd 스레드를 요청 노드에 포함된 cpu들에서만 동작할 수 있도록 cpu 비트마스크를 지정한다.
코드 라인11에서 태스크를 freeze할 수 있도록 PF_NOFREEZE 플래그를 제거한다.
- 참고: freeze | 문c
코드 라인 13~14에서 kcompactd의 최대 order와 존을 리셋한다.
코드 라인 16~26에서 종료 요청이 없는 한 계속 루프를 돌며 슬립한 후 외부 요청에 의해 깨어나면 compaction을 수행한다.

다음 그림은 kcompactd 스레드가 동작하는 과정을 보여준다.

kcompactd_do_work()

mm/compaction.c

static void kcompactd_do_work(pg_data_t *pgdat)
{
        /*
         * With no special task, compact all zones so that a page of requested
         * order is allocatable.
         */
        int zoneid;
        struct zone *zone;
        struct compact_control cc = {
                .order = pgdat->kcompactd_max_order,
                .total_migrate_scanned = 0,
                .total_free_scanned = 0,
                .classzone_idx = pgdat->kcompactd_classzone_idx,
                .mode = MIGRATE_SYNC_LIGHT,
                .ignore_skip_hint = false,
                .gfp_mask = GFP_KERNEL,
        };
        trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
                                                        cc.classzone_idx);
        count_compact_event(KCOMPACTD_WAKE);

        for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
                int status;

                zone = &pgdat->node_zones[zoneid];
                if (!populated_zone(zone))
                        continue;

                if (compaction_deferred(zone, cc.order))
                        continue;

                if (compaction_suitable(zone, cc.order, 0, zoneid) !=
                                                        COMPACT_CONTINUE)
                        continue;

                cc.nr_freepages = 0;
                cc.nr_migratepages = 0;
                cc.total_migrate_scanned = 0;
                cc.total_free_scanned = 0;
                cc.zone = zone;
                INIT_LIST_HEAD(&cc.freepages);
                INIT_LIST_HEAD(&cc.migratepages);

                if (kthread_should_stop())
                        return;
                status = compact_zone(zone, &cc);

                if (status == COMPACT_SUCCESS) {
                        compaction_defer_reset(zone, cc.order, false);
                } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
                        /*
                         * Buddy pages may become stranded on pcps that could
                         * otherwise coalesce on the zone's free area for
                         * order >= cc.order.  This is ratelimited by the
                         * upcoming deferral.
                         */
                        drain_all_pages(zone);

                        /*
                         * We use sync migration mode here, so we defer like
                         * sync direct compaction does.
                         */
                        defer_compaction(zone, cc.order);
                }

                count_compact_events(KCOMPACTD_MIGRATE_SCANNED,
                                     cc.total_migrate_scanned);
                count_compact_events(KCOMPACTD_FREE_SCANNED,
                                     cc.total_free_scanned);

                VM_BUG_ON(!list_empty(&cc.freepages));
                VM_BUG_ON(!list_empty(&cc.migratepages));
        }

        /*
         * Regardless of success, we are done until woken up next. But remember
         * the requested order/classzone_idx in case it was higher/tighter than
         * our current ones
         */
        if (pgdat->kcompactd_max_order <= cc.order)
                pgdat->kcompactd_max_order = 0;
        if (pgdat->kcompactd_classzone_idx >= cc.classzone_idx)
                pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
}

노드에 지정된 kcompactd_classzone_idx 존까지 kcompactd_max_order로 compaction을 수행한다.

코드 라인 9~17에서 kcompactd에서 사용할 compact_control을 다음과 같이 준비한다.
- .order에 외부에서 요청한 오더를 지정한다.
- .classzone_idx에 외부에서 요청한 존 인덱스를 지정한다.
- .mode에 MIGRATE_SYNC_LIGHT를 사용한다.
- skip 힌트를 사용하도록 한다.
코드 라인 20에서 KCOMPACTD_WAKE 카운터를 증가시킨다.
코드 라인 22~27에서 가장 낮은 존부터 요청한 존까지 순회하며 유효하지 않은 존은 스킵한다.
코드 라인 29~30에서 compaction 유예 조건인 존의 경우 스킵한다.
코드 라인 32~34에서 존이 compaction 하기 적절하지 않은 경우 스킵한다.
코드 라인 36~42에서 compaction을 하기 위해 compaction_control 결과를 담을 멤버들을 초기화한다.
코드 라인 44~45에서 스레드 종료 요청인 경우 함수를 빠져나간다.
코드 라인 46에서 존에 대해 compaction을 수행하고 결과를 알아온다.
코드 라인 48~49에서 만일 compaction이 성공한 경우 유예 카운터를 리셋한다.
코드 라인 50~64에서 만일 compaction이 완료될 때까지 원하는 order가 없는 경우 per-cpu 캐시를 회수하고, 유예 한도를 증가시킨다.
코드 라인 66~69에서 KCOMPACTD_MIGRATE_SCANNED 카운터 및 KCOMPACTD_FREE_SCANNED 카운터를 갱신한다.
코드 라인 80~81에서 진행 order보다 외부 요청 order가 작거나 동일하면 다음에 wakeup 하지 않도록 max_order를 0으로 리셋한다.
코드 라인 82~83에서 진행 존보다 외부 요청 존이 더 크거나 동일하면 다음에 시작할 존을 가장 높은 존으로 리셋한다.

Kcompatd 깨우기

wakeup_kcompactd()

mm/compaction.c

void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
{
        if (!order)
                return;

        if (pgdat->kcompactd_max_order < order)
                pgdat->kcompactd_max_order = order;

        if (pgdat->kcompactd_classzone_idx > classzone_idx)
                pgdat->kcompactd_classzone_idx = classzone_idx;

        /*
         * Pairs with implicit barrier in wait_event_freezable()
         * such that wakeups are not missed.
         */
        if (!wq_has_sleeper(&pgdat->kcompactd_wait))
                return;

        if (!kcompactd_node_suitable(pgdat))
                return;

        trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
                                                        classzone_idx);
        wake_up_interruptible(&pgdat->kcompactd_wait);
}

kcompactd 스레드를 깨운다.

코드 라인 3~4에서 order 값이 0인 경우 kcompactd를 깨우지 않고 함수를 빠져나간다.
코드 라인 6~7에서 @order가 kcompactd_max_order 보다 큰 경우 kcompactd_max_order를 갱신한다.
코드 라인 9~10에서 @classzone_idx가 kcompactd_classzone_idx보다 작은 경우 kcompactd_classzone_idx를 갱신한다.
코드 라인 16~17에서 kcompactd가 이미 깨어있으면 함수를 빠져나간다.
코드 라인 19~20에서 compaction을 진행해도 효과가 없을만한 노드의 경우 함수를 빠져나간다.
코드 라인 24에서 kcompactd를 깨운다.

kcompactd_node_suitable()

mm/compaction.c

static bool kcompactd_node_suitable(pg_data_t *pgdat)
{
        int zoneid;
        struct zone *zone;
        enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;

        for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
                zone = &pgdat->node_zones[zoneid];

                if (!populated_zone(zone))
                        continue;

                if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
                                        classzone_idx) == COMPACT_CONTINUE)
                        return true;
        }

        return false;
}

kcompactd를 수행하기 위해 요청한 노드의 가용한 존들에 대해 하나라도 compaction 효과가 있을만한 존이 있는지 여부를 반환한다.

코드 라인 5~11에서 요청한 노드의 kcompactd_classzone_idx 까지 가용한 존들에 대해 순회한다.
코드 라인 13~15에서 순회 중인 존들 중 하나라도 kcompactd_max_order 값을 사용하여 compaction 효과가 있다고 판단하면 true를 반환한다.
코드 라인 18에서 해당 노드의 모든 존들에 대해 compaction 효과를 볼 수 없어 false를 반환한다.

기타

swapper_spaces[] 배열

mm/swap_state.c

struct address_space swapper_spaces[MAX_SWAPFILES];

swap_aops

mm/swap_state.c

/*                                      
 * swapper_space is a fiction, retained to simplify the path through
 * vmscan's shrink_page_list.
 */
static const struct address_space_operations swap_aops = {
        .writepage      = swap_writepage,
        .set_page_dirty = swap_set_page_dirty,
#ifdef CONFIG_MIGRATION
        .migratepage    = migrate_page,
#endif 
};

address_space_operations 구조체

include/linux/fs.h

struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
        int (*readpage)(struct file *, struct page *);

        /* Write back some dirty pages from this mapping. */
        int (*writepages)(struct address_space *, struct writeback_control *);

        /* Set a page dirty.  Return true if this dirtied it */
        int (*set_page_dirty)(struct page *page);

        /*
         * Reads in the requested pages. Unlike ->readpage(), this is
         * PURELY used for read-ahead!.
         */
        int (*readpages)(struct file *filp, struct address_space *mapping,
                        struct list_head *pages, unsigned nr_pages);

        int (*write_begin)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);

        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
        /*
         * migrate the contents of a page to the specified target. If
         * migrate_mode is MIGRATE_ASYNC, it must not block.
         */
        int (*migratepage) (struct address_space *,
                        struct page *, struct page *, enum migrate_mode);
        bool (*isolate_page)(struct page *, isolate_mode_t);
        void (*putback_page)(struct page *);
        int (*launder_page) (struct page *);
        int (*is_partially_uptodate) (struct page *, unsigned long,
                                        unsigned long);
        void (*is_dirty_writeback) (struct page *, bool *, bool *);
        int (*error_remove_page)(struct address_space *, struct page *);

        /* swapfile support */
        int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                                sector_t *span);
        void (*swap_deactivate)(struct file *file);
};

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd & Kcompactd) | 문c – 현재 글

Freeze (hibernation/suspend)

2016-11-302016-11-30 문영일 Leave a comment

리눅스 시스템 차원의 서스펜드 또는 하이버네이션 기능으로 현재 동작중인 모든 유저 스레드와 몇 개의 커널 스레드들을 얼려 정지시킨다.

유저 스레드들은 시그널 핸들링 코드에 의해 try_to_freeze() 호출이 되면서 freeze가 된다.
커널 스레드들은 명확하게 적절한 위치에서 freeze 요청을 확인하고 호출해줘야 한다.
freeze 불가능한 작업큐에 있는 태스크들이 완료될 때 까지 대기하는데 딜레이 된 태스크들에 대해서는 전부 active 시킨 후 대기한다.

Freeze

freezing()

include/linux/freezer.h

/*
 * Check if there is a request to freeze a process
 */
static inline bool freezing(struct task_struct *p) 
{
        if (likely(!atomic_read(&system_freezing_cnt)))
                return false;
        return freezing_slow_path(p);
}

현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다. (true=freeze 가능, false=freeze 불가능)

코드 라인 06~07에서 높은 확률로 전역 system_freezing_cnt가 0인 경우 freeze 요청이 없으므로 false를 반환한다.
코드 라인 08에서 현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다.

freezing_slow_path()

kernel/freezer.c

/**
 * freezing_slow_path - slow path for testing whether a task needs to be frozen
 * @p: task to be tested
 *
 * This function is called by freezing() if system_freezing_cnt isn't zero
 * and tests whether @p needs to enter and stay in frozen state.  Can be
 * called under any context.  The freezers are responsible for ensuring the
 * target tasks see the updated state.
 */
bool freezing_slow_path(struct task_struct *p)
{
        if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
                return false;

        if (test_thread_flag(TIF_MEMDIE))
                return false;

        if (pm_nosig_freezing || cgroup_freezing(p))
                return true;

        if (pm_freezing && !(p->flags & PF_KTHREAD))
                return true;

        return false;
}
EXPORT_SYMBOL(freezing_slow_path);

현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다.

코드 라인 12~13에서 현재 태스크에 PF_NOFREEZE 또는 PF_SUSPEND_TASK 플래그가 설정된 경우 freeze 하지 못하게 false를 반환한다.
코드 라인 15~16에서 현재 태스크에 TIF_MEMDIE 플래그가 있는 경우 freeze 하지 못하게 false를 반환한다.
코드 라인18~19에서 전역 pm_nosig_freezing이 true이거나 현재 태스크에 대해 cgroup에 freezing이 설정된 경우 freezing을 할 수 있게 true를 반환한다.
코드 라인 21~22에서 전역 pm_freezing이 true이면서 현재 태스크가 커널 스레드가 아닌 경우 freezing을 할 수 있게 true를 반환한다.

try_to_freeze_tasks()

kernel/power/process.c

static int try_to_freeze_tasks(bool user_only)
{
        struct task_struct *g, *p;
        unsigned long end_time;
        unsigned int todo;
        bool wq_busy = false;
        struct timeval start, end;
        u64 elapsed_msecs64;
        unsigned int elapsed_msecs;
        bool wakeup = false;
        int sleep_usecs = USEC_PER_MSEC;

        do_gettimeofday(&start);

        end_time = jiffies + msecs_to_jiffies(freeze_timeout_msecs);

        if (!user_only)
                freeze_workqueues_begin();

        while (true) {
                todo = 0;
                read_lock(&tasklist_lock);
                for_each_process_thread(g, p) {
                        if (p == current || !freeze_task(p))
                                continue;

                        if (!freezer_should_skip(p))
                                todo++;
                }
                read_unlock(&tasklist_lock);

                if (!user_only) {
                        wq_busy = freeze_workqueues_busy();
                        todo += wq_busy;
                }

                if (!todo || time_after(jiffies, end_time))
                        break;

                if (pm_wakeup_pending()) {
                        wakeup = true;
                        break;
                }

                /*
                 * We need to retry, but first give the freezing tasks some
                 * time to enter the refrigerator.  Start with an initial
                 * 1 ms sleep followed by exponential backoff until 8 ms.
                 */
                usleep_range(sleep_usecs / 2, sleep_usecs);
                if (sleep_usecs < 8 * USEC_PER_MSEC)
                        sleep_usecs *= 2;
        }

        do_gettimeofday(&end);
        elapsed_msecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
        do_div(elapsed_msecs64, NSEC_PER_MSEC);
        elapsed_msecs = elapsed_msecs64;

        if (todo) {
                pr_cont("\n");
                pr_err("Freezing of tasks %s after %d.%03d seconds "
                       "(%d tasks refusing to freeze, wq_busy=%d):\n",
                       wakeup ? "aborted" : "failed",
                       elapsed_msecs / 1000, elapsed_msecs % 1000,
                       todo - wq_busy, wq_busy);

                if (!wakeup) {
                        read_lock(&tasklist_lock);
                        for_each_process_thread(g, p) {
                                if (p != current && !freezer_should_skip(p)
                                    && freezing(p) && !frozen(p))
                                        sched_show_task(p);
                        }
                        read_unlock(&tasklist_lock);
                }
        } else {
                pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
                        elapsed_msecs % 1000);
        }

        return todo ? -EBUSY : 0;
}

유저 스레드 및 커널 스레드들을 freeze 시도한다. 정상적으로 freeze한 경우 0을 반환하고 그렇지 않은 경우 -EBUSY를 반환한다. 인수로 user_only가 false인 경우 freeze 불가능한 작업큐에서 동작하는 태스크와 지연된 태스크들에 대해 모두 처리가 완료될 때까지 최대 20초를 기다린다.

코드 라인 15에서 현재 시간으로 부터 20초를 타임아웃으로 설정한다.
- freeze_timeout_msecs
  - 20 * MSEC_PER_SEC;
코드 라인 17~18에서 user_only가 false인 경우 freeze 할 수 없는 워크 큐에 있는 지연된 태스크들을 모두 pool 워크큐로 옮기고 activate 시킨다.
코드 라인 20~30에서 freeze되지 않고 아직 남아 있는 스레드 수를 todo로 산출한다.
- 모든 스레드들에서 현재 태스크와 freeze되지 않은 태스크들에 대해 skip 한다.
- freeze된 스레드들에서 freezer가 skip 해야 하는 경우가 아니었다면 todo를 증가시킨다.
코드 라인 32~35에서 pool 워크큐에서 여전히 activate 된 수를 todo에 더한다.
- user_only가 false인 경우 pool 워크큐가 여전히 active 된 태스크들이 존재하면 todo를 증가시킨다.
코드 라인 37~38에서 처리할 항목이 없거나 타임 오버된 경우 루프를 탈출한다.
코드 라인 40~43에서 suspend(freeze)가 포기된 경우 wakeup 준비한다.
코드 라인 50~53에서 sleep_usecs의 절반 ~ sleep_usecs 범위에서 sleep 한 후 sleep_usces가 8ms 미만인 경우 sleep_usecs를 2배로 키운다.
- 처음 sleep_usecs 값은 1000으로 1ms이다.
코드 라인 55~58에서 소요 시간을 산출하여 elapsed_msecs에 저장한다.
코드 라인 60~76에서 freeze 되지 않은 항목이 남아 있는 경우 이에 대한 에러 출력을 하고 해당 스레드에 대한 정보를 자세히 출력한다.

기타

다음은 freeze와 관련된 함수이다.

freeze_processes()
freeze_kernel_threads()
thaw_processes()

참고

Documentation/power/freezing-of-tasks.txt