문c 블로그

Scheduler -16- (Load Balance 2)

2017-08-312023-06-24 문영일 Leave a comment

New Idle 밸런싱

현재 cpu의 런큐에서 수행시킬 태스크가 없어서 idle 상태에 진입하는데 그 전에 newidle_balance()함수를 통해 다른 cpu에서 동작하는 태스크를 가져와서 동작시키게 할 수 있다.

newidle_balance() 결과 값에 따라 결과 값이 양수인 경우 cfs 태스크가 있으므로 pick_next_task_fair()를 다시 시도하고 그 외의 결과 값인 경우 다음과 같다.

음수인 경우RETRY_TASK를 반환하면 stop -> dl -> rt 순서로 태스크를 찾아 수행한다.
0인 경우 NULL 반환하고 pick_next_task_idle()을 호출하여 idle로 진입한다.

newidle_balance()

kernel/sched/fair.c – 1/2

/*
 * idle_balance is called by schedule() if this_cpu is about to become
 * idle. Attempts to pull tasks from other CPUs.
 */

int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{
        unsigned long next_balance = jiffies + HZ;
        int this_cpu = this_rq->cpu;
        struct sched_domain *sd;
        int pulled_task = 0;
        u64 curr_cost = 0;

        update_misfit_status(NULL, this_rq);
        /*
         * We must set idle_stamp _before_ calling idle_balance(), such that we
         * measure the duration of idle_balance() as idle time.
         */
        this_rq->idle_stamp = rq_clock(this_rq);

        /*
         * Do not pull tasks towards !active CPUs...
         */
        if (!cpu_active(this_cpu))
                return 0;

        /*
         * This is OK, because current is on_cpu, which avoids it being picked
         * for load-balance and preemption/IRQs are still disabled avoiding
         * further scheduler activity on it and we're being very careful to
         * re-start the picking loop.
         */
        rq_unpin_lock(this_rq, rf);

        if (this_rq->avg_idle < sysctl_sched_migration_cost ||
            !READ_ONCE(this_rq->rd->overload)) {

                rcu_read_lock();
                sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                nohz_newidle_balance(this_rq);

                goto out;
        }

        raw_spin_unlock(&this_rq->lock);

새롭게 처음 idle 진입 시 newidle 밸런싱을 시도한다. (결과: 0=수행할 태스크가 없다. -1=rt 또는 dl 태스크가 있다. 양수=cfs 태스크가 있다.)

코드 라인 9에서 misfit 상태를 갱신한다.
코드 라인 14에서 idle 진입 시각을 기록한다.
코드 라인 19~20에서 cpu가 이미 active 상태가 아닌 경우 0을 반환한다.
코드 라인 30~42에서 평균 idle 시간이 너무 짧거나 오버로드된 상태가 아니면 다음 밸런싱 시각을 갱신하고, nohz_newidle_balance()를 수행 후 out 레이블로 이동하여 idle 밸런싱을 skip 한다.
- “/proc/sys/kernel/sched_migration_cost_ns”의 디폴트 값은 500,000 (ns)이다.

kernel/sched/fair.c – 2/2

        update_blocked_averages(this_cpu);
        rcu_read_lock();
        for_each_domain(this_cpu, sd) {
                int continue_balancing = 1;
                u64 t0, domain_cost;

                if (!(sd->flags & SD_LOAD_BALANCE))
                        continue;

                if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
                        update_next_balance(sd, &next_balance);
                        break;
                }

                if (sd->flags & SD_BALANCE_NEWIDLE) {
                        t0 = sched_clock_cpu(this_cpu);

                        pulled_task = load_balance(this_cpu, this_rq,
                                                   sd, CPU_NEWLY_IDLE,
                                                   &continue_balancing);

                        domain_cost = sched_clock_cpu(this_cpu) - t0;
                        if (domain_cost > sd->max_newidle_lb_cost)
                                sd->max_newidle_lb_cost = domain_cost;

                        curr_cost += domain_cost;
                }

                update_next_balance(sd, &next_balance);

                /*
                 * Stop searching for tasks to pull if there are
                 * now runnable tasks on this rq.
                 */
                if (pulled_task || this_rq->nr_running > 0)
                        break;
        }
        rcu_read_unlock();

        raw_spin_lock(&this_rq->lock);

        if (curr_cost > this_rq->max_idle_balance_cost)
                this_rq->max_idle_balance_cost = curr_cost;

out:
        /*
         * While browsing the domains, we released the rq lock, a task could
         * have been enqueued in the meantime. Since we're not going idle,
         * pretend we pulled a task.
         */
        if (this_rq->cfs.h_nr_running && !pulled_task)
                pulled_task = 1;

        /* Move the next balance forward */
        if (time_after(this_rq->next_balance, next_balance))
                this_rq->next_balance = next_balance;

        /* Is there a task of a high priority class? */
        if (this_rq->nr_running != this_rq->cfs.h_nr_running)
                pulled_task = -1;

        if (pulled_task)
                this_rq->idle_stamp = 0;

        rq_repin_lock(this_rq, rf);

        return pulled_task;
}

코드 라인 1에서 블럭드 로드 평균을 갱신한다.
코드 라인 3~8에서 요청한 cpu에 대해 최하위 스케줄 도메인부터 최상위 스케줄 도메인까지 순회하며 로드 밸런싱을 허용하지 않는 도메인은 skip 한다.
코드 라인 10~13에서 평균 idle 시간(avg_idle)이 하위 도메인부터 누적된 밸런싱 소요 시간 + 현재 도메인에 대한 최대 밸런싱 소요 시간보다 작은 경우에는 밸런싱에 오버헤드가 발생하는 경우를 막기 위해 밸런싱을 하지 않는다. 다음 밸런싱 시각(1틱 ~ 최대 0.1초)을 갱신하고 루프를 벗어난다.
- curr_cost
  - 하위 도메인부터 누적된 idle 밸런싱 소요 시간
- domain_cost
  - 해당 도메인에 대한 idle 밸런싱 소요 시간
- sd->max_newidle_lb_cost
  - 해당 도메인의 최대 idle 밸런싱 소요 시간
  - 매 스케줄 틱마다 수행되는 rebalance_domains() 함수를 통해 1초에 1%씩 줄어든다.
코드 라인 15~27에서 순회 중인 스케줄 도메인이 newidle을 허용하는 경우 다음과 같이 처리한다.
- idle 로드밸런싱을 수행하고 마이그레이션을 한 태스크 수를 알아온다.
- 순회 중인 도메인에서 idle 로드밸런싱에서 소요된 시간을 curr_cost에 누적시킨다. 또한 순회 중인 도메인의 idle 밸런싱 시간이 도메인의 max_newidle_lb_cost보다 큰 경우 갱신한다.
코드 라인 29에서 다음 밸런싱 시각을 갱신한다.
코드 라인 35~36에서 마이그레이션한 태스크가 있거나 현재 동작 중인 엔티티가 1개 이상인 경우 루프를 벗어난다.
- 이 cpu의 런큐에 일감(?)이 생겼으므로 idle 상태로 진입할 필요가 없어졌다.
코드 라인 42~43에서 idle 밸런싱에 사용한 시간이 현재 런큐의 max_idle_balance_cost를 초과한 경우 갱신한다.
코드 라인 45~52 에서 out: 레이블이다. idle 밸런스를 통해 마이그레이션을 한 태스크가 없지만 그 사이에 새로운 태스크가 cfs 런큐에 엔큐된 경우 idle 상태로 진입할 필요 없으므로 pulled_task=1로 대입한다.
코드 라인 55~56에서 런큐에 지정된 다음 밸런싱 시각이 이미 경과한 경우 그 시각으로 갱신한다.
코드 라인 59~60에서 현재 런큐에 cfs 태스크보다 빠른 우선 순위를 가진 태스크(stop, rt, dl)가 있는 경우 pulled_task=-1을 대입하여 idle 상태로 진입할 필요를 없앤다.
코드 라인 62~63에서 현재 cpu에 수행할 태스크가 있어 idle에 진입할 필요가 없는 경우이다. 러너블 로드 평균을 갱신하고 idle_stamp를 0으로 초기화한다.
코드 라인 67에서 마이그레이션된 태스크 수를 반환한다. 0=수행할 태스크가 없다. -1=rt 또는 dl 태스크가 있다. 양수=cfs 태스크가 있다.

다음 그림은 newidle_balance()의 처리 과정을 보여준다.

nohz 밸런싱

다음 플래그들은 nohz 밸런싱과 관련한 플래그들이다.

NOHZ_STATS_KICK
- nohz 관련 stat을 갱신할 수 있게 한다.
NOHZ_BALANCE_KICK
- nohz idle 밸런싱을 할 수 있게 한다.

다음 그림은 nohz 밸런싱을 수행하는 과정을 보여준다.

nohz idle 밸런싱 시작

nohz_balancer_kick()

kernel/sched/fair.c -1/2-

/*
 * Current decision point for kicking the idle load balancer in the presence
 * of idle CPUs in the system.
 */

static void nohz_balancer_kick(struct rq *rq)
{
        unsigned long now = jiffies;
        struct sched_domain_shared *sds;
        struct sched_domain *sd;
        int nr_busy, i, cpu = rq->cpu;
        unsigned int flags = 0;

        if (unlikely(rq->idle_balance))
                return;

        /*
         * We may be recently in ticked or tickless idle mode. At the first
         * busy tick after returning from idle, we will update the busy stats.
         */
        nohz_balance_exit_idle(rq);

        /*
         * None are in tickless mode and hence no need for NOHZ idle load
         * balancing.
         */
        if (likely(!atomic_read(&nohz.nr_cpus)))
                return;

        if (READ_ONCE(nohz.has_blocked) &&
            time_after(now, READ_ONCE(nohz.next_blocked)))
                flags = NOHZ_STATS_KICK;

        if (time_before(now, nohz.next_balance))
                goto out;

        if (rq->nr_running >= 2) {
                flags = NOHZ_KICK_MASK;
                goto out;
        }

        rcu_read_lock();

        sd = rcu_dereference(rq->sd);
        if (sd) {
                /*
                 * If there's a CFS task and the current CPU has reduced
                 * capacity; kick the ILB to see if there's a better CPU to run
                 * on.
                 */
                if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
                        flags = NOHZ_KICK_MASK;
                        goto unlock;
                }
        }

        sd = rcu_dereference(per_cpu(sd_asym_packing, cpu));
        if (sd) {
                /*
                 * When ASYM_PACKING; see if there's a more preferred CPU
                 * currently idle; in which case, kick the ILB to move tasks
                 * around.
                 */
                for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
                        if (sched_asym_prefer(i, cpu)) {
                                flags = NOHZ_KICK_MASK;
                                goto unlock;
                        }
                }
        }

nohz idle 동작 중인 cpu를 찾아 nohz 밸런싱을 동작시키게 한다. ipi를 통해 해당 cpu를 깨운다.

코드 라인 9~10에서 이미 idle 밸런싱 중인 경우 함수를 빠져나간다.
코드 라인 16에서 현재 cpu가 busy 상태로 진입하였다. 이 때 처음 busy tick인 경우 현재 cpu를 nohz idle에서 busy 상태로 변경한다.
코드 라인 22~23에서 nohz idle 중인 cpu가 하나도 없으면 함수를 빠져나간다.
코드 라인 25~27에서 nohz idle 중인 cpu들 중 blocked 로드를 가진 경우 nohz.next_blocked 시간이 도래하였으면 NOHZ_STATS_KICK 플래그를 지정해둔다.
코드 라인 29~30에서 현재 시각이 아직 nohz.next_balance 시각을 넘어서지 않은 경우 out 레이블로 이동한다.
코드 라인 32~35에서 현재 cpu의 런큐에서 2 개 이상의 태스크가 동작 중인 경우 NOHZ_KICK_MASK 플래그를 지정한 후 out 레이블로 이동한다.
코드 라인 39~50에서 1개 이상의 cfs 태스크가 동작하는데 rt,dl,irq 등의 외부 로드가 높아 cfs capacity가 일정 부분 감소된 경우 NOHZ_KICK_MASK 플래그를 지정한 후 unlock 레이블로 이동한다.
코드 라인 52~65에서 power7 또는 x86의 ITMT 처럼 asym packing이 있는 도메인의 cpu들을 대상으로 순회하며 순회 중인 cpu가 현재 cpu보다 더 높은 우선 순위를 가진 경우 NOHZ_KICK_MASK 플래그를 지정한 후 unlock 레이블로 이동한다.

kernel/sched/fair.c -2/2-

        sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
        if (sd) {
                /*
                 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
                 * to run the misfit task on.
                 */
                if (check_misfit_status(rq, sd)) {
                        flags = NOHZ_KICK_MASK;
                        goto unlock;
                }

                /*
                 * For asymmetric systems, we do not want to nicely balance
                 * cache use, instead we want to embrace asymmetry and only
                 * ensure tasks have enough CPU capacity.
                 *
                 * Skip the LLC logic because it's not relevant in that case.
                 */
                goto unlock;
        }

        sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
        if (sds) {
                /*
                 * If there is an imbalance between LLC domains (IOW we could
                 * increase the overall cache use), we need some less-loaded LLC
                 * domain to pull some load. Likewise, we may need to spread
                 * load within the current LLC domain (e.g. packed SMT cores but
                 * other CPUs are idle). We can't really know from here how busy
                 * the others are - so just get a nohz balance going if it looks
                 * like this LLC domain has tasks we could move.
                 */
                nr_busy = atomic_read(&sds->nr_busy_cpus);
                if (nr_busy > 1) {
                        flags = NOHZ_KICK_MASK;
                        goto unlock;
                }
        }
unlock:
        rcu_read_unlock();
out:
        if (flags)
                kick_ilb(flags);
}

코드 라인 1~20에서 빅/리틀 처럼 asym cpu capacity를 가진 도메인에서 현재 런큐가 misfit 상태이면서 루트 도메인내 다른 cpu보다 낮은 성능을 가졌거나 cpu가 rt/dl/irq 등의 유틸로 인해 cfs capacity 가 압박을 받고 있는 상태인 경우 NOHZ_KICK_MASK 플래그를 지정한 후 unlock 레이블로 이동한다.
코드 라인 22~38에서 캐시 공유 도메인에서 busy cpu가 2개 이상인 경우 NOHZ_KICK_MASK 플래그를 지정한 후 unlock 레이블로 이동한다.
코드 라인 39~43에서 unlock: 및 out: 레이블이다. nohz idle 상태의 cpu를 알아온 후 nohz 밸런싱을 동작시키게 하기 위해 해당 cpu를 깨운다.

kick_ilb()

kernel/sched/fair.c

/*
 * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
 * idle CPU in the HK_FLAG_MISC housekeeping set (if there is one).
 */

static void kick_ilb(unsigned int flags)
{
        int ilb_cpu;

        nohz.next_balance++;

        ilb_cpu = find_new_ilb();

        if (ilb_cpu >= nr_cpu_ids)
                return;

        flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
        if (flags & NOHZ_KICK_MASK)
                return;

        /*
         * Use smp_send_reschedule() instead of resched_cpu().
         * This way we generate a sched IPI on the target CPU which
         * is idle. And the softirq performing nohz idle load balance
         * will be run before returning from the IPI.
         */
        smp_send_reschedule(ilb_cpu);
}

no idle 상태의 cpu를 알아온 후 해당 cpu에서 nohz 밸런싱을 동작시키게 햐기 위해 cpu를 깨운다.

코드 라인 5~10에서 nohz idle 상태의 cpu를 알아온다.
코드 라인 12~22에서 해당 cpu의 런큐에 flags를 추가한다. 그 전에 NOHZ_KICK_MASK에 해당하는 플래그가 없으면 해당 cpu를 깨운다.
- 첫 kicking 한 번만 IPI 호출을 한다.

find_new_ilb()

kernel/sched/fair.c

static inline int find_new_ilb(void)
{
        int ilb;

        for_each_cpu_and(ilb, nohz.idle_cpus_mask,
                              housekeeping_cpumask(HK_FLAG_MISC)) {
                if (idle_cpu(ilb))
                        return ilb;
        }

        return nr_cpu_ids;
}

nohz idle cpu 및 HK_FLAG_MISC 플래그를 가진 cpu들 중 idle 상태의 cpu 번호를 반환한다. 못 찾은 경우 nr_cpu_ids를 반환한다.

HK_FLAG_MISC 플래그는 nohz full 지정된 cpu들에 플래그가 설정된다.
- 예) “nohz_full=5-8”

nohz idle 밸런스

nohz_idle_balance()

kernel/sched/fair.c

/*
 * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
 * rebalancing for all the cpus for whom scheduler ticks are stopped.
 */

static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
{
        int this_cpu = this_rq->cpu;
        unsigned int flags;

        if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
                return false;

        if (idle != CPU_IDLE) {
                atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
                return false;
        }

        /* could be _relaxed() */
        flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
        if (!(flags & NOHZ_KICK_MASK))
                return false;

        _nohz_idle_balance(this_rq, flags, idle);

        return true;
}

nohz idle 밸런싱을 통해 현재 cpu가 nohz idle 상태인 경우 다른 바쁜 cpu에서 태스크를 마이그레이션해온다.

코드 라인 6~17에서 현재 cpu에 대해 idle 상태로 진입하지 않았거나 NOHZ_KICK_MASK에 해당하는 플래그가 없는 경우 false를 반환한다. NOHZ_KICK_MASK에 해당하는 플래그들은 클리어한다.
- #define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
코드 라인 19~21에서 idle 밸런싱을 수행하고 true를 반환한다.

nohz newidle 밸런스

nohz_newidle_balance()

kernel/sched/fair.c

static void nohz_newidle_balance(struct rq *this_rq)
{
        int this_cpu = this_rq->cpu;

        /*
         * This CPU doesn't want to be disturbed by scheduler
         * housekeeping
         */
        if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
                return;

        /* Will wake up very soon. No time for doing anything else*/
        if (this_rq->avg_idle < sysctl_sched_migration_cost)
                return;

        /* Don't need to update blocked load of idle CPUs*/
        if (!READ_ONCE(nohz.has_blocked) ||
            time_before(jiffies, READ_ONCE(nohz.next_blocked)))
                return;

        raw_spin_unlock(&this_rq->lock);
        /*
         * This CPU is going to be idle and blocked load of idle CPUs
         * need to be updated. Run the ilb locally as it is a good
         * candidate for ilb instead of waking up another idle CPU.
         * Kick an normal ilb if we failed to do the update.
         */
        if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
                kick_ilb(NOHZ_STATS_KICK);
        raw_spin_lock(&this_rq->lock);
}

newidle_balance() 함수에서 newidle 로드밸런싱을 수행하지만 오버로드되지 않은 경우에 한해서는 이 함수를 호출한다. 이 함수에서는 nohz 블럭드 로드만 갱신하고 pull 마이그레이션은 포기한다. 만일 갱신이 실패하는 경우 ipi를 통해 다른 nohz cpu를 통해 nohz 블록드 로드를 갱신하게 한다.

코드 라인 9~10에서 현재 cpu가 HK_FLAG_SCHED 플래그를 가지지 않은 경우 함수를 빠져나간다.
- nohz 및 isolcpus 관련하여 미래 사용을 위해 남겨두었다.
코드 라인 13~14에서 평균 idle이 sysctl_sched_migration_cost(디폴트 500us) 보다 작은 경우 함수를 빠져나간다.
- 잦은 wakeup / sleep이 발생되는 상황
코드 라인 17~19에서 이미 nohz 블럭드 로드 상태(nohz 진입 후)이거나 32ms 주기의 next_blocked 시각이 도래하지 않았으면 함수를 빠져나간다.
코드 라인 28~29에서 nohz 블럭드 로드를 갱신하고, 만일 갱신이 실패한 경우 nohz cpu들 중 하나를 선택해서 ipi call을 통해 nohz 블럭드 로드를 갱신하게 한다.

_nohz_idle_balance()

kernel/sched/fair.c -1/2-

/*
 * Internal function that runs load balance for all idle cpus. The load balance
 * can be a simple update of blocked load or a complete load balance with
 * tasks movement depending of flags.
 * The function returns false if the loop has stopped before running
 * through all idle CPUs.
 */

static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
                               enum cpu_idle_type idle)
{
        /* Earliest time when we have to do rebalance again */
        unsigned long now = jiffies;
        unsigned long next_balance = now + 60*HZ;
        bool has_blocked_load = false;
        int update_next_balance = 0;
        int this_cpu = this_rq->cpu;
        int balance_cpu;
        int ret = false;
        struct rq *rq;

        SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);

        /*
         * We assume there will be no idle load after this update and clear
         * the has_blocked flag. If a cpu enters idle in the mean time, it will
         * set the has_blocked flag and trig another update of idle load.
         * Because a cpu that becomes idle, is added to idle_cpus_mask before
         * setting the flag, we are sure to not clear the state and not
         * check the load of an idle cpu.
         */
        WRITE_ONCE(nohz.has_blocked, 0);

        /*
         * Ensures that if we miss the CPU, we must see the has_blocked
         * store from nohz_balance_enter_idle().
         */
        smp_mb();

        for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
                if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
                        continue;

                /*
                 * If this CPU gets work to do, stop the load balancing
                 * work being done for other CPUs. Next load
                 * balancing owner will pick it up.
                 */
                if (need_resched()) {
                        has_blocked_load = true;
                        goto abort;
                }

                rq = cpu_rq(balance_cpu);

                has_blocked_load |= update_nohz_stats(rq, true);

                /*
                 * If time for next balance is due,
                 * do the balance.
                 */
                if (time_after_eq(jiffies, rq->next_balance)) {
                        struct rq_flags rf;

                        rq_lock_irqsave(rq, &rf);
                        update_rq_clock(rq);
                        rq_unlock_irqrestore(rq, &rf);

                        if (flags & NOHZ_BALANCE_KICK)
                                rebalance_domains(rq, CPU_IDLE);
                }

                if (time_after(next_balance, rq->next_balance)) {
                        next_balance = rq->next_balance;
                        update_next_balance = 1;
                }
        }

nohz ilde (또는 newilde) 밸런싱을 수행한다.

코드 라인 6에서 다음 밸런싱 시각으로 현재 시각 + 60초를 준다. 이 값은 최대값으로 갱신될 예정이다.
코드 라인 32~34에서 nohz idle 중인 cpu를 순회하며 현재 cpu 및 busy cpu는 skip 한다.
코드 라인 41~44에서 리스케줄 요청이 있는 경우 루프를 벗어나 abort 레이블로 이동한다.
코드 라인 46~48에서 순회 중인 cpu의 런큐에 대해 블럭드 로드 존재 여부를 알아온다.
코드 라인 54~63에서 밸런싱 시각을 넘어선 경우 런큐 클럭을 갱신 한 후 밸런싱을 시도한다.
코드 라인 65~68에서 처음 60초로 설정해 두었던 next_balance 시각을 갱신한다.

kernel/sched/fair.c -2/2-

        /* Newly idle CPU doesn't need an update */
        if (idle != CPU_NEWLY_IDLE) {
                update_blocked_averages(this_cpu);
                has_blocked_load |= this_rq->has_blocked_load;
        }

        if (flags & NOHZ_BALANCE_KICK)
                rebalance_domains(this_rq, CPU_IDLE);

        WRITE_ONCE(nohz.next_blocked,
                now + msecs_to_jiffies(LOAD_AVG_PERIOD));

        /* The full idle balance loop has been done */
        ret = true;

abort:
        /* There is still blocked load, enable periodic update */
        if (has_blocked_load)
                WRITE_ONCE(nohz.has_blocked, 1);

        /*
         * next_balance will be updated only when there is a need.
         * When the CPU is attached to null domain for ex, it will not be
         * updated.
         */
        if (likely(update_next_balance))
                nohz.next_balance = next_balance;

        return ret;
}

코드 라인 2~5에서 idle 밸런싱인 경우 blocked 평균을 갱신하고, blocked 로드를 가졌는지를 알아온다.
코드 라인 7~8에서 nohz 밸런스를 수행한다.
코드 라인 10~14에서 다음 idle 밸런싱 주기를 현재 시각 + 32ms로 갱신하고 ret=true를 대입한다.
코드 라인 16~19에서 abort: 레이블이다. blocked 로드가 여전히 있는 경우 nohz.has_blocked에 1을 대입하여 계속 갱신될 수 있게 한다.
코드 라인 26~27에서 nohz 밸런싱 주기를 갱신한다.

Blocked 로드 갱신

nohz와 관련하여 다음과 같은 주요 멤버들을 알아본다.

rq->has_blocked_load
- nohz 상태에 진입하여 nohz_balance_enter_idle() 함수를 통해 이 값이 1로 설정된다.
- update_blocked_averages() -> update_blocked_load_status() 함수를 통해 cfs,dl,rt 및 irq 등의 로드 및 유틸이 완전히 없을 때 0으로 클리어된다.
nohz.has_blocked
- nohz 상태에 진입하여 nohz_balance_enter_idle() 함수를 통해 이 값이 1로 설정된다.
- _nohz_idle_balance() 함수가 수행될 때에는 이 값이 0으로 클리어된 후 nohz idle 밸런싱의 경우에만 update_blocked_averages()를 통해 갱신된 블럭드 로드가 여전히 존재하는 경우 1로 설정된다.
- 이 값이 1인 경우에만 nohz idle 밸런싱을 수행한다.
- 이 값이 1이고 newidle 밸런싱에서 LBF_NOHZ_STATS 플래그를 설정하여 nohz 관련 stat을 갱신하도록 한다.
- 이 값이 1이고 nohz.next_blocked 시각이 지난 경우 NOHZ_STATS_KICK 플래그를 설정하여 nohz 관련 stat을 갱신하도록 한다.

update_blocked_averages()

kernel/sched/fair.c

static void update_blocked_averages(int cpu)
{
        struct rq *rq = cpu_rq(cpu);
        struct cfs_rq *cfs_rq, *pos;
        const struct sched_class *curr_class;
        struct rq_flags rf;
        bool done = true;

        rq_lock_irqsave(rq, &rf);
        update_rq_clock(rq);

        /*
         * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
         * that RT, DL and IRQ signals have been updated before updating CFS.
         */
        curr_class = rq->curr->sched_class;
        update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
        update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
        update_irq_load_avg(rq, 0);

        /* Don't need periodic decay once load/util_avg are null */
        if (others_have_blocked(rq))
                done = false;

        /*
         * Iterates the task_group tree in a bottom up fashion, see
         * list_add_leaf_cfs_rq() for details.
         */
        for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
                struct sched_entity *se;

                if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
                        update_tg_load_avg(cfs_rq, 0);

                /* Propagate pending load changes to the parent, if any: */
                se = cfs_rq->tg->se[cpu];
                if (se && !skip_blocked_update(se))
                        update_load_avg(cfs_rq_of(se), se, 0);

                /*
                 * There can be a lot of idle CPU cgroups.  Don't let fully
                 * decayed cfs_rqs linger on the list.
                 */
                if (cfs_rq_is_decayed(cfs_rq))
                        list_del_leaf_cfs_rq(cfs_rq);

                /* Don't need periodic decay once load/util_avg are null */
                if (cfs_rq_has_blocked(cfs_rq))
                        done = false;
        }

        update_blocked_load_status(rq, !done);
        rq_unlock_irqrestore(rq, &rf);
}

런큐의 블럭드 로드 여부를 갱신한다.

코드 라인 10~19에서 런큐 클럭을 갱신한 후 cfs 로드 평균을 갱신하기 전에 먼저 dl, rt, irq 로드 평균을 갱신한다.
코드 라인 22~23에서 dl, rt, irq 유틸이 남아 있으면 done을 false로 변경한다.
코드 라인 29~38에서 런큐에 매달린 모든 leaf cfs 런큐들에 대해 태스크 그룹 및 cfs 런큐 로드 평균을 갱신한다.
- leaf cfs 런큐는 태스크가 연결된 cfs 런큐로 중복되지 않는다.
코드 라인 44~45에서 decay되어 로드가 없는 경우 leaf cfs 런큐 리스트에서 제거한다.
코드 라인 48~49에서 cfs 런큐에 여전히 로드가 있는 경우 done을 false로 변경한다.
코드 라인 52에서 block 로드 상태를 갱신한다.
- 하나의 cfs 런큐라도 로드가 남아 있으면 rq->has_blocked_load=0이 설정된다.

others_have_blocked()

kernel/sched/fair.c

static inline bool others_have_blocked(struct rq *rq)
{
        if (READ_ONCE(rq->avg_rt.util_avg))
                return true;

        if (READ_ONCE(rq->avg_dl.util_avg))
                return true;

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
        if (READ_ONCE(rq->avg_irq.util_avg))
                return true;
#endif

cfs외 다른(rt, dl 및 irq) 유틸 평균이 조금이라도 남아 있는지 여부를 반환한다. (1=유틸 평균 존재, 0=유틸 평균 없음)

skip_blocked_update()

kernel/sched/fair.c

/*
 * Check if we need to update the load and the utilization of a blocked
 * group_entity:
 */

static inline bool skip_blocked_update(struct sched_entity *se)
{
        struct cfs_rq *gcfs_rq = group_cfs_rq(se);

        /*
         * If sched_entity still have not zero load or utilization, we have to
         * decay it:
         */
        if (se->avg.load_avg || se->avg.util_avg)
                return false;

        /*
         * If there is a pending propagation, we have to update the load and
         * the utilization of the sched_entity:
         */
        if (gcfs_rq->propagate)
                return false;

        /*
         * Otherwise, the load and the utilization of the sched_entity is
         * already zero and there is no pending propagation, so it will be a
         * waste of time to try to decay it:
         */
        return true;
}

그룹 엔티티의 유틸이나 로드 평균이 하나도 없어 skip 해도 되는지 여부를 반환한다. (1=skip, 0=로드 평균 또는 유틸이 남아 있어 skip 불가)

cfs_rq_is_decayed()

kernel/sched/fair.c

static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
{
        if (cfs_rq->load.weight)
                return false;

        if (cfs_rq->avg.load_sum)
                return false;

        if (cfs_rq->avg.util_sum)
                return false;

        if (cfs_rq->avg.runnable_load_sum)
                return false;

        return true;
}

cfs 런큐가 decay되어 러너블 로드 및 로드, 유틸 하나도 남아 있지 않은지 여부를 반환한다. (1=하나도 남아 있지 않다. 0=일부 남아 있다)

cfs_rq_has_blocked()

kernel/sched/fair.c

static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
{
        if (cfs_rq->avg.load_avg)
                return true;

        if (cfs_rq->avg.util_avg)
                return true;

        return false;
}

cfs 런큐가 블럭드 로드를 가지는지 여부를 반환한다. (1=로드 또는 유틸 존재, 0=로드 및 유틸 없음)

update_blocked_load_status()

kernel/sched/fair.c

static inline void update_blocked_load_status(struct rq *rq, bool has_blocked)
{
        rq->last_blocked_load_update_tick = jiffies;

        if (!has_blocked)
                rq->has_blocked_load = 0;
}

cfs 런큐에 블럭드 로드의 유무 상태를 갱신한다.

has_blocked_load
- 0=로드 또는 유틸 존재 <- 이 함수에서는 이 상태만 설정한다.
- 1=로드 및 유틸 남아 있지 않음. <- idle 상태 진입 시 1로 설정된다.

Fork 밸런싱

wake_up_new_task()

kernel/sched/core.c

/*
 * wake_up_new_task - wake up a newly created task for the first time.
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 */

void wake_up_new_task(struct task_struct *p)
{
        struct rq_flags rf;
        struct rq *rq;

        raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
        p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
        /*
         * Fork balancing, do it here and not earlier because:
         *  - cpus_ptr can change in the fork path
         *  - any previously selected CPU might disappear through hotplug
         *
         * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
         * as we're not fully set-up yet.
         */
        p->recent_used_cpu = task_cpu(p);
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);
        post_init_entity_util_avg(p);

        activate_task(rq, p, ENQUEUE_NOCLOCK);
        trace_sched_wakeup_new(p);
        check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
        if (p->sched_class->task_woken) {
                /*
                 * Nothing relies on rq->lock after this, so its fine to
                 * drop it.
                 */
                rq_unpin_lock(rq, &rf);
                p->sched_class->task_woken(rq, p);
                rq_repin_lock(rq, &rf);
        }
#endif
        task_rq_unlock(rq, p, &rf);
}

새로운 태스크에 대해 밸런싱을 수행한다. 새 태스크를 가능하면 idle한 cpu를 찾아 이동시킨다.

코드 라인 7에서 태스크를 TASK_RUNNING 상태로 변경한다.
코드 라인 17에서 최근에 사용했던 cpu 번호를 p->recent_used_cpu에 보관해둔다. 이렇게 보관된 cpu 번호는 select_idle_sibling() 함수를 사용할 때에 이용된다.
코드 라인 18에서 SD_BALANCE_FORK 플래그를 사용하여 태스크가 수행될 가장 적절한 cpu를 찾아 태스크에 설정한다.
코드 라인 21~22에서 런큐 클럭을 갱신하고, 새 태스크가 동작할 cfs 런큐의 로드 평균을 사용하여 태스크의 유틸 평균에 대한 초기값을 결정한다.
코드 라인 24에서 태스크를 런큐에 엔큐하고 activation 한다.
코드 라인 26에서 preemption 여부를 체크한다.
코드 라인 28~36에서 해당 태스크의 스케줄러의 (*task_woken) 후크에 등록된 함수를 호출한다. dl 또는 rt 스케줄러의 함수가 동작할 수 있는데, 새 태스크가 우선 순위가 밀려 곧장 동작하지 못할 경우 dl 또는 rt 오버로드 시켜 다른 cpu로의 push 밸런싱을 수행하게 한다.

Exec 밸런싱

sched_exec()

kernel/sched/core.c

/*
 * sched_exec - execve() is a valuable balancing opportunity, because at
 * this point the task has the smallest effective memory and cache footprint.
 */

void sched_exec(void)
{
        struct task_struct *p = current;
        unsigned long flags;
        int dest_cpu;

        raw_spin_lock_irqsave(&p->pi_lock, flags);
        dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
        if (dest_cpu == smp_processor_id())
                goto unlock;

        if (likely(cpu_active(dest_cpu))) {
                struct migration_arg arg = { p, dest_cpu };

                raw_spin_unlock_irqrestore(&p->pi_lock, flags);
                stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
                return;
        }
unlock:
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);
}

실행될 태스크에 대해 밸런싱을 수행한다.

코드 라인 8~10에서 exec 밸런싱을 통해 태스크가 수행될 cpu를 선택한다. 선택한 cpu가 현재 cpu인 경우 마이그레이션 없이 함수를 빠져나간다.
코드 라인 12~18에서 현재 태스크를 수행될 cpu로 마이그레이션 하도록 워크큐를 사용하여 워커스레드에 요청한다.

migration_cpu_stop()

kernel/sched/core.c

/*
 * migration_cpu_stop - this will be executed by a highprio stopper thread
 * and performs thread migration by bumping thread off CPU then
 * 'pushing' onto another runqueue.
 */

static int migration_cpu_stop(void *data)
{
        struct migration_arg *arg = data;
        struct task_struct *p = arg->task;
        struct rq *rq = this_rq();
        struct rq_flags rf;

        /*
         * The original target CPU might have gone down and we might
         * be on another CPU but it doesn't matter.
         */
        local_irq_disable();
        /*
         * We need to explicitly wake pending tasks before running
         * __migrate_task() such that we will not miss enforcing cpus_ptr
         * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
         */
        sched_ttwu_pending();

        raw_spin_lock(&p->pi_lock);
        rq_lock(rq, &rf);
        /*
         * If task_rq(p) != rq, it cannot be migrated here, because we're
         * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
         * we're holding p->pi_lock.
         */
        if (task_rq(p) == rq) {
                if (task_on_rq_queued(p))
                        rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
                else
                        p->wake_cpu = arg->dest_cpu;
        }
        rq_unlock(rq, &rf);
        raw_spin_unlock(&p->pi_lock);

        local_irq_enable();
        return 0;
}

런큐에서 wake up 시도 중인 태스크들을 모두 activate 시킨 후 요청한 태스크를 dest cpu의 런큐에 마이그레이션한다.

__migrate_task()

kernel/sched/core.c

/*
 * Move (not current) task off this CPU, onto the destination CPU. We're doing
 * this because either it can't run here any more (set_cpus_allowed()
 * away from this CPU, or CPU going down), or because we're
 * attempting to rebalance this task on exec (sched_exec).
 *
 * So we race with normal scheduler movements, but that's OK, as long
 * as the task is no longer on this CPU.
 */

static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
                                 struct task_struct *p, int dest_cpu)
{
        /* Affinity changed (again). */
        if (!is_cpu_allowed(p, dest_cpu))
                return rq;

        update_rq_clock(rq);
        rq = move_queued_task(rq, rf, p, dest_cpu);

        return rq;
}

요청한 태스크를 @dest cpu의 런큐에 마이그레이션한다. 실패한 경우 0을 반환한다.

코드 라인 5~6에서 태스크가 @dest_cpu를 지원하지 않는 경우 기존 @rq를 반환한다.
코드 라인 8~11에서 런큐 클럭을 갱신하고, @dest_cpu로 태스크를 마이그레이션한 후 @dest_cpu의 런큐를 반환한다.

move_queued_task()

kernel/sched/core.c

/*
 * This is how migration works:
 *
 * 1) we invoke migration_cpu_stop() on the target CPU using
 *    stop_one_cpu().
 * 2) stopper starts to run (implicitly forcing the migrated thread
 *    off the CPU)
 * 3) it checks whether the migrated task is still in the wrong runqueue.
 * 4) if it's in the wrong runqueue then the migration thread removes
 *    it and puts it into the right queue.
 * 5) stopper completes and stop_one_cpu() returns and the migration
 *    is done.
 */

/*
 * move_queued_task - move a queued task to new rq.
 *
 * Returns (locked) new rq. Old rq's lock is released.
 */

static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
                                   struct task_struct *p, int new_cpu)
{
        lockdep_assert_held(&rq->lock);

        WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
        dequeue_task(rq, p, DEQUEUE_NOCLOCK);
        set_task_cpu(p, new_cpu);
        rq_unlock(rq, rf);

        rq = cpu_rq(new_cpu);

        rq_lock(rq, rf);
        BUG_ON(task_cpu(p) != new_cpu);
        enqueue_task(rq, p, 0);
        p->on_rq = TASK_ON_RQ_QUEUED;
        check_preempt_curr(rq, p, 0);

        return rq;
}

요청한 태스크를 new cpu로 마이그레이션한다.

코드 라인 6~8에서 태스크의 on_rq에 마이그레이션 중이라고 상태를 바꾸고, 런큐에서 태스크를 디큐한다. 그런 후 태스크에 @new_cpu를 지정한다.
코드 라인 11~16에서 @new_cpu의 런큐에 태스크를 엔큐시키키고, 엔큐 상태도 변경한다.
코드 라인 17에서 preemption이 필요한 경우 리스케줄 요청을 설정하도록 체크한다.
코드 라인 19에서 @new_cpu의 런큐를 반환한다.

Wake 밸런싱

try_to_wake_up() 함수 내부에서 select_task_rq()를 호출할 떄 SD_BALANCE_WAKE 플래그를 사용하여 wake 밸런싱을 수행한다.

참고: Scheduler -5- (Scheduler Core) | 문c

FORK, EXEC 및 WAKE 밸런싱 공통

태스크가 동작할 적절한 cpu 선택

다음 그림은 select_task_rq() 함수 이하의 호출 관계를 보여준다.

select_task_rq()

kernel/sched/core.c

/*
 * The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
 */

static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
        lockdep_assert_held(&p->pi_lock);

        if (p->nr_cpus_allowed > 1)
                cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
        else
                cpu = cpumask_any(p->cpus_ptr);

        /*
         * In order not to call set_task_cpu() on a blocking task we need
         * to rely on ttwu() to place the task on a valid ->cpus_ptr
         * CPU.
         *
         * Since this is common to all placement strategies, this lives here.
         *
         * [ this allows ->select_task() to simply return task_cpu(p) and
         *   not worry about this generic constraint ]
         */
        if (unlikely(!is_cpu_allowed(p, cpu)))
                cpu = select_fallback_rq(task_cpu(p), p);

        return cpu;
}

@sd_flags 요청에 맞게 태스크가 동작할 적절한 cpu를 찾아 선택한다.

코드 라인 6~9에서 태스크가 동작 할 수 있는 cpu가 2개 이상인 경우 스케줄러에 등록된 (*select_task_rq) 후크 함수를 통해 태스크가 동작할 적절한 cpu를 찾아온다. 태스크가 사용할 cpu가 1개로 고정된 경우 태스크에 지정된 cpu를 찾아온다.
- dl 태스크의 경우 select_task_rq_dl() 함수를 호출하며 wake 밸런싱인 경우에만 deadline이 가장 여유 있는 cpu를 선택한다.
- rt 태스크의 경우 select_task_rq_rt() 함수를 호출하며 wake 또는 fork 밸런싱인 경우에만 가장 낮은 우선 순위부터 요청한 태스크의 우선순위 범위 이내에서 동작할 수 있는 cpu cpu를 찾아 선택한다.
- cfs 태스크의 경우 select_task_rq_fair() 함수를 호출하여 적절한 cpu를 선택한다.
코드 라인 21~22에서 낮은 확률로 선택된 cpu가 태스크에 허용되지 않는 cpu인 경우 fallback cpu를 찾는다.
코드 라인 24에서 찾은 cpu를 반환한다.

fallback cpu 선택

select_fallback_rq()

요청한 cpu와 태스크를 사용하여 fallback cpu를 찾아 반환한다. 다음 순서대로 찾는다.

요청한 cpu가 소속된 노드와 태스크가 허용하는 online cpu를 찾아 반환한다.
노드와 상관없이 태스크가 허용하는 online cpu를 찾아 반환한다.
cpuset에 설정된 effective_cpus를 태스크에 설정하고 그 중 online cpu를 찾아 반환한다.
possible cpu를 태스크에 설정하고 그 중 online cpu를 찾아 반환한다.

kernel/sched/core.c

/*
 * ->cpus_ptr is protected by both rq->lock and p->pi_lock
 *
 * A few notes on cpu_active vs cpu_online:
 *
 *  - cpu_active must be a subset of cpu_online
 *
 *  - on CPU-up we allow per-CPU kthreads on the online && !active CPU,
 *    see __set_cpus_allowed_ptr(). At this point the newly online
 *    CPU isn't yet part of the sched domains, and balancing will not
 *    see it.
 *
 *  - on CPU-down we clear cpu_active() to mask the sched domains and
 *    avoid the load balancer to place new tasks on the to be removed
 *    CPU. Existing tasks will remain running there and will be taken
 *    off.
 *
 * This means that fallback selection must not select !active CPUs.
 * And can assume that any active CPU must be online. Conversely
 * select_task_rq() below may allow selection of !active CPUs in order
 * to satisfy the above rules.
 */

static int select_fallback_rq(int cpu, struct task_struct *p)
{
        int nid = cpu_to_node(cpu);
        const struct cpumask *nodemask = NULL;
        enum { cpuset, possible, fail } state = cpuset;
        int dest_cpu;

        /*
         * If the node that the CPU is on has been offlined, cpu_to_node()
         * will return -1. There is no CPU on the node, and we should
         * select the CPU on the other node.
         */
        if (nid != -1) {
                nodemask = cpumask_of_node(nid);

                /* Look for allowed, online CPU in same node. */
                for_each_cpu(dest_cpu, nodemask) {
                        if (!cpu_active(dest_cpu))
                                continue;
                        if (cpumask_test_cpu(dest_cpu, p->cpus_ptr))
                                return dest_cpu;
                }
        }

        for (;;) {
                /* Any allowed, online CPU? */
                for_each_cpu(dest_cpu, p->cpus_ptr) {
                        if (!is_cpu_allowed(p, dest_cpu))
                                continue;

                        goto out;
                }

                /* No more Mr. Nice Guy. */
                switch (state) {
                case cpuset:
                        if (IS_ENABLED(CONFIG_CPUSETS)) {
                                cpuset_cpus_allowed_fallback(p);
                                state = possible;
                                break;
                        }
                        /* Fall-through */
                case possible:
                        do_set_cpus_allowed(p, cpu_possible_mask);
                        state = fail;
                        break;

                case fail:
                        BUG();
                        break;
                }
        }

out:
        if (state != cpuset) {
                /*
                 * Don't tell them about moving exiting tasks or
                 * kernel threads (both mm NULL), since they never
                 * leave kernel.
                 */
                if (p->mm && printk_ratelimit()) {
                        printk_deferred("process %d (%s) no longer affine to cpu%d\n",
                                        task_pid_nr(p), p->comm, cpu);
                }
        }

        return dest_cpu;
}

코드 라인 13~23에서 cpu가 포함된 노드에 속한 cpu들을 순회하며 active cpu가 태스크에서 허용하면 해당 cpu를 반환한다.
코드 라인 25~32에서 태스크에 허용된 cpu들 중 active cpu를 찾아 반환한다.
코드 라인 35~41에서 태스크에 허용된 cpu들을 태스크 그룹의 지정된 cpu들로 바꾼 후 possible 단계로 다시 시도해본다.
코드 라인 43~46에서 태스크에 허용된 cp들을 possible cpu로 변경한 후 fail 단계로 다시 시도해본다.
코드 라인 48~51에서 fail 단계마저 실패한 경우 BUG() 함수를 호출한다
코드 라인 54~65에서 out: 레이블이다. 재시도를 한 경우는 경고 메시지를 출력한다.
코드 라인 67에서 최종 선택한 cpu를 반환한다.

CFS 태스크가 동작할 적절한 cpu 선택

다음 그림은 select_task_rq_fair() 함수 이하의 호출 관계를 보여준다.

select_task_rq_fair()

kernel/sched/fair.c

/*
 * select_task_rq_fair: Select target runqueue for the waking task in domains
 * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
 * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
 *
 * Balances load by selecting the idlest CPU in the idlest group, or under
 * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
 *
 * Returns the target CPU number.
 *
 * preempt must be disabled.
 */

static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
        struct sched_domain *tmp, *sd = NULL;
        int cpu = smp_processor_id();
        int new_cpu = prev_cpu;
        int want_affine = 0;
        int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);

        if (sd_flag & SD_BALANCE_WAKE) {
                record_wakee(p);

                if (sched_energy_enabled()) {
                        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
                        if (new_cpu >= 0)
                                return new_cpu;
                        new_cpu = prev_cpu;
                }

                want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
                              cpumask_test_cpu(cpu, p->cpus_ptr);
        }

        rcu_read_lock();
        for_each_domain(cpu, tmp) {
                if (!(tmp->flags & SD_LOAD_BALANCE))
                        break;

                /*
                 * If both 'cpu' and 'prev_cpu' are part of this domain,
                 * cpu is a valid SD_WAKE_AFFINE target.
                 */
                if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
                    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
                        if (cpu != prev_cpu)
                                new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);

                        sd = NULL; /* Prefer wake_affine over balance flags */
                        break;
                }

                if (tmp->flags & sd_flag)
                        sd = tmp;
                else if (!want_affine)
                        break;
        }

        if (unlikely(sd)) {
                /* Slow path */
                new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
        } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
                /* Fast path */

                new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

                if (want_affine)
                        current->recent_used_cpu = cpu;
        }
        rcu_read_unlock();

        return new_cpu;
}

@sd_flag에 SD_BALANCE_WAKE, SD_BALANCE_FORK 또는 SD_BALANCE_EXEC를 가지고 이 함수에 진입하였다. 이 함수에서는 wake, fork 또는 exec 밸런싱을 통해 가장 에너지 및 성능 효과적인 cpu를 찾아 반환한다.

코드 라인 8에서 종료 중인 태스크가 아닌 경우이면서 @wake_flags에 WF_SYNC 플래그를 사용 여부를 sync에 대입해둔다.
- WF_SYNC 플래그는 waker와 wakee 태스크가 캐시 바운싱으로 인해 성능 하락되지 않도록 가능하면 하나의 cpu즉, waker가 있는 cpu에서 wakee 태스크가 동작되도록 동기화시킬 때 사용한다.
코드 라인 10~18에서 wake 밸런싱인 경우 wakee 태스크 @p와 waker와의 wake 스위칭 비율이 비슷하고 현재 cpu가 태스크가 허용하는 cpu에 포함되었는지 여부를 알아와서 want_affine에 대입한다. 만일 모바일 기기 처럼 EM(에너지 모델)을 사용하는 경우 EAS(Energy Aware Scheduler)를 통해 에너지 및 성능 효과적인 cpu를 찾아 반환한다.
코드 라인20~21에서 wake 밸런싱에서 EAS를 통해 new cpu를 결정하지 못한 경우 태스크에 대한 현재 cpu 및 태스크가 마지막으로 잠들기 전 동작했었던 기존 cpu에 대해 캐시 친화 여부를 알아온다.
- wake_wide() 및 wake_cap()이 false가 되어야 wake_affine() 함수를 통해 태스크를 동작시킬 cpu를 결정하는데, 현재 cpu 또는 태스크가 동작하던 기존 cpu 둘 중 가장 캐시 친화적인 cpu를 고려하여 결정한다.
코드 라인 25~27에서 현재 cpu의 최하위 도메인에서 최상위 도메인까지 순회하며 밸런싱을 허용하지 않는 도메인이 나타나는 경우 break 한다.
코드 라인 33~40에서 want_affine이 설정되었고, SD_WAKE_AFFINE 플래그가 있는 도메인이면서 현재 cpu가 도메인에 포함된 경우 sd에 null을 대입하고, 루프를 벗어나다. 만일 현재 cpu가 wakeup 전에 돌던 cpu가 아니면 wake_affine() 함수를 통해 새 cpu를 알아온다.
- SD_WAKE_AFFINE 플래그가 있는 도메인
  - 이 도메인은 idle 상태에서 깨어난 cpu가 도메인내의 idle sibling cpu 선택을 허용한다.
  - NUMA distance가 30이상 되는 원거리에 있는 누마 노드에 wake된 태스크가 밸런싱하지 못하게 하려면 이 플래그를 사용하지 않아야 한다.
코드 라인 42~45에서 순회 중인 도메인에 sd_flag가 있는 경우 sd에 현재 순회중인 도메인을 대입한다. 그렇지 않고 want_affine 값이 있으면 계속 순회하고, 0인 경우 루프를 벗어난다.
코드 라인 48~50에서 낮은 확률로 도메인이 결정된 경우 slow path로써 도메인에서 가장 idle한 cpu를 찾아온다.
코드 라인 51~58에서 wak 밸런싱인 경우 fast path로써 가장 idle한 cpu를 찾아온다.
코드 라인 61에서 결정된 cpu 번호를 반환한다.

캐시 친화를 고려한 wakeup

wake 스위칭 수 기록

record_wakee()

kernel/sched/fair.c

static void record_wakee(struct task_struct *p)
{
        /*
         * Only decay a single time; tasks that have less then 1 wakeup per
         * jiffy will not have built up many flips.
         */
        if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
                current->wakee_flips >>= 1;
                current->wakee_flip_decay_ts = jiffies;
        }

        if (current->last_wakee != p) {
                current->last_wakee = p;
                current->wakee_flips++;
        }
}

현재(current) 동작 중인 waker 태스크에 깨울 wakee 태스크 @p를 last_wakee로 기록하고, wake 스위칭 수(wakee_flips)를 증가시킨다.

코드 라인 7~10에서 1초마다 햔제 동작 중인 태스크의 wakee_flips 값을 절반으로 decay 한다.
코드 라인 12~15에서 현재 동작 중인 태스크의 last_wakee를 @p로 갱신하고, wakee_flips를 증가시킨다.

waker/wakee의 빈번한 wake 스위칭 판단

wake_wide()

kernel/sched/fair.c

/*
 * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
 *
 * A waker of many should wake a different task than the one last awakened
 * at a frequency roughly N times higher than one of its wakees.
 *
 * In order to determine whether we should let the load spread vs consolidating
 * to shared cache, we look for a minimum 'flip' frequency of llc_size in one
 * partner, and a factor of lls_size higher frequency in the other.
 *
 * With both conditions met, we can be relatively sure that the relationship is
 * non-monogamous, with partner count exceeding socket size.
 *
 * Waker/wakee being client/server, worker/dispatcher, interrupt source or
 * whatever is irrelevant, spread criteria is apparent partner count exceeds
 * socket size.
 */

static int wake_wide(struct task_struct *p)
{
        unsigned int master = current->wakee_flips;
        unsigned int slave = p->wakee_flips;
        int factor = this_cpu_read(sd_llc_size);

        if (master < slave)
                swap(master, slave);
        if (slave < factor || master < slave * factor)
                return 0;
        return 1;
}

현재 동작 중인 태스크 waker와 wakee @p에 대해 wake 스위칭이 어느 한 쪽이 더 빈번한 경우 1을 반환한다. (캐시 친화가 없는 것으로 간주한다.)

코드 라인 3~4에서 waker 태스크와 wakee 태스크 @p의 wake 스위칭 횟수를 각각 가져와 master와 slave에 대입한다.
코드 라인 5에서 패키지 내에서 캐시를 공유하는 cpu 수를 알아와서 factor에 대입한다.
- SD_SHARE_PKG_RESOURCES 플래그가 있는 스케줄 도메인(복수개인 경우 상위 도메인)에 소속된 cpu 수
코드 라인 7~11에서 master와 slave 횟수가 factor보다 크고 작은 쪽과 큰 쪽의 횟수가 factor 비율만큼 차이가 벌어진 경우 1을 반환한다.
- 태스크 @p의 wake 스위칭(wakee_flips) 횟수 slave가 오히려 master보다 큰 경우 현재 동작 중인 태스크의 횟수보다 큰 경우에 한해 서로 값을 swap 하여 항상 master가 더 크게 만들어 둔다.
- wake 스위칭(wake_flips) 작은 쪽이 factor 보다 더 작거나, 작은쪽에 factor 배율을 적용하여 오히려 커지는 경우 0을 반환하고, 그렇지 않은 경우 1을 반환한다.
- 예) 캐시 공유 cpu 수=4, 현재 태스크가 9번 스위칭, 밸런스 요청한 태스크가 2번인 경우
  - 9 > (4 x 2) = 1
- Facebook 개발자가 더 빠른 로직으로 대체하여 기존 wake_wide() 함수의 결과가 약간 변했다.
  - 기존: sched: Implement smarter wake-affine logic (2013, v3.12-rc1)
  - 변경: sched/fair: Beef up wake_wide() (2017, v4.15-rc1)

다음 그림은 waker 태스크와 wakee 태스크간의 wake 스위칭 비율이 일정 배수(factor) 이상되는지 여부를 알아온다.

dispacher/worker 스레드 모델의 경우 dispacher가 여러 개의 worker 스레드를 한번 씩 깨우므로 dispacher의 wake 스위칭 수가 더 많다. 이렇게 wake 스위칭 비율이 큰 경우 캐시 친화적이지 않은 것으로 판단한다.

다음 그림은 wake_wide() 함수를 통해 wake 스위칭 비율을 비교하여 캐시 친화 판단을 결정하는 과정을 보여준다.

dispatcher와 worker가 서로 할 일을 하고 깨운다고 가정한다.

wake_cap()

kernel/sched/fair.c

/*
 * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
 * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
 *
 * In that case WAKE_AFFINE doesn't make sense and we'll let
 * BALANCE_WAKE sort things out.
 */

static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
        long min_cap, max_cap;

        if (!static_branch_unlikely(&sched_asym_cpucapacity))
                return 0;

        min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
        max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;

        /* Minimum capacity is close to max, no need to abort wake_affine */
        if (max_cap - min_cap < max_cap >> 3)
                return 0;

        /* Bring task utilization in sync with prev_cpu */
        sync_entity_load_avg(&p->se);

        return !task_fits_capacity(p, min_cap);
}

빅/리틀같은 asym cpu capacity가 적용된 시스템에서 태스크 @p의 기존 cpu와 요청한 @cpu간의 capacity가 차이를 비교하여 적절한지 여부를 판단한다.

커널 v5.7에서 select_idle_sibling() 함수가 이를 대체하므로 이 로직이 불필요하다 판단되어 제거된다.
- 참고: sched/fair: Remove wake_cap() (2020, v5.7-rc1)

Wake시 idle, 캐시 및 로드 고려한 cpu 선택

wake_affine()

kernel/sched/fair.c

static int wake_affine(struct sched_domain *sd, struct task_struct *p,
                       int this_cpu, int prev_cpu, int sync)
{
        int target = nr_cpumask_bits;

        if (sched_feat(WA_IDLE))
                target = wake_affine_idle(this_cpu, prev_cpu, sync);

        if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
                target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

        schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
        if (target == nr_cpumask_bits)
                return prev_cpu;

        schedstat_inc(sd->ttwu_move_affine);
        schedstat_inc(p->se.statistics.nr_wakeups_affine);
        return target;
}

태스크가 동작할 때 @this_cpu와 @prev_cpu 둘 중 어떤 cpu에서 동작해야 캐시(cache hot) 활용에 더 도움이 되는지 판단하여 해당 cpu를 반환한다.

코드 라인 4에서 target cpu를 선택하지 못한 상태로 둔다.
코드 라인 6~7에서 WA_IDLE feture를 사용하는 경우 두 idle cpu 중 캐시 친화 cpu를 찾아온다.
- @this_cpu가 idle이고 두 cpu가 캐시 공유된 상태라면 @this_cpu를 선택하지만 @prev_cpu 역시 idle 상태라면 migration을 포기하고 @prev_cpu를 선택한다.
- wake 밸런싱의 캐시 친화 cpu 관련 기능이다. 태스크의 기존 cpu가 캐시 친화 idle인 경우 약간의 성능을 개선하기 위해 밸런싱을 방지한다.
- 참고: sched/core: Fix wake_affine() performance regression (2017, v4.14-rc5)
코드 라인 9~10에서 WA_WEIGHT feature를 사용하고 아직 target cpu가 선택되지 않은 상태라면 @this_cpu의 로드가 @prev_cpu 보다 낮은 경우 @this_cpu를 반환한다.
- wake 밸런싱의 캐시 친화 cpu 관련 기능이다. 태스크의 기존 cpu와 현재 cpu간의 러너블 로드가 작은 쪽으로 밸런싱을 수행하게 한다. 이렇게 하여 약간의 성능을 개선한다.
- 참고: sched/core: Address more wake_affine() regressions (2017, v4.14-rc5)
코드 라인 12에서 캐시 친화 wakeup 시도를 하므로 nr_wakeups_affine_attempts 카운터를 증가시킨다.
코드 라인 13~14에서 target cpu가 결정되지 않은 경우 @prev_cpu를 반환한다.
코드 라인 16~17에서 캐시 친화 wakeup을 하므로 ttwu_move_affine 및 nr_wakeups_affine 카운터를 1 증가시킨다.
코드 라인 18에서 선택한 target cpu를 반환한다.

다음 그림은 태스크를 현재 cpu에서 동작시킬지 아니면 기존 cpu에서 동작 시킬지 캐시 친화도 및 로드를 고려하여 결정하는 모습을 보여준다.

WA_IDLE, WA_WEIGHT, WA_BIAS features는 디폴트로 true이다.

idle 및 캐시 친화력을 고려한 cpu 선택

wake_affine_idle()

kernel/sched/fair.c

/*
 * The purpose of wake_affine() is to quickly determine on which CPU we can run
 * soonest. For the purpose of speed we only consider the waking and previous
 * CPU.
 *
 * wake_affine_idle() - only considers 'now', it check if the waking CPU is
 *                      cache-affine and is (or will be) idle.
 *
 * wake_affine_weight() - considers the weight to reflect the average
 *                        scheduling latency of the CPUs. This seems to work
 *                        for the overloaded case.
 */

static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
        /*
         * If this_cpu is idle, it implies the wakeup is from interrupt
         * context. Only allow the move if cache is shared. Otherwise an
         * interrupt intensive workload could force all tasks onto one
         * node depending on the IO topology or IRQ affinity settings.
         *
         * If the prev_cpu is idle and cache affine then avoid a migration.
         * There is no guarantee that the cache hot data from an interrupt
         * is more important than cache hot data on the prev_cpu and from
         * a cpufreq perspective, it's better to have higher utilisation
         * on one CPU.
         */
        if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
                return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

        if (sync && cpu_rq(this_cpu)->nr_running == 1)
                return this_cpu;

        return nr_cpumask_bits;
}

@this_cpu가 idle 상태이고 @perv_cpu와 캐시를 공유한 경우 @this_cpu를 선택한다. 단 @prev_cpu도 idle인 경우 태스크가 수행했었던 @prev_cpu를 선택한다. 그 외의 경우 실패 값으로 nr_cpumask_bits를 반환하는데, 만일 @sync가 설정된 경우 @this_cpu에 태스크가 1개만 동작하면 @this_cpu를 선택한다.

코드 라인 16~17에서 @this_cpu와 @prev_cpu가 같은 캐시를 공유하는 경우 @prev_cpu가 idle이면 기존 cpu를 선택하도록 @prev_cpu를 반환한다. 그렇지 않고 @this_cpu가 idle인 경우 @this_cpu를 반환한다.
코드 라인 19~20에서 @sync가 주어졌고, @this_cpu에 1개의 태스크만 동작하는 상태이면 @this_cpu를 반환한다.
코드 라인 22에서 cpu를 결정하지 못한 경우 nr_cpumask_bits를 반환한다.

available_idle_cpu()

kernel/sched/core.c

/**
 * available_idle_cpu - is a given CPU idle for enqueuing work.
 * @cpu: the CPU in question.
 *
 * Return: 1 if the CPU is currently idle. 0 otherwise.
 */

int available_idle_cpu(int cpu)
{
        if (!idle_cpu(cpu))
                return 0;

        if (vcpu_is_preempted(cpu))
                return 0;

        return 1;
}

@cpu가 idle 상태인지 여부를 반환한다.

cpus_share_cache()

kernel/sched/core.c

bool cpus_share_cache(int this_cpu, int that_cpu)
{
        return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}

@this_cpu와 @that_cpu 가 캐시를 공유하는지 여부를 반환한다.

태스크 이동을 고려한 this cpu 로드가 작을 때 this cpu 선택

wake_affine_weight()

kernel/sched/fair.c

static int
wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
                   int this_cpu, int prev_cpu, int sync)
{
        s64 this_eff_load, prev_eff_load;
        unsigned long task_load;

        this_eff_load = cpu_runnable_load(cpu_rq(this_cpu));

        if (sync) {
                unsigned long current_load = task_h_load(current);

                if (current_load > this_eff_load)
                        return this_cpu;

                this_eff_load -= current_load;
        }

        task_load = task_h_load(p);

        this_eff_load += task_load;
        if (sched_feat(WA_BIAS))
                this_eff_load *= 100;
        this_eff_load *= capacity_of(prev_cpu);

        prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu));
        prev_eff_load -= task_load;
        if (sched_feat(WA_BIAS))
                prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
        prev_eff_load *= capacity_of(this_cpu);

        /*
         * If sync, adjust the weight of prev_eff_load such that if
         * prev_eff == this_eff that select_idle_sibling() will consider
         * stacking the wakee on top of the waker if no other CPU is
         * idle.
         */
        if (sync)
                prev_eff_load += 1;

        return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}

두 cpu 중 스케일 적용된 capacity를 반영하여 @this_cpu의 러너블 로드가 낮은 경우 @this_cpu를 반환한다. 실패 시 nr_cpumask_bits를 반환한다.

코드 라인 8에서 @this_cpu의 러너블 로드 평균을 알아와서 this_eff_load에 대입한다.
코드 라인 10~17에서 this_eff_load 보다 현재 동작 중인 태스크(waker)의 로드 평균이 큰 경우 @this_cpu를 선택한다. 그렇지 않은 경우 this_eff_load을 동작 중인 태스크의 로드만큼 감소시킨다.
코드 라인 19~24에서 this_eff_load에서 태스크 @p의 로드 평균을 더하고 두 cpu의 capacity가 적용된 비율을 비교하기 위해 반대쪽 @prev_cpu의 capacity를 곱한다. WA_BIAS feature가 적용된 경우 100%를 곱해 적용한다.
코드 라인 26~30에서 @prev_cpu의 러너블 로드 평균을 알아와서 태스크 @p의 로드 평균을 감소시키고 두 cpu의 capacity가 적용된 비율을 비교하기 위해 반대쪽 @this_cpu의 capacity를 곱한다. WA_BIAS feature가 적용된 경우 100%+ imbalance_pct의 절반을 추가로 곱해 적용한다.
- wake 밸런싱의 캐시 친화 cpu 관련 기능이다. 위의 WA_WEIGHT 기능을 사용할 때 태스크의 기존 cpu 로드에 약간의 바이어스(sd->imbalance_pct의 100% 초과분 절반)를 추가하여 this cpu 쪽으로 조금 더 유리한 선택이되게 한다.
코드 라인 38~41에서 두 로드 평균을 비교하여 @this_cpu의 로드가 더 낮은 경우 @this_cpu를 반환하고, 그렇지 않은 경우 마이그레이션을 하지 않기 위해 nr_cpumask_bits를 반환한다.
- @sync가 주어지지 않은 일반적인 경우 두 로드 평균이 동일할 때 this_cpu로 마이그레이션을 하지 않는 것이 유리하다고 판단한다.
  - 참고: Do not migrate on wake_affine_weight() if weights are equal (2018, v4.17-rc1)
- 단 @sync가 주어진 경우 동일 로드 값 비교 시 prev_eff_load에 더 불리한 판정을 주기위해 1을 더해 this_cpu로 마이그레이션을 하도록 한다.
  - waker 태스크와 wakee 태스크가 캐시 바운싱으로 성능 저하가 예상될 때 동기화 시켜 가능하면 waker 태스크가 있는 this_cpu에서 wakee 태스크를 깨우도록하는데 이 때에는 prev_cpu에 최소한의(1 만큼)의 불리함을 준다.

Wake 밸런싱의 Fast-Path

select_idle_sibling()

kernel/sched/fair.c

/*
 * Try and locate an idle core/thread in the LLC cache domain.
 */

static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
        struct sched_domain *sd;
        int i, recent_used_cpu;

        if (available_idle_cpu(target) || sched_idle_cpu(target))
                return target;

        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */
        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)))
                return prev;

        /* Check a recently used CPU as a potential idle candidate: */
        recent_used_cpu = p->recent_used_cpu;
        if (recent_used_cpu != prev &&
            recent_used_cpu != target &&
            cpus_share_cache(recent_used_cpu, target) &&
            (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
            cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) {
                /*
                 * Replace recent_used_cpu with prev as it is a potential
                 * candidate for the next wake:
                 */
                p->recent_used_cpu = prev;
                return recent_used_cpu;
        }

        sd = rcu_dereference(per_cpu(sd_llc, target));
        if (!sd)
                return target;

        i = select_idle_core(p, sd, target);
        if ((unsigned)i < nr_cpumask_bits)
                return i;

        i = select_idle_cpu(p, sd, target);
        if ((unsigned)i < nr_cpumask_bits)
                return i;

        i = select_idle_smt(p, target);
        if ((unsigned)i < nr_cpumask_bits)
                return i;

        return target;
}

태스크가 깨어날 현재 cpu와 태스크가 잠들었었던 기존 cpu 둘 중 idle한 cpu 하나를 선택한다. 만일 idle cpu가 없으면 LLC 도메인내에서 idle한 cpu를 찾는다. 다음과 같은 순서로 찾아본다.

1) @target cpu가 idle
2) @prev cpu가 캐시 공유 사용 및 idle
3) 태스크가 최근에 사용했었던 cpu가 캐시 공유 사용 및 idle
4) 캐시 공유 도메인에서 idle core/cpu/smt
- @target cpu 부터 순회하며 hw thread들 모두 idle cpu인 core 선택
- @target cpu 부터 제한된 스캔 내에서 idle cpu 선택
- @target cpu의 hw thread들 중 idle인 hw thread 선택
5) 캐시 친화 포기하고 그냥 @target
참고: sched/core: Rewrite and improve select_idle_siblings() (2016, v4.9-rc1)

코드 라인 6~7에서 @target cpu가 idle이거나 SCHED_IDLE policy 태스크만 동작하는 경우 @target cpu를 반환한다.
코드 라인 12~14에서 @prev_cpu가 idle 또는 SCHED_IDLE policy 태스크만 동작하고 @target cpu와 캐시를 공유하는 cpu인 경우 당연히 @prev_cpu를 반환한다.
- target_cpu로 마이그레이션하지 못하게 해야 한다.
코드 라인 17~29에서 최근에 태스크가 동작했었던 cpu가 코드 라인 12~14와 같은 조건이라면 최근에 사용했었던 cpu를 반환한다.
코드 라인 31~45에서 캐시 공유 스케줄 도메인이 있는 경우 idle core, idle cpu, idle smt 순서대로 진행하되 cpu가 선택되면 그 cpu를 반환한다.
코드 라인 47에서 어떠한 조건에도 만족하지 못하는 경우 캐시 친화와 관련 없이 그냥 @target cpu를 반환한다.

다음 그림은 Wake 밸런싱의 Fast Path 동작으로 select_idle_sibling() 함수를 통해 idle cpu를 찾는 과정을 보여준다.

idle SMT core 선택

select_idle_core()

kernel/sched/fair.c

/*
 * Scan the entire LLC domain for idle cores; this dynamically switches off if
 * there are no idle cores left in the system; tracked through
 * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
 */

static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
        int core, cpu;

        if (!static_branch_likely(&sched_smt_present))
                return -1;

        if (!test_idle_cores(target, false))
                return -1;

        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

        for_each_cpu_wrap(core, cpus, target) {
                bool idle = true;

                for_each_cpu(cpu, cpu_smt_mask(core)) {
                        __cpumask_clear_cpu(cpu, cpus);
                        if (!available_idle_cpu(cpu))
                                idle = false;
                }

                if (idle)
                        return core;
        }

        /*
         * Failed to find an idle core; stop looking for one.
         */
        set_idle_cores(target, 0);

        return -1;
}

캐시 공유 도메인의 SMT core들 중 L1 캐시를 공유하는 cpu들 모두 idle인 core를 찾아 반환한다. @target 코어부터 시작하여 검색한다.

코드 라인 6~7에서 hw thread가 없는 !SMT 시스템의 경우 -1을 반환한다.
코드 라인 9~10에서 @target cpu의 L1 캐시를 공유하는 hw thread들 중 idle thread가 없으면 -1을 반환한다.
코드 라인 12~14에서 도메인의 cpu들과 태스크가 허용하는 cpu들을 포함하는 cpu들을 @target 부터 순회한다.
코드 라인 15~24에서 순회 중인 cpu의 L1 캐시를 공유하는 smt thrad들 중 하나라도 busy 상태인지를 체크한다. 만일 모두 idle한 경우 순회 중인 core 번호를 반환한다.
코드 라인 29에서 idle core를 못찾은 경우 @target cpu 내의 hw 스레드들에 idle 코어가 없다고 기록한 후 -1을 반환한다.

test_idle_cores()

kernel/sched/fair.c

static inline bool test_idle_cores(int cpu, bool def)
{
        struct sched_domain_shared *sds;

        sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
        if (sds)
                return READ_ONCE(sds->has_idle_cores);

        return def;
}

@cpu에 포함된 hw thread들이 모두 idle 상태인지 여부를 반환한다. 캐시 공유 도메인이 없는 경우 @def 값을 반환한다.

set_idle_cores()

kernel/sched/fair.c

static inline void set_idle_cores(int cpu, int val)
{
        struct sched_domain_shared *sds;

        sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
        if (sds)
                WRITE_ONCE(sds->has_idle_cores, val);
}

@cpu에 연관된 모든 hw thread들의 idle 상태 @val을 기록한다. (@val: 0=하나라도 idle이 아닌 hw thread가 있다. 1=해당 core의 hw thread들이 모두 idle이다.)

Idle cpu 선택

select_idle_cpu()

kernel/sched/fair.c

/*
 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
 * average idle time for this rq (as found in rq->avg_idle).
 */

static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
        struct sched_domain *this_sd;
        u64 avg_cost, avg_idle;
        u64 time, cost;
        s64 delta;
        int this = smp_processor_id();
        int cpu, nr = INT_MAX, si_cpu = -1;

        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
        if (!this_sd)
                return -1;

        /*
         * Due to large variance we need a large fuzz factor; hackbench in
         * particularly is sensitive here.
         */
        avg_idle = this_rq()->avg_idle / 512;
        avg_cost = this_sd->avg_scan_cost + 1;

        if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
                return -1;

        if (sched_feat(SIS_PROP)) {
                u64 span_avg = sd->span_weight * avg_idle;
                if (span_avg > 4*avg_cost)
                        nr = div_u64(span_avg, avg_cost);
                else
                        nr = 4;
        }

        time = cpu_clock(this);

        for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
                if (!--nr)
                        return si_cpu;
                if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                        continue;
                if (available_idle_cpu(cpu))
                        break;
                if (si_cpu == -1 && sched_idle_cpu(cpu))
                        si_cpu = cpu;
        }

        time = cpu_clock(this) - time;
        cost = this_sd->avg_scan_cost;
        delta = (s64)(time - cost) / 8;
        this_sd->avg_scan_cost += delta;

        return cpu;
}

도메인 내에서 산정된 횟수 이내에서 cpu를 순회하며 idle cpu를 찾아 반환한다. 산정된 횟수가 이내에서 idle cpu를 찾지 못한 경우 대안으로 idle policy 태스크만을 돌리는 cpu라도 반환한다.

참고: sched/core: Implement new approach to scale select_idle_cpu() (2017, v4.13-rc1)

코드 라인 8에서 스캔할 cpu 수로 무제한 값을 대입한다.
코드 라인 10~12에서 캐시 공유 도메인이 없는 경우 -1을 반환한다.
코드 라인 18~22에서 SIS_AVG_CPU(디폴트 false) feture를 사용하는 경우 평균 스캔 코스트에 비해 너무 짧은 시간이 사용된 wakeup 밸런싱인 경우 -1을 반환한다.
- wake 밸런싱의 캐시 친화 cpu 관련 기능이다. wake 밸런싱에서 cpu의 평균 idle 시간(rq->avg_idle)이 스케줄 도메인의 wakeup cost(sd->avg_scan_cost)에 비해 너무 짧은 경우 밸런싱을 방지한다.
- 참고: sched/fair: Make select_idle_cpu() more aggressive (2017, v4.11)
코드 라인 24~30에서 SIS_PROP(디폴트 true) feature를 사용하는 경우 스캔할 cpu 수를 다음과 같이 결정하고 최소 4 이상으로 한다.
- wake 밸런싱의 캐시 친화 cpu 관련 기능이다. wake 밸런싱에서 sibling cpu를 스캔할 수를 제한한다.
  - span_avg(avg_idle * 도메인 소속 cpu 수) / avg_cost
- 참고: sched/core: Implement new approach to scale select_idle_cpu() (2017, v4.13-rc1)
코드 라인 34~43에서 도메인 내의 cpu들을 @target 번호부터 순회하며 idle cpu를 찾아 루프를 벗어난다. idle policy 태스크만을 동작중인 cpu는 si_cpu에 기록해두고, 스캔 cpu 수 만큼 루프를 돌아도 idle cpu를 못찾은 경우 대안으로 이 si_cpu를 사용한다.
코드 라인 45~48에서 위의 idle cpu를 찾는 스캔 시간을 평균 스캔 코스트와의 차이의 1/8만큼을 도메인의 평균 스캔 코스트에 누적시킨다.
코드 라인 50에서 결정한 cpu를 반환한다.

Idle hw thread 선택

select_idle_smt()

kernel/sched/fair.c

/*
 * Scan the local SMT mask for idle CPUs.
 */

static int select_idle_smt(struct task_struct *p, int target)
{
        int cpu, si_cpu = -1;

        if (!static_branch_likely(&sched_smt_present))
                return -1;

        for_each_cpu(cpu, cpu_smt_mask(target)) {
                if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                        continue;
                if (available_idle_cpu(cpu))
                        return cpu;
                if (si_cpu == -1 && sched_idle_cpu(cpu))
                        si_cpu = cpu;
        }

        return si_cpu;
}

@target cpu의 hw thread들 중 idle cpu를 찾아 반환한다. 차선으로 idle cpu가 없는 경우 idle policy 만을 동작시키는 cpu를 반환한다. 그마저도 없으면 -1을 반환한다.

코드 라인 5~6에서 hw thread가 없는 !SMT 시스템의 경우 -1을 반환한다.
코드 라인 8~15에서 @target cpu와 같은 L1 캐시를 공유하는 hw thread들을 순회하며 태스크가 허용하지 않는 cpu들은 skip 하고, idle cpu를 찾아 반환한다. idle policy 태스크만을 돌리는 cpu는 차선으로 선택해둔다.
코드 라인 17에서 idle cpu를 못찾은 경우 차선으로 지정해둔 cpu를 반환한다.

Wake 밸런싱의 Slow-Path

idlest cpu 찾기

find_idlest_cpu()

kernel/sched/fair.c

static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
                                  int cpu, int prev_cpu, int sd_flag)
{
        int new_cpu = cpu;

        if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
                return prev_cpu;

        /*
         * We need task's util for capacity_spare_without, sync it up to
         * prev_cpu's last_update_time.
         */
        if (!(sd_flag & SD_BALANCE_FORK))
                sync_entity_load_avg(&p->se);

        while (sd) {
                struct sched_group *group;
                struct sched_domain *tmp;
                int weight;

                if (!(sd->flags & sd_flag)) {
                        sd = sd->child;
                        continue;
                }

                group = find_idlest_group(sd, p, cpu, sd_flag);
                if (!group) {
                        sd = sd->child;
                        continue;
                }

                new_cpu = find_idlest_group_cpu(group, p, cpu);
                if (new_cpu == cpu) {
                        /* Now try balancing at a lower domain level of 'cpu': */
                        sd = sd->child;
                        continue;
                }

                /* Now try balancing at a lower domain level of 'new_cpu': */
                cpu = new_cpu;
                weight = sd->span_weight;
                sd = NULL;
                for_each_domain(cpu, tmp) {
                        if (weight <= tmp->span_weight)
                                break;
                        if (tmp->flags & sd_flag)
                                sd = tmp;
                }
        }

        return new_cpu;
}

코드 라인 6~7에서 도메인에 소속된 cpu들이 태스크에서 허용된 cpu들과 하나도 일치하지 않는 경우 @prev_cpu를 반환한다.
코드 라인 13~14에서 fork 밸런싱이 아니 경우에 진입한 경우 엔티티의 로드 평균을 갱신하여 동기화한다.
코드 라인 16~24에서 요청한 도메인부터 시작하여 child 도메인으로 내려가며 @sd_flag가 도메인에서 지원하지 않는 경우 skip 한다.
코드 라인 26~30에서 idlest 그룹을 찾고 없으면 skip 한다.
코드 라인 32~37에서 idlest 그룹에서 cpu를 찾고, @cpu와 동일한 경우 skip 한다.
코드 라인 40~48에서 찾은 idlest cpu의 가장 낮은 도메인부터 다시 순회하며 @sd_flags가 지원되는 가장 높은 도메인을 찾는다. 단 순회 중인 도메인의 cpu 수가 요청한 도메인보다 크거나 같은 경우 더 이상 진행할 필요 없으므로 루프를 벗어난다.
- 전체 도메인에서 idle cpu의 분산을 위해 찾은 cpu를 그대로 사용하지 않고, 한 단계 스케줄 도메인을 내려서 다시 시도하면서 찾은 cpu가 그 전에 찾은 cpu가 동일할때까지 반복하는 개념이다.
코드 라인 51에서 idlest cpu를 반환한다. 찾지 못한 경우 @cpu가 반환된다.

idlest 그룹 찾기

find_idlest_group()

kernel/sched/fair.c -1/2-

/*
 * find_idlest_group finds and returns the least busy CPU group within the
 * domain.
 *
 * Assumes p is allowed on at least one CPU in sd.
 */

static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
                  int this_cpu, int sd_flag)
{
        struct sched_group *idlest = NULL, *group = sd->groups;
        struct sched_group *most_spare_sg = NULL;
        unsigned long min_runnable_load = ULONG_MAX;
        unsigned long this_runnable_load = ULONG_MAX;
        unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
        unsigned long most_spare = 0, this_spare = 0;
        int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
        unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
                                (sd->imbalance_pct-100) / 100;

        do {
                unsigned long load, avg_load, runnable_load;
                unsigned long spare_cap, max_spare_cap;
                int local_group;
                int i;

                /* Skip over this group if it has no CPUs allowed */
                if (!cpumask_intersects(sched_group_span(group),
                                        p->cpus_ptr))
                        continue;

                local_group = cpumask_test_cpu(this_cpu,
                                               sched_group_span(group));

                /*
                 * Tally up the load of all CPUs in the group and find
                 * the group containing the CPU with most spare capacity.
                 */
                avg_load = 0;
                runnable_load = 0;
                max_spare_cap = 0;

                for_each_cpu(i, sched_group_span(group)) {
                        load = cpu_runnable_load(cpu_rq(i));
                        runnable_load += load;

                        avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);

                        spare_cap = capacity_spare_without(i, p);

                        if (spare_cap > max_spare_cap)
                                max_spare_cap = spare_cap;
                }

                /* Adjust by relative CPU capacity of the group */
                avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
                                        group->sgc->capacity;
                runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
                                        group->sgc->capacity;

                if (local_group) {
                        this_runnable_load = runnable_load;
                        this_avg_load = avg_load;
                        this_spare = max_spare_cap;
                } else {
                        if (min_runnable_load > (runnable_load + imbalance)) {
                                /*
                                 * The runnable load is significantly smaller
                                 * so we can pick this new CPU:
                                 */
                                min_runnable_load = runnable_load;
                                min_avg_load = avg_load;
                                idlest = group;
                        } else if ((runnable_load < (min_runnable_load + imbalance)) &&
                                   (100*min_avg_load > imbalance_scale*avg_load)) {
                                /*
                                 * The runnable loads are close so take the
                                 * blocked load into account through avg_load:
                                 */
                                min_avg_load = avg_load;
                                idlest = group;
                        }

                        if (most_spare < max_spare_cap) {
                                most_spare = max_spare_cap;
                                most_spare_sg = group;
                        }
                }
        } while (group = group->next, group != sd->groups);

스케줄링 도메인내에서 cpu 로드가 가장 낮은 idlest 스케줄 그룹을 찾아 반환한다.

idlest 스케줄 그룹의 cpu 로드에 imbalance_pct의 100% 초과분은 절반만 적용하여 로컬 그룹의 cpu 로드보다 오히려 커지는 경우 null을 반환한다.

코드 라인 11에서 요청한 도메인의 imbalance_pct의 100% 초과분의 절반만 적용한 값을 imbalance_scale로 사용한다.
- 예) imbalance_pct=117인 경우 100을 넘어서는 값 17의 절반인 8을 100과 더해 imbalance_scale=108이 된다.
코드 라인 12~13에서 요청한 도메인의 imbalance_pct의 100% 초과분의 절반을 nice 0 태스크에 해당하는 로드 weight 값과 곱한 후 100으로 나눈 값을 imbalance에 대입한다.
- 예) imbalance_pct=117인 경우 nice-0 load weight에 해당하는 1024 * 100% 초과분 17의 절반 값 8%를 적용한 imbalance=81
코드 라인 15~24에서 그룹을 순회하며 그룹의 cpu들이 태스크가 허용하는 cpu들에 하나도 포함되지 않은 경우 skip 한다.
코드 라인 26~27에서 @this_cpu가 포함된 로컬 그룹을 알아온다.
코드 라인 33~47에서 순회 중인 그룹의 cpu들을 대상으로 재차 순회하며 다음을 계산해둔다.
- 러너블 로드 합을 runnable_load에 대입
- cfs 런큐의 로드 평균 합을 avg_load에 대입
- 순회 중인 cpu에서 여분의 capacity 중 최대 값을 max_spare_cap에 대입
코드 라인 50~51에서 그룹 capacity를 스케일 적용한 평균 로드를 산출한다.
코드 라인 52~53에서 그룹 capacity를 스케일 적용한 러너블 로드를 산출한다.
코드 라인 55~58에서 순회 중인 그룹이 로컬 그룹인 경우의 값들을 보관한다.
코드 라인 59~82에서 순회 중인 그룹의 러너블 로드가 가장 작은 idlest 그룹을 찾아 갱신한다.
- imbalance 값이 추가된 러너블 로드가 min_runnable_load 보다 충분히 작은 경우이다.
  - 이 때 min_runnable_load 값도 갱신해둔다.
- 러너블 로드가 imbalace 값이 추가된 min_runnable_load 보다 작으면서 imbalance_scale 비율이 적용된 평균 로드가 min_avg_load보다 작은 경우이다.
  - 이 때 min_runnable_load는 갱신하지 않는다.
- 그룹들 중 가장 여분의 capacity가 있는 그룹을 most_spare_sg에 대입하고, capacity 값은 most_spare에 대입한다.
코드 라인 83에서 단방향 원형 리스트로 연결된 스케줄 그룹이 한 바퀴 돌 때까지 계속한다.

kernel/sched/fair.c -2/2-

        /*
         * The cross-over point between using spare capacity or least load
         * is too conservative for high utilization tasks on partially
         * utilized systems if we require spare_capacity > task_util(p),
         * so we allow for some task stuffing by using
         * spare_capacity > task_util(p)/2.
         *
         * Spare capacity can't be used for fork because the utilization has
         * not been set yet, we must first select a rq to compute the initial
         * utilization.
         */
        if (sd_flag & SD_BALANCE_FORK)
                goto skip_spare;

        if (this_spare > task_util(p) / 2 &&
            imbalance_scale*this_spare > 100*most_spare)
                return NULL;

        if (most_spare > task_util(p) / 2)
                return most_spare_sg;

skip_spare:
        if (!idlest)
                return NULL;

        /*
         * When comparing groups across NUMA domains, it's possible for the
         * local domain to be very lightly loaded relative to the remote
         * domains but "imbalance" skews the comparison making remote CPUs
         * look much more favourable. When considering cross-domain, add
         * imbalance to the runnable load on the remote node and consider
         * staying local.
         */
        if ((sd->flags & SD_NUMA) &&
            min_runnable_load + imbalance >= this_runnable_load)
                return NULL;

        if (min_runnable_load > (this_runnable_load + imbalance))
                return NULL;

        if ((this_runnable_load < (min_runnable_load + imbalance)) &&
             (100*this_avg_load < imbalance_scale*min_avg_load))
                return NULL;

        return idlest;
}

코드 라인 12~13에서 fork 밸런싱으로 진입한 경우 skip_spare 레이블로 이동한다.
코드 라인 15~17에서 로컬의 여분 capacity가 태스크 유틸의 절반 보다 크고, imbalance_scale이 적용된 로컬의 여분 capacity 또한 most_spare보다 큰 경우 migration할 필요 없으므로 null을 반환한다.
코드 라인 19~20에서 그룹들 중 가장 큰 여분 capacity가 태스크 유틸의 절반보다 큰 경우 가장 큰 여분을 가진 그룹을 반환한다.
코드 라인 22~24에서 skip_spare: 레이블이다. idlest 그룹을 찾지 못한 경우 null을 반환한다.
코드 라인 34~43에서 다음 3가지 조건에 해당하면 migration할 필요 없으므로 null을 반환한다.
- 누마 도메인에서 imbalance가 추가된 최소 러너블 로드가 로컬 그룹의 러너블 로드보다 큰 경우
- imbalance가 추가된 로컬 그룹의 러너블 로드보다 최소 러너블 로드가 더 큰 경우
- imbalance가 추가된 최소 러너블 로드가 로컬 그룹의 러너블 로드보다 크면서 imbalnce_scale이 적용된 최소 평균 로드보다 로컬 그룹의 평균 로드보다 작은 경우
코드 라인 45에서 결정한 idlest 그룹을 반환한다.

다음 그림은 wake/fork/exec 밸런싱의 Slow Path로 동작하는 find_idlest_group() 함수를 통해 최대 여유 capacity 그룹 또는 최소 로드 그룹을 찾는 과정을 보여준다.

idlest 그룹내 idlest cpu 찾기

find_idlest_group_cpu()

kernel/sched/fair.c

/*
 * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
 */

static int
find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
        unsigned long load, min_load = ULONG_MAX;
        unsigned int min_exit_latency = UINT_MAX;
        u64 latest_idle_timestamp = 0;
        int least_loaded_cpu = this_cpu;
        int shallowest_idle_cpu = -1, si_cpu = -1;
        int i;

        /* Check if we have any choice: */
        if (group->group_weight == 1)
                return cpumask_first(sched_group_span(group));

        /* Traverse only the allowed CPUs */
        for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
                if (available_idle_cpu(i)) {
                        struct rq *rq = cpu_rq(i);
                        struct cpuidle_state *idle = idle_get_state(rq);
                        if (idle && idle->exit_latency < min_exit_latency) {
                                /*
                                 * We give priority to a CPU whose idle state
                                 * has the smallest exit latency irrespective
                                 * of any idle timestamp.
                                 */
                                min_exit_latency = idle->exit_latency;
                                latest_idle_timestamp = rq->idle_stamp;
                                shallowest_idle_cpu = i;
                        } else if ((!idle || idle->exit_latency == min_exit_latency) &&
                                   rq->idle_stamp > latest_idle_timestamp) {
                                /*
                                 * If equal or no active idle state, then
                                 * the most recently idled CPU might have
                                 * a warmer cache.
                                 */
                                latest_idle_timestamp = rq->idle_stamp;
                                shallowest_idle_cpu = i;
                        }
                } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
                        if (sched_idle_cpu(i)) {
                                si_cpu = i;
                                continue;
                        }

                        load = cpu_runnable_load(cpu_rq(i));
                        if (load < min_load) {
                                min_load = load;
                                least_loaded_cpu = i;
                        }
                }
        }

        if (shallowest_idle_cpu != -1)
                return shallowest_idle_cpu;
        if (si_cpu != -1)
                return si_cpu;
        return least_loaded_cpu;
}

그룹내에서 가장 idle한 cpu를 다음 순서대로 찾아 반환한다. 단 그룹내 1개의 cpu만 동작 중인 경우 해당 cpu를 반환한다.

1) 가장 짧은 latency를 가진 idle cpu (동일한 경우 가장 최근에 idle 진입한 cpu)
2) 차선으로 idle policy 태스크만 동작중인 cpu
3) 마지막으로 가장 낮은 로드를 사용하는 cpu

코드 라인 12~13에서 그룹 내 cpu가 하나만 지원하는 경우 해당 cpu를 반환한다.
코드 라인 16에서 그룹에 포함된 cpu들 중 태스크가 허용하는 cpu들에 대해 순회한다.
코드 라인 17~38에서 순회 중인 cpu가 idle cpu인 경우 런큐에 연결된 cpuidle_state를 알아온다. 그리고 idle 상태에서 깨어나는데 걸리는 시간(exit_latency)가 가장 작은 cpu를 찾아 shallowest_idle_cpu에 대입하고 cpu idle PM에서 지정해둔 exit_latency 시간을 min_exit_latency에 대입한다.
- cpuidle_state
  - cpu idle 상태를 관리하고 cpu idle PM을 위한 generic 프레임워크와 연결된다. 여기서 cpu idle 드라이버와 연결하여 사용한다.
- idle->exit_latency (us 단위)
  - arm에서 사용하는 대부분의 idle 드라이버는 cpu idle과 클러스터 idle의 2단계의 상태 정도로 나누어 관리한다. 그 중 클러스터 idle의 경우 idle로 진입한 후 다시 wake될 때까지 사이클에 소요되는 시간이 크므로 이를 exit_latency에 대입하여 관리한다.
  - 이 값은 초기 커널에서는 하드 코딩하여 각 시스템에 지정하였는데 최근에는 디바이스 트리에서 지정한 값으로도 관리된다. 보통 수 us 부터 수십 ms까지 전원 관리를 얼마나 깊게 하는가에 따라 다르다.
  - idle cpu가 wake하기 위해 걸리는 시간이 작은 cpu를 선택하여 더욱 성능 효율을 낼 수 있다.
  - exit_latency가 서로 동일한 경우 idle 진입한 시각이 얼마되지 않은 cpu를 선택한다.
    - 최근에 idle 상태에 진입한 cpu를 선택하는 것으로 L2 이상의 share 캐시에 데이터가 남아있을 가능성이 크다. (더 높은 성능을 위해)
코드 라인 39~50에서 순회 중인 cpu가 idle이 아니고 shallowest_idle_cpu와 si_cpu 모두 아직 지정되지 않은 경우 idle policy를 사용하는 태스크만 동작중인 cpu인 경우에는 si_cpu에 대입하고 skip 한다. 그렇지 않은 경우 순회 중인 cpu의 최소 러너블 로드를 갱신하고 해당 cpu를 least_loaded_cpu에 대입한다.
코드 라인 53~54에서 가장 짧은 latency를 가진 idle cpu를 찾은 경우 이 cpu를 반환한다.
코드 라인 55~56에서 차선으로 idle policy 태스크만 동작중인 cpu를 찾은 경우 이 cpu를 반환한다.
코드 라인 57에서 마지막으로 가장 낮은 로드를 사용하는 cpu를 반환한다.

다음 4 개의 그림은 스케줄 그룹내에서 태스크가 허용하면서 가장 idlest한 cpu를 찾는 과정을 보여준다.

exit_latency는 WFI로 인해 shallow 슬립하는 cpu에서는 1과 같이 작은 값이고, 클러스터의 전원을 끄는 deep 슬립하는 cpu는 큰 값을 사용한다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c – 현재 글
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheduler -15- (Load Balance 1)

2017-08-202021-01-11 문영일 10 Comments

Load Balance

CFS 로드 밸런싱을 위한 관련 componet들은 다음과 같다.

로드밸런싱에 진입하는 방법은 다음과 같이 5가지가 있다. passive 밸런싱은 태스크 상태의 변화에 따라 동작한다. periodic 밸런싱은 dynamic하게 변화하는 밸런싱 인터벌에 따라 호출되어 동작한다. Fork, Exec, Wake 밸런싱은 현재 동작시키고자하는 태스크의 대상 cpu를 결정하기 위해 idlest cpu를 찾아 동작하는 구조이다. 그리고 나머지 idle 밸런싱과 periodic 밸런싱은 busiest cpu를 찾고 그 cpu의 런큐에 위치한 태스크를 가져(pull migration)와서 동작시키는 구조이다.

Passive Balancing
- Fork Balancing
  - 태스크 생성 시 태스크를 부모 태스크가 실행되던 cpu에서 수행할 지 아니면 다른 cpu로 마이그레이션할 지 결정한다.
  - 가능하면 캐시 친화력이 있는 cpu나 idle cpu를 선택하고 그렇지 않은 경우 cpu 로드가 적은 cpu의 런큐로 마이그레이션한다.
  - wake_up_new_task() 함수에서 SD_BALANCE_FORK 플래그를 사용하여 호출한다.
- Exec Balancing
  - 태스크 실행 시 태스크를 기존 실행되던 cpu에서 수행할 지 아니면 다른 cpu로 마이그레이션할 지 결정한다.
  - 가능하면 캐시 친화력이 있는 cpu나 idle cpu를 선택하고 그렇지 않은 경우 cpu 로드가 적은 cpu의 런큐로 마이그레이션한다.
  - 다른 cpu로 마이그레이션 할 때 migrate 스레드를 사용한다.
  - sched_exec() 함수에서 SD_BALANCE_EXEC 플래그를 사용한다.
- Wake Balancing
  - idle 태스크가 깨어났을 때 깨어난 cpu에서 수행할 지 아니면 다른 idle cpu로 마이그레이션할 지 결정한다.
  - try_to_wake_up() 함수에서 SD_BALANCE_WAKE 플래그를 사용한다.
- Idle Balancing
  - cpu가 idle 상태에 진입한 경우 가장 바쁜 스케줄 그룹의 가장 바쁜 cpu에서 태스크를 가져올지 결정한다.
  - idle_balance() 함수에서 SD_BALANCE_NEWIDLE 플래그를 사용한다.
Periodic Balancing
- 주기적인 스케줄 틱을 통해 밸런싱 주기마다 리밸런싱 여부를 체크하여 결정한다.
  - 로드밸런스 주기는 1틱 ~ max_interval(초기값 0.1초)까지 동적으로 변한다.
- SD_LOAD_BALANCE 플래그가 있는 스케줄 도메인의 스케줄 그룹에서 오버로드된 태스크를 찾아 현재 cpu로 pull 마이그레이션 하여 로드를 분산한다.
- 스케줄 틱 -> raise softirq -> run_rebalance_domains() -> rebalance_domains() 호출 순서를 가진다.
- active 로드 밸런싱
  - 주기적인 스케줄 틱을 통해 리밸런싱 여부를 체크하여 결정하지만 특정 상황에서 몇 차례 실패하는 경우 이 방법으로 전환한다.
    - buesiest cpu에서 이미 러닝 중인 태스크를 migration하기 위해서는 active 로드 밸런싱을 사용한다.
  - 대상 cpu의 cpu stopper 스레드(stop 스케줄러를 사용하므로 가장 우선 순위가 높다)를 깨워 그 cpu 런큐에서 동작하는 cpu stopper 스레드를 제외한 하나의 태스크를 dest 런큐로 push 마이그레이션한다.

다음 그림은 CFS 로드 밸런스에 대한 주요 함수 흐름을 보여준다.

SCHED softirq

로드 밸런싱을 위한 sched softirq 호출

trigger_load_balance()

kernel/sched/fair.c

/*
 * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
 */

void trigger_load_balance(struct rq *rq)
{
        /* Don't need to rebalance while attached to NULL domain */
        if (unlikely(on_null_domain(rq)))
                return;

        if (time_after_eq(jiffies, rq->next_balance))
                raise_softirq(SCHED_SOFTIRQ);

        nohz_balancer_kick();
}

현재 시각이 밸런싱을 체크할 시각이 지났으면 sched 소프트인터럽트를 호출한다. 또한 nohz idle을 지원하는 경우 nohz kick이 필요한 경우 수행한다.

코드 라인 4~5에서 아직 런큐에 스케줄링 도메인이 지정(attach)되지 않은 경우 함수를 빠져나간다.
코드 라인 7~8에서 밸런싱 체크할 시각이 지난 경우 sched softirq를 호출한다.
코드 라인 10에서 nohz 밸런싱 주기마다 수행한다.

sched softirq 루틴

다음 그림은 스케줄 틱마다 active 로드밸런싱을 수행할 때 호출되는 함수들의 흐름을 보여준다.

run_rebalance_domains()

kernel/sched/fair.c

/*
 * run_rebalance_domains is triggered when needed from the scheduler tick.
 * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
 */

static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
        struct rq *this_rq = this_rq();
        enum cpu_idle_type idle = this_rq->idle_balance ?
                                                CPU_IDLE : CPU_NOT_IDLE;

        /*
         * If this CPU has a pending nohz_balance_kick, then do the
         * balancing on behalf of the other idle CPUs whose ticks are
         * stopped. Do nohz_idle_balance *before* rebalance_domains to
         * give the idle CPUs a chance to load balance. Else we may
         * load balance only within the local sched_domain hierarchy
         * and abort nohz_idle_balance altogether if we pull some load.
         */
        if (nohz_idle_balance(this_rq, idle))
                return;

        /* normal load balance */
        update_blocked_averages(this_rq->cpu);
        rebalance_domains(this_rq, idle);
}

CFS 로드 밸런스 softirq를 통해 이 함수가 호출된다. 스케줄 도메인이 로드밸런싱을 할 주기에 이른 경우에 한해 이를 수행한다.

코드 라인 4~5에서 현재 런큐의 idle_balance 값이 있는 경우, 즉 idle 중인 경우 CPU_IDLE 타입으로 idle이 아닌 경우에는 CPU_NOT_IDLE 타입을 선택한다.
코드 라인 15~16에서 다른 cpu에서 요청받은 nohz 밸런싱을 시도한다
- trigger_load_balance() -> nohz_balancer_kick() -> kick_ilb() 함수에서 IPI를 통해 nohz idle 중인 cpu를 깨울 때 깨울 cpu의 런큐에 NOHZ_KICK_MASK 플래그가 설정하여 요청한다.
코드 라인 19~20에서 blocked 평균을 갱신한 후 nohz 밸런싱이 아닌 일반 적인 로드 밸런싱을 호출하여 동작한다.

rebalance_domains()

kernel/sched/fair.c -1/2-

/*
 * It checks each scheduling domain to see if it is due to be balanced,
 * and initiates a balancing operation if so.
 *
 * Balancing parameters are set up in init_sched_domains.
 */

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
        int continue_balancing = 1;
        int cpu = rq->cpu;
        unsigned long interval;
        struct sched_domain *sd;
        /* Earliest time when we have to do rebalance again */
        unsigned long next_balance = jiffies + 60*HZ;
        int update_next_balance = 0;
        int need_serialize, need_decay = 0;
        u64 max_cost = 0;

        rcu_read_lock();
        for_each_domain(cpu, sd) {
                /*
                 * Decay the newidle max times here because this is a regular
                 * visit to all the domains. Decay ~1% per second.
                 */
                if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
                        sd->max_newidle_lb_cost =
                                (sd->max_newidle_lb_cost * 253) / 256;
                        sd->next_decay_max_lb_cost = jiffies + HZ;
                        need_decay = 1;
                }
                max_cost += sd->max_newidle_lb_cost;

                if (!(sd->flags & SD_LOAD_BALANCE))
                        continue;

                /*
                 * Stop the load balance at this level. There is another
                 * CPU in our sched group which is doing load balancing more
                 * actively.
                 */
                if (!continue_balancing) {
                        if (need_decay)
                                continue;
                        break;
                }

코드 라인 8에서 밸런싱에 사용할 시각으로 최대값의 의미를 갖는 60초를 대입한다.
코드 라인 14에서 최상위 스케줄 도메인까지 순회한다.
코드 라인 19~25에서 1초에 한 번씩 sd->next_decay_max_lb_cost를 1%씩 decay 한다.
코드 라인 27~28에서 스케줄 도메인에 SD_LOAD_BALANCE 플래그가 없는 경우 skip 한다.
코드 라인 35~39에서 밸런싱에 성공하여 continue_balancing(초기값=1)이 설정되어 있지 않으면 need_decay 값에 따라 설정된 경우 skip하고 그렇지 않은 경우 루프를 벗어난다.

kernel/sched/fair.c -2/2-

                interval = get_sd_balance_interval(sd, idle != CPU_IDLE);

                need_serialize = sd->flags & SD_SERIALIZE;
                if (need_serialize) {
                        if (!spin_trylock(&balancing))
                                goto out;
                }

                if (time_after_eq(jiffies, sd->last_balance + interval)) {
                        if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
                                /*
                                 * The LBF_DST_PINNED logic could have changed
                                 * env->dst_cpu, so we can't know our idle
                                 * state even if we migrated tasks. Update it.
                                 */
                                idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
                        }
                        sd->last_balance = jiffies;
                        interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
                }
                if (need_serialize)
                        spin_unlock(&balancing);
out:
                if (time_after(next_balance, sd->last_balance + interval)) {
                        next_balance = sd->last_balance + interval;
                        update_next_balance = 1;
                }
        }
        if (need_decay) {
                /*
                 * Ensure the rq-wide value also decays but keep it at a
                 * reasonable floor to avoid funnies with rq->avg_idle.
                 */
                rq->max_idle_balance_cost =
                        max((u64)sysctl_sched_migration_cost, max_cost);
        }
        rcu_read_unlock();

        /*
         * next_balance will be updated only when there is a need.
         * When the cpu is attached to null domain for ex, it will not be
         * updated.
         */
        if (likely(update_next_balance)) {
                rq->next_balance = next_balance;

#ifdef CONFIG_NO_HZ_COMMON
                /*
                 * If this CPU has been elected to perform the nohz idle
                 * balance. Other idle CPUs have already rebalanced with
                 * nohz_idle_balance() and nohz.next_balance has been
                 * updated accordingly. This CPU is now running the idle load
                 * balance for itself and we need to update the
                 * nohz.next_balance accordingly.
                 */
                if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
                        nohz.next_balance = rq->next_balance;
#endif
        }
}

코드 라인 1에서 스케줄 도메인의 밸런스 주기(jiffies)를 알아온다.
- 이 함수에는 CPU_IDLE 또는 CPU_NOT_IDLE 플래그 둘 중 하나로 요청된다.
- CPU_NOT_IDLE 상태인 경우 도메인의 밸런스 주기에 32배의 busy_factor(느린 밸런싱 주기)가 반영된다.
코드 라인 3~7에서 모든 cpu에서 누마 밸런싱을 위해 요청이 온 경우 시리얼하게 처리를 하기 위해 락을 획득한다. 실패하는 경우 skip 한다.
- NUMA 도메인들은 SD_SERIALIZE 플래그를 가지고 있다. 이 도메인에서 밸런싱 작업을 할 때 다른 cpu들에서 밸런싱을 하기 위해 진입하면 경쟁을 회피하기 위해 skip 하고 다음 밸런싱 인터벌 후에 다시 시도한다.
코드 라인 9~20에서 현재 시각이 순회 중인 도메인의 밸런싱 주기를 지나친 경우 로드 밸런싱을 수행한다. 그리고 밸런스 인터벌을 다시 갱신한다.
코드 라인 24~27에서 next_balance은 각 도메인의 last_balance + interval 값 중 최소치를 갱신해둔다.
코드 라인 29~36에서 need_decay가 설정된 경우 max_idle_balance_cost를 갱신한다.
코드 라인 44~59에서 갱신해둔 최소 next_balance로 런큐의 next_balance를 설정한다. CPU_IDLE로 진입한 경우엔 다음 주기 보다 nohz의 밸런싱 주기가 더 멀리 있는 경우 next_balance 주기도 동일하게 갱신한다.

다음 그림은 특정 cpu의 런큐에 대해 최하위 스케줄 도메인부터 최상위 스케줄 도메인까지 로드 밸런스를 수행하는 모습을 보여준다.

각 스케줄 도메인에서 cpu#2가 포함된 스케줄 그룹이 로컬 그룹이고, 다른 비교 대상 그룹들과 밸런싱을 비교한다.

도메인의 밸런싱 주기

도메인의 밸런싱 인터벌(sd->balance_interval)은 다음과 같은 값으로 변화한다.

밸런싱 주기의 단위는 틱(tick)이며 최소 주기부터 시작한다.
최소 밸런싱 주기(sd->min_interval)는 해당 도메인의 cpu수 만큼이다.
- 따라서 도메인 레벨이 위로 올라갈 수록 밸런싱 주기는 길어진다.
최대 밸런싱 주기(sd->max_interval)는 최소 주기의 2배이다.

도메인의 다음 밸런싱 시각(rq->next_balance)은 다음과 같이 결정된다.

밸런싱으로 인한 오버 헤드를 적게 하기 위해 cpu가 not-idle 상태에서는 밸런싱 주기의 32배를 곱하여 사용하고, idle 상태인 경우에는 밸런싱 주기 그대로 사용한다.
active 밸런싱을 수행한 경우에 다음 밸런싱 시각은 결정한 밸런싱 주기의 2배를 사용한다.

get_sd_balance_interval()

kernel/sched/fair.c

static inline unsigned long
get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
{
        unsigned long interval = sd->balance_interval;

        if (cpu_busy)
                interval *= sd->busy_factor;

        /* scale ms to jiffies */
        interval = msecs_to_jiffies(interval);
        interval = clamp(interval, 1UL, max_load_balance_interval);

        return interval;
}

요청한 스케줄링 도메인의 밸런스 주기(jiffies)를 알아오는데 cpu_busy인 경우 로드밸런싱을 천천히 하도록 busy_factor를 곱하여 적용한다.

코드 라인 4~7에서 스케줄 도메인의 밸런스 주기를 알아온 후 인수 cpu_busy가 설정된 경우 busy_factor(디폴트: 32)를 곱한다.
- interval 값은 도메인 내 cpu 수로 시작하기 때문에 cpu가 많은 시스템에서 32배를 곱하면 매우 큰 수가 나온다. 따라서 이 값은 커널 v5.10-rc1에서 16으로 줄인다.
  - 참고: sched/fair: Reduce busy load balance interval (2020, v5.10-rc1)
코드 라인 10~13에서 ms 단위로된 밸런스 주기를 jiffies 단위로 변경하고 1 ~ max_load_balance_interval(초기값 0.1초)로 제한한 후 반환한다.

로드 밸런스

load_balance()

load_balance() 함수는 현재 cpu 로드와 스케줄링 도메인내의 가장 바쁜 cpu 로드와 비교하여 불균형 상태이면 밸런스 조절을 위해 가장 바쁜 cpu의 태스크를 현재 cpu로 마이그레이션해온다. (Pull migration) 만일 바쁜 cpu의 태스크가 동작 중이어서 가져올 수 없으면 바쁜 cpu에서 cpu stopper를 깨워 하나의 태스크를 요청한 cpu 쪽으로 마이그레이션하도록 한다. (Push migration)

load_balance() 함수에 진입 시 사용되는 cpu_idle_type은 다음과 같이 3가지가 있다.

CPU_IDLE
- 스케줄틱에 의해 주기적 로드밸런스 조건에서 진입 시, 현재 cpu의 런큐가 idle 중이다.
- SD_LOAD_BALANCE 플래그를 가진 스케줄 도메인 만큼 loop를 돈다.
  - { env.dst = 현재 idle cpu <- env.src = 검색한 busiest cpu }
CPU_NOT_IDLE
- 스케줄틱에 의해 주기적 로드밸런스 조건에서 진입 시, 현재 cpu의 런큐에서 어떤 태스크가 동작 중이다.
- SD_LOAD_BALANCE 플래그를 가진 스케줄 도메인 만큼 loop를 돈다.
  - { env.dst = 현재 busy cpu <- env.src = 검색한 busiest cpu }
CPU_NEWLY_IDLE
- 패시브 로드밸런스 조건으로, 런큐에서 마지막 동작하던 어떤 태스크가 dequeue되어 idle 진입 직전에 이 함수에 진입하였다.
- SD_LOAD_BALANCE & SD_BALANCE_NEWIDLE 플래그를 가진 스케줄 도메인 만큼 loop를 돌며 마이그레이션 성공하거나 런큐에 1 개 이상의 태스크가 동작할 때 stop 한다.
  - { env.dst = 현재 new idle cpu <- env.src = 검색한 busiest cpu }

다음 그림은 3 가지 cpu_idle_type에 대해 진입하는 루트를 보여준다.

즉, 위와 같은 조건일 때에 인자로 요청한 스케줄 도메인에 포함된 cpu들과 현재 cpu 간에 뷸균형 로드가 발견되면 가장 바쁜 cpu의 태스크를 현재 cpu로 가져오는 것으로 로드밸런싱을 수행한다.

도메인내에서 가장 바쁜 cpu를 찾는 알고리즘은 다음과 같다.

첫 번째, find_busiest_group() 함수를 통해 도메인에서 가장 바쁜(busiest) 그룹을 찾는다.
- cpu 그룹의 로드 값에 그룹 평균 cpu capacity를 나누어 비교한다.
- 예) 다음 두 그룹의 로드는 고성능 및 저성능 그룹 각각 동일하다.
  - 그룹A) 그룹로드=1035, 그룹 capacity =2070
  - 그룹B) 그룹로드=430, 그룹 capacity =860
두 번째, find_busiest_queue() 함수를 통해 가장 바쁜(busiest) 그룹에서 가장 바쁜 cpu를 찾는다.
- cpu의 로드 값에 cpu capacity를 나누어 비교한다.
- 예) 다음 두 cpu의 로드는 고성능 및 저성능 cpu 각각 동일하다.
  - cpu A) cpu 로드=1535, 그룹 capacity =1535
  - cpu B) cpu 로드=430, 그룹 capacity =430

kernel/sched/fair.c -1/5-

/*
 * Check this_cpu to ensure it is balanced within domain. Attempt to move
 * tasks if there is an imbalance.
 */

static int load_balance(int this_cpu, struct rq *this_rq,
                        struct sched_domain *sd, enum cpu_idle_type idle,
                        int *continue_balancing)
{
        int ld_moved, cur_ld_moved, active_balance = 0;
        struct sched_domain *sd_parent = sd->parent;
        struct sched_group *group;
        struct rq *busiest;
        struct rq_flags rf;
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);

        struct lb_env env = {
                .sd             = sd,
                .dst_cpu        = this_cpu,
                .dst_rq         = this_rq,
                .dst_grpmask    = sched_group_cpus(sd->groups),
                .idle           = idle,
                .loop_break     = sched_nr_migrate_break,
                .cpus           = cpus,
                .fbq_type       = all,
                .tasks          = LIST_HEAD_INIT(env.tasks),
        };

        cpumask_copy(cpus, cpu_active_mask);

        schedstat_inc(sd, lb_count[idle]);

redo:
        if (!should_we_balance(&env)) {
                *continue_balancing = 0;
                goto out_balanced;
        }

        group = find_busiest_group(&env);
        if (!group) {
                schedstat_inc(sd, lb_nobusyg[idle]);
                goto out_balanced;
        }

        busiest = find_busiest_queue(&env, group);
        if (!busiest) {
                schedstat_inc(sd, lb_nobusyq[idle]);
                goto out_balanced;
        }

        BUG_ON(busiest == env.dst_rq);

        schedstat_add(sd->lb_imbalance[idle], env.imbalance);

        env.src_cpu = busiest->cpu;
        env.src_rq = busiest;

코드 라인 10에서 per-cpu 로드 밸런스 cpu 마스크의 포인터를 가져온다.
코드 라인 12~22에서 로드밸런스 환경 정보를 담고 있는 env를 준비한다.
- 로드 밸런스에 사용할 lb_env 구조체 각 항목의 설명은 이 글의 마지막에 위치한다.
코드 라인 24에서 스케줄 도메인 @sd에 소속된 cpu들과 런큐가 동작 중인(cpu_active_mask) cpu들 둘 다 만족하는 cpu들을 알아온다.
코드 라인 26에서 idle 타입에 따른 스케줄링 도메인의 lb_count[] 카운터를 1 증가시킨다.
코드 라인 28~32에서 redo: 레이블이다. 이미 밸런싱 상태인 경우 out_balanced 레이블로 이동하여 함수를 빠져나간다.
코드 라인 34~38에서 busiest 그룹이 없는 경우 로드밸런싱을 할 필요가 없으므로 idle 타입에 따른 스케줄 도메인의 lb_nobusyg[] 카운터를 1 증가 시키고 out_balanced 레이블로 이동하여 함수를 빠져 나간다.
코드 라인 40~44에서 그룹내에서 busiest 런큐가 없는 경우 역시 로드밸런싱을 할 필요가 없으므로 idle 타입에 따른 스케줄 도메인의 lb_nobusyq[] 카운터를 1 증가 시키고 out_balanced 레이블로 이동하여 함수를 빠져 나간다.
코드 라인 48에서 스케줄링 도메인의 lb_imbalance[idle] stat에 env.imbalance 값을 추가한다.
코드 라인 50~51에서 src_cpu와 src_rq에 busiest cpu와 busiest 런큐 정보를 대입한다.

kernel/sched/fair.c -2/5-

        ld_moved = 0;
        if (busiest->nr_running > 1) {
                /*
                 * Attempt to move tasks. If find_busiest_group has found
                 * an imbalance but busiest->nr_running <= 1, the group is
                 * still unbalanced. ld_moved simply stays zero, so it is
                 * correctly treated as an imbalance.
                 */
                env.flags |= LBF_ALL_PINNED;
                env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);

more_balance:
                rq_lock_irqsave(busiest, &rf);
                update_rq_clock(busiest);

                /*
                 * cur_ld_moved - load moved in current iteration
                 * ld_moved     - cumulative load moved across iterations
                 */
                cur_ld_moved = detach_tasks(&env);

                /*
                 * We've detached some tasks from busiest_rq. Every
                 * task is masked "TASK_ON_RQ_MIGRATING", so we can safely
                 * unlock busiest->lock, and we are able to be sure
                 * that nobody can manipulate the tasks in parallel.
                 * See task_rq_lock() family for the details.
                 */

                rq_unlock(busiest, &rf);

                if (cur_ld_moved) {
                        attach_tasks(&env);
                        ld_moved += cur_ld_moved;
                }

                local_irq_restore(rf.flags);

                if (env.flags & LBF_NEED_BREAK) {
                        env.flags &= ~LBF_NEED_BREAK;
                        goto more_balance;
                }

코드 라인 2~10에서 busiest 런큐에 2개 이상의 러닝 태스크가 있는 경우 최대 반복 횟수로 busiest의 러닝 태스크 수를 대입하되 sysctl_sched_nr_migrate(디폴트 32)로 제한한다. 태스크들을 디태치하기 전에 태스크를 하나도 옮길 수 없는 의미의 LBF_ALL_PINNED 플래그를 초기값으로 대입한다. 이 플래그는 태스크들 중 하나라도 마이그레이션할 수 있을때 제거된다.
- busiest 런큐에 1개의 태스크가 있는 경우 해당 태스크가 이미 동작 주이므로 pull migration해올 수 없다. 이 경우 자동으로 ld_moved는 0이고, 이후 루틴에서 active 밸런싱을 통해 push migration을 시도한다.
코드 라인 12~14에서 more_balance: 레이블이다. bisiest 런큐 락을 획득한 후 런큐 클럭을 갱신한다.
코드 라인 20~30에서 마이그레이션할 태스크들을 env->src_rq에서 detach하고 그 수를 cur_ld_moved에 대입한 후 런큐 락을 해제한다.
- cur_ld_moved
  - 현재 migration을 위해 detach한 태스크 수가 대입된다.
- ld_moved
  - migration된 태스크 수가 누적된다.
코드 라인 32~35에서 detach한 태스크들을 env->dst_rq에 attach하고 ld_moved에 그 수를 더한다.
코드 라인 39~42에서 LBF_NEED_BREAK 플래그가 설정된 경우 이 플래그를 제거한 후 다시 more_balance 레이블로 다시 이동하여 더 처리하도록 한다.
- loop
  - 현재 migration 시도 횟수가 담긴다.
  - loop_max까지 시도하며, 중간에 loop_break 횟수에 도달하는 경우 loop_break 횟수를 누적시키고 다시 시도한다.
- loop_max
  - busiest 그룹에서 동작 중인 태스크 수가 설정되고, sysctl_sched_nr_migrate(디폴트=32) 이하로 제한된다.
  - “/proc/sys/kernel/sched_latency_ns”로 한 번의 밸런싱 호출을 처리할 때 migration 태스크의 최대 수를 제한할 수 있다.
- loop_break
  - sched_nr_migrate_break(디폴트로 32)부터 시작하여 32개씩 증가한다.
  - 태스크 수가 32개를 초과하는 경우 인터럽트를 너무 오랫동안 막고 migration 하는 것을 방지하기 위해 중간에 한 번씩 인터럽트를 열고 닫아줄 목적으로 사용한다. (interrupt latency를 짧게 유지하도록)

kernel/sched/fair.c -3/5-

.               /*
                 * Revisit (affine) tasks on src_cpu that couldn't be moved to
                 * us and move them to an alternate dst_cpu in our sched_group
                 * where they can run. The upper limit on how many times we
                 * iterate on same src_cpu is dependent on number of cpus in our
                 * sched_group.
                 *
                 * This changes load balance semantics a bit on who can move
                 * load to a given_cpu. In addition to the given_cpu itself
                 * (or a ilb_cpu acting on its behalf where given_cpu is
                 * nohz-idle), we now have balance_cpu in a position to move
                 * load to given_cpu. In rare situations, this may cause
                 * conflicts (balance_cpu and given_cpu/ilb_cpu deciding
                 * _independently_ and at _same_ time to move some load to
                 * given_cpu) causing exceess load to be moved to given_cpu.
                 * This however should not happen so much in practice and
                 * moreover subsequent load balance cycles should correct the
                 * excess load moved.
                 */
                if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {

                        /* Prevent to re-select dst_cpu via env's cpus */
                        cpumask_clear_cpu(env.dst_cpu, env.cpus);

                        env.dst_rq       = cpu_rq(env.new_dst_cpu);
                        env.dst_cpu      = env.new_dst_cpu;
                        env.flags       &= ~LBF_DST_PINNED;
                        env.loop         = 0;
                        env.loop_break   = sched_nr_migrate_break;

                        /*
                         * Go back to "more_balance" rather than "redo" since we
                         * need to continue with same src_cpu.
                         */
                        goto more_balance;
                }

                /*
                 * We failed to reach balance because of affinity.
                 */
                if (sd_parent) {
                        int *group_imbalance = &sd_parent->groups->sgc->imbalance;

                        if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
                                *group_imbalance = 1;
                }

                /* All tasks on this runqueue were pinned by CPU affinity */
                if (unlikely(env.flags & LBF_ALL_PINNED)) {
                        cpumask_clear_cpu(cpu_of(busiest), cpus);
                        /*
                         * Attempting to continue load balancing at the current
                         * sched_domain level only makes sense if there are
                         * active CPUs remaining as possible busiest CPUs to
                         * pull load from which are not contained within the
                         * destination group that is receiving any migrated
                         * load.
                         */
                        if (!cpumask_subset(cpus, env.dst_grpmask)) {
                                env.loop = 0;
                                env.loop_break = sched_nr_migrate_break;
                                goto redo;
                        }
                        goto out_all_pinned;
                }
        }

코드 라인 20~36에서 아직 불균형 상태이면서 LBF_DST_PINNED 플래그가 설정되어있는 경우, 즉 하나 이상의 태스크들이 목적하는(dest) cpu를 허용하지 않아 마이그레이션을 하지못한 경우이다. 따라서 dst cpu가 선택되지 않도록 막고 대체 cpu를 dst cpu로 지정한 후 LBF_DST_PINNED 플래그를 클리어한다. 그리고 loop 카운터를 리셋한 후 다시 처음부터 시작하도록 more_balance 레이블로 이동한다.
코드 라인 41~46에서 부모 도메인이 아직 불균형 상태이면서 태스크들 중 일부가 마이그레이션 되지 못하고 남아 있는 경우라 LBF_SOME_PINNED 플래그가 설정되어 있는 상태이다. 이 때엔 cpu affinity 문제로 밸런싱을 못하였으므로 다음 부모 스케줄링 도메인에서 밸런싱을 시도할 때 이를 알리기 위해 부모 스케줄링 도메인의 첫 스케줄 그룹 imbalance 값을 1로 변경한다.
코드 라인 49~65에서 낮은 확률로 LBF_ALL_PINNED 플래그가 설정되어 현재 런큐에 있는 모든 태스크의 마이그레이션이 불가능한 상태이다. 이러한 경우 busiest cpu를 제거하고, dst cpu로 사용할 수 있는 cpu가 남아 있는 경우 loop 카운터를 리셋하고 다시 redo 레이블로 이동하여 계속 처리하게 한다. dst cpu로 사용할 수 있는 cpu가 하나도 남지 않게 되면 out_all_pinned 레이블로 이동한 후 함수를 빠져나간다.

kernel/sched/fair.c -4/5-

.       if (!ld_moved) {
                schedstat_inc(sd, lb_failed[idle]);
                /*
                 * Increment the failure counter only on periodic balance.
                 * We do not want newidle balance, which can be very
                 * frequent, pollute the failure counter causing
                 * excessive cache_hot migrations and active balances.
                 */
                if (idle != CPU_NEWLY_IDLE)
                        sd->nr_balance_failed++;

                if (need_active_balance(&env)) {
                        unsigned long flags;

                        raw_spin_lock_irqsave(&busiest->lock, flags);

                        /* don't kick the active_load_balance_cpu_stop,
                         * if the curr task on busiest cpu can't be
                         * moved to this_cpu
                         */
                        if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
                                raw_spin_unlock_irqrestore(&busiest->lock,
                                                            flags);
                                env.flags |= LBF_ALL_PINNED;
                                goto out_one_pinned;
                        }

                        /*
                         * ->active_balance synchronizes accesses to
                         * ->active_balance_work.  Once set, it's cleared
                         * only after active load balance is finished.
                         */
                        if (!busiest->active_balance) {
                                busiest->active_balance = 1;
                                busiest->push_cpu = this_cpu;
                                active_balance = 1;
                        }
                        raw_spin_unlock_irqrestore(&busiest->lock, flags);

                        if (active_balance) {
                                stop_one_cpu_nowait(cpu_of(busiest),
                                        active_load_balance_cpu_stop, busiest,
                                        &busiest->active_balance_work);
                        }

                        /*
                         * We've kicked active balancing, reset the failure
                         * counter.
                         */
                        sd->nr_balance_failed = sd->cache_nice_tries+1;
                }
        } else
                sd->nr_balance_failed = 0;

코드 라인 1~2에서 마이그레이션한 태스크가 하나도 없는 경우 스케줄링 도메인의 lb_failed[idle] 카운터를 1 증가시킨다.
코드 라인 9~10에서 CPU_NEWLY_IDLE이 아닌 경우 스케줄링 도메인의 nr_balance_failed 카운터를 1 증가시킨다.
코드 라인 12~15에서 active 로드 밸런싱 조건을 만족하게 되면 busiest 스핀 락을 획득한다.
- nr_balance_failed > cache_nice_tries+2인 경우 true가 된다.
코드 라인 21~26에서 현재 cpu가 busiest의 현재 태스크에 허가된 cpu가 아닌 경우 모든 태스크들을 옮길 수 없게 되었으므로 LBF_ALL_PINNED 플래그를 추가한 후 out_one_pinned 레이블로 이동하여 함수를 빠져나간다.
코드 라인 33~44에서 busiest 런큐의 active_balance를 1로 설정하고 태스크를 이동해올 목적지 cpu로 this_cpu(dest cpu)를 지정한 후 busiest cpu에서 push migration하도록 의뢰한다.
- stop 스케줄러에서 동작하는 cpu stopper 스레드(“migration%d” 커널스레드)는 active_load_balance_cpu_stop() 함수를 호출하는데 busiest cpu에서 동작 중인 cpu stopper 스레드를 제외한 나머지 태스크들 중 하나를 선택하여 rq->push_cpu 쪽으로 마이그레이션 한다.
코드 라인 50에서 스케줄링 도메인의 nr_balance_failed에 cache_nice_tries+1 값을 대입한다.
코드 라인 52~53에서 로드 밸런스로 옮겨진 태스크가 있는 경우 스케줄링 도메인의 nr_balance_failed 통계를 0으로 리셋한다.

kernel/sched/fair.c -5/5-

        if (likely(!active_balance) || voluntary_active_balance(&env)) {
                /* We were unbalanced, so reset the balancing interval */
                sd->balance_interval = sd->min_interval;
        } else {
                /*
                 * If we've begun active balancing, start to back off. This
                 * case may not be covered by the all_pinned logic if there
                 * is only 1 task on the busy runqueue (because we don't call
                 * detach_tasks).
                 */
                if (sd->balance_interval < sd->max_interval)
                        sd->balance_interval *= 2;
        }

        goto out;

out_balanced:
        /*
         * We reach balance although we may have faced some affinity
         * constraints. Clear the imbalance flag only if other tasks got
         * a chance to move and fix the imbalance.
         */
        if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {
                int *group_imbalance = &sd_parent->groups->sgc->imbalance;

                if (*group_imbalance)
                        *group_imbalance = 0;
        }

out_all_pinned:
        /*
         * We reach balance because all tasks are pinned at this level so
         * we can't migrate them. Let the imbalance flag set so parent level
         * can try to migrate them.
         */
        schedstat_inc(sd, lb_balanced[idle]);

        sd->nr_balance_failed = 0;

out_one_pinned:
        ld_moved = 0;

        /*
         * newidle_balance() disregards balance intervals, so we could
         * repeatedly reach this code, which would lead to balance_interval
         * skyrocketting in a short amount of time. Skip the balance_interval
         * increase logic to avoid that.
         */
        if (env.idle == CPU_NEWLY_IDLE)
                goto out;

        /* tune up the balancing interval */
        if (((env.flags & LBF_ALL_PINNED) &&
              sd->balance_interval < MAX_PINNED_INTERVAL) ||
             sd->balance_interval < sd->max_interval)
                sd->balance_interval *= 2;
out:
        return ld_moved;
}

코드 라인 1~15에서 밸런스 주기를 조정하고 out 레이블로 이동한다. 높은 확률로 active_balance가 실행된 적이 없는 경우 스케줄링 도메인의 밸런스 주기에 최소 주기를 대입하고, 실행된 적이 있는 경우 밸런스 주기를 최대 밸런스 주기를 넘지 않을 때에만 두 배로 증가시킨다.
코드 라인 17~28에서 out_balanced: 레이블이다. 이미 밸런스가 잡힌 경우 진입하는데, all pinned 설정이 아닌 경우 부모 스케줄링 도메인의 첫 스케줄링 그룹 imbalance 값을 0으로 리셋한다.
코드 라인 30~38에서 out_all_pinned 레이블이다. 스케줄링 도메인의 lb_balanced[idle] 카운터를 1 증가시키고 nr_balance_failed를 0으로 리셋한다.
코드 라인 40~50에서 ld_moved를 0으로 리셋하고, cpu가 처음 idle 상태에 진입하였던 경우 out 레이블로 이동하고 함수를 빠져나간다.
코드 라인 53~56에서 LBF_ALL_PINNED 플래그가 설정되었고 밸런스 주기가 MAX_PINNED_INTERVAL(512) 및 max_interval 이내인 경우 밸런스 주기를 2배로 높인다.
코드 라인 57~58에서 out: 레이블이다. 로드밸런싱으로 인해 마이그레이션한 태스크 수를 반환한다.

다음 그림은 DIE domain에서 cpu#2가 포함된 로컬 그룹과 다른 그룹들을 비교하여 busiest group를 찾은 후 그에 소속된 cpu들 사이에서 buest queue를 찾는 모습을 보여준다.

밸런스 필요 체크

should_we_balance()

kernel/sched/fair.c

static int should_we_balance(struct lb_env *env)
{
        struct sched_group *sg = env->sd->groups;
        int cpu, balance_cpu = -1;

        /*
         * Ensure the balancing environment is consistent; can happen
         * when the softirq triggers 'during' hotplug.
         */
        if (!cpumask_test_cpu(env->dst_cpu, env->cpus))
                return 0;

        /*
         * In the newly idle case, we will allow all the CPUs
         * to do the newly idle load balance.
         */
        if (env->idle == CPU_NEWLY_IDLE)
                return 1;

        /* Try to find first idle CPU */
        for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
                if (!idle_cpu(cpu))
                        continue;

                balance_cpu = cpu;
                break;
        }

        if (balance_cpu == -1)
                balance_cpu = group_balance_cpu(sg);

        /*
         * First idle CPU or the first CPU(busiest) in this sched group
         * is eligible for doing load balancing at this and above domains.
         */
        return balance_cpu == env->dst_cpu;
}

로드 밸런스를 하여도 되는지 여부를 반환한다.

코드 라인 10~11에서 env->cpus 들에 dst_cpu가 없으면 로드 밸런싱을 하지 않도록 0을 반환한다.
코드 라인 17~18에서 cpu가 처음 idle 진입한 경우 항상 true(1)를 반환하여 로드밸런싱을 시도하게 한다.
- pull 마이그레이션을 시도하는 현재 cpu가 idle 상태에서 진입한 경우 항상 밸런스를 허용한다.
코드 라인 21~30에서 첫 스케줄 그룹의 밸런스 마스크에 속한 cpu들과 env->cpus 둘 모두 포함된 cpu들을 대상으로 순회하며 첫 idle cpu를 찾는다. 만일 못 찾은 경우 스케줄링 그룹의 첫 번째 cpu를 알아온다.
코드 라인 36에서 알아온 cpu가 env->dst_cpu인지 여부를 반환한다.
- pull 마이그레이션을 시도하는 현재 cpu가 busy cpu 상태에서 진입한 경우 dst cpu가 idle 상태이거나, 첫 그룹 밸런스 마스크의 첫 번째 cpu인 경우 밸런스를 허용한다.
- 참고로 처음 밸런싱 시도시에는 this cpu가 dst cpu이지만 특정 태스크가 this cpu를 허용하지 않는 경우 두 번째 시도에서는 dst cpu가 다른 cpu로 바뀐다.

다음 그림은 busy 상태에서 밸런스를 시도할 수 있는 cpu를 보여준다.

3 가지 case가 있지만 그림에는 1)과 3)의 case만 표현하였다.
- 1) idle 상태에서는 어떠한 cpu도 밸런스를 시도할 수 있다.
- 2) busy cpu의 경우 dst cpu가 첫 그룹 밸런스 마스크의 첫 번째 idle cpu인 경우에만 밸런스를 시도한다.
- 3) 위의 2)번 케이스에서 idle cpu가 하나도 찾을 수 없는 경우 첫 그룹의 밸런스 마스크 중 첫 번째 cpu만 밸런스를 시도할 수 있다.

도메인내 가장 바쁜 그룹 및 cpu 찾기

도메인 내 가장 바쁜 그룹 찾기

find_busiest_group()

kernel/sched/fair.c -1/2-

/******* find_busiest_group() helpers end here *********************/

/**
 * find_busiest_group - Returns the busiest group within the sched_domain
 * if there is an imbalance.
 *
 * Also calculates the amount of runnable load which should be moved
 * to restore balance.
 *
 * @env: The load balancing environment.
 *
 * Return:      - The busiest group if imbalance exists.
 */

static struct sched_group *find_busiest_group(struct lb_env *env)
{
        struct sg_lb_stats *local, *busiest;
        struct sd_lb_stats sds;

        init_sd_lb_stats(&sds);

        /*
         * Compute the various statistics relavent for load balancing at
         * this level.
         */
        update_sd_lb_stats(env, &sds);

        if (sched_energy_enabled()) {
                struct root_domain *rd = env->dst_rq->rd;

                if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
                        goto out_balanced;
        }

        local = &sds.local_stat;
        busiest = &sds.busiest_stat;

        /* ASYM feature bypasses nice load balance check */
        if (check_asym_packing(env, &sds))
                return sds.busiest;

        /* There is no busy sibling group to pull tasks from */
        if (!sds.busiest || busiest->sum_nr_running == 0)
                goto out_balanced;

        /* XXX broken for overlapping NUMA groups */
        sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
                                                / sds.total_capacity;

        /*
         * If the busiest group is imbalanced the below checks don't
         * work because they assume all things are equal, which typically
         * isn't true due to cpus_ptr constraints and the like.
         */
        if (busiest->group_type == group_imbalanced)
                goto force_balance;

        /*
         * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
         * capacities from resulting in underutilization due to avg_load.
         */
        if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
            busiest->group_no_capacity)
                goto force_balance;

        /* Misfit tasks should be dealt with regardless of the avg load */
        if (busiest->group_type == group_misfit_task)
                goto force_balance;

요청한 로드밸런스 환경을 사용하여 태스크를 끌어오기 위해 가장 바쁜 스케줄 그룹을 찾아온다.

코드 라인 6에서 로드 밸런스에 사용하는 스케줄 도메인 통계 sds를 초기화하는데 sds.busiest_stat->group_type을 group_other로 초기화한다.
코드 라인 12에서 로드 밸런스를 위해 스케줄 도메인 통계 sds를 갱신한다.
코드 라인 14~19에서 EAS(Energy Aware Scheduler)가 enable된 경우 performance 도메인이 가동 중이고 오버 유틸되지 않은 경우 밸런스가 필요없으므로 out_balanced 레이블로 이동한다.
- EAS에서는 도메인내에 오버 유틸된 cpu가 하나라도 있어야 밸런싱을 시도한다.
코드 라인 21~22에서 local 및 busiest에 대한 통계를 관리하기 위해 지정해둔다.
코드 라인 25~26에서 asym packing 도메인(SMT를 사용하는 POWER7 칩은 0번 hw thread가 1번보다 더 빠르므로 0번 hw trhead가 idle 상태로 변경되는 시점에서 1번 hw thread에서 동작 중인 태스크를 0번으로 옮기는 것이 성능면에서 효율적이다)을 사용하는 cpu인 경우 보다 빠른 코어로 migration을 하는 것이 좋다고 판단되어 sds.busiest 그룹을 반환한다.
코드 라인 29~30에서 끌어 당겨올 busiest 그룹이 없거나 busiest 그룹에서 동작하는 cfs 태스크가 하나도 없는 경우 out_balanced 레이블로 이동한다.
코드 라인 33~34 도메인의 전체 로드에서 전체 capacity를 나누어 도메인 로드 평균을 구한다.
코드 라인 41~42에서 태스크가 특정 cpu로 제한되어 그룹 간의 로드 평균을 비교하는 일반적인 방법을 사용할 수 없는 상황이다. 이렇게 busiest 그룹이 불균형 밸런스 타입으로 분류된 경우 불균형 상태로 분류하여 force_balance 레이블로 이동 후 calculate_imbalance() 함수를 통해 기 선정된 busiest 그룹의 불균형 값을 산출한다.
- 밸런싱 시 태스크의 일부를 마이그레이션(LBF_SOME_PINNED ) 할 수 없었던 경우 상위 도메인 첫 그룹 capacity의 imbalance 값에 1을 대입하여 그룹 불균형 상태를 감지하도록 설정한다.
코드 라인 48~50에서 idle 및 newidle 상태에서 진입하였고, 로컬 그룹이 충분한 capacity를 가졌고, busiest 그룹이 capacity가 부족한 상황이면 이 busiest 그룹을 불균형 상태로 분류하여 무조건 밸런싱이 필요한 상황이므로 force_balance 레이블로 이동한다.
코드 라인 53~54에서 busiest 그룹이 misfit_task 상태로 분류된 경우 이 역시 busiest 그룹을 불균형 상태로 분류하여 무조건 밸런싱이 필요한 상황이므로 force_balance 레이블로 이동한다.

kernel/sched/fair.c -2/2-

        /*
         * If the local group is busier than the selected busiest group
         * don't try and pull any tasks.
         */
        if (local->avg_load >= busiest->avg_load)
                goto out_balanced;

        /*
         * Don't pull any tasks if this group is already above the domain
         * average load.
         */
        if (local->avg_load >= sds.avg_load)
                goto out_balanced;

        if (env->idle == CPU_IDLE) {
                /*
                 * This CPU is idle. If the busiest group is not overloaded
                 * and there is no imbalance between this and busiest group
                 * wrt idle CPUs, it is balanced. The imbalance becomes
                 * significant if the diff is greater than 1 otherwise we
                 * might end up to just move the imbalance on another group
                 */
                if ((busiest->group_type != group_overloaded) &&
                                (local->idle_cpus <= (busiest->idle_cpus + 1)))
                        goto out_balanced;
        } else {
                /*
                 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
                 * imbalance_pct to be conservative.
                 */
                if (100 * busiest->avg_load <=
                                env->sd->imbalance_pct * local->avg_load)
                        goto out_balanced;
        }

force_balance:
        /* Looks like there is an imbalance. Compute it */
        env->src_grp_type = busiest->group_type;
        calculate_imbalance(env, &sds);
        return env->imbalance ? sds.busiest : NULL;

out_balanced:
        env->imbalance = 0;
        return NULL;
}

코드 라인 5~34에서 다음 조건들 중 하나라도 걸리는 경우 밸런싱을 할 필요없어 포기하기 위해 out_balanced 레이블로 이동한다.
- 로컬 그룹의 평균 로드가 선택한 busiest 그룹의 평균 로드보다 크거나 같다.
- 로컬 그룹의 평균 로드가 도메인의 평균 로드보다 더 크거나 같다.
- idle 상태에서 진입하였고, busiest 그룹이 오버 로드된 상태가 아니고, 로컬 그룹의 idle cpu가 busiest 그룹에 비해 2개 이상 더 많지 않을 때이다.
- not-idle 또는 new idle 상태로 진입한 경우 로컬 그룹의 로드가 busiest 그룹보다 더 로드가 큰 경우이다.
  - 밸런싱에는 오버헤드가 있으므로 로컬 값에 imbalance_pct 비율만큼 더 가중치를 줘서 약간의 차이일 때에는 밸런싱을 하지 못하게 한다.
  - imbalance_pct 비율은 SMT는 110%, MC는 117%, 그 외 도메인의 경우 125%의 가중치를 사용한다. 이 값이 클 수록 로컬에 더 가중치를 주어 밸런싱을 억제하게 한다.
  - imbalance_pct 비율은 커널 v5.10-rc1에서 디폴트로 117%로 줄여, DIE나 NUMA 도메인에서도 이 값을 사용하게 한다.
    - 참고: sched/fair: Reduce minimal imbalance threshold (2020, v5.10-rc1)
코드 라인 36~40에서 force_balance 레이블이다. imbalance를 산출하고, 그 후 결정된 busiest 스케줄 그룹을 반환한다.
코드 라인 42~44에서 busiest 그룹을 찾지 못했다. 밸런싱을 포기하도록 null을 반환한다.

그룹 내 가장 바쁜 cpu 찾기

find_busiest_queue()

kernel/sched/fair.c -1/2-

/*
 * find_busiest_queue - find the busiest runqueue among the CPUs in the group.
 */

static struct rq *find_busiest_queue(struct lb_env *env,
                                     struct sched_group *group)
{
        struct rq *busiest = NULL, *rq;
        unsigned long busiest_load = 0, busiest_capacity = 1;
        int i;

        for_each_cpu_and(i, sched_group_span(group), env->cpus) {
                unsigned long capacity, load;
                enum fbq_type rt;

                rq = cpu_rq(i);
                rt = fbq_classify_rq(rq);

                /*
                 * We classify groups/runqueues into three groups:
                 *  - regular: there are !numa tasks
                 *  - remote:  there are numa tasks that run on the 'wrong' node
                 *  - all:     there is no distinction
                 *
                 * In order to avoid migrating ideally placed numa tasks,
                 * ignore those when there's better options.
                 *
                 * If we ignore the actual busiest queue to migrate another
                 * task, the next balance pass can still reduce the busiest
                 * queue by moving tasks around inside the node.
                 *
                 * If we cannot move enough load due to this classification
                 * the next pass will adjust the group classification and
                 * allow migration of more tasks.
                 *
                 * Both cases only affect the total convergence complexity.
                 */
                if (rt > env->fbq_type)
                        continue;

                /*
                 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
                 * seek the "biggest" misfit task.
                 */
                if (env->src_grp_type == group_misfit_task) {
                        if (rq->misfit_task_load > busiest_load) {
                                busiest_load = rq->misfit_task_load;
                                busiest = rq;
                        }

                        continue;
                }

스케줄 그룹내에서 가장 busy한 워크 로드(러너블 로드 평균 / cpu capacity)를 가진 cpu 런큐를 반환한다.

코드 라인 8에서 스케줄 그룹 소속 cpu들과 env->cpus 들 양쪽에 포함된 cpu들을 대상으로 순회한다.
코드 라인 34~35에서 NUMA 밸런싱을 사용하는 시스템의 경우 런큐의 fbq 타입이 env->fbq_type 보다 큰 경우 skip 한다.
- NUMA 밸런싱을 사용하지 않는 경우 런큐의 fbq 타입은 항상 regular(0)이므로 skip 하지 않는다.
- fbq_type은 NUMA 시스템에서만 갱신되며 update_sd_lb_stats() 함수를 통해 도메인 통계를 산출한 후에 이루어진다.
코드 라인 41~48에서 빅/리틀 클러스터가 운영되는 asym 도메인에서 misfit task 로드를 관리하는데 소스 그룹 타입이 group_misfit_task 타입이고, 런큐의 misfit_task_load가 busiest_load보다 커서 갱신할 때 이 그룹을 busiest 그룹으로 선택한다. 이 후 다음 그룹을 계속한다.

kernel/sched/fair.c -2/2-

                capacity = capacity_of(i);

                /*
                 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
                 * eventually lead to active_balancing high->low capacity.
                 * Higher per-CPU capacity is considered better than balancing
                 * average load.
                 */
                if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
                    capacity_of(env->dst_cpu) < capacity &&
                    rq->nr_running == 1)
                        continue;

                load = cpu_runnable_load(rq);

                /*
                 * When comparing with imbalance, use cpu_runnable_load()
                 * which is not scaled with the CPU capacity.
                 */

                if (rq->nr_running == 1 && load > env->imbalance &&
                    !check_cpu_capacity(rq, env->sd))
                        continue;

                /*
                 * For the load comparisons with the other CPU's, consider
                 * the cpu_runnable_load() scaled with the CPU capacity, so
                 * that the load can be moved away from the CPU that is
                 * potentially running at a lower capacity.
                 *
                 * Thus we're looking for max(load_i / capacity_i), crosswise
                 * multiplication to rid ourselves of the division works out
                 * to: load_i * capacity_j > load_j * capacity_i;  where j is
                 * our previous maximum.
                 */
                if (load * busiest_capacity > busiest_load * capacity) {
                        busiest_load = load;
                        busiest_capacity = capacity;
                        busiest = rq;
                }
        }

        return busiest;
}

코드 라인 1~12에서 SD_ASYM_CPUCAPACITY 플래그를 사용하는 빅리틀 유형의 도메인이고, 산출한 capacity가 dst cpu의 capacity 보다 크고, 런큐에 태스크가 1개만 잘 동작하고 있으므로 이 때에는 skip 한다.
- 1개의 태스크가 빅 cpu에서 동작할 때 평균 로드 분산을 위해 active 밸런싱을 통해 리틀 cpu로 옮기기 보다는 그냥 빅 cpu에서 계속 동작하는 것이 더 좋기 때문에 밸런싱을 하지 않는다.
코드 라인 14~23에서 순회 중인 cpu의 런큐에서 busiest 그룹의 imbalance 값보다 더 높은 러너블 로드로 rt/dl/irq 방해없이 cfs 태스크가 잘 동작하는 경우 밸런싱을 하지 않도록 skip 한다.
- 태스크가 1개만 동작하고 러너블 로드 값이 env->imbalance 값보다 크며 rt/dl/irq 등의 유틸로 인해 cfs capacity가 감소하지 않은 경우 skip 한다.
코드 라인 36~40에서 마지막으로 이제 실제 러너블 로드 값을 비교하여 busiest 런큐를 갱신한다.
- cpu 스케일 적용하여 순회 중인 cpu의 로드 값과 busiest cpu의 로드를 서로 비교하여 busiest를 갱신한다.
- 당연히 처음 루프에서는 무조건 갱신한다.
코드 라인 43에서 가장 바쁜 cpu 런큐를 반환한다.

스케일 적용된 cpu 로드 비교

예) 두 개의 cpu capacity는 A=1024, B=480이고, 동일한 200의 로드를 가지는 경우 누가 busy cpu일까?

19.5%(A: 200/1204) < 41.7%(B: 200/480)

예) 두 개의 cpu capacity는 A=1024, B=480이고, 각각 A=400, B=200의 로드를 가지는 경우 누가 busy cpu일까?

39.1%(A: 400/1024) < 41.7%(B: 200/480)

로드밸런스 통계

스케줄링 도메인 로드밸런스 통계 갱신

update_sd_lb_stats()

kernel/sched/fair.c -1/2-

/**
 * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
 * @env: The load balancing environment.
 * @sds: variable to hold the statistics for this sched_domain.
 */

static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
        struct sched_domain *child = env->sd->child;
        struct sched_group *sg = env->sd->groups;
        struct sg_lb_stats *local = &sds->local_stat;
        struct sg_lb_stats tmp_sgs;
        bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
        int sg_status = 0;

#ifdef CONFIG_NO_HZ_COMMON
        if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
                env->flags |= LBF_NOHZ_STATS;
#endif

        do {
                struct sg_lb_stats *sgs = &tmp_sgs;
                int local_group;

                local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sg));
                if (local_group) {
                        sds->local = sg;
                        sgs = local;

                        if (env->idle != CPU_NEWLY_IDLE ||
                            time_after_eq(jiffies, sg->sgc->next_update))
                                update_group_capacity(env->sd, env->dst_cpu);
                }

                update_sg_lb_stats(env, sg, sgs, &sg_status);

                if (local_group)
                        goto next_group;

                /*
                 * In case the child domain prefers tasks go to siblings
                 * first, lower the sg capacity so that we'll try
                 * and move all the excess tasks away. We lower the capacity
                 * of a group only if the local group has the capacity to fit
                 * these excess tasks. The extra check prevents the case where
                 * you always pull from the heaviest group when it is already
                 * under-utilized (possible with a large weight task outweighs
                 * the tasks on the system).
                 */
                if (prefer_sibling && sds->local &&
                    group_has_capacity(env, local) &&
                    (sgs->sum_nr_running > local->sum_nr_running + 1)) {
                        sgs->group_no_capacity = 1;
                        sgs->group_type = group_classify(sg, sgs);
                }

                if (update_sd_pick_busiest(env, sds, sg, sgs)) {
                        sds->busiest = sg;
                        sds->busiest_stat = *sgs;
                }

next_group:
                /* Now, start updating sd_lb_stats */
                sds->total_running += sgs->sum_nr_running;
                sds->total_load += sgs->group_load;
                sds->total_capacity += sgs->group_capacity;

                sg = sg->next;
        } while (sg != env->sd->groups);

로드 밸런스를 위한 스케줄 도메인 통계 sd_lb_stats를 갱신한다.

코드 라인 11~12에서 no hz로 처음 진입한 경우 LBF_NOHZ_STATS 플래그를 추가한다.
코드 라인 15~32에서 스케줄 그룹을 순회하며 그룹 통계를 갱신한다. 만일 dst cpu가 포함된 스케줄 그룹은 로컬 그룹으로 지정하고, next_group 레이블로 이동한다. 또한 new-idle 상태로 진입한 경우가 아니고 갱신 주기가 도달한 경우 그룹 capacity도 갱신한다.
- update_group_capacity() 참고: Scheduler -14- (Scheduling Domain 2) | 문c
코드 라인 44~49에서 child 도메인에 SD_PREFER_SIBLING 플래그가 있는 도메인이면서 로컬 그룹의 capacity가 여유가 있고 순회 중인 그룹의 태스크 수가 로컬 그룹의 태스크 수보다 2개 이상 더 많은 경우에 한해 그룹 타입을 group_overloaded 상태로 변경한다.
코드 라인 51~54에서 순회 중인 그룹 중 busiest 그룹을 선택하고 busiest 통계를 갱신한다.
코드 라인 56~63에서 next_group 레이블이다. 전체 그룹의 로드를 누적하고 다음 그룹 루프를 돈다.

kernel/sched/fair.c -2/2-

#ifdef CONFIG_NO_HZ_COMMON
        if ((env->flags & LBF_NOHZ_AGAIN) &&
            cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {

                WRITE_ONCE(nohz.next_blocked,
                           jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
        }
#endif

        if (env->sd->flags & SD_NUMA)
                env->fbq_type = fbq_classify_group(&sds->busiest_stat);

        if (!env->sd->parent) {
                struct root_domain *rd = env->dst_rq->rd;

                /* update overload indicator if we are at root domain */
                WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);

                /* Update over-utilization (tipping point, U >= 0) indicator */
                WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
                trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
        } else if (sg_status & SG_OVERUTILIZED) {
                struct root_domain *rd = env->dst_rq->rd;

                WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
                trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
        }
}

코드 라인 2~7에서 LBF_NOHZ_AGAIN 플래그를 가지고 도메인에 no hz idle cpu들을 모두 포함한 경우 nohz.next_blocked 시각을 32ms 후의 시각으로 설정한다.
코드 라인 10~11에서 누마 스케줄 도메인인 경우 busiest 그룹에서 fbq 타입을 알아와서 지정한다.
코드 라인 13~21에서 마지막 도메인을 진행 중인 경우 dst 런큐의 루트도메인에 overload 및 overutilized 여부를 갱신한다.
코드 라인 22~27에서 마지막 도메인이 아니고 overutilized 된 경우에만 루트도메인에 overload 및 overutilized를 SG_OVERUTILIZED(2) 값으로 갱신한다.

다음 그림은 sd_lb_stats 구조체에 도메인 통계 및 local/busiest 그룹에 대한 통계를 산출하는 모습을 보여준다.

스케줄링 그룹 로드밸런스 통계 갱신

update_sg_lb_stats()

kernel/sched/fair.c

/**
 * update_sg_lb_stats - Update sched_group's statistics for load balancing.
 * @env: The load balancing environment.
 * @group: sched_group whose statistics are to be updated.
 * @sgs: variable to hold the statistics for this group.
 * @sg_status: Holds flag indicating the status of the sched_group
 */

static inline void update_sg_lb_stats(struct lb_env *env,
                                      struct sched_group *group,
                                      struct sg_lb_stats *sgs,
                                      int *sg_status)
{
        int i, nr_running;

        memset(sgs, 0, sizeof(*sgs));

        for_each_cpu_and(i, sched_group_span(group), env->cpus) {
                struct rq *rq = cpu_rq(i);

                if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
                        env->flags |= LBF_NOHZ_AGAIN;

                sgs->group_load += cpu_runnable_load(rq);
                sgs->group_util += cpu_util(i);
                sgs->sum_nr_running += rq->cfs.h_nr_running;

                nr_running = rq->nr_running;
                if (nr_running > 1)
                        *sg_status |= SG_OVERLOAD;

                if (cpu_overutilized(i))
                        *sg_status |= SG_OVERUTILIZED;

#ifdef CONFIG_NUMA_BALANCING
                sgs->nr_numa_running += rq->nr_numa_running;
                sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
                /*
                 * No need to call idle_cpu() if nr_running is not 0
                 */
                if (!nr_running && idle_cpu(i))
                        sgs->idle_cpus++;

                if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
                    sgs->group_misfit_task_load < rq->misfit_task_load) {
                        sgs->group_misfit_task_load = rq->misfit_task_load;
                        *sg_status |= SG_OVERLOAD;
                }
        }

        /* Adjust by relative CPU capacity of the group */
        sgs->group_capacity = group->sgc->capacity;
        sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

        if (sgs->sum_nr_running)
                sgs->load_per_task = sgs->group_load / sgs->sum_nr_running;

        sgs->group_weight = group->group_weight;

        sgs->group_no_capacity = group_is_overloaded(env, sgs);
        sgs->group_type = group_classify(group, sgs);
}

로드 밸런스를 위한 스케줄 그룹 통계 sg_lb_stats를 갱신한다.

코드 라인 8에서 먼저 출력 인수로 지정된 스케줄 그룹 통계 @sgs를 모두 0으로 초기화한다.
코드 라인 10에서 스케줄 그룹에 포함한 cpu들과 env->cpus로 요청한 cpu들 둘 모두에 포함된 cpu들에 대해 순회한다.
코드 라인 13~14에서 LBF_NOHZ_STATS 플래그 요청이 있는 경우 no hz 관련 블럭드 로드가 있는 경우 블럭드 로드 관련 통계를 갱신하고 LBF_NOHZ_AGAIN 플래그를 추가한다.
코드 라인 16~18에서 그룹에 속한 cpu의 러너블 로드, 유틸 및 cfs 태스크 수 등을 그룹 통계에 누적시킨다.
코드 라인 20~22에서 cfs 태스크가 2 개 이상 동작하는 경우 출력 인자 @sg_status에 SG_OVERLOAD 플래그를 추가한다.
코드 라인 24~25에서 순회 중인 cpu의 capacity를 초과하는 유틸 상태인 경우 출력 인자 @sg_status에 SG_OVERUTILIZED 플래그를 추가한다.
코드 라인 28~29에서 순회 중인 cpu의 numa 관련 태스크 수를 그룹 통계에 누적시킨다.
- 누마 태스크 수 및 누마 우선 노드에서 동작 중인 태스크 수
코드 라인 34~35에서 순회 중인 cpu가 idle 상태인 경우 그룹 내 idle cpu 수를 나타내는 idle_cpus 카운터를 1 증가시킨다.
코드 라인 37~41에서 빅 리틀 클러스터(DIE 도메인) 처럼 도메인 내에 다른 cpu capacity를 가진 그룹을 가진 도메인이면서 group_misfit_task_load 보다 큰 순회 중인 cpu의 misfit_task_load가 더 큰 경우 group_misfit_task_load를 갱신하고 출력 인자 @sg_status에 SG_OVERLOAD 플래그를 추가한다.
코드 라인 45~46에서 먼저 sgs->group_capacity에 스케줄 그룹의 capacity 값을 대입한다. 그런 후 sgs_avg_load에는 그룹 로드 * (1024 / 그룹 capacity)를 대입한다.
코드 라인 48~49에서 그룹 내 동작 중인 태스크가 있는 경우 태스크당 로드를 산출한다.
코드 라인 51에서 먼저 sgs->group_weight에 스케줄 그룹의 weight를 대입한다.
코드 라인 53~54에서 그룹이 오버로드된 상태인 지 여부와 그룹 타입을 알아온다.
- 오버로드 상태인 경우 그룹 타입으로 group_overloaded가 지정되고, 그렇지 않은 경우 그 외의 그룹 타입을 판정해온다.

update_nohz_stats()

kernel/sched/fair.c

static bool update_nohz_stats(struct rq *rq, bool force)
{
#ifdef CONFIG_NO_HZ_COMMON
        unsigned int cpu = rq->cpu;

        if (!rq->has_blocked_load)
                return false;

        if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
                return false;

        if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
                return true;

        update_blocked_averages(cpu);

        return rq->has_blocked_load;
#else
        return false;
#endif
}

nohz 런큐에 대해 블럭드 로드가 있는 경우 블럭드 로드 관련 통계를 갱신한다. @force가 0인 경우 틱이 변경된 경우에만 갱신되며, @force가 1인 경우 언제나 갱신한다.

코드 라인 6~7에서 런큐에 블럭드 로드가 없는 경우 false를 반환한다.
코드 라인 9~10에서 요청한 런큐의 cpu가 nohz 중인 cpu가 아닌 경우 false를 반환한다.
코드 라인 12~13에서 @force 요청이 없는 경우 블럭드 로드가 갱신된 틱과 현재 틱이 같은 경우 중복 갱신을 피하기 위해 true를 반환한다.
코드 라인 15~17에서 블럭드 로드 평균을 갱신하고 블럭드 로드 여부를 반환한다.

가장 바쁜 그룹 여부 체크

update_sd_pick_busiest()

kernel/sched/fair.c

/**
 * update_sd_pick_busiest - return 1 on busiest group
 * @env: The load balancing environment.
 * @sds: sched_domain statistics
 * @sg: sched_group candidate to be checked for being the busiest
 * @sgs: sched_group statistics
 *
 * Determine if @sg is a busier group than the previously selected
 * busiest group.
 *
 * Return: %true if @sg is a busier group than the previously selected
 * busiest group. %false otherwise.
 */

static bool update_sd_pick_busiest(struct lb_env *env,
                                   struct sd_lb_stats *sds,
                                   struct sched_group *sg,
                                   struct sg_lb_stats *sgs)
{
        struct sg_lb_stats *busiest = &sds->busiest_stat;

        /*
         * Don't try to pull misfit tasks we can't help.
         * We can use max_capacity here as reduction in capacity on some
         * CPUs in the group should either be possible to resolve
         * internally or be covered by avg_load imbalance (eventually).
         */
        if (sgs->group_type == group_misfit_task &&
            (!group_smaller_max_cpu_capacity(sg, sds->local) ||
             !group_has_capacity(env, &sds->local_stat)))
                return false;

        if (sgs->group_type > busiest->group_type)
                return true;

        if (sgs->group_type < busiest->group_type)
                return false;

        if (sgs->avg_load <= busiest->avg_load)
                return false;

        if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
                goto asym_packing;

        /*
         * Candidate sg has no more than one task per CPU and
         * has higher per-CPU capacity. Migrating tasks to less
         * capable CPUs may harm throughput. Maximize throughput,
         * power/energy consequences are not considered.
         */
        if (sgs->sum_nr_running <= sgs->group_weight &&
            group_smaller_min_cpu_capacity(sds->local, sg))
                return false;

        /*
         * If we have more than one misfit sg go with the biggest misfit.
         */
        if (sgs->group_type == group_misfit_task &&
            sgs->group_misfit_task_load < busiest->group_misfit_task_load)
                return false;

asym_packing:
        /* This is the busiest node in its class. */
        if (!(env->sd->flags & SD_ASYM_PACKING))
                return true;

        /* No ASYM_PACKING if target CPU is already busy */
        if (env->idle == CPU_NOT_IDLE)
                return true;
        /*
         * ASYM_PACKING needs to move all the work to the highest
         * prority CPUs in the group, therefore mark all groups
         * of lower priority than ourself as busy.
         */
        if (sgs->sum_nr_running &&
            sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
                if (!sds->busiest)
                        return true;

                /* Prefer to move from lowest priority CPU's work */
                if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
                                      sg->asym_prefer_cpu))
                        return true;
        }

        return false;
}

요청한 스케줄 그룹이 기존에 선택했었던 busiest 스케줄 그룹보다 더 바쁜지 여부를 반환한다.

1) 빅->리틀: misfit & 로컬 보다 작은 capacity로 인한 거절
2) 그룹 타입이 큰 경우 허용, 작은 경우 거절
3) 그룹 타입이 같고 평균 로드가 적은 경우 거절
4) 빅->리틀: 소스 capacity 충분하여 거절 또는 misfit task 로드가 적어 거절
5) asym packing: busy 진입 또는 dst cpu가 더 빠른 cpu시 허용
6) 마지막으로 항상 허용

코드 라인 6에서 busiest 스케줄 그룹의 통계와 비교하기 위해 알아온다.
코드 라인 14~17에서 group_misfit_task 그룹 타입이면서 그룹이 다음 조건에 해당하면 false를 반환한다.
- 그룹이 로컬 그룹보다 작은 cpu capacity를 가졌다.
- 그룹이 충분한 capacity를 가지지 못하였다.
코드 라인 19~20에서 요청한 그룹 타입이 busiest 그룹 타입보다 큰 경우 요청한 그룹 타입이 더 바쁘다고 판단하여 true(1)를 반환한다.
- 그룹 타입은 4 가지로 group_other(0), group_misfit_task(1), group_imbalanced(2) 및 group_overloaded(3)가 있다.
코드 라인 22~23에서 요청한 그룹 타입이 busiest 그룹 타입보다 작은 경우 요청한 그룹 타입이 더 바쁘지 않다고 판단하여 false(0)를 반환한다.
코드 라인 25~26에서 동일한 그룹 타입인 경우는 평균 로드를 비교하여 요청한 그룹이 기존 busiest 그룹보다 작거나 같으면 false(0)를 반환한다.
코드 라인 28~29에서 스케줄 도메인이 빅리틀 같은 SD_ASYM_CPUCAPACITY 플래그가 없는 경우 sym_packing 레이블로 이동한다.
코드 라인 37~39에서 그룹에서 동작 중인 태스크 수가 그룹내 cpu 수 이하이고 로컬 그룹이 비교 그룹보다 작은 capacity를 가진 경우 false를 반환한다.
코드 라인 44~46에서 요청한 그룹이 group_misfit_task 그룹 상태이고 busiest 그룹의 group_misfit_task_load 값 보다 작은 경우 false를 반환한다.
코드 라인 48~51에서 asym_packing: 레이블이다. SD_ASYM_PACKING(현재 powerpc 아키텍처 및 x86의 ITMT 지원 아키텍처에서 사용) 플래그를 사용하지 않는 경우 true를 반환한다.
코드 라인 54~55에서 cpu가 busy 상태에서 진입한 경우 true를 반환한다.
코드 라인 61~72에서 요청한 스케줄 그룹에서 동작 중인 태스크가 있고, 더 높은 capacity를 가진 hw thread로 이동하는 것이 좋을 때엔 다음 조건을 만족하는 경우 true를 반환하고, 그 외의 경우 false를 반환한다.
- busiest 그룹이 아직 결정되지 않았을 때
- busiest 그룹이 선택한 cpu보다 dst cpu로 이동하는 것이 좋을 때

다음 그림은 도메인내의 스케줄 그룹을 순회하며 busiest 그룹을 선택하여 갱신하는 모습을 보여준다.

Overload & Overutilized & Misfit-Task-Load

빅/리틀 아키텍처처럼 asym cpu capacity를 사용하는 스케줄 도메인 간 로드밸런싱을 사용할 때 태스크 로드가 빅 프로세스에서 문제 없이 동작하였지만 리틀 프로세스로 옮겨갈 때 cpu capacity가 부족해지는데 이를 판단하기 위해 다음과 같은 상태 구분을 한다.

overloaded
- 로드가 초과된 오버 로드 상태이다.
- 스케줄 그룹에는 group_no_capacity가 설정된다.
- 루트 도메인에는 SG_OVERLOAD 플래그가 추가된다.
overutilized
- 낮은 성능의 cpu로 전환 시 유틸이 초과된 오버 유틸 상태이다.
- 루트 도메인에는 SG_OVERUTILIZED 플래그가 추가된다.
misfit_task_load
- 빅/리틀 클러스터가 채용된 시스템에서 태스크의 유틸이 매우 높은 경우 빅 클러스터에서 최대의 성능을 높일 수 있도록 하였다.
- 리틀 클러스터와 같이 낮은 성능의 cpu로 전환 시 해당 런큐는 misfit_task 상태가 되고 rq->misfit_task_load에는 태스크의 로드 값이 담긴다. 또한 해당 cpu가 포함된 스케줄 그룹에 대해서는 그 그룹에 속한 rq->misfit_task_load 중 가장 큰 값이 sgs->group_misfit_task_load에 담긴다.
- misfit 상태가 아닌 경우 0을 가진다.

Misfit 상태 갱신

update_misfit_status()

kernel/sched/fair.c

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
{
        if (!static_branch_unlikely(&sched_asym_cpucapacity))
                return;

        if (!p) {
                rq->misfit_task_load = 0;
                return;
        }

        if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
                rq->misfit_task_load = 0;
                return;
        }

        rq->misfit_task_load = task_h_load(p);
}

요청한 태스크의 유틸을 런큐의 cpu capacity로 충분히 처리할 수 있는지 misfit 로드를 갱신한다. (misfit 상태인 경우 rq->misfit_task_load에 태스크 로드 값이 대입되고, 그렇지 않은 경우 0으로 클리어된다)

코드 라인 3~4에서 asym cpu capcity를 사용하지 않는 시스템인 경우 함수를 빠져나간다.
코드 라인 6~9에서 태스크가 주어지지 않은 경우 런큐의 misfit_task_load 값은 0으로 클리어한 후 함수를 빠져나간다.
코드 라인 11~14에서 태스크가 런큐가 동작하는 cpu의 capacity 이내에서 처리할 수 있는 경우 런큐의 misfit_task_load 값을 0으로 클리어한 후 함수를 빠져나간다.
코드 라인 16에서 cpu capacity가 부족한 상태라는 것을 표시하기 위해 태스크 로드를 런큐의 misfit_task_load에 대입한다.

다음 그림은 그룹내 misfit 태스크의 로드 중 가장 큰 로드를 group_misfit_task_load에 갱신하는 모습을 보여준다.

update_sg_lb_stats() 참조

다음 그림은 그룹간 busiest 그룹의 비교 시 유틸을 담기에 그룹 capacity가 부족한 경우를 보여준다.

task_fits_capacity()

kernel/sched/fair.c

static inline int task_fits_capacity(struct task_struct *p, long capacity)
{
        return fits_capacity(task_util_est(p), capacity);
}

요청한 태스크의 유틸(125% 적용)이 @capacitiy에 충분히 적합한지 여부를 반환한다. (1=적합, 0=부적합)

check_misfit_status()

kernel/sched/fair.c

/*
 * Check whether a rq has a misfit task and if it looks like we can actually
 * help that task: we can migrate the task to a CPU of higher capacity, or
 * the task's current CPU is heavily pressured.
 */

static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
{
        return rq->misfit_task_load &&
                (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
                 check_cpu_capacity(rq, sd));
}

런큐가 misfit 상태이면서 다른 cpu보다 낮은 성능을 가졌거나 cpu가 rt/dl/irq 등의 유틸로 인해 cfs capacity 가 압박을 받고 있는지 여부를 반환한다. (1=misfit 및 압박 상태, 0=압박 받지 않는 상태)

런큐가 misfit 상태이고 다른 cpu 보다 작은 cpu capacity를 가졌거나, cpu가 압박(rt/dl/irq로 인해 cfs에 대한 cpu capacity가 줄어든 상태) 중인 경우 1을 반환한다.

check_cpu_capacity()

kernel/sched/fair.c

/*
 * Check whether the capacity of the rq has been noticeably reduced by side
 * activity. The imbalance_pct is used for the threshold.
 * Return true is the capacity is reduced
 */

static inline int
check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
        return ((rq->cpu_capacity * sd->imbalance_pct) <
                                (rq->cpu_capacity_orig * 100));
}

rt/dl/irq 등의 유틸로 인해 해당 cpu의 cfs capacity가 감소했는지 여부를 알아온다.

요청한 런큐의 cfs 성능을 나타내는 cpu capacity에 도메인의 imbalance_pct 만큼의 스레졸드를 적용하였을 때 해당 cpu의 오리지날 capacity 보다 작아졌는지 여부를 알아온다. 1=스레졸드 이상 감소. 0=스레졸드 미만 감소
side activity 란?
- rt, dl, irq 등의 유틸 로드

다음 그림은 해당 cpu의 cfs capacity가 감소되었는지 여부를 판단하는 모습을 보여준다.

group_smaller_max_cpu_capacity()

kernel/sched/fair.c

/*
 * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
 * per-CPU capacity_orig than sched_group ref.
 */

static inline bool
group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
        return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
}

스케줄 그룹 @sg의 max_capacity가 스케줄 그룹 @ref의 것 보다 작은지 여부를 반환한다. (1=@sg가 작다)

sg->sgc->max_capacity * 125% < ref->sgc->max_capacity

group_smaller_min_cpu_capacity()

kernel/sched/fair.c

/*
 * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
 * per-CPU capacity than sched_group ref.
 */

static inline bool
group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
        return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
}

스케줄 그룹 @sg의 min_capacity가 스케줄 그룹 @ref의 것 보다 작은지 여부를 반환한다. (1=@sg가 작다)

sg->sgc->min_capacity * 125% < ref->sgc->min_capacity

fits_capacity()

kernel/sched/fair.c

/*
 * The margin used when comparing utilization with CPU capacity.
 *
 * (default: ~20%)
 */

#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)

요청한 @cap에 125%를 적용한 값이 @max값 이내인지 여부를 반환한다. (1=보통, 0=capacity 초과)

Overutilized 상태 갱신

update_overutilized_status()

kernel/sched/fair.c

static inline void update_overutilized_status(struct rq *rq)
{
        if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
                WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
                trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
        }
}

런큐의 유틸이 cpu capacity를 초과하는 오버 유틸 상태인 경우 런큐가 가리키는 루트 도메인에 오버 유틸 상태를 갱신한다.

cpu_overutilized()

kernel/sched/fair.c

static inline bool cpu_overutilized(int cpu)
{
        return !fits_capacity(cpu_util(cpu), capacity_of(cpu));
}

cpu capacity를 초과하는 유틸 여부를 반환한다. (1=초과)

현재 cpu의 유틸 * 125%가 현재 cpu의 capacity를 초과하면 오버 유틸 상태가 된다.

다음 그림은 스레졸드 125%가 주어진 cpu 유틸이 capacity를 초과하는 여부를 3가지 예로 보여준다.

group_is_overloaded()

kernel/sched/fair.c

/*
 *  group_is_overloaded returns true if the group has more tasks than it can
 *  handle.
 *  group_is_overloaded is not equals to !group_has_capacity because a group
 *  with the exact right number of tasks, has no more spare capacity but is not
 *  overloaded so both group_has_capacity and group_is_overloaded return
 *  false.
 */

static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
        if (sgs->sum_nr_running <= sgs->group_weight)
                return false;

        if ((sgs->group_capacity * 100) <
                        (sgs->group_util * env->sd->imbalance_pct))
                return true;

        return false;
}

그룹 유틸이 오버 로드된 상태인지 여부를 반환한다. (1=오버 로드)

코드 라인 4~5에서 그룹에서 동작 중인 태스크 수가 그룹에 속한 cpu 수보다 작은 경우 오버 로드되지 않은 상태로 false를 반환한다.
코드 라인 7~11에서 imbalance_pct를 적용한 그룹 유틸이 그룹 capacity를 초과하는 경우 오버 로드 상태인 true를 반환한다. 그렇지 않은 경우 false를 반환한다.

다음 그림은 그룹내 cpu 런큐의 유틸 * 스레졸드가 그룹 capactiy를 초과하여 group_overloaded 타입이 된 모습을 보여준다.

Group 및 fbq 타입

그룹 타입

그룹 타입은 다음과 같이 4 종류로 구분되며 숫자가 클 수록 우선 순위가 높다. 로드 밸런스를 위해 그룹 간에 밸런싱 비교를 하는데 먼저 그룹 타입을 비교하고, 그 후 그룹 타입이 서로 동일한 경우에 그룹 로드 값을 비교한다.

group_other(0)
- 그룹은 보통 상태이다.
group_misfit_task(1)
- 그룹에 유틸이 큰 misfit task가 있는 상태이다. (빅 클러스터에서 동작시켜 최대 성능)
group_imbalanced(2)
- 태스크에 cpu 제한을 두어 마이그레이션이 제한되어 그룹이 불균형 상태에서 밸런싱을 해야하는 상태이다.
- 그룹간에 일반적인 밸런싱을 하는 경우 태스크의 cpu 제한에 의해 의도치 않게 특정 그룹에 태스크가 오버 로드될 수 있는 상황을 막기 위함이다.
group_overloaded(3)
- 그룹에 로드가 초과된 상태이다.

group_classify()

kernel/sched/fair.c

static inline enum
group_type group_classify(struct sched_group *group,
                          struct sg_lb_stats *sgs)
{
        if (sgs->group_no_capacity)
                return group_overloaded;

        if (sg_imbalanced(group))
                return group_imbalanced;

        if (sgs->group_misfit_task_load)
                return group_misfit_task;

        return group_other;
}

그룹 상태를 분류하여 반환한다.

코드 라인 5~6에서 그룹 통계에서 capacity 부족 상태이면 group_overloaded 상태를 반환한다.
코드 라인 8~9에서 일부 태스크들이 특정 cpu를 허용하지 않아 그룹간 밸런싱을 하는 것이 오히려 문제가 되는 경우이다. 이렇게 별도의 그룹 불균형 상태로 밸런싱을 해야할 때 group_imbalanced 상태를 반환한다.
코드 라인 11~12에서 group_misfit_task_load가 있는 경우 group_misfit_task 상태를 반환한다.
코드 라인 14에서 그 외의 경우 group_other 상태를 반환한다.

그룹 불균형

두 개의 그룹에서 4개의 태스크를 동작시킬 때 p->cpus_ptr 을 통해 다음 4개의 cpu만을 허용시키면 그룹 관점에서 밸런싱을 수행하면 각 그룹에서 두 개의 태스크들을 수행시켜 밸런싱이 이루어질 것이다.

cpu { 0 1 2 3 } { 4 5 6 7 }
cpus_ptr * * * *

조건을 바꿔 p->cpus_ptr 을 통해 다음 4개의 cpu만을 허용시키면 그룹 관점에서 밸런싱을 수행하면 3번 cpu에서는 오버로드되어 2개의 태스크가 동작해야 하고, 나머지 456 cpu 중 하나는 idle 상태가 되는 의도치 않은 결과를 얻게된다. 따라서 이러한 그룹 불균형 상태를 인지하여 그룹 불균형 상태에서 밸런싱을 하는 방법이 필요해졌다.

cpu { 0 1 2 3 } { 4 5 6 7 }
cpus_ptr * * * *

그룹 불균형 detect

하위 도메인에서 일부 태스크가 cpu affinity 문제로 migration이 실패(LB_SOME_PINNED)하는 경우 상위 도메인 첫 그룹에 그룹 불균형 상태를 기록한다. 상위 도메인에서는 이러한 시그널이 있으면 해당 그룹을 busiest 그룹 후보로 인식하고, calculate_imbalance() 함수와 find_busiest_group() 두 함수에서 일반적인 균형 조건 중 일부를 피해 효과적인 그룹 불균형을 생성하도록 허락한다.

sg_imbalanced()

kernel/sched/fair.c

/*
 * Group imbalance indicates (and tries to solve) the problem where balancing
 * groups is inadequate due to ->cpus_ptr constraints.
 *
 * Imagine a situation of two groups of 4 CPUs each and 4 tasks each with a
 * cpumask covering 1 CPU of the first group and 3 CPUs of the second group.
 * Something like:
 *
 *      { 0 1 2 3 } { 4 5 6 7 }
 *              *     * * *
 *
 * If we were to balance group-wise we'd place two tasks in the first group and
 * two tasks in the second group. Clearly this is undesired as it will overload
 * cpu 3 and leave one of the CPUs in the second group unused.
 *
 * The current solution to this issue is detecting the skew in the first group
 * by noticing the lower domain failed to reach balance and had difficulty
 * moving tasks due to affinity constraints.
 *
 * When this is so detected; this group becomes a candidate for busiest; see
 * update_sd_pick_busiest(). And calculate_imbalance() and
 * find_busiest_group() avoid some of the usual balance conditions to allow it
 * to create an effective group imbalance.
 *
 * This is a somewhat tricky proposition since the next run might not find the
 * group imbalance and decide the groups need to be balanced again. A most
 * subtle and fragile situation.
 */

static inline int sg_imbalanced(struct sched_group *group)
{
        return group->sgc->imbalance;
}

그룹 불균형 상태 여부를 반환한다. (1=태스크가 특정 cpu로의 이동이 제한된 그룹 불균형 상태로 밸런싱을 해야하는 상황이다. 0=그룹 불균형 상태가 아니므로 일반적인 그룹간 밸런싱을 수행한다.)

fbq(find busiest queue) 타입

fbq 타입은 NUMA 밸런싱을 위해 사용되며 런큐나 그룹에서 3가지 타입으로 분류된다.

regular(0)
- 누마 태스크들이 없을 수 있다.
remote(1)
- 모든 태스크가 누마 태스크이고, 일부는 preferred 노드가 아닌 wrong 노드에서 동작한다.
- buesiest queue를 찾을 때 제외한다.
all(2)
- 모든 태스크가 preferred 노드에서 동작하는 누마 태스크이다.
- 레귤러 및 리모트를 구분하지 않는다.

fbq_classify_group()

kernel/sched/fair.c

#ifdef CONFIG_NUMA_BALANCING
static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
{
        if (sgs->sum_nr_running > sgs->nr_numa_running)
                return regular;
        if (sgs->sum_nr_running > sgs->nr_preferred_running)
                return remote;
        return all;
}
#else
static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
{
        return all;
}
#endif

그룹의 fbq 타입을 반환한다. regular(0), remote(1), all(2) 타입을 구분하여 반환한다. 단 UMA 시스템에서는 all(2) 만을 반환한다.

fbq_classify_rq()

kernel/sched/fair.c

#ifdef CONFIG_NUMA_BALANCING
static inline enum fbq_type fbq_classify_rq(struct rq *rq)
{
        if (rq->nr_running > rq->nr_numa_running)
                return regular;
        if (rq->nr_running > rq->nr_preferred_running)
                return remote;
        return all;
}
#else
static inline enum fbq_type fbq_classify_rq(struct rq *rq)
{
        return regular;
}
#endif

런큐의 fbq 타입을 반환한다. regular(0), remote(1), all(2) 타입을 구분하여 반환한다. 단 UMA 시스템에서는 regular(0) 만을 반환한다.

ASYM 패킹 마이그레이션 필요 체크

check_asym_packing()

kernel/sched/fair.c

/**
 * check_asym_packing - Check to see if the group is packed into the
 *                      sched domain.
 *
 * This is primarily intended to used at the sibling level.  Some
 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
 * case of POWER7, it can move to lower SMT modes only when higher
 * threads are idle.  When in lower SMT modes, the threads will
 * perform better since they share less core resources.  Hence when we
 * have idle threads, we want them to be the higher ones.
 *
 * This packing function is run on idle threads.  It checks to see if
 * the busiest CPU in this domain (core in the P7 case) has a higher
 * CPU number than the packing function is being run on.  Here we are
 * assuming lower CPU number will be equivalent to lower a SMT thread
 * number.
 *
 * Return: 1 when packing is required and a task should be moved to
 * this CPU.  The amount of the imbalance is returned in env->imbalance.
 *
 * @env: The load balancing environment.
 * @sds: Statistics of the sched_domain which is to be packed
 */

static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
{
        int busiest_cpu;

        if (!(env->sd->flags & SD_ASYM_PACKING))
                return 0;

        if (env->idle == CPU_NOT_IDLE)
                return 0;

        if (!sds->busiest)
                return 0;

        busiest_cpu = sds->busiest->asym_prefer_cpu;
        if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
                return 0;

        env->imbalance = sds->busiest_stat.group_load;

        return 1;
}

asym packing 도메인에서 더 빠른 코어로의 migration이 필요한지 여부를 반환한다. SMT를 사용하는 POWER7 칩은 0번 hw thread가 1번보다 더 빠르므로 0번 hw trhead가 idle 상태로 변경되는 시점에서 1번 hw thread에서 동작 중인 태스크를 0번으로 옮기는 것이 성능면에서 효율적이다. 최근엔 x86에서도 특정 core를 boost하는 기술을 사용한다.

코드 라인 5~6에서 SD_ASYM_PACKING 플래그를 사용하지 않는 스케줄링 도메인은 0을 반환한다.
- SD_ASYM_PACKING 플래그는 POWER7(powerpc) 및 ITMT(Intel Turbo Boost Max Technology 3.0)를 지원하는 x86의 일부 아키텍처에서만 사용한다.
- POWER7 아키텍처의 경우 SMT 스레드들 중 작은 번호의 스레드를 사용하는 것을 권장한다. 작은 번호의 스레드를 사용하여야 코어 리소스를 덜 공유하여 더 높은 성능을 낸다.
코드 라인 8~9에서 busy 상태에서 진입한 경우 0을 반환한다. idle 상태에서만 asym 패킹 마이그레이션을 한다.
코드 라인 11~12에서 busiest 스케줄 그룹이 없는 경우 0을 반환한다.
코드 라인 14~16에서 busiest cpu보다 dest cpu가 더 성능이 좋은 경우 0을 반환한다.
코드 라인 18~20에서 그룹 로드를 env->imbalance에 대입하고 asym 밸런싱을 위해 1을 반환한다.

다음 그림과 같이 asym packing을 사용하는 도메인에서는 idle 상태인 빠른 cpu로 태스크를 옮겨 최대 성능을 얻어낼 수 있다.

sched_asym_prefer()

kernel/sched/sched.h

static inline bool sched_asym_prefer(int a, int b)
{
        return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
}

asym 패킹에서 @a cpu가 @b cpu 보다 더 우선 순위가 높은지 여부를 알아온다.

ASYM PACKING을 지원하는 SMT 도메인에서 cpu간에 우선 순위가 존재한다.
- POWERPC의 경우 cpu 번호가 가장 낮은 경우 우선 순위가 높다.
- ITMT를 지원하는 x86 아키텍처의 경우 부스트 되는 경우가 있어 우선 순위가 실시간으로 바뀐다.

arch_asym_cpu_priority() – Generic

kernel/sched/fair.c

int __weak arch_asym_cpu_priority(int cpu)
{
        return -cpu;
}

현재 cpu의 우선 순위를 반환한다. cpu 번호가 낮을 수록 우선 순위가 높아진다.

ITMT를 지원하는 x86 아키텍처를 제외한 나머지 아키텍처들은 모두 generic 코드를 사용한다.

다음 그림은 powerpc 아키텍처에서 hw thread를 사용하는데 SMT 도메인에서 앞 core가 항상 더 빠름을 보여준다.

다음 그림은 ITMT를 채택한 x86 아키텍처가 MC 및 SMT 도메인에서 boost된 cpu가 더 빠름을 보여준다.

불균형 산출

calculate_imbalance()

kernel/sched/fair.c

/**
 * calculate_imbalance - Calculate the amount of imbalance present within the
 *                       groups of a given sched_domain during load balance.
 * @env: load balance environment
 * @sds: statistics of the sched_domain whose imbalance is to be calculated.
 */

static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
        unsigned long max_pull, load_above_capacity = ~0UL;
        struct sg_lb_stats *local, *busiest;

        local = &sds->local_stat;
        busiest = &sds->busiest_stat;

        if (busiest->group_type == group_imbalanced) {
                /*
                 * In the group_imb case we cannot rely on group-wide averages
                 * to ensure CPU-load equilibrium, look at wider averages. XXX
                 */
                busiest->load_per_task =
                        min(busiest->load_per_task, sds->avg_load);
        }

        /*
         * Avg load of busiest sg can be less and avg load of local sg can
         * be greater than avg load across all sgs of sd because avg load
         * factors in sg capacity and sgs with smaller group_type are
         * skipped when updating the busiest sg:
         */
        if (busiest->group_type != group_misfit_task &&
            (busiest->avg_load <= sds->avg_load ||
             local->avg_load >= sds->avg_load)) {
                env->imbalance = 0;
                return fix_small_imbalance(env, sds);
        }

        /*
         * If there aren't any idle CPUs, avoid creating some.
         */
        if (busiest->group_type == group_overloaded &&
            local->group_type   == group_overloaded) {
                load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
                if (load_above_capacity > busiest->group_capacity) {
                        load_above_capacity -= busiest->group_capacity;
                        load_above_capacity *= scale_load_down(NICE_0_LOAD);
                        load_above_capacity /= busiest->group_capacity;
                } else
                        load_above_capacity = ~0UL;
        }

        /*
         * We're trying to get all the CPUs to the average_load, so we don't
         * want to push ourselves above the average load, nor do we wish to
         * reduce the max loaded CPU below the average load. At the same time,
         * we also don't want to reduce the group load below the group
         * capacity. Thus we look for the minimum possible imbalance.
         */
        max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

        /* How much load to actually move to equalise the imbalance */
        env->imbalance = min(
                max_pull * busiest->group_capacity,
                (sds->avg_load - local->avg_load) * local->group_capacity
        ) / SCHED_CAPACITY_SCALE;

        /* Boost imbalance to allow misfit task to be balanced. */
        if (busiest->group_type == group_misfit_task) {
                env->imbalance = max_t(long, env->imbalance,
                                       busiest->group_misfit_task_load);
        }

        /*
         * if *imbalance is less than the average load per runnable task
         * there is no guarantee that any tasks will be moved so we'll have
         * a think about bumping its value to force at least one task to be
         * moved
         */
        if (env->imbalance < busiest->load_per_task)
                return fix_small_imbalance(env, sds);
}

busiest 그룹이 불균형하여 진입했고, 정확히 불균형 값을 계산해본다.

코드 라인 9~16에서 busiest 그룹이 group_imbalanced(1) 타입인 경우 태스크의 cpu 허용 제한으로 인해 일반적인 그룹간의 평균 로드로만 밸런싱을 비교하면 안되는 상황이다. busiest 그룹의 태스크 당 로드 값 load_per_task 을 도메인 평균 로드 sds->avg_load 이하로 제한한다.
코드 라인 24~29에서 busiest 그룹이 group_misfit_task 타입이 아니고 다음 두 조건 중 하나에 해당하면 일단 imbalance를 0으로 클리어한 후 조금 더 깊이 산출하기 위해 fix_small_imabalnce() 함수를 통해 minor한 imbalance 값을 다시 산출한다.
- busiest 그룹의 평균 로드가 도메인의 평균 로드보다 작은 경우
- local 그룹의 평균 로드가 도메인의 평균 로드보다 큰 경우
코드 라인 34~43에서 busiest 및 local 그룹 모두 group_overloaded(2) 타입인 경우 즉, idle cpu들이 없는 경우 load_above_capacity 값을 다음과 같이 준비한다.
- 그룹에서 동작 중인 태스크 * 1024이 busiest 그룹의 capacity를 초과하는 만큼만 (nice-0 weight(1024) / group capacity) 비율만큼 적용한다. 초과분이 없는 경우 ~0UL 값을 대입한다.
코드 라인 52에서 max_pull 값을 산출하는데 busiest 그룹의 평균 로드가 도메인의 평균 로드를 초과한 차이와 load_above_capacity 값 중 작은 값으로 한다.
코드 라인 55~58에서 얼마나 불균형한지 imbalance 값을 아래 두 값 중 작은 값으로 산출한다.
- 이미 산출해둔 max_pull 값에 busiest 그룹의 capacity 값을 곱한 값과
- 도메인의 평균 로드에서 local 그룹의 평균 로드를 뺀 차이분을 ( group_capacity / 1024) 비율로 곱한 값
코드 라인 61~63에서 만일 그룹 타입이 group_misfit_task 인 경우에는 산출된 imbalance 보다 더 높은 busiest->group_misfit_task_load 인 경우 이 값으로 갱신한다.
코드 라인 72~73에서 최종 산출된 imbalance 값 보다 busiest 그룹의 태스크 당 로드 값이 큰 경우 fix_small_imabalnce() 함수를 통해 minor한 imbalance 값을 다시 산출한다.

작은 불균형 산출

fix_small_imbalance()

kernel/sched/fair.c

/**
 * fix_small_imbalance - Calculate the minor imbalance that exists
 *                      amongst the groups of a sched_domain, during
 *                      load balancing.
 * @env: The load balancing environment.
 * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
 */

static inline
void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
        unsigned long tmp, capa_now = 0, capa_move = 0;
        unsigned int imbn = 2;
        unsigned long scaled_busy_load_per_task;
        struct sg_lb_stats *local, *busiest;

        local = &sds->local_stat;
        busiest = &sds->busiest_stat;

        if (!local->sum_nr_running)
                local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
        else if (busiest->load_per_task > local->load_per_task)
                imbn = 1;

        scaled_busy_load_per_task =
                (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
                busiest->group_capacity;

        if (busiest->avg_load + scaled_busy_load_per_task >=
            local->avg_load + (scaled_busy_load_per_task * imbn)) {
                env->imbalance = busiest->load_per_task;
                return;
        }

        /*
         * OK, we don't have enough imbalance to justify moving tasks,
         * however we may be able to increase total CPU capacity used by
         * moving them.
         */

        capa_now += busiest->group_capacity *
                        min(busiest->load_per_task, busiest->avg_load);
        capa_now += local->group_capacity *
                        min(local->load_per_task, local->avg_load);
        capa_now /= SCHED_CAPACITY_SCALE;

        /* Amount of load we'd subtract */
        if (busiest->avg_load > scaled_busy_load_per_task) {
                capa_move += busiest->group_capacity *
                            min(busiest->load_per_task,
                                busiest->avg_load - scaled_busy_load_per_task);
        }

        /* Amount of load we'd add */
        if (busiest->avg_load * busiest->group_capacity <
            busiest->load_per_task * SCHED_CAPACITY_SCALE) {
                tmp = (busiest->avg_load * busiest->group_capacity) /
                      local->group_capacity;
        } else {
                tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
                      local->group_capacity;
        }
        capa_move += local->group_capacity *
                    min(local->load_per_task, local->avg_load + tmp);
        capa_move /= SCHED_CAPACITY_SCALE;

        /* Move if we gain throughput */
        if (capa_move > capa_now)
                env->imbalance = busiest->load_per_task;
}

local 그룹과 busiest 그룹의 로드 비교가 쉽지 않은 상화에서 이 함수가 호출되었다. busiest 그룹의 태스크 하나에 해당하는 로드를 local 그룹으로 마이그레이션한 상황을 가정하여 마이그레이션 후 성능이 더 올라가는 경우 imbalnce 값으로 busiest 그룹의 태스크 하나에 해당하는 로드 값을 지정한다.

Case 1) 평균 로드를 비교 (busiest > 로컬 + 1개의 scaled busiest 태스크 로드 추가(1개의 local 태스크 로드보다 작거나 같은 경우에만))

코드 라인 12~13에서 local 그룹의 sum_nr_running 값이 주어지지 않은 경우 로컬 그룹의 태스크당 로드 값으로 dst cpu의 태스크 당 로드 평균 값을 산출해와서 사용한다.
- dst cpu 로드 * (1/n 개 태스크)를 산출해온다. 태스크가 없으면 0을 반환한다.
코드 라인 14~15에서 태스크당 로드가 busiest 그룹 > local 그룹인 경우 로컬 그룹에 busiest 그룹의 태스크 하나를 migration하는 조건에 사용할 배율(imbn)이 반영되지 못하게 2 배에서 1 배로 떨어뜨린다.
코드 라인 17~19에서 busiest 그룹의 태스크 하나에 해당하는 로드를 스케일 적용한 값을 scaled_busy_load_per_task에 담는다.
- busiest 그룹의 태스크 당 로드 값 * 1024 / group capacity 비율
코드 라인 21~25에서 busiest 그룹 평균 로드 > local 그룹 평균 로드 + scaled_busy_load_per_task(imbn이 2인 경우에만)인 경우 busiest의 태스크당 로드 값을 imbalance에 대입하고 함수를 빠져나간다.

Case 2) busiest 태스크 1개 로드를 local로 옮겼다고 가정할 때 더 좋은 성능(태스크 이동 후 양쪽 로드 합계를 더해 로드 평균이 커진 경우이다.)

코드 라인 33~37에서 태스크를 옮기기에 충분한 imbalance 값이 없다. 그렇지만 태스크를 이동하여 사용되는 전체 cpu capcaity를 증가시키면 가능해질 수도 있다. 이하 코드에서 capa_now와 capa_move를 구해 비교할 계획이다. capa_now는 다음과 두 값을 더 해 산출한다.
- busiest 그룹의 capacity * min(태스크당 로드, 평균 로드) / 1024
- local 그룹의 capacity * min(태스크당 로드, 평균 로드) / 1024
코드 라인 40~44에서 busiest 그룹의 평균 로드가 scale 적용된 busiest 그룹의 태스크당 로드보다 큰 경우 capa_move에 다음을 더한다.
- busiest 그룹의 capacity * min(태스크당 로드, 평균 로드-스케일 적용 태스크당 로드) / 1024
코드 라인 47~54에서 busied 그룹의 로드 평균 * capacity < 태스크당 로드 * 1024인 경우 여부에 따라 tmp 값을 구한다.
- 참: tmp = busiest 로드 평균 * (busiest 그룹 capacity / 로컬 그룹 capacity)
- 거짓: busiest 그룹의 태스크당 로드 * (1024 / 로컬 그룹 capacity)
코드 라인 55~56에서 다음 값을 capa_move에 추가한다.
- 로컬 그룹 capacity * min(로컬 태스크당 로드, 로컬 로드 평균 + tmp) / 1024
코드 라인 60~61에서 태스크를 옮겼을 때가 더 유리한 경우 imbalance에 busiest 그룹의 태스크당 로드 평균을 대입한다.

다음 그림은 case 1)의 busiest의 로드 평균이 local의 로드 평균 보다 큰 경우를 보여준다.

두 그룹의 태스크당 로드를 비교하여 우측 그림과 같이 busiest가 같거나 작은 경우에는 1개의 태스크를 local 그룹의 로드 평균에 더해 비교한다.
우측 아래의 예에서는 imbalance를 결정하지 못해 case 2)로 게속 판단을 해야 한다.

다음 그림은 case 2) busiest 그룹의 태스크 1개 로드 scale 적용하여 local 그룹으로 옮겼을 때의 두 그룹 로드 평균 합이 커지지 않아 imbalnace 결정을 하지 못한 모습을 보여준다.

항상 1개 태스크 로드와 로드 평균 중 가장 작은 값을 사용한다.

다음 그림은 case 2) busiest 그룹의 태스크 1개 로드 scale 적용하여 local 그룹으로 옮겼을 때의 두 그룹 로드 평균 합이 커져 imbalnace 값으로 busiest 그룹의 1 개 태스크 로드로 결정하는 모습을 보여준다.

디태치 & 어태치 태스크들

디태치 태스크들

detach_tasks()

kernel/sched/fair.c -1/2-

/*
 * detach_tasks() -- tries to detach up to imbalance weighted load from
 * busiest_rq, as part of a balancing operation within domain "sd".
 *
 * Returns number of detached tasks if successful and 0 otherwise.
 */

static int detach_tasks(struct lb_env *env)
{
        struct list_head *tasks = &env->src_rq->cfs_tasks;
        struct task_struct *p;
        unsigned long load;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);

        if (env->imbalance <= 0)
                return 0;

        while (!list_empty(tasks)) {
                /*
                 * We don't want to steal all, otherwise we may be treated likewise,
                 * which could at worst lead to a livelock crash.
                 */
                if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
                        break;

                p = list_first_entry(tasks, struct task_struct, se.group_node);

                env->loop++;
                /* We've more or less seen every task there is, call it quits */
                if (env->loop > env->loop_max)
                        break;

                /* take a breather every nr_migrate tasks */
                if (env->loop > env->loop_break) {
                        env->loop_break += sched_nr_migrate_break;
                        env->flags |= LBF_NEED_BREAK;
                        break;
                }

                if (!can_migrate_task(p, env))
                        goto next;

                load = task_h_load(p);

                if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
                        goto next;

                if ((load / 2) > env->imbalance)
                        goto next;

                detach_task(p, env);
                list_add(&p->se.group_node, &env->tasks);

                detached++;
                env->imbalance -= load;

소스 런큐의 cfs_tasks 리스트에 있는 태스크들을 디태치하여 env->tasks 리스트에 넣어온다. 반환되는 값으로 디태치한 태스크 수를 알아온다.

코드 라인 10~11에서 env->imbalance가 0 이하이면 밸런싱할 필요가 없으므로 0을 반환한다.
코드 라인 13~21에서 src 런큐의 cfs_tasks 리스트에 처리할 태스크가 없을 때까지 루프를 돌며 앞에서 부터 태스크를 하나씩 가져온다. 단 idle 상태로 진입한 경우에는 소스 런큐 태스크가 1개 이하인 경우에는 루프를 벗어난다.
코드 라인 23~26에서 루프 카운터를 증가시키고 loop_max를 초과하면 루프를 벗어난다.
- loop_max 값은 busiest cpu에서 동작중인 태스크의 수이며, 최대 sysctl_sched_nr_migrate(디폴트: 32)개로 제한된다.
코드 라인 29~33에서 루프 카운터가 loop_break를 초과하는 경우에는 LBF_NEED_BREAK 플래그를 설정한채 루프를 벗어난다.
- 이 경우 이 함수를 호출한 load_balance() 함수로 돌아간 후 unlock 후 다시 처음부터 lock을 다시 잡고 시도하게된다. loop_max가 매우 클 경우 lock을 잡고 한번에 처리하는 개수가 크면 시간이 너무 많이 소요되므로 이를 loop_break 단위로 나누어 처리하도록 한다.
코드 라인 35~36에서 태스크를 마이그레이션 할 수 없으면 next 레이블로 이동하여 태스크를 리스트의 뒤로 옮긴다음 계속 루프를 돈다.
코드 라인 38에서 태스크의 로드 평균 기여값을 알아온다.
코드 라인 40~41에서 LB_MIN feature를 사용하고 로드가 16보다 작고 도메인에 밸런싱이 실패한 적이 없으면 next 레이블로 이동하여 리스트의 뒤로 옮긴다음 계속 루프를 돈다. 즉 너무 작은 로드는 마이그레이션하지 않으려할 때 사용한다.
- 디폴트로 LB_MIN feature를 사용하지 않는다.
코드 라인 43~44에서 로드 값의 절반이 imbalance보다 큰 경우 next 레이블로 이동하여 리스트의 뒤로 옮긴다음 계속 루프를 돈다. 즉 2 개 이상의 태스크들 중 로드의 절반 이상을 차지하는 태스크는 마이그레이션 하지 않는다.
코드 라인 46~50에서 태스크를 detach하고 env->tasks 리스트에 추가한다. env->imbalance에 로드 값을 감소시킨다.

kernel/sched/fair.c -2/2-

#ifdef CONFIG_PREEMPT
                /*
                 * NEWIDLE balancing is a source of latency, so preemptible
                 * kernels will stop after the first task is detached to minimize
                 * the critical section.
                 */
                if (env->idle == CPU_NEWLY_IDLE)
                        break;
#endif

                /*
                 * We only want to steal up to the prescribed amount of
                 * weighted load.
                 */
                if (env->imbalance <= 0)
                        break;

                continue;
next:
                list_move_tail(&p->se.group_node, tasks);
        }

        /*
         * Right now, this is one of only two places we collect this stat
         * so we can safely collect detach_one_task() stats here rather
         * than inside detach_one_task().
         */
        schedstat_add(env->sd, lb_gained[env->idle], detached);

        return detached;
}

코드 라인 1~9에서 preempt 커널 옵션을 사용하고 NEWIDLE 밸런싱이 수행중인 경우 하나만 처리하고 루프를 벗어난다.
코드 라인 15~16에서 imbalance가 0 이하인 경우 루프를 벗어난다. 즉 로드를 더 뺄 imbalance 값이 없는 경우 그만 처리한다.
코드 라인 18에서 계속 순회한다.
코드 라인 19~21에서 next 레이블에 도착하면 태스크를 소스 런큐의 cfs_tasks 리스트의 후미로 옮긴다.
코드 라인 28~30에서 lb_gained[] 통계를 갱신하고 detached된 수를 반환한다.

detach_task()

kernel/sched/fair.c

/*
 * detach_task() -- detach the task for the migration specified in env
 */

static void detach_task(struct task_struct *p, struct lb_env *env)
{
        lockdep_assert_held(&env->src_rq->lock);

        deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
        set_task_cpu(p, env->dst_cpu);
}

태스크를 소스 런큐에서 디태치하고, 디태치한 태스크에 dst_cpu를 지정한다.

task_h_load()

kernel/sched/fair.c

#ifdef CONFIG_FAIR_GROUP_SCHED
static unsigned long task_h_load(struct task_struct *p)
{
        struct cfs_rq *cfs_rq = task_cfs_rq(p);

        update_cfs_rq_h_load(cfs_rq);
        return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
                        cfs_rq_load_avg(cfs_rq) + 1);
}
#else
static unsigned long task_h_load(struct task_struct *p)
{
        return p->se.avg.load_avg_contrib;
}
#endif

태스크의 로드 평균 기여값을 반환한다. 만일 그룹 스케줄링을 사용하는 경우 태스크의 로드 평균 기여에 cfs 런큐의 h_load 비율을 곱하고 러너블 로드 평균으로 나누어 반환한다.

코드 라인 4~6에서 요청한 태스크의 cfs 런큐부터 최상위 cfs 런큐까지 h_load를 갱신한다.
코드 라인 7~8에서 태스크의 로드 평균 기여에 cfs 런큐의 h_load 비율을 곱하고 러너블 로드 평균으로 나누어 반환한다.

다음 그림은 요청한 태스크의 로드 평균 기여를 산출하는 모습을 보여준다.

update_cfs_rq_h_load()

kernel/sched/fair.c

/*
 * Compute the hierarchical load factor for cfs_rq and all its ascendants.
 * This needs to be done in a top-down fashion because the load of a child
 * group is a fraction of its parents load.
 */

static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
{
        struct rq *rq = rq_of(cfs_rq);
        struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
        unsigned long now = jiffies;
        unsigned long load;

        if (cfs_rq->last_h_load_update == now)
                return;

        WRITE_ONCE(cfs_rq->h_load_next, NULL);
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                 WRITE_ONCE(cfs_rq->h_load_next, se);
                if (cfs_rq->last_h_load_update == now)
                        break;
        }

        if (!se) {
                cfs_rq->h_load = cfs_rq_load_avg(cfs_rq);
                cfs_rq->last_h_load_update = now;
        }

        while ((se = cfs_rq->h_load_next) != NULL) {
                load = cfs_rq->h_load;
                load = div64_ul(load * se->avg.load_avg,
                                cfs_rq_load_avg(cfs_rq) + 1);
                cfs_rq = group_cfs_rq(se);
                cfs_rq->h_load = load;
                cfs_rq->last_h_load_update = now;
        }
}

요청한 cfs 런큐부터 최상위 까지의 h_load를 갱신한다.

코드 라인 8~9에서 이미 마지막 h_load가 갱신된 시각(jiffies)이면 함수를 빠져나간다.
코드 라인 11에서 요청한 cfs 런큐를 대표하는 엔티티의 cfs 런큐 h_load_next에 null을 대입한다. h_load_next는 스케줄 엔티티를 top down으로 연결시키기 위해 잠시 사용되는 변수이다.
코드 라인 12~17에서 요청한 cfs 런큐를 대표하는 엔티티부터 최상위 엔티티까지 순회하며 순회중인 스케줄 엔티티의 cfs 런큐 h_load_next 에 스케줄 엔티티를 대입한다. 만일 순회 중 h_load가 갱신되어 있으면 순회를 중단한다.
코드 라인 19~22에서 순회가 중단되지 않고 끝까지 수행되었거나 최상위 cfs 런큐로 요청된 경우 cfs 런큐의 h_load에 러너블 로드 평균을 대입하고 h_load 갱신이 완료된 현재 시각을 대입한다.
코드 라인 24~31에서 엔티티를 top down으로 순회를 한다. h_load 값에 로드 평균 기여를 곱한 후 러너블 로드로 나누어 h_load를 갱신한다. 그 후 갱신된 시각에 현재 시각을 대입한다.

다음 그림은 h_load 값이 산출되는 모습을 보여준다.

어태치 태스크들

attach_tasks()

kernel/sched/fair.c

/*
 * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
 * new rq.
 */

static void attach_tasks(struct lb_env *env)
{
        struct list_head *tasks = &env->tasks;
        struct task_struct *p;
        struct rq_flags rf;

        rq_lock(env->dst_rq, &rf);
        update_rq_clock(env->dst_rq);

        while (!list_empty(tasks)) {
                p = list_first_entry(tasks, struct task_struct, se.group_node);
                list_del_init(&p->se.group_node);

                attach_task(env->dst_rq, p);
        }

        rq_unlock(env->dst_rq, &rf);
}

디태치한 태스크들을 모두 dst 런큐에 어태치한다.

attach_task()

kernel/sched/fair.c

/*
 * attach_task() -- attach the task detached by detach_task() to its new rq.
 */
static void attach_task(struct rq *rq, struct task_struct *p)
{
        lockdep_assert_held(&rq->lock);

        BUG_ON(task_rq(p) != rq);
        p->on_rq = TASK_ON_RQ_QUEUED;
        activate_task(rq, p, 0);
        check_preempt_curr(rq, p, 0);
}

태스크를 요청한 런큐에 어태치한다. 그런 후 preemption 조건을 만족하면 요청 플래그를 설정한다.

Active 밸런스

load_balance() 함수에서 pull migration에서 가져올 수 없는 태스크가 있었다. busiest 런큐에서 러닝 중인 태스크를 마이그레이션할 수 없는데, 이를 active 밸런스를 사용하여 동작 중인 태스크의 로드밸런싱을 가능하게 한다.

다음 그림은 태스크가 하나도 마이그레이션되지 않았을 대 active 밸런싱을 호출하여 수행하는 과정을 보여준다.

need_active_balance()

kernel/sched/fair.c

static int need_active_balance(struct lb_env *env)
{
        struct sched_domain *sd = env->sd;

        if (voluntary_active_balance(env))
                return 1;

        return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}

active 로드 밸런싱이 필요한지 여부를 반환한다. (nr_balance_failed > cache_nice_tries+2인 경우 1)

코드 라인 5~6에서 active 밸런싱이 필요한 특수한 케이스가 있으면 1을 반환한다.
코드 라인 8에서 낮은 확률로 로드밸런스 실패 횟수가 cache_nice_tries+2 보다 큰 경우 1을 반환한다.

voluntary_active_balance()

kernel/sched/fair.c

static inline bool
voluntary_active_balance(struct lb_env *env)
{
        struct sched_domain *sd = env->sd;

        if (asym_active_balance(env))
                return 1;

        /*
         * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
         * It's worth migrating the task if the src_cpu's capacity is reduced
         * because of other sched_class or IRQs if more capacity stays
         * available on dst_cpu.
         */
        if ((env->idle != CPU_NOT_IDLE) &&
            (env->src_rq->cfs.h_nr_running == 1)) {
                if ((check_cpu_capacity(env->src_rq, sd)) &&
                    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
                        return 1;
        }

        if (env->src_grp_type == group_misfit_task)
                return 1;

        return 0;
}

active 밸런싱이 필요한 특수한 케이스가 있으면 1을 반환한다.

코드 라인 6~7에서 asym packing 도메인에서 dst_cpu가 src cpu보다 우선 순위가 높은 경우 1을 반환한다.
코드 라인 15~20에서 cpu가 idle 또는 new idle 상태에서 진입하였고, 소스 런큐에서 cfs 태스크 1 개만 동작할 때 다음 조건인 경우 1을 반환한다.
- rt,dl,irq 유틸 등으로 소스 런큐의 cpu에 대한 cfs capacity가 감소되었고, imbalance_pct 비율을 소스 cpu에 적용하더라도 dst cpu의 capacity가 더 높은 경우 1을 반환한다.
코드 라인 22~25에서 소스 그룹이 group_misfit_task이면 1을 반환하고, 그 외의 경우 0을 반환한다.

asym_active_balance()

kernel/sched/fair.c

static inline bool
asym_active_balance(struct lb_env *env)
{
        /*
         * ASYM_PACKING needs to force migrate tasks from busy but
         * lower priority CPUs in order to pack all tasks in the
         * highest priority CPUs.
         */
        return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
               sched_asym_prefer(env->dst_cpu, env->src_cpu);
}

cpu가 busy가 아닌 상태에서 진입하였고, asym packing 도메인에서 dst_cpu가 src cpu보다 우선 순위가 높은 경우 여부를 반환한다.

stop_one_cpu_nowait()

kernel/stop_machine.c

/**
 * stop_one_cpu_nowait - stop a cpu but don't wait for completion
 * @cpu: cpu to stop
 * @fn: function to execute
 * @arg: argument to @fn
 * @work_buf: pointer to cpu_stop_work structure
 *
 * Similar to stop_one_cpu() but doesn't wait for completion.  The
 * caller is responsible for ensuring @work_buf is currently unused
 * and will remain untouched until stopper starts executing @fn.
 *
 * CONTEXT:
 * Don't care.
 *
 * RETURNS:
 * true if cpu_stop_work was queued successfully and @fn will be called,
 * false otherwise.
 */

bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
                        struct cpu_stop_work *work_buf)
{
        *work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg, };
        return cpu_stop_queue_work(cpu, work_buf);
}

@cpu를 멈추고 인자로 받은 @fn 함수가 동작하게 한다. 종료 결과 없이 요청한다.

현재 커널 소스에서 인수로 사용되는 함수는 active_load_balance_cpu_stop() 함수와 watchdog_timer_fn() 함수에서 사용된다.
이 함수를 통해 busiest cpu의 태스크를 push_cpu로 옮기게 한다.

active_load_balance_cpu_stop()

kernel/sched/fair.c -1/2-

/*
 * active_load_balance_cpu_stop is run by the CPU stopper. It pushes
 * running tasks off the busiest CPU onto idle CPUs. It requires at
 * least 1 task to be running on each physical CPU where possible, and
 * avoids physical / logical imbalances.
 */

static int active_load_balance_cpu_stop(void *data)
{
        struct rq *busiest_rq = data;
        int busiest_cpu = cpu_of(busiest_rq);
        int target_cpu = busiest_rq->push_cpu;
        struct rq *target_rq = cpu_rq(target_cpu);
        struct sched_domain *sd;
        struct task_struct *p = NULL;
        struct rq_flags rf;

        rq_lock_irq(busiest_rq, &rf);
        /*
         * Between queueing the stop-work and running it is a hole in which
         * CPUs can become inactive. We should not move tasks from or to
         * inactive CPUs.
         */
        if (!cpu_active(busiest_cpu) || !cpu_active(target_cpu))
                goto out_unlock;

        /* Make sure the requested CPU hasn't gone down in the meantime: */
        if (unlikely(busiest_cpu != smp_processor_id() ||
                     !busiest_rq->active_balance))
                goto out_unlock;

        /* Is there any task to move? */
        if (busiest_rq->nr_running <= 1)
                goto out_unlock;

        /*
         * This condition is "impossible", if it occurs
         * we need to fix it. Originally reported by
         * Bjorn Helgaas on a 128-CPU setup.
         */
        BUG_ON(busiest_rq == target_rq);

        /* Search for an sd spanning us and the target CPU. */
        rcu_read_lock();
        for_each_domain(target_cpu, sd) {
                if ((sd->flags & SD_LOAD_BALANCE) &&
                    cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
                                break;
        }

busiest cpu의 태스크 하나를 idle cpu로 옮긴다.

코드 라인 17~18에서 busiest cpu 및 target cpu가 active 상태가 아닌 경우 함수를 빠져나간다.
코드 라인 21~23에서 낮은 확률로 busiest cpu가 현재 cpu가 아니거나 busiest 런큐의 active_balance가 해제된 경우 out_unlock 레이블로 이동하여 마이그레이션을 포기한다.
코드 라인 26~27에서 busiest 런큐에서 동작 중인 태스크가 한 개 이하인 경우 out_unlock 레이블로 이동하여 마이그레이션을 포기한다.
코드 라인 38~42에서 target cpu의 스케줄 도메인을 순회하며 로드 밸런스가 허용된 도메인에 busiest cpu가 포함되어 있는 경우 이 도메인을 통해 밸런싱을 하기 위해 루프를 벗어난다.

kernel/sched/fair.c -2/2-

        if (likely(sd)) {
                struct lb_env env = {
                        .sd             = sd,
                        .dst_cpu        = target_cpu,
                        .dst_rq         = target_rq,
                        .src_cpu        = busiest_rq->cpu,
                        .src_rq         = busiest_rq,
                        .idle           = CPU_IDLE,
                        /*
                         * can_migrate_task() doesn't need to compute new_dst_cpu
                         * for active balancing. Since we have CPU_IDLE, but no
                         * @dst_grpmask we need to make that test go away with lying
                         * about DST_PINNED.
                         */
                        .flags          = LBF_DST_PINNED,
                };

                schedstat_inc(sd->alb_count);
                update_rq_clock(busiest_rq);

                p = detach_one_task(&env);
                if (p) {
                        schedstat_inc(sd->alb_pushed);
                        /* Active balancing done, reset the failure counter. */
                        sd->nr_balance_failed = 0;
                } else {
                        schedstat_inc(sd->alb_failed);
                }
        }
        rcu_read_unlock();
out_unlock:
        busiest_rq->active_balance = 0;
        rq_unlock(busiest_rq, &rf);

        if (p)
                attach_one_task(target_rq, p);

        local_irq_enable();

        return 0;
}

코드 라인 1~29에서 높은 확률로 밸런싱 가능한 도메인을 찾은 경우 alb_count 카운터를 증가시키고 busiest 런큐의 클럭을 갱신 한 후 busiest cpu 에서 태스크를 디태치해온다. 디태치한 경우 alb_pushed 카운터를 증가시키고 그렇지 못한 경우 alb_failed 카운터를 증가시킨다.
코드 라인 31~40에서 out_unlock: 레이블이다. busiest 런큐의 avtive_balance에 0을 대입하고 target cpu에 디태치 했었던 태스크를 어태치하고 0 값으로 함수를 마친다.

detach_one_task()

kernel/sched/fair.c

/*
 * detach_one_task() -- tries to dequeue exactly one task from env->src_rq, as
 * part of active balancing operations within "domain".
 *
 * Returns a task if successful and NULL otherwise.
 */

static struct task_struct *detach_one_task(struct lb_env *env)
{
        struct task_struct *p, *n;

        lockdep_assert_held(&env->src_rq->lock);

        list_for_each_entry_reverse(p,
                        &env->src_rq->cfs_tasks, se.group_node) {
                if (!can_migrate_task(p, env))
                        continue;

                detach_task(p, env);

                /*
                 * Right now, this is only the second place where
                 * lb_gained[env->idle] is updated (other is detach_tasks)
                 * so we can safely collect stats here rather than
                 * inside detach_tasks().
                 */
                schedstat_inc(env->sd, lb_gained[env->idle]);
                return p;
        }
        return NULL;
}

src 런큐에서 동작하는 태스크들 중 마이그레이션 가능한 태스크 하나를 디태치하고 반환한다.

코드 라인 7~10에서 src 런큐에서 동작하는 태스크들을 역방향 순회하며 마이그레이션 할 수 없는 태스크는 skip 한다.
코드 라인 12~21에서 태스크를 디태치하고 lb_gained[]의 stat 카운터를 증가시키고 태스크를 반환한다.
코드 라인 23에서 디태치를 못한 경우 null을 반환한다.

attach_one_task()

kernel/sched/fair.c

/*
 * attach_one_task() -- attaches the task returned from detach_one_task() to
 * its new rq.
 */

static void attach_one_task(struct rq *rq, struct task_struct *p)
{
        struct rq_flags rf;

        rq_lock(rq, &rf);
        update_rq_clock(rq);
        attach_task(rq, p);
        rq_unlock(rq, &rf);
}

요청한 태스크를 어태치한다. (런큐에 엔큐)

마이그레이션 가능 여부 체크

can_migrate_task()

kernel/sched/fair.c

/*
 * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
 */

static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
        int tsk_cache_hot;

        lockdep_assert_held(&env->src_rq->lock);

        /*
         * We do not migrate tasks that are:
         * 1) throttled_lb_pair, or
         * 2) cannot be migrated to this CPU due to cpus_allowed, or
         * 3) running (obviously), or
         * 4) are cache-hot on their current CPU.
         */
        if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
                return 0;

        if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
                int cpu;

                schedstat_inc(p, se.statistics.nr_failed_migrations_affine);

                env->flags |= LBF_SOME_PINNED;

                /*
                 * Remember if this task can be migrated to any other cpu in
                 * our sched_group. We may want to revisit it if we couldn't
                 * meet load balance goals by pulling other tasks on src_cpu.
                 *
                 * Also avoid computing new_dst_cpu if we have already computed
                 * one in current iteration.
                 */
                if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
                        return 0;

                /* Prevent to re-select dst_cpu via env's cpus */
                for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
                        if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
                                env->flags |= LBF_DST_PINNED;
                                env->new_dst_cpu = cpu;
                                break;
                        }
                }

                return 0;
        }

태스크를 마이그레이션해도 되는지 여부를 반환한다.

코드 라인 15~16에서 태스크가 존재하는 태스크 그룹의 src 또는 dest cpu의 cfs 런큐가 스로틀 중인 경우 0을 반환한다.
코드 라인 18~46에서 dst cpu들 중 태스크에 허락된 cpu가 없는 경우 nr_failed_migrations_affine stat을 증가시키고 LBF_SOME_PINNED 플래그를 설정하고 0을 반환한다. 만일 dst_grpmask가 비어있지 않으면서 LBF_DST_PINNED 플래그가도 설정되지 않은 경우 env->cpus를 순회하며 태스크가 허용하는 cpu에 한하여 LBF_DST_PINNED 플래그를 설정하고 new_dst_cpu에 cpu를 대입하고 0을 반환한다.

        /* Record that we found atleast one task that could run on dst_cpu */
        env->flags &= ~LBF_ALL_PINNED;

        if (task_running(env->src_rq, p)) {
                schedstat_inc(p, se.statistics.nr_failed_migrations_running);
                return 0;
        }

        /*
         * Aggressive migration if:
         * 1) destination numa is preferred
         * 2) task is cache cold, or
         * 3) too many balance attempts have failed.
         */
        tsk_cache_hot = migrate_degrades_locality(p, env);
        if (!tsk_cache_hot == -1)
                tsk_cache_hot = task_hot(p, env);

        if (tsk_cache_hot <= 0 ||
            env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
                if (tsk_cache_hot == 1) {
                        schedstat_inc(env->sd, lb_hot_gained[env->idle]);
                        schedstat_inc(p, se.statistics.nr_forced_migrations);
                }
                return 1;
        }

        schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
        return 0;
}

코드 라인 2에서 LBF_ALL_PINNED 플래그를 제거한다.
코드 라인 4~7에서 src 런큐에서 태스크가 러닝 중이면 nr_failed_migrations_running stat을 증가시키고 0을 반환한다.
코드 라인 15~17에서 태스크가 cache hot 상태이거나 numa 밸런싱을 하는 경우 locality가 저하되는지 여부를 확인한다.
- cache-hot을 유지하는 경우 마이그레이션을 하지 않는다.
- numa 밸런싱을 사용하지 않는 경우 migrate_degrades_locality() 함수의 결과는 항상 false(0)이다.
코드 라인 19~26에서 numa 밸런싱을 하는 경우 locality가 상승하거나 태스크가 cache hot 상태가 아니거나 밸런싱 실패가 cache_nice_tries 보다 많은 경우 마이그레이션을 할 수 있다고 1을 반환한다. 태스크가 캐시 hot 상태인 경우 lb_hot_gained[] 및 nr_forced_migrations stat 을 증가시킨다.
코드 라인 28~29에서 nr_failed_migrations_hot stat을 증가시키고 0을 반환한다.

소스 및 목적 cpu의 스로틀 여부

throttled_lb_pair()

kernel/sched/fair.c

/*
 * Ensure that neither of the group entities corresponding to src_cpu or
 * dest_cpu are members of a throttled hierarchy when performing group
 * load-balance operations.
 */

static inline int throttled_lb_pair(struct task_group *tg,
                                    int src_cpu, int dest_cpu)
{
        struct cfs_rq *src_cfs_rq, *dest_cfs_rq;

        src_cfs_rq = tg->cfs_rq[src_cpu];
        dest_cfs_rq = tg->cfs_rq[dest_cpu];

        return throttled_hierarchy(src_cfs_rq) ||
               throttled_hierarchy(dest_cfs_rq);
}

태스크 그룹에 대한 src 또는 dest cpu의 cfs 런큐가 스로틀되었는지 여부를 반환한다.

throttled_hierarchy()

kernel/sched/fair.c

/* check whether cfs_rq, or any parent, is throttled */
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{
        return cfs_bandwidth_used() && cfs_rq->throttle_count;
}

요청한 cfs 런큐가 스로틀되었는지 여부를 반환한다.

마이그레이션 후 locality 저하 여부 체크

migrate_degrades_locality()

kernel/sched/fair.c

/*
 * Returns 1, if task migration degrades locality
 * Returns 0, if task migration improves locality i.e migration preferred.
 * Returns -1, if task migration is not affected by locality.
 */

static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
{
        struct numa_group *numa_group = rcu_dereference(p->numa_group);
        unsigned long src_weight, dst_weight;
        int src_nid, dst_nid, dist;

        if (!static_branch_likely(&sched_numa_balancing))
                return -1;

        if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
                return -1;

        src_nid = cpu_to_node(env->src_cpu);
        dst_nid = cpu_to_node(env->dst_cpu);

        if (src_nid == dst_nid)
                return -1;

        /* Migrating away from the preferred node is always bad. */
        if (src_nid == p->numa_preferred_nid) {
                if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
                        return 1;
                else
                        return -1;
        }

        /* Encourage migration to the preferred node. */
        if (dst_nid == p->numa_preferred_nid)
                return 0;

        /* Leaving a core idle is often worse than degrading locality. */
        if (env->idle == CPU_IDLE)
                return -1;

        dist = node_distance(src_nid, dst_nid);
        if (numa_group) {
                src_weight = group_weight(p, src_nid, dist);
                dst_weight = group_weight(p, dst_nid, dist);
        } else {
                src_weight = task_weight(p, src_nid, dist);
                dst_weight = task_weight(p, dst_nid, dist);
        }

        return dst_weight < src_weight;
}

NUMA 밸런싱을 사용하는 경우 태스크의 마이그레이션이 locality를 저하시키는지 여부를 반환한다. 1인 경우 locality가 저하되므로 마이그레이션 권장하지 않는다. 0인 경우에는 locality가 상승하는 경우이므로 마이그레이션을 권장한다. -1인 경우 영향이 없거나 모르는 경우이다. (NUMA 밸런싱을 사용하지 않는 경우 항상 -1이다)

코드 라인 7~8에서 NUMA 밸런싱이 사용되지 않는 경우 -1을 반환한다.
- /proc/sys/kernel/numa_balancing
코드 라인 10~11에서 태스크의 누마 폴트가 없거나 NUMA 도메인이 아닌 경우 -1을 반환한다.
코드 라인 13~17에서 src 노드와 dst 노드가 같은 경우 -1을 반환한다.
코드 라인 20~25에서 태스크의 선호 노드가 소스 노드인 경우이다. 소스 런큐에서 동작 중인 태스크가 리모트 노드도 있는 경우 locality가 나빠지므로 1을 반환하고, 그렇지 않은 경우 locality 영향 없는 -1을 반환한다.
코드 라인 28~29에서 태스크의 선호 노드가 dst 노드인 경우 locality가 좋아지므로 0을 반환한다.
코드 라인 32~33에서 idle 상태에서 진입한 경우 -1을 반환한다.
코드 라인 35~44에서 태스크가 dst 노드의 폴트 수 보다 src 노드의 폴트 수가 큰지 여부를 반환한다.
- dst < src 인 경우 1 (locality 저하)
- dst >= src인 경우 0 (locality 증가)

태스크의 캐시 hot 상태 여부

task_hot()

kernel/sched/fair.c

/*
 * Is this task likely cache-hot:
 */

static int task_hot(struct task_struct *p, struct lb_env *env)
{
        s64 delta;

        lockdep_assert_held(&env->src_rq->lock);

        if (p->sched_class != &fair_sched_class)
                return 0;

        if (unlikely(task_has_idle_policy(p))
                return 0;

        /*
         * Buddy candidates are cache hot:
         */
        if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
                        (&p->se == cfs_rq_of(&p->se)->next ||
                         &p->se == cfs_rq_of(&p->se)->last))
                return 1;

        if (sysctl_sched_migration_cost == -1)
                return 1;
        if (sysctl_sched_migration_cost == 0)
                return 0;

        delta = rq_clock_task(env->src_rq) - p->se.exec_start;

        return delta < (s64)sysctl_sched_migration_cost;
}

태스크가 cache-hot 상태인지 여부를 반환한다. (hot 상태인 경우 가능하면 마이그레이션 하지 않고 이 상태를 유지하게 한다)

코드 라인 7~8에서 cfs 태스크가 아닌 경우 0을 반환한다.
코드 라인 10~11에서 태스크가 SCHED_IDLE policy를 사용하는 경우 0을 반환한다.
코드 라인 15~19에서 CACHE_HOT_BUDDY feature를 사용하면서 dst 런큐에서 동작 중인 태스크가 있고 이 태스크가 next 및 last 버디에 모두 지정된 경우 1을 반환한다.
- CACHE_HOT_BUDDY feature는 디폴트로 true이다.
- 혼자 열심히 잘 돌고 있으므로 방해하면 안된다. 마이그레이션하면 캐시만 낭비된다.
코드 라인 21~22에서 sysctl_sched_migration_cost가 -1로 설정된 경우 마이그레이션을 하지 못하도록 1을 반환한다.
- “/proc/sys/kernel/sched_migration_cost_ns”의 디폴트 값은 500,000(ns)이다.
코드 라인 23~24에서 sysctl_sched_migration_cost가 0으로 설정된 경우 항상 마이그레이션 하도록 0을 반환한다.
코드 라인 26~28에서 실행 시간이 sysctl_sched_migration_cost보다 작은지 여부를 반환한다.
- 실행 시간이 극히 적을 때 마이그레이션하면 캐시만 낭비하므로 마이그레이션을 하지 못하도록 1을 반환한다.

Schedule Features

# cat /sys/kernel/debug/sched_features 
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION ARCH_CAPACITY NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN

feature를 설정하는 방법
- echo HRTICK > sched_features
feature를 클리어하는 방법
- echo NO_HRTICK > sched_features

구조체

lb_env 구조체

kernel/sched/fair.c

struct lb_env {
        struct sched_domain     *sd;

        struct rq               *src_rq;
        int                     src_cpu;

        int                     dst_cpu;
        struct rq               *dst_rq;

        struct cpumask          *dst_grpmask;
        int                     new_dst_cpu;
        enum cpu_idle_type      idle;
        long                    imbalance;
        /* The set of CPUs under consideration for load-balancing */
        struct cpumask          *cpus;

        unsigned int            flags;

        unsigned int            loop;
        unsigned int            loop_break;
        unsigned int            loop_max;

        enum fbq_type           fbq_type;
        enum group_type         src_grp_type;
        struct list_head        tasks;
};

*sd
- 로드밸런싱을 수행할 스케줄링 도메인
*src_rq
- source 런큐
src_cpu
- source cpu
dst_cpu
- dest cpu
*dst_rq
- dest 런큐
*dst_grpmask
- dest 그룹의 cpu 마스크
new_dst_cpu
- 태스크의 cpu 허용 제한으로 인해 dst cpu로 마이그레이션이 불가능할 때 dest 그룹내 다른 cpu 중 하나를 선택한 후 재시도할 때 이 값을 dst_cpu로 대입할 목적으로 사용된다.
idle
- idle 타입
  - CPU_IDLE
    - 정규 틱에서 cpu가 idle 상태일 때 밸런싱 시도한 경우
  - CPU_NOT_IDLE
    - 정규 틱에서 cpu가 idle 상태가 아닐 때 밸런싱 시도한 경우
  - CPU_NEWLY_IDLE
    - 새롭게 cpu가 idle 상태가 되었을 때 밸런싱 시도한 경우
imbalance
- 로드밸런싱이 필요한 강도만큼 수치가 설정된다.
- 이 값이 클 수록 불균형 상태이므로 로드밸런싱 확률이 높아진다.
*cpus
- 로드 밸런싱에 고려되는 cpu들에 대한 cpu mask
flags
- LBF_ALL_PINNED(0x01)
  - 모든 태스크들을 마이그레이션 하지 못하는 상황이다.
- LBF_NEED_BREAK(0x02)
  - loop_break 단위로 나누기 위해 마이그레이션 루프에서 잠시 빠져나온 후 다시 시도한다.
- LBF_DST_PINNED(0x04)
  - 목적 cpu로 마이그레이션을 할 수 없는 상황이다.
- LBF_SOME_PINNED(0x08)
  - 일부 태스크가 마이그레이션을 할 수 없는 상황이다.
  - src cpu의 런큐에서 러닝 중인 태스크등은 pull migration을 할 수 없다. 이러한 경우 active 밸런싱의 push migration을 사용한다.
loop
- migration 진행 중인 내부 카운터
- 태스크 하나의 마이그레이션을 시도할 때마다 증가된다.
loop_break
- loop가 이 값에 도달하면 루프를 멈추었다가 다시 시도한다.
loop_max
- loop가 이 값에 도달하면 중지한다.
- 대상 cpu에서 동작중인 태스크들의 수로 지정되고, 최대 수는 sysctl_sched_nr_migrate(디폴트 32)개로 제한된다.
fbq_type
- find busiest queue 타입이고 설명은 본문에 기술하였다.
  - regular(0)
  - remote(1)
  - all(2)
src_grp_type
- 소스 그룹 타입이고 설명은 본문에 기술하였다.
  - group_other(0)
  - group_misfit_task(1)
  - group_imbalanced(2)
  - group_overloaded(3)
tasks
- 로드밸런싱할 태스크 리스트

sd_lb_stats 구조체

kernel/sched/fair.c

/*
 * sd_lb_stats - Structure to store the statistics of a sched_domain
 *               during load balancing.
 */

struct sd_lb_stats {
        struct sched_group *busiest;    /* Busiest group in this sd */
        struct sched_group *local;      /* Local group in this sd */
        unsigned long total_running;
        unsigned long total_load;       /* Total load of all groups in sd */
        unsigned long total_capacity;   /* Total capacity of all groups in sd */
        unsigned long avg_load; /* Average load across all groups in sd */

        struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
        struct sg_lb_stats local_stat;  /* Statistics of the local group */
};

로드 밸런싱 중에 사용되며 특정 도메인에 대한 로드 밸런스 통계를 표현한다.

*busiest
- 도메인내 busiest 스케줄 그룹을 가리킨다.
*local
- 도메인내 현재 비교 중인 로컬 스케줄 그룹을 가리킨다.
tatal_running
- 도메인내에 동작 중인 태스크 수
total_load
- 도메인에 포함된 모든 그룹의 로드 합
total_capacity
- 도메인에 포함된 모든 그룹의 capacity 합
avg_load
- 도메인에 포함된 모든 그룹의 로드 평균
busiest_stat
- busiest 그룹 통계
local_stat
- 로컬 그룹 통계

sg_lb_stats 구조체

kernel/sched/fair.c

/*
 * sg_lb_stats - stats of a sched_group required for load_balancing
 */

struct sg_lb_stats {
        unsigned long avg_load; /*Avg load across the CPUs of the group */
        unsigned long group_load; /* Total load over the CPUs of the group */
        unsigned long load_per_task;
        unsigned long group_capacity;
        unsigned long group_util; /* Total utilization of the group */
        unsigned int sum_nr_running; /* Nr tasks running in the group */
        unsigned int idle_cpus;
        unsigned int group_weight;
        enum group_type group_type;
        int group_no_capacity;
        unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
        unsigned int nr_numa_running;
        unsigned int nr_preferred_running;
#endif
};

로드 밸런싱 중에 사용되며 sd_lb_stats 구조체 내에 포함되어 사용되며, 특정 그룹에 대한 로드 밸런스 통계를 표현한다.

avg_load
- 그룹 로드 평균
group_load
- 그룹에 포함된 cpu들의 로드 합
load_per_task
- 태스크 당 로드로 다음과 같이 구해진다.
  - group_load / sum_nr_running
group_capacity
- 그룹에 포함된 cpu들의 capacity 합
group_util
- 그룹에 포함된 cpu들의 util 합
sum_nr_running
- 그룹에서 동작 중인 태스크들의 합
idle_cpus
- 그룹에서 idle 중인 cpu들 수
group_weight
- 그룹에 포함된 cpu들 수
group_type
- 그룹 타입이고 설명은 본문에 기술하였다.
  - group_other
  - group_misfit_task
  - group_imbalanced
  - group_overloaded
group_no_capacity
- capacity 부족으로 로드 밸런싱이 필요 여부. (1=capacity 부족)
group_misfit_task_load
- 그룹에 포함된 cpu들의 misfit_task_load 합
nr_numa_running
- 그룹 내 권장 노드가 아닌 다른 노드에서 동작 중인 태스크의 수
nr_preferred_running
- 그룹 내 누마 권장 노드에서 동작 중인 태스크의 수

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c – 현재 글
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheduler -13- (Scheduling Domain 1)

2017-07-202023-04-15 문영일 Leave a comment

cpu topology

cpu 토플로지로 구성된 정보는 스케줄 도메인을 사용한 로드밸런싱과 PM(Power Management) 시스템에서 사용된다. core.c에 있었던 cpu topology 부분을 topology.c로 분리하였다.

코드 위치
- kernel/sched/topology.c
- include/linux/sched/topology.c

cpu 토플로지

cpu 토플로지는 다음 3 종류의 아키텍처가 지원한다. 이를 지원하지 않는 경우 모든 cpu의 성능이 동일하다고 판단한다. cpu 토플로지의 변경은 cpu가 online/offline 됨에 따라 갱신된다.

ARM32
- armv7 아키텍처에서 CONFIG_ARM_CPU_TOPOLOGY 커널 옵션을 사용할 때 cpu topology를 지원한다.
- MPIDR 레지스터를 통해 3 단계 affinity 레벨을 읽어 cpu_topology[]를 구성한다.
- 최근 일부 시스템은 디바이스 트리를 지원한다.
ARM64
- 항상 cpu topology를 구성하여 사용한다.
- 부트 타임에 Device Tree의 “cpu-map” 노드 정보를 읽어와서 클러스터 정보들을 추가하며, 디바이스 트리를 통해 구성하지 못한 경우 MPIDR 레지스터를 통해 4 단계 affinity 레벨을 읽어 cpu_topology[]를 구성한다.
RISC-V
- 항상 cpu topology를 구성하여 사용한다.
- 부트 타임에 Device Tree의 “cpu-map” 노드 정보를 읽어와서 클러스터 정보들을 추가한다.

arch specific cpu topology & cpu capacity

arm, arm64 아키텍처 전용 cpu topology 및 cpu capacity에 대한 중복 코드들을 다음 위치에 통합한다.
- common 코드 위치
  - drivers/base/arch_topology.c
  - include/linux/arch_topology.h
- 참고:
  - arm,arm64,drivers: move externs in a new header file (2017, v4.13-rc1)
  - arm, arm64: factorize common cpu capacity default code (2017, v4.13-rc1)
디바이스 트리 관련
- RISC-V & ARM64 시스템의 경우 디바이스 트리에서 “cpu-map” 노드를 읽어 cpu topology를 만들어낸다. 따라서 관련 중복 코드들을 common 위치로 옮긴다. (arm의 일부 코드는 구현이 달라 통합하지 않고 남아있다)
  - 참고: cpu-topology: Move cpu topology code to common code (2019, v5.4-rc1)
- ARM32 시스템에서 디바이스 트리를 지원하는 시스템들에 대해서는 위의 common 코드 외에 별도의 디바이스 트리 파싱 소스를 사용하고 있다.

cpu_topology[]

include/linux/arch_topology.h

struct cpu_topology {
        int thread_id;
        int core_id;
        int package_id;
        int llc_id;
        cpumask_t thread_sibling;
        cpumask_t core_sibling;
        cpumask_t llc_sibling;
};

cpu topology 구조체는 모든 아키텍처에서 공통으로 통합되었다. 시스템 레지스터를 읽어 구성하고, 디바이스 트리를 지원하는 경우 이를 통해서도 추가 구성된다.

thread_id
- h/w 스레드를 구분하기 위한 값이다. 아직 arm에서는 하드웨어 스레드를 지원하지 않아 항상 -1을 담고, 사용하지 않는다.
- powerpc, s390, mips 및 x86의 하이퍼 스레딩 등 멀티 스레딩(hw thread)을 지원하는 시스템에서 CONFIG_SCHED_SMT 커널 옵션과 함께 사용된다.
core_id
- core(cpu)를 구분하기 위한 값이다.
package_id
- package(클러스터)를 구분하기 위한 값이다.
- 최근 ARM64 SoC 동향은 빅 클러스터/미디엄 클러스터/리틀 클러스터로 나뉘어 동시에 동작시킬 수 있다.
llc_id
- last level 캐시를 구분하기 위한 값이다.
- 현재 ACPI를 사용하는 ARM64에서만 지원한다.
thread_sibling
- core(cpu)에 구성된 h/w 스레드들의 비트 마스크이다.
- hw-thread는 arm64에 출시 계획이 있었으나 지연 후 출시가 보류된 상태이다.
core_sibling
- package(클러스터)에 구성된 core(cpu)들의 비트 마스크이다.
- 예) 0b11110000 -> 해당 클러스터가 cpu#4 ~ cpu#7 까지를 구성한다.
llc_sibling
- last level 캐시를 공유하는 패키지들의 비트 마스크이다.

다음은 두 단계의 클러스터를 사용하는 총 4개 클러스터로 구성된 시스템의 예를 보여준다.

참고
- CPU topology binding description | Documentation/devicetree/bindings/cpu/cpu-topology.txt
- ARM CPUs bindings | Documentation/devicetree/bindings/arm/cpus.txt

cpus {
        #size-cells = <0>;
        #address-cells = <2>;

        cpu-map {
                cluster0 {
                        cluster0 {
                                core0 {
                                        thread0 {
                                                cpu = <&CPU0>;
                                        };
                                        thread1 {
                                                cpu = <&CPU1>;
                                        };
                                };

                                core1 {
                                        thread0 {
                                                cpu = <&CPU2>;
                                        };
                                        thread1 {
                                                cpu = <&CPU3>;
                                        };
                                };
                        };

                        cluster1 {
                                core0 {
                                        thread0 {
                                                cpu = <&CPU4>;
                                        };
                                        thread1 {
                                                cpu = <&CPU5>;
                                        };
                                 };

                                core1 {
                                        thread0 {
                                                cpu = <&CPU6>;
                                        };
                                        thread1 {
                                                cpu = <&CPU7>;
                                        };
                                };
                        };
               };
               cluster1 {
                        cluster0 {
                                core0 {
                                        thread0 {
                                                cpu = <&CPU8>;
                                        };
                                        thread1 {
                                                cpu = <&CPU9>;
                                        };
                                };
                                core1 {
                                        thread0 {
                                                cpu = <&CPU10>;
                                        };
                                        thread1 {
                                                cpu = <&CPU11>;
                                        };
                                };
                        };

                        cluster1 {
                                core0 {
                                        thread0 {
                                                cpu = <&CPU12>;
                                        };
                                        thread1 {
                                                cpu = <&CPU13>;
                                        };
                                };
                                core1 {
                                        thread0 {
                                                cpu = <&CPU14>;
                                        };
                                        thread1 {
                                                cpu = <&CPU15>;
                                        };
                                };
                        };
                };
        };

        CPU0: cpu@0 {
                device_type = "cpu";
                compatible = "arm,cortex-a53";
                reg = <0x0 0x0>;                 <- core_id
                enable-method = "spin-table";
                cpu-release-addr = <0 0x20000000>;
                capacity-dmips-mhz = <485>;      <- raw cpu capacity
        };

        ...
}

CPU Capacity 스케일 관리

코어별로 능력치가 다른 시스템을 위해 상대적 능력치를 저장해두어 로드밸런스에 사용한다.

최근 아키텍처들은 빅/미디엄/리틀 클러스터들이 각각의 성능을 가진 이 기종 아키텍처들을 지원하고 있으며, 이들은 커널에서 서로 다른 성능으로 동작하고 있다. 처음 ARM32 시스템에서는 frequency가 고정된 빅/리틀 클러스터(cortex-a7/a15)에서 먼저 사용되면서 이 기능이 지원되었고 최근에는 ARM64 및 RISC-V 시스템에서 여러 가지 클러스터 들을 사용하여 지원하고 있다. ARM64의 경우 frequency를 제외한 cpu capacity를 관리하고, frequency 관리는 별도로 한다.

Scaled CPU Capacity 산출 – Generic

디바이스 트리를 통해 다음 값을 읽어 산출한다.

“capacity-dmips-mhz” 속성 값
- raw_capacity[]에 저장한다.
- 최대 값은 capacity_scale 저장한다.
다음과 같이 간단히 산출한다.
- cpu_scale[] = raw_capacity[] / capacity_scale

cpu 토플로지 초기화

다음 그림은 cpu 토플로지를 초기화하는 모습을 보여준다. 커널 부트업 및 나머지 cpu들이 on될 때마다 store_cpu_topology() 함수가 호출된다.

init_cpu_topology() – Generic

drivers/base/arch_topology.c

void __init init_cpu_topology(void)
{
        reset_cpu_topology();

        /*
         * Discard anything that was parsed if we hit an error so we
         * don't use partial information.
         */
        if (parse_acpi_topology())
                reset_cpu_topology();
        else if (of_have_populated_dt() && parse_dt_topology())
                reset_cpu_topology();
}

cpu_topology[]를 초기화 후 구성한다.

코드 라인 3에서 possible cpu 수 만큼 순회하며 cpu_topology[]를 초기화한다.
코드 라인 9~10에서 acpi를 파싱하여 cpu topology를 구성한다. 실패하는 경우 다시 초기화한다.
코드 라인 11~12에서 디바이스 트리의 cpus 노드를 파싱하여 cpu topology를 구성한다. 실패하는 경우 다시 초기화한다.

reset_cpu_topology()

drivers/base/arch_topology.c

void __init reset_cpu_topology(void)
{
        unsigned int cpu;

        for_each_possible_cpu(cpu) {
                struct cpu_topology *cpu_topo = &cpu_topology[cpu];

                cpu_topo->thread_id = -1;
                cpu_topo->core_id = -1;
                cpu_topo->package_id = -1;
                cpu_topo->llc_id = -1;

                clear_cpu_topology(cpu);
        }
}

cpu_topology[]를 초기화한다.

스케드 도메인용 토플로지

kernel/sched/topology.c

struct sched_domain_topology_level *sched_domain_topology = default_topology;

default_topology[] – Generic

kernel/sched/topology.c

/*
 * Topology list, bottom-up.
 */

static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
        { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
        { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
        { cpu_cpu_mask, SD_INIT_NAME(DIE) },
        { NULL, },
};

디폴트 토플로지는 최대 3단계인 SMT -> MC -> DIE 레벨까지 구성할 수 있다.

ARM64 및 RISC-V 시스템의 경우 NUMA 레벨은 디바이스 트리를 사용하여 NUMA distance 단계 별로 DIE 레벨 뒤에 추가 구성된다.

arm_topology[] – ARM32

arch/arm/kernel/topology.c

static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
        { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
        { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
        { cpu_cpu_mask, SD_INIT_NAME(DIE) },
        { NULL, },
};

ARM32에서는 코어 파워 도메인을 구분하는 GMC라는 단계를 추가하여 사용한다.

cpu_coregroup_mask()

drivers/base/arch_topology.c

const struct cpumask *cpu_coregroup_mask(int cpu)
{
        const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));

        /* Find the smaller of NUMA, core or LLC siblings */
        if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
                /* not numa in package, lets use the package siblings */
                core_mask = &cpu_topology[cpu].core_sibling;
        }
        if (cpu_topology[cpu].llc_id != -1) {
                if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
                        core_mask = &cpu_topology[cpu].llc_sibling;
        }

        return core_mask;
}

@cpu의 MC 도메인 단계에 해당하는 cpumask를 반환한다. 라스트 레벨 캐시 정보가 존재하면 해당 core 들에 대한 비트맵을 반환한다. 없으면 요청한 cpu의 노드 및 패키지 내의 core 들에 대한 비트맵을 반환한다. 이 정보도 없는 경우 그냥 cpu가 소속된 노드의 cpu core 들에 대한 비트맵을 반환한다.

코드 라인 3에서 cpu가 소속된 노드의 cpu core 들에 대한 비트맵을 구한다.
코드 라인 6~9에서 위에서 구한 비트맵에 cpu의 동료 core cpu들이 포함된 경우 이 동료 core cpu들로 비트맵을 구한다.
코드 라인 10~13에서 llc 캐시 정보가 구현된 경우 이에 해당하는 core cpu들로 비트맵을 구한다.
코드 라인 15에서 최종 구한 비트맵을 반환한다.

cpu_core_flags()

include/linux/topology.h

#ifdef CONFIG_SCHED_MC
static inline int cpu_core_flags(void)
{
        return SD_SHARE_PKG_RESOURCES;
}
#endif

SD_SHARE_PKG_RESOURCES 플래그를 반환한다. (MC 도메인 레벨)

cpu_cpu_mask()

include/linux/topology.h

static inline const struct cpumask *cpu_cpu_mask(int cpu)
{
        return cpumask_of_node(cpu_to_node(cpu));
}

@cpu의 소속 노드 id에 해당하는 cpumask를 반환한다. (DIE 단계에서 사용된다)

예) 2개의 numa 노드 * 4개의 클러스터 * 4개 cpu를 가진 시스템에서 요청 cpu=31
- 0xffff_0000

디바이스 트리 파싱

parse_dt_topology() – Generic, ARM32 별도

drivers/base/arch_topology.c

static int __init parse_dt_topology(void)
{
        struct device_node *cn, *map;
        int ret = 0;
        int cpu;

        cn = of_find_node_by_path("/cpus");
        if (!cn) {
                pr_err("No CPU information found in DT\n");
                return 0;
        }

        /*
         * When topology is provided cpu-map is essentially a root
         * cluster with restricted subnodes.
         */
        map = of_get_child_by_name(cn, "cpu-map");
        if (!map)
                goto out;

        ret = parse_cluster(map, 0);
        if (ret != 0)
                goto out_map;

        topology_normalize_cpu_scale();

        /*
         * Check that all cores are in the topology; the SMP code will
         * only mark cores described in the DT as possible.
         */
        for_each_possible_cpu(cpu)
                if (cpu_topology[cpu].package_id == -1)
                        ret = -EINVAL;

out_map:
        of_node_put(map);
out:
        of_node_put(cn);
}

디바이스 트리를 통해 cpu topology를 구성한다.

코드 라인 7~11에서 “/cpus” 노드를 찾는다.
코드 라인 17~19에서 찾은 “/cpus” 노드의 하위에서 “cpu-map” 노드를 찾는다.
코드 라인 21~23에서 찾은 “cpu-map” 노드의 하위에 있는 “cluster%d” 노드들을 파싱한다.
코드 라인 25에서 scale 적용된 cpu capacity를 산출한다.
코드 라인 31~33에서 읽어들인 cpu_topology[]에서 package_id가 구성되지 않은 경우 -EINVAL 에러를 반환한다.

parse_cluster()

drivers/base/arch_topology.c

static int __init parse_cluster(struct device_node *cluster, int depth)
{
        char name[10];
        bool leaf = true;
        bool has_cores = false;
        struct device_node *c;
        static int package_id __initdata;
        int core_id = 0;
        int i, ret;

        /*
         * First check for child clusters; we currently ignore any
         * information about the nesting of clusters and present the
         * scheduler with a flat list of them.
         */
        i = 0;
        do {
                snprintf(name, sizeof(name), "cluster%d", i);
                c = of_get_child_by_name(cluster, name);
                if (c) {
                        leaf = false;
                        ret = parse_cluster(c, depth + 1);
                        of_node_put(c);
                        if (ret != 0)
                                return ret;
                }
                i++;
        } while (c);

        /* Now check for cores */
        i = 0;
        do {
                snprintf(name, sizeof(name), "core%d", i);
                c = of_get_child_by_name(cluster, name);
                if (c) {
                        has_cores = true;

                        if (depth == 0) {
                                pr_err("%pOF: cpu-map children should be clusters\n",
                                       c);
                                of_node_put(c);
                                return -EINVAL;
                        }

                        if (leaf) {
                                ret = parse_core(c, package_id, core_id++);
                        } else {
                                pr_err("%pOF: Non-leaf cluster with core %s\n",
                                       cluster, name);
                                ret = -EINVAL;
                        }

                        of_node_put(c);
                        if (ret != 0)
                                return ret;
                }
                i++;
        } while (c);

        if (leaf && !has_cores)
                pr_warn("%pOF: empty cluster\n", cluster);

        if (leaf)
                package_id++;

        return 0;
}

“cluster%d” 노드를 파싱한다.

코드 라인 16~28에서 0번 부터 순회하며 클러스터 노드가 있는지 확인하고, 확인된 클러스터 노드의 경우 그 밑에 child 클러스터에 대해서도 재귀 호출로 파싱하도록 한다.
코드 라인 31~58에서 클러스터의 child가 없는 경우에 도착한다. leaf 클러스터에 소속된 “core%d” 노드 정보를 파싱한다.
코드 라인 60~61에서 클러스터에 core 노드가 구성되지 않은 경우 경고 메시지를 출력한다.
코드 라인 63~64에서 leaf 클러스터인 경우에 한해 static 변수인 package_id를 증가시킨다.
코드 라인 66에서 해당 클러스터가 분석 완료되어 0을 반환한다.

parse_core()

drivers/base/arch_topology.c

static int __init parse_core(struct device_node *core, int package_id,
                             int core_id)
{
        char name[10];
        bool leaf = true;
        int i = 0;
        int cpu;
        struct device_node *t;

        do {
                snprintf(name, sizeof(name), "thread%d", i);
                t = of_get_child_by_name(core, name);
                if (t) {
                        leaf = false;
                        cpu = get_cpu_for_node(t);
                        if (cpu >= 0) {
                                cpu_topology[cpu].package_id = package_id;
                                cpu_topology[cpu].core_id = core_id;
                                cpu_topology[cpu].thread_id = i;
                        } else {
                                pr_err("%pOF: Can't get CPU for thread\n",
                                       t);
                                of_node_put(t);
                                return -EINVAL;
                        }
                        of_node_put(t);
                }
                i++;
        } while (t);

        cpu = get_cpu_for_node(core);
        if (cpu >= 0) {
                if (!leaf) {
                        pr_err("%pOF: Core has both threads and CPU\n",
                               core);
                        return -EINVAL;
                }

                cpu_topology[cpu].package_id = package_id;
                cpu_topology[cpu].core_id = core_id;
        } else if (leaf) {
                pr_err("%pOF: Can't get CPU for leaf core\n", core);
                return -EINVAL;
        }

        return 0;
}

“core%d” 노드를 파싱한다.

코드 라인 10~29에서 하위 노드에서 “thread%d” 노드들을 찾아 발견된 경우 스레드 노드의 “cpu” 속성이 phandle로 연결한 cpu 노드를 파싱하여 “capacity-dmips-mhz” 속성 값을 읽어 cpu capacity를 저장하고, package_id, core_id 및 thread_id를 모두 사용하는 3 단계 cpu_topology를 구성한다.
- cpu 노드를 파싱하여
코드 라인 31~44에서 현재 노드에서 “cpu” 속성을 찾아 phandle로 연결된 cpu 노드가 존재하는 경우에 한해 “capacity-dmips-mhz” 속성 값을 읽어 cpu capacity를 저장하고 package_id와 core_id 만을 사용하는 2 단계 cpu_topology를 구성한다.
코드 라인 46에서 core 노드의 파싱이 정상 완료되어 0을 반환한다.

get_cpu_for_node()

drivers/base/arch_topology.c

static int __init get_cpu_for_node(struct device_node *node)
{
        struct device_node *cpu_node;
        int cpu;

        cpu_node = of_parse_phandle(node, "cpu", 0);
        if (!cpu_node)
                return -1;

        cpu = of_cpu_node_to_id(cpu_node);
        if (cpu >= 0)
                topology_parse_cpu_capacity(cpu_node, cpu);
        else
                pr_crit("Unable to find CPU node for %pOF\n", cpu_node);

        of_node_put(cpu_node);
        return cpu;
}

“cpu” 속성에서 읽은 phandle에 연결된 cpu 노드의 “capacity-dmips-mhz” 값을 읽어 저장하고, “reg” 속성 값을 cpu 번호로 반환한다.

코드 라인 6~8에서 “cpu” 속성에서 읽은 phandle에 연결된 cpu 노드를 알아온다.
코드 라인 10~14에서 알아온 cpu 노드에서 “capacity-dmips-mhz” 값을 읽어 저장한다.
코드 라인 17에서 알아온 cpu 노드의 “reg” 속성에 기록된 cpu 번호를 반환한다.

CPU Capacity

topology_parse_cpu_capacity()

drivers/base/arch_topology.c

bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
{
        static bool cap_parsing_failed;
        int ret;
        u32 cpu_capacity;

        if (cap_parsing_failed)
                return false;

        ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
                                   &cpu_capacity);
        if (!ret) {
                if (!raw_capacity) {
                        raw_capacity = kcalloc(num_possible_cpus(),
                                               sizeof(*raw_capacity),
                                               GFP_KERNEL);
                        if (!raw_capacity) {
                                cap_parsing_failed = true;
                                return false;
                        }
                }
                capacity_scale = max(cpu_capacity, capacity_scale);
                raw_capacity[cpu] = cpu_capacity;
                pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n",
                        cpu_node, raw_capacity[cpu]);
        } else {
                if (raw_capacity) {
                        pr_err("cpu_capacity: missing %pOF raw capacity\n",
                                cpu_node);
                        pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n");
                }
                cap_parsing_failed = true;
                free_raw_capacity();
        }

        return !ret;
}

@cpu_node에서 “capacity-dmips-mhz” 값을 읽어 저장한다. 성공 시 0을 반환한다.

코드 라인 7~8에서 cap_parsing_failed가 한 번이라도 설정된 경우 false를 반환한다.
코드 라인 10~34에서 “capacity-dmips-mhz” 값을 읽어 raw_capacity[]에 저장한다. 그 중 가장 큰 값을 capacity_scale에 갱신한다.
코드 라인 36에서 cpu capacity 값을 저장 여부를 반환한다. 0=저장 성공

topology_normalize_cpu_scale()

drivers/base/arch_topology.c

void topology_normalize_cpu_scale(void)
{
        u64 capacity;
        int cpu;

        if (!raw_capacity)
                return;

        pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
        for_each_possible_cpu(cpu) {
                pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
                         cpu, raw_capacity[cpu]);
                capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT)
                        / capacity_scale;
                topology_set_cpu_scale(cpu, capacity);
                pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
                        cpu, topology_get_cpu_scale(cpu));
        }
}

모든 possible cpu에서 읽어 들인 raw cpu capacity 값을 스케일 적용하여 cpu_scale에 저장한다.

cpu_scale

drivers/base/arch_topology.c

DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;

전역 per-cpu 변수 cpu_scale에서는 cpu capacity 값을 담고있다. cpu topology를 사용하지 않을 때에 이 값은 디폴트 값 1024를 담고 있다.

topology_set_cpu_scale()

drivers/base/arch_topology.c

void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
{
        per_cpu(cpu_scale, cpu) = capacity;
}

전역 per-cpu 변수 cpu_scale에 요청한 스케일 적용한 cpu capacity 값을 저장한다.

부팅된 cpu의 토플로지 적용

store_cpu_topology() – ARM64

arch/arm64/kernel/topology.c

void store_cpu_topology(unsigned int cpuid)
{
        struct cpu_topology *cpuid_topo = &cpu_topology[cpuid];
        u64 mpidr;

        if (cpuid_topo->package_id != -1)
                goto topology_populated;

        mpidr = read_cpuid_mpidr();

        /* Uniprocessor systems can rely on default topology values */
        if (mpidr & MPIDR_UP_BITMASK)
                return;

        /* Create cpu topology mapping based on MPIDR. */
        if (mpidr & MPIDR_MT_BITMASK) {
                /* Multiprocessor system : Multi-threads per core */
                cpuid_topo->thread_id  = MPIDR_AFFINITY_LEVEL(mpidr, 0);
                cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 1);
                cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) |
                                         MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8;
        } else {
                /* Multiprocessor system : Single-thread per core */
                cpuid_topo->thread_id  = -1;
                cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 0);
                cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1) |
                                         MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8 |
                                         MPIDR_AFFINITY_LEVEL(mpidr, 3) << 16;
        }

        pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
                 cpuid, cpuid_topo->package_id, cpuid_topo->core_id,
                 cpuid_topo->thread_id, mpidr);

topology_populated:
        update_siblings_masks(cpuid);
}

디바이스 트리를 통해 cpu topology를 구성하지 못한 경우에 한 해 시스템 레지스터 중 mpidr 레지스터값을 읽어 cpu_topology를 구성한다.

코드 라인 6~7에서 package_id가 이미 설정된 경우 sibling cpu 마스크들을 갱신만 하고 함수를 빠져나가기 위해 topology_populated: 레이블로 이동한다.
코드 라인 9에서 cpu affinity 레벨을 파악하기 위해 mpidr 레지스터 값을 읽어온다.
코드 라인 12~13에서 uni processor 시스템인 경우 함수를 빠져나간다.
코드 라인 16~21에서 mpidr 값에서 hw 스레드를 지원하는 시스템인 경우 4단계 affinity 값들을 모두 반영한다.
- arm 및 arm64는 아직 h/w 멀티스레드가 적용되지 않았다. (cortex-a72,73,75까지도)
코드 라인 22~29에서 3단계 affinity 값들을 반영한다.
코드 라인 31~33에서 로그 정보를 출력한다.
코드 라인 35~46에서 topology_populated: 레이블이다. 요청 cpu에 대한 각 sibling cpumask를 갱신한다.

다음 그림은 mpidr 값을 읽어 core_id, package_id를 갱신하고 관련 시블링 cpu 마스크도 갱신하는 모습을 보여준다.

update_siblings_masks()

drivers/base/arch_topology.c

static void update_siblings_masks(unsigned int cpuid)
{
        struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
        int cpu;

        /* update core and thread sibling masks */
        for_each_possible_cpu(cpu) {
                cpu_topo = &cpu_topology[cpu];

                if (cpuid_topo->llc_id == cpu_topo->llc_id) {
                        cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
                        cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
                }

                if (cpuid_topo->package_id!= cpu_topo->package_id)
                        continue;

                cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
                cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

                if (cpuid_topo->core_id != cpu_topo->core_id)
                        continue;

                cpumask_set_cpu(cpuid, &cpu_topo->thread_sibling);
                cpumask_set_cpu(cpu, &cpuid_topo->thread_sibling);
        }
}

요청 cpu에 대한 sibling cpu 마스크들을 갱신한다.

코드 라인 7~13에서 possible cpu 수 만큼 순회하며 순회 중인 cpu의 llc_id가 요청한 cpu의 llc_id와 동일한 경우 두 cpu의 llc_sibling 비트를 설정한다.
코드 라인 15~16에서 순회 중인 cpu의 package_id와 요청한 cpu의 package_id가 다른 경우 skip 한다.
코드 라인 18~19에서 순회 중인 cpu와 요청한 cpu에 대한 core_sibling 비트를 설정한다.
코드 라인 21~22에서 순회 중인 cpu의 core_id와 요청한 cpu의 core_id가 다른 경우 skip 한다.
코드 라인 24~25에서 순회 중인 cpu와 요청한 cpu에 대한 thread_sibling 비트를 설정한다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c – 현재 글
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheding Domains | LWN.net

call_function_init()

2017-06-222020-03-11 문영일 Leave a comment

이동: Interrupts -6- (IPI Cross-call) | 문c

Interrupts -8- (Workqueue 2)

2017-06-212020-07-11 문영일 4 Comments

워커

워커 생성

create_worker()

kernel/workqueue.c

/**
 * create_worker - create a new workqueue worker
 * @pool: pool the new worker will belong to
 *
 * Create and start a new worker which is attached to @pool.
 *
 * CONTEXT:
 * Might sleep.  Does GFP_KERNEL allocations.
 *
 * Return:
 * Pointer to the newly created worker.
 */

static struct worker *create_worker(struct worker_pool *pool)
{
        struct worker *worker = NULL;
        int id = -1;
        char id_buf[16];

        /* ID is needed to determine kthread name */
        id = ida_simple_get(&pool->worker_ida, 0, 0, GFP_KERNEL);
        if (id < 0)
                goto fail;

        worker = alloc_worker(pool->node);
        if (!worker)
                goto fail;

        worker->pool = pool;
        worker->id = id;

        if (pool->cpu >= 0)
                snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
                         pool->attrs->nice < 0  ? "H" : "");
        else
                snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);

        worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
                                              "kworker/%s", id_buf);
        if (IS_ERR(worker->task))
                goto fail;

        set_user_nice(worker->task, pool->attrs->nice);
        kthread_bind_mask(worker->task, pool->attrs->cpumask);

        /* successful, attach the worker to the pool */
        worker_attach_to_pool(worker, pool);

        /* start the newly created worker */
        spin_lock_irq(&pool->lock);
        worker->pool->nr_workers++;
        worker_enter_idle(worker);
        wake_up_process(worker->task);
        spin_unlock_irq(&pool->lock);

        return worker;

fail:
        if (id >= 0)
                ida_simple_remove(&pool->worker_ida, id);
        kfree(worker);
        return NULL;
}

요청한 워커풀에 워커 스레드를 생성한다.

코드 라인 8~10에서 워커풀에서 생성할 새로운 워커를 위해 id를 받아온다.
- pool->worker_ida는 워커들의 id의 할당 관리를 위해 IDR Radix tree 기반으로 동작한다.
코드 라인 12~17에서 워커 객체를 할당 받은 후 워커풀 및 id를 지정한다.
코드 라인 19~23에서 워커풀의 이름을 지정한다.
- cpu로 바운드(지정)된 워커풀은 nice 값에 따라 다음과 같이 표시한다.
  - “<cpu>:<worker id>H” – 디폴트 nice 보다 스레드의 nice 우선 순위가 보다 높다.
  - “<cpu>:<worker id>” – 디폴트 nice 보다 스레드의 nice 우선 순위가 같거나 낮다.
- 언바운드된 워커풀은 “u<pool id>:<worker id>“와 같이 표현한다.
코드 라인 25~28에서 kthread를 통해 worker_thread() 함수가 호출되는 스레드를 생성하여 워커의 task에 대입한다.
코드 라인 30에서 워커 스레드의 static_prio 값을 워커풀 속성에 있는 nice 값을 우선순위로 변환한 값으로 설정한다.
코드 라인 31에서 워커 스레드에 cpu mask를 지정한다. 또한 PF_NO_SETAFFINITY 플래그를 추가하여 다른 cpu로 마이그레이션되지 않도록 막는다.
코드 라인 34에서 워커 스레드를 워커풀에 지정한다.
코드 라인 38에서 워커풀내의 워커 수를 나타내는 nr_workers를 1 증가시킨다.
코드 라인 39~40에서 워커 스레드를 idle 상태로 진입하게 한 후 다시 깨운다.
코드 라인 43에서 생성한 워크 스레드를 반환한다.

alloc_worker()

kernel/workqueue.c

static struct worker *alloc_worker(int node)
{
        struct worker *worker;

        worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node);
        if (worker) {
                INIT_LIST_HEAD(&worker->entry);
                INIT_LIST_HEAD(&worker->scheduled);
                INIT_LIST_HEAD(&worker->node);
                /* on creation a worker is in !idle && prep state */
                worker->flags = WORKER_PREP;
        }
        return worker;
}

워커 구조체를 할당받고 내부에서 관리하는 리스트들을 초기화한다.

워커를 워커풀에 연결(attach)

worker_attach_to_pool()

kernel/workqueue.c

/**
 * worker_attach_to_pool() - attach a worker to a pool
 * @worker: worker to be attached
 * @pool: the target pool
 *
 * Attach @worker to @pool.  Once attached, the %WORKER_UNBOUND flag and
 * cpu-binding of @worker are kept coordinated with the pool across
 * cpu-[un]hotplugs.
 */

static void worker_attach_to_pool(struct worker *worker,
                                   struct worker_pool *pool)
{
        mutex_lock(&pool->attach_mutex);

        /*
         * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
         * online CPUs.  It'll be re-applied when any of the CPUs come up.
         */
        set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);

        /*
         * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
         * stable across this function.  See the comments above the
         * flag definition for details.
         */
        if (pool->flags & POOL_DISASSOCIATED)
                worker->flags |= WORKER_UNBOUND;

        list_add_tail(&worker->node, &pool->workers);

        mutex_unlock(&pool->attach_mutex);
}

워커를 워커풀에 연결한다.

코드 라인 10에서 워커풀 속성에 있는 cpumask를 워커 스레드에 지정한다.
- 바운드된 워커풀은 특정 cpu만 마스크비트가 설정되어 있다.
- 언바운드된 워커풀은 online된 cpu들에 대해 모두 마스크비트가 설정되어 있다.
코드 라인 17~19에서 워커풀이 아직 서비스하지 않는 상태인 경우 워커에 WORKER_UNBOUND 플래그를 설정한다.
코드 라인 20에서 워커풀의 workers 리스트에 워커를 추가한다.

워커를 워커풀에서 연결 해제

worker_detach_from_pool()

kernel/workqueue.c

/**
 * worker_detach_from_pool() - detach a worker from its pool
 * @worker: worker which is attached to its pool
 *
 * Undo the attaching which had been done in worker_attach_to_pool().  The
 * caller worker shouldn't access to the pool after detached except it has
 * other reference to the pool.
 */

static void worker_detach_from_pool(struct worker *worker)
{
        struct worker_pool *pool = worker->pool;
        struct completion *detach_completion = NULL;

        mutex_lock(&pool->attach_mutex);

        list_del(&worker->node);
        worker->pool = NULL;

        if (list_empty(&pool->workers))
                detach_completion = pool->detach_completion;
        mutex_unlock(&pool->attach_mutex);

        /* clear leftover flags without pool->lock after it is detached */
        worker->flags &= ~(WORKER_UNBOUND | WORKER_REBOUND);

        if (detach_completion)
                complete(detach_completion);
}

워커를 워커풀에서 연결해제한다.

코드 라인 8~9에서 워커풀의 workers 리스트에서 요청한 워커를 제거한다.
코드 라인 11~12에서 워커풀이 빈 경우 워커풀의 detach_completion 상태를 알아온다.
코드 라인 16에서 워커의 WORKER_UNBOUND 및 WORKER_REBOUND 플래그를 제거한다.
코드 라인 18~19에서 detach 작업이 완료될 때 까지 기다린다.

워커 Idle 진입

worker_enter_idle()

kernel/workqueue.c

/**
 * worker_enter_idle - enter idle state
 * @worker: worker which is entering idle state
 *
 * @worker is entering idle state.  Update stats and idle timer if
 * necessary.
 *
 * LOCKING:
 * spin_lock_irq(pool->lock).
 */

static void worker_enter_idle(struct worker *worker)
{
        struct worker_pool *pool = worker->pool;

        if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
            WARN_ON_ONCE(!list_empty(&worker->entry) &&
                         (worker->hentry.next || worker->hentry.pprev)))
                return;

        /* can't use worker_set_flags(), also called from create_worker() */
        worker->flags |= WORKER_IDLE;
        pool->nr_idle++;
        worker->last_active = jiffies;

        /* idle_list is LIFO */
        list_add(&worker->entry, &pool->idle_list);

        if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
                mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);

        /*
         * Sanity check nr_running.  Because unbind_workers() releases
         * pool->lock between setting %WORKER_UNBOUND and zapping
         * nr_running, the warning may trigger spuriously.  Check iff
         * unbind is not in progress.
         */
        WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
                     pool->nr_workers == pool->nr_idle &&
                     atomic_read(&pool->nr_running));
}

요청한 워커 스레드를 idle 상태로 설정하고 idle 카운터를 증가시킨다. 필요 시 idle 카운터를 동작시킨다.

코드 라인 5~8에서 워커가 이미 idle 상태이거나 워커가 워커풀의 idle_list에서 대기중면서 busy_hash[] 리스트에서도 존재하는 경우 경고 메시지를 1번 출력 후 함수를 빠져나간다.
코드 라인 11~13에서 워커에 WORKER_IDLE 플래그를 설정하고 idle 카운터를 1 증가시킨다. last_active에 현재 시각으로 갱신한다.
코드 라인 16에서 워커풀의 idle 리스트에 워커를 추가한다.
코드 라인 18~19에서 워커풀에 워커들이 너무 많이 쉬고 있으면 idle 타이머를 현재 시각 기준으로 IDLE_WORKER_TIMEOUT(5분) 후에 동작하도록 설정한다.
코드 라인 27~29에서 워커풀이 서비스 중이고 워커들이 모두 idle 상태인데 동작 중인 워커들이 있다고 하는 경우 경고메시지를 출력한다.
- nr_workers(워커풀에 등록된 워커 수) = nr_running(동작중인 워커 수) + nr_idle(idle 상태인 워커 수)

too_many_workers()

kernel/workqueue.c

/* Do we have too many workers and should some go away? */
static bool too_many_workers(struct worker_pool *pool)
{
        bool managing = mutex_is_locked(&pool->manager_arb);
        int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
        int nr_busy = pool->nr_workers - nr_idle;

        return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
}

워커풀에 idle 워커들이 busy 워커들의 일정 비율 이상 즉, idle 워커 수가 많은지 여부를 반환한다.

코드 라인 4~6에서 워커풀에 등록된 워커들에서 idle 워커를 제외한 수를 알아온다.
코드 라인 8에서 idle 워커에서 기본 idle 워커 수 2를 뺀 나머지 idle 워커 수가 busy 워커보다 일정 비율(기본 1/4배) 이상인 경우 놀고 있는 idle 워커 수가 많다고 판단하여 true를 반환한다.
- 조건: (idle 워커 – 2) * MAX_IDLE_WORKERS_RATIO(4) > busy 워커
- 이 함수가 true가 되는 조건 예)
  - idle=3, busy=0~4
  - idle=4, busy=0~8
  - idle=5, busy=0~12
  - …

다음 그림은 idle 워커가 너무 많은지 여부를 판단한다.

Idle 워커 타임아웃 -> idle 워커 수 줄이기

idle_worker_timeout()

kernel/workqueue.c

static void idle_worker_timeout(struct timer_list *t)
{
        struct worker_pool *pool = from_timer(pool, t, idle_timer);

        spin_lock_irq(&pool->lock);

        while (too_many_workers(pool)) {
                struct worker *worker;
                unsigned long expires;

                /* idle_list is kept in LIFO order, check the last one */
                worker = list_entry(pool->idle_list.prev, struct worker, entry);
                expires = worker->last_active + IDLE_WORKER_TIMEOUT;

                if (time_before(jiffies, expires)) {
                        mod_timer(&pool->idle_timer, expires);
                        break;
                }

                destroy_worker(worker);
        }

        spin_unlock_irq(&pool->lock);
}

idle 타이머가 만료 시 호출되며 busy 워커에 비해 너무 많은 idle 워커 비율(기본 1/4)가 있다고 판단하면 일정 비율만큼 idle 워커를 줄인다.

코드 라인 7에서 워커풀에 idle 워커들이 busy 워커들의 일정 비율 이상 즉, idle 워커 수가 많은 경우에 한하여 반복한다.
코드 라인 12~21에서 idle_list에서 하나의 워커를 가져와서 워커의 최종 사용 시각이 현재 시각에 비해 IDLE_WORKER_TIMEOUT(5분)을 지나지 않은 경우 만료 시각을 다시 재설정하고 함수를 빠져간다. 이미 지정된 시간을 초과한 경우 워커를 제거하고 계속 루프를 돈다.

워커 소멸

destroy_worker()

kernel/workqueue.c

/**
 * destroy_worker - destroy a workqueue worker
 * @worker: worker to be destroyed
 *
 * Destroy @worker and adjust @pool stats accordingly.  The worker should
 * be idle.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock).
 */

static void destroy_worker(struct worker *worker)
{
        struct worker_pool *pool = worker->pool;

        lockdep_assert_held(&pool->lock);

        /* sanity check frenzy */
        if (WARN_ON(worker->current_work) ||
            WARN_ON(!list_empty(&worker->scheduled)) ||
            WARN_ON(!(worker->flags & WORKER_IDLE)))
                return;

        pool->nr_workers--;
        pool->nr_idle--;

        list_del_init(&worker->entry);
        worker->flags |= WORKER_DIE;
        wake_up_process(worker->task);
}

워커를 소멸시킨다. 즉 워커 스레드를 소멸시킨다.

코드 라인 8~11에서 요청한 워크가 동작 중이거나 이 워커에 스케줄된 워크가 있거나 idle 플래그가 없는 경우 경고 메시지를 출력하고 함수를 빠져나간다.
코드 라인 13에서 워커풀에 등록된 워커 수인 nr_workers를 1 감소시킨다.
코드 라인 14에서 워커풀에서 idle 중인 워커 수를 나타내는 nr_idle을 1 감소시킨다.
코드 라인 16에서 워커를 워커풀의 idle_list 에서 제거한다.
코드 라인 17~18에서 워커에 WORKER_DIE 플래그를 설정하고 worker에 연결된 태스크를 깨워 소멸 처리한다.
- 워커풀에서 워커 id를 반납하고 워커풀에서 디태치하며 워커를 제거한다.
- 제거중인 상황에서의 태스크명을 “kworker/dying”으로 변경한다.

워크가 동작했던 워커 찾기

find_worker_executing_work()

kernel/workqueue.c

/**
 * find_worker_executing_work - find worker which is executing a work
 * @pool: pool of interest
 * @work: work to find worker for
 *
 * Find a worker which is executing @work on @pool by searching
 * @pool->busy_hash which is keyed by the address of @work.  For a worker
 * to match, its current execution should match the address of @work and
 * its work function.  This is to avoid unwanted dependency between
 * unrelated work executions through a work item being recycled while still
 * being executed.
 *
 * This is a bit tricky.  A work item may be freed once its execution
 * starts and nothing prevents the freed area from being recycled for
 * another work item.  If the same work item address ends up being reused
 * before the original execution finishes, workqueue will identify the
 * recycled work item as currently executing and make it wait until the
 * current execution finishes, introducing an unwanted dependency.
 *
 * This function checks the work item address and work function to avoid
 * false positives.  Note that this isn't complete as one may construct a
 * work function which can introduce dependency onto itself through a
 * recycled work item.  Well, if somebody wants to shoot oneself in the
 * foot that badly, there's only so much we can do, and if such deadlock
 * actually occurs, it should be easy to locate the culprit work function.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock).
 *
 * Return:
 * Pointer to worker which is executing @work if found, %NULL
 * otherwise.
 */

static struct worker *find_worker_executing_work(struct worker_pool *pool,
                                                 struct work_struct *work)
{
        struct worker *worker;

        hash_for_each_possible(pool->busy_hash, worker, hentry,
                               (unsigned long)work)
                if (worker->current_work == work &&
                    worker->current_func == work->func)
                        return worker;

        return NULL;
}

동작중인 워커들에서 워크가 동작하는 워커를 찾아온다. 없으면 null을 반환한다.

busy_hash 리스트를 순회하며 요청한 워크가 현재 동작중인 워커를 찾는다.

need_more_worker()

kernel/workqueue.c

/*
 * Need to wake up a worker?  Called from anything but currently
 * running workers.
 *
 * Note that, because unbound workers never contribute to nr_running, this
 * function will always return %true for unbound pools as long as the
 * worklist isn't empty.
 */

static bool need_more_worker(struct worker_pool *pool)
{
        return !list_empty(&pool->worklist) && __need_more_worker(pool);
}

워커가 더 필요한지 여부를 알아온다.

워커풀에 작업이 대기되어 있고 워커가 더 필요한 경우 true를 반환한다.

__need_more_worker()

kernel/workqueue.c

/*
 * Policy functions.  These define the policies on how the global worker
 * pools are managed.  Unless noted otherwise, these functions assume that
 * they're being called with pool->lock held.
 */

static bool __need_more_worker(struct worker_pool *pool)
{
        return !atomic_read(&pool->nr_running);
}

워커풀에서 동작 중인 워커가 없는지 여부를 반환한다. (1=현재 동작중인 워커가 없다.)

wake_up_worker()

kernel/workqueue.c

/**
 * wake_up_worker - wake up an idle worker
 * @pool: worker pool to wake worker from
 *
 * Wake up the first idle worker of @pool.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock).
 */

static void wake_up_worker(struct worker_pool *pool)
{
        struct worker *worker = first_idle_worker(pool);

        if (likely(worker))
                wake_up_process(worker->task);
}

요청한 워커풀의 첫 번째 idle 워커를 깨운다.

first_idle_worker()

kernel/workqueue.c

/* Return the first idle worker.  Safe with preemption disabled */
static struct worker *first_idle_worker(struct worker_pool *pool)
{
        if (unlikely(list_empty(&pool->idle_list)))
                return NULL;

        return list_first_entry(&pool->idle_list, struct worker, entry);
}

요청한 워커풀에서 첫 idle 워커를 반환한다.

idle_list에서 있는 첫 번째 워커를 반환한다.

워커 스레드 동작

worker_thread()

kernel/workqueue.c -1/2-

/**
 * worker_thread - the worker thread function
 * @__worker: self
 *
 * The worker thread function.  All workers belong to a worker_pool -
 * either a per-cpu one or dynamic unbound one.  These workers process all
 * work items regardless of their specific target workqueue.  The only
 * exception is work items which belong to workqueues with a rescuer which
 * will be explained in rescuer_thread().
 *
 * Return: 0
 */

static int worker_thread(void *__worker)
{
        struct worker *worker = __worker;
        struct worker_pool *pool = worker->pool;

        /* tell the scheduler that this is a workqueue worker */
        set_pf_worker(true);
woke_up:
        spin_lock_irq(&pool->lock);

        /* am I supposed to die? */
        if (unlikely(worker->flags & WORKER_DIE)) {
                spin_unlock_irq(&pool->lock);
                WARN_ON_ONCE(!list_empty(&worker->entry));
                set_pf_worker(false);

                set_task_comm(worker->task, "kworker/dying");
                ida_simple_remove(&pool->worker_ida, worker->id);
                worker_detach_from_pool(worker);
                kfree(worker);
                return 0;
        }

        worker_leave_idle(worker);
recheck:
        /* no more worker necessary? */
        if (!need_more_worker(pool))
                goto sleep;

        /* do we need to manage? */
        if (unlikely(!may_start_working(pool)) && manage_workers(worker))
                goto recheck;

        /*
         * ->scheduled list can only be filled while a worker is
         * preparing to process a work or actually processing it.
         * Make sure nobody diddled with it while I was sleeping.
         */
        WARN_ON_ONCE(!list_empty(&worker->scheduled));

코드 라인 7에서 스케줄러로 하여금 이 태스크가 워커 스레드인 것을 알 수 있게 PF_WQ_WORKER 플래그를 설정한다.
코드 라인 12~22에서 워커 스레드의 종료 요청 플래그가 설정된 경우 워커를 워커풀에서 제거하고 할당 해제 후 종료한다.
코드 라인 24에서 워커 스레드를 idle 상태로 둔다.
코드 라인 27~28에서 추가 적인 워커가 필요하지 않으면 sleep으로 이동한다.
- 동작중인 워커가 하나도 없으면 더 이상 필요한 워커가 없다고 판단한다.
코드 라인 31~32에서 워커풀에 idle 워커가 없으면 워커를 생성한다. 만일 워커를 생성하기 위한 락 획득 시도가 실패하는 경우 다시 recheck 레이블로 이동하여 다시 시도한다.

kernel/workqueue.c -2/2-

        /*
         * Finish PREP stage.  We're guaranteed to have at least one idle
         * worker or that someone else has already assumed the manager
         * role.  This is where @worker starts participating in concurrency
         * management if applicable and concurrency management is restored
         * after being rebound.  See rebind_workers() for details.
         */
        worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);

        do {
                struct work_struct *work =
                        list_first_entry(&pool->worklist,
                                         struct work_struct, entry);

                pool->watchdog_ts = jiffies;

                if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
                        /* optimization path, not strictly necessary */
                        process_one_work(worker, work);
                        if (unlikely(!list_empty(&worker->scheduled)))
                                process_scheduled_works(worker);
                } else {
                        move_linked_works(work, &worker->scheduled, NULL);
                        process_scheduled_works(worker);
                }
        } while (keep_working(pool));

        worker_set_flags(worker, WORKER_PREP);
sleep:
        /*
         * pool->lock is held and there's no work to process and no need to
         * manage, sleep.  Workers are woken up only while holding
         * pool->lock or from local cpu, so setting the current state
         * before releasing pool->lock is enough to prevent losing any
         * event.
         */
        worker_enter_idle(worker);
        __set_current_state(TASK_IDLE);
        spin_unlock_irq(&pool->lock);
        schedule();
        goto woke_up;
}

코드 라인 8에서 워커에서 WORKER_PREP 및 WORKER_REBOUND 플래그를 제거한다.
코드 라인 10~13에서 워커풀에 있는 워크리스트에서 처리할 워크를 가져온다.
코드 라인 15에서 watchdog_ts에 현재 시각(jiffies)를 기록한다.
코드 라인 17~21에서 높은 확률로 워크에서 linked라는 플래그가 없는 경우 하나의 워크를 처리한다. 그 후 낮은 확률로 워커의 scheduled리스트에 있는 워크도 모두 처리한다.
코드 라인 22~25에서 lined라는 플래그가 있는 경우 다음 연결된 워크를 모두 처리한다.
코드 라인 26에서 워커풀의 워크리스트가 다 처리되어 empty될 때까지 루프를 반복한다.
코드 라인 28에서 워커를 다시 prep 상태로 변경한다.
코드 라인 37~38에서 워커를 idle 상태로 설정하고 태스크 상태는 슬립 중 wakeup할 수 있도록 인터럽터블로 변경한다.
코드 라인 40~41에서 스케줄 함수를 통해 슬립한다. 그 후 깨어나면 다시 처음 루틴부터 시작하도록 woke_up 레이블로 이동한다.

다음 그림은 워커 스레드가 워크를 처리하는 과정을 보여준다.

manage_workers()

kernel/workqueue.c

/**
 * manage_workers - manage worker pool
 * @worker: self
 *
 * Assume the manager role and manage the worker pool @worker belongs
 * to.  At any given time, there can be only zero or one manager per
 * pool.  The exclusion is handled automatically by this function.
 *
 * The caller can safely start processing works on false return.  On
 * true return, it's guaranteed that need_to_create_worker() is false
 * and may_start_working() is true.
 *
 * CONTEXT:             
 * spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.  Does GFP_KERNEL allocations.
 *
 * Return:
 * %false if the pool doesn't need management and the caller can safely
 * start processing works, %true if management function was performed and
 * the conditions that the caller verified before calling the function may
 * no longer be true.
 */

static bool manage_workers(struct worker *worker)
{
        struct worker_pool *pool = worker->pool;

        if (pool->flags & POOL_MANAGER_ACTIVE)
                return false;

        pool->flags |= POOL_MANAGER_ACTIVE;
        pool->manager = worker;

        maybe_create_worker(pool);

        pool->manager = NULL;
        pool->flags &= ~POOL_MANAGER_ACTIVE;
        wake_up(&wq_manager_wait);
        return true;
}

필요한 만큼 요청한 워커풀에 워커를 생성한다. 만일 워커를 생성하기 위해 락 획득 시도가 실패하는 경우 false를 반환한다.

maybe_create_worker()

kernel/workqueue.c

/**
 * maybe_create_worker - create a new worker if necessary
 * @pool: pool to create a new worker for
 *
 * Create a new worker for @pool if necessary.  @pool is guaranteed to
 * have at least one idle worker on return from this function.  If
 * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
 * sent to all rescuers with works scheduled on @pool to resolve
 * possible allocation deadlock.
 *
 * On return, need_to_create_worker() is guaranteed to be %false and
 * may_start_working() %true.
 *
 * LOCKING:
 * spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.  Does GFP_KERNEL allocations.  Called only from
 * manager.
 */

static void maybe_create_worker(struct worker_pool *pool)
__releases(&pool->lock)
__acquires(&pool->lock)
{
restart:
        spin_unlock_irq(&pool->lock);

        /* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
        mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);

        while (true) {
                if (create_worker(pool) || !need_to_create_worker(pool))
                        break;

                schedule_timeout_interruptible(CREATE_COOLDOWN);

                if (!need_to_create_worker(pool))
                        break;
        }

        del_timer_sync(&pool->mayday_timer);
        spin_lock_irq(&pool->lock);
        /*
         * This is necessary even after a new worker was just successfully
         * created as @pool->lock was dropped and the new worker might have
         * already become busy.
         */
        if (need_to_create_worker(pool))
                goto restart;
}

필요한 만큼 요청한 워커풀에 워커를 생성한다.

코드 라인 9에서 mayday 타이머를 현재 시각 기준으로 만료 시각을 다시 조정한다.
코드 라인 11~19에서 반복하며 필요한 수 만큼 워커를 생성한다.
코드 라인 21에서 mayday 타이머를 제거한다.
코드 라인 28~29에서 다시 한 번 워커를 추가해야 하는지 확인하여 재시도할지 결정한다.

need_to_create_worker()

kernel/workqueue.c

/* Do we need a new worker?  Called from manager. */
static bool need_to_create_worker(struct worker_pool *pool)
{
        return need_more_worker(pool) && !may_start_working(pool);
}

요청한 워커풀에 워커가 더 필요하지만 준비된 idle 워커가 없어서 곧바로 동작할 수 없는 경우 워커를 만들 필요가 있는지 여부를 반환한다.

연결된(스케줄된) 워크들 처리

move_linked_works()

kernel/workqueue.c

/**                               
 * move_linked_works - move linked works to a list
 * @work: start of series of works to be scheduled
 * @head: target list to append @work to
 * @nextp: out paramter for nested worklist walking
 *
 * Schedule linked works starting from @work to @head.  Work series to
 * be scheduled starts at @work and includes any consecutive work with
 * WORK_STRUCT_LINKED set in its predecessor.
 *      
 * If @nextp is not NULL, it's updated to point to the next work of
 * the last scheduled work.  This allows move_linked_works() to be
 * nested inside outer list_for_each_entry_safe().
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock).
 */

static void move_linked_works(struct work_struct *work, struct list_head *head,
                              struct work_struct **nextp)
{       
        struct work_struct *n;

        /*
         * Linked worklist will always end before the end of the list,
         * use NULL for list head.
         */
        list_for_each_entry_safe_from(work, n, NULL, entry) {
                list_move_tail(&work->entry, head);
                if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
                        break;
        }

        /*
         * If we're already inside safe list traversal and have moved
         * multiple works to the scheduled queue, the next position
         * needs to be updated.
         */
        if (nextp)
                *nextp = n;
}

연결된 워크들을 두 번째 head 리스트로 옮긴다.

코드 라인 10~14에서 워크들을 순회하며 head로 옮긴다. 만일 워크에 linked 속성 플래그가 없는 경우 루프를 벗어난다.
코드 라인 21~22에서 출력 인수 nextp에 처리된 워크를 반환한다.

process_scheduled_works()

kernel/workqueue.c

/**
 * process_scheduled_works - process scheduled works
 * @worker: self
 *
 * Process all scheduled works.  Please note that the scheduled list
 * may change while processing a work, so this function repeatedly
 * fetches a work from the top and executes it.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.
 */

static void process_scheduled_works(struct worker *worker)
{
        while (!list_empty(&worker->scheduled)) {
                struct work_struct *work = list_first_entry(&worker->scheduled,
                                                struct work_struct, entry);
                process_one_work(worker, work);
        }
}

모든 스케줄된 워크를 처리한다.

워커의 스케줄드 리스트에 있는 워크들 수 만큼 순회하며 워크를 처리한다.

하나의 워크 처리

process_one_work()

워커에서 워크를 하나 처리한다.

kernel/workqueue.c – 1/4

/**
 * process_one_work - process single work
 * @worker: self
 * @work: work to process
 *
 * Process @work.  This function contains all the logics necessary to
 * process a single work including synchronization against and
 * interaction with other workers on the same cpu, queueing and
 * flushing.  As long as context requirement is met, any worker can
 * call this function to process a work.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock) which is released and regrabbed.
 */

static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(&pool->lock)
__acquires(&pool->lock)
{
        struct pool_workqueue *pwq = get_work_pwq(work);
        struct worker_pool *pool = worker->pool;
        bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
        int work_color;
        struct worker *collision;
#ifdef CONFIG_LOCKDEP
        /*
         * It is permissible to free the struct work_struct from
         * inside the function that is called from it, this we need to
         * take into account for lockdep too.  To avoid bogus "held
         * lock freed" warnings as well as problems when looking into
         * work->lockdep_map, make a copy and use that here.
         */
        struct lockdep_map lockdep_map;

        lockdep_copy_map(&lockdep_map, &work->lockdep_map);
#endif
        /* ensure we're on the correct CPU */
        WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
                     raw_smp_processor_id() != pool->cpu);

코드 라인 7에서 워크큐에 cpu intensive 플래그가 사용되었는지 여부를 알아온다.

kernel/workqueue.c – 2/4

.       /*
         * A single work shouldn't be executed concurrently by
         * multiple workers on a single cpu.  Check whether anyone is
         * already processing the work.  If so, defer the work to the
         * currently executing one.
         */
        collision = find_worker_executing_work(pool, work);
        if (unlikely(collision)) {
                move_linked_works(work, &collision->scheduled, NULL);
                return;
        }

        /* claim and dequeue */
        debug_work_deactivate(work);
        hash_add(pool->busy_hash, &worker->hentry, (unsigned long)work);
        worker->current_work = work;
        worker->current_func = work->func;
        worker->current_pwq = pwq;
        work_color = get_work_color(work);

        /*
         * Record wq name for cmdline and debug reporting, may get
         * overridden through set_worker_desc().
         */
        strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);

        list_del_init(&work->entry);

        /*
         * CPU intensive works don't participate in concurrency management.
         * They're the scheduler's responsibility.  This takes @worker out
         * of concurrency management and the next code block will chain
         * execution of the pending work items.
         */
        if (unlikely(cpu_intensive))
                worker_set_flags(worker, WORKER_CPU_INTENSIVE);

        /*
         * Wake up another worker if necessary.  The condition is always
         * false for normal per-cpu workers since nr_running would always
         * be >= 1 at this point.  This is used to chain execution of the
         * pending work items for WORKER_NOT_RUNNING workers such as the
         * UNBOUND and CPU_INTENSIVE ones.
         */
        if (need_more_worker(pool))
                wake_up_worker(pool);

코드 라인 7~11에서 워터풀내에서 요청한 워크가 이미 처리중인 경우 처리중인 워커의 스케줄드 리스트에 추가하고 함수를 빠져나간다.
- 같은 워크 요청은 하나의 cpu에서 여러 개의 워커에서 동작못하게 막는다.
코드 라인 15~18에서 busy_hash에 work를 키로 인덱스를 결정하고 워커를 추가한다. 추가할 때 워커에 처리중인 함수와 워크 및 풀워크큐를 지정한다.
코드 라인 19에서 현재 작업의 워크 컬러를 알아온다.
코드 라인 25에서 워커 스레드의 이름을 워크큐명으로 지정한다.
코드 라인 27에서 워크를 기존 리스트에서 제거한다.
코드 라인 35~36에서 cpu intensive 워크큐를 이용하는 경우 워커에 cpu intensive 플래그를 설정한다.
코드 라인 45~46에서 워커가 더 필요한 경우 대기중인 첫 번째 idle 워커를 깨운다.

kernel/workqueue.c – 3/4

.       /*
         * Record the last pool and clear PENDING which should be the last
         * update to @work.  Also, do this inside @pool->lock so that
         * PENDING and queued state changes happen together while IRQ is
         * disabled.
         */
        set_work_pool_and_clear_pending(work, pool->id);

        spin_unlock_irq(&pool->lock);

        lock_map_acquire_read(&pwq->wq->lockdep_map);
        lock_map_acquire(&lockdep_map);
        /*
         * Strictly speaking we should mark the invariant state without holding
         * any locks, that is, before these two lock_map_acquire()'s.
         *
         * However, that would result in:
         *
         *   A(W1)
         *   WFC(C)
         *              A(W1)
         *              C(C)
         *
         * Which would create W1->C->W1 dependencies, even though there is no
         * actual deadlock possible. There are two solutions, using a
         * read-recursive acquire on the work(queue) 'locks', but this will then
         * hit the lockdep limitation on recursive locks, or simply discard
         * these locks.
         *
         * AFAICT there is no possible deadlock scenario between the
         * flush_work() and complete() primitives (except for single-threaded
         * workqueues), so hiding them isn't a problem.
         */
        lockdep_invariant_state(true);
        trace_workqueue_execute_start(work);
        worker->current_func(work);
        /*
         * While we must be careful to not use "work" after this, the trace
         * point will only record its address.
         */
        trace_workqueue_execute_end(work);
        lock_map_release(&lockdep_map);
        lock_map_release(&pwq->wq->lockdep_map);

        if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
                pr_err("BUG: workqueue leaked lock or atomic: %s/0x%08x/%d\n"
                       "     last function: %pf\n",
                       current->comm, preempt_count(), task_pid_nr(current),
                       worker->current_func);
                debug_show_held_locks(current);
                dump_stack();
        }

코드 라인 7에서 워크에 워커풀 id를 기록하고 pending 비트를 포함한 나머지 플래그들도 모두 지운다.
코드 라인 36에서 워커에 지정된 처리 함수를 호출한다. (워크에 지정된 함수)
코드 라인 45~52에서 PREEMPT_ACTIVE를 뺀 preempt 카운터가 0이 아니면 즉, preempt가 disable된 경우 에러 메시지를 출력하고 스택을 덤프한다.

kernel/workqueue.c – 4/4

        /*
         * The following prevents a kworker from hogging CPU on !PREEMPT
         * kernels, where a requeueing work item waiting for something to
         * happen could deadlock with stop_machine as such work item could
         * indefinitely requeue itself while all other CPUs are trapped in
         * stop_machine. At the same time, report a quiescent RCU state so
         * the same condition doesn't freeze RCU.
         */
        cond_resched();

        spin_lock_irq(&pool->lock);

        /* clear cpu intensive status */
        if (unlikely(cpu_intensive))
                worker_clr_flags(worker, WORKER_CPU_INTENSIVE);

        /* tag the worker for identification in schedule() */
        worker->last_func = worker->current_func;

        /* we're done with it, release */
        hash_del(&worker->hentry);
        worker->current_work = NULL;
        worker->current_func = NULL;
        worker->current_pwq = NULL;
        worker->desc_valid = false;
        pwq_dec_nr_in_flight(pwq, work_color);
}

코드 라인 9에서 리스케줄 요청이 있는 경우 슬립한다.
코드 라인 14~15에서 cpu_intensive가 설정된 경우 워커에서 cpu intensive 플래그를 제거한다.
코드 라인 18에서 마지막 수행한 함수를 기록해둔다.
코드 라인 21에서 busy_hash에서 워커를 제거한다.
코드 라인 22~25에서 워커에 설정한 워크 정보를 초기화한다.
코드 라인 26에서 워크 컬러에 해당하는 현재 처리 중인 워크 수를 1 감소시킨다.

워크

static 워크 생성

컴파일 타임 워크 생성 및 초기화 매크로

DECLARE_WORK()
DECLARE_DELAYED_WORK()
DECLARE_DEFERABLE_WORK()

DECLARE_WORK()

include/linux/workqueue.h

#define DECLARE_WORK(n, f)                                              \
        struct work_struct n = __WORK_INITIALIZER(n, f)

컴파일 타임에 워크를 static 하게 선언하고 워크 함수를 지정한다.

include/linux/workqueue.h

#define __WORK_INITIALIZER(n, f) {                                      \
        .data = WORK_DATA_STATIC_INIT(),                                \
        .entry  = { &(n).entry, &(n).entry },                           \
        .func = (f),                                                    \
        __WORK_INIT_LOCKDEP_MAP(#n, &(n))                               \
        }

워크 구조체의 주요 멤버를 설정하고 초기화한다.

data 멤버의 초기 설정 값으로 아직 워크풀이 지정되지 않았고 컴파일 타임에 static하게 생성되었음을 알린다.
entry 멤버는 아직 워크풀에 등록되지 않았으므로 자기 자신을 가리키게 초기화한다.
func 멤버는 실행될 함수를 지정한다.

include/linux/workqueue.h

#define WORK_DATA_STATIC_INIT() \
        ATOMIC_LONG_INIT(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC)

워크의 data 멤버의 초기 설정 값으로 아직 워크풀이 지정되지 않았고 컴파일 타임에 static하게 생성되었음을 알린다.

WORK_STRUCT_NO_POOL 값은 플래그들이 있는 lsb 몇 비트를 제외하고 pool id에 해당하는 모든 비트가 1로 설정된다.
WORK_STRUCT_STATIC 플래그는 CONFIG_DEBUG_OBJECT 커널 옵션을 사용하는 경우에만 bit4에 해당하는 플래그가 추가되고 1로 설정된다.

dynamic 워크 생성

동적으로 사용되며 기존 워크 구조체를 초기화한다.

INIT_WORK()
INIT_WORK_ONSTACK()
INIT_DELAYED_WORK()
INIT_DELAYED_WORK_ONSTACK()
INIT_DEFERRABLE_WORK()
INIT_DEFERRABLE_WORK_ONSTACK()

INIT_WORK()

include/linux/workqueue.h

#define INIT_WORK(_work, _func)                                         \
        __INIT_WORK((_work), (_func), 0)

요청한 워크에 워크 함수를 지정하고 초기화한다. (런타임에 사용)

include/linux/workqueue.h

#define __INIT_WORK(_work, _func, _onstack)                             \
        do {                                                            \
                __init_work((_work), _onstack);                         \
                (_work)->data = (atomic_long_t) WORK_DATA_INIT();       \
                INIT_LIST_HEAD(&(_work)->entry);                        \
                (_work)->func = (_func);                                \
        } while (0)
#endif

런타임에 요청한 워크에 함수를 지정하고 초기화한다.

data 멤버에는초기 설정 값으로 아직 워크풀이 지정되지 않았음을 알린다.
entry 멤버는 아직 워크풀에 등록되지 않았으므로 자기 자신을 가리키게 초기화한다.
func 멤버는 실행될 함수를 지정한다.

include/linux/workqueue.h

#define WORK_DATA_INIT()        ATOMIC_LONG_INIT(WORK_STRUCT_NO_POOL)

워크의 data 멤버 초기값으로 pool id 필드를 WORK_STRUCT_NO_POOL로 설정하여 아직 워크풀이 지정되지 않았음을 알린다.

WORK_STRUCT_NO_POOL 값은 플래그들이 있는 lsb 몇 비트를 제외하고 pool id에 해당하는 모든 비트가 1로 설정된다.

워크 엔큐

워크(work)를 글로벌(시스템) 워크큐 또는 지정된 워크큐에 엔큐할 수 있다.

schedule_work()
- 글로벌(시스템) 워크큐에 워크를 엔큐한다.
queue_work()
- 지정한 워크큐에 워크를 엔큐한다.

지연 워크(delayed work)를 글로벌(시스템) 워크큐 또는 지정된 워크큐에 엔큐할 수 있다. 지연 시간은 틱(jiffies) 단위를 사용한다.

schedule_delayed_work()
- 글로벌(시스템) 워크큐에 지연 워크를 엔큐한다.
queue_delayed_work()
- 지정한 워크큐에 워크를 엔큐한다.

위의 4가지 api 명칭의 마지막에 _on을 추가하는 경우 특정 cpu를 지정(bound)하여 동작하게 할 수 있다.

schedule_work_on()
queue_work_on()
schedule_delayed_work_on()
queue_delayed_work_on()

워크를 글로벌 워크큐에 엔큐

schedule_work()

include/linux/workqueue.h

/**
 * schedule_work - put work task in global workqueue
 * @work: job to be done
 *
 * Returns %false if @work was already on the kernel-global workqueue and
 * %true otherwise.
 *
 * This puts a job in the kernel-global workqueue if it was not already
 * queued and leaves it in the same position on the kernel-global
 * workqueue otherwise.
 */

static inline bool schedule_work(struct work_struct *work)
{
        return queue_work(system_wq, work);
}

워크를 시스템 워크큐에 엔큐한다.

워크를 지정한 워크큐에 엔큐

queue_work()

include/linux/workqueue.h

/**
 * queue_work - queue work on a workqueue
 * @wq: workqueue to use
 * @work: work to queue
 *
 * Returns %false if @work was already on a queue, %true otherwise.
 *
 * We queue the work to the CPU on which it was submitted, but if the CPU dies
 * it can be processed by another CPU.
 */

static inline bool queue_work(struct workqueue_struct *wq,
                              struct work_struct *work)
{
        return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}

워크큐 @wq에 워크를 엔큐한다. 가능하면 현재 cpu에 작업을 시키도록 요청한다. 이미 워크큐에 등록되어 있는 상태이면 실패를 반환한다.

queue_work_on()

kernel/workqueue.c

/**
 * queue_work_on - queue work on specific cpu
 * @cpu: CPU number to execute work on
 * @wq: workqueue to use
 * @work: work to queue
 *
 * We queue the work to a specific CPU, the caller must ensure it
 * can't go away.
 *
 * Return: %false if @work was already on a queue, %true otherwise.
 */

bool queue_work_on(int cpu, struct workqueue_struct *wq,
                   struct work_struct *work)
{
        bool ret = false;
        unsigned long flags;

        local_irq_save(flags);

        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
                __queue_work(cpu, wq, work);
                ret = true;   
        }
        
        local_irq_restore(flags);
        return ret;
}
EXPORT_SYMBOL(queue_work_on);

워크가 이미 등록된 상태가 아니면 워크큐에 워크를 엔큐한다. 가능하면 현재 cpu에 작업을 시키도록 요청한다. 이미 워크큐에 등록되어 있는 상태이면 실패를 반환한다.

워크에 pending 플래그를 보고 워크가 이미 워크큐에 엔큐되어 있다는 것을 알 수 있다.

__queue_work()

워크를 워크큐에 엔큐한다.

kernel/workqueue.c – 1/2

static void __queue_work(int cpu, struct workqueue_struct *wq,
                         struct work_struct *work)
{
        struct pool_workqueue *pwq;
        struct worker_pool *last_pool;
        struct list_head *worklist;
        unsigned int work_flags;
        unsigned int req_cpu = cpu;

        /*
         * While a work item is PENDING && off queue, a task trying to
         * steal the PENDING will busy-loop waiting for it to either get
         * queued or lose PENDING.  Grabbing PENDING and queueing should
         * happen with IRQ disabled.
         */
        lockdep_assert_irqs_disabled();

        debug_work_activate(work);

        /* if draining, only works from the same workqueue are allowed */
        if (unlikely(wq->flags & __WQ_DRAINING) &&
            WARN_ON_ONCE(!is_chained_work(wq)))
                return;
        rcu_read_lock()
retry:
        if (req_cpu == WORK_CPU_UNBOUND)
                cpu = wq_select_unbound_cpu(raw_smp_processor_id());

        /* pwq which will be used unless @work is executing elsewhere */
        if (!(wq->flags & WQ_UNBOUND))
                pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
        else
                pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

        /*
         * If @work was previously on a different pool, it might still be
         * running there, in which case the work needs to be queued on that
         * pool to guarantee non-reentrancy.
         */
        last_pool = get_work_pool(work);
        if (last_pool && last_pool != pwq->pool) {
                struct worker *worker;

                spin_lock(&last_pool->lock);

                worker = find_worker_executing_work(last_pool, work);

                if (worker && worker->current_pwq->wq == wq) {
                        pwq = worker->current_pwq;
                } else {
                        /* meh... not running there, queue here */
                        spin_unlock(&last_pool->lock);
                        spin_lock(&pwq->pool->lock);
                }
        } else {
                spin_lock(&pwq->pool->lock);
        }

코드 라인 21~23에서 draining 플래그가 있는 워크큐이면서 워크큐의 워커가 현재 실행중이 아니면 경고 메시지를 출력하고 함수를 빠져나간다.

코드 라인 26~27에서 WORK_CPU_UNBOUND cpu로 요청한 경우 일단 현재 cpu로 설정한다.

코드 라인 30~33에서 unbound 워크큐가 아닌 경우 요청 cpu에 대한 풀워크큐를 알아온다. unbound 워크큐인 경우 현재 cpu 노드의 풀워크큐를 알아온다.

wq->cpu_pwqs:
- cpu 바운드 워크큐인 경우 cpu별로 사용
wq->numa_pwq_tbl[node]:
- unbound 워크큐이면 노드별로 사용

코드 라인 40~46에서 워크가 마지막으로 동작했었던 워커풀이 조금전에 알아온 풀워크큐가 가리키는 워커풀과 다른 경우 워크가 동작했었던 마지막 풀에서 요청한 워크가 현재 동작중인 워커를 알아온다.

코드 라인 48~54에서 워커가 현재 동작중인 풀워크큐의 워커인 경우 워커에서 현재 동작하는 풀워크큐를 그대로 사용한다.

워크의 재진입을 허용하지 않게 하는 것을 보장하기 위해 기존 워크가 수행되었던 워커의 풀워크큐를 그대로 사용한다.
워크가 이미 동작 중인 경우 다른 cpu를 사용하는 워커풀로 전달되지 않도록 현재 동작 중인 워커가 있는 워커풀로 워크를 보낸다. 동일 워크들은 절대 동시 처리되지 않게 한다.

kernel/workqueue.c – 2/2

        /*
         * pwq is determined and locked.  For unbound pools, we could have
         * raced with pwq release and it could already be dead.  If its
         * refcnt is zero, repeat pwq selection.  Note that pwqs never die
         * without another pwq replacing it in the numa_pwq_tbl or while
         * work items are executing on it, so the retrying is guaranteed to
         * make forward-progress.
         */
        if (unlikely(!pwq->refcnt)) {
                if (wq->flags & WQ_UNBOUND) {
                        spin_unlock(&pwq->pool->lock);
                        cpu_relax();
                        goto retry;
                }
                /* oops */
                WARN_ONCE(true, "workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt",
                          wq->name, cpu);
        }

        /* pwq determined, queue */
        trace_workqueue_queue_work(req_cpu, pwq, work);

        if (WARN_ON(!list_empty(&work->entry)))
                goto out;

        pwq->nr_in_flight[pwq->work_color]++;
        work_flags = work_color_to_flags(pwq->work_color);

        if (likely(pwq->nr_active < pwq->max_active)) {
                trace_workqueue_activate_work(work);
                pwq->nr_active++;
                worklist = &pwq->pool->worklist;
                if (list_empty(worklist))
                        pwq->pool->watchdog_ts = jiffies;
        } else {
                work_flags |= WORK_STRUCT_DELAYED;
                worklist = &pwq->delayed_works;
        }

        insert_work(pwq, work, worklist, work_flags);

out:
        spin_unlock(&pwq->pool->lock);
        rcu_read_unlock();
}

코드 라인 9~18에서 풀워크큐의 참조 카운터가 0인 경우 경고 메시지를 출력한다. 단 unbound 워크큐인경우 retry 한다.

풀워크큐가 race 상황에서 release되었을 수도 있다. 그런 경우 다시 풀워크큐를 알아온다.

코드 라인 23~24에서 워크가 아무 곳에도 등록되지 않은 경우 함수를 빠져나간다.

코드 라인 26~27에서 워크 컬러에 해당하는 처리중인 워크 수를 증가시키고 워크 컬러 값을 플래그 값으로 변환한다.

코드 라인 29~40에서 풀워크큐에 등록된 워크의 수가 최대 제한에 걸리지 않았으면 워커 수 카운터인 nr_active를 1 증가시키고 풀워크큐가 가리키는 워커풀의 워크리스트에 워크를 추가한다. 만일 최대 제한을 초과한 경우 워크에 delay 플래그를 추가한 후 풀워크큐의 delayed_works 리스트에 추가한다.

최대 제한 수는 512이며 unbound 워크큐인 경우 512와 cpu*4 중 큰 수를 사용한다.

insert_work()

kernel/workqueue.c

/**
 * insert_work - insert a work into a pool
 * @pwq: pwq @work belongs to
 * @work: work to insert
 * @head: insertion point
 * @extra_flags: extra WORK_STRUCT_* flags to set
 *
 * Insert @work which belongs to @pwq after @head.  @extra_flags is or'd to
 * work_struct flags.
 *
 * CONTEXT:
 * spin_lock_irq(pool->lock).
 */

static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
                        struct list_head *head, unsigned int extra_flags)
{
        struct worker_pool *pool = pwq->pool;

        /* we own @work, set data and link */
        set_work_pwq(work, pwq, extra_flags);
        list_add_tail(&work->entry, head);
        get_pwq(pwq);

        /*
         * Ensure either wq_worker_sleeping() sees the above
         * list_add_tail() or we see zero nr_running to avoid workers lying
         * around lazily while there are works to be processed.
         */
        smp_mb();

        if (__need_more_worker(pool))
                wake_up_worker(pool);
}

요청한 리스트에 워크를 추가한다.

코드 라인 7에서 워크 데이터가 풀워크큐를 가리키게 하고 요청한 플래그들을 추가한다.

코드 라인 8에서 요청한 리스트의 후미에 워크를 추가한다.

코드 라인 9에서 풀워크큐의 참조카운터를 1 증가시킨다.

코드 라인 18~19에서 워커풀에서 동작중인 워커가 없는 경우 idle 워커를 깨운다.

set_work_pwq()

kernel/workqueue.c

static void set_work_pwq(struct work_struct *work, struct pool_workqueue *pwq,
                         unsigned long extra_flags)
{
        set_work_data(work, (unsigned long)pwq,
                      WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | extra_flags);
}

워크에 풀워크큐를 지정하고 요청한 플래그 이외에도 기본적으로 WORK_STRUCT_PENDING 및 WORK_STRUCT_PWQ 를 추가한다.

set_work_data()

kernel/workqueue.c

/*
 * While queued, %WORK_STRUCT_PWQ is set and non flag bits of a work's data
 * contain the pointer to the queued pwq.  Once execution starts, the flag
 * is cleared and the high bits contain OFFQ flags and pool ID.
 *
 * set_work_pwq(), set_work_pool_and_clear_pending(), mark_work_canceling()
 * and clear_work_data() can be used to set the pwq, pool or clear
 * work->data.  These functions should only be called while the work is
 * owned - ie. while the PENDING bit is set.
 *
 * get_work_pool() and get_work_pwq() can be used to obtain the pool or pwq
 * corresponding to a work.  Pool is available once the work has been
 * queued anywhere after initialization until it is sync canceled.  pwq is
 * available only while the work item is queued.
 *
 * %WORK_OFFQ_CANCELING is used to mark a work item which is being
 * canceled.  While being canceled, a work item may have its PENDING set
 * but stay off timer and worklist for arbitrarily long and nobody should
 * try to steal the PENDING bit.
 */

static inline void set_work_data(struct work_struct *work, unsigned long data,
                                 unsigned long flags)
{
        WARN_ON_ONCE(!work_pending(work));
        atomic_long_set(&work->data, data | flags | work_static(work));
}

워크에 data(풀워크큐 또는 pool id)와 플래그를 더해 설정한다.

기타 API

flush_workqueue()
- 워크큐에 엔큐된 워크를 모두 처리하여 비운다.
flush_schedule_work()
- 워크큐에 엔큐된 지연 워크를 모두 처리하여 비운다.
cancel_work_sync()
- 워크큐에 엔큐된 워크를 취소하고 완료될 때까지 기다린다.
cancel_delayed_work()
- 워크큐에 엔큐된 지연 워크를 취소한다.
cancel_delayed_work_sync()
- 워크큐에 엔큐된 지연 워크를 취소하고 완료될 때까지 기다린다.
destroy_workqueue()
- 생성했던 워크큐를 소멸시킨다.

구조체

workqueue_struct 구조체

kernel/workqueue.c

/*
 * The externally visible workqueue.  It relays the issued work items to
 * the appropriate worker_pool through its pool_workqueues.
 */

struct workqueue_struct {
        struct list_head        pwqs;           /* WR: all pwqs of this wq */
        struct list_head        list;           /* PL: list of all workqueues */

        struct mutex            mutex;          /* protects this wq */
        int                     work_color;     /* WQ: current work color */
        int                     flush_color;    /* WQ: current flush color */
        atomic_t                nr_pwqs_to_flush; /* flush in progress */
        struct wq_flusher       *first_flusher; /* WQ: first flusher */
        struct list_head        flusher_queue;  /* WQ: flush waiters */
        struct list_head        flusher_overflow; /* WQ: flush overflow list */

        struct list_head        maydays;        /* MD: pwqs requesting rescue */
        struct worker           *rescuer;       /* I: rescue worker */

        int                     nr_drainers;    /* WQ: drain in progress */
        int                     saved_max_active; /* WQ: saved pwq max_active */

        struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
        struct pool_workqueue   *dfl_pwq;       /* PW: only for unbound wqs */

#ifdef CONFIG_SYSFS
        struct wq_device        *wq_dev;        /* I: for sysfs interface */
#endif
#ifdef CONFIG_LOCKDEP
        char                    *lock_name;
        struct lock_class_key   key;
        struct lockdep_map      lockdep_map;
#endif
        char                    name[WQ_NAME_LEN]; /* I: workqueue name */
        /*
         * Destruction of workqueue_struct is RCU protected to allow walking
         * the workqueues list without grabbing wq_pool_mutex.
         * This is used to dump all workqueues from sysrq.
         */
        struct rcu_head         rcu;

        /* hot fields used during command issue, aligned to cacheline */
        unsigned int            flags ____cacheline_aligned; /* WQ: WQ_* flags */
        struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
        struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
};

pwqs

워크큐에 소속된 풀워크큐들

list

모든 워크큐가 전역 workqueues 리스트에 연결될 떄 사용하는 list 노드이다.

work_color

현재 워크 컬러

flush_color

현재 플러시 컬러

nr_pwqs_to_flush

플러시될 풀워크큐 수로 플러시가 진행중일 때 사용된다.

*first_flusher

처음 플러셔(플러시 요청)

flusher_queue

플러셔 리스트에 플러시 요청이 쌓이며 하나씩 처리하기 위해 first_flusher로 옮긴다.
플러시가 완료되면 first_flusher는 null이 되고 flusher_queue 리스트도 비게된다.

flusher_overflow

플러시 오버플로 리스트로 플러시 컬러 공간이 부족할 때 플러시 요청을 이 리스트에 추가한다.

maydays

구조 요청한 풀워크큐 리스트

nr_drainers

진행중인 drainer 수로 워크큐를 비워달라고 요청받으면 1 증가되고 드레이닝이 시작하고 완료되면 1 감소된다.

saved_max_active

등록할 수 있는 최대 active 워크 수로 이 수를 초과하는 워크의 경우 워커풀로 배포하지 않고 풀워크큐의 delayed_works 리스트에서 대기하게 한다.

*unbound_attrs

언바운드 워크큐 속성

*dfl_pwq

디폴트 언바운드 풀워크큐를 가리키며 cpu on/off 시 numa_pwq_tbl[]의 이용이 힘들 때 잠시 fall-back용으로 사용한다.

wq_dev

sysfs 인터페이스

name[]

워크큐명

rcu

워크큐가 제거될 때 rcu에 의해 보호되어 제거된다.

flags

플래그들

*cpu_pwqs

per-cpu 풀워크큐들

*numa_pwq_tbl[]

노드별 언바운드 풀워크큐

workqueue_attrs 구조체

include/linux/workqueue.h

/**
 * struct workqueue_attrs - A struct for workqueue attributes.
 *
 * This can be used to change attributes of an unbound workqueue.
 */

struct workqueue_attrs {
        /**
         * @nice: nice level
         */
        int nice;

        /**
         * @cpumask: allowed CPUs
         */
        cpumask_var_t cpumask;

        /**
         * @no_numa: disable NUMA affinity
         *
         * Unlike other fields, ``no_numa`` isn't a property of a worker_pool. It
         * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
         * doesn't participate in pool hash calculations or equality comparisons.
         */
        bool no_numa;
};

nice

nice 우선순위

cpumask

허락된 cpu들

no_numa

NUMA 노드 정보를 disable

wq_flusher 구조체

kernel/workqueue.c

/*
 * Structure used to wait for workqueue flush.
 */

struct wq_flusher {
        struct list_head        list;           /* WQ: list of flushers */
        int                     flush_color;    /* WQ: flush color waiting for */
        struct completion       done;           /* flush completion */
};

list

플러셔(플러시 요청)를 플러셔 리스트에 추가할 때 사용되는 연결 노드

flush_color

플러시 컬러

done

플러시 완료(completion) 대기를 위해 사용

pool_workqueue 구조체

kernel/workqueue.c

/*
 * The per-pool workqueue.  While queued, the lower WORK_STRUCT_FLAG_BITS
 * of work_struct->data are used for flags and the remaining high bits
 * point to the pwq; thus, pwqs need to be aligned at two's power of the
 * number of flag bits.
 */

struct pool_workqueue {
        struct worker_pool      *pool;          /* I: the associated pool */
        struct workqueue_struct *wq;            /* I: the owning workqueue */
        int                     work_color;     /* L: current color */
        int                     flush_color;    /* L: flushing color */
        int                     refcnt;         /* L: reference count */
        int                     nr_in_flight[WORK_NR_COLORS];
                                                /* L: nr of in_flight works */
        int                     nr_active;      /* L: nr of active works */
        int                     max_active;     /* L: max active works */
        struct list_head        delayed_works;  /* L: delayed works */
        struct list_head        pwqs_node;      /* WR: node on wq->pwqs */
        struct list_head        mayday_node;    /* MD: node on wq->maydays */

        /*
         * Release of unbound pwq is punted to system_wq.  See put_pwq()
         * and pwq_unbound_release_workfn() for details.  pool_workqueue
         * itself is also RCU protected so that the first pwq can be
         * determined without grabbing wq->mutex.
         */
        struct work_struct      unbound_release_work;
        struct rcu_head         rcu;
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

*pool

연결된 워커풀

*wq

워크큐

work_color

워크 컬러로 새 워크마다 이 컬러의 워크 컬러를 사용한다.
플러시 요청이 발생한 경우 다음 워크 컬러를 선택한다. (0 ~ 14이내에서 순환 증가)

flush_color

플러시 컬러로 플러시 요청이 있는 경우 현재 사용한 워크 컬러를 지정한다. 그리고 이 플러시 컬러에 해당하는 워크들의 작업이 끝날때까지 기다린다. (플러시)

refcnt

참조카운터

nr_in_flight[]

컬러별 현재 워커가 처리중인 워크 수를 담는다.

nr_active

active 워크 수
지연된 워크는 포함되지 않는다.

max_active

동시 처리 제한(최대 active 워크 수)

delayed_works

동시 처리 제한(max_active)을 초과한 워크 요청이 대기하는 리스트이다.
suspend PM 기능이 동작하는 경우 모든 인입되는 워크도 이 곳에서 대기된다.

pwqs_node

노드별 풀워크큐

mayday_node

워크큐의 maydays 리스트에 추가될 때 연결되는 리스트 노드

unbound_release_work

풀워크큐가 내장하여 사용하는 워크이다. 언바운드 워커풀을 제거할 때 사용한다.
pwq_unbound_release_workfn() 함수를 호출하여 워커풀을 제거하는데 모든 워커풀이 제거된 워크큐도 함께 제거된다.

rcu

풀워크큐 구조체를 rcu 방식으로 풀워크큐 slab 캐시로 free할 때 사용하는 rcu 노드이다.

worker_pool 구조체

kernel/workqueue.c

/*
 * Structure fields follow one of the following exclusion rules.
 *
 * I: Modifiable by initialization/destruction paths and read-only for
 *    everyone else.
 *
 * P: Preemption protected.  Disabling preemption is enough and should
 *    only be modified and accessed from the local cpu.
 *
 * L: pool->lock protected.  Access with pool->lock held.
 *
 * X: During normal operation, modification requires pool->lock and should
 *    be done only from local cpu.  Either disabling preemption on local
 *    cpu or grabbing pool->lock is enough for read access.  If
 *    POOL_DISASSOCIATED is set, it's identical to L.
 *
 * A: pool->attach_mutex protected.
 *
 * PL: wq_pool_mutex protected.
 *
 * PR: wq_pool_mutex protected for writes.  RCU protected for reads.
 *
 * PW: wq_pool_mutex and wq->mutex protected for writes.  Either for reads.
 *
 * PWR: wq_pool_mutex and wq->mutex protected for writes.  Either or
 *      RCU for reads.
 *
 * WQ: wq->mutex protected.
 *
 * WR: wq->mutex protected for writes.  RCU protected for reads.
 *
 * MD: wq_mayday_lock protected.
 */

/* struct worker is defined in workqueue_internal.h */

struct worker_pool {
        spinlock_t              lock;           /* the pool lock */
        int                     cpu;            /* I: the associated cpu */
        int                     node;           /* I: the associated node ID */
        int                     id;             /* I: pool ID */
        unsigned int            flags;          /* X: flags */

        unsigned long           watchdog_ts;    /* L: watchdog timestamp */

        struct list_head        worklist;       /* L: list of pending works */

        int                     nr_workers;     /* L: total number of workers */
        int                     nr_idle;        /* L: currently idle workers */

        struct list_head        idle_list;      /* X: list of idle workers */
        struct timer_list       idle_timer;     /* L: worker idle timeout */
        struct timer_list       mayday_timer;   /* L: SOS timer for workers */

        /* a workers is either on busy_hash or idle_list, or the manager */
        DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
                                                /* L: hash of busy workers */

        struct worker           *manager;       /* L: purely informational */
        struct list_head        workers;        /* A: attached workers */
        struct completion       *detach_completion; /* all workers detached */

        struct ida              worker_ida;     /* worker IDs for task name */

        struct workqueue_attrs  *attrs;         /* I: worker attributes */
        struct hlist_node       hash_node;      /* PL: unbound_pool_hash node */
        int                     refcnt;         /* PL: refcnt for unbound pools */

        /*
         * The current concurrency level.  As it's likely to be accessed
         * from other CPUs during try_to_wake_up(), put it in a separate
         * cacheline.
         */
        atomic_t                nr_running ____cacheline_aligned_in_smp;

        /*
         * Destruction of pool is RCU protected to allow dereferences
         * from get_work_pool().
         */
        struct rcu_head         rcu;
} ____cacheline_aligned_in_smp;

cpu

cpu 번호

node

노드번호

워커풀 id

flags

워커풀의 플래그

watchdog_ts

워치독 타임스탬프

worklist

처리할 워크가 담기는 리스트

nr_workers

워커풀에 등록된 워커 수

nr_idle

워커풀에 대기중인 워커 수

idle_list

idle 워커 리스트

idle_timer

idle 타이머
5분 간격으로 너무 많이 대기중인 워커들을 소멸시킨다.

mayday_timer

mayday 타이머
0.1초 간격으로 워크들의 데드락 상황을 파악하여 rescuer_thread를 깨운다.

busy_hash[]

busy 워커들이 있는 해시 리스트

*manager

workers

연결된 워커들 리스트

*detach_completion

모든 워커들을 detach할 때 사용

worker_ida

워커들 id를 발급하는 IDA 트리

*attrs

워크큐 속성(nice 및 cpumask)
언바운드 워크큐는 같은 속성을 사용하는 경우 워커풀을 공유하여 사용한다.

hash_node

언바운드 풀 해시 노드

refcnt

참조 카운터

nr_running

워크를 처리중인 워커 수
다른 cpu들에서 try_to_wake_up() 함수를 통해 atomic하게 접근되는 변수로 이용되고 캐시 라인 바이트 수만큼 정렬되어 사용된다.

rcu

워커풀을 소멸시킬 때 rcu를 사용하여 수행한다.

work_struct 구조체

include/linux/workqueue.h

struct work_struct {
        atomic_long_t data;
        struct list_head entry;
        work_func_t func;
#ifdef CONFIG_LOCKDEP
        struct lockdep_map lockdep_map;
#endif
};

data

풀워크큐 주소 또는 pool id가 저장되고 하위 플래그들이 구성된다. (본문참고)

entry

리스트에 연결될 때 사용하는 리스트 노드

func

작업 호출 함수

delayed_work 구조체

include/linux/workqueue.h

struct delayed_work {
        struct work_struct work;
        struct timer_list timer;

        /* target workqueue and CPU ->timer uses to queue ->work */
        struct workqueue_struct *wq; 
        int cpu;                    
};

work

워크 구조체가 포함된다.

timer

지연 타이머

*wq

타이머가 만료되어 워크가 등록될 워크큐를 가리킨다.

cpu

워크가 동작할 cpu 지정

참고

Interrupts -1- (Interrupt Controller) | 문c
Interrupts -2- (irq chip) | 문c
Interrupts -3- (irq domain) | 문c
Interrupts -4- (Top-Half & Bottom-Half) | 문c
Interrupts -5- (Softirq) | 문c
Interrupts -6- (IPI Cross-call) | 문c
Interrupts -7- (Workqueue 1) | 문c
Interrupts -8- (Workqueue 2) | 문c – 현재 글
Interrupts -9- (GIC v3 Driver) | 문c
Interrupts -10- (irq partition) | 문c
Interrupts -11- (RPI2 IC Driver) | 문c
Interrupts -12- (irq desc) | 문c

Driver porting: the workqueue interface. | LWN.net
[Linux] concurrency managed workqueue (cmwq) | F/OSS
지연 가능 함수, 커널 태스크릿 및 작업 큐 (BOTTOM HALF) | 신불사
Multitasking in the Linux Kernel. Workqueues | Vita Loginova
Details of the workqueue interface (2002) | LWN.net