문c 블로그

Scheduler -9- (RT Scheduler)

2017-05-242023-02-18 문영일 Leave a comment

RT 스케줄러

RT 스케줄러의 우선 순위

rt 태스크들은 cfs 태스크들보다 항상 우선순위가 높아 먼저 실행될 권리를 갖는다. 다만 cpu를 offline 시킬 때 사용하는 stop 스케줄러에서 사용되는 stop 태스크와 deadline 스케줄러에서 사용하는 deadline 태스크들 보다는 우선 순위가 낮다. 스케줄러들 끼리의 우선 순위를 보면 다음과 같다.

RT 태스크의 우선 순위

rt 태스크들 끼리 경쟁할 때 스케줄링 순서를 알아본다. rt 태스크의 우선 순위는 0(highest priority) ~ 99(lowest priority)로 나뉜다. 이를 RT0 ~ RT99라고 표현하기도 한다. 동시에 RT50 태스크와 RT60 태스크가 경쟁하는 경우 RT50이 우선 순위가 더 높아 먼저 실행된다.

RT 런큐

rt 런큐는 cpu 수 만큼 생성된다. 물론 그룹 스케줄링을 사용하는 경우(cgroup의 cpu subsystem) 서브 그룹이 만들어질 때마다 cpu 수 만큼 추가로 만들어지는데 이는 잠시 후에 언급한다. rt 태스크들이 rt 스케줄러에 큐잉되면 rt 런큐에 존재하는 active라는 이름의 큐에 들어가는데 100개의 리스트로 이루어진 array[]에서 관리한다. rt 스케줄러는 2 개 이상의 rt 태스크가 큐에서 관리될 때 우선 순위가 가장 높은 rt 태스크부터 실행시킨다.

아래 그림은 cpu#0의 rt 런큐에 4개의 rt 태스크들이 큐잉되어 동작하는 모습을 보여준다. 동작하는 순서는 A 태스크부터 D 태스크까지 각 rt 태스크들이 디큐될 때마다 다음 우선 순위의 태스크가 실행된다.

RT 태스크 실행 시간

RT 태스크는 한 번 실행되면 다음 조건으로만 멈추거나 다른 태스크로 변경될 수 있다.

rt 스케줄러 보다 더 높은 우선 순위를 가진 스케줄러의 동작
- stop 또는 deadline 태스크의 실행
rt 태스크 스스로 슬립
- schedule(), yield() 및 msleep() 등
preemtible 커널에서 더 높은 우선 순위를 가진 rt 태스크의 실행
RR(Round Robin) policy를 사용하고 동등한 우선 순위를 사용하는 태스크들 사이에서 실행 태스크 변경
rt 밴드위드로 인해 스로틀
rt 태스크 종료

RT 태스크용 스케줄링 정책(policy)

RT 태스크의 우선 순위가 같을 때 처리 순서가 바뀌는 다음 2 가지의 RT 스케줄링 정책을 지원한다.

SCHED_FIFO
- 먼저 실행된 태스크가 끝날 때 까지 계속 수행한다.
SCHED_RR
- 같은 우선 순위의 태스크는 커널에서 설정된 기간(디폴트 100ms) 단위로 실행 순서를 바꾼다.

RT 태스크 preemption

현재 처리하는 RT 태스크의 우선 순위보다 더 높은 우선 순위의 RT 태스크가 RT 런큐에 엔큐되면 당연히 우선 순위가 더 높은 RT 태스크를 실행한다. 하지만 기존 태스크가 커널에서 만들어진 커널용 태스크인 경우에는 커널의 preemption 옵션에 따라 우선 순위가 바뀌지 않을 수도 있고, 약간 지연 또는 즉각 반영되어 바뀔 수도 있다.

참고: Scheduler -3- (Preemption & Context Switch) | 문c

RT 그룹 스케줄링

그룹 스케줄링을 사용하는 경우 아래 그림과 같이 관리된다. 우선 순위만 보면 RT 그룹 스케줄링을 사용하는 것과 사용하지 않는 것은 스케줄링에 대해 다른 점을 구별할 수 없다. RT 그룹 스케줄링을 사용할 때에는 RT 밴드위드에서 쓰임새가 달라진다. RT 밴드위드의 동작은 CFS 밴드위드의 동작과 거의 유사하게 동작한다. 그룹에 주기(period)와 런타임(runtime)이 주어지고 주기마다 런타임이 소진되면 rt 런큐가 스로틀되는 형태로 동일하다.

RT 스케줄러 ops

kernel/sched/rt.c

const struct sched_class rt_sched_class = {
        .next                   = &fair_sched_class,
        .enqueue_task           = enqueue_task_rt,
        .dequeue_task           = dequeue_task_rt,
        .yield_task             = yield_task_rt,

        .check_preempt_curr     = check_preempt_curr_rt,

        .pick_next_task         = pick_next_task_rt,
        .put_prev_task          = put_prev_task_rt,
        .set_next_task          = set_next_task_rt,

#ifdef CONFIG_SMP
        .balance                = balance_rt,
        .select_task_rq         = select_task_rq_rt,
        .set_cpus_allowed       = set_cpus_allowed_common,
        .rq_online              = rq_online_rt,
        .rq_offline             = rq_offline_rt,
        .task_woken             = task_woken_rt,
        .switched_from          = switched_from_rt,
#endif

        .task_tick              = task_tick_rt,

        .get_rr_interval        = get_rr_interval_rt,

        .prio_changed           = prio_changed_rt,
        .switched_to            = switched_to_rt,

        .update_curr            = update_curr_rt,

#ifdef CONFIG_UCLAMP_TASK
        .uclamp_enabled         = 1,
#endif
};

RT 스케줄 틱

task_tick_rt()

kernel/sched/rt.c

/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
        struct sched_rt_entity *rt_se = &p->rt;

        update_curr_rt(rq);
        update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

        watchdog(rq, p);

        /*
         * RR tasks need a special form of timeslice management.
         * FIFO tasks have no timeslices.
         */
        if (p->policy != SCHED_RR)
                return;

        if (--p->rt.time_slice)
                return;

        p->rt.time_slice = sched_rr_timeslice;

        /*
         * Requeue to the end of queue if we (and all of our ancestors) are not
         * the only element on the queue
         */
        for_each_sched_rt_entity(rt_se) {
                if (rt_se->run_list.prev != rt_se->run_list.next) {
                        requeue_task_rt(rq, p, 0);
                        resched_curr(rq);
                        return;
                }
        }
}

RT 스케줄러에서 스케줄 틱마다 다음과 같은 일들을 수행한다.

rt 로드 평균 및 런타임을 갱신
요청한 rt 태스크의 제한시간이 설정된 rt 태스크 제한시간 리미트를 초과한 경우 cpu 시간 만료 설정
요청한 태스크가 라운드 로빈 정책을 사용하고 같은 우선 순위의 태스크가 복수인 경우 해당 태스크를 양보하고 라운드 로빈 처리

코드 라인 5에서 현재 실행 중인 rt 태스크에 대한 런타임 등을 갱신한다.
코드 라인 6에서 rt 런큐의 로드 평균 등을 갱신한다.
코드 라인 8에서 유저용 rt 태스크에 제한시간(RLIMIT_RTTIME)이 설정된 경우 태스크의 cpu 시간 만료를 체크한다.
코드 라인 14~15에서 태스크의 스케줄 정책이 라운드 로빈(SCHED_RR)이 아니면 함수를 빠져나간다.
- rt 태스크에서 사용하는 스케줄 정책은 SCHED_RR 및 SCHED_FIFO가 있다.
코드 라인 17~18에서 라운드 로빈 policy를 가진 경우이다. 아직 라운드 로빈할 시각이 안된 경우 함수를 빠져나간다.
- sched_rr_timeslice
  - 디폴트 100 ms에 해당하는 RR 틱 카운터
코드 라인 20에서 rt 태스크의 타임 슬라이스에 라운도 로빈용 타임 슬라이스(디폴트 100 ms)를 대입한다.
코드 라인 26~32에서 rt 태스크의 최상위 rt 스케줄 엔티티까지 순회하며 복수개의 스케줄 엔티티가 있는 경우 라운도 로빈 처리하고 리스케줄 요청 플래그를 설정한다.
- rt 런큐 어레이 리스트 중 해당 스케줄 엔티티가 소속된 우선순위의 리스트에 복수의 rt 스케줄 엔티티가 있는 경우 해당 스케줄 엔티티의 우선순위를 양보하기 위해 해당 리스트의 뒤로 리큐하고 리스케줄 요청 플래그를 설정하는 것으로 매 스케줄 틱마다 라운드 로빈 기능을 수행한다.
- 같은 우선 순위의 라운드 로빈 정책을 사용하는 rt 태스크는 매 스케줄 틱마다 돌아가며 수행되게 한다.

다음 그림은 task_tick_rt() 함수 이후의 호출 관계를 보여준다.

라운드 로빈

requeue_task_rt()

kernel/sched/rt.c

static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head)
{
        struct sched_rt_entity *rt_se = &p->rt;
        struct rt_rq *rt_rq;

        for_each_sched_rt_entity(rt_se) {
                rt_rq = rt_rq_of_se(rt_se);
                requeue_rt_entity(rt_rq, rt_se, head);
        }
}

RT 태스크를 라운드 로빈 처리한다. @head=1일 때 리스트의 선두로, 0일 때 후미로 이동시킨다.

다음 그림은 같은 우선 순위를 가진 RT 태스크(R1, A1, A2)들이 라운드 로빈을 하는 모습을 보여준다.

A1 -> R1 -> A2 -> R1 사이클을 반복한다.

requeue_rt_entity()

kernel/sched/rt.c

/*
 * Put task to the head or the end of the run list without the overhead of
 * dequeue followed by enqueue.
 */

static void
requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se, int head)
{
        if (on_rt_rq(rt_se)) {
                struct rt_prio_array *array = &rt_rq->active;
                struct list_head *queue = array->queue + rt_se_prio(rt_se);

                if (head)
                        list_move(&rt_se->run_list, queue);
                else
                        list_move_tail(&rt_se->run_list, queue);
        }
}

RT 스케줄 엔티티를 라운드 로빈 처리한다. 디큐 및 엔큐 처리로 인한 오버헤드를 없애기 위해 스케줄 엔티티만 이동시킨다.

코드 라인 4~6에서 요청한 rt 스케줄 엔티티가 런큐에 이미 존재하는 경우 100개의 리스트 어레이 중 해당 우선 순위의 리스트를 알아온다.
코드 라인 8~11에서 rt 스케줄 엔티티를 인수 head 요청에 따라 리스트의 선두 또는 후미에 추가한다.

다음 그림은 요청한 rt 엔티티를 라운드 로빈하는 것을 보여준다.

로드 및 Runtime 갱신

update_curr_rt()

kernel/sched/rt.c

/*
 * Update the current task's runtime statistics. Skip current tasks that
 * are not in our scheduling class.
 */

static void update_curr_rt(struct rq *rq)
{
        struct task_struct *curr = rq->curr;
        struct sched_rt_entity *rt_se = &curr->rt;
        u64 delta_exec;
        u64 now;

        if (curr->sched_class != &rt_sched_class)
                return;

        now = rq_clock_task(rq);
        delta_exec = rq_clock_task(rq) - curr->se.exec_start;
        if (unlikely((s64)delta_exec <= 0))
                return;

        schedstat_set(curr->se.statistics.exec_max,
                      max(curr->se.statistics.exec_max, delta_exec));
        
        curr->se.sum_exec_runtime += delta_exec;
        account_group_exec_runtime(curr, delta_exec);
        
        curr->se.exec_start = now;
        cpuacct_charge(curr, delta_exec);
        
        if (!rt_bandwidth_enabled())
                return;

        for_each_sched_rt_entity(rt_se) {
                struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
                
                if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
                        raw_spin_lock(&rt_rq->rt_runtime_lock);
                        rt_rq->rt_time += delta_exec;
                        if (sched_rt_runtime_exceeded(rt_rq))
                                resched_curr(rq);
                        raw_spin_unlock(&rt_rq->rt_runtime_lock);
                }
        }
}

현재 동작 중인 rt 태스크의 런타임을 갱신한다. 그리고 라운드 로빈할 태스크가 있는 경우 리스케줄 요청한다.

코드 라인 8~9에서 현재 동작중인 태스크가 rt 태스크가 아닌 경우 함수를 빠져나간다.
코드 라인 11~14에서 현재 시각에서 지난 갱신 때의 시각을 뺀 delta 실행 시간을 구한다. 만일 실행 시간이 0보다 작으면 함수를 빠져나간다.
코드 라인 16~17에서 스케줄 통계를 위해 태스크의 최대 delta 실행 시각을 갱신한다.
코드 라인 19에서 현재 태스크의 실행 시간 총합을 갱신한다.
코드 라인 20에서 현재 스레드 그룹용 총 시간 관리를 위해 cpu 타이머가 동작하는 동안 총 실행 시간을 갱신한다.
- posix timer를 통해 만료 시 시그널을 발생한다.
코드 라인 22에서 다음 갱신시 delta 실행 시각을 구하기 위해 현재 시각을 기록한다.
코드 라인 23에서 태스크의 cputime과 cpu cgroup용 cputime을 갱신한다.
- cputime: 커널 소모 시간과 유저 소모 시간 누적
코드 라인 25~26에서글로벌 rt bandwidth가 설정되지 않은 경우 함수를 빠져나간다.
- 디폴트로 0.95초로 설정되어 있다.
  - sysctl_sched_rt_runtime(950,000 us = 0.95 s)
  - “/proc/sys/kernel/sched_rt_runtime_us“
코드 라인 32~37에서 최상위 rt 엔티티까지 순회하며 rt 런큐의 rt_time에 실행 시각을 누적시킨다.
코드 라인 38~39에서 rt 런타임이 초과된 경우 rt 스로틀 시킨 후 리스케줄 요청 플래그를 설정한다.

다음 예와 같이 태스크 그룹에 대해 커널이 사용한 시간과 유저가 사용한 시간을 틱 수로 보여준다.

$ cat /sys/fs/cgroup/cpu/A/cpuacct.stat
user 47289
system 5

RT Watchdog

유저용 RT 태스크가 슬립 없이 일정 기간(rlimit) 이상 가동되는 경우 이 RT 태스크에 시그널을 전달한다. 별도의 시그널 처리기가 없으면 태스크가 종료된다.

RLIMIT_RTTIME 파라미터로 rlimit min/max를 설정한다. (us)
min 타임이 초과하는 경우 SIGXCPU 시그널을 전달한다.
max 타임이 초과하는 경우 SIGKILL 시그널을 전달한다.

watchdog()

kernel/sched/rt.c

static void watchdog(struct rq *rq, struct task_struct *p)
{
        unsigned long soft, hard;

        /* max may change after cur was read, this will be fixed next tick */
        soft = task_rlimit(p, RLIMIT_RTTIME);
        hard = task_rlimit_max(p, RLIMIT_RTTIME);

        if (soft != RLIM_INFINITY) {
                unsigned long next;

                if (p->rt.watchdog_stamp != jiffies) {
                        p->rt.timeout++;
                        p->rt.watchdog_stamp = jiffies;
                }

                next = DIV_ROUND_UP(min(soft, hard), USEC_PER_SEC/HZ);
                if (p->rt.timeout > next)
                        posix_cputimers_rt_watchdog(&p->posix_cputimers,
                                                    p->se.sum_exec_runtime);
        }
}

유저용 rt 태스크에 제한시간(RLIMIT_RTTIME)이 설정된 경우 태스크의 cpu 시간 만료를 체크한다.

코드 라인 6~7에서 rt 태스크의 현재 제한시간(us)과 최대 제한시간(us)을 알아온다.
코드 라인 9~15에서 rt 태스크에 RLIMIT_RTTIME이 설정되어 있는 경우 현재 rt 태스크의 실행 시간(틱 카운터로 p->rt.timeout 사용)을 증가시키고, 워치독 스탬프에 현재 시각(jiffies)을 갱신한다.
코드 라인 17~20에서 틱 단위로 증가시킨 rt 태스크 실행 시간이 us 단위의 soft 또는 hard rlimit 값을 틱 단위로 바꾼 시각을 초과한 경우 POSIX cpu 타이머에 수행 시간 총합을 기록하여 posix cpu 타이머 처리 루틴에서 관련 시그널을 선택하여 보낼 수 있게 한다.
- 참고로 rt 태스크가 슬립했다 깨어나는 경우 timeout은 0으로 다시 초기화된다.

RT 태스크 실행 시간 제약 샘플

test.c

#include <sys/resource.h>

void main()
{
        long long n = 0;
        struct rlimit rlim;

        rlim.rlim_cur = 2000000; /* us */
        rlim.rlim_max = 3000000; /* us */
        if (setrlimit(RLIMIT_RTTIME, &rlim) == -1)
                return;

        while (1)
                n++;
}

run.sh

gcc test.c -o test
date +"%Y-%m-%d %H:%M:%S.%N"
chrt -f 50 ./test
date +"%Y-%m-%d %H:%M:%S.%N"

슬립없이 유저용 rt 태스크를 계속 돌리면 SIGXCPU 시그널이 발생된 후 다음과 같이 메시지를 출력하고 태스크를 종료시킨다.

$ ./run.sh
2020-10-21 20:36:48.014463532
./run.sh: line 3:  8697 CPU time limit exceeded chrt -f 50 ./test
2020-10-21 20:36:50.025738816

RT Bandwidth

글로벌 RT Bandwidth

RT bandwidth 기능은 CFS 스케줄러와 달리 RT 그룹 스케줄링을 사용하지 않아도 항상 기본 동작하도록 설정되어 있다. 디폴트 값으로 다음과 같은 설정이 되어 있다.

sysctl_sched_rt_runtime
- 디폴트 값: 950,000 us (0.95 초)
- “/proc/sys/kernel/sched_rt_runtime_us“
sysctl_sched_rt_period
- 디폴트 값: 1,000,000 us (1초)
- “/proc/sys/kernel/sched_rt_period_us“

그룹 RT Bandwidth

커널이 cgroup을 사용하면서 CONFIG_RT_GROUP_SCHED 커널 옵션을 사용하여 RT 그룹 스케줄링을 동작시키는 경우 태스크 그룹마다 bandwidth 기능을 설정하여 사용할 수 있게 된다.

rt_runtime_us
- 하위 태스크 그룹의 디폴트 값: 0 us (disable)
- 루트 태스크 그룹의 디폴트 값: 950,000 us (0.95초)
- “/sys/fs/cgroup/cpu/<태스크 그룹>/rt_runtime_us“
rt_period_us
- 디폴트 값: 1,000,000 us (1초)
- “/sys/fs/cgroup/cpu/<태스크 그룹>/rt_period_us“

디폴트 설정을 그대로 사용하는 경우 rt 태스크는 1초 기간 내에 0.95초 만큼 런타임을 사용할 수 있다. 이는 1개의 cpu를 사용하는 시스템을 가정할 때 최대 95%의 cpu를 rt 스케줄러가 점유할 수 있도록 한다.

일반적으로 RT 태스크들은 매우 짧은 시간만 스케줄링되어 동작하므로 1초 주기동안 RT 태스크의 런타임이 95%를 초과하여 스로틀링하는 경우는 매우 드물다고 할 수 있다.

RT Bandwidth 초기화

init_rt_bandwidth()

kernel/sched/rt.c

void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
{
        rt_b->rt_period = ns_to_ktime(period);
        rt_b->rt_runtime = runtime;
        
        raw_spin_lock_init(&rt_b->rt_runtime_lock);

        hrtimer_init(&rt_b->rt_period_timer,
                        CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        rt_b->rt_period_timer.function = sched_rt_period_timer;
}

rt period와 runtime 값을 사용하여 초기화한다.

코드 라인 3에서 인수로 전달받은 us 단위의 period 값을 나노초 단위로 바꾸어 rt_period에 저장한다.
코드 라인 4에서 인수로 전달받은 us 단위의 runtime 값을 나노초 단위로 바꾸어 rt_runtime에 저장한다.
코드 라인 8~10에서 hrtimer를 초기화하고 만료 시 호출 함수를 지정한다.

그룹 RT runtime 설정

sched_group_set_rt_runtime()

kernel/sched/core.c

static int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
{
        u64 rt_runtime, rt_period;

        rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
        rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
        if (rt_runtime_us < 0)
                rt_runtime = RUNTIME_INF;
        else if ((u64)rt_runtime_us > U64_MAX / NSEC_PER_USEC)
                return -EINVAL;

        return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
}

요청한 태스크 그룹에 rt 런타임(us)을 나노초로 변경하여 설정한다.

코드 라인 5에서 rt bandwidth에 설정되어 있는 period 값을 나노초 단위로 변환해온다.
코드 라인 6에서 rt bandwidth에 설정되어 있는 런타임 값을 나노초 단위로 변환해온다.
코드 라인 7~10에서 rt 런타임 값이 0보다 작으면 무제한(-1)으로 설정하여 rt bandwidth가 동작하지 않게한다.
코드 라인 12에서 요청한 태스크 그룹에 rt bandwidth의 period(ns) 및 runtime(ns) 값을 설정한다.

그룹 RT period 설정

sched_group_set_rt_period()

kernel/sched/core.c

static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
{
        u64 rt_runtime, rt_period;

        if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
                return -1;

        rt_period = (u64)rt_period_us * NSEC_PER_USEC;
        rt_runtime = tg->rt_bandwidth.rt_runtime;

        return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
}

요청한 태스크 그룹에 rt period(us) 값을 나노초로 변경하여 설정한다.

코드 라인 5~6에서 rt 런타임이 설정되지 않은 경우 period 설정을 포기한다.
코드 라인 8에서 인수로 받은 rt_period_us 값을 나노초 단위로 변환한다.
코드 라인 9에서 rt bandwidth에 설정되어 있는 런타임(ns) 값을 가져온다.
코드 라인 11에서 요청한 태스크 그룹에 rt bandwidth의 period(ns) 및 runtime(ns) 값을 설정한다.

그룹 RT runtime & period 공통 설정

tg_set_rt_bandwidth()

kernel/sched/core.c

static int tg_set_rt_bandwidth(struct task_group *tg,
                u64 rt_period, u64 rt_runtime)
{
        int i, err = 0;

        /*
         * Disallowing the root group RT runtime is BAD, it would disallow the
         * kernel creating (and or operating) RT threads.
         */
        if (tg == &root_task_group && rt_runtime == 0)
                return -EINVAL;

        /* No period doesn't make any sense. */
        if (rt_period == 0)
                return -EINVAL;

        mutex_lock(&rt_constraints_mutex);
        read_lock(&tasklist_lock);
        err = __rt_schedulable(tg, rt_period, rt_runtime);
        if (err)
                goto unlock;

        raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
        tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
        tg->rt_bandwidth.rt_runtime = rt_runtime;

        for_each_possible_cpu(i) {
                struct rt_rq *rt_rq = tg->rt_rq[i];

                raw_spin_lock(&rt_rq->rt_runtime_lock);
                rt_rq->rt_runtime = rt_runtime;
                raw_spin_unlock(&rt_rq->rt_runtime_lock);
        }
        raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
unlock:
        read_unlock(&tasklist_lock);
        mutex_unlock(&rt_constraints_mutex);

        return err;
}

RT 런타임 초과 여부

sched_rt_runtime_exceeded()

kernel/sched/rt.c

static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
{
        u64 runtime = sched_rt_runtime(rt_rq);

        if (rt_rq->rt_throttled)
                return rt_rq_throttled(rt_rq);

        if (runtime >= sched_rt_period(rt_rq))
                return 0;

        balance_runtime(rt_rq);
        runtime = sched_rt_runtime(rt_rq);
        if (runtime == RUNTIME_INF)
                return 0;

        if (rt_rq->rt_time > runtime) {
                struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);

                /*
                 * Don't actually throttle groups that have no runtime assigned
                 * but accrue some time due to boosting.
                 */
                if (likely(rt_b->rt_runtime)) {
                        rt_rq->rt_throttled = 1;
                        printk_deferred_once("sched: RT throttling activated\n");
                } else {
                        /*
                         * In case we did anyway, make it go away,
                         * replenishment is a joke, since it will replenish us
                         * with exactly 0 ns.
                         */
                        rt_rq->rt_time = 0;
                }

                if (rt_rq_throttled(rt_rq)) {
                        sched_rt_rq_dequeue(rt_rq);
                        return 1;
                }
        }

        return 0;
}

RT 로컬에서 소모한 런타임이 할당된 런타임을 초과한 경우 밸런싱 작업을 수행한다. 스로틀이 필요한 경우 1을 반환한다.

코드 라인 3에서 로컬 rt 런타임을 알아온다.
코드 라인 5~6에서 rt 런큐가 이미 스로틀 중인 경우 rt_rq_throttled() 결과를 반환한다.
- 부스트된 경우에 빠르게 처리하기 위해 스로틀 여부와 상관 없이 0을 반환하여 리스케줄링 하지 않게 한다.
코드 라인 8~9에서 태스크 그룹의 rt bandwidth 런타임 설정이 기간 설정보다 큰 경우 스로틀할 필요가 없으므로 0을 반환한다.
코드 라인 11에서 RT_RUNTIME_SHARE feature(default=false)를 사용하면서 로컬 rt 런큐의 실행 시간이 글로벌 런타임을 초과한 경우 런타임 밸런싱을 수행한다.
- 런타임밸런싱은 모자라는 런타임을 다른 cpu에서 빌려오는 일을 수행한다.
- UP 시스템은 cpu가 1개 이므로 다른 cpu에서 남은 런타임을 빌릴 수 없어서 밸런싱 작업에 아무런 일도 하지 않는다.
코드 라인 12~14에서 런타임 밸런싱 작업을 하고 이 루틴에 들어왔다. 다시 한 번 보충되었을 수도 있는 로컬 rt 런타임 값을 알아온다. 단 로컬 런타임이 disable 상태라면 스로틀하지 않도록 0을 반환한다.
코드 라인 16~33에서 런타임 밸런싱 이후에도 로컬 rt 런큐의 실행 시간이 남은 로컬 rt 런타임을 초과한 경우이다. 만일 높은 확률로 rt 런타임이 설정되어 있는 경우 스로틀됨을 알리기 위해 1을 설정한다. 그렇지 않은 경우 rt 로컬 런타임 소모량을 0으로 리셋한다.
코드 라인 35~38에서 rt 로컬이 이미 스로틀되었고 pi 부스팅하지 않은 경우 rt 런큐에서 동작중인 엔티티 수만큼 감소시킨다. 그런 후 스로틀링 하도록 1을 반환한다. (그룹 엔티티의 디큐가 아님에 주의한다.)
- 참고: sched/rt: Do not throttle when PI boosting (2012, v3.4-rc1)
코드 라인 41에서 스로틀 하지 않도록 0을 반환한다.

RT_RUNTIME_SHARE feature

default 설정은 false이다.
이 feture를 켜서 사용하는 경우 어느 한 cpu에서 초과시킨 runtime을 다른 cpu의 runtime을 share하여 가져와서 사용하게 하여 사용자가 설정한 runtime을 전체 cpu를 대상으로 제어를 하게된다.
- 예) runtime=18ms 설정하였고, 4 cpu system에서 cpu#0이 정해진 runtime을 초과하여 20ms를 실행한 경우 나머지 다른cpu에서 런타임을 1ms씩 각출하여 빌려온다.
  - cpu#0: runtime=20ms, cpu#1: runtime=19ms, cpu#2: runtime=19ms, cpu#3: runtime=19ms
이 feture를 사용하지 않으면 어느 한 cpu에서 초과시킨 runtime이 있어도, 다른 cpu들은 이를 무시하고 원래 정해진 runtime 만큼만 사용하게 한다.
- 예) runtime=18ms 설정하였고, 4 cpu system에서 cpu#0이 정해진 runtime을 초과하여 20ms를 실행한 경우 나머지 다른cpu들에게서 runtime을 빌려오지 않고 각자 따라 동작한다.
  - cpu#0: runtime=20ms, cpu#1: runtime=20ms, cpu#2: runtime=20ms, cpu#3: runtime=20ms

다음 그림과 같이 소모한 런타임이 초과된 경우 UP 시스템에서 처리되는 모습을 보여준다.

RT 런큐의 디큐 및 엔큐

sched_rt_rq_dequeue()

kernel/sched/rt.c

static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
{
        struct sched_rt_entity *rt_se;
        int cpu = cpu_of(rq_of_rt_rq(rt_rq));

        rt_se = rt_rq->tg->rt_se[cpu];

        if (!rt_se) {
                dequeue_top_rt_rq(rt_rq);
                /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
                cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
        }
        else if (on_rt_rq(rt_se))
                dequeue_rt_entity(rt_se, 0);
}

rt 런큐를 디큐한다.

코드 라인 6에서 rt 런큐에 해당하는 그룹 엔티티를 알아온다.
코드 라인 8~12에서 최상위 root 인경우 rt 그룹 엔티티가 없다. 이 때엔 최상위 rt 런큐를 디큐 표시하고, util을 갱신한다.
코드 라인 13~14에서 rt 그룹 엔티티를 디큐한다.

sched_rt_rq_enqueue()

kernel/sched/rt.c

static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
        struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
        struct rq *rq = rq_of_rt_rq(rt_rq);
        struct sched_rt_entity *rt_se;

        int cpu = cpu_of(rq);

        rt_se = rt_rq->tg->rt_se[cpu];

        if (rt_rq->rt_nr_running) {
                if (!rt_se)
                        enqueue_top_rt_rq(rt_rq);
                else if (!on_rt_rq(rt_se))
                        enqueue_rt_entity(rt_se, 0);

                if (rt_rq->highest_prio.curr < curr->prio)
                        resched_curr(rq);
        }
}

rt 런큐를 엔큐한다.

코드 라인 9에서 rt 런큐에 해당하는 그룹 엔티티를 알아온다.
코드 라인 11에서 rt 런큐에 동작 중인 엔티티가 있는 경우에만 엔큐를 할 수 있다.
코드 라인 12~13에서 최상위 root 인경우 rt 그룹 엔티티가 없다. 이 때엔 최상위 rt 런큐를 엔큐 표시한다.
코드 라인 14~15에서 rt 그룹 엔티티를 엔큐한다.
코드 라인 17~18에서 우선 순위의 변경이 필요한 경우 리스케줄 요청한다.

최상위 RT 런큐의 디큐 및 엔큐

dequeue_top_rt_rq()

kernel/sched/rt.c

static void 
dequeue_top_rt_rq(struct rt_rq *rt_rq)
{
        struct rq *rq = rq_of_rt_rq(rt_rq);

        BUG_ON(&rq->rt != rt_rq);

        if (!rt_rq->rt_queued)
                return;

        BUG_ON(!rq->nr_running);

        sub_nr_running(rq, rt_rq->rt_nr_running);
        rt_rq->rt_queued = 0;
}

최상위 rt 런큐를 디큐 상태로 바꾸고 동작했던 태스크 수만큼 런큐에서 감소시킨다. (rq->nr_running 갱신)

코드 라인 8~9에서 rt 로컬 런큐가 이미 디큐된 상태이면 함수를 빠져나온다.
- rq->rt_queued
  - 런큐에서 rt 런큐의 가동 상태를 나타낸다. (1=엔큐, 0=디큐)
코드 라인 13에서 rt 런큐에서 동작중인 엔티티 수를 감산하여 갱신한다.
- rq->nr_running -= rt_rq->rt_nr_running
코드 라인 14에서 rt 런큐를 디큐된 상태로 설정한다.

다음 그림은 rt 런큐의 디큐와 엔큐 처리 과정을 보여준다.

rt 태스크의 수가 18개 씩이나 동시에 동작하는 상황은 보통 실제 상황에는 거의 없고, 이해를 돕기 위한 숫자일 뿐이다.

enqueue_top_rt_rq()

kernel/sched/rt.c

static void
enqueue_top_rt_rq(struct rt_rq *rt_rq)
{
        struct rq *rq = rq_of_rt_rq(rt_rq);

        BUG_ON(&rq->rt != rt_rq);

        if (rt_rq->rt_queued)
                return;

        if (rt_rq_throttled(rt_rq))
                return;

        if (rt_rq->rt_nr_running) {
                add_nr_running(rq, rt_rq->rt_nr_running);
                rt_rq->rt_queued = 1;
        }

        /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
        cpufreq_update_util(rq, 0);
}

최상위 rt 런큐를 엔큐 상태로 바꾸고 최상위 rt 런큐에있는 태스크 수만큼 증가시킨다. (rq->nr_running 갱신)

코드 라인 8~9에서 최상위 rt 로컬 런큐가 이미 엔큐된 상태이면 함수를 빠져나온다.
코드 라인 11~12에서 최상위 rt 로컬 런큐가 이미 스로틀 상태이면서 pi 부스트하지 않는 경우 함수를 빠져나간다.
코드 라인 14~17에서 rt 런큐에 동작 가능한 엔티티가 있는 경우 rt 런큐를 엔큐 상태로 변경한다. 그리고 동작중인 태스크 수를 추가하여 갱신한다.
- rq->nr_running += rt_rq->rt_nr_running
코드 라인 20에서 런큐 util을 갱신한다.

RT 런타임 밸런싱

balance_runtime()

kernel/sched/rt.c

static void balance_runtime(struct rt_rq *rt_rq)
{
        if (!sched_feat(RT_RUNTIME_SHARE))
                return;

        if (rt_rq->rt_time > rt_rq->rt_runtime) {
                raw_spin_unlock(&rt_rq->rt_runtime_lock);
                do_balance_runtime(rt_rq);
                raw_spin_lock(&rt_rq->rt_runtime_lock);
        }
}

요청한 rt 런큐의 할당된 런타임을 모두 소모한 경우 다른 rt 로컬 풀로부터 빌려와서 최대한 rt_period 만큼 더 할당하여 늘리도록 밸런싱을 수행한다.

코드 라인 3~4에서 RT_RUNTIME_SHARE 기능을 사용하지 않는 경우 함수를 빠져나간다.
- 빌려오는 런타임때문에 cfs 태스크의 기아(starving) 현상이 발생할 수 있어 커널 v5.10-rc1에서 디폴트 값을 disable 하였다.
  - 참고: sched/rt: Disable RT_RUNTIME_SHARE by default (2020)
코드 라인 6~10에서 요청한 rt 런큐의 할당된 런타임을 모두 소모한 경우 다른 rt 로컬 풀로부터 빌려와서 최대한 rt_period 만큼 더 할당하여 늘리도록 밸런싱을 수행한다.

다음 그림은 RT_RUNTIME_SHARE 기능을 사용하지 않을 때 특정 태스크 그룹의 rt 밴드위드의 동작을 보여준다.

다음 그림은 RT_RUNTIME_SHARE 기능을 사용할 때 특정 태스크 그룹의 rt 밴드위드의 동작을 보여준다.

로컬 런타임이 부족한 경우 SMP 시스템에서는 다른 cpu로 부터 런타임을 빌려오는 런타임 밸런싱 작업을 수행한다.

do_balance_runtime()

kernel/sched/rt.c

/*
 * We ran out of runtime, see if we can borrow some from our neighbours.
 */

static void do_balance_runtime(struct rt_rq *rt_rq)
{
        struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
        struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
        int i, weight;
        u64 rt_period;

        weight = cpumask_weight(rd->span);

        raw_spin_lock(&rt_b->rt_runtime_lock);
        rt_period = ktime_to_ns(rt_b->rt_period);
        for_each_cpu(i, rd->span) {
                struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
                s64 diff;

                if (iter == rt_rq)
                        continue;

                raw_spin_lock(&iter->rt_runtime_lock);
                /*
                 * Either all rqs have inf runtime and there's nothing to steal
                 * or __disable_runtime() below sets a specific rq to inf to
                 * indicate its been disabled and disalow stealing.
                 */
                if (iter->rt_runtime == RUNTIME_INF)
                        goto next;

                /*
                 * From runqueues with spare time, take 1/n part of their
                 * spare time, but no more than our period.
                 */
                diff = iter->rt_runtime - iter->rt_time;
                if (diff > 0) {
                        diff = div_u64((u64)diff, weight);
                        if (rt_rq->rt_runtime + diff > rt_period)
                                diff = rt_period - rt_rq->rt_runtime;
                        iter->rt_runtime -= diff;
                        rt_rq->rt_runtime += diff;
                        if (rt_rq->rt_runtime == rt_period) {
                                raw_spin_unlock(&iter->rt_runtime_lock);
                                break;
                        }
                }
next:
                raw_spin_unlock(&iter->rt_runtime_lock);
        }
        raw_spin_unlock(&rt_b->rt_runtime_lock);
}

요청한 rt 로컬 런큐에 런타임 할당량을 루트 도메인의 다른 rt 로컬 런큐에서 사용하고 남은 만큼 빌려 할당한다.

코드 라인 3~4에서 rt 로컬 런큐에 해당하는 rt 로컬 풀과 루트 도메인을 알아온다.
코드 라인 8에서 루트 도메인에 사용할 수 있는 cpu 수를 알아온다.
코드 라인 11에서 그룹의 rt period 설정 값을 나노초 단위로 변환하여 알아온다.
코드 라인 12~13에서 루트 도메인에 사용할 수 있는 cpu를 순회하며 태스크 그룹에 연결된 rt 로컬 런큐를 iter에 대입한다.
코드 라인 16~17에서 순회하는 rt 로컬 런큐가 인수로 요청한 rt 로컬 런큐와 같은 경우 skip 한다.
- 요청한 rt 로컬 런큐가 다른 rt 로컬 런큐로부터 런타임을 얻어와야하기 때문에 자신은 skip 한다.
코드 라인 25~26에서 순회하는 rt 로컬 런큐에 런타임 할당이 안된 경우 rt bandwidth가 설정되지 않은 경우이므로 next로 이동하고 skip 한다.
코드 라인 32에서 순회하는 rt 로컬 런큐의 할당된 런타임에서 소모한 rt 런타임의 차를 diff에 대입하여 아직 사용하지 않은 기간을 알아온다.
코드 라인 33~34에서 순회하는 rt 로컬 런큐의 사용하지 않은 런타임이 있는 경우 그 값을 루트 도메인의 cpu 수만큼 나눈다.
코드 라인 35~36에서 순회하는 rt 로컬 런큐의 할당된 런타임과 빌려올 diff 값을 더한 값이 rt_period 기간을 초과하지 않도록 빌려올 값 diff를 조절한다.
코드 라인 37~38에서 순회하는 rt 로컬 런큐의 런타임 할당 값에서 diff를 빌려오고 인수로 요청한 rt 로컬 런큐의 런타임 할당 값에 추가한다.
코드 라인 39~42에서 빌려와서 채운 런타임 할당이 rt_period와 같은 경우 더 이상 빌려올 필요가 없으므로 루프를 탈출한다.

Enqueue & Dequeue RT 엔티티

다음 그림은 enqueue_rt_entity()와 dequeue_rt_entity() 함수의 함수간 처리 흐름도이다.

enqueue_rt_entity()

kernel/sched/rt.c

static void enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head)
{
        struct rq *rq = rq_of_rt_se(rt_se);

        dequeue_rt_stack(rt_se);
        for_each_sched_rt_entity(rt_se)
                __enqueue_rt_entity(rt_se, head);
        enqueue_top_rt_rq(&rq->rt);
}

rt 엔티티를 엔큐한다.

코드 라인 5에서 최상위 rt 엔티티부터 요청한 rt 엔티티까지 top-down 방향으로 rt 엔티티를 디큐한다.
- 기존에 엔큐되어 있었으면 먼저 디큐한다.
코드 라인 6~7에서 요청한 rt 엔티티부터 최상위 rt 엔티티까지 다시 엔큐한다.
코드 라인 8에서 최상위 rt 런큐를 엔큐 상태로 바꾸고 최상위 rt 런큐에있는 태스크 수만큼 증가시킨다. (rq->nr_running 갱신)

dequeue_rt_entity()

kernel/sched/rt.c

static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
{
        struct rq *rq = rq_of_rt_se(rt_se);

        dequeue_rt_stack(rt_se); 

        for_each_sched_rt_entity(rt_se) {
                struct rt_rq *rt_rq = group_rt_rq(rt_se);

                if (rt_rq && rt_rq->rt_nr_running)
                        __enqueue_rt_entity(rt_se, false);
        }
        enqueue_top_rt_rq(&rq->rt);
}

rt 엔티티를 디큐한다.

코드 라인 5에서 최상위 rt 엔티티부터 요청한 rt 엔티티까지 top-down 방향으로 rt 엔티티를 디큐한다.
코드 라인 7~12에서 순회 중인 rt 엔티티가 그룹을 대표하고 그 그룹에서 여전히 또 다른 태스크가 동작중인 경우 순회 중인 rt 엔티티를 엔큐한다.
코드 라인 13에서 최상위 rt 런큐를 엔큐 상태로 바꾸고 최상위 rt 런큐에있는 태스크 수만큼 증가시킨다. (rq->nr_running 갱신)

dequeue_rt_stack()

kernel/sched/rt.c

/*
 * Because the prio of an upper entry depends on the lower
 * entries, we must remove entries top - down.
 */

static void dequeue_rt_stack(struct sched_rt_entity *rt_se)
{
        struct sched_rt_entity *back = NULL;

        for_each_sched_rt_entity(rt_se) {
                rt_se->back = back;
                back = rt_se;
        }

        dequeue_top_rt_rq(rt_rq_of_se(back));

        for (rt_se = back; rt_se; rt_se = rt_se->back) {
                if (on_rt_rq(rt_se))
                        __dequeue_rt_entity(rt_se);
        }
}

최상위 rt 엔티티부터 요청한 rt 엔티티까지 top-down 방향으로 rt 엔티티들을 디큐한다.

코드 라인 5~8에서 루트 방향의 계층적 rt 엔티티를 반대로 구성한다.
코드 라인 10에서 최상위 rt 런큐를 디큐 상태로 바꾸고 최상위 rt 런큐에있는 태스크 수만큼 감소시킨다. (rq->nr_running 갱신)
코드 라인 12~15에서 최상위 엔티티부터 요청한 rt 엔티티까지 순회하며 순회 중인 rt 엔티티가 해당 rt 런큐에서 동작하는 경우 그 rt 엔티티를 디큐한다.

__enqueue_rt_entity()

kernel/sched/rt.c

static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head)
{
        struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
        struct rt_prio_array *array = &rt_rq->active;
        struct rt_rq *group_rq = group_rt_rq(rt_se);
        struct list_head *queue = array->queue + rt_se_prio(rt_se);

        /*
         * Don't enqueue the group if its throttled, or when empty.
         * The latter is a consequence of the former when a child group
         * get throttled and the current group doesn't have any other
         * active members.
         */
        if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
                return;

        if (head)
                list_add(&rt_se->run_list, queue);
        else
                list_add_tail(&rt_se->run_list, queue);
        __set_bit(rt_se_prio(rt_se), array->bitmap);

        inc_rt_tasks(rt_se, rt_rq);
}

rt 엔티티를 rt 런큐에 엔큐한다.

코드 라인 3에서 rt 엔티티의 스케줄을 담당하는 rt 런큐를 얻어온다.
코드 라인 5에서 rt 엔티티의 그룹 rt 런큐를 얻어온다.
코드 라인 6에서 rt 엔티티의 우선순위에 해당하는 큐리스트를 알아온다.
코드 라인 14~15에서 태스크 그룹용 rt 엔티티이면서 이 그룹이 스로틀되었거나 엔큐된 rt 태스크가 없으면 함수를 빠져나간다.
- 태스크 그룹을 엔큐하였지만 그 그룹에 엔큐된 rt 태스크가 하나도 없는 경우이다.
코드 라인 17~20에서 인수 head 요청에 따라 rt 엔티티를 큐리스트의 선두 또는 후미에 추가한다.
코드 라인 21에서 해당 우선 순위별 리스트큐에 대한 비트를 설정한다.
코드 라인 23에서 엔큐된 rt 태스크에 대한 후속 작업을 진행한다.

__dequeue_rt_entity()

kernel/sched/rt.c

static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)
{
        struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
        struct rt_prio_array *array = &rt_rq->active;

        if (move_entity(flags)) {
                WARN_ON_ONCE(!rt_se->on_list);
                __delist_rt_entity(rt_se, array);
        }
        rt_se->on_rq = 0;
        dec_rt_tasks(rt_se, rt_rq);
}

rt 엔티티를 rt 런큐에서 디큐한다.

코드 라인 6~9에서 rt 엔티티를 리스트에서 제거한다. 이 때 비트맵도 같이 갱신한다.
코드 라인 10에서 rt 엔티티에 디큐된 상태를 표시한다.
코드 라인 11에서 디큐된 rt 태스크에 대한 후속 작업을 진행한다.

inc_rt_tasks()

kernel/sched/rt.c

static inline
void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        int prio = rt_se_prio(rt_se);

        WARN_ON(!rt_prio(prio));
        rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
        rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);

        inc_rt_prio(rt_rq, prio);
        inc_rt_migration(rt_se, rt_rq);
        inc_rt_group(rt_se, rt_rq);
}

엔큐한 rt 엔티티에 대한 후속 작업을 수행한다.

코드 라인 7에서 rt 런큐 이하에서 동작 중인 rt 태스크 수를 갱신한다.
- rt 엔티티가 태스크인 경우 1을 증가시키고 그룹인 경우 그룹이하에서 동작하는 rt 태스크의 수를 증가시킨다.
코드 라인 8에서 rt 런큐 이하에서 동작 중인 round robin policy를 가진 rt 태스크 수를 증가시킨다.
코드 라인 10에서 엔큐된 rt 엔티티로 인해 최고 우선 순위가 변경된 경우 이를 갱신하고 cpupri 설정도 수행한다.
코드 라인 11에서 추가된 rt 엔티티가 태스크인 경우 런큐의 overload 카운터를 증가시키고 런큐에 오버로드 여부를 갱신한다.
- rt_nr_total++
- 태스크에 2 개 이상 cpu가 할당된 경우 rt_nr_migratory++
- 태스크에 2개 이상 cpu가 할당되고 2개 이상 rt 태스크가 동작하는 경우 현재 런큐에 overload 설정
코드 라인 12에서 추가된 rt 그룹에 대한 작업을 수행한다.

dec_rt_tasks()

kernel/sched/rt.c

static inline
void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        WARN_ON(!rt_prio(rt_se_prio(rt_se)));
        WARN_ON(!rt_rq->rt_nr_running);
        rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
        rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);

        dec_rt_prio(rt_rq, rt_se_prio(rt_se));
        dec_rt_migration(rt_se, rt_rq);
        dec_rt_group(rt_se, rt_rq);
}

디큐한 rt 엔티티에 대한 후속 작업을 진행한다.

코드 라인 6에서 rt 런큐 이하에서 동작 중인 rt 태스크 수를 갱신한다.
- rt 엔티티가 태스크인 경우 1을 감소시키고 그룹인 경우 그룹이하에서 동작하는 rt 태스크의 수를 감소시킨다.
코드 라인 7에서 rt 런큐 이하에서 동작 중인 round robin policy를 가진 rt 태스크 수를 감소시킨다.
코드 라인 9에서 디큐된 rt 엔티티로 인해 최고 우선 순위가 변경된 경우 이를 갱신하고 cpupri 설정도 수행한다.
코드 라인 10에서 추가된 rt 엔티티가 태스크인 경우 런큐의 overload 카운터를 감소시키고 런큐에 오버로드 여부를 갱신한다.
- rt_nr_total–
- 태스크에 2 개 이상 cpu가 할당된 경우 rt_nr_migratory–
- 태스크에 cpu가 1개만 설정되거나 rt 태스크가 1개 이하이면 오버로드할 수 없으므로 클리어
코드 라인 11에서 추가된 rt 그룹에 대한 작업을 수행한다.

rt_se_nr_running()

kernel/sched/rt.c

static inline
unsigned int rt_se_nr_running(struct sched_rt_entity *rt_se)
{
        struct rt_rq *group_rq = group_rt_rq(rt_se);

        if (group_rq)
                return group_rq->rt_nr_running;
        else
                return 1;
}

rt 엔티티와 관련된 태스크 수를 반환한다. rt 엔티티가 태스크이면 1을 반환하고, 태스크 그룹용이면 태스크 그룹을 포함한 그 이하 child 태스크의 수를 반환한다.

rt_se_rr_nr_running()

kernel/sched/rt.c

static inline
unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
{
        struct rt_rq *group_rq = group_rt_rq(rt_se);
        struct task_struct *tsk;

        if (group_rq)
                return group_rq->rr_nr_running;

        tsk = rt_task_of(rt_se);

        return (tsk->policy == SCHED_RR) ? 1 : 0;
}

round robin policy를 가진 rt 엔티티와 관련된 태스크 수를 반환한다. rt 엔티티가 rr 태스크이면 1을 반환하고, 태스크 그룹용이면 태스크 그룹을 포함한 그 이하 child rr 태스크의 수를 반환한다.

CPU Priority Management with highest priority

다음과 같이 총 102개의 우선 순위를 관리한다.

100개의 RT 우선 순위
idle
cfs normal

위의 102 단계의 우선 순위를 다음과 같이 즉각 변환하도록 관리한다.

cpu -> 우선 순위
우선 수위 -> cpu

다음 그림은 cpu와 priority와의 컨버전에 사용되는 배열을 보여준다.

inc_rt_prio()

kernel/sched/rt.c

static void
inc_rt_prio(struct rt_rq *rt_rq, int prio)
{
        int prev_prio = rt_rq->highest_prio.curr;

        if (prio < prev_prio)
                rt_rq->highest_prio.curr = prio;

        inc_rt_prio_smp(rt_rq, prio, prev_prio);
}

엔큐된 rt 엔티티로 인해 최고 우선 순위가 변경된 경우 이를 갱신하고 cpupri 설정도 수행한다.

코드 라인 4~7에서 rt 런큐내에서 요청한 우선 순위가 가장 높은(낮은 prio 숫자값이 가장 높은 우선순위이다.)인 경우 이를 갱신한다.
코드 라인 9에서 요청한 rt 런큐의 cpu와 우선 순위에 대해 cpupri에 반영한다.

dec_rt_prio()

kernel/sched/rt.c

static void
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
        int prev_prio = rt_rq->highest_prio.curr;

        if (rt_rq->rt_nr_running) {

                WARN_ON(prio < prev_prio);

                /*
                 * This may have been our highest task, and therefore
                 * we may have some recomputation to do
                 */
                if (prio == prev_prio) {
                        struct rt_prio_array *array = &rt_rq->active;

                        rt_rq->highest_prio.curr =
                                sched_find_first_bit(array->bitmap);
                }

        } else
                rt_rq->highest_prio.curr = MAX_RT_PRIO;

        dec_rt_prio_smp(rt_rq, prio, prev_prio);
}

디큐된 rt 엔티티로 인해 최고 우선 순위가 변경된 경우 이를 갱신하고 cpupri 설정도 수행한다.

코드 라인 6~19에서 rt 런큐내에서 동작중인 rt 태스크가 있고 요청한 우선 순위가 가장 높은 우선 순위인 경우 다음 우선 순위를 가장 높은 우선 순위로 갱신한다.
코드 라인 21~22에서 rt 런큐내에서 동작중인 rt 태스크가 없으면 비어 있는 상태로 초기화한다. (100으로 설정)
코드 라인 24에서 요청한 rt 런큐의 cpu와 우선 순위에 대해 cpupri에 반영한다.

inc_rt_prio_smp()

kernel/sched/rt.c

static void
inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
{
        struct rq *rq = rq_of_rt_rq(rt_rq);

#ifdef CONFIG_RT_GROUP_SCHED
        /*
         * Change rq's cpupri only if rt_rq is the top queue.
         */
        if (&rq->rt != rt_rq)
                return;
#endif
        if (rq->online && prio < prev_prio)
                cpupri_set(&rq->rd->cpupri, rq->cpu, prio);
}

요청한 rt 런큐에서 가장 높은 우선 순위인 경우 cpu와 우선 순위를 cpupri에 설정한다.

코드 라인 6~12에서 그룹 스케줄링을 사용하는 경우 최상위 rt 런큐가 아닌 경우 함수를 빠져나간다.
코드 라인 13~14에서 최고 우선 순위가 갱신된 경우 런큐의 cpu와 요청 우선 순위에 대해 cpupri에 설정한다.

dec_rt_prio_smp()

kernel/sched/rt.c

static void
dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
{
        struct rq *rq = rq_of_rt_rq(rt_rq);

#ifdef CONFIG_RT_GROUP_SCHED
        /*
         * Change rq's cpupri only if rt_rq is the top queue.
         */
        if (&rq->rt != rt_rq)
                return;
#endif
        if (rq->online && rt_rq->highest_prio.curr != prev_prio)
                cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
}

요청한 rt 런큐에서 요청한 우선 순위가 가장 높은 우선 순위인 경우 cpu와 차순위로 갱신된 최고 우선 순위를 cpupri에 설정한다.

코드 라인 10~11에서 rt 그룹 스케줄링이 지원되는 커널인 경우 최상위 rt 런큐가 아니면 함수를 빠져나간다.
코드 라인 13~14에서 online 상태의 런큐이면서 최고 우선 순위의 rt 엔티티가 디큐된 cpu와 차순위로 갱신된 최고 우선 순위를 cpupri에 설정한다.

Highest RT Priority 갱신

cpupri_set()

요청한 cpu와 현재 동작 중인 스케줄러내에서의 최고 우선 순위를 cpupri에 설정한다.

kernel/sched/cpupri.c – 1/2

/**
 * cpupri_set - update the cpu priority setting
 * @cp: The cpupri context
 * @cpu: The target cpu
 * @newpri: The priority (INVALID-RT99) to assign to this CPU
 *
 * Note: Assumes cpu_rq(cpu)->lock is locked
 *
 * Returns: (void)
 */

void cpupri_set(struct cpupri *cp, int cpu, int newpri)
{
        int *currpri = &cp->cpu_to_pri[cpu];
        int oldpri = *currpri;
        int do_mb = 0;

        newpri = convert_prio(newpri);

        BUG_ON(newpri >= CPUPRI_NR_PRIORITIES);

        if (newpri == oldpri)
                return;

        /*
         * If the cpu was currently mapped to a different value, we
         * need to map it to the new value then remove the old value.
         * Note, we must add the new value first, otherwise we risk the
         * cpu being missed by the priority loop in cpupri_find.
         */
        if (likely(newpri != CPUPRI_INVALID)) {
                struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];

                cpumask_set_cpu(cpu, vec->mask);
                /*
                 * When adding a new vector, we update the mask first,
                 * do a write memory barrier, and then update the count, to
                 * make sure the vector is visible when count is set.
                 */
                smp_mb__before_atomic();
                atomic_inc(&(vec)->count);
                do_mb = 1;
        }

코드 라인 7에서 인수로 받은 우선순위를 사용하여 cpupri로 변환한다.
코드 라인 11~13에서 현재 cpu가 이미 같은 우선 순위를 사용하고 있었으면 함수를 빠져나간다.
코드 라인 20~23에서 새 우선 순위에 해당하는 벡터의 cpumask에 요청한 cpu 비트를 설정한다.
코드 라인 29~31에서 새 우선 순위에서 동작하는 벡터의 cpu 카운터를 증가시킨다. 다음 카운터를 감소시키는 동작이 나올 예정인데 그 때 메모리 배리어 동작이 필요하므로 1을 대입한다.

kernel/sched/cpupri.c – 2/2

        if (likely(oldpri != CPUPRI_INVALID)) {
                struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];

                /*
                 * Because the order of modification of the vec->count
                 * is important, we must make sure that the update
                 * of the new prio is seen before we decrement the
                 * old prio. This makes sure that the loop sees
                 * one or the other when we raise the priority of
                 * the run queue. We don't care about when we lower the
                 * priority, as that will trigger an rt pull anyway.
                 *
                 * We only need to do a memory barrier if we updated
                 * the new priority vec.
                 */
                if (do_mb)
                        smp_mb__after_atomic();

                /*
                 * When removing from the vector, we decrement the counter first
                 * do a memory barrier and then clear the mask.
                 */
                atomic_dec(&(vec)->count);
                smp_mb__after_atomic();
                cpumask_clear_cpu(cpu, vec->mask);
        }

        *currpri = newpri;
}

코드 라인 1에서 기존 cpupri가 설정되지 않은 경우이다.
코드 라인 16~17에서 메모리 배리어 동작이 필요한 경우 수행한다.
- arm은 컴파일러 배리어인 barrier()를 동작시킨다.
코드 라인 23에서 기존 우선 순위에서 동작하는 벡터의 cpu 카운터를 감소시킨다.
코드 라인 25에서 기존 우선 순위에 해당하는 벡터의 cpumask에 요청한 cpu 비트를 클리어한다.
코드 라인 28에서 현재 cpu 위치에 newpri 값을 기록한다.

다음 그림은 루트 도메인의 cpupri 내부에 있는 102개의 cpupri 벡터와 cpu_to_pri를 갱신하는 모습을 보여준다.

4개의 cpu를 번호 순서대로 RT0, NORMAL, IDLE, RT99와 같은 우선 순위가 동작하는 상황에서 마지막 cpu에 디폴트 nice 0 우선순위인 prio=120 우선순위로 설정한다.

convert_prio()

kernel/sched/cpupri.c

/* Convert between a 140 based task->prio, and our 102 based cpupri */
static int convert_prio(int prio)
{
        int cpupri;

        if (prio == CPUPRI_INVALID)
                cpupri = CPUPRI_INVALID;
        else if (prio == MAX_PRIO)
                cpupri = CPUPRI_IDLE;
        else if (prio >= MAX_RT_PRIO)
                cpupri = CPUPRI_NORMAL;
        else
                cpupri = MAX_RT_PRIO - prio + 1;

        return cpupri;
}

태스크 기반의 140단계 우선 순위를 102단계의 cpupri로 변환하여 반환한다.

코드 라인 6~7에서 prio=CPUPRI_INVALID(-1)인 경우 그 값을 그대로 반환한다.
코드 라인 8~9에서 prio=140인 idle task의 우선순위인 경우 CPUPRI_IDLE(0) 값을 반환한다.
코드 라인 10~11에서 prio>=100인 notmal(cfs) task 우선 순위인 경우 CPUPRI_NORMAL(1) 값을 반환한다.
코드 라인 12~13에서 prio<100인 rt task 우선 순위인 경우 RT0 ~ RT99 -> 101 ~ 2로 뒤집어서 값을 반환한다.

다음 그림은 태스크 기반의 140단계 우선 순위를 102단계의 cpupri로 변경하는 모습을 보여준다.

cpupri_find()

kernel/sched/cpupri.c

/**
 * cpupri_find - find the best (lowest-pri) CPU in the system
 * @cp: The cpupri context
 * @p: The task
 * @lowest_mask: A mask to fill in with selected CPUs (or NULL)
 *
 * Note: This function returns the recommended CPUs as calculated during the
 * current invocation.  By the time the call returns, the CPUs may have in
 * fact changed priorities any number of times.  While not ideal, it is not
 * an issue of correctness since the normal rebalancer logic will correct
 * any discrepancies created by racing against the uncertainty of the current
 * priority configuration.
 *
 * Return: (int)bool - CPUs were found
 */

int cpupri_find(struct cpupri *cp, struct task_struct *p,
                struct cpumask *lowest_mask)
{
        int idx = 0; 
        int task_pri = convert_prio(p->prio);

        BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);

        for (idx = 0; idx < task_pri; idx++) {
                struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
                int skip = 0;

                if (!atomic_read(&(vec)->count))
                        skip = 1;
                /*
                 * When looking at the vector, we need to read the counter,
                 * do a memory barrier, then read the mask.
                 *
                 * Note: This is still all racey, but we can deal with it.
                 *  Ideally, we only want to look at masks that are set.
                 *
                 *  If a mask is not set, then the only thing wrong is that we
                 *  did a little more work than necessary.
                 *
                 *  If we read a zero count but the mask is set, because of the
                 *  memory barriers, that can only happen when the highest prio
                 *  task for a run queue has left the run queue, in which case,
                 *  it will be followed by a pull. If the task we are processing
                 *  fails to find a proper place to go, that pull request will
                 *  pull this task if the run queue is running at a lower
                 *  priority.
                 */
                smp_rmb();

                /* Need to do the rmb for every iteration */
                if (skip)
                        continue;

                if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
                        continue;

                if (lowest_mask) {
                        cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);

                        /*
                         * We have to ensure that we have at least one bit
                         * still set in the array, since the map could have
                         * been concurrently emptied between the first and
                         * second reads of vec->mask.  If we hit this
                         * condition, simply act as though we never hit this
                         * priority level and continue on.
                         */
                        if (cpumask_any(lowest_mask) >= nr_cpu_ids)
                                continue;
                }

                return 1;
        }

        return 0;
}

102 단계의 가장 낮은 우선 순위부터 요청한 태스크의 우선순위 범위 이내에서 동작할 수 있는 cpu가 있는지 여부를 찾아 반환한다. cpu를 찾은 경우 1을 반환한다. 또한 출력 인수 lowest_mask에 찾은 best(lowest) 우선순위에서 동작할 수 있는 cpumask를 반환한다.

코드 라인 5에서 태스크에 설정된 140 단계의 우선 순위로 102 단계의 cpupri 우선 순위로 변환하여 task_pri에 대입한다.
코드 라인 9~10에서 인덱스를 0부터 태스크의 cpupri 번호까지 순회하며 해당하는 인덱스의 cpupri 벡터를 알아온다.
코드 라인 13~14에서 인덱스 번호의 cpupri를 사용하는 cpu가 없는 경우 skip=1을 설정한다.
코드 라인 36~37에서 메모리 배리어를 수행한 후 skip 설정이 있으면 다음 인덱스 번호로 skip 한다.
코드 라인 39~40에서 순회하는 cpupri 벡터에서 사용하는 cpu와 태스크에 허용된 cpu들이 중복된 cpu들 중 하나의 랜덤 cpu 번호가 최대 cpu 수 이상이면 처리할 수 없어 skip 한다.
- cpus_allowed는 cgroup의 cpuset 서브시스템을 컨트롤하여 특정 태스크에 허용하는 cpu들을 지정한다.
코드 라인 42~43에서 출력 인수 lowest_mask가 지정된 경우 순회하는 cpupri 벡터에서 사용하는 cpu와 태스크에 허용된 cpu들이 중복된 cpu들을 알아와서 출력 인수 lowest_mask에 대입한다.
코드 라인 53~54에서 출력 인수 lowest_mask에서 랜덤으로 가져온 cpu 번호가 최대 cpu 수 이상이면 처리할 수 없어 skip 한다.
코드 라인 57~60에서 정상적으로 찾았으므로 1을 반환하고 루프를 다 돌도록 찾지 못한 경우 0을 반환한다.

다음 그림은 102단계의 cpupri 벡터들에서 가장 낮은 우선 순위 0부터 요청한 태스크의 우선순위까지 검색하여 best (lowest) 우선 순위의 cpu들을 찾는 모습을 보여준다.

태스크는 cpu#0과 cpu#1로 제한된 상태이다. (cgroup -> cpuset 사용)
cpupri 벡터에서 0번 idle에는 cpu#2번만 사용되고 있어 skip
cpupri 벡터에서 1번 normal에는 cpu#1과 cpu#3이 사용되고 있어 cpu#1만 cpumask로 출력인수에 반환한다.

RT Migration

최상위 rt 런큐에 rt 태스크가 엔큐될 때마다 그 태스크가 2 개 이상이면서 2 개 이상의 cpu로 할당된 경우 오버로드될 수 있다고 판단한다.

inc_rt_migration()

kernel/sched/rt.c

static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        struct task_struct *p;

        if (!rt_entity_is_task(rt_se))
                return;

        p = rt_task_of(rt_se);
        rt_rq = &rq_of_rt_rq(rt_rq)->rt;

        rt_rq->rt_nr_total++;
        if (p->nr_cpus_allowed > 1)
                rt_rq->rt_nr_migratory++;

        update_rt_migration(rt_rq);
}

rt 태스크의 수를 증가시킨다. 또한 rt 태스크가 이주 가능한 경우 이주가능한 rt 태스크의 수를 증가시킨다.

코드 라인 5~6에서 엔티티가 태스크가 아니면 함수를 빠져나간다.
코드 라인 8~9에서 태스크와 최상위 루트 rt 런큐를 알아온다.
코드 라인 11에서 최상위 루트 rt 런큐에서 rt 태스크의 수를 증가시킨다.
코드 라인 12~13에서 만일 태스크에 배정된 cpu 수가 2개 이상인 경우 최상위 루트 rt 런큐의 이주 가능한 태스크의 수를 증가시킨다.
코드 라인 15에서 최상위 루트 RT 런큐의 이주와 관련된 상태를 갱신한다.

dec_rt_migration()

kernel/sched/rt.c

static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        struct task_struct *p;

        if (!rt_entity_is_task(rt_se))
                return;

        p = rt_task_of(rt_se);
        rt_rq = &rq_of_rt_rq(rt_rq)->rt;

        rt_rq->rt_nr_total--;
        if (p->nr_cpus_allowed > 1)
                rt_rq->rt_nr_migratory--;

        update_rt_migration(rt_rq);
}

rt 태스크의 수를 감소시킨다. 또한 rt 태스크가 이주 가능한 경우 이주가능한 rt 태스크의 수를 감소시킨다.

코드 라인 5~6에서 엔티티가 태스크가 아니면 함수를 빠져나간다.
코드 라인 8~9에서 태스크와 최상위 루트 rt 런큐를 알아온다.
코드 라인 11에서 최상위 루트 rt 런큐에서 rt 태스크의 수를 감소시킨다.
코드 라인 12~13에서 만일 태스크에 배정된 cpu 수가 2개 이상인 경우 최상위 루트 rt 런큐의 이주 가능한 태스크의 수를 감소시킨다.
코드 라인 15에서 최상위 루트 RT 런큐의 이주와 관련된 상태를 갱신한다

이주와 관련된 멤버들

최상위 root rt_rq->rt_nr_total
- rt 태스크의 수
최상위 root rt_rq->rt_nr_migratory
- 이주 가능한 rt 태스크의 수

update_rt_migration()

kernel/sched/rt.c

static void update_rt_migration(struct rt_rq *rt_rq)
{
        if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) {
                if (!rt_rq->overloaded) {
                        rt_set_overload(rq_of_rt_rq(rt_rq));
                        rt_rq->overloaded = 1;
                }
        } else if (rt_rq->overloaded) {
                rt_clear_overload(rq_of_rt_rq(rt_rq));
                rt_rq->overloaded = 0;
        }
}

요청한 RT 런큐의 이주와 관련된 상태를 갱신한다. RT 런큐에 2 개 이상의 태스크가 있는 경우 오버로드 상태로 설정한다. 이미 오버로드된 상태라면 클리어한다.

코드 라인 3에서 RT 런큐에서 migration된 횟수가 0보다 크고 2개 이상의 rt 태스크가 엔큐된 경우에 한해
코드 라인 4~7에서 런큐를 오버로드로 설정한다.
코드 라인 8~11에서 RT 런큐가 오버로드된 상태인 경우 클리어한다.

RT Overload

RT 런큐에 2 개 이상의 태스크가 엔큐된 경우 이를 오버로드라 부르고 트래킹하기 위해 사용한다.

rq->rd->rto_count
- 도메인내에서의 rt 오버로드된 횟수
rq->rd->rdo_mask
- 도메인내에서의 rt 오버로드된 cpu
rt_rq->overloaded
- rt 런큐의 오버로드 여부

참고: sched: add rt-overload tracking

rt_set_overload()

kernel/sched/rt.c

static inline void rt_set_overload(struct rq *rq)
{
        if (!rq->online)
                return;

        cpumask_set_cpu(rq->cpu, rq->rd->rto_mask);
        /*
         * Make sure the mask is visible before we set
         * the overload count. That is checked to determine
         * if we should look at the mask. It would be a shame
         * if we looked at the mask, but the mask was not
         * updated yet.
         *
         * Matched by the barrier in pull_rt_task().
         */
        smp_wmb();
        atomic_inc(&rq->rd->rto_count);
}

rt 오버로드 카운터를 증가시키고 rt 오버로드 마스크 중 해당 cpu의 비트를 설정한다.

rt_clear_overload()

kernel/sched/rt.c

static inline void rt_clear_overload(struct rq *rq)
{
        if (!rq->online)
                return;

        /* the order here really doesn't matter */
        atomic_dec(&rq->rd->rto_count);
        cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
}

rt 오버로드 카운터를 감소시키고 rt 오버로드 마스크 중 해당 cpu의 비트를 클리어한다.

RT Group 및 타이머 가동

inc_rt_group()

kernel/sched/rt.c

static void
inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        if (rt_se_boosted(rt_se))
                rt_rq->rt_nr_boosted++;

        if (rt_rq->tg)
                start_rt_bandwidth(&rt_rq->tg->rt_bandwidth);
}

요청한 rt 런큐가 부스팅중이면 boosted 카운터를 증가시킨다. 그리고 그룹에 대한 rt period 타이머를 가동시킨다.

dec_rt_group()

kernel/sched/rt.c

static void
dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
        if (rt_se_boosted(rt_se))
                rt_rq->rt_nr_boosted--;

        WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
}

요청한 rt 런큐가 부스팅중이면 boosted 카운터를 감소시킨다.

RT Period 타이머

다음 그림은 rt period 타이머에 대한 가동과 호출 함수에 대한 함수간 처리 흐름을 보여준다.

RT period 타이머 가동

start_rt_bandwidth()

kernel/sched/rt.c

static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
{
        if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
                return;

        raw_spin_lock(&rt_b->rt_runtime_lock);
        if (!rt_b->rt_period_active) {
                rt_b->rt_period_active = 1;
                /*
                 * SCHED_DEADLINE updates the bandwidth, as a run away
                 * RT task with a DL task could hog a CPU. But DL does
                 * not reset the period. If a deadline task was running
                 * without an RT task running, it can cause RT tasks to
                 * throttle when they start up. Kick the timer right away
                 * to update the period.
                 */
                hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0));
                hrtimer_start_expires(&rt_b->rt_period_timer,
                                      HRTIMER_MODE_ABS_PINNED_HARD);
        }
        raw_spin_unlock(&rt_b->rt_runtime_lock);
}

rt bandwidth용 period 타이머를 동작시킨다.

코드 라인 3~4에서 글로벌 밴드위드가 설정되지 않았거나, 그룹에 rt 런타임 설정이 없는 경우 함수를 빠져나간다.
코드 라인 6~21에서 rt_period_timer가 않는 경우에 한해 타이머를 가동시킨다.
- rt period 타이머는 hardirq에서 동작하도록 설정된다.

RT period 타이머 만료 시

sched_rt_period_timer()

kernel/sched/rt.c

static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
{
        struct rt_bandwidth *rt_b =
                container_of(timer, struct rt_bandwidth, rt_period_timer);
        int idle = 0;
        int overrun;

        raw_spin_lock(&rt_b->rt_runtime_lock);
        for (;;) {
                overrun = hrtimer_forward_now(timer, rt_b->rt_period);
                if (!overrun)
                        break;

                raw_spin_unlock(&rt_b->rt_runtime_lock);
                idle = do_sched_rt_period_timer(rt_b, overrun);
                raw_spin_lock(&rt_b->rt_runtime_lock);
        }
        if (idle)
                rt_b->rt_period_active = 0;
        raw_spin_unlock(&rt_b->rt_runtime_lock);

        return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}

코드 라인 3~4에서 그룹에 설정된 rt bandwidth 값을 알아온다.
코드 라인 9~17에서 타이머를 forward 시키고, overrun이 없는 경우 더 이상 처리할 일이 없으므로 루프를 빠져나가고, overrun이 발생한 경우 period 타이머 만료에 대한 처리를 수행한다.
코드 라인 18~19에서 idle 결과인 경우 period 타이머가 동작하지 않음을 알리도록 rt_period_active를 0으로 설정한다.
코드 라인 22에서 idle 결과 값에 따라 idle인 경우 hrtimer가 재설정되지 않게 HRTIMER_NORESTART를 반환한다. idle이 아닌 경우 hrtimer가 재설정되도록 HRTIMER_RESTART를 반환한다.

do_sched_rt_period_timer()

kernel/sched/rt.c -1/2-

static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
{
        int i, idle = 1, throttled = 0;
        const struct cpumask *span;

        span = sched_rt_period_mask();
#ifdef CONFIG_RT_GROUP_SCHED
        /*
         * FIXME: isolated CPUs should really leave the root task group,
         * whether they are isolcpus or were isolated via cpusets, lest
         * the timer run on a CPU which does not service all runqueues,
         * potentially leaving other CPUs indefinitely throttled.  If
         * isolation is really required, the user will turn the throttle
         * off to kill the perturbations it causes anyway.  Meanwhile,
         * this maintains functionality for boot and/or troubleshooting.
         */
        if (rt_b == &root_task_group.rt_bandwidth)
                span = cpu_online_mask;
#endif
        for_each_cpu(i, span) {
                int enqueue = 0;
                struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
                struct rq *rq = rq_of_rt_rq(rt_rq);
                int skip;

                /*
                 * When span == cpu_online_mask, taking each rq->lock
                 * can be time-consuming. Try to avoid it when possible.
                 */
                raw_spin_lock(&rt_rq->rt_runtime_lock);
                if (!sched_feat(RT_RUNTIME_SHARE) && rt_rq->rt_runtime != RUNTIME_INF)
                        rt_rq->rt_runtime = rt_b->rt_runtime;
                skip = !rt_rq->rt_time && !rt_rq->rt_nr_running;
                raw_spin_unlock(&rt_rq->rt_runtime_lock);
                if (skip)
                        continue;

                raw_spin_lock(&rq->lock);
                update_rq_clock(rq);

rt period 만료 시 해야 할 일을 수행한다

rt 엔티티가 엔큐되고 rt period 타이머가 동작된 후 다른 cpu로부터 할당 런타임을 빌려와서 설정하고 잔량이 남는 경우 스로틀하지 않게되는데 이 때 rt 런큐가 스로틀 되었었던 경우 런큐에 엔큐한다.

코드 라인 3에서 idle 변수는 period 타이머를 stop 시킬지 여부를 반환하기 위한 값이다. (1=stop, 0=continue)
코드 라인 6에서 현재 cpu 런큐의 루트 도메인에 허가된 cpu 비트마스크를 알아온다.
코드 라인 17~18에서 루트 태스크 그룹의 경우 online된 cpu 전체를 사용한다.
코드 라인 20~22에서 rt 런큐에 소속된 cpu들을 순회한다.
코드 라인 31~32에서 RT_RUNTIME_SHARE 기능이 없고 그룹에 rt 런타임이 설정된 경우 그 설정 값을 로컬 런타임이 매 period 마다 사용하게 한다.
코드 라인 33~36에서 rt 수행시간이 없고, 수행할 rt 태스크도 없는 경우 skip 한다.
코드 라인 39에서 런큐 클럭을 갱신한다.

kernel/sched/rt.c -2/2-

                if (rt_rq->rt_time) {
                        u64 runtime;

                        raw_spin_lock(&rt_rq->rt_runtime_lock);
                        if (rt_rq->rt_throttled)
                                balance_runtime(rt_rq);
                        runtime = rt_rq->rt_runtime;
                        rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
                        if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
                                rt_rq->rt_throttled = 0;
                                enqueue = 1;

                                /*
                                 * When we're idle and a woken (rt) task is
                                 * throttled check_preempt_curr() will set
                                 * skip_update and the time between the wakeup
                                 * and this unthrottle will get accounted as
                                 * 'runtime'.
                                 */
                                if (rt_rq->rt_nr_running && rq->curr == rq->idle)
                                        rq_clock_cancel_skipupdate(rq);
                        }
                        if (rt_rq->rt_time || rt_rq->rt_nr_running)
                                idle = 0;
                        raw_spin_unlock(&rt_rq->rt_runtime_lock);
                } else if (rt_rq->rt_nr_running) {
                        idle = 0;
                        if (!rt_rq_throttled(rt_rq))
                                enqueue = 1;
                }
                if (rt_rq->rt_throttled)
                        throttled = 1;

                if (enqueue)
                        sched_rt_rq_enqueue(rt_rq);
                raw_spin_unlock(&rq->lock);
        }

        if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
                return 1;

        return idle;
}

코드 라인 1에서 rt 런큐에서 rt 태스크가 수행한 시간이 0보다 큰 경우이다.
코드 라인 5~7에서 로컬 rt 런큐가 스로틀된 적이 있는 경우 다른 rt 로컬 풀로부터 런타임을 빌려오는 balance_runtime()을 수행한다. 그런 후 다시 로컬 런타임을 알아온다.
코드 라인 8에서 rt 런큐의 rt 수행시간에 overrun * rt 런타임 시간만큼 감소시칸다. 단 rt_time이 0 미만으로 내려가지 않게 제한한다.
코드 라인 9~24에서 스로틀을 해제할 수 있는 상황인경우 rt런큐를 엔큐한다.
- rt 런큐가 스로틀 중이고, 런타임에 여유가 생겼으면 스로틀을 해제하기 위해 rt_throttled에 0을 대입한다.
- idle 상태에서 rt 태스크가 깨어난 상황인 경우 클럭의 skip 요청을 취소한다.
코드 라인 26~35에서 rt_time이 0인 경우 idle=0, 그리고 스로틀을 풀러야 하는 상황이라면 엔큐를 한다.
코드 라인 39~40에서 rt 밴드위드를 동작시키지 않아도 될 때에는 1을 반환한다.
- 스로틀되지 않고 rt bandwidth 설정도 없어야 한다.
코드 라인 42에서 period 타이머의 stop 여부를 결정하는 idle 상태를 반환한다. (1=stop, 0=continue)

Check Preempt

check_preempt_curr_rt()

kernel/sched/rt.c

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
{
        if (p->prio < rq->curr->prio) {
                resched_curr(rq);
                return;
        }

#ifdef CONFIG_SMP
        /*
         * If:
         *
         * - the newly woken task is of equal priority to the current task
         * - the newly woken task is non-migratable while current is migratable
         * - current will be preempted on the next reschedule
         *
         * we should check to see if current can readily move to a different
         * cpu.  If so, we will reschedule to allow the push logic to try
         * to move current somewhere else, making room for our non-migratable
         * task.
         */
        if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
                check_preempt_equal_prio(rq, p);
#endif
}

현재 태스크보다 더 높은 우선 순위 또는 동등한 우선 순위의 태스크에 리스케줄해야 하는 경우를 체크하여 필요 시 리스케줄 요청 플래그를 설정한다.

코드 라인 6~9에서 요청한 태스크의 우선 순위가 현재 런큐에서 동작하는 태스크의 우선 순위보다 높은 경우 리스케줄 요청 플래그를 설정한다.
코드 라인 24~25에서 smp 시스템의 경우 요청한 태스크의 우선 순위와 현재 런큐에서 동작 중인 우선 순위가 동일하면서 현재 동작중인 태스크에 리스케줄 요청이 없으면 조건에 따라 요청한 태스크를 라운드 로빈하고 리스케줄 요청 플래그를 설정해야 하는지 체크한다.

check_preempt_equal_prio()

kernel/sched/rt.c

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
        /*
         * Current can't be migrated, useless to reschedule,
         * let's hope p can move out.
         */
        if (rq->curr->nr_cpus_allowed == 1 ||
            !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
                return;

        /*
         * p is migratable, so let's not schedule it and
         * see if it is pushed or pulled somewhere else.
         */
        if (p->nr_cpus_allowed != 1
            && cpupri_find(&rq->rd->cpupri, p, NULL))
                return;

        /*
         * There appears to be other cpus that can accept
         * current and none to run 'p', so lets reschedule
         * to try and push current away:
         */
        requeue_task_rt(rq, p, 1);
        resched_curr(rq);
}

조건에 따라 요청한 태스크를 라운드 로빈하고 리스케줄 요청 플래그를 설정해야 하는지 체크한다.

코드 라인 7~9에서 현재 태스크에서 사용할 수 있는 cpu가 1개 밖에 없는 경우 이거나 런큐의 루트도메인에서 102 단계의 가장 낮은 우선 순위부터 현재 런큐에서 동작중인 태스크의 우선순위 범위 이내에서 동작할 수 있는 cpu가 있으면 함수를 빠져나간다.
코드 라인 15~17에서 현재 태스크에서 사용할 수 있는 cpu가 2개 이상인 경우이고 런큐의 루트도메인에서 102 단계의 가장 낮은 우선 순위부터 요청한 태스크의 우선순위 범위 이내에서 동작할 수 있는 cpu가 있으면 함수를 빠져나간다.
코드 라인 24~25에서 현재 태스크를 리큐하여 라운드 로빈할 수 있게한 후 리스케줄 요청 플래그를 설정한다.

Enqueue & Dequeue RT 태스크

enqueue_task_rt()

kernel/sched/rt.c

/*
 * Adding/removing a task to/from a priority array:
 */

static void
enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
        struct sched_rt_entity *rt_se = &p->rt;

        if (flags & ENQUEUE_WAKEUP)
                rt_se->timeout = 0;

        enqueue_rt_entity(rt_se, flags);

        if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
                enqueue_pushable_task(rq, p);
}

rt 태스크를 런큐에 엔큐한다.

코드 라인 6~7에서 rt 태스크가 막 깨어나서 다시 런큐에 등록될 때엔 rt 워치독을 위한 timeout 설정을 클리어한다.
코드 라인 9에서 rt 엔티티를 rt 런큐에 엔큐한다.
코드 라인 11~12에서 rt 태스크가 rt 런큐에서 대기해야 하는 상황이고 현재 cpu로 제한하지 않은 상태인 경우 다른 cpu로 migration할 수 있도록 pushable 태스크 리스트에 추가한다.

dequeue_task_rt()

kernel/sched/rt.c

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
        struct sched_rt_entity *rt_se = &p->rt;

        update_curr_rt(rq);
        dequeue_rt_entity(rt_se, flags);

        dequeue_pushable_task(rq, p);
}

rt 태스크를 런큐에서 디큐한다.

코드 라인 5에서 현재 동작 중인 rt 태스크의 실행 시간등을 갱신한다.
코드 라인 6에서 rt 엔티티를 rt 런큐에서 디큐한다.
코드 라인 8에서 pushable 태스크 리스트에서 현재 rt 태스크를 디큐한다.

pushable 태스크 리스트에 추가 및 삭제

enqueue_pushable_task()

kernel/sched/rt.c

static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
{
        plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
        plist_node_init(&p->pushable_tasks, p->prio);
        plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);

        /* Update the highest prio pushable task */
        if (p->prio < rq->rt.highest_prio.next)
                rq->rt.highest_prio.next = p->prio;
}

요청한 태스크를 최상위 런큐의 pushable_tasks 리스트에 추가하고 최상위 rt 런큐의 차순위를 갱신한다.

코드 라인 3에서 현재 태스크를 최상위 rt 런큐의 pushable_tasks 리스트에서 제거한다.
코드 라인 4~5에서 현재 태스크의 우선 순위를 pushable_tasks 노드에 설정하고 최상위 rt 런큐의 pushable_tasks 리스트에 다시 추가한다.
- 가장 우선 순위가 높은(숫자가 낮은) 노드가 pushable_tasks 리스트에서 가장 앞에 정렬된다.
코드 라인 8~9에서 요청 태스크의 우선 순위가 런큐의 차순위 우선 순위보다 더 높은 경우 갱신한다.

다음 그림은 태스크를 pushable task 리스트에 추가할 때 차순위(highest_prio.next)를 갱신하는 모습을 보여준다.

dequeue_pushable_task()

kernel/sched/rt.c

static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
{
        plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);

        /* Update the new highest prio pushable task */
        if (has_pushable_tasks(rq)) {
                p = plist_first_entry(&rq->rt.pushable_tasks,
                                      struct task_struct, pushable_tasks);
                rq->rt.highest_prio.next = p->prio;
        } else
                rq->rt.highest_prio.next = MAX_RT_PRIO;
}

요청한 태스크를 최상위 런큐의 pushable_tasks 리스트에서 제거하고 최상위 rt 런큐의 차순위를 갱신한다.

코드 라인 3에서 현재 태스크를 최상위 rt 런큐의 pushable_tasks 리스트에서 제거한다.
코드 라인 6~11에서 런큐에 pushable task가 있으면 그 중 가장 높은 우선 순위를 런큐의 차순위 우선순위로 설정한다. 만일 pushable task가 없으면 런큐의 차순위를 비워둔다.

다음 그림은 태스크를 pushable task 리스트에서 삭제할 때 차순위(highest_prio.next)를 갱신하는 모습을 보여준다.

pick_highest_pushable_task()

kernel/sched/rt.c

/*
 * Return the highest pushable rq's task, which is suitable to be executed
 * on the cpu, NULL otherwise
 */
static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
{
        struct plist_head *head = &rq->rt.pushable_tasks;
        struct task_struct *p;

        if (!has_pushable_tasks(rq))
                return NULL;

        plist_for_each_entry(p, head, pushable_tasks) {
                if (pick_rt_task(rq, p, cpu))
                        return p;
        }

        return NULL;
}

pushable tasks 리스트에 연결된 태스크를 순회하며 active 되지 않은 첫 태스크를 알아온다. (높은 우선 순위 -> 낮은 우선 순위로 순회)

코드 라인 10~11에서 요청한 런큐의 pushable task 리스트가 비어 있으면 null을 반환한다.
코드 라인 13~16에서 pushable tasks 리스트에 연결된 태스크를 순회하며 active 되지 않은 첫 태스크를 알아온다.

has_pushable_tasks()

kernel/sched/rt.c

static inline int has_pushable_tasks(struct rq *rq)
{
        return !plist_head_empty(&rq->rt.pushable_tasks);
}

pushable 태스크 리스트에 태스크가 존재하는지 여부를 반환한다.

pick_rt_task()

kernel/sched/rt.c

static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
{
        if (!task_running(rq, p) &&
            cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
                return 1;
        return 0;
}

요청한 cpu에서 동작 가능하고 active 되지 않은 태스크인지 여부를 반환한다.

pick_next_pushable_task()

kernel/sched/rt.c

static struct task_struct *pick_next_pushable_task(struct rq *rq)
{
        struct task_struct *p;

        if (!has_pushable_tasks(rq))
                return NULL;

        p = plist_first_entry(&rq->rt.pushable_tasks,
                              struct task_struct, pushable_tasks);

        BUG_ON(rq->cpu != task_cpu(p));
        BUG_ON(task_current(rq, p));
        BUG_ON(p->nr_cpus_allowed <= 1);

        BUG_ON(!task_on_rq_queued(p));
        BUG_ON(!rt_task(p));

        return p;
}

요청한 런큐의 pushable tasks 리스트에서 대기 중인 첫 번째 rt 태스크를 반환한다.

plist (Descending-priority-sorted double-linked list)

우선 순위 기반으로 소팅된 이중 리스트이다. 키에 사용될 우선 순위가 0~99까지 100개로 제한되어 있어서 RB 트리를 사용하는 것보다 더 효율적이다. 따라서 RT 스케줄러에서 이 자료 구조는 overload된 태스크들을 pushable_tasks plist에 추가할 때 소팅에 최적화된 모습을 보여준다.

pushable_tasks에서 가장 높은 우선 순위의 태스크를 검색하는 경우: 가장 head에 있는 태스크를 사용한다.
pushable_tasks에 태스크를 추가하는 경우: 2 개의 이중 리스트 중 prio_list를 사용하면 중복되는 우선 순위를 건너 띄며 검색하므로 더 빠른 검색이 가능하다. (물론 최악의 경우 99번이 필요하다.)

다음 그림은 plist의 각 노드가 연결되어 있는 모습을 보여준다.

다음 태스크 픽업

pick_next_task_rt()

kernel/sched/rt.c

static struct task_struct *
pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        struct task_struct *p;

        WARN_ON_ONCE(prev || rf);

        if (!sched_rt_runnable(rq))
                return NULL;

        p = _pick_next_task_rt(rq);
        set_next_task_rt(rq, p);
        return p;
}

다음에 스케줄할 가장 높은 우선 순위의 rt 태스크를 알아온다.

코드 라인 8~9에서 런큐에 동작하는 rt 태스크가 없으면 null을 반환한다.
코드 라인 11에서 런큐에서 실행시킬 rt 태스크를 알아온다.
코드 라인 12~13에서 실행 시킬 rt 태스크를 next 태스크로 지정하고 반환한다.

_pick_next_task_rt()

kernel/sched/rt.c

static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
        struct sched_rt_entity *rt_se;
        struct rt_rq *rt_rq  = &rq->rt;

        do {
                rt_se = pick_next_rt_entity(rq, rt_rq);
                BUG_ON(!rt_se);
                rt_rq = group_rt_rq(rt_se);
        } while (rt_rq);

        return rt_task_of(rt_se);
}

rt 런큐의 rt 어레이 리스트에서 가장 높은 우선 순위의 rt 태스크를 찾아 반환한다.

코드 라인 7에서 rt 런큐의 rt 어레이 리스트에서 가장 높은 우선 순위의 rt 엔티티를 찾아온다.
코드 라인 9~10에서 rt 엔티티가 그룹인 경우 다음 하위 그룹으로 이동하며 최종적으로 task인 rt 엔티티를 알아온다.
코드 라인 12에서 찾은 rt 태스크를 반환한다.

다음 그림은 런큐에서 가장 높은 우선 순위의 rt 태스크를 찾아오는 모습을 보여준다.

pick_next_rt_entity()

kernel/sched/rt.c

static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
                                                   struct rt_rq *rt_rq)
{
        struct rt_prio_array *array = &rt_rq->active;
        struct sched_rt_entity *next = NULL;
        struct list_head *queue;
        int idx;

        idx = sched_find_first_bit(array->bitmap);
        BUG_ON(idx >= MAX_RT_PRIO);

        queue = array->queue + idx;
        next = list_entry(queue->next, struct sched_rt_entity, run_list);

        return next;
}

rt 런큐의 rt 어레이 리스트에서 가장 높은 우선 순위의 rt 엔티티를 찾아 반환한다.

코드 라인 9에서 rt 우선순위별 리스트 어레이에서 엔티티가 존재하는 가장 우선 순위가 높은 리스트의 인덱스를 알아온다.
- bit(0) = 0번 우선 순위로 가장 높은 우선순위
코드 라인 12~13에서 리스트에 있는 가장 처음 rt 엔티티를 반환한다.

실행할 rt 태스크 지정

set_next_task_rt()

kernel/sched/rt.c

static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
{
        p->se.exec_start = rq_clock_task(rq);

        /* The running task is never eligible for pushing */
        dequeue_pushable_task(rq, p);

        /*
         * If prev task was rt, put_prev_task() has already updated the
         * utilization. We only care of the case where we start to schedule a
         * rt task
         */
        if (rq->curr->sched_class != &rt_sched_class)
                update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);

        rt_queue_push_tasks(rq);
}

요청한 태스크를 rt 런큐에서 지금 실행할 태스크로 지정한다.

코드 라인 3에서 rt 태스크에 시작 시각을 기록한다.
코드 라인 6에서 현재 cpu에서 동작시킬 것이므로 migration되지 않도록 pushable tasks 리스트에서 제거한다.
코드 라인 13~14에서 현재 실행 중인 태스크가 rt 태스크가 아닌 경우 rt 로드 평균 등을 갱신한다.
코드 라인 16에서 현재 런큐를 push 태스크 리스트에 등록한다.

기존 태스크 수행 완료 처리

put_prev_task_rt()

kernel/sched/rt.c

static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
        update_curr_rt(rq);

        /*
         * The previous task needs to be made eligible for pushing
         * if it is still active
         */
        if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
                enqueue_pushable_task(rq, p);
}

rt 런큐내에서 기존 rt 태스크의 수행 완료 처리를 한다.

코드 라인 3에서 rt 런큐의 현재 태스크에 대해 런타임 등을 갱신한다.
코드 라인 9~10에서 필요태스크가 rt 런큐에 있고 태스크에 할당된 cpu 수가 2개 이상인 경우 요청한 태스크를 최상위 런큐의 pushable_tasks 리스트에 추가한다.

밸런스

두 개 이상의 rt 태스크가 하나의 cpu에서 동작을 해야 하는 경우 하나의 rt 태스크를 제외한 나머지를 migration 시킬 수 있도록 다음과 같은 관리를 수행한다.

pushable 태스크 리스트
- migration이 필요한 rt 태스크를 해당 cpu 런큐의 pushable 리스트에 추가한다.
  - rq->pushable_lists
overload 마스크
- migration이 필요한 cpu에 대해 overload 마스크 설정을 한다.
  - rto_count++, rto_mask 설정

rt 태스크가 다른 cpu로 migration될 때 다음 함수 호출을 통해서 수행한다. 또한 migration할 cpu를 찾기 위해 cpu별로 102개의 priority를 갱신하여 이를 통해 우선 순위가 가장 낮은 cpu들을 가려낸다.

balance_rt() – (*balance)
- pull_rt_task() 함수 호출
- RT_PUSH_IPI feature 사용 여부에 따라 IPI를 사용한 push 마이그레이션 또는 직접 pull 마이그레이션 해온다.
  - RT_PUSH_IPI feature 사용 시 overload된 cpu로 IPI 호출한다. 그 후 IPI 호출된 cpu는 가장 높은 우선 순위의 pushable 태스크부터 하나씩 낮은 우선 순위를 가진 cpu를 찾아 push 마이그레이션 한다.
  - RT_PUSH_IPI feature 미 사용 시 현재 cpu가 직접 overload된 cpu들의 런큐에서 현재 cpu의 우선 순위보다 높은 pushable 태스크들을 pull 마이그레이션해온다.
task_woken_rt() – (*task_woken)
- push_rt_tasks() 함수 호출
- rt 런큐에 이미 동작 중인 태스크가 있고, 이 rt 태스크가 현재 cpu로 고정되었거나 깨어날 rt 태스크보다 우선 순위가 더 높은 경우 깨어날 rt 태스크를 migration하기 위해 해당 cpu가 직접 push 마이그레이션 한다.
balance_callback() – (*balance_callback)
- __schedule() 함수의 가장 마지막에서 post 처리를 위해 현재 cpu가 직접 push 마이그레이션 및 pull 마이그레이션을 수행하는 다음 함수 중 하나를 호출한다.
  - push_rt_tasks()
  - pull_rt_task()

balance_rt()

kernel/sched/rt.c

static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
{
        if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
                /*
                 * This is OK, because current is on_cpu, which avoids it being
                 * picked for load-balance and preemption/IRQs are still
                 * disabled avoiding further scheduler activity on it and we've
                 * not yet started the picking loop.
                 */
                rq_unpin_lock(rq, rf);
                pull_rt_task(rq);
                rq_repin_lock(rq, rf);
        }

        return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
}

rt 태스크의 로드밸런스가 필요한지 여부를 확인하고 수행한다.

코드 라인 3에서 요청한 rt 태스크가 런큐에 없고, 현재 동작 중인 rt 태스크보다 우선 순위가 더 높은 경우이다.
코드 라인 11에서 rt 태스크를 마이그레이션 한다.
코드 라인 5에서 stop, dl 및 rt 태스크가 존재하는 경우 1을 반환한다.

need_pull_rt_task()

kernel/sched/rt.c

static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
{
        /* Try to pull RT tasks here if we lower this rq's prio */
        return rq->rt.highest_prio.curr > prev->prio;
}

요청한 태스크의 우선 순위가 최상위 rt 런큐에서 가장 높은 우선 순위보다 높거나 같은 경우 true를 반환한다.

Push 태스크

rt_queue_push_tasks()

kernel/sched/rt.c

static inline void rt_queue_push_tasks(struct rq *rq)
{
        if (!has_pushable_tasks(rq))
                return;

        queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks);
}

런큐에 pushable 태스크가 있는 경우 콜백함수로 push_rt_tasks() 함수를 지정한다.

__schedule() 함수에서 스케줄 처리를 완료한 후 마지막에 post 처리를 위해 이 콜백 함수가 호출된다.

queue_balance_callback()

kernel/sched/rt.c

static inline void
queue_balance_callback(struct rq *rq,
                       struct callback_head *head,
                       void (*func)(struct rq *rq))
{
        lockdep_assert_held(&rq->lock);

        if (unlikely(head->next))
                return;

        head->func = (void (*)(struct callback_head *))func;
        head->next = rq->balance_callback;
        rq->balance_callback = head;
}

런큐의 balance_callback 리스트에 밸런스를 위한 콜백 함수를 추가한다.

push_rt_tasks()

kernel/sched/rt.c

tatic void push_rt_tasks(struct rq *rq)
{
        /* push_rt_task will return true if it moved an RT */
        while (push_rt_task(rq))
                ;
}

런큐에서 overload된 태스크들을 모두 다른 cpu로 마이그레이션한다.

push_rt_task()

kernel/sched/rt.c

/*
 * If the current CPU has more than one RT task, see if the non
 * running task can migrate over to a CPU that is running a task
 * of lesser priority.
 */

static int push_rt_task(struct rq *rq)
{
        struct task_struct *next_task;
        struct rq *lowest_rq;
        int ret = 0;

        if (!rq->rt.overloaded)
                return 0;

        next_task = pick_next_pushable_task(rq);
        if (!next_task)
                return 0;

retry:
        if (WARN_ON(next_task == rq->curr))
                return 0;

        /*
         * It's possible that the next_task slipped in of
         * higher priority than current. If that's the case
         * just reschedule current.
         */
        if (unlikely(next_task->prio < rq->curr->prio)) {
                resched_curr(rq);
                return 0;
        }

        /* We might release rq lock */
        get_task_struct(next_task);

        /* find_lock_lowest_rq locks the rq if found */
        lowest_rq = find_lock_lowest_rq(next_task, rq);

        if (!lowest_rq) {
                struct task_struct *task;
                /*
                 * find_lock_lowest_rq releases rq->lock
                 * so it is possible that next_task has migrated.
                 *
                 * We need to make sure that the task is still on the same
                 * run-queue and is also still the next task eligible for
                 * pushing.
                 */
                task = pick_next_pushable_task(rq);
                if (task == next_task) {
                        /*
                         * The task hasn't migrated, and is still the next
                         * eligible task, but we failed to find a run-queue
                         * to push it to.  Do not retry in this case, since
                         * other CPUs will pull from us when ready.
                         */
                        goto out;
                }

                if (!task)
                        /* No more tasks, just exit */
                        goto out;

                /*
                 * Something has shifted, try again.
                 */
                put_task_struct(next_task);
                next_task = task;
                goto retry;
        }

        deactivate_task(rq, next_task, 0);
        set_task_cpu(next_task, lowest_rq->cpu);
        activate_task(lowest_rq, next_task, 0);
        ret = 1;

        resched_curr(lowest_rq);

        double_unlock_balance(rq, lowest_rq);

out:
        put_task_struct(next_task);

        return ret;
}

런큐에서 overload된 태스크 하나를 다른 cpu로 마이그레이션한다.

코드 라인 7~8에서 런큐가 오버 로드된 적이 없으면 함수를 빠져나간다.
코드 라인 10~12에서 pushable 태스크 리스트에서 가장 우선 순위가 높은 태스크를 알아온다.
코드 라인 14~16에서 retry: 레이블이다. 선택된 next 태스크가 런큐에서 이미 실행 중인 경우 함수를 빠져나간다.
코드 라인 23~26에서 선택된 next 태스크가 현재 런큐에서 동작 중인 태스크보다 더 높은 우선 순위를 가진 경우 리스케줄 요청하고 함수를 빠져나간다.
코드 라인 32에서 가장 낮은 우선 순위를 가진 cpu의 런큐를 선택한다. 이 때 런큐와 찾은 cpu의 런큐 둘 다 락을 건 상태로 온다.
코드 라인 34~65에서 적절한 cpu를 찾지 못한 경우 다시 한 번 시도하기 위해 retry 레이블로 이동한다.
코드 라인 67~69에서 마이그레이션을 하기 위해 next_task를 현재 런큐에서 꺼낸 후 찾은 cpu의 런큐에 집어 넣는다. 이 때 태스크에는 이동되는 cpu 번호를 기록한다.
코드 라인 70에서 리턴 값으로 1을 지정하여 추가 마이그레이션을 하도록 한다.
코드 라인 72에서 찾은 cpu의 런큐에 리스케줄 요청을 한다.
코드 라인 74에서 두 개의 런큐 락을 해제한다.

find_lock_lowest_rq()

kernel/sched/rt.c

/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
        struct rq *lowest_rq = NULL;
        int tries;
        int cpu;

        for (tries = 0; tries < RT_MAX_TRIES; tries++) {
                cpu = find_lowest_rq(task);

                if ((cpu == -1) || (cpu == rq->cpu))
                        break;

                lowest_rq = cpu_rq(cpu);

                if (lowest_rq->rt.highest_prio.curr <= task->prio) {
                        /*
                         * Target rq has tasks of equal or higher priority,
                         * retrying does not release any lock and is unlikely
                         * to yield a different result.
                         */
                        lowest_rq = NULL;
                        break;
                }

                /* if the prio of this runqueue changed, try again */
                if (double_lock_balance(rq, lowest_rq)) {
                        /*
                         * We had to unlock the run queue. In
                         * the mean time, task could have
                         * migrated already or had its affinity changed.
                         * Also make sure that it wasn't scheduled on its rq.
                         */
                        if (unlikely(task_rq(task) != rq ||
                                     !cpumask_test_cpu(lowest_rq->cpu, task->cpus_ptr) ||
                                     task_running(rq, task) ||
                                     !rt_task(task) ||
                                     !task_on_rq_queued(task))) {

                                double_unlock_balance(rq, lowest_rq);
                                lowest_rq = NULL;
                                break;
                        }
                }

                /* If this rq is still suitable use it. */
                if (lowest_rq->rt.highest_prio.curr > task->prio)
                        break;

                /* try again */
                double_unlock_balance(rq, lowest_rq);
                lowest_rq = NULL;
        }

        return lowest_rq;
}

가장 낮은 우선 순위를 가진 cpu의 런큐를 선택한다. 성공 시 요청한 런큐와 찾은 cpu의 런큐 둘 다 락을 건 상태로 리턴한다.

코드 라인 8~9에서 최대 3번 동안 순회하며 cpu들 중 가장 우선 순위가 낮은 태스크들이 수행되는 cpu의 lowest 런큐를 찾는다.
코드 라인 11~12에서 찾은 cpu가 현재 cpu인 경우 루프를 멈춘다.
코드 라인 14~24에서 찾은 lowest 런큐에서 동작 중인 태스크의 우선 순위가 요청한 태스크(현재 cpu에서 동작 중)의 우선 순위보다 더 높은 경우 찾은 lowest 런큐를 포기하고 null을 반환한다.
코드 라인 27~44에서 두 개의 런큐 락을 획득한다.
코드 라인 34~43에서 다음의 경우들은 lowest 런큐를 포기하고 null을 반환한다.
- 요청한 태스크가 현재 런큐에 없거나 만일 실패하는 경우
- 태스크에 지정된 cpu 범위를 벗어난 경우
  - 예: taskset -c 0-3 -> 0~3번 cpu만 사용
- 태스크가 이미 러닝 중인 경우
- 태스크가 rt 태스크가 아닌 경우
- 태스크가 rt 런큐에서 디큐된 경우
코드 라인 47~48에서 최종 선택한 lowest 런큐에서 수행 중인 태스크 보다 요청한 태스크의 우선 순위가 더 높은 경우 찾은 lowest 런큐를 반환한다.
코드 라인 51~52에서 최대 3회를 다시 시도한다.
코드 라인 55에서 찾은 lowest 런큐를 반환한다.

Pull 태스크

rt_queue_pull_task()

kernel/sched/rt.c

static inline void rt_queue_pull_task(struct rq *rq)
{
        queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task);
}

런큐에 pushable 태스크가 있는 경우 콜백함수로 pull_rt_task() 함수를 지정한다.

__schedule() 함수에서 스케줄 처리를 완료한 후 마지막에 post 처리를 위해 이 콜백 함수가 호출된다.

pull_rt_task()

kernel/sched/rt.c -1/2-

static void pull_rt_task(struct rq *this_rq)
{
        int this_cpu = this_rq->cpu, cpu;
        bool resched = false;
        struct task_struct *p;
        struct rq *src_rq;
        int rt_overload_count = rt_overloaded(this_rq);

        if (likely(!rt_overload_count))
                return;

        /*
         * Match the barrier from rt_set_overloaded; this guarantees that if we
         * see overloaded we must also see the rto_mask bit.
         */
        smp_rmb();

        /* If we are the only overloaded CPU do nothing */
        if (rt_overload_count == 1 &&
            cpumask_test_cpu(this_rq->cpu, this_rq->rd->rto_mask))
                return;

#ifdef HAVE_RT_PUSH_IPI
        if (sched_feat(RT_PUSH_IPI)) {
                tell_cpu_to_push(this_rq);
                return;
        }
#endif

        for_each_cpu(cpu, this_rq->rd->rto_mask) {
                if (this_cpu == cpu)
                        continue;

                src_rq = cpu_rq(cpu);

                /*
                 * Don't bother taking the src_rq->lock if the next highest
                 * task is known to be lower-priority than our current task.
                 * This may look racy, but if this value is about to go
                 * logically higher, the src_rq will push this task away.
                 * And if its going logically lower, we do not care
                 */
                if (src_rq->rt.highest_prio.next >=
                    this_rq->rt.highest_prio.curr)
                        continue;

오버로드된 런큐들에 대해 현재 런큐에서 진행하려고 하는 우선 순위 태스크보다 더 높은 우선 순위 태스크가 있으면 끌어온다. 단 RT_PUSH_IPI feature(디폴트 enable)를 사용 중인 경우 오버로드된 런큐의 cpu로 IPI 호출하여 직접 push 하도록 한다.

코드 라인 7~10에서 요청한 런큐의 도메인내에서 오버로드가 없는 경우 함수를 빠져나간다.
코드 라인 19~21에서 오버로드돤 태스크가 1개 이고 현재 cpu인 경우 pull할 필요 없으므로 함수를 빠져나간다.
코드 라인 23~28에서 RT_PUSH_IPI feature를 사용하는 경우 오버로드된 런큐의 cpu로 IPI 호출하여 해당 cpu 스스로 직접 push 하도록 한다.
코드 라인 30~32에서 오버로드된 cpu를 순회하며 현재 (요청한 cpu)인 경우 skip 한다.
코드 라인 34에서 순회하는 cpu에 해당하는 런큐를 알아온다.
코드 라인 43~45에서 순회중인 cpu에 대한 최상위 rt 런큐의 차순위보다 요청한 런큐의 우선순위가 더 높거나 같은 경우 skip 한다.
- 참고: sched: use highest_prio.next to optimize pull operations

kernel/sched/rt.c -2/2-

                /*
                 * We can potentially drop this_rq's lock in
                 * double_lock_balance, and another CPU could
                 * alter this_rq
                 */
                double_lock_balance(this_rq, src_rq);

                /*
                 * We can pull only a task, which is pushable
                 * on its rq, and no others.
                 */
                p = pick_highest_pushable_task(src_rq, this_cpu);

                /*
                 * Do we have an RT task that preempts
                 * the to-be-scheduled task?
                 */
                if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
                        WARN_ON(p == src_rq->curr);
                        WARN_ON(!task_on_rq_queued(p));

                        /*
                         * There's a chance that p is higher in priority
                         * than what's currently running on its CPU.
                         * This is just that p is wakeing up and hasn't
                         * had a chance to schedule. We only pull
                         * p if it is lower in priority than the
                         * current task on the run queue
                         */
                        if (p->prio < src_rq->curr->prio)
                                goto skip;

                        resched = true;

                        deactivate_task(src_rq, p, 0);
                        set_task_cpu(p, this_cpu);
                        activate_task(this_rq, p, 0);
                        /*
                         * We continue with the search, just in
                         * case there's an even higher prio task
                         * in another runqueue. (low likelihood
                         * but possible)
                         */
                }
skip:
                double_unlock_balance(this_rq, src_rq);
        }

        if (resched)
                resched_curr(this_rq);
}

코드 라인 6에서 두 개의 런큐에 대해 안전하게 double 락을 건다.
코드 라인 12에서 pushable tasks 리스트에 연결된 태스크를 순회하며 active 되지 않은 첫 태스크를 알아온다. (높은 우선 순위 -> 낮은 우선 순위로 순회)
코드 라인 18~30에서 얻어온 태스크의 우선 순위가 순회중인 cpu의 최상위 rt 런큐에서 현재 동작 중인 우선 순위보다 높은 경우 skip 한다.
- 해당 런큐에서 곧 동작할 예정이므로 pull 하지 않는다.
코드 라인 32~36에서 순회 중인 cpu의 최상위 rt 런큐에서 얻어온 태스크를 비활성화한 후 현재 cpu로 다시 설정하고 요청한 런큐에 다시 태스크를 활성화시킨다. 함수를 빠져나갈때엔 리스케줄 요청을 위해 resched 플래그를 설정한다.
코드 라인 45~46에서 skip: 레이블이다. 두 개의 런큐에 대해 double 락을 해제한다.
코드 라인 49~50에서 resched 플래그가 설정된 경우 리스케줄 요청을 수행한다.

다음 그림은 오버로드된 다른 cpu의 런큐에서 현재 런큐에서 수행하려고 하는 태스크의 우선 순위보다 더 높은 우선 순위를 가진 태스크를 끌어오는 과정을 보여준다.

구조체

sched_rt_entity 구조체

kernel/sched/sched.h

struct sched_rt_entity {
        struct list_head                run_list;
        unsigned long                   timeout;
        unsigned long                   watchdog_stamp;
        unsigned int                    time_slice;
        unsigned short                  on_rq;
        unsigned short                  on_list;

        struct sched_rt_entity          *back;
#ifdef CONFIG_RT_GROUP_SCHED
        struct sched_rt_entity          *parent;
        /* rq on which this entity is (to be) queued: */
        struct rt_rq                    *rt_rq;
        /* rq "owned" by this entity/group: */
        struct rt_rq                    *my_q;
#endif
} __randomize_layout;

run_list
- 100개의 큐리스트 중 하나에 엔큐될 때 사용하는 엔트리 노드이다.
timeout
- 태스크에 설정된 실행 제한시간
watchdog_stamp
- 마지막에 갱신한 watchdog 시각
time_slice
- 라운드 로빈용 rt 타임 슬라이스로 100ms에 해당하는 tick 수
- 1개의 우선 순위 리스트큐 내의 rt 태스크들이 라운드로빈 처리 시 이 기간 이내에 라운드로빈 하지 않도록 제한한다.
- 라운드 로빈 한 번 할 때마다 리필된다.
on_rq
- rt 엔티티가 런큐에 엔큐된 상태를 나타낸다. (1=엔큐된 상태)
on_list
- rt 엔티티가 rt 런큐 리스트 자료 구조에 연결된 상태를 나타낸다. (1=리스트에 등록된 상태)
*back
- rt 엔티티를 rt 런큐에 엔큐 및 디큐할 떄 bottom-up으로 올라간 경로대로 다시 내려갈 때 사용할 목적으로 잠시 사용한다.
*parent
- 그룹 스케줄링을 사용하지 않는 경우 null
- 그룹 스케줄링을 사용할 때 부모 rt 그룹 엔티티를 가리킨다. 최상위에서는 null을 가리킨다.
*rt_rq
- rt 엔티티가 스케줄되어 소속될 rt 런큐를 가리킨다.
*my_q
- rt 엔티티가 그룹을 대표하는 경우 그 대표하는 rt 런큐를 가리킨다.

rt_bandwidth 구조체

kernel/sched/sched.h

struct rt_bandwidth {
        /* nests inside the rq lock: */
        raw_spinlock_t          rt_runtime_lock;
        ktime_t                 rt_period;
        u64                     rt_runtime;
        struct hrtimer          rt_period_timer;
};

rt_runtime_lock
- 스핀락
rt_period
- rt period (ns)
- 글로벌 rt period 및 태스크 그룹의 rt period에서 디폴트 값은 1초이다.
rt_runtime
- rt 런타임 (ns)
- 글로벌 rt 런타임 및 루트 태스크 그룹의 디폴트 값은 0.95초, 하위 태스크 그룹의 경우 디폴트 값은 0이다.
rt_period_timer
- rt period 타이머

rt_rq 구조체

kernel/sched/sched.h

/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
        struct rt_prio_array    active;
        unsigned int            rt_nr_running;
        unsigned int            rr_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
        struct {
                int             curr; /* highest queued rt task prio */
#ifdef CONFIG_SMP
                int             next; /* next highest */
#endif
        } highest_prio;
#endif
#ifdef CONFIG_SMP
        unsigned long           rt_nr_migratory;
        unsigned long           rt_nr_total;
        int                     overloaded;
        struct plist_head       pushable_tasks;

#endif /* CONFIG_SMP */
        int                     rt_queued;

        int                     rt_throttled;
        u64                     rt_time;
        u64                     rt_runtime;
        /* Nests inside the rq lock: */
        raw_spinlock_t          rt_runtime_lock;

#ifdef CONFIG_RT_GROUP_SCHED
        unsigned long           rt_nr_boosted;

        struct rq               *rq;
        struct task_group       *tg;
#endif
};

active
- rt_prio_array 구조체로 내부에 100개의 큐리스트가 어레이로 구성되어 있다.
rt_nr_running
- rt 런큐 이하 계층 구조의 child 그룹들 모두에서 동작중인 rt 태스크의 수
rr_nr_running
- rt 런큐 이하 계층 구조의 child 그룹들 모두에서 동작중인 round robin policy를 사용하는 rt 태스크의 수
highest_prio.curr
- 현재 동작중인 RT 태스크의 최고 우선 순위
highest_prio.next
- 다음에 동작할 RT 태스크의 우선 순위
- 2 개의 태스크가 둘 다 최고 우선 순위를 가지는 경우 curr와 next 둘다 동일하다.
rt_nr_migratory
- migration 가능한 rt 태스크의 수로 엔큐/디큐될 때 태스크가 1개의 cpu로 고정된 경우가 아니면 증가/감소한다.
- 다른 cpu로 push 가능한 태스크의 수
- 참고: sched: add RT-balance cpu-weight
rt_nr_total
- 최상위 rt 런큐만 갱신되는 rt 태스크의 수로 overload 체크를 위해 태스크가 엔큐/디큐될 때 증가/감소한다.
- 이 값이 2 이상이고 rt_nr_migratory가 1 이상일 때 overload가 설정된다.
- sched_rt: Fix overload bug on rt group scheduling
overloaded
- 2 개 이상이 런큐에서 동작하려 할 때 1
- 오버로드된 경우 가능하면 다른 cpu에서 끌어갈 수 있도록 한다. (for push operation)
pushable_tasks
- 오버로드된 태스크들를 push 하기 위해 리스트에 높은 우선 순위부터 정렬된다. (for push operation)
- SMP 시스템에서 우선 순위가 낮은 cpu에서 이 리스트에 접근하여 끌어갈 수 있다. (for pull operation)
- sched: create “pushable_tasks” list to limit pushing to one attempt
rt_queued
- rt 런큐가 이미 런큐에 엔큐된 경우 1
rt_throttled
- 스로틀된 경우 1
- sched: rt time limit
rt_time
- RT 태스크가 사용한 런타임이 누적된다.
- 아래 로컬 풀(rt_runtime)에 할당된 런타임을 초과하는 경우 스로틀한다.
rt_runtime
- rt 로컬 풀에 할당된 런타임
  - RT_RUNTIME_SHARE 기능을 사용 유무에 따라
    - 사용하지 않을 때에는 매 period마다 태스크 그룹에 설정된 rt 런타임이 사용된다.
    - 사용할 경우에는 각 cpu들은 처음 태스크 그룹에 설정된 rt 런타임을 사용하고, 다른 cpu의 rt 런타임을 빌려 주고 받는다. 즉 런타임이 더 필요한 곳으로 다른 cpu들의 런타임을 몰아줄 수도 있다.
rt_nr_boosted
- Priority Inversion 문제를 해결하기 위해 사용한다.
  - 참고: sched: rt-group: deal with PI
*rq
- 런큐를 가리킨다.
*tg
- cgroup에서 그룹 스케줄링을 사용하는 경우 태스크 그룹을 가리킨다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c – 현재 글
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Real-Time group scheduling | kernel.org

Scheduler -8- (CFS Bandwidth)

2017-05-232021-01-11 문영일 2 Comments

CFS Bandwidth

태스크 그룹별로 shares 값을 설정하여 cfs 태스크의 스케줄 할당 비율을 조절할 수 있엇다. 여기서 또 다른 cfs 태스크의 스케줄 할당 비율을 조절할 수 있는 cfs bandwidth 방법을 소개한다.

태스크 그룹에 매 cfs_period_us 기간 마다 cfs_quota_us 기간 만큼 런타임을 할당하여 사용한다. 소진되어 런타임 잔량이 0이하가 되면 다음 period가 오기 전까지 남는 시간은 스로틀링 한다.

cfs_period_us
- bandwidth 기간 (us)
cfs_quota_us
- bandwidth 할당 쿼터 (us)
- 디폴트 값으로 -1(무제한)이 설정되어 있으며, 이 때에는 cfs bandwidth가 동작하지 않는다.

cfs 스로틀

해당 cfs 런큐가 스로틀 하는 경우 다음과 같이 동작한다.

다른 태스크 그룹에게 시간 할당을 양보한다.
- 예) root 그룹 아래에 A, B 두 태스크 그룹이 동작할 때 A 그룹에 cfs 밴드위드를 걸면 A 그룹이 스로틀 하는 동안 B 그룹이 동작한다.
양보할 다른 태스크 그룹도 없는 경우 idle 한다.

다음은 루트 태스크 그룹에 설정된 cfs_period_us와 cfs_quota_us 값을 보여준다. 디폴트로 cfs_quota_us 값이 -1이 설정되어 cfs bandwidth가 활성화되어 있지 않음을 알 수 있다.

$ cd /sys/fs/cgroup/cpu
$ ls
cgroup.clone_children  cpu.cfs_period_us  cpu.stat       cpuacct.usage_percpu  system.slice
cgroup.procs           cpu.cfs_quota_us   cpuacct.stat   notify_on_release     tasks
cgroup.sane_behavior   cpu.shares         cpuacct.usage  release_agent         user.slice
$ cat cpu.cfs_period_us 
100000
$ cat cpu.cfs_quota_us 
-1

다음 용어들이 빈번이 나오므로 먼저 요약한다.

cfs runtime
- cfs 런큐에 태스크가 스케줄되어 동작한 시간
quota 정수 비율 (normalize cfs quota)
- period 기간에 대한 quota 기간의 비율을 정수로 변환한 값이다.
- 100%=1M(1,048,576)이다.
- 예 1) period=10ms, quota=5ms인 경우 50%이며 이 비율을 quota 정수 비율로 표현하면 524,288이다.
- 예 2) period=10ms, quota=20ms인 경우 200%이며 이 비율을 quota 정수 비율로 표현하면 2,097,152이다.

bandwith 적용 사례

bandwidth가 적용된 사례 3개를 확인해보자.

사례 1) 다음 그림은 20ms 주기마다 10ms quota 만큼 cfs 스케줄되는 모습을 보여준다. 남는 시간은 스로틀링 한다.

일반적으로는 첫 번째 구간의 반복이다. 하지만 그림에서는 매 구간 마다 발생할 수 있는 케이스를 최대한 담았다.
범례 설명
- cfs running
  - 해당 태스크 그룹에 소속된 cfs 태스크들이 사용한 런타임 구간이다.
- cfs 스로틀
  - 해당 태스크 그룹이 dequeue되어 cfs 스로틀하면 다른 태스크 그룹의 cfs 태스크들이 이 구간에서 동작할 수 있다.
  - 동작할 cfs 태스크가 하나도 없는 경우 idle 한다.
- other 스케줄러
  - cfs 보다 우선 순위가 높은 stop, dl 및 rt 태스크들이 동작하는 구간이다.

다음 그림은 위의 사례 1)에서 발생할 수 있는 다양한 케이스를 보여준다.

1 번째 period 구간은 해당 그룹이 먼저 동작하고 주어진 quota 만큼의 런타임을 다 소진하고 스로틀링 하여 다른 태스크들에게 스케줄링을 넘겼다.
2 번째 period 구간이 되면서 다시 quota 만큼 런타임을 재충전(refill) 받아 다시 모두 사용하고 또 스로틀링하였다.
3 번째 period 구간에서 other(stop, rt, dl) 스케줄러가 먼저 할당되어 동작하고 끝나면서 해당 그룹의 cfs 태스크가 수행됨을 알 수 있다.
9 번째 period 구간에서 other(stop, rt, dl) 스케줄러가 먼저 할당되어 동작하면서 cfs 태스크가 동작할 수 있는 시간이 없었음을 알 수 있다.

사례 2) 다음 그림은 20ms 주기마다 2개의 cpu에 총 20ms quota 만큼 cfs 스케줄한다.

period와 quota가 같은 경우 2개의 cpu가 주어지면 일반적으로 매 period 마다 2 개의 cpu가 번갈아 가면서 런타임이 소진된다.
cpu가 두 개라 period와 quota 기간이 같아도 절반의 여유가 있음을 확인할 수 있다.
가능하면 스로틀링한 cpu에 런타임을 우선 할당하여 스로틀링이 교대로 됨을 알 수 있다.

다음 그림은 위의 사례 2)에서 발생할 수 있는 다양한 케이스를 보여준다.

8 번째 period 구간에서 other(stop, rt, dl) 스케줄러가 먼저 할당되어 동작하면서 cfs 태스크가 동작할 수 있는 시간이 없었음을 알 수 있다.

사례 3) 다음 그림은 20ms 주기마다 2개의 cpu에 총 30ms quota 만큼 cfs 스케줄한다.

해당 태스크 그룹은 최대 75%의 cfs 런타임 할당을 받는 것을 확인할 수 있다.

다음 그림은 위의 사례 3)에서 발생할 수 있는 다양한 케이스를 보여준다.

다음 그림은 cfs 밴드 위드를 설정하지 않았을 때와 두 개 그룹 중 G1 태스크 그룹에 period=25, quota=25의 밴드위드를 설정하여 동작시켰을 때의 차이를 비교하였다.

g1 그룹에 소속된 태스크들이 스로틀링되는 시간을 알아볼 수 있다.

주요 전역 변수 값

sysctl_sched_cfs_bandwidth_slice
- 디폴트 값은 5000 (us) = 5 (ms)
- “/sys/fs/kernel/sched_cfs_bandwidth_slice_us” 파일로 설정
- 로컬 풀의 요구에 따라 글로벌 풀(tg)로부터 로컬(per cfs_rq) 풀로 런타임을 얻어와서 할당해주는 기간
min_cfs_rq_runtime
- 디폴트 값은 1,000,000 (ns) = 1 (ms)
- 로컬 풀에서 최소 할당 받을 런타임
min_bandwidth_expiration
- 디폴트 값은 2,000,000 (ns) = 2 (ms)
- 최소 남은 period 만료 시각으로 이 기간 내에서는 slack 타이머를 활성화시키지 않는다.
cfs_bandwidth_slack_period
- 디폴트 값은 5,000,000 (ns) = 5 (ms)
- slack 타이머 주기

CFS Runtime

그룹내에서 CFS bandwidth를 사용 시 스로틀링을 위해 남은 quota(runtime) 산출에 사용했던 CFS runtime의 구현 방법들은 다음과 같이 진화하였다.

1) cfs hard limits: cfs bandwidth 적용 초기에 구현된 방법
2) hybrid global pool: 현재 커널에서 구현된 방법으로 cfs bandwidth v4에서 소개되었다.

Hybrid global pool

global runtime으로만 구현하게 되면 cpu가 많이 있는 시스템에서 각 cpu마다 동작하는 cfs 런큐간의 lock contension에 대한 부담이 매우커지는 약점이 있다. 또한 local cpu runtime으로만 구현하더라도 cfs 런큐 간에 남은 quota들을 확인하는 복잡한 relation이 발생하므로 소규모 smp 시스템에서만 적절하다고 볼 수 있다. 따라서 최근에는 성능을 위해 로컬 및 글로벌 두 개 모두를 구현한 하이브리드 버전이 사용되고 있다.

global runtime pool
- 글로벌 런타임 풀은 태스크 그룹별로 생성된다.
- cfs bandwidth에서 글로벌 풀로 불리우기도 하며 cfs_bandwidth 구조체에 관련 멤버들을 갖는다.
- 추적이 발생하는 곳이며 period 타이머에 의해 매 period 마다 quota 만큼 런타임을 리필(리프레쉬) 한다.
local cpu runtime
- 로컬 cpu 런타임은 태스크 그룹의 각 cpu 마다 존재한다.
- cfs bandwidth에서 로컬 풀로 불리우기도 하며 cfs_rq 구조체에 cfs bandwidth 관련 멤버들을 갖는다.
- 로컬 런타임에서 소비가 이루어지며 각각의 local cpu에 있는 cfs 런큐에서 발생하고 성능을 위해 lock을 사용하지 않는 장점이 있다.
- period 만료 시각에 로컬 런타임이 모두 소비된 경우 이전 period 기간에 스로틀한 로컬 풀 위주로 할당을 한다. 할당 할 수 없는 상황에서는 스로틀 한다.
- 로컬 런타임이 모두 소비된 경우 글로벌 런타임 풀에서 적정량(slice) 만큼을 빌려올 수 있다.

로컬 런타임의 보충은 다음과 같은 사례에서 발생한다. 자세한 것은 각 함수들에서 알아본다.

Case A) period 타이머 만료 시 스로틀된 로컬들에 대해 런타임 부족분 우선 분배
- sched_cfs_period_timer() -> distribute_cfs_runtime()
Case B) 매 tick 마다 빈 로컬 런타임 분배
- update_curr() -> account_cfs_rq_runtime() -> assign_cfs_rq_runtime()
Case C) 스케줄 엔티티 디큐 시 로컬 런타임 잔량을 글로벌 런큐로 반납. 조건에 따라 slack 타이머 가동시켜 스로틀된 로컬들에 대해 런타임 부족분 우선 분배
- dequeue_entity() -> return_cfs_rq_runtime()

CFS Bandwidth 초기화

init_cfs_bandwidth()

kernel/sched/fair.c

void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) 
{
        raw_spin_lock_init(&cfs_b->lock);
        cfs_b->runtime = 0;
        cfs_b->quota = RUNTIME_INF;
        cfs_b->period = ns_to_ktime(default_cfs_period());

        INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
        hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        cfs_b->period_timer.function = sched_cfs_period_timer;
        hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        cfs_b->slack_timer.function = sched_cfs_slack_timer;
        cfs_b->distribute_running = 0;
        cfs_b->slack_started = false;
}

cfs bandwidth를 초기화한다.

코드 라인 4~6에서 글로벌 runtime을 0으로, quota 값은 무한대 값인 RUNTIME_INF(0xffffffff_ffffffff = -1)로, 그리고 period 값은 디폴트 cfs period 값(100,000,000ns=0.1s)를 period에 저장한다.
코드 라인 8에서 스로틀드 리스트를 초기화한다.
코드 라인 9~10에서 period hrtimer를 초기화하고 만료 시 호출되는 함수를 지정한다.
코드 라인 11~12에서 slack hrtimer를 초기화하고 만료 시 호출되는 함수를 지정한다.
코드 라인 13~14에서 글로벌 런타임을 스로틀 cpu들에 분배 중이라는 의미의 distribute_running을 0으로 초기화하고, slack 타이머가 진행 중이라는 의미의 slack_started를 false로 초기화한다.

CFS 밴드위드 설정

CFS quota 설정

tg_set_cfs_quota()

kernel/sched/core.c

int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
{
        u64 quota, period;

        period = ktime_to_ns(tg->cfs_bandwidth.period);
        if (cfs_quota_us < 0)
                quota = RUNTIME_INF;
        else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
                quota = (u64)cfs_quota_us * NSEC_PER_USEC;
        else
                return -EINVAL;

        return tg_set_cfs_bandwidth(tg, period, quota);
}

요청한 태스크 그룹에 cfs quota 값(us)을 나노초로 변경하여 설정하되 가능한 범위는 1ms ~ 1s 이다.

코드 라인 5에서 cfs bandwidth에 설정되어 있는 period 값을 나노초 단위로 변환해온다.
코드 라인 6~11에서 인수로 받은 us 단위의 quota 값이 0보다 작은 경우 스로틀링 하지 않도록 무제한으로 설정하고, 0보다 큰 경우 quota 값을 나노초 단위로 바꾼다.
코드 라인 13에서 요청한 태스크 그룹에 period(ns) 및 quota(ns) 값을 설정한다.

CFS period 설정

tg_get_cfs_period()

kernel/sched/core.c

int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
{
        u64 quota, period;

        if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
                return -EINVAL;

        period = (u64)cfs_period_us * NSEC_PER_USEC;
        quota = tg->cfs_bandwidth.quota;

        return tg_set_cfs_bandwidth(tg, period, quota);
}

요청한 태스크 그룹에 cfs period 값(us)을 나노초로 변경하여 설정하되 최소 1ms 부터 설정 가능하다.

코드 라인 5~6에서 64비트 나노초로 담을 수 없는 큰 숫자가 주어지면 에러를 반환한다.
코드 라인 8에서 인수로 받은 us 단위의 period 값을 나노초 단위로 변환한다.
코드 라인 9에서 cfs bandwidth에 설정되어 있는 quota(ns) 값을 알아온다.
코드 라인 11에서 요청한 태스크 그룹에 period(ns) 및 quota(ns) 값을 설정한다.

CFS quota 및 period 공통 설정

최대 및 최소 cfs quota 제한

kernel/sched/core.c

const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */

cfs period의 설정은 1ms ~ 1s 범위에서 가능하게 제한된다. cfs quota 값은 1ms 이상 가능하다.

tg_set_cfs_bandwidth()

kernel/sched/core.c -1/2-

static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
{
        int i, ret = 0, runtime_enabled, runtime_was_enabled;
        struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;

        if (tg == &root_task_group)
                return -EINVAL;

        /*
         * Ensure we have at some amount of bandwidth every period.  This is
         * to prevent reaching a state of large arrears when throttled via
         * entity_tick() resulting in prolonged exit starvation.
         */
        if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
                return -EINVAL;

        /*
         * Likewise, bound things on the otherside by preventing insane quota
         * periods.  This also allows us to normalize in computing quota
         * feasibility.
         */
        if (period > max_cfs_quota_period)
                return -EINVAL;

        /*
         * Prevent race between setting of cfs_rq->runtime_enabled and
         * unthrottle_offline_cfs_rqs().
         */
        get_online_cpus();
        mutex_lock(&cfs_constraints_mutex);
        ret = __cfs_schedulable(tg, period, quota);
        if (ret)
                goto out_unlock;

요청한 태스크 그룹의 bandwidth 기능 유무를 설정한다. 처리되는 항목은 다음과 같다.

요청 태스크 그룹에 period(ns) 값이 1ms 이상인 경우에 한하여 설정
요청 태스크 그룹에 quota(ns) 값이 1ms ~ 1s 범위내인 경우에 한하여 설정
전체 태스크 그룹의 quota 정수 비율을 재설정
quota 설정에 따라 cfs 밴드폭 기능을 활성화, 비활성화 또는 기존 상태 유지

코드 라인 6~7에서 요청한 태스크 그룹이 루트 태스크 그룹인 경우 period 및 quota 밴드폭 설정을 할 수 없어 -EINVAL 에러를 반환한다.
코드 라인 14~15에서 요청한 ns 단위의 quota 및 period 값이 최소 값(1ms) 미만인 경우 -EINVAL 에러를 반환한다.
코드 라인 22~23에서요청한 ns 단위의 period 값이 최대 값(1s)을 초과하는 경우 -EINVAL 에러를 반환한다.
코드 라인 31~33에서 최상위 루트 태스크부터 전체 태스크 그룹을 순회하는 동안 위에서 아래로 내려가는 순서로 각 태스크 그룹의 quota 정수 비율을 설정한다.

kernel/sched/core.c -2/2-

        runtime_enabled = quota != RUNTIME_INF;
        runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
        /*
         * If we need to toggle cfs_bandwidth_used, off->on must occur
         * before making related changes, and on->off must occur afterwards
         */
        if (runtime_enabled && !runtime_was_enabled)
                cfs_bandwidth_usage_inc();
        raw_spin_lock_irq(&cfs_b->lock);
        cfs_b->period = ns_to_ktime(period);
        cfs_b->quota = quota;

        __refill_cfs_bandwidth_runtime(cfs_b);

        /* restart the period timer (if active) to handle new period expiry */
        if (runtime_enabled)
                start_cfs_bandwidth(cfs_b, true);

        raw_spin_unlock_irq(&cfs_b->lock);

        for_each_online_cpu(i) {
                struct cfs_rq *cfs_rq = tg->cfs_rq[i];
                struct rq *rq = cfs_rq->rq;
                struct rq_flags rf;

                rq_lock_irq(rq, &rf);
                cfs_rq->runtime_enabled = runtime_enabled;
                cfs_rq->runtime_remaining = 0;

                if (cfs_rq->throttled)
                        unthrottle_cfs_rq(cfs_rq);
                rq_unlock_irq(rq, &rf);
        }
        if (runtime_was_enabled && !runtime_enabled)
                cfs_bandwidth_usage_dec();
out_unlock:
        mutex_unlock(&cfs_constraints_mutex);
        put_online_cpus();

        return ret;
}

코드 라인 1에서 quota 값이 무제한 설정이 아니면 runtime_enable에 true가 대입된다.
코드 라인 2에서 기존 quota 값이 무제한 설정이 아니면 runtime_was_enable에 true가 대입된다.
코드 라인 7~8에서 quota가 무제한이었다가 설정된 경우 cfs bandwidth 기능을 enable 한다.
코드 라인 10~11에서 cfs bandwidth period와 quota에 요청한 값을 저장한다. (ns 단위)
코드 라인 13에서 cfs 밴드폭을 리필(리프레쉬)한다.
코드 라인 16~17에서 cfs bandwidth 기능이 enable 되었고 cfs bandwidth 타이머도 enable된 경우 cfs bandwidth 기능을 시작하기 위해 cfs 밴드폭 타이머를 가동한다.
코드 라인 21~33에서 cpu 수만큼 루프를 돌며 cfs 런큐에 runtime_enabled를 설정하고 runtime_remaining에 0을 대입하여 초기화한다. cfs 런큐가 이미 스로틀된 경우 언스로틀 한다.
코드 라인 34~35에서 quota가 설정되었다가 무제한으로 된 경우 cfs bandwidth 기능을 disable한다.
코드 라인 36~40에서 out_unlock: 레이블이다. 뮤텍스를 언락하고, online cpu 시퀀스 락을 해제한 후 ret 값을 반환한다.

__cfs_schedulable()

kernel/sched/core.c

static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
{               
        int ret;
        struct cfs_schedulable_data data = {
                .tg = tg,
                .period = period,
                .quota = quota,
        };
 
        if (quota != RUNTIME_INF) { 
                do_div(data.period, NSEC_PER_USEC);
                do_div(data.quota, NSEC_PER_USEC);
        }
        
        rcu_read_lock();
        ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
        rcu_read_unlock();

        return ret;
}

최상위 루트 태스크부터 전체 태스크 그룹을 순회하는 동안 위에서 아래로 내려가는 순서로 quota 정수 비율을 설정한다. 성공하면 0을 반환한다.

코드 라인 4~8에서 cfs 스케줄 데이터 구조체에 태스크 그룹과 ns 단위의 period와 quota 값을 대입한다.
코드 라인 10~13에서 period와 quota 값을 us 단위로 변환한다.
코드 라인 15~17에서 최상위 루트 태스크부터 전체 태스크 그룹을 순회하는 동안 위에서 아래로 내려가는 순서로 quota 정수 비율을 설정한다.

태스크 그룹 트리 워크 다운을 통한 quota 정수 비율 설정

tg_cfs_schedulable_down()

kernel/sched/core.c

static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
{
        struct cfs_schedulable_data *d = data;
        struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
        s64 quota = 0, parent_quota = -1;

        if (!tg->parent) {
                quota = RUNTIME_INF;
        } else {
                struct cfs_bandwidth *parent_b = &tg->parent->cfs_bandwidth;

                quota = normalize_cfs_quota(tg, d);
                parent_quota = parent_b->hierarchical_quota;

                /*
                 * ensure max(child_quota) <= parent_quota, inherit when no
                 * limit is set
                 */
                if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
                        quota = min(quota, parent_quota);
                } else {
                        if (quota == RUNTIME_INF)
                                quota = parent_quota;
                        else if (parent_quota != RUNTIME_INF && quota > parent_quota)
                                return -EINVAL;
                }
        }
        cfs_b->hierarchical_quota = quota;

        return 0;
}

요청한 태스크 그룹에서 period에 대한 quota 정수 비율을 설정한다. 에러가 없으면 0을 반환한다.

코드 라인 3에서 인수 data에서 us 단위의 period 및 quota가 담긴 구조체 포인터를 가져온다.
코드 라인 7~8에서 부모가 없는 최상위 태스크 그룹인 경우 스로틀링 하지 않도록 quota에 무제한을 설정한다.
코드 라인 12에서 period에 대한 quota 정수 비율을 산출한다. (예: 정수 1M=100%, 256K=25%)
코드 라인 13에서 부모 quota 정수 비율을 알아온다.
코드 라인 19~20에서 cgroupv2의 경우 quota 값과 부모의 quota 값 중 작은 값을 사용한다.
코드 라인 21~23에서 산출된 quota 정수 비율이 무제한인 경우 부모 quota 값을 사용한다.
코드 라인 24~25에서 부모 quota 비율이 무제한이 아니고 산출된 quota 비율이 부모 quota 비율보다 큰 경우 -EINVAL 에러를 반환한다.
코드 라인 28~30에서 요청한 태스크 그룹의 quota 비율을 설정하고 성공(0)을 반환한다.
- 계층적으로 관리되는 태스크 그룹의 quota 정수 비율은 hierarchical_quota에 저장한다.

CFS quota 정수 비율 산출

normalize_cfs_quota()

kernel/sched/core.c

/*
 * normalize group quota/period to be quota/max_period
 * note: units are usecs
 */

static u64 normalize_cfs_quota(struct task_group *tg,
                               struct cfs_schedulable_data *d)
{
        u64 quota, period;

        if (tg == d->tg) {
                period = d->period;
                quota = d->quota;
        } else {
                period = tg_get_cfs_period(tg);
                quota = tg_get_cfs_quota(tg);
        }

        /* note: these should typically be equivalent */
        if (quota == RUNTIME_INF || quota == -1)
                return RUNTIME_INF;

        return to_ratio(period, quota);
}

period에 대한 quota 비율을 정수로 반환한다. (예: 정수 1M=100%, 256K=25%)

코드 라인 6~8에서 요청한 태스크 그룹과 스케줄 데이터의 태스크 그룹이 동일한 경우 us 단위인 스케줄 데이터의 period와 quota 값을 사용한다.
코드 라인 9~12에서 동일하지 않은 경우 태스크 그룹의 period 값과 quota 값을 us 단위로 변환하여 가져온다.
코드 라인 15~16에서 quota가 무제한 설정된 경우 무제한(0xffffffff_ffffffff) 값을 반환한다.
코드 라인 18에서 quota * 1M(1 << 20) / period 값을 반환한다.

tg_get_cfs_quota()

kernel/sched/core.c

long tg_get_cfs_quota(struct task_group *tg)
{
        u64 quota_us;

        if (tg->cfs_bandwidth.quota == RUNTIME_INF)
                return -1;

        quota_us = tg->cfs_bandwidth.quota;
        do_div(quota_us, NSEC_PER_USEC);

        return quota_us;
}

태스크 그룹의 cfs quota를 us 단위로 반환한다. 무제한 설정된 경우 -1을 반환한다.

tg_get_cfs_period()

kernel/sched/core.c

long tg_get_cfs_period(struct task_group *tg)
{
        u64 cfs_period_us;

        cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
        do_div(cfs_period_us, NSEC_PER_USEC);

        return cfs_period_us;
}

태스크 그룹의 cfs period를 us 단위로 반환한다.

태스크 그룹 트리 Walk

walk_tg_tree()

kernel/sched/sched.h

/*
 * Iterate the full tree, calling @down when first entering a node and @up when
 * leaving it for the final time.
 *      
 * Caller must hold rcu_lock or sufficient equivalent.
 */

static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
{       
        return walk_tg_tree_from(&root_task_group, down, up, data);
}

태스크 그룹 트리에서 최상위 루트 태스크부터 전체 태스크 그룹을 순회하는 동안 아래로 내려가면 down 함수를 호출하고 위로 올라가면 up 함수를 호출한다. 호출한 함수가 중간에 에러가 발생하면 그 값을 반환하고 처리를 중단한다.

다음 그림은 __cfs_schedulabel() 함수를 호출할 때 각 태스크 그룹을 아래로 내려갈 때마다 tg_cfs_schedulable_down()을 호출하는 모습을 보여준다.

호출 순서는 번호 순이며 하향에 대한 호출 순서만 나열하면 1-D -> 2-D -> 4-D -> 5-D -> 7-D -> 10-D 순서이다.

walk_tg_tree_from()

kernel/sched/core.c

/*
 * Iterate task_group tree rooted at *from, calling @down when first entering a
 * node and @up when leaving it for the final time.
 *
 * Caller must hold rcu_lock or sufficient equivalent.
 */

int walk_tg_tree_from(struct task_group *from,
                             tg_visitor down, tg_visitor up, void *data)
{
        struct task_group *parent, *child;
        int ret;

        parent = from;

down:
        ret = (*down)(parent, data);
        if (ret)
                goto out;
        list_for_each_entry_rcu(child, &parent->children, siblings) {
                parent = child;
                goto down;

up:
                continue;
        }
        ret = (*up)(parent, data);
        if (ret || parent == from)
                goto out;

        child = parent;
        parent = parent->parent;
        if (parent)
                goto up;
out:
        return ret;
}

태스크 그룹 트리에서 요청한 태스크 그룹 이하의 태스크 그룹을 순회하는 동안 아래로 내려가면 down 함수를 호출하고 위로 올라가면 up 함수를 호출한다. 호출한 함수가 중간에 에러가 발생하면 그 값을 반환하고 처리를 중단한다. 에러가 없으면 0을 반환한다.

코드 라인 16~18에서 상위 태스크 그룹부터 인수로 받은 down() 함수를 호출한다.
- throttle_cfs_rq() -> tg_throttle_down() 함수를 호출한다.
- unthrottle_cfs_rq() -> tg_nop() 함수를 호출하여 아무 것도 수행하지 않는다.
코드 라인 19에서 parent의 자식들에 대해 좌에서 우로 루프를 돈다. 자식이 없으면 루프를 벗어난다.
코드 라인 20~21에서 선택된 자식으로 down 레이블로 이동한다.
코드 라인 23~25에서 다시 자식들에 대해 계속 처리한다.
코드 라인 26~28에서 하위 태스크 그룹부터 인수로 받은 up() 함수를 호출한다.
- throttle_cfs_rq() -> tg_nop() 함수를 호출하여 아무 것도 수행하지 않는다.
- unthrottle_cfs_rq() -> tg_unthrottle_up() 함수를 호출한다.
코드 라인 30~33에서 parent의 부모를 선택하고 부모가 있으면 up 레이블로 이동한다.

다음 그림은 walk_tg_tree_from() 함수가 1번 down 함수 호출부터 12번 up 함수 호출하는 것 까지 트리를 순회하는 모습을 보여준다.

CFS Throttling

스로틀 필요 체크

check_enqueue_throttle()

kernel/sched/fair.c

/*
 * When a group wakes up we want to make sure that its quota is not already
 * expired/exceeded, otherwise it may be allowed to steal additional ticks of
 * runtime as update_curr() throttling can not not trigger until it's on-rq.
 */

static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
{
        if (!cfs_bandwidth_used())
                return;

        /* an active group must be handled by the update_curr()->put() path */
        if (!cfs_rq->runtime_enabled || cfs_rq->curr)
                return;

        /* ensure the group is not already throttled */
        if (cfs_rq_throttled(cfs_rq))
                return;

        /* update runtime allocation */
        account_cfs_rq_runtime(cfs_rq, 0);
        if (cfs_rq->runtime_remaining <= 0)
                throttle_cfs_rq(cfs_rq);
}

현재 그룹의 cfs 런큐에서 quota 만큼의 실행이 끝나고 남은 런타임이 없으면 스로틀한다.

코드 라인 3~4에서 cfs bandwidth 구성이 사용되지 않으면 함수를 빠져나간다.
코드 라인 7~8에서 무제한 quota 설정이거나 cfs 런큐에서 동작 중인 태스크가 있으면 함수를 빠져나간다.
코드 라인 11~12에서 cfs 런큐가 이미 스로틀된 경우 함수를 빠져나간다.
코드 라인 15~17에서 cfs 런큐의 런타임을 산출하고 런타임이 남아 있지 않는 경우 스로틀한다.

다음 그림은 check_enqueue_throttle() 함수 이하의 호출 관계를 보여준다.

cfs 런큐 스로틀

throttle_cfs_rq()

kernel/sched/fair.c

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
        struct rq *rq = rq_of(cfs_rq);
        struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
        struct sched_entity *se;
        long task_delta, dequeue = 1;

        se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];

        /* freeze hierarchy runnable averages while throttled */
        rcu_read_lock();
        walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
        rcu_read_unlock();

        task_delta = cfs_rq->h_nr_running;
        for_each_sched_entity(se) {
                struct cfs_rq *qcfs_rq = cfs_rq_of(se);
                /* throttled entity or throttle-on-deactivate */
                if (!se->on_rq)
                        break;

                if (dequeue)
                        dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
                qcfs_rq->h_nr_running -= task_delta;

                if (qcfs_rq->load.weight)
                        dequeue = 0;
        }

        if (!se)
                sub_nr_running(rq, task_delta);

        cfs_rq->throttled = 1;
        cfs_rq->throttled_clock = rq_clock(rq);
        raw_spin_lock(&cfs_b->lock);
        /*
         * Add to the _head_ of the list, so that an already-started
         * distribute_cfs_runtime will not see us
         */
        list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
        if (!cfs_b->timer_active)
                __start_cfs_bandwidth(cfs_b, false);
        raw_spin_unlock(&cfs_b->lock);
}

요청한 cfs 런큐를 스로틀링한다.

코드 라인 3~4에서 요청한 cfs 런큐에 해당하는 런큐와 태스크 그룹의 cfs bandwidth을알아온다.
코드 라인 8에서 요청한 cfs 런큐에 해당하는 태스크 그룹용 스케줄 엔티티를 알아온다.
코드 라인 11~13에서 요청한 cfs 런큐에 해당하는 태스크 그룹부터 하위의 태스크 그룹 전체를 순회하며 스로틀 되는 동안 계층적인 러너블 평균의 산출을 멈추게 한다.
- 각 태스크 그룹의 cfs 런큐의 스로틀 카운터를 증가시키고 처음인 경우 런큐의 clock_task를 cfs 런큐의 throttled_clock_task에 대입한다.
코드 라인 15에서 요청한 cfs 런큐의 동작중인 active 태스크의 수를 알아온다.
코드 라인 16~20에서 요청한 cfs 런큐용 스케줄 엔티티부터 최상위 스케줄 엔티티까지 순회하며 해당 스케줄 엔티티가 런큐에 올라가 있지 않으면 순회를 멈춘다.
코드 라인 22~24에서 dequeue 요청이 있을 때 현재 스케줄 엔티티를 디큐하여 sleep 하게 하고 동작 중인 태스크 수를 1 감소시킨다.
코드 라인 26~27에서 현재 스케줄 엔티티를 담고 있는 cfs 런큐의 로드 weight이 0이 아니면 순회 중 다음 부모 스케줄 엔티티에 대해 dequeue를 요청하도록 dequeue에 0을 설정한다.
코드 라인 30~31에서 순회가 중단된 적이 없으면 런큐의 active 태스크 수를 task_delta 만큼 감소시킨다.
- sub_nr_running()
  - rq->nr_running -= count
코드 라인 33~34에서 cfs 런큐에 스로틀되었음을 알리고 스로틀된 시각을 기록한다.
코드 라인 40에서 cfs bandwidth 의 throttled_cfs_rq 리스트에 cfs 런큐의 스로틀된 cfs 런큐를 추가한다.
코드 라인 41~42에서 cfs bandwidth 기능이 동작하도록 타이머를 동작시킨다.

다음 그림은 요청한 cfs 런큐에 대해 스로틀링을 할 때 처리되는 모습을 보여준다.

태스크 그룹 워크 다운을 통한 스로틀

tg_throttle_down()

kernel/sched/fair.c

static int tg_throttle_down(struct task_group *tg, void *data)
{
        struct rq *rq = data;
        struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];

        /* group is entering throttled state, stop time */
        if (!cfs_rq->throttle_count) {
                cfs_rq->throttled_clock_task = rq_clock_task(rq);
                list_del_leaf_cfs_rq(cfs_rq);
        }
        cfs_rq->throttle_count++;

        return 0;
}

요청 태스크 그룹 및 cpu의 cfs 런큐의 스로틀 카운터를 증가시켜 스로틀 상태로 변경한다.

요청 태스크 그룹의 cfs 런큐의 스로틀 카운터를 증가시키고 처음인 경우 런큐의 clock_task를 cfs 런큐의 throttled_clock_task에 대입한다.

코드 라인 3~4에서 두 번째 인수로 받은 런큐의 cpu 번호를 알아와서 요청 태스크 그룹의 cfs 런큐를 알아온다.
코드 라인 7~10에서 처음 스로틀링에 들어가는 경우 스로틀 시작 시각을 기록하고, leaf_cfs_rq 리스트에서 이 cfs 런큐가 있으면 제거한다.
코드 라인 11에서 cfs 런큐의 스로틀 카운터를 1 증가시킨다.
코드 라인 13에서 성공 0을 반환한다.

cfs 런큐 언스로틀

unthrottle_cfs_rq()

kernel/sched/fair.c

void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
        struct rq *rq = rq_of(cfs_rq);
        struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
        struct sched_entity *se;
        int enqueue = 1;
        long task_delta, idle_task_delta;

        se = cfs_rq->tg->se[cpu_of(rq)];

        cfs_rq->throttled = 0;

        update_rq_clock(rq);

        raw_spin_lock(&cfs_b->lock);
        cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
        list_del_rcu(&cfs_rq->throttled_list);
        raw_spin_unlock(&cfs_b->lock);

        /* update hierarchical throttle state */
        walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);

        if (!cfs_rq->load.weight)
                return;

        task_delta = cfs_rq->h_nr_running;
        idle_task_delta = cfs_rq->idle_h_nr_running;
        for_each_sched_entity(se) {
                if (se->on_rq)
                        enqueue = 0;

                cfs_rq = cfs_rq_of(se);
                if (enqueue)
                        enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
                cfs_rq->h_nr_running += task_delta;
                cfs_rq->idle_h_nr_running += idle_task_delta;

                if (cfs_rq_throttled(cfs_rq))
                        break;
        }

        assert_list_leaf_cfs_rq(rq);

        if (!se)
                add_nr_running(rq, task_delta);

        /* determine whether we need to wake up potentially idle cpu */
        if (rq->curr == rq->idle && rq->cfs.nr_running)
                resched_curr(rq);
}

요청한 cfs 런큐를 언스로틀링한다.

코드 라인 3~4에서 요청한 cfs 런큐에 해당하는 런큐와 태스크 그룹의 cfs bandwidth을알아온다.
코드 라인 9에서 cfs 런큐를 대표하는 엔티티를 알아온다.
코드 라인 11에서 cfs 런큐의 throttled를 0으로 하여 스로틀링을 해제한 것으로 설정한다.
코드 라인 13~18에서 런큐 클럭을 갱신하고, cfs 밴드위드 락을 획득한 채로 스로틀된 시간을 갱신한다. 그런 후 스로틀 리스트에서 제거한다.
코드 라인 21에서 각 태스크 그룹의 하위 그룹들에 대해 bottom-up 방향으로 각 로컬 풀을 언스로틀하도록 요청한다.
코드 라인 23~24에서 현재 로컬 풀의 로드 weight이 0이면 부모에 영향을 끼치지 않으므로 더이상 처리하지 않고 함수를 빠져나간다.
코드 라인 26~27에서 task_delta에 요청한 cfs 런큐 이하에서 동작중인 active 태스크 수를 알아온다. idle_task_delta에는 idle policy를 사용하는 cfs 태스크 수를 알아온다.
코드 라인 28~30에서 요청한 cfs 런큐용 스케줄 엔티티부터 최상위 스케줄 엔티티까지 순회하며 해당 스케줄 엔티티가 런큐에 올라가 있는 상태이면 enqueue에 0을 대입하여 엔큐를 못하게 설정한다.
코드 라인 32~34에서 enqueue가 필요한 상태인 경우 엔티티를 엔큐한다.
코드 라인 35~36에서 순회 중인 cfs 런큐들 마다 cfs active 태스크 수와 idle 태스크 수를 추가하여 반영한다.
코드 라인 38~39에서 cfs 런큐가 스로틀된 적 있으면 루프를 빠져나간다.
코드 라인 44~45에서 최상위 스케줄 엔티티(루트 태스크 그룹에 연결된)까지 루프를 다 돌은 경우 런큐의 nr_running에도 active 태스크 수를 추가하여 반영한다.
코드 라인 48~49에서 현재 태스크가 idle 중이면서 최상위 cfs 런큐에서 동작중인 스케줄 엔티티가 있으면 리스케줄 요청 플래그를 설정한다.

다음 그림은 여러 가지 clock에 대해 동작되는 모습을 보여준다.

스로틀링 시간 역시 rq->clock에 동기되는 time 누적과 rq->clock_task를 사용한 task_time 누적으로 나뉘어 관리된다.
rq->clock에서 irq 처리 부분만 제외시킨 부분이 rq->clock_task 이다.
그러나 CONFIG_IRQ_TIME_ACCOUNTING 커널 옵션을 사용하지 않으면 irq 소요시간을 측정하지 않으므로 이러한 경우에는 rq->clock과 rq->clock_task가 동일하게 된다.

태스크 그룹 워크 업을 통한 언스로틀

tg_unthrottle_up()

kernel/sched/fair.c

static int tg_unthrottle_up(struct task_group *tg, void *data)
{
        struct rq *rq = data; 
        struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];

        cfs_rq->throttle_count--;
        if (!cfs_rq->throttle_count) {
                /* adjust cfs_rq_clock_task() */
                cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
                                             cfs_rq->throttled_clock_task;

                /* Add cfs_rq with already running entity in the list */
                if (cfs_rq->nr_running >= 1)
                        list_add_leaf_cfs_rq(cfs_rq);
        }

        return 0;
}

요청 태스크 그룹의 cfs 런큐에 스로틀 완료 카운터를 감소시킨다. 이 카운터가 0인 경우 스로틀 상태에서 벗어난다.

요청 태스크 그룹의 cfs 런큐의 스로틀 카운터를 감소시키고 처음인 경우 런큐의 clock_task를 cfs 런큐의 throttled_clock_task에 대입한다.

코드 라인 3~4에서 두 번째 인수로 받은 런큐의 cpu 번호를 알아와서 요청 태스크 그룹의 cfs 런큐를 알아온다.
코드 라인 6에서 cfs 런큐의 스로틀 카운터를 1 감소시킨다.
코드 라인 7~15에서 smp 시스템에서 스로틀 카운터가 0인 경우 스로틀된 시간 총합을 갱신한다. 또한 cfs 런큐에 동작 중인 엔티티가 1개 이상있는 경우 leaf cfs 런큐 리스트에 추가한다.

CFS Runtime 최소 slice 할당

account_cfs_rq_runtime()

kernel/sched/fair.c

static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
        if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
                return;

        __account_cfs_rq_runtime(cfs_rq, delta_exec);
}

로컬 런타임이 모두 소비된 경우 글로벌 런타임에서 최소 slice(디폴트=5 ms) – 초과 소모한 런타임만큼을 차용하여 로컬 런타임을 할당한다. 만일 로컬 런타임이 충분히 할당되지 않은 경우 리스케줄 요청 플래그를 설정한다..

참고: sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth

__account_cfs_rq_runtime()

kernel/sched/fair.c

static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
        /* dock delta_exec before expiring quota (as it could span periods) */
        cfs_rq->runtime_remaining -= delta_exec;

        if (likely(cfs_rq->runtime_remaining > 0))
                return;

        if (cfs_rq->throttled)
                return;
        /*
         * if we're unable to extend our runtime we resched so that the active
         * hierarchy can be throttled
         */
        if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
                resched_curr(rq_of(cfs_rq));
}

로컬 런타임이 모두 소비된 경우 글로벌 런타임에서 최소 slice(디폴트=5 ms) – 초과 소모한 런타임만큼을 차용하여 로컬 런타임을 할당한다. 로컬 런타임이 충분히 할당되지 않은 경우 리스케줄 요청 플래그를 설정한다..

코드 라인 4에서 매 스케줄 틱마다 update_curr() 함수를 통해 이 루틴이 불리는데 실행되었던 시간 만큼을 로컬 런타임에서 소모시킨다.
코드 라인 6~7에서 로컬 런타임이 아직 남아 있으면 함수를 빠져나간다.
코드 라인 9~10에서 이미 스로틀 중인 경우 함수를 빠져나간다.
코드 라인 15~16에서 로컬 런타임이 충분히 할당되지 않고 높은 확률로 cfs 런큐에서 태스크가 동작 중인 경우 리스케줄 요청 플래그를 설정하여 cfs 스로틀을 시작 한다.

assign_cfs_rq_runtime()

kernel/sched/fair.c

/* returns 0 on failure to allocate runtime */
static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
        struct task_group *tg = cfs_rq->tg;
        struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
        u64 amount = 0, min_amount;

        /* note: this is a positive sum as runtime_remaining <= 0 */
        min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;

        raw_spin_lock(&cfs_b->lock);
        if (cfs_b->quota == RUNTIME_INF)
                amount = min_amount;
        else {
                start_cfs_bandwidth(cfs_b);

                if (cfs_b->runtime > 0) {
                        amount = min(cfs_b->runtime, min_amount);
                        cfs_b->runtime -= amount;
                        cfs_b->idle = 0;
                }
        }
        raw_spin_unlock(&cfs_b->lock);

        cfs_rq->runtime_remaining += amount;

        return cfs_rq->runtime_remaining > 0;
}

로컬 런타임이 모두 소비된 경우 글로벌 런타임에서 최소 slice(디폴트=5 ms) – 초과 소모한 런타임만큼을 차용하여 로컬 런타임을 할당한다. 로컬 런타임이 채워진 경우 1을 반환하고, 여전히 부족한 경우 0을 반환한다.

코드 라인 9에서 로컬 런타임이 다 소진된 상태에서 글로벌 런타임 풀에서 가져올 런타임을 결정한다.
- 로컬 잔여 런타임이 다 소진되어 0이거나 초과 소모하여 음수인 경우에만 이 함수에 진입되었다.
- 글로벌 런타임에서 차용할 런타임은 slice(디폴트 5ms)에서 초과 소모한 런타임양을 뺀 값이다.
  - 차용할 런타임 = 5 ms – 초과 소모 런타임
코드 라인 12~13에서 quota 설정이 무한대인 경우 빌려올 양은 위에서 산출한 값을 그대로 적용한다.
코드 라인 14~22에서 quota 설정이 있는 경우 cfs 밴드위드를 동작시킨다. 그리고 위에서 산출한 런타임 만큼 글로벌 런타임에서 차감한다. 글로벌 런타임은 0 미만으로 떨어지지디 않도록 제한된다.
코드 라인 25에서 로컬 런타임 잔량에 글로벌 런타임에서 가져온 양을 추가한다.
코드 라인 27에서 로컬 풀의 잔여 런타임이 있는지 여부를 반환한다.

다음 그림은 스케줄 tick이 발생하여 delta 실행 시간을 로컬 런타임 풀에서 소모시키고 소모 시킬 로컬 런타임이 없으면 slice 만큼의 런타임을 글로벌 런타임에서 빌려오는 것을 보여준다.

sched_cfs_bandwidth_slice()

kernel/sched/fair.c

static inline u64 sched_cfs_bandwidth_slice(void)
{
        return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
}

cfs bandwidth slice 값을 나노초 단위로 반환한다.

/*
 * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
 * each time a cfs_rq requests quota.
 *
 * Note: in the case that the slice exceeds the runtime remaining (either due
 * to consumption or the quota being specified to be smaller than the slice)
 * we will always only issue the remaining available time.
 *
 * default: 5 msec, units: microseconds
  */
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

매번 cfs 런큐가 요청하는 quota 마다 태스크 그룹의 글로벌에서 로컬 cfs 런큐 풀로 할당해줄 수 있는 runtime
“/proc/sys/kernel/sched_cfs_bandwidth_slice_us” -> 디폴트 값은 5000 (us)

두 개의 CFS Bandwidth 타이머

period 타이머 주요 기능
- period 주기마다 만료되어 호출된다.
- 글로벌 런타임을 재충전(refill) 한다.
- 스로틀된 로컬 cfs 런큐들에 초과 소모한 런타임을 우선 분배한다.
slack 타이머 주요 기능
- 태스크 dequeue 시 5ms 후에 만료되어 호출된다.
- 남은 로컬 잔량을 글로벌 런타임에 반납하여 스로틀된 로컬 cfs 런큐들에 초과 소모한 런타임을 우선 분배한다.

다음 그림은 cfs bandwidth에 대한 두 개의 타이머에 대한 함수 호출 관계를 보여준다.

정규 period 시각마다 분배

period 타이머를 통해 매 period 시각마다 글로벌 런타임 리필 후 스로틀된 로컬 cfs 런큐를 대상으로 런타임 부족분을 우선 차감 분배한다.

CFS Period Timer – (1) 활성화

start_cfs_bandwidth()

kernel/sched/fair.c

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
        lockdep_assert_held(&cfs_b->lock);

        if (cfs_b->period_active)
                return;

        cfs_b->period_active = 1;
        hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
        hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

글로벌 풀의 period 타이머를 가동시킨다.

코드 라인 5~6에서 이미 period 타이머가 동작 중인 경우 함수를 빠져나간다.
코드 라인 8에서 period 타이머가 동작 중임을 알린다.
코드 라인 9~10에서 period 타이머를 동작시킨다.

CFS Period Timer – (2) 만료 시 호출

sched_cfs_period_timer()

kernel/sched/fair.c

static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
        struct cfs_bandwidth *cfs_b =
                container_of(timer, struct cfs_bandwidth, period_timer);
        unsigned long flags;
        int overrun;
        int idle = 0;
        int count = 0;

        raw_spin_lock_irqsave(&cfs_b->lock, flags);
        for (;;) {
                overrun = hrtimer_forward_now(timer, cfs_b->period);
                if (!overrun)
                        break;

                if (++count > 3) {
                        u64 new, old = ktime_to_ns(cfs_b->period);

                        /*
                         * Grow period by a factor of 2 to avoid losing precision.
                         * Precision loss in the quota/period ratio can cause __cfs_schedulable
                         * to fail.
                         */
                        new = old * 2;
                        if (new < max_cfs_quota_period) {
                                cfs_b->period = ns_to_ktime(new);
                                cfs_b->quota *= 2;

                                pr_warn_ratelimited(
        "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
                                        smp_processor_id(),
                                        div_u64(new, NSEC_PER_USEC),
                                        div_u64(cfs_b->quota, NSEC_PER_USEC));
                        } else {
                                pr_warn_ratelimited(
        "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
                                        smp_processor_id(),
                                        div_u64(old, NSEC_PER_USEC),
                                        div_u64(cfs_b->quota, NSEC_PER_USEC));
                        }

                        /* reset count so we don't come right back in here */
                        count = 0;
                }

                idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
        }
        if (idle)
                cfs_b->period_active = 0;
        raw_spin_unlock_irqrestore(&cfs_b->lock, flags);

        return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}

period 타이머 만료 시에 호출되며 타이머에 연동된 태스크 그룹의 quota를 글로벌 런타임에 리필하고 추가적으로 필요한 작업들을 수행한다.

코드 라인 3~4에서 태스크 그룹의 cfs bandwidth를 알아온다.
코드 라인 11~14에서 period 타이머를 인터벌 기간 뒤로 forward 한다. 오버런한 적이 없으면 즉 forward할 필요가 없으면 함수를 빠져나간다.
- 참고: Timer -2- (HRTimer) | 문c
코드 라인 16~44에서 period와 quota 값이 너무 작아 오버런이 되는 상황인 경우 3회 마다 period와 quota 값을 1초 범위 이내에서 2배씩 늘려나가고 경고 메시지를 출력한다.
코드 라인 46에서 period 타이머의 만료에 따른 작업을 수행하고 period 타이머의 종료 여부(idle=1)를 알아온다.
- 태스크 그룹의 quota를 글로벌 런타임에 리필하고 스로틀된 cfs 런큐에 런타임을 분배하는 등을 수행한다.
코드 라인 4849에서 idle 상태가 되는 경우 period 타이머가 비활성화되었음을 알린다.
코드 라인 52에서 period 타이머의 재시작 여부를 반환한다.
- HRTIMER_RESTART=1
- HRTIMER_NORESTART=0

do_sched_cfs_period_timer()

태스크 그룹의 quota를 글로벌 런타임에 리필하고 이전 period에서 언스로틀된 cfs 런큐들에 대해 글로벌 런타임을 먼저 차감 분배하고 언스로틀한다.

kernel/sched/fair.c

/*
 * Responsible for refilling a task_group's bandwidth and unthrottling its
 * cfs_rqs as appropriate. If there has been no activity within the last
 * period the timer is deactivated until scheduling resumes; cfs_b->idle is
 * used to track this state.
 */

static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
        u64 runtime;
        int throttled;

        /* no need to continue the timer with no bandwidth constraint */
        if (cfs_b->quota == RUNTIME_INF)
                goto out_deactivate;

        throttled = !list_empty(&cfs_b->throttled_cfs_rq);
        cfs_b->nr_periods += overrun;

        /*
         * idle depends on !throttled (for the case of a large deficit), and if
         * we're going inactive then everything else can be deferred
         */
        if (cfs_b->idle && !throttled)
                goto out_deactivate;

        __refill_cfs_bandwidth_runtime(cfs_b);

        if (!throttled) {
                /* mark as potentially idle for the upcoming period */
                cfs_b->idle = 1;
                return 0;
        }

        /* account preceding periods in which throttling occurred */
        cfs_b->nr_throttled += overrun;

        /*
         * This check is repeated as we are holding onto the new bandwidth while
         * we unthrottle. This can potentially race with an unthrottled group
         * trying to acquire new bandwidth from the global pool. This can result
         * in us over-using our runtime if it is all used during this loop, but
         * only by limited amounts in that extreme case.
         */
        while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
                runtime = cfs_b->runtime;
                cfs_b->distribute_running = 1;
                raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
                /* we can't nest cfs_b->lock while distributing bandwidth */
                runtime = distribute_cfs_runtime(cfs_b, runtime);
                raw_spin_lock_irqsave(&cfs_b->lock, flags);

                cfs_b->distribute_running = 0;
                throttled = !list_empty(&cfs_b->throttled_cfs_rq);

                lsub_positive(&cfs_b->runtime, runtime);
        }

        /*
         * While we are ensured activity in the period following an
         * unthrottle, this also covers the case in which the new bandwidth is
         * insufficient to cover the existing bandwidth deficit.  (Forcing the
         * timer to remain active while there are any throttled entities.)
         */
        cfs_b->idle = 0;

        return 0;

out_deactivate:
        return 1;
}

코드 라인 7~8에서 태스크 그룹에 cfs quota 설정을 안한 경우 cfs bandwidth 설정이 안된 것이므로 out_deactivate 레이블로 이동한다.
코드 라인 10에서 태스크 그룹에 스로틀된 cfs 런큐가 있는지 여부를 throttled에 대입한다.
코드 라인 11에서 nr_periods를 overrun 횟수만큼 증가시킨다.
코드 라인 17~18에서 idle 상태이면서 스로틀된 cfs 런큐가 없으면 out_deactivate 레이블로 이동한다.
코드 라인 20에서 글로벌 런타임을 quota 설정만큼 리필한다.
코드 라인 22~26에서 스로틀된 cfs 런큐가 없는 경우 idle=1로 설정하고 period 타이머를 재시작하기 위해 0을 반환한다.
코드 라인 29에서 스로틀된 횟수를 overrun 만큼 증가시킨다.
코드 라인 38~50에서 스로틀된 cfs 런큐가 있고 글로벌 런타임이 남아 있으며 다른 곳에서 분배 중이지 않은 경우에 한해 반복하며 다음과 같이 분배를 수행한다.
- 스로틀된 cfs 런큐들을 순서대로 부족한 런타임만큼 우선 배분하고 언스로틀한다.
- 배분한 런타임은 글로벌 런타임 잔량에서 차감한다.
코드 라인 58~60에서 글로벌 풀의 idle에 0을 대입하고 period 타이머를 재시작하기 위해 0을 반환한다.
코드 라인 62~63에서 out_deactivate: 레이블이다. period 타이머를 재시작하지 않도록 1을 반환한다.

다음 그림은 글로벌 런타임을 스로틀된 cfs 런큐에 부족한 런타임만큼 우선 차감 분배하는 과정을 보여준다.

글로벌 런타임 재충전

__refill_cfs_bandwidth_runtime()

kernel/sched/fair.c

/*
 * Replenish runtime according to assigned quota. We use sched_clock_cpu
 * directly instead of rq->clock to avoid adding additional synchronization
 * around rq->lock.
 *
 * requires cfs_b->lock
 */

void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
        if (cfs_b->quota != RUNTIME_INF)
                cfs_b->runtime = cfs_b->quota;
}

글로벌 풀의 런타임을 quota 만큼으로 리필한다.

이 함수는 다음 그림과 같이 period 타이머의 만료 시 마다 호출되어 사용되는 것을 보여준다.

스로틀된 cfs 런큐에 초과 소모한 런타임 분배

distribute_cfs_runtime()

kernel/sched/fair.c

static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
        struct cfs_rq *cfs_rq;
        u64 runtime;
        u64 starting_runtime = remaining;

        rcu_read_lock();
        list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
                                throttled_list) {
                struct rq *rq = rq_of(cfs_rq);
                struct rq_flags rf;

                rq_lock_irqsave(rq, &rf);
                if (!cfs_rq_throttled(cfs_rq))
                        goto next;

                /* By the above check, this should never be true */
                SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);

                runtime = -cfs_rq->runtime_remaining + 1;
                if (runtime > remaining)
                        runtime = remaining;
                remaining -= runtime;

                cfs_rq->runtime_remaining += runtime;

                /* we check whether we're throttled above */
                if (cfs_rq->runtime_remaining > 0)
                        unthrottle_cfs_rq(cfs_rq);

next:
                rq_unlock_irqrestore(rq, &rf);

                if (!remaining)
                        break;
        }
        rcu_read_unlock();

        return starting_runtime - remaining;
}

스로틀된 cfs 런큐들을 순서대로 글로벌 잔량이 남아있는 한 초과 소모한 런타임을 우선 배분하고 언스로틀한다.

코드 라인 8~15에서 태스크 그룹의 스로틀된 cfs 런큐 리스트를 순회하며 cfs 런큐가 스로틀되지 않은 경우 cfs 런큐가 발견되면 skip 처리하기 위해 next 레이블로 이동한다.
코드 라인 20~25에서 글로벌 런타임에서 스로틀된 cfs 런큐의 런타임 부족분 + 1만큼을 차감 분배한다.글로벌 런타임값은 0미만으로 떨어지지 않도록 제한된다.
- cfs 런큐가 스로틀되었다는 의미는 cfs 런큐의 잔여 런타임은 0이 되었거나 초과 수행되어 음수 값인 상태이다.
코드 라인 28~29에서 cfs 런큐의 잔여 런타임이 0보다 큰 경우 언스로틀한다.
코드 라인 31~35에서 next: 레이블이다. remaing 값이 0인 경우 루프를 벗어난다.
코드 라인 39에서 분배한 런타임만큼을 반환한다.

다음 그림은 distribute_cfs_runtime() 함수의 동작 시 글로벌 런타임을 기존 스로틀된 cfs 런큐의 초과 소모한 런타임 만큼을 우선 분배하고 언스로틀하는 과정을 보여준다.

엔티티 디큐 시 남은 런타임 반납

엔티티가 디큐될 때 사용하고 남은 런타임 잔량 중 1ms를 뺀 나머지 모두를 글로벌 런타임에 반납한다. periods 타이머 만료 시각까지 7 ms 이상 충분히 시간이 남아 있으면 스로틀된 cfs 런큐를 깨워 동작시키기 위해 분배 작업을 위해 5ms 슬랙 타이머를 가동 시킨다. 슬랙 타이머의 만료 시각에는 스로틀 중인 로컬 cfs 런큐들에 남은 잔량을 분배한다.

7 ms가 필요한 이유
- 다음 두 가지 기간을 더한 값이 period 타이머의 만료 시간보다 작으면 어짜피 period 타이머로 글로벌 런타임이 리필될 예정이고, 각 로컬 풀도 다시 할당 받을 수 있게 된다. 따라서 period 타이머 만료 시각에 가까와 지면 남은 로컬 런타임 잔량을 반납할 이유가 없어진다.
  - 슬랙 타이머용으로 5 ms
  - period 타이머 만료 시간 전 2 ms의 여유가 필요하다.

return_cfs_rq_runtime()

kernel/sched/fair.c

static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
        if (!cfs_bandwidth_used())
                return;

        if (!cfs_rq->runtime_enabled || cfs_rq->nr_running)
                return;

        __return_cfs_rq_runtime(cfs_rq);
}

cfs 스케줄러에서 스케줄 엔티티가 디큐될 때 이 함수가 호출되면 남은 로컬 런타임을 회수하여 글로벌 풀로 반납한다. 그런 후에 5ms 주기의 slack 타이머를 가동시켜서 스로틀된 다른 태스크에게 런타임을 할당해준다.

__return_cfs_rq_runtime()

kernel/sched/fair.c

/* we know any runtime found here is valid as update_curr() precedes return */
static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
        struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
        s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;

        if (slack_runtime <= 0)
                return;

        raw_spin_lock(&cfs_b->lock);
        if (cfs_b->quota != RUNTIME_INF &&
                cfs_b->runtime += slack_runtime;

                /* we are under rq->lock, defer unthrottling using a timer */
                if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
                    !list_empty(&cfs_b->throttled_cfs_rq))
                        start_cfs_slack_bandwidth(cfs_b);
        }
        raw_spin_unlock(&cfs_b->lock);

        /* even if it's not valid for return we don't want to try again */
        cfs_rq->runtime_remaining -= slack_runtime;
}

코드 라인 5~8에서 로컬 런타임으로부터 글로벌 풀로 반납할 잔량을 구한다. 반납할 량이 0보다 적으면 함수를 빠져나간다.
- 반납할 런타임 = 런타임의 잔량 – 최소 런타임(1 ms)
코드 라인 11~12에서 반납할 런타임을 글로벌 풀에 반납한다.
코드 라인 15~17에서 글로벌 풀의 런타임이 slice(디폴트 5ms) 보다 크고 스로틀되어 있는 로컬 풀이 있으면 slack 타이머를 가동한다.
코드 라인 22에서 로컬 런타임 잔량을 반납한 양 만큼 빼서 갱신한다.

CFS Slack Timer – (1) 활성화

start_cfs_slack_bandwidth()

kernel/sched/fair.c

static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
{
        u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;

        /* if there's a quota refresh soon don't bother with slack */
        if (runtime_refresh_within(cfs_b, min_left))
                return;

        /* don't push forwards an existing deferred unthrottle */
        if (cfs_b->slack_started)
                return;
        cfs_b->slack_started = true;
        hrtimer_start(&cfs_b->slack_timer,
                        ns_to_ktime(cfs_bandwidth_slack_period),
                        HRTIMER_MODE_REL);
}

slack 타이머를 slack 주기(디폴트 5ms)로 가동시킨다. 단 period 타이머의 만료 시각이 slack 불필요 범위(디폴트 7ms) 이내인 경우에는 가동시키지 않는다.

코드 라인 3~7에서 런타임 리필(리프레쉬) 주기가 다가온 경우 slack 타이머를 활성화할 필요 없으므로 함수를 빠져나간다.
- 최소 만료 시간과 slack 주기를 더해 min_left(디폴트=2+5=7ms)보다 런타임 리필 주기가 커야 한다.
코드 라인 10~11에서 이미 슬랙 타이머가 동작 중인 경우 함수를 빠져나간다.
코드 라인 12에서 슬랙 타이머가 동작 했음을 알린다.
코드 라인 14~16에서 slack 타이머를 cfs_bandwidth_slack_period(디폴트 5ms) 주기로 활성화한다.

runtime_refresh_within()

kernel/sched/fair.c

/*
 * Are we near the end of the current quota period?
 *
 * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
 * hrtimer base being cleared by __hrtimer_start_range_ns. In the case of
 * migrate_hrtimers, base is never cleared, so we are fine.
 */

static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
{
        struct hrtimer *refresh_timer = &cfs_b->period_timer;
        u64 remaining;

        /* if the call-back is running a quota refresh is already occurring */
        if (hrtimer_callback_running(refresh_timer))
                return 1;

        /* is a quota refresh about to occur? */
        remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
        if (remaining < min_expire)
                return 1;

        return 0;
}

글로벌 런타임 리프레쉬 주기가 다가오는지 여부를 확인한다.

코드 라인 3~8에서 hrtimer가 만료되어 콜백이 진행중이면 1을 반환한다. 현재 리프레시 진행 중이므로 굳이 slack 타이머를 가동시킬 필요 없다.
코드 라인 11~15에서 만료될 시간이 인수로 받은 min_expire 기준 시간 보다 작은 경우 곧 리프레쉬 주기가 다가오므로 1을 반환하고 그 외의 경우 slack 타이머가 동작하도록 0을 반환한다.

CFS Slack Timer – (2) 만료 시 호출

sched_cfs_slack_timer()

kernel/sched/fair.c

static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
{
        struct cfs_bandwidth *cfs_b =
                container_of(timer, struct cfs_bandwidth, slack_timer);
        do_sched_cfs_slack_timer(cfs_b);

        return HRTIMER_NORESTART;
}

slack 타이머 만료 시 글로벌 풀로부터 스로틀된 로컬들의 초과 소모 런타임을 우선 분배한다. (디폴트로 slack 타이머는 5ms이다)

디큐된 태스크의 남은 런타임 잔량을 글로벌에 반납하면서 slack 타이머를 통해 할당을 못받고 스로틀되고 있는 로컬 풀에 오버 소모한 런타임을 준 후 언스로틀한다.

do_sched_cfs_slack_timer()

kernel/sched/fair.c

/*
 * This is done with a timer (instead of inline with bandwidth return) since
 * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
 */

static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
        u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
        unsigned long flags;

        /* confirm we're still not at a refresh boundary */
        raw_spin_lock_irqsave(&cfs_b->lock, flags);
        cfs_b->slack_started = false;
        if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
                raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
                return;
        }

        if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
                raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
                return;
        }

        if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
                runtime = cfs_b->runtime;

        if (runtime)
                cfs_b->distribute_running = 1;

        raw_spin_unlock_irqrestore(&cfs_b->lock, flags);

        if (!runtime)
                return;

        runtime = distribute_cfs_runtime(cfs_b, runtime, expires);

        raw_spin_lock_irqsave(&cfs_b->lock, flags);
        lsub_positive(&cfs_b->runtime, runtime);
        cfs_b->distribute_running = 0;
        raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}

slack 타이머 만료 시 글로벌 풀로부터 스로틀된 로컬들의 초과 소모한 런타임을 우선 분배한다. (디폴트로 slack 타이머는 5ms이다)

코드 라인 3에서 cfs bandwidth의 slice를 구해온다.
- slice: 글로벌 풀에서 로컬로 빌려올 수 있는 런타임 시간 단위(디폴트=5 ms)
코드 라인 7~12에서 cfs 밴드위드 락을 획득한 채로 슬랙 타이머가 동작 중임을 알리는 slack_started를 false로 한다. 만일 다른 곳(다른 cpu의 periods 타이머에서 분배 중)에서 이미 분배를 진행하고 있는 경우 cfs 밴드위드 락을 해제하고 함수를 빠져나간다.
코드 라인 14~17에서 period 타이머의 만료 시각이 최소 만료 시각(디폴트 2 ms)이내로 곧 다가오는 경우 처리하지 않고 cfs 밴드위드 락을 해제하고 함수를 빠져나간다.
코드 라인 19~20에서 quota 설정이 되었으면서 글로벌 풀의 런타임이 slice 보다 큰 경우 분배할 런타임으로 글로벌 런타임을 사용한다.
코드 라인 22~23에서 런타임이 있으면 분배 중임을 알리기 위해 distribute_running에 1을 대입한다.
코드 라인 25에서 cfs 밴드위드 락을 해제한다.
코드 라인 27~28에서 분해할 런타임이 없으면 함수를 빠져나간다.
코드 라인 30에서 스로틀된 cfs 런큐들을 순서대로 글로벌 잔량이 남아있는 한 초과 소모한 런타임을 우선 배분하고 언스로틀한다.
코드 라인 32~35에서 글로벌 풀의 런타임에서 분배에 소진한 런타임을 뺀다. 글로벌 런타임이 0 미만이 되지 않도록 0으로 제한한다.

DL Bandwidth 초기화

init_dl_bandwidth()

kernel/sched/deadline.c

void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
{
        raw_spin_lock_init(&dl_b->dl_runtime_lock);
        dl_b->dl_period = period;
        dl_b->dl_runtime = runtime;
}

dl period와 runtime 값을 사용하여 초기화한다.

코드 라인 4에서 인수로 전달받은 us 단위의 period 값을 나노초 단위로 바꾸어 dl_period에 저장한다.
코드 라인 5에서 인수로 전달받은 us 단위의 runtime 값을 나노초 단위로 바꾸어 dl_runtime에 저장한다.

구조체

cfs_bandwidth 구조체

kernel/sched/sched.h

struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
        raw_spinlock_t          lock;
        ktime_t                 period;
        u64                     quota;
        u64                     runtime;
        s64                     hierarchical_quota;

        u8                      idle;
        u8                      period_active;
        u8                      distribute_running;
        u8                      slack_started;
        struct hrtimer          period_timer;
        struct hrtimer          slack_timer;
        struct list_head        throttled_cfs_rq;

        /* Statistics: */
        int                     nr_periods;
        int                     nr_throttled;
        u64                     throttled_time;
#endif
};

lock
- spin 락
period
- 태스크 그룹의 cpu 사용량을 제어하기 위한 주기로 ns 단위로 저장된다.
- 1ms ~ 1s까지 설정가능하며 디폴트 값=100ms
- “/sys/fs/cgroup/cpu/<태스크그룹>/cpu.cfs_period_us”에서 설정하고 ns 단위로 변환하여 저장된다.
quota
- 태스크 그룹이 period 동안 수행 할 쿼터로 ns 단위로 저장된다.
- 1ms~ 부터 설정 가능
- 0xffffffff_ffffffff 또는 -1인 경우 무제한(bandwidth 설정 없음)
- “/sys/fs/cgroup/cpu/<태스크그룹>/cpu.cfs_quota_us”에서 설정하고 ns 단위로 변환하여 저장된다.
runtime
- 글로벌 런타임(ns)
- period 타이머 주기마다 quota 시간으로 refill(refresh) 된다.
- 로컬 풀에서 디폴트 5ms 씩 런타임을 분배하면서 점점 줄어든다.
  - 디큐되는 엔티티에서 반납되어 커지는 경우도 있다.
  - 그 외 매 period 타이머 주기 및 slack 타이머가 동작하는 하여 스로틀된 로컬에 런타임을 분배하면서 줄어들기도 한다.
hierarchical_quota
- 계층적으로 관리되는 태스크 그룹의 period에 대한 quota 정수 비율이다.
- 정수 값은 1M(1 << 20)가 100%이고 512K는 50%에 해당한다.
idle
- idle(1) 상태인 경우 로컬에 런타임 할당이 필요 없는 상태로 만들고 다음 주기에 스로틀되도록 하려는 목적이다.
- 로컬에 런타임 할당을 하거나 스로틀링을 한 경우는 idle 상태에서 해제(0)된다.
period_active
- period 타이머의 가동 여부
distribute_running
- 스로틀된 로컬 cfs 런큐에 분배 중인 경우 1이된다.
slack_timer
- 슬랙 타이머 (디폴트 5ms)
- 태스크가 dequeue되어 남는 로컬 런타임 잔량이 있을 때 반납하고, 슬랙 타이머를 동작시킨다.
- 슬랙 타이머가 동작하면 스로틀된 로컬 cfs 런큐에 분배한다.
throttled_cfs_rq
- 스로틀된 cfs 런큐 리스트
- 참고: sched: Add support for throttling group entities
nr_periods
- 주기가 반복 진행된 횟수
nr_throttled
- 스로틀링된 횟수
throttled_time
- 스로틀링된 시간 총합(태스크 및 irq 처리 타임을 포함한 시간)

cfs_rq 구조체 (bandwidth 멤버만)

kernel/sched/sched.

struct cfs_rq {

        (...생략...)

#ifdef CONFIG_CFS_BANDWIDTH
        int runtime_enabled;
        s64 runtime_remaining;

        u64 throttled_clock;
        u64 throttled_clock_task;
        u64 throttled_clock_task_time;
        int throttled;
        int throttle_count;
        struct list_head throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
};

runtime_enabled
- period 타이머 활성화 여부
runtime_remaining
- 로컬 잔여 런타임
- 글로벌 풀로부터 필요한 만큼 분배 받아서 설정된다.
throttled_clock
- 스로틀된 시작 시각으로 irq 처리 타임을 포함한 rq->clock으로 산출된다.
throttled_clock_task
- 스로틀된 시작 시각으로 irq 처리 타임을 뺸 태스크 실행시간만으로 rq->clock_task를 사용하여 산출된다.
throttled_clock_task_time
- 스로틀된 시간 총합(irq 처리 타임을 뺀 태스크 스로틀링된 시간만 누적)
throttled
- 스로틀된 적이 있었는지 여부(1=스로틀된 적이 있는 경우)
throttle_count
- 스로틀 횟수
throttled_list
- 태스크 그룹에 있는 cfs_bandwidth의 throttled_cfs_rq 리스트에 추가할 때 사용하는 링크 노드

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c – 현재 글
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheduler -4- (Group Scheduling)

2017-05-182021-01-11 문영일 Leave a comment

Group Scheduling 관리

그룹 스케줄링은 cgroup의 cpu 서브시스템을 사용하여 구현하였고 각 그룹은 태스크 그룹(struct task_group)으로 관리된다.

참고로 그룹 스케줄링(스케줄 그룹)과 유사한 단어인 스케줄링 그룹(sched_group)은 로드 밸런스에서 사용하는 점에 유의한다.

다음 그림은 태스크 그룹간 계층도를 보여준다.

cgroup 디렉토리의 계층 구조
task_group 에서의 계층 구조

다음 그림은 태스크 그룹에 태스크 및 스케줄 엔티티가 포함된 모습을 보여준다.

다음 그림은 cfs 런큐들이 cpu 만큼 있음을 보여준다.

다음 그림은 스케줄 엔티티의 부모 관계를 보여준다.

다음 그림은 스케줄 엔티티의 cfs_rq 및 my_q가 어떤 cfs_rq를 가리키는지 보여준다.

태스크 그룹 생성 – (1)

cpu_cgroup_css_alloc()

kernel/sched/core.c

static struct cgroup_subsys_state *
cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
        struct task_group *parent = css_tg(parent_css);
        struct task_group *tg;

        if (!parent) {
                /* This is early initialization for the top cgroup */
                return &root_task_group.css;
        }

        tg = sched_create_group(parent);
        if (IS_ERR(tg))
                return ERR_PTR(-ENOMEM);

        return &tg->css;
}

요청한 cpu cgroup에 연결된 태스크 그룹의 하위에 새 태스크 그룹을 생성한다.

코드 라인 4에서 요청한 cpu cgroup에 연결된 태스크 그룹을 알아온다.
코드 라인 7~10에서 태스크 그룹이 null인 경우 루트 태스크 그룹을 반환한다.
코드 라인 12~14에서 태스크 그룹의 child에 새 태스크 그룹을 생성하고 태스크 그룹 내부의 cfs 스케줄 그룹과 rt 스케줄 그룹을 초기화한다.

cpu cgroup 서브시스템이 설정된 커널인 경우 커널이 초기화되면서 cgroup_init_subsys() 함수에서 루트 태스크 그룹은 초기화된다. 그 밑으로 새 태스크 그룹을 생성할 때 다음과 같이 디렉토리를 생성하는 것으로 새 태스크 그룹이 생성된다.

/$ cd /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu$ sudo mkdir A
/sys/fs/cgroup/cpu$ ls
cgroup.clone_children  cpu.rt_runtime_us  cpuacct.stat              cpuacct.usage_percpu_user
cgroup.procs           cpu.shares         cpuacct.usage             cpuacct.usage_sys
cpu.cfs_period_us      cpu.stat           cpuacct.usage_all         cpuacct.usage_user
cpu.cfs_quota_us       cpu.uclamp.max     cpuacct.usage_percpu      notify_on_release
cpu.rt_period_us       cpu.uclamp.min     cpuacct.usage_percpu_sys  tasks

위의 주요 설정 항목들은 다음과 같다. cgroup 공통 항목들의 설명은 제외한다.

cpu.cfs_periods_us
- cfs 밴드위드의 기간(periods)을 설정한다.
- 디폴트 값은 100,000 us 이고, 최대 값은 1,000,000 us(1초)이다.
cpu.cfs_quota_us
- cfs 밴드위드의 quota를 설정한다.
- 디폴트 값은 -1로 이는 동작하지 않는 상태이다.
cpu.rt_period_us
- rt 밴드위드의 기간(periods)를 설정한다.
- 디폴트 값은 100,000 us이다.
cpu.rt_runtime_us
- rt 밴드위드의 런타임을 설정한다.
- 디폴트 값은 0으로 동작하지 않는 상태이다.
cpu.shares
- cfs 스케줄러에서 사용하는 shares 비율이다.
- 디폴트 값은 nice-0의 로드 weight에 해당하는 1024 이다. 이 값이 클 수록 cpu 유틸을 높일 수 있다.
cpu.stat
- 다음과 같은 rt 밴드위드 statistics 값 들을 보여준다.
  - nr_periods, nr_throttled, throttled_time
cpu.uclamp.max
- cpu 유틸 값을 상한을 제한하는 용도인 uclamp 최대 값이다.
- 서로 다른 여러 개의 core를 가진 시스템에서 사용된다. 이 태스크 그룹에 동작하는 태스크들은 아무리 높은 cpu 유틸을 기록해도 이 제한 값을 초과하지 못하므로 low performance cpu에서 동작할 확률이 높아진다.
- 디폴트 값은 max(100%)이다.
cpu.uclamp.min
- cpu 유틸 값의 하한을 제한하는 용도인 uclamp 최소 값이다.
- 서로 다른 여러 개의 core를 가진 시스템에서 사용된다. 이 태스크 그룹에 동작하는 태스크들은 일을 거의 하지 않아도 이 제한 값을 넘기게 되므로 high performance cpu에서 동작할 확률이 높아진다.
- 이 값이 높아질 수록 이 태스크 그룹에 해당하는 태스크들은 high performance cpu에서 동작하게 된다.
- 디폴트 값은 0이다.

다음 그림은 sched_create_group() 함수의 호출 관계를 보여준다.

sched_create_group()

kernel/sched/core.c

/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent)
{
        struct task_group *tg;

        tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
        if (!tg)
                return ERR_PTR(-ENOMEM);

        if (!alloc_fair_sched_group(tg, parent))
                goto err;

        if (!alloc_rt_sched_group(tg, parent))
                goto err;

        alloc_uclamp_sched_group(tg, parent);

        return tg;

err:
        sched_free_group(tg);
        return ERR_PTR(-ENOMEM);
}

요청한 cpu cgroup의 child에 태스크 그룹을 생성하고 그 태스크 그룹에 cfs 스케줄 그룹과 rt 스케줄 그룹을 할당하고 초기화한다.

코드 라인 6~8에서 태스크 그룹 구조체를 할당한다.
코드 라인 10~11에서 태스크 그룹에 cfs 스케줄 그룹을 할당하고 초기화한다.
코드 라인 13~14에서 태스크 그룹에 rt 스케줄 그룹을 할당하고 초기화한다.
코드 라인 16에서 태스크 그룹에 uclamp 디폴트 값을 할당한다.
코드 라인 18에서 생성하고 초기화한 태스크 그룹을 반환한다.

다음 그림은 sched_create_group() 함수를 호출하여 태스크 그룹을 생성할 때 태스크 그룹에 연결되는 cfs 런큐, 스케줄 엔티티, rt 런큐, rt 스케줄 엔티티를 보여준다.

태스크 그룹 생성 – (2) CFS 스케줄 그룹 할당

alloc_fair_sched_group()

kernel/sched/core.c

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se;
        int i;

        tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);
        if (!tg->cfs_rq)
                goto err;
        tg->se = kzalloc(sizeof(se) * nr_cpu_ids, GFP_KERNEL);
        if (!tg->se)
                goto err;

        tg->shares = NICE_0_LOAD;

        init_cfs_bandwidth(tg_cfs_bandwidth(tg));

        for_each_possible_cpu(i) {
                cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
                                      GFP_KERNEL, cpu_to_node(i));
                if (!cfs_rq)
                        goto err;

                se = kzalloc_node(sizeof(struct sched_entity),
                                  GFP_KERNEL, cpu_to_node(i));
                if (!se)
                        goto err_free_rq;

                init_cfs_rq(cfs_rq);
                init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
                init_entity_runnable_average(se);
        }

        return 1;

err_free_rq:
        kfree(cfs_rq);
err:
        return 0;
}

태스크 그룹에 cfs 스케줄 그룹을 할당하고 초기화한다. 성공인 경우 1을 반환한다.

코드 라인 7~9에서 tg->cfs_rq에 cpu 수 만큼 cfs 런큐를 가리키는 포인터를 할당한다.
코드 라인 10~12에서 tg->se에 cpu 수 만큼 스케줄 엔티티를 가리키는 포인터를 할당한다.
코드 라인 14에서 shares 값으로 기본 nice 0의 load weight 값인 NICE_0_LOAD(1024)를 대입한다.
- 64비트 시스템에서는 조금 더 정밀도를 높이기 위해 2^10 scale을 적용하여 NICE_0_LOAD 값으로 1K(1024) * 1K(1024) = 1M (1048576)를 사용한다.
코드 라인 16에서 cfs 대역폭을 초기화한다.
코드 라인 18~27에서 cpu 수 만큼 루프를 돌며 cfs 런큐 및 스케줄 엔티티를 할당받는다.
코드 라인 29에서 할당받은 cfs 런큐를 초기화한다.
코드 라인 30에서 태스크 그룹에 할당받은 cfs 런큐와 cfs 스케줄 엔티티를 연결시키고 cfs 엔트리들을 초기화한다.
코드 라인 31에서 엔티티 러너블 평균을 초기화한다.
코드 라인 34에서 성공 값 1을 반환한다.

다음 그림은 alloc_rt_sched_group() 함수를 통해 태스크 그룹이 이 함수에서 생성한 cfs 런큐와 스케줄 엔티티가 연결되고 초기화되는 모습을 보여준다.

다음 그림은 CFS 런큐 <-> 태스크 그룹 <-> 스케줄 엔티티간의 연관 관계를 보여준다.

하위 태스크 그룹 하나가 스케줄 엔티티 하나에 대응하고 다른 태스크의 스케줄 엔티티와 동등하게 cfs 런큐에 큐잉되어 있음을 알 수 있다.

태스크 그룹의 cfs 스케줄 엔티티 초기화

init_tg_cfs_entry()

kernel/sched/fair.c

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
                        struct sched_entity *se, int cpu,
                        struct sched_entity *parent)
{       
        struct rq *rq = cpu_rq(cpu);

        cfs_rq->tg = tg;
        cfs_rq->rq = rq;
        init_cfs_rq_runtime(cfs_rq);
                
        tg->cfs_rq[cpu] = cfs_rq;
        tg->se[cpu] = se;
 
        /* se could be NULL for root_task_group */
        if (!se)
                return;
                
        if (!parent) {
                se->cfs_rq = &rq->cfs;
                se->depth = 0;
        } else {
                se->cfs_rq = parent->my_q;
                se->depth = parent->depth + 1;
        }       
                
        se->my_q = cfs_rq;
        /* guarantee group entities always have weight */
        update_load_set(&se->load, NICE_0_LOAD);
        se->parent = parent;
}

태스크 그룹에 cfs 런큐 및 cfs 스케줄 엔티티를 연결시키고 cfs 엔트리들을 초기화한다.

코드 라인 7~8에서 cfs 런큐의 태스크 그룹 및 런큐를 지정한다.
코드 라인 11에서 cfs 런큐의 스로틀링 runtime을 초기화한다.
코드 라인 12에서 태스크 그룹의 cfs 런큐 및 스케줄링 엔티티를 지정한다.
코드 라인 15~16에서 스케줄링 엔티티가 지정되지 않은 경우 함수를 빠져나간다.
- 디폴트 그룹 초기화 호출 시에는 스케줄링 엔티티가 null이다.
코드 라인 18~20에서 부모가 지정되지 않은 경우 스케줄링 엔티티의 cfs 런큐는 런큐의 cfs 런큐를 사용하고 depth를 0으로 한다.
코드 라인 21~24에서 부모가 지정된 경우 스케줄링 엔티티의 cfs 런큐는 부모의 my_q를 지정하고 depth 값은 부모 값보다 1 증가시켜 사용한다.
코드 라인 26에서 스케줄링 엔티티의 my_q에 cfs 런큐를 대입한다.
코드 라인 28에서 스케줄링 엔티티의 로드값을 일단 nice 0에 해당하는 로드 weight 값인 1024를 대입한다.
코드 라인 29에서 스케줄링 엔티티의 부모를 지정한다.

다음 그림은 루트 태스크 그룹부터 하위 태스크 그룹까지 init_tg_cfs_entry() 함수를 각각 호출할 때 서로 연결되는 모습을 보여준다.

init_tg_cfs_entry() 함수와는 관계없지만 태스크 내부에 있는 스케줄 엔티티가 해당 태스크 그룹에 연결된 모습도 참고바란다.
init_tg_cfs_entry() 함수 내부에서 다음 두 함수를 호출하여 처리하지만 그림에는 표현하지 않았다.
- init_cfs_rq_runtime() -> runtime_enable=0, throttled_list 초기화
- update_load_set() -> load.weight=1024, load.inv_weight=0으로 초기화

init_cfs_rq_runtime()

kernel/sched/fair.c

static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) 
{
        cfs_rq->runtime_enabled = 0;
        INIT_LIST_HEAD(&cfs_rq->throttled_list);
}

cfs 런큐의 스로틀링 runtime을 초기화한다.

코드 라인 3에서 cfs 런큐의 runtime_enabled 값에 0을 대입하여 cfs 대역폭의 런타임 산출을 disable로 초기화한다.
코드 라인 4에서 throttled_list를 초기화한다.

update_load_set()

kernel/sched/fair.c

static inline void update_load_set(struct load_weight *lw, unsigned long w)
{
        lw->weight = w;
        lw->inv_weight = 0;
}

로드 weight 값을 설정한다. inv_weight 값은 0으로 일단 초기화한다.

태스크 그룹 생성 – (3) RT 스케줄 그룹 할당

alloc_rt_sched_group()

kernel/sched/core.c

int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
{
        struct rt_rq *rt_rq;
        struct sched_rt_entity *rt_se;
        int i;  
                
        tg->rt_rq = kzalloc(sizeof(rt_rq) * nr_cpu_ids, GFP_KERNEL);
        if (!tg->rt_rq)
                goto err;
        tg->rt_se = kzalloc(sizeof(rt_se) * nr_cpu_ids, GFP_KERNEL);
        if (!tg->rt_se)
                goto err;
                
        init_rt_bandwidth(&tg->rt_bandwidth,
                        ktime_to_ns(def_rt_bandwidth.rt_period), 0);

        for_each_possible_cpu(i) {
                rt_rq = kzalloc_node(sizeof(struct rt_rq),
                                     GFP_KERNEL, cpu_to_node(i));
                if (!rt_rq)
                        goto err;

                rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
                                     GFP_KERNEL, cpu_to_node(i));
                if (!rt_se)
                        goto err_free_rq;

                init_rt_rq(rt_rq, cpu_rq(i));
                rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
                init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
        }

        return 1;

err_free_rq:
        kfree(rt_rq);
err:
        return 0;
}

태스크 그룹에 rt 스케줄 그룹을 할당하고 초기화한다. 성공인 경우 1을 반환한다.

코드 라인 7~9에서 tg->rt_rq에 cpu 수 만큼 rt 런큐를 가리키는 포인터를 할당한다.
코드 라인 10~12에서 tg->rt_se에 cpu 수 만큼 rt 스케줄 엔티티를 가리키는 포인터를 할당한다.
코드 라인 14~15에서 rt 대역폭을 초기화한다.
코드 라인 17~26에서 cpu 수 만큼 루프를 돌며 rt 런큐 및 rt 스케줄 엔티티를 할당받는다.
코드 라인 29에서 할당받은 rt 런큐를 초기화한다.
코드 라인 30에서 할당받은 rt 스케줄 엔티티를 초기화한다.
코드 라인 33에서 성공 값 1을 반환한다.

다음 그림은 alloc_rt_sched_group() 함수를 통해 태스크 그룹이 이 함수에서 생성한 rt 런큐와 rt 스케줄 엔티티가 연결되고 초기화되는 모습을 보여준다.

태스크 그룹의 rt 스케줄 엔티티 초기화

init_tg_rt_entry()

kernel/sched/rt.c

void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
                struct sched_rt_entity *rt_se, int cpu,
                struct sched_rt_entity *parent)
{
        struct rq *rq = cpu_rq(cpu);

        rt_rq->highest_prio.curr = MAX_RT_PRIO;
        rt_rq->rt_nr_boosted = 0;
        rt_rq->rq = rq;
        rt_rq->tg = tg; 

        tg->rt_rq[cpu] = rt_rq;
        tg->rt_se[cpu] = rt_se;

        if (!rt_se)
                return;

        if (!parent)
                rt_se->rt_rq = &rq->rt;
        else    
                rt_se->rt_rq = parent->my_q;

        rt_se->my_q = rt_rq; 
        rt_se->parent = parent;
        INIT_LIST_HEAD(&rt_se->run_list);
}

태스크 그룹에 rt 런큐 및 rt 스케줄 엔티티를 연결시키고 rt 엔트리들을 초기화한다.

코드 라인 7에서 rt 런큐의 highest_prio.curr에 MAX_RT_PRIO(100)으로 초기화한다.
코드 라인 8에서 rt_nr_boosted 값을 0으로 초기화한다.
코드 라인 9~10에서 rt 런큐의 태스크 그룹 및 런큐를 지정한다.
코드 라인 12~13에서 태스크 그룹의 rt 런큐 및 rt 스케줄링 엔티티를 지정한다.
코드 라인 15~16에서 rt 스케줄링 엔티티가 지정되지 않은 경우 함수를 빠져나간다.
- 디폴트 그룹 초기화 호출 시에는 rt 스케줄링 엔티티가 null이다.
코드 라인 18~21에서 부모가 지정되지 않은 경우 rt 스케줄링 엔티티의 rt 런큐는 런큐의 rt 런큐를 사용하고, 부모가 지정된 경우 부모의 my_q를 사용한다.
코드 라인 23에서 rt 스케줄링 엔티티의 my_q에 rt 런큐를 대입한다.
코드 라인 24에서 rt 스케줄링 엔티티의 부모를 지정한다.
코드 라인 25에서 rt 스케줄링 엔티티의 run_list를 초기화한다.

다음 그림은 루트 태스크 그룹부터 하위 태스크 그룹까지 init_tg_rt_entry() 함수를 각각 호출할 때 서로 연결되는 모습을 보여준다.

init_tg_rt_entry() 함수와는 관계없지만 태스크 내부에 있는 스케줄 엔티티가 해당 태스크 그룹에 연결된 모습도 참고바란다.

CFS shares 설정

sched_group_set_shares()

kernel/sched/fair.c

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
        int i;

        /*
         * We can't change the weight of the root cgroup.
         */
        if (!tg->se[0])
                return -EINVAL;

        shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));

        mutex_lock(&shares_mutex);
        if (tg->shares == shares)
                goto done;

        tg->shares = shares;
        for_each_possible_cpu(i) {
                struct rq *rq = cpu_rq(i);
                struct sched_entity *se;
                struct rq_flags rf;

                /* Propagate contribution to hierarchy */
                raw_spin_lock_irqsave(&rq->lock, rf);
                update_rq_clock(rq);
                for_each_sched_entity(se) {
                        update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
                        update_cfs_group(se);
                }
                raw_spin_unlock_irqrestore(&rq->lock, &rf);
        }

done:
        mutex_unlock(&shares_mutex);
        return 0;
}

요청한 태스크 그룹의 cfs shares 값을 설정한다.

코드 라인 8~9에서 태스크 그룹의 첫 번째 스케줄 엔티티가 null인 경우 -EINVAL 에러를 반환한다.
- 루트 태스크그룹의 스케줄 엔티티 포인터들은 null로 설정되어 있다.
코드 라인 11~17에서 shares 값이 MIN_SHARES(2) ~ MAX_SHARES(256K) 범위에 들도록 조절하고 태스크 그룹에 설정한다. 만일 기존 shares 값에 변화가 없으면 변경 없이 그냥 성공(0)을 반환한다.
코드 라인 18~25에서 possible cpu 수 만큼 순회하며 런큐 및 태스크 그룹에서 스케줄 엔티티를 알아온 후 런큐 락을 획득한 후 런큐 클럭을 갱신한다.
코드 라인 26~29에서 계층 구조의 스케줄 엔티티들에 대해 현재 스케줄 엔티티부터 최상위 스케줄 엔티티까지 로드 평균 및 shares 값을 갱신하게 한다.

kernel/sched/sched.h

/*
 * A weight of 0 or 1 can cause arithmetics problems.
 * A weight of a cfs_rq is the sum of weights of which entities
 * are queued on this cfs_rq, so a weight of a entity should not be
 * too large, so as the shares value of a task group.
 * (The default weight is 1024 - so there's no practical
 *  limitation from this.)
 */
#define MIN_SHARES      (1UL <<  1)
#define MAX_SHARES      (1UL << 18)

CFS shares 값 범위(2 ~ 256K)

다음 그림은 태스크 그룹에 설정한 shares 값이 스케줄 엔트리, cfs 런큐 및 런큐의 로드 weight 값에 재반영되는 모습을 보여준다.

구조체

task_group 구조체

kernel/sched/sched.h

/* Task group related information */
struct task_group {
        struct cgroup_subsys_state css;

#ifdef CONFIG_FAIR_GROUP_SCHED
        /* schedulable entities of this group on each CPU */
        struct sched_entity     **se;
        /* runqueue "owned" by this group on each CPU */
        struct cfs_rq           **cfs_rq;
        unsigned long           shares;

#ifdef  CONFIG_SMP
        /*
         * load_avg can be heavily contended at clock tick time, so put
         * it in its own cacheline separated from the fields above which
         * will also be accessed at each tick.
         */
        atomic_long_t           load_avg ____cacheline_aligned;
#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
        struct sched_rt_entity  **rt_se;
        struct rt_rq            **rt_rq;

        struct rt_bandwidth     rt_bandwidth;
#endif

        struct rcu_head         rcu;
        struct list_head        list;

        struct task_group       *parent;
        struct list_head        siblings;
        struct list_head        children;

#ifdef CONFIG_SCHED_AUTOGROUP
        struct autogroup        *autogroup;
#endif

        struct cfs_bandwidth    cfs_bandwidth;

#ifdef CONFIG_UCLAMP_TASK_GROUP
        /* The two decimal precision [%] value requested from user-space */
        unsigned int            uclamp_pct[UCLAMP_CNT];
        /* Clamp values requested for a task group */
        struct uclamp_se        uclamp_req[UCLAMP_CNT];
        /* Effective clamp values used for a task group */
        struct uclamp_se        uclamp[UCLAMP_CNT];
#endif

};

css
- cgroup 인터페이스
**se
- cpu 수 만큼의 스케줄링 엔티티들이다.
- cpu별 해당 태스크 그룹을 대표하는 스케줄링 엔티티이다.
**cfs_rq
- cpu 수 만큼의 cfs 런큐들이다.
- cpu별 해당 태스크 그룹에 대한 cfs 런큐이다.
shares
- 태스크 그룹에 소속된 cfs 태스크들이 다른 그룹과의 cpu 로드 점유 비율을 설정한다.
- 루트 태스크 그룹의 shares 값은 변경할 수 없다. (“/sys/fs/cgroup/cpu/cpu.shares”)
- 하위 태스크 그룹의 shares 값을 설정하여 사용한다.
load_avg
- 로드 평균
**rt_se
- cpu 수 만큼의 rt 스케줄링 엔티티들이다.
- cpu별 해당 태스크 그룹을 대표하는 스케줄링 엔티티이다.
**rt_rq
- cpu 수 만큼의 rt 런큐들이다.
- cpu별 해당 태스크 그룹에 대한 rt 런큐이다.
rt_bandwidth
- rt 밴드폭
- 디폴트로 rt 태스크가 최대 cpu 점유율의 95%를 사용할 수 있게한다.
*parent
- 상위 태스크 그룹을 가리킨다.
- 루트 태스크 그룹에서는 null 값을 담는다.
siblings
- 형재 태스크 그룹들을 담는다.
children
- 하위 태스크 그룹을 담는다.
*autogroup
- tty 로긴된 유저쉘에 대해 자동으로 태스크 그룹을 만들때 사용한다.
cfs_bandwidth
- 태스크 그룹에 소속된 cfs 태스크들의 cpu 점유 비율을 결정하게 한다. (스로틀 등)
uclamp_pct[]
- 유저가 요청한 clamp min, max의 백분률이 저장된다.
ucmap_req[]
- 태스크 그룹을 위해 요청된 clamp 밸류들이다.
uclamp[]
- 동작 중인 clamp min, max 설정 값이다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c – 현재 글
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheduler -19- (초기화)

2017-05-122021-01-11 문영일 Leave a comment

스케줄러 초기화

다음 그림은 sched_init() 함수의 간략한 처리 흐름도이다.

다음 그림은 sched_init() 함수 내부에서 다른 함수들과의 호출 관계 흐름을 보여준다.

sched_init()

함수의 소스 라인이 길어서 5개로 나누었다.

kernel/sched/core.c – (1/5)

void __init sched_init(void)
{
        int i, j; 
        unsigned long alloc_size = 0, ptr;

#ifdef CONFIG_FAIR_GROUP_SCHED
        alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
        alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif 
        if (alloc_size) {
                ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);

#ifdef CONFIG_FAIR_GROUP_SCHED
                root_task_group.se = (struct sched_entity **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);
        
                root_task_group.cfs_rq = (struct cfs_rq **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
                root_task_group.rt_se = (struct sched_rt_entity **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

                root_task_group.rt_rq = (struct rt_rq **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

#endif /* CONFIG_RT_GROUP_SCHED */
        }
#ifdef CONFIG_CPUMASK_OFFSTACK
        for_each_possible_cpu(i) {
                per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
                        cpumask_size(), GFP_KERNEL, cpu_to_node(i));
        }
#endif /* CONFIG_CPUMASK_OFFSTACK */

        init_rt_bandwidth(&def_rt_bandwidth,
                        global_rt_period(), global_rt_runtime());
        init_dl_bandwidth(&def_dl_bandwidth,
                        global_rt_period(), global_rt_runtime());

cfs 그룹 스케줄링을 위해 cpu 수 만큼 루트 태스크 그룹용 스케줄 엔티티들과 cfs 런큐의 포인터 배열을 할당받아 준비한다. rt 그룹 스케줄링도 동일하게 준비한다. 그리고 디폴트 rt 및 디폴트 dl 밴드폭을 초기화한다.

코드 라인 6~13에서 cfs 그룹 스케줄링을 위해 cpu 수 만큼의 sched_entity 포인터 배열과 cfs_rq 포인터 배열을 할당받는다. 또한 rt 그룹 스케줄링을 위해 같은 방법으로 준비한다.
- CONFIG_FAIR_GROUP_SCHED
  - cgroup의 cfs 그룹 스케줄링을 지원하는 커널 옵션
- CONFIG_RT_GROUP_SCHED
  - cgroup의 rt 그룹 스케줄링을 지원하는 커널 옵션
코드 라인 15~22에서 cfs 그룹 스케줄링을 위해 root_task_group.se와 root_task_group.cfs_rq에 위에서 할당받은 cpu 수 만큼의 포인터 배열을 각각 지정한다.
코드 라인 23~30에서 rt 그룹 스케줄링을 위해 root_task_group.rt_se와 root_task_group.rt_rq에 위에서 할당받은 cpu 수 만큼의 포인터 배열을 각각 지정한다.
코드 라인 32~37에서 cpumask offstack을 위해 load_balance_mask에 cpu 수 만큼의 cpu 비트맵을 할당받아 대입한다.
- CONFIG_CPUMASK_OFFSTACK
  - cpu 수가 많은 경우 시스템에서 스택오버 플로우를 방지하기 위해 스택 대신 dynamic 메모리를 할당받아 cpu 비트맵으로 사용한다.
- DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
코드 라인 39~42에서 디폴트 rt 밴드폭 및 디폴트 dl 밴드폭을 초기화한다.
- rt 밴드폭과 dl 밴드폭은 95%로 rt 스케줄러와 dl 스케줄러가 점유할 수 있는 비율은 최대 95%까지 가능하다.

다음 그림은 루트 태스크 그룹 및 디폴트 rt 밴드폭 및 디폴트 dl 밴드폭을 초기화하는 과정을 보여준다.

kernel/sched/core.c – (2/5)

#ifdef CONFIG_SMP
        init_defrootdomain();
#endif

#ifdef CONFIG_RT_GROUP_SCHED
        init_rt_bandwidth(&root_task_group.rt_bandwidth,
                        global_rt_period(), global_rt_runtime());
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
        list_add(&root_task_group.list, &task_groups);
        INIT_LIST_HEAD(&root_task_group.children);
        INIT_LIST_HEAD(&root_task_group.siblings);
        autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

루트 도메인을 초기화한다. rt 그룹 스케줄링을 위해 루트 태스크 그룹의 rt 밴드폭도 초기화한다. 루트 태스크 그룹 및 오토그룹도 초기화한다.

코드 라인 1~3에서 smp 시스템인 경우 루트 도메인을 초기화한다.
코드 라인 5~8에서 rt 그룹 스케줄링을 위해 루트 태스크 그룹의 rt 밴드폭을 초기화한다.
코드 라인 10~15에서 cgroup 스케줄링을 위해 루트 태스크 그룹의 리스트에 전역 태스크 그룹을 추가하고 children, sibling 등의 리스트를 초기화한다. 마지막으로 autogroup을 초기화한다.

다음 그림은 디폴트 루트 도메인과 오토 그룹을 초기화하는 과정을 보여줍니다.

kernel/sched/core.c – (3/5)

.       for_each_possible_cpu(i) {
                struct rq *rq;

                rq = cpu_rq(i);
                raw_spin_lock_init(&rq->lock);
                rq->nr_running = 0;
                rq->calc_load_active = 0;
                rq->calc_load_update = jiffies + LOAD_FREQ;
                init_cfs_rq(&rq->cfs);
                init_rt_rq(&rq->rt, rq);
                init_dl_rq(&rq->dl, rq);

cpu 수 만큼 루프를 돌며 cfs 런큐, rt 런큐, dl 런큐들을 초기화한다.

코드 라인 1~5에서 cpu 수만큼 루프를 돌며 해당 cpu용 런큐를 선택하고 락을 획득한다.
코드 라인 6에서 nr_running에 0을 대입하여 현재 런큐에서 active하게 돌고 있는 태스크의 수를 0으로 초기화한다.
코드 라인 7에서 calc_load_active에 0을 대입하여 active 로드 값을 0으로 초기화한다.
코드 라인 8에서 다음 로드 갱신 주기를 현재 시각 + LOAD_FREQ(5초) 후로 설정한다.
코드 라인 9~11에서 cfs, rt 및 dl 런큐를 초기화한다.

다음 그림은 cpu 수 만큼 루프를 돌며 각각의 런큐를 초기화하고 내부의 cfs, rt 및 dl 런큐를 초기화하는 과정을 보여준다.

kernel/sched/core.c – (4/5)

#ifdef CONFIG_FAIR_GROUP_SCHED
                root_task_group.shares = ROOT_TASK_GROUP_LOAD;
                INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
                /*
                 * How much cpu bandwidth does root_task_group get?
                 *
                 * In case of task-groups formed thr' the cgroup filesystem, it
                 * gets 100% of the cpu resources in the system. This overall
                 * system cpu resource is divided among the tasks of
                 * root_task_group and its child task-groups in a fair manner,
                 * based on each entity's (task or task-group's) weight
                 * (se->load.weight).
                 *
                 * In other words, if root_task_group has 10 tasks of weight
                 * 1024) and two child groups A0 and A1 (of weight 1024 each),
                 * then A0's share of the cpu resource is:
                 *
                 *      A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
                 *
                 * We achieve this by letting root_task_group's tasks sit
                 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
                 */
                init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
                init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */

                rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
                init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif

                for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
                        rq->cpu_load[j] = 0;

                rq->last_load_update_tick = jiffies;

#ifdef CONFIG_SMP
                rq->sd = NULL;
                rq->rd = NULL;
                rq->cpu_capacity = SCHED_CAPACITY_SCALE;
                rq->post_schedule = 0;
                rq->active_balance = 0;
                rq->next_balance = jiffies;
                rq->push_cpu = 0;
                rq->cpu = i;
                rq->online = 0;
                rq->idle_stamp = 0;
                rq->avg_idle = 2*sysctl_sched_migration_cost;
                rq->max_idle_balance_cost = sysctl_sched_migration_cost;

                INIT_LIST_HEAD(&rq->cfs_tasks);

                rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ_COMMON
                rq->nohz_flags = 0;
#endif
#ifdef CONFIG_NO_HZ_FULL
                rq->last_sched_tick = 0;
#endif
#endif
                init_rq_hrtick(rq);
                atomic_set(&rq->nr_iowait, 0);
        }

계속해서 cpu 수 만큼 루프를 돌며 런큐 및 다음 항목들을 초기화한다.

cfs 밴드폭
루트 태스크 그룹의 cfs 엔트리 및 rt 엔트리
각종 cpu 로드, 상태 및 통계 정보
런큐의 루트 도메인에 디폴트 루트 도메인 설정
hrtick이 준비된 경우 초기화

코드 라인 1~3에서 cfs 그룹 스케줄링을 위해 루트 태스크 그룹의 shares 값을 nice 0 로드값(32비트 시스템에서 1024)으로 초기 지정한다. leaf_cfs_rq_list를 초기화한다.
- #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
- #define NICE_0_LOAD SCHED_LOAD_SCALE
- #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
- #define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)
- SCHED_LOAD_RESOLUTION 값은 32비트 시스템은 0이고 64비트 시스템은 10이다.
코드 라인 23에서 루트 태스크 그룹의 cfs 밴드폭을 초기화한다.
코드 라인 24에서 cfs 런큐의 태스크 그룹에 디폴트 태스크 그룹을 지정하고 다음 항목들을 초기화한다.
- 태스크 그룹의 cfs 런큐 및 스케줄 엔티티 설정
- cfs 그룹 스케줄링을 사용하지 않는 경우 런큐의 runtime_enable에 0을 대입하여 디폴트로 스로틀링하지 않게 한다.
코드 라인 27에서 런큐의 rt_runtime에 디폴트 rt 밴드폭의 런타임을 대입한다.
코드 라인 28~30에서 rt 그룹 스케줄링을 위해 rt 런큐의 태스크 그룹에 디폴트 태스크 그룹을 지정하고 다음 항목을 초기화한다.
- 태스크 그룹의 rt 런큐 및 스케줄 엔티티 설정
코드 라인 32~33에서 CPU_LOAD_IDX_MAX(5)개로 이루어진 cpu_load[] 배열을 0으로 초기화한다.
- 이 배열은 향후 각 tick의 배수 단위로 산출된 cpu 로드 평균 값을 갖는다.
- cpu_load[0] = 1 tick 현재 cpu 로드
- cpu_load[1] = 2 ticks 기간 cpu 로드
- cpu_load[2] = 4 ticks 기간 cpu 로드
- cpu_load[3] = 8 ticks 기간 cpu 로드
- cpu_load[4] = 16 ticks 기간 cpu 로드
코드 라인 35에서 위 cpu_load[]가 산출된 최종 시각(jiffies) 값을 대입한다.
코드 라인 37~39에서 smp 시스템인 경우 런큐의 스케줄 도메인 및 루트 도메인으로null을 대입한다.
코드 라인 40에서 런큐의 cpu_capacity 값으로 SCHED_CAPACITY_SCALE(1024) 값을 대입한다.
- #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
- #define SCHED_CAPACITY_SHIFT 10
코드 라인 42~44까지 로드밸런싱에 사용하는 멤버들을 0으로 초기화하고 다음 밸런스 시각을 현재 시각(jiffies)로 설정한다.
코드 라인 45~46에서 현재 런큐의 cpu를 지정하고 online 되지 않음으로 설정한다.
코드 라인 47~49에서 idle_stamp를 0으로 설정하고 max_idle_balance_cost는 스케줄 마이그레이션 코스트 값과 동일하게 하고 avg_idle은 그 두 배로 설정한다.
- 디폴트 sysctl_sched_migration_cost = 500000UL;
코드 라인 51에서 런큐의 cfs_tasks 리스트를 초기화한다.
코드 라인 53에서 런큐의 루트 도메인으로 디폴트 루트 도메인을 연결한다.
코드 라인 54~56에서 nohz를 지원하는 시스템인 경우 nohz_flags를 0으로 초기화한다.
코드 라인 57~59에서 nohz full을 지원하는 시스템인 경우 마지막 스케줄 틱 값에 0을 대입한다.
코드 라인 61에서 hrtick이 지원되는 시스템인 경우 런큐의 hrtick을 준비한다.
코드 라인 62에서 런큐의 iowait 카운터를 0으로 초기화한다.

다음 그림은 계속하여 cpu 수 만큼 루프를 돌며 각각의 런큐를 초기화하고 cfs 밴드폭, 태스크 그룹 cfs 엔트리, 태스크 그룹 rt 엔트리 및 디폴트 루트 도메인을 초기화하는 모습을 보여보여준다.

kernel/sched/core.c – (5/5)

        set_load_weight(&init_task);

#ifdef CONFIG_PREEMPT_NOTIFIERS
        INIT_HLIST_HEAD(&init_task.preempt_notifiers);
#endif

        /*
         * The boot idle thread does lazy MMU switching as well:
         */
        atomic_inc(&init_mm.mm_count);
        enter_lazy_tlb(&init_mm, current);

        /*
         * During early bootup we pretend to be a normal task:
         */
        current->sched_class = &fair_sched_class;

        /*
         * Make us the idle thread. Technically, schedule() should not be
         * called from this thread, however somewhere below it might be,
         * but because we are the idle thread, we just pick up running again
         * when this runqueue becomes "idle".
         */
        init_idle(current, smp_processor_id());

        calc_load_update = jiffies + LOAD_FREQ;

#ifdef CONFIG_SMP
        zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
        /* May be allocated at isolcpus cmdline parse time */
        if (cpu_isolated_map == NULL)
                zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
        idle_thread_set_boot_cpu();
        set_cpu_rq_start_time();
#endif
        init_sched_fair_class();

        scheduler_running = 1;
}

현재 부트업 태스크를 초기화하고 마지막으로 cfs 스케줄러를 준비하는 것으로 스케줄러가 준비되었다.

코드 라인 1에서 현재 부트업 태스크인 init_task의 로드 weight을 초기화한다.
- static_prio 값을 nice 레벨의 prio로 변환하고 이의 weight 값을 대입한다.
- 태스크의 스케줄 정책이 SCHED_IDLE인 경우에는 weight 값을 가장 느린 WEIGHT_IDLEPRIO(3)으로 설정한다.
코드 라인 3~5에서 init_task의 preempt_notifiers 리스트를 초기화한다.
코드 라인 10~11에서 부트업 idle 스레드는 init_mm의 mm_count를 1증가 시키고 lazy MMU 스위칭을 하게 한다.
- arm & arm64 커널은 현재 lazy tlb를 지원하지 않는다.
코드 라인 16에서 처음 부트업 중인 현재 태스크의 스케줄러를 cfs 스케줄러로 설정한다.
코드 라인 24에서 현재 cpu에서 부트업 태스크를 idle 스레드로 설정하고 또한 idle 스케줄러를 지정한다.
코드 라인 26에서 전역 calc_load_update에 jiffies + LOAD_FREQ(5초)를 대입한다.
코드 라인 28~29에서 smp 시스템인 경우 임시 스케줄 도메인의 cpu 비트맵을 zero 값으로 할당받는다.
코드 라인 31~32에서 cpu_isolated_map이 지정되지 않은 경우 cpu_isolated_map이라는 cpu 비트맵을 zero 값으로 할당받는다.
코드 라인 33에서 현재 태스크를 현재 cpu에 해당하는 전역 per-cpu idle_threads 값에 대입한다.
- per_cpu(idle_threads, smp_processor_id()) = current;
코드 라인 34에서 현재 런큐의 age_stamp에 현재 스케줄 클럭을 대입한다.
코드 라인 36~38에서 cfs 스케줄러를 초기화한 후 스케줄러가 동작중임을 알린다.

디폴트 루트 도메인

디폴트 루트 도메인 초기화

init_defrootdomain()

kernel/sched/core.c

static void init_defrootdomain(void)
{
        init_rootdomain(&def_root_domain);

        atomic_set(&def_root_domain.refcount, 1);
}

디폴트 루트 도메인을 초기화하고 참조 카운터를 1로 설정한다.

init_rootdomain()

kernel/sched/core.c

static int init_rootdomain(struct root_domain *rd)
{
        memset(rd, 0, sizeof(*rd));

        if (!alloc_cpumask_var(&rd->span, GFP_KERNEL))
                goto out;
        if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
                goto free_span;
        if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
                goto free_online;
        if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
                goto free_dlo_mask;

        init_dl_bw(&rd->dl_bw);
        if (cpudl_init(&rd->cpudl) != 0)
                goto free_dlo_mask;

        if (cpupri_init(&rd->cpupri) != 0)
                goto free_rto_mask;
        return 0;

free_rto_mask:
        free_cpumask_var(rd->rto_mask);
free_dlo_mask:
        free_cpumask_var(rd->dlo_mask);
free_online:
        free_cpumask_var(rd->online);
free_span:
        free_cpumask_var(rd->span);
out:
        return -ENOMEM;
}

요청한 루트 도메인을 초기화한다.

코드 라인 3에서 루트 도메인 구조체를 모두 0으로 초기화한다.
코드 라인 5~12에서 루트 도메인의 cpumask 용도의 멤버 span, online, dlo_mask, rto_mask를 할당받는다.
코드 라인 14에서 루트 도메인의 dl_bw를 초기화한다.
- UP 시스템에서는 dl 스케줄러의 cpu 밴드폭 비율로 최대 95%를 할당하게 한다.
- 하나의 core만을 가진 시스템에서 dl 태스크가 최대 점유할 수 있는 cpu 비율을 제한하여 rt나 cfs 스케줄러가 조금이라도 동작할 수 있게 한다.
- SMP 시스템에서는 dl 스케줄러의 cpu 밴드폭 비율을 100%(무한대 설정) 사용할 수 있게 한다.
코드 라인 15~16에서 dl 스케줄러에서 로드밸런스 관리를 위해 루트 도메인의 cpudl을 초기화한다.
코드 라인 18~19에서 rt 스케줄러에서 로드밸런스 관리를 위해 루트 도메인의 cpupri를 초기화한다.

init_dl_bw()

kernel/sched/deadline.c

void init_dl_bw(struct dl_bw *dl_b)
{
        raw_spin_lock_init(&dl_b->lock);
        raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
        if (global_rt_runtime() == RUNTIME_INF)
                dl_b->bw = -1; 
        else
                dl_b->bw = to_ratio(global_rt_period(), global_rt_runtime());
        raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
        dl_b->total_bw = 0;
}

cpu가 1개 밖에 없는 up 시스템에서는 dl 스케줄러 요청 사항에 대해 100% cpu 밴드폭을 할당하지 않고 제한을 시키기 위해 디폴트 dl 밴드폭 사용 비율을 설정한다. (정수 1M=100%이며, 디폴트 비율은 95%가 설정된다.)

코드 라인 5~6에서 글로벌 rt runtime 값이 무한대 값 RUNTIME_INF(0xffffffff_ffffffff)인 경우 bw에 -1을 대입한다.
코드 라인 7~8에서 그 외의 경우 글로벌 rt runtime << 20 / 글로벌 rt period 값을 대입한다.
- 예) 글로벌 값으로 period=1,000,000,000이고 runtime=950,000,000일 때 결과 값은 996,147이고 이 값은 996,147 / 1M인 95%에 해당한다.
코드 라인 10에서 total 밴드폭을 0으로 초기화한다.

global_rt_runtime()

kernel/sched/sched.h

static inline u64 global_rt_runtime(void)
{
        if (sysctl_sched_rt_runtime < 0)
                return RUNTIME_INF;

        return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

글로벌 rt 런타임 값을 나노초 단위로 반환한다.

코드 라인 3~4에서 sysctl_sched_rt_runtime 값이 0보다 작게 설정된 경우 무한대 값 RUNTIME_INF(0xffffffff_ffffffff)을 반환한다.
- sysctl_sched_rt_runtime의 디폴트 값은 950,000 (us)이다.
코드 라인 6에서 us초 단위의 sysctl_sched_rt_runtime을 나노초 단위로 바꾸어 반환한다.

global_rt_period()

kernel/sched/sched.h

static inline u64 global_rt_period(void)
{
        return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
}

글로벌 rt 런타임 값을 나노초 단위로 반환한다.

us초 단위의 sysctl_sched_rt_runtime을 나노초 단위로 바꾸어 반환한다.
- sysctl_sched_rt_period의 디폴트 값은 1,000,000 (us)이다.

to_ratio()

kernel/sched/core.c

unsigned long to_ratio(u64 period, u64 runtime)
{
        if (runtime == RUNTIME_INF)
                return 1ULL << 20;

        /*
         * Doing this here saves a lot of checks in all
         * the calling paths, and returning zero seems
         * safe for them anyway.
         */
        if (period == 0)
                return 0;

        return div64_u64(runtime << 20, period);
}

runtime 비율을 1M 곱한 정수로 반환한다. (runtime / period 비율 결과를 1M를 곱한 정수로 바꿔서 반환한다.)

코드 라인 3~4에서 runtime이 무한대 값 RUNTIME_INF(0xffffffff_ffffffff)인 경우 1M를 반환한다. (100%)
코드 라인 11~12에서 period 값이 0인 경우 0을 반환한다. (0%)
코드 라인 14에서 runtime << 1M / period 값을 반환한다.
- 예) 글로벌 값으로 period=1,000,000,000이고 runtime=950,000,000일 때 결과 값은 996,147이고 이 값은 996,147 / 1M인 95%에 해당한다.

cpudl_init()

kernel/sched/cpudeadline.c

/*              
 * cpudl_init - initialize the cpudl structure
 * @cp: the cpudl max-heap context
 */
int cpudl_init(struct cpudl *cp)
{
        int i;
                
        memset(cp, 0, sizeof(*cp));
        raw_spin_lock_init(&cp->lock);
        cp->size = 0;
        
        cp->elements = kcalloc(nr_cpu_ids,
                               sizeof(struct cpudl_item),
                               GFP_KERNEL);
        if (!cp->elements)
                return -ENOMEM;

        if (!zalloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
                kfree(cp->elements);
                return -ENOMEM;
        }

        for_each_possible_cpu(i)
                cp->elements[i].idx = IDX_INVALID;

        return 0;
}

dl 스케줄러에서 로드밸런스 관리를 위해 cpudl 구조체를 초기화한다.

코드 라인 9에서 cpudl 구조체 내부를 모두 0으로 클리어한다.
코드 라인 11에서 멤버 size에 0을 대입한다.
코드 라인 13~17에서 cpu 수 만큼 cpudl_item 구조체를 할당받아 멤버 elemenents에 대입한다.
코드 라인 19~22에서 멤버 free_cpus에 cpu 비트맵을 할당한다.
코드 라인 24~25에서 cpu 수 만큼 멤버 elements[].idx 값에 IDX_INVALID(-1) 값으로 초기화한다.

cpupri_init()

kernel/sched/cpupri.c

/**
 * cpupri_init - initialize the cpupri structure
 * @cp: The cpupri context
 *      
 * Return: -ENOMEM on memory allocation failure.
 */     
int cpupri_init(struct cpupri *cp) 
{  
        int i;
 
        memset(cp, 0, sizeof(*cp));
        
        for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
                struct cpupri_vec *vec = &cp->pri_to_cpu[i];
        
                atomic_set(&vec->count, 0);
                if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
                        goto cleanup;
        }

        cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL);
        if (!cp->cpu_to_pri)
                goto cleanup;

        for_each_possible_cpu(i)
                cp->cpu_to_pri[i] = CPUPRI_INVALID;

        return 0;

cleanup:
        for (i--; i >= 0; i--)
                free_cpumask_var(cp->pri_to_cpu[i].mask);
        return -ENOMEM;
}

rt 스케줄러에서 로드밸런스 관리를 위해 cpupri 구조체를 초기화한다.

코드 라인 11에서 cpupri 구조체 내부를 모두 0으로 클리어한다.
코드 라인 13~19에서 CPUPRI_NR_PRIORITIES(102) 개 만큼 루프를 돌며 pri_to_cpu[]->count에 0을 대입하고 pri_to_cpu[]->mask에 cpumask를 할당받아 대입한다.
코드 라인 21~26에서 멤버 cpu_to_pri에 cpu 수 만큼 int 배열을 할당받아 지정하고 각각의 값으로 CPUPRI_INVALID(-1)를 대입한다.

디폴트 루트 도메인에 런큐 attach

rq_attach_root()

kernel/sched/core.c

static void rq_attach_root(struct rq *rq, struct root_domain *rd)
{
        struct root_domain *old_rd = NULL;
        unsigned long flags;

        raw_spin_lock_irqsave(&rq->lock, flags);

        if (rq->rd) {
                old_rd = rq->rd;

                if (cpumask_test_cpu(rq->cpu, old_rd->online))
                        set_rq_offline(rq);

                cpumask_clear_cpu(rq->cpu, old_rd->span);

                /*
                 * If we dont want to free the old_rd yet then
                 * set old_rd to NULL to skip the freeing later
                 * in this function:
                 */
                if (!atomic_dec_and_test(&old_rd->refcount))
                        old_rd = NULL;
        }

        atomic_inc(&rd->refcount);
        rq->rd = rd;

        cpumask_set_cpu(rq->cpu, rd->span);
        if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
                set_rq_online(rq);

        raw_spin_unlock_irqrestore(&rq->lock, flags);

        if (old_rd)
                call_rcu_sched(&old_rd->rcu, free_rootdomain);
}

런큐에 루트 도메인을 연결한다.

코드 라인 8~9에서 루트 도메인이 지정된 경우 현재 런큐의 루트 도메인을 old_rd에 대입한다.
코드 라인 11~14에서 기존 루트 도메인에 런큐의 cpu가 online 상태였었던 경우 런큐를 offline으로 바꾸도록 각 스케줄러의 런큐를 offline으로 바꾸도록 해당 후크 함수들을 호출한다. 그리고 online 및 span의 해당 cpu 비트를 클리어한다.
코드 라인 21~22에서 기존 루트 도메인의 참조 카운터를 감소시키고 그 값이 0이 되면 old_rd에 null을 대입한다.
코드 라인 25~26에서 요청한 런큐의 루트 도메인 참조 카운터를 1 증가시키고 런큐의 루트 도메인을 지정한다.
코드 라인 28~30에서 루트도메인의 span cpu 마스크에서 요청 런큐의 cpu에 해당하는 비트를 설정한다. 만일 cpu_active_mask에 런큐의 cpu에 해당하는 비트가 설정된 경우 런큐를 online 상태로 바꾸도록 각 스케줄러의 런큐를 online으로 바꾸도록 해당 후크 함수들을 호출한다. 그리고 online의 해당 cpu 비트를 설정한다.
코드 라인 34~35에서 기존 루트 도메인을 rcu 방식을 사용하여 할당 해제하게한다.

다음 그림은 런큐의 루트도메인이 교체되는 모습을 보여준다.

루트 도메인 할당 해제

free_rootdomain()

kernel/sched/core.c

static void free_rootdomain(struct rcu_head *rcu)
{
        struct root_domain *rd = container_of(rcu, struct root_domain, rcu);

        cpupri_cleanup(&rd->cpupri);
        cpudl_cleanup(&rd->cpudl);
        free_cpumask_var(rd->dlo_mask);
        free_cpumask_var(rd->rto_mask);
        free_cpumask_var(rd->online);
        free_cpumask_var(rd->span);
        kfree(rd);
}

rcu에 연결된 루트 도메인을 할당 해제한다.

cpupri_cleanup()

kernel/sched/cpupri.c

/**
 * cpupri_cleanup - clean up the cpupri structure
 * @cp: The cpupri context
 */
void cpupri_cleanup(struct cpupri *cp)
{
        int i;

        kfree(cp->cpu_to_pri);
        for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
                free_cpumask_var(cp->pri_to_cpu[i].mask);
}

cpupri 멤버 cpu_to_pri에 할당된 메모리를 할당해제하고 102번 루프를 돌며 pri_to_cpu[].mask에 할당된 cpu 마스크를 를 할당 해제한다.

cpudl_cleanup()

kernel/sched/cpudeadline.c

/*
 * cpudl_cleanup - clean up the cpudl structure
 * @cp: the cpudl max-heap context
 */             
void cpudl_cleanup(struct cpudl *cp)
{       
        free_cpumask_var(cp->free_cpus);
        kfree(cp->elements);
}

cpudl 멤버 free_cpus에 할당된 cpu 마스크를 할당해제하고 elements에 할당된 메모리도 할당 해제한다.

오토 그룹 초기화

autogroup_init()

kernel/sched/auto_group.c

void __init autogroup_init(struct task_struct *init_task)
{
        autogroup_default.tg = &root_task_group;
        kref_init(&autogroup_default.kref);
        init_rwsem(&autogroup_default.lock);
        init_task->signal->autogroup = &autogroup_default;
}

자동 그룹 스케줄링을 지원하는 커널에서 디폴트 오토 그룹으로 루트 태스크 그룹을 지정한다.

코드 라인 3에서 디폴트 오토그룹의 태스크 그룹으로 루트 태스크 그룹을 지정한다.
코드 라인 6에서 init_task의 오토그룹으로 디폴트 오토 그룹을 지정한다.

cfs, rt, dl 런큐 초기화

init_cfs_rq()

kernel/sched/fair.c

void init_cfs_rq(struct cfs_rq *cfs_rq)
{               
        cfs_rq->tasks_timeline = RB_ROOT;
        cfs_rq->min_vruntime = (u64)(-(1LL << 20));
#ifndef CONFIG_64BIT
        cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif          
#ifdef CONFIG_SMP
        atomic64_set(&cfs_rq->decay_counter, 1);
        atomic_long_set(&cfs_rq->removed_load, 0);
#endif
}

cfs 런큐의 구조체를 초기화한다.

코드 라인 3에서 tasks_timeline은 스케줄링 엔티티의 vruntime 값으로 정렬될 RB 트리로 이 값을 RB_ROOT로 지정한다.
코드 라인 4에서 min_vruntime 값으로 0xffff_ffff_fff0_0000 (-1M)를 지정한다.
코드 라인 8~11에서 smp 시스템인 경우 decay_counter에 1을 대입하고 removed_load에 0을 대입한다.

init_rt_rq()

kernel/sched/rt.c

void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
        struct rt_prio_array *array;
        int i;
                
        array = &rt_rq->active;
        for (i = 0; i < MAX_RT_PRIO; i++) {
                INIT_LIST_HEAD(array->queue + i); 
                __clear_bit(i, array->bitmap);
        }
        /* delimiter for bitsearch: */
        __set_bit(MAX_RT_PRIO, array->bitmap);

#if defined CONFIG_SMP
        rt_rq->highest_prio.curr = MAX_RT_PRIO;
        rt_rq->highest_prio.next = MAX_RT_PRIO;
        rt_rq->rt_nr_migratory = 0;
        rt_rq->overloaded = 0;
        plist_head_init(&rt_rq->pushable_tasks);
#endif
        /* We start is dequeued state, because no RT tasks are queued */
        rt_rq->rt_queued = 0;

        rt_rq->rt_time = 0;
        rt_rq->rt_throttled = 0;
        rt_rq->rt_runtime = 0;
        raw_spin_lock_init(&rt_rq->rt_runtime_lock);
}

rt 런큐의 구조체를 초기화한다.

코드 라인 6~10에서 rt 런큐의 우선 순위 큐를 초기화한다. 100개의 우선 순위 수 만큼 active->quque[] 리스트를 초기화하고 bitmap을 클리어한다.
코드 라인 12에서 마지막 101번째 비트맵 비트에 구분자 역할로 사용하기 위해 1을 설정한다.
코드 라인 15~16에서 highest_prio.curr 및 next에 MAX_RT_PRIO(100) 우선 순위 값을 대입한다.
코드 라인 17~18에서 rt_nr_migratory, overloaded 값에 0을 대입한다.
코드 라인 19에서 pushable_tasks 리스트를 초기화한다.
코드 라인 22에서 rt_queued에 0을 대입하여 현재 큐잉된 태스크가 없음을 나타낸다.
코드 라인 24~26에서 rt_time, rt_throttled, rt_runtime 값을 0으로 초기화하다.

init_dl_rq()

kernel/sched/deadline.c

void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
{
        dl_rq->rb_root = RB_ROOT;

#ifdef CONFIG_SMP
        /* zero means no -deadline tasks */
        dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;

        dl_rq->dl_nr_migratory = 0;
        dl_rq->overloaded = 0;
        dl_rq->pushable_dl_tasks_root = RB_ROOT;
#else
        init_dl_bw(&dl_rq->dl_bw);
#endif
}

dl 런큐의 구조체를 초기화한다.

코드 라인 3에서 rb_root는 스케줄링 엔티티의 vruntime 값으로 정렬될 RB 트리로 이 값을 RB_ROOT로 지정한다
코드 라인 7에서 earliest_dl.curr 및 next의 값으로 0을 대입하여 큐잉된 태스크가 없음을 나타낸다.
코드 라인 9~10에서 dl_nr_migratory, overloaded 값에 0을 대입한다.
코드 라인 11에서 pushable_dl_tasks_root RB 트리를 초기화한다.
코드 라인 13에서 UP 시스템인 경우 dl 밴드폭 비율을 설정한다.

hrtick 초기화

init_rq_hrtick()

kernel/sched/core.c

static void init_rq_hrtick(struct rq *rq)
{
#ifdef CONFIG_SMP
        rq->hrtick_csd_pending = 0;

        rq->hrtick_csd.flags = 0;
        rq->hrtick_csd.func = __hrtick_start;
        rq->hrtick_csd.info = rq;
#endif

        hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        rq->hrtick_timer.function = hrtick;
}

런큐용 hrtick을 초기화한다.

코드 라인 3~9에서 smp 시스템인 경우 IPI 용으로 hrtick에 대한 호출되는 함수를 지정하고 인수로는 런큐를 사용하게 한다.
코드 라인 11~12에서 hrtick_timer를 초기화하고 hrtick 만료 시 호출되는 함수를 지정한다.

아이들 스레드 초기화

init_idle()

kernel/sched/core.c

/**
 * init_idle - set up an idle thread for a given CPU
 * @idle: task in question
 * @cpu: cpu the idle task belongs to
 *
 * NOTE: this function does not set the idle thread's NEED_RESCHED
 * flag, to make booting more robust.
 */
void init_idle(struct task_struct *idle, int cpu)
{
        struct rq *rq = cpu_rq(cpu);
        unsigned long flags;

        raw_spin_lock_irqsave(&rq->lock, flags);

        __sched_fork(0, idle);
        idle->state = TASK_RUNNING;
        idle->se.exec_start = sched_clock();

        do_set_cpus_allowed(idle, cpumask_of(cpu));
        /*
         * We're having a chicken and egg problem, even though we are
         * holding rq->lock, the cpu isn't yet set to this cpu so the
         * lockdep check in task_group() will fail.
         *
         * Similar case to sched_fork(). / Alternatively we could
         * use task_rq_lock() here and obtain the other rq->lock.
         *
         * Silence PROVE_RCU
         */
        rcu_read_lock();
        __set_task_cpu(idle, cpu);
        rcu_read_unlock();

        rq->curr = rq->idle = idle;
        idle->on_rq = TASK_ON_RQ_QUEUED;
#if defined(CONFIG_SMP)
        idle->on_cpu = 1;
#endif
        raw_spin_unlock_irqrestore(&rq->lock, flags);

        /* Set the preempt count _outside_ the spinlocks! */
        init_idle_preempt_count(idle, cpu);

        /*
         * The idle tasks have their own, simple scheduling class:
         */
        idle->sched_class = &idle_sched_class;
        ftrace_graph_init_idle_task(idle, cpu);
        vtime_init_idle(idle, cpu);
#if defined(CONFIG_SMP)
        sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
}

요청 cpu에 대한 idle 스레드를 설정하고 idle 스케줄러에 등록한다.

코드 라인 16에서 idle 태스크의 cfs, dl 및 rt 스케줄링 엔티티의 멤버 값들과 numa 밸런싱 관련 값들을 초기화한다.
코드 라인 17~18에서 요청 idle 태스크의 상태를 TASK_RUNNING으로 바꾸고 스케줄링 엔티티의 시작 실행 시각을 현재 시각으로 초기화한다.
코드 라인 20에서 요청한 idle 태스크는 요청한 cpu만 운영될 수 있도록 제한한다.
코드 라인 32에서 idle 태스크의 cfs_rq와 부모 엔티티를 설정한다.
코드 라인 35에서 런큐가 현재 동작중인 태스크와 idle 태스크로 인수로 요청한 태스크를 지정한다.
코드 라인 36에서 idle 태스크의 on_rq에 TASK_ON_RQ_QUEUED(1)를 대입하여 런큐에 올라가 있는 것을 의미하게 한다.
코드 라인 37에서 idle 태스크이 on_cpu에 1을 대입한다.
코드 라인 43에서 preempt 카운터를 0으로 설정하여 preemption이 가능하게 한다.
코드 라인 48에서 idle 태스크가 idle 스케줄러를 사용하게 설정한다.
코드 라인 50에서 디버그 정보를 제공하기 위해 full dynticks cpu 타임을 측정을 목적으로 초기화한다.

다음 그림은 init_idle() 함수가 처리되는 과정을 보여준다.

__sched_fork()

kernel/sched/core.c

/*
 * Perform scheduler related setup for a newly forked process p.
 * p is forked by current.
 *
 * __sched_fork() is basic setup used by init_idle() too:
 */
static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        p->on_rq                        = 0;

        p->se.on_rq                     = 0;
        p->se.exec_start                = 0;
        p->se.sum_exec_runtime          = 0;
        p->se.prev_sum_exec_runtime     = 0;
        p->se.nr_migrations             = 0;
        p->se.vruntime                  = 0;
#ifdef CONFIG_SMP
        p->se.avg.decay_count           = 0;
#endif
        INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_SCHEDSTATS
        memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif

        RB_CLEAR_NODE(&p->dl.rb_node);
        init_dl_task_timer(&p->dl);
        __dl_clear_params(p);

        INIT_LIST_HEAD(&p->rt.run_list);

#ifdef CONFIG_PREEMPT_NOTIFIERS
        INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif

#ifdef CONFIG_NUMA_BALANCING
        if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
                p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
                p->mm->numa_scan_seq = 0;
        }

        if (clone_flags & CLONE_VM)
                p->numa_preferred_nid = current->numa_preferred_nid;
        else
                p->numa_preferred_nid = -1;

        p->node_stamp = 0ULL;
        p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
        p->numa_scan_period = sysctl_numa_balancing_scan_delay;
        p->numa_work.next = &p->numa_work;
        p->numa_faults = NULL;
        p->last_task_numa_placement = 0;
        p->last_sum_exec_runtime = 0;

        p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}

fork된 태스크의 cfs, dl 및 rt 스케줄링 엔티티의 멤버 값들과 numa 밸런싱 관련 값들을 초기화한다. 이 태스크는 현재 태스크에서 새롭게 fork되었으며 다음 두 곳 함수에서 호출되어 사용된다.

kernel/fork.c – copy_process() 함수 -> sched_fork() 함수
kernel/sched/core.c – init_idle() 함수

코드 라인 9에서 on_rq에 0을 대입하여 런큐에 없음을 의미한다.
코드 라인 11~20에서 cfs 스케줄링 엔티티 값들을 0으로 초기화하고 group_node 리스트를 초기화한다.
코드 라인 26에서 dl 스케줄링 엔티티의 rb_node를 클리어하여 dl 스케줄러의 RB 트리에 태스크가 하나도 대기하지 않음을 의미한다.
코드 라인 27에서 dl 태스크 타이머를 초기화한다.
코드 라인 28에서 dl 스케줄링 엔티티를 파라메터들을 초기화한다.
코드 라인 30에서 rt 스케줄링 엔티티의 run_list를 초기화한다.
코드 라인 32~34에서 현재 태스크에서 preemption이 발생되면 preempt_notifiers에 등록된 함수들을 동작시키게 하기 위해 preempt_notifiers 리스트를 초기화한다.
코드 라인 36~56에서 numa 밸런싱의 설명은 생략한다.

init_dl_task_timer()

kernel/sched/deadline.c

void init_dl_task_timer(struct sched_dl_entity *dl_se)
{
        struct hrtimer *timer = &dl_se->dl_timer;

        hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        timer->function = dl_task_timer;
}

dl 스케줄링 엔티티의 hrtimer를 초기화하고 만료 시 호출될 함수를 지정한다.

__dl_clear_params()

kernel/sched/core.c

void __dl_clear_params(struct task_struct *p)
{
        struct sched_dl_entity *dl_se = &p->dl;

        dl_se->dl_runtime = 0;
        dl_se->dl_deadline = 0;
        dl_se->dl_period = 0;
        dl_se->flags = 0;
        dl_se->dl_bw = 0;

        dl_se->dl_throttled = 0;
        dl_se->dl_new = 1;
        dl_se->dl_yielded = 0;
}

dl 스케줄링 엔티티를 파라메터들을 초기화한다.

do_set_cpus_allowed()

kernel/sched/core.c

void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
{
        if (p->sched_class->set_cpus_allowed)
                p->sched_class->set_cpus_allowed(p, new_mask);

        cpumask_copy(&p->cpus_allowed, new_mask);
        p->nr_cpus_allowed = cpumask_weight(new_mask);
}

요청 태스크가 운영될 수 있는 cpu들을 지정한다. 현재 태스크가 동작 중인 경우 해당 스케줄러에도 통보된다.

코드 라인 3~4에서 태스크의 스케줄러가 (*set_cpus_allowed) 후크 함수가 구현된 경우 호출한다.
- 현재 rt 스케줄러에는 set_cpus_allowed_rt() 함수, 그리고 dl 스케줄러에는 set_cpus_allowed_dl() 함수가 구현되어 있다.
코드 라인 6에서 태스크의 cpus_allowed에 new_mask를 복사한다.
코드 라인 7에서 태스크의 nr_cpus_allowed에 new_mask에 설정된 cpu 수를 기록한다.

__set_task_cpu()

kernel/sched/sched.h

static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
        set_task_rq(p, cpu);
#ifdef CONFIG_SMP
        /*
         * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
         * successfuly executed on another CPU. We must ensure that updates of
         * per-task data have been completed by this moment.
         */
        smp_wmb();
        task_thread_info(p)->cpu = cpu;
        p->wake_cpu = cpu;
#endif
}

태스크의 cfs_rq와 부모 엔티티를 설정한다.

코드 3에서 태스크의 cfs_rq와 부모 엔티티를 설정한다.
- 부트업 과정에서는 init_task가 태스크 그룹이 관리하는 cfs 런큐를 가리키게한다.
코드 4~13에서 smp 시스템인 경우 현재 스레드의 cpu 멤버와 현재 태스크의 wake_cpu 멤버에 요청한 cpu 번호를 대입한다.

set_task_rq()

kernel/sched/sched.h

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
        struct task_group *tg = task_group(p);
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
        p->se.cfs_rq = tg->cfs_rq[cpu];
        p->se.parent = tg->se[cpu];
#endif

#ifdef CONFIG_RT_GROUP_SCHED
        p->rt.rt_rq  = tg->rt_rq[cpu];
        p->rt.parent = tg->rt_se[cpu];
#endif
}

태스크의 cfs_rq와 부모 엔티티를 설정한다.

코드 라인 4~6에서 cfs 그룹 스케줄링 또는 rt 그룹 스케줄링을 지원하는 경우 태스크 그룹을 알아온다.
코드 라인 8~11에서 cfs 그룹 스케줄링을 지원하는 경우 스케줄링 엔티티의 cfs 런큐 및 부모로 태스크 그룹의 cfs 런큐 및 스케줄링 엔티티를 지정하게 한다.
코드 라인 13~16에서 rt 그룹 스케줄링을 지원하는 경우 rt 스케줄링 엔티티의 rt 런큐 및 부모로 태스크 그룹의 rt 런큐 및 rt 스케줄링 엔티티를 지정하게 한다.

init_idle_preempt_count()

include/asm-generic/preempt.h

#define init_idle_preempt_count(p, cpu) do { \
        task_thread_info(p)->preempt_count = PREEMPT_ENABLED; \
} while (0)

preempt 카운터를 PREEMPT_ENABLED(0)으로 설정하여 preemption이 가능하게 한다.

기타 초기화

set_load_weight()

kernel/sched/core.c

static void set_load_weight(struct task_struct *p)
{
        int prio = p->static_prio - MAX_RT_PRIO;
        struct load_weight *load = &p->se.load;

        /*
         * SCHED_IDLE tasks get minimal weight:
         */
        if (p->policy == SCHED_IDLE) {
                load->weight = scale_load(WEIGHT_IDLEPRIO);
                load->inv_weight = WMULT_IDLEPRIO;
                return;
        }

        load->weight = scale_load(prio_to_weight[prio]);
        load->inv_weight = prio_to_wmult[prio];
}

태스크에 지정된 static 우선순위를 사용하여 로드 weight를 설정한다. (idle 스레드인 경우는 로드 weight 값으로 가장 느린 3을 사용한다)

코드 라인 3에서 nice 40개 우선순위에 있는 weight 값을 사용하기 위해 100~139 사이 값인 태스크의 static 우선순위 – MAX_RT_PRIO(100)을 prio에 대입한다.
코드 라인 9~13에서 현재 태스크가 SCHED_IDLE 스케줄 정책을 사용하는 경우 cfs 스케줄링 엔티티의 로드 weight 값으로 WEIGHT_IDLEPRIO(3)을 저장하고 inv_weight 값도 이에 해당하는 WMULT_IDLEPRIO(1431655765) 값을 저장하고 함수를 빠져나간다.
코드 라인 15~16에서 0 ~ 39까지 범위인 prio에 해당하는 weight 값과 inv_weight 값을 cfs 스케줄링 엔티티의 로드 값에 저장한다.

set_cpu_rq_start_time()

kernel/sched/core.c

static void __cpuinit set_cpu_rq_start_time(void)
{
        int cpu = smp_processor_id();
        struct rq *rq = cpu_rq(cpu);
        rq->age_stamp = sched_clock_cpu(cpu);
}

현재 런큐의 age_stamp에 현재 스케줄 클럭을 대입한다.

init_sched_fair_class()

kernel/sched/fair.c

__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
        open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);

#ifdef CONFIG_NO_HZ_COMMON
        nohz.next_balance = jiffies;
        zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
        cpu_notifier(sched_ilb_notifier, 0);
#endif
#endif /* SMP */

}

smp 시스템인 경우 cfs 스케줄러를 초기화한다.

코드 라인 4에서 SCHED_SOFTIRQ가 발생할 때 호출되는 함수로 run_rebalance_domains() 함수를 지정한다.
코드 라인 5에서 nohz idle이 지원되는 경우 nohz.next_balance에 현재 시각(jiffies)을 대입하고 idle_cpus_mask에 cpumask를 할당받는다. 마지막으로 cpu notier에 sched_lib_notifer() 함수를 등록한다.

sched_ilb_notifier()

kernel/sched/fair.c

static int sched_ilb_notifier(struct notifier_block *nfb,
                                        unsigned long action, void *hcpu)
{
        switch (action & ~CPU_TASKS_FROZEN) {
        case CPU_DYING:
                nohz_balance_exit_idle(smp_processor_id());
                return NOTIFY_OK;
        default:
                return NOTIFY_DONE;
        }
}

cpu 상태가 dying 상태(frozen 제외)가 된 경우 nohz_balance_exit_idle() 함수를 호출하여 해당 cpu가 nohz idle 상태를 벗어나게 한다.

nohz_balance_exit_idle()

kernel/sched/fair.c

static inline void nohz_balance_exit_idle(int cpu)
{
        if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
                /*
                 * Completely isolated CPUs don't ever set, so we must test.
                 */
                if (likely(cpumask_test_cpu(cpu, nohz.idle_cpus_mask))) {
                        cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
                        atomic_dec(&nohz.nr_cpus);
                }
                clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
        }
}

해당 cpu가 idle 상태를 벗어나게 한다.

코드 라인 3, 11에서 낮은 확률로 현재 cpu의 런큐 멤버 중 nohz_flags에 NOHZ_TICK_STOPPED 플래그가 설정된 경우 이 플래그를 지운다.
코드 라인 7~10에서 nohz.idle_cpus_mask에 요청 cpu가 설정된 경우, 즉 현재 cpu가 nohz idle 상태를 지원하는 경우 해당 비트를 클리어하고 nohz.nr_cpus를 1 감소시킨다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c – 현재 글
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Scheduler -6- (CFS Scheduler)

2017-05-122021-01-11 문영일 13 Comments

런타임 산출

타임 슬라이스 산출

sched_slice()

kernel/sched/fair.c

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

        for_each_sched_entity(se) {
                struct load_weight *load;
                struct load_weight lw;

                cfs_rq = cfs_rq_of(se);
                load = &cfs_rq->load;

                if (unlikely(!se->on_rq)) {
                        lw = cfs_rq->load;

                        update_load_add(&lw, se->load.weight);
                        load = &lw;
                }
                slice = __calc_delta(slice, se->load.weight, load);
        }
        return slice;
}

최상위 그룹 엔티티까지 엔티티의 가중치 비율을 적용하여 타임 슬라이스를 구한다.

코드 라인 3에서 cfs 런큐에서 동작 중인 태스크들 + 1(해당 엔티티가 런큐에 포함되지 않은 경우)의 개수로 스케줄링 latency를 구한다.
코드 라인 5에서 최상위 그룹 엔티티까지 순회한다.
코드 라인 9~10에서 엔티티가 소속된 cfs 런큐의 로드 값을 load 포인터 변수가 가리키게 한다.
코드 라인 12~17에서 낮은 확률로 엔티티가 cfs 런큐에 포함되지 않은 경우 cfs 런큐의 로드에 이 엔티티의 로드도 포함하여 load에 산출한다.
코드 라인 18에서 타임 슬라이스 값에 cfs 런큐 가중치 대비 엔티티의 가중치 비율을 곱한 값을 slice에 구해온다.
코드 라인 20에서 최종 산출한 타임 슬라이스 값을 반환한다.

다음 그림은 그룹 스케줄링 없이 추가할 엔티티에 대한 타임 슬라이스를 산출하는 과정을 보여준다.

런큐에 5개 스케줄 엔티티의 load weight가 더해있고 아직 런큐에 올라가지 않은 스케줄 엔티티를 더해서 타임 슬라이스를 산출한다.
엔티티 및 cfs 런큐의 로드 weight은 64비트 시스템에서 추가로 scale up (10비트)되어 사용된다.

다음 그림은 하위 그룹 스케줄링 A, B를 사용하였고, 그 중 추가 할 엔티티에 대한 타임 슬라이스를 산출하는 과정을 보여준다.

다음 그림은 그룹 스케줄링을 사용할 때 태스크들의 타임 슬라이스 분배를 자세히 보여준다.

__sched_period()

kernel/sched/fair.c

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */

static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))
                return nr_running * sysctl_sched_min_granularity;
        else
                return sysctl_sched_latency;
}

태스크 스케줄에 사용할 스케줄링 latency를 산출한다.

코드 라인 3~4에서 태스크 수 @nr_running이 sched_nr_latency를 초과하는 경우 스케줄링 레이턴시는 다음과 같이 늘어난다.
- nr_running * 최소 태스크 스케줄 단위(0.75ms * factor)
- 디폴트 factor = log2(cpu 수) + 1
- sched_nr_latency
  - sysctl_sched_latency / sysctl_sched_min_granularity
  - 디폴트 값은 8이다.
코드 라인 5~6에서 적정 수 이내인 경우 스케줄링 레이턴시 값을 반환한다.
- 디폴트 스케줄링 레이턴시 = 6,000,000(ns) * factor
- 예) rpi2, 3, 4의 경우 cpu가 4개 이므로 6,000,000(ns) * factor(3) = 18,000,000(ns)

다음은 rpi2, 3, 4 시스템에서 스케줄링 레이턴시 산출 관련 디폴트 설정 값이다.

$ cat /proc/sys/kernel/sched_latency_ns
18000000

$ cat /proc/sys/kernel/sched_min_granularity_ns
2250000

update_load_add()

kernel/sched/fair.c

static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
        lw->weight += inc;
        lw->inv_weight = 0;
}

로드 값을 inc 만큼 추가하고 inv_weight 값은 0으로 리셋한다.

__calc_delta()

kernel/sched/fair.c

/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
        u64 fact = scale_load_down(weight);
        int shift = WMULT_SHIFT;

        __update_inv_weight(lw);

        if (unlikely(fact >> 32)) {
                while (fact >> 32) {
                        fact >>= 1;
                        shift--;
                }
        }

        /* hint to use a 32x32->64 mul */
        fact = (u64)(u32)fact * lw->inv_weight;

        while (fact >> 32) {
                fact >>= 1;
                shift--;
        }

        return mul_u64_u32_shr(delta_exec, fact, shift);
}

스케줄링 기간을 산출하여 반환한다. (delta_exec * weight / lw.weight)

코드 라인 3에서 weight 값을 scale 로드 다운하여 fact에 대입한다. 32bit 아키텍처는 scale 로드 다운하지 않는다.
- 커널 v4.7-rc1에서 64bit 아키텍처에 대해 weight에 대한 해상도를 높여서 사용할 수 있게 변경되었다.
코드 라인 6에서 0xfff_ffff / lw->weight 값을 lw->inv_weight 값으로 대입한다.
코드 라인 8~13에서 작은 확률로 fact가 32bit 값 이상인 경우 상위 32bit에서 사용하는 bit 수만큼을 기존 shift(32) 값에서 빼고 그 횟수만큼 fact를 우측 시프트한다.
- 32bit 아키텍처는 fact 값이 32bit 값을 넘어가지 않아 이 루틴은 건너뛴다.
- 예) 0x3f_ffff_ffff -> shift=32-6=26
코드 라인 16에서 나눗셈을 피하기 위해 계산된 inv_wieght 값을 fact에 곱한다.
- delta_exec * weight / lw.weight = (delta_exec * weight * lw.inv_weight) >> 32
코드 라인 18~21에서 32bit 시프트 중 남은 시프트 수를 구한다. 또한 루프 반복 횟수 만큼 fact를 시프트한다.
코드 라인 23에서 (delta_exec * fact) >> shift 한 값을 반환한다.
- 결국: (delta_exec * weight * lw.inv_weight) >> 32와 같은 결과를 반환하게 된다.

kernel/sched/fair.c

#define WMULT_CONST     (~0U)
#define WMULT_SHIFT     32

다음 그림은 5개의 태스크가 스케줄되어 산출된 각각의 time slice 값을 알아본다.

녹색 박스 수식은 나눗셈을 사용한 산출 방법이다. (수학 모델)
적색 박스 수식은 더 빠른 처리를 위해 나눗셈 없이 곱셈과 시프트만 사용한 방법이다. (구현 모델)
weight 값은 scale down 하였다. (64비트 시스템)

__update_inv_weight()

kernel/sched/fair.c

static void __update_inv_weight(struct load_weight *lw)
{
        unsigned long w;

        if (likely(lw->inv_weight))
                return;

        w = scale_load_down(lw->weight);

        if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
                lw->inv_weight = 1;
        else if (unlikely(!w))
                lw->inv_weight = WMULT_CONST;
        else
                lw->inv_weight = WMULT_CONST / w;
}

0xfff_ffff / lw->weight 값을 lw->inv_weight 값으로 대입한다.

산출된 inv_weight 값은 weight 값으로 나눗셈을 하는 대신 이 값으로 곱하기 위해 반전된 weight 값을 사용한다. 그 후 32bit 우측 쉬프트를 하여 사용한다.
‘delta_exec / weight =’ 과 같이 나눗셈이 필요 할 때 나눗셈 대신 ‘delta_exec * wmult >> 32를 사용하는 방법이 있다.
- wmult 값을 만들기 위해 ‘2^32 / weight’ 값을 사용한다.

코드 5~6에서 이미 lw->inv_weight 값이 설정된 경우 함수를 빠져나간다.
코드 8에서 arm 아키텍처는 sched_load 비율을 적용하지 않으므로 그대로 lw->weight 값을 사용한다.
코드 10~15에서 w 값에 따라서 lw->inv_weight 값을 다음 중 하나로 갱신한다.
- 시스템이 64비트이고 낮은 확률로 w 값이 32bit 값을 넘어선 경우 1
- 낮은 확률로 w 값이 0인 경우 0xffff_ffff
- 그 외의 경우 0xffff_ffff / weight

mul_u64_u32_shr()

include/linux/math64.h

#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
#ifndef mul_u64_u32_shr
static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
{
        return (u64)(((unsigned __int128)a * mul) >> shift);
}
#endif /* mul_u64_u32_shr */
#else   
#ifndef mul_u64_u32_shr
static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
{
        u32 ah, al;
        u64 ret;

        al = a;
        ah = a >> 32;

        ret = ((u64)al * mul) >> shift;
        if (ah)
                ret += ((u64)ah * mul) << (32 - shift);

        return ret;
}
#endif /* mul_u64_u32_shr */
#endif

a * mul >> shift 결과를 반환한다.

계산의 중간에 a * mul 연산에서 128비트를 사용하는데 아키텍처가 128비트 정수 연산을 지원하는 경우와 그렇지 않은 경우에 대해 함수가 구분되어 있다.

스케줄링 latency 와 스케일 정책

다음 그림은 update_sysctl() 함수를 호출하는 경로를 보여준다.

sched_proc_update_handler()

kernel/sched/fair.c

int sched_proc_update_handler(struct ctl_table *table, int write,
                void __user *buffer, size_t *lenp,
                loff_t *ppos)
{
        int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
        int factor = get_update_sysctl_factor();

        if (ret || !write)
                return ret;

        sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
                                        sysctl_sched_min_granularity);

#define WRT_SYSCTL(name) \
        (normalized_sysctl_##name = sysctl_##name / (factor))
        WRT_SYSCTL(sched_min_granularity);
        WRT_SYSCTL(sched_latency);
        WRT_SYSCTL(sched_wakeup_granularity);
#undef WRT_SYSCTL

        return 0;
}

다음 4 가지 중 하나의 스케줄링 옵션을 바꾸게될 때 호출되어 CFS 스케줄러의 스케줄링 latency 산출에 영향을 준다.

“/proc/sys/kernel/sched_min_granularity_ns”
- min: 100000, max: 1000000000 (range: 100us ~ 1s)
“/proc/sys/kernel/sched_latency_ns”
- min: 100000, max: 1000000000 (range: 100us ~ 1s)
“/proc/sys/kernel/sched_wakeup_granularity_ns”
- min: 0, max: 1000000000 (range: 0 ~ 1s)
“/proc/sys/kernel/sched_tunable_scaling”
- min:0, max: 2 (range: 0 ~ 2)

코드 라인 5에서 kern_table[] 배열에서 해당 항목의 min/max에 적용된 범위 이내에서 관련 proc 파일 값을 읽는다.
코드 라인 6~9에서 스케줄링 latency에 반영할 factor 값을 산출해온다. 만일 실패하는 경우 함수를 빠져나간다.
코드 라인 11~12에서 스케줄링 latency 값을 태스크 최소 스케줄 단위로 나누면 스케쥴 latency 값으로 적용할 수 있는 최대 태스크의 수가 산출된다.
코드 라인 14~15에서 WRT_SYSCTL() 매크로를 정의한다.
코드 라인 16에서 normalized_sysctl_sched_min_granularity <- sysctl_sched_min_granularity / factor 값을 대입한다.
코드 라인 17에서 normalized_sysctl_sched_latency <- sysctl_sched_latency / factor 값을 대입한다.
코드 라인 18에서 normalized_sysctl_sched_wakeup_granularity <- sysctl_sched_wakeup_granularity / factor 값을 대입한다.

get_update_sysctl_factor()

kernel/sched/fair.c

/*
 * Increase the granularity value when there are more CPUs,
 * because with more CPUs the 'effective latency' as visible
 * to users decreases. But the relationship is not linear,
 * so pick a second-best guess by going with the log2 of the
 * number of CPUs.
 *
 * This idea comes from the SD scheduler of Con Kolivas:
 */

static int get_update_sysctl_factor(void)
{
        unsigned int cpus = min_t(int, num_online_cpus(), 8);
        unsigned int factor;

        switch (sysctl_sched_tunable_scaling) {
        case SCHED_TUNABLESCALING_NONE:
                factor = 1;
                break;
        case SCHED_TUNABLESCALING_LINEAR:
                factor = cpus;
                break;
        case SCHED_TUNABLESCALING_LOG:
        default:
                factor = 1 + ilog2(cpus);
                break;
        }

        return factor;
}

스케일링 정책에 따른 비율을 반환한다. 이 값은 사용하는 정책에 따라 cpu 수에 따른 스케일 latency에 변화를 주게된다.

“/proc/sys/kernel/sched_tunable_scaling” 을 사용하여 스케일링 정책을 바꿀 수 있다.
- SCHED_TUNABLESCALING_NONE(0)
  - 항상 1을 사용한다.
- SCHED_TUNABLESCALING_LOG(1)
  - 1 + ilog2(cpus)과 같이 cpu 수에 따라 변화된다.
  - rpi2: cpu=4개이고 이 옵션을 사용하므로 factor=3가 된다.
- SCHED_TUNABLESCALING_LINEAR(2)
  - online cpu 수로 8 단위로 정렬하여 사용한다. (8, 16, 24, 32, …)

update_sysctl()

kernel/sched/fair.c

static void update_sysctl(void)
{
        unsigned int factor = get_update_sysctl_factor();

#define SET_SYSCTL(name) \
        (sysctl_##name = (factor) * normalized_sysctl_##name)
        SET_SYSCTL(sched_min_granularity);
        SET_SYSCTL(sched_latency);
        SET_SYSCTL(sched_wakeup_granularity);
#undef SET_SYSCTL
}

스케줄 도메인의 초기화 또는 구성이 바뀌는 경우와 cpu가 on/off 될 때에 한하여 다음 3가지의 값을 갱신한다.

sysctl_sched_min_granularity = factor * normalized_sysctl_sched_min_granularity
sysctl_sched_latency = factor * normalized_sysctl_sched_latency
sysctl_sched_wakeup_granularity = factor * normalized_sysctl_sched_wakeup_granularity

vruntime 산출

sched_vslice()

kernel/sched/fair.c

/*
 * We calculate the vruntime slice of a to-be-inserted task.
 *
 * vs = s/w
 */

static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

스케줄 엔티티의 로드에 해당하는 time slice 값을 산출하고 이를 통해 vruntime 을 구한다.

sched_slice() 함수를 통해 스케줄 엔티티의 로드에 해당하는 기간인 time slice를 구한다.
- 예) rpi2: 스케줄 기간 18ms에서 6개의 태스크 중 현재 태스크가 로드 weight으로 기여한 시간이 3ms라고 할 때
그런 후 calc_delta_fair() 함수를 통해 스케줄 엔티티의 로드 weight과 반비례한 time slice 값을 반환한다.
- 예) 2개의 태스크에 대해 각각 1024, 3072의 로드 weight이 주어졌을 때
  - 6ms의 스케줄 주기를 사용한 경우 각각의 로드 weight에 따라 1.5ms와 4.5ms의 time slice를 할당받는다.
  - 이 기간으로 calc_delta_fair() 함수를 통한 경우 각각 ‘1.5ms * (1024/1024)’ , ‘4.5ms * (1024/3072)=1.5ms’ 수식으로 동일하게 1.5ms의 시간이 산출된다.
  - 산출된 이 값을 기존 vruntime에 누적하여 사용한다.

댜음 그림은 4개의 태스크에 대해 vruntime 값이 누적되는 모습을 보여준다.

high-resolution timer를 사용한 HRTICK인 경우 아래 그림과 같이 태스크별 할당 시간을 다 소모한 후 preemption 한다. 그러나 hrtick을 사용하지 않는 경우 매 정규 틱마다 태스크별 할당 시간과 vruntime을 비교하여 스케줄링된다. 그러한 경우 태스크는 자신의 할당 시간을 다 소모하지 못하고 중간에 vruntime이 더 작은 대기중인 태스크에게 선점될 수 있다.
아래 vruntime 값은 실제와는 다른 임의의 값으로 실전에서 아래와 같은 수치를 만들기는 어렵다.
RB 트리에 엔큐된 스케줄링 엔티티들은 vruntime 값이 가장 작은 RB 트리의 가장 좌측부터 vruntime 값이 가장 큰 가장 우측 스케줄 엔티티까지의 vruntime 범위가 아래와 같이 산출된 vruntime 값 범위 이내로 모여드는 특성을 가지고 있다.

calc_delta_fair()

kernel/sched/fair.c

/*
 * delta /= w
 */

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
        if (unlikely(se->load.weight != NICE_0_LOAD))
                delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

        return delta;
}

time slice 값에 해당하는 vruntime 값을 산출한다.

vruntime = time slice * wieght-0 / weight

CFS 태스크 틱 – (*task_tick)

다음 그림과 같이 스케줄 틱과 hrtick으로부터 호출되어 cfs 스케줄러에서 틱을 처리하는 흐름을 보여준다.

CFS 스케줄러에서 스케줄 틱 처리

task_tick_fair()

kernel/sched/fair.c

/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                entity_tick(cfs_rq, se, queued);
        }

        if (static_branch_unlikely(&sched_numa_balancing))
                task_tick_numa(rq, curr);

        update_misfit_status(curr, rq);
        update_overutilized_status(task_rq(curr));
}

CFS 스케줄러 틱 호출 시마다 수행해야 할 일을 처리한다.

코드라인 6~9에서 요청 태스크에서 부터 최상위 태스크 그룹까지 각 스케줄링 엔티티들에 대해 스케줄 틱을 처리하도록 호출한다.
- 실행 시간 갱신. 만료 시 리스케줄 요청 처리
- 엔티티 로드 평균 갱신
- 블럭드 로드 갱신
- 태스크 그룹의 shares 갱신
- preemption 요청이 필요한 경우를 체크하여 설정
코드라인 11~12에서 numa 밸런싱이 enable된 경우 numa 로드 밸런싱을 수행한다.
코드 라인 14에서 ARM의 big/little cpu처럼 서로 다른(asynmetric) cpu capacity를 사용하는 cpu에서 misfit 상태를 갱신한다.
- 참고: sched/fair: Add ‘group_misfit_task’ load-balance type (2018, v4.20-rc1)
코드 라인 15에서 런큐 루트 도메인의 overutilized 상태를 갱신한다.
- 참고: sched/fair: Add over-utilization/tipping point indicator (2018, v5.0-rc1)

다음 그림은 entity_tick()에 대한 함수 흐름을 보여준다.

entity_tick()

kernel/sched/fair.c

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
        update_load_avg(cfs_rq, curr, UPDATE_TG);
        update_cfs_group(curr);

#ifdef CONFIG_SCHED_HRTICK
        /*
         * queued ticks are scheduled to match the slice, so don't bother
         * validating it and just reschedule.
         */
        if (queued) {
                resched_curr(rq_of(cfs_rq));
                return;
        }
        /*
         * don't let the period tick interfere with the hrtick preemption
         */
        if (!sched_feat(DOUBLE_TICK) &&
                        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
                return;
#endif

        if (cfs_rq->nr_running > 1)
                check_preempt_tick(cfs_rq, curr);
}

CFS 스케줄러의 스케줄 엔티티 틱 호출 시 다음과 같이 처리한다.

코드 라인 7에서 현재 스케줄 엔티티의 런타임을 갱신한다. 이미 런타임이 다 소모된 경우에는 다음 태스크를 스케줄 하도록 리스케줄 요청을 한다.
코드 라인 12에서 엔티티 로드 평균 등을 산출하여 갱신한다.
코드 라인 13에서 그룹 엔티티 및 cfs 런큐의 로드 평균 등을 갱신한다.
코드 라인 20~23에서 스케줄러용 hrtick()을 통해 진입한 경우 인자 @queue=1이며, 이 경우 런타임만큼 타이머가 expire 되었으므로 다음 태스크를 위해 리스케줄링 요청한다.
코드 라인 27~29에서 더블 틱 기능을 사용하지 않고 고정 스케줄 틱으로 진입하였다. 이 경우 hrtick이 운영중인 경우 hrtick을 방해하지 않게 하기 위해 그냥 함수를 빠져나간다.
코드 라인 32~33에서 cfs 런큐에 2 개 이상의 태스크가 동작하는 경우 런타임 소모로 인한 태스크 스위칭(preemption)이 필요한지를 체크한다. 필요시 리스케줄 요청 플래그가 설정된다.

주기적 태스크 preemption 체크

check_preempt_tick()

kernel/sched/fair.c

/*
 * Preempt the current task with a newly woken task if needed:
 */

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        ideal_runtime = sched_slice(cfs_rq, curr);
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

        /*
         * Ensure that a task that missed wakeup preemption by a
         * narrow margin doesn't have to wait for a full slice.
         * This also mitigates buddy induced latencies under load.
         */
        if (delta_exec < sysctl_sched_min_granularity)
                return;

        se = __pick_first_entity(cfs_rq);
        delta = curr->vruntime - se->vruntime;

        if (delta < 0)
                return;

        if (delta > ideal_runtime)
                resched_curr(rq_of(cfs_rq));
}

다음 상황이 포함된 경우 리스케줄 요청 플래그를 설정한다.

실행 시간이 다 소모된 경우
더 실행해야 할 시간은 남아 있지만 더 빠르게 동작시켜야 할 cfs 런큐에 엔큐된 태스크가 있는 경우
- 현재 태스크의 vruntime이 다음 수행할 태스크의 vruntime보다 크고 현재 태스크의 time slice 값을 초과하는 경우이다.

코드 라인 8에서 현재 태스크가 동작해야 할 time slice 값을 구해 ideal_runtime에 대입한다.
코드 라인 9에서 총 실행 시간에서 기존 실행 시간을 빼면 delta 실행 시간이 산출된다.
코드 라인 10~18에서 delta 실행 시간이 ideal_runtime을 초과한 경우 태스크에 주어진 시간을 다 소모한 것이다. 이러한 경우 리스케줄 요청 플래그를 설정하고 3개의 버디(skip, last, next) 엔티티에 curr 엔티티가 있으면 클리어하고 함수를 빠져나간다.
코드 라인 25~26에서 delta 실행 시간이 최소 preemption에 필요한 시간(0.75ms) 값보다 작은 경우 현재 태스크가 계속 실행될 수 있도록 함수를 빠져나간다.
코드 라인 28~29에서 cfs 런큐에서 다음 실행할 첫 엔티티를 가져와서 현재 스케줄링 엔티티의 vruntime에서 다음 스케줄링 엔티티의 vruntime을 뺀 delta 시간을 구한다.
코드 라인 31~32에서 delta 시간이 0보다 작은 경우 즉, 다음 스케줄링 엔티티의 vruntime이 현재 스케줄링 엔티티보다 더 큰 일반적인 경우 함수를 빠져나간다.
코드 라인 34~35에서 delta 시간이 ideal_runtime보다 큰 경우 리스케줄 요청 플래그를 설정한다.

현재 태스크에 대한 런타임, PELT 및 통계 들 갱신 – (*udpate_curr)

update_curr_fair()

kernel/sched/fair.c

static void update_curr_fair(struct rq *rq)
{
        update_curr(cfs_rq_of(&rq->curr->se));
}

현재 태스크의 런타임, PELT 및 통계들을 갱신한다. 그리고 현재 태스크의 실행 시간을 다 소모하였으면 리스케줄 요청한다.

update_curr()

kernel/sched/fair.c

/*
 * Update the current task's runtime statistics.
 */

static void update_curr(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;
        u64 now = rq_clock_task(rq_of(cfs_rq));
        u64 delta_exec;

        if (unlikely(!curr))
                return;

        delta_exec = now - curr->exec_start;
        if (unlikely((s64)delta_exec <= 0)) 
               return; 

        curr->exec_start = now;

        schedstat_set(curr->statistics.exec_max,
                      max(delta_exec, curr->statistics.exec_max));

        curr->sum_exec_runtime += delta_exec;
        schedstat_add(cfs_rq->exec_clock, delta_exec);

        curr->vruntime += calc_delta_fair(delta_exec, curr);
        update_min_vruntime(cfs_rq);

        if (entity_is_task(curr)) {
                struct task_struct *curtask = task_of(curr);

                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
                cgroup_account_cputime(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
        }

        account_cfs_rq_runtime(cfs_rq, delta_exec);
}

현재 태스크의 런타임, PELT 및 통계들을 갱신한다. 그리고 현재 태스크의 실행 시간을 다 소모하였으면 리스케줄 요청한다.

코드 라인 3~8에서 cfs 런큐에 러닝 중인 엔티티가 없으면 함수를 빠져나간다.
코드 라인 10~12에서 이번 갱신 시간까지 동작한 엔티티의 런타임을 delta_exec에 대입하고, 혹시 0 이하인 경우 함수를 빠져나간다.
- x86의 unstable 클럭등을 사용할 때 일시적으로 음수가 나올 수 있다.
코드 라인 14에서 다음 갱신에 대한 런타임 산출을 위해 미리 현재 시각으로 갱신한다.
코드 라인 16~17에서 갱신 중 최대 런타임을 갱신한다.
코드 라인 19에서 엔티티의 런타임 총합을 갱신한다.
코드 라인 20에서 cfs 런큐의 런타임 총합을 갱신한다.
코드 라인 22에서 실행된 런타임으로 vruntime 값을 구해 엔티티의 vruntime 값에 추가한다.
- curr->vruntime += delta_exec * wieght-0 / curr->load.weight
코드 라인 23에서 cfs 런큐의 min_vruntime 값을 갱신한다.
코드 라인 25~31에서 현재 엔티티가 태스크인 경우 실행된 런타임을 cgroup과 cputimer 통계에 추가하여 갱신하게 한다.
코드 라인 33에서 cfs 밴드위드를 사용하여 cfs 런큐의 런타임 잔량을 다 소모한 경우 스로틀 한다.

다음 그림은 cfs 런큐에서 동작 중인 엔티티의 런타임 값으로 vruntime, min_vruntime을 갱신한다. 그리고 cfs 밴드위드를 사용하는 경우 cfs 런큐의 런타임 잔량이 다 소모된 경우 리스줄되는 과정을 보여준다.

min_vruntime 갱신

update_min_vruntime()

kernel/sched/fair.c

static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;
        struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);

        u64 vruntime = cfs_rq->min_vruntime;

        if (curr) {
                if (curr->on_rq)
                        vruntime = curr->vruntime;
                else
                        curr = NULL;
        }

        if (leftmost) { /* non-empty tree */
                struct sched_entity *se;
                se = rb_entry(leftmost, struct sched_entity, run_node);

                if (!curr)
                        vruntime = se->vruntime;
                else
                        vruntime = min_vruntime(vruntime, se->vruntime);
        }

        /* ensure we never gain time by being placed backwards. */
        cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
        smp_wmb();
        cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}

cfs 런큐의 min_vruntime 값을 갱신한다. 현재 태스크와 첫 번째 대기 중인 태스크 둘 중 작은 vruntime 값을 cfs 런큐의 min_vruntime 값보다 큰 경우에만 갱신한다.

코드 라인 6에서 cfs 런큐의 min_vruntime 값을 vruntime에 저장한다.
코드 라인 8~13에서 현재 동작 중인 엔티티가 cfs 런큐에 엔큐되어 있는 경우 이 엔티티의 vruntime으로 대체한다.
코드 라인 15~23에서 cfs 런큐에 대기중인 엔티티가 있는 경우 가장 먼저 실행될 엔티티의 vruntime과 비교하여 가장 작은 것을 산출한다.
코드 라인 26에서 위에서 산출된 가장 작은 vruntime 값이 cfs 런큐의 min_vruntime 값보다 큰 경우에만 갱신한다.
- cfs 런큐의 min_vruntime을 새로운 min_vruntime으로 갱신하는데 새로 갱신하는 값은 기존 보다 더 큰 값이 된다.
코드 라인 29에서 32bit 시스템인 경우에만 min_vruntime을 min_vruntime_copy에도 복사한다.

다음 그림은 update_min_vruntime()에 대한 동작을 보여준다.

CFS 버디

리스케줄시 특정 태스크의 순서를 변경할 때 사용하기 위해 다음과 같은 버디가 있다.

skip 버디
- 리스케줄 할 때 현재 태스크를 skip 버디에 지정하여 가능하면 자기 자신을 다음 태스크로 선택하지 못하게 한다.
last 버디
- 캐시 지역성을 향상시키기 위해 웨이크 업 preemption이 성공하면 preemption 직전에 실행된 작업 옆에 둔다.
next 버디
- 캐시 지역성을 높이기 위해 깨어난 태스크를 다음 스케줄 시 곧바로 동작시킨다.
- 리스케줄 할 때 지정된 태스크를 next 버디에 지정하여 리스케줄 시 다음 태스크에서 선택하도록 한다.

버디들 모두 클리어

clear_buddies()

kernel/sched/fair.c

static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        if (cfs_rq->last == se)
                __clear_buddies_last(se);

        if (cfs_rq->next == se)
                __clear_buddies_next(se);

        if (cfs_rq->skip == se)
                __clear_buddies_skip(se);
}

다음에 이 스케줄링 엔티티가 재선택되지 않도록 계층적 정리한다.

코드 라인 3~4에서 cfs 런큐의 last가 요청한 스케줄링 엔티티와 동일한 경우 last에 대한 계층적 정리를 수행한다.
코드 라인 6~7에서 cfs 런큐의 next가 요청한 스케줄링 엔티티와 동일한 경우 next에 대한 계층적 정리를 수행한다.
코드 라인 9~10에서 cfs 런큐의 skip이 요청한 스케줄링 엔티티와 동일한 경우 skip에 대한 계층적 정리를 수행한다.

__clear_buddies_last()

kernel/sched/fair.c

static void __clear_buddies_last(struct sched_entity *se)
{
        for_each_sched_entity(se) {
                struct cfs_rq *cfs_rq = cfs_rq_of(se);
                if (cfs_rq->last != se)
                        break;

                cfs_rq->last = NULL;
        }
}

최상위 태스크 그룹까지 루프를 돌며 cfs 런큐의 last가 요청한 스케줄링 엔티티인 경우 null을 대입하여 클리어한다.

__clear_buddies_next()

kernel/sched/fair.c

static void __clear_buddies_next(struct sched_entity *se)
{
        for_each_sched_entity(se) {
                struct cfs_rq *cfs_rq = cfs_rq_of(se);
                if (cfs_rq->next != se)
                        break;

                cfs_rq->next = NULL;
        }
}

최상위 태스크 그룹까지 루프를 돌며 cfs 런큐의 next가 요청한 스케줄링 엔티티인 경우 null을 대입하여 클리어한다.

__clear_buddies_skip()

kernel/sched/fair.c

static void __clear_buddies_skip(struct sched_entity *se)
{
        for_each_sched_entity(se) {
                struct cfs_rq *cfs_rq = cfs_rq_of(se);
                if (cfs_rq->skip != se)
                        break;

                cfs_rq->skip = NULL;
        }
}

최상위 태스크 그룹까지 루프를 돌며 cfs 런큐의 skip이 요청한 스케줄링 엔티티인 경우 null을 대입하여 클리어한다.

태스크 Fork (*task_fork)

task_fork_fair()

sched/fair.c

/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */

static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;
        struct rq *rq = this_rq();
        struct rq_flags rf;

        rq_lock(rq, &rf);
        update_rq_clock(rq);

        cfs_rq = task_cfs_rq(current);
        curr = cfs_rq->curr;
        if (curr) {
                update_curr(cfs_rq);
                se->vruntime = curr->vruntime;
        }
        place_entity(cfs_rq, se, 1);

        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr(rq);
        }

        se->vruntime -= cfs_rq->min_vruntime;
        rq_unlock(rq, &rf);
}

부모 context로 부터 fork된 새 child 태스크의 vruntime을 결정한다. (cfs 런큐의 RB 트리의 위치를 결정한다)

코드 라인 8~9에서 런큐 락을 획득한 후 런큐 클럭을 갱신한다.
코드 라인 11~16에서 현재 처리중인 cfs 태스크의 런큐에 대해 런타임, PELT 및 stat 등을 갱신하고, 현재 엔티티의 vruntime을 새 엔티티에도 사용한다.
코드 라인 17에서 cfs 런큐에서 엔티티의 위치를 지정한다.
- 새 엔티티의 vruntime을 지정한다. (cfs 런큐의 RB 트리에 넣을 위치가 결정된다)
코드 라인 19~26에서 sysctl_sched_child_runs_first가 true이면 새 child를 먼저 동작시켜야 한다.
- 현재 동작중인 스케줄링 엔티티가 새 스케줄링 엔티티의 vruntime보다 더 우선인 경우 vruntime 값을 서로 교환하여 자리를 바꾼다. 그런 후 리스케줄 요청한다.
- “/proc/sys/kernel/sched_child_runs_first”의 default 값은 0으로 1로 설정되는 경우 새 child가 가장 먼저 동작하도록 한다.
코드 라인 28에서 새 엔티티는 아직 런큐되지 않았으므로 min_vruntim을 제거해야 한다.
코드 라인 29에서 런큐 락을 해제한다.

RB 트리에서의 스케줄 엔티티의 위치 설정(vruntime)

place_entity()

kernel/sched/fair.c

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        u64 vruntime = cfs_rq->min_vruntime;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice(cfs_rq, se);

        /* sleeps up to a single latency don't count. */
        if (!initial) {
                unsigned long thresh = sysctl_sched_latency;

                /*
                 * Halve their sleep time's effect, to allow
                 * for a gentler effect of sleepers:
                 */
                if (sched_feat(GENTLE_FAIR_SLEEPERS))
                        thresh >>= 1;

                vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        se->vruntime = max_vruntime(se->vruntime, vruntime);
}

스케줄 엔티티의 vruntime 값을 갱신한다. (vruntime 값이 RB 트리내에서 앞으로 전진하지는 못하고 뒤로 후진만 가능하다)

코드 라인 4에서 cfs 런큐의 min_vruntime 값을 알아온다.
- 태스크가 cfs 런큐되는 상황이 다음과 같이 있다.
  - 처음 fork 된 태스크
  - 다른 태스크 그룹에서 넘어온 태스크
  - 슬립/스로틀 되었다가 다시 깨어난 태스크 또는 그룹 엔티티
  - 다른 클래스에 있다가 넘어온 태스크
- place_entity() 함수가 호출되는 case
  - 처음 fork 된 태스크 -> task_fork_fair() -> place_entity(initial=1)
  - 슬립/스로틀되었다가 다시 깨어난 태스크 또는 그룹 엔티티 -> unthrottle_cfs_rq(ENQUEUE_WAKEUP) -> enqueue_entity() -> place_entity(initial=0)
  - 다른 클래스로 넘어가려는 태스크 -> switched_from_fair() -> detach_task_cfs_rq() -> place_entity(initial=0)
  - 다른 태스크 그룹에서 넘어온 태스크 -> task_move_group_fair() -> detach_task_cfs_rq() -> place_entity(initial=0)
- 태스크가 다시 cfs 런큐에 들어올 때 min_vruntime을 가지고 시작한다.
- cfs 런큐는 글로벌 런큐에도 위치하고 태스크 그룹이 설정된 경우 각각의 태스크 그룹에도 설정되어 사용된다.
코드 라인 12~13에서 태스크에 대한 초기 태스크인 경우이고 START_DEBIT 기능을 사용하는 경우 vruntime 기준 한 타임 더 늦게 실행되도록 유도한다.
- initial
  - 새로 fork된 태스크가 스케줄링될 때 이 값이 1로 요청된다.
    - 경로: copy_process() -> sched_fork() —->sched_class->task_fork(p)–> task_fork_fair() -> place_entity()
- START_DEBIT & GENTLE_FAIR_SLEEPERS
  - 참고: Scheduler -15- (Core) | 문c
- “/proc/sys/kernel/sched_child_runs_first”
  - 이 기능을 사용하는 경우 이 함수를 빠져나간 후에 현재 실행 중인 태스크보다 fork된 자식 태스크를 먼저 실행 시키기 위해 현재 태스크의 vruntime과 swap 시킨다.
코드 라인 16~27에서 초기 태스크 생성시 호출된 경우가 아니면 스케줄 latency 만큼 vruntime을 줄인다. 단 GENTLE_FAIR_SLEEPERS 기능을 사용하는 경우 절반만 줄인다.
코드 라인 30에서 산출된 vruntime 값이 스케줄 엔티티의 vruntime 값보다 큰 경우 갱신한다.
- cfs_rq->min_vruntime <= se->vruntime

태스크 엔큐 – (*enqueue_task)

enqueue_task_fair()

kernel/sched/fair.c -1/2-

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int idle_h_nr_running = task_has_idle_policy(p);

        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
         * Let's add the task's estimated utilization to the cfs_rq's
         * estimated utilization, before we update schedutil.
         */
        util_est_enqueue(&rq->cfs, p);

        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag
         * passed.
         */
        if (p->in_iowait)
                cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

        for_each_sched_entity(se) {
                if (se->on_rq)
                        break;
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);

                /*
                 * end evaluation on encountering a throttled cfs_rq
                 *
                 * note: in the case of encountering a throttled cfs_rq we will
                 * post the final h_nr_running increment below.
                 */
                if (cfs_rq_throttled(cfs_rq))
                        break;
                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                flags = ENQUEUE_WAKEUP;
        }

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                if (cfs_rq_throttled(cfs_rq))
                        break;

                update_load_avg(cfs_rq, se, UPDATE_TG);
                update_cfs_group(se);
        }

태스크를 cfs 런큐에 추가하고 로드 weight 및 runtime 통계들을 갱신한다. cgroup을 사용하여 태스크(스케줄링) 그룹이 구성되어 운영되는 경우 각 태스크 그룹의 cfs 런큐에 각각의 스케줄 엔티티를 큐잉하고 역시 로드 weight 및 runtime 통계들을 갱신한다.

코드 라인 14에서 util_est 엔큐를 갱신한다.
- 참고: sched/fair: Add util_est on top of PELT (2018, v4.17-rc1)
코드 라인 21~22에서 in_iowait 중인 태스크는 cpu frequency를 조절하도록 진입한다.
코드 라인 24~26에서 요청 태스크의 엔티티들에 대해 루프를 돌며 현재의 엔티티가 런큐에 이미 존재하면 루프를 탈출한다.
코드 라인 27~28에서 cfs 런큐에 엔티티를 엔큐한다.
코드 라인 36~37에서 cfs 런큐가 스로틀된 경우 루프를 탈출한다.
코드 라인 38에서 cfs 런큐 이하 동작 중인 cfs 태스크 수를 의미하는 h_r_running을 증가시킨다.
코드 라인 39에서 cfs 런큐 이하 동작 중인 idle policy를 가진 cfs 태스크 수를 의미하는 idle_h_r_running를 idle cfs 태스크인 경우에 한해 증가시킨다.
코드 라인 41에서 플래그에 ENQUEUE_WAKEUP을 기록한다.
코드 라인 44에서 위의 루프가 중간에서 멈춘 경우 그 중간 부터 계속 하여 루프를 돈다.
코드 라인 45~47에서 h_nr_running을 증가시키고, idle_h_nr_running도 idle policy를 가진 태스크의 경우 증가시킨다.
코드 라인 49~50에서 cfs 런큐가 스로틀된 경우 루프를 탈출한다.
코드 라인 52~53에서 cfs 런큐의 로드 평균과 그룹의 로드 평균을 갱신한다.

kernel/sched/fair.c -2/2-

        if (!se) {
                add_nr_running(rq, 1);
                /*
                 * Since new tasks are assigned an initial util_avg equal to
                 * half of the spare capacity of their CPU, tiny tasks have the
                 * ability to cross the overutilized threshold, which will
                 * result in the load balancer ruining all the task placement
                 * done by EAS. As a way to mitigate that effect, do not account
                 * for the first enqueue operation of new tasks during the
                 * overutilized flag detection.
                 *
                 * A better way of solving this problem would be to wait for
                 * the PELT signals of tasks to converge before taking them
                 * into account, but that is not straightforward to implement,
                 * and the following generally works well enough in practice.
                 */
                if (flags & ENQUEUE_WAKEUP)
                        update_overutilized_status(rq);

        }

        if (cfs_bandwidth_used()) {
                /*
                 * When bandwidth control is enabled; the cfs_rq_throttled()
                 * breaks in the above iteration can result in incomplete
                 * leaf list maintenance, resulting in triggering the assertion
                 * below.
                 */
                for_each_sched_entity(se) {
                        cfs_rq = cfs_rq_of(se);

                        if (list_add_leaf_cfs_rq(cfs_rq))
                                break;
                }
        }

        assert_list_leaf_cfs_rq(rq);

        hrtick_update(rq);
}

코드 라인 1~20에서 위의 두 번쨰 루프에서 cfs 런큐가 스로틀하여 중간에 루프를 벗어난 경우 런큐의 태스크 수를 증가시킨다. 첫 번째 루프를 통해 태스크가 한 번이라도 엔큐한 적이 있으면 루트 도메인의 overutilized 상태를 갱신한다.
코드 라인 22~35에서 cfs 밴드위드를 사용하는 경우 엔티티 루프를 돌며 런큐의 leaf_cfs_rq_list에 cfs 런큐를 추가한다.
코드 라인 39에서 hrtick이 사용 중인 경우 런타임이 남아 있는 경우에 한해 남은 런타임만큼 타이머를 동작시킨다.

다음 그림은 enqueue_task_fair() 함수에서 태스크를 추가할 때 각 태스크 그룹의 cfs 런큐에 큐잉되는 모습을 보여준다.

for_each_sched_entity() 매크로

kernel/sched/fair.c

/* Walk up scheduling entities hierarchy */
#define for_each_sched_entity(se) \
                for (; se; se = se->parent)

스케줄 엔티티의 최상위 부모까지 루프를 돈다.

cgroup을 사용하여 태스크(스케줄링) 그룹을 계층적으로 만들어서 운영하는 경우 스케줄 엔티티는 태스크가 있는 하위 스케줄 엔티티부터 루트 태스크 그룹까지 부모로 연결된다.
그러나 태스크(스케줄링) 그룹을 운영하지 않는 경우 태스크는 태스크 구조체 내부에 임베드된 스케줄 엔티티 하나만을 갖게되어 루프를 돌지 않는다.

cfs 런큐에 엔티티 엔큐

enqueue_entity()

kernel/sched/fair.c

/*
 * MIGRATION
 *
 *      dequeue
 *        update_curr()
 *          update_min_vruntime()
 *        vruntime -= min_vruntime
 *
 *      enqueue
 *        update_curr()
 *          update_min_vruntime()
 *        vruntime += min_vruntime
 *
 * this way the vruntime transition between RQs is done when both
 * min_vruntime are up-to-date.
 *
 * WAKEUP (remote)
 *
 *      ->migrate_task_rq_fair() (p->state == TASK_WAKING)
 *        vruntime -= min_vruntime
 *
 *      enqueue
 *        update_curr()
 *          update_min_vruntime()
 *        vruntime += min_vruntime
 *
 * this way we don't have the most up-to-date min_vruntime on the originating
 * CPU and an up-to-date min_vruntime on the destination CPU.
 */

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
        bool curr = cfs_rq->curr == se;

        /*
         * If we're the current task, we must renormalise before calling
         * update_curr().
         */
        if (renorm && curr)
                se->vruntime += cfs_rq->min_vruntime;

        update_curr(cfs_rq);

        /*
         * Otherwise, renormalise after, such that we're placed at the current
         * moment in time, instead of some random moment in the past. Being
         * placed in the past could significantly boost this task to the
         * fairness detriment of existing tasks.
         */
        if (renorm && !curr)
                se->vruntime += cfs_rq->min_vruntime;

        /*
         * When enqueuing a sched_entity, we must:
         *   - Update loads to have both entity and cfs_rq synced with now.
         *   - Add its load to cfs_rq->runnable_avg
         *   - For group_entity, update its weight to reflect the new share of
         *     its group cfs_rq
         *   - Add its new weight to cfs_rq->load.weight
         */
        update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
        update_cfs_group(se);
        enqueue_runnable_load_avg(cfs_rq, se);
        account_entity_enqueue(cfs_rq, se);

        if (flags & ENQUEUE_WAKEUP)
                place_entity(cfs_rq, se, 0);

        check_schedstat_required();
        update_stats_enqueue(cfs_rq, se, flags);
        check_spread(cfs_rq, se);
        if (!curr)
                __enqueue_entity(cfs_rq, se);
        se->on_rq = 1;

        if (cfs_rq->nr_running == 1) {
                list_add_leaf_cfs_rq(cfs_rq);
                check_enqueue_throttle(cfs_rq);
        }
}

스케줄 엔티티를 RB 트리에 큐잉하고 런타임 통계를 갱신한다.

코드 라인 4에서 vruntim에 대한 renormalization의 필요 여부를 알아온다.
- wakeup 플래그가 지정되지 않았거나 migrate 플래그가 지정된 경우 vruntim에 대한 renormalization이 필요하다.
코드 라인 11~23에서 @se 엔티티가 renormalization이 필요하면 엔티티의 vruntime에 min_vruntime을 추가한다. 단 @se가 현재 동작 중인 엔티티인지 여부에 따라 다음과 같이 처리한다.
- @se==curr인 경우 min_vruntime을 먼저 더한 후 cfs 런큐에 대한 런타임, PELT 및 stat을 갱신한다.
- @se!=curr인 경우 cfs 런큐에 대한 런타임, PELT 및 stat을 갱신한 후 min_vruntime을 추가한다.
코드 라인 22~23에서 @se 엔티티가 renormalization이 필요하고 현재 동작 중인 엔티티가 아닌 경우 vruntime에 min_vruntime을 더한다.
- update_curr() 후에 min_vruntime만 추가한다.
코드 라인 33에서 엔티티의 로드를 갱신한다.
- 새 엔티티의 추가를 고려해야 하므로 DO_ATTACH 플래그를 추가하였고, 태스크 그룹도 같이 갱신하도록 UPDATE_TG 플래그도 추가하였다.
코드 라인 34에서 그룹 엔티티 및 cfs 런큐의 로드 평균을 갱신한다.
코드 라인 35에서 cfs 런큐의 러너블 로드 평균을 갱신한다.
코드 라인 36에서 엔티티 엔큐에 대한 account를 갱신한다.
- cfs 런큐에서 로드 weight를 증가시키고 nr_running등을 증가시킨다
코드 라인 38~39에서 ENQUEUE_WAKEUP 플래그가 요청된 경우스케줄 엔티티의 위치를 갱신한다.
코드 라인 42에서 enque된 스케줄 엔티티가 cfs 런큐에서 현재 돌고 있지 않은 경우 wait_start를 현재 런큐 시각으로 설정한다.
코드 라인 43에서 디버깅을 위해 nr_spread_over를 갱신한다.
코드 라인 44~45에서 스케줄 엔티티가 cfs 런큐에서 현재 러닝중이지 않으면 RB 트리에 추가한다.
- cfs 런큐에 curr 엔티티가 하나만 동작하는 경우엔 RB 트리는 비어있다.
- 즉 RB 트리에 1개의 엔티티가 있다는 것은 2개의 cfs 엔티티가 러너블 상태임을 뜻한다.
코드 라인 46에서 on_rq에 1을 대입하여 현재 엔티티가 cfs 런큐에서 동작함을 알린다.
코드 라인 48~51에서 cfs 런큐에서 동작 중인 active 태스크가 1개 뿐인 경우 런큐의 leaf cfs 리스트에 cfs 런큐를 추가하고 필요한 경우 스로틀 한다.

__enqueue_entity()

kernel/sched/fair.c

/*
 * Enqueue an entity into the rb-tree:
 */

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
        struct rb_node *parent = NULL;
        struct sched_entity *entry;
        bool leftmost = true;

        /*
         * Find the right place in the rbtree:
         */
        while (*link) {
                parent = *link;
                entry = rb_entry(parent, struct sched_entity, run_node);
                /*
                 * We dont care about collisions. Nodes with
                 * the same key stay together.
                 */
                if (entity_before(se, entry)) {
                        link = &parent->rb_left;
                } else {
                        link = &parent->rb_right;
                        leftmost = false;
                }
        }

        rb_link_node(&se->run_node, parent, link);
        rb_insert_color_cached(&se->run_node,
                               &cfs_rq->tasks_timeline, leftmost);
}

엔티티 @se를 cfs 런큐의 RB 트리에 추가한다.

코드 라인 3에서 처음 비교할 노드로 RB 트리 루트 노드를 선택한다.
코드 라인 11에서 루프를 돌며 노드가 있는 동안 루프를 돈다. (null이되면 끝)
코드 라인 12~13에서 parent에 현재 비교할 노드를 보관하고 이 노드에 담긴 스케줄 엔티티를 entry에 대입한다.
코드 라인 18~23에서 스케줄 엔티티의 vruntime이 비교 entry의 vruntime보다 작은 경우 좌측 노드 방향을 선택하고 그렇지 않은 경우 우측 노드 방향을 선택한다.
- 요청한 스케줄 엔티티의 vruntime 값이 RB 트리에서 가장 작은 수인 경우 루프 안에서 한 번도 우측 방향을 선택하지 않았던 경우 이다. 이러한 경우 cfs 런큐의 rb_leftmost는 이 스케줄 엔티티를 가리키게 한다. cfs 런큐에서 다음에 스케줄 엔티티를 pickup할 경우 가장 먼저 선택된다.
코드 라인 26~28에서 RB 트리에 추가하고 color를 갱신한다.

다음 그림은 스케줄 엔티티를 cfs 런큐에 큐잉하는 모습을 보여준다.

list_add_leaf_cfs_rq()

kernel/sched/fair.c

static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
        struct rq *rq = rq_of(cfs_rq);
        int cpu = cpu_of(rq);

        if (cfs_rq->on_list)
                return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;

        cfs_rq->on_list = 1;

        /*
         * Ensure we either appear before our parent (if already
         * enqueued) or force our parent to appear after us when it is
         * enqueued. The fact that we always enqueue bottom-up
         * reduces this to two cases and a special case for the root
         * cfs_rq. Furthermore, it also means that we will always reset
         * tmp_alone_branch either when the branch is connected
         * to a tree or when we reach the top of the tree
         */
        if (cfs_rq->tg->parent &&
            cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
                /*
                 * If parent is already on the list, we add the child
                 * just before. Thanks to circular linked property of
                 * the list, this means to put the child at the tail
                 * of the list that starts by parent.
                 */
                list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
                        &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
                /*
                 * The branch is now connected to its tree so we can
                 * reset tmp_alone_branch to the beginning of the
                 * list.
                 */
                rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
                return true;
        }

        if (!cfs_rq->tg->parent) {
                /*
                 * cfs rq without parent should be put
                 * at the tail of the list.
                 */
                list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
                        &rq->leaf_cfs_rq_list);
                /*
                 * We have reach the top of a tree so we can reset
                 * tmp_alone_branch to the beginning of the list.
                 */
                rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
                return true;
        }

        /*
         * The parent has not already been added so we want to
         * make sure that it will be put after us.
         * tmp_alone_branch points to the begin of the branch
         * where we will add parent.
         */
        list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);
        /*
         * update tmp_alone_branch to points to the new begin
         * of the branch
         */
        rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
        return false;
}

요청한 cfs 런큐를 leaf 리스트에 추가한다. (leaf 리스트 선두는 항상 태스크를 소유한 스케줄 엔티티의 가장 아래 cfs 런큐가 있어야 한다.)

코드 라인 3에서 요청한 cfs 런큐가 이미 leaf 리스트에 있는 경우 다음에 따른 결과를 반환한다.
- rq->tmp_alone_branch == &rq->leaf_cfs_rq_list
- on_list
  - cfs 런큐가 leaf_cfs_rq_list에 추가 되어 있는 경우 true가 된다.
코드 라인 20~37에서 부모 cfs 런큐도 leaf 리스트에 이미 올라가 있는 경우 현재 leaf cfs 런큐 리스트를 부모 leaf cfs 런큐 리스트에 추가하고 1을 반환한다.
코드 라인 39~52에서 부모 그룹이 없는 경우 런큐의 leaf cfs 런큐 리스트에 현재 leaf cfs 런큐 리스트를 추가한다.
코드 라인 60~66에서 현재 leaf cfs 런큐 리스트를 rq->tmp_alone_branch에 추가하고, false를 반환한다.

다음 그림은 list_add_leaf_cfs_rq() 함수가 패치되기 전/후의 동작을 보여준다.

참고: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list

add_nr_running()

kernel/sched/sched.h

static inline void add_nr_running(struct rq *rq, unsigned count)
{
        unsigned prev_nr = rq->nr_running;

        rq->nr_running = prev_nr + count;

#ifdef CONFIG_SMP
        if (prev_nr < 2 && rq->nr_running >= 2) {
                if (!READ_ONCE(rq->rd->overload))
                        WRITE_ONCE(rq->rd->overload, 1);
        }
#endif

         sched_update_tick_dependency(rq);
}

런큐에서 돌고있는 active 태스크의 수를 count 만큼 추가한다.

코드 라인 3~5에서 런큐에서 돌고있는 active 태스크의 수를 count 만큼 추가한다.
코드 라인 8~11에서 기존 active 태스크 수가 2개 미만이었지만 추가한 태스크를 포함하여 active 태스크가 2개 이상이 된 경우 smp 시스템이면 런큐의 루트 도메인에 있는 overload를 true로 변경한다.
코드 라인 14에서 nohz full을 지원하는 커널인 경우 현재 런큐가 nohz full을 지원하는 상태이면 nohz full 상태에서 빠져나와 tick을 발생하게 한다.

hrtimer를 사용한 hrtick

hrtick 갱신

hrtick_update()

kernel/sched/fair.c

/*
 * called from enqueue/dequeue and updates the hrtick when the
 * current task is from our class and nr_running is low enough
 * to matter.
 */

static void hrtick_update(struct rq *rq)
{
        struct task_struct *curr = rq->curr;

        if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
                return;

        if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
                hrtick_start_fair(rq, curr);
}

다음 hrtick 만료 시각을 갱신한다. 이 함수는 스케줄 엔티티가 엔큐/디큐될 때마다 호출된다.

코드 라인 5~6에서 high-resolution 모드의 hitimer를 사용하는 hrtick이 사용되지 않거나 현재 태스크가 cfs 스케줄러를 사용하지 않으면 함수를 빠져나간다.
코드 라인 8~9에서 현재 태스크의 스케줄 엔티티의 cfs 런큐에서 동작하는 active 태스크가 sched_nr_latency(디폴트 8) 미만인 경우 다음 hrtick 만료 시각을 설정한다.

hrtick_enabled()

kernel/sched/sched.h

/*
 * Use hrtick when:
 *  - enabled by features
 *  - hrtimer is actually high res 
 */>

static inline int hrtick_enabled(struct rq *rq)
{
        if (!sched_feat(HRTICK))
                return 0;
        if (!cpu_active(cpu_of(rq)))
                return 0;
        return hrtimer_is_hres_active(&rq->hrtick_timer);
}

high-resolution 모드의 hitimer를 사용하는 hrtick 동작 여부를 반환한다.

hrtimer_is_hres_active()
- return timer->base->cpu_base->hres_active

hrtick_start_fair()

kernel/sched/fair.c

static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
        struct sched_entity *se = &p->se;
        struct cfs_rq *cfs_rq = cfs_rq_of(se);

        WARN_ON(task_rq(p) != rq);

        if (cfs_rq->nr_running > 1) {
                u64 slice = sched_slice(cfs_rq, se);
                u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
                s64 delta = slice - ran;

                if (delta < 0) {
                        if (rq->curr == p)
                                resched_curr(rq);
                        return;
                }
                hrtick_start(rq, delta);
        }
}

태스크에 주어진 weight로 타임 슬라이스를 산출하여 hrtick을 설정한다. 단 타임 슬라이스가 이미 소진된 경우 함수를 빠져나간다.

코드 라인 8~9에서 최상위 cfs 런큐에서 동작 중인 태스크가 1개 이상인 경우 요청한 태스크에 배정할 타임 슬라이스를 산출해온다.
코드 라인 10~11에서 산출한 타임 슬라이스에서 기존에 오버런한 시간을 뺀다.
코드 라인 13~17에서 만일 배정할 타임 슬라이스가 없는 경우 함수를 빠져나간다. 또한 런큐에 설정된 curr가 현재 태스크인 경우 리스케줄 요청한다.
코드 라인 11에서 delta 만료 시간으로 런큐의 hrtick 타이머를 가동시킨다.

다음 그림은 hrtick 만료 시각을 설정하는 모습을 보여준다.

다음 태스크 선택 – (*pick_next_task)

다음 실행시킬 태스크를 선택해서 반환한다. 만일 hrtick을 사용하고 있으면 해당 타임 슬라이스를 산출하여 타이머를 가동시킨다.

pick_next_task_fair()

kernel/sched/fair.c -1/2-

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        struct cfs_rq *cfs_rq = &rq->cfs;
        struct sched_entity *se;
        struct task_struct *p;
        int new_tasks;

again:
        if (!sched_fair_runnable(rq))
                goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
        if (!prev || prev->sched_class != &fair_sched_class)
                goto simple;

        /*
         * Because of the set_next_buddy() in dequeue_task_fair() it is rather
         * likely that a next task is from the same cgroup as the current.
         *
         * Therefore attempt to avoid putting and setting the entire cgroup
         * hierarchy, only change the part that actually changes.
         */

        do {
                struct sched_entity *curr = cfs_rq->curr;

                /*
                 * Since we got here without doing put_prev_entity() we also
                 * have to consider cfs_rq->curr. If it is still a runnable
                 * entity, update_curr() will update its vruntime, otherwise
                 * forget we've ever seen it.
                 */
                if (curr) {
                        if (curr->on_rq)
                                update_curr(cfs_rq);
                        else
                                curr = NULL;

                        /*
                         * This call to check_cfs_rq_runtime() will do the
                         * throttle and dequeue its entity in the parent(s).
                         * Therefore the nr_running test will indeed
                         * be correct.
                         */
                        if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
                                cfs_rq = &rq->cfs;

                                if (!cfs_rq->nr_running)
                                        goto idle;

                                goto simple;
                        }
                }

                se = pick_next_entity(cfs_rq, curr);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);

코드 라인 9~11에서 again: 레이블이다. 동작시킬 cfs 태스크가 없는 경우 idle 레이블로 이동한다.
코드 라인 14~15에서 기존 태스크가 없거나 cfs 스케줄러에서 동작하지 않은 경우 simple 레이블로 이동한다.
코드 라인 25~38에서 최상위 cfs 런큐부터 하위로 순회하며 cfs 런큐의 curr 엔티티의 런타임, PELT 및 stat등을 갱신한다.
코드 라인 46~53에서 현재 그룹 엔티티가 스로틀해야 하는 경우 siple 레이블로 이동한다. 단 동작 중인 엔티티가 없으면 idle 레이블로 이동한다.
코드 라인 56~58에서 다음 엔티티를 구하고, 하위 cfs 런큐로 순회한다.
코드 라인 60에서 루프가 만료된 후 엔티티는 실제 런타임을 가지는 태스크형 엔티티가 된다.

kernel/sched/fair.c -2/2-

        /*
         * Since we haven't yet done put_prev_entity and if the selected task
         * is a different task than we started out with, try and touch the
         * least amount of cfs_rqs.
         */
        if (prev != p) {
                struct sched_entity *pse = &prev->se;

                while (!(cfs_rq = is_same_group(se, pse))) {
                        int se_depth = se->depth;
                        int pse_depth = pse->depth;

                        if (se_depth <= pse_depth) {
                                put_prev_entity(cfs_rq_of(pse), pse);
                                pse = parent_entity(pse);
                        }
                        if (se_depth >= pse_depth) {
                                set_next_entity(cfs_rq_of(se), se);
                                se = parent_entity(se);
                        }
                }

                put_prev_entity(cfs_rq, pse);
                set_next_entity(cfs_rq, se);
        }

        goto done;
simple:
#endif
        if (prev)
                put_prev_task(rq, prev);

        do {
                se = pick_next_entity(cfs_rq, NULL);
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);

done: __maybe_unused;
#ifdef CONFIG_SMP
        /*
         * Move the next running task to the front of
         * the list, so our cfs_tasks list becomes MRU
         * one.
         */
        list_move(&p->se.group_node, &rq->cfs_tasks);
#endif

        if (hrtick_enabled(rq))
                hrtick_start_fair(rq, p);

        update_misfit_status(p, rq);

        return p;

idle:
        if (!rf)
                return NULL;

        new_tasks = newidle_balance(rq, rf);

        /*
         * Because newidle_balance() releases (and re-acquires) rq->lock, it is
         * possible for any higher priority task to appear. In that case we
         * must re-start the pick_next_entity() loop.
         */
        if (new_tasks < 0)
                return RETRY_TASK;

        if (new_tasks > 0)
                goto again;

        /*
         * rq is about to be idle, check if we need to update the
         * lost_idle_time of clock_pelt
         */
        update_idle_rq_clock_pelt(rq);

        return NULL;
}

코드 라인 6~7에서 러닝할 태스크가 변경된 경우이다.
코드 라인 9~21에서 이전 실행 했었던 태스크에 대한 엔티티와 현재 실행할 태스크에 대한 엔티티가 위로 이동하면서 그룹 엔티티가 만나기 직전까지 다음과 같이 처리한다.
- 현재 스케줄 엔티티가 기존 스케줄 엔티티보다 상위에 있는 경우(depth 값 0이 최상위) 기존 스케줄 엔티티를 해당 cfs 런큐에 put하고 부모 스케줄 엔티티를 지정한다.
- 현재 스케줄 엔티티가 기존 스케줄 엔티티보다 같거나 하위에 있는 경우 현재 스케줄 엔티티를 현재 cfs 런큐에 curr로 설정하고 부모 스케줄 엔티티를 지정한다.
코드 라인 23~24에서 그룹 엔티티가 같아진 경우 cfs 런큐에 기존 스케줄 엔티티를 put하고 현재 스케줄 엔티티를 curr로 설정한다.
코드 라인 27에서 done 레이블로 이동한다.
코드 라인 28~31에서 simple: 레이블이다. 만일 prev 태스크가 있었으면 put_prev_task() 처리를 한다.
코드 라인 33~37에서 최상위 cfs 런큐부터 실행할 태스크가 있는 cfs 런큐를 내려오면서 순회한다.
코드 라인 39에서 여기 까지 오면 다음 실행할 최종 태스크가 구해진다.
코드 라인 41~48에서 done: 레이블이다. 런큐의 cfs_tasks 리스트에 러닝 태스크를 추가한다.
코드 라인 51~52에서 hrtick을 사용하는 경우 해당 스케줄 엔티티의 남은 타임 슬라이스 만큼 만료 시각을 hrtick 타이머에 설정한다.
코드 라인 54~56에서 misfit 상태를 갱신하고 실행할 태스크를 반환한다.
코드 라인 58~60에서 idle: 레이블이다. 런큐 플래그가 없으면 null을 반환한다.
코드 라인 62에서idle 밸런스를 통해 실행할 태스크를 가져온다.
- 현재 cpu가 로드가 없으므로 다른 로드 높은 cpu로부터 태스크를 pull해온다.
코드 라인 69~70에서 결과가 0보다 작으면 RETRY_TASK를 반환한다.
코드 라인 72~73에서 가져온 태스크가 있는 경우 again 레이블로 이동한다.
코드 라인 79~81에서 가져온 태스크가 없는 경우 런큐의 lost_idle_time을 갱신하고 null을 반환한다.

pick_next_entity()

kernel/sched/fair.c

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        struct sched_entity *left = __pick_first_entity(cfs_rq);
        struct sched_entity *se;

        /*
         * If curr is set we have to see if its left of the leftmost entity
         * still in the tree, provided there was anything in the tree at all.
         */
        if (!left || (curr && entity_before(curr, left)))
                left = curr;

        se = left; /* ideally we run the leftmost entity */

        /*
         * Avoid running the skip buddy, if running something else can
         * be done without getting too unfair.
         */
        if (cfs_rq->skip == se) {
                struct sched_entity *second;

                if (se == curr) {
                        second = __pick_first_entity(cfs_rq);
                } else {
                        second = __pick_next_entity(se);
                        if (!second || (curr && entity_before(curr, second)))
                                second = curr;
                }

                if (second && wakeup_preempt_entity(second, left) < 1)
                        se = second;
        }

        /*
         * Prefer last buddy, try to return the CPU to a preempted task.
         */
        if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
                se = cfs_rq->last;

        /*
         * Someone really wants this to run. If it's not unfair, run it.
         */
        if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
                se = cfs_rq->next;

        clear_buddies(cfs_rq, se);

        return se;
}

cfs 런큐에서 다음 스케줄 엔티티를 선택해서 반환하는데 cfs 런큐에 설정된 3개의 버디 엔티티를 참고한다.

코드 라인 4에서 cfs 런큐의 rb 트리에서 대기 중인 엔티티들 중 가장 처음(leftmost) 엔티티를 알아온다.
코드 라인 14에서 알아온 엔티티가 없거나 curr보다 vruntime이 많은 경우 그대로 curr를 엔티티로 선택한다.
코드 라인 20~29에서 cfs 런큐에 skip 버디로 설정된 엔티티인 경우 다음 우선 순위의 second 엔티티를 찾는다. 알아온 엔티티 역시 curr보다 vruntime이 많은 경우 그대로 curr를 second 엔티티로 선택한다.
코드 라인 31~32에서 second 엔티티가 처음 선택해 두었던 left 엔티티의 vruntime보다 작거나 gran이 적용된 left의 vruntime보다 작은 경우 스케줄 엔티티로 second 엔티티를 선택한다.
- gran:
  - cfs 런큐에서 sysctl_sched_wakeup_granularity(디폴트 1000000 ns) 기간에서 해당 스케줄 엔티티의 weight에 해당하는 vruntime으로 변환한 기간
  - 이 값을 벗어나는 경우 preemption을 막는다.
코드 라인 38~39에서 cfs 런큐에 last 버디가 준비된 경우 last 엔티티가 현재 선정된 second 엔티티의 vruntime보다 작거나 gran이 적용된 second의 vruntime보다 작은 경우 스케줄 엔티티로 last 엔티티를 선택한다.
코드 라인 44~45에서 cfs 런큐에 next 버디가 준비된 경우 next 엔티티가 현재 선정된 스케줄 엔티티의 vruntime보다 작거나 gran이 적용된 스케줄 엔티티의 vruntime보다 작은 경우 스케줄 엔티티로 next 엔티티를 선택한다.
코드 라인 47에서 cfs 런큐에 있는 3개의 버디(skip, next, last) 설정들을 모두 클리어한다.
코드 라인 49에서 최종 선택된 스케줄 엔티티를 반환한다.

wakeup_preempt_entity()

kernel/sched/fair.c

/*
 * Should 'se' preempt 'curr'.
 *
 *             |s1
 *        |s2
 *   |s3
 *         g
 *      |<--->|c
 *
 *  w(c, s1) = -1
 *  w(c, s2) =  0
 *  w(c, s3) =  1
 *
 */

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
        s64 gran, vdiff = curr->vruntime - se->vruntime;

        if (vdiff <= 0)
                return -1;

        gran = wakeup_gran(curr, se);
        if (vdiff > gran)
                return 1;

        return 0;
}

se 엔티티가 curr 엔티티를 누르고 preemption할 수 있는지 상태를 알아온다. 1=가능, 0=gran 범위 이내이므로 불가능, -1=불가능

코드 라인 4~7에서 curr의 vruntime <= se의 vruntime인 경우 preemption 하지 못하도록 -1을 반환한다.
코드 라인 9~11에서 se의 vruntime + gran < curr의 vruntime인 경우 se가 preemption 할 수 있도록 1을 반환한다.
코드 라인 13에서 preemption을 하지 못하게 도록 0을 반환한다. preemption 해봤자 스위칭하다가 순서가 곧 바뀔만큼의 작은 시간 차이밖에 안날 때이다.

wakeup_gran()

kernel/sched/fair.c

static unsigned long
wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
        unsigned long gran = sysctl_sched_wakeup_granularity;

        /*
         * Since its curr running now, convert the gran from real-time
         * to virtual-time in his units.
         *
         * By using 'se' instead of 'curr' we penalize light tasks, so
         * they get preempted easier. That is, if 'se' < 'curr' then
         * the resulting gran will be larger, therefore penalizing the
         * lighter, if otoh 'se' > 'curr' then the resulting gran will
         * be smaller, again penalizing the lighter task.
         *
         * This is especially important for buddies when the leftmost
         * task is higher priority than the buddy.
         */
        return calc_delta_fair(gran, se);
}

스케줄 엔티티의 wakeup에 필요한 최소 시간을 반환한다. (gran)

sysctl_sched_wakeup_granularity(디폴트 1000000 ns) 기간을 스케줄 엔티티의 vruntime으로 환산하여 반환한다.

set_next_entity()

kernel/sched/fair.c

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        /* 'current' is not kept within the tree. */
        if (se->on_rq) {
                /*
                 * Any task has to be enqueued before it get to execute on
                 * a CPU. So account for the time it spent waiting on the
                 * runqueue.
                 */
                update_stats_wait_end(cfs_rq, se);
                __dequeue_entity(cfs_rq, se);
                update_load_avg(cfs_rq, se, UPDATE_TG);
        }

        update_stats_curr_start(cfs_rq, se);
        cfs_rq->curr = se;

        /*
         * Track our maximum slice length, if the CPU's load is at
         * least twice that of our own weight (i.e. dont track it
         * when there are only lesser-weight tasks around):
         */
        if (schedstat_enabled() &&
            rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
                schedstat_set(se->statistics.slice_max,
                        max((u64)schedstat_val(se->statistics.slice_max),
                            se->sum_exec_runtime - se->prev_sum_exec_runtime));
        }

        se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

cfs 런큐에 요청한 스케줄 엔티티를 동작하도록 설정한다. (curr에 설정)

코드 라인 5~14에서 엔티티를 curr에 넣기 전에 RB tree에서 꺼내고 로드 평균을 갱신한다.
코드 라인 16~17에서 시작과 관련한 통계 값들을 설정하고 엔티티를 curr에 설정한다.
코드 라인 24~29에서 최대 slice 타임을 갱신한다.
코드 라인 31에서 기존 실행 시간 총합을 갱신한다.

구조체

load_weight 구조체

include/linux/sched.h

struct load_weight {
        unsigned long weight;
        u32 inv_weight;
};

weight
- CFS 스케줄러에서 사용하는 유저 태스크에 주어진 우선순위 nice에 대한 태스크 로드 값
- default nice 값 0인 경우 weight=1024
inv_weight
- prio_to_wmult[] 배열에 이미 만들어진 nice값에 해당하는 2^32/weight 값

sched_entity 구조체

include/linux/sched.h

struct sched_entity {
        /* For load-balancing: */
        struct load_weight              load;
        unsigned long                   runnable_weight;
        struct rb_node                  run_node;
        struct list_head                group_node;
        unsigned int                    on_rq;

        u64                             exec_start;
        u64                             sum_exec_runtime;
        u64                             vruntime;
        u64                             prev_sum_exec_runtime;

        u64                             nr_migrations;

        struct sched_statistics         statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED
        int                             depth;
        struct sched_entity             *parent;
        /* rq on which this entity is (to be) queued: */
        struct cfs_rq                   *cfs_rq;
        /* rq "owned" by this entity/group: */
        struct cfs_rq                   *my_q;
#endif

#ifdef CONFIG_SMP
        /*
         * Per entity load average tracking.
         *
         * Put into separate cache line so it does not
         * collide with read-mostly values above.
         */
        struct sched_avg                avg;
#endif
};

load
- 엔티티의 로드 weight 값
runnable_weight
- 엔티티의 러너블 로드 weight 값
run_node
- cfs 런큐의 RB 트리에 연결될 때 사용되는 노드
group_node
- cfs 런큐의 cfs_tasks 리스트에 연결될 떄 사용되는 노드
on_rq
- cfs 런큐의 자료 구조인 RB Tree에 엔큐 여부
- curr 엔티티는 on_rq==0이다.
exec_start
- 엔티티의 실행 시각(ns)
sum_exec_runtime
- 엔티티가 실행한 총 시간(ns)
vruntime
- 가상 런타임 값으로 이 시간을 기준으로 cfs 런큐의 RB 트리에서 정렬된다.
prev_sum_exec_runtime
- 이전 산출에서 엔티티가 실행한 총 시간(ns)
nr_migrations
- 마이그레이션 횟수
statistics
- 스케줄 통계
depth
- 태스크 그룹의 단계별 depth로 루트 그룹이 0이다.
*parent
- 부모 엔티티
- 태스크 그룹이 구성되면 엔티티가 계층 구조로 구성된다.
*cfs_rq
- 엔티티가 소속된 cfs 런큐
*my_q
- 그룹 엔티티가 관리하는 cfs 런큐
avg
- 스케줄 엔티티의 로드 평균(PELT)

cfs_rq 구조체

kernel/sched/sched.h

/* CFS-related fields in a runqueue */
struct cfs_rq {
        struct load_weight      load;
        unsigned long           runnable_weight;
        unsigned int            nr_running;
        unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
        unsigned int            idle_h_nr_running; /* SCHED_IDLE */

        u64                     exec_clock;
        u64                     min_vruntime;
#ifndef CONFIG_64BIT
        u64                     min_vruntime_copy;
#endif

        struct rb_root_cached   tasks_timeline;

        /*
         * 'curr' points to currently running entity on this cfs_rq.
         * It is set to NULL otherwise (i.e when none are currently running).
         */
        struct sched_entity     *curr;
        struct sched_entity     *next;
        struct sched_entity     *last;
        struct sched_entity     *skip;

#ifdef  CONFIG_SCHED_DEBUG
        unsigned int            nr_spread_over;
#endif

#ifdef CONFIG_SMP
        /*
         * CFS load tracking
         */
        struct sched_avg        avg;
#ifndef CONFIG_64BIT
        u64                     load_last_update_time_copy;
#endif
        struct {
                raw_spinlock_t  lock ____cacheline_aligned;
                int             nr;
                unsigned long   load_avg;
                unsigned long   util_avg;
                unsigned long   runnable_sum;
        } removed;

#ifdef CONFIG_FAIR_GROUP_SCHED
        unsigned long           tg_load_avg_contrib;
        long                    propagate;
        long                    prop_runnable_sum;

        /*
         *   h_load = weight * f(tg)
         *
         * Where f(tg) is the recursive weight fraction assigned to
         * this group.
         */
        unsigned long           h_load;
        u64                     last_h_load_update;
        struct sched_entity     *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */

#ifdef CONFIG_FAIR_GROUP_SCHED
        struct rq               *rq;    /* CPU runqueue to which this cfs_rq is attached */

        /*
         * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
         * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
         * (like users, containers etc.)
         *
         * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
         * This list is used during load balance.
         */
        int                     on_list;
        struct list_head        leaf_cfs_rq_list;
        struct task_group       *tg;    /* group that "owns" this runqueue */

#ifdef CONFIG_CFS_BANDWIDTH
        int                     runtime_enabled;
        s64                     runtime_remaining;

        u64                     throttled_clock;
        u64                     throttled_clock_task;
        u64                     throttled_clock_task_time;
        int                     throttled;
        int                     throttle_count;
        struct list_head        throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

load
- cfs 런큐의 로드 weight 값
runnable_weight
- cfs 런큐의 러너블 로드 weight 값
nr_running
- 해당 cfs 런큐에서 동작중인 엔티티 수이다. (스로틀링된 그룹 엔티티 제외)
- 타임 slice 계산에 참여할 엔티티 수이다.
h_nr_running
- cfs 런큐를 포함한 하위 cfs 런큐에서 동작중인 active 태스크 수이다. (스로틀링된 하위 그룹들 모두 제외)
exec_clock
- 실행 시각
min_vruntime
- 최소 가상 런타임
- RB 트리에서 엔큐된 스케줄 엔티티의 vruntime에서 이 값을 빼서 사용한다.
min_vruntime_copy
- 32비트 시스템에서 사용
tasks_timeline
- cfs 런큐의 RB 트리 루트
*rb_leftmost
- cfs 런큐의 RB 트리에 있는 가장 좌측의 노드를 가리킨다.
- RB 트리내에서 가장 작은 vruntime 값을 가지는 스케줄 엔티티 노드
*curr
- cfs 런큐에서 동작 중인 현재 스케줄 엔티티
- RB 트리에 존재하지 않고 밖으로 빠져나와 있다.
*next
- next 버디
*last
- last 버디
*skip
- skip 버디
avg
- PELT 기반 로드 평균
h_load
last_h_load_update
*h_load_next
*rq
- cfs 런큐가 속한 런큐를 가리킨다.
on_list
- leaf cfs 런큐 리스트에 포함 여부
leaf_cfs_rq_list
- 태스크가 포함된 cfs_rq가 추가된다.
*tg
- cfs 런큐가 속한 태스크 그룹
runtime_enabled
- cfs 런큐에서 cfs 밴드폭 구성 사용
- quota가 설정된 경우 사용
runtime_remaing
- 남은 런타임
throttled_clock
- 스로틀 시작 시각
throttled_clock_task
- 스로틀 태스크 시작 시각(irq 처리 시간을 제외한 태스크 시간)
throttled_clock_task_time
- 스로틀된 시간 총합(irq 처리 시간을 제외하고 스로틀된 태스크 시간)
throttled
- cfs 런큐가 스로틀된 경우 true
throttle_count
- cfs 런큐가 스로틀된 횟수로 계층 구조의 스케줄 엔티티에 대해 child 부터 시작하여 스로된 cfs 런큐가 있는 경우 카운터가 증가되다.
throttled_list
- cfs_bandwidth의 throttled_cfs_rq 리스트에 추가할 때 사용하는 cfs 런큐의 스로틀 리스트

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c – 현재 글
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

CPU bandwidth control for CFS | Google Paul Turner, IBM Bharata B Rao, Google Nikhil Rao – 다운로드 pdf
CFS bandwidth control | LWN.net – 한글 번역 | KERNELMSG
CFS Bandwidth Control | kernel.org
cfs scheduling 1 & cfs group scheduling (1) | 솔개가하늘을가르네
Linux 3.2 – CFS CPU bandwidth (english version) | Christophe Blaess
integrating the scheduler and cpufreq | Linaro connect – 다운로드 pdf
CFS 소스 코드 분석 | 어린아이