문c 블로그

Scheduler -7- (Preemption & Context Switch)

2017-04-282022-06-06 문영일 9 Comments

멀티 태스킹

arm 아키텍처에서 각 cpu core들은 한 번에 하나의 스레드를 동작시킬 수 있다. (특정 아키텍처에서 cpu core 당 2 개 이상의 h/w 스레드를 지원하기도한다. 예: 인텔의 하이퍼스레드 등) 사용자가 요청한 여러 개의 태스크를 동시에 처리하기 위해 각 태스크들을 일정 주기 시간을 분할하여 스케줄하는 방법을 사용한다.

다음 그림과 같이 시간 분할하여 스케줄한 예를 보여준다. (1 core 시스템에서 디폴트 스케줄 레이턴시: 6ms)

Case A) 3개의 태스크를 구동한 예:
- 3개의 태스크 각 A, B, C 태스크가 각자 지정된 로드 weight 비율로 6ms 스케줄링 레이튼시 간격으로 스케줄 처리하는 것으로 멀티태스킹을 수행한다.
Case B) 1개의 태스크를 구동한 예:
- 1개의 태스크는 경쟁하는 태스크가 없으므로 계속 동작한다. sleep 하는 구간에서는 idle 태스크가 대신 동작한다.
Case C) 10개의 태스크를 구동한 예:
- 8개를 초과하는 태스크의 경우 디폴트 스케줄링 레이튼시를 초과하므로 지정된 최소 할당 시간(0.75ms) x 태스크 수를 산출하여 스케줄링 레이턴시를 산출하여 처리한다.

유저 모드 vs 커널 모드

다음 그림과 같이 유저 모드와 커널 모드에서 동작하는 태스크들을 구분해본다.

아키텍처에 따라 커널 모드 및 유저 모드의 정확한 명칭이 조금씩 다르다.
- 예) ARM32 및 ARM64 AArch32의 경우 커널 모드는 PL1(Previlidge Level 1)이다. 경우에 따라서는 하이퍼바이저의 PL2도 포함된다. 그리고 유저 모드는 PL0로 동작한다.
- 예) ARM64 AArch64의 경우 커널 모드는 EL1(Exception Level 1)이다. 경우에 따라서는 하이퍼바이저의 EL2도 포함된다. 그리고 유저 모드는 EL0(Exception Level 0)로 동작한다.
dl, rt, cfs 태스크들은 유저 모드에서 동작시킬 수도 있고, 커널 모드에서 동작시킬 수 있다.
- 예) kthreadd는 커널 모드에서 동작하는 cfs 태스크이고, /sbin/init의 경우 유저 모드에서 동작하는 cfs 태스크이다.

다음 그림과 같이 인터럽트 및 Exception 코드들은 커널 모드에서 동작하는 것을 알 수 있다. 그리고 유저에서 요청한 syscall 역시 커널 모드에서 동작한다.

아래 전체 구간에서 리눅스 커널의 preemption 모델에 따라 preemption이 가능한 구간과 불가능한 구간을 파악해내는 것이 중요하다.

Preemption

한 개의 태스크가 cpu 시간을 독차지하지 못하게 제어를 할 필요가 있어 다음과 같은 순간에 다른 태스크의 전환을 수행한다.

1) 양보

다음 그림과 같이 유저/커널 태스크의 구분 없이 모든 preemption 모드에서 yield() 또는 schedule() 함수가 동작하는 경우 다음 수행하여야 할 태스크를 선택한다.

2) 유저 선점

다음 그림과 같이 유저 태스크 동작 중 syscall, exception 및 인터럽트에 의해 선점 요청이 이루어진 경우 유저 모드 복귀 시 리스케줄을 수행한다.

타이머 인터럽트에서 현재 태스크의 런타임이 다 소진되었거나, 기타 장치 인터럽트에서 우선 처리할 태스크가 있는 경우 요청된다.

3) 커널 선점

CONFIG_PREEMPT_NONE
- 커널 스레드가 스스로 양보하지 않으면 preemption되지 않는다. 커널 태스크를 만들어 사용하는 경우 일반적으로 태스크를 설계 시 적절한 위치에서 종종 양보(yield or sleep)를 해야 한다.
CONFIG_PREEMPT_VOLUNTRY
- 커널 스레드가 양보하지 않으면 preemption되지 않는 것은 동일하다. 단 커널 API의 곳곳에 preemption point를 두었고 이 곳에서 리스케줄 요청 플래그를 확인하고 직접 양보를 수행한다.
CONFIG_PREEMPT & CONFIG_PREEMPT_RT
- 유저 선점처럼 실시간 선점이 가능하다. 단 다음의 경우엔 즉각 선점을 허용하지 않고 약간 지연된다.
  - 인터럽트 처리 중
    - 인터럽트가 끝날 때 preemption을 허용하므로 약간의 latency가 발생한다.
  - disable_preempt() 처리중
    - preemption을 허용하지 않고 있다가, enable_preempt() 명령이 올 때 preemption을 허용하므로 약간의 latency가 발생한다.

다음 그림은 커널 선점에 대해 4개의 preemption 모델별 처리 차이를 보여준다.

런타임 소진에 따른 스케줄(유저 선점 예)

유저 태스크에 주어진 로드 weight 비율만큼의 산출된 런타임이 다 소진된 경우 이를 체크하기 위해 타이머 인터럽트가 사용된다. 틱은 다음과 같이 두 가지 종류 중 하나 이상을 사용한다.

hrtick
- hrtick을 사용하는 시스템에서는 산출된 런타임에 맞춰 타이머 인터럽트를 생성하고 그 때마다 런타임 소진을 체크하므로 just하게 스케줄할 수 있다
- CONFIG_SCHED_HRTICK 커널 옵션과 hrtick_feature가 활성화되어야 한다.
  - 디폴트: CONFIG_SCHED_HRTICK을 사용한다.
  - 디폴트: hrtick feature를 사용하지 않는다.
정규 periodic tick
- 정규 periodic tick을 사용하는 시스템에서는 매 틱마다 런타임 소진을 체크한다. 런타임을 초과하여 사용한 런타임은 나중에 그 만큼 빼고 처리된다.
- periodic tick은 엄밀히 런타임 소진 또는 vruntime이 가장 작은 태스크로 스케줄링이된다.

다음 그림은 유저 태스크들을 대상으로 hrtick과 정규 periodic 틱을 사용하는 두 개의 예를 보여준다.

preemption 체크 여부를 위해 두 틱을 다 사용하게 할 수도 있다. (double tick feature)

CONFIG_PREEMPT 또는 CONFIG_PREEMPT_RT 커널 옵션을 사용한 경우에는 커널 태스크가 동작 시 런타임 소진에 의해 선점 요청을 하면 위의 유저 태스크 상황과 동일하게 동작한다. 그러나 다른 커널 옵션을 사용하는 경우 양보 코드 및 preemption point에 의지하여 선점을 하는 것에 주의해야 한다.

preemption이 일어나는 것을 막아야 할 때?

preemption은 동기화 문제를 수반한다.
critical section으로 보호받는 영역에서는 preemption이 발생하면 안되기 때문에 이러한 경우 preempt_disable()을 사용한다.
preempt_disable() 코드에서는 단순히 preempt 카운터를 증가시킨다.
인터럽트 발생 시 preempt 카운터가 0이 아니면 스케쥴링이 일어나지 않도록 즉 다른 스레드로의 context switching이 일어나지 않도록 막는다.
물론 interrupt를 원천적으로 disable하는 경우도 preemption 이 일어나지 않는다.

리눅스 preemption 모델의 특징

리눅스는 유저 모드에서는 태스크에 할당받은 time slice에 대해 모두 소진하기 전이라도 다른 태스크에 의해 preemption 되어 현재 태스크가 sleep될 수 있다. 하지만 커널 스레드나 커널 모드에서는 다음과 같이 4가지의 preemption 모델에 따라 동작을 다르게 한다.

PREEMPT_NONE:

No Forced Preemption (Server)
반응 속도(latency)는 최대한 떨어뜨리되 배치 작업을 우선으로 하여 성능에 최적화 시켜 서버에 적합한 모델이다.
100hz ~ 250hz의 낮은 타이머 주기를 사용.
context switching을 최소화

PREEMPT_VOLUNTARY:

Voluntary Kernel Preemption (Desktop)
오래 걸릴만한 api 또는 드라이버에서 중간 중간에 preemption point를 두어 리스케줄 요청한 태스크들을 먼저 수행할 수 있도록 preemption 될 수 있게 한다. (중간 중간 필요한 preemption point에서 스케줄을 변경하게 한다.)
어느 정도 반응 속도(latency)를 높여 키보드, 마우스 및 멀티미디어 등의 작업이 가능하도록 하고 성능도 일정부분 보장하게 하도록한 데스크탑에 적합한 모델이다.
preemption points
- 커널 모드에서 동작하는 여러 코드에 explicit preemption points를 추가하여 종종 reschedule이 필요한지 확인하여 preemption이 사용되어야 하는 빈도를 높힘으로 preemption latency를 작게하였다.
- 이러한 preemption 포인트의 도움을 받아 급한 태스크의 기동에 필요한 latency가 100us 이내로 줄어드는 성과가 있었다.
- preemption point는 보통 1ms(100us) 이상 소요되는 루틴에 보통 추가한다.
- 현재 커널에 거의 천 개에 가까운 preemption point가 존재한다.

PREEMPT:

Preemptible Kernel (Low-Latency Desktop)
커널 모드에서도 다음 몇 가지 상황(PREEMPT_RT도 마찬가지)을 제외하고 언제나 preemption을 허용하여 거의 대부분의 시간동안 우선 순위가 높은 태스크가 먼저 수행되도록 스케줄링 된다. 따라서 이 옵션을 사용하는 경우에는 preemption point가 불필요하므로 빈코드로 대채된다.
반응 속도(latency) 빨라야 하는 대략 밀리세컨드의 latency가 필요한 네트웍 장치등의 임베디드 시스템에 적합한 모델이다

PREEMPT_RT:

Fully Preemptible Kernel (Real-Time)
최근에 커널 v5.3-rc1부터 정식으로 mainline에 등록되었다.
- sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT (2019, v5.3-rc1)
그 동안 full preempt kernel에 대한 고민이 kernel mainliner들에게 있다. 바로 성능 저하인데 이 때문에 mainline에 올리지 못했던 이유이기도 하다. (참고: Optimizing preemption | LWN.net)
대부분의 타이머와 각종 디바이스의 인터럽트 처리를 디폴트로 bottom-half에서 처리하도록 하므로 hardirq 처리로 인해 block되는 시간이 짧아진다. 따라서 이 모델은 실시간 인터럽트 처리가 필요한 시스템에 적합한 모델이다.
hrtimer에 hard모드와 soft 모드 플래그가 추가되었으며, 디폴트로 soft 모드로 동작한다. soft 모드는 타이머 펑션을 rt 스레드에서 동작하도록 bottom-half 처리하고, hard 모드는 hardirq 처리한다.
아키텍처 지원을 받는 시스템만 사용할 수 있다.

커널의 irq latency를 줄이기 위한 또 다른 기능

preempt 및 preempt_rt 모델이더라해도 현재 syscall 및 irq를 네스트하여 처리하지 않고, irq를 disable한 채로 hardirq service를 처리하여 irq latency를 커지게 하는 주범이다. 극히 짧은 타이밍에 네스트될 수 있긴 하지만 본격적인 irq 네스팅을 지원하는 것은 아니다.
위의 hardirq 서비스 처리 구간 이외에도 커널의 많은 코드에서 동기화를 위해 local_irq_disable()을 많이 사용하여 인터럽트의 빠른 처리를 방해하고 있다. 이들 또한 irq latency를 커지게 만드는데 동조한다.
ARM 시스템에서는 위의 hardirq 제한을 피해가는 방법으로 fiq를 사용할 수 있다. 이 fiq는 성능을 이유로 irq domain 서브시스템을 사용하지 못하고, 처리 루틴을 비워두었다. 따라서 특정 시스템에서 custom하게 처리할 수 있으며 약간의 제한을 받는다.
fiq보다 더 강력한 x86의 NMI처럼 local_irq_disable() 시에도 특정 중요 인터럽트를 처리할 수 있는 Pesudo-NMI 방법이 있다. 이는 ARM64 시스템 중 GIC v3.0 이상을 채용하여 ARM64_HAS_IRQ_PRIO_MASKING 기능이 동작하는 시스템에서만 제공된다.

기타 preemption 관련 커널 옵션

CONFIG_PREEMPTION

CONFIG_PREEMPT_RT 코드가 도입되면서 기존 CONFIG_PREEMPT를 CONFIG_PREEMPT_LL로 변경하였었는데, 이러한 변경이 oldconfig 빌드에 문제를 일으켜 rollback하고, CONFIG_PREEMPTION이 새롭게 만들어 졌다.이 커널 옵션은 CONFIG_PREMPT 및 CONFIG_PREEMPT_RT를 선택하면 enable된다.
- 참고: sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y (2019, v5.3-rc2)

CONFIG_PREEMPT_COUNT

preempt 옵션으로부터 분리하였다. 이 분리된 기능으로 preempt_none 및 preempt_voluntry 모델에서도 preempt_disable()로 preempt_count를 증가시켜 preemption을 막을 수 있다.
- 참고: sched: Isolate preempt counting in its own config option (2011, v3.1-rc1)
CONFIG_PREEMPTION이 사용되지 않는 preempt_none 및 preempt_voluntry 모델에서 옵션으로 지정할 수 있다. (디폴트=n)
CONFIG_PREEMPTION이 사용되는 preempt 및 preempt_rt 모델은 디폴트로 항상 사용된다.

커널 버전에 따른 preemption 기능

커널 버전 2.4까지 user mode만 선점이 가능했었다.
- user mode에서 동작중인 process가 system call API를 호출하여 kernel mode로 진입하여 동작 중인 경우에는 선점 불가능
버전 2.6에 이르러 kernel mode도 선점이 가능해졌다.
- 드라이버 수행 중 또는 system call API를 호출하여 kernel mode에 있는 경우에도 다른 태스크로의 선점이 가능해졌다.
- preemption 모델 중 CONFIG_PREEMPT_NONE 제외
- kernel mode에서 선점이 가능해졌지만 리눅스가 원래 Real-Time OS 설계가 아닌 관계로 big kernel lock(2중, 3중 critical section 등 사용)으로 인해 필요한 때 인터럽트 응답성이 빠르지 않았다. 물론 현재는 big kernel lock을 다 제거하여 사용하지 않는다.
- 인터럽트 latency를 줄이기 위해 인터럽트 핸들러를 top-half, bottom-half 두 개의 파트로 나누었다. (Two part interrupt handler | 문c)
RT-Preempt 리눅스 커널 (RT 패치 적용된 모델)
- critical section 및 인터럽트 수행중에서도 preemption이 지원되었다.
- SMP 환경에서는 spinlock으로 제어되는 critical section에서 CPU들의 동시 접근 효율이 떨어지면서 문제가 되므로 이를 해결하기 위해 spinlock은 preempt_disable을 하지 않도록 RT mutex를 사용한다. 물론 반드시 preemption이 disable되어야 하는 경우를 위해 그러한 루틴을 위해 raw_spin_lock에서 처리되게 이전하였다.
PREEMPT_RT의 메인라인 커널 적용
- 그 동안 많은 RT-Preempt 리눅스 커널의 기능들이 메인라인에 조금씩 추가되어 왔었다.
- PREEMPT_RT preemption 모델이 v5.3 메인라인에 합류하기 전까지는 별도의 RT 기능이 패치된 RT-Preempt Linux Kernel을 사용하여야 했다.
- 현재 버전은 제한된 RT-Preempt 커널로 인터럽트 수행 및 spinlock 처리 중에서의 preemption은 불가능하다. 대신 hardirq 처리 시의 irq latency를 줄이기 위해 디폴트로 타이머 및 디바이스의 irq 처리를 bottom-half에서 처리하도록 하였다. spinlock 처리 중 preemption을 가능하게 하기 위해 추가 코드의 적용이 필요한 상태이다. 호환되지 않는 드라이버들이 있어 spinlock 처리 중 preemption이 가능하게 하지 않았다.

RT Mutex based PI(Priority Inheritance) Protocol

커널 버전 2.6.18 버전에서 소개
kernel space locking
user space locking 역시 “futex”를 통해 빠른 lock 지원
preemption이 일어날 때 critical section 구간에서 또 다른 문제 중 하나인 priority inversion이 발생한다.
priority inversion을 제거하기 위한 방법으로 priority inheritance protocol을 사용
참고:

Preemption Point 구현

might_sleep()

include/linux/kernel.h

# define might_sleep() do { might_resched(); } while (0)

CONFIG_PREEMPT_VOLUNTARY 커널에서 preemption point로 동작한다. 리스케줄 요청이 있고, preemption 카운터가 0이되어 preemption이 가능한 경우 스케줄하여 태스크 선점을 양보한다.

현재 태스크는 cpu 선점을 양보하고, 런큐는 다음 실행할 태스크를 선택하고 스케줄하여 실행한다.
커널에서 1ms 이상 소요되는 경우 voluntary preemption 모델을 사용하는 커널을 위해 잠시 우선순위가 높은 태스크를 위해 스스로 양보하고 선점당할 수 있는 포인트를 제공하기 위해 사용된다.

might_resched()

include/linux/kernel.h

#ifdef CONFIG_PREEMPT_VOLUNTARY
# define might_resched() _cond_resched()
#else
# define might_resched() do { } while (0)
#endif

CONFIG_PREEMPT_VOLUNTARY 커널에서 preemption point로 동작하여 리스케줄 요청에 대해 리스케줄 가능한 경우 스케줄한다.

다음 그림은 might_sleep() 및 cond_resched() 함수를 호출 할 때 voluntary 커널에서 스케줄 함수를 호출하는 과정을 보여준다.

cond_resched()

include/linux/sched.h

#define cond_resched() ({                       \
        ___might_sleep(__FILE__, __LINE__, 0);  \
        _cond_resched();                        \
})

리스케줄 요청에 대해 리스케줄 가능한 경우(preempt count=0) 스케줄을 수행한다. 리스케줄한 경우 1을 반환한다.

CONFIG_PREEMPT_NONE 또는 CONFIG_PREEMPT_VOLUNTARY 커널에서 사용가능하다.

_cond_resched()

kernel/sched/core.c

int __sched _cond_resched(void)
{
        if (should_resched(0)) {
                preempt_schedule_common();
                return 1;
        }
        rcu_all_qs();
        return 0;
}
EXPORT_SYMBOL(_cond_resched);

리스케줄 요청에 대해 리스케줄 가능한 경우(preempt count=0) 스케줄을 수행한다. 리스케줄한 경우 1을 반환한다.

코드 라인 3~6에서 preemption 가능한 경우 스케줄을 수행하고, 1을 반환한다.
코드 라인 7에서 RCU의 모든 cpu가 qs 상태를 보고한다.
코드 라인 8에서 리스케줄 하지 않았으므로 0을 반환한다.

should_resched() – Generic

include/asm-generic/preempt.h

/*              
 * Returns true when we need to resched and can (barring IRQ state).
 */

static __always_inline bool should_resched(int preempt_offset)
{
        return unlikely(preempt_count() == preempt_offset &&
                        tif_need_resched());
}

리스케줄 요청에 대해 preempt 카운터가 @preempt_offset과 동일한 경우 true를 반환한다.

should_resched() – ARM64

arch/arm64/include/asm/preempt.h

static inline bool should_resched(int preempt_offset)
{
        u64 pc = READ_ONCE(current_thread_info()->preempt_count);
        return pc == preempt_offset;
}

리스케줄 요청에 대해 preempt 카운터가 @preempt_offset과 동일한 경우 true를 반환한다.

리스케줄 요청 – TIF_NEED_RESCHED 플래그

32비트 아키텍처에서는 preempt_count와 TIF_NEED_RESCHED 요청을 flags에 저장하여야 하였고 LLSC 방식을 사용하는 아키텍처에서 인터럽트에 의해 위의 두 멤버가 한번에 갱신되지 않은채 읽히는 문제점이 있었다. 그러나 64비트 아키텍처부터 64비트로 확장된 preempt_count의 절반을 원래 목적의 카운터를 담아두고, 나머지 절반에 리스케줄 요청(need_resched)을 추가로 사용하도록하여 이를 한 번에 액세스할 수 있으므로 이러한 문제점을 해결하였다.

참고: arm64: preempt: Provide our own implementation of asm/preempt.h (2018, v5.0-rc1)

다음 32비트와 64비트에서코드를 보면 64비트에서 리스케줄 요청이 약간 변형되어 사용하는 것을 알 수 있다.

arch/arm/include/asm/thread_info.h – ARM32

struct thread_info {
        unsigned long           flags;          <- TIF_NEED_RESCHED 플래그 사용
        int                     preempt_count;  
(...생략...)

arch/arm64/include/asm/thread_info.h – ARM64

struct thread_info {
        unsigned long           flags;          <- TIF_NEED_RESCHED 플래그 그대로 사용
(...생략...)
        union {
                u64             preempt_count;
                struct {
#ifdef CONFIG_CPU_BIG_ENDIAN
                        u32     need_resched;    <- 빅엔디안 설정 시 사용
                        u32     count;           
#else
                        u32     count;           <- 64비트 나머지 절반은 기존 preempt_count와 동일
                        u32     need_resched;    <- TIF_NEED_RESCHED와 동일한 용도로 여기에 0 설정
.                                                   주의: 리스케줄 요청=0, 초기 값=1
                } preempt;
#endif
        };
};

need_resched()

include/linux/sched.h

static __always_inline bool need_resched(void)
{
        return unlikely(tif_need_resched());
}

현재 스레드에 리스케줄 요청이 기록되어 있는지 여부를 반환한다. 리스케줄 요청=true(1)

tif_need_resched()

include/linux/thread_info.h

#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)

현재 스레드에 리스케줄 요청이 기록되어 있는지 여부를 반환한다. 리스케줄 요청=true(1)

test_thread_flag()

include/linux/thread_info.h

#define test_thread_flag(flag) \
        test_ti_thread_flag(current_thread_info(), flag)

현재 스레드에 요청 flag 비트가 설정된 경우 true(1)를 반환한다.

리스케줄 Now

resched_curr()

kernel/sched/fair.c

/*
 * resched_curr - mark rq's current task 'to be rescheduled now'.
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
 */

void resched_curr(struct rq *rq)
{
        struct task_struct *curr = rq->curr;
        int cpu;

        lockdep_assert_held(&rq->lock);

        if (test_tsk_need_resched(curr))
                return;

        cpu = cpu_of(rq);

        if (cpu == smp_processor_id()) {
                set_tsk_need_resched(curr);
                set_preempt_need_resched();
                return;
        }

        if (set_nr_and_not_polling(curr))
                smp_send_reschedule(cpu);
        else
                trace_sched_wake_idle_without_ipi(cpu);
}

현재 태스크에 리스케줄 요청 플래그를 설정한다. 만일 런큐의 cpu가 현재 cpu가 아닌 경우 리스케줄 요청 IPI call을 수행한다.

코드 라인 8~9에서 런큐에서 동작중인 현재 태스크에 이미 리스케줄 요청 플래그가 설정된 경우 함수를 빠져나간다.
코드 라인 11~17에서 런큐의 cpu가 현재 cpu와 동일한 경우 태스크에 리스케줄 요청 플래그를 설정하고 함수를 빠져나간다.
- arm 커널에서 set_preempt_need_resched() 함수는 아무런 동작도 하지 않는다.
코드 라인 19~20에서 리스케줄 요청할 태스크가 현재 런큐의 cpu가 아닌 경우이다. 만일 TIF_POLLING_NRFLAG가 설정되지 않은 경우 리스케줄 요청 플래그를 설정하고 리스케줄 요청 IPI 호출을 수행한다.

test_tsk_need_resched()

include/linux/sched.h

static inline int test_tsk_need_resched(struct task_struct *tsk)
{
        return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}

현재 태스크에 리스케줄 요청 플래그가 설정되었는지 여부를 반환한다. 낮은 확률로 설정된 경우 true를 반환한다.

set_tsk_need_resched()

include/linux/sched.h

static inline void set_tsk_need_resched(struct task_struct *tsk)
{
        set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

현재 태스크에 리스케줄 요청 플래그를 설정한다.

set_nr_and_not_polling()

kernel/sched/core.c

#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
 * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
 * this avoids any races wrt polling state changes and thereby avoids
 * spurious IPIs.
 */

static bool set_nr_and_not_polling(struct task_struct *p)
{
        struct thread_info *ti = task_thread_info(p);
        return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
}
#else
static bool set_nr_and_not_polling(struct task_struct *p)
{       
        set_tsk_need_resched(p);
        return true;
}

smp 시스템이면서 TIF_POLLING_NRFLAG가 지원되는 커널 여부에 따라 IPI 호출 여부를 판단하기 위해 다음과 같은 동작을 수행한다.

지원되는 경우 현재 태스크가 가리키는 스레드의 플래그에 리스케줄 요청 플래그를 설정한다. 그리고 그 플래그에 _TIF_POLLING_NRFLAG가 설정된 경우 write 폴링 상태 변화 시에 무분별한 IPI 호출이 발생하지 않도록 false를 반환한다.
지원되지 않는 경우 태스크에 리스케줄 요청 플래그를 설정하고 true를 반환한다.

smp_send_reschedule()

kernel/smp.c

void smp_send_reschedule(int cpu)
{
        smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}

지정 cpu로 리스케줄 요청 IPI 호출을 수행한다.

Preempt Count

현재 스레드의 preemption 허용 여부를 nest 하여 증/감하는데 이 값이 0이 될 때 preemption 가능한 상태가 된다.

참고: Four short stories about preempt_count() (2020) | LWN.net

preempt_count() – Generic

include/asm-generic/preempt.h

static __always_inline int preempt_count(void)
{
        return READ_ONCE(current_thread_info()->preempt_count);
}

현재 태스크의 preempt 카운터를 반환한다.

32비트 preemption 카운터가 0이되면 preemption이 가능해진다. preemption 카운터의 각 비트를 묶어 preemption을 mask하는 용도로 나누어 구성한다. 어느 한 필드라도 비트가 존재하면 preemption되지 않는다.

preempt_count() – ARM64

arch/arm64/include/asm/preempt.h

static inline int preempt_count(void)
{
        return READ_ONCE(current_thread_info()->preempt.count);
}

현재 태스크의 preempt 카운터를 반환한다. (32비트 preempt.count)

include/linux/preempt_mask.h

/*
 * We put the hardirq and softirq counter into the preemption
 * counter. The bitmask has the following meaning:
 *
 * - bits 0-7 are the preemption count (max preemption depth: 256)
 * - bits 8-15 are the softirq count (max # of softirqs: 256)
 *
 * The hardirq count could in theory be the same as the number of
 * interrupts in the system, but we run all interrupt handlers with
 * interrupts disabled, so we cannot have nesting interrupts. Though
 * there are a few palaeontologic drivers which reenable interrupts in
 * the handler, so we need more than one bit here.
 *
 *         PREEMPT_MASK:        0x000000ff
 *         SOFTIRQ_MASK:        0x0000ff00
 *         HARDIRQ_MASK:        0x000f0000
 *             NMI_MASK:        0x00100000
 * PREEMPT_NEED_RESCHED:        0x80000000 <- 주의: ARM64의 경우 0x1_00000000 이다.
 */

#define PREEMPT_BITS    8
#define SOFTIRQ_BITS    8
#define HARDIRQ_BITS    4
#define NMI_BITS        1

#define PREEMPT_SHIFT   0
#define SOFTIRQ_SHIFT   (PREEMPT_SHIFT + PREEMPT_BITS)
#define HARDIRQ_SHIFT   (SOFTIRQ_SHIFT + SOFTIRQ_BITS)
#define NMI_SHIFT       (HARDIRQ_SHIFT + HARDIRQ_BITS)

#define __IRQ_MASK(x)   ((1UL << (x))-1)

#define PREEMPT_MASK    (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
#define SOFTIRQ_MASK    (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
#define HARDIRQ_MASK    (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
#define NMI_MASK        (__IRQ_MASK(NMI_BITS)     << NMI_SHIFT)

#define PREEMPT_OFFSET  (1UL << PREEMPT_SHIFT)
#define SOFTIRQ_OFFSET  (1UL << SOFTIRQ_SHIFT)
#define HARDIRQ_OFFSET  (1UL << HARDIRQ_SHIFT)
#define NMI_OFFSET      (1UL << NMI_SHIFT)

#define SOFTIRQ_DISABLE_OFFSET  (2 * SOFTIRQ_OFFSET)

#define PREEMPT_DISABLED        (PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED)

다음 그림은 preempt 카운터의 각 비트에 대해 표시하였다. (ARM64기준)

PREEMPT_NEED_RESCHED 플래그 유무 및 위치는 아키텍처에 따라 다르다.
- 주의: 리스케줄 요청이 있는 경우 이 비트는 0이된다.

다음 그림은 preempt_count와 관련된 조작 함수들을 보여준다.

preempt_count 조회 함수들

include/linux/preempt.h

#define hardirq_count() (preempt_count() & HARDIRQ_MASK)
#define softirq_count() (preempt_count() & SOFTIRQ_MASK)
#define irq_count()     (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \
                                 | NMI_MASK))

다음 그림은 preempt_count와 관련된 조회 함수 들을 보여준다.

preempt_count 조건 판단 함수들

include/linux/preempt.h

/*
 * Are we doing bottom half or hardware interrupt processing?
 *
 * in_irq()       - We're in (hard) IRQ context
 * in_softirq()   - We have BH disabled, or are processing softirqs
 * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
 * in_serving_softirq() - We're in softirq context
 * in_nmi()       - We're in NMI context
 * in_task()      - We're in task context
 *
 * Note: due to the BH disabled confusion: in_softirq(),in_interrupt() really
 *       should not be used in new code.
 */

#define in_irq()                (hardirq_count())
#define in_softirq()            (softirq_count())
#define in_interrupt()          (irq_count())
#define in_serving_softirq()    (softirq_count() & SOFTIRQ_OFFSET)
#define in_nmi()                (preempt_count() & NMI_MASK)
#define in_task()               (!(preempt_count() & \
                                   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))

다음 그림은 preempt_count와 관련된 조건 판단에 대한 함수 들을 보여준다.

Preempt Disable & Enable

자료 구조 동기화를 위해 리스케줄링(preemption)을 잠시 막아두어야 할 때 사용하는 기본 API 이다.

preempt_disable()
- 리스케줄되지 않도록 preempt 카운터를 증가시킨다. preempt 카운터가 0 보다 큰 경우 리스케줄(preemption)을 방지한다.
preempt_enable()
- 리스케줄을 방지하기 위해 증가시킨 preempt 카운터를 감소시킨다. 0이 되었을 때 리스케줄 요청이 있는 경우 리스케줄한다(현재 태스크를 스케줄-out하고 다음 태스크를 스케줄-in)

그 외에 다음과 같이 여러 가지 형태의 API들이 준비되어 있다.

preempt_disable_notrace()
preempt_enable_notrace()
preempt_enable_no_resched()
preempt_enable_no_resched_notrace()
sched_preempt_enable_no_resched()

notrace 옵션

notrace 옵션은 CONFIG_DEBUG_PREEMPT 또는 CONFIG_TRACE_PREEMPT_TOGGLE 커널 옵션이 있는 상태에서도 ftrace 출력을 하지 않게 한다. 주로 타이머 등 내부에서 사용된다.

참고: ftrace: add preempt_enable/disable notrace macros (2008, 2.6.27-rc1)

no_resched 옵션

no_resched 옵션은 preempt_enable 후에 preempt 카운터가 0이되고 리스케줄 요청이 있더라도 리스케줄을 하지 않도록 제한한다.

모듈에서의 사용

위의 옵션들은 모듈내에서는 아무런 동작도 하지 않는다.

참고: sched/preempt: Take away preempt_enable_no_resched() from modules (2014, v3.14-rc1)

preempt_disable()

include/linux/preempt.h

#ifdef CONFIG_PREEMPT_COUNT
#define preempt_disable() \ 
do { \
        preempt_count_inc(); \
        barrier(); \
} while (0)
#else
/*
 * Even if we don't have any preemption, we need preempt disable/enable
 * to be barriers, so that we don't have things like get_user/put_user
 * that can cause faults and scheduling migrate into our preempt-protected
 * region.
 */
#define preempt_disable()                       barrier()
#endif

리스케줄되지 않도록 preempt 카운터를 증가시킨다. preempt 카운터가 0 보다 큰 경우 리스케줄(preemption)을 방지한다. preempt 카운터 및 preemption 모델에 따라 다음과 같이 동작한다.

CONFIG_PREEMPT_COUNT 사용
- preemption 카운터를 증가시킨다.
- preemption 모델과 관계 없이 이후부터 preemption을 허용하지 않는다.
CONFIG_PREEMPT_COUNT 미사용
- 커널 모드에서 preemption되지 않으므로 preempt 카운터의 조작이 불필요한 상태이다.

다음 그림은 preempt_disable()이 preempt 카운터 사용 여부에 따라 수행되는 경로를 보여준다.

preempt_count_inc()

include/linux/preempt.h

#define preempt_count_inc() preempt_count_add(1)

preemption 카운터를 1 만큼 증가시킨다.

preempt_count_add()

include/linux/preempt.h

#define preempt_count_add(val)  __preempt_count_add(val)

preemption 카운터를 val 값 만큼 증가시킨다.

__preempt_count_add()

include/asm-generic/preempt.h

static __always_inline void __preempt_count_add(int val)
{
        *preempt_count_ptr() += val;
}

preemption 카운터를 val 값 만큼 증가시킨다.

preempt_enable()

include/linux/preempt.h

#ifdef CONFIG_PREEMPT_COUNT
#ifdef CONFIG_PREEMPTION
#define preempt_enable() \
do { \
        barrier(); \
        if (unlikely(preempt_count_dec_and_test())) \
                __preempt_schedule(); \
} while (0)
#else
#define preempt_enable() \
do { \
        barrier(); \
        preempt_count_dec(); \
} while (0)
#endif
#else
#define preempt_enable()                        barrier()
#endif

리스케줄을 방지하기 위해 증가시킨 preempt 카운터를 감소시킨다. 0이 되었을 때 리스케줄 요청이 있는 경우 리스케줄한다(현재 태스크를 스케줄-out하고 다음 태스크를 스케줄-in). preemption 모델에 따라 다음과 같이 동작한다.

preempt & preempt_rt
- preemption 카운터를 감소시키고 그 값이 0이면서 리스케줄 요청이 있으면 리스케줄한다.
(none 및 voluntry) with CONFIG_PREEMPT_COUNT
- preemption 카운터를 감소시킨다.
(none 및 voluntry) without CONFIG_PREEMPT_COUNT
- 컴파일러 배리어만 동작한다.

다음 그림은 preempt_enable()이 preemption 모델에 따라 수행되는 경로를 보여준다.

preempt_count_dec_and_test()

include/linux/preempt.h

#define preempt_count_dec_and_test() __preempt_count_dec_and_test()

preemption 카운터를 감소시키고 그 값이 0이 되어 preemption 가능하고 리스케줄 요청이 있는 경우 true를 반환한다.

__preempt_count_dec_and_test() – Generic

include/asm-generic/preempt.h

static __always_inline bool __preempt_count_dec_and_test(void)
{
        /*
         * Because of load-store architectures cannot do per-cpu atomic
         * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
         * lost.
         */
        return !--*preempt_count_ptr() && tif_need_resched();
}

preemption 카운터를 감소시키고 그 값이 0이며 리스케줄 요청이 있는 경우 true를 반환한다.

__preempt_count_dec_and_test() – ARM64

arch/arm64/include/asm/preempt.h

static inline bool __preempt_count_dec_and_test(void)
{
        struct thread_info *ti = current_thread_info();
        u64 pc = READ_ONCE(ti->preempt_count);

        /* Update only the count field, leaving need_resched unchanged */
        WRITE_ONCE(ti->preempt.count, --pc);

        /*
         * If we wrote back all zeroes, then we're preemptible and in
         * need of a reschedule. Otherwise, we need to reload the
         * preempt_count in case the need_resched flag was cleared by an
         * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
         * pair.
         */
        return !pc || !READ_ONCE(ti->preempt_count);
}

preemption 카운터를 감소시키고 그 값이 0이며 리스케줄 요청이 있는 경우 true를 반환한다.

ARM64의 경우 64비트 preempt_count를 사용하고, 절반은 count와 need_resched로 나누어 관리한다.
- 주의: 리스케줄 요청이 있는 경우 need_resched는 0이된다.
- 참고: arm64: preempt: Provide our own implementation of asm/preempt.h (2018, v5.0-rc1)

preempt_count_dec()

include/linux/preempt.h

#define preempt_count_dec() preempt_count_sub(1)
#define preempt_count_sub(val)  __preempt_count_sub(val)

preemption 카운터를 1 만큼 감소시킨다.

preempt_count_sub()

include/linux/preempt.h

#define preempt_count_sub(val)  __preempt_count_sub(val)

preemption 카운터를 val 값 만큼 감소시킨다.

__preempt_count_sub()

include/asm-generic/preempt.h

static __always_inline void __preempt_count_sub(int val)
{
        *preempt_count_ptr() -= val;
}

preemption 카운터를 val 값 만큼 감소시킨다.

스케줄

다음과 같이 다양한 형태의 스케줄 함수들이 있다.

schedule()
- 현재 태스크를 런큐에서 디큐하고 슬립한다. 이후 다음 선정된 태스크가 실행된다.
- 현재 태스크는 런큐에서 디큐한다.
- 재 실행을 위해서는 wake_up_process() 등으로 깨워야 한다.
preempt_schedule()
- 현재 태스크가 preemption이 가능한 상태인 경우에 한해 러닝 상태로 리스케줄 한다.
- 현재 태스크를 런큐에 그대로 두고 실행 순서만 바꾼다.
preempt_schedule_notrace()
- preempt_schedule()과 같은 동작을 하되 ftrace 출력을 하지 않는다.
schedule_user()
- 유저 context 디버깅을 위한 정보를 수집하고, 나머지는 schedule() 함수와 동일하다.
schedule_idle()
- idle 스케줄러에서 idle에 사용하는 init 태스크를 슬립시킬 용도로 사용한다.
- schedule() API를 호출하지 않는 이유는 내부에서 rcu와 관련된 API동작하는 sched_submit_work() 등이 포함되어 있으므로 이들을 제외하고 순수하게 __schedule(false)만을 호출하게 한다.
schedule_preempt_disabled()
- mutex 구현에서 사용되며 schedule 후 preempt가 disable된 상태로 리턴한다.

timeout 기능이 포함된 스케줄

타이머를 사용한 상태로 슬립하고, 타이머가 expire되면 태스크를 깨운다.

schedule_timeout()
- 현재 태스크를 틱(jiffies) 만큼 슬립한다.
schedule_timeout_uninterruptible()
- 현재 태스크를 uninterruptible 상태로 변경하고 틱(jiffies) 만큼 슬립한다.
schedule_timeout_interruptible()
- 현재 태스크를 interruptible 상태로 변경하고 틱(jiffies) 만큼 슬립한다.
schedule_timeout_killable()
- 현재 태스크를 killable(uninterruptible 및 wakekill) 상태로 변경하고 틱(jiffies) 만큼 슬립한다.
schedule_timeout_idle()
- 현재 태스크를 idle(uninterruptible 및 noload) 상태로 변경하고 틱(jiffies) 만큼 슬립한다.

태스크 슬립

msleep()

kernel/time/timer.c

/**
 * msleep - sleep safely even with waitqueue interruptions
 * @msecs: Time in milliseconds to sleep for
 */

void msleep(unsigned int msecs)
{
        unsigned long timeout = msecs_to_jiffies(msecs) + 1;

        while (timeout)
                timeout = schedule_timeout_uninterruptible(timeout);
}

요청한 밀리세컨드+1 만큼 sleep한다.

현재 태스크를 TASK_UNINTERRUPTIBLE 상태로 바꾸고 lowres 타이머에 요청한 밀리세컨드로 타이머를 설정한 후 preemption 하도록 스케줄한다.

schedule_timeout_uninterruptible()

kernel/time/timer.c

signed long __sched schedule_timeout_uninterruptible(signed long timeout)
{
        __set_current_state(TASK_UNINTERRUPTIBLE);
        return schedule_timeout(timeout);
}
EXPORT_SYMBOL(schedule_timeout_uninterruptible);

현재 태스크를 TASK_UNINTERRUPTIBLE 상태로 바꾸고 lowres 타이머를 사용하여 청한 밀리세컨드로 타이머를 설정한 후 preemption 하도록 스케줄한다.

지정된 시간이 지나기 전에 schedule_timeout() 함수가 끝날 수 없다.

schedule_timeout_interruptible()

kernel/time/timer.c

signed long __sched schedule_timeout_interruptible(signed long timeout)
{
        __set_current_state(TASK_INTERRUPTIBLE);
        return schedule_timeout(timeout);
}
EXPORT_SYMBOL(schedule_timeout_interruptible);

현재 태스크를 TASK_INTERRUPTIBLE 상태로 바꾸고 lowres 타이머를 사용하여 청한 밀리세컨드로 타이머를 설정한 후 preemption 하도록 스케줄한다.

지정된 시간이 지나기 전에 schedule_timeout() 함수가 끝나고 돌아올 수 있다.

schedule_timeout()

kernel/time/timer.c – 1/2

/**
 * schedule_timeout - sleep until timeout
 * @timeout: timeout value in jiffies
 *
 * Make the current task sleep until @timeout jiffies have
 * elapsed. The routine will return immediately unless
 * the current task state has been set (see set_current_state()).
 *
 * You can set the task state as follows -
 *
 * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
 * pass before the routine returns unless the current task is explicitly
 * woken up, (e.g. by wake_up_process())".
 *
 * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
 * delivered to the current task or the current task is explicitly woken
 * up.
 *
 * The current task state is guaranteed to be TASK_RUNNING when this
 * routine returns.
 *
 * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
 * the CPU away without a bound on the timeout. In this case the return
 * value will be %MAX_SCHEDULE_TIMEOUT.
 *
 * Returns 0 when the timer has expired otherwise the remaining time in
 * jiffies will be returned.  In all cases the return value is guaranteed
 * to be non-negative.
 */

signed long __sched schedule_timeout(signed long timeout)
{
        struct process_timer timer;
        unsigned long expire;

        switch (timeout)
        {
        case MAX_SCHEDULE_TIMEOUT:
                /*
                 * These two special cases are useful to be comfortable
                 * in the caller. Nothing more. We could take
                 * MAX_SCHEDULE_TIMEOUT from one of the negative value
                 * but I' d like to return a valid offset (>=0) to allow
                 * the caller to do everything it want with the retval.
                 */
                schedule();
                goto out;
        default:
                /*
                 * Another bit of PARANOID. Note that the retval will be
                 * 0 since no piece of kernel is supposed to do a check
                 * for a negative retval of schedule_timeout() (since it
                 * should never happens anyway). You just have the printk()
                 * that will tell you if something is gone wrong and where.
                 */
                if (timeout < 0) {
                        printk(KERN_ERR "schedule_timeout: wrong timeout "
                                "value %lx\n", timeout);
                        dump_stack();
                        current->state = TASK_RUNNING;
                        goto out;
                }
        }

        expire = timeout + jiffies;

        timer.task = current;
        timer_setup_on_stack(&timer.timer, process_timeout, 0);
        __mod_timer(&timer.timer, expire, 0);
        schedule();
        del_singleshot_timer_sync(&timer.timer);

        /* Remove the timer from the object tracker */
        destroy_timer_on_stack(&timer.timer);

        timeout = expire - jiffies;

 out:
        return timeout < 0 ? 0 : timeout;
}
EXPORT_SYMBOL(schedule_timeout);

lowres 타이머를 사용하여 @timeout(jiffies 틱 단위) 값으로 타이머를 설정한 후 슬립하고 타이머가 expire될 때 깨어난다. 함수를 빠져나갈 때에는 TASK_RUNNING 상태를 보장한다.

코드 라인 6~17에서 MAX_SCHEDULE_TIMEOUT 값으로 요청한 경우 타이머 설정없이 슬립한다.
- 외부에서 wakeup_process() 또는 try_to_wake_up()등의 함수에 의해 태스크를 런큐에 다시 엔큐하여 깨어나게한다음 out 레이블로 이동한 후 함수를 빠져나간다.
코드 라인 18~32에서 음수 값으로 타이머를 설정한 경우 에러 메시지를 출력하고 태스크를 TASK_RUNNING 상태로 바꾼 후 함수를 빠져나간다.
코드 라인 35~40에서 타이머 만료 시각을 설정하고 슬립한다. 타이머 만료 시 프로세스를 깨우도록 process_timeout() 함수를 호출한다.
코드 라인 41에서 혹시 만료되지 않은 타이머를 제거한다.
코드 라인 44에서 타이머 디버그 용도로 트래킹 중인 오브젝트를 제거한다.
코드 라인 46~49에서 남은 jiffies 단위의 타이머 만료 시간을 반환한다. 0 미만인 경우 0을 반환한다.

process_timeout()

kernel/time/timer.c

static void process_timeout(unsigned long __data)
{
        wake_up_process((struct task_struct *)__data);
}

태스크를 깨운다.

타이머에의 의해 호출되며 interruptible 또는 uninterruptible 상태의 태스크를 깨워 다시 런큐에 엔큐시킨다.

schedule()

kernel/sched/core.c

asmlinkage __visible void __sched schedule(void)
{
        struct task_struct *tsk = current;

        sched_submit_work(tsk);
        do {
                preempt_disable();
                __schedule(false);
                sched_preempt_enable_no_resched();
        } while (need_resched());
        sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);

현재 태스크를 런큐에서 디큐하고, 슬립한 후 다음 태스크를 스케줄한다. 재 실행을 위해 슬립 후엔 wake_up_process() 함수 등으로 깨워야 한다.

코드 라인 5에서 리스케줄 하기 전에 할 일을 수행한다.
- 워커 스레드인 경우 슬립 상태로 진입함을 알린다.
- 처리 중인 plugged IO 큐를 확실히 전송하여 데드락을 회피한다.
코드 라인 7~9에서 현재 실행 중인 태스크를 런큐에서 디큐하고 preempt disable 상태로 슬립한 후, 다음 태스크를 스케줄한다.
- false로 요청하는 경우 현재 태스크를 런큐에서 디큐한다.
코드 라인 10에서 리스케줄 요청이 있으면 반복한다.
코드 라인 11에서 워커 스레드인 경우 러닝 상태로 변경함을 알린다.

__preempt_schedule()

kernel/sched/core.c

#define __preempt_schedule() preempt_schedule()

현재 태스크가 preemption이 가능한 경우에만 러닝 상태에서 리스케줄한다.

CONFIG_PREEMPT에서만 사용 가능하다.

preempt_schedule()

kernel/sched/core.c

/*
 * this is the entry point to schedule() from in-kernel preemption
 * off of preempt_enable. Kernel preemptions off return from interrupt
 * occur there and call schedule directly.
 */

asmlinkage __visible void __sched notrace preempt_schedule(void)
{
        /*
         * If there is a non-zero preempt_count or interrupts are disabled,
         * we do not want to preempt the current task. Just return..
         */
        if (likely(!preemptible()))
                return;

        preempt_schedule_common();
}
NOKPROBE_SYMBOL(preempt_schedule);
EXPORT_SYMBOL(preempt_schedule);

현재 태스크가 preemption이 가능한 경우에만 러닝 상태에서 리스케줄한다.

preempt 및 preempt_rt 커널에서 사용 가능하다.
preempt 카운터가 0이고, 인터럽트가 enable된 상태여야 한다.
현재 태스크는 런큐에서 실행 순서만 재배치된다

preemptible()

include/linux/preempt_mask.h

#ifdef CONFIG_PREEMPT_COUNT
# define preemptible()  (preempt_count() == 0 && !irqs_disabled())
#else
# define preemptible()  0
#endif

preemption이 가능한 상태인지 여부를 알아온다. true(1)=preemption 가능한 상태

CONFIG_PREEMPT_COUNT 커널 옵션을 사용하지 않는 경우에는 항상 preemption을 할 수 없다.

preempt_schedule_common()

kernel/sched/core.c

static void __sched notrace preempt_schedule_common(void)
{
        do {
                /*
                 * Because the function tracer can trace preempt_count_sub()
                 * and it also uses preempt_enable/disable_notrace(), if
                 * NEED_RESCHED is set, the preempt_enable_notrace() called
                 * by the function tracer will call this function again and
                 * cause infinite recursion.
                 *
                 * Preemption must be disabled here before the function
                 * tracer can trace. Break up preempt_disable() into two
                 * calls. One to disable preemption without fear of being
                 * traced. The other to still record the preemption latency,
                 * which can also be traced by the function tracer.
                 */
                preempt_disable_notrace();
                preempt_latency_start(1);
                __schedule(true);
                preempt_latency_stop(1);
                preempt_enable_no_resched_notrace();

                /*
                 * Check again in case we missed a preemption opportunity
                 * between schedule and now.
                 */
        } while (need_resched());
}

리스케줄 요청이 있는 동안 루프를 돌며 리스케줄을 수행한다.

코드 라인 17에서 리스케줄 하기 전에 ftrace 출력 없이 preempt 카운터를 증가시켜 preemption을 막아둔다.
코드 라인 18에서 현재 태스크의 스케줄 out을 ftrace 출력한다.
코드 라인 19에서 현재 태스크를 러닝 상태로 런큐에 그대로 두고 리스케줄하여 다음 진행할 태스크를 pickup하여 실행시킨다.
코드 라인 20에서 pickup한 태스크의 스케줄 in을 ftrace 출력한다.
코드 라인 21에서 리스케줄 하기 전에 막아두었던 preemption을 다시 열기 위해 preempt 카운터를 감소시킨다. 이 때 다시 리스케줄되지 않게 하기 위해 no_resched 옵션을 사용하였다.
코드 라인 27에서 코드 라인 13에서 리스케줄 요청이 남아 있는 경우 계속 루프를 돈다.

스케줄 메인 함수

__schedule()

kernel/sched/core.c

/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: all callers must re-check need_resched() afterward and reschedule
 * accordingly in case an event triggered the need for rescheduling (such as
 * an interrupt waking up a task) while preemption was disabled in __schedule().
 */

CONFIG_PREEMPT 및 CONFIG_PREMPT_RT 커널 옵션을 사용하는 preemptible 커널인 경우 커널 모드에서도 preempt가 enable 되어 있는 대부분의 경우 preemption이 가능하다. 그러나 그러한 옵션을 사용하지 않는 커널은 커널 모드에서 preemption이 가능하지 않다. 다만 다음의 경우에 한하여 preemption이 가능하다.

CONFIG_PREEMPT_VOLUNTARY 커널 옵션을 사용하면서 cond_resched() 호출 시
명확히 지정하여 schedule() 함수를 호출 시
syscall 호출 후 유저 스페이스로 되돌아 갈 때
- 유저 모드에서는 언제나 preemption이 가능하다. 때문에 syscall 호출하여 커널에서 요청한 서비스를 처리한 후 유저로 돌아갔다 preemption이 일어나면 유저 모드와 커널 모드의 왕래만 한 번 더 반복하게 되므로 overhead가 생길 따름이다. 그래서 유저 스페이스로 돌아가기 전에 preemption 처리를 하는 것이 더 빠른 처리를 할 수 있다.
인터럽트 핸들러를 처리 후 다시 유저 스페이스로 되돌아 갈 때
- 이 전 syscall 상황과 유사하게 이 상황도 유저 스페이스로 돌아가기 전에 처리해야 더 빠른 처리를 할 수 있다.

kernel/sched/core.c -1/2-

static void __sched notrace __schedule(bool preempt)
{
        struct task_struct *prev, *next;
        unsigned long *switch_count;
        struct rq_flags rf;
        struct rq *rq;
        int cpu;

        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
        prev = rq->curr;

        schedule_debug(prev, preempt);

        if (sched_feat(HRTICK))
                hrtick_clear(rq);

        local_irq_disable();
        rcu_note_context_switch(preempt);

        /*
         * Make sure that signal_pending_state()->signal_pending() below
         * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
         * done by the caller to avoid the race with signal_wake_up().
         *
         * The membarrier system call requires a full memory barrier
         * after coming from user-space, before storing to rq->curr.
         */
        rq_lock(rq, &rf);
        smp_mb__after_spinlock();

        /* Promote REQ to ACT */
        rq->clock_update_flags <<= 1;
        update_rq_clock(rq);

        switch_count = &prev->nivcsw;
        if (!preempt && prev->state) {
                if (signal_pending_state(prev->state, prev)) {
                        prev->state = TASK_RUNNING;
                } else {
                        deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

                        if (prev->in_iowait) {
                                atomic_inc(&rq->nr_iowait);
                                delayacct_blkio_start();
                        }
                }
                switch_count = &prev->nvcsw;
        }

현재 태스크를 인자 @preempt 요청에 따라 런큐에서 디큐하여 슬립(preempt=false) 시키거나 러닝 상태로 런큐에 그대로 두고 리스케줄 한다.

코드 라인 9~11에서 현재 cpu의 런큐에서 동작 중인 태스크를 prev에 알아온다.
코드 라인 13에서 스케줄 타임에 체크할 항목과 통계를 수행한다.
코드 라인 15~16에서 hrtick 이 동작 중인 경우 현재 태스크에 대한 hrtick이 더 이상 필요 없으므로 클리어한다.
코드 라인 18에서 로컬 irq를 disable 한다.
코드 라인 19에서 현재 태스크가 스케줄 out되기 전에 필요한 rcu 처리를 수행한다.
코드 라인 29~30에서 런큐 락 및 smp 메모리 베리어를 수행한다.
- 런큐 락을 사용하지 않은 곳과의 동기화를 위해 spin_lock을 사용한 런큐 락을 사용한 후 에 smp_mb()를 수행해야 한다.
코드 라인 33~34에서 런큐의 clock_skip_update에 담긴 RQCF_REQ_SKIP(1) 플래그에서 RQCF_ACT_SKIP(2) 단계로 전환한 후 현재 cpu의 런큐 클럭을 갱신한다.
코드 라인 36에서 기존 태스크의 context 스위치 횟수를 알아온다.
코드 라인 37~49에서 현재 태스크를 런큐에서 디큐하고 슬립시키기 위해 @preempt=false로 하고 태스크 상태가 TASK_RUNNING(0)이 아닌 경우 태스크를 런큐에서 디큐한다. 단 펜딩 시그널이 있으면 그대로 러닝 상태로 둔다.
- DEQUEUE_SLEEP 플래그를 주어 이 태스크가 cfs 런큐에서 완전히 빠져나가는 것이 아니라 슬립하기 위해 디큐하므로 추후 다시 엔큐됨을 표시하기 위함이다.

kernel/sched/core.c -2/2-

        next = pick_next_task(rq, prev, &rf);
        clear_tsk_need_resched(prev);
        clear_preempt_need_resched();

        if (likely(prev != next)) {
                rq->nr_switches++;
                /*
                 * RCU users of rcu_dereference(rq->curr) may not see
                 * changes to task_struct made by pick_next_task().
                 */
                RCU_INIT_POINTER(rq->curr, next);
                /*
                 * The membarrier system call requires each architecture
                 * to have a full memory barrier after updating
                 * rq->curr, before returning to user-space.
                 *
                 * Here are the schemes providing that barrier on the
                 * various architectures:
                 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
                 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
                 * - finish_lock_switch() for weakly-ordered
                 *   architectures where spin_unlock is a full barrier,
                 * - switch_to() for arm64 (weakly-ordered, spin_unlock
                 *   is a RELEASE barrier),
                 */
                ++*switch_count;

                trace_sched_switch(preempt, prev, next);

                /* Also unlocks the rq: */
                rq = context_switch(rq, prev, next, &rf);
        } else {
                rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
                rq_unlock_irq(rq, &rf);
        }

        balance_callback(rq);
}

코드 라인 1~3에서 다음 처리할 태스크를 가져오고 기존 태스크의 리스케줄 요청(TIF_NEED_RESCHED 플래그)을 클리어한다. LLSC 방식을 사용하는 64비트 시스템의 경우 추가로 ti->preempt.need_resched을 1로 지정하여 리스케줄 요청을 클리어(주의: 0=리스케줄 요청, 1=클리어)한다.
- arm64 시스템의 경우 아키텍처 버전에 따라 ARMv8.0의 경우 LLSC 방식을 사용하고, ARMv8.1부터는 CAS 방식을 사용한다. 그런데 preempt 관련 코드는 그냥 LLSC 방식으로 처리한다.
코드 라인 5~31에서 높은 확률로 태스크를 전환한다.
코드 라인 32~35에서 태스크 전환을 하지 않는 경우에는 런큐의 clock_skip_update는 다시 클리어 단계로 전환시킨다. 그리고 런큐를 언락한다.
코드 라인 37에서 스케줄 완료 후 로드 밸런스와 관련된 일이 있으면 이를 처리한다.
- dl 및 rt 태스크의 경우 2 개 이상 같은 cpu에서 동작하는 경우 이를 다른 cpu로 push 하기 위한 처리가 있다.

schedule_debug()

kernel/sched/core.c

/*
 * Various schedule()-time debugging checks and statistics:
 */

static inline void schedule_debug(struct task_struct *prev)
{
#ifdef CONFIG_SCHED_STACK_END_CHECK
        if (task_stack_end_corrupted(prev))
                panic("corrupted stack end detected inside scheduler\n");
#endif

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
        if (!preempt && prev->state && prev->non_block_count) {
                printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
                        prev->comm, prev->pid, prev->non_block_count);
                dump_stack();
                add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
        }
#endif

        if (unlikely(in_atomic_preempt_off())) {
                __schedule_bug(prev);
                preempt_count_set(PREEMPT_DISABLED);
        }
        rcu_sleep_check();

        profile_hit(SCHED_PROFILING, __builtin_return_address(0));

        schedstat_inc(this_rq(), sched_count);
}

스케줄 타임에 체크할 항목과 통계를 수행한다.

코드 라인 3~6에서 스택이 손상되었는지 체크한다.
코드 라인 8~15에서 non-blocking 섹션에서 슬립하는 경우를 찾아내기 위한 디버그 코드이다.
코드 라인 17~20에서 preempt disable 여부를 체크하여 에러 메시지를 출력한다.
코드 라인 21에서 다음 3가지 rcu에 대해 read-side 크리티컬 섹션에서 sleep(context-switch)이 발생하는 경우를 체크한다.
- “Illegal context switch in RCU read-side critical section”
  - CONFIG_PREEMPT_RCU 커널 옵션을 사용하는 경우는 rcu preemption를 지원하므로 슬립이 가능하므로 이 경우는 체크하지 않는다.
- “Illegal context switch in RCU-bh read-side critical section”
- “Illegal context switch in RCU-sched read-side critical section”
코드 라인 23에서 SCHED_PROFILING을 동작시킨 경우 수행한다.
코드 라인 25에서 rq->sched_count를 1 증가시킨다.

task_stack_end_corrupted()

include/linux/sched.h

#define task_stack_end_corrupted(task) \
                (*(end_of_stack(task)) != STACK_END_MAGIC)

스택의 마지막 경계에 기록해둔 매직 넘버(0x57AC6E9D)가 깨져서 손상되었는지 여부를 알아온다. true(1)=스택 손상

hrtick_clear()

kernel/sched/core.c

/*
 * Use HR-timers to deliver accurate preemption points.
 */

static void hrtick_clear(struct rq *rq)
{
        if (hrtimer_active(&rq->hrtick_timer))
                hrtimer_cancel(&rq->hrtick_timer);
}

hrtick을 취소시킨다.

CONFIG_SCHED_HRTICK 커널 옵션을 사용하고, HRTICK feature를 사용하면 hrtick이 active된다.

signal_pending_state()

include/linux/sched.h

static inline int signal_pending_state(long state, struct task_struct *p)
{       
        if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
                return 0;
        if (!signal_pending(p))
                return 0;
        
        return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
}

TASK_INTERRUPTIBLE 또는 TASK_WAKEKILL 상태인 태스크가 SIGKILL 요청을 받았거나 시그널 처리를 요청받은 경우 true를 반환한다.

코드 라인 3~4에서 태스크가 TASK_INTERRUPTIBLE 상태도 아니고 TASK_INTERRUPTIBLE 상태도 아닌 경우 false를 반환한다.
코드 라인 5~6에서 태스크가 시그널 펜딩 상태가 아니면 false를 반환한다.
코드 라인 8에서 상태가 인터럽터블이거나 요청 태스크로 fatal(SIGKILL) 시그널 요청이 온경우 true를 반환한다.

signal_pending()

include/linux/sched.h

static inline int signal_pending(struct task_struct *p)
{       
        return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

요청 태스크의 TIF_SIGPENDING 플래그 설정 여부를 반환한다.

fatal_signal_pending()

include/linux/sched.h

static inline int fatal_signal_pending(struct task_struct *p)
{
        return signal_pending(p) && __fatal_signal_pending(p);
}

요청 태스크로 fatal(SIGKILL) 시그널이 요청되었는지 여부를 반환한다.

__fatal_signal_pending()

include/linux/sched.h

static inline int __fatal_signal_pending(struct task_struct *p)
{
        return unlikely(sigismember(&p->pending.signal, SIGKILL));
}

요청 태스크로 SIGKILL 시그널이 요청되었는지 여부를 반환한다.

SIGKILL 시그널
- 태스크를 죽일 떄 요청한다.

task_on_rq_queued()

kernel/sched/sched.h

static inline int task_on_rq_queued(struct task_struct *p)
{
        return p->on_rq == TASK_ON_RQ_QUEUED;
}

태스크가 현재 런큐에서 동작중인지 여부를 반환한다.

clear_tsk_need_resched()

include/linux/sched.h”

static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
        clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

요청 태스크에서 리스케줄 요청 플래그를 클리어한다.

Context Switch

태스크 전환은 다음과 같이 여러 가지 이름으로 불린다. (단 Interrupt Context Switch는 여기서 설명하지 않는다.)

Process Context Switch
Thread Context Switch
Task Context Switch
CPU Context Switch
- 주의: 소극적인 표현으로는 태스크의 mm 스위칭을 제외한 cpu 레지스터들의 백업/복구를 수행하는 context 전환만을 의미할 수 있다.

context_switch()

kernel/sched/core.c

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)
{
        prepare_task_switch(rq, prev, next);

        /*
         * For paravirt, this is coupled with an exit in switch_to to
         * combine the page table reload and the switch backend into
         * one hypercall.
         */
        arch_start_context_switch(prev);

        /*
         * kernel -> kernel   lazy + transfer active
         *   user -> kernel   lazy + mmgrab() active
         *
         * kernel ->   user   switch + mmdrop() active
         *   user ->   user   switch
         */
        if (!next->mm) {                                // to kernel
                enter_lazy_tlb(prev->active_mm, next);

                next->active_mm = prev->active_mm;
                if (prev->mm)                           // from user
                        mmgrab(prev->active_mm);
                else
                        prev->active_mm = NULL;
        } else {                                        // to user
                membarrier_switch_mm(rq, prev->active_mm, next->mm);
                /*
                 * sys_membarrier() requires an smp_mb() between setting
                 * rq->curr / membarrier_switch_mm() and returning to userspace.
                 *
                 * The below provides this either through switch_mm(), or in
                 * case 'prev->active_mm == next->mm' through
                 * finish_task_switch()'s mmdrop().
                 */
                switch_mm_irqs_off(prev->active_mm, next->mm, next);

                if (!prev->mm) {                        // from kernel
                        /* will mmdrop() in finish_task_switch(). */
                        rq->prev_mm = prev->active_mm;
                        prev->active_mm = NULL;
                }
        }

        rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

        prepare_lock_switch(rq, next, rf);

        /* Here we just switch the register state and the stack. */
        switch_to(prev, next, prev);
        barrier();

        return finish_task_switch(prev);
}

새로운 태스크로 context 스위칭한다. 만일 새로운 태스크가 유저 태스크인 경우 가상 공간을 바꾸기 위해 mm 스위칭도 한다.

코드 라인 5에서 다음 태스크로 context 스위치하기 전에 할 일을 준비한다.
코드 라인 12에서 context 스위치 직전에 아키텍처에서 할 일을 수행한다.
- 현재 x86 아키텍처만 수행한다.
코드 라인 21~24에서 스위칭할 다음 태스크가 커널 태스크(mm=null)인 경우 기존 태스크의 active_mm을 사용한다. lazy tlb 모드로 진입한다.
- arm 아키텍처는 lazy_tlb 모드를 사용하지 않는다.
- arm64의 경우 ttbr0를 사용여 PAN 기능을 sw 에뮬레이션 하는 아키텍처에서는 ti->ttbr0에 zero 페이지를 지정한다. HW PAN 기능이 있는 아키텍처의 경우에는 아무런 수행을 하지 않는다.
코드 라인 25~28에서 이전 태스크가 유저 태스크인지 여부에 따라 다음과 같이 동작한다.
- 유저 태스크인 경우 기존 유저 태스크의 가상 공간을 계속 사용할 계획이므로 mm 참조 카운터를 1 증가시킨다.
- 커널 태스크인 경우 prev->active_mm을 null로 대입시킨다.
코드 라인 29~39에서 스위칭할 다음 유저 태스크의 가상 공간을 사용하기 위해 mm 스위칭을 한다.
- mm 스위칭을 하기전에 메모리 베리어 관련 동작을 수행한다.
- mm 스위칭을 위해 TTBR0 레지스터를 새 태스크의 mm을 사용하여 설정한다.
- ASID generation이 같거나 이동 가능한 경우 TLB flush 없이 빠르게 스위칭하고, ASID 부여할 공간이 없는 경우 높은 cost가 발생하는 TLB flush도 수행해야 한다.
코드 라인 41~45에서 이전 태스크가 커널 태스크인 경우 rq->prev_mm에 이전 태스크의 active_mm을 백업해두고, prev->active_mm에 null을 대입한다.
코드 라인 48에서 런큐의 클럭 갱신 플래그에서 RQCF_ACT_SKIP와 RQCF_REQ_SKIP 모두 지운다.
코드 라인 50에서 태스크 스위칭 전에 런큐 락을 스위칭한다.
코드 라인 53에서 다음 태스크로 context 스위칭을 수행한다.
코드 라인 56에서 기존 태스크에 대해 context 스위치 완료 후 할 일을 수행한다

active_mm을 별도로 사용하는 이유

유저 프로세스는 유저 가상 영역을 표현한 mm을 가지고 있다(유저 스레드들은 유저 프로세스의 mm을 공유). 그런데 커널 스레드는 유저 프로세스 영역을 이용할 이유가 없으므로 유저 가상 영역을 포현한 mm을 사용하지 않으며 그 값은 null이다. 그런데 active_mm은 어디에 사용할까?

먼저 context swing에는 mm 스위칭이 포함되어 있는데 mm 스위칭이 일어날때마다 매우 높은 cost가 발생한다. 따라서 커널 스레드가 스케쥴 될 경우에는 mm 스위칭을 하지 않아도 된다.
따라서 커널 스레드가 동작 중인 경우 active_mm은 기존 유저 프로세스의 mm을 전달받아 사용한다.
즉 active_mm은 현재 진행중인 current task(프로세스 및 스레드)가 mm을 사용 중이든 아니든. active_mm 말 표현과 동일하게 현재 활성화된 mm 환경을 가리킨다.

두 개의 속성을 정리해본다.

mm
- 유저 프로세스인 경우 자신의 mm을 가리킨다.
- 유저 스레드인 경우 해당 부모 프로세스의 mm을 가리킨다.
- 커널 스레드의 경우 null
active_mm
- 유저 프로세스인 경우 자신의 mm을 가리킨다.
- 유저 스레드인 경우 해당 부모 프로세스의 mm을 가리킨다.
- 커널 스레드의 경우 이전 유저 프로세스의 mm을 전달받아 사용한다.

다음 그림과 같이 6개의 초기 태스크가 생성되었다고 가정한다.

다음 그림과 같이 초기 태스크인 init_task부터 각 커널 태스크들과 유저 태스크들이 스케줄링 되면서 active_mm이 변화하는 모습과 각 별표에서 mm 스위칭이 발생하는 것을 알 수 있다.

참고로 가장 주소 환경은 처음 init_task를 제외하고 항상 유저 태스크인 경우만 mm 스위칭이 발생하는데 이 때 mm을 사용한다.
커널 태스크로 전환된 경우 이전에 사용했던 유저 가상 주소를 active_mm을 통해 전달받아 사용한다.
- 처음 init_mm 제외

Context 스위치 전에 준비할 일과 종료 후 할 일

prepare_task_switch()

kernel/sched/core.c

/**
 * prepare_task_switch - prepare to switch tasks
 * @rq: the runqueue preparing to switch
 * @prev: the current task that is being switched out
 * @next: the task we are going to switch to.
 *
 * This is called with the rq lock held and interrupts off. It must
 * be paired with a subsequent finish_task_switch after the context
 * switch.
 *
 * prepare_task_switch sets up locking and calls architecture specific
 * hooks.
 */

static inline void 
prepare_task_switch(struct rq *rq, struct task_struct *prev,
                    struct task_struct *next)
{
        kcov_prepare_switch(prev);
        sched_info_switch(rq, prev, next);
        perf_event_task_sched_out(prev, next);
        rseq_preempt(prev);
        fire_sched_out_preempt_notifiers(prev, next);
        prepare_task(next);
        prepare_arch_switch(next);
}

다음 태스크로 context 스위치하기 전에 할일을 준비한다.

코드 라인 5에서 kcov(Kernel Code Coverage Randomize Test Tool)를 위해 스케줄 out할 태스크에 대한 정보를 기록한다.
- 참고: kernel: add kcov code coverage (2016) | LWN.net
코드 라인 6에서 스케줄러 통계 정보를 수집한다.
코드 라인 7에서 스케줄 out 태스크에 대한 perf event 정보를 출력한다.
코드 라인 8에서 스케줄 out 태스크의 preempt_notifier에 등록한 함수들을 콜백 함수들을 호출한다.
코드 라인 9에서 다음 태스크가 런큐에서 동작함을 알린다. (next->on_cpu = 1)
코드 라인 10에서 아키텍처에서 지원하는 경우 context 스위치 전에 할 일을 수행하게 한다.
- arm, arm64 아키텍처는 해당 사항 없다.

다음은 perf 툴을 사용하여 태스크들에 대해 스케줄 레이턴시를 분석하였다.

단순한 연산을 반복하는 load100 태스크가 실행하였고, 백그라운드에서는 커널을 빌드중이다

$ perf sched record ./load100
^C
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 10.035 MB perf.data (88881 samples) ]

$ perf sched latency
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  rm:(5)                |     19.912 ms |        7 | avg:    5.079 ms | max:   13.957 ms | max at: 255594.701006 s
  sh:(45)               |    176.899 ms |       71 | avg:    4.153 ms | max:   20.318 ms | max at: 255593.822963 s
  recordmcount:(5)      |     32.081 ms |        7 | avg:    3.589 ms | max:   14.293 ms | max at: 255592.467954 s
  fixdep:(6)            |    312.116 ms |       36 | avg:    3.232 ms | max:   12.683 ms | max at: 255594.611009 s
  cc1:(11)              |  24455.216 ms |     4283 | avg:    1.071 ms | max:   20.005 ms | max at: 255594.251001 s
  gcc:(10)              |     60.107 ms |       41 | avg:    0.938 ms | max:   11.545 ms | max at: 255593.861169 s
  make:(8)              |    150.007 ms |       64 | avg:    0.802 ms | max:   10.033 ms | max at: 255595.552049 s
  load100:9169          |   5440.106 ms |       40 | avg:    0.706 ms | max:    4.055 ms | max at: 255590.765890 s
  lxpanel:992           |      8.397 ms |       17 | avg:    0.601 ms | max:    3.316 ms | max at: 255594.839685 s
  perf_4.9:9154         |     44.080 ms |        2 | avg:    0.256 ms | max:    0.483 ms | max at: 255590.709552 s
  as:(6)                |   1609.065 ms |     2505 | avg:    0.214 ms | max:   20.093 ms | max at: 255594.555090 s
  ...
 -----------------------------------------------------------------------------------------------------------------
  TOTAL:                |  32729.581 ms |    15537 |
 ---------------------------------------------------

finish_task_switch()

kernel/sched/core.c

/**
 * finish_task_switch - clean up after a task-switch
 * @prev: the thread we just switched away from.
 *
 * finish_task_switch must be called after the context switch, paired
 * with a prepare_task_switch call before the context switch.
 * finish_task_switch will reconcile locking set up by prepare_task_switch,
 * and do any other architecture-specific cleanup actions.
 *
 * Note that we may have delayed dropping an mm in context_switch(). If
 * so, we finish that here outside of the runqueue lock. (Doing it
 * with the lock held can cause deadlocks; see schedule() for
 * details.)
 *
 * The context switch have flipped the stack from under us and restored the
 * local variables which were saved when this task called schedule() in the
 * past. prev == current is still correct but we need to recalculate this_rq
 * because prev may have moved to another CPU.
 */

static struct rq *finish_task_switch(struct task_struct *prev)
        __releases(rq->lock)
{
        struct rq *rq = this_rq();
        struct mm_struct *mm = rq->prev_mm;
        long prev_state;

        /*
         * The previous task will have left us with a preempt_count of 2
         * because it left us after:
         *
         *      schedule()
         *        preempt_disable();                    // 1
         *        __schedule()
         *          raw_spin_lock_irq(&rq->lock)        // 2
         *
         * Also, see FORK_PREEMPT_COUNT.
         */
        if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
                      "corrupted preempt_count: %s/%d/0x%x\n",
                      current->comm, current->pid, preempt_count()))
                preempt_count_set(FORK_PREEMPT_COUNT);

        rq->prev_mm = NULL;

        /*
         * A task struct has one reference for the use as "current".
         * If a task dies, then it sets TASK_DEAD in tsk->state and calls
         * schedule one last time. The schedule call will never return, and
         * the scheduled task must drop that reference.
         *
         * We must observe prev->state before clearing prev->on_cpu (in
         * finish_task), otherwise a concurrent wakeup can get prev
         * running on another CPU and we could rave with its RUNNING -> DEAD
         * transition, resulting in a double drop.
         */
        prev_state = prev->state;
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_task(prev);
        finish_lock_switch(rq);
        finish_arch_post_lock_switch();
        kcov_finish_switch(current);

        fire_sched_in_preempt_notifiers(current);
        /*
         * When switching through a kernel thread, the loop in
         * membarrier_{private,global}_expedited() may have observed that
         * kernel thread and not issued an IPI. It is therefore possible to
         * schedule between user->kernel->user threads without passing though
         * switch_mm(). Membarrier requires a barrier after storing to
         * rq->curr, before returning to userspace, so provide them here:
         *
         * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
         *   provided by mmdrop(),
         * - a sync_core for SYNC_CORE.
         */
        if (mm) {
                membarrier_mm_sync_core_before_usermode(mm);
                mmdrop(mm);
        }
        if (unlikely(prev_state == TASK_DEAD)) {
                if (prev->sched_class->task_dead)
                        prev->sched_class->task_dead(prev);

                /*
                 * Remove function-return probe instances associated with this
                 * task and put them back on the free list.
                 */
                kprobe_flush_task(prev);

                /* Task is done with its stack. */
                put_task_stack(prev);

                put_task_struct_rcu_user(prev);
        }

        tick_nohz_task_switch();
        return rq;
}

기존 태스크 @prev에 대해 context 스위치 완료 후 할 일을 수행한다. (prepare_task_switch() 함수와 한 쌍을 이룬다.)

코드 라인 4~24에서 런큐의 prev_mm 정보를 mm에 가져온 후 null을 대입하여 클리어한다.
코드 라인 37에서 기존 태스크의 state를 알아온다.
코드 라인 38에서 CONFIG_VIRT_CPU_ACCOUNTING 커널 옵션이 사용되는 경우 idle 및 system 타임으로 나누어 기존 태스크에 대한 vtime을 갱신한다.
코드 라인 39에서 스케줄 in 태스크에 대한 perf event 정보를 출력한다.
코드 라인 40~42에서 기존 태스크의 스위칭 완료 시 수행할 일, 런큐에 대한 락 전환 및 아키텍처별로 수행할 일을 한다.
코드 라인 43에서 kcov(Kernel Code Coverage Randomize Test Tool)를 위해 스케줄 in할 태스크에 대한 정보를 기록한다.
코드 라인 45에서 스케줄 in 되는 태스크의 preempt_notifiers 체인 리스트에 등록된 notifier 콜백 함수를 호출한다.
코드 라인 58~61에서 기존 태스크가 유저 태스크인 경우 가상 공간 mm을 사용하지 않으므로 mm 참조 카운터를 감소시킨다. (참조 카운터가 0이되면 할당 해제한다.)
코드 라인 62~76에서 낮은 확률로 기존 태스크 상태가 TASK_DEAD 인 경우 기존 태스크를 사용하지 않는다. (참조 카운터가 0이되면 할당 해제한다) 만일 기존 태스크의 스케줄러에 (*task_dead)가 준비된 경우 호출한다.
- dl 스케줄러를 사용하는 경우 task_dead_dl() 함수를 호출하여 total_bw에서 dl_bw를 감소시키고 dl 타이머를 중지시킨다.
코드 라인 78에서 시스템이 nohz full로 동작하고 있는 경우 nohz 지속 여부에 대한 판단 결과 지속할 필요가 없음녀 다시 tick 스케줄링을 재개한다.
코드 라인 79에서 런큐를 반환한다.

fire_sched_in_preempt_notifiers()

kernel/sched/core.c

static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
{
        if (static_branch_unlikely(&preempt_notifier_key))
                __fire_sched_in_preempt_notifiers(curr);
}

스케줄 in 되는 경우 현재 태스크의 preempt_notifiers 체인 리스트에 등록된 notifier 함수를 호출한다.

__fire_sched_out_preempt_notifiers()

kernel/sched/core.c

static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
{
        struct preempt_notifier *notifier;

        hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
                notifier->ops->sched_in(notifier, raw_smp_processor_id());
}

다음 그림은 현재 태스크의 preempt_notifiers 체인리스트에 virt/kvm/kvm_main.c – vcpu_load() 함수가 등록되어 호출되는 것을 보여준다.

가상 메모리 관리(mm) 제거

mmdrop()

include/linux/sched.h

/* mmdrop drops the mm and the page tables */
static inline void mmdrop(struct mm_struct * mm)
{
        if (unlikely(atomic_dec_and_test(&mm->mm_count)))
                __mmdrop(mm);
}

메모리 디스크립터 mm의 참조 카운터를 감소시키고 0인 경우 mm을 할당 해제한다.

__mmdrop()

kernel/fork.c

/*
 * Called when the last reference to the mm
 * is dropped: either by a lazy thread or by 
 * mmput. Free the page directory and the mm.
 */

void __mmdrop(struct mm_struct *mm)
{
        BUG_ON(mm == &init_mm);
        WARN_ON_ONCE(mm == current->mm);
        WARN_ON_ONCE(mm == current->active_mm);
        mm_free_pgd(mm);
        destroy_context(mm);
        mmu_notifier_mm_destroy(mm);
        check_mm(mm);
        put_user_ns(mm->user_ns);
        free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);

메모리 디스크립터 mm을 할당 해제한다.

코드 라인 6에서 mm에 연결된 페이지 테이블을 할당 해제한다.
코드 라인 7에서 mm의 context 정보를 해제한다.
- arm 아키텍처는 아무것도 수행하지 않는다.
코드 라인 8에서 mm->mmu_notifier_mm을 할당 해제한다.
코드 라인 9에서 mm에 문제가 있는지 체크하고 문제가 있는 경우 alert 메시지를 출력한다
코드 라인 10에서 유저 네임스페이스 참조 카운터를 1 감소시킨다. (0이되면 유저 네임스페이스를 제거한다)
코드 라인 11에서 mm을 할당 해제한다.

mm_free_pgd()

kernel/fork.c

static inline void mm_free_pgd(struct mm_struct *mm)
{
        pgd_free(mm, mm->pgd); 
}

check_mm()

kernel/fork.c

static void check_mm(struct mm_struct *mm)
{
        int i;

        BUILD_BUG_ON_MSG(ARRAY_SIZE(resident_page_types) != NR_MM_COUNTERS,
                         "Please make sure 'struct resident_page_types[]' is updated as well");

        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = atomic_long_read(&mm->rss_stat.count[i]);

                if (unlikely(x))
                        pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
                                 mm, resident_page_types[i], x);
        }

        if (mm_pgtables_bytes(mm))
                pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n",
                                mm_pgtables_bytes(mm));

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
        VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
#endif
}

mm에 문제가 있는지 체크하고 문제가 있는 경우 alert 메시지를 출력한다.

코드 라인 5~6에서 mm 카운터 수가 맞는지 체크한다.
코드 라인 8~14에서 3개의 mm 카운터 수만큼 루프를 돌며 rss_stat 카운터 값이 0 보다 큰 경우 alert 메시지를 출력한다.
코드 라인 16~18에서 mm에 사요된 페이지 테이블 바이트 수가 여전히 0 보다 큰 경우 alert 메시지를 출력한다.
코드 라인 20~22에서 mm->pmd_huge_pte 수가 0 보다 큰 경우 emergency 메시지를 덤프한다.

mm 카운터

include/linux/mm_types.h

enum {
        MM_FILEPAGES,
        MM_ANONPAGES,
        MM_SWAPENTS,
        NR_MM_COUNTERS
};

free_mm()

kernel/fork.c

#define free_mm(mm)     (kmem_cache_free(mm_cachep, (mm)))

mm 슬랩 캐시에 메모리 디스크립터 mm을 할당 해제 한다.

가상 메모리 관리(mm) 체계 스위칭

ARM32 mm 스위칭

switch_mm() – ARM32

arch/arm/include/asm/mmu_context.h

/*
 * This is the actual mm switch as far as the scheduler
 * is concerned.  No registers are touched.  We avoid
 * calling the CPU specific function when the mm hasn't
 * actually changed.
 */

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
          struct task_struct *tsk)
{
#ifdef CONFIG_MMU
        unsigned int cpu = smp_processor_id();

        /*
         * __sync_icache_dcache doesn't broadcast the I-cache invalidation,
         * so check for possible thread migration and invalidate the I-cache
         * if we're new to this CPU.
         */
        if (cache_ops_need_broadcast() &&
            !cpumask_empty(mm_cpumask(next)) &&
            !cpumask_test_cpu(cpu, mm_cpumask(next)))
                __flush_icache_all();

        if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next) {
                check_and_switch_context(next, tsk);
                if (cache_is_vivt())
                        cpumask_clear_cpu(cpu, mm_cpumask(prev));
        }
#endif
}

지정한 태스크의 가상 메모리 관리 체계로 전환하기 위해 mm 스위칭을 수행한다.

코드 라인 13~16에서 캐시 작업에 브로드캐스트가 필요한 아키텍처이고 태스크에 대한 cpu 비트맵이 비어 있지 않고 해당 cpu 비트만 클리어 되어 있는 경우 명령 캐시 전체를 flush 한다.
- arm 아키텍처에서 UP 시스템이나 armv7 이상인 경우 해당 사항 없다.
코드 라인 18~22에서 태스크에 대한 cpu 비트맵이 클리어된 상태이거나 다음 태스크에 사용할 mm이 기존 태스크의 mm과 동일하지 않은 경우 mm 스위칭을 수행한다. 캐시가 vivt 타입인 경우 기존 태스크에 대한 cpu 비트맵에서 현재 cpu 비트를 클리어한다.

cache_ops_need_broadcast() – ARM32

arch/arm/include/asm/smp_plat.h

#if !defined(CONFIG_SMP) || __LINUX_ARM_ARCH__ >= 7
#define cache_ops_need_broadcast()      0
#else
static inline int cache_ops_need_broadcast(void)
{
        if (!is_smp())
                return 0;

        return ((read_cpuid_ext(CPUID_EXT_MMFR3) >> 12) & 0xf) < 1;
}
#endif

캐시 작업에 브로드캐스트가 필요한 아키텍처인지 여부를 알아온다. true(1)=브로드캐스트 필요

다음과 같이 armv6 이하 SMP 아키텍처에서 브로드캐스트가 필요한 SMP 아키텍처가 있다.
MMFR3.Maintenance broadcast가 레지스터는 다음과 같다.
- 0b0000: 캐시, TLB 및 BP 조작 모두 local cpu에만 적용된다.
- 0b0001: 캐시, BP 조작은 명령에 따라 share cpu들에 적용되지만 TLB는 local cpu에만 적용된다.
- 0b0010: 캐시, TLB 및 BP 조작 모두 명령에 따라 share cpu들에 적용된다.

cpu_set_reserved_ttbr0() – ARM32

arch/arm/mm/context.c

static void cpu_set_reserved_ttbr0(void)
{
        u32 ttb;
        /*
         * Copy TTBR1 into TTBR0.
         * This points at swapper_pg_dir, which contains only global
         * entries so any speculative walks are perfectly safe.
         */
        asm volatile(
        "       mrc     p15, 0, %0, c2, c0, 1           @ read TTBR1\n"
        "       mcr     p15, 0, %0, c2, c0, 0           @ set TTBR0\n"
        : "=r" (ttb));
        isb();
}

커널 페이지 테이블(pgd)의 물리 주소가 담겨 있는 TTBR1 레지스터를 읽어 TTBR0에 복사한다.

cpu_switch_mm() – ARM32

arch/arm/include/asm/proc-fns.h

#define cpu_switch_mm(pgd,mm) cpu_do_switch_mm(virt_to_phys(pgd),mm)

cpu_do_switch_mm() – ARM32

arch/arm/include/asm/proc-fns.h & arch/arm/include/asm/glue-proc.h

#ifdef MULTI_CPU
#define cpu_do_switch_mm                processor.switch_mm
#else
#define cpu_do_switch_mm                __glue(CPU_NAME,_switch_mm)
#endif

armV7 아키텍처는 MULTI_CPU를 사용하고 processor.switch_mm 후크는 cpu_v7_switch_mm() 함수를 가리킨다.

cpu_v7_switch_mm() – ARM32

arch/arm/mm/proc-v7-2level.S

/*
 *      cpu_v7_switch_mm(pgd_phys, tsk)
 *
 *      Set the translation table base pointer to be pgd_phys
 *
 *      - pgd_phys - physical address of new TTB
 *
 *      It is assumed that:
 *      - we are not using split page tables
 *
 *      Note that we always need to flush BTAC/BTB if IBE is set
 *      even on Cortex-A8 revisions not affected by 430973.
 *      If IBE is not set, the flush BTAC/BTB won't do anything.
 */

ENTRY(cpu_v7_switch_mm)
#ifdef CONFIG_MMU
        mmid    r1, r1                          @ get mm->context.id
        ALT_SMP(orr     r0, r0, #TTB_FLAGS_SMP)
        ALT_UP(orr      r0, r0, #TTB_FLAGS_UP)
#ifdef CONFIG_PID_IN_CONTEXTIDR
        mrc     p15, 0, r2, c13, c0, 1          @ read current context ID
        lsr     r2, r2, #8                      @ extract the PID 
        bfi     r1, r2, #8, #24                 @ insert into new context ID
#endif
#ifdef CONFIG_ARM_ERRATA_754322
        dsb     
#endif
        mcr     p15, 0, r1, c13, c0, 1          @ set context ID
        isb
        mcr     p15, 0, r0, c2, c0, 0           @ set TTB 0
        isb
#endif
        bx      lr
ENDPROC(cpu_v7_switch_mm)

mm 스위칭을 위해 CONTEXTIDR.ASID를 mm->context.id로 갱신하고 TTBR0 레지스터에는 pgd_phys + TTB 속성 플래그를 기록한다.

코드 라인 3에서 mm->context.id 값을 읽어 r1에 대입한다.
코드 라인 4~5에서 pgd 물리 주소가 담긴 r0에 TTB 플래그를 추가한다.
코드 라인 6~10에서 CONTEXTIDR 레지스터값 >> 8 비트한 값이 PROCID이고 이 값을 r1의 상위 24비트에 대입한다.
- r1 bits[31:8] <- CONTEXTIDR.PROCID
코드 라인 11~13에서 ARMv7 아키텍처에서 ASID 스위칭을 하는 경우 잘못된 MMU 변환이 가능하여 erratum을 위해 dsb 명령을 수행한다.
코드 라인 14~15에서 CONTEXTIDR에서 ASID 값이 교체된 r1 값을 CONTEXTIDR에 기록하고 isb 명령을 통해 명령 파이프를 비운다.
코드 라인 16~17에서 TTBR0 레지스터에 pgd 주소 및 TTB 플래그가 담긴 r0 레지스터를 기록하고 isb 명령을 통해 명령 파이프를 비운다.

arch/arm/mm/proc-v7-2level.S – ARM32

/* PTWs cacheable, inner WB not shareable, outer WB not shareable */
#define TTB_FLAGS_UP    TTB_IRGN_WB|TTB_RGN_OC_WB
          
/* PTWs cacheable, inner WBWA shareable, outer WBWA not shareable */
#define TTB_FLAGS_SMP   TTB_IRGN_WBWA|TTB_S|TTB_NOS|TTB_RGN_OC_WBWA

TTBR0 레지스터에 pgd 물리 주소를 지정할 때 위의 플래그와 같이 지정하여 사용한다.

#define TTB_S           (1 << 1)
#define TTB_RGN_NC      (0 << 3)
#define TTB_RGN_OC_WBWA (1 << 3)
#define TTB_RGN_OC_WT   (2 << 3)
#define TTB_RGN_OC_WB   (3 << 3)
#define TTB_NOS         (1 << 5)
#define TTB_IRGN_NC     ((0 << 0) | (0 << 6))
#define TTB_IRGN_WBWA   ((0 << 0) | (1 << 6))
#define TTB_IRGN_WT     ((1 << 0) | (0 << 6))
#define TTB_IRGN_WB     ((1 << 0) | (1 << 6))

다음 그림은 armv7 아키텍처의 context 스위칭 시 CONTEXTIDR 레지스터와 TTBR0 레지스터가 변경되는 모습을 보여준다

mmid 매크로 – ARM32

arch/arm/mm/proc-macros.S

/*
 * mmid - get context id from mm pointer (mm->context.id)
 * note, this field is 64bit, so in big-endian the two words are swapped too.
 */

        .macro  mmid, rd, rn
#ifdef __ARMEB__
        ldr     \rd, [\rn, #MM_CONTEXT_ID + 4 ]
#else
        ldr     \rd, [\rn, #MM_CONTEXT_ID]
#endif
        .endm

mm->context.id를 rd에 반환한다. (rn에 mm)

ARM64 mm 스위칭

switch_mm() – ARM64

arch/arm64/include/asm/mmu_context.h

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
          struct task_struct *tsk)
{
        if (prev != next)
                __switch_mm(next);

        /*
         * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
         * value may have not been initialised yet (activate_mm caller) or the
         * ASID has changed since the last run (following the context switch
         * of another thread of the same process).
         */
        update_saved_ttbr0(tsk, next);
}

지정한 태스크의 가상 메모리 관리 체계로 전환하기 위해 mm 스위칭을 수행한다.

코드 라인 5~6에서 @prev 가상 메모리 관리(mm)과 @next 가상 메모리 관리(mm)가 다른 경우 mm 스위칭을 수행한다.
코드 라인 14에서 유저 가상 주소를 담당하는 ti->ttbr0에 기록할 ttbr 디스크립터를 기록한다. ttbr 디스크립터는 다음과 같이 구성된다.
- 16바이트의 ASID
- 48비트의 pgd 테이블에 대한 물리 주소

__switch_mm() – ARM64

arch/arm64/include/asm/mmu_context.h

static inline void __switch_mm(struct mm_struct *next)
{
        unsigned int cpu = smp_processor_id();

        /*
         * init_mm.pgd does not contain any user mappings and it is always
         * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
         */
        if (next == &init_mm) {
                cpu_set_reserved_ttbr0();
                return;
        }

        check_and_switch_context(next, cpu);
}

@next mm으로 mm 스위칭을 수행한다.

코드 라인 9~12에서 커널 가상 메모리 관리로 전환되는 경우 ttbr0를 zero 페이지를 가리키게 하여 유저 가상 주소를 액세스하지 못하게 막고 함수를 빠져나간다. ARM64 시스템의 경우 커널 가상 메모리 관리는 ttbr1을 통해 항상 active되어 있다.
코드 라인 14에서 asid를 체크하여 TLB 캐시 플러시 없는 mm 스위칭을 수행한다. 단 asid 부족 시엔 TLB 캐시를 플러시한다.

cpu_set_reserved_ttbr0() – ARM64

arch/arm64/include/asm/mmu_context.h

/*
 * Set TTBR0 to empty_zero_page. No translations will be possible via TTBR0.
 */

static inline void cpu_set_reserved_ttbr0(void)
{
        unsigned long ttbr = phys_to_ttbr(__pa_symbol(empty_zero_page));

        write_sysreg(ttbr, ttbr0_el1);
        isb();
}

유저 페이지 테이블을 가리키는 ttbr0에 zero 페이지를 연결하여 유저 가상 주소 영역에 접근하지 못하게 막는다.

cpu_switch_mm() – ARM64

arch/arm64/include/asm/mmu_context.h

static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
{
        BUG_ON(pgd == swapper_pg_dir);
        cpu_set_reserved_ttbr0();
        cpu_do_switch_mm(virt_to_phys(pgd),mm);
}

TTBR0에 @mm->context.id와 @pgd 물리주소를 기록하여 mm 스위칭을 수행한다.

cpu_do_switch_mm() – ARM64

arch/arm64/mm/proc.S

/*
 *      cpu_do_switch_mm(pgd_phys, tsk)
 *
 *      Set the translation table base pointer to be pgd_phys.
 *
 *      - pgd_phys - physical address of new TTB
 */

ENTRY(cpu_do_switch_mm)
        mmid    x1, x1                          // get mm->context.id
        bfi     x0, x1, #48, #16                // set the ASID
        msr     ttbr0_el1, x0                   // set TTBR0
        isb
alternative_if_not ARM64_WORKAROUND_CAVIUM_27456
        ret
        nop
        nop
        nop
alternative_else
        ic      iallu
        dsb     nsh
        isb
        ret
alternative_endif
ENDPROC(cpu_do_switch_mm)

64비트의 ttbr0_el1 레지스터에 페이지 테이블의 물리주소(pgd_phys)를 대입하되 최상위 16비트는 asid를 지정한다. (16비트의 asid + 48비트의 pgd_phys)

코드 라인 2에서 x1레지스터에 mm을 담고 있고, 이를 이용하여 mm->context.id를 다시 x1 레지스터로 읽어온다.
코드 라인 3에서 pgd 물리 주소가 담긴 x0 레지스터의 bit[63:48] ASID 위치에 읽어온 mm->context.id를 대입한다.
코드 라인 4에서 ttbr0_el1에 기록한다.
코드 라인 5~16에서 명령어 배리어를 수행한 후 리턴한다.
- 단 CAVIUM_27456 SoC에 대해서는 워크어라운드로 명령어 캐시를 비우고 dsb 베리어 및 명령어 베리어를 한 번 더 수행한다.

mmid 매크로

include/asm/assembler.h

/*
 * mmid - get context id from mm pointer (mm->context.id)
 */

.       .macro  mmid, rd, rn
        ldr     \rd, [\rn, #MM_CONTEXT_ID]
        .endm

mm_struct 구조체 포인터인 @rn->context.id 값을 @rd 레지스터에 읽어온다.

다음 그림은 pgd 페이지 테이블 주소와 ASID 값을 기록하여 mm 스위칭을 하는 모습을 보여준다.

ASID 관리

mm 스위칭 후 TLB 캐시 및 명령 캐시에 대한 플러시를 수행하는데 이는 높은 코스트를 유지한다. ARMv7 아키텍처 이후부터 TLB 캐시에 ASID를 이용한 가상 주소의 중복을 허용하게 하였다. 이를 이용하여 각각의 태스크 마다 아키텍처가 유니크하게 식별할 수 있도록 ASID를 발급하여 구분한다. 그런데 이 ASID는 ARM32의 경우 8 bit 만을 허용하고, ARM64의 경우 8 bit 또는 16 bit를 지원한다. 이 때문에 리눅스 커널에서 태스크의 식별에 사용하는 pid를 사용하지 못하고 별도로 ASID 발급 관리를 수행한다.

ARM 아키텍처에서의 ASID 운용

다음 그림은 TLB 캐시내에서 VMID + ASID + 주소로 엔트리를 구분하고 있는 모습을 보여준다.

ARMv7 이상에서 지원한다.

다음 그림은 ARM64 시스템에서 아키텍처가 지원하는 ASID 비트 수를 알아오기 위한 레지스터를 보여준다.

ASID 대역(generation) 관리 범위

ASID generiation 관리 범위와 관련하여 태스크들의 mm 스위칭 시 다음과 같이 동작한다.

Fastpath mm스위칭
- 다음 태스크로의 mm 스위칭 시 context.id의 하위 8비트를 제외한 값이 현재 asid_generation 값과 동일하다.
- 스핀락을 사용하지 않고, cost 높은 TLB 캐시 플러시를 수행하지 않는다.
Slowpath mm 스위칭
- 다음 태스크로의 mm 스위칭 시 context.id의 하위 8비트를 제외한 값이 현재 asid_generation 값과 다르다.
- 스핀락을 사용하며 mm->context.id 값을 변경한다.
- asid 대역을 모두 사용한 경우 다음과 같이 동작한다.
  - asid generation을 증가시킨다.
  - asid_map 비트맵을 한꺼번에 클리어한다.
  - 모든 cpu의 TLB 캐시를 플러시 한다.

asid_map 비트맵

2^asid_bits 수만큼의 비트를 관리하는 비트맵으로 현재 사용 중인 asid 값들을 마크하고 있다.

가상 메모리 mm이 필요 없어져 삭제되었다 하더라도 TLB 캐시를 플러시하지 않을 계획이므로 한 번 마크된 asid들은 계속 사용하는 것처럼 유지한다.
ASID generation을 증가시키고 비트맵을 한꺼번에 클리어한다. 그리고 mm 스위칭 시 모든 cpu에 대한 TLB 캐시를 플러시하도록 예약한다.

ASID generation의 증가

asid 값들이 모두 다 사용되어 더 이상 asid를 발급받지 못하는 경우 다음과 같이 처리한다.

asid_generation 값을 2^asid_bits 수 만큼 증가시킨 새로운 대역으로 이동한다.
TLB 캐시를 플러시한다. 이 때 asid_map 비트맵도 한꺼번에 클리어한다.

check_and_switch_context() – ARM32

arch/arm/mm/context.c

void check_and_switch_context(struct mm_struct *mm, struct task_struct *tsk)
{                                          
        unsigned long flags;
        unsigned int cpu = smp_processor_id();
        u64 asid;
                
        if (unlikely(mm->context.vmalloc_seq != init_mm.context.vmalloc_seq))
                __check_vmalloc_seq(mm);

        /*
         * We cannot update the pgd and the ASID atomicly with classic
         * MMU, so switch exclusively to global mappings to avoid
         * speculative page table walking with the wrong TTBR.
         */
        cpu_set_reserved_ttbr0();

        asid = atomic64_read(&mm->context.id);
        if (!((asid ^ atomic64_read(&asid_generation)) >> ASID_BITS)
            && atomic64_xchg(&per_cpu(active_asids, cpu), asid))
                goto switch_mm_fastpath;

        raw_spin_lock_irqsave(&cpu_asid_lock, flags);
        /* Check that our ASID belongs to the current generation. */
        asid = atomic64_read(&mm->context.id);
        if ((asid ^ atomic64_read(&asid_generation)) >> ASID_BITS) {
                asid = new_context(mm, cpu);
                atomic64_set(&mm->context.id, asid);
        }
                
        if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending)) {
                local_flush_bp_all();
                local_flush_tlb_all();
        }
 
        atomic64_set(&per_cpu(active_asids, cpu), asid);
        cpumask_set_cpu(cpu, mm_cpumask(mm));
        raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);

switch_mm_fastpath:
        cpu_switch_mm(mm->pgd, mm);
}

asid를 체크하여 TLB 캐시 플러시 없는 mm 스위칭을 수행한다. 단 asid 부족 시엔 TLB 캐시를 플러시한다.

코드 라인 7~8에서 커널 매핑 정보가 갱신된 경우 해당 커널 페이지 테이블로 부터 변경된 부분을 유저 페이지 테이블의 커널 영역에 복사하는 것으로 커널 매핑을 갱신한다.
- init_mm의 vmalloc 정보가 갱신되어 현재 태스크의 vmalloc 시퀀스와 다른 경우 init_mm의 페이지 테이블 중 vmalloc 주소 공간에 해당하는 매핑 테이블 엔트리들만 현재 태스크의 vmalloc 영역을 가리키는 엔트리에 복사한다.
코드 라인 15에서 classic MMU를 사용하면서 atomic하게 유저 페이지 테이블 지정 및 ASID 설정을 atomic하게 갱신할 수 없다. 따라서 먼저 커널 페이지 테이블로 전환한다. 이렇게 커널 페이지 테이블로 전환하면 유저 페이지와 관계 없는 글로벌 매핑 페이지만을 사용하므로 ASID의 효력이 없게하여 잘못된 페이지 테이블 워킹을 방지하는 효과가 있다.
코드 라인 17~20에서 mm 스위칭할 태스크의 32bit asid 값을 읽어 asid의 256 발급(generation) 단위(0, 0x100, 0x200, 0x300, …)에 속하면 switch_mm_fast로 이동한다.
- asid를 atomic하게 읽어온 값의 최하위 8비트를 제외한 값이 최근에 발급한 대역과 같은 경우 fastpath 처리한다.
- 예) asid_generation=0x300, mm->context.id=0x340
  - 각각 8비트를 우측 shift한 결과가 같으므로 fastpath
- 예) asid_generation=0x400, mm->context.id=0x3f0
  - 각각 8비트를 우측 shift한 결과가 다르므로 slowpath
코드 라인 22~28에서 스핀락을 걸고 다시 한 번 asid 값을 읽었을 때 최하위 8비트를 제외한 값이 최근에 발급한 번호 대역과 다른 경우 새로운 asid를 발급해와서 mm->context.id에 기록하여 변경한다.
- 예) asid_generation=0x400, mm->context.id=0x3f0인 경우 보통 mm->context.id는 0x4f0으로 변경된다.
  - 0xf0이 이미 사용되고 있으면 빈 번호를 발급받는다.
코드 라인 30~33에서 현재 cpu에 대해 tlb 플러시가 펜딩된 상태인 경우 TLB 및 명령 캐시의 플러시 처리를 수행한다.
코드 라인 35~37에서 현재 cpu의 active_asids에 asid 값을 저장해두고 mm->cpu_vm_mask_var에 현재 cpu 비트를 설정한다.
코드 라인 40에서 mm 스위칭을 수행한다.
- TTBR0 레지스터에 pgd를 대입하고 CONTEXTIDR에서 8비트의 asid 값만 변경한다.

check_and_switch_context() – ARM64

arch/arm64/mm/context.c

void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
{
        unsigned long flags;
        u64 asid, old_active_asid;

        if (system_supports_cnp())
                cpu_set_reserved_ttbr0();

        asid = atomic64_read(&mm->context.id);

        /*
         * The memory ordering here is subtle.
         * If our active_asids is non-zero and the ASID matches the current
         * generation, then we update the active_asids entry with a relaxed
         * cmpxchg. Racing with a concurrent rollover means that either:
         *
         * - We get a zero back from the cmpxchg and end up waiting on the
         *   lock. Taking the lock synchronises with the rollover and so
         *   we are forced to see the updated generation.
         *
         * - We get a valid ASID back from the cmpxchg, which means the
         *   relaxed xchg in flush_context will treat us as reserved
         *   because atomic RmWs are totally ordered for a given location.
         */
        old_active_asid = atomic64_read(&per_cpu(active_asids, cpu));
        if (old_active_asid &&
            !((asid ^ atomic64_read(&asid_generation)) >> asid_bits) &&
            atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu),
                                     old_active_asid, asid))
                goto switch_mm_fastpath;

        raw_spin_lock_irqsave(&cpu_asid_lock, flags);
        /* Check that our ASID belongs to the current generation. */
        asid = atomic64_read(&mm->context.id);
        if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) {
                asid = new_context(mm);
                atomic64_set(&mm->context.id, asid);
        }

        if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
                local_flush_tlb_all();

        atomic64_set(&per_cpu(active_asids, cpu), asid);
        raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);

switch_mm_fastpath:

        arm64_apply_bp_hardening();

        /*
         * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
         * emulating PAN.
         */
        if (!system_uses_ttbr0_pan())
                cpu_switch_mm(mm->pgd, mm);
}

asid를 체크하여 TLB 캐시 플러시 없는 mm 스위칭을 수행한다. 단 asid 부족 시엔 TLB 캐시를 플러시한다.

코드 라인 6~7에서 ARMv8.2 이상 아키텍처에서 CnP를 지원하는데 이러한 경우 TTBR0_EL1에 zero 페이지를 연결하여 유저 가상 주소에 접근하지 못하게 한다.
- CnP를 사용하면 페이지 테이블 변환이 다른 cpu들에게도 동일한 효과를 발휘한다. 0으로 기록하면 CnP가 disable되어 다른 cpu들에는 영향을 주지 않게 한다.
- 참고: arm64: mm: Support Common Not Private translations (2018, v4.20-rc1)
코드 라인 9에서 mm->context.id를 asid 값으로 읽어온다.
코드 라인 25~30에서 old_active_asid가 0이 아니고 읽어온 asid가 현재 asid_generation 대역 범위에 있으면 active_asids에 atomice 하게 기록한 후 switch_mm_fastpath 레이블로 이동한다.
- 다음 상황에선 slowpath를 진행한다.
  - active_asids가 0인 경우
  - asid가 asid_generation 대역 범위를 벗어나는 경우
  - 레이스로 인하여 atomic 기록이 실패하는 경우
- active_asids는 atomic 연산을 통해 race 상황을 감시하기 위해서만 사용된다.
코드 라인 32~38에서 스핀락을 획득한채로 다시 asid를 읽어온 값이 asid_generation 대역 범위를 벗어나면 asid를 새로운 generation 범위내에 있는 값으로 재발급해온다. 이 값을 mm->context.id에 기록한다.
코드 라인 40~41에서 현재 cpu에 대한 tlb 플러싱을 요구하는 비트가 설정된 경우 로컬 플러시를 수행한다.
- 이 요청은 asid가 모두 발급되어 부족해진 경우 다음 mm 스위칭에서 tlb를 플러시하도록 설정한다.
코드 라인 43~44에서 최종 사용할 asid 값을 active_asids 기록하고 스핀락을 해제한다.
코드 라인 46~48에서 switch_mm_fastpath: 레이블이다. ARM64_HARDEN_BRANCH_PREDICTOR 기능(capability)을 가진 시스템에서 workround가 필요한 경우 호출된다.
코드 라인 54~55에서 sw 에뮬레이션 방식의 PAN을 사용하는 시스템이 아니면 mm 스위칭을 수행한다.
- sw 에뮬레이션 방식의 PAN을 사용하는 시스템의 경우 uaccess_enable()에서 TTBR0를 기록하므로 이때 mm 스위칭을 하게되도록 유예시킨다.

다음 그림은 태스크의 mm 스위칭 시 asid_generation 범위내와 범위 밖에서 운용될 때의 변화를 3가지 케이스 형태로 보여준다.

새 ASID 번호 발급

다음 그림은 새롭게 이동한 asid_generation 대역의 빈 asid 번호로 기존 context.id를 변경하는 모습을 보여준다.

asid들은 context.id의 하위 asid_bits 수 만큼의 비트를 사용하여 asid_map에서 비트맵으로 관리된다. 단 중복되는 경우 빈 번호를 사용한다.

new_context() – ARM32

arch/arm/mm/context.c

static u64 new_context(struct mm_struct *mm, unsigned int cpu)
{
        static u32 cur_idx = 1;
        u64 asid = atomic64_read(&mm->context.id);
        u64 generation = atomic64_read(&asid_generation);

        if (asid != 0) {
                u64 newasid = generation | (asid & ~ASID_MASK);
                /*
                 * If our current ASID was active during a rollover, we
                 * can continue to use it and this was just a false alarm.
                 */
                if (check_update_reserved_asid(asid, newasid))
                        return newasid;

                /*
                 * We had a valid ASID in a previous life, so try to re-use
                 * it if possible.,
                 */
                asid &= ~ASID_MASK;
                if (!__test_and_set_bit(asid, asid_map))
                        return newasid;
        }

        /*
         * Allocate a free ASID. If we can't find one, take a note of the
         * currently active ASIDs and mark the TLBs as requiring flushes.
         * We always count from ASID #1, as we reserve ASID #0 to switch
         * via TTBR0 and to avoid speculative page table walks from hitting
         * in any partial walk caches, which could be populated from
         * overlapping level-1 descriptors used to map both the module
         * area and the userspace stack.
         */
        asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, cur_idx);
        if (asid == NUM_USER_ASIDS) {
                generation = atomic64_add_return(ASID_FIRST_VERSION,
                                                 &asid_generation);
                flush_context(cpu);
                asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, 1);
        }

        __set_bit(asid, asid_map);
        cur_idx = asid;
        cpumask_clear(mm_cpumask(mm));
        return asid | generation;
}

현재 asid 발급 대역에서 태스크에 사용할 새로운 asid를 찾아 반환한다. 만일 256개를 모두 사용한 경우 tlb 플러싱을 예약한 후 새로운 대역에서 발급한다. 대역은 256 단위로 증가한다.

코드 라인 4~23에서 asid는 256개 비트를 관리하는 asid_map에서 asid의 사용 유무를 관리한다. mm->context_id에서 읽어온 asid 값과, 이 값과 asid_generation 값을 합쳐 만든 newasid를 다음과 같이 비교하여 처리한다.
- 읽어온 asid 값이 reserve_asid와 동일한 경우에 newasid 값을 반환한다.
  - reserve_asid는 미리 asid_map에 비트를 설정하여 할당해둔 asid이므로 비트 설정이 따로 필요 없다.
- asid의 하위 8비트 만큼의 비트에 해당하는 인덱스 값으로 asid_map이 비어있는 경우 이의 비트틀 설정하고 asid 값을 반환한다.
코드 라인 34에서 asid가 0이거나 asid_map에서 이미 사용된 경우 직전에 보관해둔 cur_idx 번호부터 빈 번호를 찾아 asid 값으로 알아온다.
코드 라인 35~40에서 모두 할당되어 비어 있는 번호가 없으면 asid_generation을 ASID_FIRST_VERSION 값 만큼 추가하여 증가시킨다. 그런 후 asid_map의 모든 비트들을 클리어하고, 다음 mm 스위칭 시에 모든 cpu에 대해 tlb 및 명령 캐시를 플러시 하도록 예약한다. 마지막으로 asid_map의 1번에 해당하는 비트를 설정하고 asid를 발급해온다.
코드 라인 42~43에서 asid_map에서 asid에 해당하는 비트를 설정하고, 다음 검색을 위해 asid 값을 cur_idx에도 저장한다.
코드 라인 44에서 mm->cpu_bitmap을 클리어한다.
코드 라인 45에서 asid 인덱스 값과 generation을 조합하여 반환한다.

new_context() – ARM64

arch/arm64/mm/context.c

static u64 new_context(struct mm_struct *mm)
{
        static u32 cur_idx = 1;
        u64 asid = atomic64_read(&mm->context.id);
        u64 generation = atomic64_read(&asid_generation);

        if (asid != 0) {
                u64 newasid = generation | (asid & ~ASID_MASK);

                /*
                 * If our current ASID was active during a rollover, we
                 * can continue to use it and this was just a false alarm.
                 */
                if (check_update_reserved_asid(asid, newasid))
                        return newasid;

                /*
                 * We had a valid ASID in a previous life, so try to re-use
                 * it if possible.
                 */
                if (!__test_and_set_bit(asid2idx(asid), asid_map))
                        return newasid;
        }

        /*
         * Allocate a free ASID. If we can't find one, take a note of the
         * currently active ASIDs and mark the TLBs as requiring flushes.  We
         * always count from ASID #2 (index 1), as we use ASID #0 when setting
         * a reserved TTBR0 for the init_mm and we allocate ASIDs in even/odd
         * pairs.
         */
        asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, cur_idx);
        if (asid != NUM_USER_ASIDS)
                goto set_asid;

        /* We're out of ASIDs, so increment the global generation count */
        generation = atomic64_add_return_relaxed(ASID_FIRST_VERSION,
                                                 &asid_generation);
        flush_context();

        /* We have more ASIDs than CPUs, so this will always succeed */
        asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, 1);

set_asid:
        __set_bit(asid, asid_map);
        cur_idx = asid;
        return idx2asid(asid) | generation;
}

현재 asid 발급 대역에서 태스크에 사용할 새로운 asid를 찾아 반환한다. 만일 asid generation 대역을 모두 모두 사용한 경우 tlb 플러싱을 예약하고 새로운 대역에서 발급한다. 대역은 2^asid_bits 만큼 단위로 증가한다. (시스템에 따라 asid_bits=8 또는 16)

코드 라인 4~23에서 asid는 2^asid_bits 만큼의 사이즈를 가진 asid_map에서 asid의 사용 유무를 관리한다. mm->context_id에서 읽어온 asid 값과, 이 값과 asid_generation 값을 합쳐 만든 newasid를 다음과 같이 비교하여 처리한다.
- 읽어온 asid 값이 reserve_asid와 동일한 경우에 newasid 값을 반환한다.
  - reserve_asid는 미리 asid_map에 비트를 설정하여 할당해둔 asid이므로 비트 설정이 따로 필요 없다.
- asid의 하위 asid_bits 만큼의 비트에 해당하는 인덱스 값으로 asid_map이 비어있는 경우 이의 비트틀 설정하고 asid 값을 반환한다.
코드 라인 32~34에서 asid가 0이거나 asid_map에서 이미 사용된 경우 직전에 보관해둔 cur_idx 번호부터 빈 번호를 찾아 asid 값으로 알아온 후 정상적으로 발급된 경우 set_asid 레이블로 이동한다.
- 성능을 위해 1번 비트부터 검색하지 않고 cur_idx 번호부터 검색한다.
코드 라인 37~38에서 모두 할당되어 비어 있는 번호가 없으면 asid_generation을 ASID_FIRST_VERSION 값 만큼 추가하여 증가시킨다.
코드 라인 39~42에서 asid_map의 모든 비트들을 클리어하고, 다음 mm 스위칭 시에 모든 cpu에 대해 tlb 및 명령 캐시를 플러시 하도록 예약한다. 마지막으로 asid_map의 1번에 해당하는 비트를 설정하고 asid를 발급해온다.
코드 라인 44~46에서 set_asid 레이블이다. asid_map에서 asid에 해당하는 비트를 설정하고, 다음 검색을 위해 asid 값을 cur_idx에도 저장한다.
코드 라인 47에서 asid 인덱스 값과 generation을 조합하여 반환한다.

flush_context()

arch/arm64/mm/context.c

static void flush_context(void)
{
        int i;
        u64 asid;

        /* Update the list of reserved ASIDs and the ASID bitmap. */
        bitmap_clear(asid_map, 0, NUM_USER_ASIDS);

        for_each_possible_cpu(i) {
                asid = atomic64_xchg_relaxed(&per_cpu(active_asids, i), 0);
                /*
                 * If this CPU has already been through a
                 * rollover, but hasn't run another task in
                 * the meantime, we must preserve its reserved
                 * ASID, as this is the only trace we have of
                 * the process it is still running.
                 */
                if (asid == 0)
                        asid = per_cpu(reserved_asids, i);
                __set_bit(asid2idx(asid), asid_map);
                per_cpu(reserved_asids, i) = asid;
        }

        /*
         * Queue a TLB invalidation for each CPU to perform on next
         * context-switch
         */
        cpumask_setall(&tlb_flush_pending);
}

TLB 플러시를 예약한다.

코드 라인 7에서 asid_map의 모든 비트들을 클리어한다.
코드 라인 9~10에서 모든 possible cpu들을 순회하며 active_asids 값을 asid로 읽어오고 0으로 클리어한다.
코드 라인 18~19에서 읽어온 asid가 0인 경우 reserved_asids에 다시 읽어온다.
코드 라인 20에서 asid_map에 asid에 해당하는 비트를 설정한다.
코드 라인 21에서 reserved_asids에 asid를 기록한다.
코드 라인 28에서 tlb_flush_pending 비트에 모두 1을 기록하여 다음 context-switch에서 TLB flush를 수행하게 한다.

mm 스위칭 전 vmalloc 복제 – ARM32

__check_vmalloc_seq()

arch/arm/mm/ioremap.c

void __check_vmalloc_seq(struct mm_struct *mm) 
{
        unsigned int seq;

        do {
                seq = init_mm.context.vmalloc_seq;
                memcpy(pgd_offset(mm, VMALLOC_START),
                       pgd_offset_k(VMALLOC_START),
                       sizeof(pgd_t) * (pgd_index(VMALLOC_END) -
                                        pgd_index(VMALLOC_START)));
                mm->context.vmalloc_seq = seq;
        } while (seq != init_mm.context.vmalloc_seq);
}

init_mm의 vmalloc 정보가 갱신되었기 때문에 다음 태스크로 스위칭 되기 전에 init_mm->pgd의 vmalloc 엔트리들을 mm->pgd로 갱신한다.

코드 라인 6에서 init_mm의 vmalloc 시퀀스 번호를 알아온다.
코드 라인 7~10에서 vmalloc address space에 해당하는 init_mm의 페이지 테이블 엔트리들을 mm의 페이지 테이블로 복사한다.
코드 라인 11~12에서 vmalloc 시퀀스 번호도 갱신한다. 이렇게 갱신하는 동안 또 변경이 일어나면 루프를 돌며 다시 갱신한다.

다음 그림은 갱신된 커널의 vmalloc 엔트리들을 내 태스크의 페이지 테이블로 갱신하는 모습을 보여준다.

태스크 스위칭

switch_to()

include/asm-generic/switch_to.h

#define switch_to(prev, next, last)                                     \
        do {                                                            \
                ((last) = __switch_to((prev), (next)));                 \
        } while (0)

#endif

태스크 context 스위칭을 통해 다음 태스크로 전환한다.

다음 그림과 같이 태스크가 생성되는 경우 할당되는 context 관련된 컴포넌트들을 알아본다.

태스크 스위칭 – ARM32

__switch_to() – ARM32

arch/arm/kernel/entry-armv.S – THUMB 코드 제거

/*
 * Register switch for ARMv3 and ARMv4 processors
 * r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info
 * previous and next are guaranteed not to be the same.
 */

ENTRY(__switch_to)
 UNWIND(.fnstart        )
 UNWIND(.cantunwind     )
        add     ip, r1, #TI_CPU_SAVE
 ARM(   stmia   ip!, {r4 - sl, fp, sp, lr} )    @ Store most regs on stack
        ldr     r4, [r2, #TI_TP_VALUE]
        ldr     r5, [r2, #TI_TP_VALUE + 4]
#ifdef CONFIG_CPU_USE_DOMAINS
        mrc     p15, 0, r6, c3, c0, 0           @ Get domain register
        str     r6, [r1, #TI_CPU_DOMAIN]        @ Save old domain register
        ldr     r6, [r2, #TI_CPU_DOMAIN]
#endif
        switch_tls r1, r4, r5, r3, r7
#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP)
        ldr     r7, [r2, #TI_TASK]
        ldr     r8, =__stack_chk_guard
        .if (TSK_STACK_CANARY > IMM12_MASK)
        add     r7, r7, #TSK_STACK_CANARY & ~IMM12_MASK
        .endif
        ldr     r7, [r7, #TSK_STACK_CANARY & IMM12_MASK]
#endif
#ifdef CONFIG_CPU_USE_DOMAINS
        mcr     p15, 0, r6, c3, c0, 0           @ Set domain register
#endif
        mov     r5, r0
        add     r4, r2, #TI_CPU_SAVE
        ldr     r0, =thread_notify_head
        mov     r1, #THREAD_NOTIFY_SWITCH
        bl      atomic_notifier_call_chain
#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP)
        str     r7, [r8]
#endif
        mov     r0, r5
 ARM(   ldmia   r4, {r4 - sl, fp, sp, pc}  )    @ Load all regs saved previously
 UNWIND(.fnend          )
ENDPROC(__switch_to)

태스크 context 스위칭을 통해 다음 태스크로 전환한다.

코드 라인 4~5에서 기존 ti->cpu_context에 r4 레지스터 부터 대부분의 레지스터들을 백업한다.
- struct thread_info -> ti
코드 라인 6~12에서 레지스터 r4, r5, r6에 순서대로 thread_info의 tp_value[0], tp_value[1] 및 cpu_domain 값을 가져온다.
코드 라인 14에서 process context 스위칭을 위해 TLS 스위칭을 한다.
- r1: &prev->thread_info
- r4: &tp_value[0]
- r5: &tp_value[1]
- r3: tmp1 레지스터
- r7: tmp2 레지스터
코드 라인 15~22에서 SMP가 아닌 시스템에서 CONFIG_STACKPROTECTOR 커널 옵션을 사용하는 경우 다음 태스크의 stack_canary 값을 r7 레지스터에 읽어오고 r8 레지스터에는 __stack_chk_guard 주소 값을 읽어온다.
코드 라인 23~25에서 도메인 레지스터에 cpu_domain 값을 설정한다.
코드 라인 26에서 r5 레지스터에 이전 태스크를 가리키는 r0 레지스터를 잠시 백업해둔다.
코드 라인 27에서 r4 레지스터에 다음 ti->cpu_context 값을 대입한다.
코드 라인 28~30에서 thread_notify_head 리스트에 등록된 notify 블럭의 모든 함수들을 호출한다. 호출 시 THREAD_NOTIFY_SWITCH 명령과 다음 struct thread_info 포인터 값을 전달한다.
코드 라인 31~33에서 다음 태스크의 stack_canary 값을 __stack_chk_guard 전역 변수에 저장한다.
코드 라인 34에서 r5에 보관해둔 이전 태스크 포인터 값을 다시 r0에 대입한다.
코드 라인 35에서 다음 ti->cpu_context 값을 가리키는 r4를 통해서 r4 레지스터 부터 대부분의 레지스터들을 복구한다.
- 마지막에 읽은 pc 레지스터로 점프하게 된다.

switch_tls 매크로 – ARM32

arch/arm/include/asm/tls.h

#define switch_tls      switch_tls_v6k

armv6 및 armv7 아키텍처 이상의 경우 tls 레지스터를 가지고 있고 switch_tls_v68 매크로 함수를 호출한다.

switch_tls_v6k 매크로 – ARM32

arch/arm/include/asm/tls.h

.       .macro switch_tls_v6k, base, tp, tpuser, tmp1, tmp2
        ldr     \tmp1, =elf_hwcap
        ldr     \tmp1, [\tmp1, #0]
        mov     \tmp2, #0xffff0fff
        tst     \tmp1, #HWCAP_TLS               @ hardware TLS available?
        streq   \tp, [\tmp2, #-15]              @ set TLS value at 0xffff0ff0
        mrcne   p15, 0, \tmp2, c13, c0, 2       @ get the user r/w register
        mcrne   p15, 0, \tp, c13, c0, 3         @ yes, set TLS register
        mcrne   p15, 0, \tpuser, c13, c0, 2     @ set user r/w register
        strne   \tmp2, [\base, #TI_TP_VALUE + 4] @ save it
        mrc     p15, 0, \tmp2, c13, c0, 2       @ get the user r/w register
        mcr     p15, 0, \tp, c13, c0, 3         @ set TLS register
        mcr     p15, 0, \tpuser, c13, c0, 2     @ and the user r/w register
        str     \tmp2, [\base, #TI_TP_VALUE + 4] @ save it
        .endm

process context 스위칭을 위해 TLS 스위칭을 한다.

Thread ID 레지스터를 읽어서 이전 thread_info.tp_value[1]에 백업하고 TLS 레지스터에 tp 값과 Thread ID 레지스터에 tpuser 값을 설정한다.
코드 라인 2~3에서 elf_hwcap 값을 읽어온다.
코드 라인 3~5에서 하드웨어 TLS 기능이 있으면 0xfff_0ff0을 @tp에 저장한다.
코드 라인 6에서 하드웨어 TLS 기능이 없으면 다음과 같이 처리한다.
- Thread ID 레지스터 용도의 TPIDRURW(User 읽기/쓰기) 레지스터 값을 tmp2에 대입한다. (tmp2 <- ThreadID)
- TLS 레지스터 용도의 TPIDRURO(User 읽기 전용) 레지스터에 tp 값을 저장한다. (TLS <- tp)
- Thread ID 레지스터 용도의 TPIDRURW(User 읽기/쓰기) 레지스터에 tpuser 값을 저장한다. (ThreadID <- tpuser)
- 기존 Thread ID 값을 thread_info->tp_value[1]에 저장한다.

다음 그림은 process context 스위칭을 위해 TLS를 스위칭하는 모습을 보여준다.

include/asm/mmu.h

#ifdef CONFIG_CPU_HAS_ASID
#define ASID_BITS       8
#define ASID_MASK       ((~0ULL) << ASID_BITS)
#define ASID(mm)        ((unsigned int)((mm)->context.id.counter & ~ASID_MASK))
#else   
#define ASID(mm)        (0)
#endif

ASID_MASK=0xffff_ff00

태스크 스위칭 – ARM64

__switch_to() – ARM64

arch/arm64/kernel/process.c

/*
 * Thread switching.
 */

__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
                                struct task_struct *next)
{
        struct task_struct *last;

        fpsimd_thread_switch(next);
        tls_thread_switch(next);
        hw_breakpoint_thread_switch(next);
        contextidr_thread_switch(next);
        entry_task_switch(next);
        uao_thread_switch(next);
        ptrauth_thread_switch(next);
        ssbs_thread_switch(next);

        /*
         * Complete any pending TLB or cache maintenance on this CPU in case
         * the thread migrates to a different CPU.
         * This full barrier is also required by the membarrier system
         * call.
         */
        dsb(ish);

        /* the actual thread switch */
        last = cpu_switch_to(prev, next);

        return last;
}

태스크 context 스위칭을 통해 다음 태스크로 전환한다.

코드 라인 6~13에서 다음 항목들도 context 전환을 수행한다.
- floating 포인트 처리기 및 SIMD 전환
- tls 레지스터 전환
- hw 디버거를 위한 break point 정보 전환
- contextidr_el1 정보 백업
- __entry_task에 @next 태스크 백업
- PSTATE에서 uao 플래그 설정(다음 태스크가 커널=1, 유저=0)
- 다음 유저 태스크인 경우 보안 관련 ssbs 플래그 처리
코드 라인 21에서 아직 완료되지 않은 TLB 및 캐시 조작의 완료를 기다린다.
코드 라인 24에서 @next 태스크로의 실제 context 스위칭을 수행한다.
코드 라인 26에서 전환 전 태스크 last를 반환한다.
- last == @prev

cpu_switch_to() – ARM64

arch/arm64/kernel/entry.S

/*
 * Register switch for AArch64. The callee-saved registers need to be saved
 * and restored. On entry:
 *   x0 = previous task_struct (must be preserved across the switch)
 *   x1 = next task_struct
 * Previous and next are guaranteed not to be the same.
 *
 */

ENTRY(cpu_switch_to)
        mov     x10, #THREAD_CPU_CONTEXT
        add     x8, x0, x10
        mov     x9, sp
        stp     x19, x20, [x8], #16             // store callee-saved registers
        stp     x21, x22, [x8], #16
        stp     x23, x24, [x8], #16
        stp     x25, x26, [x8], #16
        stp     x27, x28, [x8], #16
        stp     x29, x9, [x8], #16
        str     lr, [x8]
        add     x8, x1, x10
        ldp     x19, x20, [x8], #16             // restore callee-saved registers
        ldp     x21, x22, [x8], #16
        ldp     x23, x24, [x8], #16
        ldp     x25, x26, [x8], #16
        ldp     x27, x28, [x8], #16
        ldp     x29, x9, [x8], #16
        ldr     lr, [x8]
        mov     sp, x9
        msr     sp_el0, x1
        ret
ENDPROC(cpu_switch_to)
NOKPROBE(cpu_switch_to)

@x0 -> @x1 태스크로의 context 스위칭을 수행한다.

코드 라인 2~11에서 &prev->thread_info.cpu_context에 x19~x29, x9, lr 레지스터를 백업한다.
코드 라인 12~19에서 &next->thread_info.cpu_context로부터 x19~x29, x9, lr 레지스터를 로드한다.
코드 라인 30~32에서 잠시 x9 레지스터에 보관해두었던 sp를 다시 복구하고, 유저 스택 sp_el0에 @x1(next) 태스크 주소를 기록한다.

다음 그림은 스케줄 틱이 발생했을 때 스케줄링을 통해 유저 태스크간의 전환에 interrupt 및 process의 두 가지 context switch가 진행되는 모습을 보여준다.

다음 그림은 32bit ARM 시스템에서 interrupt 및 process의 두 가지 context switch가 발생할 때 cpu 레지스터들이 백업/복구되는 과정을 보여준다.

유저 태스크에서 사용하던 모든 레지스터들은 인터럽트 발생 시 모두 태스크 A의 커널 스택에 모두 백업한다.
커널의 __switch_to() 함수에서 사용중인 일부 레지스터들을 태스크 A의 cpu_context_save 구조체에 백업한다.
r0~r3, ip 레지스터는 scratch 레지스터로 함수로 부터 리턴될 때 내용이 깨져도 상관이 없다.
- AAPCS 참고: Procedure Call Standard for the ARM® Architecture | ARM – 다운로드 pdf

다음 그림은 ARM64 시스템에서 interrupt 및 process의 두 가지 context switch가 발생할 때 cpu 레지스터들이 백업/복구되는 과정을 보여준다.

유저 태스크에서 사용하던 모든 레지스터들은 인터럽트 발생 시 모두 태스크 A의 커널 스택에 모두 백업한다.
커널의 cpu_switch_to() 함수에서 사용중인 일부 레지스터들을 태스크 A의 cpu_context 구조체에 백업한다.
Callee Saved 레지스터는 r19~r28이며 이 레지스터들은 호출된(callee) 함수가 스스로 레지스터를 보호해야하는 규약을 가지고 있으므로 변경하여 사용한 후에는 반드시 복귀전에 이를 복원해야 한다.
- AAPCS64 참고: Procedure Call Standard for the Arm® 64-bit Architecture | ARM
ARM64 시스템에서는 전용 IRQ 스택을 지원하며, 커널 스택의 전환 사이에 인터럽트 처리를 위해 irq 핸들러에서 이를 중간에 이용한다.

스레드 Notifier

atomic_notifier_call_chain()

kernel/notifier.c

int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
                               unsigned long val, void *v)
{
        return __atomic_notifier_call_chain(nh, val, v, -1, NULL);
}
EXPORT_SYMBOL_GPL(atomic_notifier_call_chain);
NOKPROBE_SYMBOL(atomic_notifier_call_chain);

notify 체인 리스트에 등록된 notify 블럭의 모든 함수들을 호출한다.

__atomic_notifier_call_chain()

/**
 *      __atomic_notifier_call_chain - Call functions in an atomic notifier chain
 *      @nh: Pointer to head of the atomic notifier chain
 *      @val: Value passed unmodified to notifier function
 *      @v: Pointer passed unmodified to notifier function
 *      @nr_to_call: See the comment for notifier_call_chain.
 *      @nr_calls: See the comment for notifier_call_chain.
 *
 *      Calls each function in a notifier chain in turn.  The functions
 *      run in an atomic context, so they must not block.
 *      This routine uses RCU to synchronize with changes to the chain.
 *
 *      If the return value of the notifier can be and'ed
 *      with %NOTIFY_STOP_MASK then atomic_notifier_call_chain()
 *      will return immediately, with the return value of
 *      the notifier function which halted execution.
 *      Otherwise the return value is the return value
 *      of the last notifier function called.
 */

int __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
                                 unsigned long val, void *v,
                                 int nr_to_call, int *nr_calls) 
{
        int ret;

        rcu_read_lock();
        ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
        rcu_read_unlock();
        return ret;
}
EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
NOKPROBE_SYMBOL(__atomic_notifier_call_chain);

rcu로 보호받으며 notify 체인 리스트에 등록된 notify 블럭의 모든 함수들을 nr_to_call 수 만큼 호출하고 성공적으로 호출된 횟수를 출력 인수 nr_calls에 저장한다.

val 값과 v 포인터 값이 notifier_call 함수에 전달된다.

contextidr_notifier_init() – ARM32

arch/arm/mm/context.c

static int __init contextidr_notifier_init(void)
{
        return thread_register_notifier(&contextidr_notifier_block);
}
arch_initcall(contextidr_notifier_init);

Context 스위치마다 CONTEXTID.ASID 레지스터에 pid 값을 알아와서 기록하게 하도록 notify 블럭을 등록한다.

thread_register_notifier() – ARM32

arch/arm/include/asm/thread_notify.h

static inline int thread_register_notifier(struct notifier_block *n)
{
        extern struct atomic_notifier_head thread_notify_head;
        return atomic_notifier_chain_register(&thread_notify_head, n);
}

스레드 notify 체인 블럭에 notify 블럭을 등록한다.

다음 커널 코드에서 notify 블럭을 등록하여 사용하고 있다.
- mm/context.c – contextidr_notifier_init() <- THREAD_NOTIFY_SWITCH 명령
- arch/arm/nwfpe/fpmodule.c – fpe_init() <- THREAD_NOTIFY_FLUSH 명령
- vfp/vfpmodule.c – vfp_init() <- THREAD_NOTIFY_SWITCH, FLUSH, EXIT, COPY 명령
- 그 외 xscale 아키텍처 및 thumbee에서도 사용한다.

arch/arm/mm/context.c

static struct notifier_block contextidr_notifier_block = {
        .notifier_call = contextidr_notifier,
};

contextidr_notifier() – ARM32

arch/arm/mm/context.c

#ifdef CONFIG_PID_IN_CONTEXTIDR
static int contextidr_notifier(struct notifier_block *unused, unsigned long cmd,
                               void *t)
{
        u32 contextidr;
        pid_t pid;
        struct thread_info *thread = t;

        if (cmd != THREAD_NOTIFY_SWITCH)
                return NOTIFY_DONE;

        pid = task_pid_nr(thread->task) << ASID_BITS;
        asm volatile(
        "       mrc     p15, 0, %0, c13, c0, 1\n"
        "       and     %0, %0, %2\n"
        "       orr     %0, %0, %1\n"
        "       mcr     p15, 0, %0, c13, c0, 1\n"
        : "=r" (contextidr), "+r" (pid)
        : "I" (~ASID_MASK)); 
        isb();

        return NOTIFY_OK;
}
#endif

커널에서 하드웨어 트레이스 툴을 사용할 때 사용되며 CONTEXTID.ASID 레지스터에 pid 값을 알아와서 기록한다.

코드 라인 9~10에서 THREAD_NOTIFY_SWITCH 명령이 아니면 이 루틴과는 해당 사항이 없으므로 NOTIFY_DONE을 반환한다.
코드 라인 12에서 두 번째 인수로 받은 t 값을 사용하여 thread_info->task->pid 값을 읽어온다.
코드 라인 13~19에서 CONTEXTID 레지스터의 ASID 필드(bits[7:0])만 클리어하고 pid 값을 더해 기록한다.

arch/arm/include/asm/thread_notify.h

/*      
 * These are the reason codes for the thread notifier.
 */
#define THREAD_NOTIFY_FLUSH     0
#define THREAD_NOTIFY_EXIT      1
#define THREAD_NOTIFY_SWITCH    2
#define THREAD_NOTIFY_COPY      3

task_pid_nr()

include/linux/sched.h

static inline pid_t task_pid_nr(struct task_struct *tsk)
{
        return tsk->pid;
}

태스크의 pid 값을 반환한다.

구조체

태스크(process 또는 cpu) context switch 관련

thread_info 구조체 – ARM64

arch/arm64/include/asm/thread_info.h

/*
 * low level task data that entry.S needs immediate access to.
 */

struct thread_info {
        unsigned long           flags;          /* low level flags */
        mm_segment_t            addr_limit;     /* address limit */
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
        u64                     ttbr0;          /* saved TTBR0_EL1 */
#endif
        union {
                u64             preempt_count;  /* 0 => preemptible, <0 => bug */
                struct {
#ifdef CONFIG_CPU_BIG_ENDIAN
                        u32     need_resched;
                        u32     count;
#else
                        u32     count;
                        u32     need_resched;
#endif
                } preempt;
        };
};

flags
- 플래그들
addr_limit
- 접근 제한 주소
ttbr0
- sw 에뮬레이션 방식의 PAN 기능을 위해 사용하는 ttbr0 값을 백업할 때 사용한다.
preempt_count
- preemption 카운터로 이 값이 0일 경우에만 preemption이 가능하다.
- 아래 preempt 값을 union으로 사용
preempt.count
- 64비트 나머지 절반은 기존 preempt_count와 동일
preempt.need_resched
- TIF_NEED_RESCHED와 동일한 용도로 여기에 0 설정 . 주의: 리스케줄 요청=0, 초기 값=1

thread_info 구조체 – ARM32

arch/arm/include/asm/thread_info.h

/*
 * low level task data that entry.S needs immediate access to.
 * __switch_to() assumes cpu_context follows immediately after cpu_domain.
 */

struct thread_info {
        unsigned long           flags;          /* low level flags */
        int                     preempt_count;  /* 0 => preemptable, <0 => bug */
        mm_segment_t            addr_limit;     /* address limit */
        struct task_struct      *task;          /* main task structure */
        __u32                   cpu;            /* cpu */
        __u32                   cpu_domain;     /* cpu domain */
#ifdef CONFIG_STACKPROTECTOR_PER_TASK
        unsigned long           stack_canary;
#endif
        struct cpu_context_save cpu_context;    /* cpu context */
        __u32                   syscall;        /* syscall number */
        __u8                    used_cp[16];    /* thread used copro */
        unsigned long           tp_value[2];    /* TLS registers */
#ifdef CONFIG_CRUNCH
        struct crunch_state     crunchstate;
#endif
        union fp_state          fpstate __attribute__((aligned(8)));
        union vfp_state         vfpstate;
#ifdef CONFIG_ARM_THUMBEE
        unsigned long           thumbee_state;  /* ThumbEE Handler Base register */
#endif
};

flags
- 플래그들
preempt_count
- preemption 카운터로 이 값이 0일 경우에만 preemption이 가능하다.
addr_limit
- 접근 제한 주소
task
- 현재 태스크
exec_domain
- 실행 도메인
cpu
- 동작 중인 cpu 번호
cpu_domain
- cpu 도메인
- DACR 레지스터 참고
- 0=no access, 1=user, 3=manager
stack_canary
- 스택 오버플로우 감지를 위해 저장한 값
cpu_context
- cpu context 시 사용하는 cpu 레지스터 정보 저장 장소
- 아키텍처에 대한 정보가 담기므로 아키텍처 별로 다르다.
syscall
- 시스템 콜(swi) 시 사용하는 syscall 번호
used_cp[]
- 코프로세서 인스트럭션 체크 시 사용
tp_value[]
- TLS 주소 저장
fpstate
- undefined 명령 처리 시 사용하는 fp(부동 소숫점 처리기) 상태
- 이 값을 사용하여 FP 모듈의 USR 시작 주소로 진입할 수 있게한다.
vfpstate
- undefined 명령 처리 시 사용하는 vfp(부동소숫점 연산 처리기) 상태
- 이 값을 사용하여 VFP 모듈의 USR 시작 주소로 진입할 수 있게한다.

thread_struct 구조체 – ARM64

arch/arm64/include/asm/processor.h

struct thread_struct {
        struct cpu_context      cpu_context;    /* cpu context */

        /*
         * Whitelisted fields for hardened usercopy:
         * Maintainers must ensure manually that this contains no
         * implicit padding.
         */
        struct {
                unsigned long   tp_value;       /* TLS register */
                unsigned long   tp2_value;
                struct user_fpsimd_state fpsimd_state;
        } uw;

        unsigned int            fpsimd_cpu;
        void                    *sve_state;     /* SVE registers, if any */
        unsigned int            sve_vl;         /* SVE vector length */
        unsigned int            sve_vl_onexec;  /* SVE vl after next exec */
        unsigned long           fault_address;  /* fault info */
        unsigned long           fault_code;     /* ESR_EL1 value */
        struct debug_info       debug;          /* debugging */
#ifdef CONFIG_ARM64_PTR_AUTH
        struct ptrauth_keys_user        keys_user;
        struct ptrauth_keys_kernel      keys_kernel;
#endif
#ifdef CONFIG_ARM64_MTE
        u64                     sctlr_tcf0;
        u64                     gcr_user_incl;
#endif
};

thread_struct 구조체 – ARM32

arch/arm/include/asm/processor.h

struct thread_struct {
                                                        /* fault info     */
        unsigned long           address;
        unsigned long           trap_no;
        unsigned long           error_code;
                                                        /* debugging      */
        struct debug_info       debug;
};

cpu_context 구조체 – ARM64

arch/arm64/include/asm/processor.h

struct cpu_context {
        unsigned long x19;
        unsigned long x20;
        unsigned long x21;
        unsigned long x22;
        unsigned long x23;
        unsigned long x24;
        unsigned long x25;
        unsigned long x26;
        unsigned long x27;
        unsigned long x28;
        unsigned long fp;
        unsigned long sp;
        unsigned long pc;
};

cpu context 스위치시 레지스터의 일부가 저장되는 장소

cpu_context_save 구조체 – ARM32

arch/arm/include/asm/thread_info.h

struct cpu_context_save {
        __u32   r4;
        __u32   r5;
        __u32   r6;
        __u32   r7;
        __u32   r8;
        __u32   r9;
        __u32   sl;
        __u32   fp;
        __u32   sp;
        __u32   pc;
        __u32   extra[2];               /* Xscale 'acc' register, etc */
};

cpu context 스위치시 레지스터의 일부가 저장되는 장소

인터럽트 context switch 관련

pt_regs 구조체 – ARM64

arch/arm64/include/uapi/asm/ptrace.h

/*
 * This struct defines the way the registers are stored on the stack during an
 * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 (for
 * stack alignment). struct user_pt_regs must form a prefix of struct pt_regs.
 */

struct pt_regs {
        union {
                struct user_pt_regs user_regs;
                struct {
                        u64 regs[31];
                        u64 sp;
                        u64 pc;
                        u64 pstate;
                };
        };
        u64 orig_x0;
#ifdef __AARCH64EB__
        u32 unused2;
        s32 syscallno;
#else
        s32 syscallno;
        u32 unused2;
#endif

        u64 orig_addr_limit;
        /* Only valid when ARM64_HAS_IRQ_PRIO_MASKING is enabled. */
        u64 pmr_save;
        u64 stackframe[2];

        /* Only valid for some EL1 exceptions. */
        u64 lockdep_hardirqs;
        u64 exit_rcu;
};

interrupt context switch시 저장되는 레지스터 및 기타 정보들

svc_pt_regs 구조체 – ARM64

arch/arm/include/asm/ptrace.h

struct svc_pt_regs {
        struct pt_regs regs;
        u32 dacr;
        u32 addr_limit;
};

interrupt context switch시 저장되는 레지스터 및 기타 정보들

pt_regs 구조체 – ARM32

arch/arm/include/asm/ptrace.h

struct pt_regs {
        unsigned long uregs[18];
};

user_pt_regs 구조체 – ARM64

arch/arm64/include/uapi/asm/ptrace.h

/*
 * User structures for general purpose, floating point and debug registers.
 */

struct user_pt_regs {
        __u64           regs[31];
        __u64           sp;
        __u64           pc;
        __u64           pstate;
};

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c – 현재 글
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

The Evolution of Real-Time Linux – download pdf
A realtime preemption overview | LWN.net
Optimizing preemption | LWN.net

Scheduler -1- (Basic)

2017-04-242021-01-11 문영일 8 Comments

태스크란?

프로세스
- 일반적인 정의는 프로그램이 실행중인 상태이다.
- OS 커널이 부트업되어 준비된 경우 유저 레벨에서 디스크로부터 application을 로드하여 자원(메모리, 파일, 페이징, 시그널, 스택, …)을 할당하여 동작시킬 수 있는 최소 단위이다.
- 유저 레벨에서 요청하여 OS 커널이 준비하고 생성한다.
- 특별히 OS 커널은 그 자체가 하나의 커널 프로세스라고도 할 수 있다.
  - 커널 프로세스는 부트업타임에 생성되고 종료 직전까지 소멸되지 않으며 스케줄링 단위에는 포함되지 않는 특성으로 인해 존재 유무가 확연히 드러나지 않는다.
스레드
- 일반적인 정의는 프로세스에서 실행 단위만 분리한 상태이다.
  - 이 때문에 경량 프로세스 (LWP, Light-Weight Process)로도 불린다.
  - 참고: 메인스트림 OS로 레벨업「리눅스 커널2.6」| ZDnet Korea
- 아키텍처를 다루는 문서에서 사용하는 스레드는 하드웨어 스레드를 의미하며 명령들을 수행할 수 있는 최소단위 장치인 core 또는 virtual core이다.
  - 예) core 당 2개의 하이퍼 스레드(x86에서의 virtual core)를 가진 4 core의 하드웨어 스레드는 총 8개이다.
- OS 레벨에서의 스레드는 Process 내부에서 생성되는 더 작은 단위의 소프트웨어 스레드를 의미한다.
  - 커널 코드에서는 사용위치에 따라 스레드가 하드웨어 스레드를 의미할 수도 있고 소프트웨어 스레드를 의미할 수도 있으므로 적절히 구분해서 판단해야 한다.
- 리눅스 버전 2.6에서 전적으로 커널이 스레드를 관리하고 생성한다. 따라서 이 기준으로 설명하였다.
  - 최종적으로 Red-hat팀이 개발한 NPTL(Native POSIX Thread Library)을 채용하였다.
  - 유저 스레드라고 불려도 커널이 생성하여 관리하는 것에 주의해야 한다. 리눅스 커널 2.6 이전에는 프로세스를 clone하여 사용하거나 유저 라이브러리를 통해 만들어 사용한 진짜 유저 스레드이다.
  - 참고: Native POSIX Thread Library | Wikipedia
- 프로세스가 할당 받은 자원을 공유하여 사용할 수 있다. 단 스택 및 TLS(Thread Local Storage) 영역은 공유하지 않고 별도로 생성한다.
- 유저 스레드는 유저 프로세스 또는 상위 스레드로 부터 생성된다.
- 특별히 커널 스레드는 유저 레벨이 아닌 OS 커널 레벨에서만 생성 가능하다.
태스크
- 프로세스나 스레드를 리눅스에서 스케줄링 할 수 있는 최소 단위가 태스크이다.
- 유저 태스크에서 스케줄링에 대한 비율을 조정하는 경우 nice 값을 사용하여 각 태스크의 time slice를 산출해낸다.
  - 내부 자료 구조에서는 task_struct가 태스크를 나타내고 그 내부에 스케줄 엔티티 구조체를 포함하며 이 스케줄 엔티티를 사용하여 time slice를 계산한다.
    - 태스크에는 스케줄러별로 사용할 수 있도록 3 개의 스케줄 엔티티인 sched_entity, sched_rt_entity 및 sched_dl_entity가 있다.
  - cgroup을 사용한 그룹 스케줄링을 하는 경우에는 스케줄 엔티티가 태스크 그룹(group scheduling)에 연결되어 time slice를 계산한다.
- 유닉스와 다르게 리눅스의 경우 별도의 설정이 없는 경우 프로세스와 스레드가 동일한 태스크 표현을 사용하고 스케줄링 비율은 동일하다.

태스크 상태

태스크마다 상태를 나타내는 플래그가 있다. 다음과 같이 두 종류의 플래그를 사용하며 각각은 다음과 같다.

task->state
- TASK_RUNNING
  - 다음과 같이 두 종류로 나뉜다.
    - 태스크가 cpu에서 동작 중이다. (running on cpu)
    - 태스크가 런큐에서 실행 준비 중이다. (ready to run)
- TASK_INTERRUPTIBLE
  - 태스크가 슬립되었고, 시그널을 받아들일 수 있는 상태이다.
- TASK_UNINTERRUPTIBLE
  - 태스크가 슬립되었고, 시그널에 의해 방해받지 않는 상태이다.
    - TASK_WAKEKILL 플래그와 같이 사용될 때에는 fatal signal인 SIGKILL은 받을 수 있다.
  - 중요한 일을 시그널의 방해 없이 슬립 상태에서 기다릴 때 사용한다.
    - 예) 블럭 디바이스 데이터를 버퍼에 옮기는 작업의 완료를 기다릴때 사용한다.
- TASK_STOPPED
  - 태스크가 다음 시그널을 받아 중단된 상태이다.
    - 유저가 보낸 SIGSTOP 시그널 요청을 받았다. 이는 유저가 보낸 SIGCONT 시그널로 계속할 수 있다.
    - foreground에서 SIGTSTP(^Z) 시그널 요청을 받았다.
    - background에서 SIGTTIN 시그널 요청을 받았다.
    - background에서 SIGTTOUT 시그널 요청을 받았다.
- TASK_TRACED
  - 태스크가 디버거 요청에 의해 중단된 상태이다.
- TASK_KILLABLE
  - TASK_WAKEKILL 과 TASK_UNINTERRUPTIBLE 두 개 상태가 조합되었다.
  - uninterruptible 상태이지만 오직 fatal signal인 SIGKILL 요청을 받을 수 있는 상태이다.
- TASK_WAKEKILL
  - fatal signal인 SIGKILL 요청을 받을 수 있는 플래그로, 단독으로 사용하진 않는다.
- TASK_PARKED
  - ktrhead_create()로 생성된 per-cpu 커널 스레드를 대상으로 슬립 상태의 전환을 수행할 수 있는 특별한 관리 방법이다.
  - kthread_park()로 슬립하고 kthread_unpark()로 꺠어날 수 있다.
task->exit_state
- EXIT_ZOMBIE
  - 태스크가 종료될 때 사용하는 두 개의 상태 중 첫 번째 상태이다.
  - 부모 태스크가 wait*() 관련 함수를 통해 자식 태스크의 정보를 수집하기 직전까지 이 상태를 유지한다.
- EXIT_DEAD
  - 태스크가 종료될 때 사용하는 두 개의 상태 중 마지막 상태이다.
  - 부모 태스크가 wait*() 관련 함수를 통해 자식 태스크의 정보를 수집완료 하였으므로 자식 태스크가 소멸하는 중인 상태이다.

다음 그림은 태스크 상태 전환 다이어그램이다.

멀티태스킹 (time slice, virtual runtime)

동시에 2개의 태스크를 하나의 CPU에서 구동을 시키면 이론적으로는 그 CPU에서 50%의 파워로 2 개의 태스크그가 동시에 병렬로 동작해야 하는데 실제 하드웨어 CPU는 한 번에 하나의 태스크 밖에 수행할 수 없으므로 시간을 분할(time slice) 한 그 가상 시간(virtual runtime) 만큼 교대로 수행을 하여 만족시킨다.

우선순위

Priority

커널에서 분류하여 사용하는 우선순위로 0~139까지 총 140 단계가 있다.
- priority 값이 작을수록 높은 우선순위로 time slice를 많이 받는다.
0~99까지 100 단계는 RT 스케줄러에서 사용한다.
100~139까지의 40단계는 CFS 스케줄러에서 사용한다.
- Nice로 변환되는 영역이다.
deadline 태스크는 RT나 CFS 태스크들과 달리 우선순위가 없고 항상 RT나 CFS 태스크들보다 우선 처리한다.

Nice

POSIX 표준이 지정하였다. 일반적인 태스크에서 지정할 수 있는 -20 ~ +19 까지 총 40단계가 있다.
- nice 값이 가장 작은 -19가 우선 순위가 더 높고 time slice를 많이 받는다.
- priority 값으로 변환 시 100 ~ 139의 우선순위를 갖는다.
  - RT 스케줄러에서 사용하는 priority 0 ~ 99값보다 커서 RT 스케줄러보다는 항상 낮은 우선 순위를 갖는다.
- root 권한이 아닌 유저로 프로세스나 스레드가 생성되면 스케줄링 단위인 태스크의 nice 값은 0으로 시작된다.
2003년에 +19 레벨의 time slice는 1ms(1 jiffy)와 동일하게 설정을 하였으나 그 후 HZ=1000인 시스템(데스크탑)에서 time slice가 너무 적어 5ms로 변경하였다.

스케줄러

커널에 디폴트로 5개의 스케줄러가 준비되어 있고 사용자 태스크는 스케줄 정책을 선택하는 것으로 각각의 스케줄러를 선택할 수 있다. 단 stop 스케줄러 및 Idle-task 스케줄러는 커널이 자체적으로만 사용하므로 사용자들은 선택할 수 없다. 스케줄러별 처리 우선 순위는 아래 스케줄러 순서대로 동작한다. 단 RT 스케줄러의 경우 CFS 스케줄러에 5%(디폴트)의 시간 배분을 양보하여 RT 태스크에 의해 CFS 태스크가 고사(Starvation)되는 것을 막는다.

Stop 스케줄러
Deadline 스케줄러
RT 스케줄러
CFS 스케줄러
Idle Task 스케줄러

스케줄러 경쟁

태스크가 각각의 스케줄러들에 등록되면 가장 먼저 처리할 스케줄러가 가장 먼저 처리할 스케줄러는 Stop이고, 우선 순위는

스케줄 정책

SCHED_DEADLINE
- 태스크를 deadline 스케줄러에서 동작하게 한다.
SCHED_RR
- 같은 우선 순위의 태스크를 default 0.1초 단위로 round-robin 방식을 사용하여 RT 스케줄러에서 동작하게 한다.
SCHED_FIFO
- 태스크를 FIFO(first-in first-out) 방식으로 RT 스케줄러에서 동작하게 한다.
SCHED_NORMAL
- 태스크를 CFS 스케줄러에서 동작하게 한다.
SCHED_BATCH
- 태스크를 CFS 스케줄러에서 동작하게 한다. 단 SCHED_NORMAL과는 다르게 yield를 회피하여 현재 태스크가 최대한 처리를 할 수 있게 한다.
SCHED_ILDE
- 태스크를 CFS 스케줄러에서 가장 낮은 우선순위로 동작하게 한다. (nice 19보다 더 느리다. nice=20이란 것은 없지만 이와 같다고 생각하면 된다.)
  - 주의: 이 정책을 사용한다고 idle 스케줄러로 진입시키는 것이 아니다.

CFS 스케줄러

nice(-20 ~ +19) 우선 순위 태스크는 nice 값에 따른 weight를 태스크에 배정하고 각 태스크들의 weight 비율만큼 timeslice를 배정받아 태스크들이 교대로(context siwthching) 동작한다.
linux v2.6.23부터 사용하기 시작하였다.
태스크마다 virtual 런타임이 주어지고 이를 RB 트리에서 관리한다.
- p->se.vruntime을 키로 소팅된다.
- CFS 이전의 기존 스케줄러에 사용하던 active, expired 배열은 더이상 cfs 스케줄러에서 사용하지 않는다.
3가지 스케줄 정책에서 사용된다.
- SCHED_NORMAL
- SCHED_BATCH
- SCHED_IDL
cfs 직전에 사용한 O(1) 스케줄러는 jiffies나 hz 상수에 의존하여 ms 단위를 사용한 것에 반해 cfs 스케줄러는 ns 단위로 설계되었다.
더 이상 통계적 방법을 사용하지 않고 주어진 런타임 시간의 마감(deadline) 방식으로 관리한다.
CFS bandwidth 기능을 사용할 수 있다.
예) ksoftirqd나 high 우선 순위 워크큐의 워커로 동작하는 kworker 태스크들은 CFS 스케줄러에서 nice -20(nice중 가장 우선 순위가 높다)으로 동작한다.

리얼타임(RT) 스케줄러

RT 태스크마다 priority(0~99, 0이 가장 빠름) 가 지정되는데, 우선순위에 따라 RT 태스크들끼리 경쟁 시 가장 우선 순위가 높은 RT 태스크를 먼저 처리하고, 처리를 완료한 후에 그 다음 우선순위의 RT 태스크를 수행한다.
- priority 51번과 52번이 경쟁하는 경우 51번 우선 순위의 태스크가 동작하는 중에는 52번으로 task context switch 하지 않는다. 즉 자신 보다 낮은 우선 순위의 RT 태스크에는 양보하지 않는다. 다만 cfs 스케줄러가 동작할 수 있도록 디폴트로 5%(1초 – sched_rt_runtime_us)의 시간 분할을 cfs 태스크에게 양보한다.
sched_rt_period_us
- 실행 주기 결정 (초기값: 1000000 = 1초)
- 이 값을 매우 작은 수로 하는 경우 시스템이 처리에 응답할 수 없다.
  - hrtimer보다 작은 경우 등
sched_rt_runtime_us
- 실행 시간 결정 (초기값: 950000 = 0.95초)
- 전체 스케줄 타임을 100%로 가정할 때 95%에 해당하는 시간을 RT 스케줄러가 이용할 수 있다.
동일한 우선 순위의 RT 태스크에서 task context switching 여부를 결정할 수 있는 2가지 스케줄 정책을 선택할 수 있다.
- SCHED_FIFO
  - cfs 스케줄러에서 사용하는 weight 값 및 time slice라는 개념이 없다. 먼저 동작된 RT 태스크는 스스로 슬립하거나 태스크가 종료되기 전까지 계속 동작한다.
- SCHED_RR
  - 같은 우선 순위의 태스크가 존재하는 경우 0.1초(디폴트) 단위로 라운드 로빈 방식으로 순회(task context switching)하며 동작한다.
0번부터 99번까지 100개의 우선 순위 리스트 배열로 관리된다.
RT bandwidth 기능을 사용할 수 있다.
예) migration, threaded irq 및 watchdog 태스크들은 rt 스케줄러에서 동작한다.

Deadline 스케줄러

사용자가 지정할 수 있는 스케줄러 중 가장 우선순위를 갖는 스케줄러이다. 규칙적으로 동작해야 하는 audio, video 프레임 재생 및 pwm 드라이버를 위해 준비하였는데 아직 잘 활용되고 있지 않다.

1가지 스케줄 정책에서 사용된다.
- SCHED_DEADLINE
deadline은 RB 트리에서 관리한다.

런큐

각각의 CPU 마다 하나 씩 사용된다.
런큐에는 각 스케쥴러들이 동작한다.
- 런큐 하위에 cfs 런큐, rt 런큐, deadline 런큐가 하위에 존재한다.
CPU에 배정할 태스크들이 스케줄 엔티티 형태로 런큐에 enqueue 된다.
- current 태스크를 제외한 태스크들이 엔큐된다.
- nr_current = current 태스크 + 엔큐된 태스크들 수 이다.
태스크가 처음 실행될 때 가능하면 부모 태스크가 위치한 런큐에 enqueue 된다.
- 같은 cpu에 배정되는 경우 캐시 친화력이 높아서 성능이 향상된다.

다음 그림은 런큐에 포함된 각 스케줄러들이 태스크들을 관리할 때 사용되는 주요 자료 구조를 보여준다.

stop 스케줄러와, idle 스케줄러는 커널이 내부적으로 관리하는 용도로 사용되며 아래 그림에는 포함시키지 않았다.

커널과 실수(Floating Point)

리눅스 커널은 성능을 필요로 하는 루틴이 매번 반복되는 경우에는 실수 연산 성능을 향상시키기 위해 다음과 같은 테크닉등을 사용한다.

실수(float) 연산 명령 대신 정수(int) 연산 명령을 사용하기 위해 실수 값을 이진화 정수로 변환하여 사용한다.
복잡한 계산을 미리 산출해둔 테이블로 만들어 사용한다.
- 미리 산출해둔 테이블을 위해 어느 정도 테이블 범위가 제한되어야 한다.
- 예) 입력 n값이 0~3으로 한정된 경우 어떠한 수식(factor)을 반영한 테이블
  - x[0] = 1024
  - x[1] = 820
  - x[2] = 655
곱셈(mult) 후 시프트(shift) 연산 명령을 사용하여 느린 정수(int) 나눗셈 연산도 대체한다.
inline 함수를 사용하여 인수 전달에 사용되는 시간 절약을 절약한다. 대신 이 함수를 호출하는 곳이 많은 경우 코드 사이즈가 비례하게 커진다.

리눅스 커널에서 실수(Floating Point) 연산 명령

리눅스 커널 코드에서 실수(Floating Point)를 사용하는 경우의 처리 방법은 다음과 같다.

컴파일러가 이러한 커널 코드를 빌드하면 실수 연산용 코프로세서(Coprocess)가 사용하는 어셈블리 코드를 만들지 않도록 한다.
- 컴파일러는 실수 처리를 위해 컴파일러가 라이브러리로 제공하는 실수 라이브러리를 사용한다.
- 이 실수 라이브러리에는 정수만을 사용하는 범용 레지스터와 명령들로만 구현되어 있다.
- 그 외에 커널은 유저 태스크가 실수 코드를 수행하면 이 명령의 수행이 불가능한 경우 명령(instruction) fault를 발생시키는데, 커널이 이를 emulation 하도록 하는 실수 라이브러리도 포함되어 있다.
- 단 예외적으로, FPU 및 GPU 등의 드라이버가 직접 실수(Floating Point) 명령을 사용할 수도 있다.
초창기 대부분의 컴퓨터 시스템에는 실수 연산용 코프로세서 유닛이 없거나 빈 소켓인 채로 사용되었다.
코프로세서가 장착되었다 하더라도 시스템마다 성능이크게 달랐다. 그리고 단순 곱셈과 나눗셈에서의 실수 명령은 정수 명령의 처리보다는 항상 처리 사이클이 더 많아 느렸고 현재도 동일하다. 따라서 리눅스 커널은 처음 코어 설계에서 실수(Floating Point) 연산용 Coprocess를 배제하는 코드만을 사용하도록 강제하였다.

이진화 정수

실수를 정수로 만들어 사용할 때에는 소숫점 이하 단위의 정확도를 높이기 위해 정확도에 따른 정수를 만들어 사용한다.

예) 10진화 정수
- 10 진수에서 실수 1을 -> 정밀도 1.000 까지 취급하기 위해서는 실수 1에 1000(10^3)을 곱하여 사용한다.
예) 이진화 정수
- 2 진수에서 실수 1을 -> 정밀도 1.000 까지 취급하기 위해서는 실수 1에 0b10000000000(2^10)을 곱하여 사용한다.

Mult & Shift

곱셈(mult)과 시프트(shift) 연산 명령을 사용하여 느린 정수(int) 나눗셈 명령을 대체하는 방법을 알아본다.

수식
- X / Y ——(대체)—–> X * mult >> shift
예) X를 Y(820)로 나누고 싶은 경우
- Y(820)를 inverse한 1/Y(1/820) 값을 이진화 정수 값으로로 기억해둔다.
- 사용할 정밀도(shift=32)를 결정하면, 이진화 정수 값(mult)은 다음과 같이 결정된다.
  - mult = (2^32) * 1/820 = (2^32) / 820 = 5,237,765
- 결국 X * mult(5,237,765) >> shift(32) 한 방법으로 나눗셈 연산을 대체할 수 있다.
이와 같은 기법은 커널의 여러 루틴에서 사용된다.
- 예: Timer -10- (Timekeeping) | 문c

CFS Runtime(Time Slice) & Virtual Runtime

CFS 태스크 Runtime 산출

Nice별 가중치(weight)

time slice를 산출하기 위해 각 cfs 태스크에 부여된 nice 우선순위 값에 따라서 각 cfs 태스크의 가중치(weight) 값이 부여된다.

nice가 0인 cfs 태스크에 대한 가중치(weight) 값을 실수 1.0으로 정하였다.
관련 구현 코드는 1.0이라는 실수(Floating Point) 연산을 사용하지 않고, 대신 이진화 정수 값을 사용한다.
- 32비트 시스템에서의 이진화 정수 값으로 정밀도에 해당하는 shift 값은 10을 사용하여 1024(2^10)가 된다.
- 64비트 시스템에서는 위의 값에 정밀도를 더 높이기 위해 scale(1024)을 곱하여 1,048,576(2^20)을 사용한다.

먼저 nice 우선 순위에 따라 미리 산출해둔 가중치 테이블을 알아본다.

중간 우선 순위인 nice 0 값을 가진 태스크의 가중치는 1024이고, 가장 느린 nice 19 값을 가진 태스크의 가중치는 15이다.
아래 테이블에 나오지 않았지만 idle 스케줄러에서 사용하는 idle 태스크의 경우 time slice를 가장 적게 받는 3을 사용한다. 가장 느린 cfs 태스크의 가중치가 15보다 느린 것을 알 수 있다.
nice 값의 우선 순위 1 변화시에 25%의 가중치(weight)가 변화되도록 설계되었고, 그 이유는 바로 아래에서 설명한다.

kernel/sched/core.c – sched_prio_to_weight[]

weight 변화량

동작 중인 태스크가 2개 일때 기준으로 nice 값이 1씩 감소할 때마다 10%의 time slice 배정을 더 받을 수 있도록 25%씩 weight 값이 증가된다. 단 3개 이상의 태스크를 비교할 때에는 10%가 아닌 7.6%가 증가된 time slice 배정을 한다.

아래 그림과 같이 weight가 25% 증가 시 실제 cpu 점유율이 기존에 비해 증가되는 폭을 확인할 수 있다.

2개 기준일때 10% 증가, 3개 기준일 때 7.6% 증가

다음 그림은 5개의 task에 대해 각각의 nice 값에 따른 weight 값과 그 비율을 산출하였다.

CFS 태스크 inverse weight

다음 테이블은 nice 값을 사용하여 inverse weighted mult 값을 미리 산출해둔 테이블이다. 32비트 정확도를 사용하여 만들었다.

성능향상을 목적으로 커널은 x / weight = y와 같은 산술식을 사용하여야 할 때 나눗셈을 피해서, 아래 테이블 값을 사용한 x * wmult >> 32로 y 값을 산출한다.

CFS 런큐

스케줄링 latency 와 스케일 정책

런큐에서 동작하는 태스크들이 실시간 동시에 동작하는 것처럼 보이게 하기 위해 기간을 두고 반복을 한다. 이 때 한 바퀴 돌 기간인 periods(ns)를 산출하기 위한 관련 변수들은 다음과 같다. 자세한 설명은 그 다음 스케줄링 latency 산출식부터 살펴본다.

rq->nr_running
- 런큐에서 동작중인 태스크 수
- current 태스크 + 런큐에 엔큐된 태스크 수
factor
- 스케일링 정책에 따른 비율이다. 이 정책에 따라 cpu 수에 따른 비율 적용 factor 값을 결정한다. (rpi2: factor=3)
- cpu 수에 따라 periods 산출에 영향을 준다. cpu가 많아지면 cpu 수 만큼 비례하게 할 것인지, 아니면 2log()를 사용하여 점진적으로 비율을 줄여 나갈 것인지 아니면 cpu 수와 관계 없이 사용할 것인지를 지정한다.
- 3가지 정책이 사용되며 아래 별도 항목에서
sched_latency_ns
- 최소 스케줄 기간(periods)으로 디폴트 값(6000000) * factor(3) = 18000000(ns)
- “/proc/sys/kernel/sched_latency_ns”
sched_min_granularity
- 태스크에 대한 최소 스케쥴 기간으로 디폴트 값(750000) * factor(3) = 2250000(ns)
- task context switching이 빈번해서 효율이 떨어질 수 있으므로 태스크에 배정될 타임 슬라이스는 이 값보다 작게 배정하지 않는다.
- “/proc/sys/kernel/sched_min_granularity_ns”
sched_wakeup_granularity_ns
- 태스크의 wakeup 기간으로 디폴트 값(1000000) * factor(3) = 3000000(ns)
- “/proc/sys/kernel/sched_wakeup_granularity_ns”
sched_nr_latency
- sched_latency_ns를 사용할 수 있는 태스크 수
- 이 값은 sched_latency_ns / sched_min_granularity 으로 자동 산출된다.

스케줄링 latency 산출식

산출은 nr_running와 sched_nr_latency를 비교한 결과에 따라 다음 두 가지 중 하나로 결정된다.

sysctl_sched_latency(18ms) <- 조건: nr_running <= sched_nr_latency(8)
sysctl_sched_min_granularity(2.25ms) * nr_running <- 조건: nr_running > sched_nr_latency(8)

다음 그림은 rpi2(cpu:4) 시스템에서 하나의 cpu에 태스크가 5개 및 10개가 구동될 떄 각각의 스케줄링 기간을 산출하는 모습을 보여준다.

태스크가 일정 수 이하일 경우에는 산출된 스케줄링 latency를 그대로 periods로 사용한다.
태스크가 일정 수를 초과하면 산출된 스케줄링 latency에 태스크들을 배치할 때 태스크에 대한 최소 스케쥴 기간( sched_min_granularity)이 보장되지 않기 때문에 최소 스케쥴 기간( sched_min_granularity) * 태스크 수를 하여 periods를 산출한다.

cpu 수에 따른 스케일링 정책

스케일링 정책에 따른 비율이며 cpu 수에 따른 스케일 latency에 변화를 주게된다.

SCHED_TUNABLESCALING_NONE(0)
- 항상 1을 사용하여 스케일이 변하지 않게 한다.
SCHED_TUNABLESCALING_LOG(1)
- 1 + ilog2(cpus)과 같이 cpu 수에 따라 변화된다. (디폴트)
- rpi2: cpu=4개이고 이 옵션을 사용하므로 factor=3이 적용되어 3배의 스케일을 사용한다.
SCHED_TUNABLESCALING_LINEAR(2)
- online cpu 수로 8 단위로 정렬하여 사용한다. (8, 16, 24, 32, …)

태스크별 Runtime(Time Slice)

개별 태스크에 배정된 런타임 시간을 산출한다.

런큐별 스케줄링 latency 1 주기 기간을 각 태스크의 weight 비율별로 나눈다.

태스크별 VRuntime

태스크의 vruntime(virtual runtime)은 개별 태스크 실행 시간을 누적시킨 값을 런큐의 RB 트리에 배치하기 위해 로드 weight을 반비례하게 적용한 값이다. (우선 순위가 높을 수록 누적되는 weight 값을 반대로 작게하여 vruntime에 추가하므로 RB 트리에서 먼저 우선 처리될 수 있게 한다)

증가될 vruntime 값은 nice 0 태스크에 대해 주어지는 런타임값과 동일하다.
- 예) 6ms 주기에서 하나의 nice 0 태스크가 동작하는 경우 태스크에 주어지는 런타임 값은 6ms이고, vruntime 값도 6ms 이다.
- 예) 6 ms 주기에서 두 개의 nice 0 태스크가 동작하는 경우 각 태스크에 주어지는 런타임 값은 6ms 주기의 절반인 3ms이고, vruntime 값도 3ms 이다.
현재 태스크의 실행을 중단시키고 task context switch를 하는 경우 현재 태스크를 vruntime 만큼 뒤로 옮겨 놓는다. 이렇게 옮겨 놓고 즉 CFS 런큐의 RB Tree 내에서 다음 수행할 태스크는 각 태스크의 vruntime이 가장 빠른 태스크로 선정한다. 즉 next 태스크를 선정하는 기준으로 사용한다.
모든 태스크에 동일한 vruntime이 산출되는데 현재 태스크의 런타임이 소모되면 태스크에 대한 vruntime 값은 다음과 같이 갱신된다.
- curr->vruntime += 산출된 vruntime
- 해당 태스크의 vruntime 값은 산출된 vruntime 만큼 뒤로 옮긴다.

누적시 추가될 virtual runtime slice는 다음과 같이 산출된다.

vruntime slice = time slice * wieght-0 / weight

다음 그림은 nice 값이 각각 다른 4개의 태스크에 대해 vruntime을 구한 결과를 보여준다.

nice 0 값을 가진 Task B의 경우 런타임 값과 vruntime 값이 동일함을 알 수 있다.

다음 그림은 4개의 태스크가 idle 없이 계속 실행되는 경우 각 태스크의 vruntime은 산출된 vruntime slice 만큼 계속 누적되어 실행되는 모습을 보여준다.

주의: HRTICK(디폴트=no)이 동작하고, preemption 요청 RT 태스크들이 없다고 가정한 상태에서 각 태스크에 동일한 기회가 주어지는 상황에 대해 이해를 돕기 위한 그림이다. 실제 HRTICK이 동작하지 않는 상태에서는 아래 그림과 같이 동작하지 않고 런타임이 적은 Task D의 경우 정규 스케줄 틱에서 실제 자신의 런타임보다 더 많은 시간을 실행하므로 런타임이 큰 Task A와 같은 기회를 갖지는 못한다.

cfs_rq->min_vruntime

cfs 런큐에 새로운 태스크 또는 슬립되어 있다가 깨어나 엔큐되는 경우 vruntime이 0부터 시작하게되면 이러한 태스크의 vruntime이 가장 낮아 이 태스크에 실행이 집중되고 다른 태스크들은 cpu 타임을 얻을 수 없는 상태가 된다. (starvation problem) 따라서 이러한 것을 막기 위해서 cfs 런큐에 새로 진입할 때 min vruntime 값을 기준으로 사용하게 하였다.

cfs_rq->min_vruntime 값은 현재 태스크와 cfs 런큐에 대기중인 태스크들 중 가장 낮은 수가 스케줄 틱 및 필요 시마다 갱신되어 사용된다.
cfs 런큐에 진입 시에 min_vruntime 값을 더해서 사용하고 cfs 런큐에서 빠져나갈 때에는 min_vruntime 값만큼 감소시킨다.
- 태스크가 슬립하거나 다른 cpu의 런큐로 옮기기 위해 이 cpu의 런큐에서 디큐되는데 이 cpu의 런큐의 min_vruntime 뺀 delta 값만을 보관한다.
- 태스크가 깨어나거나 다른 cpu의 런큐에서 이 cpu의 런큐에 엔큐될 때에는 태스크의 보관된 delta vruntime 값에 현재 런큐의 min_vruntime 값을 추가한다.
cfs 런큐에 진입 시에 vruntime에 대해 약간의 규칙을 사용한다.
- 새 태스크가 cfs 런큐에 진입
  - min_vruntime 값 + vruntime slice를 더해 사용
  - fork된 child 태스크가 먼저 실행되는 옵션을 사용한 경우 현재 태스크가 우선 처리되도록 vruntime과 swap한다.
- idle 태스크가 깨어나면서 cfs 런큐에 진입
  - min_vruntime 값 사용
- 기타 여러 이유로 cfs 런큐에 진입(태스크 그룹내 이동, cpu migration, 스케줄러 변경)

다음 그림은 min_vruntime이 vruntime이 가장 낮은 curr 태스크가 사용하는 vruntime 값으로 갱신되는 모습을 보여준다.

태스크의 vruntime 값은 런큐에서 디큐하고 엔큐할 때 해당 런큐의 min_vruntime을 기준으로 하는 것을 보여준다.

CFS Group Scheduling

cgroup의 cpu 서브시스템을 사용하여 커널 v2.6.24에서 소개되었다.
계층적 스케줄 그룹(TG: Task Group)을 설정하여 사용자 프로세스의 utilization을 조절할 수 있다.
- 태스크의 로드 weight 값을 관리하기 위해 스케줄 엔티티를 사용하는데 태스크용 스캐줄 엔티티와 태스크 그룹용 스케줄 엔티티(tg->se[])로 나뉜다.
스케줄 그룹의 CFS bandwidth 기능을 사용하면 배정받은 시간 범위내에서의 utilization 마저도 조절할 수 있다.

다음 그림은 그룹 스케줄링의 유/무에 대해 비교를 하였다.

좌측은 cfs 스케줄러가 설계한대로 태스크에 주어진 우선 순위가 동일하면 각 태스크에 공평(fair)하게 런타임이 배정된다.
우측은 cgroup을 적용하여 유저별로 우선 순위를 동일하게 적용하면 각 유저에 공평(fair)하게 런타임이 배정된다.
우측의 경우 루트 그룹을 만들고 그 아래에 두 개의 스케줄 그룹을 만들면 디폴트로 각각 하위 두 그룹이 50%씩 utilization을 나누게 된다.
태스크 그룹을 사용하는 경우 각 태스크 그룹마다 cpu 수 만큼 cfs 런큐가 있고 그 그룹에 속한 태스크나 태스크 그룹은 각각의 스케줄 엔티티로 본다.
- 우측의 경우 5개의 태스크용 스케줄 엔티티가 모두 1024의 weight 값을 갖는다고 가정할 때 root 그룹은 2개의 그룹용 스케줄 엔티티에 대한 분배를 결정하고, 각 group에서는 그룹에 소속된 task용 스케줄 엔티티에 대한 분배를 결정한다.
- 우측의 경우 root 그룹에 2 개의 task가 추가로 실행되는 경우 root 그룹은 총 4개의 스케줄 엔티티에 대한 분배를 수행한다. (태스크 그룹용 스케줄 엔티티 2개 + 태스크용 스케줄 엔티티 2개)
  - group-1에 해당하는 각 태스크는 6.25%를 할당받는다.
  - group-2에 해당하는 태스크는 25%를 할당받는다.
  - 그림에는 표현되지 않았지만 루트 그룹에 새로 추가된 task 6와 task 7은 각각 25%를 할당받는다.

다음 그림은 그룹 스케줄링을 적용 시 time slice 적용 비율이 변경된 것을 보여준다.

루트 그룹 밑에 두 개의 태스크 그룹을 만들고 각각 20%, 80%에 해당하는 share 값을 설정한 경우 그 그룹 아래에서 동작하는 태스크들의 utilization을 다음과 같이 확인할 수 있다.

CPU Bandwidth for CFS

태스크 그룹에 할당된 time slice를 모두 사용하지 않고 일정 비율만 사용하게 한다. 이렇게 시간 점유를 안하는 구간을 cfs 스로틀링이라고 하는데, 이 때 다른 태스크 그룹에 타임 슬라이스를 양보하게 된다. 양보할 태스크들마저 없는 경우 idle 한다.

다음 그림은 어느 한 태스크 그룹에 소속된 cfs 태스크의 절반을 스로틀링하도록 설정한 모습을 보여준다. (1개 core를 사용한 시스템 예를 보여준다)

CFS Scheduling Domain

스케줄링 그룹(sched_group)에 많은 cpu 로드가 발생하면 각 스케줄링 도메인을 통해 로드 밸런스를 수행하게 한다.

계층적 구조를 통해 cpu affinity에 합당한 core를 찾아가도록 한다.

스케줄링 도메인은 디바이스 트리 및 MPIDR 레지스터를 참고하여 cpu가 on/off 될 때마다 cpu topology를 변경하고 이를 통해 스케줄링 도메인을 구성한다.

디바이스 트리 + MPIDR 레지스터 -> cpu on/off -> cpu topology 구성 변경 -> 스케줄링 도메인

cpu 토플로지의 단계는 다음과 같다.

cluster
core
thread

스케줄링 도메인의 단계는 다음과 같다.

DIE 도메인
- cpu die (package)
MC 도메인
- 멀티 코어
SMT 도메인
- 멀티 스레드(H/W)

다음 그림은 스케줄링 도메인과 스케줄링 그룹을 동시에 표현하였다. (arm 문서 참고)

참고로 아래 그림의 스케줄링 그룹(sched_group)과 그룹 스케줄링(task_group)은 다른 기능이므로 주의한다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

CPU bandwidth control for CFS | Google Paul Turner, IBM Bharata B Rao, Google Nikhil Rao – 다운로드 pdf
CFS bandwidth control | LWN.net – 한글 번역 | KERNELMSG
CFS Bandwidth Control | kernel.org
cfs scheduling 1 & cfs group scheduling (1) | 솔개가하늘을가르네
Linux 3.2 – CFS CPU bandwidth (english version) | Christophe Blaess
integrating the scheduler and cpufreq | Linaro connect – 다운로드 pdf
CFS 소스 코드 분석 | 어린아이

linux kernel 강좌 1 – CFS ( Completely Fair Scheduler ) 스케쥴러 | 토리
리눅스 커널 내부구조 스터디[ chapter 3] | 이즈리얼
[CFS] cfs 정리 | NZCV
[Linux] Completely Fair Scheduler | F/OSS
cfs 스케줄러 | 솔개가하늘을가르네
Load balance in Linux source level bottom-up analysis | Mad for Simplicity
CPU bandwidth control for CFS | P. Turner, B. B. Rao, N. Rao – 다운로드 pdf
Evaluation of CPU cgroup | Fujitsu – 다운로드 pdf
Task Management | 백승재 – 다운로드 pdf

Scheduler -3- (PELT-1)

2017-04-202023-05-08 문영일 Leave a comment

Scheduler -3- (PELT-1)

리눅스 커널은 다음과 같이 시스템 운영 환경에 따라 스케줄러와 로드 추적을 구분하여 적용하고 있다.

SMP(Symmetric Multi Processing) 시스템
- 스케줄러로 내장 커널 스케줄러를 사용한다.
- PELT(Per-Entity Load Tracking)를 구성하여 사용한다.
AMP(Asymmetric Multi Processing) or HMP(Heterogeneous multi-processing) 시스템
- 모바일 기기를 만드는 안드로이드 진영의 각 칩셋사에서 여러 모델을 시험적으로 사용하고 있고, 현재 가장 우수한 성능을 나타내는 조합은 다음과 같다.
- 스케줄러로 IKS(In-Kernel Switcher), GTS(Global Task Scheduling) 또는 EAS(Energy Aware Scheduler)를 사용하여 왔는데, 최근에는 EAS 만 사용한다.
- EAS는 메인 라인 커널 v5.0에 적용되었고, CONFIG_ENERGY_MODEL 커널 옵션을 사용한다.
- 참고:
  - PM: Introduce an Energy Model management framework (2018, v5.0-rc1)
  - EAS Development for Mainline Linux | ARM
  - Scheduling for Android devices (2016) | LWN.net

WALT(Window Assisted Load Tracking)

안드로이드 모바일 시스템을 위해 도입된 WALT는 PELT와 같이 사용할 수도 있고, 독립적으로 사용할 수도 있다.
태스크의 로드 상승 및 하강에 대한 변화는 PELT에 비해 약 4배 정도 빠르게 상승하고, 하강은 8배 더 빠르다.
EAS(Energy Aware Scheduler) + schedutil이 기본적으로 같이 사용된다.
참고:
- sched: Introduce Window Assisted Load Tracking (2016) | LWN.net
- WALT vs PELT | Redux

Big/Little SoC

big/little migration을 위한 H/W 모델에 다음과 같은 것들이 사용된다.

클러스터 마이그레이션 사용 모델
- 클러스터들 중 하나만 사용되는 모델로, 오래된 big/little SoC 들에서 사용되었다.
CPU 마이그레이션 사용 모델
- IKS(In-Kernel Switcher)를 통해 cpu별로 big/little cpu를 선택하여 가동하는 모델이다.
HMP(Heterogeneous Multi-Processing) 사용 모델
- 가장 강력한 모델로 동시에 모든 프로세스를 사용하는 모델로, 최근 SoC들이 사용되는 방식이다.

PELT(Per-Entity Load Tracking)

용도
- 로드 밸런스
- 그룹 스케줄링
- sched util governor에서 주파수 선택
- EAS(Energy Aware Scheduler) 에서 태스크의 cpu 선정
커널 v3.8부터 다음 로드 평균을 산출하여 엔티티별 로드를 추적할 수 있다.
- 로드 합계 및 평균 (load sum & average)
  - 로드밸런스에서 task를 다른 cpu로 이동시킬지 여부를 결정하기 위해 엔티티별 로드를 비교할 때 사용하는 값.
  - runnable% * load weight
- 러너블 합계 및 평균(runnable sum & average)
  - 작업 대기 시간을 추적하기 위해 사용됨.
  - 엔티티가 cpu에서 러너블(curr + 엔큐) 상태로 있었던 기간 평균
  - 커널 v5.7-rc1에서 기존 runnable load sum & avg가 삭제됨
    - 참고: sched/pelt: Remove unused runnable load average (2020, v5.7-rc1)
  - 커널 v5.7-rc1에서 runnable sum & avg가 새로 추가됨.
    - 참고: sched/pelt: Add a new runnable average signal (2020, v5.7-rc1)
- 유틸 로드 합계 및 평균(util load sum & average)
  - 엔티티가 cpu에서 수행한 기간 평균
  - running% * 1024
- 유틸 추정치 로드 합계 및 평균 (util_est load sum & average)
  - 빅 태스크가 약각 오랜 시간 슬립하였다가 깨어난 경우 슬립한 기간만큼 로드값이 decay되어 매우 낮아지면서 다시 런큐에 등록될 때 로드가 현저하게 적어 cpu freq가 낮게 배정되어 출발하는 이슈가 있다. 때문에 이러한 문제를 개선하기 위해 태스크가 슬립하여 deaueue되는 순간의 util 로드를 기억해두었다가, 다시 사용시 재계산된 util 로드와 비교하여 최대 값을 적용한 것이 util_est 로드이다.
스케줄 틱마다 그리고 태스크의 엔티티에 변화(enqueue, dequeue, migration 등)가 생길 때마다 갱신한다.
로드 값은 1 period 단위 기간인 2^20ns(약 1024us) 마다 decay되며, 32번째 단위 시간이 될 때 정확히 절반으로 줄어드는 특성을 가진다.

커널의 PELT를 위한 지수 평활 계수는 다음과 같다.

지수 함수 모델 k = a * (b ^ n)
- a=1, b=0.5^(1/n)), n=32 또는 a=1, b=2^(-1/n), n=32
k=0.5^(1/32) 또는 k=2^(-1/32)
- k 값은 32번 decay할 때 절반으로 줄어드는 특징이 있다. (k^32=0.5)
- k = 0.97857206…
지수 평활 계수 k를 사용하여 로드 값은 다음과 같이 산출된다.
- cpu 로드 값 = 기존 cpu 로드 값 * k + 새 cpu 로드 값 * (1 – k)

엔티티들의 로드 합계와 평균을 구할 때마다 전체 엔티티를 스캔하는 방식은 많은 코스트를 사용해야 하므로 이러한 방법은 사용하지 않는다. 다음과 같은 타이밍에 엔티티들의 합과 평균을 구한다.

스케줄 틱
- 스케줄 틱마다 현재 러닝 중인 엔티티의 로드 합계와 평균을 갱신한다.
엔티티의 변화
- 엔티티의 attach/detach/migration 등 이벤트마다 해당 엔티티의 로드 합계와 평균을 갱신한다.

엔티티의 로드 합계와 평균을 산출한 후에는 다음 순서로 누적하고 상위에 보고한다.

1단계: 엔티티 accumulate
- 엔티티들의 로드를 모아 해당 cfs 런큐로 누적(accumulate)
2단계: 상위 cfs 런큐 propagation
- 최하위 cfs 런큐부터 최상위 cfs 런큐까지 전파(propagation)
- 2단계에서는 cfs 런큐를 대변하는 그룹 엔티티의 로드값을 먼저 갱신한 후에 상위 cfs 런큐에 propagation을 하도록 2단계로 나뉘어 수행한다.
3단계: 태스크 그룹 contribution
- per-cpu cfs 런큐의 로드를 태스크 그룹으로 글로벌 로드 값으로 변화분 기여(atomic add)

로드 평균 추적

로드 평균 추적을 위해 해당 태스크의 pid가 필요하다. 아래 bash 태스크의 pid를 알아온 후 cgroup 유저 세션의 tasks에 포함되었는지 확인해본다.

# ps
  PID TTY          TIME CMD
  471 ttyAMA0  00:00:00 login
  760 ttyAMA0  00:00:00 bash
  791 ttyAMA0  00:00:00 ps

# cat /sys/fs/cgroup/cpu/user.slice/user-0.slice/session-1.scope/tasks
471
760
949

다음은 /proc/<pid> 디렉토리에 진입한 후 해당 태스크의 스케줄 statistics 정보를 확인한다.

# cd /proc/760

# cat /proc/760$ cat schedstat
503102081 13905504 293

# cat /proc/760/sched
full_load (760, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :       3989409.638984
se.vruntime                                  :         17502.410171
se.sum_exec_runtime                          :           783.958340
se.nr_migrations                             :                    1
nr_switches                                  :                 1016
nr_voluntary_switches                        :                  935
nr_involuntary_switches                      :                   81
se.load.weight                               :              1048576
se.runnable_weight                           :              1048576
se.avg.load_sum                              :                 7971
se.avg.runnable_load_sum                     :                 7971
se.avg.util_sum                              :              8163624
se.avg.load_avg                              :                  167
se.avg.runnable_load_avg                     :                  167
se.avg.util_avg                              :                  167
se.avg.last_update_time                      :        3989409638400
se.avg.util_est.ewma                         :                   53
se.avg.util_est.enqueued                     :                  167
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                  292
mm->numa_scan_seq                            :                    0
numa_pages_migrated                          :                    0
numa_preferred_nid                           :                   -1
total_numa_faults                            :                    0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0

다음 명령을 사용하면 모든 태스크의 스케줄 statistics 정보를 한 번에 알아볼 수 있다. (소숫점 이상은 밀리세컨드이다.)

# cat /proc/sched_debug 

(...생략...)

runnable tasks:
 S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
-----------------------------------------------------------------------------------------------------------
 S        systemd     1       128.556417      1435   120         0.000000      2532.278466         0.000000 0 0 /init.scope
 S       kthreadd     2      9007.043581       110   120         0.000000        38.206758         0.000000 0 0 /
 I         rcu_gp     3         7.075165         3   100         0.000000         0.017792         0.000000 0 0 /
 I     rcu_par_gp     4         8.587139         3   100         0.000000         0.023918         0.000000 0 0 /
 I   kworker/0:0H     6       372.048163         8   100         0.000000         6.427459         0.000000 0 0 /
 I   mm_percpu_wq     8        14.188365         3   100         0.000000         0.024208         0.000000 0 0 /
 S    ksoftirqd/0     9      9072.934585       312   120         0.000000        30.653868         0.000000 0 0 /
 S    migration/0    11         0.000000        73     0         0.000000        30.741958         0.000000 0 0 /
 S        cpuhp/0    12       337.602512        13   120         0.000000         1.789666         0.000000 0 0 /
 I    kworker/0:1    21      9068.975793       653   120         0.000000       173.940206         0.000000 0 0 /
 S        kauditd    23        64.560835         2   120         0.000000         0.137958         0.000000 0 0 /
 S     oom_reaper    24        70.621792         2   120         0.000000         0.060960         0.000000 0 0 /
 I      writeback    25        74.666295         2   100         0.000000         0.059208         0.000000 0 0 /
 S     kcompactd0    26        74.680295         2   120         0.000000         0.063125         0.000000 0 0 /
 I    kintegrityd    77       173.140036         2   100         0.000000         0.056583         0.000000 0 0 /
 I blkcg_punt_bio    79       179.192415         2   100         0.000000         0.067375         0.000000 0 0 /
 I     tpm_dev_wq    80       179.207188         3   100         0.000000         0.095500         0.000000 0 0 /
 I        ata_sff    81       179.208301         3   100         0.000000         0.158208         0.000000 0 0 /
 I     devfreq_wq    83       179.225743         3   100         0.000000         0.162542         0.000000 0 0 /
 I   kworker/u5:0    86       218.072428         3   100         0.000000         0.827083         0.000000 0 0 /
 I        xprtiod    87       218.102147         2   100         0.000000         0.223708         0.000000 0 0 /
 S        kswapd0   146       243.631141         3   120         0.000000         0.045208         0.000000 0 0 /
 I         nfsiod   147       231.757201         3   100         0.000000         0.177042         0.000000 0 0 /
 I       xfsalloc   148       237.636886         3   100         0.000000         0.097125         0.000000 0 0 /
 I   kworker/0:1H   155      9042.378580       127   100         0.000000        89.291426         0.000000 0 0 /
 I    kworker/0:3   187      5836.456951        50   120         0.000000         0.949960         0.000000 0 0 /
 S  systemd-udevd   207       676.417929       890   120         0.000000       762.019973         0.000000 0 0 /autogroup-13
 Ssystemd-network   240        10.659505        54   120         0.000000        89.746452         0.000000 0 0 /autogroup-21
 S          acpid   257         3.489799         8   120         0.000000         8.382792         0.000000 0 0 /autogroup-22
 S          gdbus   271       115.518683        38   120         0.000000         9.789205         0.000000 0 0 /autogroup-24
 S       rsyslogd   272        22.666138        29   120         0.000000        51.687122         0.000000 0 0 /autogroup-27
 S   rtkit-daemon   273        16.551777        19   121         0.000000        17.856413         0.000000 0 0 /autogroup-28
 S   rtkit-daemon   275        56.491989       225   120         0.000000        34.801951         0.000000 0 0 /autogroup-28
 S   rtkit-daemon   276         0.000000       111     0         0.000000        16.475375         0.000000 0 0 /autogroup-28
 S systemd-logind   279         3.639090       143   120         0.000000        73.653134         0.000000 0 0 /autogroup-32
 S NetworkManager   283       207.003587       235   120         0.000000       343.688449         0.000000 0 0 /autogroup-35
 S          gmain   328       222.841799       236   120         0.000000        58.477883         0.000000 0 0 /autogroup-35
 S        polkitd   301        34.343594        84   120         0.000000        34.283501         0.000000 0 0 /autogroup-37
 S          gmain   306        18.460635         5   120         0.000000         0.291958         0.000000 0 0 /autogroup-37
 S           bash   760      2480.890766       941   120         0.000000      2141.409558         0.000000 0 0 /user.slice/user-0.slice/session-1.scope
 S           ntpd   771       128.569616       776   120         0.000000       300.594067         0.000000 0 0 /autogroup-53
 I   kworker/u4:0   800      9013.026954       461   120         0.000000        24.093421         0.000000 0 0 /
 I   kworker/u4:1   847      9073.130002        56   120         0.000000         8.608543         0.000000 0 0 /
cpu#0
  .nr_running                    : 0
  .nr_switches                   : 33262
  .nr_load_updates               : 0
  .nr_uninterruptible            : 3
  .next_balance                  : 4295.902295
  .curr->pid                     : 0
  .clock                         : 4040095.934379
  .clock_task                    : 4021572.724115
  .avg_idle                      : 8881345
  .max_idle_balance_cost         : 5154074

cfs_rq[0]:/user.slice/user-0.slice/session-1.scope
  .exec_clock                    : 0.000000
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 17534.518299
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -15804.041501
  .nr_spread_over                : 0
  .nr_running                    : 0
  .load                          : 0
  .runnable_weight               : 0
  .load_sum                      : 2658005
  .load_avg                      : 56
  .runnable_load_sum             : 415
  .runnable_load_avg             : 0
  .util_sum                      : 2647676
  .util_avg                      : 55
  .util_est_enqueued             : 0
  .removed.load_avg              : 0
  .removed.util_avg              : 0
  .removed.runnable_sum          : 0
  .tg_load_avg_contrib           : 56
  .tg_load_avg                   : 1080
  .se->exec_start                : 4021516.525783
  .se->vruntime                  : 20945.789828
  .se->sum_exec_runtime          : 17480.214425
  .se->load.weight               : 156022
  .se->runnable_weight           : 2
  .se->avg.load_sum              : 2592
  .se->avg.load_avg              : 8
  .se->avg.util_sum              : 2647676
  .se->avg.util_avg              : 55
  .se->avg.runnable_load_sum     : 2592
  .se->avg.runnable_load_avg     : 0

(...생략...)

스케줄 틱

scheduler_tick()

kernel/sched/core.c

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */

void scheduler_tick(void)
{
        int cpu = smp_processor_id();
        struct rq *rq = cpu_rq(cpu);
        struct task_struct *curr = rq->curr;
        struct rq_flags rf;

        sched_clock_tick();

        rq_lock(rq, &rf);

        update_rq_clock(rq);
        curr->sched_class->task_tick(rq, curr, 0);
(...생략...)
}

스케줄 틱마다 런타임 처리 및 로드 평균 산출 등의 처리를 수행하고, 로드 밸런스 주기가 된 경우 SCHED softirq를 호출한다.

코드 라인 3~5에서 현재 cpu의 런큐 및 현재 태스크를 알아온다.
코드 라인 8에서 x86 및 ia64 아키텍처에 등에서 사용하는 unstable 클럭을 사용하는 중이면 per-cpu sched_clock_data 값을 현재 시각으로 갱신한다.
코드 라인 10에서 런큐 정보를 수정하기 위해 런큐 락을 획득한다.
코드 라인 12에서 런큐 클럭 값을 갱신한다.
- rq->clock, rq->clock_task, rq->clock_pelt 클럭등을 갱신한다.
코드 라인 13에서 현재 태스크의 스케줄링 클래스에 등록된 (*task_tick) 콜백 함수를 호출한다.
- task_tick_fair(), task_tick_rt(), task_tick_dl()

스케줄링 클래스별 틱 처리

Real-Time 스케줄링 클래스
- task_tick_rt()
Deadline 스케줄링 클래스
- task_tick_dl()
Completly Fair 스케줄링 클래스
- task_tick_fair()
Idle-Task 스케줄링 클래스
- task_tick_idle() – 빈 함수
Stop-Task 스케줄링 클래스
- task_tick_stop() – 빈 함수

task_tick_fair()

kernel/sched/fair.c

/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                entity_tick(cfs_rq, se, queued);
        }
(...생략...)
}

현재 태스크에 대한 스케줄 틱 처리를 수행한다.

코드 라인 6~9에서 최상위 그룹 엔티티까지 스케줄 틱에 대한 로드 평균 등의 산출을 수행한다.

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
        update_load_avg(cfs_rq, curr, UPDATE_TG);
        update_cfs_group(curr);
(...생략...)
}

스케줄 틱에 대해 현재 엔티티에 대한 로드 평균 등을 갱신한다.

코드 라인 7에서 런타임 등에 대해 처리한다.
코드 라인 12에서 요청한 엔티티의 로드 평균을 갱신한다.
코드 라인 13에서 그룹 엔티티 및 cfs 런큐에 대해 로드 평균을 갱신한다.

런큐 클럭

런큐에는 다음과 같은 3가지 클럭이 사용된다.

rq->clock
- 갱신할 때 마다 정확히 절대 스케줄 클럭을 사용하여 런타임을 누적한다.
rq->clock_task
- irq 처리 및 vCPU 처리에 사용된 시간을 제외하고 순수하게 현재 cpu가 태스크에 소모한 런타임을 누적한다.
- rq->clock 보다 약간 느리게 간다.
rq->clock_pelt
- clock_task에 누적되는 시간에 cpu capacity를 적용하여 누적시킨 런타임이다.
  - big little 시스템에서 big cpu들이 1024(1.0)을 적용될 경우 little cpu들은 더 적은 capacity를 반영한다.
  - 또한 cpu frequency 모드에 따른 성능 갭도 반영한다.
- 런큐가 idle 상태인 경우에는 항상 rq->clock_task에 동기화된다.

위의 3가지 런큐 클럭은 다음 순서로 갱신한다.

update_rq_clock() -> update_rq_clock_task() -> update_rq_clock_pelt()

다음 그림은 3가지 런큐 클럭에 대한 차이를 보여준다.

rq->clock_task는 이전 타임의 irq 타임 + task 타임을 반영한다.

런큐 클럭 조회

rq_clock()

kernel/sched/sched.h

static inline u64 rq_clock(struct rq *rq)
{
        lockdep_assert_held(&rq->lock);
        assert_clock_updated(rq);

        return rq->clock;
}

런큐의 clock을 를 반환한다.

rq_clock_task()

kernel/sched/sched.h

static inline u64 rq_clock_task(struct rq *rq) 
{
        lockdep_assert_held(&rq->lock);
        return rq->clock_task;
}

런큐의 clock_task 를 반환한다.

rq_clock_pelt()

kernel/sched/pelt.h

static inline u64 rq_clock_pelt(struct rq *rq)
{
        lockdep_assert_held(&rq->lock);
        assert_clock_updated(rq);

        return rq->clock_pelt - rq->lost_idle_time;
}

런큐의 clock_pelt를 반환한다.

cfs_rq_clock_pelt()

kernel/sched/pelt.h

/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
        if (unlikely(cfs_rq->throttle_count))
                return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time;

        return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
}

런큐의 clock_pelt에서 cfs 런큐의 스로틀 시간을 뺀 시간을 반환한다. 이 클럭으로 로드 평균을 구할 때 사용한다.

cfs bandwidth를 설정하여 사용할 때 스로틀 시간에 관련된 다음 변수들이 있다.
- 스로틀이 시작되면 cfs_rq->throttled_clock_task는 rq->clock_task로 동기화된다.
- 스로틀이 종료되면 cfs_rq->throttled_clock_task_time에는 스로틀한 시간만큼을 누적시킨다.

런큐 클럭들의 갱신

update_rq_clock()

kernel/sched/core.c

void update_rq_clock(struct rq *rq)
{
        s64 delta;

        lockdep_assert_held(&rq->lock);

        if (rq->clock_update_flags & RQCF_ACT_SKIP)
                return;

#ifdef CONFIG_SCHED_DEBUG
        if (sched_feat(WARN_DOUBLE_CLOCK))
                SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
        rq->clock_update_flags |= RQCF_UPDATED;
#endif

        delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
        if (delta < 0)
                return;
        rq->clock += delta;
        update_rq_clock_task(rq, delta);
}

나노초로 동작하는 런큐 클럭을 갱신한다. (스케줄 틱 및 태스크(또는 엔티티)의 변동시 마다 호출된다)

코드 라인 7~8에서 skip 요청을 받아들인 경우이다. 이때엔 런큐 클럭을 갱신하지 않고 그냥 함수를 빠져나간다.
코드 라인 16~18에서 런큐 클럭이 갱신된 시간과 스케쥴 클럭의 시간 차이 delta를 얻어온다. 시간이 흐르지 않은 경우 클럭 갱신 없이 함수를 빠져나간다.
코드 라인 19에서 런큐 클럭에 delta를 추가하여 갱신한다.
코드 라인 20에서 rq->clock_task에 @delta를 전달하여 반영한다. (일부 시간이 빠짐)

update_rq_clock_task()

kernel/sched/core.c

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
/*
 * In theory, the compile should just see 0 here, and optimize out the call
 * to sched_rt_avg_update. But I don't trust it...
 */
        s64 __maybe_unused steal = 0, irq_delta = 0;

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
        irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

        /*
         * Since irq_time is only updated on {soft,}irq_exit, we might run into
         * this case when a previous update_rq_clock() happened inside a
         * {soft,}irq region.
         *
         * When this happens, we stop ->clock_task and only update the
         * prev_irq_time stamp to account for the part that fit, so that a next
         * update will consume the rest. This ensures ->clock_task is
         * monotonic.
         *
         * It does however cause some slight miss-attribution of {soft,}irq
         * time, a more accurate solution would be to update the irq_time using
         * the current rq->clock timestamp, except that would require using
         * atomic ops.
         */
        if (irq_delta > delta)
                irq_delta = delta;

        rq->prev_irq_time += irq_delta;
        delta -= irq_delta;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
        if (static_key_false((&paravirt_steal_rq_enabled))) {
                steal = paravirt_steal_clock(cpu_of(rq));
                steal -= rq->prev_steal_time_rq;

                if (unlikely(steal > delta))
                        steal = delta;

                rq->prev_steal_time_rq += steal;
                delta -= steal;
        }
#endif

        rq->clock_task += delta;

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
        if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
                update_irq_load_avg(rq, irq_delta + steal);
#endif
        update_rq_clock_pelt(rq, delta);
}

나노초로 동작하는 rq->clock_task를 갱신한다. 커널 옵션들의 사용 유무에 따라 rq->clock과 동일할 수도 있고, irq 처리 및 vCPU에 사용된 시간을 빼고 순수하게 현재 cpu에서 동작한 태스크 시간만을 누적할 수 있다.

코드 라인 9~32에서 irq 처리에 사용된 시간을 측정하기 위한 CONFIG_IRQ_TIME_ACCOUNTING 커널 옵션을 사용한 경우 인자로 전달받은 @delta 값을 그대로 사용하지 않고 irq 처리에 사용된 시간은 @delta에서 뺴고 순수 태스크 런타임만을 적용하도록 한다.
코드 라인 33~44에서 CONFIG_PARAVIRT_TIME_ACCOUNTING 커널 옵션을 사용한 경우 vCPU에서 소모한 시간을 빼고 현재 cpu에서 사용된 시간만을 적용하도록 한다.
코드 라인 46에서 rq->clock_task에 변화(약간 감소)된 delta 시간을 누적시킨다.
코드 라인 48~51에서 irq 처리에 사용된 시간으로 로드 평균을 갱신한다.
- CONFIG_HAVE_SCHED_AVG_IRQ 커널 옵션은 위의 두 커널 옵션 중 하나라도 설정된 SMP 시스템에서 enable된다.
코드 라인 52에서 rq->clock_pelt에 @delta를 전달하여 반영한다. (cpu capacity 및 frequency에 따라 @delta 값이 증감된다.)

update_rq_clock_pelt()

kernel/sched/pelt.h

/*
 * The clock_pelt scales the time to reflect the effective amount of
 * computation done during the running delta time but then sync back to
 * clock_task when rq is idle.
 *
 *
 * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
 * @ max capacity  ------******---------------******---------------
 * @ half capacity ------************---------************---------
 * clock pelt      | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
 *
 */

static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
{
        if (unlikely(is_idle_task(rq->curr))) {
                /* The rq is idle, we can sync to clock_task */
                rq->clock_pelt  = rq_clock_task(rq);
                return;
        }

        /*
         * When a rq runs at a lower compute capacity, it will need
         * more time to do the same amount of work than at max
         * capacity. In order to be invariant, we scale the delta to
         * reflect how much work has been really done.
         * Running longer results in stealing idle time that will
         * disturb the load signal compared to max capacity. This
         * stolen idle time will be automatically reflected when the
         * rq will be idle and the clock will be synced with
         * rq_clock_task.
         */

        /*
         * Scale the elapsed time to reflect the real amount of
         * computation
         */
        delta = cap_scale(delta, arch_scale_cpu_capacity(cpu_of(rq)));
        delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));

        rq->clock_pelt += delta;
}

clock_pelt는 실행중인 @delta 시간을 cpu 성능에 따라 반감된 시간을 반영한다. 런큐가 idle 상태인 경우에는 clock_task 시각으로 동기화시킨다.

코드 라인 3~7에서 런큐가 idle 상태인 경우 rq->clock_pelt는 rq->clock_task에 동기화된다.
코드 라인 25에서 @delta 시간에서 cpu capacity 만큼 반감시킨다.
- full capacity 상태인 경우 반감되지 않는다.
코드 라인 26에서 다시 한 번 frequency 성능만큼 반감시킨다.
- full frequency로 동작하는 상태인 경우 반감되지 않는다.
코드 라인 28에서 위에서 cpu 성능(capacity 및 frequency)에 따라 감소된 delta를 rq->clock_pelt에 반영한다.

다음 그림은 rq->clock_pelt에 서로 다른 cpu capacity가 적용된 시간이 반영되는 모습을 보여준다.

엔티티 로드 평균 갱신

엔티티 로드 평균을 산출하는 update_load_avg() 함수는 다음과 같은 함수에서 호출된다.

entity_tick()
- 스케줄 틱마다 호출될 때
enqueue_task_fair()
- cfs 런큐에 태스크가 엔큐될 때
dequeue_task_fair()
- cfs 런큐에 태스크가 디큐될 때
enqueue_entity()
- cfs 런큐에 엔티티가 엔큐될 때
dequeue_entity()
- cfs 런큐에서 엔티티가 디큐될 때
set_next_entity()
- cfs 런큐의 다음 엔티티가 선택될 때
put_prev_entity()
- 현재 동작 중인 엔티티를 런큐의 대기로 돌려질 때
update_blocked_averages()
- blocked 평균을 갱신할 때
propagate_entity_cfs_rq()
- cfs 런큐로의 전파가 필요할 때
detach_entity_cfs_rq()
- cfs 런큐로부터 엔티티를 detach할 때
attach_entity_cfs_rq()
- cfs 런큐로 엔티티를 attach할 때
sched_group_set_shares()
- 그룹에 shares를 설정할 때

엔티티 로드 weight과 러너블 로드 weight

64비트 시스템에서는 엔티티에 대한 로드 weight과 러너블 로드 weight 값을 10비트 더 정확도를 올려 처리하기 위해 스케일 업(<< 10)한다.

32비트 시스템
- nice 0 엔티티에 대한 se->load_weight = 1,024
64비트 시스템
- nice 0 엔티티에 대한 se->load_weight = 1,024 << 10 = 1,048,576

se_weight()

kernel/sched/sched.h

static inline long se_weight(struct sched_entity *se)
{
        return scale_load_down(se->load.weight);
}

엔티티 로드 weight을 스케일 다운하여 반환한다.

se_runnable()

kernel/sched/sched.h

static inline long se_runnable(struct sched_entity *se)
{
        return scale_load_down(se->runnable_weight);
}

엔티티 러너블 로드 weight을 스케일 다운하여 반환한다.

틱 마다 스케줄 엔티티에 대한 PELT 갱신과 preemption 요청 체크

스케줄 틱마다 호출되는 scheduler_tick() 함수와 hr 틱마다 호출되는 hrtick() 함수는 현재 curr에서 동작하는 태스크의 스케줄러의 (*task_tick) 후크 함수를 호출한다. 예를 들어 현재 런큐의 curr가 cfs 태스크인 경우 task_tick_fair() 함수가 호출되는데 태스크와 관련된 스케줄 엔티티부터 최상위 스케줄 엔티티까지 entity_tick() 함수를 호출하여 PELT와 관련된 함수들을 호출한다. 그리고 스케줄 틱으로 호출된 경우 preemption 요청 여부를 체크하여 리스케줄 되는 것에 반해, hrtimer에 의해 정확히 러닝 타이밍 소진시 HR 틱이 발생되어 호출되는 경우에는 무조건 리스케줄 요청한다.

함수 호출 경로 예)
- 스케줄 틱: scheduler_tick() -> task_tick_fair() -> entity_tick()
  - entity_tick() 함수를 호출할 때 인수 queued=0으로 호출된다.
- HR 틱: hrtick() -> task_tick_fair() -> entity_tick()
  - entity_tick() 함수를 호출할 때 인수 queued=1로 호출된다.
참고: Scheduler -6- (CFS Scheduler) | 문c

엔티티의 로드 평균 갱신

다음 그림은 update_load_avg() 함수의 호출 과정을 보여준다.

update_load_avg()

kernel/sched/fair.c

/* Update task and its cfs_rq load average */
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        u64 now = cfs_rq_clock_pelt(cfs_rq);
        int decayed;

        /*
         * Track task load average for carrying it to new CPU after migrated, and
         * track group sched_entity load average for task_h_load calc in migration
         */
        if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
                __update_load_avg_se(now, cfs_rq, se);

        decayed  = update_cfs_rq_load_avg(now, cfs_rq);
        decayed |= propagate_entity_load_avg(se);

        if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

                /*
                 * DO_ATTACH means we're here from enqueue_entity().
                 * !last_update_time means we've passed through
                 * migrate_task_rq_fair() indicating we migrated.
                 *
                 * IOW we're enqueueing a task on a new CPU.
                 */
                attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
                update_tg_load_avg(cfs_rq, 0);

        } else if (decayed && (flags & UPDATE_TG))
                update_tg_load_avg(cfs_rq, 0);
}

엔티티의 로드 평균 등을 갱신한다.

코드 라인 4에서 cpu 성능이 적용된 런큐의 pelt 시각을 알아온다.
코드 라인 11~12에서 SKIP_AGE_LOAD 플래그가 없는 경우 엔티티의 로드 평균을 갱신한다.
코드 라인 14에서 cfs 런큐의 로드/유틸 평균을 갱신하고 decay 여부를 알아온다.
코드 라인 15에서 cfs 런큐에 태스크가 attach/detach할 때 전파(propagate) 표시를 하여 다음 update에서 변경 사항을 상위 수준(cfs 런큐 및 엔티티)으로 전파하기 위해 cfs 런큐의 로드 등을 엔티티에 복사한다.그리고 decay 여부도 알아온다.
- 참고: sched/fair: Propagate load during synchronous attach/detach (2016, v4.10-rc1)
코드 라인 17~27에서 태스크를 마이그레이션한 이후 엔티티에서 처음인 경우 로드 평균을 추가하고, 태스크 그룹 로드 평균을 갱신한다.
- 처음 마이그레이션된 경우 last_update_time은 0으로 클리어되어 있다.
코드 라인 29~30에서 decay 된 경우에 한해 태스크 그룹 로드 평균도 갱신한다.

스케줄 엔티티에 대한 로드 합계 및 평균을 산출

__update_load_avg_se()

kernel/sched/pelt.c

int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
                                cfs_rq->curr == se)) {

                ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
                cfs_se_util_change(&se->avg);
                trace_pelt_se_tp(se);
                return 1;
        }

        return 0;
}

새롭게 반영해야 할 로드를 엔티티 로드 합계에 추가하고 그 과정에서 decay한 적이 있으면 로드 평균도 재산출한다. 반환 되는 값은 로드 평균의 재산출 여부이다. (로드 합계의 decay가 수행된 경우 로드 평균도 재산출되므로 1을 반환한다.)

코드 라인 3~4에서 새롭게 반영해야 할 로드를 엔티티의 로드 합계에 추가하고, 그 과정에서 decay 한 적이 있으면
코드 라인 6에서 로드 평균을 재산출한다.
코드 라인 7~9에서 avg->util_est.enqueued의 UTIL_AVG_UNCHANGED 플래그가 설정되어 있는 경우 제거하여 로드 평균이 재산출되었음을 인식하게 한다. 그런 후 decay 하여 로드 평균이 재산출 되었음을 의미하는 1을 반환한다.
코드 라인 12에서 decay 없어서 로드 평균이 재산출 되지 않았음을 의미하는 0을 반환한다.

cfs_se_util_change()

kernel/sched/pelt.h

/*
 * When a task is dequeued, its estimated utilization should not be update if
 * its util_avg has not been updated at least once.
 * This flag is used to synchronize util_avg updates with util_est updates.
 * We map this information into the LSB bit of the utilization saved at
 * dequeue time (i.e. util_est.dequeued).
 */
#define UTIL_AVG_UNCHANGED 0x1

static inline void cfs_se_util_change(struct sched_avg *avg)
{
        unsigned int enqueued;

        if (!sched_feat(UTIL_EST))
                return;

        /* Avoid store if the flag has been already set */
        enqueued = avg->util_est.enqueued;
        if (!(enqueued & UTIL_AVG_UNCHANGED))
                return;

        /* Reset flag to report util_avg has been updated */
        enqueued &= ~UTIL_AVG_UNCHANGED;
        WRITE_ONCE(avg->util_est.enqueued, enqueued);
}

avg->util_est.enqueued의 UTIL_AVG_UNCHANGED 플래그가 설정되어 있는 경우 제거하여 로드 평균이 재산출되었음을 인식하게 한다.

로드 합계 및 평균 산출

로드 합계 산출

___update_load_sum()

kernel/sched/pelt.c

/*
 * We can represent the historical contribution to runnable average as the
 * coefficients of a geometric series.  To do this we sub-divide our runnable
 * history into segments of approximately 1ms (1024us); label the segment that
 * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
 *
 * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
 *      p0            p1           p2
 *     (now)       (~1ms ago)  (~2ms ago)
 *
 * Let u_i denote the fraction of p_i that the entity was runnable.
 *
 * We then designate the fractions u_i as our co-efficients, yielding the
 * following representation of historical load:
 *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
 *
 * We choose y based on the with of a reasonably scheduling period, fixing:
 *   y^32 = 0.5
 *
 * This means that the contribution to load ~32ms ago (u_32) will be weighted
 * approximately half as much as the contribution to load within the last ms
 * (u_0).
 *
 * When a period "rolls over" and we have new u_0`, multiplying the previous
 * sum again by y is sufficient to update:
 *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
 *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
 */

static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
                  unsigned long load, unsigned long runnable, int running)
{
        u64 delta;

        delta = now - sa->last_update_time;
        /*
         * This should only happen when time goes backwards, which it
         * unfortunately does during sched clock init when we swap over to TSC.
         */
        if ((s64)delta < 0) {
                sa->last_update_time = now;
                return 0;
        }

        /*
         * Use 1024ns as the unit of measurement since it's a reasonable
         * approximation of 1us and fast to compute.
         */
        delta >>= 10;
        if (!delta)
                return 0;

        sa->last_update_time += delta << 10;

        /*
         * running is a subset of runnable (weight) so running can't be set if
         * runnable is clear. But there are some corner cases where the current
         * se has been already dequeued but cfs_rq->curr still points to it.
         * This means that weight will be 0 but not running for a sched_entity
         * but also for a cfs_rq if the latter becomes idle. As an example,
         * this happens during idle_balance() which calls
         * update_blocked_averages()
         */
        if (!load)
                runnable = running = 0;

        /*
         * Now we know we crossed measurement unit boundaries. The *_avg
         * accrues by two steps:
         *
         * Step 1: accumulate *_sum since last_update_time. If we haven't
         * crossed period boundaries, finish.
         */
        if (!accumulate_sum(delta, sa, load, runnable, running))
                return 0;

        return 1;
}

로드 합계들을 구한다. 그리고 decay 여부를 반환한다.

코드 라인 7에서 현재 시각에서 last_update_time 시각까지의 차이를 delta로 알아온다.
코드 라인 12~15에서 unstable한 TSC 클럭을 사용하는 아키텍처등에서는 초기에 delta가 0보다 작아지는 경우도 발생한다. 이러한 경우에는 last_update_time만 현재 시각으로 갱신하고 그냥 0을 반환한다.
코드 라인 21~23에서 나노초 단위인 delta를 PELT가 로드 값 산출에 사용하는 마이크로초 단위로 바꾸고, 이 값이 0인 경우 그냥 0을 반환한다.
코드 라인 25에서 last_update_time에 마이크로초 이하를 절삭한 나노초로 변환하여 저장한다. PELT의 로드 산출에는 항상 마이크로초 이하의 나노초들은 적용하지 않고 보류한다.
코드 라인 36~37에서 @load 값이 0인 경우 runnable과 running에는 0을 반환한다.
코드 라인 46~47에서 delta 기간의 새 로드를 반영하여 여러 로드들의 합계들을 모두 구한다. 로드 합계 산출 후 기존 로드합계의 decay가 이루어지지 않은 경우 0을 반환한다.
코드 라인 49에서 로드 산출에 decay 하였으므로 1을 반환한다. (decay가 되면 이 함수 이후에 산출된 로드 합계를 사용하여 로드 평균을 재산출해야 한다.)

period

PELT 시간 단위 기준은 다음과 같은 특성을 가진다.

PELT 클럭 값들은 ns 단위이다.
PELT 연산에 사용하는 시간 단위는 1024(ns)로 정렬하여 사용하고, 약 1(us)이다.
Period 단위는 1024 * 1024(ns)로 정렬하여 사용하고, 약 1(ms)이다.
- Decay는 보통 Period 단위로 정렬된

다음 그림은 pelt와 관련된 시간 단위들을 시각적으로 표현하였다.

다음 그림은 ___update_load_sum() 함수에서 Step 1 구간의 Old 로드의 decay 부분 산출 과정을 보여준다.

decay 구간은 1024us 단위마다 수행하며, 각 구간마다 decay 적용률이 반영될때 현재 시각에 가까운 구간일 수록 더 높게 로드값을 반영하는 것을 알 수 있다.
Old 로드를 산출하는 Step 1 과정에서는 now가 있는 끝 구간의 짤린 부분을 제외하고, Olt 로드를 3 periods 구간만큼만 decay 한다.

다음 그림은 ___update_load_sum() 함수에서 Step 2 구간의 New 로드의 decay 부분 산출 과정을 보여준다.

New 로드를 산출하는 Step 2 과정에서는 last_update_time 이후의 기간을 모두 decay 한다.

다음 그림의 예는 직접 기간을 정해 산출할 수 있도록 하였다.

Step 1부분의 Old 로드에 decay 반영하고
Step 2 부분의 New 로드에 decay 반영하여
두 값을 더하여 로드를 산출하는 모습을 보여준다.

예) old 로드 합계=20000, new 로드 값=2048로 최종 로드를 구해본다.

Step 1) decay_load(old 로드 합계, periods)
- = decay_load(20000, 4)
Step 2) new 로드 * __accumalate_pelt_segments(periods, d1, d3)
- 2048 * __accumalate_pelt_segments(4, 921, 200)

accumulate_sum()

kernel/sched/pelt.c

/*
 * Accumulate the three separate parts of the sum; d1 the remainder
 * of the last (incomplete) period, d2 the span of full periods and d3
 * the remainder of the (incomplete) current period.
 *
 *           d1          d2           d3
 *           ^           ^            ^
 *           |           |            |
 *         |<->|<----------------->|<--->|
 * ... |---x---|------| ... |------|-----x (now)
 *
 *                           p-1
 * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
 *                           n=1
 *
 *    = u y^p +                                 (Step 1)
 *
 *                     p-1
 *      d1 y^p + 1024 \Sum y^n + d3 y^0         (Step 2)
 *                     n=1
 */

static __always_inline u32
accumulate_sum(u64 delta, struct sched_avg *sa,
               unsigned long load, unsigned long runnable, int running)
{
        u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
        u64 periods;

        delta += sa->period_contrib;
        periods = delta / 1024; /* A period is 1024us (~1ms) */

        /*
         * Step 1: decay old *_sum if we crossed period boundaries.
         */
        if (periods) {
                sa->load_sum = decay_load(sa->load_sum, periods);
                sa->runnable_load_sum =
                        decay_load(sa->runnable_load_sum, periods);
                sa->util_sum = decay_load((u64)(sa->util_sum), periods);

                /*
                 * Step 2
                 */
                delta %= 1024;
                contrib = __accumulate_pelt_segments(periods,
                                1024 - sa->period_contrib, delta);
        }
        sa->period_contrib = delta;

        if (load)
                sa->load_sum += load * contrib;
        if (runnable)
                sa->runnable_load_sum += runnable * contrib;
        if (running)
                sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

        return periods;
}

마이크로초 단위의 @delta 기간 동안 새 @load를 반영한 여러 로드 합계 및 평균들을 산출한다. 반환되는 값은 decay에 사용한 기간 수이다.

코드 라인 5에서 contrib 값은 1024us 미만에서만 사용되는 값이다. 여기서 delta가 1024us 미만일 경우에만 사용하기 위해 delta를 contrib에 미리 대입한다.
코드 라인 8에서 인자로 전달받은 @delta에 기존 갱신 타임에 1024us 미만이라 decay를 하지 않은 마이크로 단위의 기간 잔량 sa->period_contrib를 추가한다.
코드 라인 9에서 1024us 단위로 decay하기 위해 delta를 1024로 나눈다.
- 커널에서 하나의 period는 정확히 1024us이다. 그런데 사람은 오차를 포함하여 약 1ms라고 느끼면 된다.
코드 라인 14~18에서 Step 1 루틴으로 과거 값을 먼저 decay 한다. 즉 기존 load_sum, runnable_load_sum 및 util_sum 값들을 periods 만큼 decay 한다.
- old 로드는 1024us 단위로 정렬하여 decay하므로 나머지에 대해서는 decay하지 않는다.
코드 라인 23~25에서 Step 1에서 decay하여 산출한 로드 값에 Step 2의 새로운 로드를 반영하기 위해 기여분(contrib)을 곱하여 산출한다.
- 새 로드 값 = old 로드 * decay + new 로드 * contrib
  - decay_load(old 로드, 1024us 단위 기간 수) + new 로드 * __accumulate_pelt_segments()
코드 라인 27에서 1024us 단위 미만이라 decay 처리하지 못한 나머지 마이크로초 단위의 delta를 sa->period_contrib에 기록해둔다. 이 값은 추후 다음 갱신 기간 delta에 포함하여 활용한다.
코드 라인 29~32에서 기존 로드 합계를 decay한 값에 새 load * contrib를 적용하여 로드합계를 산출한다.
코드 라인 33~34에서 util_sum 값의 경우 1024 * contrib를 추가하여 산출한다.
코드 라인 36에서 1024us 단위로 decay 산출한 기간 수(periods)를 반환한다.

기여분(contrib)

기여분은 최대 로드 기간인 LOAD_AVG_MAX(47748)를 초과할 수 없다.

avg.load_sum += load * contrib
- 예) 기여분(contrib) = LOAD_AVG_MAX * 10% = 4774
avg.runnable_load_sum += runnable * contrib
avg.util_sum += contrib * 1024

로드 감쇠

decay_load()

kernel/sched/pelt.c

/*
 * Approximate:
 *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
 */

static u64 decay_load(u64 val, u64 n)
{
        unsigned int local_n;

        if (unlikely(n > LOAD_AVG_PERIOD * 63))
                return 0;

        /* after bounds checking we can collapse to 32-bit */
        local_n = n;

        /*
         * As y^PERIOD = 1/2, we can combine
         *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
         * With a look-up table which covers y^n (n<PERIOD)
         *
         * To achieve constant time decay_load.
         */
        if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
                val >>= local_n / LOAD_AVG_PERIOD;
                local_n %= LOAD_AVG_PERIOD;
        }

        val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
        return val;
}

로드 값 @val을 @n 기간(1024us) 에 해당하는 감쇠 비율로 줄인다. n=0인 경우 감쇠 없는 1.0 요율이고, n=32인 경우 절반이 줄어드는 0.5 요율이다.

코드 라인 5~6에서 기간 @n이 2016(32 * 63)을 초과하는 경우 많은 기간을 감쇠시켜 산출을 하더라도 0이 되므로 성능을 위해 연산 없이 곧장 0으로 반환한다.
코드 라인 9에서 기간 @n을 32비트 타입으로 변환한다.
코드 라인 18~21에서 기간 @n은 32 단위로 절반씩 줄어든다. 따라서 기간 @n이 32 이상인 경우 val 값을 n/32 만큼 시프트하고, local_n 값은 32로 나눈 나머지를 사용한다.
코드 라인 23~24에서 미리 계산해둔 decay 요율을 inverse 형태로 담은 runnable_avg_yN_inv[]테이블에서 local_n 인덱스에 해당하는 mult 값을 val에 곱하고 정밀도 32비트를 우측 시프트하여 반환한다.

미리 만들어진 PELT용 decay factor

kernel/sched/fair.c

static const u32 runnable_avg_yN_inv[] __maybe_unused = {
        0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
        0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
        0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
        0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
        0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
        0x85aac367, 0x82cd8698,
};

#define LOAD_AVG_PERIOD 32
#define LOAD_AVG_MAX 47742

약 32ms(1024us * 32)까지 decay될 요율(factor)이 미리 계산되어 있는 runnable_avg_yN_inv[] 테이블은 나눗셈 연산을 하지 않게 하기 위해 decay 요율마다 mult 값이 다음 수식으로 만들어졌다.

n 번째 요율을 계산하는 수식(32bit shift 적용):
- ‘y^n’은 항상 1.0 이하의 실수 값이다. 이 값에 32비트 정밀도를 적용하여 이진화 정수화한다.
  - ‘y^k * 2^32’ 수식을 사용한다. (단 y^32=0.5)
- y값을 먼저 구해보면
  - y=0.5^(1/32)=0.97852062…
- 인덱스 값 n에 0부터 32까지 사용한 결과 값은? (테이블 구성은 0~31 까지)
  - 인덱스 n에 0을 주면 y^0 * 2^32 = 0.5^(1/32)^0 * (2^32) = 1.0 << 32 = 0x1_0000_0000 (32bit로 구성된 테이블 값은 0xffff_ffff 사용)
  - 인덱스 n에 1 을 주면 y^1 * 2^32 = 0.5^(1/32)^1 * (2^32) = 0.97852062 << 32 = 0xfa83_b2db
  - 인덱스 n에 2을 주면 y^2 * 2^32 = 0.5^(1/32)^2 * (2^32) = 0.957603281 << 32 = 0xf525_7d15
  - …
  - 인덱스 n에 31을 주면 y^2 * 2^32 = 0.5^(1/32)^31 * (2^32) = 0.510948574 << 32 = 0x82cd_8698
  - 인덱스 n에 32을 주면 y^2 * 2^32 = 0.5^(1/32)^31 * (2^32) = 0.5 << 32 = 0x7fff_ffff

__accumulate_pelt_segments()

kernel/sched/pelt.c

static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
{
        u32 c1, c2, c3 = d3; /* y^0 == 1 */

        /*
         * c1 = d1 y^p
         */
        c1 = decay_load((u64)d1, periods);

        /*
         *            p-1
         * c2 = 1024 \Sum y^n
         *            n=1
         *
         *              inf        inf
         *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
         *              n=0        n=p
         */
        c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;

        return c1 + c2 + c3;
}

새 로드에 곱하여 반영할 기여율(contrib)을 반환한다. 기여율을 산출하기 위해 @periods는 새 로드가 decay할 1024us 단위의 기간 수이며, 처리할 기간의 1024us 정렬되지 않는 1024us 미만의 시작 @d1 구간과 1024us 미만의 마지막 d3 구간을 사용한다.

코드 라인 3에서 1024us 미만의 d3 구간은 y^0 구간을 적용받으므로 d3 * y^0 = d3이다. 그러므로 기여분에 그대로 d3 값을 사용하기 위해 c3에 대입한다.
코드 라인 8에서 d1 구간에 대해 @periods 만큼 decay를 수행하여 c1을 구한다.
코드 라인 19에서 d2 구간을 decay 하기 위한 변형된 수식을 사용하여 c2를 구한다.
코드 라인 21에서 c1 + c2 + c3 한 값을 반환한다. 이 값은 새 로드에 곱하여 기여율(contrib)로 사용될 예정이다.

예) 새 로드 값=3072(3.0), periods=32, d1=512, d3=768인 경우 기여분(contrib)은?

y = 0.5^(1/32) = 0.978572…
c1 = d1 * y^periods = 500 * (0.5^(1/32))^32 = 250
c2 = 1024 * (y^1 + … + y^(periods-1)) = 22870
- = 1024 * ((y ^ 0 + … + y^inf) – (y^periods + … + y^inf) – (y^0))
- = 1024 * (y ^ 0 + … + y^inf) – 1024 * (y^periods + … + y^inf) – 1024 * y^0
- = 47742 – decay_load(47742, 32) – 1024
- = 47742 – 24359 – 1024 = 23336
c3 = 750
return = 256 + 22870 + 768 = 23894
- 전체 로드 기간 LOAD_AVG_MAX(47742) 대비 50%를 기여분(contrib)으로 반환한다.

다음 그림은 새 로드가 반영될 때 사용할 기여분(contrib)을 산출할 때 사용되는 구간들을 보여준다.

로드 평균 산출

___update_load_avg()

kernel/sched/pelt.c

static __always_inline void
___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
{
        u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;

        /*
         * Step 2: update *_avg.
         */
        sa->load_avg = div_u64(load * sa->load_sum, divider);
        sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
        WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

@sa의 로드, 러너블 로드 및 유틸 평균들을 재산출한다.

코드 라인 4에서 divider 값은 decay를 고려했을 떄의 최대 기간 값을 사용해야 하고, 이번 산출에서 decay에 반영하지 못할 기간은 제외해야 한다. LOAD_AVG_MAX는 모든 decay 로드 값을 더했을 때 최대값이 될 수 있는 값으로 이 값을 사용하여 최대 기간으로 사용한다. 그래서 어떤 로드 합계 / divider가 새 로드에 반영할 비율이 된다. 그런데 decay 산출할 때 마다 1024us 미만의 기간은 decay를 하지 않아야 하므로 최대 기간에서 빼야한다. 이렇게 제외시켜야할 기간은 다음과 같다.
- 1024(us) – 기존 산출에서 decay 하지 않은 값(sa->period_contrib) = 새 산출에서 decay 하지 않아 제외 시켜야 할 값
  - 지난 산출에서 3993(921+1024+1024+1024+921)us 의 시간 중 3 기간에 대한 3072 us를 decay 하는데 사용하였고, 921 us는 사용하지 않았다. 이번 산출에서는 그 기존 분량은 decay에 사용하고, 나머지 103us는 decay에 사용하지 않는 기간이된다.
- 참고: sched/cfs: Make util/load_avg more stable (2017, v
코드 라인 9~10에서 @load 및 @runnable에 이 함수 진입 전에 산출해둔 로드 섬들을 divider로 나눈 비율을 곱하여 적용한다.
- @load가 0일 때 로드 평균은 0으로 리셋되고, @runnable이 0일 때 역시 러너블 로드 평균도 0으로 리셋된다.
- 참고: sched/cfs: Make util/load_avg more stable (2017, v4.13-rc1)
코드 라인 11에서 유틸 평균은 유틸 합계를 divider로 나눈 값으로 갱신한다.

sa->period_contrib

period_contrib는 지난번 decay 되지 않았던 1 period(1024*1024us) 미만의 시간을 저장해둔다.

다음 그림은 period_contrib에 대한 정확한 위치를 보여준다.

LOAD_AVG_MAX

LOAD_AVG_MAX 값은 1 periods에서 사용하는 단위 1024us에 로드가 있을 때 이 값을 1024(1.0을 10비트 이진화정수로 사용한 값)라고 가정하고 이 값을 무한대 periods 기간 n번까지 decay한 값을 모두 더했을 때 나올 수 있는 최대 값으로 로드 평균 산출을 위한 최대 산출 기간으로 사용한다.

n=∞(무한)
평활 계수 y를 32번 감쇠할 때 0.5일때, y 값의 산출은 다음과 같다.
- y^32=0.5
- y=0.5^(1/32)
- = 약 0.97852
- 엑셀식: =POWER(0.5, 1/32)

예) decay 총합 = y^0 + y^1 + y^2 + y^3 + … + y^n = 1 + 0.978572 + 0.957603 + 0.937084 + … 0.0000 = 약 46.668046

1ms 단위의 로드 값 1024를 계속 n번 만큼 decay 한 수를 합하면? (소숫점을 이진화 정수 40bit 정밀도의 엑셀 사용 시)
- = 1024 * y^0 + 1024 * y^1 + 1024 * y^2 + 1024 * y^3 + … + 1024 * y^n
- = 1024 * (y^0 + y^1 + y^2 + y^3 + … + y^n) =
- = 1024 + 1002 + 980 + 959 + … + 0
- = 47579
커널의 경우 위의 값과는 다른 결과 값 47742를 사용하는데, 다음 소스를 참고 할 수 있다.
- 47742라는 값을 추적하기 위해 y 값에 double 자료형으로 사용하여 정밀도를 높였고, 정수의 연산을 위해 32비트 이진화 정수형을 사용하고, 결과는 long 자료형을 사용하였다.
- deacy를 346(n=0 ~ 345)번 진행하는 동안 계속 new load에 대한 1024를 계속 더했다.
- 참고: Documentation/scheduler/sched-pelt.c

다음 그림은 LOAD_AVG_MAX에 대한 값을 산출하는 과정을 보여준다.

다음과 같이 로드 값 1024를 y^n(1부터 32까지) 누적한 값을 산출해본다. (sum=23,382)

로드 평균 산출시 LOAD_AVG_MAX 대신 실제 사용하는 값?

load_sum 값으로 실제 load_avg를 산출하기 위해서는 간단히 다음과 같은 수식을 사용한다.

load_avg = load_sum / LOAD_AVG_MAX

그러나 PELT의 경우 periods 기간을 1024us 단위로 잘라 사용하다 보니 산출하는 시점의 시각이 1024us 단위에 정확히 맟춰지지 않는다. 때문에 로드 합계에 사용된 기간만을 정확히 사용해야 하므로 LOAD_AVG_MAX 에서 로드 합계에 관여하지 않은 기간이 발생하는 P0 구간에서 미래 시간인 1024 – sa->period_contrib 값을 빼준다.

divider = LOAD_AVG_MAX – 1024 + sa->load_sum
load_avg = load_sum / divider

다음 그림은 divider를 산출하는 과정을 보여준다.

다음 그림은 ___update_load_avg() 함수가 하는 일에서 로드 평균만 산출하는 과정을 보여준다.

다음 그림은 태스크용 엔티티의 로드 평균을 산출한 값들이다.

1000HZ 틱을 10번 진행하였을 때의 로드 합계 및 평균을 보여준다.

다음 그림은 위의 조건으로 태스크용 엔티티의 로드 합계를 456틱까지 진행한 모습을 보여준다.

201틱부터 ruunning을 0으로 하여 유틸 로드를 하향시켰다.
- util_sum은 단위가 scale up(1024)되어 아래 그래프에는 표현하지 않았다.
301틱부터 runnable을 0으로 하여 로드 및 러너블 로드를 하향시켰다.
빨간색은 load_sum 및 runnable_sum

cfs 런큐 로드 평균 갱신

엔티티의 로드 평균을 구했으면 다음은 cfs 런큐로 로드를 모으고 이에 대한 cfs 런큐 로드 평균을 구해야 한다.

update_cfs_rq_load_avg()

kernel/sched/fair.c

/**
 * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
 * @now: current time, as per cfs_rq_clock_pelt()
 * @cfs_rq: cfs_rq to update
 *
 * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
 * avg. The immediate corollary is that all (fair) tasks must be attached, see
 * post_init_entity_util_avg().
 *
 * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
 *
 * Returns true if the load decayed or we removed load.
 *
 * Since both these conditions indicate a changed cfs_rq->avg.load we should
 * call update_tg_load_avg() when this function returns true.
 */

static inline int
update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
        unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
        struct sched_avg *sa = &cfs_rq->avg;
        int decayed = 0;

        if (cfs_rq->removed.nr) {
                unsigned long r;
                u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;

                raw_spin_lock(&cfs_rq->removed.lock);
                swap(cfs_rq->removed.util_avg, removed_util);
                swap(cfs_rq->removed.load_avg, removed_load);
                swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
                cfs_rq->removed.nr = 0;
                raw_spin_unlock(&cfs_rq->removed.lock);

                r = removed_load;
                sub_positive(&sa->load_avg, r);
                sub_positive(&sa->load_sum, r * divider);

                r = removed_util;
                sub_positive(&sa->util_avg, r);
                sub_positive(&sa->util_sum, r * divider);

                add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);

                decayed = 1;
        }

        decayed |= __update_load_avg_cfs_rq(now, cfs_rq);

#ifndef CONFIG_64BIT
        smp_wmb();
        cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif

        if (decayed)
                cfs_rq_util_change(cfs_rq, 0);

        return decayed;
}

cfs 런큐의 로드/유틸 평균을 갱신한다.

코드 라인 7~29에서 cfs 런큐에서 엔티티가 detach되었을 때 cfs 런큐에서 로드 합계 및 평균을 직접 감산시키지 않고, cfs_rq->removed에 로드/유틸 평균 및 로드 합계 등을 감소시켜달라고 요청해두었었다. 성능을 높이기 위해 cfs 런큐의 락을 획득하지 않고 removed를 이용하여 atomic 명령으로 갱신한다. 다음은 cfs 로드 평균 갱신 시에 다음과 같이 처리한다.
- cfs 런큐의 removed에서 이 값들을 로컬 변수로 읽어오고 0으로 치환한다.
- cfs 런큐의 로드/유틸 합계와 평균에서 이들을 감소시킨다.
- cfs 런큐에 러너블 합계를 전파하기 위해 추가한다.
코드 라인 31에서 cfs 런큐의 로드 평균을 갱신한다.
코드 라인 38~39에서 유틸 값이 바뀌면 cpu frequency 변경 기능이 있는 시스템의 경우 주기적으로 변경한다.

__update_load_avg_cfs_rq()

kernel/sched/pelt.c

int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
{
        if (___update_load_sum(now, &cfs_rq->avg,
                                scale_load_down(cfs_rq->load.weight),
                                scale_load_down(cfs_rq->runnable_weight),
                                cfs_rq->curr != NULL)) {

                ___update_load_avg(&cfs_rq->avg, 1, 1);
                trace_pelt_cfs_tp(cfs_rq);
                return 1;
        }

        return 0;
}

cfs 런큐의 로드 합계 및 평균을 갱신한다. 반환 되는 값은 cfs 런큐의 로드 평균의 재산출 여부이다.

코드 라인 3~6에서 cfs 런큐의 로드 합계를 갱신한다. 그 과정에서 decay 한 적이 있으면
코드 라인 8~10에서 cfs 런큐의 로드 평균을 갱신하고, 1을 반환한다.
코드 라인 13에서 decay 없어서 cfs 런큐의 로드 평균이 재산출 되지 않았음을 의미하는 0을 반환한다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c – 현재 글
Scheduler -3a- (PELT-2) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Update on big.LITTLE scheduling experiments | ARM – 다운로드 pdf
big.LITTLE technology – 다운로드 ppt
Per-entity load tracking | LWN.net
Load tracking in the scheduler | LWN.net
Scheduler Load tracking update and improvement | Vincent Guittot, Linaro connect 2017 – 다운로드 pdf

softirq_init()

2017-03-222020-03-11 문영일 Leave a comment

이동: Interrupts -5- (Softirq) | 문c

Interrupts -5- (Softirq)

2017-03-222020-03-16 문영일 Leave a comment

Softirq

특징

리눅스 커널의 interrupt bottom-half 처리기 중 가장 큰 부분으로 동작한다.
softirq는 최고 우선 순위의 ksoftirqd 스레드가 각각의 cpu에서 동작된다.
기존의 많은 드라이버들이 interrupt bottom-half 처리기로 tasklet을 많이 사용해었는데 tasklet 인터페이스가 그대로 softirq의 한 부분으로 동작하면서 기존 tasklet을 사용하던 드라이버들을 흡수하였다.

두 가지 context 사용

softirq는 경우에 따라 두 가지 context 모드를 전환하면서 동작한다. 단 CONFIG_IRQ_FORCED_THREADING 커널 옵션과 “threadirqs” 커널 파라메터를 사용하여 항상 process context로만 동작시킬 수 있다.

irq context
- hardirq라고도 불린다.
- 처음 호출 시 irq context 상태에서 직접 핸들러 함수를 호출하여 처리한다.
process context
- task context 또는 thread context 라고도 불린다.
- irq context 에서 처리하다 2ms 이상 처리가 길어지면 process context에서 동작하는 ksoftirqd 스레드를 깨워서 softirq 처리를 의뢰한다.
- ksoftirqd가 깨어나면 다시 잠들기 전까지 이후의 펜딩 softirq들의 처리를 ksoftirqd가 담당한다.
  - 모든 펜딩된 softirq 처리 요청이 완료되면 다시 ksoftirqd는 잠든다. 이 후 softirq 요청 들은 다시 irq context에서 처리된다.
- irq context에서 계속 처리하지 않고 ksoftirqd 같은 태스크에 처리를 의뢰하는 것은 대량의 인터럽트 처리 를 irq context에서 수행하느라 cpu를 독식하여 일반 태스크들이 동작하지 못하는 기아(starvation) 현상을 없애기 위함이다.

다음 그림은 softirq가 수행되는 두 가지 context를 비교하여 보여준다.

Softirq action 핸들러

softirq action 핸들러는 매우 빠르게 처리해야 할 특정 인터럽트에 대해서만 제한적으로 메인 라인 커미터에 의해 하드 코딩되어 유지 관리되고 있다.

include/linux/interrupt.h

/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
   frequency threaded job scheduling. For almost all the purposes
   tasklets are more than enough. F.e. all serial device BHs et
   al. should be converted to tasklets, not to softirqs.
 */

enum
{
        HI_SOFTIRQ=0,
        TIMER_SOFTIRQ,
        NET_TX_SOFTIRQ,
        NET_RX_SOFTIRQ,
        BLOCK_SOFTIRQ,
        IRQ_POLL_SOFTIRQ,
        TASKLET_SOFTIRQ,
        SCHED_SOFTIRQ,
        HRTIMER_SOFTIRQ, /* Unused, but kept as tools rely on the
                            numbering. Sigh! */
        RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */

        NR_SOFTIRQS
};

다음 그림은 softirq의 처리 우선 순위와 각 softirq 액션 핸들러들을 보여준다.

softirq 스택

irq context의 처리가 완료될 때 irq_exit() 함수를 호출하는데, 이 때 펜딩된 softirq의 처리를 irq context에서 계속 처리할 수 있다. 이러한 경우에 한하여 아키텍처에 따라 자체 softirq 스택을 사용할지 아니면 그냥 계속 irq stack을 사용할 지 여부를 결정한다.

HAVE_IRQ_EXIT_ON_IRQ_STACK 커널 옵션을 사용하는 경우 irq 스택을 그대로 사용한다.
- x86, powerpc, spark, s390 등의 시스템에서는 자체 softirq 스택을 사용한다.
- 그 외의 arm, arm64 등의 시스템들은 위 커널 옵션을 사용하지 않아도 fallback 루틴에 의해 irq stack을 그대로 사용한다.

다음 그림은 irq context에서 softirq 서비스가 계속되는 경우 사용되는 스택에 따른 비교를 보여준다.

ksoftirqd 스레드 생성

spawn_ksoftirqd()

kernel/softirq.c

static __init int spawn_ksoftirqd(void)
{
        cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
                                  takeover_tasklets);
        BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

        return 0;
}
early_initcall(spawn_ksoftirqd);

각 cpu에서 ksoftirqd를 실행한다.

코드 라인 3~4에서 cpu가 offline되어 CPUHP_SOFTIRQ_DEAD 상태로 변할 때 마다 takeover_tasklets() 함수가 호출되도록 등록한다.
코드 라인 5에서 각 cpu에서 softirqd를 실행한다.

softirq_threads

kernel/softirq.c

static struct smp_hotplug_thread softirq_threads = {
        .store                  = &ksoftirqd,
        .thread_should_run      = ksoftirqd_should_run,
        .thread_fn              = run_ksoftirqd,
        .thread_comm            = "ksoftirqd/%u",
};

다음 그림은 ksoftirqd가 각 cpu 마다 호출되는 과정을 보여준다.

cpu off에 따른 tasklet 이주

takeover_tasklets()

kernel/softirq.c

static int takeover_tasklets(unsigned int cpu)
{
        /* CPU is dead, so no lock needed. */
        local_irq_disable();

        /* Find end, append list for that CPU. */
        if (&per_cpu(tasklet_vec, cpu).head != per_cpu(tasklet_vec, cpu).tail) {
                *__this_cpu_read(tasklet_vec.tail) = per_cpu(tasklet_vec, cpu).head;
                __this_cpu_write(tasklet_vec.tail, per_cpu(tasklet_vec, cpu).tail);
                per_cpu(tasklet_vec, cpu).head = NULL;
                per_cpu(tasklet_vec, cpu).tail = &per_cpu(tasklet_vec, cpu).head;
        }
        raise_softirq_irqoff(TASKLET_SOFTIRQ);

        if (&per_cpu(tasklet_hi_vec, cpu).head != per_cpu(tasklet_hi_vec, cpu).tail) {
                *__this_cpu_read(tasklet_hi_vec.tail) = per_cpu(tasklet_hi_vec, cpu).head;
                __this_cpu_write(tasklet_hi_vec.tail, per_cpu(tasklet_hi_vec, cpu).tail);
                per_cpu(tasklet_hi_vec, cpu).head = NULL;
                per_cpu(tasklet_hi_vec, cpu).tail = &per_cpu(tasklet_hi_vec, cpu).head;
        }
        raise_softirq_irqoff(HI_SOFTIRQ);

        local_irq_enable();
        return 0;
}

@cpu가 offline 상태로 변화하면 해당 @cpu의 tasklets 들을 현재 동작 중인 로컬 cpu로 전환시킨다.

코드 라인 7~13에서 요청 @cpu의 tasklet_vec 리스트의 엔트리들을 현재 로컬 cpu로 옮기고 tasklet softirq를 요청한다.
코드 라인 15~21에서 요청 @cpu의 tasklet_hi_vec 리스트의 엔트리들을 현재 로컬 cpu로 옮기고 hi softirq를 요청한다.

SMP 핫플러그 스레드 등록

모든 online cpu 마다 동작할 커널 스레드들을 등록한다. 다음은 smp 핫플러그 등록 기능을 사용하여 동작하는 스레드들이다.

run_ksoftirqd – “ksoftirqd/<cpu>”
cpu_stopper_thread() – “migration/<cpu>”
rcu_cpu_kthread() – “rcuc/<cpu>”
cpuhp_thread_fun() – “cpuhp/<cpu>”
idle_inject_fn() – “idle_inject/<cpu>”

smpboot_register_percpu_thread()

kernel/smpboot.c

/**
 * smpboot_register_percpu_thread - Register a per_cpu thread related
 *                                          to hotplug
 * @plug_thread:        Hotplug thread descriptor
 *
 * Creates and starts the threads on all online cpus.
 */

int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
{
        unsigned int cpu;               
        int ret = 0;

        get_online_cpus();
        mutex_lock(&smpboot_threads_lock);
        for_each_online_cpu(cpu) {
                ret = __smpboot_create_thread(plug_thread, cpu);
                if (ret) {
                        smpboot_destroy_threads(plug_thread);
                        goto out;       
                }
                smpboot_unpark_thread(plug_thread, cpu);
        }
        list_add(&plug_thread->list, &hotplug_threads);
out:
        mutex_unlock(&smpboot_threads_lock);
        put_online_cpus();
        return ret;
}
EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);

동작시킬 hotplug 스레드 정보를 가진 @plug_thread를 모든 online cpu에서 동작하게 한다.

코드 라인 8~15에서 online cpu 수 만큼 @plug_thread 정보에 포함된 스레드 함수를 fork한 후 unpark 상태로 변경한다.
코드 라인 16에서 동작시킨 hotplug 스레드 정보를 전역 @hotplug_threads 리스트에 추가한다.

__smpboot_create_thread()

kernel/smpboot.c

static int
__smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
        struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
        struct smpboot_thread_data *td;

        if (tsk)
                return 0;

        td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
        if (!td)
                return -ENOMEM;
        td->cpu = cpu;
        td->ht = ht;

        tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu,
                                    ht->thread_comm);
        if (IS_ERR(tsk)) {
                kfree(td);
                return PTR_ERR(tsk);
        }
        /*
         * Park the thread so that it could start right on the CPU
         * when it is available.
         */
        kthread_park(tsk);
        get_task_struct(tsk);
        *per_cpu_ptr(ht->store, cpu) = tsk;
        if (ht->create) {
                /*
                 * Make sure that the task has actually scheduled out
                 * into park position, before calling the create
                 * callback. At least the migration thread callback
                 * requires that the task is off the runqueue.
                 */
                if (!wait_task_inactive(tsk, TASK_PARKED))
                        WARN_ON(1);
                else
                        ht->create(cpu);
        }
        return 0;
}

hotplug 스레드 정보 @ht를 @cpu에서 fork 하여 동작하게 한다. 생성된 태스크명은 @ht->thread_comm으로 한다. 성공 시엔 0을 반환한다.

코드 라인 4~8에서 요청한 @cpu에 해당하는 @ht->store 값을 tsk로 가져온다. 이 값이 이미 설정된 경우 함수를 빠져나간다.
코드 라인 10~14에서 smpboot_thread_data 구조체를 할당받아 인자로 받은 @cpu와 @ht 정보를 대입한다.
코드 라인 16~21에서 스레드 데이터 @td 정보를 사용하여 @cpu에서 스레드를 fork 하여 동작시킨다. 생성된 태스크명으로 @ht->thread_comm을 사용한다.
- 항상 smpboot_thread_fn() 함수가 처음 fork 되어 스레드의 lifetime을 관리한다.
코드 라인 26에서 해당 스레드를 처음 park 상태로 변경한다.
코드 라인 27~28에서 생성된 태스크를 @ht->store에 대입한다.
코드 라인 29~40에서 @ht->create에 함수가 등록된 경우 함수를 호출한다.
- migration 커널 스레드에서 cpu_stop_create() 함수가 등록되어 사용된다.
코드 라인 41에서 성공 값 0을 반환한다.

smpboot_thread_fn()

kernel/smpboot.c

/**
 * smpboot_thread_fn - percpu hotplug thread loop function
 * @data:       thread data pointer
 *
 * Checks for thread stop and park conditions. Calls the necessary
 * setup, cleanup, park and unpark functions for the registered
 * thread.
 *
 * Returns 1 when the thread should exit, 0 otherwise.
 */

static int smpboot_thread_fn(void *data)
{
        struct smpboot_thread_data *td = data;
        struct smp_hotplug_thread *ht = td->ht;

        while (1) {
                set_current_state(TASK_INTERRUPTIBLE);
                preempt_disable();
                if (kthread_should_stop()) {
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        if (ht->cleanup)
                                ht->cleanup(td->cpu, cpu_online(td->cpu));
                        kfree(td);
                        return 0;
                }

                if (kthread_should_park()) {
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        if (ht->park && td->status == HP_THREAD_ACTIVE) {
                                BUG_ON(td->cpu != smp_processor_id());
                                ht->park(td->cpu);
                                td->status = HP_THREAD_PARKED;
                        }
                        kthread_parkme();
                        /* We might have been woken for stop */
                        continue;
                }

                BUG_ON(td->cpu != smp_processor_id());

                /* Check for state change setup */
                switch (td->status) {
                case HP_THREAD_NONE:
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        if (ht->setup)
                                ht->setup(td->cpu);
                        td->status = HP_THREAD_ACTIVE;
                        continue;

                case HP_THREAD_PARKED:
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        if (ht->unpark)
                                ht->unpark(td->cpu);
                        td->status = HP_THREAD_ACTIVE;
                        continue;
                }

                if (!ht->thread_should_run(td->cpu)) {
                        preempt_enable_no_resched();
                        schedule();
                } else {
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        ht->thread_fn(td->cpu);
                }
        }
}

per-cpu 핫플러그 스레드 루프 함수로 무한 루프를 돌며 요청 시 마다 등록된 함수를 호출한다.

코드 라인 7~8에서 현재 태스크를 인터럽트 허용상태로 두고 커널 선점을 막는다.
코드 라인 9~16에서 현재 태스크에 KTHREAD_SHOULD_STOP 플래그가 설정된 경우 스레드의 종료 처리를 한다.
코드 라인 18~29에서 현재 태스크에 KTHREAD_SHOULD_PARK 플래그가 설정된 경우 스레드를 park 상태로 변경시켜 sleep한다. 깨어나는 경우 계속 루프를 돈다.
코드 라인 34~41에서 smp 핫플러그 스레드 상태가 HP_THREAD_NONE인 경우 태스크를 running 상태로 변경하고 커널 선점을 오픈하며 ht->(*setup) 함수를 동작시키고 smp 핫플러그 상태를 active로 변경한다.
코드 라인 43~50에서 smp 핫플러그 스레드 상태가 HP_THREAD_PARKED 상태로 요청한 경우 태스크를 running 상태로 변경하고 커널 선점을 오픈하며 ht->unpark 함수를 동작시키고 smp 핫플러그 상태를 active로 변경한다.
코드 라인 52~54에서 ht->(*should_run)를 수행하여 false인 경우 리스케쥴한다. (sleep)
- ksoftirqd 커널 스레드에서는 ksoftirqd_should_run() 함수를 호출하여 처리할 softirq의 여부를 확인한다.
코드 라인 55~59에서 태스크를 TASK_RUNNING 상태로 바꾸고 선점 enable 한 후 해당 스레드의 처리 함수를 호출한다.
- ksoftirqd 커널 스레드에서는 run_ksoftirqd() 함수를 호출한다.

kernel/smpboot.c

enum {
        HP_THREAD_NONE = 0,
        HP_THREAD_ACTIVE,
        HP_THREAD_PARKED,
};

다음 그림은 SMP 핫플러그 스레드 상태별 처리 흐름을 보여준다.

ksoftirqd_should_run()

kernel/softirq.c

static int ksoftirqd_should_run(unsigned int cpu)
{
        return local_softirq_pending();
}

처리할 softirq가 있는지 여부를 알아온다.

softirqd 스레드

다음 그림은 softirq 호출 시 각 softirq의 처리 루틴으로 분기되는 과정을 보여준다. 특정 softirq를 호출하기 위해서는 raise_softirq()를 사용한다.

run_ksoftirqd()

kernel/softirq.c

static void run_ksoftirqd(unsigned int cpu)
{
        local_irq_disable();
        if (local_softirq_pending()) {
                /*
                 * We can safely run softirq on inline stack, as we are not deep
                 * in the task stack here.
                 */
                __do_softirq();
                local_irq_enable();
                cond_resched_rcu_qs();
                return;
        }
        local_irq_enable();
}

local 인터럽트를 막아둔채로 처리할 softirq가 있으면 해당 softirq 핸들러 함수를 호출한다.

local_softirq_pending()

include/linux/irq_cpustat.h

#define local_softirq_pending() (__this_cpu_read(local_softirq_pending_ref))

현재 cpu에 처리할 softirq가 있는지 유무를 알아온다.

irq_stat

kernel/softirq.c

#ifndef __ARCH_IRQ_STAT
DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
EXPORT_SYMBOL(irq_stat);
#endif

do_softirq()

kernel/softirq.c

asmlinkage __visible void do_softirq(void)
{
        __u32 pending;
        unsigned long flags;

        if (in_interrupt())
                return;

        local_irq_save(flags);

        pending = local_softirq_pending();

        if (pending && !ksoftirqd_running(pending))
                do_softirq_own_stack();

        local_irq_restore(flags);
}

local 인터럽트를 막아둔채로 처리할 softirq가 있으면 해당 softirq 핸들러 함수를 호출한다.

do_softirq_own_stack()

include/linux/interrupt.h

static inline void do_softirq_own_stack(void)
{
        __do_softirq();
}

처리할 softirq가 있으면 해당 softirq 핸들러 함수를 호출한다.

이 함수는 아키텍처에 따라 다른 코드가 수행되는데 arm 및 arm64의 경우 상기 코드와 같이 동작한다.

__do_softirq()

kernel/softirq.c

asmlinkage __visible __softirq_entry__do_softirq(void)
{
        unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
        unsigned long old_flags = current->flags;
        int max_restart = MAX_SOFTIRQ_RESTART;
        struct softirq_action *h;
        bool in_hardirq;
        __u32 pending;
        int softirq_bit;

        /*
         * Mask out PF_MEMALLOC s current task context is borrowed for the
         * softirq. A softirq handled such as network RX might set PF_MEMALLOC
         * again if the socket is related to swap
         */
        current->flags &= ~PF_MEMALLOC;

        pending = local_softirq_pending();
        account_irq_enter_time(current);
                
        __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
        in_hardirq = lockdep_softirq_start();

restart:
        /* Reset the pending bitmask before enabling irqs */
        set_softirq_pending(0);

        local_irq_enable();

        h = softirq_vec;

        while ((softirq_bit = ffs(pending))) {
                unsigned int vec_nr;
                int prev_count;

                h += softirq_bit - 1;

                vec_nr = h - softirq_vec;
                prev_count = preempt_count();

                kstat_incr_softirqs_this_cpu(vec_nr);

                trace_softirq_entry(vec_nr);
                h->action(h);
                trace_softirq_exit(vec_nr);
                if (unlikely(prev_count != preempt_count())) {
                        pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
                               vec_nr, softirq_to_name[vec_nr], h->action,
                               prev_count, preempt_count());
                        preempt_count_set(prev_count);
                }
                h++;
                pending >>= softirq_bit;
        }

        if (__this_cpu_read(ksoftirqd) == current)
                rcu_bh_qs();
        local_irq_disable();

        pending = local_softirq_pending();
        if (pending) {
                if (time_before(jiffies, end) && !need_resched() &&
                    --max_restart)
                        goto restart;

                wakeup_softirqd();
        }

        lockdep_softirq_end(in_hardirq);
        account_irq_exit_time(current);
        __local_bh_enable(SOFTIRQ_OFFSET);
        WARN_ON_ONCE(in_interrupt());
        tsk_restore_flags(current, old_flags, PF_MEMALLOC);
}

처리할 softirq가 있으면 해당 softirq 핸들러 함수를 호출한다. 이 루틴은 irq context 또는 process context 모두에서 호출될 수 있다.

코드 라인 3에서 최대 softirq 처리 시간으로 현재 시간으로 부터 2ms 후에 해당하는 jiffies 값을 산출한다.
코드 라인 4에서 현재 태스크의 플래그를 백업해둔다.
코드 라인 5에서 최대 재시도 수를 10번으로 제한한다.
코드 라인 16에서 현재 태스크 플래그에서 PF_MEMALLOC을 제거하여 softirq 처리 핸들러 루틴에서 응급 메모리를 사용하지 못하도록 한다. 단 네트워크 swap을 이용하여야 하는 NET_RX_SOFTIRQ는 메모리 부족 상태에서도 패킷 생성을 위해 슬랩 할당을 해야 하므로 예외적으로 해당 핸들러에서 PF_MEMALLOC을 설정한다.
코드 라인 18에서 처리할 softirq를 알아온다.
코드 라인 19에서 CONFIG_IRQ_TIME_ACCOUNTING 커널 옵션을 사용한 경우 트레이스 목적으로 irq 진입 시간을 기록한다.
코드 라인 21에서 커널 선점되지 않도록 preempt_count를 SOFTIRQ_OFFSET 만큼 더한다.
코드 라인 26~28에서 irq를 enable하기 전에 __softirq_pending 플래그를 클리어한 후 irq를 enable 한다.
코드 라인 32~38에서 처리할 softirq 중 가장 우선 순위가 높은 softirq의 벡터번호를 알아온다.
- vec_nr=0 <- softirq_vec[HI_SOFTIRQ]
코드 라인 41에서 처리할 softirq 통계 카운터를 1 증가 시킨다.
코드 라인 44에서 softirq 핸들러 함수를 호출한다.
코드 라인 46~51에서 softirq 핸들러 함수를 수행하기 전에 읽은 preempt_count 값에 변화가 발생하면 에러 메시지를 출력하고 preempt_count 값을 처음 읽었던 값으로 되돌린다.
코드 라인 52~54에서 다음 순위의 softirq를 처리할 준비를 한다.
코드 라인 56~57에서 현재 동작 중인 태스크가 irqthread인 경우 rcu의 bottom half 처리를 한다.
코드 라인 58에서 irq를 disable 한다.
코드 라인 60~64에서 여전히 pending된 softirq가 있는지 확인하여 여전히 존재하면 다음 조건을 만족하면 다시 restart: 레이블로 이동하여 softirq 처리를 수행하게 한다.
- 최대 반복 횟수: 10회 이내
- 최대 처리 시간: 2ms 이내
- 리스케쥴 요청이 없어야 한다.
코드 라인 65에서 ksoftirqd에서 호출되지 않고 irq context에서 직접 호출한 경우ksoftirqd가 나머지 pending 된 softirq를 처리하도록 깨운다. 다음과 같은 특정 커널 조건에서만 유효하다.
- CONFIG_IRQ_FORCED_THREADING 커널 옵션 사용
- “thredirqs” 커널 파라메터 사용
코드 라인 70에서 CONFIG_IRQ_TIME_ACCOUNTING 커널 옵션을 사용한 경우 트레이스 목적으로 irq 퇴출 시간을 기록한다.
코드 라인 71에서 softirq에서 막은 커널 선점을 다시 가능하도록 preempt_count를 SOFTIRQ_OFFSET 만큼 감소시킨다.
코드 라인 72에서 아직도 interrupt 처리중인 경우 경고 메시지를 출력한다.
코드 라인 73에서 현재 태스크 플래그를 원래대로 다시 복구한다.

softirq 실행

다음 그림은 softirq가 요청되고 실행되는 과정을 보여준다.

특정 디바이스의 인터럽트 핸들러 수행 중 softirq로 처리하고자 할 때 raise_softirq()를 호출하여 softirq 펜딩 플래그를 설정한다.
irq context에서 빠져나갈때 호출되는 irq_exit() 함수에서는 softirq가 펜딩되었는지 여부를 알아보고 펜딩된 softirq가 있으면 invoke_softirq() 함수를 호출한다.

softirq 실행 요청

raise_softirq()

kernel/softirq.c

void raise_softirq(unsigned int nr)
{
        unsigned long flags;

        local_irq_save(flags);
        raise_softirq_irqoff(nr);
        local_irq_restore(flags);
}

요청한 @nr의 softirq 서비스를 호출한다. softirq @nr번에 해당하는 펜딩 비트 플래그를 설정한다.

raise_softirq_irqoff()

kernel/softirq.c

/*
 * This function must run with irqs disabled!
 */

inline void raise_softirq_irqoff(unsigned int nr)
{
        __raise_softirq_irqoff(nr);

        /*
         * If we're in an interrupt or softirq, we're done
         * (this also catches softirq-disabled code). We will
         * actually run the softirq once we return from
         * the irq or softirq.
         *
         * Otherwise we wake up ksoftirqd to make sure we
         * schedule the softirq soon.
         */
        if (!in_interrupt())
                wakeup_softirqd();
}

요청한 번호의 softirq 서비스를 호출한다.

코드 라인 3에서 요청한 softirq 번호에 해당하는 펜딩 비트플래그를 설정한다.
코드 라인 14~15에서 irq context가 아닌 경우에 한하여 softirqd 스레드를 깨운다.
- 일어나자마자 알아서 펜딩 플래그를 조사해서 수행한다.
- irq context에서 호출된 경우 잠시 후 irq context가 끝날 때 실행되는 irq_exit() 함수내에서 invoke_softirq()를 호출하여 irq context를 유지한 상태에서 softirq를 수행될 예정이다.

__raise_softirq_irqoff()

kernel/softirq.c

void __raise_softirq_irqoff(unsigned int nr)
{
        trace_softirq_raise(nr);
        or_softirq_pending(1UL << nr);
}

요청한 softirq 번호에 해당하는 펜딩 비트플래그를 설정한다.

or_softirq_pending

include/linux/interrupt.h

#define or_softirq_pending(x)  (__this_cpu_or(local_softirq_pending_ref, (x)))

softirq 호출

invoke_softirq()

kernel/softirq.c

static inline void invoke_softirq(void)
{
        if (ksoftirqd_running(local_softirq_pending()))
                return;

        if (!force_irqthreads) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
                /*
                 * We can safely execute softirq on the current stack if
                 * it is the irq stack, because it should be near empty
                 * at this stage.
                 */
                __do_softirq();
#else
                /*
                 * Otherwise, irq_exit() is called on the task stack that can
                 * be potentially deep already. So call softirq in its own stack
                 * to prevent from any overrun.
                 */
                do_softirq_own_stack();
#endif
        } else {
                wakeup_softirqd();
        }
}

펜딩된 softirq를 호출하여 서비스한다. irq_exit() 함수에서는 softirq가 펜딩되었는지 여부를 알아보고 펜딩된 softirq가 있으면 이 함수를 호출한다.

하여 펜딩된 softirq들을 모두 처리하는데, 함수를 빠져나간다. 한다. 펜딩된 softirq들을 수행 중 2ms 이상 소요되는 경우 process context에서 동작하는 ksoftirqd를 깨워 대신 수행하게 한다.

코드 라인 3~4에서 process context에서 동작하는 ksoftirqd 이미 수행 중인 경우 ksoftirqd 스레드가 동작중인 경우 ksoftirq가 알아서 대신 펜딩된 softirq들을 호출하여 처리하므로 이 수간에는 그냥 함수를 빠져나간다.
코드 라인 6~24에서 “threadirqs” 커널 파라미터가 설정된 경우 모든 펜딩 softirq를 process context에서 동작하는 ksoftirqd에서 처리하도록 ksoftirqd를 깨운다. 그렇지 않은 경우 현재 irq context에서 직접 softirq 서비스를 호출한다.
- arm, arm64 시스템의 경우 CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK 커널 옵션을 사용하지 않아도 irq 스택을 계속 사용한다.

do_softirq_own_stack()

include/linux/interrupt.h

#ifdef __ARCH_HAS_DO_SOFTIRQ
void do_softirq_own_stack(void);
#else
static inline void do_softirq_own_stack(void)
{
        __do_softirq();
}
#endif

__ARCH_HAS_DO_SOFTIRQ 옵션을 사용하지 않는 arm, arm64 시스템은 softirq 자체 스택 처리 없이, 그냥 스택의 변경 없이 현재 irq context에서 사용 중인 irq 스택을 계속 사용하여 softirq를 처리한다.

softirqd 깨우기

wakeup_softirqd()

kernel/softirq.c

/*
 * we cannot loop indefinitely here to avoid userspace starvation,
 * but we also don't want to introduce a worst case 1/HZ latency
 * to the pending events, so lets the scheduler to balance
 * the softirq load for us.
 */

static void wakeup_softirqd(void)
{
        /* Interrupts are disabled: no need to stop preemption */
        struct task_struct *tsk = __this_cpu_read(ksoftirqd);

        if (tsk && tsk->state != TASK_RUNNING)
                wake_up_process(tsk);
}

softirqd 스레드가 잠들어있는 경우 깨운다.

softirq 초기화

softirq_init()

kernel/softirq.c

void __init softirq_init(void)
{
        int cpu;

        for_each_possible_cpu(cpu) {
                per_cpu(tasklet_vec, cpu).tail =
                        &per_cpu(tasklet_vec, cpu).head;
                per_cpu(tasklet_hi_vec, cpu).tail =
                        &per_cpu(tasklet_hi_vec, cpu).head;
        }

        open_softirq(TASKLET_SOFTIRQ, tasklet_action);
        open_softirq(HI_SOFTIRQ, tasklet_hi_action);
}

softirq를 사용하기 전에 초기화를 수행한다.

코드 라인 5~10에서 possible cpu 수 만큼 루프를 돌며 tasklet_vec 및 tasklet_hi_vec 리스트를 초기화한다.
코드 라인 12에서 TASKLET_SOFTIRQ용 핸들러 함수를 지정한다.
코드 라인 13에서 HI_SOFTIRQ용 핸들러 함수를 지정한다.

특정 softirq 핸들러 준비

open_softirq()

kernel/softirq.c

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
        softirq_vec[nr].action = action;
}

요청한 softirq 벡터 번호에 action 핸들러 함수를 대입한다.

Timer Softirq

run_timer_softirq()

kernel/time/timer.c

/*
 * This function runs timers and the timer-tq in bottom half context.
 */

static void run_timer_softirq(struct softirq_action *h)
{
        struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);

        __run_timers(base);
        if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
                __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
}

타이머 softirq 처리 루틴이다.

코드 라인 3~5에서 lowres 타이머 휠에서 만료된 타이머가 있는 경우 등록된 콜백 함수를 호출한다.
코드 라인 6~7에서 nohz를 지원하는 경우 nohz용 lowres 타이머 휠에서 만료된 타이머가 있는 경우 등록된 콜백 함수를 호출한다.

참고

Interrupts -1- (Interrupt Controller) | 문c
Interrupts -2- (irq chip) | 문c
Interrupts -3- (irq domain) | 문c
Interrupts -4- (Top-Half & Bottom-Half) | 문c
Interrupts -5- (Softirq) | 문c – 현재 글
Interrupts -6- (IPI Cross-call) | 문c
Interrupts -7- (Workqueue 1) | 문c
Interrupts -8- (Workqueue 2) | 문c
Interrupts -9- (GIC v3 Driver) | 문c
Interrupts -10- (irq partition) | 문c
Interrupts -11- (RPI2 IC Driver) | 문c
Interrupts -12- (irq desc) | 문c