문c 블로그

Static Keys -2- (초기화)

2016-04-272019-08-17 문영일 Leave a comment

Static Keys -2- (초기화)

커널과 모듈에서 조건문의 branch miss ratio를 낮추어 성능을 향상 시키기 위한 방법으로 static key를 사용한 jump label API를 사용하는데 커널에서 이러한 jump label 코드들을 모두 찾아 초기화한다.

static key를 초기화와 관련된 루틴은 다음과 같다.

커널 boot-up – jump_label_init() 함수
모듈 로딩 시 호출될 콜백 함수 등록 – jump_label_init_module()
각 모듈 로딩 시 호출 – jump_label_module_notify()

커널 부트 업 시 Static Key 초기화

jump_label_init()

kernel/jump_label.c

void __init jump_label_init(void)
{
        struct jump_entry *iter_start = __start___jump_table;
        struct jump_entry *iter_stop = __stop___jump_table;
        struct static_key *key = NULL;
        struct jump_entry *iter;

        /*
         * Since we are initializing the static_key.enabled field with
         * with the 'raw' int values (to avoid pulling in atomic.h) in
         * jump_label.h, let's make sure that is safe. There are only two
         * cases to check since we initialize to 0 or 1.
         */
        BUILD_BUG_ON((int)ATOMIC_INIT(0) != 0);
        BUILD_BUG_ON((int)ATOMIC_INIT(1) != 1);

        if (static_key_initialized)
                return;

        cpus_read_lock();
        jump_label_lock();
        jump_label_sort_entries(iter_start, iter_stop);

        for (iter = iter_start; iter < iter_stop; iter++) {
                struct static_key *iterk;

                /* rewrite NOPs */
                if (jump_label_type(iter) == JUMP_LABEL_NOP)
                        arch_jump_label_transform_static(iter, JUMP_LABEL_NOP);

                if (init_section_contains((void *)jump_entry_code(iter), 1))
                        jump_entry_set_init(iter);

                iterk = jump_entry_key(iter);
                if (iterk == key)
                        continue;

                key = iterk;
                static_key_set_entries(key, iter);
        }
        static_key_initialized = true;
        jump_label_unlock();
        cpus_read_unlock();
}

jump 라벨의 사용을 위해 초기화를 수행한다. __jump_table에 있는 모든 jump 엔트리를 읽어와서 key로 소팅하고 jump 라벨 타입이 nop인 경우 해당 커널 코드(jump label 코드)의 1 word를 nop으로 rewrite 한다. 그리고 각 static 키들은 각 static 키를 사용하는 첫 jump 엔트리를 가리키게 한다.

코드 라인 17~18에서 static key의 초기화는 한 번만 수행하는 것으로 제한한다.
코드 라인 22에서 __jump_table에 있는 jump 엔트리를 key 주소로 heap sorting을 한다.
코드 라인 24~29에서 __jump_table을 순회하며 jump 라벨 엔트리의 타입이 nop인 경우 static key를 사용한 jump_label API 코드 주소에 nop 명령어를 rewrite 한다.
- ARM64의 경우는 컴파일 타임에 nop 명령을 사용하여 이미 초기화된 상태라 별도로 nop 코드를 rewrite 할 필요 없다.
코드 라인 31~32에서 jump 엔트리의 코드가 init 섹션안에 위치한 경우라면 key의 bit1을 설정하여 jump 엔트리가 init 섹션안에 위치함을 표시한다. 부트 업이 완료되면 init 섹션안의 코드는 모두 제거된다. 추후 런타임 중에 init 섹션안에 포함된 jump 엔트리들에 대해 업데이트하지 않게 하기 위함이다.
코드 라인 34~36에서 jump 라벨 엔트리들이 key 주소로 sorting 되어 있으므로 동일한 static key를 사용하는 jump 라벨 엔트리인 경우 skip 한다.
코드 라인 38~39에서 새로운 키를 사용하는 jump 엔트리인 경우이다. 이 경우 static 키에 첫 jump 엔트리를 대입한다. 대입할 때 기존 key 타입은 유지한다.
코드 라인 41에서 static key가 모두 초기화 되었음을 알리는 전역 플래그 변수이다.

아래 그림은 jump_label_init() 함수를 통해 동작되는 과정을 보여준다.

Jump 라벨 엔트리들 소팅

jump_label_sort_entries()

kernel/jump_label.c

static void
jump_label_sort_entries(struct jump_entry *start, struct jump_entry *stop)
{
        unsigned long size;
        void *swapfn = NULL;

        if (IS_ENABLED(CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE))
                swapfn = jump_label_swap;

        size = (((unsigned long)stop - (unsigned long)start)
                                        / sizeof(struct jump_entry));
        sort(start, size, sizeof(struct jump_entry), jump_label_cmp, swapfn);
}

__jump_table의 entry를 key 주소로 heap sorting 알고리즘을 수행한다.

jump_label_cmp()

kernel/jump_label.c

static int jump_label_cmp(const void *a, const void *b)
{
        const struct jump_entry *jea = a;
        const struct jump_entry *jeb = b;

        if (jump_entry_key(jea) < jump_entry_key(jeb))
                return -1;

        if (jump_entry_key(jea) > jump_entry_key(jeb))
                return 1;

        return 0;
}

두 jump 엔트리의 키를 비교하여 A가 작으면 -1, B가 크면 1, 동일하면 0을 반환한다.

jump_label_swap()

kernel/jump_label.c

static void jump_label_swap(void *a, void *b, int size)
{
        long delta = (unsigned long)a - (unsigned long)b;
        struct jump_entry *jea = a;
        struct jump_entry *jeb = b;
        struct jump_entry tmp = *jea;

        jea->code       = jeb->code - delta;
        jea->target     = jeb->target - delta;
        jea->key        = jeb->key - delta;

        jeb->code       = tmp.code + delta;
        jeb->target     = tmp.target + delta;
        jeb->key        = tmp.key + delta;
}

두 jump 엔트리를 swap 한다. ARM64의 경우 CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE 커널 옵션을 사용한 상태이므로 두 jump 엔트리 간의 차이 값인 delta가 두 jump 엔트리의 멤버들에 모두 적용된다.

jump_label_type()

kernel/jump_label.c

static enum jump_label_type jump_label_type(struct jump_entry *entry)
{
        struct static_key *key = jump_entry_key(entry);
        bool enabled = static_key_enabled(key);
        bool branch = jump_entry_is_branch(entry);

        /* See the comment in linux/jump_label.h */
        return enabled ^ branch;
}

jump 엔트리에서 jump label 타입을 알아온다. 이 타입은 코드에 기록할 타입에 사용한다. (0=JUMP_LABEL_NOP, 1=JUMP_LABEL_JMP)

코드 라인 3에서 jump 엔트리가 가리키는 static 키를 알아온다.
코드 라인 4에서 static 키의 enable 여부를 알아온다.
코드 라인 5에서 jump 엔트리의 nop/branch 여부를 알아온다.
코드 라인 8에서 최종 코드에 적용할 nop/branch 타입을 알아온다.

아래 그림은 2가지 상태(초기 설정 상태와 enable 상태)에 따라 nop(JUMP_LABEL_DISABLE)과 branch(JUMP_LABEL_ENABLE) 동작 상태를 기존 API를 사용하여 보여준다.

STATIC_KEY_INIT_FALSE 또는 STATIC_KEY_INIT_TRUE로 초기화된 경우는 처음에 항상 nop를 수행하고 static_key_slow_inc() 또는 static_key_slow_dec() 함수에 의해 branch code가 동작하게된다.

아래 그림은 3가지 상태(초기 설정 상태, enable 상태 및 조건 API)에 따라 nop(JUMP_LABEL_DISABLE)과 branch(JUMP_LABEL_ENABLE) 동작 상태를 신규 API를 사용하여 보여준다.

DEFINE_STATIC_KEY_FALSE 또는 DEFINE_STATIC_KEY_TRUE로 초기화된 경우는 static_branch_unlikely() 및 static_branch_likely() 함수와 enable 상태의 조합에 따라 초기 값으로 nop/branch가 결정되고 이후 static_branch_enable() 및 static_branch_disable() API에 의해 조건이 반전된다.

jump_label_init_type()

kernel/jump_label.c

static enum jump_label_type jump_label_init_type(struct jump_entry *entry)
{
        struct static_key *key = jump_entry_key(entry);
        bool type = static_key_type(key);
        bool branch = jump_entry_is_branch(entry);

        /* See the comment in linux/jump_label.h */
        return type ^ branch;
}

jump 엔트리에서 초기 jump label 타입을 알아온다. (0=JUMP_LABEL_NOP, 1=JUMP_LABEL_JMP)

코드 라인 3에서 jump 엔트리가 가리키는 static 키를 알아온다.
코드 라인 4에서 static 키의 타입(true/false)을 알아온다.
코드 라인 5에서 jump 엔트리의 nop/branch 여부를 알아온다.
코드 라인 8에서 초기 코드에 적용된 nop/branch 타입을 산출해 반환한다.

static_key_enabled()

include/linux/jump_label.h

#define static_key_enabled(x)                                                   \
({                                                                              \
        if (!__builtin_types_compatible_p(typeof(*x), struct static_key) &&     \
            !__builtin_types_compatible_p(typeof(*x), struct static_key_true) &&\
            !__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
                ____wrong_branch_error();                                       \
        static_key_count((struct static_key *)x) > 0;                           \
})

static key가 enable 되었는지 여부를 리턴한다.

static_key_count()

include/linux/jump_label.h

static inline int static_key_count(struct static_key *key)
{
        return atomic_read(&key->enabled);
}

key->enabled 값을 atomic하게 읽어온다.

모듈에서 사용된 static key를 사용한 jump label API

다음 그림은 모듈용 jump 라벨 초기화 루틴에서 notifier 블럭을 등록하는 과정을 보여준다.

초기화 함수 등록

kernel/jump_label.c

early_initcall(jump_label_init_module);

.initcall 섹션에 jump_label_init_module() 함수를 등록한다.

등록되는 모든 initcall 함수들은 kernel_init 스레드의 do_initcalls() 함수에서 호출된다.

early_initcall()

include/linux/init.h

/*      
 * Early initcalls run before initializing SMP.
 *
 * Only for built-in code, not modules.
 */
#define early_initcall(fn)              __define_initcall(fn, early)

__define_initcall()

include/linux/init.h

/* initcalls are now grouped by functionality into separate 
 * subsections. Ordering inside the subsections is determined
 * by link order. 
 * For backwards compatibility, initcall() puts the call in 
 * the device init subsection.
 *   
 * The `id' arg to __define_initcall() is needed so that multiple initcalls
 * can point at the same handler without causing duplicate-symbol build errors.
 */  
        
#define __define_initcall(fn, id) \
        static initcall_t __initcall_##fn##id __used \
        __attribute__((__section__(".initcall" #id ".init"))) = fn; \
        LTO_REFERENCE_INITCALL(__initcall_##fn##id)

initcall 함수를 .initcallearly.init 섹션에 등록한다.

예) __initcall_jump_label_init_moduleearly = jump_label_init_module;

모듈 로드/언로드

jump_label_init_module()

kernel/jump_label.c

static __init int jump_label_init_module(void)
{
        return register_module_notifier(&jump_label_module_nb);
}

jump_label_module_nb 구조체를 module notifier 블록에 등록한다.

jump_label_module_nb 구조체

kernel/jump_label.c

struct notifier_block jump_label_module_nb = {
        .notifier_call = jump_label_module_notify,
        .priority = 1, /* higher than tracepoints */
};

jump_label_module_notify 함수가 등록된다.
notifier block이 ascending으로 정렬되어 있고 priority는 1의 값이다.

모듈 로딩/언로딩 시 jump 라벨 함수 호출

다음 그림은 모듈이 로딩/언로딩될 때마다 등록된 notifier 블럭에 의해 호출되는 jump_label_module_notify() 함수를 보여준다.

jump_label_module_notify()

kernel/jump_label.c

static int
jump_label_module_notify(struct notifier_block *self, unsigned long val,
                         void *data)
{
        struct module *mod = data;
        int ret = 0;

        cpus_read_lock();
        jump_label_lock();

        switch (val) {
        case MODULE_STATE_COMING:
                ret = jump_label_add_module(mod);
                if (ret) {
                        WARN(1, "Failed to allocate memory: jump_label may not work properly.\n");
                        jump_label_del_module(mod);
                }
                break;
        case MODULE_STATE_GOING:
                jump_label_del_module(mod);
                break;
        }

        jump_label_unlock();
        cpus_read_unlock();

        return notifier_from_errno(ret);
}

개개 모듈이 로드/언로드/초기화 되었을 때 각각 호출되며 3가지 메시지에 대해 아래의 메시지에 대해 구분하여 처리된다.

MODULE_STATE_COMING
- 모듈이 로드 되었을 때 호출되는 이벤트로 static key에 static_key_module 객체를 할당하여 연결한다.
- load_module() 함수에서 notifier_call_chain()을 호출한다.
MODULE_STATE_GOING
- 모듈이 언로드 되었을 때 호출되는 이벤트로 static key에 연결된 static_key_module 객체들 중 해당 모듈이 있는 객체를 찾아 삭제한다.
- delete_module() 함수에서 notifier_call_chain()을 호출한다.

모듈 로딩 시 jump 라벨 초기화

jump_label_add_module()

kernel/jump_label.c

static int jump_label_add_module(struct module *mod)
{
        struct jump_entry *iter_start = mod->jump_entries;
        struct jump_entry *iter_stop = iter_start + mod->num_jump_entries;
        struct jump_entry *iter;
        struct static_key *key = NULL;
        struct static_key_mod *jlm, *jlm2;

        /* if the module doesn't have jump label entries, just return */
        if (iter_start == iter_stop)
                return 0;

        jump_label_sort_entries(iter_start, iter_stop);

        for (iter = iter_start; iter < iter_stop; iter++) {
                struct static_key *iterk;

                if (within_module_init(jump_entry_code(iter), mod))
                        jump_entry_set_init(iter);

                iterk = jump_entry_key(iter);
                if (iterk == key)
                        continue;

                key = iterk;
                if (within_module((unsigned long)key, mod)) {
                        static_key_set_entries(key, iter);
                        continue;
                }
                jlm = kzalloc(sizeof(struct static_key_mod), GFP_KERNEL);
                if (!jlm)
                        return -ENOMEM;
                if (!static_key_linked(key)) {
                        jlm2 = kzalloc(sizeof(struct static_key_mod),
                                       GFP_KERNEL);
                        if (!jlm2) {
                                kfree(jlm);
                                return -ENOMEM;
                        }
                        preempt_disable();
                        jlm2->mod = __module_address((unsigned long)key);
                        preempt_enable();
                        jlm2->entries = static_key_entries(key);
                        jlm2->next = NULL;
                        static_key_set_mod(key, jlm2);
                        static_key_set_linked(key);
                }
                jlm->mod = mod;
                jlm->entries = iter;
                jlm->next = static_key_mod(key);
                static_key_set_mod(key, jlm);
                static_key_set_linked(key);

                /* Only update if we've changed from our initial state */
                if (jump_label_type(iter) != jump_label_init_type(iter))
                        __jump_label_update(key, iter, iter_stop, true);
        }

        return 0;
}

모듈에서 사용하는 jump label 엔트리를 정렬(heap sort) 한다. 그리고 static 키가 해당 모듈에 이미 포함되어 있는 경우 key->entries가 해당 엔트리를 가리키게 하고, 그렇지 않은 경우 글로벌 static 키가 커널에 존재하므로 static_key_mod 객체를 할당하고 이를 key에 연결한다. 그런 후 타입이 enable인 경우 동일한 키를 사용하는 모든 엔트리가 가리키는 code 주소의 명령을 update 한다.

코드 라인 10~11에서 jump 라벨을 사용하지 않은 모듈인 경우 그냥 함수를 빠져나간다.
코드 라인 13에서 모듈에 존재하는 jump label 엔트리를 정렬(heap sort)한다.
코드 라인 15~19에서 모듈에 있는 jump label 엔트리를 순회하며 jump 라벨이 모듈의 init 섹션에서 사용된 경우 이를 식별하기 위해 key의 bit1을 설정한다.
코드 라인 21~23에서 이미 처리한 static 키와 연결된 jump 라벨 엔트리는 skip 한다.
코드 라인 25~29에서 소팅된 jump 라벨 엔트리들 중 static 키와 연결된 첫 jump 라벨 엔트리이다. 여기서 static 키가 모듈에서 정의된 경우 해당 static 키를 첫 jump 라벨 엔트리에 연결한다.
코드 라인 30~32에서 jump 라벨이 모듈 외부의 커널에서 정의된 static를 사용하는 경우이다. 이 모듈을 커널의 static 키에 연결하기 위해 static_key_mod 구조체를 할당받는다.
코드 라인 33~47에서 커널의 static 키가 한 번도 모듈과 링크를 한 적이 없으면 static_key_mod 구조체를 추가로 하나 더 만들고 이를 먼저 연결한다.
코드 라인 48~52에서 static_key_mod 구조체의 내용을 채운 후 커널의 글로벌 static 키가 사용하는 static_key_mod에 연결한다.
코드 라인 55~56에서 jump 라벨의 초기값과 jump 라벨 타입이 다른 경우에 한해 이에 해당하는 jump 라벨들이 가리키는 코드들을 모두 update 한다.

다음 그림은 module-B가 로드되어 static key를 사용한 jump 라벨들이 초기화되는 과정을 보여준다.

static_key_mod 객체가 추가되고 추가된 모듈은 module을 가리키고 entries 멤버는 해당 모듈의 해당 static 키를 사용한 첫 jump 라벨 엔트리를 가리킨다.
모듈 내부의 각 static key는 해당 static 키를 사용하는 첫 jump 라벨 엔트리를 가리킨다.

모듈 언로딩 시 jump 라벨 정리

jump_label_del_module()

kernel/jump_label.c

static void jump_label_del_module(struct module *mod)
{
        struct jump_entry *iter_start = mod->jump_entries;
        struct jump_entry *iter_stop = iter_start + mod->num_jump_entries;
        struct jump_entry *iter;
        struct static_key *key = NULL;
        struct static_key_mod *jlm, **prev;

        for (iter = iter_start; iter < iter_stop; iter++) {
                if (jump_entry_key(iter) == key)
                        continue;

                key = jump_entry_key(iter);

                if (within_module((unsigned long)key, mod))
                        continue;

                /* No memory during module load */
                if (WARN_ON(!static_key_linked(key)))
                        continue;

                prev = &key->next;
                jlm = static_key_mod(key);

                while (jlm && jlm->mod != mod) {
                        prev = &jlm->next;
                        jlm = jlm->next;
                }

                /* No memory during module load */
                if (WARN_ON(!jlm))
                        continue;

                if (prev == &key->next)
                        static_key_set_mod(key, jlm->next);
                else
                        *prev = jlm->next;

                kfree(jlm);

                jlm = static_key_mod(key);
                /* if only one etry is left, fold it back into the static_key */
                if (jlm->next == NULL) {
                        static_key_set_entries(key, jlm->entries);
                        static_key_clear_linked(key);
                        kfree(jlm);
                }
        }
}

모듈에서 사용한 외부 글로벌 static key를 찾고 여기에 연결된 static_key_mod 객체를 찾아 연결을 끊고 메모리를 해지한다.

코드 라인 9~11에서 모듈에 있는 key 값으로 소팅된 jump label 엔트리 수 만큼 순회하며 동일한 static 키를 사용하는 jump 라벨 엔트리는 처리하지 않기 위해 skip 한다.
코드 라인 13~16에서 jump 라벨 엔트리가 모듈내에 정의된 static 키를 사용하는 경우 skip 한다.
코드 라인 19~20에서 글로벌 static 키가 링크된 적이 없으면 경고 메시지를 출력하고 skip 한다.
코드 라인 22~32에서 언로드될 모듈에 연결될 static_key_mod 구조체를 순차 검색한다.
코드 라인 34~39에서 해당 static_key_mod 구조체의 연결을 끊고 메모리를 해제한다.
코드 라인 41~47에서 만일 static 키에 연결된 static_key_mod 객체가 하나만 남은 경우 마저 제거하고 할당 해제한다. 이 때 마지막 객체에 연결된 글로벌 jump 라벨 엔트리는 static 키에 다시 연결한다. 연결 시 link 플래그는 제거한다.

다음 그림은 module-B가 언로드될 때 static 키 모듈 연결 정보인 static_key_mod 객체를 소멸시키는 과정을 보여준다.

참고

Static Keys -1- (Core API) | 문c
Static Keys -2- (초기화) | 문c – 현재 글

Static Keys | kernel.org
jump label: introduce static_branch() interface | LWN.net
locking/static_keys: Add a new static_key interface

Static Keys -1- (Core API)

2016-04-272019-08-19 문영일 Leave a comment

Static Keys -1- (Core API)

Static keys는 GCC 기능과 커널 코드 교체 기술로 조건을 빠르게 수행할 수 있는 fast-path 솔루션으로 branch miss 확률을 약간 줄여준다. Static Keys는 조건문의 수행이 매우 빈번하여 고성능을 추구할 때 사용되며, 조건의 변경 시에는 커널과 모듈의 코드가 변경되며, 모든 코어의 TLB 캐시를 flush 해야 하는 등 매우 큰 코스트가 발생하므로 변경이 빈번하지 않는 경우에만 사용해야 한다.

New Interface 소개

최근 커널 4.3-rc1에서 새로운 interface API가 제공된다.

참고: locking/static_keys: Add a new static_key interface

Deprecated Method

다음과 같은 사용 방법들은 일정 기간 새로운 method와 같이 사용되고 이 후 어느 순간 사용되지 않을 계획이다.

struct static_key false = STATIC_KEY_INIT_FALSE;
struct static_key true = STATIC_KEY_INIT_TRUE;
static_key_true()
static_key_false()

기존 사용 예)

        struct static_key key = STATIC_KEY_INIT_FALSE;

        ...

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

New Method

DEFINE_STATIC_KEY_TRUE(key);
DEFINE_STATIC_KEY_FALSE(key);
static_branch_likely()
static_branch_unlikely()

개선된 사용 예)

	DEFINE_STATIC_KEY_FALSE(key);

	...

        if (static_branch_unlikely(&key))
                do unlikely code
        else
                do likely code

Static Key – Old API

Static Key 선언

STATIC_KEY_INIT

include/linux/jump_label.h

#define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE

#define STATIC_KEY_INIT_TRUE ((struct static_key) \
                { .enabled = ATOMIC_INIT(1) })
#define STATIC_KEY_INIT_FALSE ((struct static_key) \
                { .enabled = ATOMIC_INIT(0) })

Static Key 브랜치

static_key_false()

include/linux/jump_label.h

static __always_inline bool static_key_false(struct static_key *key)
{
        return arch_static_branch(key);
}

static_key_true()

include/linux/jump_label.h

static __always_inline bool static_key_true(struct static_key *key)
{
        return !static_key_false(key);
}

Static Key – New API

Static Key 선언

DEFINE_STATIC_KEY_TRUE() & DEFINE_STATIC_KEY_FALSE()

include/linux/jump_label.h

#define DEFINE_STATIC_KEY_TRUE(name)    \
        struct static_key_true name = STATIC_KEY_TRUE_INIT
#define DEFINE_STATIC_KEY_FALSE(name)   \
        struct static_key_false name = STATIC_KEY_FALSE_INIT

주어진 이름(@name)으로 true static key 또는 false static key의 구조체를 초기화한다.

include/linux/jump_label.h

#define STATIC_KEY_TRUE_INIT  (struct static_key_true) { .key = STATIC_KEY_INIT_TRUE,  }
#define STATIC_KEY_FALSE_INIT (struct static_key_false){ .key = STATIC_KEY_INIT_FALSE, }

true static key와 false static key의 구조체를 초기화한다.

include/linux/jump_label.h

#define STATIC_KEY_INIT_TRUE                                    \
        { .enabled = { 1 },                                     \
          { .entries = (void *)JUMP_TYPE_TRUE } }
#define STATIC_KEY_INIT_FALSE                                   \
        { .enabled = { 0 },                                     \
          { .entries = (void *)JUMP_TYPE_FALSE } }

entries의 lsb 1비트를 사용하여 true 또는 false로 초기 설정을 한다.

컴파일타임에 static 키를 true 또는 false로 설정할 때 enabled와 entries가 둘 다 1 또는 0으로 설정된다.
entries는 컴파일 타임에 결정된 후 변경되지 않는다.
enabled는 런타임에 변경된다.

include/linux/jump_label.h

struct static_key_true {
        struct static_key key;
};

struct static_key_false {
        struct static_key key;
};

static_key_true 도는 static_key_false 구조체는 static_key 구조체 하나를 포함한다.

Static Key 브랜치

likely와 unlikely는 컴파일러가 생성하는 코드의 위치가 달라진다.

static_branch_likely() 함수의 경우
- 조건을 만족하는 코드들이 해당 문장에 같이 놓이게 컴파일된다.
static_branch_unlikely() 함수의 경우
- 조건을 만족하는 코드들이 해당 문장에서 먼 위치에 놓이게 컴파일된다.

static_branch_likely()

include/linux/jump_label.h

/*
 * Combine the right initial value (type) with the right branch order
 * to generate the desired result.
 *
 *
 * type\branch| likely (1)            | unlikely (0)
 * -----------+-----------------------+------------------
 *            |                       |
 *  true (1)  |    ...                |    ...
 *            |    NOP                |    JMP L
 *            |    <br-stmts>         | 1: ...
 *            | L: ...                |
 *            |                       |
 *            |                       | L: <br-stmts>
 *            |                       |    jmp 1b
 *            |                       |
 * -----------+-----------------------+------------------
 *            |                       |
 *  false (0) |    ...                |    ...
 *            |    JMP L              |    NOP
 *            |    <br-stmts>         | 1: ...
 *            | L: ...                |
 *            |                       |
 *            |                       | L: <br-stmts>
 *            |                       |    jmp 1b
 *            |                       |
 * -----------+-----------------------+------------------
 *
 * The initial value is encoded in the LSB of static_key::entries,
 * type: 0 = false, 1 = true.
 *
 * The branch type is encoded in the LSB of jump_entry::key,
 * branch: 0 = unlikely, 1 = likely.
 *
 * This gives the following logic table:
 *
 *      enabled type    branch    instuction
 * -----------------------------+-----------
 *      0       0       0       | NOP
 *      0       0       1       | JMP
 *      0       1       0       | NOP
 *      0       1       1       | JMP
 *
 *      1       0       0       | JMP
 *      1       0       1       | NOP
 *      1       1       0       | JMP
 *      1       1       1       | NOP
 *
 * Which gives the following functions:
 *
 *   dynamic: instruction = enabled ^ branch
 *   static:  instruction = type ^ branch
 *
 * See jump_label_type() / jump_label_init_type().
 */

#define static_branch_likely(x)                                                 \
({                                                                              \
        bool branch;                                                            \
        if (__builtin_types_compatible_p(typeof(*x), struct static_key_true))   \
                branch = !arch_static_branch(&(x)->key, true);                  \
        else if (__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
                branch = !arch_static_branch_jump(&(x)->key, true);             \
        else                                                                    \
                branch = ____wrong_branch_error();                              \
        likely(branch);                                                         \
})

likely문 처럼 조건에 걸릴 확률이 높은 경우에는 조건에 걸려 동작하는 명령(nop 또는 jmp)이 코드 근접 범위에 있다.

static_branch_unlikely()

include/linux/jump_label.h

#define static_branch_unlikely(x)                                               \
({                                                                              \
        bool branch;                                                            \
        if (__builtin_types_compatible_p(typeof(*x), struct static_key_true))   \
                branch = arch_static_branch_jump(&(x)->key, false);             \
        else if (__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
                branch = arch_static_branch(&(x)->key, false);                  \
        else                                                                    \
                branch = ____wrong_branch_error();                              \
        unlikely(branch);                                                       \
})

unlikely문 처럼 조건에 걸릴 확률이 낮은 경우에는 조건에 걸려 동작하는 명령이 멀리 떨어져 있다.

다음 그림은 static key 사용 시 likely와 unlikely 조건에 따라 코드 배치와 jump 엔트리의 기록 타입(nop/jmp)을 보여준다.

다음 그림은 static key 사용에 따른 likely와 unlikely 조건에 따라 코드 배치와 jump 엔트리의 기록 타입(nop/jmp)과 런타임에 변경된 코드의 변화를 보여준다.

Jump 라벨 사용

arch_static_branch() – ARM32

arch/arm/include/asm/jump_label.h

static __always_inline bool arch_static_branch(struct static_key *key)
{
        asm_volatile_goto("1:\n\t"
                 JUMP_LABEL_NOP "\n\t"
                 ".pushsection __jump_table,  \"aw\"\n\t"
                 ".word 1b, %l[l_yes], %c0\n\t"
                 ".popsection\n\t"
                 : :  "i" (key) :  : l_yes);

        return false;
l_yes:
        return true;
}

nop 명령이 디폴트로 동작하는 Jump 라벨이다.

함수 호출 부분에 nop 코드를 배치하고 __jump_table 섹션에 3개의 word를 push한다. push되는 항목은 jump_entry 구조체 동일하다.

arch_static_branch_jump() – ARM32

arch/arm/include/asm/jump_label.h

static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
{
        asm_volatile_goto("1:\n\t"
                 WASM(b) " %l[l_yes]\n\t"
                 ".pushsection __jump_table,  \"aw\"\n\t"
                 ".word 1b, %l[l_yes], %c0\n\t"
                 ".popsection\n\t"
                 : :  "i" (&((char *)key)[branch]) :  : l_yes);

        return false;
l_yes:
        return true;
}

jmp 명령이 디폴트로 동작하는 Jump 라벨이다.

아래 그림 역시 신규 static key API에 의해 등록되는 과정을 보여준다.

arch_static_branch() – ARM64

arch/arm64/include/asm/jump_label.h

static __always_inline bool arch_static_branch(struct static_key *key,
                                               bool branch)
{
        asm_volatile_goto(
                "1:     nop                                     \n\t"
                 "      .pushsection    __jump_table, \"aw\"    \n\t"
                 "      .align          3                       \n\t"
                 "      .long           1b - ., %l[l_yes] - .   \n\t"
                 "      .quad           %c0 - .                 \n\t"
                 "      .popsection                             \n\t"
                 :  :  "i"(&((char *)key)[branch]) :  : l_yes);

        return false;
l_yes:
        return true;
}

nop 명령이 디폴트로 동작하는 Jump 라벨이다.

두 개의 4바이트 long 값은 code와 target에 해당하는 주소에서 현재 주소(.)를 뺀 상대 위치를 저장한다.
마지막으로 8바이트 quad 값은 key에 해당하는 주소에서 현재 주소(.)를 뺀 상대 위치를 저장한다.

arch_static_branch_jump() – ARM64

arch/arm64/include/asm/jump_label.h

static __always_inline bool arch_static_branch_jump(struct static_key *key,
                                                    bool branch)
{
        asm_volatile_goto(
                "1:     b               %l[l_yes]               \n\t"
                 "      .pushsection    __jump_table, \"aw\"    \n\t"
                 "      .align          3                       \n\t"
                 "      .long           1b - ., %l[l_yes] - .   \n\t"
                 "      .quad           %c0 - .                 \n\t"
                 "      .popsection                             \n\t"
                 :  :  "i"(&((char *)key)[branch]) :  : l_yes);

        return false;
l_yes:
        return true;
}

jmp 명령이 디폴트로 동작하는 Jump 라벨이다.

static_key_enabled()

include/linux/jump_label.h

#define static_key_enabled(x)                                                   \
({                                                                              \
        if (!__builtin_types_compatible_p(typeof(*x), struct static_key) &&     \
            !__builtin_types_compatible_p(typeof(*x), struct static_key_true) &&\
            !__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
                ____wrong_branch_error();                                       \
        static_key_count((struct static_key *)x) > 0;                           \
})

static 키가 enable 상태인지 여부를 반환한다.

카운트 값이 1 이상인 경우 true(1)를 반환한다.

static_key_count()

include/linux/jump_label.h

static inline int static_key_count(struct static_key *key)
{
        return atomic_read(&key->enabled);
}

static 키 카운트 값을 반환한다.

런타임중 Key 조건 변경 API

1) 일반 API

static_branch_enable() & static_branch_disable()

include/linux/jump_label.h

/*
 * Normal usage; boolean enable/disable.
 */

#define static_branch_enable(x)                 static_key_enable(&(x)->key)
#define static_branch_disable(x)                static_key_disable(&(x)->key)

static 키를 enable(1) 하거나 disable(0) 한다. 그리고 변경된 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

static_key_enable()

kernel/jump_label.c

void static_key_enable(struct static_key *key)
{
        cpus_read_lock();
        static_key_enable_cpuslocked(key);
        cpus_read_unlock();
}
EXPORT_SYMBOL_GPL(static_key_enable);

static 키를 enable(변경) 하고 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

static_key_enable_cpuslocked()

kernel/jump_label.c

void static_key_enable_cpuslocked(struct static_key *key)
{
        STATIC_KEY_CHECK_USE(key);
        lockdep_assert_cpus_held();

        if (atomic_read(&key->enabled) > 0) {
                WARN_ON_ONCE(atomic_read(&key->enabled) != 1);
                return;
        }

        jump_label_lock();
        if (atomic_read(&key->enabled) == 0) {
                atomic_set(&key->enabled, -1);
                jump_label_update(key);
                /*
                 * See static_key_slow_inc().
                 */
                atomic_set_release(&key->enabled, 1);
        }
        jump_label_unlock();
}
EXPORT_SYMBOL_GPL(static_key_enable_cpuslocked);

코드 라인 6~9에서 다른 cpu에서 이미 enable 요청하여 이미 enable 된 상태인 경우에는 함수를 그냥 빠져나간다.
코드 라인 11~20에서 lock을 획득한채로 다시 한 번 enable 여부를 확인해본다. 여전히 다른 cpu와 경쟁하지 않는 상태인 경우 jump 라벨을 업데이트 하기 전에 enabled를 먼저 -1로 상태를 바꾼 상태에서 jump 라벨을 변경한다. 모두 완료되면 1로 변경을 한다.

static_key_disable()

kernel/jump_label.c

void static_key_disable(struct static_key *key)
{
        cpus_read_lock();
        static_key_disable_cpuslocked(key);
        cpus_read_unlock();
}
EXPORT_SYMBOL_GPL(static_key_disable);

static 키를 disable(0) 하고 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

static_key_disable_cpuslocked()

kernel/jump_label.c

void static_key_disable_cpuslocked(struct static_key *key)
{
        STATIC_KEY_CHECK_USE(key);
        lockdep_assert_cpus_held();

        if (atomic_read(&key->enabled) != 1) {
                WARN_ON_ONCE(atomic_read(&key->enabled) != 0);
                return;
        }

        jump_label_lock();
        if (atomic_cmpxchg(&key->enabled, 1, 0))
                jump_label_update(key);
        jump_label_unlock();
}
EXPORT_SYMBOL_GPL(static_key_disable_cpuslocked);

코드 라인 6~9에서 다른 cpu에서 이미 disable 요청하여 이미 disable된 상태인 경우에는 함수를 그냥 빠져나간다.
코드 라인 11~14에서 lock을 획득한채로 disable 상태로 바꾼다. 다른 cpu와 경쟁 상태가 아닌 경우 jump 라벨을 변경한다.

2) Advanced API

static_branch_inc() & static_branch_dec()

include/linux/jump_label.h

/*
 * Advanced usage; refcount, branch is enabled when: count != 0
 */

#define static_branch_inc(x)            static_key_slow_inc(&(x)->key)
#define static_branch_dec(x)            static_key_slow_dec(&(x)->key)

static 키 카운터를 증가시키거나 감소시킨다. 증가 시켜 처음 1이 된 경우 또는 감소 시켜 다시 0이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

static_key_slow_inc()

kernel/jump_label.c

void static_key_slow_inc(struct static_key *key)
{
        cpus_read_lock();
        static_key_slow_inc_cpuslocked(key);
        cpus_read_unlock();
}
EXPORT_SYMBOL_GPL(static_key_slow_inc);

static 키 카운터를 증가시킨다. 처음 1이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

static_key_slow_inc_cpuslocked()

kernel/jump_label.c

void static_key_slow_inc_cpuslocked(struct static_key *key)
{
        int v, v1;

        STATIC_KEY_CHECK_USE(key);
        lockdep_assert_cpus_held();

        /*
         * Careful if we get concurrent static_key_slow_inc() calls;
         * later calls must wait for the first one to _finish_ the
         * jump_label_update() process.  At the same time, however,
         * the jump_label_update() call below wants to see
         * static_key_enabled(&key) for jumps to be updated properly.
         *
         * So give a special meaning to negative key->enabled: it sends
         * static_key_slow_inc() down the slow path, and it is non-zero
         * so it counts as "enabled" in jump_label_update().  Note that
         * atomic_inc_unless_negative() checks >= 0, so roll our own.
         */
        for (v = atomic_read(&key->enabled); v > 0; v = v1) {
                v1 = atomic_cmpxchg(&key->enabled, v, v + 1);
                if (likely(v1 == v))
                        return;
        }

        jump_label_lock();
        if (atomic_read(&key->enabled) == 0) {
                atomic_set(&key->enabled, -1);
                jump_label_update(key);
                /*
                 * Ensure that if the above cmpxchg loop observes our positive
                 * value, it must also observe all the text changes.
                 */
                atomic_set_release(&key->enabled, 1);
        } else {
                atomic_inc(&key->enabled);
        }
        jump_label_unlock();
}

static 키를 카운터를 증가시킨다. 처음 1이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

코드 라인 20~24에서 카운터를 읽어 1 보다 큰 경우이므로, 처음 증가시킨 경우가 아닌 두 번 이상의 증가를 요청한 경우이다. 이 때 카운터를 atomic하게 증가시킨 후 경쟁 상황이 아니었으면 정상적으로 함수를 빠져나간다.
코드 라인 26~34에서 락을 잡은 후 카운터를 0에서 처음 증가 시킨 경우에는 jump 라벨을 업데이트 하기 전에 먼저 -1로 상태를 바꾼 상태에서 jump 라벨을 변경한다. 모두 완료되면 1로 변경을 한다.
코드 라인 35~37에서 다른 cpu에서 먼저 1로 증가시킨 경우이다. 이러한 경우 카운터만 증가시킨다.

static_key_slow_dec()

kernel/jump_label.c

void static_key_slow_dec(struct static_key *key)
{
        STATIC_KEY_CHECK_USE(key);
        __static_key_slow_dec(key, 0, NULL);
}
EXPORT_SYMBOL_GPL(static_key_slow_dec);

static 키를 카운터를 감소시킨다. 카운터가 0이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

__static_key_slow_dec()

kernel/jump_label.c

static void __static_key_slow_dec(struct static_key *key,
                                  unsigned long rate_limit,
                                  struct delayed_work *work)
{
        cpus_read_lock();
        __static_key_slow_dec_cpuslocked(key, rate_limit, work);
        cpus_read_unlock();
}

static 키를 카운터를 감소시킨다. 카운터가 0이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

rate_limit가 주어진 경우 rate_limit 만큼 딜레이 요청된 워크를 수행한다.

__static_key_slow_dec_cpuslocked()

kernel/jump_label.c

static void __static_key_slow_dec_cpuslocked(struct static_key *key,
                                           unsigned long rate_limit,
                                           struct delayed_work *work)
{
        lockdep_assert_cpus_held();

        /*
         * The negative count check is valid even when a negative
         * key->enabled is in use by static_key_slow_inc(); a
         * __static_key_slow_dec() before the first static_key_slow_inc()
         * returns is unbalanced, because all other static_key_slow_inc()
         * instances block while the update is in progress.
         */
        if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) {
                WARN(atomic_read(&key->enabled) < 0,
                     "jump label: negative count!\n");
                return;
        }

        if (rate_limit) {
                atomic_inc(&key->enabled);
                schedule_delayed_work(work, rate_limit);
        } else {
                jump_label_update(key);
        }
        jump_label_unlock();
}

static 키를 카운터를 감소시킨다. 카운터가 0이 된 경우 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

코드 라인 14~18에서 카운터가 0 보다 작은 값이 되는 경우 경고 메시지를 출력하고 함수를 빠져나간다.
코드 라인 20~22에서 jump_label_rate_limit() API를 사용하여 요청된 경우 rate_limit만큼 지연시켜 워크류를 동작시킨 후 jump 라벨들을 업데이트 한다.
- x86 kvm 내에서 사용하고 있다.
코드 라인 23~25에서 해당 static 키를 사용하는 모든 jump 라벨들을 업데이트한다.

Jump Label 수정

jump_label_update()

kernel/jump_label.c

static void jump_label_update(struct static_key *key)
{
        struct jump_entry *stop = __stop___jump_table;
        struct jump_entry *entry;
#ifdef CONFIG_MODULES
        struct module *mod;

        if (static_key_linked(key)) {
                __jump_label_mod_update(key);
                return;
        }

        preempt_disable();
        mod = __module_address((unsigned long)key);
        if (mod)
                stop = mod->jump_entries + mod->num_jump_entries;
        preempt_enable();
#endif
        entry = static_key_entries(key);
        /* if there are no users, entry can be NULL */
        if (entry)
                __jump_label_update(key, entry, stop,
                                    system_state < SYSTEM_RUNNING);
}

static 키를 변경한 경우 static 키가 존재하는 모듈 또는 커널의 초기화되지 않은 jump 엔트리들을 업데이트한다. 부트 업 중에 요청된 경우 초기화 여부와 관계없이 업데이트한다.

코드 라인 8~11에서 static 키가 로드된 모듈용인 경우 초기화되지 않은 jump 엔트리들을 업데이트한다. 모듈이 처음 로딩되는 중이면 초기화 여부와 관계없이 업데이트한다.
코드 라인 13~17에서 static 키가 커널과 함께 로드된 모듈 영역에 있는 경우 업데이트 범위에 그 모듈의 끝 엔트리로를 포함시킨다.
코드 라인 19~23에서 static 키 엔트리가 발견된 경우 초기화되지 않은 jump 엔트리들을 업데이트한다. 부트 업 중에 요청된 경우 초기화 여부와 관계없이 업데이트한다.

__jump_label_mod_update()

kernel/jump_label.c

static void __jump_label_mod_update(struct static_key *key)
{
        struct static_key_mod *mod;

        for (mod = static_key_mod(key); mod; mod = mod->next) {
                struct jump_entry *stop;
                struct module *m;

                /*
                 * NULL if the static_key is defined in a module
                 * that does not use it
                 */
                if (!mod->entries)
                        continue;

                m = mod->mod;
                if (!m)
                        stop = __stop___jump_table;
                else
                        stop = m->jump_entries + m->num_jump_entries;
                __jump_label_update(key, mod->entries, stop,
                                    m && m->state == MODULE_STATE_COMING);
        }
}

요청한 static 키를 사용하는 모듈들을 순회하며 jump 엔트리들을 업데이트한다.

코드 라인 5~14에서 요청한 static 키를 사용하는 모듈들을 순회하며 entries가 null인 경우 사용하지 않는 경우이므로 skip 한다.
코드 라인 16~22에서 모듈이 지정되지 않은 경우 모든 jump 라벨을 대상으로하고, 모듈이 지정된 경우 해당 모듈의 점프 엔트리들까지를 대상으로 초기화되지 않은 jump 라벨 엔트리들을 업데이트한다. 만일 모듈이 로드되는 중에 요청된 경우 초기화 여부 관계 없이 업데이트한다.

__jump_label_update()

kernel/jump_label.c

static void __jump_label_update(struct static_key *key,
                                struct jump_entry *entry,
                                struct jump_entry *stop,
                                bool init)
{
        for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
                /*
                 * An entry->code of 0 indicates an entry which has been
                 * disabled because it was in an init text area.
                 */
                if (init || !jump_entry_is_init(entry)) {
                        if (kernel_text_address(jump_entry_code(entry)))
                                arch_jump_label_transform(entry, jump_label_type(entry));
                        else
                                WARN_ONCE(1, "can't patch jump_label at %pS",
                                          (void *)jump_entry_code(entry));
                }
        }
}

@entry ~ @stop 까지의 jump 엔트리들을 대상으로 아직 기록되지 않았거나, 인자로 요청한 @init이 true인 경우 jump 엔트리들을 수정하여 기록한다.

1) Jump 라벨 수정 for ARM32

arch_jump_label_transform() – ARM32

void arch_jump_label_transform(struct jump_entry *entry,
                               enum jump_label_type type)
{
        __arch_jump_label_transform(entry, type, false);
}

요청 타입(nop, branch)에 따라 변경할 한 개의 워드 코드를 patch 한다.

arch_jump_label_transform_static() – ARM32

arch/arm/kernel/jump_label.c

void arch_jump_label_transform_static(struct jump_entry *entry,
                                      enum jump_label_type type)
{
        __arch_jump_label_transform(entry, type, true);
}

요청 타입(nop, branch)에 따라 변경할 한 개의 워드 코드를 early patch 한다.

__arch_jump_label_transform()

arch/arm/kernel/jump_label.c

static void __arch_jump_label_transform(struct jump_entry *entry,
                                        enum jump_label_type type,
                                        bool is_static)
{
        void *addr = (void *)entry->code;
        unsigned int insn;

        if (type == JUMP_LABEL_JMP)
                insn = arm_gen_branch(entry->code, entry->target);
        else
                insn = arm_gen_nop();

        if (is_static)
                __patch_text_early(addr, insn);
        else
                patch_text(addr, insn);
}

요청 타입에 따라 변경할 한 개의 워드 코드를 patch 하는데 다음의 2 가지 옵션이 있다.

is_static을 true로 호출하는 경우 처음 변경된 경우 이므로 곧장 패치하고, false로 호출된 경우는 이미 운영 중인 코드이므로, 모든 cpu의 스케쥴러를 정지시킨 후 커널 영역을 fixmap 매핑 영역에 잠시 매핑한 후 patch 한다

코드 라인 8~11에서 타입이 enable이면 branch 코드를 만들고, disable 이면 nop 코드를 만든다.
코드 라인 13~16에서 early 요청이 있는 경우 early patch 코드를 수행하고, 그렇지 않은 경우 patch 코드를 수행한다.

arm_gen_branch()

arch/arm/include/asm/insn.h

static inline unsigned long
arm_gen_branch(unsigned long pc, unsigned long addr)
{
        return __arm_gen_branch(pc, addr, false);
}

ARM 또는 THUMB 브랜치 코드를 만들어온다.

__arm_gen_branch()

arch/arm/kernel/insn.c

unsigned long
__arm_gen_branch(unsigned long pc, unsigned long addr, bool link)
{
        if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
                return __arm_gen_branch_thumb2(pc, addr, link);
        else
                return __arm_gen_branch_arm(pc, addr, link);
}

ARM 또는 THUMB 브랜치 코드를 만들어온다.

__arm_gen_branch_arm()

arch/arm/kernel/insn.c

static unsigned long
__arm_gen_branch_arm(unsigned long pc, unsigned long addr, bool link)
{
        unsigned long opcode = 0xea000000;
        long offset;

        if (link)
                opcode |= 1 << 24;

        offset = (long)addr - (long)(pc + 8);
        if (unlikely(offset < -33554432 || offset > 33554428)) {
                WARN_ON_ONCE(1);
                return 0;
        }

        offset = (offset >> 2) & 0x00ffffff;

        return opcode | offset;
}

ARM용 branch 코드를 만들어 word로 리턴한다.

arm_gen_nop()

arch/arm/include/asm/insn.h

static inline unsigned long
arm_gen_nop(void)
{
#ifdef CONFIG_THUMB2_KERNEL
        return 0xf3af8000; /* nop.w */
#else   
        return 0xe1a00000; /* mov r0, r0 */
#endif
}

ARM용 nop 코드로 mov r0, r0에 해당하는 word를 리턴한다.

__patch_text_early()

arch/arm/include/asm/patch.h

static inline void __patch_text_early(void *addr, unsigned int insn)
{
        __patch_text_real(addr, insn, false);
}

주어진 주소에 명령을 즉각 패치한다.

커널 초기화 시에는 이 early 함수를 호출하여 사용한다.

patch_text()

arch/arm/kernel/patch.c

void __kprobes patch_text(void *addr, unsigned int insn)
{
        struct patch patch = {
                .addr = addr,
                .insn = insn,
        };

        stop_machine(patch_text_stop_machine, &patch, NULL);
}

cpu의 스케쥴링을 모두 정지 시킨 후 주어진 주소에 명령을 패치한다.

커널이 운영중일 경우에는 커널 코드가 read만 허용하므로 fixmap 영역 중 한 페이지를 사용하여 잠시 매핑하여 수정하게 한다.
참고로 커널 초기화 시에는 이 함수를 사용하지 않고 __patch_text_early() 함수를 사용한다.

patch_text_stop_machine()

arch/arm/kernel/patch.c

static int __kprobes patch_text_stop_machine(void *data)
{
        struct patch *patch = data;

        __patch_text(patch->addr, patch->insn);

        return 0;
}

주어진 주소 위치에 명령을 기록(패치)한다.

__patch_text()

arch/arm/include/asm/patch.h

static inline void __patch_text(void *addr, unsigned int insn)
{
        __patch_text_real(addr, insn, true);
}

주소에 주소 위치에 명령을 기록(패치)하고 리매핑한다.

__patch_text_real()

arch/arm/kernel/patch.c

void __kprobes __patch_text_real(void *addr, unsigned int insn, bool remap)
{
        bool thumb2 = IS_ENABLED(CONFIG_THUMB2_KERNEL);
        unsigned int uintaddr = (uintptr_t) addr;
        bool twopage = false;
        unsigned long flags;
        void *waddr = addr;
        int size;

        if (remap)
                waddr = patch_map(addr, FIX_TEXT_POKE0, &flags);
        else
                __acquire(&patch_lock);

        if (thumb2 && __opcode_is_thumb16(insn)) {
                *(u16 *)waddr = __opcode_to_mem_thumb16(insn);
                size = sizeof(u16);
        } else if (thumb2 && (uintaddr & 2)) {
                u16 first = __opcode_thumb32_first(insn);
                u16 second = __opcode_thumb32_second(insn);
                u16 *addrh0 = waddr;
                u16 *addrh1 = waddr + 2;

                twopage = (uintaddr & ~PAGE_MASK) == PAGE_SIZE - 2;
                if (twopage && remap)
                        addrh1 = patch_map(addr + 2, FIX_TEXT_POKE1, NULL);

                *addrh0 = __opcode_to_mem_thumb16(first);
                *addrh1 = __opcode_to_mem_thumb16(second);

                if (twopage && addrh1 != addr + 2) {
                        flush_kernel_vmap_range(addrh1, 2);
                        patch_unmap(FIX_TEXT_POKE1, NULL);
                }

                size = sizeof(u32);
        } else {
                if (thumb2)
                        insn = __opcode_to_mem_thumb32(insn);
                else
                        insn = __opcode_to_mem_arm(insn);

                *(u32 *)waddr = insn;
                size = sizeof(u32);
        }

        if (waddr != addr) {
                flush_kernel_vmap_range(waddr, twopage ? size / 2 : size);
                patch_unmap(FIX_TEXT_POKE0, &flags);
        } else
                __release(&patch_lock);

        flush_icache_range((uintptr_t)(addr),
                           (uintptr_t)(addr) + size);
}

ARM 코드에 대해 리매핑을 하지 않고 패치하는 경우는 간단하게 4바이트만 변경하고 해당 4바이트 주소의 i-cache를 flush한다. 리매핑이 필요한 경우에는 fix-map을 사용하여 매핑 후 명령 부분의 4바이트를 변경하고 cpu 아키텍처에 따라 해당 4바이트 영역의 d-cache에 대해 flush하고 fix-map을 언매핑한 후 같은 영역에 대해 i-cache를 flush한다.

patch_map()

arch/arm/kernel/patch.c

static void __kprobes *patch_map(void *addr, int fixmap, unsigned long *flags)
        __acquires(&patch_lock)
{
        unsigned int uintaddr = (uintptr_t) addr;
        bool module = !core_kernel_text(uintaddr);
        struct page *page;

        if (module && IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
                page = vmalloc_to_page(addr);
        else if (!module && IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
                page = virt_to_page(addr);
        else
                return addr;

        if (flags)
                spin_lock_irqsave(&patch_lock, *flags);
        else
                __acquire(&patch_lock);

        set_fixmap(fixmap, page_to_phys(page));

        return (void *) (__fix_to_virt(fixmap) + (uintaddr & ~PAGE_MASK));
}

Fixmap을 사용하여 해당 페이지를 매핑한다.

flush_kernel_vmap_range()

arch/arm/include/asm/cacheflush.h

static inline void flush_kernel_vmap_range(void *addr, int size)
{
        if ((cache_is_vivt() || cache_is_vipt_aliasing()))
          __cpuc_flush_dcache_area(addr, (size_t)size);
}

캐시가 vivt 또는 vipt aliasing인 경우 d-cache의 해당 영역을 flush 한다.

patch_unmap()

arch/arm/kernel/patch.c

static void __kprobes patch_unmap(int fixmap, unsigned long *flags)
        __releases(&patch_lock)
{
        clear_fixmap(fixmap);

        if (flags)
                spin_unlock_irqrestore(&patch_lock, *flags);
        else
                __release(&patch_lock);
}

매핑된 Fixmap 인덱스를 해지한다.

2) Jump 라벨 수정 for ARM64

arch_jump_label_transform() – ARM64

arch/arm64/kernel/jump_label.c

void arch_jump_label_transform(struct jump_entry *entry,
                               enum jump_label_type type)
{
        void *addr = (void *)jump_entry_code(entry);
        u32 insn;

        if (type == JUMP_LABEL_JMP) {
                insn = aarch64_insn_gen_branch_imm(jump_entry_code(entry),
                                                   jump_entry_target(entry),
                                                   AARCH64_INSN_BRANCH_NOLINK);
        } else {
                insn = aarch64_insn_gen_nop();
        }

        aarch64_insn_patch_text_nosync(addr, insn);
}

jump 엔트리의 타입에 따라 jmp 명령 또는 mop 명령을 기록한다.

branch 명령 코드를 만드는 aarch64_insn_gen_branch_imm() 함수와 nop 명령 코드를 만드는 aarch64_insn_gen_nop() 함수는 분석을 생략한다.

aarch64_insn_patch_text_nosync() – ARM64

arch/arm64/kernel/insn.c

int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
        u32 *tp = addr;
        int ret;

        /* A64 instructions must be word aligned */
        if ((uintptr_t)tp & 0x3)
                return -EINVAL;

        ret = aarch64_insn_write(tp, insn);
        if (ret == 0)
                __flush_icache_range((uintptr_t)tp,
                                     (uintptr_t)tp + AARCH64_INSN_SIZE);

        return ret;
}

물리 주소 @addr에 인스트럭션 @insn을 기록한 후 해당 주소의 명령 캐시를 flush 한다.

aarch64_insn_write()

int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
        return __aarch64_insn_write(addr, cpu_to_le32(insn));
}

물리 주소 @addr에 인스트럭션 @insn을 기록한다.

__aarch64_insn_write()

static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
{
        void *waddr = addr;
        unsigned long flags = 0;
        int ret;

        raw_spin_lock_irqsave(&patch_lock, flags);
        waddr = patch_map(addr, FIX_TEXT_POKE0);

        ret = probe_kernel_write(waddr, &insn, AARCH64_INSN_SIZE);

        patch_unmap(FIX_TEXT_POKE0);
        raw_spin_unlock_irqrestore(&patch_lock, flags);

        return ret;
}

물리 주소 @addr에 인스트럭션 @insn을 기록한다.

기록 전/후로 인터럽트의 접근을 금지하고 fixmap의 TEXT_POKE0 슬롯을 사용하여 매핑/해제한다.

patch_map() – ARM64

arch/arm64/kernel/insn.c

static void __kprobes *patch_map(void *addr, int fixmap)
{
        unsigned long uintaddr = (uintptr_t) addr;
        bool module = !core_kernel_text(uintaddr);
        struct page *page;

        if (module && IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
                page = vmalloc_to_page(addr);
        else if (!module)
                page = phys_to_page(__pa_symbol(addr));
        else
                return addr;

        BUG_ON(!page);
        return (void *)set_fixmap_offset(fixmap, page_to_phys(page) +
                        (uintaddr & ~PAGE_MASK));
}

@fixmap 슬롯에 물리 주소 @addr을 매핑한다.

patch_unmap() – ARM64

arch/arm64/kernel/insn.c

static void __kprobes patch_unmap(int fixmap)
{
        clear_fixmap(fixmap);
}

@fixmap 슬롯에 매핑된 페이지를 매핑 해제한다.

probe_kernel_write()

mm/maccess.c

/**
 * probe_kernel_write(): safely attempt to write to a location
 * @dst: address to write to
 * @src: pointer to the data that shall be written
 * @size: size of the data chunk
 *
 * Safely write to address @dst from the buffer at @src.  If a kernel fault
 * happens, handle that and return -EFAULT.
 */

long __weak probe_kernel_write(void *dst, const void *src, size_t size)
    __attribute__((alias("__probe_kernel_write")));

long __probe_kernel_write(void *dst, const void *src, size_t size)
{
        long ret;
        mm_segment_t old_fs = get_fs();

        set_fs(KERNEL_DS);
        pagefault_disable();
        ret = __copy_to_user_inatomic((__force void __user *)dst, src, size);
        pagefault_enable();
        set_fs(old_fs);

        return ret ? -EFAULT : 0;
}
EXPORT_SYMBOL_GPL(probe_kernel_write);

구조체

static_key 구조체

include/linux/jump_label.h

struct static_key {
        atomic_t enabled;
/*
 * Note:
 *   To make anonymous unions work with old compilers, the static
 *   initialization of them requires brackets. This creates a dependency
 *   on the order of the struct with the initializers. If any fields
 *   are added, STATIC_KEY_INIT_TRUE and STATIC_KEY_INIT_FALSE may need
 *   to be modified.
 *
 * bit 0 => 1 if key is initially true
 *          0 if initially false
 * bit 1 => 1 if points to struct static_key_mod
 *          0 if points to struct jump_entry
 */
        union {
                unsigned long type;
                struct jump_entry *entries;
                struct static_key_mod *next;
        };
};

멤버들 중 enable를 제외한 나머지 3개의 멤버를 union 타입으로 묶어 사이즈를 줄였다.

enabled
- 런타임에 변경되는 값이지만, 초기 컴파일 타임에는 아래 entries의 bit1가 의미하는 jump 타입과 동일하게 사용된다.
- static_key_slow_inc() 및 static_key_slow_dec() 함수에 의해 카운터 값이 런타임 중에 바뀐다.
- 참고로 jump label 타입은 컴파일 타임과 런타임에서 다음과 같이 결정된다.
  - 컴파일 타임
    - type ^ branch
  - 런 타임
    - !!enabled ^ branch
type
- lsb 2비트를 사용하여 타입을 결정한다.
  - JUMP_TYPE_FALSE(0)
  - JUMP_TYPE_TRUE(1)
  - JUMP_TYPE_LINKED(2)
*entries
- key로 sort 된 첫 번째 jump 엔트리를 가리킨다.
- jump 엔트리 포인터가 담기는데 하위 2비트에 static 키의 default 값이 설정되어 있다.
  - JUMP_TYPE_FALSE(0)과 JUMP_TYPE_TRUE(1)을 사용한다.
*next
- static_key_mod 포인터가 담기는데 하위 2비트에 JUMP_TYPE_LINKED(2)가 추가되어 사용된다.

jump_entry 구조체 – ARM32

include/linux/jump_label.h

struct jump_entry {
        jump_label_t code;
        jump_label_t target;
        jump_label_t key;
};

아래 3개의 멤버들은 모두 주소를 담고 있는데, ARM32에서는 각각에 해당하는 절대 주소를 담고, ARM64에서는 각각에 해당하는 주소 – jump 라벨 엔트리가 저장되는 위치에 해당하는 주소를 뺀 상대 주소가 담긴다.

code
- static_branch_likely() 등의 static branch를 사용한 코드 주소
- ARM64의 경우 32비트만을 사용하여 static_branch_*() 코드가 위치한 주소에서 이에 해당하여 저장될 jump 엔트리의 code 멤버의 주소를 뺀 상대주소가 담긴다.
target
- 브랜치할 곳의 주소
- ARM64의 경우 32비트만을 사용하여 브랜치 할 주소에서 저장될 jump 엔트리의 target 멤버의 주소를 뺀 상대주소가 담긴다.
key
- static key 구조체를 가리키는 주소와 2 개의 플래그가 포함된다.
  - bit0에는 jump 엔트리의 branch 상태를 나타낸다. 1=jmp, 0=nop
  - bit1에는 코드가 init 섹션에 위치하였는지 여부를 나타낸다. 1=init 섹션에 위치, 0=그외 섹션 위치
    - init 섹션에 있는 코드들은 부트업 후에 모두 삭제되므로 런타임 시에는 이 위치의 jump 엔트리들을 수정할 필요 없다.
- ARM64의 경우 64비트를 사용하여 선언된 static 키 주소에서 저장될 jump 엔트리의 key 멤버의 주소를 뺀 상대주소가 담긴다.

typedef u32 jump_label_t;

jump_entry 구조체 – ARM64

#ifdef CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE
struct jump_entry {
        s32 code;
        s32 target;
        long key;       // key may be far away from the core kernel under KASLR
};

ARM64의 경우 사이즈를 줄이기 위해 CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE 커널 옵션을 사용한다. 이 때에는 code와 target은 static 키를 기준으로 상대 주소를 사용한 4바이트를 사용하고, key 값만 8 바이트를 사용한다.

static_key_mod 구조체

kernel/jump_label.c

struct static_key_mod {
        struct static_key_mod *next;
        struct jump_entry *entries;
        struct module *mod;
};

모듈에서 커널이 선언한 글로벌 static 키를 사용할 때 사용된다. (참고로 모듈에서 선언된 로컬 static 키와는 관련 없다)

*next
- 다음 글로벌 static key를 사용하는 모듈과 연결된다.
*entries
- 모듈에서 사용하는 첫 jump 엔트리를 가리킨다.
- 리스트의 마지막은 글로벌에서 사용하는 첫 jump 엔트리를 가리킨다.
*mod
- 모듈을 가리킨다.

jump_label_type

include/linux/jump_label.h

enum jump_label_type { 
        JUMP_LABEL_NOP = 0,
        JUMP_LABEL_JMP,
};

JUMP_LABEL_NOP
- NOP 코드로 jump_label 코드를 변경한다.
JUMP_LABEL_JMP
- JMP 코드로 jump_label 코드를 변경한다.

참고

Static Keys -1- (Core API) | 문c – 현재 글
Static Keys -2- (초기화) | 문c

Static Keys | kernel.org
jump label: introduce static_branch() interface | LWN.net
locking/static_keys: Add a new static_key interface

parse_args()

2016-04-262020-02-27 문영일 2 Comments

cmdline 인수로 받은 파라메터에 대해 구식 또는 신식 커널 파라메터 블럭에서에 연결된 설정 함수를 호출한다. 모듈(modprobe) 관련 파라메터가 있는 경우는 이 루틴에서 무시하고 매치되지 않은 unknown 파라메터는 값이 있는 경우 envp_init[] 배열에 추가하고 값이 없는 경우 argv_init[] 배열에 추가한다.

setup_kernel() 중간 부분

init/main.c

        after_dashes = parse_args("Booting kernel",
                                  static_command_line, __start___param,
                                  __stop___param - __start___param,
                                  -1, -1, &unknown_bootoption);
        if (!IS_ERR_OR_NULL(after_dashes))
                parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
                           set_init_arg);

after_dashes = parse_args(“Booting kernel”, static_command_line, __start___param, __stop___param – __start___param, -1, -1, &unknown_bootoption);
- static_command_line을 파싱하고 param=value 형태로 다듬고 신형 커널 파라메터 블럭에서(__start___param ~ __stop___param)에서 각 커널파라메터에 매치되는 함수를 호출하고 매치되지 않는 경우 unknown_bootoption() 함수를 호출한다.
- 만일 파라메터가 “–“로 끝나는 경우 after_dashes에 “–” 이후의 문자열이 담긴다.
if (!IS_ERR_OR_NULL(after_dashes))
- 파싱 하면서 “–“를 발견한 경우
parse_args(“Setting init args”, after_dashes, NULL, 0, -1, -1, set_init_arg);
- “–” 뒤의 파라메터들을 argv_init[] 배열에 추가한다.

parse_args()

kernel/params.c

char *parse_args(const char *doing,
                 char *args,    
                 const struct kernel_param *params,
                 unsigned num,
                 s16 min_level,
                 s16 max_level, 
                 int (*unknown)(char *param, char *val, const char *doing))

cmdline을 파싱하여 각 커널파라메터에 대응하는 params 블럭에서 찾아서 매치되는 파라메터의 함수를 호출하고 매치되지 않는 경우 unknown을 호출한다. 또한 파라메터가 “–“로 끝나는 경우 “–” 이후의 문자열을 리턴한다.

parse_args()를 호출하는 case
- parse_args(“early_options”, , , , , do_early_param);
  - init/main.c – parse_early_options() 함수
  - 참고: parse_early_param() | 문c
- parse_args(“Booting kernel”, static_command_line, __start___param, __stop___param – start___param, -1, -1, &unknown_bootoption);
  - init/main.c – start_kernel() 함수
- parse_args(“Setting init args”, after_dashes, NULL, 0, -1, -1, set_init_arg);
  - init/main.c – start_kernel() 함수
- parse_args(“dyndbg params”, cmdline, NULL, 0, 0, 0, &ddebug_dyndbg_boot_param_cb);
  - lib/dynamic_debug.c – dynamic_debug_init() 함수
- parse_args(initcall_level_names[level], initcall_command_line, __start___param, __stop___param – __start___param, level, level, &repair_env_string);
  - init/main.c – do_initcall_level() 함수

unknown_bootoption()

init/main.c

/*
 * Unknown boot options get handed to init, unless they look like
 * unused parameters (modprobe will find them in /proc/cmdline).
 */
static int __init unknown_bootoption(char *param, char *val, const char *unused)
{
        repair_env_string(param, val, unused);

        /* Handle obsolete-style parameters */
        if (obsolete_checksetup(param))
                return 0;

        /* Unused module parameter. */
        if (strchr(param, '.') && (!val || strchr(param, '.') < val))
                return 0;

        if (panic_later)
                return 0;

        if (val) {
                /* Environment option */
                unsigned int i;
                for (i = 0; envp_init[i]; i++) {
                        if (i == MAX_INIT_ENVS) {
                                panic_later = "env";
                                panic_param = param;
                        }
                        if (!strncmp(param, envp_init[i], val - param))
                                break;
                }
                envp_init[i] = param;
        } else {
                /* Command line option */
                unsigned int i;
                for (i = 0; argv_init[i]; i++) {
                        if (i == MAX_INIT_ARGS) {
                                panic_later = "init";
                                panic_param = param;
                        }
                }
                argv_init[i] = param;
        }
        return 0;
}

cmdline 인수로 받은 파라메터를 param=value 형태로 다듬고 구형 커널 파라메터 블럭(__setup_start ~ __setup_end) 에서 매치된 파라메터 중 early가 아닌 경우 해당 파라메터에 연결된 설정 함수를 호출한다. 또한 모듈(modprobe) 관련 파라메터가 있는 경우 일단 무시한다. 그 외에 매치되지 않은 unknown 파라메터는 값이 있는 경우 envp_init[] 배열에 추가하고 값이 없는 경우 argv_init[] 배열에 추가한다.

repair_env_string(param, val, unused);
- param=val 및 param=”val”과 같은 형태는 ‘=’대신 파라메터 구분을 하기 위해 null이 입력되어 있는데 이를 다시 ‘=’문자로 치환하고 따옴표가 사용된 경우 param=val 형태가 되도록 따옴표를 제거한다.
- rpi2 변경된 예)
  - dma.dmachans=0x7f35 bcm2708_fb.fbwidth=592 bcm2708_fb.fbheight=448 bcm2709.boardrev=0xa01041 bcm2709.serial=0x670ebdbf smsc95xx.macaddr=B8:27:EB:0E:BD:BF bcm2708_fb.fbswap=1 bcm2709.disk_led_gpio=47 bcm2709.disk_led_active_low=0 sdhci-bcm2708.emmc_clock_freq=250000000 vc_mem.mem_base=0x3dc00000 vc_mem.mem_size=0x3f000000 dwc_otg.lpm_enable=0 console=ttyAMA0,115200 console=tty1 root=/dev/mmcblk0p6 rootfstype=ext4 elevator=deadline rootwait
if (obsolete_checksetup(param)) return 0;
- 구형 파라메터 블럭에서 매치된 파라메터가 있는 경우 해당 파라메터에 연결된 설정함수를 호출하고 리턴한다.
if (strchr(param, ‘.’) && (!val || strchr(param, ‘.’) < val)) return 0;
- 모듈 파라메터가 사용된 경우 리턴한다.
  - rpi2 예) dma.dmachans=0x7f35
- 참고: param: don’t complain about unused module parameters

val 값이 있는 경우 environment 옵션으로 등록한다.

if (val) {
- val 값이 있는 경우
for (i = 0; envp_init[i]; i++) {
- envp_init[] 배열에 등록된 엔트리 수 만큼 루프를 돈다.
if (i == MAX_INIT_ENVS) { panic_later = “env”; panic_param = param;
- i가 MAX_INIT_ENVS에 도달하면 panic 관련 변수 설정을 한다.
if (!strncmp(param, envp_init[i], val – param)) break;
- 파라메터명이 envp_init[]에 등록되어 있는 문자열과 같은 경우 break하여 루틴을 빠져나간다.
envp_init[i] = param;
- envp_init[] 배열에 param을 추가한다.

val 값이 없는 경우 command line option으로 등록한다.

for (i = 0; argv_init[i]; i++) {
- argv_init[] 배열에 등록된 엔트리 수 만큼 루프를 돈다.
if (i == MAX_INIT_ARGS) { panic_later = “init”; panic_param = param; }
- i가 MAX_INIT_ARGS에 도달하면 panic 관련 변수 설정을 한다.
argv_init[i] = param;
- argv_init[] 배열에 param을 추가한다.

repair_env_string()

init/main.c

static int __init repair_env_string(char *param, char *val, const char *unused)
{
        if (val) {
                /* param=val or param="val"? */
                if (val == param+strlen(param)+1)
                        val[-1] = '=';
                else if (val == param+strlen(param)+2) {
                        val[-2] = '=';
                        memmove(val-1, val, strlen(val)+1);
                        val--;
                } else
                        BUG();
        }
        return 0;
}

param=val 및 param=”val”과 같은 형태는 ‘=’대신 파라메터 구분을 하기 위해 null이 입력되어 있는데 이를 다시 ‘=’문자로 치환하고 따옴표가 사용된 경우 param=val 형태가 되도록 따옴표를 제거한다.

if (val == param+strlen(param)+1) val[-1] = ‘=’;
- param<null>val 과 같이 따옴표를 사용하지 않은 경우 val[-1]에 ‘=’를 대입한다.
else if (val == param+strlen(param)+2) { val[-2] = ‘=’; memmove(val-1, val, strlen(val)+1); val–;
- param<null>”val”과 같이 따옴표를 사용한 경우 val[2]에 ‘=’를 대입하고 val 문자열을 1칸 앞으로 당긴다.

obsolete_checksetup()

init/main.c

static int __init obsolete_checksetup(char *line)
{
        const struct obs_kernel_param *p;
        int had_early_param = 0;

        p = __setup_start;
        do {
                int n = strlen(p->str);
                if (parameqn(line, p->str, n)) {
                        if (p->early) {
                                /* Already done in parse_early_param?
                                 * (Needs exact match on param part).
                                 * Keep iterating, as we can have early
                                 * params and __setups of same names 8( */
                                if (line[n] == '\0' || line[n] == '=')
                                        had_early_param = 1;
                        } else if (!p->setup_func) {
                                pr_warn("Parameter %s is obsolete, ignored\n",
                                        p->str);
                                return 1;
                        } else if (p->setup_func(line + n))
                                return 1;
                }
                p++;
        } while (p < __setup_end);

        return had_early_param;
}

__setup_start ~ __setup_end 까지의 구식 커널 파라메터 블럭에서 매치된 파라메터 중 early가 아닌 경우 해당 파라메터에 연결된 설정 함수를 호출한다.

if (parameqn(line, p->str, n)) {
- 인수 문자열과 파라메터 블럭의 문자열을 n 바이트만큼 비교하여 같으면
if (p->early) { if (line[n] == ‘\0’ || line[n] == ‘=’) had_early_param = 1;
- early 파라메터이면 had_early_param에 1을 대입한다.
} else if (!p->setup_func) { pr_warn(“Parameter %s is obsolete, ignored\n”, p->str); return 1;
- 매치된 파라메터의 setup_func이 등록되어 있지 않은 경우 경고 메시지를 출력하고 1을 리턴한다.
} else if (p->setup_func(line + n)) return 1;
- 매치된 파라메터의 setup_func()에 인수 val을 준비하여 호출하고 1을 리턴한다.
} while (p < __setup_end);
- 파라메터 블럭의 끝까지 루프를 돈다.

init/main.c

const char *envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };

init/main.c

static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };

set_init_arg()

init/main.c

/* Anything after -- gets handed straight to init. */
static int __init set_init_arg(char *param, char *val, const char *unused)
{
        unsigned int i;

        if (panic_later)
                return 0;

        repair_env_string(param, val, unused);

        for (i = 0; argv_init[i]; i++) {
                if (i == MAX_INIT_ARGS) {
                        panic_later = "init";
                        panic_param = param;
                        return 0;
                }
        }
        argv_init[i] = param;
        return 0;
}

if (panic_later) return 0;
- panic_later가 지정된 경우 함수를 빠져나간다.
repair_env_string(param, val, unused);
- param=val 및 param=”val”과 같은 형태는 ‘=’대신 파라메터 구분을 하기 위해 null이 입력되어 있는데 이를 다시 ‘=’문자로 치환하고 따옴표가 사용된 경우 param=val 형태가 되도록 따옴표를 제거한다.
for (i = 0; argv_init[i]; i++) {
- argv_init[] 배열에 등록된 엔트리 수 만큼 루프를 돈다.
if (i == MAX_INIT_ARGS) { panic_later = “init”; panic_param = param; return 0; }
- i가 MAX_INIT_ARGS에 도달하면 panic 관련 변수 설정을 하고 루틴을 빠져나온다.
argv_init[i] = param;
- argv_init[] 배열에 param을 추가한다.

모듈(커널) 파라메터 등록

커널(모듈) 파라메터는 다음의 매크로로 등록한다.

__setup()
- 구형(obs_kernel_param) 커널 파라메터를 사용함
- 참고: parse_early_param() | 문c
core_param()
module_param()

모듈 파라메터는 다음 두 가지의 방법 중 하나로 전달될 수 있다.

커널 cmdline
- 예) usbcore.blinkenlights=1
modprobe 명령
- 예) $ modprobe usbcore blinkenlights=1

module_param()

include/linux/moduleparam.h

/**
 * module_param - typesafe helper for a module/cmdline parameter
 * @value: the variable to alter, and exposed parameter name.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * @value becomes the module parameter, or (prefixed by KBUILD_MODNAME and a
 * ".") the kernel commandline parameter.  Note that - is changed to _, so
 * the user can use "foo-bar=1" even for variable "foo_bar".
 *
 * @perm is 0 if the the variable is not to appear in sysfs, or 0444
 * for world-readable, 0644 for root-writable, etc.  Note that if it
 * is writable, you may need to use kparam_block_sysfs_write() around
 * accesses (esp. charp, which can be kfreed when it changes).
 *
 * The @type is simply pasted to refer to a param_ops_##type and a
 * param_check_##type: for convenience many standard types are provided but
 * you can create your own by defining those variables.
 *
 * Standard types are:
 *      byte, short, ushort, int, uint, long, ulong
 *      charp: a character pointer
 *      bool: a bool, values 0/1, y/n, Y/N.
 *      invbool: the above, only sense-reversed (N = true).
 */
#define module_param(name, type, perm)                          \
        module_param_named(name, name, type, perm)

모듈(커널) 파라메터를 등록할 때 사용하는 매크로인다.

module_param_unsafe() 함수를 사용하는 경우 커널에 위험성이 존재할 때 사용한다.

module_param_named()

include/linux/moduleparam.h

/**
 * module_param_named - typesafe helper for a renamed module/cmdline parameter
 * @name: a valid C identifier which is the parameter name.
 * @value: the actual lvalue to alter.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * Usually it's a good idea to have variable names and user-exposed names the
 * same, but that's harder if the variable must be non-static or is inside a
 * structure.  This allows exposure under a different name.
 */
#define module_param_named(name, value, type, perm)                        \
        param_check_##type(name, &(value));                                \
        module_param_cb(name, &param_ops_##type, &value, perm);            \
        __MODULE_PARM_TYPE(name, #type)

타입 체크 후 module_param_cb()를 호출한다.

module_param_call()

include/linux/moduleparam.h

/* Obsolete - use module_param_cb() */
#define module_param_call(name, set, get, arg, perm)                    \
        static struct kernel_param_ops __param_ops_##name =             \
                { .flags = 0, (void *)set, (void *)get };               \
        __module_param_call(MODULE_PARAM_PREFIX,                        \
                            name, &__param_ops_##name, arg,             \
                            (perm) + sizeof(__check_old_set_param(set))*0, -1, 0)

set, get 핸들러를 연결한 __param_ops_XXX 객체를 만들고 __module_param_call()을 호출한다.

예) module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy, NULL, 0644);
- static struct kernel_param_ops __param_ops_policy = { .flags = 0, (void *) pcie_aspm_set_policy, (void *) pcie_aspm_get_policy };
- __module_param_call(“pcie_aspm.”, policy, &param_ops_policy, NULL, 0644 + sizeof(__check_old_set_param(set))*0, -1, 0)

module_param_cb()

include/linux/moduleparam.h

/**
 * module_param_cb - general callback for a module/cmdline parameter
 * @name: a valid C identifier which is the parameter name.
 * @ops: the set & get operations for this parameter.
 * @perm: visibility in sysfs.
 *
 * The ops can have NULL set or get functions.
 */
#define module_param_cb(name, ops, arg, perm)                                 \
        __module_param_call(MODULE_PARAM_PREFIX, name, ops, arg, perm, -1, 0)

module_param_cb() 매크로 함수를 사용한 곳의 모듈명이 MODULE_PARAM_PREFIX에 담겨 빌드되고 이를 가지고 __module_param_call() 매크로를 호출한다.

예) module_param_cb(skip_txen_test, &param_ops_uint, &skip_txen_test, 0644);
- __modul_param_call(“8250_core”, skip_txen_test, &param_ops_uint, &skip_txen_test, 0644, -1, 0)

__module_param_call()

include/linux/moduleparam.h

/* This is the fundamental function for registering boot/module
   parameters. */
#define __module_param_call(prefix, name, ops, arg, perm, level, flags) \
        /* Default value instead of permissions? */                     \
        static const char __param_str_##name[] = prefix #name; \
        static struct kernel_param __moduleparam_const __param_##name   \
        __used                                                          \
    __attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
        = { __param_str_##name, ops, VERIFY_OCTAL_PERMISSIONS(perm),    \
            level, flags, { arg } }

모듈(커널) 파라메터 블럭에 파라메터와 핸들러들이 등록된다.

예) __modul_param_call(“8250_core”, skip_txen_test, &param_ops_uint, &skip_txen_test, 0644, -1, 0)
- static const char __param_str_skip_txen_test[] = “8250_core” skip_txen_test;
- static struct kernel_param const __param_skip_txen_test = { __param_str_skip_txen_test, &param_ops_uint, 0644, -1, 0, { &skip_txen_test } }
__moduleparam_const
- ALPHA, IA64, PPC64 아키텍처는 아무일도 하지 않고 그 밖의 아키텍처는 const 이다.

구조체

kernel_param 구조체

struct kernel_param {
        const char *name;
        const struct kernel_param_ops *ops;
        u16 perm;
        s8 level;
        u8 flags;
        union {
                void *arg;
                const struct kparam_string *str;
                const struct kparam_array *arr;
        };
};

flags
- KERNEL_PARAM_FL_UNSAFE(1)을 사용하는 경우 커널에 문제를 일으킬 수 있는 위험한 파라메터라는 것을 의미한다.

kernel_param_ops 구조체

include/linux/moduleparam.h

struct kernel_param_ops {
        /* How the ops should behave */
        unsigned int flags;
        /* Returns 0, or -errno.  arg is in kp->arg. */
        int (*set)(const char *val, const struct kernel_param *kp);
        /* Returns length written or -errno.  Buffer is 4k (ie. be short!) */
        int (*get)(char *buffer, const struct kernel_param *kp);
        /* Optional function to free kp->arg when module unloaded. */
        void (*free)(void *arg);
};

flags
- KERNEL_PARAM_OPS_FL_NOARG(1)을 사용하는 경우 value 값 없는 param값만 허용한다.

kparam_string 구조체

include/linux/moduleparam.h

/* Special one for strings we want to copy into */
struct kparam_string {
        unsigned int maxlen;
        char *string;
};

kparam_array 구조체

include/linux/moduleparam.h

/* Special one for arrays */
struct kparam_array
{
        unsigned int max;
        unsigned int elemsize;
        unsigned int *num;
        const struct kernel_param_ops *ops;
        void *elem;
};

참고

Earlycon & Earlyprintk
parse_early_param() | 문c
parse_args() | 문c – 현재 글

Kernel Parameters | kernel.org

page_alloc_init()

2016-04-262019-06-07 문영일 Leave a comment

page_alloc_init()

mm/page_alloc.c

void __init page_alloc_init(void)
{
        int ret;

        ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,
                                        "mm/page_alloc:dead", NULL,
                                        page_alloc_cpu_dead);
        WARN_ON(ret < 0);
}

cpu가 다운될 때 페이지 할당자와 관련되어 사용되는 각종 per-cpu용 캐시(pagevec, pcp) 및 vm 통계용 메모리를 회수한다.

코드 라인 5~7에서 cpuhp_setup_state_nocalls() 함수는 cpu hot-plug 상태가 변동되어 cpu의 시작과 종료 시 호출될 함수를 지정할 수 있다.
- 기존에는 hotcpu notifier를 사용한 방법을 사용하였었는데 커널 v4.10-rc에서 수정되었다.
  - 새 방법 참고: mm/page_alloc: Convert to hotplug state machine
  - 기존 방법 참고: hotcpu_notifier() | 문c

page_alloc_cpu_dead()

mm/page_alloc.c

static int page_alloc_cpu_dead(unsigned int cpu)
{
        lru_add_drain_cpu(cpu);
        drain_pages(cpu);

        /*
         * Spill the event counters of the dead processor
         * into the current processors event counters.
         * This artificially elevates the count of the current
         * processor.
         */
        vm_events_fold_cpu(cpu);

        /*
         * Zero the differential counters of the dead processor
         * so that the vm statistics are consistent.
         *
         * This is only okay since the processor is dead and cannot
         * race with what we are doing.
         */
        cpu_vm_stats_fold(cpu);
        return 0;
}

해당 cpu용으로 사용되던 페이지 할당자와 관련된 메모리(pagevec, pcp)들을 회수하고 이벤트 카운터와 vm 카운터들을 갱신한다.

코드 라인 3에서 다운된 @cpu가 사용하는 페이지 할당자의 회수 매커니즘 lruvec에 사용하던 per-cpu 캐시들에서 페이지를 회수하여 해당 zone(또는 memory cgroup의 zone)에 있는 lruvec으로 이전한다.
코드 라인 4에서 다운된 @cpu가 사용하는 버디 시스템의 0 페이지 할당 전용 캐시인 Per-Cpu Page Frame Cache 페이지를 해지한다.
코드 라인 12에서 다운된 @cpu에 대한 이벤트 카운터들을 현재 cpu의 이벤트 카운터에 더한 후 fold된 cpu에 대한 이벤트 카운터를 모두 clear 한다.
코드 라인 21에서 다운된 @cpu의 전체 pageset event를 zone 및 전역에 옮기고 clear 한다.

vm_events_fold_cpu()

mm/vmstat.c

/*
 * Fold the foreign cpu events into our own.
 *
 * This is adding to the events on one processor
 * but keeps the global counts constant. 
 */
void vm_events_fold_cpu(int cpu)
{
        struct vm_event_state *fold_state = &per_cpu(vm_event_states, cpu);
        int i;

        for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
                count_vm_events(i, fold_state->event[i]);
                fold_state->event[i] = 0;
        }
}

fold될 cpu에 대한 event[] 카운터들을 현재의 cpu event[] 카운터에 더한 후 fold될 cpu에 대한 event[]는 모두 clear 한다.

cpu_vm_stats_fold()

mm/vmstat.c

/*
 * Fold the data for an offline cpu into the global array.
 * There cannot be any access by the offline cpu and therefore
 * synchronization is simplified.
 */
void cpu_vm_stats_fold(int cpu)
{
        struct zone *zone;
        int i;
        int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };

        for_each_populated_zone(zone) {
                struct per_cpu_pageset *p;

                p = per_cpu_ptr(zone->pageset, cpu);
        
                for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
                        if (p->vm_stat_diff[i]) {
                                int v;

                                v = p->vm_stat_diff[i];
                                p->vm_stat_diff[i] = 0;
                                atomic_long_add(v, &zone->vm_stat[i]);
                                global_diff[i] += v;
                        }
        }
                        
        fold_diff(global_diff);
}

fold될 cpu에 대한 vm_stat_diff[] 카운터들을 zone->vm_stat[] 카운터에 더하고 전역 vm_stat[]에도 도한 후 fold될 cpu에 대한 카운터는 모두 clear 한다.

fold_diff()

mm/vmstat.c

/*
 * Fold a differential into the global counters.
 * Returns the number of counters updated.
 */
static int fold_diff(int *diff)
{
        int i;
        int changes = 0;

        for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
                if (diff[i]) {
                        atomic_long_add(diff[i], &vm_stat[i]);
                        changes++;
        }
        return changes;
}

인수로 전달 받은 vm_stat 값을 전역 vm_stat[]에 더하고 변경 된 항목 수가 리턴된다.

참고

hotcpu_notifier() | 문c
Per-CPU Page Frame Cache (zone->pageset) | 문c
[Linux] pageflags로 살펴본 메모리의 일생 | 문c
From mm-summi | Fujitsu – 다운로드
LRU Lists & pagevecs | 문c
Buddy Memory Allocator (해지) | 문c

Zoned Allocator -10- (LRU & pagevecs)

2016-04-262023-06-22 문영일 10 Comments

Memory Reclaiming

메모리가 부족하면 주기적으로 페이지를 해지하는 프로세스가 돌며 페이지를 회수하여 재사용 하는데 여러 가지 메모리 교체 정책이 있다. 그 중 리눅스는 LRU 알고리즘을 사용한다.

커널의 페이지 관리의 핵심을 담당하는 buddy 시스템은 여러 개의 order와 각각의 order 별로 6개의 migratetype 별로 free 페이지를 관리하고 있다. 그런데 이들을 할당하여 사용할 때 메모리가 부족하면 특정 할당 페이지들을 대상으로 이를 회수하여 사용할 수 있는데, 이러한 페이지를 다음에서 알아본다.

file 페이지들(aka page cache)
- 사용자가 읽어 메모리에 있는 파일 페이지들은 사본이고, 원본은 이미 backing storage system(디스크등)에 저장되어 있으므로 메모리에 있는 페이지들을 해제(free)시켜 즉각 회수가 가능하다. 그 이후 필요하면 다시 그 부분만 로드하여 사용한다.
- 메모리에 로드한 페이지에 수정이 가해진 경우(dirty 상태)에는 디스크에 기록한 후 회수한다.
- LRU 리스트를 통해 관리한다.
anon 페이지들
- 사용자가 malloc()으로 요청한 메모리 또는 스택 메모리등은 원본이 메모리이므로, swap backing storage system(디스크등)에 임시로 저장한 후 메모리에서 해제(free)시켜 회수할 수 있다.
- LRU 리스트를 통해 관리한다.
reclaimable 슬랩 캐시들
- 커널에서 할당한 메모리들은 이동이 불가능하여 compaction 및 회수가 불가능하게 설게되어 있지만, GFP_RECLAIMABLE 옵션을 사용하여 만든 슬랩 캐시들은 커널 메모리로 사용할지라도 회수가 가능하도록 설계되어 있다.
- LRU 리스트를 통해 관리하지 않는다.

LRU (Least Recently Used)

회수 관리에 사용되는 페이지들은 LRU 리스트를 통해 관리하고 있다. 이들을 알아본다.

최소 빈도로 사용되는 페이지를 회수하는 방식이다.
최소 빈도 처리는 실제 구현 시 리스트 내의 head에 회수 관리할 페이지를 추가하고, 회수할 페이지는 리스트의 tail 에서 처리한다.
file 페이지 및 anon 페이지들이 대상이다.

회수는 사용자 할당 메모리인 anon 페이지와 페이지 캐시를 대상으로 한다.
- 단 페이지 캐시 중 unevictable로 분류된 ramfs, shm, mlock 페이지들은 제외한다.
LRU에서 관리하는 페이지들은 compaction을 통해 migrate가 가능한 movable 페이지들이다.
커널 v2.6.28-rc1 부터는 기존에 zone별로 2개의 LRU(active_list 와 inactive_list) 리스트만을 관리하였었는데 이를 다시 anon과 file로 나누었고, 회수 대상에서 제외할 페이지들만을 모아놓은 unevictable 리스트까지 총 5개로 확대하여 사용한다.
- 기존
  - zone->active_list
  - zone->inactive_list
- 신규: 5개로 확장된 LRU 리스트
  - zone->lruvec.lists[LRU_INACTIVE_ANON]
  - zone->lruvec.lists[LRU_ACTIVE_ANON]
  - zone->lruvec.lists[LRU_INACTIVE_FILE]
  - zone->lruvec.lists[LRU_ACTIVE_FILE]
  - zone->lruvec.lists[LRU_UNEVICTABLE]
- 참고:
  - vmscan: Use an indexed array for LRU variables
  - vmscan: split LRU lists into anon & file sets
커널 v4.8-rc1 부터 zone이 아닌 노드별로 관리한다.
- 참고: mm, vmscan: move LRU lists to node (2016, v4.8-rc1)

memcg/node lruvecs

다음 그림과 같이 cgroup을 사용한 memory 컨트롤러를 memcg라고 하고, 각각의 memcg는 노드별 lruvec을 관리한다.

lruvec 구조체에 위의 5개 lru list를 포함하여 관리한다.
lruvec은 각각의 memcg(Cgroup의 memory controller) 및 노드별로 관리된다.
즉 하나의 유저 페이지는 수 많은 lruvec 중의 하나에서 관리한다.

LRU 리스트 타입

lru 리스트는 양방향 리스트로 선두는 hot, 페이지 후미는 cold 페이지 성격을 갖는다.

ANON
- anonymous 유저 메모리를 VM에 매핑하여 사용한 페이지이다.
- 메모리 부족 시 swap 영역에 옮기고 다 옮긴 페이지는 회수한다.
  - 현재 리눅스 커널은 성능상의 이유로 swap 크기가 default 0으로 설정되어 있다.
  - 최근 torvalds는 ssd 타입의 디스크를 사용하여 다시 swap을 사용하는 것에 관심을 갖고 있다.
    - 참고: Reconsidering swapping | LWN.net
FILE
- 파일을 VM에 매핑하여 사용되는 페이지로 정규 파일에서 읽어 들인 페이지이다.
- 메모리 부족 시 clean 페이지들은 그냥 회수하고, dirty된 페이지들은 file(backing store)에 기록 후에 회수한다.
ACTIVE
- 처음 할당된 페이지들은 inactive 리스트의 선두(hot)에 추가된다.
- 주기적으로 active와 inactive의 ratio를 비교하여 계속 참조(reference)되지 않는 페이지는 inactive 리스트로 옮기고 참조된 페이지는 다시 active list의 선두로 옮긴다(rotate).
INACTIVE
- 회수 매커니즘이 동작할 때 inactive 리스트의 후미(cold)에서 회수를 시도한다.
  - ANON: swap 영역에 옮긴다.
  - FILE: clean 페이지는 곧바로 회수가능하다. dirty 페이지인 경우 writeback으로 바꾸고 async하게 원래의 화일에 기록하게 해놓고 페이지를 inactive list의 선두로 옮긴다(rotate)
    - rotate 시켜 즉각 처리를 유보시키고, 나중에 다시 차례가 되어 writeback이 완료된 경우 회수한다.
UNEVICTABLE
- 메모리 회수 메커니즘에서 사용할 수 없도록 한 페이지로 다음의 경우 사용된다.
  - ramfs
  - SHM_LOCK(공유 메모리 락)’d shared memory regions
  - VM_LOCKED VMAs
- 다음 3가지 case에서는 isolation을 통한 migration을 허용한다.
  - 메모리 파편화 관리
  - 워크로드 관리
  - 메모리 hotplug
- per-cpu를 사용하는 LRU pagevec 매커니즘을 사용하지 않는다.

LRU 리스트 간 이동

anon 페이지
- fault가 발생하여 새롭게 할당된 페이지는 inactive list의 head에 진입한다.
  - 커널 v5.9-rc1에 Workingset Detection 기능이 추가되면서 처음 fault된 anon 페이지들도 file 페이지처럼 inactive 리스트의 선두(head)에 추가된다.
    - 참고: mm/vmscan: protect the workingset on anonymous LRU (2020, v5.9-rc1)
- inactive list의 tail에서 스캔한 페이지에 대해 해당 메모리가 applicaton 또는 커널에 의해 2번 이상 access된 흔적이 있으면 active list의 head로 이동시킨다.
  - 이를 page의 activate 또는 promotion이라고 한다.
- incactive list의 tail에서 스캔한 페이지를 swap한 후 buddy로 되돌린다.
- active list의 tail에서 active/inactive 비율에 맞춰 밀려난 페이지는 inactive list의 head로 이동시킨다.
  - 이를 page의 deactivate 또는 demotion이라고 한다.
file 페이지
- fault가 발생하여 새롭게 할당된 페이지는 inactive list의 head에 진입한다.
- deactivate/demotion 또는 activate/promotion 과정은 다음 사항을 제외하곤 anon 페이지와 동일하다.
  - 실행 파일의 경우는 1번만 access한 경우에도 promotion한다.
- incactive list의 tail에서 스캔한 페이지는 reclaim 전에 dirty 된 페이지는 원본이 있던 backing-storage에 기록한다. writeback이 완료된 페이지는 buddy로 되돌린다.

다음 그림은 페이지 회수 시 사용되는 lru 리스트들을 보여준다.

LRU 관련 페이지 플래그들

LRU 관리를 위해 다음과 같은 페이지 플래그가 사용된다.

PG_lru
- LRU 리스트에서 관리되는 동안에 사용된다. 페이지가 LRU 캐시인 pagevec 리스트에 있을 경우엔 PG_lru가 클리어상태이다.
PG_active
- active 리스트에서 관리되는 동안에 설정되며, inactive 리스트로 이동하면 클리어된다.
PG_swapbacked
- anon 페이지가 처음 생성되면 swap backing storage의 설정 유무와 상관없이, swap 가능한 페이지 상태라는 의미로 이 플래그가 설정된다.
- madvise() API의 MADV_FREE 옵션을 사용하면 해당 페이지를 사용 해제 시 lazy free 상태로 변경시킬 수 있는데 이 때 이 플래그를 클리어하여 일시적으로 clean anon 페이지 상태를 만든다.
PG_referenced
- 페이지가 최근에 2번 이상 참조되었는지를 확인한 후 활성화하여 active list로 옮길 목적으로 사용되며, PTE(페이지 테이블 엔트리)의 AF(Access Flag)와 같이 사용된다.
- 아래 주제(페이지 참조(reference))에서 자세한 설명을 계속한다.
PG_writeback
- anon 및 file 페이지를 backing storage에 기록하는동안에만 설정된다.
PG_dirty
- file 페이지가 open() 후 read() 하여 메모리에 로딩된 이 후 write()에 의해 해당 페이지의 메모리가 변경되면 이 플래그가 설정된다.
PG_reclaim
- 회수 대상이 된 페이지에 이 플래그가 설정되고, 회수된 free 페이지가 버디 시스템으로 되돌리기 전에 이 플래그는 클리어된다.
PG_workingset
- active 리스트에서 관리하던 페이지가 inactive로 이동될 때 설정되며, 빈번한 fault에 의해 반복되는 디스크 IO를 통해 성능 저하되지 않도록 페이지의 refault 유무를 가리기 위해 사용된다. 페이지가 backing storage에 저장되면 이 플래그의 상태도 한동안 기억되어야 하는데, 해당 페이지가 backing storage에 읽고 쓸때 사용되는 캐시(page cache 또는 swap cache)가 사용되지 않는 시점의 shadow 엔트리에 저장되어 관리되고 있다.
- 예) file이 로드된 경우 page cache에 담기는데 이러한 정보의 관리는 xarray 자료구조를 사용하여 보관한다. 메모리가 부족하여 page cache를 비우게 되면 xarray에서도 page cache에 대한 정보를 지우는데, 대신 그 이후엔 shadow 정보를 기록하는 용도로 사용한다. 이 shadow 정보는 eviction 페이지에 대한 관련(workingset 여부등) 정보가 포함된다.

조금은 오래된 글이지만 페이지 플래그에 대해 잘 설명해 놓은 주옥같은 글이 있으므로 다음 문서를 참고하고, 이 문서에서 언급하지 않은 페이지 플래그의 변화들을 위주로 보강 설명을 한다.

참고: [Linux] pageflags로 살펴본 메모리의 일생 | F/OSS Study
주의: PG_buddy 등의 몇 가지 플래그는 페이지의 _mapcount에서 또다시 분리하여 새로 추가한 page_type으로 이동시켰다.

페이지 플래그 상태 변화

다음과 같은 페이지들이 회수될 때의 플래그 변화를 알아본다.

anon 페이지
- fault가 발생하여 새롭게 할당된 anon 페이지는 PG_swapbacked 플래그가 설정되고, inactive 리스트에서 시작한다.
  - 참고: anon 페이지에 PG_swapbacked가 없는 페이지는 clean anon 페이지라하며, lazy free 상태의 페이지를 의미한다.
- inactive 리스트에서 다시 inactive 리스트로 이동시키거나, active 리스트로 promotion하는 경우가 있는데, 이들 중 promotion 하는 경우 PG_active 플래그가 설정된다.
- incactive list의 tail에서 스캔한 페이지를 swap할 때 PG_writeback 및 PG_reclaim 플래그가 설정된다.
- swap은 add_to_swap() 함수를 통해 시작되는데 swap이 완료되면 PG_writeback 을 클리어하고, 그 후 buddy 시스템으로 되돌리기 위해 PG_reclaim 플래그도 클리어된다.
- active 리스트에서 inactive 리스트로 demotion하는 경우 PG_workingset 플래그가 설정되고, PG_active 플래그가 클리어된다.
- refault된 activate 페이지의 경우 PG_active 플래그를 설정한 후, eviction 당시의 PG_workingset 플래그를 유지하고, activate 리스트에서 시작한다.
file 페이지
- fault가 발생하여 새롭게 할당된 file 페이지는 inactive list에서부터 시작한다.
- inactive 리스트에서 다시 inactive 리스트로 이동시키거나, active 리스트로 promotion하는 경우가 있는데, 이들 중 promotion 하는 경우 PG_active 플래그가 설정된다.
- 스캔을 통해 회수 대상으로 선정되면 PG_reclaim 플래그가 설정된다.
- 회수는 pageout() 함수를 통새 시작되는데 사용자에 의해 메모리에서 변경된 페이지는 이미 PG_dirty가 설정되어 있어 회수를 하기 전에 먼저 원본이 있던 backing storage에 기록해야 하며, 이 기간동안 PG_writeback이 설정된다.
- writeback이 완료되면 PG_writeback과 PG_dirty 플래그가 클리어된 후 clean file 페이지 상태가 되는데, 그 후 buddy 시스템으로 되돌리기 위해 PG_reclaim 플래그도 클리어된다.
- active 리스트에서 inactive 리스트로 demotion하는 경우 PG_workingset 플래그가 설정되고, PG_active 플래그가 클리어된다.
- refault된 activate 페이지의 경우 PG_active 플래그를 설정한 후, eviction 당시의 PG_workingset 플래그를 유지하고, activate 리스트에서 시작한다.

페이지 참조(reference)

페이지가 applicaton 및 커널에 의해 2번 이상 참조되었는지 여부를 체크하는 것으로 활성 페이지라 판단하여 active 리스트의 선두(head)로 옮긴다.

active 리스트에 있었던 페이지라면 rotate하여 리스트의 선두(head)로 이동시킨다.
inactive 리스트에 있었던 페이지라면 promote하여 active 리스트의 선두(head)로 이동시킨다.

1. PTE(Page Table Entry)의 AF(Access Flag)

페이지가 처음 참조되었는지 여부를 체크하기 위해서 커널은 HW 아키텍처의 fault 이벤트를 받아 처리한다. fault 이벤트는 다양한 원인에 의해 발생하지만 페이지 참조와 관련한 항목은 다음과 같이 3가지 항목 정도로 요약할 수 있다.

PTE fault
- application이 해당 페이지에 접근(access)할 때 매핑하지 않은 가상 주소 공간에 읽기(read)를 시도하는 경우 발생한다.
  - do_anonymous_page, do_fault, do_swap_page, do_numa_page
Permission fault
- application이 읽기 전용으로 매핑된 가상 주소 공간에 기록(write)을 시도하는 경우 발생한다.
  - do_wp_page
Access Flag fault
- PTE의 AF 비트가 설정되지 않은 페이지에 접근을 시도하는 경우 발생한다.
- 페이지가 액세스 되었으므로 커널은 SW 방식으로 직접 PTE의 AF 비트를 설정한다. 예: ARMv8.0 이하
  - pte_mkyoung – ptep_set_access_flags
- 최근 아키텍처는 HW가 직접 PTE 엔트리의 AF 비트를 설정한다. 예: ARMv8.1 이상

2. PG_reference

처음 참조되어 fault가 발생하여 새롭게 로드 또는 생성된 페이지는 처음 액세스되었으므로 커널(또는 HW가 지원하는 경우 자동으로)은 PTE의 AF 비트를 설정한다.

이 페이지가 file 페이지인지 anon 페이지인지 여부에 따라 file 페이지는 inactive 리스트, anon 페이지는 active 리스트의 각각 선두(head)에서 시작한다.
시작한 리스트에서 해당 페이지가 시간이 흘러 점점 리스트의 끝(tail) 부분으로 슬라이딩하고, 끝 부분에서 reclaim을 위한 스캔 대상이 되어, 이 페이지의 참조여부를 조사(page_check_reference)할 때 매핑된 PTE의 AF 비트의 설정 여부를 알아오고, PG_reference 플래그도 조사한다. 이 때 다음 참조 조사를 위해 PTE의 AF 비트는 클리어해둔다.

다음은 inactive 리스트 후미(tail)에서 스캔한 페이지의 처리 과정을 설명한다.

다음의 경우에서는 활성화를 위해 PG_reference 플래그를 설정하고, 곧장 active 리스트의 선두(head)로 옮기는데 이를 promote 또는 activate라고 한다.
- PG_reference 플래그의 설정 유무와 관계없이 2개 이상의 참조를 확인한 경우
- PG_reference 플래그의 설정 유무와 관계없이 1개의 실행 파일이 참조된 경우
- PG_reference 플래그가 설정되었고, 1개의 참조를 확인한 경우
PG_reference 플래그가 설정되지 않았고, 1개의 참조를 확인한 경우 PG_reference 플래그를 설정하고, 원래 있었던(active or inactive) 리스트에서 일단 유지(keep)하기 리스트의 선두(head)로 rotate 한다.
하나의 참조도 발견되지 않은 경우 역시 PG_reference 플래그 설정 여부와 관계없이 이 페이지의 evict를 위해 PG_reclaim을 설정한다. 그 이후의 file 페이지의 dirty 페이지에 대한 writeback 처리 및 anon 페이지의 swap 등은 PG_reference 플래그 설정과 관계 없으므로 생략한다.

다음은 active 리스트 후미(tail)에서 스캔한 페이지의 처리 과정을 설명한다.

PG_reference 플래그의 유무와 관계없이 1개 이상의 실행 파일이 참조된 경우 active 리스트에서 일단 유지(keep)하기 리스트의 선두(head)로 rotate 한다.
위의 참조 조건이 아닌 경우 비활성 페이지라 판단하여 PG_reference 플래그를 클리어하고, inactive 리스트의 선두(head)로 이동시키는데 이를 demote 또는 deactivate라고 한다.

다음 그림은 file 페이지의 참조 관련한 플래그의 변화를 보여준다.

다음 그림은 lru를 통한 페이지 회수가 진행될 때 관련된 vm 카운터 값을 보여준다.

Anon 페이지

Anon 페이지가 생성되는 경로는 다음과 같다.

유저 application에서 힙 또는 스택 메모리의 증가로 커널에 anonymous로 할당 요청한 페이지이다.
open된 공유 파일의 수정이 발생할 때 fault 핸들러로부터 COW(Copy On Write) 기능을 사용하여 복사된 페이지이다.
KSM(Kernel Same Memory) 기능에 의해 공유된 페이지도 anon 페이지이다.

anon 페이지는 swap 영역을 사용할 수 있는지 여부를 PG_swapbacked 플래그로 나타낸다.

normal anon 페이지
- swap 영역을 가진 anon 페이지로 PG_swapbacked 플래그가 설정된 anon 페이지이다.
clean anon 페이지
- swap 영역이 없는 anon 페이지로 PG_swapbacked 플래그가 설정되지 않은 anon 페이지이다.
- MADV_FREE 페이지로 lazy-free 상태의 페이지이다.
  - 참고
    - mm: support madvise(MADV_FREE) (2014) | LWN.net
    - Volatile ranges and MADV_FREE (2014) | LWN.net

pagevecs

pageveces는 lru 캐시이다. 페이지 회수 매커니즘에서는 lru 리스트에서 일정 부분의 페이지를 isolation 시 배치 처리하여 사용한다. 그러나 배치 처리를 할 수 없는 곳에서는 요청 시에 하나씩 lock을 획득하여 처리하면 lock contention에 의해 성능이 저하 되므로 별도의 lru 캐시를 구현하여 사용하고 있다. per-cpu로 구현된 5개의 pagevecs가 있으며 각각은 14개의 페이지를 관리할 수 있다.

lru_add_pvec
lru_rotate_pvecs
lru_deactivate_file_pvecs
lru_lazyfree_pvecs
- 참고: mm: move MADV_FREE pages into LRU_INACTIVE_FILE list (2017, v4.12-rc1)
activate_page_pvecs

다음 그림은 lru 캐시인 pagevecs를 사용하는 함수의 호출관계를 보여준다.

함수가 호출될 때마다 lru 캐시인 pagevecs에 추가하지만 처리 한도인 14개를 초과 시에는 LRU에 직접 추가한다.
lru 캐시인 pagevecs에 있는 페이지를 lru 리스트로 회수하려면 lru_add_drain_cpu() 함수를 호출하여 사용한다.

Workingset Detection

페이지의 반복되는 회수로 인해 반복되는 refault로 인해 디스크 IO cost가 증가하는 것을 막기 위해 Workingset Detection 관련한 알고리즘이 적용되었다.

PG_workingset 플래그와 swap cache 및 page cache의 shadow 엔트리에 정보를 기록하며, lruvec마다 anon/file cost를 산출하여 운영한다.
참고:
- mm: balance LRU lists based on relative thrashing (2020, v5.8-rc1)
- mm: workingset: tell cache transitions from workingset thrashing (2018, v4.20-rc1)
- mm: workingset: eviction buckets for bigmem/lowbit machines (2016, v4.6-rc1)
- mm: thrash detection-based file cache sizing (2014, v3.15-rc1)

다음 그림은 캐시를 관리하는 XAraay의 shadow 엔트리를 이용하여 페이지가 evict될 때 페이지에 가지고 있던 정보 중 일부 PG_workingset 및 age 정도등을 기록하고, 나중에 refault되어 다시 로드될 때 이미 workingset 정보였다는 것을 갱신할 수 있도록 하였다.

다음 그림은 Xarray의 shadow 엔트리에 저장되는 값을 보여준다.

eviction시 lruvec→nonresident_age 값을 기록하는데 시스템 메모리가 크거나 또는 32비트 시스템일 경우 저장할 여분 비트가 부족할 수 있으므로 bucket_order만큼 우측 shift하여 저장하고, refault시 꺼내서 사용할 때에는 bucket_order 만큼 좌측 shift하여 사용 한다.
값을 shift하여 사용하는 만큼 하위 비트들이 클리어된 상태로 trim되어 거친 값을 가지게되고, 비교할 때 러프하게 비교할 수 밖에 없다.

다음 그림은 Xarray의 shadow 엔트리에 저장되는 값을 eviction시 만들거나(pack) 또는 refault 시 꺼내는(unpack) 과정을 보여준다.

Workingset Detection for file 페이지

Refualt 페이지의 activate 여부를 판단

refault 페이지를 activate해야 할지 여부를 알아내기 위한 요소들은 다음과 같다.

NR_inactive
- 해당 lru 타입 중 inactive lru 리스트에서 관리하는 페이지 수
NR_active
- 해당 lru 타입 중 active lru 리스트에서 관리하는 페이지 수
lruvec->nonresident_age
- 페이지가 activation 및 eviction한 페이지 수로 누적 증가한다. timestamp와 유사하다.
- activation 값은 두 가지가 경로가 있다.
  - inactive 리스트에서 promote한 페이지인 경우
  - refault 페이지가 곧장 active 리스트로 곧바로 향한 경우
PG_workingset
- 페이지의 workingset 여부를 알려주는 플래그

Refualt File 페이지의 activate 여부를 판단

위의 요소를 사용하여 다음과 같은 수식(처음 Workingset Detection 기능이 소개될 때 file 캐시만 지원)을 사용한다.

R
- refault 순간의 lru->nonresident_age 값
E
- eviction 순간의 lru->nonresident_age 값
refault distant
- = (R – E)
- eviction된 이후 refault되었을 때의 간격
complete minimum access distant
- = NR_inactive + (R – E)
activate 여부 판단
- = refault distant + NR_inactive <= NR_active + NR_inactive
- = refault distant <= NR_active

다음 그림은 refault file 페이지가 active list로 추가되는 과정을 보여준다.

페이지가 짧은 시간(refault distance <= NR_active)에 refault되어 진입하게 되면 activate 한다.

Workingset Detection for anon/file 페이지

anon 페이지도 refault 시 refault distance와 workingset_size(NR_active_file 대신 새롭게 anon을 포함)를 산출한 후 비교하여 activate 여부를 결정할 수 있다.

anon 페이지 수식:
- = refault distant + NR_inactive_anon <= NR_active_anon + NR_inactive_anon + NR_inactive_file + NR_inactive_file
- = refault distant <= NR_active_anon + NR_inactive_file + NR_inactive_file
- = refault distant <= workingset_size
  - workingset_size = NR_active_anon + NR_inactive_file + NR_inactive_file
file 페이지 수식:
- = refault distant + NR_inactive_file <= NR_active_anon + NR_inactive_anon + NR_inactive_file + NR_inactive_file
- = refault distant <= NR_active_anon + NR_inactive_anon + NR_active_file
- = refault distant <= workingset_size
  - workingset_size = NR_active_anon + NR_inactive_anon + NR_active_file
단 swap 공간이 없는 경우 anon과 관련된 수는 포함되지 않는다.
참고:
- mm/swap: implement workingset detection for anonymous LRU (2020. v5.9-rc1)
- mm/workingset: prepare the workingset detection infrastructure for anon LRU (2020, v5.9-rc1)
- mm: workingset: age nonresident information alongside anonymous pages (2020, v5.8-rc3)
- mm: workingset: let cache workingset challenge anon (2020, v5.8-rc1)

다음 그림은 file/anon 페이지에 대한 새로운 Workingset Detection을 지원하는 경우의 refault 페이지를 activate 하는 과정을 보여준다.

다음 그림은 refault 페이지의 refault distance 값이 작을 때와(short time) 클 때(long time)에 따라 activate 유무를 판단하는 과정을 보여준다.

per-cpu LRU 캐시(pagevec)의 Drain

다음 그림은 lru_add_drain_cpu() 함수의 호출 관계이다.

lru_add_drain_cpu()

mm/swap.c

/*
 * Drain pages out of the cpu's pagevecs.
 * Either "cpu" is the current CPU, and preemption has already been
 * disabled; or "cpu" is being hot-unplugged, and is already dead.
 */

void lru_add_drain_cpu(int cpu)
{
        struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);

        if (pagevec_count(pvec))
                __pagevec_lru_add(pvec);

        pvec = &per_cpu(lru_rotate_pvecs, cpu);
        if (pagevec_count(pvec)) {
                unsigned long flags;

                /* No harm done if a racing interrupt already did this */
                local_irq_save(flags);
                pagevec_move_tail(pvec);
                local_irq_restore(flags);
        }

        pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
        if (pagevec_count(pvec))
                pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);

        pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
        if (pagevec_count(pvec))
                pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);

        activate_page_drain(cpu);
}

지정된 @cpu가 사용하던 페이지 할당자의 회수 매커니즘 lruvec에 사용하던 5개의 per-cpu 캐시들인 pagevec들을 회수하여 해당 zone(또는 memcg의 zone)에 있는 lruvec로 이전한다

코드 라인 3~6에서 지정된 @cpu 캐시 lru_add_pvec에 등록된 페이지를 해당 페이지 zone의 lruvec로 이전하고 비운다.
코드 라인 8~16에서 지정된 @cpu 캐시 lru_rotate_pvecs에 등록된 페이지를 해당 페이지 zone의 lruvec에 마지막 위치로 이전하고 비운다.
코드 라인 18~20에서 지정된 @cpu 캐시 lru_deactivate_file_pvecs에 등록된 페이지를 해당 페이지 zone의 lruvec로 이전하고 비운다.
코드 라인 22~24에서 지정된 @cpu 캐시 lru_lazyfree_pvecs에 등록된 페이지를 해당 페이지 zone의 lruvec로 이전하고 비운다.
코드 라인 26에서 지정된 @cpu 캐시 activate_page_pvecs에 등록된 페이지를 해당 페이지의 zone의 lruvec로 이전하고 비운다.

__pagevec_lru_add()

mm/swap.c

/*
 * Add the passed pages to the LRU, then drop the caller's refcount
 * on them.  Reinitialises the caller's pagevec.
 */

void __pagevec_lru_add(struct pagevec *pvec)
{
        pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
}
EXPORT_SYMBOL(__pagevec_lru_add);

cpu 캐시 pagevec에 등록된 페이지를 해당 페이지의 zone(또는 memory cgroup의 zone)->lruvec로 이전하고 pagevec를 비우고 초기화한다.

pagevec_move_tail()

mm/swap.c

/*
 * pagevec_move_tail() must be called with IRQ disabled.
 * Otherwise this may cause nasty races.
 */

static void pagevec_move_tail(struct pagevec *pvec)
{
        int pgmoved = 0;

        pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
        __count_vm_events(PGROTATED, pgmoved);
}

pagevec에 등록된 페이지들을 해당 페이지의 memory control group의 lru의 타입별 리스트의 후미에 추가하고 pagevec를 비우고 초기화한다. 추가한 페이지들의 수를 vm_events 관련 pgmoved 항목에 더한다.

activate_page_drain()

mm/swap.c

static void activate_page_drain(int cpu)
{
        struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);

        if (pagevec_count(pvec))
                pagevec_lru_move_fn(pvec, __activate_page, NULL);
}

activate_page_pvecs 라는 cpu 캐시 리스트에 등록된 페이지들을 해당 페이지의 memory control group의 lru의 타입별 리스트에서 삭제했다가 lru의 타입 + active를 하여 다시 선두(hot)에 추가하고 active 플래그를 설정하며 vm_events 관련 PGACTIVATE 항목을 증가시키고 reclaim 관련 통계도 증가시킨다. 그런 후 pagevec를 비우고 초기화한다.

5개의 pagevec 이주 함수

1) 공통 이주 함수

pagevec_lru_move_fn()

mm/swap.c

static void pagevec_lru_move_fn(struct pagevec *pvec,
        void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
        void *arg)
{
        int i;
        struct pglist_data *pgdat = NULL;
        struct lruvec *lruvec;
        unsigned long flags = 0;

        for (i = 0; i < pagevec_count(pvec); i++) {
                struct page *page = pvec->pages[i];
                struct pglist_data *pagepgdat = page_pgdat(page);

                if (pagepgdat != pgdat) {
                        if (pgdat)
                                spin_unlock_irqrestore(&pgdat->lru_lock, flags);
                        pgdat = pagepgdat;
                        spin_lock_irqsave(&pgdat->lru_lock, flags);
                }

                lruvec = mem_cgroup_page_lruvec(page, pgdat);
                (*move_fn)(page, lruvec, arg);
        }
        if (pgdat)
                spin_unlock_irqrestore(&pgdat->lru_lock, flags);
        release_pages(pvec->pages, pvec->nr);
        pagevec_reinit(pvec);
}

pagevec에 등록된 페이지를 해당 페이지의 memory control group의 lruvec로 이전하고 pagevec를 비우고 초기화한다.

코드 라인 10~19에서 pagevec 리스트에 등록된 수 만큼 순회하며 노드가 변경될 때마다 spin 락을 풀었다가 다시 획득한다. 장시간 락을 획득하지 못하도록 억제한다.
코드 라인 21~22에서 해당 페이지가 소속된 memcg의 lruvec 리스트로 페이지를 이동시킨다. 만일 memcg가 없는 경우 해당 노드의 lruvec 리스트를 사용한다.
- move_fn 인수에 지정된 함수를 호출한다.
- 예) __pagevec_lru_add_fn()
  - pagevec의 페이지를 lruvec에 추가한다.
코드 라인 25에서 pagevec의 페이지들을 해지한다.
코드 라인 26에서 pagevec을 다시 초기화한다.

2) 5개의 이주 함수

__pagevec_lru_add_fn()

mm/swap.c

static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
                                 void *arg)
{
        enum lru_list lru;
        int was_unevictable = TestClearPageUnevictable(page);

        VM_BUG_ON_PAGE(PageLRU(page), page);

        SetPageLRU(page);
        /*
         * Page becomes evictable in two ways:
         * 1) Within LRU lock [munlock_vma_pages() and __munlock_pagevec()].
         * 2) Before acquiring LRU lock to put the page to correct LRU and then
         *   a) do PageLRU check with lock [check_move_unevictable_pages]
         *   b) do PageLRU check before lock [clear_page_mlock]
         *
         * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
         * following strict ordering:
         *
         * #0: __pagevec_lru_add_fn             #1: clear_page_mlock
         *
         * SetPageLRU()                         TestClearPageMlocked()
         * smp_mb() // explicit ordering        // above provides strict
         *                                      // ordering
         * PageMlocked()                        PageLRU()
         *
         *
         * if '#1' does not observe setting of PG_lru by '#0' and fails
         * isolation, the explicit barrier will make sure that page_evictable
         * check will put the page in correct LRU. Without smp_mb(), SetPageLRU
         * can be reordered after PageMlocked check and can make '#1' to fail
         * the isolation of the page whose Mlocked bit is cleared (#0 is also
         * looking at the same page) and the evictable page will be stranded
         * in an unevictable LRU.
         */
        smp_mb();

        if (page_evictable(page)) {
                lru = page_lru(page);
                update_page_reclaim_stat(lruvec, page_is_file_cache(page),
                                         PageActive(page));
                if (was_unevictable)
                        count_vm_event(UNEVICTABLE_PGRESCUED);
        } else {
                lru = LRU_UNEVICTABLE;
                ClearPageActive(page);
                SetPageUnevictable(page);
                if (!was_unevictable)
                        count_vm_event(UNEVICTABLE_PGCULLED);
        }

        add_page_to_lru_list(page, lruvec, lru);
        trace_mm_lru_insertion(page, lru);
}

지정된 @lruvec의 적절한 타입(inactive_anon, active_anon, inactive_file, active_file, unevictable)의 리스트에 page를 추가한다. 페이지에는 lru 리스트에 소속되었다는 표식을 위해 LRU 플래그 비트가 설정된다.

코드 라인 5에서 페이지가 unevictable 리스트에 있었던 페이지인지 확인하고 해당 플래그를 클리어한다.
코드 라인 9에서 페이지가 lru 리스트에 소속되었다는 표식을 한다.
코드 라인 36에서 메모리 접근 순서를 명확히 해야 하는 케이스에 대한 설명은 위의 주석을 참고한다.
코드 라인 38~43에서 페이지가 회수 가능한 상태인 경우 lru 리스트를 선택하고 reclaim 관련 scanned[]와 rocated[] 항목을 증가시킨다. 기존에 unevictable 상태였던 경우 UNEVICTABLE_PGRESCUED 카운터를 증가시킨다.
코드 라인 44~50에서 페이지가 회수 가능한 상태가 아닌 경우 unevectable lru 리스트를 선택하고, active 플래그를 클리어하고, unevictable 플래그를 설정한다. 기존에 evictable 상태였었으면 UNEVICTABLE_PGCULLED 카운터를 증가시킨다.
코드 라인 52에서 lruvec에 페이지를 추가한다.

pagevec_move_tail_fn()

mm/swap.c

static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
                                 void *arg)
{
        int *pgmoved = arg;

        if (PageLRU(page) && !PageUnevictable(page)) {
                del_page_from_lru_list(page, lruvec, page_lru(page));
                ClearPageActive(page);
                add_page_to_lru_list_tail(page, lruvec, page_lru(page));
                (*pgmoved)++;
        }
}

페이지가 unevictable이 아닌 lru 타입이면 리스트의 후미(cold)에 페이지를 추가한다. 그리고 active 플래그를 제거한다.

코드 라인 6에서 페이지가 LRU 플래그 설정되어 있고 unevitable 플래그 상태가 아니면 페이지를 기존 lru 리스트에서 제거한다.
코드 라인 7~8에서 페이지의 active 플래그를 제거한 후 lru의 타입별 리스트의 후미에 페이지를 추가한다.
코드 라인 9에서 마지막 인자로 전달 받은 카운터를 증가시킨다.

lru_deactivate_file_fn()

mm/swap.c

/*
 * If the page can not be invalidated, it is moved to the
 * inactive list to speed up its reclaim.  It is moved to the
 * head of the list, rather than the tail, to give the flusher
 * threads some time to write it out, as this is much more
 * effective than the single-page writeout from reclaim.
 *
 * If the page isn't page_mapped and dirty/writeback, the page
 * could reclaim asap using PG_reclaim.
 *
 * 1. active, mapped page -> none
 * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
 * 3. inactive, mapped page -> none
 * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
 * 5. inactive, clean -> inactive, tail
 * 6. Others -> none
 *
 * In 4, why it moves inactive's head, the VM expects the page would
 * be write it out by flusher threads as this is much more effective
 * than the single-page writeout from reclaim.
 */

static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
                              void *arg)
{
        int lru, file;
        bool active;

        if (!PageLRU(page))
                return;

        if (PageUnevictable(page))
                return;

        /* Some processes are using the page */
        if (page_mapped(page))
                return;

        active = PageActive(page);
        file = page_is_file_cache(page);
        lru = page_lru_base_type(page);

        del_page_from_lru_list(page, lruvec, lru + active);
        ClearPageActive(page);
        ClearPageReferenced(page);
        add_page_to_lru_list(page, lruvec, lru);

        if (PageWriteback(page) || PageDirty(page)) {
                /*
                 * PG_reclaim could be raced with end_page_writeback
                 * It can make readahead confusing.  But race window
                 * is _really_ small and  it's non-critical problem.
                 */
                SetPageReclaim(page);
        } else {
                /*
                 * The page's writeback ends up during pagevec
                 * We moves tha page into tail of inactive.
                 */
                list_move_tail(&page->lru, &lruvec->lists[lru]);
                __count_vm_event(PGROTATED);
        }

        if (active)
                __count_vm_event(PGDEACTIVATE);
        update_page_reclaim_stat(lruvec, file, 0);
}

페이지가 LRU 타입이면서 unevictable이 아니고 mapped file이 아닌 경우 lru의 타입별 리스트에서 페이지를 삭제한 후 lru의 기본 타입의 선두에 페이지를 추가한다. 페이지 플래그는 active 및 referenced 플래그를 삭제한다. 페이지에 기록 속성이 있는 경우 reclaim 플래그를 설정하고 그렇지 않은 경우 리스트의 후미로 이동시킨다.

코드 라인 7~8에서 페이지에 LRU 플래그가 설정되어 있지 않은 경우 더 이상 진행하지 않고 빠져나간다.
코드 라인 10~11에서 페이지에 Unevitable 플래그가 설정되어 있는 경우 더 이상 진행하지 않고 빠져나간다.
코드 라인 14~15에서 페이지가 이미 매핑되어 프로세스에서 사용 중인 경우 더 이상 진행하지 않고 빠져나간다.
코드 라인 17에서 페이지가 active 플래그 상태를 가지고 있는지 여부를 알아온다.
코드 라인 18에서 페이지가 file로 부터 캐시되어 있는지 여부를 알아온다.
코드 라인 19에서 페이지로부터 lru 베이스 타입을 알아온다.
- LRU_INACTIVE_FILE 또는 LRU_INACTIVE_ANON 타입을 반환한다.
코드 라인 21에서 lru + active 배열의 lru 리스트에서 페이지를 찾아 삭제한다.
코드 라인 22~24에서 페이지에서 Active 플래그 및 Referencewd 플래그를 삭제한 후 lru 베이스 타입 배열의 lru 리스트에 페이지를 추가한다.
코드 라인 26~32에서 페이지에 Writeback 또는 Dirty가 설정된 경우Reclaim 플래그를 설정해 놓는다.
코드 라인 33~40에서 그렇지 않은 경우 lru 타입 배열의 lru 리스트의 후미에 페이지를 추가한다. 그런 후 PGROTATED 카운터를 증가시킨다.
- - 후미에 추가하는 경우 cold 페이지로 최빈도로 사용됨을 나타낸다.
코드 라인 42~43에서 active인 경우 PGDEACTIVATE 항목의 vm_event 를 증가시킨다.
코드 라인 44에서 reclaim 관련 scanned[]와 rocated[] 항목을 증가시킨다

lru_lazyfree_fn()

mm/swap.c

static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
                            void *arg)
{
        if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
            !PageSwapCache(page) && !PageUnevictable(page)) {
                bool active = PageActive(page);

                del_page_from_lru_list(page, lruvec,
                                       LRU_INACTIVE_ANON + active);
                ClearPageActive(page);
                ClearPageReferenced(page);
                /*
                 * lazyfree pages are clean anonymous pages. They have
                 * SwapBacked flag cleared to distinguish normal anonymous
                 * pages
                 */
                ClearPageSwapBacked(page);
                add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);

                __count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
                count_memcg_page_event(page, PGLAZYFREE);
                update_page_reclaim_stat(lruvec, 1, 0);
        }
}

swap 영역을 가진 normal anon 페이지를 swap 영역을 가지지 않는 clean anon 페이지로 바꾸고 inactive file lru 리스트의 선두(hot)에 추가한다.

코드 라인 4~9에서 swap 영역을 가진 normal anon 페이지이면서 swap 캐시된 상태가 아니면 lruvec 리스트에서 제거한다.
코드 라인 10~18에서 페이지에서 Active, Referenced, SwapBacked 플래그를 클리어한 후 lru 리스트에 추가한다.
코드 라인 20에서 PGLAZYFREE vm 카운터를 페이지 수 만큼 증가시킨다.
코드 라인 21에서 memcg에서 PGLAZYFREE 카운터를 증가시킨다.
코드 라인 22에서 reclaim 관련 scanned[]와 rocated[] 항목을 증가시킨다

__activate_page()

mm/swap.c

static void __activate_page(struct page *page, struct lruvec *lruvec,
                            void *arg)
{
        if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
                int file = page_is_file_cache(page);
                int lru = page_lru_base_type(page);

                del_page_from_lru_list(page, lruvec, lru);
                SetPageActive(page);
                lru += LRU_ACTIVE;
                add_page_to_lru_list(page, lruvec, lru);
                trace_mm_lru_activate(page);

                __count_vm_event(PGACTIVATE);
                update_page_reclaim_stat(lruvec, file, 1);
        }
}

페이지를 lruvec->lists[basic type]에서 삭제한 후 active 플래그를 설정하고 lruvec->lists[lru+active]의 선두(hot)에 추가한다.

코드 라인 4~8에서 페이지에 LRU 설정되어 있고, inactive 이면서 unevictable 플래그 설정이 없는 경우 해당 lru 타입의 lru 리스트에서 제거한다.
코드 라인 9~11에서 페이지를 active 설정하고, 해당 타입(file or anon)의 active lru 리스트의 선두에 페이지를 추가한다.
코드 라인 14에서 vm_event의 PGACTIVATE 항목의 카운터를 증가시킨다.
코드 라인 15에서 reclaim 관련 scanned[]와 rocated[] 항목을 증가시킨다

기타

page_evictable()

mm/vmscan.c

/*
 * page_evictable - test whether a page is evictable
 * @page: the page to test
 *
 * Test whether page is evictable--i.e., should be placed on active/inactive
 * lists vs unevictable list.
 *
 * Reasons page might not be evictable:
 * (1) page's mapping marked unevictable
 * (2) page is part of an mlocked VMA
 *
 */

int page_evictable(struct page *page)
{
        int ret;

        /* Prevent address_space of inode and swap cache from being freed */
        rcu_read_lock();
        ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
        rcu_read_unlock();
        return ret;
}

페이지가 evictable 상태인지 여부를 반환한다.

이미 매핑된 페이지 또는 mlock 상태가 아닌 페이지이면 evicatable 상태이다.

page_is_file_cache()

include/linux/mm_inline.h

/**
 * page_is_file_cache - should the page be on a file LRU or anon LRU?
 * @page: the page to test
 *
 * Returns 1 if @page is page cache page backed by a regular filesystem,
 * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
 * Used by functions that manipulate the LRU lists, to sort a page
 * onto the right LRU list.
 *
 * We would like to get this info without a page flag, but the state
 * needs to survive until the page is last deleted from the LRU, which
 * could be as far down as __page_cache_release.
 */

static inline int page_is_file_cache(struct page *page)
{
        return !PageSwapBacked(page);
}

페이지가 file lru에 있는지 anon lru에 있는지 여부를 반환한다.

1: file lru에 속한다.
- 파일 캐시 페이지 또는 swap 영역을 가지지 않는 clean anon 페이지
0: anon lru에 속한다.
- swap 영역을 가진 normal anon 페이지 또는 tmpfs

page_lru()

include/linux/mm_inline.h

/**     
 * page_lru - which LRU list should a page be on?
 * @page: the page to test
 *      
 * Returns the LRU list a page should be on, as an index
 * into the array of LRU lists.
 */

static __always_inline enum lru_list page_lru(struct page *page)
{
        enum lru_list lru;

        if (PageUnevictable(page))
                lru = LRU_UNEVICTABLE;
        else {
                lru = page_lru_base_type(page);
                if (PageActive(page))
                        lru += LRU_ACTIVE;
        }
        return lru;
}

페이지에 대한 lru(5가지 상태) 값을 알아온다.

코드 라인 5~6에서 페이지가 unevictable 플래그를 가졌으면 LRU_UNEVICTABLE(4)을 리턴한다.
코드 라인 7~8에서 페이지가 화일을 캐시한 타입인 경우 LRU_INACTIVE_FILE(2)을 그렇지 않은 경우 LRU_INACTIVE_ANON(0)을 알아온다.
코드 라인 9~10에서 페이지가 active 상태인 경우 clear하고 lru에 LRU_ACTIVE(1)를 추가한다.
- LRU_INACTIVE_FILE(2) -> LRU_ACTIVE_FILE(3)
- LRU_INACTIVE_ANON(0) -> LRU_ACTIVE_ANON(1)

add_page_to_lru_list()

include/linux/mm_inline.h

static __always_inline void add_page_to_lru_list(struct page *page,
                                struct lruvec *lruvec, enum lru_list lru)
{
        update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
        list_add(&page->lru, &lruvec->lists[lru]);
}

페이지를 lru 리스트에 추가한다.

코드 라인 4에서 lru 관련 통계를 갱신한다.
- 페이지가 huge 페이지인 경우 작은 페이지 수를 알아온다. 아닌 경우는 1이다.
  - huge 페이지가 2MB인 경우 -> 512개
코드 라인 5에서 lru의 타입별 리스트에 페이지를 선두에 추가한다. 선두에 추가한다는 의미는 사용빈도가 높은 hot page를 의미한다.

update_lru_size()

include/linux/mm_inline.h

static __always_inline void update_lru_size(struct lruvec *lruvec,
                                enum lru_list lru, enum zone_type zid,
                                int nr_pages)
{
        __update_lru_size(lruvec, lru, zid, nr_pages);
#ifdef CONFIG_MEMCG
        mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
#endif
}

코드 라인 5에서 노드 및 존의 페이지의 lru 타입에 해당하는 vm 카운터에 페이지 수를 추가한다.
코드 라인 7에서 메모리 cgroup의 lru_size[lru]에 페이지 수를 추가한다.

__update_lru_size()

include/linux/mm_inline.h

static __always_inline void __update_lru_size(struct lruvec *lruvec,
                                enum lru_list lru, enum zone_type zid,
                                int nr_pages)
{
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);

        __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
        __mod_zone_page_state(&pgdat->node_zones[zid],
                                NR_ZONE_LRU_BASE + lru, nr_pages);
}

노드 및 존의 페이지의 lru 타입에 해당하는 vm 카운터에 페이지 수를 추가한다.

코드 라인 7에서 노드의 페이지의 lru 타입에 해당하는 vm 카운터에 페이지 수를 추가한다.
코드 라인 8~9에서 존의 페이지의 lru 타입에 해당하는 vm 카운터에 페이지 수를 추가한다.

mem_cgroup_update_lru_size()

mm/memcontrol.c

/**
 * mem_cgroup_update_lru_size - account for adding or removing an lru page
 * @lruvec: mem_cgroup per zone lru vector
 * @lru: index of lru list the page is sitting on
 * @zid: zone id of the accounted pages
 * @nr_pages: positive when adding or negative when removing
 *
 * This function must be called under lru_lock, just before a page is added
 * to or just after a page is removed from an lru list (that ordering being
 * so as to allow it to check that lru_size 0 is consistent with list_empty).
 */

void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
                                int zid, int nr_pages)
{
        struct mem_cgroup_per_node *mz;
        unsigned long *lru_size;
        long size;

        if (mem_cgroup_disabled())
                return;

        mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
        lru_size = &mz->lru_zone_size[zid][lru];

        if (nr_pages < 0)
                *lru_size += nr_pages;

        size = *lru_size;
        if (WARN_ONCE(size < 0,
                "%s(%p, %d, %d): lru_size %ld\n",
                __func__, lruvec, lru, nr_pages, size)) {
                VM_BUG_ON(1);
                *lru_size = 0;
        }

        if (nr_pages > 0)
                *lru_size += nr_pages;
}

메모리 cgroup의 노드별 lru_size[lru]에 페이지 수를 추가한다.

update_page_reclaim_stat()

mm/swap.c

static void update_page_reclaim_stat(struct lruvec *lruvec,
                                     int file, int rotated)
{
        struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

        reclaim_stat->recent_scanned[file]++;
        if (rotated)
                reclaim_stat->recent_rotated[file]++;
}

reclaim 관련 scanned[]와 rocated[] 항목을 증가시킨다. 두 항목은 각각 2개의 배열을 사용하는데 각각의 배열은 다음과 같다.

[0]: anon LRU stat
[1]: file LRU stat

LRU 리스트로 복귀

putback_movable_pages()

mm/migrate.c

/*
 * Put previously isolated pages back onto the appropriate lists
 * from where they were once taken off for compaction/migration.
 *
 * This function shall be used whenever the isolated pageset has been
 * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
 * and isolate_huge_page().
 */
void putback_movable_pages(struct list_head *l)
{
        struct page *page;
        struct page *page2;

        list_for_each_entry_safe(page, page2, l, lru) {
                if (unlikely(PageHuge(page))) {
                        putback_active_hugepage(page);
                        continue;
                }
                list_del(&page->lru);
                dec_zone_page_state(page, NR_ISOLATED_ANON +
                                page_is_file_cache(page));
                if (unlikely(isolated_balloon_page(page)))
                        balloon_page_putback(page);
                else
                        putback_lru_page(page);
        }
}

기존에 isolation된 페이지들을 다시 원래의 위치로 되돌린다.

list_for_each_entry_safe(page, page2, l, lru) {
- 리스트에 있는 페이지들 만큼 루프를 돈다.
if (unlikely(PageHuge(page))) { putback_active_hugepage(page); continue; }
- 적은 확률로 huge 페이지인 경우 hstate[].hugepage_activelist의 후미로 이동시키고 skip 한다.
  - huge page는 hstate[]에서 관리한다.
dec_zone_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
- 페이지의 타입에 따라 NR_ISOLATE_ANON 또는 NR_ISOLATED_FILE stat을 감소시킨다.
if (unlikely(isolated_balloon_page(page))) balloon_page_putback(page);
- 적은 확률로 balloon 페이지인 경우 balloon_dev_info의 pages 리스트에 되돌린다.
  - balloon page는 balloon 디바이스에서 관리한다.
else putback_lru_page(page);
- 페이지를 lurvec.lists[]에 되돌린다.

putback_lru_page()

mm/vmscan.c

/**
 * putback_lru_page - put previously isolated page onto appropriate LRU list
 * @page: page to be put back to appropriate lru list
 *
 * Add previously isolated @page to appropriate LRU list.
 * Page may still be unevictable for other reasons.
 *
 * lru_lock must not be held, interrupts must be enabled.
 */
void putback_lru_page(struct page *page)
{
        bool is_unevictable;
        int was_unevictable = PageUnevictable(page);

        VM_BUG_ON_PAGE(PageLRU(page), page);

redo:
        ClearPageUnevictable(page);

        if (page_evictable(page)) {
                /*
                 * For evictable pages, we can use the cache.
                 * In event of a race, worst case is we end up with an
                 * unevictable page on [in]active list.
                 * We know how to handle that.
                 */
                is_unevictable = false;
                lru_cache_add(page);
        } else {
                /*
                 * Put unevictable pages directly on zone's unevictable
                 * list.
                 */
                is_unevictable = true;
                add_page_to_unevictable_list(page);
                /*
                 * When racing with an mlock or AS_UNEVICTABLE clearing
                 * (page is unlocked) make sure that if the other thread
                 * does not observe our setting of PG_lru and fails
                 * isolation/check_move_unevictable_pages,
                 * we see PG_mlocked/AS_UNEVICTABLE cleared below and move
                 * the page back to the evictable list.
                 *
                 * The other side is TestClearPageMlocked() or shmem_lock().
                 */
                smp_mb();
        }

        /*
         * page's status can change while we move it among lru. If an evictable
         * page is on unevictable list, it never be freed. To avoid that,
         * check after we added it to the list, again.
         */
        if (is_unevictable && page_evictable(page)) {
                if (!isolate_lru_page(page)) {
                        put_page(page);
                        goto redo;
                }
                /* This means someone else dropped this page from LRU
                 * So, it will be freed or putback to LRU again. There is
                 * nothing to do here.
                 */
        }

        if (was_unevictable && !is_unevictable)
                count_vm_event(UNEVICTABLE_PGRESCUED);
        else if (!was_unevictable && is_unevictable)
                count_vm_event(UNEVICTABLE_PGCULLED);

        put_page(page);         /* drop ref from isolate */
}

isolation되었던 페이지를 다시 lruvec에 되돌린다.

int was_unevictable = PageUnevictable(page);
- 페이지가 unevictable 상태인지 여부를 알아온다.
ClearPageUnevictable(page);
- 페이지의 PG_unevictable 플래그를 클리어한다.
if (page_evictable(page)) { is_unevictable = false; lru_cache_add(page);
- 페이지 매핑 상태를 보아 evictable 상태인 경우 is_unevictable에 false를 담고 페이지를 lru_add_pvec 캐시에 등록한다.
} else { is_unevictable = true; add_page_to_unevictable_list(page); smp_mb(); }
- lruvec.list[LRU_UNEVICTABLE]에 페이지를 추가한다.
if (is_unevictable && page_evictable(page)) { if (!isolate_lru_page(page)) { put_page(page); goto redo; } }
- lruvec.list[LRU_UNEVICTABLE]에 추가한 페이지가 evictable 상태로 바뀐 경우 이 페이지는 절대 free 되지 않는다. 이를 피하기 위해 다시 한 번 이 페이지를 isolation 하여 체크하게 반복한다.
if (was_unevictable && !is_unevictable) count_vm_event(UNEVICTABLE_PGRESCUED);
- unevictable 이었으면서 지금은 unevictable이 아닌 경우 UNEVICTABLE_PGRESCUED stat을 증가시킨다.
else if (!was_unevictable && is_unevictable) count_vm_event(UNEVICTABLE_PGCULLED);
- unevictable 이 아니었으면서 지금은 unevictable인 경우 UNEVICTABLE_PG CULLED stat을 증가시킨다.
put_page(page);
- 페이지에서 LRU 비트 플래그를 클리어하고 lru 리스트에서 제거하며 버디 시스템에 페이지를 hot 방향으로 free한다.

Huge Page & Huge TLB

Huge TLB를 지원하는 아키텍처에서만 사용할 수 있다.
- x86, ia64, arm with LPAE, sparc64, s390 등에서 사용할 수 있다.
- 참고: hugetlbpage.txt | kernel.org
Huge TLB를 사용하는 경우 큰 페이지를 하나의 TLB 엔트리로 로드하여 사용하므로 매핑에 대한 overhead가 줄어들어 빠른 access 성능을 유지할 수 있게된다.
Huge TLB를 사용하는 경우 TLB H/W의 성능 향상을 위해 페이지 블럭을 MAX_ORDER-1 페이지 단위가 아닌 HugeTLB 단위에 맞게 운용할 수 있다.
전역 hstate[]는 배열로 구성되어 size가 다른 여러 개의 TLB 엔트리를 구성하여 사용할 수 있다.
- 참고: hugetlb: multiple hstates for multiple page sizes
커널 파라메터를 사용하여 지정된 크기의 공간을 reserve 하여 사용한다.
- 예) “default_hugepagesz=1G hugepagesz=1G”
런타임 시 설정 변경
- “/proc/sys/vm/nr_hugepages” 이며 NUMA 시스템에서는 “/sys/devices/system/node/node_id/hugepages/hugepages”을 설정하여 사용한다.
shared 메모리를 open 하여 만들 때 SHM_HUGETLB 옵션을 사용하여 huge tlb를 사용하게 할 수 있다.
- 예) shmid = shmget(2, LENGTH, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0)

HugeTLBFS

파일 시스템과 같이 동작하므로 마운트하여 사용한다.
- 예) mount -t hugetlbfs -o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> none /mnt/huge
마운트된 디렉토리(/mnt/huge)내에서 만들어진 파일들은 huge tlb를 사용하여 매핑된다.

putback_active_hugepage()

mm/hugetlb.c

void putback_active_hugepage(struct page *page)
{                                       
        VM_BUG_ON_PAGE(!PageHead(page), page);
        spin_lock(&hugetlb_lock);
        list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
        spin_unlock(&hugetlb_lock);
        put_page(page);
}

isolation되었던 페이지를 전역 hstate[]의 hugepage_activelist의 후미에 다시 되돌린다.

isolation때 증가시킨 참조 카운터를 감소 시킨다.

Balloon 페이지 관리

리눅스는 KVM 및 XEN과 같은 가상 머신을 위한 Balloon 디바이스 드라이버를 제공한다.
메모리 파편화를 막기위해 Balloon 메모리 compaction을 지원한다.

balloon_page_putback()

mm/balloon_compaction.c

/* putback_lru_page() counterpart for a ballooned page */
void balloon_page_putback(struct page *page)
{
        /*
         * 'lock_page()' stabilizes the page and prevents races against
         * concurrent isolation threads attempting to re-isolate it.
         */
        lock_page(page);

        if (__is_movable_balloon_page(page)) {
                __putback_balloon_page(page);
                /* drop the extra ref count taken for page isolation */
                put_page(page);
        } else {
                WARN_ON(1);
                dump_page(page, "not movable balloon page");
        }
        unlock_page(page);
}

isolation되었던 페이지가 ballon 페이지인 경우 페이지에 기록된 ballon 디바이스의 pages 리스트에 다시 되돌린다.

isolation때 증가시킨 참조 카운터를 감소 시킨다.

__is_movable_balloon_page()

include/linux/balloon_compaction.h

/*
 * __is_movable_balloon_page - helper to perform @page PageBalloon tests
 */             
static inline bool __is_movable_balloon_page(struct page *page)
{
        return PageBalloon(page);
}

Ballon 페이지 여부를 반환한다.

__putback_balloon_page()

mm/balloon_compaction.c

static inline void __putback_balloon_page(struct page *page)
{
        struct balloon_dev_info *b_dev_info = balloon_page_device(page);
        unsigned long flags;

        spin_lock_irqsave(&b_dev_info->pages_lock, flags);
        SetPagePrivate(page);
        list_add(&page->lru, &b_dev_info->pages);
        b_dev_info->isolated_pages--;
        spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
}

페이지에 PG_private 플래그를 설정하고 페이지에 기록된 ballon 페이지 디바이스의 pages 리스트에 되돌린다.

balloon_page_device()

include/linux/balloon_compaction.h

/*
 * balloon_page_device - get the b_dev_info descriptor for the balloon device
 *                       that enqueues the given page.
 */
static inline struct balloon_dev_info *balloon_page_device(struct page *page)
{
        return (struct balloon_dev_info *)page_private(page);
}

ballon 페이지 디바이스를 알아온다.

구조체

pagevec 구조체

struct pagevec {                        
        unsigned long nr;
        boool percpu_pvec_drained;
        struct page *pages[PAGEVEC_SIZE];
};

nr
- pagevec에서 관리되고 있는 페이지 수
percpu_pvec_drained
- drain 여부
*pages[]
- pagevec에서 관리되는 페이지들이다. (최대 15개)

lruvec 구조체

include/linux/mmzone.h

struct lruvec {
        struct list_head                lists[NR_LRU_LISTS];
        struct zone_reclaim_stat        reclaim_stat;
        /* Evictions & activations on the inactive file list */
        atomic_long_t                   inactive_age;
        /* Refaults at the time of last reclaim cycle */
        unsigned long                   refaults;
#ifdef CONFIG_MEMCG
        struct pglist_data *pgdat;
#endif
};

lists[]
- 5개의 lruvec 리스트이다.
reclaim_stat
- reclaim 관련 stat
inactive_age
refaults
*pgdat
- 노드를 가리킨다.
- memory control cgroup을 사용할 때 lruvec은 노드별로 관리된다.

zone_reclaim_stat 구조체

include/linux/mmzone.h

struct zone_reclaim_stat {
        /*
         * The pageout code in vmscan.c keeps track of how many of the
         * mem/swap backed and file backed pages are referenced.
         * The higher the rotated/scanned ratio, the more valuable
         * that cache is.
         *
         * The anon LRU stats live in [0], file LRU stats in [1]
         */
        unsigned long           recent_rotated[2];
        unsigned long           recent_scanned[2];
};

lru_list

include/linux/mmzone.h

/*
 * We do arithmetic on the LRU lists in various places in the code,
 * so it is important to keep the active lists LRU_ACTIVE higher in
 * the array than the corresponding inactive lists, and to keep
 * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
 *
 * This has to be kept in sync with the statistics in zone_stat_item
 * above and the descriptions in vmstat_text in mm/vmstat.c
 */
#define LRU_BASE 0
#define LRU_ACTIVE 1
#define LRU_FILE 2

enum lru_list {
        LRU_INACTIVE_ANON = LRU_BASE,
        LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
        LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
        LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
        LRU_UNEVICTABLE,
        NR_LRU_LISTS
};

전역 pagevec 캐시

mm/swap.c

static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
#endif

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c – 현재 글
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

[Linux] pageflags로 살펴본 메모리의 일생 | F/OSS
Linux Memory Allocation | Comumbia Edu. – pdf 다운로드
PageReplacementDesign | linux-mm.org
UNEVICTABLE LRU INFRASTRUCTURE | kernel.org
Overview of Memory Reclaim in the Current Upstream Kernel (2021) | SUSE – 다운로드 pdf