문c 블로그

Slub Memory Allocator -3- (캐시 생성)

2016-06-032019-11-23 문영일 Leave a comment

슬랩 캐시 생성

빠른 성능으로 규격화된 object를 지속적으로 공급하기 위해 커널 메모리용 슬랩 캐시를 생성 시켜 준비한다.

Hardened usercopy whitelisting

보안을 목적으로 커널 데이터를 유저측과 교환할 때 슬랩 캐시도 그 영역을 제한하는 방법을 추가하였다. 따라서 슬랩 캐시를 생성할 때 슬랩 캐시에서 object의 useroffset 위치 부터 usersize 만큼만 허용할 수 있도록 지정하는 kmem_cache_create_usercopy() 함수가 추가되었다. 커널 전용 슬랩 캐시를 만드는 경우에는 유저로 데이터 복사를 허용하지 않는다. 그러나 kmalloc의 경우는 전체 object size를 허용하도록 생성하는 것이 특징이다.

참고: usercopy: Prepare for usercopy whitelisting

alias 슬랩 캐시

만일 기존에 생성한 슬랩 캐시와 object 사이즈가 같고 플래그 설정이 유사한 슬랩 캐시를 생성하는 경우 기존에 생성한 슬랩 캐시를 공유하여 사용하는데, 이렇게 만들어진 슬랩 캐시는 alias 슬랩 캐시라고 한다.

다음 그림은 슬랩 캐시를 만들기 위한 함수 호출 관계를 보여준다.

kmem_cache_create()

mm/slab_common.c

/**
 * kmem_cache_create - Create a cache.
 * @name: A string which is used in /proc/slabinfo to identify this cache.
 * @size: The size of objects to be created in this cache.
 * @align: The required alignment for the objects.
 * @flags: SLAB flags
 * @ctor: A constructor for the objects.
 *
 * Cannot be called within a interrupt, but can be interrupted.
 * The @ctor is run when new pages are allocated by the cache.
 *
 * The flags are
 *
 * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
 * to catch references to uninitialised memory.
 *
 * %SLAB_RED_ZONE - Insert `Red` zones around the allocated memory to check
 * for buffer overruns.
 *
 * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
 * cacheline.  This can be beneficial if you're counting cycles as closely
 * as davem.
 *
 * Return: a pointer to the cache on success, NULL on failure.
 */

struct kmem_cache *
kmem_cache_create(const char *name, unsigned int size, unsigned int align,
                slab_flags_t flags, void (*ctor)(void *))
{
        return kmem_cache_create_usercopy(name, size, align, flags, 0, 0,
                                          ctor);
}
EXPORT_SYMBOL(kmem_cache_create);

요청한 @size 및 @align 단위로 @name 명칭의 슬랩 캐시를 생성한다. 유사한 사이즈와 호환 가능한 플래그를 사용한 슬랩 캐시가 있는 경우 별도로 생성하지 않고, alias 캐시로 등록한다. 실패하는 경우 null을 반환한다.

kmem_cache_create_usercopy()

mm/slab_common.c

/**
 * kmem_cache_create_usercopy - Create a cache with a region suitable
 * for copying to userspace
 * @name: A string which is used in /proc/slabinfo to identify this cache.
 * @size: The size of objects to be created in this cache.
 * @align: The required alignment for the objects.
 * @flags: SLAB flags
 * @useroffset: Usercopy region offset
 * @usersize: Usercopy region size
 * @ctor: A constructor for the objects.
 *
 * Cannot be called within a interrupt, but can be interrupted.
 * The @ctor is run when new pages are allocated by the cache.
 *
 * The flags are
 *
 * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
 * to catch references to uninitialised memory.
 *
 * %SLAB_RED_ZONE - Insert `Red` zones around the allocated memory to check
 * for buffer overruns.
 *
 * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
 * cacheline.  This can be beneficial if you're counting cycles as closely
 * as davem.
 *
 * Return: a pointer to the cache on success, NULL on failure.
 */

struct kmem_cache *
kmem_cache_create_usercopy(const char *name,
                  unsigned int size, unsigned int align,
                  slab_flags_t flags,
                  unsigned int useroffset, unsigned int usersize,
                  void (*ctor)(void *))
{
        struct kmem_cache *s = NULL;
        const char *cache_name;
        int err;

        get_online_cpus();
        get_online_mems();
        memcg_get_cache_ids();

        mutex_lock(&slab_mutex);

        err = kmem_cache_sanity_check(name, size);
        if (err) {
                goto out_unlock;
        }

        /* Refuse requests with allocator specific flags */
        if (flags & ~SLAB_FLAGS_PERMITTED) {
                err = -EINVAL;
                goto out_unlock;
        }

        /*
         * Some allocators will constraint the set of valid flags to a subset
         * of all flags. We expect them to define CACHE_CREATE_MASK in this
         * case, and we'll just provide them with a sanitized version of the
         * passed flags.
         */
        flags &= CACHE_CREATE_MASK;

        /* Fail closed on bad usersize of useroffset values. */
        if (WARN_ON(!usersize && useroffset) ||
            WARN_ON(size < usersize || size - usersize < useroffset))
                usersize = useroffset = 0;

        if (!usersize)
                s = __kmem_cache_alias(name, size, align, flags, ctor);
        if (s)
                goto out_unlock;

        cache_name = kstrdup_const(name, GFP_KERNEL);
        if (!cache_name) {
                err = -ENOMEM;
                goto out_unlock;
        }

        s = create_cache(cache_name, size,
                         calculate_alignment(flags, align, size),
                         flags, useroffset, usersize, ctor, NULL, NULL);
        if (IS_ERR(s)) {
                err = PTR_ERR(s);
                kfree_const(cache_name);
        }

out_unlock:
        mutex_unlock(&slab_mutex);

        memcg_put_cache_ids();
        put_online_mems();
        put_online_cpus();

        if (err) {
                if (flags & SLAB_PANIC)
                        panic("kmem_cache_create: Failed to create slab '%s'. Error %d\n",
                                name, err);
                else {
                        pr_warn("kmem_cache_create(%s) failed with error %d\n",
                                name, err);
                        dump_stack();
                }
                return NULL;
        }
        return s;
}
EXPORT_SYMBOL(kmem_cache_create_usercopy);

코드 라인 12에서 CONFIG_HOTPLUG_CPU 커널 옵션을 사용하는 경우에만 cpu_hotplug.refcount를 증가시켜 cpu를 분리하는 경우 동기화(지연)시킬 목적으로 설정한다.
코드라인13에서 CONFIG_MEMORY_HOTPLUG 커널 옵션을 사용하는 경우에만 mem_hotplug.refcount를 증가시켜 memory를 분리하는 경우 동기화(지연)시킬 목적으로 설정한다.
코드 라인 14에서 MEMCG_KMEM 커널 옵션을 사용하여 슬랩 캐시 사용량을 제어하고자 할 목적으로 read 세마포어 락을 사용한다.
코드 라인 18~21에서 CONFIG_DEBUG_VM 커널 옵션을 사용하는 경우에만 name과 size에 대한 간단한 체킹을 수행한다. 이 커널 옵션을 사용하지 않는 경우에는 항상 false를 반환한다.
코드 라인 24~27에서 허가된 플래그가 아닌 경우 -EINVAL 에러를 반환한다.
코드 라인 35에서 kmem_cache를 생성할 때 유효한 플래그만 통과시킨다.
코드 라인 38~40에서 유저에서 접근 가능해야 하는 영역의 useroffset 및 usersize가 size 범위를 벗어나는 경우 0으로 만든다.
코드 라인 42~45에서 기존 슬랩 캐시 중 병합가능한 슬랩 캐시를 찾은 경우 캐시 생성을 포기하고 alias 캐시로 등록한다.
코드 라인 47~51에서 name을 clone한 후 cache_name으로 반환한다. 단 name이 .rodata 섹션에 있는 경우 name을 그대로 반환한다.
코드 라인 53~57에서 새 캐시를 생성한다.

다음 그림은 각종 플래그들에 대한 매크로 상수이다. (각 플래그 다음에 위치한 한 글자의 알파벳 문자는 slabinfo 유틸리티를 사용하여 플래그 속성값이 출력될 때 사용되는 알파벳 문자이다.)

커널 디렉토리에서 tools/vm/slabinfo.c를 컴파일하여 사용한다.
참고: Slub Memory Allocator (slubinfo) | 문c

kmem_cache_sanity_check()

mm/slab_common.c

static int kmem_cache_sanity_check(const char *name, unsigned int size)
{
        if (!name || in_interrupt() || size < sizeof(void *) ||
                size > KMALLOC_MAX_SIZE) {
                pr_err("kmem_cache_create(%s) integrity check failed\n", name);
                return -EINVAL;
        }

        WARN_ON(strchr(name, ' '));     /* It confuses parsers */
        return 0;
}

CONFIG_DEBUG_VM 커널 옵션을 사용하는 경우 슬랩 캐시를 생성하기 위한 간단한 체크를 수행한다. 성공 시 0을 반환한다.

인터럽트 수행 중에 호출되는 경우 -EINVAL 에러를 반환한다.
32비트 시스템에서 4 바이트, 64비트 시스템에서 8 바이트보다 작은 사이즈를 지정하는 경우 -EINVAL 에러를 반환한다.

Alias 캐시 생성

__kmem_cache_alias()

mm/slub.c

struct kmem_cache *
__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
                   slab_flags_t flags, void (*ctor)(void *))
{
        struct kmem_cache *s, *c;

        s = find_mergeable(size, align, flags, name, ctor);
        if (s) {
                s->refcount++;

                /*
                 * Adjust the object sizes so that we clear
                 * the complete object on kzalloc.
                 */
                s->object_size = max(s->object_size, size);
                s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));

                for_each_memcg_cache(c, s) {
                        c->object_size = s->object_size;
                        c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
                }

                if (sysfs_slab_alias(s, name)) {
                        s->refcount--;
                        s = NULL;
                }
        }

        return s;
}

유사한 사이즈와 호환 가능한 플래그를 사용한 슬랩 캐시가 있는 경우 별도로 생성하지 않고, alias 캐시로 등록한다. alias 캐시로 등록하지 않고 별도로 캐시를 만들어야 하는 경우에는 null을 반환한다.

코드 라인 7에서 병합 가능한 캐시를 알아온다. 병합할 수 없으면 null을 반환한다.
- 실제 생성되는 캐시는 /sys/kernel/slabs에서 슬랩명으로 디렉토리가 생성되지만, alias 캐시는 병합된 캐시 디렉토리를 가리키는 링크를 생성한다.
코드 라인 8~16에서 요청한 캐시는 alias 캐시로 등록되고 실제 병합될 캐시를 사용하므로 레퍼런스 카운터를 증가시키고, 병합될 캐시의 object_size보다 요청 캐시의 @size가 더 큰 경우 갱신한다. 또한 병합될 캐시의 inuse보다 요청한 @size가 큰 경우도 갱신한다.
- c->object_size <- 요청한 @size가 담긴다.
  - 메타 데이터를 제외한 실제 object 데이터가 저장될 공간의 사이즈
- c->size
  - 메타 데이터를 포함한 사이즈
- c->inuse
  - 메타 데이터(REDZONE은 포함되지 않음)까지의 offset 값(메타 데이터가 없는 경우 워드 단위로 정렬한 size와 동일)
코드 라인 18~21에서 병합될 캐시의 모든 memcg 캐시들을 순회하며 memcg용 object size를 병합될 캐시의 object size와 동일하게 갱신한다. 또한 memcg용 캐시의 inuse보다 @size가 더 큰 경우 갱신한다.
코드 라인 23~26에서 CONFIG_SYSFS 커널 옵션을 사용하는 경우 생성된 캐시에 대한 sysfs 링크를 만든다. 단 커널이 부트업 프로세스 중인 경우 링크 생성을 slab_sysfs 드라이버 가동후로 미룬다. 링크가 만들어지지 않는 경우 에러(0이 아닌)를 반환한다.
- 에러를 반환한 경우 병합될 캐시로의 사용을 포기하기 위해 refcount를 감소 시키고 null을 반환한다.

다음 그림은 생성 요청한 캐시에 대해 병합될 캐시를 찾은 경우 새로운 캐시를 등록하지 않고 그냥 alias 캐시 리스트에 등록하는 경우를 보여준다.

병합 캐시 검색

find_mergeable()

mm/slab_common.c

struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
                slab_flags_t flags, const char *name, void (*ctor)(void *))
{
        struct kmem_cache *s;

        if (slab_nomerge)
                return NULL;

        if (ctor)
                return NULL;

        size = ALIGN(size, sizeof(void *));
        align = calculate_alignment(flags, align, size);
        size = ALIGN(size, align);
        flags = kmem_cache_flags(size, flags, name, NULL);

        if (flags & SLAB_NEVER_MERGE)
                return NULL;

        list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
                if (slab_unmergeable(s))
                        continue;

                if (size > s->size)
                        continue;

                if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
                        continue;
                /*
                 * Check if alignment is compatible.
                 * Courtesy of Adrian Drzewiecki
                 */
                if ((s->size & ~(align - 1)) != s->size)
                        continue;

                if (s->size - size >= sizeof(void *))
                        continue;

                if (IS_ENABLED(CONFIG_SLAB) && align &&
                        (align > s->align || s->align % align))
                        continue;

                return s;
        }
        return NULL;
}

병합 가능한 캐시를 검색하여 아래의 엄격한 조건을 만족하는 캐시를 찾아 반환하고 찾지 못한 경우 null을 반환한다.

“slub_nomerge” 커널 파라메터를 사용한 경우는 병합하지 않는다
요청 캐시의 플래그에서 SLAB_NEVER_MERGE 플래그들 중 하나라도 사용하면 안된다. (대부분 디버그용 플래그)
요청 캐시가 별도의 object 생성자를 사용하면 안된다.
전체 캐시들에 대해 루프를 돌며 다음 조건을 비교한다.
- 캐시의 플래그에서 SLAB_NEVER_MERGE 플래그들 중 하나라도 사용하면 안된다.
- 루트 캐시가 아니면 안된다.
- 캐시에 별도의 object 생성자가 사용되면 안된다.
- 유저사이즈가 지정된 경우는 안된다.
- 레퍼런스 카운터가 0보다 작으면 안된다. (초기화 중)
- 기존 캐시 사이즈와 유사해야한다 (기존 캐시 사이즈와 같거나 워드 범위 이내로 작아야한다. )
- 요청 플래그에 사용한 SLAB_MERGE_SAME 플래그들이 캐시에서 사용된 플래그와 동일하지 않으면 안된다.
- 캐시의 size가 재조정된 요청 align 단위로 정렬되지 않으면 안된다.

mm/slab_common.c

/*
 * Set of flags that will prevent slab merging
 */
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
                SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
                SLAB_FAILSLAB | SLAB_KASAN)

디버그 옵션들을 사용한 캐시에 대해 병합을 할 수 없다.

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
                SLAB_ACCOUNT)

캐시 병합을 하기 위해서는 대상 캐시의 플래그와 요청 캐시의 플래그에 대해 위의 3개 플래그가 서로 동일하여야 한다.

kmem_cache_flags()

/*
 * kmem_cache_flags - apply debugging options to the cache
 * @object_size:        the size of an object without meta data
 * @flags:              flags to set
 * @name:               name of the cache
 * @ctor:               constructor function
 *
 * Debug option(s) are applied to @flags. In addition to the debug
 * option(s), if a slab name (or multiple) is specified i.e.
 * slub_debug=<Debug-Options>,<slab name1>,<slab name2> ...
 * then only the select slabs will receive the debug option(s).
 */

slab_flags_t kmem_cache_flags(unsigned int object_size,
        slab_flags_t flags, const char *name,
        void (*ctor)(void *))
{
        char *iter;
        size_t len;

        /* If slub_debug = 0, it folds into the if conditional. */
        if (!slub_debug_slabs)
                return flags | slub_debug;

        len = strlen(name);
        iter = slub_debug_slabs;
        while (*iter) {
                char *end, *glob;
                size_t cmplen;

                end = strchr(iter, ',');
                if (!end)
                        end = iter + strlen(iter);

                glob = strnchr(iter, end - iter, '*');
                if (glob)
                        cmplen = glob - iter;
                else
                        cmplen = max_t(size_t, len, (end - iter));

                if (!strncmp(name, iter, cmplen)) {
                        flags |= slub_debug;
                        break;
                }

                if (!*end)
                        break;
                iter = end + 1;
        }

        return flags;
}

CONFIG_SLUB_DEBUG 커널 옵션을 사용하는 경우 전역 변수 slub_debug에 저장된 플래그들을 추가한다.

“slub_debug=” 커널 파라메터를 통해서 각종 디버그 옵션을 선택하면 slub_debug 값(SLAB_DEBUG_FREE, SLAB_RED_ZONE, SLAB_POISON, SLAB_STORE_USER, SLAB_TRACE, SLAB_FAILSLAB)이 결정되고 “,” 뒤의 문자열이 slub_debug_slabs에 저장된다.

병합 불가능 조건

slab_unmergeable()

mm/slab_common.c

/*
 * Find a mergeable slab cache
 */

int slab_unmergeable(struct kmem_cache *s)
{       
        if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
                return 1;
        
        if (!is_root_cache(s))
                return 1;
        
        if (s->ctor)
                return 1;

        if (s->usersize)
                return 1;
        /*
         * We may have set a slab to be unmergeable during bootstrap.
         */     
        if (s->refcount < 0)
                return 1;
                        
        return 0;
}

캐시를 병합할 수 없는 경우 true를 반환한다. 다음은 병합이 불가능한 조건이다.

“slub_nomerge” 커널 파라메터를 사용하는 경우 또는 캐시의 플래그에 SLAB_NEVER_MERGE 관련 플래그를 사용한 경우 병합 불가능
루트 캐시가 아닌 경우
별도의 object 생성자가 주어진 경우
usersize가 지정된 경우
refcount가 0보다 작은 경우
- 캐시가 부트업 프로세스 중에 만들어져 아직 동작하지 않는 상태

alias 캐시용 sysfs 링크 생성

sysfs_slab_alias()

mm/slub.c

static int sysfs_slab_alias(struct kmem_cache *s, const char *name)
{
        struct saved_alias *al;

        if (slab_state == FULL) {
                /*
                 * If we have a leftover link then remove it.
                 */
                sysfs_remove_link(&slab_kset->kobj, name);
                return sysfs_create_link(&slab_kset->kobj, &s->kobj, name);
        }

        al = kmalloc(sizeof(struct saved_alias), GFP_KERNEL);
        if (!al)
                return -ENOMEM;

        al->s = s;
        al->name = name;
        al->next = alias_list;
        alias_list = al;
        return 0;
}

CONFIG_SYSFS 커널 옵션을 사용하는 경우 생성된 캐시에 대한 sysfs 링크를 만든다. 단 커널이 부트업 프로세스 중인 경우 링크 생성을 slab_sysfs 드라이버 가동 후로 미룬다. 링크가 만들어지지 않는 경우 에러(0이 아닌)를 반환한다.

코드 라인 5~11에서slub 메모리 할당자가 완전히 동작을 시작한 경우 기존 링크를 삭제하고, 다시 name으로 링크를 만들고 반환한다.
코드 라인 13~21에서 saved_alias 구조체 메모리 영역을 할당하고 초기화한다. 초기화된alias 캐시정보를 전역 alias_list에 추가한다.
- 이렇게 추가된 alias_list는 나중에 커널이 각 드라이버를 호출할 때 __initcall() 함수에 등록된 slab_sysfs_init() 루틴이 호출될 때 slab_state를 full 상태로 바꾸고 alias_list에 있는 모든 alias 캐시에 대해 sysfs_slab_alias() 루틴이 다시 호출되면서 sysfs를 사용하여 링크를 만들게 된다.

정규 슬랩 캐시 생성

create_cache()

mm/slab_common.c

static struct kmem_cache *create_cache(const char *name,
                unsigned int object_size, unsigned int align,
                slab_flags_t flags, unsigned int useroffset,
                unsigned int usersize, void (*ctor)(void *),
                struct mem_cgroup *memcg, struct kmem_cache *root_cache)
{
        struct kmem_cache *s;
        int err;

        if (WARN_ON(useroffset + usersize > object_size))
                useroffset = usersize = 0;

        err = -ENOMEM;
        s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
        if (!s)
                goto out;

        s->name = name;
        s->size = s->object_size = object_size;
        s->align = align;
        s->ctor = ctor;
        s->useroffset = useroffset;
        s->usersize = usersize;

        err = init_memcg_params(s, memcg, root_cache);
        if (err)
                goto out_free_cache;

        err = __kmem_cache_create(s, flags);
        if (err)
                goto out_free_cache;

        s->refcount = 1;
        list_add(&s->list, &slab_caches);
        memcg_link_cache(s);
out:
        if (err)
                return ERR_PTR(err);
        return s;

out_free_cache:
        destroy_memcg_params(s);
        kmem_cache_free(kmem_cache, s);
        goto out;
}

정규 슬랩 캐시를 생성한다.

코드 라인 10~11에서 유저영역에 복사할 영역이 object 범위를 벗어나는경우 useroffset와 usersize를0으로만든다
코드 라인 12~22에서새로운 캐시를 만들어 관리하기 위해 kmem_cache에서 slub object를 할당 받고 초기화한다.
코드 라인 23~25에서 CONFIG_MEMCG_KMEM 커널 옵션을 사용하는 경우 memcg에서 커널 메모리에 대한 관리를 위해 각 파라메터들을 초기화한다.
코드 라인 27~29에서 요청한 캐시를 생성한다.
코드 라인 30~31에서 캐시 레퍼런스 카운터를 1로 만들고 캐시를 전역 slab_caches에 추가한다.
- 실제 캐시 또는 alias 캐시를 생성할 때 마다 실제 캐시의 레퍼런스가 증가되고, 반대로 캐시 또는 alias 캐시를 소멸(삭제)시킬 때 마다 실제 캐시의 레퍼런스 카운터 값을 감소시킨다. 0 값이 되는 경우 실제 캐시를 소멸(삭제)시킬 수 있다.
코드라인 32에서 memcg에 슬랩 캐시 링크를 추가한다.
코드라인 33~36에서 out: 레이블이다 결과를 반환한다.

다음은 anon_vma 슬랩 캐시가 sysfs에 반영되어동작중인예를 보여준다. (위치: /sys/kernel/slab/<캐시명>)

$ cd /sys/kernel/slab/anon_vma
KVM /sys/kernel/slab/anon_vma$ ls
aliases      destroy_by_rcu  objects_partial  red_zone                  slabs_cpu_partial
align        free_calls      objs_per_slab    remote_node_defrag_ratio  store_user
alloc_calls  hwcache_align   order            sanity_checks             total_objects
cpu_partial  min_partial     partial          shrink                    trace
cpu_slabs    object_size     poison           slab_size                 usersize
ctor         objects         reclaim_account  slabs                     validate

memcg_link_cache()

mm/slab_common.c

void memcg_link_cache(struct kmem_cache *s)
{
        if (is_root_cache(s)) {
                list_add(&s->root_caches_node, &slab_root_caches);
        } else {
                list_add(&s->memcg_params.children_node,
                         &s->memcg_params.root_cache->memcg_params.children);
                list_add(&s->memcg_params.kmem_caches_node,
                         &s->memcg_params.memcg->kmem_caches);
        }
}

캐시가 루트 캐시인 경우 전역 slab_root_caches 리스트에 추가하고 루트 캐시가 아닌 경우 memcg에추가한다

__kmem_cache_create()

mm/slub.c

int __kmem_cache_create(struct kmem_cache *s, unsigned long flags)
{
        int err;

        err = kmem_cache_open(s, flags);
        if (err)
                return err;

        /* Mutex is not taken during early boot */
        if (slab_state <= UP)
                return 0;

        memcg_propagate_slab_attrs(s);
        err = sysfs_slab_add(s);
        if (err)
                kmem_cache_close(s);

        return err;
}

캐시를 생성하고, 커널이 부트업 중이 아닌 경우 memcg용 속성들을 읽어서 설정한 후 sysfs에 생성한 캐시에 대한 링크들을 생성한다.

코드 라인 5~7에서 캐시를 생성한다.
코드 라인 10~11에서 완전히 full slab 시스템이 가동되기 전, 즉 early bootup이 진행 중에는 함수를 빠져나간다.
코드 라인 13에서 CONFIG_MEMCG_KMEM 커널 옵션을 사용한 경우 생성한 캐시의 루트 캐시에 대한 모든 속성을 읽어서 다시 한 번 재 설정한다.
코드 라인 14~16에서 생성된 캐시에 대한 내용을 파일 시스템을 통해 속성들을 보거나 설정할 수 있도록 링크들을 생성한다.

캐시 할당자 상태

mm/slab.h

/*              
 * State of the slab allocator.
 *
 * This is used to describe the states of the allocator during bootup.
 * Allocators use this to gradually bootstrap themselves. Most allocators
 * have the problem that the structures used for managing slab caches are
 * allocated from slab caches themselves.
 */

enum slab_state {
        DOWN,                   /* No slab functionality yet */
        PARTIAL,                /* SLUB: kmem_cache_node available */
        PARTIAL_NODE,           /* SLAB: kmalloc size for node struct available */
        UP,                     /* Slab caches usable but not all extras yet */
        FULL                    /* Everything is working */
};

slab(slub) 메모리 할당자의 운영 상태로 slub 메모리 시스템에 대해서는 다음과 같다. (PARTIAL_NODE는 slub에서 사용하지않는다.)

DOWN
- 아직 커널이 부트업 프로세스를 진행중이며 slub 메모리 시스템이 만들어지지 않은 상태
PARTIAL
- 슬랩 캐시를 생성하지 못하지만 이미 생성된 슬랩 캐시로부터 슬랩 object의 할당은 가능한 상태
- kmem_cache_node 슬랩 캐시가 존재하는 상태
  - 캐시를 생성하기 위해 캐시 내부에 필요한 kmem_cache_node가 필요한데 이를 만들기 위해 부트업 처리 중 가장 먼저 만든 캐시이다.
UP
- kmem_cache 시스템은 동작하나 다른 엑스트라 시스템이 아직 활성화 되지 않은 상태
FULL
- slub 메모리 할당자에 대한 부트업 프로세스가 완료되어 slub에 대한 모든 것이 동작하는 상태

kmem_cache_open()

mm/slub.c

static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
{
        s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
#ifdef CONFIG_SLAB_FREELIST_HARDENED
        s->random = get_random_long();
#endif

        if (!calculate_sizes(s, -1))
                goto error;
        if (disable_higher_order_debug) {
                /*
                 * Disable debugging flags that store metadata if the min slab
                 * order increased.
                 */
                if (get_order(s->size) > get_order(s->object_size)) {
                        s->flags &= ~DEBUG_METADATA_FLAGS;
                        s->offset = 0;
                        if (!calculate_sizes(s, -1))
                                goto error;
                }
        }

#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
        if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0)
                /* Enable fast mode */
                s->flags |= __CMPXCHG_DOUBLE;
#endif

        /*
         * The larger the object size is, the more pages we want on the partial
         * list to avoid pounding the page allocator excessively.
         */
        set_min_partial(s, ilog2(s->size) / 2);

        set_cpu_partial(s);

#ifdef CONFIG_NUMA
        s->remote_node_defrag_ratio = 1000;
#endif

        /* Initialize the pre-computed randomized freelist if slab is up */
        if (slab_state >= UP) {
                if (init_cache_random_seq(s))
                        goto error;
        }

        if (!init_kmem_cache_nodes(s))
                goto error;

        if (alloc_kmem_cache_cpus(s))
                return 0;

        free_kmem_cache_nodes(s);
error:
        if (flags & SLAB_PANIC)
                panic("Cannot create slab %s size=%u realsize=%u order=%u offset=%u flags=%lx\n",
                      s->name, s->size, s->size,
                      oo_order(s->oo), s->offset, (unsigned long)flags);
        return -EINVAL;
}

캐시를 생성한다.

코드 라인 3에서”slab_debug=” 커널 파라메터에 의해 몇 개의 디버그 요청이 있는 경우 object를 만들 때 반영하기 위해 해당 기능의 slub 디버그 플래그를 추가한다.
코드 라인 8~9에서 객체 size에 따른 order 및 객체 수 등이 산출되지 않는 경우 에러 처리로 이동 한다.
- 참고: Slub Memory Allocator (order 계산) | 문c
코드 라인 10~21에서 “slub_debug=O” 커널 파라메터가 사용된 경우 디버깅 기능으로 인해 order 값이 상승되는 경우 해당캐시의 디버깅 기능을 disable하게 한다. s->size를 할당하기 위해 계산된 order 값이 s->object_size를 할당하기 위해 계산된 order 값보다 큰 경우 즉, 디버그등 목적으로 메타데이터가 추가되어 페이지 할당이 더 필요한 경우이다. 이러한 경우 메타 데이타가 추가되는 플래그들을 클리어하고, offset을 0으로 대입하여 메타 데이터로 인해 order가 커지게 되면 메타 데이터를 추가하지 못하게 막는다.

코드 라인 25~27에서 시스템이 더블 워드 데이터 형에 대해 cmpxchg 기능을 지원하면서 slab 디버그 플래그를 사용하지 않는 경우 플래그에 __CMPXCHG_DOUBLE가 추가된다.
- x86 아키텍처나 64bit arm 아키텍처 등에서 지원하고 32bit arm에서는 지원하지 않는다.
코드 라인 34에서 object의 size를 표현하는데 필요한 비트 수의 절반을 min_partial에 저장한다. 단 5~10 범위 사이로 조정한다.
- 예)
  - size가 4K -> 필요한 비트 수=12 -> min_partial = 6
  - size가 1M -> 필요한 비트 수=20 -> min_partial = 10
코드 라인 36에서 size에 적합한 cpu_partial 갯수를 산출한다.
코드 라인 39에서 NUMA 시스템인 경우 remote_node_defrag_ratio에 1000을 대입한다.
- 로컬 노드의 partial 리스트가 부족할 때 리모트 노드의 partial 리스트를 이용할 수 있도록 하는데 이의 허용률을 1000으로 설정한다.
  - 허용 수치는 0~100까지 입력하는 경우 그 값을 10배 곱하고, 100이상 수치 입력하는 경우 그대로 허용한다.
  - 1000으로 설정하는 경우 약 98%의 성공률로 설정된다.
    - 하드웨어 딜레이 타이머를 1024로 나눈 나머지가 이 수치 이하인 경우에만 허용(성공)한다.
    - 1024 이상으로 설정하는 경우 100% 허용(성공)
코드 라인 43~46에서 슬랩이 정상 동작하는 경우 FP 값을 엔코딩하여 숨길 랜덤 시퀀스를 초기화한다.
코드 라인 48~49에서per 노드의 초기화가 실패하는 경우 error로 이동한다.
코드 라인 51~52에서 per cpu에 대한 할당이 성공하면 함수를 종료한다.
코드 라인 54에서 슬랩 캐시의 할당이 실패한 경우이다. per 노드를 해제하고 에러를 리턴한다.
코드 라인 55~59에서 error: 레이블이다. SLAB_PANIC 플래그가설정된경우panic 로그를출력하고 panic 동작에들어간다.
코드 라인 60에서 -EINVAL 에러를반환한다.

set_min_partial()

mm/slub.c

static void set_min_partial(struct kmem_cache *s, unsigned long min)
{
        if (min < MIN_PARTIAL)
                min = MIN_PARTIAL;
        else if (min > MAX_PARTIAL)
                min = MAX_PARTIAL;
        s->min_partial = min;
}

지정한 캐시의 min_partial 값을 설정한다. 단 min 값은 MIN_PARTIAL(5) ~ MAX_PARTIAL(10)을 벗어나는 경우 조정된다.

set_cpu_partial()

mm/slub.c

static void set_cpu_partial(struct kmem_cache *s)
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
        /*
         * cpu_partial determined the maximum number of objects kept in the
         * per cpu partial lists of a processor.
         *
         * Per cpu partial lists mainly contain slabs that just have one
         * object freed. If they are used for allocation then they can be
         * filled up again with minimal effort. The slab will never hit the
         * per node partial lists and therefore no locking will be required.
         *
         * This setting also determines
         *
         * A) The number of objects from per cpu partial slabs dumped to the
         *    per node list when we reach the limit.
         * B) The number of objects in cpu partial slabs to extract from the
         *    per node list when we run out of per cpu objects. We only fetch
         *    50% to keep some capacity around for frees.
         */
        if (!kmem_cache_has_cpu_partial(s))
                s->cpu_partial = 0;
        else if (s->size >= PAGE_SIZE)
                s->cpu_partial = 2;
        else if (s->size >= 1024)
                s->cpu_partial = 6;
        else if (s->size >= 256)
                s->cpu_partial = 13;
        else
                s->cpu_partial = 30;
#endif
}

size에따라 per-cpu의 partial 리스트에서 유지 가능한 최대 슬랩 object 수(s->cpu_partial)를 정한다.

1 페이지 이상 -> 2개
1024 이상 -> 6개
256 이상 -> 13개
256 미만 -> 30개

kmem_cache_has_cpu_partial()

mm/slub.c

static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
        return !kmem_cache_debug(s);
#else
        return false;
#endif
}

CONFIG_SLUB_CPU_PARTIAL 커널 옵션이 사용되는 경우 지정된 캐시에서 디버그 플래그가 설정되어 사용되지 않는 경우에 cpu partial 리스트의 사용이 지원된다. 그렇지 않은 경우 false를 반환한다.

true=지정된 캐시의 cpu partial 리스트 사용을 지원한다.
false=지정된 캐시의 cpu partial 리스트 사용을 지원하지 않는다.

kmem_cache_debug()

mm/slub.c

static inline int kmem_cache_debug(struct kmem_cache *s)
{
#ifdef CONFIG_SLUB_DEBUG
        return unlikely(s->flags & SLAB_DEBUG_FLAGS);
#else
        return 0;
#endif
}

CONFIG_SLUB_DEBUG 커널 옵션이 사용되는 경우 지정된 캐시에서 SLAB_DEBUG_FLAGS 들 중 하나라도 설정되었는지 여부를 반환하고 그렇지 않은 경우 0을 반환한다.

1=지정된 캐시의 디버그 플래그 설정됨
0=지정된 캐시의 디버그 플래그가 설정되지 않음

kmem_cache_node 초기화

init_kmem_cache_nodes()

mm/slub.c

static int init_kmem_cache_nodes(struct kmem_cache *s)
{
        int node;

        for_each_node_state(node, N_NORMAL_MEMORY) {
                struct kmem_cache_node *n;

                if (slab_state == DOWN) {
                        early_kmem_cache_node_alloc(node);
                        continue;
                }
                n = kmem_cache_alloc_node(kmem_cache_node,
                                                GFP_KERNEL, node);

                if (!n) {
                        free_kmem_cache_nodes(s);
                        return 0;
                }

                init_kmem_cache_node(n); 
                s->node[node] = n;
        }
        return 1;
}

메모리를 가진 모든 노드에 대해 슬랩 시스템 상태에 따라 다음 2가지 중 하나를 수행한다.

아직 슬랩 시스템이 하나도 가동되지 않은 상태인 경우에는 슬랩 object를 할당 받지 못한다. 따라서 kmem_cache_node 구조체의 할당을 위해 슬랩 object가 아닌 버디 시스템을 통해 페이지를 할당 받고 이 페이지를 kmem_cache_node 크기 단위로 나누어 수동으로 슬랩 object를 구성한다. 그런 후 첫 object 항목을 할당 받아 kmem_cache_node 구조체 정보를 구성한다.
슬랩 시스템이 조금이라도 가동하면 가장 처음 준비한 슬랩 캐시가 kmem_cache_node 슬랩 캐시이다. 이를 통해 object를 할당 받고 초기화 루틴을 수행한다.

kmem_cache_node early 초기화

early_kmem_cache_node_alloc()

mm/slub.c

/*
 * No kmalloc_node yet so do it by hand. We know that this is the first
 * slab on the node for this slabcache. There are no concurrent accesses
 * possible.
 *
 * Note that this function only works on the kmem_cache_node
 * when allocating for the kmem_cache_node. This is used for bootstrapping
 * memory on a fresh node that has no slab structures yet.
 */

static void early_kmem_cache_node_alloc(int node)
{
        struct page *page;
        struct kmem_cache_node *n;

        BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));

        page = new_slab(kmem_cache_node, GFP_NOWAIT, node);

        BUG_ON(!page);
        if (page_to_nid(page) != node) {
                pr_err("SLUB: Unable to allocate memory from node %d\n", node);
                pr_err("SLUB: Allocating a useless per node structure in order to be able to continuu
e\n");
        }

        n = page->freelist;
        BUG_ON(!n);
#ifdef CONFIG_SLUB_DEBUG
        init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
        init_tracking(kmem_cache_node, n);
#endif
        n = kasan_kmalloc(kmem_cache_node, n, sizeof(struct kmem_cache_node),
                      GFP_KERNEL);
        page->freelist = get_freepointer(kmem_cache_node, n);
        page->inuse = 1;
        page->frozen = 0;
        kmem_cache_node->node[node] = n;
        init_kmem_cache_node(n);
        inc_slabs_node(kmem_cache_node, node, page->objects);

        /*
         * No locks need to be taken here as it has just been
         * initialized and there is no concurrent access.
         */
        __add_partial(n, page, DEACTIVATE_TO_HEAD);
}

아직 슬랩 시스템이 하나도 가동되지 않은 상태인 경우에는 슬랩 object를 할당받는 kmem_cache_alloc() 함수 등을 사용할 수 없다. 그래서 버디시스템으로 부터 슬랩 용도의 페이지를 할당 받아올 수 있는 new_slab() 함수를 대신 사용하였다. 이 과정은 동시 처리를 보장하지 않으며 오직 슬랩 캐시를 만들기 위해 처음 kmem_cache_node를 구성하기 위해서만 사용될 수 있다. (커널 메모리 캐시 시스템을 운영하기 위해 가장 처음에 만들 캐시는 kmem_cache_node 라는 이름이며 이 캐시를 통해 kmem_cache_node object를 공급해 주어야 한다.)

코드 라인 8에서 kmem_cache_node 구조체에 사용할 슬랩 페이지를 할당 받아온다
- slub 시스템이 아직 동작하지 않아 slub object로 할당을 받지 못하므로 버디 시스템으로 부터 페이지를 할당 받아 슬랩 페이지로 구성한다.
코드 라인 11~15에서 할당 받은 슬랩 페이지가 리모트 노드인경우 경고를 출력한다.
- “SLUB: Unable to allocate memory from node %d\n”
- “SLUB: Allocating a useless per node structure in order to be able to continue\n”
코드 라인 17~29에서 할당받은 kmem_cache_node용 슬랩 페이지를 초기화하고 per 노드에 할당받은 첫 object를 연결한다.
코드 라인 30~31에서 kmem_cache_node를 초기화한다. 슬랩 캐시 카운터를 증가시킨다.
코드 라인 37에서 노드의 partial 리스트에 추가한다.

init_kmem_cache_node()

mm/slub.c

static void
init_kmem_cache_node(struct kmem_cache_node *n)
{
        n->nr_partial = 0;
        spin_lock_init(&n->list_lock);
        INIT_LIST_HEAD(&n->partial);
#ifdef CONFIG_SLUB_DEBUG
        atomic_long_set(&n->nr_slabs, 0);
        atomic_long_set(&n->total_objects, 0);
        INIT_LIST_HEAD(&n->full);
#endif
}

per 노드에 대한 초기화를 수행한다.

per 노드의 partial 리스트에 등록된 slub 페이지는 0으로 초기화
per 노드의 partial 리스트를 초기화

inc_slabs_node()

static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
{
        struct kmem_cache_node *n = get_node(s, node);

        /*
         * May be called early in order to allocate a slab for the
         * kmem_cache_node structure. Solve the chicken-egg
         * dilemma by deferring the increment of the count during
         * bootstrap (see early_kmem_cache_node_alloc).
         */
        if (likely(n)) {
                atomic_long_inc(&n->nr_slabs);
                atomic_long_add(objects, &n->total_objects);
        }
}

CONFIG_SLUB_DEBUG 커널 옵션이 사용될 때에 동작하며 지정 노드에 대한 slab 수를 증가시킨다.

__add_partial()

/*
 * Management of partially allocated slabs.
 */

static inline void
__add_partial(struct kmem_cache_node *n, struct page *page, int tail)
{
        n->nr_partial++;
        if (tail == DEACTIVATE_TO_TAIL)
                list_add_tail(&page->lru, &n->partial);
        else
                list_add(&page->lru, &n->partial);
}

slub 페이지를 지정된 per 노드의 partial 리스트의 선두 또는 후미에 tail 옵션에 따라 방향을 결정하여 추가한다. 또한 해당 노드의 partial 수를 증가시킨다.

참고

Slab Memory Allocator -1- (구조) | 문c
Slab Memory Allocator -2- (캐시 초기화) | 문c
Slub Memory Allocator -3- (캐시 생성) | 문c – 현재 글
Slub Memory Allocator -4- (Order 계산) | 문c
Slub Memory Allocator -5- (Slub 할당) | 문c
Slub Memory Allocator -6- (Object 할당) | 문c
Slub Memory Allocator -7- (Object 해제) | 문c
Slub Memory Allocator -8- (Drain/Flash 캐시) | 문c
Slub Memory Allocator -9- (캐시 Shrink) | 문c
Slub Memory Allocator -10- (Slub 해제) | 문c
Slub Memory Allocator -11- (캐시 삭제) | 문c
Slub Memory Allocator -12- (Slub 디버깅) | 문c
Slub Memory Allocator -13- (slabinfo) | 문c

Slab Memory Allocator -1- (구조)

2016-06-032022-06-18 문영일 8 Comments

Slab Memory Allocator

슬랩(Slab, Slub, Slob) object는 커널이 사용하는 정규 메모리 할당의 최소 단위이다. 커널은 다음과 같이 3가지 중 하나를 선택하여 빌드되어 사용된다. 서로에 대한 차이점을 알아본다. 차이점을 구분하지 않고 설명할 때에는 한글로 슬랩으로 표현한다. 또한 구조 및 소스 분석은 모두 Slub 구현만 분석한다.

Slab

커널 메모리 관리의 핵심으로 2007~8년까지 default로 사용해 왔었다.
배열로 된 Slab object 큐가 CPU와 노드에서 사용된다.
메타 데이터 조각이 앞쪽에 배치되어 오브젝트의 정렬이 어려워 메모리 부족으로 캐시를 클리닝하는 경우 매우 복잡한 처리가 필요하다.
처음 생성된 slab은 full 리스트에서 관리되다가 object가 하나라도 사용되는 경우 그 slab은 partial 리스트로 이동되어 관리된다. 다 사용하는 경우 다시 이동되어 empty 리스트에서 관리된다.
slab은 성능을 향상시키기 위해 노드별(per-node) 및 cpu별(per-cpu)로 각각 관리된다.

Slub

2007~8년부터 메모리가 충분히 있는 임베디드 시스템뿐만 아니라 PC나 서버에서 사용되며, 현재 default로 사용한다.
Slab과 다르게 메모리 오버헤드를 줄이기 위해 Slab object 큐 대신 단순히 한 페이지의 slub page를 지정하여 사용한다.
Slab과 다르게 freelist 관리에 대한 공간을 별도로 사용하지 않는다.
Slab과 다르게 처음 생성된 slub은 object 사용 개수가 0으로 시작하고 partial 리스트로 관리된다. 만일 object가 다 사용되는 경우 slub은 partial 리스트에서 제거되고 관리 매커니즘에서 제외된다. 그러나 그 slub이 하나의 object라도 free 되는 경우 다시 partial 리스트에 추가되어 관리된다.
Slub 역시 slab과 동일하게 성능을 향상시키기 위해 노드(per-node) 및 cpu별(per-cpu)로 각각 관리된다.
Slab에 비해 slub을 사용하면서 시스템 내의 slab 캐시가 줄었고(50% 정도), slab 할당자의 지역성(locality)이 향샹되었으며, slab 메모리의 단편화가 줄어들었다.

Slob

low memory footprint를 갖는 임베디드 리눅스에서 선택하여 사용한다.
속도는 가장 느리지만 메모리 소모가 가장 적다.

슬랩 구조

슬랩 object는 1 개 이상의 페이지 프레임에 배치되는데, 지정된 object 사이즈만큼씩 배치되어 사용된다. 디버그 정보를 위해 object에 메타 정보들이 포함될 수도 있다.

슬랩 페이지 내 object 배치

슬랩 object 사이즈에 맞춰 산출된 order 페이지를 버디 시스템으로 부터 할당 받아 1개의 슬랩 페이지를 구성한다. 슬랩 페이지는 동일한 슬랩 object 사이즈로 모두 채운다. 산출된 order는 디폴트로 0~3까지 사용한다.

다음 그림은 버디 시스템에서 order-N 페이지를 사용하여 슬랩 페이지를 구성하고, 이 슬랩 페이지에서 할당 가능한 슬랩 object를 배치한 모습을 보여준다.

슬랩 캐시

다음 그림과 1개 이상의 슬랩 페이지가 모여 슬랩 캐시가 구성된다. 즉 슬랩 캐시내의 모든 object들은 동일한 사이즈만을 제공한다.

다음 그림과 같이 필요한 object 사이즈가 다른 경우 각각의 object 사이즈별로 슬랩 캐시를 만들어 구성할 수 있다. 커널에서 특정 구조체를 많이 사용하는 경우 이렇게 슬랩 캐시를 미리 등록하여 준비한다.

예) page, anon_vma, vm_area_struct, task_struct, dentry, skb, …

per-node 및 per-cpu 관리 지원

per-node
- 노드별로 메모리 접근 속도가 다르므로 슬랩 페이지들을 노드별로 나눠 관리한다.
per-cpu
- lock-less를 사용한 빠른 슬랩 캐시 할당을 위해 cpu별로 나눠 관리한다. per-cpu 슬랩 캐시에는 partial 리스트와 1 개의 page가 지정된다.

다음 그림은 슬랩 페이지들이 노드별 및 cpu별로 관리되는 모습을 보여준다.

cpu별로 할당/회수에 관련된 슬랩 페이지가 지정되고, 나머지는 partial 리스트에서 관리한다.

Freelist

per-cpu 슬랩 페이지에서 free object의 처음을 가리킨다. 각 free object들은 다음 free object들을 가리키므로 하나의 free object 리스트로 사용된다. 이 freelist에는 해당 cpu 전담으로 할당 가능하고, 해제는 다른 어떠한 cpu들도 사용 가능하다.

다음 그림은 cpu별로 슬랩 페이지를 지정하고, freelist를 통해 지정된 슬랩 페이지의 free object를 가리키고 있는 모습을 보여준다.

다음 그림은 freelist가 첫 free 슬랩 object를 가리키고, 각 free object끼리 순서대로 연결되는 모습을 보여준다.

슬랩 Object 내부 항목

슬랩 Object를 구성하는 항목은 다음과 같다.

Object 영역

슬랩 object의 최소 사이즈는 32비트 시스템에서 4 바이트이고, 64비트 시스템에서 8 바이트이다.

FP(Free Pointer)

페이지 프레임 내에 존재하는 각 object들이 free 상태일 때 object의 선두에 있는 FP(Free Pointer)를 통해 다음 free object를 가리키게한다. object를 디버깅하는 경우 object를 사용하지 않아도 object의 모든 데이터가 uninitialized poison 데이터로 설정되고 이를 모니터하여 혹시 침해를 입지 않는지 확인하는데 사용한다. 따라서 이러한 경우에는 FP(Free Pointer)를 object의 뒤로 이동시켜 사용한다. 그 다음에 owner track 필드와 red zone 필드가 추가되어 디버깅에 사용된다.
free object들은 offset를 0으로 하여 object의 가장 선두에 FP(Free Pointer)를 사용해 다음 free object의 주소를 가리키고 있다. 다만 SLAB_DESTROY_BY_RCU 및 SLAB_POISON 플래그 옵션을 사용하였거나 생성자를 사용한 경우 offset에 object_size를 대입하여 FP(Free Pointer)를 object_size 만큼 뒤로 이동시킨다.
보안 향상을 위해 CONFIG_SLAB_FREELIST_HARDENED 커널 옵션을 사용하여 free 포인터 값들을 encaptualation하여 숨길 수 있다.

Poison

데이터 주소 침범 등을 검출하기 위해 사용한다.
슬랩 object의 소멸 후에 object_size에 해당하는 공간에 poison 값을 기록하고, 슬랩 object의 생성 시 이 값이 변경되는 것을 검출하여 에러 출력으로 리포트한다.
poison 값은 다음과 같다.
- 슬랩 object가 사용 중이지만 초기화되지 않은 경우 0x5a=’Z’ 값으로 채운다.
- 슬랩 object가 사용되지 않을 때 0x6b=’k’ 값으로 채우고 마지막 바이트만 0xa5를 기록한다.
“slub_debug=P” 커널 파라미터를 사용하여 poison 디버깅을 할 수 있다.
디버깅 관련 참고:
- slub_debug: Detect Kernel heap memory corruption | TechVolve
- Short users guide for SLUB (2015, Documentation/vm/slub.txt)| Kernel.org

Red-Zone

데이터 주소 침범 등을 검출하기 위해 사용한다.
object_size의 좌우에 red-zone 값을 기록한 후 슬랩 object의 생성과 소멸 시에 이 값이 변경되는 것을 검출하여 에러 출력으로 리포트한다.
red zone 값은 다음과 같다.
- inactive 상태일 때 0xbb 값으로 채운다.
- active 상태일 때 0xcc 값으로 채운다.
“slub_debug=Z” 커널 파라미터를 사용하여 Red-zone 디버깅을 할 수 있다.

Owner(User) Track

슬랩 object의 생성과 소멸 시 호출한 함수를 각각 최대 16개까지 출력하는 기능이다.
“slub_debug=U” 커널 파라미터를 사용하여 유저 추적을 할 수 있다.

Padding

Align 정렬에 따른 패딩 값으로 0x5a=’Z’를 채운다.

슬랩 Object 내부 구조

주의: 아래 그림에서 FP의 위치는 object_size의 포인터 사이즈 정렬 단위를 사용하여 중앙으로 이동하였다.

참고:
- slub: relocate freelist pointer to middle of object (2020, v5.7-rc1)
- mm/slub: fix redzoning for small allocations (2021, v5.13-rc7)
- mm/slub: actually fix freelist pointer vs redzoning (2021, v5.13-rc7)

1) 메타 정보 없는 slub object

전체 사이즈는 실제 object 사이즈 + align 단위로 정렬한 패딩을 포함한다.
- 예) object_size=22, align=8
  - inuse=24, size=24
- 예) object_size=22, align=64
  - inuse=24, size=64
FP를 가리키는 offset은 0이다.
최소 정렬 사이즈는 워드(32bit=4, 64bit=8) 단위를 사용한다.
- 예) object_size=22, align=0
  - size=24
SLAB_HWCACHE_ALIGN GPF 플래그를 사용 시 L1 캐시 라인 사이즈보다 작은 경우 캐시 라인 바운싱을 최소화 시키기 위해 align 단위를 2의 차수 단위로 줄인 수에 맞게 사용한다.
- 예: …, 64, 32, 16, 8
예) object_size=22, align=22, flags=SLAB_SWCACHE_ALIGN
- size=32

2) red-zone 정보가 포함된 slub object

이전 object가 overwrite 하여도 다음 object를 보호하기 위해 red_left_pad 공간을 추가하였다.
object 우측에 인접하여 최소 1~최대 워드사이즈(32bit=4, 64bit=8)인 red zone이 들어간다.
- object 사이즈가 워드 단위로 이미 정렬되어 redzone 자리가 없는 경우 redzone 자리로 워드 사이즈 길이만큼 추가한다.
- 또한 노란색 공간 즉 사용자용 object 시작 위치는 align 정렬되고, 전체 object 사이즈도 align 정렬된다.

3) fp 이동(poison, rcu, ctor)이 포함된 slub object

object 위치에 별도의 정보를 기록해야 하는 poison 디버깅, rcu를 사용한 free object 지원 또는 생성자를 사용하는 슬랩 캐시들은 FP(Free Pointer) 위치를 object 다음으로 옮겨야 한다. FP(Free Pointer)를 가리키는 offset은 inuse와 동일하다.
다음은 FP를 옮겨야 하는 3 가지 항목이다.
- SLAB_POISON 플래그를 사용한 경우 다음 두 경우에 poison 데이터를 기록한다.
  - object가 free 상태
  - object가 할당되었지만 초기화되지 않은 상태
- SLAB_TYPESAFE_BY_RCU 플래그를 사용한 경우 rcu를 사용한 free 함수 포인터가 저장된다.
- 생성자가 사용된 슬랩 캐시의 경우 생성자 함수 포인터를 기록한다.

4) owner track 정보가 포함된 slub object

object 할당/해제하는 사용자를 추적하기 위해 owner track 정보를 추가하였다.
- SLAB_STORE_USER 플래그 사용 시 owner track 정보를 위해 track 구조체를 두 개 사용한다.

5) red-zone + fp 이동 + owner-track 정보가 포함된 slub object

그림에는 표기하지 않았지만 KASAN 디버깅을 하는 경우 owner track 뒤에 KASAN 관련 정보가 추가된다.

구조체

kmem_cache 구조체 (slub)

include/linux/slub_def.h

/*
 * Slab cache management.
 */

struct kmem_cache {
        struct kmem_cache_cpu __percpu *cpu_slab;
        /* Used for retriving partial slabs etc */
        slab_flags_t flags;
        unsigned long min_partial;
        unsigned int size;      /* The size of an object including meta data */
        unsigned int object_size;/* The size of an object without meta data */
        unsigned int offset;    /* Free pointer offset. */
#ifdef CONFIG_SLUB_CPU_PARTIAL
        /* Number of per cpu partial objects to keep around */
        unsigned int cpu_partial;
#endif
        struct kmem_cache_order_objects oo;

        /* Allocation and freeing of slabs */
        struct kmem_cache_order_objects max;
        struct kmem_cache_order_objects min;
        gfp_t allocflags;       /* gfp flags to use on each alloc */
        int refcount;           /* Refcount for slab cache destroy */
        void (*ctor)(void *);
        unsigned int inuse;             /* Offset to metadata */
        unsigned int align;             /* Alignment */
        unsigned int red_left_pad;      /* Left redzone padding size */
        const char *name;       /* Name (only for display!) */
        struct list_head list;  /* List of slab caches */
#ifdef CONFIG_SYSFS
        struct kobject kobj;    /* For sysfs */
        struct work_struct kobj_remove_work;
#endif
#ifdef CONFIG_MEMCG
        struct memcg_cache_params memcg_params;
        /* for propagation, maximum size of a stored attr */
        unsigned int max_attr_size;
#ifdef CONFIG_SYSFS
        struct kset *memcg_kset;
#endif
#endif

#ifdef CONFIG_SLAB_FREELIST_HARDENED
        unsigned long random;
#endif

#ifdef CONFIG_NUMA
        /*
         * Defragmentation by allocating from a remote node.
         */
        unsigned int remote_node_defrag_ratio;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
        unsigned int *random_seq;
#endif

#ifdef CONFIG_KASAN
        struct kasan_cache kasan_info;
#endif

        unsigned int useroffset;        /* Usercopy region offset */
        unsigned int usersize;          /* Usercopy region size */

        struct kmem_cache_node *node[MAX_NUMNODES];
};

슬랩 캐시를 관리한다.

cpu_slab
- per-cpu 캐시
flags
- 캐시 생성 시 적용된 플래그 옵션
min_partial
- 캐시에서 유지할 최소 partial 슬랩 페이지 수
size
- object 및 메타 데이터를 포함하고 align된 사이즈
object_size
- 메타 데이터를 제외한 object의 사이즈
offset
- 오브젝트 내에서 FP(Free Pointer)의 위치 오프셋
- SLAB_POISON 및 SLAB_DESTROY_BY_RCU를 사용하지 않는 경우 0
cpu_partial
- per-cpu partial 리스트에서 유지 가능한 최대 슬랩 object 수
oo
- 슬랩 페이지 생성 시 적용할 권장 order
- 메모리 부족 상황이 아닌 경우에는 권장 order로 슬랩 페이지를 할당한다.
max
- 슬랩 페이지 생성 시 최대 order
min
- 슬랩 페이지 생성 시 최소 order
- 메모리 부족 상황에서는 min order로 슬랩 페이지를 할당한다.
allocflags
- object 할당 시 사용할 GFP 플래그
refcount
- 슬랩 캐시를 삭제하기 위해 사용할 참조 카운터로 alias 캐시 생성 시 계속 증가한다.
(*ctor)
- object 생성자
inuse
- 메타 데이터로 인해 추가된 공간을 제외한 실제 사이즈(actual size)
- 메타 데이터(SLAB_POISON, SLAB_DESTROY_BY_RCU, SLAB_STORE_USER, KASAN)를 사용하지 않을 경우의 size와 동일하다.
align
- 정렬할 바이트 수
red_left_pad
- 좌측 red-zone 패딩 사이즈
reserved
- slub object의 끝에 reserve 시켜야 할 바이트 수
*name
- 오직 출력을 위해 사용되는 이름
list
- 슬랩 캐시들을 연결할 때 사용되는 리스트 노드
kobj
- sysfs에 생성할 때 사용할 디렉토리 정보
kobj_remove_work
- 슬랩 캐시 삭제 시 워크를 통해 연동되어 sysfs에 생성한 슬랩 캐시명의 디렉토리를 제거한다.
memcg_param
- 메모리 cgroup에서 사용하는 파라메터
max_attr_size
- 속성이 저장될 최대 사이즈
*memcg_kset
- 메모리 cgroup에서 사용하는 kset
random
- CONFIG_SLAB_FREELIST_HARDENED 커널 옵션 사용 시 보안을 목적으로 FP(Free Pointer)를 encaptualization 하여 알아볼 수 없게 숨기기 위한 랜덤 값이다.
remote_node_defrag_ratio
- 할당할 slub object가 로컬노드의 partial 리스트에서 부족한 경우 리모트 노드의 partail 리스트에서 시도할 확률
- 0~1023 (100=로컬노드의 partial 리스트에서 slub object를 할당할 수 없을 때 약 10%의 확률로 리모트 노드에서 시도)
*random_seq
- CONFIG_SLAB_FREELIST_RANDOM 커널 옵션을 사용 시 heap 오버플로우 침입에 대한 보안 강화를 목적으로 free object들의 순서를 섞기 위한 배열이 할당된다.
kasan_info
- CONFIG_KASAN 커널 옵션을 사용 시 KASAN(Kernel Address SANitizer) 런타임 디버거를 사용할 수 있다.
useroffset
- 유저 copy 영역 offset
usersize
- 유저 copy 영역 사이즈
*node
- 노드별 partial 리스트를 관리하는 kmem_cache_node 배열 포인터

kmem_cache_cpu 구조체 (slub)

include/linux/slub_def.h

struct kmem_cache_cpu {
        void **freelist;        /* Pointer to next available object */
        unsigned long tid;      /* Globally unique transaction id */
        struct page *page;      /* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
        struct page *partial;   /* Partially allocated frozen slabs */
#endif
#ifdef CONFIG_SLUB_STATS
        unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};

per-cpu로 관리되는 슬랩 캐시

**freelist
- 아래 page 멤버 중 할당 가능한 free object를 가리키는 포인터
tid
- 글로벌하게 유니크한 트랜잭션 id
*page
- 할당/해제에 사용 중인 슬랩 캐시 페이지
*partial
- 일부 object가 사용(in-use)된 frozen 슬랩 페이지 리스트
stat
- 슬랩 캐시 통계

kmem_cache_node 구조체 (slab, slub)

mm/slab.h

#ifndef CONFIG_SLOB
/*      
 * The slab lists for all objects.
 */

struct kmem_cache_node {
        spinlock_t list_lock;
#ifdef CONFIG_SLUB
        unsigned long nr_partial;
        struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
        atomic_long_t nr_slabs;
        atomic_long_t total_objects;
        struct list_head full;
#endif
#endif
};

per-node로 관리하는 슬랩 캐시

list_lock
- spin lock에서 사용
nr_partial
- 유지할 partial 리스트의 수
partial
- 해당 노드에서 partial된 슬랩 페이지 리스트
- 단 slab과 다르게 slub은 특별히 partial 상태를 구별하지 않는다.
nr_slabs
- 디버그용 슬랩 페이지 수
total_objects
- 디버그용 슬랩 object의 총 갯수
full
- 디버그용 다 사용된(in-use) slub 페이지 리스트

page 구조체에서 slub 용도로 사용되는 멤버

include/linux/mm_types.h

struct page {
...
                struct {        /* slab, slob and slub */
                        union {
                                struct list_head slab_list;     /* uses lru */
                                struct {        /* Partial pages */
                                        struct page *next;
#ifdef CONFIG_64BIT
                                        int pages;      /* Nr of pages left */
                                        int pobjects;   /* Approximate count */
#else
                                        short int pages;
                                        short int pobjects;
#endif
                                };
                        };
                        struct kmem_cache *slab_cache; /* not slob */
                        /* Double-word boundary */
                        void *freelist;         /* first free object */
                        union {
                                void *s_mem;    /* slab: first object */
                                unsigned long counters;         /* SLUB */
                                struct {                        /* SLUB */
                                        unsigned inuse:16;
                                        unsigned objects:15;
                                        unsigned frozen:1;
                                };
                        };
                };
...

슬랩 페이지에 사용되는 페이지 디스크립터이다.

slab_list
- per-cpu 캐시 또는 per-node의 partial 리스트에 연결 시 사용될 노드
*next
- 다음 partial 페이지를 가리킨다.
pages
- 남은 partial 페이지 수
pobjects
- 대략 남은 object 수
*slab_cache
- 슬랩 캐시
*freelist
- 첫 번째 free object를 가리키는 포인터
counters
- 다음 정보를 담고 있다.
  - inuse:16
    - 사용중인 object의 총 갯수
    - counters[15:0]
  - objects:15
    - object의 총 갯수
    - counters[30:16]
  - frozen:1
    - frozen 여부를 나타낸다.
    - counters[31]

슬랩 페이지의 frozen 상태

슬랩 페이지가 특정 cpu가 전용으로 사용할 수 있는 상태가 frozen 상태이다.
- c->page에 연결된 슬랩 페이지이거나, c->partial에 연결된 슬랩 페이지들이 frozen 상태에 있다.
노드별 partial 리스트 관리에 있는 슬랩 페이지들은 un-frozen 상태이다.
- node[]->partial에 연결된 슬랩 페이지들이 un-frozen 상태에 있다.
전담 cpu는 frozen된 페이지가 가진 freelist에서 슬랩 object 탐색과 할당/해제가 가능하다.
전담 cpu가 아닌 다른 cpu는 freelist에서 슬랩 object의 탐색 및 할당이 불가능하고 오직 슬랩 object의 할당 해제만 허용된다.

kmem_cache_order_objects 구조체

include/linux/slub_def.h

/*
 * Word size structure that can be atomically updated or read and that
 * contains both the order and the number of objects that a slab of the
 * given order would contain.
 */

struct kmem_cache_order_objects {
        unsigned long x;
};

x[15:0]
- slub 페이지를 만들 때 사용할 최대 object 수
x[31:16]
- slub 페이지를 만들 때 사용할 order

참고

Slab Memory Allocator -1- (구조) | 문c – 현재 글
Slab Memory Allocator -2- (캐시 초기화) | 문c
Slub Memory Allocator -3- (캐시 생성) | 문c
Slub Memory Allocator -4- (Order 계산) | 문c
Slub Memory Allocator -5- (Slub 할당) | 문c
Slub Memory Allocator -6- (Object 할당) | 문c
Slub Memory Allocator -7- (Object 해제) | 문c
Slub Memory Allocator -8- (Drain/Flash 캐시) | 문c
Slub Memory Allocator -9- (캐시 Shrink) | 문c
Slub Memory Allocator -10- (Slub 해제) | 문c
Slub Memory Allocator -11- (캐시 삭제) | 문c
Slub Memory Allocator -12- (Slub 디버깅) | 문c
Slub Memory Allocator -13- (slabinfo) | 문c
Kmalloc | 문c

Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB (Rev. 2014) | Christoph Lameter – pdf 다운로드
A Heap of Trouble (Exploiting the Linux Kernel SLOB Allocator) | Dan Rosenberg – pdf 다운로드
How does the SLUB allocator work | 김준수 – pdf 다운로드
[Linux] 동적 메모리 할당자 : slab, slub, slob | F/OSS
The SLUB Allocator | LWN.net
Kernel dynamic memory analysis | elinux.org
SLUB: Support for statistics to help analyze allocator behavior | LWN.net
Cramming more into struct page | LWN.net
debug with Linux slub allocator | thinkiii
slub_debug: Detect Kernel heap memory corruption | TechVolve
Short users guide for SLUB (2015, Documentation/vm/slub.txt)| Kernel.org

Slub Memory Allocator -4- (order 계산)

2016-06-022019-11-11 문영일 Leave a comment

슬랩 캐시 사이즈 및 order 산출

calculate_sizes()

mm/slub.c -1/2-

/*
 * calculate_sizes() determines the order and the distribution of data within
 * a slab object.
 */

static int calculate_sizes(struct kmem_cache *s, int forced_order)
{
        slab_flags_t flags = s->flags;
        unsigned int size = s->object_size;
        unsigned int order;

        /*
         * Round up object size to the next word boundary. We can only
         * place the free pointer at word boundaries and this determines
         * the possible location of the free pointer.
         */
        size = ALIGN(size, sizeof(void *));

#ifdef CONFIG_SLUB_DEBUG
        /*
         * Determine if we can poison the object itself. If the user of
         * the slab may touch the object after free or before allocation
         * then we should never poison the object itself.
         */
        if ((flags & SLAB_POISON) && !(flags & SLAB_TYPESAFE_BY_RCU) &&
                        !s->ctor)
                s->flags |= __OBJECT_POISON;
        else
                s->flags &= ~__OBJECT_POISON;

        /*
         * If we are Redzoning then check if there is some space between the
         * end of the object and the free pointer. If not then add an
         * additional word to have some bytes to store Redzone information.
         */
        if ((flags & SLAB_RED_ZONE) && size == s->object_size)
                size += sizeof(void *);
#endif

        /*
         * With that we have determined the number of bytes in actual use
         * by the object. This is the potential offset to the free pointer.
         */
        s->inuse = size;

        if (((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
                s->ctor)) {
                /*
                 * Relocate free pointer after the object if it is not
                 * permitted to overwrite the first word of the object on
                 * kmem_cache_free.
                 *
                 * This is the case if we do RCU, have a constructor or
                 * destructor or are poisoning the objects.
                 */
                s->offset = size;
                size += sizeof(void *);
        }

지정된 forced_order 또는 객체 사이즈(s->size)로 적절히 산출된 order 둘 중 하나를 선택하여 slab에 들어갈 최소 객체 수(s->min), 적절한 객체 수(s->oo) 및 최대 객체 수(s->max)를 설정하고 1을 반환한다. 객체 수가 0인 경우 에러로 0을 반환한다.

코드 라인 12에서 size를 포인터 사이즈 단위로 올림 정렬한다.
코드 라인 20~24에서 poison 디버그를 요청한 경우에 한해 __OBJECT_POISON 플래그를 추가하되 별도의 생성자를 사용하는 슬랩 캐시 또는 RCU를 사용한 슬랩 캐시는 RCU가 우선되므로 poison 디버깅을 지원하지 않는다. RCU를 이용한 free 기능을 사용하는 경우 object의 free 시점이 delay 되므로 원본 object 데이터가 poison 값으로 파괴되면 안되는 상태이다.
코드 라인 31~32에서 red-zone 디버깅 시 size와 object_size가 같은 경우 object와 FP(Free Pointer) 사이에 red-zone으로 사용할 패딩 공간이 없어 별도의 red-zone 용도의 패딩 공간을 포인터 길이 만큼 추가한다.
코드 라인 39에서 지금 까지 산출된 실제 object에 사용하는 사이즈(actual size)를 s->inuse에 저장한다.
코드 라인 41~53에서 object 자리에 FP(Free Pointer)가 기록되는데, rcu를 사용한 object free나 poison 디버깅 및 생성자를 별도로 사용하는 경우 object 위치에 관련 정보를 기록한다. 따라서 FP(Free Pointer)를 뒤로 이동시켜 추가하여야 한다. s->offset은 FP의 위치를 가리킨다.
- RCU는 free 시점이 delay 되므로 원본 object 데이터가 FP 값으로 overwrite되어 파괴되면 안되는 상태이다.

mm/slub.c -2/2-

#ifdef CONFIG_SLUB_DEBUG
        if (flags & SLAB_STORE_USER)
                /*
                 * Need to store information about allocs and frees after
                 * the object.
                 */
                size += 2 * sizeof(struct track);
#endif

        kasan_cache_create(s, &size, &s->flags);
#ifdef CONFIG_SLUB_DEBUG
        if (flags & SLAB_RED_ZONE) {
                /*
                 * Add some empty padding so that we can catch
                 * overwrites from earlier objects rather than let
                 * tracking information or the free pointer be
                 * corrupted if a user writes before the start
                 * of the object.
                 */
                size += sizeof(void *);

                s->red_left_pad = sizeof(void *);
                s->red_left_pad = ALIGN(s->red_left_pad, s->align);
                size += s->red_left_pad;
        }
#endif

        /*
         * SLUB stores one object immediately after another beginning from
         * offset 0. In order to align the objects we have to simply size
         * each object to conform to the alignment.
         */
        size = ALIGN(size, s->align);
        s->size = size;
        if (forced_order >= 0)
                order = forced_order;
        else
                order = calculate_order(size);

        if ((int)order < 0)
                return 0;

        s->allocflags = 0;
        if (order)
                s->allocflags |= __GFP_COMP;

        if (s->flags & SLAB_CACHE_DMA)
                s->allocflags |= GFP_DMA;

        if (s->flags & SLAB_RECLAIM_ACCOUNT)
                s->allocflags |= __GFP_RECLAIMABLE;

        /*
         * Determine the number of objects per slab
         */
        s->oo = oo_make(order, size);
        s->min = oo_make(get_order(size), size);
        if (oo_objects(s->oo) > oo_objects(s->max))
                s->max = s->oo;

        return !!oo_objects(s->oo);
}

코드 라인 2~7에서 SLAB_STORE_USER GFP 플래그가 사용된 경우 alloc/free 호출한 owner 트래킹을 위해 alloc 용 track 구조체와 free용 track 구조체 사이즈를 추가한다.
코드 라인 10에서 런타임 디버거인 KASAN을 사용하는 경우 관련 정보들을 추가한다.
코드 라인 12~25에서 red-zone 디버깅을 사용하는 경우 좌측에 red_left_pad를 추가하고, 마지막에 red-zone 정보용으로 포인터 사이즈만큼 추가한다.
코드 라인 33~34에서 지금까지 산출된 size에 align을 적용한다.
코드 라인 35~41에서 forced_order가 지정된 경우를 제외하고 size에 따른 적절한 order를 산출한다. (가능하면 0~3 이내)
코드 라인 43~51에서 다음과 같은 요청의 슬랩 캐시인 경우 플래그를 추가한다.
- order가 1 이상인 경우 compound 페이지를 만들기 위해 __GFP_COMP 플래그를 추가한다.
- dma용 슬랩 캐시를 요청한 경우 GFP_DMA 플래그를 추가한다.
- reclaimable 슬랩 캐시를 요청한 경우 __GFP_RECLAIMABLE 플래그를 추가한다.
  - migrate type을 reclaimable로 한다.
코드 라인 56~59에서 다음 3개의 order 및 object 수를 산출한다.
- s->oo에 산출한 order와 object 수를 대입한다.
- s->min에 최소 order와 object 수를 대입한다.
- s->max에 s->oo의 최고치를 갱신한다.
코드 라인 61에서 적절한 order의 산출 여부를 반환한다. (성공 시 1)

다음 그림은 slab에 사용할 최소 order와 적절한 order 및 최대 order가 산출된 모습을 보여준다.

order 산출

calculate_order()

mm/slub.c

static inline int calculate_order(unsigned int size)
{
        unsigned int order;
        unsigned int min_objects;
        unsigned int max_objects;

        /*
         * Attempt to find best configuration for a slab. This
         * works by first attempting to generate a layout with
         * the best configuration and backing off gradually.
         *
         * First we increase the acceptable waste in a slab. Then
         * we reduce the minimum objects required in a slab.
         */
        min_objects = slub_min_objects;
        if (!min_objects)
                min_objects = 4 * (fls(nr_cpu_ids) + 1);
        max_objects = order_objects(slub_max_order, size);
        min_objects = min(min_objects, max_objects);

        while (min_objects > 1) {
                unsigned int fraction;

                fraction = 16;
                while (fraction >= 4) {
                        order = slab_order(size, min_objects,
                                        slub_max_order, fraction);
                        if (order <= slub_max_order)
                                return order;
                        fraction /= 2;
                }
                min_objects--;
        }

        /*
         * We were unable to place multiple objects in a slab. Now
         * lets see if we can place a single object there.
         */
        order = slab_order(size, 1, slub_max_order, 1);
        if (order <= slub_max_order)
                return order;

        /*
         * Doh this slab cannot be placed using slub_max_order.
         */
        order = slab_order(size, 1, MAX_ORDER, 1);
        if (order < MAX_ORDER)
                return order;
        return -ENOSYS;
}

slub 할당을 위해 @size에 따른 적절한 order를 산출한다. 가능하면 커널 파라미터의 “slub_min_order=”(디폴트=0) ~ “slub_max_order=”(디폴트=3) 범위 내에서 적절한 order를 산출한다. 그러나 size가 slub_max_order 페이지 크기보다 큰 경우 버디 시스템 최대 order 범위내에서 산출한다.

코드 라인 15~17에서 슬랩 페이지당 관리할 최소 object 수를 지정하기 위해 다음 중 하나를 사용한다.
- “slub_min_order=” 커널 파라메터로 설정된 값(디폴트=0)을 사용한다.
- cpu 수에 비례한 수를 산출한다.
  - 수식: 4 * (2log(cpu 수) + 1 + 1)
  - rpi2 예) 4 * (2 + 1 + 1) = 16
코드 라인 18에서 “slub_max_order=”(default=3) 사이즈에 들어갈 수 있는 최대 object 수를 구한다.
코드 라인 19에서 위의 두 값 중 가장 작은 값으로 min_objects 수를 갱신한다.

1st phase: slab order 범위 내 적절한 order 산출

위에서 산출된 min_objects 수를 포함 시킬 수 있는 최소 order 부터 시작하되 “slub_min_order=”(디폴트=0) ~ “slub_max_order=”(디폴트=3) 범위 order 순으로 순회하며 @size에 따른 object들을 배치시키고 남은 나머지 공간이 낭비가 적은 order를 반환한다. 남은 나머지 공간이 진행 중인 order 페이지 사이즈의 1/16보다 큰 경우 낭비가 있다고 판단하여 다음 order로 넘어간다. 만일 1/16으로 안되면 1/8로 시도하고, 마지막으로 1/4로 시도한다. 마지막 시도마저 안되면 min_objects 수를 1씩 줄여가며 다시 처음 부터 시도하고, min_objects가 마지막 2가 될 때까지 시도한다.

코드 라인 21에서 산출된 min_objects가 2개 이상인 경우에만 루프를 돈다.
- min_objects가 16으로 산출된 경우 예) 16, 15, 14, … 2까지 루프를 돈다.
코드 라인 24~25에서 낭비 여부를 판단하기 위해 진행 중인 order 페이지 사이즈의 1/16, 1/8, 1/4로 변경하며 시도한다.
코드 라인 26~29에서 min_objects 수가 포함된 order 부터 “slub_min_order=”(디폴트=0) ~ “slub_max_order=”(디폴트=3) 범위 order 순으로 순회하며 fraction(1/16, 1/8, 1/4)으로 지정된 낭비 사이즈 이내에 배치가능한 order를 반환한다.
코드 라인30~31에서fraction을 반으로 줄이고 다시 반복한다.
코드 라인 32~33에서 min_objects를 1 감소시키고 다시 반복한다.

2nd phase: 1개 object가 들어갈 수 있는 slab order 범위 내 산출

코드 라인 39~41에서 “slub_max_order=”(디폴트=3) 범위 order 까지 순회하며 낭비 사이즈 범위를 고려하지 않고, 1개의 object라도 들어갈 수 있는 order를 산출한다.

3rd phase: 1개 object가 들어갈 수 있는 order 산출 (order 무제한)

코드 라인 46~48에서 버디 시스템의 최대 order 까지 순회하며 낭비 사이즈 범위를 고려하지 않고, 1개의 object라도 들어갈 수 있는 order를 산출한다.

다음 그림은 slab 할당을 위해 적절한 order를 산출하는 모습을 보여준다.

order_objects()

mm/slub.c

static inline unsigned int order_objects(unsigned int order, unsigned int size)
{
        return ((unsigned int)PAGE_SIZE << order) / size;
}

요청된 order 페이지에서 생성할 수 있는 최대 object 수를 구한다.

객체가 할당되는 2^order 페이지에서 size로 나눈 수를 반환한다.

다음 그림은 order 만큼의 페이지에서 생성할 수 있는 최대 object의 수를 알아오는 것을 보여준다.

slab_order()

mm/slub.c

/*
 * Calculate the order of allocation given an slab object size.
 *
 * The order of allocation has significant impact on performance and other
 * system components. Generally order 0 allocations should be preferred since
 * order 0 does not cause fragmentation in the page allocator. Larger objects
 * be problematic to put into order 0 slabs because there may be too much
 * unused space left. We go to a higher order if more than 1/16th of the slab
 * would be wasted.
 *
 * In order to reach satisfactory performance we must ensure that a minimum
 * number of objects is in one slab. Otherwise we may generate too much
 * activity on the partial lists which requires taking the list_lock. This is
 * less a concern for large slabs though which are rarely used.
 *
 * slub_max_order specifies the order where we begin to stop considering the
 * number of objects in a slab as critical. If we reach slub_max_order then
 * we try to keep the page order as low as possible. So we accept more waste
 * of space in favor of a small page order.
 *
 * Higher order allocations also allow the placement of more objects in a
 * slab and thereby reduce object handling overhead. If the user has
 * requested a higher mininum order then we start with that one instead of
 * the smallest order which will fit the object.
 */

static inline unsigned int slab_order(unsigned int size,
                unsigned int min_objects, unsigned int max_order,
                unsigned int fract_leftover)
{
        unsigned int min_order = slub_min_order;
        unsigned int order;

        if (order_objects(min_order, size) > MAX_OBJS_PER_PAGE)
                return get_order(size * MAX_OBJS_PER_PAGE) - 1;

        for (order = max(min_order, (unsigned int)get_order(min_objects * size));
                        order <= max_order; order++) {

                unsigned int slab_size = (unsigned int)PAGE_SIZE << order;
                unsigned int rem;

                rem = slab_size % size;

                if (rem <= slab_size / fract_leftover)
                        break;
        }

        return order;
}

slab 페이지 생성에 필요한 order를 산출한다. @size로 @min_objects 만큼 배치 가능해야 하고, 최대 @max_order 범위내에서 object를 배치하고 남은 공간이 슬랩 페이지의 1/fract_leftover 보다 크면 안된다.

코드 라인 5에서 “slub_min_order=”(디폴트=0) 값을 min_order에 대입한다.
코드 라인 8~9에서 min_order 페이지에 포함 가능한 object 수가 MAX_OBJS_PER_PAGE(32767)를 초과하는 경우 size * MAX_OBJS_PER_PAGE(32767)를 처리할 수 있는 order 값 – 1을 반환한다.
코드 라인 11~12에서 min_order 또는 @min_objects * @size 만큼 포함 가능한 order 둘 중 큰 order 부터 @max_order 까지 순회한다.
코드 라인 14~20에서 순회 중인 order 페이지에 object들을 배치하고 남은 공간이 슬랩 페이지의 1/@fract_leftover 보다 작은 경우만 낭비가 적다 판단하여 루프를 벗어나서 order를 반환한다.

다음 그림은 최대 3 order 페이지 범위내에서 object의 사이즈가 1032이고, 최소 4개 이상을 배치하여 남은 사이즈 공간이 슬랩 페이지의 1/16보다 작은 order를 산출하는 모습을 보여준다.

get_order()

include/asm-generic/getorder.h

/**
 * get_order - Determine the allocation order of a memory size
 * @size: The size for which to get the order
 *
 * Determine the allocation order of a particular sized block of memory.  This
 * is on a logarithmic scale, where:
 *
 *      0 -> 2^0 * PAGE_SIZE and below
 *      1 -> 2^1 * PAGE_SIZE to 2^0 * PAGE_SIZE + 1
 *      2 -> 2^2 * PAGE_SIZE to 2^1 * PAGE_SIZE + 1
 *      3 -> 2^3 * PAGE_SIZE to 2^2 * PAGE_SIZE + 1
 *      4 -> 2^4 * PAGE_SIZE to 2^3 * PAGE_SIZE + 1
 *      ...
 *
 * The order returned is used to find the smallest allocation granule required
 * to hold an object of the specified size. 
 *
 * The result is undefined if the size is 0.
 *
 * This function may be used to initialise variables with compile time
 * evaluations of constants.
 */

#define get_order(n)                                            \
(                                                               \
        __builtin_constant_p(n) ? (                             \
                ((n) == 0UL) ? BITS_PER_LONG - PAGE_SHIFT :     \
                (((n) < (1UL << PAGE_SHIFT)) ? 0 :              \
                 ilog2((n) - 1) - PAGE_SHIFT + 1)               \
        ) :                                                     \
        __get_order(n)                                          \
)

@n 사이즈에 따른 order 값을 산출한다.

예) 4K 페이지, n=1025 ~ 2048
- -> 1

__get_order()

include/asm-generic/getorder.h

/*
 * Runtime evaluation of get_order()
 */

static inline __attribute_const__
int __get_order(unsigned long size)
{
        int order;

        size--;
        size >>= PAGE_SHIFT;
#if BITS_PER_LONG == 32
        order = fls(size);
#else
        order = fls64(size);
#endif
        return order;
}

@size를 버디 시스템의 order로 표현할 때의 값을 구한다.

size에서 1을 뺀 값을 페이지 수로 변경하고 이를 표현할 수 있는 필요 비트 수를 구한다.
- fls()
  - lsb -> msb로 검색하여 가장 마지막에 발견되는 1로된 비트 번호 + 1
    - 예) 0xf000_0000 -> 32
예) size=0x10_0000 (1MB)
- -> 8

oo_make()

mm/slub.c

static inline struct kmem_cache_order_objects oo_make(unsigned int order,
                unsigned int size)
{
        struct kmem_cache_order_objects x = {
                (order << OO_SHIFT) + order_objects(order, size)
        };

        return x;
}

@order 및 @size를 사용하여 kmem_cache_order_objects 구조체 객체에 담고 그 값을 반환한다.

객체는 하나의 unsigned int 값을 사용하는데 lsb 16bit에 객체 수를 담고 나머지 bits에 order 값을 담는다.
OO_SHIFT=16

아래 그림은 oo_make() 인라인 함수를 사용하여 order 값과 산출된 객체 수를 kmem_cache_order_objects라는 내부 규조체 객체에 담아 반환을 하는 모습을 보여준다.

oo_objects()

mm/slub.c

static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
{
        return x.x & OO_MASK;
}

@x 값에서 object 수 만을 반환한다.

x 값의 lsb 16bit를 반환한다.

다음 그림은 kmem_cache_order_objects 구조체 객체에서 객체 수 값만을 반환받는 모습을 보여준다.

참고

Slab Memory Allocator -1- (구조) | 문c
Slab Memory Allocator -2- (캐시 초기화) | 문c
Slub Memory Allocator -3- (캐시 생성) | 문c
Slub Memory Allocator -4- (Order 계산) | 문c – 현재 글
Slub Memory Allocator -5- (Slub 할당) | 문c
Slub Memory Allocator -6- (Object 할당) | 문c
Slub Memory Allocator -7- (Object 해제) | 문c
Slub Memory Allocator -8- (Drain/Flash 캐시) | 문c
Slub Memory Allocator -9- (캐시 Shrink) | 문c
Slub Memory Allocator -10- (Slub 해제) | 문c
Slub Memory Allocator -11- (캐시 삭제) | 문c
Slub Memory Allocator -12- (Slub 디버깅) | 문c
Slub Memory Allocator -13- (slabinfo) | 문c

Slub Memory Allocator -13- (slabinfo)

2016-06-012019-11-25 문영일 Leave a comment

/proc/slabinfo

$ cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_1      1035   1035    344   23    2 : tunables    0    0    0 : slabdata     45     45      0
ext4_groupinfo_4k   2112   2112    168   24    1 : tunables    0    0    0 : slabdata     88     88      0
ip6-frags              0      0    248   16    1 : tunables    0    0    0 : slabdata      0      0      0
ip6_dst_cache        126    126    384   21    2 : tunables    0    0    0 : slabdata      6      6      0
RAWv6                104    104   1216   26    8 : tunables    0    0    0 : slabdata      4      4      0
UDPLITEv6              0      0   1216   26    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                156    156   1216   26    8 : tunables    0    0    0 : slabdata      6      6      0
tw_sock_TCPv6          0      0    272   30    2 : tunables    0    0    0 : slabdata      0      0      0
request_sock_TCPv6     0      0    328   24    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 56     56   2304   14    8 : tunables    0    0    0 : slabdata      4      4      0
cfq_io_cq            216    216    112   36    1 : tunables    0    0    0 : slabdata      6      6      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_icr                0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_ili                0      0    152   26    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_inode              0      0   1216   26    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_efd_item           0      0    400   20    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_log_item_desc    128    128     32  128    1 : tunables    0    0    0 : slabdata      1      1      0
xfs_da_state           0      0    480   17    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_btree_cur          0      0    208   19    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_log_ticket         0      0    184   22    1 : tunables    0    0    0 : slabdata      0      0      0

각 슬랩 캐시에 대해 다음과 같은 정보가 출력된다.

name
- 슬랩 이름
- s->name
<active_objs>
- 사용 중인 object 수 (per-cpu에 있는 free object 조차 사용 중인 object 수에 포함된다)
- 모든 노드의 n->total_objects – (per 노드 partial 리스트에 있는 슬랩 페이지의 page->objects – page->inuse 합계)
<num_objs>
- 할당된 슬랩 페이지의 모든 object 수(free + inuse)
- 모든 노드의 n->total_objects 합계
<objsize>
- 슬랩 object 사이즈 (메타 정보 포함)
- s->size
<objperslab>
- 각 슬랩 페이지에 들어가는 object 수
- s->oo에 기록된 object 수
<pagesperslab>
- 각 슬랩 페이지에 들어가는 페이지 수
- 2 ^ s->oo에 기록된 order
<limit>
- 캐시될 최대 object 수 – slub에서 미사용 중(항상 0으로 출력)
<batchcount>
- 한 번에 refill 가능한 object 수 – slub에서 미사용 중(항상 0으로 출력)
<sharedfactor>
- slub에서 미사용 중(항상 0으로 출력)
<active_slabs>
- 사용 중인 슬랩 수 = 전체 슬랩 수와 동일
- 모든 노드의 n->nr_slabs 합계
<num_slabs>
- 전체 슬랩 수 = 사용 중인 슬랩 수와 동일
- 모든 노드의 n->nr_slabs 합계
<sharedavail>
- slub에서 미사용 중(항상 0으로 출력)

/sys/kernel/slab 디렉토리

슬랩 캐시들은 /sys/kernel/slab 디렉토리에서 관리하며 다음과 같이 분류한다.

병합 불가능한 슬랩 캐시로 슬랩 캐시명을 사용한 디렉토리가 생성된다.
- 예) TCP 디렉토리
병합 가능한 슬랩 캐시
- alias 캐시가 가리키는 유니크한 이름으로 자동 생성된 슬랩 캐시로”:”문자열로 시작하는 디렉토리가 생성된다.
- 예) :A-0001088 디렉토리
병합된 alias 슬랩 캐시로 오리지널 캐시를 가리키는 링크 파일을 생성한다.
- 예) lrwxrwxrwx 1 root root 0 Nov 19 15:42 UDP -> :A-0001088

병합 가능한 슬랩 캐시명의 규칙은 다음과 같다.

format
- “:” 문자 + [[d][a][F][A]-] + 유니크한 7자리 숫자

각 옵션 문자에 대한 의미는 다음과 같다.

d
- DMA 사용 슬랩 캐시
- SLAB_CACHE_DMA 플래그 사용
D
- DMA32 사용 슬랩 캐시
- SLAB_CAHE_DMA32 플래그 사용 (커널 v5.1-rc3에서 추가)
a
- reclaimable 슬랩 캐시
- SLAB_RECLAIM_ACCOUNT 플래그 사용
F
- consistency 체크 허용한 슬랩 캐시
- SLAB_CONSISTENCY_CHECKS 플래그 사용
A
- memcg 통제 허용한 슬랩 캐시
- SLAB_ACCOUNT 플래그 사용
t
- SLAB_NOTRACK 플래그 사용 (커널 v.4.15-rc1에서 삭제)
- 참고: kmemcheck: remove whats left of NOTRACK flags

다음과 같이 병합된 슬랩 캐시들은 병합 가능한 오리지널 캐시 디렉토리를 가리킨다.

$ ls /sys/kernel/slab -la
...
lrwxrwxrwx   1 root root 0 Nov 19 15:32 PING -> :A-0000960
drwxr-xr-x   2 root root 0 Nov 19 15:32 RAW
drwxr-xr-x   2 root root 0 Nov 19 15:32 TCP
lrwxrwxrwx   1 root root 0 Nov 19 15:32 UDP -> :A-0001088
lrwxrwxrwx   1 root root 0 Nov 19 15:32 UDP-Lite -> :A-0001088
lrwxrwxrwx   1 root root 0 Nov 19 15:32 UNIX -> :A-0001024
lrwxrwxrwx   1 root root 0 Nov 19 15:32 aio_kiocb -> :0000192
...

다음은 전체 슬랩 캐시를 보여준다.

$ ls /sys/kernel/slab
:0000024    :a-0000256                files_cache              nfs_inode_cache
:0000032    :a-0000360                filp                     nfs_page
:0000040    PING                      fs_cache                 nfs_read_data
:0000048    RAW                       fsnotify_mark            nfs_write_data
:0000056    TCP                       fsnotify_mark_connector  nsproxy
:0000064    UDP                       hugetlbfs_inode_cache    numa_policy
:0000080    UDP-Lite                  iint_cache               p9_req_t
:0000088    UNIX                      inet_peer_cache          pde_opener
:0000104    aio_kiocb                 inode_cache              pid
:0000128    anon_vma                  inotify_inode_mark       pid_namespace
:0000192    anon_vma_chain            iommu_iova               pool_workqueue
:0000208    asd_sas_event             ip4-frags                posix_timers_cache
:0000216    audit_buffer              ip_dst_cache             proc_dir_entry
:0000240    audit_tree_mark           ip_fib_alias             proc_inode_cache
:0000256    bdev_cache                ip_fib_trie              radix_tree_node
:0000320    bio-0                     isp1760_qh               request_queue
:0000344    bio-1                     isp1760_qtd              request_sock_TCP
:0000384    bio_integrity_payload     isp1760_urb_listitem     rpc_buffers
:0000448    biovec-128                jbd2_inode               rpc_inode_cache
:0000464    biovec-16                 jbd2_journal_handle      rpc_tasks
:0000512    biovec-64                 jbd2_journal_head        sas_task
:0000640    biovec-max                jbd2_revoke_record_s     scsi_data_buffer
:0000704    blkdev_ioc                jbd2_revoke_table_s      sd_ext_cdb
:0000768    buffer_head               jbd2_transaction_s       seq_file
:0000896    configfs_dir_cache        kernfs_node_cache        sgpool-128
:0001024    cred_jar                  key_jar                  sgpool-16
:0001088    debug_objects_cache       khugepaged_mm_slot       sgpool-32
:0001984    dentry                    kioctx                   sgpool-64
:0002048    dio                       kmalloc-128              sgpool-8
:0002112    dmaengine-unmap-128       kmalloc-1k               shared_policy_node
:0004096    dmaengine-unmap-16        kmalloc-256              shmem_inode_cache
:A-0000032  dmaengine-unmap-2         kmalloc-2k               sighand_cache
:A-0000040  dmaengine-unmap-256       kmalloc-4k               signal_cache
:A-0000064  dnotify_mark              kmalloc-512              sigqueue
:A-0000072  dnotify_struct            kmalloc-8k               skbuff_ext_cache
:A-0000080  dquot                     kmalloc-rcl-128          skbuff_fclone_cache
:A-0000128  eventpoll_epi             kmalloc-rcl-1k           skbuff_head_cache
:A-0000192  eventpoll_pwq             kmalloc-rcl-256          sock_inode_cache
:A-0000256  ext2_inode_cache          kmalloc-rcl-2k           squashfs_inode_cache
:A-0000704  ext4_allocation_context   kmalloc-rcl-4k           task_delay_info
:A-0000960  ext4_extent_status        kmalloc-rcl-512          task_group
:A-0001024  ext4_free_data            kmalloc-rcl-8k           task_struct
:A-0001088  ext4_groupinfo_4k         kmem_cache               taskstats
:A-0005120  ext4_inode_cache          kmem_cache_node          tcp_bind_bucket
:a-0000016  ext4_io_end               ksm_mm_slot              tw_sock_TCP
:a-0000024  ext4_pending_reservation  ksm_rmap_item            uid_cache
:a-0000032  ext4_prealloc_space       ksm_stable_node          user_namespace
:a-0000040  ext4_system_zone          mbcache                  uts_namespace
:a-0000048  fanotify_event_info       mm_struct                v9fs_inode_cache
:a-0000056  fanotify_perm_event_info  mnt_cache                vm_area_struct
:a-0000064  fasync_cache              mqueue_inode_cache       xfrm_dst_cache
:a-0000072  fat_cache                 names_cache              xfrm_state
:a-0000104  fat_inode_cache           net_namespace
:a-0000128  file_lock_cache           nfs_commit_data
:a-0000144  file_lock_ctx             nfs_direct_cache

슬랩 캐시 속성들

$ ls /sys/kernel/slab/TCP
aliases      destroy_by_rcu  objects_partial  red_zone                  slabs_cpu_partial
align        free_calls      objs_per_slab    remote_node_defrag_ratio  store_user
alloc_calls  hwcache_align   order            sanity_checks             total_objects
cpu_partial  min_partial     partial          shrink                    trace
cpu_slabs    object_size     poison           slab_size                 usersize
ctor         objects         reclaim_account  slabs                     validate

aliases
- 병합된 alias 캐시 수
align
- 슬랩 object에 사용할 align 값
- s->align
alloc_calls
- alloc 유저 (owner) 트래킹을 사용하여 슬랩 캐시의 할당 내역을 출력한다.
- 예) 1 0xffff000008b8a068 age=2609 pid=3481
cache_dma
- DMA 존을 사용한 슬랩 캐시인지 여부를 보여준다. (1=dma 존 사용 슬랩 캐시, 0=normal 존 사용 슬랩 캐시)
- SLAB_CACHE_DMA 플래그 사용 유무
cpu_partial
- per-cpu에서 관리될 최대 슬랩 object 수
- 사이즈(s->size)에 따라 디폴트 값으로 2, 6, 13, 30 중 하나로 지정된다.
- 디버그를 사용하는 경우에는 per-cpu 관리를 사용하지 않아 0이 지정된다.
cpu_slabs
- per-cpu용으로 관리되고 있는 슬랩 페이지 수를 합산하여 보여준다. (c->page + c->partial 페이지 수)
- 노드별로 N[nid]=<per-cpu 슬랩 페이지 수>를 추가로 표기한다.
- 예) 21 N0=21
ctor
- 생성자가 있는 슬랩 캐시의 생성자 함수명을 보여준다.
- 예) init_once+0x0/0x78
destroy_by_rcu
- rcu를 사용한 슬랩 object 삭제 기법을 사용하는지 여부를 보여준다.
- lock-less 접근을 통해 빠르게 삭제하는 rcu 방법의 사용 유무 (1=사용, 0=미사용)
free_calls
- 예) 2 <not-available> age=4295439025 pid=0
hwcache_align
- L1 하드웨어 캐시 라인에 정렬 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_HWCACHE_ALIGN 플래그 사용 유무
min_partial
- 노드별 partial 리스트에서 유지할 최소 슬랩 페이지 수를 보여준다.
- 디폴트(s->min_partial) 값은 사이즈에 비례한 5~10 범위의 값을 사용하며 이는 노드별로 적용된다.
- /proc/sys/kernel/
object_size
- 메타 데이터를 제외한 슬랩 object 사이즈를 보여준다.
- s->object_size
objects
- 전체 사용중인 슬랩 object 수를 보여준다. (주의: per-cpu에서 관리되는 free object들도 사용 중인 상태로 카운팅된다.)
- 노드별로 N[nid]=<슬랩 object 수>를 추가로 표기한다.
- 예) 1288 N0=1288
objects_partial
- 노드별 partial 리스트에서 사용중인 슬랩 object 수를 보여준다.
- 노드별로 N[nid]=<슬랩 object 수>를 추가로 표기한다.
- 예) 2 N0=2
objs_per_slab
- 슬랩 페이지에 사용될 object 수를 보여준다.
- s->oo에 기록된 order 페이지에 포함될 object 수이다.
order
- 슬랩 페이지 할당에 사용될 order 값이다. (s->oo)
- 이 값은 슬랩 캐시를 생성 시 사이즈에 따라 적절히 산출되었다.
- 메모리 부족 상황에서 슬랩 페이지를 할당하는 경우 위의 order가 아닌 최소 order (s->min)값으로 슬랩 페이지를 할당하기도 한다.
partial
- 노드 partial 리스트에서 관리되고 있는 슬랩 페이지의 수를 합산하여 보여준다.
  - n->nr_partial 합산
- 노드별로 N[nid]=<per-cpu 슬랩 object 수>를 추가로 표기한다.
- 예) 1 N0=1
poison
- poison 디버그 사용 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_POISON 플래그 사용 유무
- “slab_debug=FP,<슬랩캐시명>”
reclaim_account
- reclaimable 슬랩 캐시인지 여부를 보여준다. (1=reclaimable 캐시, 0=일반 unreclaimable 캐시)
- shrinker를 지원하는 슬랩 캐시들을 만들 때 SLAB_RECLAIM_ACCOUNT 플래그를 사용하여 슬랩 캐시를 생성한다.
red_zone
- red-zone 디버그 사용 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_RED_ZONE 플래그 사용 유무
- “slab_debug=FZ,<슬랩캐시명>”
remote_node_defrag_ratio
- 로컬 노드의 슬랩 페이지가 부족한 상황인 경우 이 값으로 지정한 백분율만큼 리모트 노드에서 메모리를 허용한다.
- 디폴트 값은 100이고 0~100까지 허용되며, 0을 사용하는 경우 리모트 노드의 슬랩 페이지를 사용하지 못하게 한다.
sanity_checks
- sanity 체크 디버그 사용 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_CONSISTENCY_CHECKS 플래그 사용 유무
- “slab_debug=F,<슬랩캐시명>”
shrink
- reclaimable 슬랩 캐시의 메모리 회수를 수행한다.
- 예) echo 1 > /sys/kernel/slab/ext4_inode_cache/shrink
slab_size
- 메타 데이터를 포함한 슬랩 object 사이즈를 보여준다.
- s->size
slabs
- 전체 슬랩 페이지 수를 보여준다.
- 노드별로 N[nid]=<슬랩 페이지 수>를 추가로 표기한다.
- 예) 28 N0=28
slabs_cpu_partial
- per-cpu partial 리스트가 관리하고 있는 free 슬랩 object 수 및 슬랩 페이지 수를 보여준다. (s->page 제외)
- cpu별로 C[cpu] = <cpu partial free 슬랩 object 수>(<cpu partial 슬랩 페이지 수>)를 추가로 표기한다.
- 예) 28(28) C0=6(6) C1=3(3) C2=18(18) C3=1(1)
store_user
- 유저 트래킹 디버그 사용 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_STORE_USER 플래그 사용 유무
- “slab_debug=FU,<슬랩캐시명>”
total_objects
- 전체 슬랩 object 수를 보여준다.
- 노드별로 N[nid]=<슬랩 object 수>를 추가로 표기한다.
- 예) 1288 N0=1288
trace
- 트레이스 디버그 사용 유무를 보여준다. (1=사용, 0=미사용)
- SLAB_TRACE 플래그 사용 유무
- “slab_debug=T,<슬랩캐시명>”
usersize
- copy to/from user에 사용할 유저 사이즈를 보여준다.
- s->usersize
validate
- 슬랩 캐시의 유효성 검사를 수행한다. (강제 디버그 체크)
- 예) echo 1 > /sys/kernel/slab/anon_vma/validate

slabinfo 유틸리티

디버깅 툴 빌드

$ gcc -o slabinfo tools/vm/slabinfo.c

사용법

$ sudo ./slabinfo -h
slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.

slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]
-a|--aliases           Show aliases
-A|--activity          Most active slabs first
-d<options>|--debug=<options> Set/Clear Debug options
-D|--display-active    Switch line format to activity
-e|--empty             Show empty slabs
-f|--first-alias       Show first alias
-h|--help              Show usage information
-i|--inverted          Inverted list
-l|--slabs             Show slabs
-n|--numa              Show NUMA information
-o|--ops		Show kmem_cache_ops
-s|--shrink            Shrink slabs
-r|--report		Detailed report on single slabs
-S|--Size              Sort by size
-t|--tracking          Show alloc/free information
-T|--Totals            Show summary information
-v|--validate          Validate slabs
-z|--zero              Include empty slabs
-1|--1ref              Single reference

Valid debug options (FZPUT may be combined)
a / A          Switch on all debug options (=FZUP)
-              Switch off all debug options
f / F          Sanity Checks (SLAB_DEBUG_FREE)
z / Z          Redzoning
p / P          Poisoning
u / U          Tracking
t / T          Tracing

슬랩 캐시 리스트

$ sudo ./slabinfo
Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
:at-0000016                256      16     4.0K          0/0/1  256 0   0 100 *a
:at-0000032               3968      32   126.9K         22/0/9  128 0   0 100 *a
:at-0000040                408      40    16.3K          0/0/4  102 0   0  99 *a
:at-0000064              32128      64     2.0M       454/0/48   64 0   0 100 *a
:at-0000104                156     104    16.3K          0/0/4   39 0   0  99 *a
:t-0000024                 680      24    16.3K          0/0/4  170 0   0  99 *
:t-0000032                9472      32   303.1K        12/0/62  128 0   0 100 *
:t-0000040                 612      40    24.5K          0/0/6  102 0   0  99 *
:t-0000064               12483      64   802.8K       128/1/68   64 0   0  99 *
:t-0000088                2714      88   241.6K        15/0/44   46 0   0  98 *
:t-0000096                 168      96    16.3K          0/0/4   42 0   0  98 *
:t-0000104                6552     104   688.1K       158/0/10   39 0   0  99 *
:t-0000128                2240     128   286.7K        15/0/55   32 0   0 100 *
:t-0000192                4305     192   839.6K       152/0/53   21 0   0  98 *
:t-0000256                 192     256    49.1K          3/0/9   16 0   0 100 *
:t-0000320                 954     320   335.8K        12/5/29   25 1  12  90 *A
:t-0000384                  84     384    32.7K          0/0/4   21 1   0  98 *A
:t-0000512                 720     512   368.6K        26/0/19   16 1   0 100 *
:t-0000640                  50     640    32.7K          1/0/1   25 2   0  97 *A
:t-0000960                 187     936   180.2K         1/0/10   17 2   0  97 *A
:t-0001024                 176    1024   180.2K          4/0/7   16 2   0 100 *
:t-0002048                 176    2048   360.4K          2/0/9   16 3   0 100 *
:t-0004032                 153    4032   655.3K         5/2/15    8 3  10  94 *
:t-0004096                  64    4096   262.1K          0/0/8    8 3   0 100 *
anon_vma                  2124     104   241.6K        10/0/49   36 0   0  91
bdev_cache                  72     848    65.5K          0/0/4   18 2   0  93 Aa
biovec-128                  84    1536   131.0K          0/0/4   21 3   0  98 A
biovec-256                  10    3072    32.7K          0/0/1   10 3   0  93 A
biovec-64                   84     768    65.5K          0/0/4   21 2   0  98 A
blkdev_queue                34    1824    65.5K          0/0/2   17 3   0  94
blkdev_requests            204     232    49.1K         0/0/12   17 0   0  96
dentry                   20500     200     4.1M      1012/0/13   20 0   0  97 a
ext4_groupinfo_4k          253     172    45.0K         10/0/1   23 0   0  96 a
ext4_inode_cache         10686    1232    13.4M       400/0/11   26 3   0  97 a
fat_cache                  170      20     4.0K          0/0/1  170 0   0  83 a
fat_inode_cache             60     776    49.1K          1/0/2   20 2   0  94 a
file_lock_cache            100     160    16.3K          0/0/4   25 0   0  97
fscache_cookie_jar          32     124     4.0K          0/0/1   32 0   0  96
ftrace_event_file          595      48    28.6K          6/0/1   85 0   0  99
idr_layer_cache            270    1068   294.9K          5/0/4   30 3   0  97
inode_cache               5589     584     3.3M       191/0/16   27 2   0  96 a
jbd2_journal_handle        292      56    16.3K          0/0/4   73 0   0  99 a
jbd2_transaction_s         189     176    36.8K          0/0/9   21 0   0  90 Aa
kmalloc-8192                24    8192   196.6K          1/0/5    4 3   0 100
kmem_cache                 128     116    16.3K          1/0/3   32 0   0  90 A
kmem_cache_node            128      68    16.3K          1/0/3   32 0   0  53 A
mm_struct                  112     536    65.5K          0/0/4   28 2   0  91 A
mqueue_inode_cache          18     840    16.3K          0/0/1   18 2   0  92 A
nfs_commit_data             18     448     8.1K          0/0/1   18 1   0  98 A
posix_timers_cache          18     216     4.0K          0/0/1   18 0   0  94
proc_inode_cache           546     616   344.0K         4/0/17   26 2   0  97 a
radix_tree_node           2106     304   663.5K        71/0/10   26 1   0  96 a
shmem_inode_cache          644     696   458.7K         20/0/8   23 2   0  97
sighand_cache              184    1372   262.1K          0/0/8   23 3   0  96 A
sigqueue                   112     144    16.3K          0/0/4   28 0   0  98
sock_inode_cache           100     616    65.5K          0/0/4   25 2   0  93 Aa
taskstats                   24     328     8.1K          0/0/1   24 1   0  96
TCP                         68    1816   131.0K          0/0/4   17 3   0  94 A
UDP                         64     960    65.5K          0/0/4   16 2   0  93 A

Name
- 슬랩 명
Objects
- 사용중인(in-use) object 수
Objsize
- 메타 데이터를 제외한 object 사이즈 (s->obj_size)
Space
- 전체 슬랩 페이지에 사용된 바이트 사이즈 (단위에 사용된 값들은 1024 단위가 아니라 1000단위이다.)
Slabs/Part/Cpu
- Slabs
  - 전체 슬랩 페이지 수 – Cpu
    - = full 슬랩 페이지 수 + Part
- Part
  - 노드별 partial 리스트에서 관리하는 슬랩 페이지 수
- Cpu
  - cpu별 슬랩 페이지에서 관리하는 슬랩 페이지 수 (c->page + c->partial 페이지 수)
O/S
- 슬랩 페이지당 object 수
- s->objs_per_slab
O
- order 값
%Fr
- 노드 partial 리스트의 슬랩 페이지 비율
%Ef
- 사용 중인 object 비율
Flg
- 플래그 값은 다음과 같다.
  - *- alias 캐시
  - d – dma
  - A – L1 하드웨어 캐시 정렬(hwcache_align)
  - p – poison
  - a – reclaimable 슬랩 캐시
  - Z – red-zone
  - F – sanity check
  - U – 유저(owner) 트래킹
  - T – 트레이스

Loss 소팅 순서 (-L)

$ sudo ./slabinfo -L
Name                   Objects Objsize            Loss Slabs/Part/Cpu  O/S O %Fr %Ef Flg
task_struct                 92    3456          206.3K         11/9/5    9 3  56  60
kmalloc-512                624     512          147.4K        52/31/5   16 1  54  68
:A-0000192                2317     192          140.8K      112/58/31   21 0  40  75 *A
kernfs_node_cache        13230     128          112.8K        432/0/9   30 0   0  93
proc_inode_cache          1114     648           97.3K        96/18/4   12 1  18  88 a
dentry                    7417     192           70.9K      341/56/24   21 0  15  95 a

사용률 표시 (-D)

$ sudo ./slabinfo -D
Name                   Objects      Alloc       Free   %Fast Fallb O CmpX   UL
:0000024                   170          0          0   0   0     0 0    0    0
:0000040                   102          0          0   0   0     0 0    0    0
:0000048                    85          0          0   0   0     0 0    0    0
:0000056                    73          0          0   0   0     0 0    0    0
:0000064                    64          0          0   0   0     0 0    0    0
:0000080                    51          0          0   0   0     0 0    0    0
:0000128                    32          0          0   0   0     0 0    0    0
:0000192                    21          0          0   0   0     0 0    0    0
:0000256                   192          0          0   0   0     0 0    0    0
:0000448                    36          0          0   0   0     0 1    0    0
:0000896                    36          0          0   0   0     0 2    0    0
:0001024                    16          0          0   0   0     0 2    0    0
:0002048                    16          0          0   0   0     0 3    0    0
:0004096                    40          0          0   0   0     0 3    0    0
:a-0000032                 128          0          0   0   0     0 0    0    0
:a-0000048                  85          0          0   0   0     0 0    0    0
:a-0000056                  73          0          0   0   0     0 0    0    0
:A-0000064                1340          0          0   0   0     0 0    0    0
:A-0000072                 168          0          0   0   0     0 0    0    0
:A-0000080                 255          0          0   0   0     0 0    0    0
:a-0000104                1482          0          0   0   0     0 0    0    0
:A-0000128                 320          0          0   0   0     0 0    0    0
:A-0000192                2317          0          0   0   0     0 0    0    0
:a-0000256                  16          0          0   0   0     0 0    0    0
:A-0001088                 130          0          0   0   0     0 2    0    0
anon_vma                   654          0          0   0   0     0 0    0    0
bdev_cache                  19          0          0   0   0     0 2    0    0
blkdev_ioc                  39          0          0   0   0     0 0    0    0
configfs_dir_cache          46          0          0   0   0     0 0    0    0
...

alias 슬랩 캐시 리스트 (-A)

$ sudo ./slabinfo -a

:at-0000016  <- discard_entry f2fs_inode_entry jbd2_revoke_table_s inmem_page_entry free_nid f2fs_ino_entry sit_entry_set
:at-0000024  <- nat_entry nat_entry_set
:at-0000032  <- ext4_extent_status jbd2_revoke_record_s
:at-0000040  <- ext4_free_data ext4_io_end
:at-0000064  <- mmcblk0p6 jbd2_journal_head buffer_head
:at-0000104  <- ext4_allocation_context ext4_prealloc_space
:t-0000024   <- ip_fib_alias dnotify_struct jbd2_inode nsproxy scsi_data_buffer
:t-0000032   <- ftrace_event_field fanotify_event_info dmaengine-unmap-2 secpath_cache anon_vma_chain sd_ext_cdb ip_fib_trie tcp_bind_bucket ext4_system_zone
:t-0000040   <- eventpoll_pwq page->ptl
:t-0000064   <- nfs_page pid kmalloc-64 kiocb uid_cache fasync_cache file_lock_ctx cfq_io_cq
:t-0000088   <- flow_cache vm_area_struct
:t-0000096   <- dnotify_mark fsnotify_mark inotify_inode_mark
:t-0000104   <- task_delay_info kernfs_node_cache
:t-0000128   <- ip_mrt_cache blkdev_ioc sgpool-8 pid_namespace fs_cache inet_peer_cache kmalloc-128 ip_dst_cache eventpoll_epi cred_jar
:t-0000192   <- bio-0 key_jar skbuff_head_cache ip4-frags request_sock_TCP rpc_tasks kmalloc-192 biovec-16
:t-0000256   <- mnt_cache sgpool-16 pool_workqueue kmalloc-256 files_cache
:t-0000320   <- xfrm_dst_cache filp
:t-0000384   <- skbuff_fclone_cache dio
:t-0000512   <- sgpool-32 kmalloc-512
:t-0000640   <- nfs_write_data kioctx nfs_read_data
:t-0000960   <- RAW PING signal_cache
:t-0001024   <- UNIX kmalloc-1024 sgpool-64
:t-0002048   <- kmalloc-2048 sgpool-128 rpc_buffers
:t-0004032   <- task_struct net_namespace
:t-0004096   <- names_cache kmalloc-4096

슬랩 캐시 요약 (-T)

$ sudo ./slabinfo -T
Slabcache Totals
----------------
Slabcaches :  71      Aliases  : 118->49  Active:  59
Memory used:  32.5M   # Loss   : 741.4K   MRatio:     2%
# Objects  : 125.2K   # PartObj:      9   ORatio:     0%

Per Cache    Average         Min         Max       Total
---------------------------------------------------------
#Objects        2.1K          10       32.1K      125.2K
#Slabs            58           1        1.0K        3.4K
#PartSlab          0           0           2           2
%PartSlab         0%          0%         10%          0%
PartObjs           0           0           9           9
% PartObj         0%          0%          5%          0%
Memory        550.9K        4.0K       13.4M       32.5M
Used          538.3K        3.4K       13.1M       31.7M
Loss           12.5K           0      302.4K      741.4K

Per Object   Average         Min         Max
---------------------------------------------
Memory           255          16        8.1K
User             253          16        8.1K
Loss               1           0          64

슬랩 캐시 세부 정보 (-r)

$ slabinfo -r jake

Slabcache: jake             Aliases:  0 Order :  1 Objects: 2

Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :      30  Total  :       1   Sanity Checks : On   Total:    8192
SlabObj:     392  Full   :       0   Redzoning     : On   Used :      60
SlabSiz:    8192  Partial:       1   Poisoning     : On   Loss :    8132
Loss   :     362  CpuSlab:       0   Tracking      : On   Lalig:     724
Align  :      56  Objects:      20   Tracing       : On   Lpadd:     352

jake has no kmem_cache operations

jake: Kernel object allocation
-----------------------------------------------------------------------
      1 0xffff000008b8a068 age=1174850 pid=3481
      1 0xffff000008b8a078 age=1174837 pid=3481

jake: Kernel object freeing
------------------------------------------------------------------------
      2 <not-available> age=4296123146 pid=0

jake: No NUMA information available.

Object
- 메타 데이터를 제외한 object 사이즈
SlabObj
- 메타 데이터를 포함한 object 사이즈
SlabSiz
- 1 개의 슬렙 페이지 사이즈
Loss
- 1 개의 슬랩 페이지에서 사용할 수 없는 공간 사이즈(remain)
Align
- align 단위
Total
- 전체 슬랩 페이지 수
Full
- 전체 object가 모두 사용 중인 슬랩 페이지 수
Partial
- 노드별 슬랩 페이지 수
CpuSlab
- per-cpu 슬랩 페이지 수
Objects
- 1 개의 슬랩 페이지당 object 수
Sanity Checks
- Sanity 체크 디버그 사용 여부
Redzoning
- Red-zone 디버그 사용 여부
Poisoning
- poison 디버그 사용 여부
Tracking
- 유저 트래킹 디버그 사용 여부
Tracing
- 트레이스 디버그 사용 여부
Total
- 전체 슬렙 페이지에 사용 중인 메모리 사이즈
Used
- 메타 데이터를 제외한 사용(in-use) 중인 슬랩 object가 점유한 메모리 사이즈
Loss
- Toal – Used
Lalig
- (메타 데이터를 포함한 object 사이즈 – 메타 데이터를 제외한 object 사이즈) * 사용 중인 object 수
- (SlabObj – Object) * 사용 중인 object 수
Lpadd
- 슬랩 페이지의 남는(remain) 영역을 모두 합산한 메모리 사이즈

slabtop 유틸리티

$ slabtop
 Active / Total Objects (% used)    : 514911 / 563284 (91.4%)
 Active / Total Slabs (% used)      : 30238 / 30238 (100.0%)
 Active / Total Caches (% used)     : 89 / 121 (73.6%)
 Active / Total Size (% used)       : 198611.59K / 205849.01K (96.5%)
 Minimum / Average / Maximum Object : 0.02K / 0.37K / 12.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
114156 110759   0%    0.19K   5436       21     21744K dentry
 86895  84904   0%    1.04K  13611       30    435552K ext4_inode_cache
 56589  45405   0%    0.10K   1451       39      5804K buffer_head
 46512  19299   0%    0.04K    456      102      1824K ext4_extent_status
 44832  43854   0%    0.12K   1401       32      5604K kmalloc-128
 41300  39940   0%    0.57K   1475       28     23600K radix_tree_node
 25664  25664 100%    0.06K    401       64      1604K anon_vma_chain
 23070  22810   0%    0.13K    769       30      3076K kernfs_node_cache
 18774  18489   0%    0.57K    783       28     12528K inode_cache
 14344  14234   0%    0.18K    652       22      2608K vm_area_struct
 13110  13110 100%    0.09K    285       46      1140K anon_vma
  8816   8696   0%    0.25K    551       16      2204K filp
  4752   4057   0%    0.64K    198       24      3168K proc_inode_cache
  4386   4386 100%    0.04K     43      102       172K pde_opener
  3825   3825 100%    0.05K     45       85       180K ftrace_event_field
  3616   3616 100%    0.12K    113       32       452K pid
  3549   3549 100%    0.19K    169       21       676K cred_jar
  3125   3125 100%    0.62K    125       25      2000K squashfs_inode_cache
  2896   2877   0%    0.50K    181       16      1448K kmalloc-512
  2496   2496 100%    0.06K     39       64       156K dmaengine-unmap-2
  2408   2408 100%    0.14K     86       28       344K ext4_groupinfo_4k
  2304   2292   0%    1.00K    144       16      2304K kmalloc-1024
  2192   2140   0%    0.25K    137       16       548K kmalloc-256
  2075   2075 100%    0.62K     83       25      1328K sock_inode_cache
...

참고

Slab Memory Allocator -1- (구조) | 문c
Slab Memory Allocator -2- (캐시 초기화) | 문c
Slub Memory Allocator -3- (캐시 생성) | 문c
Slub Memory Allocator -4- (Order 계산) | 문c
Slub Memory Allocator -5- (Slub 할당) | 문c
Slub Memory Allocator -6- (Object 할당) | 문c
Slub Memory Allocator -7- (Object 해제) | 문c
Slub Memory Allocator -8- (Drain/Flash 캐시) | 문c
Slub Memory Allocator -9- (캐시 Shrink) | 문c
Slub Memory Allocator -10- (Slub 해제) | 문c
Slub Memory Allocator -11- (캐시 삭제) | 문c
Slub Memory Allocator -12- (Slub 디버깅) | 문c
Slub Memory Allocator -13- (slabinfo) | 문c – 현재 글

Per-cpu -4- (atomic operations)

2016-06-012019-05-08 문영일 Leave a comment

Per-cpu -4- (atomic operations)

this_cpu_cmpxchg_double()

include/linux/percpu-defs.h

#define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        __pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)

per-cpu 값인 pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 atomic하게 대입한다.

cmpxchg_double() 함수와 다른 점
- 다른 cpu와 경합할 필요가 없는 per-cpu 값을 교체하기 위해 더 빠른 atomic operation을 기대한다.
  - arm 아키텍처에서는 atomic opeation 동작을 하는 동안만 local irq를 막는다.
  - arm64 아키텍처에서는 atomic operation 동작을 하는 동안만 preemption을 막는다.

아래 그림은 this_cpu_cmpxchg_double() 함수가 처리되는 과정을 보여준다.

this_cpu_cmpxchg_double-1

__pcpu_double_call_return_bool()

include/linux/percpu-defs.h

/*
 * Special handling for cmpxchg_double.  cmpxchg_double is passed two
 * percpu variables.  The first has to be aligned to a double word
 * boundary and the second has to follow directly thereafter.
 * We enforce this on all architectures even if they don't support
 * a double cmpxchg instruction, since it's a cheap requirement, and it
 * avoids breaking the requirement for architectures with the instruction.
 */
#define __pcpu_double_call_return_bool(stem, pcp1, pcp2, ...)           \
({                                                                      \
        bool pdcrb_ret__;                                               \
        __verify_pcpu_ptr(&(pcp1));                                     \
        BUILD_BUG_ON(sizeof(pcp1) != sizeof(pcp2));                     \
        VM_BUG_ON((unsigned long)(&(pcp1)) % (2 * sizeof(pcp1)));       \
        VM_BUG_ON((unsigned long)(&(pcp2)) !=                           \
                  (unsigned long)(&(pcp1)) + sizeof(pcp1));             \
        switch(sizeof(pcp1)) {                                          \
        case 1: pdcrb_ret__ = stem##1(pcp1, pcp2, __VA_ARGS__); break;  \
        case 2: pdcrb_ret__ = stem##2(pcp1, pcp2, __VA_ARGS__); break;  \
        case 4: pdcrb_ret__ = stem##4(pcp1, pcp2, __VA_ARGS__); break;  \
        case 8: pdcrb_ret__ = stem##8(pcp1, pcp2, __VA_ARGS__); break;  \
        default:                                                        \
                __bad_size_call_parameter(); break;                     \
        }                                                               \
        pdcrb_ret__;                                                    \
})

pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입한다.

데이터 길이에 따라 인수 stem1 ~ stem8을 호출한다.
- 예) stem=this_cpu_cmpxchg_double_
  - this_cpu_cmpxchg_double_1, this_cpu_cmpxchg_double_2, this_cpu_cmpxchg_double_4, this_cpu_cmpxchg_double_8

this_cpu_cmpxchg_double_1()

include/asm-generic/percpu.h

#ifndef this_cpu_cmpxchg_double_1
#define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef this_cpu_cmpxchg_double_2
#define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef this_cpu_cmpxchg_double_4
#define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef this_cpu_cmpxchg_double_8
#define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif

pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입한다.

32bit arm에서는 double word를 atomic하게 처리하는 연산이 없으므로 generic 함수를 호출한다.

this_cpu_generic_cmpxchg_double()

include/asm-generic/percpu.h

#define this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({                                                                      \
        int __ret;                                                      \
        unsigned long __flags;                                          \
        raw_local_irq_save(__flags);                                    \
        __ret = raw_cpu_generic_cmpxchg_double(pcp1, pcp2,              \
                        oval1, oval2, nval1, nval2);                    \
        raw_local_irq_restore(__flags);                                 \
        __ret;                                                          \
})

인터럽트를 잠시 비활성화 시킨 채로 pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입하고 다시 인터럽트를 원래대로 돌린다.

raw_cpu_cmpxchg_double()

include/linux/percpu-defs.h

#define raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        __pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)

pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입하는데 성공하면 true를 반환한다.

raw_cpu_cmpxchg_double_1()

include/asm-generic/percpu.h

#ifndef raw_cpu_cmpxchg_double_1
#define raw_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef raw_cpu_cmpxchg_double_2
#define raw_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef raw_cpu_cmpxchg_double_4
#define raw_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif
#ifndef raw_cpu_cmpxchg_double_8
#define raw_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
        raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
#endif

pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입하는데 성공하면 true를 반환한다.

32bit arm에서는 double word를 atomic하게 처리하는 연산이 없으므로 generic 함수를 호출한다.

raw_cpu_generic_cmpxchg_double()

include/asm-generic/percpu.h

#define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({                                                                      \
        int __ret = 0;                                                  \
        if (raw_cpu_read(pcp1) == (oval1) &&                            \
                         raw_cpu_read(pcp2)  == (oval2)) {              \
                raw_cpu_write(pcp1, nval1);                             \
                raw_cpu_write(pcp2, nval2);                             \
                __ret = 1;                                              \
        }                                                               \
        (__ret);                                                        \
})

pcp(pcp1, pcp2) 값이 old 값(oval1, oval2)과 같은 경우 pcp에 new 값(nval1, nval2)을 대입하는데 성공하면 true를 반환한다.

raw_cpu_read()

include/linux/percpu-defs.h

#define raw_cpu_read(pcp)               __pcpu_size_call_return(raw_cpu_read_, pcp)

pcp 값을 반환한다.

raw_cpu_write()

include/linux/percpu-defs.h

#define raw_cpu_write(pcp, val)         __pcpu_size_call(raw_cpu_write_, pcp, val)

pcp 값으로 val 값을 대입한다.

__pcpu_size_call_return()

include/linux/percpu-defs.h

#define __pcpu_size_call_return(stem, variable)                         \
({                                                                      \
        typeof(variable) pscr_ret__;                                    \
        __verify_pcpu_ptr(&(variable));                                 \
        switch(sizeof(variable)) {                                      \
        case 1: pscr_ret__ = stem##1(variable); break;                  \
        case 2: pscr_ret__ = stem##2(variable); break;                  \
        case 4: pscr_ret__ = stem##4(variable); break;                  \
        case 8: pscr_ret__ = stem##8(variable); break;                  \
        default:                                                        \
                __bad_size_call_parameter(); break;                     \
        }                                                               \
        pscr_ret__;                                                     \
})

variable 형의 사이즈에 따라 인수로 지정된 stem1~8을 호출하여 pcp 값을 반환한다.

예) stem=raw_cpu_read_
- stem=raw_cpu_read_1, stem=raw_cpu_read_2, stem=raw_cpu_read_4, stem=raw_cpu_read_8

__pcpu_size_call()

include/linux/percpu-defs.h

#define __pcpu_size_call(stem, variable, ...)                           \
do {                                                                    \
        __verify_pcpu_ptr(&(variable));                                 \
        switch(sizeof(variable)) {                                      \
                case 1: stem##1(variable, __VA_ARGS__);break;           \
                case 2: stem##2(variable, __VA_ARGS__);break;           \
                case 4: stem##4(variable, __VA_ARGS__);break;           \
                case 8: stem##8(variable, __VA_ARGS__);break;           \
                default:                                                \
                        __bad_size_call_parameter();break;              \
        }                                                               \
} while (0)

variable 형의 사이즈에 따라 인수로 지정된 stem1~8을 호출하여 pcp 값에 val 값을 대입한다.

예) stem=raw_cpu_write_
- stem=raw_cpu_write_1, stem=raw_cpu_write_2, stem=raw_cpu_write_4, stem=raw_cpu_write_8

raw_cpu_read_1()

include/asm-generic/percpu.h

#ifndef raw_cpu_read_1
#define raw_cpu_read_1(pcp) (*raw_cpu_ptr(&(pcp)))
#endif
#ifndef raw_cpu_read_2
#define raw_cpu_read_2(pcp) (*raw_cpu_ptr(&(pcp)))
#endif
#ifndef raw_cpu_read_4
#define raw_cpu_read_4(pcp) (*raw_cpu_ptr(&(pcp)))
#endif
#ifndef raw_cpu_read_8
#define raw_cpu_read_8(pcp) (*raw_cpu_ptr(&(pcp)))
#endif

pcp의 값을 읽어온다.

raw_cpu_write_1()

include/asm-generic/percpu.h

#ifndef raw_cpu_write_1
#define raw_cpu_write_1(pcp, val)       raw_cpu_generic_to_op(pcp, val, =)
#endif
#ifndef raw_cpu_write_2
#define raw_cpu_write_2(pcp, val)       raw_cpu_generic_to_op(pcp, val, =)
#endif
#ifndef raw_cpu_write_4
#define raw_cpu_write_4(pcp, val)       raw_cpu_generic_to_op(pcp, val, =)
#endif
#ifndef raw_cpu_write_8
#define raw_cpu_write_8(pcp, val)       raw_cpu_generic_to_op(pcp, val, =)
#endif

pcp에 val 값을 저장한다.

raw_cpu_generic_to_op()

include/asm-generic/percpu.h

#define raw_cpu_generic_to_op(pcp, val, op)                             \
do {                                                                    \
        *raw_cpu_ptr(&(pcp)) op val;                                    \
} while (0)

pcp와 val 값에 op 연산을 한다.

예) raw_cpu_generic_to_op(pcp, val, +=)
- pcp += val

참고

per-cpu -1- (Basic) | 문c
per-cpu -2- (초기화) | 문c
per-cpu -3- (동적 할당) | 문c
Per-cpu -4- (atomic operations) | 문c – 현재글