문c 블로그

numa_policy_init()

2016-12-022016-12-02 문영일 Leave a comment

numa_policy_init()

mm/mempolicy.c

void __init numa_policy_init(void)
{
        nodemask_t interleave_nodes; 
        unsigned long largest = 0;
        int nid, prefer = 0;

        policy_cache = kmem_cache_create("numa_policy",
                                         sizeof(struct mempolicy),
                                         0, SLAB_PANIC, NULL);

        sn_cache = kmem_cache_create("shared_policy_node",
                                     sizeof(struct sp_node),
                                     0, SLAB_PANIC, NULL);

        for_each_node(nid) {
                preferred_node_policy[nid] = (struct mempolicy) {
                        .refcnt = ATOMIC_INIT(1),
                        .mode = MPOL_PREFERRED,
                        .flags = MPOL_F_MOF | MPOL_F_MORON,
                        .v = { .preferred_node = nid, },
                };
        }
                
        /*
         * Set interleaving policy for system init. Interleaving is only
         * enabled across suitably sized nodes (default is >= 16MB), or
         * fall back to the largest node if they're all smaller.
         */
        nodes_clear(interleave_nodes);
        for_each_node_state(nid, N_MEMORY) {
                unsigned long total_pages = node_present_pages(nid);

                /* Preserve the largest node */
                if (largest < total_pages) {
                        largest = total_pages;
                        prefer = nid;
                }

                /* Interleave this node? */
                if ((total_pages << PAGE_SHIFT) >= (16 << 20))
                        node_set(nid, interleave_nodes);
        }

        /* All too small, use the largest */
        if (unlikely(nodes_empty(interleave_nodes)))
                node_set(prefer, interleave_nodes);

        if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
                pr_err("%s: interleaving failed\n", __func__);

        check_numabalancing_enable();
}

코드 라인 07~13에서 mempolicy 구조체 타입으로 전역 policy_cache kmem 캐시와 sp_node 구조체 타입으로 전역 sn_cache kmem 캐시를 구성한다.
코드 라인 15~22에서 전체 노드에 대해 preferred_node_policy[] 배열을 초기화한다.
코드 라인 29~41에서 interleave 노드를 초기화한다. 16M 이상의 모든 메모리 노드를 interleave 노드에 참여 시킨다.
코드 라인 44~45에서 만일 참여한 노드가 하나도 없는 경우 가장 큰 노드를 interleave 노드에 참여시킨다.
코드 라인 47~48에서 interleave 메모리 정책을 설정한다.
코드 라인 50에서 NUMA 밸런싱을 설정한다.

mm/mempolicy.c

static struct mempolicy preferred_node_policy[MAX_NUMNODES];

do_set_mempolicy()

mm/mempolicy.c

/* Set the process memory policy */
static long do_set_mempolicy(unsigned short mode, unsigned short flags,
                             nodemask_t *nodes)
{
        struct mempolicy *new, *old;
        NODEMASK_SCRATCH(scratch);
        int ret;

        if (!scratch)
                return -ENOMEM;

        new = mpol_new(mode, flags, nodes);
        if (IS_ERR(new)) {
                ret = PTR_ERR(new);
                goto out;
        }

        task_lock(current);
        ret = mpol_set_nodemask(new, nodes, scratch);
        if (ret) {
                task_unlock(current);
                mpol_put(new);
                goto out;
        }
        old = current->mempolicy;
        current->mempolicy = new;
        if (new && new->mode == MPOL_INTERLEAVE &&
            nodes_weight(new->v.nodes))
                current->il_next = first_node(new->v.nodes);
        task_unlock(current);
        mpol_put(old);
        ret = 0;
out:
        NODEMASK_SCRATCH_FREE(scratch);
        return ret;
}

요청한 메모리 정책 모드를 새로 만들고 설정한 후 현재 태스크의 메모리 정책에 대입시킨다.

mpol_new()

mm/mempolicy.c

/*
 * This function just creates a new policy, does some check and simple
 * initialization. You must invoke mpol_set_nodemask() to set nodes.
 */
static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
                                  nodemask_t *nodes)
{
        struct mempolicy *policy;

        pr_debug("setting mode %d flags %d nodes[0] %lx\n",
                 mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);

        if (mode == MPOL_DEFAULT) {
                if (nodes && !nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
                return NULL;
        }
        VM_BUG_ON(!nodes);

        /*
         * MPOL_PREFERRED cannot be used with MPOL_F_STATIC_NODES or
         * MPOL_F_RELATIVE_NODES if the nodemask is empty (local allocation).
         * All other modes require a valid pointer to a non-empty nodemask.
         */
        if (mode == MPOL_PREFERRED) {
                if (nodes_empty(*nodes)) {
                        if (((flags & MPOL_F_STATIC_NODES) ||
                             (flags & MPOL_F_RELATIVE_NODES)))
                                return ERR_PTR(-EINVAL);
                }
        } else if (mode == MPOL_LOCAL) {
                if (!nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
                mode = MPOL_PREFERRED;
        } else if (nodes_empty(*nodes))
                return ERR_PTR(-EINVAL);
        policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
        if (!policy)
                return ERR_PTR(-ENOMEM);
        atomic_set(&policy->refcnt, 1);
        policy->mode = mode;
        policy->flags = flags;

        return policy;
}

새로운 메모리 정책을 생성한다.

mpol_set_nodemask()

mm/mempolicy.c

/*
 * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
 * any, for the new policy.  mpol_new() has already validated the nodes
 * parameter with respect to the policy mode and flags.  But, we need to
 * handle an empty nodemask with MPOL_PREFERRED here.
 *
 * Must be called holding task's alloc_lock to protect task's mems_allowed
 * and mempolicy.  May also be called holding the mmap_semaphore for write.
 */
static int mpol_set_nodemask(struct mempolicy *pol,
                     const nodemask_t *nodes, struct nodemask_scratch *nsc)
{
        int ret;

        /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
        if (pol == NULL)
                return 0;
        /* Check N_MEMORY */
        nodes_and(nsc->mask1,
                  cpuset_current_mems_allowed, node_states[N_MEMORY]);

        VM_BUG_ON(!nodes);
        if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
                nodes = NULL;   /* explicit local allocation */
        else {
                if (pol->flags & MPOL_F_RELATIVE_NODES)
                        mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
                else
                        nodes_and(nsc->mask2, *nodes, nsc->mask1);

                if (mpol_store_user_nodemask(pol))
                        pol->w.user_nodemask = *nodes;
                else
                        pol->w.cpuset_mems_allowed =
                                                cpuset_current_mems_allowed;
        }

        if (nodes)
                ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
        else
                ret = mpol_ops[pol->mode].create(pol, NULL);
        return ret;
}

mpol_relative_nodemask()

mm/mempolicy.c

static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
                                   const nodemask_t *rel)
{
        nodemask_t tmp;
        nodes_fold(tmp, *orig, nodes_weight(*rel));
        nodes_onto(*ret, tmp, *rel);
}

mpol_op[] 구조체 배열

static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
        [MPOL_DEFAULT] = {
                .rebind = mpol_rebind_default,
        },
        [MPOL_INTERLEAVE] = {
                .create = mpol_new_interleave,
                .rebind = mpol_rebind_nodemask,
        },
        [MPOL_PREFERRED] = {
                .create = mpol_new_preferred,
                .rebind = mpol_rebind_preferred,
        },
        [MPOL_BIND] = {
                .create = mpol_new_bind,
                .rebind = mpol_rebind_nodemask,
        },
};

check_numabalancing_enable()

mm/mempolicy.c

#ifdef CONFIG_NUMA_BALANCING
static int __initdata numabalancing_override;

static void __init check_numabalancing_enable(void)
{
        bool numabalancing_default = false;

        if (IS_ENABLED(CONFIG_NUMA_BALANCING_DEFAULT_ENABLED))
                numabalancing_default = true;

        /* Parsed by setup_numabalancing. override == 1 enables, -1 disables */
        if (numabalancing_override)
                set_numabalancing_state(numabalancing_override == 1);

        if (num_online_nodes() > 1 && !numabalancing_override) {
                pr_info("%s automatic NUMA balancing. "
                        "Configure with numa_balancing= or the "
                        "kernel.numa_balancing sysctl",
                        numabalancing_default ? "Enabling" : "Disabling");
                set_numabalancing_state(numabalancing_default);
        }
} 
#endif

NUMA 밸런싱을 설정한다.

코드 라인 12~13에서 “numa_balancing=enable” 커널 파라메터가 설정된 경우 전역 numabalancing_enabled를 true로 설정한다.
코드 라인 15~21에서 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED 커널 옵션이 사용된 커널에서 온라인 노드가 2개 이상인 경우도 전역 numabalancing_enabled를 true로 설정한다.

구조체

mempolicy 구조체

/*
 * Describe a memory policy.
 *
 * A mempolicy can be either associated with a process or with a VMA.
 * For VMA related allocations the VMA policy is preferred, otherwise
 * the process policy is used. Interrupts ignore the memory policy
 * of the current process.
 *
 * Locking policy for interlave:
 * In process context there is no locking because only the process accesses
 * its own state. All vma manipulation is somewhat protected by a down_read on
 * mmap_sem.
 *
 * Freeing policy:
 * Mempolicy objects are reference counted.  A mempolicy will be freed when
 * mpol_put() decrements the reference count to zero.
 *
 * Duplicating policy objects:
 * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
 * to the new storage.  The reference count of the new object is initialized
 * to 1, representing the caller of mpol_dup().
 */
struct mempolicy {
        atomic_t refcnt;
        unsigned short mode;    /* See MPOL_* above */
        unsigned short flags;   /* See set_mempolicy() MPOL_F_* above */
        union {
                short            preferred_node; /* preferred */
                nodemask_t       nodes;         /* interleave/bind */
                /* undefined for default */
        } v;
        union {
                nodemask_t cpuset_mems_allowed; /* relative to these nodes */
                nodemask_t user_nodemask;       /* nodemask passed by user */
        } w; 
};

mempolicy_operations 구조체

mm/mempolicy.c

static const struct mempolicy_operations {
        int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
        /*
         * If read-side task has no lock to protect task->mempolicy, write-side
         * task will rebind the task->mempolicy by two step. The first step is
         * setting all the newly nodes, and the second step is cleaning all the
         * disallowed nodes. In this way, we can avoid finding no node to alloc
         * page.
         * If we have a lock to protect task->mempolicy in read-side, we do
         * rebind directly.
         *
         * step:
         *      MPOL_REBIND_ONCE - do rebind work at once
         *      MPOL_REBIND_STEP1 - set all the newly nodes
         *      MPOL_REBIND_STEP2 - clean all the disallowed nodes
         */
        void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
                        enum mpol_rebind_step step);
} mpol_ops[MPOL_MAX];

Policy 관련 enum

/*
 * Both the MPOL_* mempolicy mode and the MPOL_F_* optional mode flags are
 * passed by the user to either set_mempolicy() or mbind() in an 'int' actual.
 * The MPOL_MODE_FLAGS macro determines the legal set of optional mode flags.
 */

/* Policies */
enum {
        MPOL_DEFAULT,
        MPOL_PREFERRED,
        MPOL_BIND, 
        MPOL_INTERLEAVE,
        MPOL_LOCAL,
        MPOL_MAX,       /* always last member of enum */
};

참고

NUMA with Linux | Lunatine

Zoned Allocator -14- (Kswapd & Kcompactd)

2016-12-012021-10-14 문영일 13 Comments

Kswapd & Kcompactd

노드마다 kswapd와 kcompactd가 동작하며 free 메모리가 일정량 이상 충분할 때에는 잠들어(sleep) 있다. 그런데 페이지 할당자가 order 페이지 할당을 시도하다 free 페이지가 부족해 low 워터마크 기준을 충족하지 못하는 순간 kswapd 및 kcompactd를 깨운다. kswapd는 자신의 노드에서 페이지 회수를 진행하고, kcompactd는 compaction을 진행하는데 모든 노드에 대해 밸런스하게 high 워터마크 기준을 충족하게 되면 스스로 sleep 한다.

kswapd 초기화

kswapd_init()

mm/vmscan.c

static int __init kswapd_init(void)
{
        int nid, ret;

        swap_setup();
        for_each_node_state(nid, N_MEMORY)
                kswapd_run(nid);
        ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
                                        "mm/vmscan:online", kswapd_cpu_online,
                                        NULL);
        WARN_ON(ret < 0);
        return 0;
}

module_init(kswapd_init)

kswapd를 사용하기 위해 초기화한다.

코드 라인 5에서 kswapd 실행 전에 준비한다.
코드 라인 6~7에서 모든 메모리 노드에 대해 kswapd를 실행시킨다.
코드 라인 8~10에서 cpu가 hot-plug를 통해 CPUHP_AP_ONLINE_DYN 상태로 변경될 때 kswapd_cpu_online() 함수가 호출될 수 있도록 등록한다.

swap_setup()

mm/swap.c

/*
 * Perform any setup for the swap system
 */

void __init swap_setup(void)
{
        unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); 

        /* Use a smaller cluster for small-memory machines */
        if (megs < 16)                          
                page_cluster = 2;
        else
                page_cluster = 3;
        /* 
         * Right now other parts of the system means that we
         * _really_ don't want to cluster much more
         */
}

kswapd 실행 전에 준비한다.

코드 라인 3~9에서 전역 total 램이 16M 이하이면 page_cluster에 2를 대입하고 그렇지 않으면 3을 대입한다.

kswapd_run()

mm/vmscan.c

/*
 * This kswapd start function will be called by init and node-hot-add.
 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
 */

int kswapd_run(int nid)
{
        pg_data_t *pgdat = NODE_DATA(nid);
        int ret = 0;

        if (pgdat->kswapd)
                return 0;

        pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
        if (IS_ERR(pgdat->kswapd)) {
                /* failure at boot is fatal */
                BUG_ON(system_state == SYSTEM_BOOTING);
                pr_err("Failed to start kswapd on node %d\n", nid);
                ret = PTR_ERR(pgdat->kswapd);
                pgdat->kswapd = NULL;
        }
        return ret;
}

kswapd 스레드를 동작시킨다.

코드 라인 3~7에서 @nid 노드의 kswapd 스레드가 이미 실행 중인 경우 skip하기 위해 0을 반환한다.
코드 라인 9~16에서 kswapd 스레드를 동작시킨다.

kswapd 동작

kswapd()

mm/vmscan.c

/*
 * The background pageout daemon, started as a kernel thread
 * from the init process.
 *
 * This basically trickles out pages so that we have _some_
 * free memory available even if there is no other activity
 * that frees anything up. This is needed for things like routing
 * etc, where we otherwise might have all activity going on in
 * asynchronous contexts that cannot page things out.
 *
 * If there are applications that are active memory-allocators
 * (most normal use), this basically shouldn't matter.
 */

static int kswapd(void *p)
{
        unsigned int alloc_order, reclaim_order;
        unsigned int classzone_idx = MAX_NR_ZONES - 1;
        pg_data_t *pgdat = (pg_data_t*)p;
        struct task_struct *tsk = current;

        struct reclaim_state reclaim_state = {
                .reclaimed_slab = 0,
        };
        const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

        if (!cpumask_empty(cpumask))
                set_cpus_allowed_ptr(tsk, cpumask);
        current->reclaim_state = &reclaim_state;

        /*
         * Tell the memory management that we're a "memory allocator",
         * and that if we need more memory we should get access to it
         * regardless (see "__alloc_pages()"). "kswapd" should
         * never get caught in the normal page freeing logic.
         *
         * (Kswapd normally doesn't need memory anyway, but sometimes
         * you need a small amount of memory in order to be able to
         * page out something else, and this flag essentially protects
         * us from recursively trying to free more memory as we're
         * trying to free the first piece of memory in the first place).
         */
        tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
        set_freezable();

        pgdat->kswapd_order = 0;
        pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
        for ( ; ; ) {
                bool ret;

                alloc_order = reclaim_order = pgdat->kswapd_order;
                classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);

kswapd_try_sleep:
                kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
                                        classzone_idx);

                /* Read the new order and classzone_idx */
                alloc_order = reclaim_order = pgdat->kswapd_order;
                classzone_idx = kswapd_classzone_idx(pgdat, 0);
                pgdat->kswapd_order = 0;
                pgdat->kswapd_classzone_idx = MAX_NR_ZONES;

                ret = try_to_freeze();
                if (kthread_should_stop())
                        break;

                /*
                 * We can speed up thawing tasks if we don't call balance_pgdat
                 * after returning from the refrigerator
                 */
                if (ret)
                        continue;

                /*
                 * Reclaim begins at the requested order but if a high-order
                 * reclaim fails then kswapd falls back to reclaiming for
                 * order-0. If that happens, kswapd will consider sleeping
                 * for the order it finished reclaiming at (reclaim_order)
                 * but kcompactd is woken to compact for the original
                 * request (alloc_order).
                 */
                trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
                                                alloc_order);
                reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
                if (reclaim_order < alloc_order)
                        goto kswapd_try_sleep;
        }

        tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
        current->reclaim_state = NULL;

        return 0;
}

각 노드에서 동작되는 kswapd 스레드는 각 노드에서 동작하는 zone에 대해 free 페이지가 low 워터마크 이하로 내려가는 경우 백그라운드에서 페이지 회수를 진행하고 high 워터마크 이상이 되는 경우 페이지 회수를 멈춘다.

코드 라인 5~14에서 요청 노드에서 동작하는 온라인 cpumask를 읽어와서 현재 태스크에 설정한다.
- 요청 노드의 cpumask를 현재 task에 cpus_allowed 등을 설정한다.
코드 라인 15에서 현재 태스크의 reclaim_state가 초기화된 reclaim_state 구조체를 가리키게 한다.
코드 라인 29에서 현재 태스크의 플래그에 PF_MEMALLOC, PF_SWAPWRITE 및 PF_KSWAPD를 설정한다.
- PF_MEMALLOC
  - 메모리 reclaim을 위한 태스크로 워터 마크 제한 없이 할당할 수 있도록 한다.
- PF_SWAPWRITE
  - anon 메모리에 대해 swap 기록 요청을 한다.
- PF_KSWAPD
  - kswapd task를 의미한다.
코드 라인 30에서 현재 태스크를 freeze할 수 있도록 PF_NOFREEZE 플래그를 제거한다.
- 참고: freeze | 문c
코드 라인 32~33에서 kswapd_order를 0부터 시작하게 하고 kswapd_classzone_idx는 가장 상위부터 할 수 있도록 MAX_NR_ZONES를 대입해둔다.
코드 라인 34~38에서 alloc_order와 reclaim_order를 노드의 kswapd가 진행하는 order를 사용한다. 그리고 대상 존인 classzone_idx를 가져온다.
코드 라인 40~42에서 try_sleep: 레이블이다. kswapd가 슬립하도록 시도한다.
코드 라인 45~46에서 alloc_order와 reclaim_order를 외부에서 요청한 order와 zone을 적용시킨다.
코드 라인 47~48에서 kwapd_order를 0으로 리셋하고, kswapd_classzone_idx는 가장 상위부터 할 수 있도록 MAX_NR_ZONES를 대입해둔다.
코드 라인 50에서 현재 태스크 kswapd에 대해 freeze 요청이 있는 경우 freeze 시도한다.
코드 라인 51~52에서 현재 태스크의 KTHREAD_SHOULD_STOP 플래그 비트가 설정된 경우 루프를 탈출하고 스레드 종료 처리한다.
코드 라인 58~59에서 freeze 된 적이 있으면 빠른 처리를 위해 노드 밸런스를 동작시키지 않고 계속 진행한다.
코드 라인 71에서 freeze 한 적이 없었던 경우이다. order 페이지와 존을 대상으로 페이지 회수를 진행한다.
코드 라인 72~73에서 요청한 order에서 회수가 실패한 경우 order 0 및 해당 존에서 다시 시도하기 위해 try_sleep: 레이블로 이동한다.
코드 라인 74에서 요청한 order에서 회수가 성공한 경우에는 루프를 계속 반복한다.
코드 라인 76~79에서 kswapd 스레드의 처리를 완료시킨다.

kswapd_try_to_sleep()

mm/vmscan.c

static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
                                unsigned int classzone_idx)
{
        long remaining = 0;
        DEFINE_WAIT(wait);

        if (freezing(current) || kthread_should_stop())
                return;

        prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

        /*
         * Try to sleep for a short interval. Note that kcompactd will only be
         * woken if it is possible to sleep for a short interval. This is
         * deliberate on the assumption that if reclaim cannot keep an
         * eligible zone balanced that it's also unlikely that compaction will
         * succeed.
         */
        if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
                /*
                 * Compaction records what page blocks it recently failed to
                 * isolate pages from and skips them in the future scanning.
                 * When kswapd is going to sleep, it is reasonable to assume
                 * that pages and compaction may succeed so reset the cache.
                 */
                reset_isolation_suitable(pgdat);

                /*
                 * We have freed the memory, now we should compact it to make
                 * allocation of the requested order possible.
                 */
                wakeup_kcompactd(pgdat, alloc_order, classzone_idx);

                remaining = schedule_timeout(HZ/10);

                /*
                 * If woken prematurely then reset kswapd_classzone_idx and
                 * order. The values will either be from a wakeup request or
                 * the previous request that slept prematurely.
                 */
                if (remaining) {
                        pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
                        pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
                }

                finish_wait(&pgdat->kswapd_wait, &wait);
                prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
        }

        /*
         * After a short sleep, check if it was a premature sleep. If not, then
         * go fully to sleep until explicitly woken up.
         */
        if (!remaining &&
            prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
                trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

                /*
                 * vmstat counters are not perfectly accurate and the estimated
                 * value for counters such as NR_FREE_PAGES can deviate from the
                 * true value by nr_online_cpus * threshold. To avoid the zone
                 * watermarks being breached while under pressure, we reduce the
                 * per-cpu vmstat threshold while kswapd is awake and restore
                 * them before going back to sleep.
                 */
                set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);

                if (!kthread_should_stop())
                        schedule();

                set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
        } else {
                if (remaining)
                        count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
                else
                        count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
        }
        finish_wait(&pgdat->kswapd_wait, &wait);
}

노드에 대해 요청 order 및 zone 까지 free 페이지가 high 워터마크 기준으로 밸런스하게 할당할 수 있는 상태라면 sleep 한다.

코드 라인 7~8에서 freeze 요청이 있는 경우 함수를 빠져나간다.
코드 라인 10에서 현재 태스크를 kswapd_wait에 추가하여 sleep할 준비를 한다.
코드 섹션 19~48에서 요청 zone 까지 그리고 요청 order에 대해 free 페이지가 밸런스된 high 워터마크 기준을 충족하는 경우의 처리이다.
- 최근에 direct-compaction이 완료되어 해당 존에서 compaction을 다시 처음부터 시작할 수 있도록 존에 compact_blockskip_flush이 설정된다. 이렇게 설정된 존들에 대해 skip 블럭 비트를 모두 클리어한다.
- compactd 스레드를 깨운다.
- 0.1초를 sleep 한다. 만일 중간에 깬 경우 노드에 kswapd가 처리중인 zone과 order를 기록해둔다.
- 다시 슬립할 준비를 한다.
코드 섹션 54~71에서 중간에 깨어나지 않고 0.1초를 완전히 슬립하였고, 여전히 요청 zone 까지 그리고 요청 order에 대해 free 페이지가 확보되어 밸런스된 high 워터마크 기준을 충족하면 다음과 같이 처리한다.
- NR_FREE_PAGES 등의 vmstat을 정밀하게 계산할 필요 여부를 per-cpu 스레졸드라고 하는데, 이를 일반적인 기준의 스레졸드로 지정하도록 노드에 포함된 각 zone에 대해 normal한 스레졸드 값을 지정한다.
- 스레드 종료 요청이 아닌 경우 sleep한다.
- kswapd가 깨어났다는 이유는 메모리 압박 상황이 되었다라는 의미이다. 따라서 이번에는 per-cpu 스레졸드 값으로 pressure한 스레졸드 값을 사용하기 지정한다.
코드 섹션 72~77에서 메모리 부족 상황이 빠르게 온 경우이다. 슬립하자마자 깨어났는데 조건에 따라 관련 카운터를 다음과 같이 처리한다.
- 0.1초간 잠시 sleep 하는 와중에 다시 메모리 부족을 이유로 현재 스레드인 kswapd가 깨어난 경우에는 KSWAPD_LOW_WMARK_HIT_QUICKLY 카운터를 증가시킨다.
- 0.1초 슬립한 후에도 high 워터마크 기준을 충족할 만큼 메모리가 확보되지 못해 슬립하지 못하는 상황이다. 이러한 경우 KSWAPD_HIGH_WMARK_HIT_QUICKLY 카운터를 증가시킨다.
코드 섹션 78에서 kswapd_wait 에서 현재 태스크를 제거한다.

다음 그림은 kswapd_try_to_sleep() 함수를 통해 kswapd가 high 워터마크 기준을 충족하면 슬립하는 과정을 보여준다.

밸런스될 때까지 페이지 회수

balance_pgdat()

mm/vmscan.c -1/3-

/*
 * For kswapd, balance_pgdat() will reclaim pages across a node from zones
 * that are eligible for use by the caller until at least one zone is
 * balanced.
 *
 * Returns the order kswapd finished reclaiming at.
 *
 * kswapd scans the zones in the highmem->normal->dma direction.  It skips
 * zones which have free_pages > high_wmark_pages(zone), but once a zone is
 * found to have free_pages <= high_wmark_pages(zone), any page is that zone
 * or lower is eligible for reclaim until at least one usable zone is
 * balanced.
 */

static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
{
        int i;
        unsigned long nr_soft_reclaimed;
        unsigned long nr_soft_scanned;
        unsigned long pflags;
        unsigned long nr_boost_reclaim;
        unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
        bool boosted;
        struct zone *zone;
        struct scan_control sc = {
                .gfp_mask = GFP_KERNEL,
                .order = order,
                .may_unmap = 1,
        };

        psi_memstall_enter(&pflags);
        __fs_reclaim_acquire();

        count_vm_event(PAGEOUTRUN);

        /*
         * Account for the reclaim boost. Note that the zone boost is left in
         * place so that parallel allocations that are near the watermark will
         * stall or direct reclaim until kswapd is finished.
         */
        nr_boost_reclaim = 0;
        for (i = 0; i <= classzone_idx; i++) {
                zone = pgdat->node_zones + i;
                if (!managed_zone(zone))
                        continue;

                nr_boost_reclaim += zone->watermark_boost;
                zone_boosts[i] = zone->watermark_boost;
        }
        boosted = nr_boost_reclaim;

restart:
        sc.priority = DEF_PRIORITY;
        do {
                unsigned long nr_reclaimed = sc.nr_reclaimed;
                bool raise_priority = true;
                bool balanced;
                bool ret;

                sc.reclaim_idx = classzone_idx;

                /*
                 * If the number of buffer_heads exceeds the maximum allowed
                 * then consider reclaiming from all zones. This has a dual
                 * purpose -- on 64-bit systems it is expected that
                 * buffer_heads are stripped during active rotation. On 32-bit
                 * systems, highmem pages can pin lowmem memory and shrinking
                 * buffers can relieve lowmem pressure. Reclaim may still not
                 * go ahead if all eligible zones for the original allocation
                 * request are balanced to avoid excessive reclaim from kswapd.
                 */
                if (buffer_heads_over_limit) {
                        for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
                                zone = pgdat->node_zones + i;
                                if (!managed_zone(zone))
                                        continue;

                                sc.reclaim_idx = i;
                                break;
                        }
                }

우선 순위를 12부터 1까지 높여가며 페이지 회수 및 compaction을 진행하여 free 페이지가 요청 order 및 zone까지 밸런스하게 high 워터마크 기준을 충족할 때까지 진행한다.

코드 라인 11~15에서 준비한 scan_control 구조체에 매핑된 페이지를 언맵할 수 있게 may_unmap=1로 설정한다.
코드 라인 17에서 메모리 부족으로 인한 현재 태스크의 psi 산출을 시작하는 지점이다.
- 참고: PSI – Pressure Stall Information | kernel.org
코드 라인 20에서 PAGEOUTRUN 카운터를 증가시킨다.
코드 라인 27~36에서 요청한 classzone_idx 이하의 존들을 순회하며 워터마크 부스트 값을 합산하여 boosted 및 nr_boost_reclaim에 대입한다. 그리고 존별 부스트 값도 zone_boosts[]에 대입한다.
코드 라인 38~40에서 restart: 레이블이다. 우선 순위를 초가값(12)으로 한 후 다시 시도한다.
코드 라인 46~67에서 reclaim할 존 인덱스 값으로 classzone_idx를 사용하되, buffer_heads_over_limit가 설정된 경우 가용한 최상위 존을 대입한다.

mm/vmscan.c -2/3-

.               /*
                 * If the pgdat is imbalanced then ignore boosting and preserve
                 * the watermarks for a later time and restart. Note that the
                 * zone watermarks will be still reset at the end of balancing
                 * on the grounds that the normal reclaim should be enough to
                 * re-evaluate if boosting is required when kswapd next wakes.
                 */
                balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
                if (!balanced && nr_boost_reclaim) {
                        nr_boost_reclaim = 0;
                        goto restart;
                }

                /*
                 * If boosting is not active then only reclaim if there are no
                 * eligible zones. Note that sc.reclaim_idx is not used as
                 * buffer_heads_over_limit may have adjusted it.
                 */
                if (!nr_boost_reclaim && balanced)
                        goto out;

                /* Limit the priority of boosting to avoid reclaim writeback */
                if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
                        raise_priority = false;

                /*
                 * Do not writeback or swap pages for boosted reclaim. The
                 * intent is to relieve pressure not issue sub-optimal IO
                 * from reclaim context. If no pages are reclaimed, the
                 * reclaim will be aborted.
                 */
                sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
                sc.may_swap = !nr_boost_reclaim;
                sc.may_shrinkslab = !nr_boost_reclaim;

                /*
                 * Do some background aging of the anon list, to give
                 * pages a chance to be referenced before reclaiming. All
                 * pages are rotated regardless of classzone as this is
                 * about consistent aging.
                 */
                age_active_anon(pgdat, &sc);

                /*
                 * If we're getting trouble reclaiming, start doing writepage
                 * even in laptop mode.
                 */
                if (sc.priority < DEF_PRIORITY - 2)
                        sc.may_writepage = 1;

                /* Call soft limit reclaim before calling shrink_node. */
                sc.nr_scanned = 0;
                nr_soft_scanned = 0;
                nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
                                                sc.gfp_mask, &nr_soft_scanned);
                sc.nr_reclaimed += nr_soft_reclaimed;

                /*
                 * There should be no need to raise the scanning priority if
                 * enough pages are already being scanned that that high
                 * watermark would be met at 100% efficiency.
                 */
                if (kswapd_shrink_node(pgdat, &sc))
                        raise_priority = false;

코드 라인 8~12에서 노드가 밸런스 상태가 아니고, 부스트 중이면 nr_boost_reclaim을 리셋한 후 restart: 레이블로 이동하여 다시 시작한다.
코드 라인 19~20에서 노드가 이미 밸런스 상태이고 부스트 중이 아니면 더이상 페이지 확보를 할 필요 없으므로 out 레이블로 이동한다.
코드 라인 23~24에서 부스트 중에는 priority가 낮은 순위(12~10)는 상관없지만 높은 순위(9~1)부터는 더 이상 우선 순위가 높아지지 않도록 raise_priority에 false를 대입한다.
- 가능하면 낮은 우선 순위에서는 writeback을 허용하지 않는다.
코드 라인 32~34에서 랩톱(절전 지원) 모드가 아니고 부스트 중이 아니면 may_writepage를 1로 설정하여 writeback을 허용한다. 그리고 부스트 중이 아니면 may_swap 및 may_shrinkslab을 1로 설정하여 swap 및 슬랩 shrink를 지원한다.
코드 라인 42에서 swap이 활성화된 경우 inactive anon이 active anon보다 작을 경우 active 리스트에 대해 shrink를 수행하여 active와 inactive간의 밸런스를 다시 잡아준다.
코드 라인 48~49에서 높은 우선 순위(9~1)에서는 may_writepage를 1로 설정하여 writeback을 허용한다.
코드 라인 52~56에서 노드를 shrink하기 전에 memcg soft limit reclaim을 수행한다. 스캔한 수는 nr_soft_scanned에 대입되고, nr_reclaimed에는 soft reclaim된 페이지 수가 추가된다.
코드 라인 63~64에서 free 페이지가 high 워터마크 기준을 충족할 만큼 노드를 shrink 한다. 만일 shrink가 성공한 경우 순위를 증가시키지 않도록 raise_priority를 false로 설정한다.

mm/vmscan.c -3/3-

                /*
                 * If the low watermark is met there is no need for processes
                 * to be throttled on pfmemalloc_wait as they should not be
                 * able to safely make forward progress. Wake them
                 */
                if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
                                allow_direct_reclaim(pgdat))
                        wake_up_all(&pgdat->pfmemalloc_wait);

                /* Check if kswapd should be suspending */
                __fs_reclaim_release();
                ret = try_to_freeze();
                __fs_reclaim_acquire();
                if (ret || kthread_should_stop())
                        break;

                /*
                 * Raise priority if scanning rate is too low or there was no
                 * progress in reclaiming pages
                 */
                nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
                nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);

                /*
                 * If reclaim made no progress for a boost, stop reclaim as
                 * IO cannot be queued and it could be an infinite loop in
                 * extreme circumstances.
                 */
                if (nr_boost_reclaim && !nr_reclaimed)
                        break;

                if (raise_priority || !nr_reclaimed)
                        sc.priority--;
        } while (sc.priority >= 1);

        if (!sc.nr_reclaimed)
                pgdat->kswapd_failures++;

out:
        /* If reclaim was boosted, account for the reclaim done in this pass */
        if (boosted) {
                unsigned long flags;

                for (i = 0; i <= classzone_idx; i++) {
                        if (!zone_boosts[i])
                                continue;

                        /* Increments are under the zone lock */
                        zone = pgdat->node_zones + i;
                        spin_lock_irqsave(&zone->lock, flags);
                        zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
                        spin_unlock_irqrestore(&zone->lock, flags);
                }

                /*
                 * As there is now likely space, wakeup kcompact to defragment
                 * pageblocks.
                 */
                wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
        }

        snapshot_refaults(NULL, pgdat);
        __fs_reclaim_release();
        psi_memstall_leave(&pflags);
        /*
         * Return the order kswapd stopped reclaiming at as
         * prepare_kswapd_sleep() takes it into account. If another caller
         * entered the allocator slow path while kswapd was awake, order will
         * remain at the higher level.
         */
        return sc.order;
}

코드 라인 6~8에서 페이지 할당 중 메모리가 부족하여 direct reclaim 시도 중 대기하고 있는 태스크들이 pfmemalloc_wait 리스트에 존재하고, 노드가 direct reclaim을 해도 된다고 판단하면 대기 중인 태스크들을 모두 깨운다.
- allow_direct_reclaim(): normal 존 이하의 free 페이지 합이 min 워터마크를 더한 페이지의 절반보다 큰 경우 true.
코드 라인 12~15에서 freeze 하였다가 깨어났었던 경우 또는 kswapd 스레드 정지 요청이 있는 경우 루프를 빠져나간다.
코드 라인 21~30에서 reclaimed 페이지와 nr_boost_reclaim을 산출한 후 부스트 중이 아니면서 회수된 페이지가 없으면 루프를 벗어난다.
코드 라인 32~34에서 우선 순위를 높이면서 최고 우선 순위까지 루프를 반복한다. 만일 회수된 페이지가 없거나 우선 순위 상승을 원하지 않는 경우에는 우선 순위 증가없이 루프를 반복한다.
코드 라인 36~37에서 루프 완료 후까지 회수된 페이지가 없으면 kswapd_failures 카운터를 증가시킨다.
코드 라인 39~60에서 out: 레이블이다. 처음 시도 시 부스트된 적이 있었으면 kcompactd를 깨운다. 또한 워터마크 부스트 값을 갱신한다.
코드 라인 64에서 메모리 부족으로 인한 현재 태스크의 psi 산출을 종료하는 지점이다.

아래 그림은 task A에서 direct 페이지 회수를 진행 중에 pfmemalloc 워터마크 기준 이하로 떨어진 경우 kswapd에 의해 페이지 회수가 될 때까지스로틀링 즉, direct 페이지 회수를 잠시 쉬게 하여 cpu 부하를 줄인다.

Kswapd 깨우기

wake_all_kswapds()

mm/page_alloc.c

static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
                             const struct alloc_context *ac)
{
        struct zoneref *z;
        struct zone *zone;
        pg_data_t *last_pgdat = NULL;
        enum zone_type high_zoneidx = ac->high_zoneidx;

        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, high_zoneidx,
                                        ac->nodemask) {
                if (last_pgdat != zone->zone_pgdat)
                        wakeup_kswapd(zone, gfp_mask, order, high_zoneidx);
                last_pgdat = zone->zone_pgdat;
        }
}

alloc context가 가리키는 zonelist 중 관련 노드의 kswpad를 깨운다.

wakeup_kswapd()

mm/vmscan.c

/*
 * A zone is low on free memory or too fragmented for high-order memory.  If
 * kswapd should reclaim (direct reclaim is deferred), wake it up for the zone's
 * pgdat.  It will wake up kcompactd after reclaiming memory.  If kswapd reclaim
 * has failed or is not needed, still wake up kcompactd if only compaction is
 * needed.
 */

void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
                   enum zone_type classzone_idx)
{
        pg_data_t *pgdat;

        if (!managed_zone(zone))
                return;

        if (!cpuset_zone_allowed(zone, gfp_flags))
                return;
        pgdat = zone->zone_pgdat;
        pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat,
                                                           classzone_idx);
        pgdat->kswapd_order = max(pgdat->kswapd_order, order);
        if (!waitqueue_active(&pgdat->kswapd_wait))
                return;

        /* Hopeless node, leave it to direct reclaim if possible */
        if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
            (pgdat_balanced(pgdat, order, classzone_idx) &&
             !pgdat_watermark_boosted(pgdat, classzone_idx))) {
                /*
                 * There may be plenty of free memory available, but it's too
                 * fragmented for high-order allocations.  Wake up kcompactd
                 * and rely on compaction_suitable() to determine if it's
                 * needed.  If it fails, it will defer subsequent attempts to
                 * ratelimit its work.
                 */
                if (!(gfp_flags & __GFP_DIRECT_RECLAIM))
                        wakeup_kcompactd(pgdat, order, classzone_idx);
                return;
        }

        trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order,
                                      gfp_flags);
        wake_up_interruptible(&pgdat->kswapd_wait);
}

지정된 zone에서 order 페이지를 할당하려다 메모리가 부족해지면 kswapd 태스크를 깨운다.

코드 라인 6~7에서 유효한 존이 아닌 경우 처리할 페이지가 없으므로 함수를 빠져나간다.
코드 라인 9~10에서 요청한 zone의 노드가 cgroup cpuset을 통해 허가되지 않은 경우 처리를 포기하고 빠져나간다.
코드 라인 11~14에서 kswapd에 존과 order를 지정하여 요청한다.
코드 라인 15~16에서 kswapd가 이미 동작 중이면 함수를 빠져나간다.
코드 라인 19~32에서 다음 조건을 만족하고 direct-reclaim을 허용하지 않는 경우에 한해 kcompactd만 깨우고 함수를 빠져나간다.
- kswad를 통한 페이지 회수 실패가 MAX_RECLAIM_RETRIES(16)번 이상인 경우
- 노드가 이미 밸런스 상태이고 부스트 중이 아닌 경우
코드 라인 36에서 kswapd 태스크를 깨운다.

current_is_kswapd()

include/linux/swap.h

static inline int current_is_kswapd(void)
{
        return current->flags & PF_KSWAPD;
}

현재 태스크가 kswapd 인 경우 true를 반환한다.

kcompactd

kcompactd 초기화

kcompactd_init()

static int __init kcompactd_init(void)
{
        int nid;
        int ret;

        ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
                                        "mm/compaction:online",
                                        kcompactd_cpu_online, NULL);
        if (ret < 0) {
                pr_err("kcompactd: failed to register hotplug callbacks.\n");
                return ret;
        }

        for_each_node_state(nid, N_MEMORY)
                kcompactd_run(nid);
        return 0;
}
subsys_initcall(kcompactd_init)

kcompactd를 사용하기 위해 초기화한다.

코드 라인 6~12에서 cpu가 hot-plug를 통해 CPUHP_AP_ONLINE_DYN 상태로 변경될 때 kcompactd_cpu_online() 함수가 호출될 수 있도록 등록한다.
코드 라인 14~15에서 모든 메모리 노드에 대해 kcompactd를 실행시킨다.

kcompactd_run()

mm/compaction.c

/*
 * This kcompactd start function will be called by init and node-hot-add.
 * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
 */

int kcompactd_run(int nid)
{
        pg_data_t *pgdat = NODE_DATA(nid);
        int ret = 0;

        if (pgdat->kcompactd)
                return 0;

        pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
        if (IS_ERR(pgdat->kcompactd)) {
                pr_err("Failed to start kcompactd on node %d\n", nid);
                ret = PTR_ERR(pgdat->kcompactd);
                pgdat->kcompactd = NULL;
        }
        return ret;
}

kcompactd 스레드를 동작시킨다.

코드 라인 3~7에서 @nid 노드의 kcompactd 스레드가 이미 실행 중인 경우 skip 하기 위해 0을 반환한다.
코드 라인 9~14에서 kcompactd 스레드를 동작시킨다.

kcompactd 동작

kcompactd()

mm/compaction.c

/*
 * The background compaction daemon, started as a kernel thread
 * from the init process.
 */

static int kcompactd(void *p)
{
        pg_data_t *pgdat = (pg_data_t*)p;
        struct task_struct *tsk = current;

        const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

        if (!cpumask_empty(cpumask))
                set_cpus_allowed_ptr(tsk, cpumask);

        set_freezable();

        pgdat->kcompactd_max_order = 0;
        pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;

        while (!kthread_should_stop()) {
                unsigned long pflags;

                trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
                wait_event_freezable(pgdat->kcompactd_wait,
                                kcompactd_work_requested(pgdat));

                psi_memstall_enter(&pflags);
                kcompactd_do_work(pgdat);
                psi_memstall_leave(&pflags);
        }

        return 0;
}

각 메모리 노드에서 동작되는 kcompactd 스레드는 슬립한 상태에 있다가 kswapd가 페이지 회수를 진행한 후 호출되어 깨어나면 백그라운드에서 compaction을 진행하고 다시 슬립한다.

코드 라인 3~9에서 현재 kcompactd 스레드를 요청 노드에 포함된 cpu들에서만 동작할 수 있도록 cpu 비트마스크를 지정한다.
코드 라인11에서 태스크를 freeze할 수 있도록 PF_NOFREEZE 플래그를 제거한다.
- 참고: freeze | 문c
코드 라인 13~14에서 kcompactd의 최대 order와 존을 리셋한다.
코드 라인 16~26에서 종료 요청이 없는 한 계속 루프를 돌며 슬립한 후 외부 요청에 의해 깨어나면 compaction을 수행한다.

다음 그림은 kcompactd 스레드가 동작하는 과정을 보여준다.

kcompactd_do_work()

mm/compaction.c

static void kcompactd_do_work(pg_data_t *pgdat)
{
        /*
         * With no special task, compact all zones so that a page of requested
         * order is allocatable.
         */
        int zoneid;
        struct zone *zone;
        struct compact_control cc = {
                .order = pgdat->kcompactd_max_order,
                .total_migrate_scanned = 0,
                .total_free_scanned = 0,
                .classzone_idx = pgdat->kcompactd_classzone_idx,
                .mode = MIGRATE_SYNC_LIGHT,
                .ignore_skip_hint = false,
                .gfp_mask = GFP_KERNEL,
        };
        trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
                                                        cc.classzone_idx);
        count_compact_event(KCOMPACTD_WAKE);

        for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
                int status;

                zone = &pgdat->node_zones[zoneid];
                if (!populated_zone(zone))
                        continue;

                if (compaction_deferred(zone, cc.order))
                        continue;

                if (compaction_suitable(zone, cc.order, 0, zoneid) !=
                                                        COMPACT_CONTINUE)
                        continue;

                cc.nr_freepages = 0;
                cc.nr_migratepages = 0;
                cc.total_migrate_scanned = 0;
                cc.total_free_scanned = 0;
                cc.zone = zone;
                INIT_LIST_HEAD(&cc.freepages);
                INIT_LIST_HEAD(&cc.migratepages);

                if (kthread_should_stop())
                        return;
                status = compact_zone(zone, &cc);

                if (status == COMPACT_SUCCESS) {
                        compaction_defer_reset(zone, cc.order, false);
                } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
                        /*
                         * Buddy pages may become stranded on pcps that could
                         * otherwise coalesce on the zone's free area for
                         * order >= cc.order.  This is ratelimited by the
                         * upcoming deferral.
                         */
                        drain_all_pages(zone);

                        /*
                         * We use sync migration mode here, so we defer like
                         * sync direct compaction does.
                         */
                        defer_compaction(zone, cc.order);
                }

                count_compact_events(KCOMPACTD_MIGRATE_SCANNED,
                                     cc.total_migrate_scanned);
                count_compact_events(KCOMPACTD_FREE_SCANNED,
                                     cc.total_free_scanned);

                VM_BUG_ON(!list_empty(&cc.freepages));
                VM_BUG_ON(!list_empty(&cc.migratepages));
        }

        /*
         * Regardless of success, we are done until woken up next. But remember
         * the requested order/classzone_idx in case it was higher/tighter than
         * our current ones
         */
        if (pgdat->kcompactd_max_order <= cc.order)
                pgdat->kcompactd_max_order = 0;
        if (pgdat->kcompactd_classzone_idx >= cc.classzone_idx)
                pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
}

노드에 지정된 kcompactd_classzone_idx 존까지 kcompactd_max_order로 compaction을 수행한다.

코드 라인 9~17에서 kcompactd에서 사용할 compact_control을 다음과 같이 준비한다.
- .order에 외부에서 요청한 오더를 지정한다.
- .classzone_idx에 외부에서 요청한 존 인덱스를 지정한다.
- .mode에 MIGRATE_SYNC_LIGHT를 사용한다.
- skip 힌트를 사용하도록 한다.
코드 라인 20에서 KCOMPACTD_WAKE 카운터를 증가시킨다.
코드 라인 22~27에서 가장 낮은 존부터 요청한 존까지 순회하며 유효하지 않은 존은 스킵한다.
코드 라인 29~30에서 compaction 유예 조건인 존의 경우 스킵한다.
코드 라인 32~34에서 존이 compaction 하기 적절하지 않은 경우 스킵한다.
코드 라인 36~42에서 compaction을 하기 위해 compaction_control 결과를 담을 멤버들을 초기화한다.
코드 라인 44~45에서 스레드 종료 요청인 경우 함수를 빠져나간다.
코드 라인 46에서 존에 대해 compaction을 수행하고 결과를 알아온다.
코드 라인 48~49에서 만일 compaction이 성공한 경우 유예 카운터를 리셋한다.
코드 라인 50~64에서 만일 compaction이 완료될 때까지 원하는 order가 없는 경우 per-cpu 캐시를 회수하고, 유예 한도를 증가시킨다.
코드 라인 66~69에서 KCOMPACTD_MIGRATE_SCANNED 카운터 및 KCOMPACTD_FREE_SCANNED 카운터를 갱신한다.
코드 라인 80~81에서 진행 order보다 외부 요청 order가 작거나 동일하면 다음에 wakeup 하지 않도록 max_order를 0으로 리셋한다.
코드 라인 82~83에서 진행 존보다 외부 요청 존이 더 크거나 동일하면 다음에 시작할 존을 가장 높은 존으로 리셋한다.

Kcompatd 깨우기

wakeup_kcompactd()

mm/compaction.c

void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
{
        if (!order)
                return;

        if (pgdat->kcompactd_max_order < order)
                pgdat->kcompactd_max_order = order;

        if (pgdat->kcompactd_classzone_idx > classzone_idx)
                pgdat->kcompactd_classzone_idx = classzone_idx;

        /*
         * Pairs with implicit barrier in wait_event_freezable()
         * such that wakeups are not missed.
         */
        if (!wq_has_sleeper(&pgdat->kcompactd_wait))
                return;

        if (!kcompactd_node_suitable(pgdat))
                return;

        trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
                                                        classzone_idx);
        wake_up_interruptible(&pgdat->kcompactd_wait);
}

kcompactd 스레드를 깨운다.

코드 라인 3~4에서 order 값이 0인 경우 kcompactd를 깨우지 않고 함수를 빠져나간다.
코드 라인 6~7에서 @order가 kcompactd_max_order 보다 큰 경우 kcompactd_max_order를 갱신한다.
코드 라인 9~10에서 @classzone_idx가 kcompactd_classzone_idx보다 작은 경우 kcompactd_classzone_idx를 갱신한다.
코드 라인 16~17에서 kcompactd가 이미 깨어있으면 함수를 빠져나간다.
코드 라인 19~20에서 compaction을 진행해도 효과가 없을만한 노드의 경우 함수를 빠져나간다.
코드 라인 24에서 kcompactd를 깨운다.

kcompactd_node_suitable()

mm/compaction.c

static bool kcompactd_node_suitable(pg_data_t *pgdat)
{
        int zoneid;
        struct zone *zone;
        enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;

        for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
                zone = &pgdat->node_zones[zoneid];

                if (!populated_zone(zone))
                        continue;

                if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
                                        classzone_idx) == COMPACT_CONTINUE)
                        return true;
        }

        return false;
}

kcompactd를 수행하기 위해 요청한 노드의 가용한 존들에 대해 하나라도 compaction 효과가 있을만한 존이 있는지 여부를 반환한다.

코드 라인 5~11에서 요청한 노드의 kcompactd_classzone_idx 까지 가용한 존들에 대해 순회한다.
코드 라인 13~15에서 순회 중인 존들 중 하나라도 kcompactd_max_order 값을 사용하여 compaction 효과가 있다고 판단하면 true를 반환한다.
코드 라인 18에서 해당 노드의 모든 존들에 대해 compaction 효과를 볼 수 없어 false를 반환한다.

기타

swapper_spaces[] 배열

mm/swap_state.c

struct address_space swapper_spaces[MAX_SWAPFILES];

swap_aops

mm/swap_state.c

/*                                      
 * swapper_space is a fiction, retained to simplify the path through
 * vmscan's shrink_page_list.
 */
static const struct address_space_operations swap_aops = {
        .writepage      = swap_writepage,
        .set_page_dirty = swap_set_page_dirty,
#ifdef CONFIG_MIGRATION
        .migratepage    = migrate_page,
#endif 
};

address_space_operations 구조체

include/linux/fs.h

struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
        int (*readpage)(struct file *, struct page *);

        /* Write back some dirty pages from this mapping. */
        int (*writepages)(struct address_space *, struct writeback_control *);

        /* Set a page dirty.  Return true if this dirtied it */
        int (*set_page_dirty)(struct page *page);

        /*
         * Reads in the requested pages. Unlike ->readpage(), this is
         * PURELY used for read-ahead!.
         */
        int (*readpages)(struct file *filp, struct address_space *mapping,
                        struct list_head *pages, unsigned nr_pages);

        int (*write_begin)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);

        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
        /*
         * migrate the contents of a page to the specified target. If
         * migrate_mode is MIGRATE_ASYNC, it must not block.
         */
        int (*migratepage) (struct address_space *,
                        struct page *, struct page *, enum migrate_mode);
        bool (*isolate_page)(struct page *, isolate_mode_t);
        void (*putback_page)(struct page *);
        int (*launder_page) (struct page *);
        int (*is_partially_uptodate) (struct page *, unsigned long,
                                        unsigned long);
        void (*is_dirty_writeback) (struct page *, bool *, bool *);
        int (*error_remove_page)(struct address_space *, struct page *);

        /* swapfile support */
        int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                                sector_t *span);
        void (*swap_deactivate)(struct file *file);
};

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd & Kcompactd) | 문c – 현재 글

Freeze (hibernation/suspend)

2016-11-302016-11-30 문영일 Leave a comment

리눅스 시스템 차원의 서스펜드 또는 하이버네이션 기능으로 현재 동작중인 모든 유저 스레드와 몇 개의 커널 스레드들을 얼려 정지시킨다.

유저 스레드들은 시그널 핸들링 코드에 의해 try_to_freeze() 호출이 되면서 freeze가 된다.
커널 스레드들은 명확하게 적절한 위치에서 freeze 요청을 확인하고 호출해줘야 한다.
freeze 불가능한 작업큐에 있는 태스크들이 완료될 때 까지 대기하는데 딜레이 된 태스크들에 대해서는 전부 active 시킨 후 대기한다.

Freeze

freezing()

include/linux/freezer.h

/*
 * Check if there is a request to freeze a process
 */
static inline bool freezing(struct task_struct *p) 
{
        if (likely(!atomic_read(&system_freezing_cnt)))
                return false;
        return freezing_slow_path(p);
}

현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다. (true=freeze 가능, false=freeze 불가능)

코드 라인 06~07에서 높은 확률로 전역 system_freezing_cnt가 0인 경우 freeze 요청이 없으므로 false를 반환한다.
코드 라인 08에서 현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다.

freezing_slow_path()

kernel/freezer.c

/**
 * freezing_slow_path - slow path for testing whether a task needs to be frozen
 * @p: task to be tested
 *
 * This function is called by freezing() if system_freezing_cnt isn't zero
 * and tests whether @p needs to enter and stay in frozen state.  Can be
 * called under any context.  The freezers are responsible for ensuring the
 * target tasks see the updated state.
 */
bool freezing_slow_path(struct task_struct *p)
{
        if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
                return false;

        if (test_thread_flag(TIF_MEMDIE))
                return false;

        if (pm_nosig_freezing || cgroup_freezing(p))
                return true;

        if (pm_freezing && !(p->flags & PF_KTHREAD))
                return true;

        return false;
}
EXPORT_SYMBOL(freezing_slow_path);

현재 태스크가 freeze 상태로 들어갈 수 있는지 여부를 확인하여 반환한다.

코드 라인 12~13에서 현재 태스크에 PF_NOFREEZE 또는 PF_SUSPEND_TASK 플래그가 설정된 경우 freeze 하지 못하게 false를 반환한다.
코드 라인 15~16에서 현재 태스크에 TIF_MEMDIE 플래그가 있는 경우 freeze 하지 못하게 false를 반환한다.
코드 라인18~19에서 전역 pm_nosig_freezing이 true이거나 현재 태스크에 대해 cgroup에 freezing이 설정된 경우 freezing을 할 수 있게 true를 반환한다.
코드 라인 21~22에서 전역 pm_freezing이 true이면서 현재 태스크가 커널 스레드가 아닌 경우 freezing을 할 수 있게 true를 반환한다.

try_to_freeze_tasks()

kernel/power/process.c

static int try_to_freeze_tasks(bool user_only)
{
        struct task_struct *g, *p;
        unsigned long end_time;
        unsigned int todo;
        bool wq_busy = false;
        struct timeval start, end;
        u64 elapsed_msecs64;
        unsigned int elapsed_msecs;
        bool wakeup = false;
        int sleep_usecs = USEC_PER_MSEC;

        do_gettimeofday(&start);

        end_time = jiffies + msecs_to_jiffies(freeze_timeout_msecs);

        if (!user_only)
                freeze_workqueues_begin();

        while (true) {
                todo = 0;
                read_lock(&tasklist_lock);
                for_each_process_thread(g, p) {
                        if (p == current || !freeze_task(p))
                                continue;

                        if (!freezer_should_skip(p))
                                todo++;
                }
                read_unlock(&tasklist_lock);

                if (!user_only) {
                        wq_busy = freeze_workqueues_busy();
                        todo += wq_busy;
                }

                if (!todo || time_after(jiffies, end_time))
                        break;

                if (pm_wakeup_pending()) {
                        wakeup = true;
                        break;
                }

                /*
                 * We need to retry, but first give the freezing tasks some
                 * time to enter the refrigerator.  Start with an initial
                 * 1 ms sleep followed by exponential backoff until 8 ms.
                 */
                usleep_range(sleep_usecs / 2, sleep_usecs);
                if (sleep_usecs < 8 * USEC_PER_MSEC)
                        sleep_usecs *= 2;
        }

        do_gettimeofday(&end);
        elapsed_msecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
        do_div(elapsed_msecs64, NSEC_PER_MSEC);
        elapsed_msecs = elapsed_msecs64;

        if (todo) {
                pr_cont("\n");
                pr_err("Freezing of tasks %s after %d.%03d seconds "
                       "(%d tasks refusing to freeze, wq_busy=%d):\n",
                       wakeup ? "aborted" : "failed",
                       elapsed_msecs / 1000, elapsed_msecs % 1000,
                       todo - wq_busy, wq_busy);

                if (!wakeup) {
                        read_lock(&tasklist_lock);
                        for_each_process_thread(g, p) {
                                if (p != current && !freezer_should_skip(p)
                                    && freezing(p) && !frozen(p))
                                        sched_show_task(p);
                        }
                        read_unlock(&tasklist_lock);
                }
        } else {
                pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
                        elapsed_msecs % 1000);
        }

        return todo ? -EBUSY : 0;
}

유저 스레드 및 커널 스레드들을 freeze 시도한다. 정상적으로 freeze한 경우 0을 반환하고 그렇지 않은 경우 -EBUSY를 반환한다. 인수로 user_only가 false인 경우 freeze 불가능한 작업큐에서 동작하는 태스크와 지연된 태스크들에 대해 모두 처리가 완료될 때까지 최대 20초를 기다린다.

코드 라인 15에서 현재 시간으로 부터 20초를 타임아웃으로 설정한다.
- freeze_timeout_msecs
  - 20 * MSEC_PER_SEC;
코드 라인 17~18에서 user_only가 false인 경우 freeze 할 수 없는 워크 큐에 있는 지연된 태스크들을 모두 pool 워크큐로 옮기고 activate 시킨다.
코드 라인 20~30에서 freeze되지 않고 아직 남아 있는 스레드 수를 todo로 산출한다.
- 모든 스레드들에서 현재 태스크와 freeze되지 않은 태스크들에 대해 skip 한다.
- freeze된 스레드들에서 freezer가 skip 해야 하는 경우가 아니었다면 todo를 증가시킨다.
코드 라인 32~35에서 pool 워크큐에서 여전히 activate 된 수를 todo에 더한다.
- user_only가 false인 경우 pool 워크큐가 여전히 active 된 태스크들이 존재하면 todo를 증가시킨다.
코드 라인 37~38에서 처리할 항목이 없거나 타임 오버된 경우 루프를 탈출한다.
코드 라인 40~43에서 suspend(freeze)가 포기된 경우 wakeup 준비한다.
코드 라인 50~53에서 sleep_usecs의 절반 ~ sleep_usecs 범위에서 sleep 한 후 sleep_usces가 8ms 미만인 경우 sleep_usecs를 2배로 키운다.
- 처음 sleep_usecs 값은 1000으로 1ms이다.
코드 라인 55~58에서 소요 시간을 산출하여 elapsed_msecs에 저장한다.
코드 라인 60~76에서 freeze 되지 않은 항목이 남아 있는 경우 이에 대한 에러 출력을 하고 해당 스레드에 대한 정보를 자세히 출력한다.

기타

다음은 freeze와 관련된 함수이다.

freeze_processes()
freeze_kernel_threads()
thaw_processes()

참고

Documentation/power/freezing-of-tasks.txt

Per-cpu -3- (동적 할당)

2016-11-112019-08-04 문영일 Leave a comment

Per-cpu -3- (동적 할당)

alloc_percpu()

include/linux/percpu.h

#define alloc_percpu(type)                                              \
        (typeof(type) __percpu *)__alloc_percpu(sizeof(type),           \
                                                __alignof__(type))

요청 타입의 per-cpu 메모리를 할당한다.

__alloc_percpu()

mm/percpu.c

/**
 * __alloc_percpu - allocate dynamic percpu area
 * @size: size of area to allocate in bytes
 * @align: alignment of area (max PAGE_SIZE)
 *
 * Equivalent to __alloc_percpu_gfp(size, align, %GFP_KERNEL).
 */

void __percpu *__alloc_percpu(size_t size, size_t align)
{
        return pcpu_alloc(size, align, false, GFP_KERNEL);
}
EXPORT_SYMBOL_GPL(__alloc_percpu);

요청 @size 및 @align 값으로 per-cpu 메모리를 할당한다.

alloc_percpu_gfp()

include/linux/percpu.h

#define alloc_percpu_gfp(type, gfp)                                     \
        (typeof(type) __percpu *)__alloc_percpu_gfp(sizeof(type),       \
                                                __alignof__(type), gfp)

요청 타입 및 gfp 플래그를 사용하여 per-cpu 메모리를 할당한다.

__alloc_percpu_gfp()

mm/percpu.c

/**
 * __alloc_percpu_gfp - allocate dynamic percpu area
 * @size: size of area to allocate in bytes
 * @align: alignment of area (max PAGE_SIZE)
 * @gfp: allocation flags
 *
 * Allocate zero-filled percpu area of @size bytes aligned at @align.  If
 * @gfp doesn't contain %GFP_KERNEL, the allocation doesn't block and can
 * be called from any context but is a lot more likely to fail. If @gfp
 * has __GFP_NOWARN then no warning will be triggered on invalid or failed
 * allocation requests.
 *
 * RETURNS:
 * Percpu pointer to the allocated area on success, NULL on failure.
 */

void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
{
        return pcpu_alloc(size, align, false, gfp);
}
EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);

요청 size, align 및 gfp 플래그 값으로 per-cpu 메모리를 할당한다.

__alloc_reserved_percpu()

mm/percpu.c

/**
 * __alloc_reserved_percpu - allocate reserved percpu area
 * @size: size of area to allocate in bytes
 * @align: alignment of area (max PAGE_SIZE)
 *
 * Allocate zero-filled percpu area of @size bytes aligned at @align
 * from reserved percpu area if arch has set it up; otherwise,
 * allocation is served from the same dynamic area.  Might sleep.
 * Might trigger writeouts.
 *
 * CONTEXT:
 * Does GFP_KERNEL allocation.
 *
 * RETURNS:
 * Percpu pointer to the allocated area on success, NULL on failure.
 */

void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
{
        return pcpu_alloc(size, align, true, GFP_KERNEL);
}

컴파일 타임에 모듈에서 사용된 static per-cpu 데이터 선언 영역들은 곧바로 사용될 수 있는 데이터 공간이 아니다. 이들은 런타임에 모듈이 로드될 때 이 함수가 호출되어 first chunk의 reserved 영역 범위내에서 할당한다.

pcpu 동적 할당 메인

pcpu_alloc()

mm/percpu.c -1/3-

/**
 * pcpu_alloc - the percpu allocator
 * @size: size of area to allocate in bytes
 * @align: alignment of area (max PAGE_SIZE)
 * @reserved: allocate from the reserved chunk if available
 * @gfp: allocation flags
 *
 * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
 * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
 * then no warning will be triggered on invalid or failed allocation
 * requests.
 *
 * RETURNS:
 * Percpu pointer to the allocated area on success, NULL on failure.
 */

static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
                                 gfp_t gfp)
{
        /* whitelisted flags that can be passed to the backing allocators */
        gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
        bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
        bool do_warn = !(gfp & __GFP_NOWARN);
        static int warn_limit = 10;
        struct pcpu_chunk *chunk;
        const char *err;
        int slot, off, cpu, ret;
        unsigned long flags;
        void __percpu *ptr;
        size_t bits, bit_align;

        /*
         * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE,
         * therefore alignment must be a minimum of that many bytes.
         * An allocation may have internal fragmentation from rounding up
         * of up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
         */
        if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
                align = PCPU_MIN_ALLOC_SIZE;

        size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
        bits = size >> PCPU_MIN_ALLOC_SHIFT;
        bit_align = align >> PCPU_MIN_ALLOC_SHIFT;

        if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
                     !is_power_of_2(align))) {
                WARN(do_warn, "illegal size (%zu) or align (%zu) for percpu allocation\n",
                     size, align);
                return NULL;
        }

        if (!is_atomic) {
                /*
                 * pcpu_balance_workfn() allocates memory under this mutex,
                 * and it may wait for memory reclaim. Allow current task
                 * to become OOM victim, in case of memory pressure.
                 */
                if (gfp & __GFP_NOFAIL)
                        mutex_lock(&pcpu_alloc_mutex);
                else if (mutex_lock_killable(&pcpu_alloc_mutex))
                        return NULL;
        }

        spin_lock_irqsave(&pcpu_lock, flags);

        /* serve reserved allocations from the reserved chunk if available */
        if (reserved && pcpu_reserved_chunk) {
                chunk = pcpu_reserved_chunk;

                off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic);
                if (off < 0) {
                        err = "alloc from reserved chunk failed";
                        goto fail_unlock;
                }

                off = pcpu_alloc_area(chunk, bits, bit_align, off);
                if (off >= 0)
                        goto area_found;

                err = "alloc from reserved chunk failed";
                goto fail_unlock;
        }

요청 size와 align 값으로 per-cpu 메모리를 동적으로 할당한다. 모듈에서 호출하는 경우 reserved를 true로 호출하여 reserved per-cpu 영역에서 할당하게 한다. 할당받을 공간이 부족한 경우 chunk를 새로 추가하는데, 만일 어토믹 요청인 경우에는 확장하지 않고 실패 처리한다.

코드 라인 6에서 어토믹 요청 여부를 파악한다. GFP_KERNEL 옵션을 사용하지 않으면 어토믹 요청이 온 것이다. alloc_percpu( ) 함수 등을 사용하여 호출하는 경우 항상 GFP_KERNEL 옵션을 사용하므로 어토믹 조건을 사용하지 않는다.
- 현재 커널에서 alloc_percpu_gfp( ) 함수를 사용하는 경우에는 gfp 옵션을 바꿀 수 있는데, 실제 적용된 코드에는 아직까지 GFP_KERNEL 옵션 이외의 gfp 옵션을 사용한 경우가 없다. 향후 어토믹 조건을 사용하려고 미리 준비해둔 함수다.
- 어토믹 조건으로 이 함수를 동작시키는 경우 populate된 페이지에서만 per-cpu 데이터를 할당할 수 있게 제한한다. 어토믹 조건이 아닌 경우는 chunk가 부족한 경우 chunk도 생성할 수 있고 unpopulate된 페이지들을 population 과정을 통해 사용할 수 있게 한다.
코드 라인 22~23에서 per-cpu 할당의 최소 정렬 단위를 최소 4바이트로 제한한다.
코드 라인 25~27에서 할당 사이즈는 per-cpu 최소 정렬 단위로 정렬한다. 그리고 산출된 사이즈 및 정렬 단위로 필요 비트 수를 구한다.
- 예) size=32, align=4
  - bits=8, bit_align=1
코드 라인 29~34에서 사이즈가 0이거나 유닛 사이즈를 초과하거나 align이 페이지 단위를 초과하거나 2의 제곱승 단위를 사용하지 않는 경우 경고 메시지를 출력하고 null을 반환한다.
코드 라인 36~46에서 어토믹 할당 요청이 아닌 경우 OOM 상황에서 per-cpu 할당을 위한 lock을 획득하지 않고 중간에 포기하고 null을 반환할 수 있게 한다.
코드 라인 48에서 할당 준비를 하는동안 interrupt를 막는다.
코드 라인 51~66에서 모듈을 위해 사용된 static per-cpu 할당은 first chunk의 reserved 영역을 사용하여 관리한다. 이 chunk에서 할당 가능한지 공간을 확인한 후 할당을 시도한다. 만일 적당한 공간을 찾은 경우 area_found: 레이블로 이동하고, 적절한 공간을 찾지 못한 경우 할당 실패 사유를 출력하고 함수를 빠져나가기 위해 faile_unlock: 레이블로 이동한다.

mm/percpu.c -2/3-

restart:
        /* search through normal chunks */
        for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
                list_for_each_entry(chunk, &pcpu_slot[slot], list) {
                        off = pcpu_find_block_fit(chunk, bits, bit_align,
                                                  is_atomic);
                        if (off < 0)
                                continue;

                        off = pcpu_alloc_area(chunk, bits, bit_align, off);
                        if (off >= 0)
                                goto area_found;

                }
        }

        spin_unlock_irqrestore(&pcpu_lock, flags);

        /*
         * No space left.  Create a new chunk.  We don't want multiple
         * tasks to create chunks simultaneously.  Serialize and create iff
         * there's still no empty chunk after grabbing the mutex.
         */
        if (is_atomic) {
                err = "atomic alloc failed, no space left";
                goto fail;
        }

        if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
                chunk = pcpu_create_chunk(pcpu_gfp);
                if (!chunk) {
                        err = "failed to allocate new chunk";
                        goto fail;
                }

                spin_lock_irqsave(&pcpu_lock, flags);
                pcpu_chunk_relocate(chunk, -1);
        } else {
                spin_lock_irqsave(&pcpu_lock, flags);
        }

        goto restart;

restart: 레이블에서는 dynamic per-cpu 할당에 대한 처리를 수행한다.

코드 라인 3에서 적절한 chunk를 먼저 찾기 위해 할당할 사이즈에 해당하는 슬롯 부터 최상위 슬롯까지 순회한다.
- pcpu_size_to_slot()은 size 범위에 해당되는 슬롯 번호를 리턴한다.
  - 예) size가 44K이라면 13번 슬롯을 리턴한다.
코드 라인 4~8에서 해당 슬롯의 chunk 리스트를 순회하며 할당할 사이즈보다 큰 free 공간이 없으면 skip 한다.
코드 라인 10~12에서 할당 요청한 사이즈 만큼 할당된 경우 area_found; 레이블로 이동한다.
코드 라인 24~27에서 atomic 요청된 경우 재시도를 하지 않고 fail: 레이블로 이동한다.
코드 라인 29~42에서 최상위 슬롯에는 항상 빈 chunk가 있어야 한다. 만일 없는 경우에는 빈 chunk를 생성하고, 최상위 슬롯에 위치시키고 재시도를 하기 위해 restart: 레이블로 이동한다.

mm/percpu.c -3/3-

area_found:
        pcpu_stats_area_alloc(chunk, size);
        spin_unlock_irqrestore(&pcpu_lock, flags);

        /* populate if not all pages are already there */
        if (!is_atomic) {
                int page_start, page_end, rs, re;

                page_start = PFN_DOWN(off);
                page_end = PFN_UP(off + size);

                pcpu_for_each_unpop_region(chunk->populated, rs, re,
                                           page_start, page_end) {
                        WARN_ON(chunk->immutable);

                        ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);

                        spin_lock_irqsave(&pcpu_lock, flags);
                        if (ret) {
                                pcpu_free_area(chunk, off);
                                err = "failed to populate";
                                goto fail_unlock;
                        }
                        pcpu_chunk_populated(chunk, rs, re, true);
                        spin_unlock_irqrestore(&pcpu_lock, flags);
                }

                mutex_unlock(&pcpu_alloc_mutex);
        }

        if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
                pcpu_schedule_balance_work();

        /* clear the areas and return address relative to base address */
        for_each_possible_cpu(cpu)
                memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);

        ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
        kmemleak_alloc_percpu(ptr, size, gfp);

        trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
                        chunk->base_addr, off, ptr);

        return ptr;

fail_unlock:
        spin_unlock_irqrestore(&pcpu_lock, flags);
fail:
        trace_percpu_alloc_percpu_fail(reserved, is_atomic, size, align);

        if (!is_atomic && do_warn && warn_limit) {
                pr_warn("allocation failed, size=%zu align=%zu atomic=%d, %s\n",
                        size, align, is_atomic, err);
                dump_stack();
                if (!--warn_limit)
                        pr_info("limit reached, disable warning\n");
        }
        if (is_atomic) {
                /* see the flag handling in pcpu_blance_workfn() */
                pcpu_atomic_alloc_failed = true;
                pcpu_schedule_balance_work();
        } else {
                mutex_unlock(&pcpu_alloc_mutex);
        }
        return NULL;
}

area_found: 레이블에서는 per-cpu 할당이 성공한 경우 후속 처리를 위한 루틴이 있고, fail: 레이블에는 할당이 실패한 경우 원인에 대한 에러 출력을 수행한 후 null을 반환한다.

코드 라인 2에서 per-cpu 할당에 대한 stat들을 증가 및 갱신한다.
코드 라인 6~29에서 어토믹 처리 요청 중이 아니면 스케줄러를 이용할 필요가 없으므로 즉시 활성화 처리를 한다. 할당받은 페이지의 시작 pfn에서 끝 pfn까지 chunk 내 un-populated 영역의 시작(rs) 주소와 끝(re) 주소들을 알아와서 해당 chunk의 지정된 영역을 populate 한다. 이때 할당된 실제 페이지를 지정된 vmalloc 영역에 매핑한다. 페이지 번호들은 모두 페이지 기준의 PFN이 아니고 chunk의 첫 유닛을 기준으로 한다. 그리고 pcpu_chunk_populated() 함수를 통해 chunk에 해당 영역이 populate되었다는 정보를 비트맵 방식으로 기록한다.
코드 라인 31~32에서 빈 populated 페이지 수가 2개 미만이면 populate를 하기 위해 백그라운드에서 워크큐를 통해 pcpu_balance_workfn( )을 호출하게 하여 하나의 빈 chunk에 어토믹 할당이 가능하도록 populated free pages를 PCPU_EMPTY_POP_PAGES_LOW(2)~HIGH(4)까지 확보한다. 어토믹 연산을 위해 미리 populate된 페이지를 확보해둔다.
코드 라인 35~36에서 할당된 사이즈의 영역을 깨끗이 0으로 청소한다.
코드 라인 38에서 주어진 주소로 per-cpu 포인터 주소로의 변환을 하여 리턴한다. per-cpu 포인터 주소는 유닛 0에 해당하는 실제 데이터의 주소가 아니라 그 주소에서 delta를 뺀 주소를 가리킨다. 실제 사용 시에는 이 값에 해당 cpu가 저장하고 있는 TPIDRPRW 값을 더해 사용한다.
- TPIDRPRW에는 first chunk의 처음 설정 시 cpu에 해당하는 유닛 offset + delta 값이 보관되어 있으며, 이 값은 향후 바뀌지 않는다.
- delta 값은 first chunk를 처음 설정할 때 계산된 가장 낮은 노드(그룹)의 base offset에서 per-cpu 섹션의 시작 주소를 뺀 가상의 주소다.
- 실제 static 변수의 경우 per-cpu 섹션에 컴파일 타임에 만들어진 주소를 가지며, 동적 할당에서는 그 delta 값을 고려하여 미리 감소시킨 값을 가리킨다. 따라서 어떠한 경우에도 per-cpu 포인터 값을 액세스하면 안 된다. static per-cpu 데이터, 동적으로 할당되어 사용하는 per-cpu 데이터 상관없이 동일한 this_cpu_ptr( ) 등의 API 함수를 사용하게 하기 위해 고려되었다.
코드 라인 39~44에서 메모리 누수 감시를 위해 객체를 등록하고 성공했으므로 함수를 빠져나간다.
코드 라인 46~57에서 할당에 실패한 경우 이 루틴으로 진입한다. 어토믹 할당 요청을 받은 경우가 아니면 경고를 출력한다.
코드 라인 58~65에서 어토믹 할당 요청을 받은 경우 pcpu_schedule_balance_work( ) 루틴을 호출하여 워크큐에서 별도로 스케줄 할당을 받아 populated된 free 페이지의 할당을 준비한다.

chunk 내 필요 free 공간 검색

pcpu_find_block_fit()

mm/percpu.c

/**
 * pcpu_find_block_fit - finds the block index to start searching
 * @chunk: chunk of interest
 * @alloc_bits: size of request in allocation units
 * @align: alignment of area (max PAGE_SIZE bytes)
 * @pop_only: use populated regions only
 *
 * Given a chunk and an allocation spec, find the offset to begin searching
 * for a free region.  This iterates over the bitmap metadata blocks to
 * find an offset that will be guaranteed to fit the requirements.  It is
 * not quite first fit as if the allocation does not fit in the contig hint
 * of a block or chunk, it is skipped.  This errs on the side of caution
 * to prevent excess iteration.  Poor alignment can cause the allocator to
 * skip over blocks and chunks that have valid free areas.
 *
 * RETURNS:
 * The offset in the bitmap to begin searching.
 * -1 if no offset is found.
 */

static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
                               size_t align, bool pop_only)
{
        int bit_off, bits, next_off;

        /*
         * Check to see if the allocation can fit in the chunk's contig hint.
         * This is an optimization to prevent scanning by assuming if it
         * cannot fit in the global hint, there is memory pressure and creating
         * a new chunk would happen soon.
         */
        bit_off = ALIGN(chunk->contig_bits_start, align) -
                  chunk->contig_bits_start;
        if (bit_off + alloc_bits > chunk->contig_bits)
                return -1;

        bit_off = chunk->first_bit;
        bits = 0;
        pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
                if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
                                                   &next_off))
                        break;

                bit_off = next_off;
                bits = 0;
        }

        if (bit_off == pcpu_chunk_map_bits(chunk))
                return -1;

        return bit_off;
}

@chunk 내에서 적절한 빈 공간을 찾아 chunk 기준 bit_off 값을 반환한다. @pop_only가 1로 주어진 경우 populate 페이지 들에서만 검색한다.

코드 라인 12~15에서 최대 연속된 free 공간을 보고 @align 정렬 단위로 @alloc_bits 만큼 할당할 공간이 없으면 -1을 반환한다.
코드 라인 17~19에서 bit_off 위치부터 시작하여 bits 단위로 순회하며 align 조건이 만족하는 빈 공간을 찾아 bit_off를 구한다.
코드 라인 20~22에서 @pop_only가 0인 경우 처음 찾은 위치가 확정된다. 그렇지 않고 @pop_only가 1로 설정된 경우 @bit_off 부터 bits 만큼의 공간이 populate된 공간인지를 확인하여 확정한다. 만일 populate 공간이 아니면 next_off에 다음 populate 페이지의 시작 비트를 가리키는 값을 반환한다.
코드라인24~26에서 다음 populate 페이지의 시작 비트를 next_off로 가져왔으므로 이를 bit_off에 넣고 계속 루프를 반복한다.
코드 라인 28~31에서 할당이 실패한 경우 -1을 반환하고, 성공한 경우 bit_off를 반환한다.

pcpu_for_each_fit_region()

mm/percpu.c

#define pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits)     \
        for (pcpu_next_fit_region((chunk), (alloc_bits), (align), &(bit_off), \
                                  &(bits));                                   \
             (bit_off) < pcpu_chunk_map_bits((chunk));                        \
             (bit_off) += (bits),                                             \
             pcpu_next_fit_region((chunk), (alloc_bits), (align), &(bit_off), \
                                  &(bits)))

@chunk 내에서 입출력인자 @bit_off부터 시작하여 @align 단위로 @alloc_bits 만큼 free 공간이 확보 가능한 영역을 찾아 입출력 인자 @bit_off에 비트 오프셋 위치와, 출력 인자 @bits에 free 영역의 사이즈를 반환한다. 참고로 각 비트는 4바이트 단위의 할당 상태를 표시한다.

pcpu_next_fit_region()

mm/percpu.c

/**
 * pcpu_next_fit_region - finds fit areas for a given allocation request
 * @chunk: chunk of interest
 * @alloc_bits: size of allocation
 * @align: alignment of area (max PAGE_SIZE)
 * @bit_off: chunk offset
 * @bits: size of free area
 *
 * Finds the next free region that is viable for use with a given size and
 * alignment.  This only returns if there is a valid area to be used for this
 * allocation.  block->first_free is returned if the allocation request fits
 * within the block to see if the request can be fulfilled prior to the contig
 * hint.
 */

static void pcpu_next_fit_region(struct pcpu_chunk *chunk, int alloc_bits,
                                 int align, int *bit_off, int *bits)
{
        int i = pcpu_off_to_block_index(*bit_off);
        int block_off = pcpu_off_to_block_off(*bit_off);
        struct pcpu_block_md *block;

        *bits = 0;
        for (block = chunk->md_blocks + i; i < pcpu_chunk_nr_blocks(chunk);
             block++, i++) {
                /* handles contig area across blocks */
                if (*bits) {
                        *bits += block->left_free;
                        if (*bits >= alloc_bits)
                                return;
                        if (block->left_free == PCPU_BITMAP_BLOCK_BITS)
                                continue;
                }

                /* check block->contig_hint */
                *bits = ALIGN(block->contig_hint_start, align) -
                        block->contig_hint_start;
                /*
                 * This uses the block offset to determine if this has been
                 * checked in the prior iteration.
                 */
                if (block->contig_hint &&
                    block->contig_hint_start >= block_off &&
                    block->contig_hint >= *bits + alloc_bits) {
                        *bits += alloc_bits + block->contig_hint_start -
                                 block->first_free;
                        *bit_off = pcpu_block_off_to_off(i, block->first_free);
                        return;
                }
                /* reset to satisfy the second predicate above */
                block_off = 0;

                *bit_off = ALIGN(PCPU_BITMAP_BLOCK_BITS - block->right_free,
                                 align);
                *bits = PCPU_BITMAP_BLOCK_BITS - *bit_off;
                *bit_off = pcpu_block_off_to_off(i, *bit_off);
                if (*bits >= alloc_bits)
                        return;
        }

        /* no valid offsets were found - fail condition */
        *bit_off = pcpu_chunk_map_bits(chunk);
}

@chunk 내에서 입출력 인자 @bit_off부터 시작하여 @align 단위로 @alloc_bits 만큼 free 공간이 확보 가능한 영역을 찾는다. 그런 후 입출력 인자 @bit_off에 찾은 free 영역의 시작 비트 오프셋 위치와, 출력 인자 @bits에 free 영역의 사이즈를 반환한다. 참고로 각 비트는 4바이트 단위의 할당 상태를 표시한다.

코드 라인 9~18에서 청크에서 pcpu 블럭(페이지) 수 만큼 순회하며 처음 호출되어 @bits가 0인 경우 가장 첫 free 공간에서 할당 가능하면 @bits를 갱신하여 함수를 빠져나오고, 그렇지 않고 free 공간의 중간 블럭인 경우 skip 한다.
코드 라인 21~34에서 free 공간에서 할당 가능하면 그 위치와 사이즈를 산출하여 반환한다.
코드 라인 36~43에서 가장 우측 free 공간에서 할당 가능한 경우 그 위치와 사이즈를 산출하여 반환한다.
코드 라인 47에서 더 이상 찾지 못한 경우 함수 밖에 있는 루프를 끝내기 위해 @bit_off에 마지막 값을 담는다.

할당한 영역을 비트맵에 표기

pcpu_alloc_area()

mm/percpu.c

/**
 * pcpu_alloc_area - allocates an area from a pcpu_chunk
 * @chunk: chunk of interest
 * @alloc_bits: size of request in allocation units
 * @align: alignment of area (max PAGE_SIZE)
 * @start: bit_off to start searching
 *
 * This function takes in a @start offset to begin searching to fit an
 * allocation of @alloc_bits with alignment @align.  It needs to scan
 * the allocation map because if it fits within the block's contig hint,
 * @start will be block->first_free. This is an attempt to fill the
 * allocation prior to breaking the contig hint.  The allocation and
 * boundary maps are updated accordingly if it confirms a valid
 * free area.
 *
 * RETURNS:
 * Allocated addr offset in @chunk on success.
 * -1 if no matching area is found.
 */

static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
                           size_t align, int start)
{
        size_t align_mask = (align) ? (align - 1) : 0;
        int bit_off, end, oslot;

        lockdep_assert_held(&pcpu_lock);

        oslot = pcpu_chunk_slot(chunk);

        /*
         * Search to find a fit.
         */
        end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
        bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
                                             alloc_bits, align_mask);
        if (bit_off >= end)
                return -1;

        /* update alloc map */
        bitmap_set(chunk->alloc_map, bit_off, alloc_bits);

        /* update boundary map */
        set_bit(bit_off, chunk->bound_map);
        bitmap_clear(chunk->bound_map, bit_off + 1, alloc_bits - 1);
        set_bit(bit_off + alloc_bits, chunk->bound_map);

        chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;

        /* update first free bit */
        if (bit_off == chunk->first_bit)
                chunk->first_bit = find_next_zero_bit(
                                        chunk->alloc_map,
                                        pcpu_chunk_map_bits(chunk),
                                        bit_off + alloc_bits);

        pcpu_block_update_hint_alloc(chunk, bit_off, alloc_bits);

        pcpu_chunk_relocate(chunk, oslot);

        return bit_off * PCPU_MIN_ALLOC_SIZE;
}

per-cpu 할당 영역에 대한 비트맵, 블럭 메타데이터들 및 각종 관련 정보들을 갱신한다.

코드 라인 9에서 할당 영역 갱신으로 인해 슬롯 이동이 될 수 있으므로 현재 슬롯 번호를 알아온다.
코드 라인 14~18에서 범위내 @align 정렬 단위로 @alloc_bits 만큼의 free 영역을 찾는다. 적합한 free 영역이 없는 경우 -1을 반환한다.
코드 라인 21에서 할당 범위의 할당맵을 모두 1로 채운다.
코드 라인 24~26에서 할당 범위의 경계 맵을 모두 클리어하고 시작과 끝+1 비트만 1로 설정한다.
코드 라인 28에서 남은 free 바이트를 갱신한다.
코드 라인 31~35에서 만일 할당한 영역이 첫 번째 free 공간인 경우 첫 번째 free 공간 비트 위치를 갱신한다.
코드 라인 37에서 chunk내의 per-cpu 블럭 메타데이터들을 갱신한다.
코드 라인 39에서 슬롯의 이동이 필요한 경우 갱신한다.
코드 라인 41에서 할당이 성공한 경우이다. 할당된 영역의 비트 offset을 반환한다.

할당 페이지 범위 활성화

pcpu_populate_chunk()

mm/percpu-vm.c

/**
 * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
 * @chunk: chunk of interest
 * @page_start: the start page
 * @page_end: the end page
 * For each cpu, populate and map pages [@page_start,@page_end) into
 *
 * For each cpu, populate and map pages [@page_start,@page_end) into
 * @chunk.
 *
 * CONTEXT:
 * pcpu_alloc_mutex, does GFP_KERNEL allocation.
 */

static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
                               int page_start, int page_end, gfp_t gfp)
{
        struct page **pages;

        pages = pcpu_get_pages();
        if (!pages)
                return -ENOMEM;

        if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
                return -ENOMEM;

        if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
                pcpu_free_pages(chunk, pages, page_start, page_end);
                return -ENOMEM;
        }
        pcpu_post_map_flush(chunk, page_start, page_end);

        return 0;
}

chunk의 요청 페이지 범위에 대해 활성화(population)한다.

코드 라인 6~8에서 필요 page descriptor 만큼 할당을 받는다. 할당 실패시 -ENOMEM으로 반환한다.
코드 라인 10~11에서 필요 페이지 범위를 cpu 수 만큼 할당 받는다. 할당 실패시 -ENOMEM으로 반환한다.
코드 라인 13~16에서 할당받은 영역 페이지들을 vmalloc 공간에 매핑시킨다.
코드 라인 17에서 매핑이 완료되면 TLB 캐시를 flush 한다.

pcpu_get_pages()

mm/percpu-vm.c

/**
 * pcpu_get_pages - get temp pages array
 * @chunk: chunk of interest
 *
 * Returns pointer to array of pointers to struct page which can be indexed
 * with pcpu_page_idx().  Note that there is only one array and accesses
 * should be serialized by pcpu_alloc_mutex.
 *
 * RETURNS:
 * Pointer to temp pages array on success.
 */

static struct page **pcpu_get_pages(void)
{
        static struct page **pages;
        size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);

        lockdep_assert_held(&pcpu_alloc_mutex);

        if (!pages)
                pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
        return pages;
}

chunk 할당을 위해 전체 per-cpu 유닛에 필요한 page descriptor 사이즈 만큼 메모리 할당을 받아온다.

pcpu_map_pages()

mm/percpu-vm.c

/**
 * pcpu_map_pages - map pages into a pcpu_chunk
 * @chunk: chunk of interest
 * @pages: pages array containing pages to be mapped
 * @page_start: page index of the first page to map
 * @page_end: page index of the last page to map + 1
 *
 * For each cpu, map pages [@page_start,@page_end) into @chunk.  The
 * caller is responsible for calling pcpu_post_map_flush() after all
 * mappings are complete.
 *
 * This function is responsible for setting up whatever is necessary for
 * reverse lookup (addr -> chunk).
 */

static int pcpu_map_pages(struct pcpu_chunk *chunk,
                          struct page **pages, int page_start, int page_end)
{
        unsigned int cpu, tcpu;
        int i, err;

        for_each_possible_cpu(cpu) {
                err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
                                       &pages[pcpu_page_idx(cpu, page_start)],
                                       page_end - page_start);
                if (err < 0)
                        goto err;

                for (i = page_start; i < page_end; i++)
                        pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
                                            chunk);
        }
        return 0;
err:
        for_each_possible_cpu(tcpu) {
                if (tcpu == cpu)
                        break;
                __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
                                   page_end - page_start);
        }
        pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
        return err;
}

할당받은 영역 페이지들을 vmalloc 공간에 매핑시킨다

코드 라인 7~12에서 possible cpu 수 만큼 루프를 돌며 per-cpu chunk를 vmalloc 공간에 매핑한다.
코드 라인 14~16에서 각 페이지(page->index)들이 pcpu_chunk를 가리키도록 설정한다.

__pcpu_map_pages()

mm/percpu-vm.c

static int __pcpu_map_pages(unsigned long addr, struct page **pages,
                            int nr_pages)
{
        return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT,
                                        PAGE_KERNEL, pages);
}

할당받은 영역 페이지들을 요청 vmalloc 가상 주소 공간에 매핑시킨다

map_kernel_range_noflush()

mm/vmalloc.c

/**
 * map_kernel_range_noflush - map kernel VM area with the specified pages
 * @addr: start of the VM area to map
 * @size: size of the VM area to map
 * @prot: page protection flags to use
 * @pages: pages to map
 *
 * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size
 * specify should have been allocated using get_vm_area() and its
 * friends.
 *                          
 * NOTE:                                
 * This function does NOT do any cache flushing.  The caller is
 * responsible for calling flush_cache_vmap() on to-be-mapped areas
 * before calling this function.
 *
 * RETURNS:
 * The number of pages mapped on success, -errno on failure.
 */

int map_kernel_range_noflush(unsigned long addr, unsigned long size,
                             pgprot_t prot, struct page **pages)
{
        return vmap_page_range_noflush(addr, addr + size, prot, pages);
}

할당받은 영역 페이지들을 요청 vmalloc 가상 주소 공간에 vmap 매핑시킨다

범위의 활성화 여부

pcpu_is_populated()

mm/percpu.c

/**
 * pcpu_is_populated - determines if the region is populated
 * @chunk: chunk of interest
 * @bit_off: chunk offset
 * @bits: size of area
 * @next_off: return value for the next offset to start searching
 *
 * For atomic allocations, check if the backing pages are populated.
 *
 * RETURNS:
 * Bool if the backing pages are populated.
 * next_index is to skip over unpopulated blocks in pcpu_find_block_fit.
 */

static bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits,
                              int *next_off)
{
        int page_start, page_end, rs, re;

        page_start = PFN_DOWN(bit_off * PCPU_MIN_ALLOC_SIZE);
        page_end = PFN_UP((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);

        rs = page_start;
        pcpu_next_unpop(chunk->populated, &rs, &re, page_end);
        if (rs >= page_end)
                return true;

        *next_off = re * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
        return false;
}

@chunk에서 @bit_off부터 @bits 까지의 공간이 활성화(populate)된 상태인지 여부를 반환한다. 출력 인자 @next_off에는 다음 검색을 시작할 offset 위치가 담긴다.

pcpu_next_unpop()

mm/percpu.c

static void pcpu_next_unpop(unsigned long *bitmap, int *rs, int *re, int end)
{
        *rs = find_next_zero_bit(bitmap, end, *rs);
        *re = find_next_bit(bitmap, end, *rs + 1);
}

@end 까지의 per-cpu 페이지 중 활성화되지 않은 페이지의 시작 @rs과 끝 @re를 산출한다.

참고

Per-cpu -1- (Basic) | 문c
Per-cpu -2- (초기화) | 문c
Per-cpu -3- (동적 할당) | 문c – 현재 글
Per-cpu -4- (atomic operations) | 문c

kmalloc vs vmalloc

2016-11-102019-11-22 문영일 Leave a comment

kmalloc vs vmalloc 비교

커널에서 메모리를 할당할 때 요구 메모리 사이즈와 성능 사이에서 고려할 API가 다음 두 개가 있고 특징에 따라 구분하여 사용하여야 한다.

할당 크기 등은 slab, slub 및 slob의 단위가 모두 다르며 최근에 가장 많이 사용하는 slub을 기준으로 나타낸다.

kmalloc의 특징

연속된 물리 메모리를 연속된 가상 공간에 매핑하여 할당 받는다.

요구 메모리가 kmalloc 캐시 최대 크기 이하인 경우 kmalloc 캐시에서 slub object를 할당받는다.
- 예) 4K 페이지: kmalloc 캐시(slub) 최대 크기=8K (2개 페이지)
요구 메모리가 kmalloc 최대 크기보다 큰 경우 버디 시스템을 사용하여 할당한다.
연속된 물리 메모리를 연속된 가상 주소 공간에 매핑하여 할당 받는다. (DMA 디바이스에서 사용 가능)
이미 사전에 1:1 매핑된 lowmem(ZONE_DMA 및 ZONE_NORMAL) 공간을 사용하므로 vmalloc()에 비해 성능이 빠르다.
단점으로는 물리적으로 연속된 공간을 할당해야 하므로 페이지 관리 측면에서 단편화 관리에 어려움이 발생한다
GFP_ATOMIC 플래그를 사용하는 경우 슬립되지 않으므로 인터럽트 핸들러에서도 사용될 수 있다.

vmalloc의 특징

단편화되어 연속되지 않는 여러 개의 페이지들을 모아 연속된 가상 메모리 공간에 매핑하여 할당받는다.

kmalloc()과 다르게 여러 개의 조각난 연속된 물리 메모리를 모아서 관리하고 이들을 별도의 공간(커널은 vmalloc address space, userland는 user space address)에 매핑하여 사용하므로 관리 요소가 증가되고 각 cpu의 tlb 플러쉬가 필요하므로 cpu core가 많은 경우 사용에 어려움이 있다.
- 커널은 VMALLOC address space를 통해 연속된 가상 주소 메모리를 제공하는데 이 공간은 아키텍처에 따라 다르다.
  - arm64: kernel 매핑 공간의 약 절반 (커널 빌드 옵션마다 위치가 달라진다.)
  - arm: 240M (0xf000_0000 ~ 0xff00_0000)
- 매핑 시 vmalloc address space 공간을 사용하는 vmap() 매핑 함수를 사용한다.
대용량의 커널 메모리가 필요하고 lowmem 부족으로 인한 압박이 예상되면서 속도는 약간 느려도 무방한 경우 페이지의 단편화를 방지하기 위해 vmalloc()을 사용하는 것이 좋다.
연속되지 않은 물리 주소 블록들을 연속된 가상 주소로 매핑하기 위해 기존 커널은 리스트에 각각의 vma_area(한 블럭의 매핑 장보)를 추가하여 관리했었다. 비록 커널이 많은 수의 vmap_area를 사용하지 않았지만 커다란 application에서 수백 또는 수천 개의 vmp_area를 검색하여 성능이 저하되는 것을 막기 위해 커널 2.6에서 레드 블랙 트리를 사용하여 관리하도록 바뀌었다.
슬립될 수 있으므로 인터럽트 핸들러에서 사용되면 안된다.

위와 같은 이유로 대부분의 커널 코드에서는 kmalloc()을 더 많이 사용하고 다음과 같은 사용 목적으로 vmalloc()을 사용한다.

swap area용 자료구조
모듈 공간 할당
일부 디바이스 드라이버 등

kzalloc() vs vzalloc()

kzalloc() 함수는 kmalloc() 함수로 할당 받은 메모리를 0으로 초기화한다.
vzalloc() 함수는 vmalloc() 함수로 할당 받은 메모리를 0으로 초기화한다.

참고

Slab Memory Allocator -1- (구조) | 문c
Kmalloc vs Vmalloc | 문c – 현재 글
Kmalloc | 문c
Vmalloc | 문c
Vmap() | 문c
gfp 플래그 | 문c