문c 블로그

Zoned Allocator -11- (Direct Reclaim)

2016-07-162022-04-23 문영일 Leave a comment

Zoned Allocator -11- (Direct Reclaim)

Reclaim 판단

should_continue_reclaim()

mm/vmscan.c

/*
 * Reclaim/compaction is used for high-order allocation requests. It reclaims
 * order-0 pages before compacting the zone. should_continue_reclaim() returns
 * true if more pages should be reclaimed such that when the page allocator
 * calls try_to_compact_zone() that it will have enough free pages to succeed.
 * It will give up earlier than that if there is difficulty reclaiming pages.
 */

static inline bool should_continue_reclaim(struct pglist_data *pgdat,
                                        unsigned long nr_reclaimed,
                                        unsigned long nr_scanned,
                                        struct scan_control *sc)
{
        unsigned long pages_for_compaction;
        unsigned long inactive_lru_pages;
        int z;

        /* If not in reclaim/compaction mode, stop */
        if (!in_reclaim_compaction(sc))
                return false;

        /* Consider stopping depending on scan and reclaim activity */
        if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
                /*
                 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
                 * full LRU list has been scanned and we are still failing
                 * to reclaim pages. This full LRU scan is potentially
                 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
                 */
                if (!nr_reclaimed && !nr_scanned)
                        return false;
        } else {
                /*
                 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
                 * fail without consequence, stop if we failed to reclaim
                 * any pages from the last SWAP_CLUSTER_MAX number of
                 * pages that were scanned. This will return to the
                 * caller faster at the risk reclaim/compaction and
                 * the resulting allocation attempt fails
                 */
                if (!nr_reclaimed)
                        return false;
        }

        /*
         * If we have not reclaimed enough pages for compaction and the
         * inactive lists are large enough, continue reclaiming
         */
        pages_for_compaction = compact_gap(sc->order);
        inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
        if (get_nr_swap_pages() > 0)
                inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
        if (sc->nr_reclaimed < pages_for_compaction &&
                        inactive_lru_pages > pages_for_compaction)
                return true;

        /* If compaction would go ahead or the allocation would succeed, stop */
        for (z = 0; z <= sc->reclaim_idx; z++) {
                struct zone *zone = &pgdat->node_zones[z];
                if (!managed_zone(zone))
                        continue;

                switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
                case COMPACT_SUCCESS:
                case COMPACT_CONTINUE:
                        return false;
                default:
                        /* check next zone */
                        ;
                }
        }
        return true;
}

high order 페이지 요청을 처리하는데 reclaim/compaction이 계속되야 하는 경우 true를 반환한다.

코드 라인 11~12에서 reclaim/compaction 모드가 아니면 처리를 중단한다.
코드 라인 15~35에서 __GFP_RETRY_MAYFAIL 플래그가 사용된 경우 reclaimed 페이지와 scanned 페이지가 없는 경우 false를 반환한다. 플래그가 사용되지 않은 경우 reclaimed 페이지가 없는 경우 false를 반환한다.
코드 라인 41~47에서 reclaimed 페이지가 order 페이지의 두 배보다 작아 compaction을 위해 작지만 inactive lru 페이지 수가 order 페이지의 두 배보다는 커 충분한 경우 true를 반환한다.
코드 라인 50~64에서 reclaim_idx만큼 존을 순회하며 compaction이 이미 성공하였거나 계속해야 하는 경우 false를 반환한다.
코드 라인 64에서 순회한 모든 존에서 compaction의 성공이 없는 경우 true를 반환하여 compaction이 계속되어야 함을 알린다.

in_reclaim_compaction()

mm/vmscan.c

/* Use reclaim/compaction for costly allocs or under memory pressure */
static bool in_reclaim_compaction(struct scan_control *sc)
{
        if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
                        (sc->order > PAGE_ALLOC_COSTLY_ORDER ||
                         sc->priority < DEF_PRIORITY - 2))
                return true;

        return false;
}

reclaim/compaction 모드인 경우 true를 반환한다.

0 order 요청을 제외하고 다음 두 조건을 만족하면 true를 반환한다.
- 우선 순위를 2번 이상 높여 반복 수행 중이다. (낮은 priority 번호가 높은 우선 순위)
- costly order 요청이다.(order 4부터)

Direct-Reclaim 수행

__alloc_pages_direct_reclaim()

mm/page_alloc.c

/* The really slow allocator path where we enter direct reclaim */
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
                int alloc_flags, const struct alloc_context *ac,
                unsigned long *did_some_progress)
{
        struct page *page = NULL;
        bool drained = false;

        *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
        if (unlikely(!(*did_some_progress)))
                return NULL;

retry:
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

        /*
         * If an allocation failed after direct reclaim, it could be because
         * pages are pinned on the per-cpu lists. Drain them and try again
         */
        if (!page && !drained) {
                unreserve_highatomic_pageblock(ac, false);
                drain_all_pages(NULL);
                drained = true;
                goto retry;
        }

        return page;
}

페이지를 회수한 후 페이지 할당을 시도한다. 만일 처음 실패하는 경우 pcp 캐시를 비워 버디 시스템에 free 페이지를 확보한 후 재시도를 한다.

코드 라인 10~12에서 페이지를 회수하며 작은 확률로 회수한 페이지가 없는 경우 null을 반환한다.
코드 라인 14~15에서 retry: 레이블에서 order 페이지 할당을 시도한다.
코드 라인 21~26에서 페이지 할당이 실패하였고 첫 실패인 경우 highatomic 페이지 블럭을 해제하고, pcp 캐시를 비워 버디시스템에 free 페이지를 확보한 후 재시도 한다.

다음 그림은 direct reclaim을 통해 페이지를 회수하는 과정을 보여준다.

__perform_reclaim()

mm/page_alloc.c

/* Perform direct synchronous page reclaim */
static int
__perform_reclaim(gfp_t gfp_mask, unsigned int order,
                                        const struct alloc_context *ac)
{
        struct reclaim_state reclaim_state;
        int progress;
        unsigned int noreclaim_flag;
        unsigned long pflags;

        cond_resched();

        /* We now go into synchronous reclaim */
        cpuset_memory_pressure_bump();
        psi_memstall_enter(&pflags);
        fs_reclaim_acquire(gfp_mask);
        noreclaim_flag = memalloc_noreclaim_save();
        reclaim_state.reclaimed_slab = 0;
        current->reclaim_state = &reclaim_state;

        progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
                                                                ac->nodemask);

        current->reclaim_state = NULL;
        memalloc_noreclaim_restore(noreclaim_flag);
        fs_reclaim_release(gfp_mask);
        psi_memstall_leave(&pflags);

        cond_resched();

        return progress;
}

페이지를 회수한다. 반환되는 값은 회수한 페이지 수이다.

코드 라인 14에서 전역 cpuset_memory_pressure_enabled가 설정된 경우 현재 태스크 cpuset의 frequency meter를 업데이트한다.
- 루트 cpuset에 있는 memory_pressure_enabled 파일을 1로 설정하여 사용한다.
코드 라인 15에서 메모리 압박이 시작되었음을 psi에 알린다.
코드 라인 17에서 페이지 회수를 목적으로 잠시 페이지 할당이 필요하다. 이 때 다시 페이지 회수 루틴이 재귀 호출되지 않도록 방지하기 위해 reclaim을 하는 동안 잠시 현재 태스크의 플래그에 PF_MEMALLOC를 설정하여 워터 마크 기준을 없앤 후 할당할 수 있도록 한다.
코드 라인 18~19에서 reclaimed_slab 카운터를 0으로 리셋하고, 현재 태스크에 지정한다.
코드 라인 21~22에서 페이지를 회수하고 회수한 페이지 수를 알아온다.
코드 라인 24에서 태스크에 지정한 reclaim_state를 해제한다.
코드 라인 25에서 현재 태스크의 플래그에 reclaim을 하는 동안 잠시 설정해두었던 PF_MEMALLOC을 제거한다.
코드 라인 27에서 메모리 압박이 완료되었음을 psi에 알린다.

Scan Control

스캔 컨트롤을 사용하는 루틴들은 다음과 같다.

reclaim_clean_pages_from_list()
try_to_free_pages()
mem_cgroup_shrink_node()
try_to_free_mem_cgroup_pages()
balance_pgdat()
shrink_all_memory()
__node_reclaim()

페이지 회수로 free 페이지 확보 시도

try_to_free_pages()

mm/vmscan.c

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                                gfp_t gfp_mask, nodemask_t *nodemask)
{
        unsigned long nr_reclaimed;
        struct scan_control sc = {
                .nr_to_reclaim = SWAP_CLUSTER_MAX,
                .gfp_mask = current_gfp_context(gfp_mask),
                .reclaim_idx = gfp_zone(gfp_mask),
                .order = order,
                .nodemask = nodemask,
                .priority = DEF_PRIORITY,
                .may_writepage = !laptop_mode,
                .may_unmap = 1,
                .may_swap = 1,
                .may_shrinkslab = 1,
        };

        /*
         * scan_control uses s8 fields for order, priority, and reclaim_idx.
         * Confirm they are large enough for max values.
         */
        BUILD_BUG_ON(MAX_ORDER > S8_MAX);
        BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
        BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);

        /*
         * Do not enter reclaim if fatal signal was delivered while throttled.
         * 1 is returned so that the page allocator does not OOM kill at this
         * point.
         */
        if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
                return 1;

        trace_mm_vmscan_direct_reclaim_begin(order,
                                sc.may_writepage,
                                sc.gfp_mask,
                                sc.reclaim_idx);

        nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

        trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);

        return nr_reclaimed;
}

페이지 회수(Reclaim)를 시도하고 회수된 페이지 수를 반환한다. 유저 요청 시 free page가 normal 존 이하에서 min 워터마크 기준의 절반 이상을 확보할 때까지 태스크가 스로틀링(sleep)될 수 있다.

코드 라인 5~16에서 페이지 회수를 위한 scan_control 구조체를 준비한다.
코드 라인 31~32에서 direct-reclaim을 위해 일정 기준 이상 스로틀링 중 fatal 시그널을 전달 받은 경우 즉각 루틴을 빠져나간다. 단 1을 반환하므로 OOM kill 루틴을 수행하지 못하게 방지한다.
코드 라인 39에서 페이지 회수를 시도한다.

유저 요청 시 스로틀링

throttle_direct_reclaim()

mm/vmscan.c

/*
 * Throttle direct reclaimers if backing storage is backed by the network
 * and the PFMEMALLOC reserve for the preferred node is getting dangerously
 * depleted. kswapd will continue to make progress and wake the processes
 * when the low watermark is reached.
 *
 * Returns true if a fatal signal was delivered during throttling. If this
 * happens, the page allocator should not consider triggering the OOM killer.
 */

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
                                        nodemask_t *nodemask)
{
        struct zoneref *z;
        struct zone *zone;
        pg_data_t *pgdat = NULL;

        /*
         * Kernel threads should not be throttled as they may be indirectly
         * responsible for cleaning pages necessary for reclaim to make forward
         * progress. kjournald for example may enter direct reclaim while
         * committing a transaction where throttling it could forcing other
         * processes to block on log_wait_commit().
         */
        if (current->flags & PF_KTHREAD)
                goto out;

        /*
         * If a fatal signal is pending, this process should not throttle.
         * It should return quickly so it can exit and free its memory
         */
        if (fatal_signal_pending(current))
                goto out;

        /*
         * Check if the pfmemalloc reserves are ok by finding the first node
         * with a usable ZONE_NORMAL or lower zone. The expectation is that
         * GFP_KERNEL will be required for allocating network buffers when
         * swapping over the network so ZONE_HIGHMEM is unusable.
         *
         * Throttling is based on the first usable node and throttled processes
         * wait on a queue until kswapd makes progress and wakes them. There
         * is an affinity then between processes waking up and where reclaim
         * progress has been made assuming the process wakes on the same node.
         * More importantly, processes running on remote nodes will not compete
         * for remote pfmemalloc reserves and processes on different nodes
         * should make reasonable progress.
         */
        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                        gfp_zone(gfp_mask), nodemask) {
                if (zone_idx(zone) > ZONE_NORMAL)
                        continue;

                /* Throttle based on the first usable node */
                pgdat = zone->zone_pgdat;
                if (allow_direct_reclaim(pgdat))
                        goto out;
                break;
        }

        /* If no zone was usable by the allocation flags then do not throttle */
        if (!pgdat)
                goto out;

        /* Account for the throttling */
        count_vm_event(PGSCAN_DIRECT_THROTTLE);

        /*
         * If the caller cannot enter the filesystem, it's possible that it
         * is due to the caller holding an FS lock or performing a journal
         * transaction in the case of a filesystem like ext[3|4]. In this case,
         * it is not safe to block on pfmemalloc_wait as kswapd could be
         * blocked waiting on the same lock. Instead, throttle for up to a
         * second before continuing.
         */
        if (!(gfp_mask & __GFP_FS)) {
                wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
                        allow_direct_reclaim(pgdat), HZ);

                goto check_pending;
        }

        /* Throttle until kswapd wakes the process */
        wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
                allow_direct_reclaim(pgdat));

check_pending:
        if (fatal_signal_pending(current))
                return true;

out:
        return false;
}

유저 태스크에서 direct-reclaim 요청 시 필요한 만큼 스로틀링한다. 파일 시스템을 사용하지 않는(nofs) direct-reclaim 요청인 경우 스로틀링은 1초로 제한된다. 스로틀링 중 sigkill 시그널 수신 여부를 반환한다.

코드 라인 15~16에서 커널 스레드에서 요청한 경우 스로틀링을 하지 않기 위해 처리를 중단하고 false를 반환한다.
코드 라인 22~23에서 SIGKILL 시그널이 처리되고 있는 태스크의 경우도 역시 처리를 중단하고 false를 반환한다.
코드 라인 39~49에서 요청한 노드의 lowmem 존들의 direct-reclaim이 허용 기준 이상인 경우 스로틀링을 포기한다.
코드 라인 52~53에서 사용할 수 있는 노드가 없는 경우 처리를 포기한다.
코드 라인 56에서 스로틀링이 시작되는 구간이다. PGSCAN_DIRECT_THROTTLE stat을 증가시킨다.
코드 라인 66~71에서 파일 시스템을 사용하지 않는 direct-reclaim 요청인 경우 direct-reclaim을 허락할 때까지 최대 1초간 스로틀링 후 check_pending 레이블로 이동한다.
코드 라인74~75에서 파일 시스템을 사용하는 direct-reclaim의 경우 kswapd를 깨워 free page를 확보하며 direct-reclaim을 허락할 때까지 슬립한다.
코드 라인 77~82에서 현재 태스크에 SIGKILL 시그널이 요청된 경우 true를 반환하고 그렇지 않은 경우 false를 반환한다.

다음 그림은 유저 요청 direct-reclaim 시 파일 시스템 사용 여부에 따라 direct-reclaim을 사용하기 위해 스로틀링하는 과정을 보여준다.

direct-reclaim 허락 여부

allow_direct_reclaim()

mm/vmscan.c

static bool allow_direct_reclaim(pg_data_t *pgdat)
{
        struct zone *zone;
        unsigned long pfmemalloc_reserve = 0;
        unsigned long free_pages = 0;
        int i;
        bool wmark_ok;

        if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
                return true;

        for (i = 0; i <= ZONE_NORMAL; i++) {
                zone = &pgdat->node_zones[i];
                if (!managed_zone(zone))
                        continue;

                if (!zone_reclaimable_pages(zone))
                        continue;

                pfmemalloc_reserve += min_wmark_pages(zone);
                free_pages += zone_page_state(zone, NR_FREE_PAGES);
        }

        /* If there are no reserves (unexpected config) then do not throttle */
        if (!pfmemalloc_reserve)
                return true;

        wmark_ok = free_pages > pfmemalloc_reserve / 2;

        /* kswapd must be awake if processes are being throttled */
        if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
                pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
                                                (enum zone_type)ZONE_NORMAL);
                wake_up_interruptible(&pgdat->kswapd_wait);
        }

        return wmark_ok;
}

요청한 노드에서 direct-reclaim을 허락하는지 여부를 반환한다. 만일 lowmem 존들의 free 페이지가 min 워터마크 50% 이하인 경우 현재 태스크를 슬립하고, kswapd를 깨운 뒤 false를 반환한다. 그리고 그 이상인 경우 direct-reclaim을 시도해도 좋다고 판단하여 true를 반환한다.

코드 라인 9~10에서 reclaim 실패 횟수가 MAX_RECLAIM_RETRIES(16)번 이상일 때 스로틀을 하지못하게 곧바로 true를 반환한다.
코드 라인 12~22에서 lowmem 존들의 min 워터마크를 합산한 pfmemalloc_reserve 값 및 free 페이지 수의 합산 값을 구한다.
코드 라인 25~26에서 pfmemalloc_reserve 값이 0인 경우 스로틀을 하지 못하게 곧바로 true를 반환한다.
코드 라인 28~35에서 free 페이지 합산 수가 lowmem 존들의 min 워터마크 합산 값의 50% 이하이면 현재 태스크를 슬립시키고 kswapd를 깨운 뒤 false를 반환한다.

다음 그림은 direct-reclaim 허락 여부를 알아오는 과정을 보여준다.

do_try_to_free_pages()

mm/vmscan.c

/*
 * This is the main entry point to direct page reclaim.
 *
 * If a full scan of the inactive list fails to free enough memory then we
 * are "out of memory" and something needs to be killed.
 *
 * If the caller is !__GFP_FS then the probability of a failure is reasonably
 * high - the zone may be full of dirty or under-writeback pages, which this
 * caller can't do much about.  We kick the writeback threads and take explicit
 * naps in the hope that some of these pages can be written.  But if the
 * allocating task holds filesystem locks which prevent writeout this might not
 * work, and the allocation attempt will fail.
 *
 * returns:     0, if no pages reclaimed
 *              else, the number of pages reclaimed
 */

static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
                                          struct scan_control *sc)
{
        int initial_priority = sc->priority;
        pg_data_t *last_pgdat;
        struct zoneref *z;
        struct zone *zone;
retry:
        delayacct_freepages_start();

        if (global_reclaim(sc))
                __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);

        do {
                vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
                                sc->priority);
                sc->nr_scanned = 0;
                shrink_zones(zonelist, sc);

                if (sc->nr_reclaimed >= sc->nr_to_reclaim)
                        break;

                if (sc->compaction_ready)
                        break;

                /*
                 * If we're getting trouble reclaiming, start doing
                 * writepage even in laptop mode.
                 */
                if (sc->priority < DEF_PRIORITY - 2)
                        sc->may_writepage = 1;
        } while (--sc->priority >= 0);

        last_pgdat = NULL;
        for_each_zone_zonelist_nodemask(zone, z, zonelist, sc->reclaim_idx,
                                        sc->nodemask) {
                if (zone->zone_pgdat == last_pgdat)
                        continue;
                last_pgdat = zone->zone_pgdat;
                snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
                set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
        }

        delayacct_freepages_end();

        if (sc->nr_reclaimed)
                return sc->nr_reclaimed;

        /* Aborted reclaim to try compaction? don't OOM, then */
        if (sc->compaction_ready)
                return 1;

        /* Untapped cgroup reserves?  Don't OOM, retry. */
        if (sc->memcg_low_skipped) {
                sc->priority = initial_priority;
                sc->memcg_low_reclaim = 1;
                sc->memcg_low_skipped = 0;
                goto retry;
        }

        return 0;
}

direct-reclaim 요청을 통해 페이지를 회수하여 free 페이지를 확보를 시도한다.

코드 라인 8~9에서 retry: 레이블이다. 페이지 회수에 소요되는 시간을 계량하기 위해 시작한다.
- 참고: delayacct_init() | 문c
코드 라인 11~12에서 global reclaim을 사용해야하는 경우 ALLOCSTALL stat을 증가시킨다.
코드 라인 14~16에서 루프를 돌며 우선 순위가 높아져 스캔 depth가 깊어지는 경우 vmpressure 정보를 갱신한다.
- 참고: vmpressure | 문c
코드 라인 17~18에서 스캔 건 수를 리셋시키고 페이지를 회수하고 회수한 건 수를 알아온다.
코드 라인 20~21에서 회수 건 수가 회수해야 할 건 수보다 큰 경우 처리를 위해 루프에서 벗어난다.
코드 라인 23~24에서 compaction이 준비된 경우 처리를 위해 루프에서 벗어난다.
코드 라인 30~31에서 우선 순위를 2 단계 더 높여 처리하는 경우 writepage 기능을 설정한다.
코드 라인 32에서 우선 순위를 최고까지 높여가며(0으로 갈수록 높아진다) 루프를 돈다.
코드 라인 34~42에서 zonelist를 순회하며 노드에 대해 노드 또는 memcg lru의 refaults를 갱신하고 memcg 노드의 congested를 false로 리셋한다.
코드 라인 44에서 페이지 회수에 소요되는 시간을 계량한다.
코드 라인 46~47네거 회수한 적이 있는 경우 그 값을 반환한다.
코드 라인 50~51에서 compaction이 준비된 경우 1을 반환한다.
코드 라인 54~59에서 sc->memcg_low_skipped가 설정된 경우 처음 재시도에 한해 priority를 다시 원래 요청 priority로 바꾸고 재시도한다.

global_reclaim()

mm/vmscan.c

#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
{
        return !sc->target_mem_cgroup;
}
#else
static bool global_reclaim(struct scan_control *sc)
{
        return true;
}
#endif

CONFIG_MEMCG 커널 옵션을 사용하여 Memory Control Group을 사용하는 경우 scan_control의 target_mem_cgroup이 정해진 경우 false를 반환한다. 그렇지 않은 경우 global reclaim을 위해 true를 반환한다. CONFIG_MEMCG를 사용하지 않는 경우 항상 true이다.

Memory Pressure (per-cpuset reclaims)

cpuset_memory_pressure_bump()

include/linux/cpuset.h

#define cpuset_memory_pressure_bump()                           \
        do {                                                    \
                if (cpuset_memory_pressure_enabled)             \
                        __cpuset_memory_pressure_bump();        \
        } while (0)

현재 태스크 cpuset의 frequency meter를 업데이트한다.

__cpuset_memory_pressure_bump()

kernel/cpuset.c

/**
 * cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.
 *
 * Keep a running average of the rate of synchronous (direct)
 * page reclaim efforts initiated by tasks in each cpuset.
 *
 * This represents the rate at which some task in the cpuset
 * ran low on memory on all nodes it was allowed to use, and
 * had to enter the kernels page reclaim code in an effort to
 * create more free memory by tossing clean pages or swapping
 * or writing dirty pages.
 *
 * Display to user space in the per-cpuset read-only file
 * "memory_pressure".  Value displayed is an integer
 * representing the recent rate of entry into the synchronous
 * (direct) page reclaim by any task attached to the cpuset.
 **/

void __cpuset_memory_pressure_bump(void)
{
        rcu_read_lock();
        fmeter_markevent(&task_cs(current)->fmeter);
        rcu_read_unlock();
}

현재 태스크 cpuset의 frequency meter를 업데이트한다.

fmeter_markevent()

kernel/cpuset.c

/* Process any previous ticks, then bump cnt by one (times scale). */
static void fmeter_markevent(struct fmeter *fmp)
{
        spin_lock(&fmp->lock);
        fmeter_update(fmp);
        fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE);
        spin_unlock(&fmp->lock);
}

요청한 frequency meter를 업데이트하고 다음 계산을 위해 이벤트 수에 1,000을 대입하되 최대 1,000,000을 넘기지 않게 한다.

fmeter_update()

kernel/cpuset.c

/* Internal meter update - process cnt events and update value */
static void fmeter_update(struct fmeter *fmp)
{
        time_t now = get_seconds();
        time_t ticks = now - fmp->time;

        if (ticks == 0)
                return;

        ticks = min(FM_MAXTICKS, ticks);
        while (ticks-- > 0)
                fmp->val = (FM_COEF * fmp->val) / FM_SCALE;
        fmp->time = now;

        fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE;
        fmp->cnt = 0;
}

요청한 frequency meter로 val 값을 계산하고 이벤트 수를 0으로 리셋한다.

코드 라인 4~8에서 fmeter에 기록된 초(second)로부터 경과한 초를 알아온다.
코드 라인 10~12에서 ticks는 최대 99까지로 제한하고, ticks 만큼 fmp->val *= 93.3%를 반복한다.
코드 라인 13에서 다음 계산을 위해 현재 초로 갱신한다.
코드 라인 15~16에서 fmp->val에 fmp->cnt x 6.7%를 더한 후 이벤트 수를 0으로 리셋한다.

fmeter 구조체

kernel/cgroup/cpuset.c

struct fmeter {
        int cnt;                /* unprocessed events count */
        int val;                /* most recent output value */
        time_t time;            /* clock (secs) when val computed */
        spinlock_t lock;        /* guards read or write of above */
};

cnt
- 처리되지 않은 이벤트 수
val
- 최근 fmeter 업데이트 시 계산된 값
time
- val 값이 계산될 때의 clock(secs)

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c – 현재 글
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Overview of Memory Reclaim in the Current Upstream Kernel (2021) | SUSE – 다운로드 pdf
Optimizing Linux Memory Management for Low-latency / High-throughput Databases

Zoned Allocator -13- (Direct Reclaim-Shrink-2)

2016-07-162021-09-11 문영일 2 Comments

Zoned Allocator -13- (Direct Reclaim-Shrink-2)

shrink_page_list()

mm/vmscan.c -1/6-

/*
 * shrink_page_list() returns the number of reclaimed pages
 */

static unsigned long shrink_page_list(struct list_head *page_list,
                                      struct pglist_data *pgdat,
                                      struct scan_control *sc,
                                      enum ttu_flags ttu_flags,
                                      struct reclaim_stat *stat,
                                      bool force_reclaim)
{
        LIST_HEAD(ret_pages);
        LIST_HEAD(free_pages);
        int pgactivate = 0;
        unsigned nr_unqueued_dirty = 0;
        unsigned nr_dirty = 0;
        unsigned nr_congested = 0;
        unsigned nr_reclaimed = 0;
        unsigned nr_writeback = 0;
        unsigned nr_immediate = 0;
        unsigned nr_ref_keep = 0;
        unsigned nr_unmap_fail = 0;

        cond_resched();

        while (!list_empty(page_list)) {
                struct address_space *mapping;
                struct page *page;
                int may_enter_fs;
                enum page_references references = PAGEREF_RECLAIM_CLEAN;
                bool dirty, writeback;

                cond_resched();

                page = lru_to_page(page_list);
                list_del(&page->lru);

                if (!trylock_page(page))
                        goto keep;

                VM_BUG_ON_PAGE(PageActive(page), page);

                sc->nr_scanned++;

                if (unlikely(!page_evictable(page)))
                        goto activate_locked;

                if (!sc->may_unmap && page_mapped(page))
                        goto keep_locked;

                /* Double the slab pressure for mapped and swapcache pages */
                if ((page_mapped(page) || PageSwapCache(page)) &&
                    !(PageAnon(page) && !PageSwapBacked(page)))
                        sc->nr_scanned++;

                may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
                        (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

                /*
                 * The number of dirty pages determines if a node is marked
                 * reclaim_congested which affects wait_iff_congested. kswapd
                 * will stall and start writing pages if the tail of the LRU
                 * is all dirty unqueued pages.
                 */
                page_check_dirty_writeback(page, &dirty, &writeback);
                if (dirty || writeback)
                        nr_dirty++;

                if (dirty && !writeback)
                        nr_unqueued_dirty++;

                /*
                 * Treat this page as congested if the underlying BDI is or if
                 * pages are cycling through the LRU so quickly that the
                 * pages marked for immediate reclaim are making it to the
                 * end of the LRU a second time.
                 */
                mapping = page_mapping(page);
                if (((dirty || writeback) && mapping &&
                     inode_write_congested(mapping->host)) ||
                    (writeback && PageReclaim(page)))
                        nr_congested++;

isolation 후 전달받은 @page_list의 페이지들에 대해 shrink를 수행하고 회수된 페이지의 수를 반환한다.

코드 라인 8에서 회수되지 않고 남은 페이지들을 담기위해 임시로 사용되는 ret_pages 리스트를 초기화한다.
코드 라인 9에서 회수를 위해 임시로 사용되는 free_pages 리스트를 초기화한다.
코드 라인 22~32에서 page_list의 페이지 수 만큼 순회하며 페이지를 가져온다.
코드 라인 34~35에서 페이지 lock 획득이 실패하는 경우 다음에 처리하도록 lru로 되돌리기 위해 keep 레이블로 이동한다.
코드 라인 41~42에서 작은 확률로 페이지가 evictable 페이지 상태가 아닌 경우 active lru로 되돌리기 위해 activate_locked 레이블로 이동한다.
코드 라인 44~45에서 sc->may_unmap 요청인 경우 매핑된 페이지는 처리하지 않고 lru로 되돌리기 위해 keep 레이블로 이동한다.
코드 라인 48~50에서 pte 매핑된 페이지 또는 swap 캐시인 경우 nr_scanned를 증가시킨다. 단 swap 영역을 가지지 않는 clean anon 페이지는 제외한다.
코드 라인 51~52에서 이 페이지의 처리에 fs 사용 가능 여부를 알아온다. fs 허용하였거나 swap 캐시이면서 IO 사용 가능한 상태도 fs 사용 가능한 상태이다.
코드 라인 61~63에서 dirty 및 writeback 페이지인지 여부를 알아오고 nr_dirty 카운터를 증가시킨다.
코드 라인 65~66에서 writeback 큐잉되지 않은 dirty 페이지인 경우 nr_unqueued_dirty 카운터를 증가시킨다.
코드 라인 74~78에서 write가 혼잡한 상태이거나 페이지가 writeback을 통해 회수가 진행되는 페이지인 경우 nr_congested를 증가시킨다.

mm/vmscan.c -2/6-

.               /*
                 * If a page at the tail of the LRU is under writeback, there
                 * are three cases to consider.
                 *
                 * 1) If reclaim is encountering an excessive number of pages
                 *    under writeback and this page is both under writeback and
                 *    PageReclaim then it indicates that pages are being queued
                 *    for IO but are being recycled through the LRU before the
                 *    IO can complete. Waiting on the page itself risks an
                 *    indefinite stall if it is impossible to writeback the
                 *    page due to IO error or disconnected storage so instead
                 *    note that the LRU is being scanned too quickly and the
                 *    caller can stall after page list has been processed.
                 *
                 * 2) Global or new memcg reclaim encounters a page that is
                 *    not marked for immediate reclaim, or the caller does not
                 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
                 *    not to fs). In this case mark the page for immediate
                 *    reclaim and continue scanning.
                 *
                 *    Require may_enter_fs because we would wait on fs, which
                 *    may not have submitted IO yet. And the loop driver might
                 *    enter reclaim, and deadlock if it waits on a page for
                 *    which it is needed to do the write (loop masks off
                 *    __GFP_IO|__GFP_FS for this reason); but more thought
                 *    would probably show more reasons.
                 *
                 * 3) Legacy memcg encounters a page that is already marked
                 *    PageReclaim. memcg does not have any dirty pages
                 *    throttling so we could easily OOM just because too many
                 *    pages are in writeback and there is nothing else to
                 *    reclaim. Wait for the writeback to complete.
                 *
                 * In cases 1) and 2) we activate the pages to get them out of
                 * the way while we continue scanning for clean pages on the
                 * inactive list and refilling from the active list. The
                 * observation here is that waiting for disk writes is more
                 * expensive than potentially causing reloads down the line.
                 * Since they're marked for immediate reclaim, they won't put
                 * memory pressure on the cache working set any longer than it
                 * takes to write them to disk.
                 */
                if (PageWriteback(page)) {
                        /* Case 1 above */
                        if (current_is_kswapd() &&
                            PageReclaim(page) &&
                            test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
                                nr_immediate++;
                                goto activate_locked;

                        /* Case 2 above */
                        } else if (sane_reclaim(sc) ||
                            !PageReclaim(page) || !may_enter_fs) {
                                /*
                                 * This is slightly racy - end_page_writeback()
                                 * might have just cleared PageReclaim, then
                                 * setting PageReclaim here end up interpreted
                                 * as PageReadahead - but that does not matter
                                 * enough to care.  What we do want is for this
                                 * page to have PageReclaim set next time memcg
                                 * reclaim reaches the tests above, so it will
                                 * then wait_on_page_writeback() to avoid OOM;
                                 * and it's also appropriate in global reclaim.
                                 */
                                SetPageReclaim(page);
                                nr_writeback++;
                                goto activate_locked;

                        /* Case 3 above */
                        } else {
                                unlock_page(page);
                                wait_on_page_writeback(page);
                                /* then go back and try same page again */
                                list_add_tail(&page->lru, page_list);
                                continue;
                        }
                }

코드 라인 43에서 writeback 페이지에 대한 처리이다.
코드 라인 45~49에서 첫 번째 writeback 케이스: kswapd에서 회수 중인 페이지가 다시 돌아온 경우이다. 이러한 경우 처리 시간을 좀 더 주기위해 nr_immediate 카운터를 증가시키고, activate 처리 후 lru로 되돌리기 위해 activate_locked 레이블로 이동한다.
코드 라인 52~67에서 두 번째 writeback 케이스: memcg를 통해 writeback을 하거나 아직 회수 중인 페이지가 아니거나 fs 사용 불가능한 상태인 경우 즉각 회수를 위해 reclaim 플래그를 설정하고, nr_writeback 카운터를 증가시킨다. 그런 후 activate 처리 후 lru로 되돌리기 위해 activate_locked 레이블로 이동한다.
코드 라인 70~76에서 세 번째 writeback 케이스: 해당 페이지의 writeback이 완료될 때 까지 기다린 후 page_list에 추가하고 계속한다.

mm/vmscan.c -3/6-

.               if (!force_reclaim)
                        references = page_check_references(page, sc);

                switch (references) {
                case PAGEREF_ACTIVATE:
                        goto activate_locked;
                case PAGEREF_KEEP:
                        nr_ref_keep++;
                        goto keep_locked;
                case PAGEREF_RECLAIM:
                case PAGEREF_RECLAIM_CLEAN:
                        ; /* try to reclaim the page below */
                }

                /*
                 * Anonymous process memory has backing store?
                 * Try to allocate it some swap space here.
                 * Lazyfree page could be freed directly
                 */
                if (PageAnon(page) && PageSwapBacked(page)) {
                        if (!PageSwapCache(page)) {
                                if (!(sc->gfp_mask & __GFP_IO))
                                        goto keep_locked;
                                if (PageTransHuge(page)) {
                                        /* cannot split THP, skip it */
                                        if (!can_split_huge_page(page, NULL))
                                                goto activate_locked;
                                        /*
                                         * Split pages without a PMD map right
                                         * away. Chances are some or all of the
                                         * tail pages can be freed without IO.
                                         */
                                        if (!compound_mapcount(page) &&
                                            split_huge_page_to_list(page,
                                                                    page_list))
                                                goto activate_locked;
                                }
                                if (!add_to_swap(page)) {
                                        if (!PageTransHuge(page))
                                                goto activate_locked;
                                        /* Fallback to swap normal pages */
                                        if (split_huge_page_to_list(page,
                                                                    page_list))
                                                goto activate_locked;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
                                        count_vm_event(THP_SWPOUT_FALLBACK);
#endif
                                        if (!add_to_swap(page))
                                                goto activate_locked;
                                }

                                may_enter_fs = 1;

                                /* Adding to swap updated mapping */
                                mapping = page_mapping(page);
                        }
                } else if (unlikely(PageTransHuge(page))) {
                        /* Split file THP */
                        if (split_huge_page_to_list(page, page_list))
                                goto keep_locked;
                }

                /*
                 * The page is mapped into the page tables of one or more
                 * processes. Try to unmap it here.
                 */
                if (page_mapped(page)) {
                        enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;

                        if (unlikely(PageTransHuge(page)))
                                flags |= TTU_SPLIT_HUGE_PMD;
                        if (!try_to_unmap(page, flags)) {
                                nr_unmap_fail++;
                                goto activate_locked;
                        }
                }

코드 라인 1~13에서 @force_reclaim이 true인 경우 reclaim을 강제하고, 그렇지 않은 경우 페이지 참조 체크 결과에 따라 다음 중 하나로 진행한다.
- swapback 중인 페이지, 2 번 이상 참조된 페이지, 실행 파일 페이지가 참조된 경우 activate lru로 되돌리기 위해 activate_locked 레이블로 이동한다.
- 기타 페이지가 참조된 경우 inactivate lru로 되돌리기 위해 keep_locked 레이블로 이동한다.
- 그 외의 경우 페이지 회수를 진행하기 위해 코드를 계속 진행한다.
코드 라인 20~56에서 swap 영역 사용 가능한 normal anon 페이지를 처리한다. 만일 swap 캐시가 아직 없을 때에 다음과 같이 수행한다.
- io 처리 금지된 상태라면 keep_locked 레이블로 이동시킨다.
- thp인 경우 split이 불가능하면 activate_locked 레이블로 이동시키고, 페이지를 split한 order 0 페이지들을 page_list에 추가하고 activate_locked 레이블로 이동한다.
- 페이지를 swap 큐에 추가한다. 만일 추가할 수 없는 경우 thp이면 split한 order 0페이지를 page_list에 추가한다.
코드 라인 57~61에서 swap 영역에 지정된 anon 페이지가 아니면서 thp인 경우 페이지를 split한 order 0 페이지들을 page_list에 추가하고 keep_locked 레이블로 이동한다.
코드 라인 67~76에서 매핑된 페이지에 대해 언매핑을 수행한다. 언매핑이 실패한 경우 activate lru로 되돌리기 위해 activate_locked로 이동한다.

mm/vmscan.c -4/6-

.               if (PageDirty(page)) {
                        /*
                         * Only kswapd can writeback filesystem pages
                         * to avoid risk of stack overflow. But avoid
                         * injecting inefficient single-page IO into
                         * flusher writeback as much as possible: only
                         * write pages when we've encountered many
                         * dirty pages, and when we've already scanned
                         * the rest of the LRU for clean pages and see
                         * the same dirty pages again (PageReclaim).
                         */
                        if (page_is_file_cache(page) &&
                            (!current_is_kswapd() || !PageReclaim(page) ||
                             !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
                                /*
                                 * Immediately reclaim when written back.
                                 * Similar in principal to deactivate_page()
                                 * except we already have the page isolated
                                 * and know it's dirty
                                 */
                                inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
                                SetPageReclaim(page);

                                goto activate_locked;
                        }

                        if (references == PAGEREF_RECLAIM_CLEAN)
                                goto keep_locked;
                        if (!may_enter_fs)
                                goto keep_locked;
                        if (!sc->may_writepage)
                                goto keep_locked;

                        /*
                         * Page is dirty. Flush the TLB if a writable entry
                         * potentially exists to avoid CPU writes after IO
                         * starts and then write it out here.
                         */
                        try_to_unmap_flush_dirty();
                        switch (pageout(page, mapping, sc)) {
                        case PAGE_KEEP:
                                goto keep_locked;
                        case PAGE_ACTIVATE:
                                goto activate_locked;
                        case PAGE_SUCCESS:
                                if (PageWriteback(page))
                                        goto keep;
                                if (PageDirty(page))
                                        goto keep;

                                /*
                                 * A synchronous write - probably a ramdisk.  Go
                                 * ahead and try to reclaim the page.
                                 */
                                if (!trylock_page(page))
                                        goto keep;
                                if (PageDirty(page) || PageWriteback(page))
                                        goto keep_locked;
                                mapping = page_mapping(page);
                        case PAGE_CLEAN:
                                ; /* try to free the page below */
                        }
                }

코드 라인 1에서 dirty 페이지인 경우의 처리이다.
코드 라인 12~25에서 dirty된 file 캐시는 kswapd에서만 pageout()을 사용할 예정이다. 따라서 file 캐시 페이지이면서 kswapd가 아닌 경우에는 reclaim 플래그를 설정한 후 activate lru로 되돌리기 위해 activate_locked 레이블로 이동한다.
코드 라인 27~32에서 fs를 사용하지 못하거나, write 금지 상황이거나, 페이지 참조 체크가 clean 상태인 경우 lru로 되돌리기 위해 keep_locked 레이블로 이동한다.
코드 라인 39에서 writable 페이지의 경우 TLB를 플러시하여 IO 시작 후 cpu가 기록하는 일이 없도록 방지한다.
코드 라인 40~62에서 dirty 페이지를 pageout() 함수를 통해 파일 시스템에 기록하도록 요청한다. clean 결과를 얻으면 페이지를 free 하기위해 아래 코드를 계속 진행하고, 나머지는 페이지 상황에 따라 active 또는 inactive lru로 되돌린다.

mm/vmscan.c -5/6-

.               /*
                 * If the page has buffers, try to free the buffer mappings
                 * associated with this page. If we succeed we try to free
                 * the page as well.
                 *
                 * We do this even if the page is PageDirty().
                 * try_to_release_page() does not perform I/O, but it is
                 * possible for a page to have PageDirty set, but it is actually
                 * clean (all its buffers are clean).  This happens if the
                 * buffers were written out directly, with submit_bh(). ext3
                 * will do this, as well as the blockdev mapping.
                 * try_to_release_page() will discover that cleanness and will
                 * drop the buffers and mark the page clean - it can be freed.
                 *
                 * Rarely, pages can have buffers and no ->mapping.  These are
                 * the pages which were not successfully invalidated in
                 * truncate_complete_page().  We try to drop those buffers here
                 * and if that worked, and the page is no longer mapped into
                 * process address space (page_count == 1) it can be freed.
                 * Otherwise, leave the page on the LRU so it is swappable.
                 */
                if (page_has_private(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
                                goto activate_locked;
                        if (!mapping && page_count(page) == 1) {
                                unlock_page(page);
                                if (put_page_testzero(page))
                                        goto free_it;
                                else {
                                        /*
                                         * rare race with speculative reference.
                                         * the speculative reference will free
                                         * this page shortly, so we may
                                         * increment nr_reclaimed here (and
                                         * leave it off the LRU).
                                         */
                                        nr_reclaimed++;
                                        continue;
                                }
                        }
                }

                if (PageAnon(page) && !PageSwapBacked(page)) {
                        /* follow __remove_mapping for reference */
                        if (!page_ref_freeze(page, 1))
                                goto keep_locked;
                        if (PageDirty(page)) {
                                page_ref_unfreeze(page, 1);
                                goto keep_locked;
                        }

                        count_vm_event(PGLAZYFREED);
                        count_memcg_page_event(page, PGLAZYFREED);
                } else if (!mapping || !__remove_mapping(mapping, page, true))
                        goto keep_locked;

                unlock_page(page);

코드 라인 22~41에서 파일 시스템에 별도의 버퍼를 가진 private 페이지인 경우 버퍼를 해제한다. 만일 버퍼 해제가 실패한 경우 다시 active lru로 되돌리기 위해 activate_locked 레이블로 이동한다. 매핑되지 않았거나 사용되지 않으면 페이지를 회수하기 위해 free_it 레이블로 이동한다. 경쟁 상황에서 드물게 이미 free되고 있는 페이지인 경우 nr_reclaimed 카운트를 증가하고 다음 페이지를 처리하도록 한다.
코드 라인 43~53에서 swap 영역을 사용하지 못하는 clean anon 페이지인 경우이다. 다음 순서대로 처리한다.
- 사용자가 없으면 참조 카운터를 0으로 변경한다. 만일 아직 사용자가 있으면 keep_locked 레이블로 이동한다.
- dirty 페이지인 경우 참조 카운터를 1로 변경하고, lru로 되돌리기 위해 keep_locked 레이블로 이동한다.
- 마지막으로 PGLAZYFREED 카운터를 증가시킨다. 다음 free_it: 레이블로 이어지는 코드를 통해 회수될 예정이다.
코드 라인 54~55에서 swapbacked 되지 않은 anon 페이지도 아니면서 매핑이 없거나 매핑을 제거할 수 없으면 lru로 되돌리기 위해 keep_locked 레이블로 이동한다.

mm/vmscan.c -6/6-

free_it:
                nr_reclaimed++;

                /*
                 * Is there need to periodically free_page_list? It would
                 * appear not as the counts should be low
                 */
                if (unlikely(PageTransHuge(page))) {
                        mem_cgroup_uncharge(page);
                        (*get_compound_page_dtor(page))(page);
                } else
                        list_add(&page->lru, &free_pages);
                continue;

activate_locked:
                /* Not a candidate for swapping, so reclaim swap space. */
                if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
                                                PageMlocked(page)))
                        try_to_free_swap(page);
                VM_BUG_ON_PAGE(PageActive(page), page);
                if (!PageMlocked(page)) {
                        SetPageActive(page);
                        pgactivate++;
                        count_memcg_page_event(page, PGACTIVATE);
                }
keep_locked:
                unlock_page(page);
keep:
                list_add(&page->lru, &ret_pages);
                VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
        }

        mem_cgroup_uncharge_list(&free_pages);
        try_to_unmap_flush();
        free_unref_page_list(&free_pages);

        list_splice(&ret_pages, page_list);
        count_vm_events(PGACTIVATE, pgactivate);

        if (stat) {
                stat->nr_dirty = nr_dirty;
                stat->nr_congested = nr_congested;
                stat->nr_unqueued_dirty = nr_unqueued_dirty;
                stat->nr_writeback = nr_writeback;
                stat->nr_immediate = nr_immediate;
                stat->nr_activate = pgactivate;
                stat->nr_ref_keep = nr_ref_keep;
                stat->nr_unmap_fail = nr_unmap_fail;
        }
        return nr_reclaimed;
}

코드 라인 1~13에서 free_it: 레이블이다. 이 곳에서는 회수될 페이지를 free_pages 리스트에 추가하고 다음 페이지를 반복한다. 만일 thp의 경우 memcg에도 보고하고 free_transhuge_page() 함수를 호출하여 order 0 페이지로 분해한다.
코드 라인 15~25에서 activate_locked: 레이블이다. 이 곳에서는 페이지를 active 설정한다. 만일 swap 캐시 페이지가 memcg swap 공간이 full 상태이거나 mlocked 페이지인 상태인 경우 swap 영역을 비우도록 한다. 그리고 mlocked 페이지가 아닌 경우 actvie 설정하고 아래 keep_locked: 레이블을 계속 진행한다.
코드 라인 26~27에서 keep_locked: 레이블이다. 페이지를 unlock 하고 아래 keep: 레이블을 계속 진행한다.
코드 라인 28~31에서 keep: 레이블이다. 이 곳에서는 페이지를 ret_pages 리스트에 추가하고 다음 페이지를 반복한다.
코드 라인 33~35에서 루프를 모두 완료하면 free_pages 리스트들의 페이지를 memcg에 uncharge 보고하고, pcp에 회수시킨다.
코드 라인 37~38에서 ret_pages 리스트는 @page_list의 선두로 다시 되돌리고(rotate) PGACTIVATE 카운트를 증가시킨다.
코드 라인 40~50에서 회수와 관련된 카운터들을 갱신하고, 회수된 페이지 수를 반환한다.

다음 그림은 스캔하여 isolation한 page_list를 대상으로 페이지를 회수하여 free page를 확보하는 흐름을 보여준다.

activate 화살표에서는 페이지에 PG_active 플래그를 설정한 후 lru로 되돌린다.
keep 화살표에서는 lru로 되돌린다.
free 화살표에서는 free 페이지를 버디 시스템으로 회수한다.

페이지의 dirty & writeback 상태 체크

page_check_dirty_writeback()

mm/vmscan.c

/* Check if a page is dirty or under writeback */
static void page_check_dirty_writeback(struct page *page,
                                       bool *dirty, bool *writeback)
{
        struct address_space *mapping;

        /*
         * Anonymous pages are not handled by flushers and must be written
         * from reclaim context. Do not stall reclaim based on them
         */
        if (!page_is_file_cache(page) ||
            (PageAnon(page) && !PageSwapBacked(page))) {
                *dirty = false;
                *writeback = false;
                return;
        }

        /* By default assume that the page flags are accurate */
        *dirty = PageDirty(page);
        *writeback = PageWriteback(page);

        /* Verify dirty/writeback state if the filesystem supports it */
        if (!page_has_private(page))
                return;

        mapping = page_mapping(page);
        if (mapping && mapping->a_ops->is_dirty_writeback)
                mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

페이지의 dirty 및 writeback 여부를 알아온다.

코드 라인 11~16에서 file 캐시가 아니거나, swap 영역을 사용할 수 없는 anon 페이지인 경우 출력 인수 dirty와 writeback에 false를 담고 함수를 종료한다.
- swapbacked 여부와 상관없는 anon 페이지
코드 라인 19~20에서 dirty 및 writeback 플래그 상태를 저장한다.
코드 라인23~24에서별도의 버퍼를 갖는 private 페이지가 아닌 경우 함수를 빠져나간다.
코드 라인26~28에서 mapping 페이지의 경우 is_dirty_writeback() 핸들러 함수를 통해 dirty 및 writeback 여부를 알아온다.

페이지의 참조 상태 체크

page_check_references()

mm/vmscan.c

static enum page_references page_check_references(struct page *page,
                                                  struct scan_control *sc)
{
        int referenced_ptes, referenced_page;
        unsigned long vm_flags;

        referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
                                          &vm_flags);
        referenced_page = TestClearPageReferenced(page);

        /*
         * Mlock lost the isolation race with us.  Let try_to_unmap()
         * move the page to the unevictable list.
         */
        if (vm_flags & VM_LOCKED)
                return PAGEREF_RECLAIM;

        if (referenced_ptes) {
                if (PageSwapBacked(page))
                        return PAGEREF_ACTIVATE;
                /*
                 * All mapped pages start out with page table
                 * references from the instantiating fault, so we need
                 * to look twice if a mapped file page is used more
                 * than once.
                 *
                 * Mark it and spare it for another trip around the
                 * inactive list.  Another page table reference will
                 * lead to its activation.
                 *
                 * Note: the mark is set for activated pages as well
                 * so that recently deactivated but used pages are
                 * quickly recovered.
                 */
                SetPageReferenced(page);

                if (referenced_page || referenced_ptes > 1)
                        return PAGEREF_ACTIVATE;

                /*
                 * Activate file-backed executable pages after first usage.
                 */
                if (vm_flags & VM_EXEC)
                        return PAGEREF_ACTIVATE;

                return PAGEREF_KEEP;
        }

        /* Reclaim if clean, defer dirty pages to writeback */
        if (referenced_page && !PageSwapBacked(page))
                return PAGEREF_RECLAIM_CLEAN;

        return PAGEREF_RECLAIM;
}

페이지 참조를 확인하여 그 상태를 다음과 같이 4 가지로 알아온다.

PAGEREF_RECLAIM
- 페이지 회수 시작
PAGEREF_RECLAIM_CLEAN
- 페이지 회수 완료되어 free 하여도 되는 상태
PAGEREF_KEEP
- 다음에 처리하게 유보
PAGEREF_ACTIVATE
- 페이지가 active 중이므로 다음에 처리하게 유보

코드 라인 7~8에서 pte 매핑된 횟수를 알아온다.
- 참조: Rmap -2- (TTU & Rmap Walk) | 문c
코드 라인 9에서 페이지의 reference 플래그를 알아오고 클리어한다.
코드 라인 15~16에서 참조된 vma 영역이 VM_LOCKED 상태인 경우 PAGEREF_RECLAIM을 반환하여 페이지 회수를 시작하게 한다.
코드 라인 18~47에서 pte 참조 중인 페이지인 경우 PAGEREF_KEEP을 반환하여 해당 lru로 되돌리게 한다. 단 다음 조건인 경우에는 PAGEREF_ACTIVATE를 반환하여 activae lru로 되돌리게 한다.
- swap 영역을 사용할 수 있는 anon 페이지인 경우
- 참조 플래그를 설정하고, 그 전에 참조 플래그가 설정되었었거나, 2 군데 이상에서 pte 참조된 경우
- 실행 파일인 경우
코드 라인 50~51에서 기존에 참조 플래그가 설정되었고 swap 영역을 사용할 수 없는 clean anon 페이지인 경우 RECLAIM_CLEAN 상태로 반환하여 곧바로 페이지를 free하게 한다.
코드 라인 53에서 그 외의 경우 PAGEREF_RECLAIM을 반환하여 페이지를 회수 시작하도록 한다.

Shrinker

슬랩 캐시를 사용하는 스캔하여 사용되지 않는 슬랩 오브젝트를 제거하여 free 페이지를 확보할 수 있도록 shrinker를 구성할 수 있다. 이러한 shrinker는 캐시를 많이 사용하는 파일 시스템 등에서 주로 많이 사용되며 이들의 등록과 삭제는 다음 api를 통해서 할 수 있다.

register_shrinker()
unregister_shrinker()

이를 사용하는 대표적인 서브시스템 및 드라이버등은 다음과 같다.

zsmalloc
huge_memory
kvm
ubifs
ext4
f2fs
xfs
nfs
gpu
ion
bcache
raid5
virtio_balloon

shrinker 구조체

include/linux/shrinker.h

/*
 * A callback you can register to apply pressure to ageable caches.
 *
 * @count_objects should return the number of freeable items in the cache. If
 * there are no objects to free, it should return SHRINK_EMPTY, while 0 is
 * returned in cases of the number of freeable items cannot be determined
 * or shrinker should skip this cache for this time (e.g., their number
 * is below shrinkable limit). No deadlock checks should be done during the
 * count callback - the shrinker relies on aggregating scan counts that couldn't
 * be executed due to potential deadlocks to be run at a later call when the
 * deadlock condition is no longer pending.
 *
 * @scan_objects will only be called if @count_objects returned a non-zero
 * value for the number of freeable objects. The callout should scan the cache
 * and attempt to free items from the cache. It should then return the number
 * of objects freed during the scan, or SHRINK_STOP if progress cannot be made
 * due to potential deadlocks. If SHRINK_STOP is returned, then no further
 * attempts to call the @scan_objects will be made from the current reclaim
 * context.
 *
 * @flags determine the shrinker abilities, like numa awareness
 */

struct shrinker {
        unsigned long (*count_objects)(struct shrinker *,
                                       struct shrink_control *sc);
        unsigned long (*scan_objects)(struct shrinker *,
                                      struct shrink_control *sc);

        long batch;     /* reclaim batch size, 0 = default */
        int seeks;      /* seeks to recreate an obj */
        unsigned flags;

        /* These are for internal use */
        struct list_head list;
#ifdef CONFIG_MEMCG_KMEM
        /* ID in shrinker_idr */
        int id;
#endif
        /* objs pending delete, per node */
        atomic_long_t *nr_deferred;
};

(*count_objects)
- 캐시안에서 free 가능한 오브젝트 수를 반환한다. 없는 경우 SHRINK_EMPTY를 반환한다.
(*scan_objects)
- 캐시안에서 free 가능한 오브젝트들을 대상으로 reclaim을 수행한다. 반환되는 수는 할당 해제한 오브젝트 수이다.
batch
- 배치 수 만큼 reclaim을 수행한다.
- 지정하지 않는 경우 SHRINK_BATCH(128) 만큼 처리한다.
seeks
- 지정되지 않으면 항상 free 가능한 오브젝트 수의 절반씩 recalim 한다.
- 지정되는 경우 free 가능한 오브젝트 >> priority를 한 후 4/seeks를 곱한 수 만큼 reclaim 한다.
flags
- 다음과 같은 플래그가 사용된다.
  - SHRINKER_NUMA_AWARE
    - node 별로 shrink를 할 수 있도록 구성한다.
  - SHRINKER_MEMCG_AWARE
    - memcg 별로 shrink를 할 수 있도록 구성한다.
list
- shrinker_list에 등록될 때 사용될 노드이다.
id
- memcg에서 shrinker_idr에 등록할 때 사용되는 id 값이다.
*nr_deferred
- 이 포인터는 노드 수(per-node)만큼 할당된 정수 배열에 연결되며, 노드별 삭제 지연된 오브젝트 수를 나타낸다.

shrinker 등록

register_shrinker()

mm/vmscan.c

int register_shrinker(struct shrinker *shrinker)
{
        int err = prealloc_shrinker(shrinker);

        if (err)
                return err;
        register_shrinker_prepared(shrinker);
        return 0;
}
EXPORT_SYMBOL(register_shrinker);

shrinker를 등록한다.

코드 라인 3~6에서 shrinker를 등록하기 전에 할당할 항목들을 준비한다.
코드 라인 7에서 shrinker를 등록한다.

다음 그림은 register_shrinker() 함수를 통해 shrinker가 등록되는 모습을 보여준다.

prealloc_shrinker()

mm/vmscan.c

/*
 * Add a shrinker callback to be called from the vm.
 */

int prealloc_shrinker(struct shrinker *shrinker)
{
        size_t size = sizeof(*shrinker->nr_deferred);

        if (shrinker->flags & SHRINKER_NUMA_AWARE)
                size *= nr_node_ids;

        shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
        if (!shrinker->nr_deferred)
                return -ENOMEM;

        if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
                if (prealloc_memcg_shrinker(shrinker))
                        goto free_deferred;
        }

        return 0;

free_deferred:
        kfree(shrinker->nr_deferred);
        shrinker->nr_deferred = NULL;
        return -ENOMEM;
}

shrinker를 등록하기 전에 할당할 항목들을 준비한다.

코드 라인 3~10에서 SHRINKER_NUMA_AWARE 플래그를 사용한 경우에 노드 수 만큼 size 배열을 할당하여 shrinker->nr_deferred에 연결한다. 그렇지 않은 경우 1 개의 size 배열을 사용한다.
코드 라인 12~15에서 SHRINKER_MEMCG_AWARE 플래그를 사용한 경우에 shrinker를 memcg에 등록하도록 idr을 준비한다.

prealloc_memcg_shrinker()

mm/vmscan.c

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
        int id, ret = -ENOMEM;

        down_write(&shrinker_rwsem);
        /* This may call shrinker, so it must use down_read_trylock() */
        id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
        if (id < 0)
                goto unlock;

        if (id >= shrinker_nr_max) {
                if (memcg_expand_shrinker_maps(id)) {
                        idr_remove(&shrinker_idr, id);
                        goto unlock;
                }

                shrinker_nr_max = id + 1;
        }
        shrinker->id = id;
        ret = 0;
unlock:
        up_write(&shrinker_rwsem);
        return ret;
}

shrinker를 memcg에 등록하도록 idr을 준비한다.

코드 라인 7~9에서 shrinker_idr에서 id를 발급받는다.
코드 라인 11~18에서 새로 발급 받은 id 값이 shrinker_nr_max 보다 크면 shrinker_nr_max를 id + 1 값으로 갱신하고, shrinker 비트맵도 확장한다.
코드 라인 19에서 shrinker에 id를 지정한다.

register_shrinker_prepared()

mm/vmscan.c

void register_shrinker_prepared(struct shrinker *shrinker)
{
        down_write(&shrinker_rwsem);
        list_add_tail(&shrinker->list, &shrinker_list);
#ifdef CONFIG_MEMCG_KMEM
        if (shrinker->flags & SHRINKER_MEMCG_AWARE)
                idr_replace(&shrinker_idr, shrinker, shrinker->id);
#endif
        up_write(&shrinker_rwsem);
}

shrinker를 등록한다.

코드 라인 4에서 전역 shrinker_list에 @shrinker를 등록한다.
코드 라인 6~7에서 SHRINKER_MEMCG_AWARE 플래그를 사용하는 shrinker인 경우 발급받은 id 자리에 @shrinker 포인터를 저정한다.

shrinker 등록 해제

unregister_shrinker()

mm/vmscan.c

/*
 * Remove one
 */

void unregister_shrinker(struct shrinker *shrinker)
{
        if (!shrinker->nr_deferred)
                return;
        if (shrinker->flags & SHRINKER_MEMCG_AWARE)
                unregister_memcg_shrinker(shrinker);
        down_write(&shrinker_rwsem);
        list_del(&shrinker->list);
        up_write(&shrinker_rwsem);
        kfree(shrinker->nr_deferred);
        shrinker->nr_deferred = NULL;
}
EXPORT_SYMBOL(unregister_shrinker);

shrinker의 등록을 해제한다.

등록된 shrinker들을 대상으로 슬랩 캐시를 shrink

shrink_slab()

mm/vmscan.c

/**
 * shrink_slab - shrink slab caches
 * @gfp_mask: allocation context
 * @nid: node whose slab caches to target
 * @memcg: memory cgroup whose slab caches to target
 * @priority: the reclaim priority
 *
 * Call the shrink functions to age shrinkable caches.
 *
 * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
 * unaware shrinkers will receive a node id of 0 instead.
 *
 * @memcg specifies the memory cgroup to target. Unaware shrinkers
 * are called only if it is the root cgroup.
 *
 * @priority is sc->priority, we take the number of objects and >> by priority
 * in order to get the scan target.
 *
 * Returns the number of reclaimed slab objects.
 */

tatic unsigned long shrink_slab(gfp_t gfp_mask, int nid,
                                 struct mem_cgroup *memcg,
                                 int priority)
{
        unsigned long ret, freed = 0;
        struct shrinker *shrinker;

        if (!mem_cgroup_is_root(memcg))
                return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

        if (!down_read_trylock(&shrinker_rwsem))
                goto out;

        list_for_each_entry(shrinker, &shrinker_list, list) {
                struct shrink_control sc = {
                        .gfp_mask = gfp_mask,
                        .nid = nid,
                        .memcg = memcg,
                };

                ret = do_shrink_slab(&sc, shrinker, priority);
                if (ret == SHRINK_EMPTY)
                        ret = 0;
                freed += ret;
                /*
                 * Bail out if someone want to register a new shrinker to
                 * prevent the regsitration from being stalled for long periods
                 * by parallel ongoing shrinking.
                 */
                if (rwsem_is_contended(&shrinker_rwsem)) {
                        freed = freed ? : 1;
                        break;
                }
        }

        up_read(&shrinker_rwsem);
out:
        cond_resched();
        return freed;
}

등록된 shrinker를 대상으로 슬랩 캐시를 shirnk 하여 free 페이지를 확보하게 할 수 있다.

코드 라인 8~9에서 root memcg가 아닌 경우 다른 memcg를 사용하는 경우 슬랩 캐시를 shrink 한다.
코드 라인 14~34에서 등록된 shrinker 리스트를 대상으로 루프를 돌며 슬랩 캐시를 shrink 한다.
- shrinker는 register_shrinker_prepared() 함수를 통해 등록된다.
코드 라인 37~39에서 out: 레이블에서는 free된 슬랩 오브젝트의 수를 반환한다.

shrink_slab_memcg()

mm/vmscan.c

static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
                        struct mem_cgroup *memcg, int priority)
{
        struct memcg_shrinker_map *map;
        unsigned long ret, freed = 0;
        int i;

        if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
                return 0;

        if (!down_read_trylock(&shrinker_rwsem))
                return 0;

        map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
                                        true);
        if (unlikely(!map))
                goto unlock;

        for_each_set_bit(i, map->map, shrinker_nr_max) {
                struct shrink_control sc = {
                        .gfp_mask = gfp_mask,
                        .nid = nid,
                        .memcg = memcg,
                };
                struct shrinker *shrinker;

                shrinker = idr_find(&shrinker_idr, i);
                if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
                        if (!shrinker)
                                clear_bit(i, map->map);
                        continue;
                }

                ret = do_shrink_slab(&sc, shrinker, priority);
                if (ret == SHRINK_EMPTY) {
                        clear_bit(i, map->map);
                        /*
                         * After the shrinker reported that it had no objects to
                         * free, but before we cleared the corresponding bit in
                         * the memcg shrinker map, a new object might have been
                         * added. To make sure, we have the bit set in this
                         * case, we invoke the shrinker one more time and reset
                         * the bit if it reports that it is not empty anymore.
                         * The memory barrier here pairs with the barrier in
                         * memcg_set_shrinker_bit():
                         *
                         * list_lru_add()     shrink_slab_memcg()
                         *   list_add_tail()    clear_bit()
                         *   <MB>               <MB>
                         *   set_bit()          do_shrink_slab()
                         */
                        smp_mb__after_atomic();
                        ret = do_shrink_slab(&sc, shrinker, priority);
                        if (ret == SHRINK_EMPTY)
                                ret = 0;
                        else
                                memcg_set_shrinker_bit(memcg, nid, i);
                }
                freed += ret;

                if (rwsem_is_contended(&shrinker_rwsem)) {
                        freed = freed ? : 1;
                        break;
                }
        }
unlock:
        up_read(&shrinker_rwsem);
        return freed;
}

memcg에 등록된 shrinker를 대상으로 슬랩 캐시를 shirnk 하여 free 페이지를 확보하게 할 수 있다.

코드 라인 8~9에서 요청한 memcg가 online 상태가 아니면 처리를 포기하고 0을 반환한다.
코드 라인 14~17에서 lock-less rcu 방식을 사용하여 shrinker_map을 수정하기 위해 준비한다.
코드 라인 19~34에서 비트맵인 shrinker_map에서 비트가 설정된 항목만큼 순회하며 해당 비트 인덱스를 키로 shrinker_idr 에서 등록된 shrinker를 대상으로 슬랩 캐시를 shrink한다.
코드 라인 35~58에서 shrink 결과가 empty인 경우 shrinker_map의 해당 비트를 클리어한다. 그 후 다시 한번 슬랩 캐시를 shrink 하고, 두 번째 수행 결과가 empty가 아닌 경우에는 shrinker_map의 해당 비트를 다시 설정한다.
코드 라인 59~64에서 shirnk된 슬랩 캐시 오브젝트 수를 더한 후 계속 반복한다.
코드 라인 66~68에서 unlock: 레이블에서는 free된 슬랩 오브젝트의 수를 반환한다.

do_shrink_slab()

mm/vmscan.c -1/2-

static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
                                    struct shrinker *shrinker, int priority)
{
        unsigned long freed = 0;
        unsigned long long delta;
        long total_scan;
        long freeable;
        long nr;
        long new_nr;
        int nid = shrinkctl->nid;
        long batch_size = shrinker->batch ? shrinker->batch
                                          : SHRINK_BATCH;
        long scanned = 0, next_deferred;

        if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
                nid = 0;

        freeable = shrinker->count_objects(shrinker, shrinkctl);
        if (freeable == 0 || freeable == SHRINK_EMPTY)
                return freeable;

        /*
         * copy the current shrinker scan count into a local variable
         * and zero it so that other concurrent shrinker invocations
         * don't also do this scanning work.
         */
        nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);

        total_scan = nr;
        if (shrinker->seeks) {
                delta = freeable >> priority;
                delta *= 4;
                do_div(delta, shrinker->seeks);
        } else {
                /*
                 * These objects don't require any IO to create. Trim
                 * them aggressively under memory pressure to keep
                 * them from causing refetches in the IO caches.
                 */
                delta = freeable / 2;
        }

        total_scan += delta;
        if (total_scan < 0) {
                pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
                       shrinker->scan_objects, total_scan);
                total_scan = freeable;
                next_deferred = nr;
        } else
                next_deferred = total_scan;

        /*
         * We need to avoid excessive windup on filesystem shrinkers
         * due to large numbers of GFP_NOFS allocations causing the
         * shrinkers to return -1 all the time. This results in a large
         * nr being built up so when a shrink that can do some work
         * comes along it empties the entire cache due to nr >>>
         * freeable. This is bad for sustaining a working set in
         * memory.
         *
         * Hence only allow the shrinker to scan the entire cache when
         * a large delta change is calculated directly.
         */
        if (delta < freeable / 4)
                total_scan = min(total_scan, freeable / 2);

        /*
         * Avoid risking looping forever due to too large nr value:
         * never try to free more than twice the estimate number of
         * freeable entries.
         */
        if (total_scan > freeable * 2)
                total_scan = freeable * 2;

        trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
                                   freeable, delta, total_scan, priority);

shrink_control을 통해 요청한 글로벌 또는 memcg에 등록된 shrinker를 대상으로 슬랩 캐시를 shirnk 하여 free 페이지를 확보 한다.

코드 라인 11~12에서 shrinker에 한 번에 처리할 슬랩 오브젝트 수가 지정되지 않는 경우 SHRINK_BATCH(128) 개를 대입한다.
코드 라인 15~16에서 SHRINKER_NUMA_AWARE 플래그가 사용된 경우가 아니면 nid를 0으로 고정한다.
코드 라인 18~20에서 shrinker에서 free 가능한 object 수를 알아오기 위해 (*count_object)의 결과 값을 알아와서 freeable에 대입한다. 처리할 free 가능한 오브젝트가 없는 경우 함수를 빠져나간다.
코드 라인 27~29에서 삭제 지연 중인 오브젝트 수(nr_deferred)를 이번에 처리하기 위해 total_scan에 대입하고, 기존 값은 atomic하게 0으로 리셋하여 다른 곳과 동시에 호출되는 것을 막는다.
코드 라인 30~50에서 freeable 값에 seeks와 priority를 적용하여 산출한 delta를 total_scan에 더한다. 만일 total_scan 값이 0 미만인 경우 freeable 값을 모두 적용한다.
- shrinker에 seeks가 지정된 경우
  - delta = (freeable >> priority) * 4 / shrinker->seeks
- shrinker에 seeks가 지정되지 않은 경우
  - delta = freeable / 2
코드 라인 64~65에서 delta가 freeable의 25% 미만인 경우 total_scan 수가 freeable의 절반을 초과하지 않도록 제한한다.
코드 라인 72~73에서 total_scan이 freeable의 두 배를 초과하지 않도록 제한한다.

mm/vmscan.c -2/2-

        /*
         * Normally, we should not scan less than batch_size objects in one
         * pass to avoid too frequent shrinker calls, but if the slab has less
         * than batch_size objects in total and we are really tight on memory,
         * we will try to reclaim all available objects, otherwise we can end
         * up failing allocations although there are plenty of reclaimable
         * objects spread over several slabs with usage less than the
         * batch_size.
         *
         * We detect the "tight on memory" situations by looking at the total
         * number of objects we want to scan (total_scan). If it is greater
         * than the total number of objects on slab (freeable), we must be
         * scanning at high prio and therefore should try to reclaim as much as
         * possible.
         */
        while (total_scan >= batch_size ||
               total_scan >= freeable) {
                unsigned long ret;
                unsigned long nr_to_scan = min(batch_size, total_scan);

                shrinkctl->nr_to_scan = nr_to_scan;
                shrinkctl->nr_scanned = nr_to_scan;
                ret = shrinker->scan_objects(shrinker, shrinkctl);
                if (ret == SHRINK_STOP)
                        break;
                freed += ret;

                count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
                total_scan -= shrinkctl->nr_scanned;
                scanned += shrinkctl->nr_scanned;

                cond_resched();
        }

        if (next_deferred >= scanned)
                next_deferred -= scanned;
        else
                next_deferred = 0;
        /*
         * move the unused scan count back into the shrinker in a
         * manner that handles concurrent updates. If we exhausted the
         * scan, there is no need to do an update.
         */
        if (next_deferred > 0)
                new_nr = atomic_long_add_return(next_deferred,
                                                &shrinker->nr_deferred[nid]);
        else
                new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);

        trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
        return freed;
}

코드 라인 16~33에서 batch_size 또는 freeable 만큼 반복하며 shrinker에 등록한 scan_objects() 핸들러 함수를 호출하여 free 가능한 오브젝트 들을 reclaim한다. free된 object 수를 알아와서 freed에 추가한다. 만일 결과가 SHRINK_STOP인 경우 루프를 벗어난다.
코드 라인 35~48에서 삭제 처리되지 않고 남은 오브젝트 수를 다음에 처리하기 위해 다시 shrinker->nr_deferred[nid]에 대입한다.
코드 라인 51에서 정상 삭제 처리된 오브젝트 수를 반환한다.

다음 그림은 do_shrink_slab()을 통하여 memcg(없으면 글로벌)에 등록된 등록된 shrinker들을 대상으로 각각 산출된 total_scan만큼씩 shrink 하는 과정을 보여준다.

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c – 현재 글
Zoned Allocator -14- (Kswapd) | 문c

Smarter shrinkers (2013) | LWN

Zoned Allocator -12- (Direct Reclaim-Shrink-1)

2016-07-162022-05-01 문영일 13 Comments

Zoned Allocator -12- (Direct Reclaim-Shrink-1)

다음 그림은 페이지 회수를 위해 shrink_zones() 함수 호출 시 처리되는 함수 호출 관계를 보여준다.

Shrink Zones

shrink_zones()

mm/vmscan.c

/*
 * This is the direct reclaim path, for page-allocating processes.  We only
 * try to reclaim pages from zones which will satisfy the caller's allocation
 * request.
 *
 * If a zone is deemed to be full of pinned pages then just give it a light
 * scan then give up on it.
 */

static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
        struct zoneref *z;
        struct zone *zone;
        unsigned long nr_soft_reclaimed;
        unsigned long nr_soft_scanned;
        gfp_t orig_mask;
        pg_data_t *last_pgdat = NULL;

        /*
         * If the number of buffer_heads in the machine exceeds the maximum
         * allowed level, force direct reclaim to scan the highmem zone as
         * highmem pages could be pinning lowmem pages storing buffer_heads
         */
        orig_mask = sc->gfp_mask;
        if (buffer_heads_over_limit) {
                sc->gfp_mask |= __GFP_HIGHMEM;
                sc->reclaim_idx = gfp_zone(sc->gfp_mask);
        }

        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                        sc->reclaim_idx, sc->nodemask) {
                /*
                 * Take care memory controller reclaiming has small influence
                 * to global LRU.
                 */
                if (global_reclaim(sc)) {
                        if (!cpuset_zone_allowed(zone,
                                                 GFP_KERNEL | __GFP_HARDWALL))
                                continue;

                        /*
                         * If we already have plenty of memory free for
                         * compaction in this zone, don't free any more.
                         * Even though compaction is invoked for any
                         * non-zero order, only frequent costly order
                         * reclamation is disruptive enough to become a
                         * noticeable problem, like transparent huge
                         * page allocations.
                         */
                        if (IS_ENABLED(CONFIG_COMPACTION) &&
                            sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                            compaction_ready(zone, sc)) {
                                sc->compaction_ready = true;
                                continue;
                        }

                        /*
                         * Shrink each node in the zonelist once. If the
                         * zonelist is ordered by zone (not the default) then a
                         * node may be shrunk multiple times but in that case
                         * the user prefers lower zones being preserved.
                         */
                        if (zone->zone_pgdat == last_pgdat)
                                continue;

                        /*
                         * This steals pages from memory cgroups over softlimit
                         * and returns the number of reclaimed pages and
                         * scanned pages. This works for global memory pressure
                         * and balancing, not for a memcg's limit.
                         */
                        nr_soft_scanned = 0;
                        nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
                                                sc->order, sc->gfp_mask,
                                                &nr_soft_scanned);
                        sc->nr_reclaimed += nr_soft_reclaimed;
                        sc->nr_scanned += nr_soft_scanned;
                        /* need some check for avoid more shrink_zone() */
                }

                /* See comment about same check for global reclaim above */
                if (zone->zone_pgdat == last_pgdat)
                        continue;
                last_pgdat = zone->zone_pgdat;
                shrink_node(zone->zone_pgdat, sc);
        }

        /*
         * Restore to original mask to avoid the impact on the caller if we
         * promoted it to __GFP_HIGHMEM.
         */
        sc->gfp_mask = orig_mask;
}

zonelist를 대상으로 필요한 zone에 대해 페이지 회수를 수행한다.

코드 라인 15에서 sc->gfp_mask를 백업해둔다.
코드 라인 16~19에서 버퍼 헤드의 수가 최대 허락된 레벨을 초과하는 경우 페이지 회수 스캐닝에 highmem zone도 포함시킨다.
코드 라인 21~22에서 zonelist에서 요청 zone 이하 및 노드들을 대상으로 루프를 돈다.
코드 라인 27에서 global lru를 대상으로 회수하는 경우이다.
코드 라인 28~30에서 cpuset이 GFP_KERNEL 및 __GFP_HARDWALL 플래그 요청으로 이 zone에서 허락되지 않는 경우 skip 한다.
코드 라인 41~46에서 cosltly order 이면서 compaction 없이 처리할 수 있을거라 판단하면 skip 한다.
코드 라인 54~55에서 이미 처리한 노드인 경우 skip 한다.
코드 라인 63~68에서 memcg 소프트 제한된 페이지 회수를 시도하여 스캔 및 회수된 페이지를 알아와서 추가한다.
코드 라인 73~75에서 이미 처리한 노드인 경우 skip 한다.
코드 라인 76에서 노드를 대상으로 페이지 회수를 시도한다.
코드 라인 83에서 백업해두었던 gfp_mask를 복구한다.

다음 그림은 shrink_zones() 함수의 처리 흐름을 보여준다.

compaction_ready()

mm/vmscan.c

/*
 * Returns true if compaction should go ahead for a costly-order request, or
 * the allocation would already succeed without compaction. Return false if we
 * should reclaim first.
 */

static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
        unsigned long watermark;
        enum compact_result suitable;

        suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
        if (suitable == COMPACT_SUCCESS)
                /* Allocation should succeed already. Don't reclaim. */
                return true;
        if (suitable == COMPACT_SKIPPED)
                /* Compaction cannot yet proceed. Do reclaim. */
                return false;

        /*
         * Compaction is already possible, but it takes time to run and there
         * are potentially other callers using the pages just freed. So proceed
         * with reclaim to make a buffer of free pages available to give
         * compaction a reasonable chance of completing and allocating the page.
         * Note that we won't actually reclaim the whole buffer in one attempt
         * as the target watermark in should_continue_reclaim() is lower. But if
         * we are already above the high+gap watermark, don't reclaim at all.
         */
        watermark = high_wmark_pages(zone) + compact_gap(sc->order);

        return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
}

추가적인 compaction 없이 페이지 할당이 가능한지 여부를 반환한다. (true=페이지 할당 가능 상태, false=페이지 회수가 필요한 상태)

코드 라인 6에서 compaction 수행이 적합한지 여부를 알아온다.
코드 라인 7~9에서 할당에 문제가 없는 경우 페이지 회수를 진행할 필요 없어 true를 반환한다.
코드 라인 10~12에서 compaction이 아직 끝나지 않았으므로 페이지 회수가 더 필요하므로 false를 반환한다.
코드 라인 23~25에서 compaction이 필요한 상태이나 이미 많은 페이지가 확보되어 있을지 모르므로 high 워터마크 기준에 compact 갭(order 페이지의 두 배)을 추가하여 free 페이지를 비교해본 결과를 반환한다.

Shrink 노드

shrink_node()

mm/vmscan.c -1/2-

static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
        struct reclaim_state *reclaim_state = current->reclaim_state;
        unsigned long nr_reclaimed, nr_scanned;
        bool reclaimable = false;

        do {
                struct mem_cgroup *root = sc->target_mem_cgroup;
                struct mem_cgroup_reclaim_cookie reclaim = {
                        .pgdat = pgdat,
                        .priority = sc->priority,
                };
                unsigned long node_lru_pages = 0;
                struct mem_cgroup *memcg;

                memset(&sc->nr, 0, sizeof(sc->nr));

                nr_reclaimed = sc->nr_reclaimed;
                nr_scanned = sc->nr_scanned;

                memcg = mem_cgroup_iter(root, NULL, &reclaim);
                do {
                        unsigned long lru_pages;
                        unsigned long reclaimed;
                        unsigned long scanned;

                        switch (mem_cgroup_protected(root, memcg)) {
                        case MEMCG_PROT_MIN:
                                /*
                                 * Hard protection.
                                 * If there is no reclaimable memory, OOM.
                                 */
                                continue;
                        case MEMCG_PROT_LOW:
                                /*
                                 * Soft protection.
                                 * Respect the protection only as long as
                                 * there is an unprotected supply
                                 * of reclaimable memory from other cgroups.
                                 */
                                if (!sc->memcg_low_reclaim) {
                                        sc->memcg_low_skipped = 1;
                                        continue;
                                }
                                memcg_memory_event(memcg, MEMCG_LOW);
                                break;
                        case MEMCG_PROT_NONE:
                                break;
                        }

                        reclaimed = sc->nr_reclaimed;
                        scanned = sc->nr_scanned;
                        shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
                        node_lru_pages += lru_pages;

                        if (sc->may_shrinkslab) {
                                shrink_slab(sc->gfp_mask, pgdat->node_id,
                                    memcg, sc->priority);
                        }

                        /* Record the group's reclaim efficiency */
                        vmpressure(sc->gfp_mask, memcg, false,
                                   sc->nr_scanned - scanned,
                                   sc->nr_reclaimed - reclaimed);

                        /*
                         * Direct reclaim and kswapd have to scan all memory
                         * cgroups to fulfill the overall scan target for the
                         * node.
                         *
                         * Limit reclaim, on the other hand, only cares about
                         * nr_to_reclaim pages to be reclaimed and it will
                         * retry with decreasing priority if one round over the
                         * whole hierarchy is not sufficient.
                         */
                        if (!global_reclaim(sc) &&
                                        sc->nr_reclaimed >= sc->nr_to_reclaim) {
                                mem_cgroup_iter_break(root, memcg);
                                break;
                        }
                } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

요청 노드의 anon 및 file lru 리스트에서 페이지 회수를 진행한다. 타겟 memcg 이하에서 진행하고 페이지 회수 결과 여부를 반환한다.

코드 라인 3에서 현재 태스크의 reclaim 상태를 알아온다.
코드 라인 7~19에서 reclaim이 완료되지 못한 경우 다시 반복된다. 회수할 memcg 대상은 sc->target_mem_cgroup 부터 모든 하위 memcg들이다.
코드 라인 21~22에서 root 부터 하이라키로 구성된 하위 memcg를 대상으로 순회한다. root가 지정되지 않은 경우 최상위 root memcg를 대상으로 수행한다.
코드 라인 27~49에서 memcg에 대한 프로텍션을 확인하고 skip 하거나 진행한다.
- hard 프로텍션이 걸린 memcg의 경우 skip 한다.
  - uage < memory.emin
- soft 프로텍션이 걸린 memcg의 경우 memcg low 이벤트를 통지한다. 단 low_reclaim을 허용하지 않은 경우 skip 한다.
  - usage < memory.elow
- 어떠한 프로텍션도 없는 memcg의 경우 그대로 진행한다.
코드 라인 51~53에서 memcg를 대상으로 shrink 한다.
코드 라인 56~59에서 슬랩의 shrink를 요청한 경우 이를 수행한다.
코드 라인 62~64에서 memcg에 대한 메모리 압박률을 체크하여 갱신한다.
코드 라인 76~80에서 global 회수가 아닌 경우이고 목표를 달성한 경우 루프를 벗어난다.
코드 라인 81에서 하이라키로 구성된 다음 memcg를 순회한다.

mm/vmscan.c -2/2-

                if (reclaim_state) {
                        sc->nr_reclaimed += reclaim_state->reclaimed_slab;
                        reclaim_state->reclaimed_slab = 0;
                }

                /* Record the subtree's reclaim efficiency */
                vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
                           sc->nr_scanned - nr_scanned,
                           sc->nr_reclaimed - nr_reclaimed);

                if (sc->nr_reclaimed - nr_reclaimed)
                        reclaimable = true;

                if (current_is_kswapd()) {
                        /*
                         * If reclaim is isolating dirty pages under writeback,
                         * it implies that the long-lived page allocation rate
                         * is exceeding the page laundering rate. Either the
                         * global limits are not being effective at throttling
                         * processes due to the page distribution throughout
                         * zones or there is heavy usage of a slow backing
                         * device. The only option is to throttle from reclaim
                         * context which is not ideal as there is no guarantee
                         * the dirtying process is throttled in the same way
                         * balance_dirty_pages() manages.
                         *
                         * Once a node is flagged PGDAT_WRITEBACK, kswapd will
                         * count the number of pages under pages flagged for
                         * immediate reclaim and stall if any are encountered
                         * in the nr_immediate check below.
                         */
                        if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
                                set_bit(PGDAT_WRITEBACK, &pgdat->flags);

                        /*
                         * Tag a node as congested if all the dirty pages
                         * scanned were backed by a congested BDI and
                         * wait_iff_congested will stall.
                         */
                        if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
                                set_bit(PGDAT_CONGESTED, &pgdat->flags);

                        /* Allow kswapd to start writing pages during reclaim.*/
                        if (sc->nr.unqueued_dirty == sc->nr.file_taken)
                                set_bit(PGDAT_DIRTY, &pgdat->flags);

                        /*
                         * If kswapd scans pages marked marked for immediate
                         * reclaim and under writeback (nr_immediate), it
                         * implies that pages are cycling through the LRU
                         * faster than they are written so also forcibly stall.
                         */
                        if (sc->nr.immediate)
                                congestion_wait(BLK_RW_ASYNC, HZ/10);
                }

                /*
                 * Legacy memcg will stall in page writeback so avoid forcibly
                 * stalling in wait_iff_congested().
                 */
                if (!global_reclaim(sc) && sane_reclaim(sc) &&
                    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
                        set_memcg_congestion(pgdat, root, true);

                /*
                 * Stall direct reclaim for IO completions if underlying BDIs
                 * and node is congested. Allow kswapd to continue until it
                 * starts encountering unqueued dirty pages or cycling through
                 * the LRU too quickly.
                 */
                if (!sc->hibernation_mode && !current_is_kswapd() &&
                   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
                        wait_iff_congested(BLK_RW_ASYNC, HZ/10);

        } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
                                         sc->nr_scanned - nr_scanned, sc));

        /*
         * Kswapd gives up on balancing particular nodes after too
         * many failures to reclaim anything from them and goes to
         * sleep. On reclaim progress, reset the failure counter. A
         * successful direct reclaim run will revive a dormant kswapd.
         */
        if (reclaimable)
                pgdat->kswapd_failures = 0;

        return reclaimable;
}

코드 라인 1~4에서 reclaim_state가 null이 아닌 경우 회수된 페이지 수에 회수된 slab 페이지 갯수를 더한다.
코드 라인 7~9에서 memcg에 대한 메모리 압박률을 체크하여 조건을 만족시키는 vmpressure 리스터들에 이벤트를 통지한다.
코드 라인 11~12에서 순회 중에 한 번이라도 회수한 페이지의 변화가 있는 경우 reclimable을 true로 설정한다.
코드 라인 14~56에서 kswapd에서 페이지 회수를 해야 하는 경우이다. 노드에 관련 플래그들을 설정한다.
코드 라인 62~64에서 글로벌 회수가 아니고, 지정된 memcg를 사용하지 않으면서 writeback으로 인해 지연되는 경우 memcg 노드의 congested를 true로 설정한다.
코드 라인 72~74에서 direct-reclaim 중 memcg 노드가 혼잡한 경우 0.1초 슬립한다.
코드 라인 76~77에서 페이지 회수를 계속할지 여부에 의해 순회를 한다.
코드 라인 85에서 페이지 회수가 된적 있으면 kswapd의 실패 수를 리셋한다.
코드 라인 87에서 페이지 회수 여부를 반환한다.

Shrink 노드 memcg

shrink_node_memcg()

mm/vmscan.c -1/2-

/*
 * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
 */

static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
                              struct scan_control *sc, unsigned long *lru_pages)
{
        struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
        unsigned long nr[NR_LRU_LISTS];
        unsigned long targets[NR_LRU_LISTS];
        unsigned long nr_to_scan;
        enum lru_list lru;
        unsigned long nr_reclaimed = 0;
        unsigned long nr_to_reclaim = sc->nr_to_reclaim;
        struct blk_plug plug;
        bool scan_adjusted;

        get_scan_count(lruvec, memcg, sc, nr, lru_pages);

        /* Record the original scan target for proportional adjustments later */
        memcpy(targets, nr, sizeof(nr));

        /*
         * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
         * event that can occur when there is little memory pressure e.g.
         * multiple streaming readers/writers. Hence, we do not abort scanning
         * when the requested number of pages are reclaimed when scanning at
         * DEF_PRIORITY on the assumption that the fact we are direct
         * reclaiming implies that kswapd is not keeping up and it is best to
         * do a batch of work at once. For memcg reclaim one check is made to
         * abort proportional reclaim if either the file or anon lru has already
         * dropped to zero at the first pass.
         */
        scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
                         sc->priority == DEF_PRIORITY);

        blk_start_plug(&plug);
        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
                                        nr[LRU_INACTIVE_FILE]) {
                unsigned long nr_anon, nr_file, percentage;
                unsigned long nr_scanned;

                for_each_evictable_lru(lru) {
                        if (nr[lru]) {
                                nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
                                nr[lru] -= nr_to_scan;

                                nr_reclaimed += shrink_list(lru, nr_to_scan,
                                                            lruvec, memcg, sc);
                        }
                }

                cond_resched();

                if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
                        continue;

                /*
                 * For kswapd and memcg, reclaim at least the number of pages
                 * requested. Ensure that the anon and file LRUs are scanned
                 * proportionally what was requested by get_scan_count(). We
                 * stop reclaiming one LRU and reduce the amount scanning
                 * proportional to the original scan target.
                 */
                nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
                nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

                /*
                 * It's just vindictive to attack the larger once the smaller
                 * has gone to zero.  And given the way we stop scanning the
                 * smaller below, this makes sure that we only make one nudge
                 * towards proportionality once we've got nr_to_reclaim.
                 */
                if (!nr_file || !nr_anon)
                        break;

memcg의 노드별 lruvec에서 anon 및 file lru 리스트의 페이지 회수를 진행한다.

코드 라인 4에서 memcg에 대한 lruvec을 알아온다. memcg가 지정되지 않은 경우 노드 lruvec을 알아온다.
코드 라인 14에서 요청한 lruvec에 대해 밸런스를 고려하여 스캔할 페이지 비율을 산출한다.
코드 라인 17에서 나중에 일부 조정을 위해 산출된 nr[] 배열을 targets[] 배열에 백업해둔다.
코드 라인 30~31에서 direct reclaim을 포함하는 global reclaim이 첫 우선 순위로 시도하는지 여부를 scan_adjusted에 대입한다.
- scan_adjusted가 true로 설정된 경우 anon과 file 페이지의 스캔 비율을 재조정되지 않게 한다.
코드 라인 33에서 태스크의 plug에 blk_plug를 대입하여 배치 i/o가 시작된 것을 알린다.
코드 라인 34~35에서 nr[]배열에서 inactive anon+file 또는 active file이 0보다 큰 경우 루프를 돈다.
- active anon은 제외한다.
코드 라인 39~47에서 evictable lru에 대해서 회수를 시도한다. 단 최대 스캔 페이지 수는 32 페이지로 제한한다.
코드 라인 51~52에서 회수가 더 필요하거나, scan_adjusted가 설정된 경우 루프를 반복한다.
코드 라인 61~62에서 회수한 페이지가 목표치를 초과 달성한 경우이다. scan 비율을 조절하기 위해 먼저 스캔 후 남은 file 페이지 수와 anon 페이지 수를 준비한다.
코드 라인 70~71에서 처리 후 남은 file 또는 anon 페이지가 없는 경우 비율을 조절할 필요가 없으므로 루프를 빠져나간다.

mm/vmscan.c -2/2-

                if (nr_file > nr_anon) {
                        unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
                                                targets[LRU_ACTIVE_ANON] + 1;
                        lru = LRU_BASE;
                        percentage = nr_anon * 100 / scan_target;
                } else {
                        unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
                                                targets[LRU_ACTIVE_FILE] + 1;
                        lru = LRU_FILE;
                        percentage = nr_file * 100 / scan_target;
                }

                /* Stop scanning the smaller of the LRU */
                nr[lru] = 0;
                nr[lru + LRU_ACTIVE] = 0;

                /*
                 * Recalculate the other LRU scan count based on its original
                 * scan target and the percentage scanning already complete
                 */
                lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
                nr_scanned = targets[lru] - nr[lru];
                nr[lru] = targets[lru] * (100 - percentage) / 100;
                nr[lru] -= min(nr[lru], nr_scanned);

                lru += LRU_ACTIVE;
                nr_scanned = targets[lru] - nr[lru];
                nr[lru] = targets[lru] * (100 - percentage) / 100;
                nr[lru] -= min(nr[lru], nr_scanned);

                scan_adjusted = true;
        }
        blk_finish_plug(&plug);
        sc->nr_reclaimed += nr_reclaimed;

        /*
         * Even if we did not try to evict anon pages at all, we want to
         * rebalance the anon lru active/inactive ratio.
         */
        if (inactive_list_is_low(lruvec, false, memcg, sc, true))
                shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
                                   sc, LRU_ACTIVE_ANON);
}

코드 라인 1~5에서 남은 잔량이 file 페이지가 많은 경우 스캔 목표 대비 남은 anon 페이지의 백분율을 산출한다.
- 예) shrink 전 산출하여 백업해둔 anon=200, shrink 후 anon=140
  - scan_target=201, percentage=약 70%의 anon 페이지를 스캔하지 못함
코드 라인 6~11에서 남은 잔량이 anon이 많은 경우 스캔 목표 대비 남은 file 페이지의 백뷴율을 산출한다.
- 예) shrink 전 산출하여 백업해둔 file=200, shrink 후 file=140
  - scan_target=201, percentage=약 70%의 file 페이지를 스캔하지 못함
코드 라인 14~15에서 대상(file 또는 anon) lru는 많이 처리되었기 때문에 다음에 스캔하지 않도록 inactive와 active 스캔 카운트를 0으로 설정한다.
코드 라인 21~24에서 대상 lru의 반대(file <-> anon) inactive를 선택하고 스캔 목표에서 원래 대상 lru가 스캔한 백분율 만큼의 페이지를 감소 시킨 페이지 수를 nr[]에 대입한다.
- 감소 시킬 때 원래 대상 lru가 스캔한 페이지 수를 초과하지 않도록 조정한다.
코드 라인 26~29에서 active anon 또는 file lru를 선택하고 스캔 목표에서 원래 대상 lru가 스캔한 백분율만큼의 페이지를 감소 시킨 페이지 수를 nr[]에 산출한다.
- 감소 시킬 때 원래 대상 lru가 스캔한 페이지 수를 초과하지 않도록 조정한다.
코드 라인 31에서 스캔 값이 조절된 후에는 루프내에서 다시 재조정되지 않도록 한다.
코드 라인 33에서 태스크의 plug에 null을 대입하여 배치 i/o가 완료된 것을 알린다.
코드 라인 34에서 회수된 페이지 수를 갱신한다.
코드 라인 40~42에서 inactive anon이 active anon보다 작을 경우 active 리스트에 대해 shrink를 수행하여 active와 inactive간의 밸런스를 다시 잡아준다.

다음 그림은 지정된 memcg의 lru 벡터 리스트를 shrink하는 모습을 보여준다.

요청 lru shrink

lru shrink

shrink_list()

mm/vmscan.c

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
                                 struct lruvec *lruvec, struct mem_cgroup *memcg,
                                 struct scan_control *sc)
{
        if (is_active_lru(lru)) {
                if (inactive_list_is_low(lruvec, is_file_lru(lru),
                                         memcg, sc, true))
                        shrink_active_list(nr_to_scan, lruvec, sc, lru);
                return 0;
        }

        return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}

lruvec의 lru 리스트에서 페이지 회수를 진행한다. 단 active lru의 경우 inactive lru보다 페이지 수가 많은 경우만 수행한다.

active 리스트에 대한 shrink 요청 시 inactive 리스트보다 페이지 수가 적으면 active 리스트에 대해 shrink를 수행하지 않는다.
inactive 리스트에 대한 shrink 요청은 조건 없이 수행한다.

active lru의 shrink

shrink_active_list()

mm/vmscan.c -1/2-

static void shrink_active_list(unsigned long nr_to_scan,
                               struct lruvec *lruvec,
                               struct scan_control *sc,
                               enum lru_list lru)
{
        unsigned long nr_taken;
        unsigned long nr_scanned;
        unsigned long vm_flags;
        LIST_HEAD(l_hold);      /* The pages which were snipped off */
        LIST_HEAD(l_active);
        LIST_HEAD(l_inactive);
        struct page *page;
        struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
        unsigned nr_deactivate, nr_activate;
        unsigned nr_rotated = 0;
        isolate_mode_t isolate_mode = 0;
        int file = is_file_lru(lru);
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);

        lru_add_drain();

        if (!sc->may_unmap)
                isolate_mode |= ISOLATE_UNMAPPED;

        spin_lock_irq(&pgdat->lru_lock);

        nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
                                     &nr_scanned, sc, isolate_mode, lru);

        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
        reclaim_stat->recent_scanned[file] += nr_taken;

        __count_vm_events(PGREFILL, nr_scanned);
        count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);

        spin_unlock_irq(&pgdat->lru_lock);

        while (!list_empty(&l_hold)) {
                cond_resched();
                page = lru_to_page(&l_hold);
                list_del(&page->lru);

                if (unlikely(!page_evictable(page))) {
                        putback_lru_page(page);
                        continue;
                }

                if (unlikely(buffer_heads_over_limit)) {
                        if (page_has_private(page) && trylock_page(page)) {
                                if (page_has_private(page))
                                        try_to_release_page(page, 0);
                                unlock_page(page);
                        }
                }

                if (page_referenced(page, 0, sc->target_mem_cgroup,
                                    &vm_flags)) {
                        nr_rotated += hpage_nr_pages(page);
                        /*
                         * Identify referenced, file-backed active pages and
                         * give them one more trip around the active list. So
                         * that executable code get better chances to stay in
                         * memory under moderate memory pressure.  Anon pages
                         * are not likely to be evicted by use-once streaming
                         * IO, plus JVM can create lots of anon VM_EXEC pages,
                         * so we ignore them here.
                         */
                        if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
                                list_add(&page->lru, &l_active);
                                continue;
                        }
                }

                ClearPageActive(page);  /* we are de-activating */
                SetPageWorkingset(page);
                list_add(&page->lru, &l_inactive);
        }

lruvec의 active lru 리스트에서 페이지 회수를 진행한다. active lru 리스트에서 일정 분량의 페이지를 isolation한 후 file 캐시 페이지인 경우 다시 active lru 리스트로 rotate 시키고, 나머지는 inactive lru 리스트로 옮긴다. 단 이들 중 unevictable 페이지는 unevictable lru 리스트로 옮긴다. 그리고 처리하는 동안 사용자가 없어진 페이지들은 버디 시스템에 free 한다.

코드 라인 20에서 per cpu lru들을 비우고 lruvec으로 되돌린다.
코드 라인 22~23에서 may_unmap 요청이 없는 경우 unmapped 페이지들도 isolation할 수 있도록 모드에 추가한다.
코드 라인 27~28에서 지정한 lru 리스트로부터 nr_to_scan 만큼 스캔을 시도하여 분리된 페이지는 l_hold 리스트에 담고 분리된 페이지 수를 반환한다.
코드 라인 30에서 NR_ISOLATED_ANON 또는 NR_ISOLATED_FILE 카운터를 분리한 페이지 수 만큼 더한다.
코드 라인 31에서 anon/file 스캔 비율 모드를 사용할 때 비율을 산출하기 위해 최근 스캔된 수에 분리한 페이지 수를 더한다.
코드 라인 33~34에서 PGREFILL 카운터에 스캔 수 만큼 추가한다.
코드 라인 38~41에서 에서 isolation한 페이지들이 있는 l_hold 리스트에서 페이지들을 하나씩 순회하며 삭제한다.
코드 라인 43~46에서 만일 페이지가 evitable 가능한 상태가 아니면 다시 원래 lru 리스트로 옮긴다.
코드 라인 48~54에서 작은 확률로 buffer_heads_over_limit이 설정되었고 private 페이지에서 lock 획득이 성공한 경우 페이지를 버디 시스템으로 되돌리고 unlock한다.
코드 라인 56~72에서 참조된 페이지의경우 nr_rotate 카운터를 페이지 수 만큼 증가시킨다. 그리고 실행 파일 캐시인경우 active lru로 rotate 시키기 위해 l_active로 옮긴다.
코드 라인 74~76에서 그 외의 페이지들은 active 플래그를 제거하고, workingset 플래그를 설정한 후 inactive lru로 옮기기 위해 l_inactive로 옮긴다.

mm/vmscan.c -2/2-

        /*
         * Move pages back to the lru list.
         */
        spin_lock_irq(&pgdat->lru_lock);
        /*
         * Count referenced pages from currently used mappings as rotated,
         * even though only some of them are actually re-activated.  This
         * helps balance scan pressure between file and anonymous pages in
         * get_scan_count.
         */
        reclaim_stat->recent_rotated[file] += nr_rotated;

        nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
        nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
        spin_unlock_irq(&pgdat->lru_lock);

        mem_cgroup_uncharge_list(&l_hold);
        free_unref_page_list(&l_hold);
        trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
                        nr_deactivate, nr_rotated, sc->priority, file);
}

코드 라인 11에서 anon/file 스캔 비율 모드를 사용할 때 비율을 산출하기 위해 최근 rotated 카운터에 nr_rotated를 더한다.
코드 라인 13에서 l_active에 모아둔 페이지들을 active lru에 옮기고 그 와중에 사용자가 없어 free 가능한 페이지들은 l_hold로 옮긴다.
코드 라인 14에서 l_inactive에 모아둔 페이지들을 inactive lru에 옮기고 그 와중에 사용자가 없어 free 가능한 페이지들은 l_hold로 옮긴다.
코드 라인 15에서 NR_ISOLATED_ANON 또는 NR_ISOLATED_FILE 카운터에서 nr_taken을 뺀다.
코드 라인 18에서 memcg에 l_hold 리스트를 uncarge 보고한다.
코드 라인 19에서 l_hold에 있는 페이지들 모두를 버디시스템으로 되돌린다.

다음 그림은 anon/file active lru 리스트를 대상으로 shrink하는 모습을 보여준다.

inactive lru의 shrink

shrink_inactive_list()

mm/vmscan.c -1/2-

/*
 * shrink_inactive_list() is a helper for shrink_node().  It returns the number
 * of reclaimed pages
 */

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
                     struct scan_control *sc, enum lru_list lru)
{
        LIST_HEAD(page_list);
        unsigned long nr_scanned;
        unsigned long nr_reclaimed = 0;
        unsigned long nr_taken;
        struct reclaim_stat stat = {};
        isolate_mode_t isolate_mode = 0;
        int file = is_file_lru(lru);
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);
        struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
        bool stalled = false;

        while (unlikely(too_many_isolated(pgdat, file, sc))) {
                if (stalled)
                        return 0;

                /* wait a bit for the reclaimer. */
                msleep(100);
                stalled = true;

                /* We are about to die and free our memory. Return now. */
                if (fatal_signal_pending(current))
                        return SWAP_CLUSTER_MAX;
        }

        lru_add_drain();

        if (!sc->may_unmap)
                isolate_mode |= ISOLATE_UNMAPPED;

        spin_lock_irq(&pgdat->lru_lock);

        nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
                                     &nr_scanned, sc, isolate_mode, lru);

        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
        reclaim_stat->recent_scanned[file] += nr_taken;

        if (current_is_kswapd()) {
                if (global_reclaim(sc))
                        __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
                count_memcg_events(lruvec_memcg(lruvec), PGSCAN_KSWAPD,
                                   nr_scanned);
        } else {
                if (global_reclaim(sc))
                        __count_vm_events(PGSCAN_DIRECT, nr_scanned);
                count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT,
                                   nr_scanned);
        }
        spin_unlock_irq(&pgdat->lru_lock);

        if (nr_taken == 0)
                return 0;

        nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
                                &stat, false);

inactive lru 리스트에서 일정 분량의 페이지를 shrink하여 free page를 확보하고, 그 중 active 페이지는 active lru 리스트에 되돌리고 writeback 등의 이유로 처리를 유보한 페이지들은 inactive lru 리스트의 선두로 rotate 시킨다.

코드 라인 16~27에서 너무 많은 페이지가 isolation된 경우 0.1초간 슬립한다.
코드 라인 29에서 lru cpu 캐시를 lruvec으로 되돌린다.
코드 라인 31~32에서 may_unmap 요청이 없는 경우 unmapped 페이지들도 isolation할 수 있도록 모드에 추가한다.
코드 라인 36~37에서 isolate_mode에 맞게 lruvec에서 nr_to_scan 만큼 page_list에 분리해온다. 스캔 수는 nr_scanned에 담기고, 처리된 수는 nr_taken에 담겨반환된다.
코드 라인 39에서 NR_ISOLATED_ANON 또는 NR_ISOLATED_FILE 카운터를 분리한 페이지 수 만큼 더한다.
코드 라인 40에서 anon/file 스캔 비율 모드를 사용할 때 비율을 산출하기 위해 최근 스캔된 수에 분리한 페이지 수를 더한다.
코드 라인 42~52에서 kswapd 및 direct-reclaim 스캔 카운터를 증가시킨다.
코드 라인 55~56에서 분리되어 처리할 페이지가 없는 경우 처리를 중단한다.
코드 라인 58에서 isolation된 페이지들이 담긴 page_list에서 shrink를 수행하고 그 중 회수된 페이지의 수를 알아온다.

mm/vmscan.c -2/2-

        spin_lock_irq(&pgdat->lru_lock);

        if (current_is_kswapd()) {
                if (global_reclaim(sc))
                        __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
                count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_KSWAPD,
                                   nr_reclaimed);
        } else {
                if (global_reclaim(sc))
                        __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
                count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_DIRECT,
                                   nr_reclaimed);
        }

        putback_inactive_pages(lruvec, &page_list);

        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

        spin_unlock_irq(&pgdat->lru_lock);

        mem_cgroup_uncharge_list(&page_list);
        free_unref_page_list(&page_list);

        /*
         * If dirty pages are scanned that are not queued for IO, it
         * implies that flushers are not doing their job. This can
         * happen when memory pressure pushes dirty pages to the end of
         * the LRU before the dirty limits are breached and the dirty
         * data has expired. It can also happen when the proportion of
         * dirty pages grows not through writes but through memory
         * pressure reclaiming all the clean cache. And in some cases,
         * the flushers simply cannot keep up with the allocation
         * rate. Nudge the flusher threads in case they are asleep.
         */
        if (stat.nr_unqueued_dirty == nr_taken)
                wakeup_flusher_threads(WB_REASON_VMSCAN);

        sc->nr.dirty += stat.nr_dirty;
        sc->nr.congested += stat.nr_congested;
        sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
        sc->nr.writeback += stat.nr_writeback;
        sc->nr.immediate += stat.nr_immediate;
        sc->nr.taken += nr_taken;
        if (file)
                sc->nr.file_taken += nr_taken;

        trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
                        nr_scanned, nr_reclaimed, &stat, sc->priority, file);
        return nr_reclaimed;
}

코드 라인 3~13에서 kswapd 및 direct-reclaim PGSTEAL 카운터를 증가시킨다.
코드 라인 15에서 남은 page_list에 있는 페이지들을 inactive에 다시 rotate 한다.
코드 라인 17에서 isolated lru 건수를 처리한 수 만큼 다시 감소시킨다.
코드 라인 21에서 inactive lru 리스트로 돌아가지 않고 free된 페이지에 대해 memcg에 uncharge 보고한다.
코드 라인 22에서 inactive lru 리스트로 돌아가지 않고 page_list에 남아있는 page들을 모두 버디 시스템으로 되돌린다.
코드 라인 35~36에서 flusher 스레드를 깨운다.
코드 라인 38~45에서 처리된 페이지 종류에 따라 스캔 컨트롤에 보고한다.
코드 라인 49에서 회수한 페이지 수를 반환한다.

다음 그림은 lru inactive 벡터 리스트를 shrink하는 모습을 보여준다.

Isolate lru 및 Rotate lru

Isolate 플래그

lru 페이지들을 isolation할 때 다음과 같은 플래그들을 조합하여 사용할 수 있다.

ISOLATE_UNMAPPED
- 페이지 테이블에 매핑되지 않은 페이지만 isolation 가능하게 제한한다. (unmapping only)
- 페이지 테이블에 매핑된 페이지는 isolation 하지 않게한다.
ISOLATE_ASYNC_MIGRATE
- async 마이그레이션 모드를 사용하면 메모리 압박이 없을 때 writeback 페이지와 migration이 지원되지 않는 파일 시스템을 사용하는 dirty 페이지를 isolation 하지 않도록 한다.
ISOLATE_UNEVICTABLE
- unevitable 페이지도 isolation할 수 있게 한다.
- 이 모드는 CMA 영역 또는 Off-line 메모리 영역을 비우기 위해 migration 시 사용된다.

Isolate lru 페이지

isolate_lru_pages()

mm/vmscan.c

/*
 * zone_lru_lock is heavily contended.  Some of the functions that
 * shrink the lists perform better by taking out a batch of pages
 * and working on them outside the LRU lock.
 *
 * For pagecache intensive workloads, this function is the hottest
 * spot in the kernel (apart from copy_*_user functions).
 *
 * Appropriate locks must be held before calling this function.
 *
 * @nr_to_scan: The number of eligible pages to look through on the list.
 * @lruvec:     The LRU vector to pull pages from.
 * @dst:        The temp list to put pages on to.
 * @nr_scanned: The number of pages that were scanned.
 * @sc:         The scan_control struct for this reclaim session
 * @mode:       One of the LRU isolation modes
 * @lru:        LRU list id for isolating
 *
 * returns how many pages were moved onto *@dst.
 */

static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
                struct lruvec *lruvec, struct list_head *dst,
                unsigned long *nr_scanned, struct scan_control *sc,
                isolate_mode_t mode, enum lru_list lru)
{
        struct list_head *src = &lruvec->lists[lru];
        unsigned long nr_taken = 0;
        unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
        unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
        unsigned long skipped = 0;
        unsigned long scan, total_scan, nr_pages;
        LIST_HEAD(pages_skipped);

        scan = 0;
        for (total_scan = 0;
             scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src);
             total_scan++) {
                struct page *page;

                page = lru_to_page(src);
                prefetchw_prev_lru_page(page, src, flags);

                VM_BUG_ON_PAGE(!PageLRU(page), page);

                if (page_zonenum(page) > sc->reclaim_idx) {
                        list_move(&page->lru, &pages_skipped);
                        nr_skipped[page_zonenum(page)]++;
                        continue;
                }

                /*
                 * Do not count skipped pages because that makes the function
                 * return with no isolated pages if the LRU mostly contains
                 * ineligible pages.  This causes the VM to not reclaim any
                 * pages, triggering a premature OOM.
                 */
                scan++;
                switch (__isolate_lru_page(page, mode)) {
                case 0:
                        nr_pages = hpage_nr_pages(page);
                        nr_taken += nr_pages;
                        nr_zone_taken[page_zonenum(page)] += nr_pages;
                        list_move(&page->lru, dst);
                        break;

                case -EBUSY:
                        /* else it is being freed elsewhere */
                        list_move(&page->lru, src);
                        continue;

                default:
                        BUG();
                }
        }

        /*
         * Splice any skipped pages to the start of the LRU list. Note that
         * this disrupts the LRU order when reclaiming for lower zones but
         * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
         * scanning would soon rescan the same pages to skip and put the
         * system at risk of premature OOM.
         */
        if (!list_empty(&pages_skipped)) {
                int zid;

                list_splice(&pages_skipped, src);
                for (zid = 0; zid < MAX_NR_ZONES; zid++) {
                        if (!nr_skipped[zid])
                                continue;

                        __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
                        skipped += nr_skipped[zid];
                }
        }
        *nr_scanned = total_scan;
        trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
                                    total_scan, skipped, nr_taken, mode, lru);
        update_lru_sizes(lruvec, lru, nr_zone_taken);
        return nr_taken;
}

지정한 @lruvec으로부터 nr_to_scan 만큼 스캔을 시도하여 분리한 페이지를 @dst 리스트에 담고 분리 성공한 페이지 수를 반환한다.

코드 라인 6에서 작업할 lru 리스트를 선택한다.
코드 라인 15~29에서 lru 리스트에서 nr_to_scan 수 만큼 페이지 스캔을 한다. 해당 페이지의 존이 요청한 존을 초과하는 경우 스킵하기 위해 pages_skipped 리스트로 옮긴다.
코드 라인 38~53에서 한 페이지를 분리하여 @dst 리스트에 옮긴다. 만일 당장 분리할 수 없는 상태(-EBUSY)인 경우 원래 요청한 lru 리스트의 선두로 옮긴다(rotate).
코드 라인 63~74에서 pages_skipped 리스트의 페이지들을 원래 요청한 lru 리스트의 선두로 옮긴다. (rotate)
코드 라인 75~79에서 출력 인자 @nr_scanned에 스캔한 수를 대입하고, lru 사이즈를 갱신한 다음, isolation 성공한 페이지의 수를 반환한다.

다음 그림은 lru 리스트에서 isolation 시도 시 일부 페이지는 rotate되고, 나머지는 isolation 되는 모습을 보여준다.

isolation 여부는 isolation 모드와 각 페이지의 상태에 따라 다르다.
메모리 압박이 심하거나 CMA 영역같은 곳을 반드시 비워야 할 때에는 여러 가지 isolation 모드를 설정(set/clear)하여 rotate 되지 않도록 할 수 있다.

__isolate_lru_page()

mm/vmscan.c

/*
 * Attempt to remove the specified page from its LRU.  Only take this page
 * if it is of the appropriate PageActive status.  Pages which are being
 * freed elsewhere are also ignored.
 *
 * page:        page to consider
 * mode:        one of the LRU isolation modes defined above
 *
 * returns 0 on success, -ve errno on failure.
 */

int __isolate_lru_page(struct page *page, isolate_mode_t mode)
{
        int ret = -EINVAL;

        /* Only take pages on the LRU. */
        if (!PageLRU(page))
                return ret;

        /* Compaction should not handle unevictable pages but CMA can do so */
        if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
                return ret;

        ret = -EBUSY;

        /*
         * To minimise LRU disruption, the caller can indicate that it only
         * wants to isolate pages it will be able to operate on without
         * blocking - clean pages for the most part.
         *
         * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
         * that it is possible to migrate without blocking
         */
        if (mode & ISOLATE_ASYNC_MIGRATE) {
                /* All the caller can do on PageWriteback is block */
                if (PageWriteback(page))
                        return ret;

                if (PageDirty(page)) {
                        struct address_space *mapping;
                        bool migrate_dirty;

                        /*
                         * Only pages without mappings or that have a
                         * ->migratepage callback are possible to migrate
                         * without blocking. However, we can be racing with
                         * truncation so it's necessary to lock the page
                         * to stabilise the mapping as truncation holds
                         * the page lock until after the page is removed
                         * from the page cache.
                         */
                        if (!trylock_page(page))
                                return ret;

                        mapping = page_mapping(page);
                        migrate_dirty = !mapping || mapping->a_ops->migratepage;
                        unlock_page(page);
                        if (!migrate_dirty)
                                return ret;
                }
        }

        if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
                return ret;

        if (likely(get_page_unless_zero(page))) {
                /*
                 * Be careful not to clear PageLRU until after we're
                 * sure the page is not being freed elsewhere -- the
                 * page release code relies on it.
                 */
                ClearPageLRU(page);
                ret = 0;
        }

        return ret;
}

lru 리스트에서 요청 페이지를 분리하고 성공 시 0을 반환한다. 만일 관련 없는 페이지의 분리를 시도하는 경우 -EINVAL을 반환하고, 모드 조건에 따라 당장 분리할 수 없는 경우 -EBUSY를 반환한다.

코드 라인 6~7에서 lru 페이지가 아닌 경우 분리를 포기한다. (-EINVAL)
코드 라인 10~11에서 unevictable 페이지이면서 모드에 unevictable의 분리를 허용하지 않은 경우 분리를 포기한다. (-EINVAL)
코드 라인 23~50에서 비동기 migration 모드인 경우이다. writeback 페이지는 분리를 포기한다. 또한 dirty 페이지도 페이지의 lock 획득 시도가 실패하거나, 매핑 드라이버의 (*migratepage) 후크가 지원되지 않는 경우 분리를 포기한다. (-EBUSY)
코드 라인 52~53에서 모드에 unmapped를 요청한 경우 mapped 페이지는 분리를 포기한다. (-EBUSY)
코드 라인 55~63에서 참조카운터가 0이 아니면 1을 증가시킨다. lru 플래그 비트를 클리어하고 성공적으로 리턴한다.

다음 그림은 isolation 모드에 따른 각 페이지 상태 및 종류에 따라 isolation 여부를 결정하는 과정을 보여준다.

update_lru_sizes()

mm/vmscan.c

/*
 * Update LRU sizes after isolating pages. The LRU size updates must
 * be complete before mem_cgroup_update_lru_size due to a santity check.
 */

static __always_inline void update_lru_sizes(struct lruvec *lruvec,
                        enum lru_list lru, unsigned long *nr_zone_taken)
{
        int zid;

        for (zid = 0; zid < MAX_NR_ZONES; zid++) {
                if (!nr_zone_taken[zid])
                        continue;

                __update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
#ifdef CONFIG_MEMCG
                mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
#endif
        }

}

isolation된 페이지 수만큼 lru 사이즈를 갱신한다. 입력 인자 @nr_zone_taken에는 각 존별 isolation 성공 페이지 수가 담긴다.

코드 라인 6~8에서 zone 수만큼 순회하며 isolation된 페이지가 없는 존은 skip 한다.
코드 라인 10에서 lruvec의 지정한 lru 리스트의 사이즈를 갱신한다.
코드 라인 12에서 memcg의 노드별로 구성된 존 사이즈를 갱신한다.

LRU로 rotate

move_active_pages_to_lru()

mm/vmscan.c

/*
 * This moves pages from the active list to the inactive list.
 *
 * We move them the other way if the page is referenced by one or more
 * processes, from rmap.
 *
 * If the pages are mostly unmapped, the processing is fast and it is
 * appropriate to hold zone->lru_lock across the whole operation.  But if
 * the pages are mapped, the processing is slow (page_referenced()) so we
 * should drop zone->lru_lock around each page.  It's impossible to balance
 * this, so instead we remove the pages from the LRU while processing them.
 * It is safe to rely on PG_active against the non-LRU pages in here because
 * nobody will play with that bit on a non-LRU page.
 *
 * The downside is that we have to touch page->_count against each page.
 * But we had to alter page->flags anyway.
 */

static void move_active_pages_to_lru(struct lruvec *lruvec,
                                     struct list_head *list,
                                     struct list_head *pages_to_free,
                                     enum lru_list lru)
{
        struct zone *zone = lruvec_zone(lruvec);
        unsigned long pgmoved = 0;
        struct page *page;
        int nr_pages;

        while (!list_empty(list)) {
                page = lru_to_page(list);
                lruvec = mem_cgroup_page_lruvec(page, zone);

                VM_BUG_ON_PAGE(PageLRU(page), page);
                SetPageLRU(page);

                nr_pages = hpage_nr_pages(page);
                mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
                list_move(&page->lru, &lruvec->lists[lru]);
                pgmoved += nr_pages;

                if (put_page_testzero(page)) {
                        __ClearPageLRU(page);
                        __ClearPageActive(page);
                        del_page_from_lru_list(page, lruvec, lru);

                        if (unlikely(PageCompound(page))) {
                                spin_unlock_irq(&zone->lru_lock);
                                mem_cgroup_uncharge(page);
                                (*get_compound_page_dtor(page))(page);
                                spin_lock_irq(&zone->lru_lock);
                        } else
                                list_add(&page->lru, pages_to_free);
                }
        }
        __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
        if (!is_active_lru(lru))
                __count_vm_events(PGDEACTIVATE, pgmoved);
}

리스트(@list)에 있는 페이지를 lru 리스트(active 또는 inactive)로 옮긴다. 만일 이 과정에 이미 해제되어 사용자가 없는 페이지인 경우 @pages_to_free 리스트에 옮긴다.

코드 라인 11~20에서 @list에 있는 페이지를 순회하며, 해당 lruvec의 lru 리스트로 되돌린다. (rotate)
코드 라인 23~35에서 페이지의 사용이 완료되었으므로 참조 카운터를 감소시킨다. 만일 이미 사용자가 없는 페이지인 경우 버디시스템으로 되돌린다.
코드 라인 38~39에서 inactive lru인 경우 PGDEACTIVATE 카운터를 옮긴 수 만큼 증가시킨다.

lru별 스캔 수 산정

스캔 밸런스 모드

lru별 스캔 수를 산정하기 위해 4가지의 모드를 사용한다. 우선 순위(sc->priority)가 점점 줄어들면서 scan 수가 줄어든다. 단 한번에 최대 스캔 가능한 수는 SWAP_CLUSTER_MAX(32) 개로 제한된다.

SCAN_ANON
- anon 페이지만 스캔한다.
SCAN_FILE
- file 페이지만 스캔한다.
SCAN_EQUAL
- anon 및 file 페이지를 동시에 스캔한다.
SCAN_FRACT
- 위의 SCAN_EQUAL로 결정한 스캔 수에 다음 anon/file 비율을 추가 적용한다.
  - anon 비율 = anon/(anon+file) 비율 * 최근 anon 스캔 수 / 최근 anon 회전 비율
  - file 비율 = file/(anon+file) 비율 * 최근 file 스캔 수 / 최근 file 회전 비율

다음 그림은 4가지 스캔 밸런스 모드에 따라 lru별 스캔 수를 대략적으로 결정하는 과정을 보여준다.

Scanning Priority 설정

anon 및 file 페이지의 priority를 설정한다.

anon priority는 swappiness 값과 동일한 값으로 0 ~ 100까지이며, 디폴트 값은 60이다.
- 최근 커널의 경우 swap 영역에 SSD 등의 고성능 블럭디스크를 사용하여 reclaim 성능을 끌어올릴 수 있어 swappiness 값을 최대 200까지 사용할 수 있도록 허용하였다.
  - 참고: mm: allow swappiness that prefers reclaiming anon over the file workingset (2020, v5.8-rc1)
file priority는 200 – anon priority 값을 사용한다.

다음 그림은 swappiness 값으로 Scanning Prioirty(file_prio 및 anon_prio)가 결정되는 모습을 보여준다.

1차 lru 별 스캔 수 결정

lru 별 스캔할 수는 스캔 컨트롤을 통해 요청한 우선순위(sc->priority)만큼 우측 시프트하여 결정한다. 단 이 값이 0이면 32를 초과하지 않는 lru 사이즈를 사용한다. OOM 직전에는 최고 우선 순위에 다다르는데 이 때에는 swapness 비율을 사용하던 SCAN_FRACT의 사용을 멈추고, swapness 비율과 상관 없는 SCAN_EQUAL을 사용하여 최대한 모든 lru를 스캔하려한다.

아래 그림은 우선 순위가 적용된 lru 별 스캔 수를 산출하는 과정을 보여준다.

anon 및 lru 비율 산출

SCAN_FRACT 모드에서는 추가로 anon 및 lru 비율을 산정하여야 한다.

최근 커널 v5.8-rc1 부터는 recent_rotated[] 및 recent_scaned[] 대신 anon_cost 와 file_cost 모델을 채용하였다.
active된적이 있었던 refault(workingset detection) 페이지의경우 cost를 추가하여 이를 스캔 밸런싱에 사용하였다.
- 참고:
  - mm: base LRU balancing on an explicit cost model (2020, v5.8-rc1)
  - mm: vmscan: reclaim writepage is IO cost (2020, v5.8-rc1)

다음 그림은 비율을 적용한 SCAN_FRACT 모드인 경우에 사용될 anon 및 lru 비율을 산정하는 과정을 보여준다.

최종 lru별 스캔 수 산출

다음 그림은 4가지 모드 각각의 스캔 카운터를 구하는 모습을 보여준다.

숫자 0~3은 inactive anon lru(0)부터 active file lru(3)까지를 의미한다.

get_scan_count()

mm/vmscan.c -1/3-

/*
 * Determine how aggressively the anon and file LRU lists should be
 * scanned.  The relative value of each set of LRU lists is determined
 * by looking at the fraction of the pages scanned we did rotate back
 * onto the active list instead of evict.
 *
 * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
 * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
 */

static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
                           struct scan_control *sc, unsigned long *nr,
                           unsigned long *lru_pages)
{
        int swappiness = mem_cgroup_swappiness(memcg);
        struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
        u64 fraction[2];
        u64 denominator = 0;    /* gcc */
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);
        unsigned long anon_prio, file_prio;
        enum scan_balance scan_balance;
        unsigned long anon, file;
        unsigned long ap, fp;
        enum lru_list lru;

        /* If we have no swap space, do not bother scanning anon pages. */
        if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
                scan_balance = SCAN_FILE;
                goto out;
        }

        /*
         * Global reclaim will swap to prevent OOM even with no
         * swappiness, but memcg users want to use this knob to
         * disable swapping for individual groups completely when
         * using the memory controller's swap limit feature would be
         * too expensive.
         */
        if (!global_reclaim(sc) && !swappiness) {
                scan_balance = SCAN_FILE;
                goto out;
        }

        /*
         * Do not apply any pressure balancing cleverness when the
         * system is close to OOM, scan both anon and file equally
         * (unless the swappiness setting disagrees with swapping).
         */
        if (!sc->priority && swappiness) {
                scan_balance = SCAN_EQUAL;
                goto out;
        }

        /*
         * Prevent the reclaimer from falling into the cache trap: as
         * cache pages start out inactive, every cache fault will tip
         * the scan balance towards the file LRU.  And as the file LRU
         * shrinks, so does the window for rotation from references.
         * This means we have a runaway feedback loop where a tiny
         * thrashing file LRU becomes infinitely more attractive than
         * anon pages.  Try to detect this based on file LRU size.
         */
        if (global_reclaim(sc)) {
                unsigned long pgdatfile;
                unsigned long pgdatfree;
                int z;
                unsigned long total_high_wmark = 0;

                pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
                pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
                           node_page_state(pgdat, NR_INACTIVE_FILE);

                for (z = 0; z < MAX_NR_ZONES; z++) {
                        struct zone *zone = &pgdat->node_zones[z];
                        if (!managed_zone(zone))
                                continue;

                        total_high_wmark += high_wmark_pages(zone);
                }

                if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
                        /*
                         * Force SCAN_ANON if there are enough inactive
                         * anonymous pages on the LRU in eligible zones.
                         * Otherwise, the small LRU gets thrashed.
                         */
                        if (!inactive_list_is_low(lruvec, false, memcg, sc, false) &&
                            lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
                                        >> sc->priority) {
                                scan_balance = SCAN_ANON;
                                goto out;
                        }
                }
        }

anon & file lru 리스트에서 얼마큼 스캔해야 하는지를 결정한다. lru 리스트 셋의 각 상대 값은 eviction 대신 active list로 다시 rotate back을 수행해야 하는 페이지의 비율을 찾는 것에 의해 결정된다.

코드 라인 5에서 memcg에 대해 swappiness 값을 알아온다.
코드 라인 17~20에서 swap이 필요 없거나 swap space가 없는 경우 anon 페이지의 swap을 할 수 없다. 따라서 이러한 경우 file 페이지만 스캔하도록 결정하고 out 레이블로 이동한다.
코드 라인 29~32에서 글로벌 reclaim이 아니고 swappiness 값이 0인 경우 file 페이지만 스캔하도록 결정하고 out 레이블로 이동한다.
코드 라인 39~42에서 최우선 순위(OOM이 가까와진)이고 swappiness 값이 주어진 경우 동등한 밸런스를 하도록 결정하고 out 레이블로 이동한다.
코드 라인 53~84에서 글로벌 reclaim인 경우 노드의 free 페이지와 file 페이지 수를 알아온다. 그리고 해당 노드에 포함된 존들의 high 워터마크 합산 값을 알아온다. 낮은 확률로 다음 조건을 만족하는 경우 anon 페이지만 스캔하도록 결정하고 out 레이블로 이동한다.
- free 페이지와 file 페이지 수가 high 워터마크 합산 값 이하이다.
- inactive anon 페이지 수가 0 이상이고 active anon 페이지 수보다 크다.

mm/vmscan.c -2/3-

.       /*
         * If there is enough inactive page cache, i.e. if the size of the
         * inactive list is greater than that of the active list *and* the
         * inactive list actually has some pages to scan on this priority, we
         * do not reclaim anything from the anonymous working set right now.
         * Without the second condition we could end up never scanning an
         * lruvec even if it has plenty of old anonymous pages unless the
         * system is under heavy pressure.
         */
        if (!inactive_list_is_low(lruvec, true, memcg, sc, false) &&
            lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
                scan_balance = SCAN_FILE;
                goto out;
        }

        scan_balance = SCAN_FRACT;

        /*
         * With swappiness at 100, anonymous and file have the same priority.
         * This scanning priority is essentially the inverse of IO cost.
         */
        anon_prio = swappiness;
        file_prio = 200 - anon_prio;

        /*
         * OK, so we have swap space and a fair amount of page cache
         * pages.  We use the recently rotated / recently scanned
         * ratios to determine how valuable each cache is.
         *
         * Because workloads change over time (and to avoid overflow)
         * we keep these statistics as a floating average, which ends
         * up weighing recent references more than old ones.
         *
         * anon in [0], file in [1]
         */

        anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
                lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES);
        file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
                lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

        spin_lock_irq(&pgdat->lru_lock);
        if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
                reclaim_stat->recent_scanned[0] /= 2;
                reclaim_stat->recent_rotated[0] /= 2;
        }

        if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
                reclaim_stat->recent_scanned[1] /= 2;
                reclaim_stat->recent_rotated[1] /= 2;
        }

        /*
         * The amount of pressure on anon vs file pages is inversely
         * proportional to the fraction of recently scanned pages on
         * each list that were recently referenced and in active use.
         */
        ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
        ap /= reclaim_stat->recent_rotated[0] + 1;

        fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
        fp /= reclaim_stat->recent_rotated[1] + 1;
        spin_unlock_irq(&pgdat->lru_lock);

        fraction[0] = ap;
        fraction[1] = fp;
        denominator = ap + fp + 1;

코드 라인 10~14에서 inactive file 페이지 수가 0 이상이고 active file 페이지 수보다 크면 file 페이지만 스캔하도록 결정하고 out 레이블로 이동한다.
코드 라인 16에서 anon 페이지와 file 페이지를 산출된 비율로 스캔을 하는 것으로 결정한다.
코드 라인 22~23에서 첫 번째, anon과 file에 해당하는 scanning priority(anon_prio와 file_prio)를 결정한다. anon_prio에 해당하는 swappiness 값은 0 ~ 100이다. file_prio는 200 – anon_prio 값을 사용한다. 참고로 swappiness가 100일 경우 anon_prio와 file_prio가 동일하다. 이 값은 다음 fs를 통해서 바꿀 수 있다.
- /proc/sys/vm/swappiness (for global)
- /sys/fs/cgroup/memory/memory.swappiness (for each memcg.)
- swappiness가 0일 때 문제가되는 경우도 있으니 주의해야 한다.
  - 참고: mm: avoid swapping out with swappiness==0
  - 참고: Linux Swappiness
코드 라인 37~40에서 anon 페이지 수와 file 페이지 수를 알아온다.
코드 라인 43~46에서 작은 확률로 최근 anon scan 페이지 수가 anon의 25%보다 큰 경우 최근 anon scan 페이지 수와 최근 anon rotate 수를 절반으로 줄인다.
코드 라인 48~51에서 작은 확률로 최근 file scan 페이지 수가 file의 25%보다 큰 경우 최근 file scan 페이지 수와 최근 file rotate 수를 절반으로 줄인다.
코드 라인 58~67에서 scan 페이지 수를 비율로 산출하기 위해 두 번째, 비율 적용 시 anon 및 file 비율이 담기는 fraction[]에 대입한다. 비율 산출 시 분모로 사용할 값으로 그 두 값을 더해 denominator에 대입한다.(+1을 추가하는 이유는 나눗셈 연산에서 에러가 발생하지 않도록 추가하였다.)
- fraction[0] = ap = anon_prio(0~200) * 최근 anon rotate에 비해 최근 scan된 비율
- fraction[1] = fp = file_prio(200-anon_prio) * 최근 file rotate에 비해 최근 scan된 비율
- denominator = ap + fp + 1

mm/vmscan.c -3/3-

out:
        *lru_pages = 0;
        for_each_evictable_lru(lru) {
                int file = is_file_lru(lru);
                unsigned long size;
                unsigned long scan;

                size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
                scan = size >> sc->priority;
                /*
                 * If the cgroup's already been deleted, make sure to
                 * scrape out the remaining cache.
                 */
                if (!scan && !mem_cgroup_online(memcg))
                        scan = min(size, SWAP_CLUSTER_MAX);

                switch (scan_balance) {
                case SCAN_EQUAL:
                        /* Scan lists relative to size */
                        break;
                case SCAN_FRACT:
                        /*
                         * Scan types proportional to swappiness and
                         * their relative recent reclaim efficiency.
                         * Make sure we don't miss the last page
                         * because of a round-off error.
                         */
                        scan = DIV64_U64_ROUND_UP(scan * fraction[file],
                                                  denominator);
                        break;
                case SCAN_FILE:
                case SCAN_ANON:
                        /* Scan one type exclusively */
                        if ((scan_balance == SCAN_FILE) != file) {
                                size = 0;
                                scan = 0;
                        }
                        break;
                default:
                        /* Look ma, no brain */
                        BUG();
                }

                *lru_pages += size;
                nr[lru] = scan;
        }
}

코드 라인 1~3에서 lru 별로 최종 스캔할 수를 산출할 out: 레이블이다. evictable lru 만큼 순회한다.
코드 라인 8~9에서 스캔 할 수는 해당 lru 사이즈를 우선 순위 만큼 우측 시프트하여 결정한다.
- 우선 순위가 가장 높은 경우 sc->priority 값이 0이므로 해당 lru 사이즈를 모두 사용한다.
코드 라인 14~15에서 스캔 수가 0이거나 memcg가 이미 삭제된 경우 스캔 수를 lru 사이즈로 결정한다. 단 최대 수는 32로 제한한다.
코드 라인 17~42에서 결정된 다음 4 가지 스캔 밸런스 방법에 따라 lru 별로 scan 수를 결정한다.
- SCAN_EQUAL
  - 산출된 scan 값을 그대로 사용한다.
- SCAN_FRACT
  - 산출된 scan 값에 비율(fraction )을 적용한다.
- SCAN_FILE
  - anon lru의 경우 스캔 수를 0으로 변경한다.
- SCAN_ANON
  - file lru의 경우 스캔 수를 0으로 변경한다.
코드 라인 44~45에서 결정된 사이즈는 출력 인자 @lru_pages에 대입하고, 스캔 수는 출력 인자 @nr[lru]에 대입한다.

다음 그림은 get_scan_coun() 함수를 통해 각 스캔 모드를 결정하는 이유를 보여준다.

Swappiness

mem_cgroup_swappiness()

include/linux/swap.h

static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
        /* Cgroup2 doesn't have per-cgroup swappiness */
        if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
                return vm_swappiness;

        /* root ? */
        if (mem_cgroup_disabled() || !memcg->css.parent)
                return vm_swappiness;

        return memcg->swappiness;
}

memcg의 “memory.swappiness” 값을 알아온다. 이 값의 디폴트 값은 60이며 0~100까지 사용된다. memcg가 사용되지 않는 경우 vm_swappiness(0..100, 디폴트: 60)를 반환한다.

다음 그림은 swappiness 값을 알아오는 과정을 보여준다.

기타

zone_reclaimable()

mm/vmscan.c

bool zone_reclaimable(struct zone *zone)
{
        return zone_page_state(zone, NR_PAGES_SCANNED) <
                zone_reclaimable_pages(zone) * 6;
}

zone에서 스캔된 페이지 수가 회수할 수 있는 페이지의 6배 보다 작은 경우 회수가 가능하다고 판단한다.

zone_reclaimable_pages()

mm/vmscan.c

static unsigned long zone_reclaimable_pages(struct zone *zone)
{
        int nr;

        nr = zone_page_state(zone, NR_ACTIVE_FILE) +
             zone_page_state(zone, NR_INACTIVE_FILE);

        if (get_nr_swap_pages() > 0)
                nr += zone_page_state(zone, NR_ACTIVE_ANON) +
                      zone_page_state(zone, NR_INACTIVE_ANON);

        return nr;
}

요청 zone의 최대 회수 가능한 페이지 수를 알아온다.

active file + inactive file 건 수를 더한 수를 반환환다. 만일 swap 페이지가 있는 경우 active anon과 inactive anon 건 수도 더해 반환한다.

get_nr_swap_pages()

include/linux/swap.h

static inline long get_nr_swap_pages(void)
{
        return atomic_long_read(&nr_swap_pages);
}

swap 페이지 수를 반환한다.

구조체

scan_control 구조체

mm/vmscan.c

struct scan_control {
        /* How many pages shrink_list() should reclaim */
        unsigned long nr_to_reclaim;

        /*
         * Nodemask of nodes allowed by the caller. If NULL, all nodes
         * are scanned.
         */
        nodemask_t      *nodemask;

        /*
         * The memory cgroup that hit its limit and as a result is the
         * primary target of this reclaim invocation.
         */
        struct mem_cgroup *target_mem_cgroup;

        /* Writepage batching in laptop mode; RECLAIM_WRITE */
        unsigned int may_writepage:1;

        /* Can mapped pages be reclaimed? */
        unsigned int may_unmap:1;

        /* Can pages be swapped as part of reclaim? */
        unsigned int may_swap:1;

        /* e.g. boosted watermark reclaim leaves slabs alone */
        unsigned int may_shrinkslab:1;

        /*
         * Cgroups are not reclaimed below their configured memory.low,
         * unless we threaten to OOM. If any cgroups are skipped due to
         * memory.low and nothing was reclaimed, go back for memory.low.
         */
        unsigned int memcg_low_reclaim:1;
        unsigned int memcg_low_skipped:1;

        unsigned int hibernation_mode:1;

        /* One of the zones is ready for compaction */
        unsigned int compaction_ready:1;

        /* Allocation order */
        s8 order;

        /* Scan (total_size >> priority) pages at once */
        s8 priority;

        /* The highest zone to isolate pages for reclaim from */
        s8 reclaim_idx;

        /* This context's GFP mask */
        gfp_t gfp_mask;

        /* Incremented by the number of inactive pages that were scanned */
        unsigned long nr_scanned;

        /* Number of pages freed so far during a call to shrink_zones() */
        unsigned long nr_reclaimed;

        struct {
                unsigned int dirty;
                unsigned int unqueued_dirty;
                unsigned int congested;
                unsigned int writeback;
                unsigned int immediate;
                unsigned int file_taken;
                unsigned int taken;
        } nr;
};

nr_to_reclaim
- shrink_list()에서 회수할 페이지 수
*nodemask
- 스캔할 노드 마스크 비트맵. null인 경우 모든 노드에서 스캔
*target_mem_cgroup
- 타겟 memcg가 주어진 경우 하이라키로 구성된 이 memcg 이하의 memcg를 대상으로 한정한다.
may_writepage
- dirty된 file 캐시 페이지를 write 시킨 후 회수 가능
may_unmap
- mapped 페이지를 unmap 시킨 후 회수 가능
may_swap
- 스웝을 사용하여 페이지 회수 가능
may_shrinkslab
- e.g. boosted watermark reclaim leaves slabs alone
memcg_low_reclaim:1
memcg_low_skipped:1
hibernation_mode:1
- 절전모드
compaction_ready:1
- zone 들 중 하나가 compaction 준비가 된 경우
order
- 할당 order
priority
- 한 번에 스캔할 페이지 수 (total_size >> priority)
reclaim_idx
- 이 존 이하를 대상으로 스캔한다. (이 보다 높은 존은 대상에서 제외)
gfp_mask
- GFP mask
nr_scanned
- 스캔한 inactive 페이지의 수
nr_reclaimed
- shrink_zones()을 호출하고 회수된 free 페이지의 수

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c – 현재 글
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Zoned Allocator -9- (Direct Compact-Migration)

2016-07-072019-10-12 문영일 Leave a comment

Migration

non-movable 페이지의 migration 지원

커널은 전통적으로 movable LRU 페이지에 대해서만 migration을 지원해왔다. 최근 임베디드 시스템인 WbOS, android 등에서 많은 수의 non-movable 페이지들을 사용하여 왔다. 이러한 non-movable 페이지들이 많이 사용되면서 high order 할당에 문제가 생기는 리포트들이 보고되어왔다. 따라서 이러한 문제점들을 제거하기 위해 몇개의 노력을 해왔었지만 (예를 들면 압축 알고리즘 개선, slub fallback 시 0 order 할당, reserved 메모리, vmalloc, …) 여전히 non-movable 페이지들이 많이 사용되면 장기적으로는 효과가 없었다.

이 번에는 non-movable 페이지들을 movable이 가능하도록 지원하기 위해 아래 패치를 통해 드라이버(zram, GPU memory, …)들에 (*isolate_page), (*migratepage) 등의 후크 함수를 구현할 수 있도록 하였다. 이러한 후크 함수가 지원되는 드라이버의 경우 커널이 non-movable로 분류할지라도 이러한 드라이버를 통해서 migration 할 수 있게 하였다.

참고: mm: migrate: support non-lru movable page migration

다음 그림은 페이지 유형별 isolation 및 migration 지원 여부를 보여준다.

migrate_pages()

mm/migrate.c

/*
 * migrate_pages - migrate the pages specified in a list, to the free pages
 *                 supplied as the target for the page migration
 *
 * @from:               The list of pages to be migrated.
 * @get_new_page:       The function used to allocate free pages to be used
 *                      as the target of the page migration.
 * @put_new_page:       The function used to free target pages if migration
 *                      fails, or NULL if no special handling is necessary.
 * @private:            Private data to be passed on to get_new_page()
 * @mode:               The migration mode that specifies the constraints for
 *                      page migration, if any.
 * @reason:             The reason for page migration.
 *
 * The function returns after 10 attempts or if no pages are movable any more
 * because the list has become empty or no retryable pages exist any more.
 * The caller should call putback_movable_pages() to return pages to the LRU
 * or free list only if ret != 0.
 *
 * Returns the number of pages that were not migrated, or an error code.
 */

int migrate_pages(struct list_head *from, new_page_t get_new_page,
                free_page_t put_new_page, unsigned long private,
                enum migrate_mode mode, int reason)
{
        int retry = 1;
        int nr_failed = 0;
        int nr_succeeded = 0;
        int pass = 0;
        struct page *page;
        struct page *page2;
        int swapwrite = current->flags & PF_SWAPWRITE;
        int rc;

        if (!swapwrite)
                current->flags |= PF_SWAPWRITE;

        for(pass = 0; pass < 10 && retry; pass++) {
                retry = 0;

                list_for_each_entry_safe(page, page2, from, lru) {
retry:
                        cond_resched();

                        if (PageHuge(page))
                                rc = unmap_and_move_huge_page(get_new_page,
                                                put_new_page, private, page,
                                                pass > 2, mode, reason);
                        else
                                rc = unmap_and_move(get_new_page, put_new_page,
                                                private, page, pass > 2, mode,
                                                reason);

                        switch(rc) {
                        case -ENOMEM:
                                /*
                                 * THP migration might be unsupported or the
                                 * allocation could've failed so we should
                                 * retry on the same page with the THP split
                                 * to base pages.
                                 *
                                 * Head page is retried immediately and tail
                                 * pages are added to the tail of the list so
                                 * we encounter them after the rest of the list
                                 * is processed.
                                 */
                                if (PageTransHuge(page) && !PageHuge(page)) {
                                        lock_page(page);
                                        rc = split_huge_page_to_list(page, from);
                                        unlock_page(page);
                                        if (!rc) {
                                                list_safe_reset_next(page, page2, lru);
                                                goto retry;
                                        }
                                }
                                nr_failed++;
                                goto out;
                        case -EAGAIN:
                                retry++;
                                break;
                        case MIGRATEPAGE_SUCCESS:
                                nr_succeeded++;
                                break;
                        default:
                                /*
                                 * Permanent failure (-EBUSY, -ENOSYS, etc.):
                                 * unlike -EAGAIN case, the failed page is
                                 * removed from migration page list and not
                                 * retried in the next outer loop.
                                 */
                                nr_failed++;
                                break;
                        }
                }
        }
        nr_failed += retry;
        rc = nr_failed;
out:
        if (nr_succeeded)
                count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
        if (nr_failed)
                count_vm_events(PGMIGRATE_FAIL, nr_failed);
        trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);

        if (!swapwrite)
                current->flags &= ~PF_SWAPWRITE;

        return rc;
}

최대 10번을 시도하여 migrate 스캐너가 isolation한 페이지를 unmap 한 후 free 스캐너가 isolation한 free 페이지로 migration한다.

코드 라인 11~15에서 현재 태스크가 swap write를 지원하지 않는 경우 migration을 하는 동안만 swap write를 지원하도록 플래그를 추가한다.
코드 라인 17~18에서 최대 반복 횟수를 10번으로 제한을 한다.
코드 라인 20에서 인수로 전달받은 @from 리스트의 페이지들 만큼 루프를 돈다.
코드 라인 24~31에서 huge 페이지 또는 일반 페이지의 unmap과 move를 수행한다. 10번 시도 중 4번째 시도 부터는 force 값을 1로 하여 full sync 모드에서 writeback 페이지들도 강제로 writeback이 끝날 때까지 대기하도록 강제한다.
코드 라인 33~56에서 migration 결과가 메모리 부족인 경우 처리를 중단한다. 만일 TransHuge 페이지인 경우에 한해 페이지를 split 한 후 retry한다.
코드 라인 57~59에서 migration을 다시 시도해야 하는 경우이다.
코드 라인 60~62에서 migration 결과가 성공한 경우이다.
코드 라인 63~72에서 migration 결과가 실패한 경우이다.
코드 라인 77에서 10번을 시도하고 완료하였거나 메모리 부족으로 처리를 완료하고자 도달하는 out 레이블이다.
코드 라인 84~85에서 현재 태스크에 설정해둔 swap writing 을 원래 상태로 돌려놓는다.

일반 페이지의 Migration

unmap_and_move()

mm/migrate.c -1/2-

/*
 * Obtain the lock on page, remove all ptes and migrate the page
 * to the newly allocated page in newpage.
 */

static ICE_noinline int unmap_and_move(new_page_t get_new_page,
                                   free_page_t put_new_page,
                                   unsigned long private, struct page *page,
                                   int force, enum migrate_mode mode,
                                   enum migrate_reason reason)
{
        int rc = MIGRATEPAGE_SUCCESS;
        struct page *newpage;

        if (!thp_migration_supported() && PageTransHuge(page))
                return -ENOMEM;

        newpage = get_new_page(page, private);
        if (!newpage)
                return -ENOMEM;

        if (page_count(page) == 1) {
                /* page was freed from under us. So we are done. */
                ClearPageActive(page);
                ClearPageUnevictable(page);
                if (unlikely(__PageMovable(page))) {
                        lock_page(page);
                        if (!PageMovable(page))
                                __ClearPageIsolated(page);
                        unlock_page(page);
                }
                if (put_new_page)
                        put_new_page(newpage, private);
                else
                        put_page(newpage);
                goto out;
        }

        rc = __unmap_and_move(page, newpage, force, mode);
        if (rc == MIGRATEPAGE_SUCCESS)
                set_page_owner_migrate_reason(newpage, reason);

migrate 스캐너가 isolation한 페이지를 unmap한 후 free 스캐너가 isolation한 free 페이지로 migration한다.

코드 라인 10~11에서 thp(Transparent Huge Page) migration이 지원하지 않을 때 thp 에 대해 -ENOMEM을 반환한다.
코드 라인 13~15에서 @get_new_page 함수를 통해 free 페이지를 가져온다.
- compaction 시에는 @get_new_page에 compaction_alloc() 함수를 사용하여 free 스캐너가 관리하는 리스트에서 선두의 free 페이지를 가져온다.
코드 라인 17~32에서 page가 이미 free 된 경우 페이지의 active 및 unevictable 플래그를 클리어한다.
코드 라인 21~26에서 낮은 확률로 non-lru movable 페이지이지만 드라이버에 (*isolate_page) 후크 함수의 구현이 없는 경우 isolated 플래그를 클리어한다.
코드 라인 27~31에서 @put_new_page 함수가 주어진 경우 이 함수를 통해 free 페이지를 다시 되돌려 놓고, 지정되지 않은 경우 페이지 를 버디 시스템으로 돌려 놓는다.
- compaction 시에는 @put_new_page에 compaction_free() 함수를 사용하여 free 스캐너가 관리하는 리스트로 되돌려 놓는다.
코드 라인 34~36에서 매핑을 푼 후 page를 newpage로 migration 한다. migration이 성공한 경우 디버그를 위해 페이지 owner에 reason을 기록해둔다.

mm/migrate.c -2/2-

out:
        if (rc != -EAGAIN) {
                /*
                 * A page that has been migrated has all references
                 * removed and will be freed. A page that has not been
                 * migrated will have kepts its references and be
                 * restored.
                 */
                list_del(&page->lru);

                /*
                 * Compaction can migrate also non-LRU pages which are
                 * not accounted to NR_ISOLATED_*. They can be recognized
                 * as __PageMovable
                 */
                if (likely(!__PageMovable(page)))
                        mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
                                        page_is_file_cache(page), -hpage_nr_pages(page));
        }

        /*
         * If migration is successful, releases reference grabbed during
         * isolation. Otherwise, restore the page to right list unless
         * we want to retry.
         */
        if (rc == MIGRATEPAGE_SUCCESS) {
                put_page(page);
                if (reason == MR_MEMORY_FAILURE) {
                        /*
                         * Set PG_HWPoison on just freed page
                         * intentionally. Although it's rather weird,
                         * it's how HWPoison flag works at the moment.
                         */
                        if (set_hwpoison_free_buddy_page(page))
                                num_poisoned_pages_inc();
                }
        } else {
                if (rc != -EAGAIN) {
                        if (likely(!__PageMovable(page))) {
                                putback_lru_page(page);
                                goto put_new;
                        }

                        lock_page(page);
                        if (PageMovable(page))
                                putback_movable_page(page);
                        else
                                __ClearPageIsolated(page);
                        unlock_page(page);
                        put_page(page);
                }
put_new:
                if (put_new_page)
                        put_new_page(newpage, private);
                else
                        put_page(newpage);
        }

        return rc;
}

코드 라인 1~19에서 out: 레이블이다. migration 결과가 -EAGAIN이 아닌 경우 페이지를 LRU 리스트에서 분리한다. 그리고 높은 확률로 non-lru movable 페이지가 아닌 경우 nr_isolated_anon 또는 nr_isolated_file 카운터를 감소시킨다.
코드 라인 26~36에서 migrate가 성공한 경우 페이지의 참조 카운터를 감소시켜 버디 시스템에 돌려보낸다.
코드 라인 37~42에서 migrate 결과가 -EAGAIN인 경우 movable 매핑된 페이지가 아니면 LRU 리스트에 되돌려 놓는다.
코드 라인 44~50에서 migrate 실패한 경우의 처리이다. non-lru movable free 페이지에 대해서는 파일 시스템의 (*putback_page) 후크 함수를 통해 원 위치로 돌려놓고, lru movable인 경우 원래 있었던 위치인 LRU 리스트로 되돌린다.
코드 라인 52~56에서 put_new: 레이블이다. @put_new_page가 주어진 경우 해당 함수를 통해 free 페이지를 다시 되돌려 놓는다. 그렇지 않은 경우 버디 시스템으로 되돌려 놓는다.

__unmap_and_move()

mm/migrate.c -1/3-

static int __unmap_and_move(struct page *page, struct page *newpage,
                                int force, enum migrate_mode mode)
{
        int rc = -EAGAIN;
        int page_was_mapped = 0;
        struct anon_vma *anon_vma = NULL;
        bool is_lru = !__PageMovable(page);

        if (!trylock_page(page)) {
                if (!force || mode == MIGRATE_ASYNC)
                        goto out;

                /*
                 * It's not safe for direct compaction to call lock_page.
                 * For example, during page readahead pages are added locked
                 * to the LRU. Later, when the IO completes the pages are
                 * marked uptodate and unlocked. However, the queueing
                 * could be merging multiple pages for one bio (e.g.
                 * mpage_readpages). If an allocation happens for the
                 * second or third page, the process can end up locking
                 * the same page twice and deadlocking. Rather than
                 * trying to be clever about what pages can be locked,
                 * avoid the use of lock_page for direct compaction
                 * altogether.
                 */
                if (current->flags & PF_MEMALLOC)
                        goto out;

                lock_page(page);
        }

        if (PageWriteback(page)) {
                /*
                 * Only in the case of a full synchronous migration is it
                 * necessary to wait for PageWriteback. In the async case,
                 * the retry loop is too short and in the sync-light case,
                 * the overhead of stalling is too much
                 */
                switch (mode) {
                case MIGRATE_SYNC:
                case MIGRATE_SYNC_NO_COPY:
                        break;
                default:
                        rc = -EBUSY;
                        goto out_unlock;
                }
                if (!force)
                        goto out_unlock;
                wait_on_page_writeback(page);
        }

page를 unmapping 하고 newpage에 migration한다.

코드 라인 7에서 lru movable 페이지 여부를 is_lru에 담는다.
- true=lru movable페이지
- false=non-lru movable 페이지
코드 라인 9~30에서 에서 page에 대한 lock을 획득한다. 만일 락 획득 시도가 실패한 경우 @force가 0이거나 async migrate 모드인 경우 -EAGAIN 에러로 함수를 종료한다. 그리고 pfmemalloc상황에서 호출된 경우 migration이 이중으로 동작할 필요 없다.
코드 라인 32~50에서 page가 파일시스템에 writeback 중인 경우 sync 또는 sync_no_copy인 경우 -EAGAIN 에러를 반환한다. 단 @force가 강제된 경우 writeback이 완료될 때까지 대기한다. 그 외의 async 및 sync_light 모드의 경우 -EBUSY 에러를 반환한다.

mm/migrate.c -2/3-

.       /*
         * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
         * we cannot notice that anon_vma is freed while we migrates a page.
         * This get_anon_vma() delays freeing anon_vma pointer until the end
         * of migration. File cache pages are no problem because of page_lock()
         * File Caches may use write_page() or lock_page() in migration, then,
         * just care Anon page here.
         *
         * Only page_get_anon_vma() understands the subtleties of
         * getting a hold on an anon_vma from outside one of its mms.
         * But if we cannot get anon_vma, then we won't need it anyway,
         * because that implies that the anon page is no longer mapped
         * (and cannot be remapped so long as we hold the page lock).
         */
        if (PageAnon(page) && !PageKsm(page))
                anon_vma = page_get_anon_vma(page);

        /*
         * Block others from accessing the new page when we get around to
         * establishing additional references. We are usually the only one
         * holding a reference to newpage at this point. We used to have a BUG
         * here if trylock_page(newpage) fails, but would like to allow for
         * cases where there might be a race with the previous use of newpage.
         * This is much like races on refcount of oldpage: just don't BUG().
         */
        if (unlikely(!trylock_page(newpage)))
                goto out_unlock;

        if (unlikely(!is_lru)) {
                rc = move_to_new_page(newpage, page, mode);
                goto out_unlock_both;
        }

        /*
         * Corner case handling:
         * 1. When a new swap-cache page is read into, it is added to the LRU
         * and treated as swapcache but it has no rmap yet.
         * Calling try_to_unmap() against a page->mapping==NULL page will
         * trigger a BUG.  So handle it here.
         * 2. An orphaned page (see truncate_complete_page) might have
         * fs-private metadata. The page can be picked up due to memory
         * offlining.  Everywhere else except page reclaim, the page is
         * invisible to the vm, so the page can not be migrated.  So try to
         * free the metadata, so the page can be freed.
         */
        if (!page->mapping) {
                VM_BUG_ON_PAGE(PageAnon(page), page);
                if (page_has_private(page)) {
                        try_to_free_buffers(page);
                        goto out_unlock_both;
                }
        } else if (page_mapped(page)) {
                /* Establish migration ptes */
                VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
                                page);
                try_to_unmap(page,
                        TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
                page_was_mapped = 1;
        }

        if (!page_mapped(page))
                rc = move_to_new_page(newpage, page, mode);

        if (page_was_mapped)
                remove_migration_ptes(page,
                        rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);

코드 라인 15~16에서 KSM(Kernel Shared Memory)을 제외한 anon 페이지인 경우anon_vma를구해온다.
코드 라인 29~32에서 낮은 확률로 non-lru movable 페이지인 경우 페이지를 new 페이지로 옮기고 매핑 제거 루틴을 수행할 필요가 없으므로 곧장 out_unlock_both: 레이블로 이동한다.
코드 라인 46~59에서 다음은 일반적이지 않은 코너 케이스에 해당하는 처리이다. 만일 anon 페이지이면서 별도의 버퍼를 사용하는(예: ksm) 경우 free 버퍼를 제거하고 함수를 빠져나간다. 그 외의 경우 이 페이지로 매핑된 모든 페이지 테이블에서 매핑을 해제하게 한다.
코드 라인 61~62에서 페이지 테이블에 매핑되지 않은 페이지인 경우 페이지를 new 페이지로 옮긴 후 매핑 제거 루틴을 수행할 필요 없이 곧바로 out_unlock_both 레이블로 이동한다.
코드 라인 64~66에서 페이지가 매핑되었었던 경우 기존 페이지에 연결된 모든 매핑을 새 페이지로 옮긴다.

mm/migrate.c -3/3-

out_unlock_both:
        unlock_page(newpage);
out_unlock:
        /* Drop an anon_vma reference if we took one */
        if (anon_vma)
                put_anon_vma(anon_vma);
        unlock_page(page);
out:
        /*
         * If migration is successful, decrease refcount of the newpage
         * which will not free the page because new page owner increased
         * refcounter. As well, if it is LRU page, add the page to LRU
         * list in here. Use the old state of the isolated source page to
         * determine if we migrated a LRU page. newpage was already unlocked
         * and possibly modified by its owner - don't rely on the page
         * state.
         */
        if (rc == MIGRATEPAGE_SUCCESS) {
                if (unlikely(!is_lru))
                        put_page(newpage);
                else
                        putback_lru_page(newpage);
        }

        return rc;
}

코드 라인 1~2에서 out_unlock_both: 레이블에서는 새 페이지에 대한 lock을 먼저 release 한다.
코드 라인 3~7에서 out_unlock: 레이블에서는 기존 페이지에 대해 lock을 release 한다. 만일 anon 페이지인 경우 anon_vma에 대한 사용이 완료되었으므로 참조카운터를 감소시킨다.
코드 라인 8~23에서 migration이 성공한 경우 사용 중으로 바뀐 newpage의 사용을 완료시킨다. lru movable 페이지인 경우에는 LRU 리스트로 되돌린다.

Non-lru movable 페이지 여부 확인

PageMovable()

mm/compaction.c

int PageMovable(struct page *page)
{
        struct address_space *mapping;

        VM_BUG_ON_PAGE(!PageLocked(page), page);
        if (!__PageMovable(page))
                return 0;

        mapping = page_mapping(page);
        if (mapping && mapping->a_ops && mapping->a_ops->isolate_page)
                return 1;

        return 0;
}
EXPORT_SYMBOL(PageMovable);

non-lru movable 페이지 여부를 반환한다.

코드 라인 6~7에서 non-lru movable 페이지가 아닌 경우 0을 반환한다.
코드 라인 9~13에서 매핑된 페이지의 드라이버에 (*isolate_page) 후크 함수가 지원되는 경우 1을 반환하고, 그렇지 않은 경우 0을 반환한다.

__PageMovable()

static __always_inline int __PageMovable(struct page *page)
{
        return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
                                PAGE_MAPPING_MOVABLE;
}

non-lru movable 페이지 여부를 반환한다.

writeback 완료까지 대기

wait_on_page_writeback()

include/linux/pagemap.h

/* 
 * Wait for a page to complete writeback
 */
static inline void wait_on_page_writeback(struct page *page)
{
        if (PageWriteback(page))
                wait_on_page_bit(page, PG_writeback);
}

page가 writeback이 완료될 때까지 기다린다.

새 페이지로 이동

move_to_new_page()

mm/migrate.c

/*
 * Move a page to a newly allocated page
 * The page is locked and all ptes have been successfully removed.
 *
 * The new page will have replaced the old page if this function
 * is successful.
 *
 * Return value:
 *   < 0 - error code
 *  MIGRATEPAGE_SUCCESS - success
 */

static int move_to_new_page(struct page *newpage, struct page *page,
                                enum migrate_mode mode)
{
        struct address_space *mapping;
        int rc = -EAGAIN;
        bool is_lru = !__PageMovable(page);

        VM_BUG_ON_PAGE(!PageLocked(page), page);
        VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);

        mapping = page_mapping(page);

        if (likely(is_lru)) {
                if (!mapping)
                        rc = migrate_page(mapping, newpage, page, mode);
                else if (mapping->a_ops->migratepage)
                        /*
                         * Most pages have a mapping and most filesystems
                         * provide a migratepage callback. Anonymous pages
                         * are part of swap space which also has its own
                         * migratepage callback. This is the most common path
                         * for page migration.
                         */
                        rc = mapping->a_ops->migratepage(mapping, newpage,
                                                        page, mode);
                else
                        rc = fallback_migrate_page(mapping, newpage,
                                                        page, mode);
        } else {
                /*
                 * In case of non-lru page, it could be released after
                 * isolation step. In that case, we shouldn't try migration.
                 */
                VM_BUG_ON_PAGE(!PageIsolated(page), page);
                if (!PageMovable(page)) {
                        rc = MIGRATEPAGE_SUCCESS;
                        __ClearPageIsolated(page);
                        goto out;
                }

                rc = mapping->a_ops->migratepage(mapping, newpage,
                                                page, mode);
                WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
                        !PageIsolated(page));
        }

        /*
         * When successful, old pagecache page->mapping must be cleared before
         * page is freed; but stats require that PageAnon be left as PageAnon.
         */
        if (rc == MIGRATEPAGE_SUCCESS) {
                if (__PageMovable(page)) {
                        VM_BUG_ON_PAGE(!PageIsolated(page), page);

                        /*
                         * We clear PG_movable under page_lock so any compactor
                         * cannot try to migrate this page.
                         */
                        __ClearPageIsolated(page);
                }

                /*
                 * Anonymous and movable page->mapping will be cleard by
                 * free_pages_prepare so don't reset it here for keeping
                 * the type to work PageAnon, for example.
                 */
                if (!PageMappingFlags(page))
                        page->mapping = NULL;
        }
out:
        return rc;
}

페이지를 새로 할당 받은 페이지로 migration 한다.

코드 라인 6에서 movable 페이지가 lru 리스트에서 관리되는 페이지인지 여부를 알아온다.
코드 라인 11~28에서 lru movable 페이지의 migration을 수행한다. 처리 유형은 다음 3가지이다.
- A) anon 페이지의 migration
- B) swap 캐시 및 파일 캐시 페이지의 (*migratepage)를 사용한 migration
- C) swap 캐시 및 파일 캐시 페이지의 (*migratepage)가 없을 때 사용한 fallback migration
코드 라인 29~45에서 non-lru movable 페이지의 migration을 수행한다. 처리 유형은 1 가지이다.
- D) non-lru 페이지 migration을 수행한다. 단 파일 시스템에 (*migratepages) 후크가 구현되지 않은 경우 isolated 플래그를 제거하고 성공을 반환한다.
코드 라인 51~69에서 migration이 성공한 경우 기존 페이지는 이제 free 페이지가 된 경우이다. non-lru 페이지였던 경우 isolated 플래그를 제거한다. 그리고 매핑되었던 경우 매핑을 제거한다.

A) lru movable – anon 페이지의 migration

migrate_page()

mm/migrate.c

/*
 * Common logic to directly migrate a single LRU page suitable for
 * pages that do not use PagePrivate/PagePrivate2.
 *
 * Pages are locked upon entry and exit.
 */

int migrate_page(struct address_space *mapping,
                struct page *newpage, struct page *page,
                enum migrate_mode mode)
{
        int rc;

        BUG_ON(PageWriteback(page));    /* Writeback must be complete */

        rc = migrate_page_move_mapping(mapping, newpage, page, mode, 0);

        if (rc != MIGRATEPAGE_SUCCESS)
                return rc;

        if (mode != MIGRATE_SYNC_NO_COPY)
                migrate_page_copy(newpage, page);
        else
                migrate_page_states(newpage, page);
        return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);

lru movable 페이지를 새로 할당 받은 페이지로 매핑을 migration하고 copy 한다.

코드 라인 9~12에서 페이지를 migrate 한다.
코드 라인 14~17에서 MIGRATE_SYNC_NO_COPY 모드에서는 페이지 디스크립터 정보만 옮기지만, 그 외의 모드에서는 cpu에서 페이지 프레임 복사도 추가하여 수행한다.

B) swap 캐시 및 파일 캐시 페이지의 (*migratepage)를 사용한 migration

다음 루틴에서 지원한다.

mm/shmem.c – migrate_page()
mm/swap_state.c – migrate_page()
fs/block_dev.c – buffer_migrate_page_norefs()
fs/ubifs/file.c – ubifs_migrate_page()
fs/ext2/inode.c – buffer_migrate_page()
fs/btrfs/disk-io.c – btree_migratepage()
fs/f2fs/checkpoint.c – f2fs_migrate_page()
fs/xfs/xfs_aops.c – iomap_migrate_page()
fs/hugetlbfs/inode.c – hugetlbfs_migrate_page()
fs/nfs/file.c – nfs_migrate_page()
…

C) fallback migrate

fallback_migrate_page()

mm/migrate.c

/*
 * Default handling if a filesystem does not provide a migration function.
 */

static int fallback_migrate_page(struct address_space *mapping,
        struct page *newpage, struct page *page, enum migrate_mode mode)
{
        if (PageDirty(page)) {
                /* Only writeback pages in full synchronous migration */
                switch (mode) {
                case MIGRATE_SYNC:
                case MIGRATE_SYNC_NO_COPY:
                        break;
                default:
                        return -EBUSY;
                }
                return writeout(mapping, page);
        }

        /*
         * Buffers may be managed in a filesystem specific way.
         * We must have no buffers or drop them.
         */
        if (page_has_private(page) &&
            !try_to_release_page(page, GFP_KERNEL))
                return -EAGAIN;

        return migrate_page(mapping, newpage, page, mode);
}

파일 시스템이 migration 기능을 지원하지 못할 때 default 호출되어 페이지를 migration 한다.

코드 라인 4~14에서 ditty 페이지이면서 MIGRATE_SYNC 또는 MIGRATE_SYNC_NO_COPY 모드로 동작하는 경우에는 페이지를 기록하여 dirty 상태를 클리어한 후 리턴한다. 그 외의 모드의 경우 -EBUSY를 반환한다.
코드 라인 20~22에서 private 페이지인 경우 파일 시스템 등이 생성한 해당 페이지에 대한 메타 데이터를 제거한다.
코드 라인 24에서 page를 새 페이지로 migration한다.

D) non-lru 페이지 migration

다음 루틴에서 지원한다.

mm/zsmalloc.c – zs_page_migrate()
mm/balloon_compaction.c – balloon_page_migrate()
drivers/virtio/virtio_balloon.c – virtballoon_migratepage()

매핑 이동

migrate_page_move_mapping()

이 함수는 다음 루틴에서 호출되어 사용된다.

mm/migrate.c – migrate_page()
mm/migrate.c – __buffer_migrate_page()
fs/iomap.c – iomap_migrate_page()
fs/ubifs/file.c – ubifs_migrate_page()
fs/f2fs/data.c – f2fs_migrate_page()
fs/aio.c – aio_migratepage()

mm/migrate.c -1/2-

/*
 * Replace the page in the mapping.
 *
 * The number of remaining references must be:
 * 1 for anonymous pages without a mapping
 * 2 for pages with a mapping
 * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
 */

int migrate_page_move_mapping(struct address_space *mapping,
                struct page *newpage, struct page *page, enum migrate_mode mode,
                int extra_count)
{
        XA_STATE(xas, &mapping->i_pages, page_index(page));
        struct zone *oldzone, *newzone;
        int dirty;
        int expected_count = expected_page_refs(page) + extra_count;

        if (!mapping) {
                /* Anonymous page without mapping */
                if (page_count(page) != expected_count)
                        return -EAGAIN;

                /* No turning back from here */
                newpage->index = page->index;
                newpage->mapping = page->mapping;
                if (PageSwapBacked(page))
                        __SetPageSwapBacked(newpage);

                return MIGRATEPAGE_SUCCESS;
        }

        oldzone = page_zone(page);
        newzone = page_zone(newpage);

        xas_lock_irq(&xas);
        if (page_count(page) != expected_count || xas_load(&xas) != page) {
                xas_unlock_irq(&xas);
                return -EAGAIN;
        }

        if (!page_ref_freeze(page, expected_count)) {
                xas_unlock_irq(&xas);
                return -EAGAIN;
        }

        /*
         * Now we know that no one else is looking at the page:
         * no turning back from here.
         */
        newpage->index = page->index;
        newpage->mapping = page->mapping;
        page_ref_add(newpage, hpage_nr_pages(page)); /* add cache reference */
        if (PageSwapBacked(page)) {
                __SetPageSwapBacked(newpage);
                if (PageSwapCache(page)) {
                        SetPageSwapCache(newpage);
                        set_page_private(newpage, page_private(page));
                }
        } else {
                VM_BUG_ON_PAGE(PageSwapCache(page), page);
        }

페이지의 매핑을 newpage로 migration 한다.

코드 라인 10~22에서 매핑되지 않은 anon 페이지의 경우 새 페이지에 인덱스와 매핑 정보를 옮긴다. 그리고 SwapBacked 플래그가 설정된 경우 제거한다. 만일 참조 카운터가 expected_count와 다른 경우 -EAGAIN을 반환하고 그렇지 않은 경우 success를 반환한다.
코드 라인 27에서 xas array 락을 획득한다.
- 참고: The XArray data structure
코드 라인 28~31에서 참조 카운터가 expected_count가 아니면 -EAGAIN을 반환한다.
코드 라인 33~36에서 참조 카운터를 0으로 리셋한다. 리셋 전 값이 expected_count가 아닌 경우 -EAGAIN을 반환한다.
코드 라인 42~44에서 매핑된 페이지의 새 페이지에 인덱스, 매핑, 참조카운터를 옮긴다.
코드 라인 45~53에서 SwapBacked 플래그를 옮긴다. 그리고 SwapCache 플래그도 옮기고 private 데이터에 저장된 값도 옮긴다.

mm/migrate.c -2/2-

        /* Move dirty while page refs frozen and newpage not yet exposed */
        dirty = PageDirty(page);
        if (dirty) {
                ClearPageDirty(page);
                SetPageDirty(newpage);
        }

        xas_store(&xas, newpage);
        if (PageTransHuge(page)) {
                int i;

                for (i = 1; i < HPAGE_PMD_NR; i++) {
                        xas_next(&xas);
                        xas_store(&xas, newpage + i);
                }
        }

        /*
         * Drop cache reference from old page by unfreezing
         * to one less reference.
         * We know this isn't the last reference.
         */
        page_ref_unfreeze(page, expected_count - hpage_nr_pages(page));

        xas_unlock(&xas);
        /* Leave irq disabled to prevent preemption while updating stats */

        /*
         * If moved to a different zone then also account
         * the page for that zone. Other VM counters will be
         * taken care of when we establish references to the
         * new page and drop references to the old page.
         *
         * Note that anonymous pages are accounted for
         * via NR_FILE_PAGES and NR_ANON_MAPPED if they
         * are mapped to swap space.
         */
        if (newzone != oldzone) {
                __dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
                __inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
                if (PageSwapBacked(page) && !PageSwapCache(page)) {
                        __dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
                        __inc_node_state(newzone->zone_pgdat, NR_SHMEM);
                }
                if (dirty && mapping_cap_account_dirty(mapping)) {
                        __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
                        __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
                        __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
                        __inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
                }
        }
        local_irq_enable();

        return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page_move_mapping);

코드 라인 2~6에서 Dirty 플래그가 있는 경우 새 페이지에 옮기고, 기존 페이지는 제거한다.
코드 라인8에서 xas xarray에 새페이지를 저장한다. 기존에는 radix tree를 사용했었는데 xarray 데이터 구조로 변경하였다.
코드 라인 9~16에서 thp인 경우 소속된 각 페이지를 xas xarray에 저장한다.
코드 라인 23에서 페이지의 참조 카운터를 페이지 수 만큼 지정한다.
코드 라인 25에서 xas array 락을 release 한다.
코드 라인 38~51 존이 변경된 경우 관련 카운터들 값을 증감한다.

다음 그림은 페이지 유형에 따라 기존 페이지에서 새 페이지로 관련 속성들을 복사하는 모습을 보여준다.

Rmap walk를 통한 페이지 migration

remove_migration_ptes()

mm/migrate.c

/*
 * Get rid of all migration entries and replace them by
 * references to the indicated page.
 */

static void remove_migration_ptes(struct page *old, struct page *new)
{
        struct rmap_walk_control rwc = {
                .rmap_one = remove_migration_pte,
                .arg = old,
        };

        if (locked)
                rmap_walk_locked(new, &rwc);
        else
                rmap_walk(new, &rwc);
}

@old 페이지에 연결된 모든 매핑을 @new 페이지로 옮긴다.

코드 라인 3~6에서 기존 페이지를 참고하는 모든 vma를 찾기위해 rmap walk를 사용하도록 준비한다.
- Rmap -2- (TTU & Rmap Walk) | 문c
코드 라인 8~11에서 rmap walk를 통해 관련 vma에 연결된 기존 매핑을 제거하고 새 페이지에 매핑을 옮긴다.

remove_migration_pte()

mm/migrate.c

/*
 * Restore a potential migration pte to a working pte entry
 */

static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
                                 unsigned long addr, void *old)
{
        struct page_vma_mapped_walk pvmw = {
                .page = old,
                .vma = vma,
                .address = addr,
                .flags = PVMW_SYNC | PVMW_MIGRATION,
        };
        struct page *new;
        pte_t pte;
        swp_entry_t entry;

        VM_BUG_ON_PAGE(PageTail(page), page);
        while (page_vma_mapped_walk(&pvmw)) {
                if (PageKsm(page))
                        new = page;
                else
                        new = page - pvmw.page->index +
                                linear_page_index(vma, pvmw.address);

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
                /* PMD-mapped THP migration entry */
                if (!pvmw.pte) {
                        VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
                        remove_migration_pmd(&pvmw, new);
                        continue;
                }
#endif

                get_page(new);
                pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
                if (pte_swp_soft_dirty(*pvmw.pte))
                        pte = pte_mksoft_dirty(pte);

                /*
                 * Recheck VMA as permissions can change since migration started
                 */
                entry = pte_to_swp_entry(*pvmw.pte);
                if (is_write_migration_entry(entry))
                        pte = maybe_mkwrite(pte, vma);

                if (unlikely(is_zone_device_page(new))) {
                        if (is_device_private_page(new)) {
                                entry = make_device_private_entry(new, pte_write(pte));
                                pte = swp_entry_to_pte(entry);
                        } else if (is_device_public_page(new)) {
                                pte = pte_mkdevmap(pte);
                                flush_dcache_page(new);
                        }
                } else
                        flush_dcache_page(new);

#ifdef CONFIG_HUGETLB_PAGE
                if (PageHuge(new)) {
                        pte = pte_mkhuge(pte);
                        pte = arch_make_huge_pte(pte, vma, new, 0);
                        set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
                        if (PageAnon(new))
                                hugepage_add_anon_rmap(new, vma, pvmw.address);
                        else
                                page_dup_rmap(new, true);
                } else
#endif
                {
                        set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);

                        if (PageAnon(new))
                                page_add_anon_rmap(new, vma, pvmw.address, false);
                        else
                                page_add_file_rmap(new, false);
                }
                if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
                        mlock_vma_page(new);

                if (PageTransHuge(page) && PageMlocked(page))
                        clear_page_mlock(page);

                /* No need to invalidate - it was non-present before */
                update_mmu_cache(vma, pvmw.address, pvmw.pte);
        }

        return true;
}

@old 페이지에 연결된 모든 매핑을 @new 페이지로 옮긴다.

코드 라인 4~15에서 old 페이지에 대한 매핑이 모두 제거될 때까지 반복한다.
- Rmap -3- (PVMW) | 문c
코드 라인 16~20에서 migration될 새 페이지를 구한다.
코드 라인 24~28에서 pmd에 매핑된 thp 엔트리인 경우 pmd 엔트리로의 migration을 수행한 후 계속한다.
코드 라인 31~34에서 새 페이지를 사용하기로 참조 카운터를 증가시킨다. 그런 후 이 페이지에 대한 pte 엔트리를 준비한다. 기존 pte 엔트리의 soft dirty 상태도 옮긴다.
코드 라인 39~41에서 swap 엔트리가 migration 가능한 swap 엔트리에서 write 속성의 vma를 사용하면 pte 속성에 write 속성을 부가한다.
- migration이 시작한 이후로 permission이 변경될 수 있는데 VMA를 다시 write 속성을 조사하여 변경된 경우 추가한다.
코드 라인 43~52에서 새 페이지에 대한 데이터 캐시를 flush 한다.
- private 페이지를 운영하는 존 디바이스인 경우 새 페이지에 대한 pte 엔트리를 가져온다.
- private 페이지가 아닌 존 디바이스인 경우는 새 페이지에 대한 pte 엔트리를 가져오고 추가로 새 페이지에 대한 데이터 캐시를 flush한다.
코드 라인 55~62에서 새 페이지가 huge 페이지인 경우 정규 매핑을 추가하고, rmap에도 추가한다.
코드 라인 63~72에서 huge 페이지가 아닌 경우 정규 매핑을 추가하고, rmap에도 추가한다.
코드 라인 73~74에서 VM_LOCKED 플래그를 가진 vma에 thp 및 hugetlbfs 페이지가 아닌 경우에만 새 페이지에 mlock 플래그 설정을 한다.
코드 라인 76~77에서 만일 새 thp에 mlocked 플래그가 설정된 경우 클리어한다.
- 참고로 thp는 mlock 설정을 할 수 없다.
코드 라인 80에서 캐시 flush가 필요한 아키텍처에서 flush를 수행한다.
- ARMv6 이상에서는 아무런 동작도 수행하지 않고, ARMv6 미만에서 cache coherent를 위한 flush 루틴들이 동작한다.

페이지 프레임 복사

migrate_page_copy()

mm/migrate.c

void migrate_page_copy(struct page *newpage, struct page *page)
{
        if (PageHuge(page) || PageTransHuge(page))
                copy_huge_page(newpage, page);
        else
                copy_highpage(newpage, page);

        migrate_page_states(newpage, page);
}
EXPORT_SYMBOL(migrate_page_copy);

기존 페이지 프레임을 새 페이지 프레임으로 복사한다. 또한 관련 페이지 디스크립터도 복사한다.

코드 라인 3~4에서 huge 페이지 또는 thp인 경우의 복사 루틴을 수행한다.
코드 라인 5~6에서 그렇지 않은 일반 페이지의 경우의 복사 루틴을 수행한다.
코드 라인 8에서 페이지 디스크립터도 복사한다.

copy_huge_page()

mm/migrate.c

static void copy_huge_page(struct page *dst, struct page *src)
{
        int i;
        int nr_pages;

        if (PageHuge(src)) {
                /* hugetlbfs page */
                struct hstate *h = page_hstate(src);
                nr_pages = pages_per_huge_page(h);

                if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
                        __copy_gigantic_page(dst, src, nr_pages);
                        return;
                }
        } else {
                /* thp page */
                BUG_ON(!PageTransHuge(src));
                nr_pages = hpage_nr_pages(src);
        }

        for (i = 0; i < nr_pages; i++) {
                cond_resched();
                copy_highpage(dst + i, src + i);
        }
}

@src huge 페이지 프레임을 @dst로 복사한다.

코드 라인 6~14에서 hugetlbfs 페이지이면서 최대 order 페이지 수보다 큰 페이지인 경우 단순히 페이지 구조체 포인터를 증가시키는 것으로 정확히 동작하는 것을 보장하지 못한다. 따라서 페이지 디스크립터의 포인터를 정확히 처리하기 위해 별도의 함수에서 처리한다.
코드 라인 15~19에서 thp에 대한 페이지 수를 알아온다.
코드 라인 21~24에서 페이지 수만큼 순회하며 @src+i의 페이지 프레임을 @dst+i로 복사한다.

__copy_gigantic_page()

mm/migrate.c

/*
 * Gigantic pages are so large that we do not guarantee that page++ pointer
 * arithmetic will work across the entire page.  We need something more
 * specialized.
 */

static void __copy_gigantic_page(struct page *dst, struct page *src,
                                int nr_pages)
{
        int i;
        struct page *dst_base = dst;
        struct page *src_base = src;

        for (i = 0; i < nr_pages; ) {
                cond_resched();
                copy_highpage(dst, src);

                i++;
                dst = mem_map_next(dst, dst_base, i);
                src = mem_map_next(src, src_base, i);
        }
}

최대 order 페이지 수 보다 큰 @src 페이지 프레임을 @dst로 복사한다.

mem_map_next()

mm/internal.h

/*
 * Iterator over all subpages within the maximally aligned gigantic
 * page 'base'.  Handle any discontiguity in the mem_map.
 */

static inline struct page *mem_map_next(struct page *iter,
                                                struct page *base, int offset)
{
        if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
                unsigned long pfn = page_to_pfn(base) + offset;
                if (!pfn_valid(pfn))
                        return NULL;
                return pfn_to_page(pfn);
        }
        return iter + 1;
}

@iter 페이지의 다음 페이지를 구해 반환한다. 최대 order 페이지 수를 초과하는 페이지의 경우 섹션별로 관리되는 mem_map의 경계를 초과할 수 있어 잘못된 주소가 나올 가능성이 있으므로 그럴 때마다 페이지 디스크립터의 주소를 정확히 재산출한다.

코드 라인 4~9에서 낮은 확률로 offset이 최대 order 페이지 수(default: 1024 = 4M(4K 페이지)) 단위로 정렬된 경우에는 page 디스크립터 포인터를 증가시키면 mem_map의 경계를 초과할 수 있으므로 pfn 값을 먼저 구한 후 pfn_to_page() 함수를 사용하여 page 디스크립터를 다시 구한다.
코드 라인 10에서 다음 페이지 디스크립터를 반환한다.

copy_highpage()

include/linux/highmem.h

static inline void copy_highpage(struct page *to, struct page *from)
{
        char *vfrom, *vto;

        vfrom = kmap_atomic(from);
        vto = kmap_atomic(to);
        copy_page(vto, vfrom);
        kunmap_atomic(vto);
        kunmap_atomic(vfrom);
}

@from 페이지 프레임을 @to로 복사한다.

코드 라인 5~6에서 32비트 시스템에서 highmem 페이지인 경우 fixmap을 사용하여 임시로 매핑하도록 한다.
코드 라인 7에서 아키텍처가 지원하는 가장 빠른 방법으로 페이지 프레임을 복사한다.
코드 라인 8~9에서 fixmap에 임시 매핑한 페이지의 매핑을 해제한다.

페이지 디스크립터 정보 복사

migrate_page_states()

mm/migrate.c

/*
 * Copy the page to its new location
 */

void migrate_page_states(struct page *newpage, struct page *page)
{
        int cpupid;

        if (PageError(page))
                SetPageError(newpage);
        if (PageReferenced(page))
                SetPageReferenced(newpage);
        if (PageUptodate(page))
                SetPageUptodate(newpage);
        if (TestClearPageActive(page)) {
                VM_BUG_ON_PAGE(PageUnevictable(page), page);
                SetPageActive(newpage);
        } else if (TestClearPageUnevictable(page))
                SetPageUnevictable(newpage);
        if (PageWorkingset(page))
                SetPageWorkingset(newpage);
        if (PageChecked(page))
                SetPageChecked(newpage);
        if (PageMappedToDisk(page))
                SetPageMappedToDisk(newpage);

        /* Move dirty on pages not done by migrate_page_move_mapping() */
        if (PageDirty(page))
                SetPageDirty(newpage);

        if (page_is_young(page))
                set_page_young(newpage);
        if (page_is_idle(page))
                set_page_idle(newpage);

        /*
         * Copy NUMA information to the new page, to prevent over-eager
         * future migrations of this same page.
         */
        cpupid = page_cpupid_xchg_last(page, -1);
        page_cpupid_xchg_last(newpage, cpupid);

        ksm_migrate_page(newpage, page);
        /*
         * Please do not reorder this without considering how mm/ksm.c's
         * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
         */
        if (PageSwapCache(page))
                ClearPageSwapCache(page);
        ClearPagePrivate(page);
        set_page_private(page, 0);

        /*
         * If any waiters have accumulated on the new page then
         * wake them up.
         */
        if (PageWriteback(newpage))
                end_page_writeback(newpage);

        copy_page_owner(page, newpage);

        mem_cgroup_migrate(page, newpage);
}
EXPORT_SYMBOL(migrate_page_states);

기존 페이지 디스크립터 내용을 새 페이지의 디스크립터로 옮긴다.

코드 라인 5~6에서 PG_error 플래그를 옮긴다.
코드 라인 7~8에서 PG_referenced 플래그를 옮긴다.
코드 라인 9~10에서 PG_uptodate 플래그를 옮긴다.
코드 라인 11~15에서 PG_active 플래그 또는 PG_unevictable 플래그를 옮기고, 기존 페이지에서는 제거한다.
코드 라인 16~17에서 PG_workingset 플래그를 옮긴다.
코드 라인 18~19에서 PG_checked 플래그를 옮긴다.
코드 라인 20~21에서 PG_mappedtodist 플래그를 옮긴다.
코드 라인 24~25에서 PG_dirty 플래그를 옮긴다.
코드 라인 27~28에서 Young 플래그를 옮긴다.
- 32bit 시스템의 경우 page 구조체가 아닌 page_ext 구조체에 존재한다.
코드 라인 29~30에서 Idle 플래그를 옮긴다.
- 32bit 시스템의 경우 page 구조체가 아닌 page_ext 구조체에 존재한다.
코드 라인 36~37에서 cpupid 정보를 옮기고, 기존 페이지에는 -1을 대입한다.
코드 라인 39에서 ksm 페이지이면서, 복사할 대상이 statble 노드 매핑된 경우 stable 노드 매핑 정보를 제거한다.
- KSM(Kernel Same page Merging)이 application이 주소 공간에서 같은 내용을 가진 페이지를 스캔하여 한 페이지로 merge 시킬 수 있도록 하는 기능이다.
- 참고: How to use the Kernel Samepage Merging feature | kernel.org
코드 라인 44~45에서 기존 페이지의 SwapCache 정보를 제거한다.
코드 라인 46~47에서 기존 페이지의 Private 플래그를 제거하고, p->private에 0을 대입한다.
코드 라인 53~54에서 새 페이지에 Writeback 플래그가 설정된 경우 lru의 tail로 rotate하고, 관련 태스크를 깨워 즉각 회수가 가능하도록 한다.
코드 라인 56에서 page owner 정보를 옮긴다.
코드 라인 58에서 memcg 정보를 옮긴다.

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c – 현재 글
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Zoned Allocator -8- (Direct Compact-Isolation)

2016-07-052022-04-07 문영일 4 Comments

Zoned Allocator -8- (Direct Compact-Isolation)

Isolation을 사용하는 용도는 다음과 같다.

Compaction
- 부족한 high order 페이지를 확보하기 위해 migratable 페이지들을 존의 상위 부분으로 migration 한다.
- Direct-compaction, manual-compaction, kcompactd
Off-line Memory
- offline할 메모리 영역에 위치한 모든 사용 중인 movable 페이지들을 다른 영역으로 migration 한다.
CMA
- CMA 영역에서 요청한 범위내의 연속된 물리 공간 확보를 위해 CMA 영역에 임시 입주(?) 중인 movable 페이지들을 migration 한다.

다음 그림은 isolation을 위한 주요 함수 호출 과정을 양쪽으로 나눠 보여준다. 좌측은 옮길 페이지를 isolation 하는 과정이고 우측은 free 페이지를 isolation 또는 확보하는 과정이다.

Compaction을 위한 두 개의 스캐너

다음 그림은 Compaction을 수행 시 두 개의 migration 스캐너와 free 스캐너가 각각의 목적에 사용될 페이지들을 isolation 할 때 동작하는 방향을 보여준다.

migration 스캐너는 migratable 페이지들을 확보하기 위해 페이지 블럭들을 대상으로 윗 방향으로 스캔을 한다.
- isolate_migratepages_block()은 선택된 하나의 페이지 블럭에서 migratable 페이지를 분리(isolate)하여 cc->migratepages 리스트에 추가한다.
free 스캐너는 free 페이지들을 확보하기 위해 페이지 블럭들을 대상으로 아래 방향으로 스캔을 한다.
- isolate_freepages_block()은 선택된 하나의 페이지 블럭에서 요청 타입의 free 페이지를 분리(isolate)하여 cc->freepages 리스트에 추가한다.

Migrate 스캐너

Isolate migratepages

isolate_migratepages()

mm/compaction.c

/*
 * Isolate all pages that can be migrated from the first suitable block,
 * starting at the block pointed to by the migrate scanner pfn within
 * compact_control.
 */

static isolate_migrate_t isolate_migratepages(struct zone *zone,
                                        struct compact_control *cc)
{
        unsigned long block_start_pfn;
        unsigned long block_end_pfn;
        unsigned long low_pfn;
        struct page *page;
        const isolate_mode_t isolate_mode =
                (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
                (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);

        /*
         * Start at where we last stopped, or beginning of the zone as
         * initialized by compact_zone()
         */
        low_pfn = cc->migrate_pfn;
        block_start_pfn = pageblock_start_pfn(low_pfn);
        if (block_start_pfn < zone->zone_start_pfn)
                block_start_pfn = zone->zone_start_pfn;

        /* Only scan within a pageblock boundary */
        block_end_pfn = pageblock_end_pfn(low_pfn);

        /*
         * Iterate over whole pageblocks until we find the first suitable.
         * Do not cross the free scanner.
         */
        for (; block_end_pfn <= cc->free_pfn;
                        low_pfn = block_end_pfn,
                        block_start_pfn = block_end_pfn,
                        block_end_pfn += pageblock_nr_pages) {

                /*
                 * This can potentially iterate a massively long zone with
                 * many pageblocks unsuitable, so periodically check if we
                 * need to schedule, or even abort async compaction.
                 */
                if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
                                                && compact_should_abort(cc))
                        break;

                page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
                                                                        zone);
                if (!page)
                        continue;

                /* If isolation recently failed, do not retry */
                if (!isolation_suitable(cc, page))
                        continue;

                /*
                 * For async compaction, also only scan in MOVABLE blocks.
                 * Async compaction is optimistic to see if the minimum amount
                 * of work satisfies the allocation.
                 */
                if (!suitable_migration_source(cc, page))
                        continue;

                /* Perform the isolation */
                low_pfn = isolate_migratepages_block(cc, low_pfn,
                                                block_end_pfn, isolate_mode);

                if (!low_pfn || cc->contended)
                        return ISOLATE_ABORT;

                /*
                 * Either we isolated something and proceed with migration. Or
                 * we failed and compact_zone should decide if we should
                 * continue or not.
                 */
                break;
        }

        /* Record where migration scanner will be restarted. */
        cc->migrate_pfn = low_pfn;

        return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

요청한 zone에서 migratable 페이지들을 isolation한다.

코드 라인 8~10에서 isolate 모드를 결정한다.
- ISOLATE_UNMAPPED(0x2)
  - 매핑되지 않은 페이지들만 isolation을 지원하고, 매핑(file 캐시, swap 캐시 등)된 파일들은 처리하지 않는다.
- ISOLATE_ASYNC_MIGRATE(0x4)
  - isolation 및 migration을 간략히 수행한다. migrate 타입이 블럭의 50% 이상인 페이지 블럭에 대해서만 수행하고, 1개의 order isolation이 가능하면 더 이상 진행하지 않고 중단한다. 이 외에 preempt 요청에 대해서도 중단한다.
- ISOLATE_UNEVICTABLE(0x8)
  - unevictable 페이지라도 isolation을 수행하게 한다.
코드 라인 16~22에서 migrate 스캐너가 윗 방향으로 스캔할 위치를 가리키는 pfn으로 스캔할 블럭의 시작과 끝 pfn을 구한다.
- 예) migrate_pfn=0x250, pageblock_order=9
  - block_start_pfn=0x200, block_end_pfn=0x400
코드 라인 28~31에서 아래 쪽으로 스캔하는 free 스캐너 위치까지 migrate 스캐너를 블럭 단위로 순회한다.
코드 라인 38~40에서 처리할 범위가 크므로 주기적으로 중단 요소가 있는지 체크한다.
- 체크 주기
  - SWAP_CLUSTER_MAX(32) 페이지 블럭 단위
- 중단요소
  - async migration 처리 중이면서 더 높은 우선 순위 태스크로부터 preemption 요청이 있는 경우
코드 라인 42~45에서 요청 존 범위내 페이지 블럭의 첫 페이지를 가져온다. 페이지가 요청 존을 벗어나면 null을 반환하며 skip 한다.
코드 라인 48~49에서 해당 zone의 블럭에서 isolation을 하지 않으려는 경우 skip 한다.
- direct-compact를 async 모드로 동작시킬 때 최근에 해당 페이지블럭에서 isolation이 실패한 경우 설정되는 skip 비트가 설정되었으면 해당 블럭을 skip하도록 false를 반환한다.
코드 라인 56~57에서 비동기 direct-compact 모드로 동작할 때 해당 블럭 migrate 타입이 요청한 migrate 타입과 동일하지 않으면 skip 한다. 단 동기(migrate_sync*) 방식이거나 매뉴얼 및 kcompactd 요청인 경우 항상 true를 반환하여 해당 블럭을 무조건 isolation 시도하게 한다.
코드 라인 60~61에서 해당 블럭의 페이지들에서 migratable 페이지들만 isolation 한다.
코드 라인 63~64에서 isolation이 실패하거나 도중에 중단해야 하는 경우 ISOLATE_ABORT 결과로 빠져나간다.
코드 라인 71에서 정상적으로 isolation이 완료된 경우 루프를 벗어난다.
코드 라인 75~77에서 처리 중인 migrate 스캐너의 현재 위치를 기억 시키고, isolation 결과를 반환한다.

다음 그림은 isolate_migratepages() 함수를 통해 migrate 스캐너가 적합 조건의 페이지 블럭들을 대상으로 movable lru 페이지들을 isolation 하는 과정을 보여준다.

적합한 movable lru 페이지 isolation (X) -> migratable 페이지

Isolate migratepages 블럭

다음 함수는 다음 두 용도에서 호출되어 사용된다.

Compaction
- isolate_migratepages() 함수에서 다음 isolate 모드 관련 두 플래그가 추가되어 호출될 수 있다.
  - sync compaction 모드가 아닌 경우 ISOLATE_ASYNC_MIGRATE가 추가된다.
  - sysctl_compact_unevictable_allowed(디폴트=1)이 설정된 경우 ISOLATE_UNEVICTABLE 모드가 추가된다.
CMA
- isolate_migratepages_range() 함수에서 ISOLATE_UNEVICTABLE 모드로 호출된다.

isolate_migratepages_block()

mm/compaction.c -1/4-

/**
 * isolate_migratepages_block() - isolate all migrate-able pages within
 *                                a single pageblock
 * @cc:         Compaction control structure.
 * @low_pfn:    The first PFN to isolate
 * @end_pfn:    The one-past-the-last PFN to isolate, within same pageblock
 * @isolate_mode: Isolation mode to be used.
 *
 * Isolate all pages that can be migrated from the range specified by
 * [low_pfn, end_pfn). The range is expected to be within same pageblock.
 * Returns zero if there is a fatal signal pending, otherwise PFN of the
 * first page that was not scanned (which may be both less, equal to or more
 * than end_pfn).
 *
 * The pages are isolated on cc->migratepages list (not required to be empty),
 * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
 * is neither read nor updated.
 */

isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                        unsigned long end_pfn, isolate_mode_t isolate_mode)
{
        struct zone *zone = cc->zone;
        unsigned long nr_scanned = 0, nr_isolated = 0;
        struct lruvec *lruvec;
        unsigned long flags = 0;
        bool locked = false;
        struct page *page = NULL, *valid_page = NULL;
        unsigned long start_pfn = low_pfn;
        bool skip_on_failure = false;
        unsigned long next_skip_pfn = 0;

        /*
         * Ensure that there are not too many pages isolated from the LRU
         * list by either parallel reclaimers or compaction. If there are,
         * delay for some time until fewer pages are isolated
         */
        while (unlikely(too_many_isolated(zone))) {
                /* async migration should just abort */
                if (cc->mode == MIGRATE_ASYNC)
                        return 0;

                congestion_wait(BLK_RW_ASYNC, HZ/10);

                if (fatal_signal_pending(current))
                        return 0;
        }

        if (compact_should_abort(cc))
                return 0;

        if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
                skip_on_failure = true;
                next_skip_pfn = block_end_pfn(low_pfn, cc->order);
        }

        /* Time to isolate some pages for migration */
        for (; low_pfn < end_pfn; low_pfn++) {

                if (skip_on_failure && low_pfn >= next_skip_pfn) {
                        /*
                         * We have isolated all migration candidates in the
                         * previous order-aligned block, and did not skip it due
                         * to failure. We should migrate the pages now and
                         * hopefully succeed compaction.
                         */
                        if (nr_isolated)
                                break;

                        /*
                         * We failed to isolate in the previous order-aligned
                         * block. Set the new boundary to the end of the
                         * current block. Note we can't simply increase
                         * next_skip_pfn by 1 << order, as low_pfn might have
                         * been incremented by a higher number due to skipping
                         * a compound or a high-order buddy page in the
                         * previous loop iteration.
                         */
                        next_skip_pfn = block_end_pfn(low_pfn, cc->order);
                }

요청한 migratetype의 한 페이지 블럭 범위에서 migratable 페이지들을 isolation 한다.

코드 라인 5에서 스캔 페이지 수와 isolated 페이지 수를 0으로 초기화한다.
코드 라인 19~28에서 작은 확률로 isolated 페이지가 과다한 경우에만 루프를 돌며 100ms 씩 처리를 지연시킨다. 단 다음 상황에서는 처리를 중단하고 0을 반환한다.
- 비동기 migration 중에는 과다한 처리를 포기한다.
- SIGKILL 시그널의 처리가 지연된 경우 함수 처리를 중단한다.
- 참고: Avoid the use of congestion_wait under zone pressure
코드 라인 30~31에서 비동기 처리 중에 더 높은 우선 순위의 태스크로부터 preemption 요청이 있는 경우 함수 처리를 중단한다.
코드 라인 33~36에서 direct-compaction 방식으로 async migration 처리 중이면 skip_on_failure를 true로 설정한다.
- skip_on_failure
  - 요청한 order를 빠르게 할당할 목적으로 가볍게 처리하기 위해 1 페이지 블럭이 아니라 order 단위로 페이지를 검색하여 isolation이 하나라도 있으면 종료시킨다.
코드 라인 39~61에서 처리할 범위의 페이지들을 하나씩 순회하되 skip_on_failure가 설정되었고, order 단위별 처리가 완료된 경우 다음 order의 끝 pfn을 지정한다. 만일 블럭내에 하나라도 isolation된 페이지가 있으면 루프를 벗어난다.

mm/compaction.c -2/4-

.               /*
                 * Periodically drop the lock (if held) regardless of its
                 * contention, to give chance to IRQs. Abort async compaction
                 * if contended.
                 */
                if (!(low_pfn % SWAP_CLUSTER_MAX)
                    && compact_unlock_should_abort(zone_lru_lock(zone), flags,
                                                                &locked, cc))
                        break;

                if (!pfn_valid_within(low_pfn))
                        goto isolate_fail;
                nr_scanned++;

                page = pfn_to_page(low_pfn);

                if (!valid_page)
                        valid_page = page;

                /*
                 * Skip if free. We read page order here without zone lock
                 * which is generally unsafe, but the race window is small and
                 * the worst thing that can happen is that we skip some
                 * potential isolation targets.
                 */
                if (PageBuddy(page)) {
                        unsigned long freepage_order = page_order_unsafe(page);

                        /*
                         * Without lock, we cannot be sure that what we got is
                         * a valid page order. Consider only values in the
                         * valid order range to prevent low_pfn overflow.
                         */
                        if (freepage_order > 0 && freepage_order < MAX_ORDER)
                                low_pfn += (1UL << freepage_order) - 1;
                        continue;
                }

                /*
                 * Regardless of being on LRU, compound pages such as THP and
                 * hugetlbfs are not to be compacted. We can potentially save
                 * a lot of iterations if we skip them at once. The check is
                 * racy, but we can consider only valid values and the only
                 * danger is skipping too much.
                 */
                if (PageCompound(page)) {
                        const unsigned int order = compound_order(page);

                        if (likely(order < MAX_ORDER))
                                low_pfn += (1UL << order) - 1;
                        goto isolate_fail;
                }

                /*
                 * Check may be lockless but that's ok as we recheck later.
                 * It's possible to migrate LRU and non-lru movable pages.
                 * Skip any other type of page
                 */
                if (!PageLRU(page)) {
                        /*
                         * __PageMovable can return false positive so we need
                         * to verify it under page_lock.
                         */
                        if (unlikely(__PageMovable(page)) &&
                                        !PageIsolated(page)) {
                                if (locked) {
                                        spin_unlock_irqrestore(zone_lru_lock(zone),
                                                                        flags);
                                        locked = false;
                                }

                                if (!isolate_movable_page(page, isolate_mode))
                                        goto isolate_success;
                        }

                        goto isolate_fail;
                }

코드 라인 6~9에서 처리할 범위가 크므로 이 루틴에서 주기적으로 unlock 하여 irq latency를 높이기 위해 노력한다. 또한 중단 요소가 있는 경우 처리를 중단하고 루프를 빠져나간다.
- 체크 주기
  - SWAP_CLUSTER_MAX(32) 페이지 블럭단위
- 중단요소
  - SIGKILL 신호가 처리 지연된 경우
  - 비동기 migration 중에 높은 우선 순위 태스크의 preemption 요청이 있는 경우
코드 라인 11~18에서 첫 유효 페이지를 갱신하고, 스캔 카운터를 증가한다. 만일 유효한 페이지가 아닌 경우 isolate_fail 레이블로 이동한다.
코드 라인 26~37에서 migrate 스캐너는 사용 중인 movable 페이지만 migrate한다. 따라서 버디시스템에서 관리하는 free 페이지인 경우 skip한다.
코드 라인 46~52에서 compound(slab, hugetlbfs, thp) 페이지도 compaction 효과가 없으므로 isolate_fail 레이블로 이동한다.
코드 라인 59~77에서 유저 할당한 lru 페이지가 아닌 경우 isolate_fail 레이블로 이동한다. 다만 작은 확률로 이미 isolation이 진행되지 되지 않은 non-lru movable 페이지이면 isolation하고, 성공한 경우 isolate_success 레이블로 이동한다.
- gpu, zsram(z3fold, zsmalloc) 및 balloon 드라이버에서 non-lru-movable 페이지의 migration을 지원한다.

mm/compaction.c -3/4-

                /*
                 * Migration will fail if an anonymous page is pinned in memory,
                 * so avoid taking lru_lock and isolating it unnecessarily in an
                 * admittedly racy check.
                 */
                if (!page_mapping(page) &&
                    page_count(page) > page_mapcount(page))
                        goto isolate_fail;

                /*
                 * Only allow to migrate anonymous pages in GFP_NOFS context
                 * because those do not depend on fs locks.
                 */
                if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
                        goto isolate_fail;

                /* If we already hold the lock, we can skip some rechecking */
                if (!locked) {
                        locked = compact_trylock_irqsave(zone_lru_lock(zone),
                                                                &flags, cc);
                        if (!locked)
                                break;

                        /* Recheck PageLRU and PageCompound under lock */
                        if (!PageLRU(page))
                                goto isolate_fail;

                        /*
                         * Page become compound since the non-locked check,
                         * and it's on LRU. It can only be a THP so the order
                         * is safe to read and it's 0 for tail pages.
                         */
                        if (unlikely(PageCompound(page))) {
                                low_pfn += (1UL << compound_order(page)) - 1;
                                goto isolate_fail;
                        }
                }

                lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);

                /* Try isolate the page */
                if (__isolate_lru_page(page, isolate_mode) != 0)
                        goto isolate_fail;

                VM_BUG_ON_PAGE(PageCompound(page), page);

                /* Successfully isolated */
                del_page_from_lru_list(page, lruvec, page_lru(page));
                inc_node_page_state(page,
                                NR_ISOLATED_ANON + page_is_file_cache(page));

isolate_success:
                list_add(&page->lru, &cc->migratepages);
                cc->nr_migratepages++;
                nr_isolated++;

                /*
                 * Record where we could have freed pages by migration and not
                 * yet flushed them to buddy allocator.
                 * - this is the lowest page that was isolated and likely be
                 * then freed by migration.
                 */
                if (!cc->last_migrated_pfn)
                        cc->last_migrated_pfn = low_pfn;

                /* Avoid isolating too much */
                if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
                        ++low_pfn;
                        break;
                }

                continue;

코드 라인 6~8에서 anonymous 페이지가 다른 커널 context에서 사용 중에 있는 경우 isolate_fail로 이동한다.
코드 라인 14~15에서 파일 시스템에 매핑되었지만 GFP_NOFS를 사용하여 fs를 사용할 수 없으면 isolate_fail 레이블로 이동한다.
코드 라인 18~37에서 일정 주기로 lock을 위에서 풀었다. 이러한 경우 다시 획득한 후 다시 페이지 조건들을 확인해야 한다. lru 페이지가 아니거나 compound 페이지인 경우 isolate_fail 레이블로 이동한다.
코드 라인 39에서 lru 리스트로 노드의 lru 또는 memcg의 노드 lru를 선택한다.
코드 라인 42~50에서 lru 리스트에서 페이지를 분리한 후 NR_ISOLATED_ANON 또는 NR_ISOLATE_FILE 카운터를 증가시킨다.
코드 라인 52~55에서 isolation이 성공되었다고 간주되어 이동되어올 수 있는 isolate_success: 레이블이다. migratepages 리스트에 분리한 페이지를 추가하고 관련 stat들을 증가시킨다.
코드 라인 63~64에서 마지막 migrate된 페이지가 한 번도 지정되지 않은 경우 갱신한다.
코드 라인 67~70에서 migrate 페이지가 적정량(32)이 된 경우 처리를 중단한다.
코드 라인 72에서 계속 다음 페이지를 처리하도록 루프를 돈다.

mm/compaction.c -4/4-

isolate_fail:
                if (!skip_on_failure)
                        continue;

                /*
                 * We have isolated some pages, but then failed. Release them
                 * instead of migrating, as we cannot form the cc->order buddy
                 * page anyway.
                 */
                if (nr_isolated) {
                        if (locked) {
                                spin_unlock_irqrestore(zone_lru_lock(zone), flags);
                                locked = false;
                        }
                        putback_movable_pages(&cc->migratepages);
                        cc->nr_migratepages = 0;
                        cc->last_migrated_pfn = 0;
                        nr_isolated = 0;
                }

                if (low_pfn < next_skip_pfn) {
                        low_pfn = next_skip_pfn - 1;
                        /*
                         * The check near the loop beginning would have updated
                         * next_skip_pfn too, but this is a bit simpler.
                         */
                        next_skip_pfn += 1UL << cc->order;
                }
        }

        /*
         * The PageBuddy() check could have potentially brought us outside
         * the range to be scanned.
         */
        if (unlikely(low_pfn > end_pfn))
                low_pfn = end_pfn;

        if (locked)
                spin_unlock_irqrestore(zone_lru_lock(zone), flags);

        /*
         * Update the pageblock-skip information and cached scanner pfn,
         * if the whole pageblock was scanned without isolating any page.
         */
        if (low_pfn == end_pfn)
                update_pageblock_skip(cc, valid_page, nr_isolated, true);

        trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
                                                nr_scanned, nr_isolated);

        cc->total_migrate_scanned += nr_scanned;
        if (nr_isolated)
                count_compact_events(COMPACTISOLATED, nr_isolated);

        return low_pfn;
}

코드 라인 1~3에서 isolate_faile: 레이블이다. isolation이 실패한 경우 skip_on_failure가 false인 경우 다음 페이지를 계속 처리한다.
코드 라인 10~19에서 isolate된 페이지들을 다시 되돌려 무효화 시킨다.
- skip_on_failure가 true인 경우 order 단위별 페이지들을 isolation 시 실패하면 다음 order 단위 페이지에서 시도한다.
코드 라인 21~28에서 low_pfn을 다음 order 단위 페이지로 변경한다.
코드 라인 35~36에서 low_pfn이 처리 범위의 끝을 넘어가지 않게 제한한다.
코드 라인 38~39에서 lock이 걸려 있는 경우 unlock 한다.
코드 라인 45~46에서 페이지 블럭의 끝까지 처리하였고 처리된 isolated 페이지가 없는 경우 valid_page에 해당하는 페이지 블럭의 migrate skip 비트를 1로 설정하여 다음 스캔에서 skip하도록 한다. 그리고 migrate 스캐너(async, sync 모두)의 시작 pfn을 해당 페이지로 설정한다.
코드 라인 51~53에서 스캔 카운터와 isolation 카운터를 갱신한다.

다음 그림은 isolate_migratepages_block() 함수를 통해 migratable 페이지를 isolation하는 모습을 보여준다.

too_many_isolated()

mm/compaction.c

/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(struct zone *zone)
{
        unsigned long active, inactive, isolated;

        inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
                                        zone_page_state(zone, NR_INACTIVE_ANON);
        active = zone_page_state(zone, NR_ACTIVE_FILE) +
                                        zone_page_state(zone, NR_ACTIVE_ANON);
        isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
                                        zone_page_state(zone, NR_ISOLATED_ANON);

        return isolated > (inactive + active) / 2;
}

isolated 페이지 수가 너무 많은 경우 true를 반환한다.

file(active+inactive) 페이지의 절반을 초과하는 경우 true

compact_should_abort()

mm/compaction.c

/*
 * Aside from avoiding lock contention, compaction also periodically checks
 * need_resched() and either schedules in sync compaction or aborts async
 * compaction. This is similar to what compact_unlock_should_abort() does, but
 * is used where no lock is concerned.
 *
 * Returns false when no scheduling was needed, or sync compaction scheduled.
 * Returns true when async compaction should abort.
 */

static inline bool compact_should_abort(struct compact_control *cc)
{
        /* async compaction aborts if contended */
        if (need_resched()) {
                if (cc->mode == MIGRATE_ASYNC) {
                        cc->contended = true;
                        return true;
                }

                cond_resched();
        }

        return false;
}

비동기 migration 처리 중이면서 우선 순위 높은 태스크의 리스케쥴 요청이 있는 경우 true가 반환된다.

compact_unlock_should_abort()

mm/compaction.c

/*
 * Compaction requires the taking of some coarse locks that are potentially
 * very heavily contended. The lock should be periodically unlocked to avoid
 * having disabled IRQs for a long time, even when there is nobody waiting on
 * the lock. It might also be that allowing the IRQs will result in
 * need_resched() becoming true. If scheduling is needed, async compaction
 * aborts. Sync compaction schedules.
 * Either compaction type will also abort if a fatal signal is pending.
 * In either case if the lock was locked, it is dropped and not regained.
 *
 * Returns true if compaction should abort due to fatal signal pending, or
 *              async compaction due to need_resched()
 * Returns false when compaction can continue (sync compaction might have
 *              scheduled)
 */

static bool compact_unlock_should_abort(spinlock_t *lock,
                unsigned long flags, bool *locked, struct compact_control *cc)
{
        if (*locked) {
                spin_unlock_irqrestore(lock, flags);
                *locked = false;
        }

        if (fatal_signal_pending(current)) {
                cc->contended = true;
                return true;
        }

        if (need_resched()) {
                if (cc->mode == MIGRATE_ASYNC) {
                        cc->contended = true;
                        return true;
                }
                cond_resched();
        }

        return false;
}

lock된 경우 unlock하고 중단 요소가 있는 경우 cc->contended에 true를 설정하고 true를 반환한다.

중단 요소
- SIGKILL 신호가 처리 지연된 경우
- 비동기 migration 중에 높은 우선 순위의 태스크의 리스케쥴 요청이 있는 경우

compact_trylock_irqsave()

mm/compaction.c

/*
 * Compaction requires the taking of some coarse locks that are potentially
 * very heavily contended. For async compaction, back out if the lock cannot
 * be taken immediately. For sync compaction, spin on the lock if needed.
 *
 * Returns true if the lock is held
 * Returns false if the lock is not held and compaction should abort
 */

static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
                                                struct compact_control *cc)
{
        if (cc->mode == MIGRATE_ASYNC) {
                if (!spin_trylock_irqsave(lock, *flags)) {
                        cc->contended = true;
                        return false;
                }
        } else {
                spin_lock_irqsave(lock, *flags);
        }

        return true;
}

비동기 migration 처리 중 compaction에 사용된 spin lock이 다른 cpu와 경쟁하여 한 번에 획득 시도가 실패한 경우 cc->contended에 true(compaction 락 혼잡)을 대입하고 false로 반환한다.

동기 migration인 경우 무조건 lock 획득을 한다.

mem_cgroup_page_lruvec()

mm/memcontrol.c

/**               
 * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
 * @page: the page      
 * @zone: zone of the page
 *
 * This function is only safe when following the LRU page isolation
 * and putback protocol: the LRU lock must be held, and the page must
 * either be PageLRU() or the caller must have isolated/allocated it.
 */

struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
{
        struct mem_cgroup_per_node *mz;
        struct mem_cgroup *memcg;
        struct lruvec *lruvec;

        if (mem_cgroup_disabled()) {
                lruvec = &pgdat->lruvec;
                goto out;
        }

        memcg = page->mem_cgroup;
        /*
         * Swapcache readahead pages are added to the LRU - and
         * possibly migrated - before they are charged.
         */
        if (!memcg)
                memcg = root_mem_cgroup;

        mz = mem_cgroup_page_nodeinfo(memcg, page);
        lruvec = &mz->lruvec;
out:
        /*
         * Since a node can be onlined after the mem_cgroup was created,
         * we have to be prepared to initialize lruvec->zone here;
         * and if offlined then reonlined, we need to reinitialize it.
         */
        if (unlikely(lruvec->pgdat != pgdat))
                lruvec->pgdat = pgdat;
        return lruvec;
}

memcg가 활성화 되어 있는 경우 해당 페이지에 기록된 memcg의 해당 노드 lruvec을 반환하고, memcg가 비활성화 되어 있는 경우 해당 노드의 lruvec를 반환한다.

Isolate LRU 페이지

__isolate_lru_page()

mm/vmscan.c

/*
 * Attempt to remove the specified page from its LRU.  Only take this page
 * if it is of the appropriate PageActive status.  Pages which are being
 * freed elsewhere are also ignored.
 *
 * page:        page to consider
 * mode:        one of the LRU isolation modes defined above
 *
 * returns 0 on success, -ve errno on failure.
 */

int __isolate_lru_page(struct page *page, isolate_mode_t mode)
{
        int ret = -EINVAL;

        /* Only take pages on the LRU. */
        if (!PageLRU(page))
                return ret;

        /* Compaction should not handle unevictable pages but CMA can do so */
        if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
                return ret;

        ret = -EBUSY;

        /*
         * To minimise LRU disruption, the caller can indicate that it only
         * wants to isolate pages it will be able to operate on without
         * blocking - clean pages for the most part.
         *
         * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
         * that it is possible to migrate without blocking
         */
        if (mode & ISOLATE_ASYNC_MIGRATE) {
                /* All the caller can do on PageWriteback is block */
                if (PageWriteback(page))
                        return ret;

                if (PageDirty(page)) {
                        struct address_space *mapping;
                        bool migrate_dirty;

                        /*
                         * Only pages without mappings or that have a
                         * ->migratepage callback are possible to migrate
                         * without blocking. However, we can be racing with
                         * truncation so it's necessary to lock the page
                         * to stabilise the mapping as truncation holds
                         * the page lock until after the page is removed
                         * from the page cache.
                         */
                        if (!trylock_page(page))
                                return ret;

                        mapping = page_mapping(page);
                        migrate_dirty = !mapping || mapping->a_ops->migratepage;
                        unlock_page(page);
                        if (!migrate_dirty)
                                return ret;
                }
        }

        if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
                return ret;

        if (likely(get_page_unless_zero(page))) {
                /*
                 * Be careful not to clear PageLRU until after we're
                 * sure the page is not being freed elsewhere -- the
                 * page release code relies on it.
                 */
                ClearPageLRU(page);
                ret = 0;
        }

        return ret;
}

페이지에서 LRU의 제거를 시도한다. 성공시 0을 반환하고, 실패하는 경우 -EINVAL 또는 -EBUSY를 반환한다.

코드 라인 6~7에서 LRU 페이지가 아닌 경우 처리를 중단한다.
코드 라인 10~11에서 unevictable 페이지이면서 isolation 모드가 ISOLATE_UNEVICTABLE을 지원하지 않는 경우 처리를 중단한다.
코드 라인 23~26에서 ISOLATE_ASYNC_MIGRATE 모드로 동작하는 경우 WriteBack 중인 페이지는 -EBUSY로 처리를 중단한다.
코드 라인 28~49에서 Dirty 설정된 페이지에 대해 migratepage 핸들러 함수가 등록되어 있지 않은 매핑 페이지인 경우 처리를 중단한다.
코드 라인 52~53에서 ISOLATE_UNMAPPED 모드로 동작할 때 매핑 페이지들은 처리를 중단한다.
코드 라인 55~65에서 많은 확률로 사용 중인 페이지인 경우 lru 플래그를 클리어하고 정상적으로 함수를 빠져나간다.

Isolate movable 페이지

아래 함수를 이용하는 대상들은 다음과 같은 용도에서 사용된다.

Compaction
- Direct-compaction에서 non-lru movable 페이지들의 migration
Off-line Memory
- 메모리를 off-line 시키기 위해 non-lru movable 페이지들의 migration

isolate_movable_page()

mm/migrate.c

int isolate_movable_page(struct page *page, isolate_mode_t mode)
{
        struct address_space *mapping;

        /*
         * Avoid burning cycles with pages that are yet under __free_pages(),
         * or just got freed under us.
         *
         * In case we 'win' a race for a movable page being freed under us and
         * raise its refcount preventing __free_pages() from doing its job
         * the put_page() at the end of this block will take care of
         * release this page, thus avoiding a nasty leakage.
         */
        if (unlikely(!get_page_unless_zero(page)))
                goto out;

        /*
         * Check PageMovable before holding a PG_lock because page's owner
         * assumes anybody doesn't touch PG_lock of newly allocated page
         * so unconditionally grapping the lock ruins page's owner side.
         */
        if (unlikely(!__PageMovable(page)))
                goto out_putpage;
        /*
         * As movable pages are not isolated from LRU lists, concurrent
         * compaction threads can race against page migration functions
         * as well as race against the releasing a page.
         *
         * In order to avoid having an already isolated movable page
         * being (wrongly) re-isolated while it is under migration,
         * or to avoid attempting to isolate pages being released,
         * lets be sure we have the page lock
         * before proceeding with the movable page isolation steps.
         */
        if (unlikely(!trylock_page(page)))
                goto out_putpage;

        if (!PageMovable(page) || PageIsolated(page))
                goto out_no_isolated;

        mapping = page_mapping(page);
        VM_BUG_ON_PAGE(!mapping, page);

        if (!mapping->a_ops->isolate_page(page, mode))
                goto out_no_isolated;

        /* Driver shouldn't use PG_isolated bit of page->flags */
        WARN_ON_ONCE(PageIsolated(page));
        __SetPageIsolated(page);
        unlock_page(page);

        return 0;

out_no_isolated:
        unlock_page(page);
out_putpage:
        put_page(page);
out:
        return -EBUSY;
}

non-lru movable 페이지를 isolation한다.

코드 라인 14~15에서 지금 막 free 페이지가 된 상황이면 -EBUSY 에러로 함수를 빠져나간다.
코드 라인 22~23에서 non-lru movable 페이지가 아닌 경우 -EBUSY 에러로 함수를 빠져나간다.
코드 라인 35~36에서 페이지 lock을 획득하고 다시 한 번 non-lru movable 페이지 여부 및 이미 isolate된 페이지인지 확인한다. 확인하여 non-lru movable 페이지가 아니거나 이미 isolate된 페이지라면 -EBUSY 에러로 함수를 빠져나간다.
코드 라인 38~42에서 non-lru movable 페이지가 아니거나 isolate 페이지인 경우 out_no_isolated 레이블로 이동한다.
- non-lru movable이 적용된 드라이버(예: z3fold, zsmalloc 등)의 (*isolate_page) 후크에 등록된 함수를 호출하여 false 반환된 경우 -EBUSY 에러로 함수를 빠져나간다.
코드 라인 46~49에서 isolate 플래그를 설정하고, 성공(0) 결과로 함수를 빠져나간다.

Isolatin 적합 여부 관련

isolation 적합 여부

블럭에 대한 isolation 여부

sync-full을 제외한 모드의 direct-compaction은 빠른 compaction을 위해 스캐닝 중인 페이지 블럭에서 skip 비트를 만나면 해당 페이지 블럭의 isolation을 skip 한다.

최근 해당 페이지 블럭에서 isolation이 실패한 경우 해당 블럭에 skip 마크를 한다.

다음 그림은 페이지 블럭들에 대한 usemap의 skip 비트의 사용 용도를 보여준다.

isolation 강제(force)

다음과 같은 수행 조건에서는 skip 비트의 설정 여부와 관계 없이 내부에서 ignore_skip_hint를 설정하여 모든 블럭의 isolation을 시도하도록 강제한다.

가장 높은 우선 순위의 sync-full을 제외한 compact 모드를 사용하는 direct-compact 사용 시
manual-compact 사용 시
kcompactd 사용 시

페이지블럭의 mobility 특성

다음 그림과 같이 페이지블럭의 타입은 해당 블럭 내에서 절반(50%) 이상의 페이지(migratetype) 타입을 가진 대표 mobility 특성을 가진다.

빠르게 동작해야 하는 async 모드 direct-compaction을 사용 시 movable 페이지 블럭만을 대상으로 compaction을 수행한다.

isolation_suitable()

mm/compaction.c

/* Returns true if the pageblock should be scanned for pages to isolate. */
static inline bool isolation_suitable(struct compact_control *cc,
                                        struct page *page)
{
        if (cc->ignore_skip_hint)
                return true;

        return !get_pageblock_skip(page);
}

해당 블럭에서 isolation을 시도해도 되는지 여부를 체크한다.

코드 라인 5~6에서 ignore_skip_hint가 설정된 경우 페이지 블럭의 skip 여부와 관계없이 isolation을 진행할 수 있게 true를 반환한다.
코드 라인 8에서 최근에 해당 페이지 블럭에서 isolation이 실패한 적이 있는 경우 skip하도록 false를 반환한다.
- usemap에 각 페이지 블럭의 skip 비트를 저장하고 있다.

다음 그림은 해당 블럭의 isolation 여부를 반환하는 isolation_suitable() 함수의 처리 과정을 보여준다.

migration 소스로의 적합 여부

suitable_migration_source()

mm/compaction.c

static bool suitable_migration_source(struct compact_control *cc,
                                                        struct page *page)
{
        int block_mt;

        if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
                return true;

        block_mt = get_pageblock_migratetype(page);

        if (cc->migratetype == MIGRATE_MOVABLE)
                return is_migrate_movable(block_mt);
        else
                return block_mt == cc->migratetype;
}

하나의 페이지 블럭에 포함된 페이지가 migration 타겟에 적합하지 여부를 반환한다. 비동기 모드로 direct-compaction 진행 시 가볍고 빠르게 진행하기 위해 해당 블럭 migrate 타입이 요청한 migrate 타입과 동일할 경우에만 true를 반환한다. 그 외 방식의 경우에는 항상 true를 반환한다.

코드 라인 6~7에서 다음 2 가지의 경우에 한하여 해당 블럭을 무조건 isolation 한 후 migration 가능한 페이지들을 처리하도록 true를 반환한다.
- migrate_async가 아닌 요청
- direct_compaction 타입이 아닌 매뉴얼 compaction이나 kcompactd 요청
코드 라인 9~12에서 movable 할당 요청 시에는 movable 및 cma 블럭 타입 여부를 반환한다.
- unmovable, reclimable, highatomic, iolate 타입은 migration 대상이 될 수 없다.
코드 라인 13~14에서 movable 할당 요청이 아닌 경우 요청한 타입에 해당하는 블럭 타입의 동일 여부를 반환한다.

다음 그림은 해당 블럭이 migration 소스로 적절한지 여부를 반환하는 suitable_migration_source() 함수의 처리 과정을 보여준다.

migration 타겟으로의 적합 여부

suitable_migration_target()

mm/compaction.c

/* Returns true if the page is within a block suitable for migration to */
static bool suitable_migration_target(struct compact_control *cc,
                                                        struct page *page)
{
        /* If the page is a large free page, then disallow migration */
        if (PageBuddy(page)) {
                /*
                 * We are checking page_order without zone->lock taken. But
                 * the only small danger is that we skip a potentially suitable
                 * pageblock, so it's not worth to check order for valid range.
                 */
                if (page_order_unsafe(page) >= pageblock_order)
                        return false;
        }

        if (cc->ignore_block_suitable)
                return true;

        /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
        if (is_migrate_movable(get_pageblock_migratetype(page)))
                return true;

        /* Otherwise skip the block */
        return false;
}

하나의 페이지 블럭에 포함된 페이지가 migration 타겟에 적합하지 여부를 반환한다. 단 sync-full 모드로 direct-compaction 진행 시 항상 true를 반환한다.

코드 라인 6~14에서 페이지 블럭보다 큰 버디 시스템의 free 페이지는 false를 반환한다.
코드 라인 16~17에서 페이지 블럭 타입의 유형과 상관 없이 무조건 대상으로 지정하기 위해 true를 반환한다.
- ignore_block_suitable은 sync-full 모드의 direct-compaction이 동작할 때만 설정된다.
코드 라인 20~24에서 페이지 블럭이 movable 타입인 경우에만 true를 반환하고 그 외에는 false를 반환한다.

다음 그림은 해당 블럭이 migration 타겟으로 적절한지 여부를 반환하는 suitable_migration_target() 함수의 처리 과정을 보여준다.

페이지 블럭 skip 지정

update_pageblock_skip()

mm/compaction.c

/*
 * If no pages were isolated then mark this pageblock to be skipped in the
 * future. The information is later cleared by __reset_isolation_suitable().
 */

static void update_pageblock_skip(struct compact_control *cc,
                        struct page *page, unsigned long nr_isolated,
                        bool migrate_scanner)
{
        struct zone *zone = cc->zone;
        unsigned long pfn;

        if (cc->no_set_skip_hint)
                return;

        if (!page)
                return;

        if (nr_isolated)
                return;

        set_pageblock_skip(page);

        pfn = page_to_pfn(page);

        /* Update where async and sync compaction should restart */
        if (migrate_scanner) {
                if (pfn > zone->compact_cached_migrate_pfn[0])
                        zone->compact_cached_migrate_pfn[0] = pfn;
                if (cc->mode != MIGRATE_ASYNC &&
                    pfn > zone->compact_cached_migrate_pfn[1])
                        zone->compact_cached_migrate_pfn[1] = pfn;
        } else {
                if (pfn < zone->compact_cached_free_pfn)
                        zone->compact_cached_free_pfn = pfn;
        }
}

처리된 isolated 페이지가 없는 경우 해당 페이지 블럭에 migrate skip 비트를 설정하고 migrate 또는 free 스캐너의 시작 pfn을 해당 페이지로 설정한다.

코드 라인 8~15에서 다음의 3 가지 경우 해당 페이지 블럭의 skip 비트를 설정하지 않도록 중단한다.
- no_set_skip_hint가 설정된 경우
  - cma 영역의 확보를 위해 isolation이 진행될 때 설정된다.
- 페이지가 없는 경우
- isolated된 페이지가 없는 경우
코드 라인 17에서 해당 페이지 블럭의 skip 비트를 설정한다.
코드 라인 22~27에서 migrate 스캐너 루틴에서 요청이 온 경우 migrate 스캔 시작 pfn을 갱신한다.
코드 라인 28~31에서 free 스캐너 루틴에서 요청이 온 경우 free 스캔 시작 pfn을 갱신한다.

다음 그림은 해당 블럭의 isolation을 금지하도록 skip 비트를 설정하는 update_pageblock_skip() 함수의 처리 과정을 보여준다.

리셋 존 skip 비트

__reset_isolation_suitable()

mm/compaction.c

/*
 * This function is called to clear all cached information on pageblocks that
 * should be skipped for page isolation when the migrate and free page scanner
 * meet.
 */

static void __reset_isolation_suitable(struct zone *zone)
{
        unsigned long start_pfn = zone->zone_start_pfn;
        unsigned long end_pfn = zone_end_pfn(zone);
        unsigned long pfn;

        zone->compact_blockskip_flush = false;

        /* Walk the zone and mark every pageblock as suitable for isolation */
        for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
                struct page *page;

                cond_resched();

                page = pfn_to_online_page(pfn);
                if (!page)
                        continue;
                if (zone != page_zone(page))
                        continue;
                if (pageblock_skip_persistent(page))
                        continue;

                clear_pageblock_skip(page);
        }

        reset_cached_positions(zone);
}

해당 zone의 isolate 및 free 스캔 시작 주소를 리셋하고 페이지블럭의 skip bit를 모두 클리어한다. 단 페이지 블럭 order 이상의 order를 사용하는 compound 페이지에 대한 skip 비트는 제외한다.

migration 대상 영역은 zone의 시작 pfn과 끝 pfn이다.
- isolate 스캐너가 시작할 pfn으로 compact_cached_migrate_pfn[0..1]에 zone의 시작 pfn을 대입한다.
- free 스캐너가 시작할 pfn으로 compact_cached_free_pfn에 zone의 끝 pfn을 대입한다.
pageblock->flags
- Sparse 메모리 모델이 아닌 경우 zone->pageblock_flags가 지정한 usemap
- Sparse 메모리 모델인 경우 mem_section[].pageblock_flags가 지정한 usemap

다음 그림은 해당 존의 모든 블럭의 isolation을 허용하도록 skip 비트를 클리어하고, 스캐너들의 시작 위치를 리셋하는 __reset_isolation_suitable() 함수의 처리 과정을 보여준다.

Isolation Freepages

다음 함수는 direct-compact, manual-compact 및 kcompactd 기능을 사용하는 곳에서 high order 페이지 부족 시 compaction을 위해 사용된다.

isolate_freepages()

mm/compaction.c

/*
 * Based on information in the current compact_control, find blocks
 * suitable for isolating free pages from and then isolate them.
 */

static void isolate_freepages(struct compact_control *cc)
{
        struct zone *zone = cc->zone;
        struct page *page;
        unsigned long block_start_pfn;  /* start of current pageblock */
        unsigned long isolate_start_pfn; /* exact pfn we start at */
        unsigned long block_end_pfn;    /* end of current pageblock */
        unsigned long low_pfn;       /* lowest pfn scanner is able to scan */
        struct list_head *freelist = &cc->freepages;

        /*
         * Initialise the free scanner. The starting point is where we last
         * successfully isolated from, zone-cached value, or the end of the
         * zone when isolating for the first time. For looping we also need
         * this pfn aligned down to the pageblock boundary, because we do
         * block_start_pfn -= pageblock_nr_pages in the for loop.
         * For ending point, take care when isolating in last pageblock of a
         * a zone which ends in the middle of a pageblock.
         * The low boundary is the end of the pageblock the migration scanner
         * is using.
         */
        isolate_start_pfn = cc->free_pfn;
        block_start_pfn = pageblock_start_pfn(cc->free_pfn);
        block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
                                                zone_end_pfn(zone));
        low_pfn = pageblock_end_pfn(cc->migrate_pfn);

        /*
         * Isolate free pages until enough are available to migrate the
         * pages on cc->migratepages. We stop searching if the migrate
         * and free page scanners meet or enough free pages are isolated.
         */
        for (; block_start_pfn >= low_pfn;
                                block_end_pfn = block_start_pfn,
                                block_start_pfn -= pageblock_nr_pages,
                                isolate_start_pfn = block_start_pfn) {
                /*
                 * This can iterate a massively long zone without finding any
                 * suitable migration targets, so periodically check if we need
                 * to schedule, or even abort async compaction.
                 */
                if (!(block_start_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
                                                && compact_should_abort(cc))
                        break;

                page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
                                                                        zone);
                if (!page)
                        continue;

                /* Check the block is suitable for migration */
                if (!suitable_migration_target(cc, page))
                        continue;

                /* If isolation recently failed, do not retry */
                if (!isolation_suitable(cc, page))
                        continue;

                /* Found a block suitable for isolating free pages from. */
                isolate_freepages_block(cc, &isolate_start_pfn, block_end_pfn,
                                        freelist, false);

                /*
                 * If we isolated enough freepages, or aborted due to lock
                 * contention, terminate.
                 */
                if ((cc->nr_freepages >= cc->nr_migratepages)
                                                        || cc->contended) {
                        if (isolate_start_pfn >= block_end_pfn) {
                                /*
                                 * Restart at previous pageblock if more
                                 * freepages can be isolated next time.
                                 */
                                isolate_start_pfn =
                                        block_start_pfn - pageblock_nr_pages;
                        }
                        break;
                } else if (isolate_start_pfn < block_end_pfn) {
                        /*
                         * If isolation failed early, do not continue
                         * needlessly.
                         */
                        break;
                }
        }

        /* __isolate_free_page() does not map the pages */
        map_pages(freelist);

        /*
         * Record where the free scanner will restart next time. Either we
         * broke from the loop and set isolate_start_pfn based on the last
         * call to isolate_freepages_block(), or we met the migration scanner
         * and the loop terminated due to isolate_start_pfn < low_pfn
         */
        cc->free_pfn = isolate_start_pfn;
}

코드 라인 22~25에서 free 스캐너의 시작 pfn을 위해 중단되었었던 cc->free_pfn 부터 시작하게 한다. 그리고 블럭의 시작과 끝 pfn을 지정한다.
- free 스캐너는 존의 아래 방향으로 블럭을 이동하며 스캔한다. 페이지 블록 내에서 free 페이지들을 isolation 할 때에는 상위 방향으로 진행한다.
코드 라인26에서 migrate 스캐너가 작업하는 블럭을 침범하지 않게 하기 위해 low_pfn에 migrate 스캐너가 활동하는 페이지 블럭의 끝 pfn을 대입한다.
코드 라인 33~36에서 free 스캐너는 migrate 스캐너가 위치한 아래 방향으로 페이지 블럭 단위로 감소하며 순회한다.
코드 라인 42~44에서 zone을 free 스캐너로 한꺼번에 isolation하는 경우 영역이 너무 커서 중간에 주기적으로 한 번씩 compaction 중단 요소가 있는지 확인한다.
- 체크 주기
  - SWAP_CLUSTER_MAX(32) 페이지 블럭 단위로 체크한다.
- 중단 요소
  - 비동기 compaction 처리 중이면서 더 높은 우선 순위 태스크의 preemption 요청이 있는 경우
코드 라인 46~49에서 페이지 블럭의 첫 페이지를 가져온다. pfn 범위가 요청한 zone에 없는 경우 skip 한다.
- zone의 마지막 페이지블럭이 partial인 경우도 skip 한다
코드 라인 52~53에서 페이지 블럭이 migration의 타겟으로 적합하지 않는 경우 skip 한다.
코드 라인 56~57에서 페이지 블럭이 isolation에 적합하지 않는 경우 skip 한다.
- 최근에 해당 페이지블럭에서 isolation이 취소된적이 있는 경우 skip하도록 false를 반환한다.
코드 라인 60~61에서 페이지 블럭을 isolation 하여 cc->freepages 리스트에 이동시킨다.
코드 라인 67~77에서migrate스캐너가 확보한 migratable 페이지들 보다 free 스캐너가 확보한 free 페이지가 더 많거나 lock contention 상황이면 루프를 중단시킨다.
코드 라인 78~84에서 한 블럭도 스캔하지 못한 상황이면 루프를 중단시킨다.
코드라인 88에서 isolation된 free 페이지들이 담긴 리스트를 대상으로 각 페이지를 LRU 리스트에서 제거하고, 0 order로 분해한다.
코드 라인 96에서 다음 스캐닝을 위해 free 스캐너의 위치를 기억해둔다.

Isolate freepages 블럭

다음 함수는 다음 두 용도에서 사용되기 위해 호출된다.

Compaction
- isolate_freepages() 함수
  - direct-compact, manual-compact 및 kcompactd 에서 high order 페이지를 확보하기 위한 compaction을 위해 사용된다.
CMA
- isolate_freepages_range() 함수에서 호출된다.
  - CMA 영역의 요청한 범위를 비우기 위해 사용된다.

isolate_freepages_block()

mm/compaction.c -1/2-

/*
 * Isolate free pages onto a private freelist. If @strict is true, will abort
 * returning 0 on any invalid PFNs or non-free pages inside of the pageblock
 * (even though it may still end up isolating some pages).
 */

static unsigned long isolate_freepages_block(struct compact_control *cc,
                                unsigned long *start_pfn,
                                unsigned long end_pfn,
                                struct list_head *freelist,
                                bool strict)
{
        int nr_scanned = 0, total_isolated = 0;
        struct page *cursor, *valid_page = NULL;
        unsigned long flags = 0;
        bool locked = false;
        unsigned long blockpfn = *start_pfn;
        unsigned int order;

        cursor = pfn_to_page(blockpfn);

        /* Isolate free pages. */
        for (; blockpfn < end_pfn; blockpfn++, cursor++) {
                int isolated;
                struct page *page = cursor;

                /*
                 * Periodically drop the lock (if held) regardless of its
                 * contention, to give chance to IRQs. Abort if fatal signal
                 * pending or async compaction detects need_resched()
                 */
                if (!(blockpfn % SWAP_CLUSTER_MAX)
                    && compact_unlock_should_abort(&cc->zone->lock, flags,
                                                                &locked, cc))
                        break;

                nr_scanned++;
                if (!pfn_valid_within(blockpfn))
                        goto isolate_fail;

                if (!valid_page)
                        valid_page = page;

                /*
                 * For compound pages such as THP and hugetlbfs, we can save
                 * potentially a lot of iterations if we skip them at once.
                 * The check is racy, but we can consider only valid values
                 * and the only danger is skipping too much.
                 */
                if (PageCompound(page)) {
                        const unsigned int order = compound_order(page);

                        if (likely(order < MAX_ORDER)) {
                                blockpfn += (1UL << order) - 1;
                                cursor += (1UL << order) - 1;
                        }
                        goto isolate_fail;
                }

                if (!PageBuddy(page))
                        goto isolate_fail;

                /*
                 * If we already hold the lock, we can skip some rechecking.
                 * Note that if we hold the lock now, checked_pageblock was
                 * already set in some previous iteration (or strict is true),
                 * so it is correct to skip the suitable migration target
                 * recheck as well.
                 */
                if (!locked) {
                        /*
                         * The zone lock must be held to isolate freepages.
                         * Unfortunately this is a very coarse lock and can be
                         * heavily contended if there are parallel allocations
                         * or parallel compactions. For async compaction do not
                         * spin on the lock and we acquire the lock as late as
                         * possible.
                         */
                        locked = compact_trylock_irqsave(&cc->zone->lock,
                                                                &flags, cc);
                        if (!locked)
                                break;

                        /* Recheck this is a buddy page under lock */
                        if (!PageBuddy(page))
                                goto isolate_fail;
                }

코드 라인 14~19에서 블럭 내에서 시작 pfn부터 끝 pfn까지 페이지를 순회한다.
코드 라인 26~29에서 한 block을 스캔하는 동안 처리할 범위가 크므로 주기적으로 lock된 경우 unlock을 하고 중단 요소가 있는 경우 cc->contended에 ture를 설정하고 루프를 빠져나간다.
- 체크 주기
  - SWAP_CLUSTER_MAX(32) 페이지 블럭단위
- 중단요소
  - SIGKILL 신호가 처리 지연된 경우
  - 비동기 migration 중에 더 높은 우선 순위 태스크의 preemption 요청이 있는 경우
코드 라인 31에서 스캔 카운터를 증가시킨다.
코드 라인 32~36에서 유효 페이지가 아닌 경우 isolate_fail 레이블로 이동한다. 첫 유효 페이지인 경우 해당 페이지를 기억시킨다.
코드 라인 44~52에서 compound 페이지인 경우 isolate_fail 레이블로 이동한다.
코드 라인 54~55에서 버디 시스템이 관리하는 free 페이지가 아닌 경우 isolate_fail 레이블로 이동한다.
코드 라인 64~81에서 만일 lock이 걸려있지 않은 경우 lock을 획득하고, 다시 한번 버디 시스템이 관리하는 free 페이지 여부를 체크하여 아닌 경우 isolate_fail 레이블로 이동한다.

mm/compaction.c -2/2-

                /* Found a free page, will break it into order-0 pages */
                order = page_order(page);
                isolated = __isolate_free_page(page, order);
                if (!isolated)
                        break;
                set_page_private(page, order);

                total_isolated += isolated;
                cc->nr_freepages += isolated;
                list_add_tail(&page->lru, freelist);

                if (!strict && cc->nr_migratepages <= cc->nr_freepages) {
                        blockpfn += isolated;
                        break;
                }
                /* Advance to the end of split page */
                blockpfn += isolated - 1;
                cursor += isolated - 1;
                continue;

isolate_fail:
                if (strict)
                        break;
                else
                        continue;

        }

        if (locked)
                spin_unlock_irqrestore(&cc->zone->lock, flags);

        /*
         * There is a tiny chance that we have read bogus compound_order(),
         * so be careful to not go outside of the pageblock.
         */
        if (unlikely(blockpfn > end_pfn))
                blockpfn = end_pfn;

        trace_mm_compaction_isolate_freepages(*start_pfn, blockpfn,
                                        nr_scanned, total_isolated);

        /* Record how far we have got within the block */
        *start_pfn = blockpfn;

        /*
         * If strict isolation is requested by CMA then check that all the
         * pages requested were isolated. If there were any failures, 0 is
         * returned and CMA will fail.
         */
        if (strict && blockpfn < end_pfn)
                total_isolated = 0;

        /* Update the pageblock-skip if the whole pageblock was scanned */
        if (blockpfn == end_pfn)
                update_pageblock_skip(cc, valid_page, total_isolated, false);

        cc->total_free_scanned += nr_scanned;
        if (total_isolated)
                count_compact_events(COMPACTISOLATED, total_isolated);
        return total_isolated;
}

코드 라인 2~6에서 버디 시스템의 free 리스트에서 free 페이지를 분리하고 order를 기록한다.
코드 라인 8~9에서 isolate 카운터 및 확보한 free 페이지 카운터를 증가시킨다.
코드 라인 10에서 분리한 페이지를 free 스캐너 리스트에 추가한다.
코드 라인 12~15에서 @strict가 true인 경우 CMA 영역의 연속된 페이지를 확보하려고 요청되는데 이 때 하나라도 실패하면 처리를 중단시킨다. 만일 @strict가 0이고 free 스캐너가 확보한 페이지가 migration 스캐너가 확보한 페이지보다 많으면 이러한 경우에도 처리를 중단시킨다.
코드 라인 17~19에서 다음 페이지를 위해 루프를 계속하게 한다.
코드 라인 21~25에서 isolate_fail: 레이블이다. @strict가 1인 경우 실패하면 곧바로 루프를 빠져나가고, 그렇지 않은 경우 루프를 계속하게 한다.
코드 라인 29~30에서 획득한 lock을 풀어준다.
코드 라인 36~37에서 진행 중인 pfn이 끝 pfn을 초과하지 않게 제한한다.
코드 라인 43에서 입출력 인자로 받은 @start_pfn 값을 현재 pfn 값으로 갱신한다.
코드 라인 50~51에서 @strict가 true인 경우 CMA 영역의 연속된 페이지를 확보하려고 요청되는데 이러한 경우 total_isolated 값을 0으로 변경하여 실패 값을 0으로 반환하기 위함이다.
코드 라인 54~55에서 끝까지 진행하였지만 실패하였기 때문에 이 페이지 블럭을 skip 마크를 기록하게 한다.
코드 라인 57에서 스캔된 페이지 수와 isolation한 페이지 수를 추가하고 isolation된 페이지 수를 반환한다.

Isolate Free 페이지

__isolate_free_page()

mm/page_alloc.c

int __isolate_free_page(struct page *page, unsigned int order)
{
        unsigned long watermark;
        struct zone *zone;
        int mt;

        BUG_ON(!PageBuddy(page));

        zone = page_zone(page);
        mt = get_pageblock_migratetype(page);

        if (!is_migrate_isolate(mt)) {
                /*
                 * Obey watermarks as if the page was being allocated. We can
                 * emulate a high-order watermark check with a raised order-0
                 * watermark, because we already know our high-order page
                 * exists.
                 */
                watermark = min_wmark_pages(zone) + (1UL << order);
                if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
                        return 0;

                __mod_zone_freepage_state(zone, -(1UL << order), mt);
        }

        /* Remove page from free list */
        list_del(&page->lru);
        zone->free_area[order].nr_free--;
        rmv_page_order(page);

        /*
         * Set the pageblock if the isolated page is at least half of a
         * pageblock
         */
        if (order >= pageblock_order - 1) {
                struct page *endpage = page + (1 << order) - 1;
                for (; page < endpage; page += pageblock_nr_pages) {
                        int mt = get_pageblock_migratetype(page);
                        if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
                            && !is_migrate_highatomic(mt))
                                set_pageblock_migratetype(page,
                                                          MIGRATE_MOVABLE);
                }
        }


        return 1UL << order;
}

order page가 free page인 경우 order-0 free page로 분해하여 반환한다. 정상인 경우 2^order에 해당하는 페이지 수가 반환된다.

코드 라인 9~10에서 페이지에 해당하는 zone과 migrate 타입을 알아온다.
코드 라인 12~24에서 isolation이 불가능한 migrate 타입인 경우 order 0을 요청하여 low 워터마크 + order 페이지 수 값으로도 워터마크 경계를 통과(ok)하지 못한 경우 실패로 0을 반환한다. 통과할 수 있는 경우 free 페이지 수를 order page 수 만큼 감소시킨다.
코드 라인 27~29에서 버디 시스템의 해당 리스트에서 페이지를 분리하고 페이지에서 buddy 정보를 제거하기 위해 buddy 플래그 제거 및 order 정보가 기록되어 있는 private 값을 0으로 클리어한다.
코드 라인 35~44에서 order 페이지를 free할 때 페이지 블럭의 50% 이상인 경우 페이지 블럭의 기존 타입이 unmovable 및 reclaimable인 경우 페이지 블럭의 타입을 movable로 변경할 수 있다. 단 isolate, cma, highatomic 타입은 변경할 수 없다.
- order 페이지내에 페이지 블럭이 여러 개 있을 수 있으므로 order 페이지의 끝까지 페이지 블럭 단위로 순회하며 해당 페이지 블럭이 isolate, cma 및 highatomic 3 가지 타입 모두 아닌 경우 movable 타입으로 변경한다.

split_page()

mm/page_alloc.c

/*
 * split_page takes a non-compound higher-order page, and splits it into
 * n (1<<order) sub-pages: page[0..n]
 * Each sub-page must be freed individually.
 *
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */

void split_page(struct page *page, unsigned int order)
{
        int i;

        VM_BUG_ON_PAGE(PageCompound(page), page);
        VM_BUG_ON_PAGE(!page_count(page), page);

        for (i = 1; i < (1 << order); i++)
                set_page_refcounted(page + i);
        split_page_owner(page, order);
}
EXPORT_SYMBOL_GPL(split_page);

분리(split)된 페이지 모두에 참조 카운터를 1로 설정한다. 그리고 기존 페이지의 owner를 모두 옮긴다.

Get new page 후크 함수

compaction_alloc()

mm/compaction.c

/*
 * This is a migrate-callback that "allocates" freepages by taking pages
 * from the isolated freelists in the block we are migrating to.
 */

static struct page *compaction_alloc(struct page *migratepage,
                                        unsigned long data,
                                        int **result)
{
        struct compact_control *cc = (struct compact_control *)data;
        struct page *freepage;

        /*
         * Isolate free pages if necessary, and if we are not aborting due to
         * contention.
         */
        if (list_empty(&cc->freepages)) {
                if (!cc->contended)
                        isolate_freepages(cc);

                if (list_empty(&cc->freepages))
                        return NULL;
        }

        freepage = list_entry(cc->freepages.next, struct page, lru);
        list_del(&freepage->lru);
        cc->nr_freepages--;

        return freepage;
}

free 스캐너로부터 isolation된 cc->freepages 리스트에서 선두의 free 페이지를 반환한다. 만일 반환할 free 페이지가 없으면 free 스캐너를 가동한다. 실패한 경우 null을 반환한다.

Put new page 후크 함수

compaction_free()

mm/compaction.c

/*
 * This is a migrate-callback that "frees" freepages back to the isolated
 * freelist.  All pages on the freelist are from the same zone, so there is no
 * special handling needed for NUMA.
 */

static void compaction_free(struct page *page, unsigned long data)
{
        struct compact_control *cc = (struct compact_control *)data;

        list_add(&page->lru, &cc->freepages);
        cc->nr_freepages++;
}

페이지를 다시 cc->freepages 리스트에 추가한다.

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c – 현재 글
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c