문c 블로그

Zoned Allocator -7- (Direct Compact)

2016-07-012020-12-21 문영일 10 Comments

Zoned Allocator -7- (Direct Compact)

Compaction

요청한 order 페이지에 대해 그 보다 많은 free 페이지가 있지만 그 페이지들이 파편화되어 높은 order 페이지 할당 요청에 대응하지 못하는 경우가 발생할 수 있다. 이러한 경우 사용중인 movable 페이지를 다른 곳으로 migration하여 연속된 free 페이지를 확보하는 방법을 compaction이라 한다.

movable 페이지들은 사용자 영역에서 할당한 메모리나 file에 해당한다. 커널이 할당한 unmovable 페이지들은 compaction할 수 없다.

다음 그림은 order 2 페이지 할당 요청에 대해 compaction이 수행되는 모습을 보여준다.

migration

다음 그림과 같이 유저 공간에서 사용되는 migratble 페이지를 다른 물리 주소로 migrate(copy)하는 방법을 보여준다.

migration 시 cpu가 페이지의 복사를 수행하지 않고 cpu cost를 낮추기 위해 DMA에 의해 복사를 수행하는 경우도 있다.
migrate가 완료된 후 곧바로 가상 주소 공간에 매핑하지는 않는다. 추후 migrate된 페이지에 접근하는 경우 그 때 가상 주소는 변경 없이 물리 주소만 변경된 새로운 매핑이 맺어진다.

migratable 페이지

사용 중인 물리 페이지를 다른 물리 페이지로 옮길 수 있는 페이지이다. 다음과 같은 종류가 가능한다.

lru movable 페이지
- LRU로 관리되는 movable 페이지는 커널 페이지 관리자가 직접 migration 가능한 페이지이다.
non-lru movable 페이지
- LRU에서 관리되지 않는 페이지들은 기본적으로 migration이 불가능하다. 그러나 migration이 구현된 드라이버의 페이지들은 migration이 가능하고 이를 non-lru movable 페이지라고 한다.
  - 예) zsram, balloon 메모리 드라이버

free 스캐너 & migrate 스캐너

compaction이 진행되면 존 내에서 다음 두 개의 스캐너가 각 페이지 블럭을 스캐닝하기 시작한다.

free 스캐너
- 최상위 페이지 블럭부터 아래 방향으로 free 페이지를 찾는다.
migrate 스캐너
- 최하위 페이지 블럭부터 윗 방향으로 사용 중인 migratable 페이지를 찾는다.

다음 그림과 같이 migrate 스캐너가 찾은 사용 중인 migratable 페이지를 free 스캐너가 찾은 free 페이지로 migration 하는 과정을 볼 수 있다.

compact_priority

include/linux/compaction.h

/*
 * Determines how hard direct compaction should try to succeed.
 * Lower value means higher priority, analogically to reclaim priority.
 */

enum compact_priority {
        COMPACT_PRIO_SYNC_FULL,
        MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
        COMPACT_PRIO_SYNC_LIGHT,
        MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
        DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
        COMPACT_PRIO_ASYNC,
        INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
};

compactin 시도 시 성공을 위한 3 단계 우선 순위이다.

COMPACT_PRIO_SYNC_FULL(0) & MIN_COMPACT_PRIORITY
- 가장 높은 우선 순위로 compaction 및 migration이 full sync로 동작한다.
COMPACT_PRIO_SYNC_LIGHT(1) & MIN_COMPACT_COSTLY_PRIORITY & DEF_COMPACT_PRIORITY
- 디폴트 및 중간 우선 순위로 compaction이 sync로 동작하지만 migration은 async로 동작한다.
COMPACT_PRIO_ASYNC(2) & INIT_COMPACT_PRIORITY
- 초기 및 가장 낮은 우선 순위로 compaction 및 migration이 async로 동작한다.

migrate_mode

include/linux/migrate_mode.h

/*
 * MIGRATE_ASYNC means never block
 * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
 *      on most operations but not ->writepage as the potential stall time
 *      is too significant
 * MIGRATE_SYNC will block when migrating pages
 */

enum migrate_mode {
        MIGRATE_ASYNC,
        MIGRATE_SYNC_LIGHT,
        MIGRATE_SYNC,
        MIGRATE_SYNC_NO_COPY
};

페이지를 migration할 때 사용하는 모드이다.

MIGRATE_ASYNC
- 비동기 migration 모드로 동작하여 블러킹되지 않는다.
- async compaction 동작 시 사용된다.
MIGRATE_SYNC_LIGHT
- writepage를 제외한 대부분을 동기 모드로 동작한다.
- kcompactd 에서 사용
- sync compaction 동작 시 사용된다.
MIGRATE_SYNC
- 동기 migration 모드로 동작하여 블러킹된다.
MIGRATE_SYNC_NO_COPY
- 동기 migration 모드로 동작하여 블러킹되지만, migration 페이지에 대해 cpu가 복사를 하지 않고 DMA를 활용하여 복사하게 한다.

Compaction 동작 모드

compaction은 다음과 같이 3가지 방법이 있다.

direct-compaction
- order 만큼의 free 페이지 할당 요청 시 메모리 부족으로 인하여 해당 order의 할당이 어려울 때 compaction이 수행될 때 이를 페이지 할당 API 내부에서 직접 호출하는 방식이다.
manual-compaction
- order와 관계없이 다음 명령을 통해 매뉴얼하게 요청한다.
  - “echo 1 > /proc/sys/vm/compact_memory”
kcompactd
- 메모리 부족 시 자동으로 wake되어 백그라운드에서 compaction을 수행한다.

Manual Compaction

다음과 같이 order별 페이지 상태를 확인해본다.

# cat /proc/pagetypeinfo
Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable    485    196  50044     12      5      0      0      1      1      1      0
Node    0, zone      DMA, type      Movable     22     69     66     51     46     30     21     11      8      1    386
Node    0, zone      DMA, type  Reclaimable     50     25     11      0      1      0      1      1      1      1      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA   1284    886    567    319    149     81     46     31     11     10     61
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA          403          417            6            0          124            0

movable 페이지를 compaction 하기 위해 다음 명령을 통해 매뉴얼 compaction을 진행해본다.

echo 1 > /proc/sys/vm/compact_memory

다음과 같이 movable 페이지의 일부가 compaction이 된 결과를 확인할 수 있다. 다만 커널이 사용했었던 unmovable 페이지들은 compaction이 안되는 것을 확인할 수 있다.

# cat /proc/pagetypeinfo
Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable    489    196  50044     11      5      0      0      1      1      1      0
Node    0, zone      DMA, type      Movable     22     43     36     32     27     24     18     14      9      1    386
Node    0, zone      DMA, type  Reclaimable     69     26     11      0      1      0      1      1      1      1      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA   1189    814    521    292    134     75     42     30     12      9     63
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA          403          417            6            0          124            0

kcomactd

메모리 부족 시 자동으로 wake되어 백그라운드에서 compaction을 수행하며 kernel v4.6-rc1에서 소개되었다.

참고
- mm, compaction: introduce kcompactd
- mm, kswapd: replace kswapd compaction with waking up kcompactd

밸런싱 판단

high 워터마크 이상에서 요청한 2^order 페이지의 할당이 가능한 상태인지 여부를 체크한다.

pgdat_balanced()

mm/vmscan.c

/*
 * Returns true if there is an eligible zone balanced for the request order
 * and classzone_idx
 */

static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
{
        int i;
        unsigned long mark = -1;
        struct zone *zone;

        /*
         * Check watermarks bottom-up as lower zones are more likely to
         * meet watermarks.
         */
        for (i = 0; i <= classzone_idx; i++) {
                zone = pgdat->node_zones + i;

                if (!managed_zone(zone))
                        continue;

                mark = high_wmark_pages(zone);
                if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
                        return true;
        }

        /*
         * If a node has no populated zone within classzone_idx, it does not
         * need balancing by definition. This can happen if a zone-restricted
         * allocation tries to wake a remote kswapd.
         */
        if (mark == -1)
                return true;

        return false;
}

노드의 @classzone_idx 이하의 존에 대해 밸런스 유무를 반환한다. free page가 high 워터마크 초과 여부를 판단하여 밸런스가 잡혀있는지 유무를 판단한다.

코드 라인 11~20에서 0번 존에서 @classzone_idx 존까지 순회하며 high 워터마크 이상에서 order 페이지를 확보가능하면 tuue를 반환한다.
코드 라인 27~28에서 managed 페이지가 하나도 없는 경우 밸런싱 작업이 필요 없으므로 true를 반환한다.

다음 그림과 같이 요청한 노드의 classzone_idx 존까지 밸런스가 잡힌 경우 true를 반환한다.

compaction 수행 조건

compaction 지속 여부 확인

compaction_suitable()

mm/compaction.c

enum compact_result compaction_suitable(struct zone *zone, int order,
                                        unsigned int alloc_flags,
                                        int classzone_idx)
{
        enum compact_result ret;
        int fragindex;

        ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
                                    zone_page_state(zone, NR_FREE_PAGES));
        /*
         * fragmentation index determines if allocation failures are due to
         * low memory or external fragmentation
         *
         * index of -1000 would imply allocations might succeed depending on
         * watermarks, but we already failed the high-order watermark check
         * index towards 0 implies failure is due to lack of memory
         * index towards 1000 implies failure is due to fragmentation
         *
         * Only compact if a failure would be due to fragmentation. Also
         * ignore fragindex for non-costly orders where the alternative to
         * a successful reclaim/compaction is OOM. Fragindex and the
         * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
         * excessive compaction for costly orders, but it should not be at the
         * expense of system stability.
         */
        if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
                fragindex = fragmentation_index(zone, order);
                if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
                        ret = COMPACT_NOT_SUITABLE_ZONE;
        }

        trace_mm_compaction_suitable(zone, order, ret);
        if (ret == COMPACT_NOT_SUITABLE_ZONE)
                ret = COMPACT_SKIPPED;

        return ret;
}

요청 zone에서 2^order 페이지의 할당을 위해 compaction 수행 적합 결과를 반환한다.

코드 라인 8~9에서 compaction 지속 여부에 대한 결과를 알아온다.
코드 라인 26~34에서 costly order 페이지 요청이 continue 결과 판정인 경우 파편화 계수 값을 확인하여 compaction이 힘들다 판단하면 COMPACT_SKIPPED를 리턴 값으로 변경한다.
- 단편화 계수가 [0, sysctl_extfrag_threshold] 범위이면 compaction을 하지 않을 목적이다.
- sysctl_extfrag_threshold
  - 디폴트 값은 500이다.
  - “proc/sys/vm/extfrag_threshold” 파일을 사용하여 값을 변경할 수 있다.

다음 그림은 compaction 지속 여부의 결과가 반환되는 모습을 보여주는데, costly order 요청이 continue 결과 판정일 때 정말 continue해도 되는지 단편화 계수를 추가로 확인하는 과정을 보여준다.

compact_result

include/linux/compaction.h

/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result {
        /* For more detailed tracepoint output - internal to compaction */
        COMPACT_NOT_SUITABLE_ZONE,
        /*
         * compaction didn't start as it was not possible or direct reclaim
         * was more suitable
         */
        COMPACT_SKIPPED,
        /* compaction didn't start as it was deferred due to past failures */
        COMPACT_DEFERRED,

        /* compaction not active last round */
        COMPACT_INACTIVE = COMPACT_DEFERRED,

        /* For more detailed tracepoint output - internal to compaction */
        COMPACT_NO_SUITABLE_PAGE,
        /* compaction should continue to another pageblock */
        COMPACT_CONTINUE,

        /*
         * The full zone was compacted scanned but wasn't successfull to compact
         * suitable pages.
         */
        COMPACT_COMPLETE,
        /*
         * direct compaction has scanned part of the zone but wasn't successfull
         * to compact suitable pages.
         */
        COMPACT_PARTIAL_SKIPPED,

        /* compaction terminated prematurely due to lock contentions */
        COMPACT_CONTENDED,

        /*
         * direct compaction terminated after concluding that the allocation
         * should now succeed
         */
        COMPACT_SUCCESS,
};

compaction 시도 전 확인 결과 또는 compaction 수행 후 결과 값이다.

COMPACT_NOT_SUITABLE_ZONE
- trace 디버그 출력 또는 내부용으로 사용된다.
COMPACT_SKIPPED
- compaction을 수행할 수 없는 상태이거나 direct-reclaim이 더 적합한 경우라서 compaction을 skip 한다.
COMPACT_DEFERRED & COMPACT_INACTIVE
- 지난 compaction 수행 시 실패하였기 때문에 이 번에는 유예시키기 위해 compaction을 skip 한다.
COMPACT_NO_SUITABLE_PAGE
- trace 디버그 출력 또는 내부용으로 사용된다.
COMPACT_CONTINUE
- 다른 페이지 블럭을 계속 compaction 진행되어야 한다.
- manual compaction의 경우 관련 영역의 모든 블럭이 완료될 때까지 진행한다.
COMPACT_COMPLETE
- 모든 존에 대해 compaction이 완료하였지만, compaction을 통해 할당 가능한 페이지를 확보하지 못한 상태이다.
COMPACT_PARTIAL_SKIPPED
- 존의 일부에 대해서 direct compaction을 수행하였으나 아직 할당 가능한 페이지의 확보는 성공하지 못한 상태이다.
COMPACT_CONTENDED
- lock 경합으로 인해 compaction이 조기에 종료되었다.
COMPACT_SUCCESS
- 할당 가능한 페이지를 확보한 후에 direct compaction이 종료되었다.

__compaction_suitable()

mm/compaction.c

/*
 * compaction_suitable: Is this suitable to run compaction on this zone now?
 * Returns
 *   COMPACT_SKIPPED  - If there are too few free pages for compaction
 *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
 *   COMPACT_CONTINUE - If compaction should run now
 */

static enum compact_result __compaction_suitable(struct zone *zone, int order,
                                        unsigned int alloc_flags,
                                        int classzone_idx,
                                        unsigned long wmark_target)
{
        unsigned long watermark;

        if (is_via_compact_memory(order))
                return COMPACT_CONTINUE;

        watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
        /*
         * If watermarks for high-order allocation are already met, there
         * should be no need for compaction at all.
         */
        if (zone_watermark_ok(zone, order, watermark, classzone_idx,
                                                                alloc_flags))
                return COMPACT_SUCCESS;

        /*
         * Watermarks for order-0 must be met for compaction to be able to
         * isolate free pages for migration targets. This means that the
         * watermark and alloc_flags have to match, or be more pessimistic than
         * the check in __isolate_free_page(). We don't use the direct
         * compactor's alloc_flags, as they are not relevant for freepage
         * isolation. We however do use the direct compactor's classzone_idx to
         * skip over zones where lowmem reserves would prevent allocation even
         * if compaction succeeds.
         * For costly orders, we require low watermark instead of min for
         * compaction to proceed to increase its chances.
         * ALLOC_CMA is used, as pages in CMA pageblocks are considered
         * suitable migration targets
         */
        watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                                low_wmark_pages(zone) : min_wmark_pages(zone);
        watermark += compact_gap(order);
        if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
                                                ALLOC_CMA, wmark_target))
                return COMPACT_SKIPPED;

        return COMPACT_CONTINUE;
}

요청 zone과 order를 사용하여 compaction을 진행 여부에 결과를 반환한다.

코드 라인 8~9에서 유저가 compaction을 요청한 경우 무조건 진행하게 하기 위해 COMPACT_CONTINUE를 반환한다.
- “echo 1 > /proc/sys/vm/compact_memory”로 compaction을 요청할 수 있다.
코드 라인 11에서 요청 존의 워터마크를 알아온다.
코드 라인 16~18에서 1차로 워터마크 이상의 free 페이지가 확보된 경우 compaction이 더 이상 필요 없는 상황으로 COMPACT_SUCCESS를 반환한다.
코드 라인 34~41에서 2차로 compaction이 완료된 상황을 가정한 상황으로 비교하여 그 결과 여전히 메모리가 부족한 경우라 판단하면 COMPACT_SKIPPED를 반환하고, 페이지 확보 가능성이 있는 경우 COMPACT_CONTINUE를 반환한다.
- costly high order 요청인 경우 low 워터마크 기준으로, 그리고 낮은 order 요청인 경우 min 워터마크 기준을 사용한다.
- compaction을 진행하는 잠시 동안 페이지들을 복사하여 할당을 하므로, 요청 order 페이지 수의 두 배를 워터마크 값에 더한 값으로 cma 영역을 포함하여 0 order 페이지 기준으로 낮춰 비교할 때 할당 가능 여부를 판단한다.

다음 그림은 compaction을 계속 수행해도 되는지알아보는 과정을 보여준다.

2차 조건에서는 compaction 상황 후를 가정하여 0 order로 기준을 변경한 워터마크와 비교한다.

단편화 계수 산출

fragmentation_index()

mm/vmstat.c

/* Same as __fragmentation index but allocs contig_page_info on stack */
int fragmentation_index(struct zone *zone, unsigned int order) 
{         
        struct contig_page_info info;

        fill_contig_page_info(zone, order, &info);
        return __fragmentation_index(order, &info);
}

compaction을 해야할지 여부를 판단하기 위해 요청 zone과 order에 대한 단편화 계수를 알아온다. 단편화 계수 값은 -1000을 반환하면 할당할 페이지가 있으므로 compaction이 필요 없는 상태이다. 그 외의 경우는 0 ~ 1000 범위 이내의 값으로 sysctl_extfrag_threshold 이하인 경우 compaction을 하지 않을 목적이다.

코드 라인 6에서 지정된 zone의 버디 시스템에서 전체 free 블럭, 전체 free page 및 order 페이지의 할당 가능한 free 블럭 수 정보를 info에 담아온다.
코드 라인 7에서 요청 order와 contig_page 정보를 사용하여 단편화 계수를 계산해온다.

다음 그림은 단편화 계수의 값을 산출하는 과정을 보여준다.

fill_contig_page_info()

mm/vmstat.c

/*
 * Calculate the number of free pages in a zone, how many contiguous
 * pages are free and how many are large enough to satisfy an allocation of
 * the target size. Note that this function makes no attempt to estimate
 * how many suitable free blocks there *might* be if MOVABLE pages were
 * migrated. Calculating that is possible, but expensive and can be
 * figured out from userspace
 */

static void fill_contig_page_info(struct zone *zone,
                                unsigned int suitable_order,
                                struct contig_page_info *info)
{
        unsigned int order;

        info->free_pages = 0;
        info->free_blocks_total = 0;
        info->free_blocks_suitable = 0;

        for (order = 0; order < MAX_ORDER; order++) {
                unsigned long blocks;

                /* Count number of free blocks */
                blocks = zone->free_area[order].nr_free;
                info->free_blocks_total += blocks;

                /* Count free base pages */
                info->free_pages += blocks << order;

                /* Count the suitable free blocks */
                if (order >= suitable_order)
                        info->free_blocks_suitable += blocks <<
                                                (order - suitable_order);
        }
}

지정된 zone의 버디 시스템에서 전체 free 블럭, 전체 free page 및 suitable_order의 할당 가능한 free 블럭 수 정보를 info에 contig_page_info 구조체로 반환한다.

코드 라인 11~16에서 zone이 관리하는 버디 시스템의 order별 리스트를 순회하며 전체 free 블럭 수를 합산한다.
코드 라인 19에서 free 페이지 수를 합산한다.
코드 라인 22~24에서 요청 order 이상의 free 블럭 수를 합산한다.

contig_page_info 구조체

mm/vmstat.c

#ifdef CONFIG_COMPACTION
struct contig_page_info {
        unsigned long free_pages;
        unsigned long free_blocks_total;
        unsigned long free_blocks_suitable;
};
#endif

요청한 order에 대한 단편화 계수를 산출하기 위한 정보이다.

free_pages
- 버디 시스템에서 관리되고 있는 모든 free 페이지 수
- 예) order 3 페이지 2 개 있는 경우
  - 16(2^3 * 2)페이지
free_blocks_total
- 버디 시스템에서 관리되고 있는 모든 free 블럭(대표 페이지) 수
- 예) order 3 페이지 2 개 있는 경우
  - 2
free_blocks_suitable
- 요청한 order를 만족시키는 free 블럭(대표 페이지) 수

__fragmentation_index()

mm/vmstat.c

/*
 * A fragmentation index only makes sense if an allocation of a requested
 * size would fail. If that is true, the fragmentation index indicates
 * whether external fragmentation or a lack of memory was the problem.
 * The value can be used to determine if page reclaim or compaction
 * should be used
 */

static int __fragmentation_index(unsigned int order, struct contig_page_info *info)
{
        unsigned long requested = 1UL << order;

        if (WARN_ON_ONCE(order >= MAX_ORDER))
                return 0;

        if (!info->free_blocks_total)
                return 0;

        /* Fragmentation index only makes sense when a request would fail */
        if (info->free_blocks_suitable)
                return -1000;

        /*
         * Index is between 0 and 1 so return within 3 decimal places
         *
         * 0 => allocation would fail due to lack of memory
         * 1 => allocation would fail due to fragmentation
         */
        return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
}

요청 order와 free 페이지 및 free 블럭 정보를 사용하여 단편화 계수를 반환한다.

0에 가까운 값 (낮은 단편화 계수)
- 은 메모리 부족으로 인해 할당이 실패될 상황이다.
- 이후에 compaction 해도 할당 실패될 가능성 높은 상태
1000에 가까운 값 (높은 단편화 계수)
- 단편화로 인해 할당이 실패될 상황이다.
- compaction 하면 할당 성공할 가능성 높은 상태
-1000 (할당 가능한 상태)
- 요청 order 블럭이 존재하여 할당이 가능한 상태이다.
- compaction이 필요하지 않다

코드 라인 5~6에서 최대 버디 order를 초과하는 order를 페이지를 요청하는 경우 0을 반환한다.
코드 라인 8~9에서 전체 free block 수가 0인 경우 compaction을 할 수 없어 0을 반환한다.
코드 라인 12~13에서 요청 order 페이지를 처리할 수 있는 free block이 있는 경우 compaction이 필요 없으므로 -1000을 반환한다.
코드 라인 21에서 1000 – (전체 free page x 1000 / 필요 page 수 + 1000) / 전체 free block 수
- 0에 가까울 수록 메모리 부족으로 compaction을 허용하지 않는것이 좋다.
- 1000에 가까울 수록 파편화된 페이지에 대해 compaction하는 것이 좋다.

다음 그림은 단편화 계수의 값을 산출하는 과정을 보여준다.

Compaction 수행

다음 그림은 compaction이 수행되는 여러 경로를 보여준다.

compact 우선순위와 migrate 모드도 같이 확인해본다.

다음 그림은 direct compaction이 수행될 때의 함수 흐름을 보여준다.

Direct-compaction을 사용한 페이지 할당

__alloc_pages_direct_compact()

mm/page_alloc.c

/* Try memory compaction for high-order allocations before reclaim */
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
                unsigned int alloc_flags, const struct alloc_context *ac,
                enum compact_priority prio, enum compact_result *compact_result)
{
        struct page *page;
        unsigned long pflags;
        unsigned int noreclaim_flag;

        if (!order)
                return NULL;

        psi_memstall_enter(&pflags);
        noreclaim_flag = memalloc_noreclaim_save();

        *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
                                                                        prio);

        memalloc_noreclaim_restore(noreclaim_flag);
        psi_memstall_leave(&pflags);

        if (*compact_result <= COMPACT_INACTIVE)
                return NULL;

        /*
         * At least in one zone compaction wasn't deferred or skipped, so let's
         * count a compaction stall
         */
        count_vm_event(COMPACTSTALL);

        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

        if (page) {
                struct zone *zone = page_zone(page);

                zone->compact_blockskip_flush = false;
                compaction_defer_reset(zone, order, true);
                count_vm_event(COMPACTSUCCESS);
                return page;
        }

        /*
         * It's bad if compaction run occurs and fails. The most likely reason
         * is that pages exist, but not enough to satisfy watermarks.
         */
        count_vm_event(COMPACTFAIL);

        cond_resched();

        return NULL;
}

direct-compaction을 수행한 후 페이지 할당을 시도한다.

코드 라인 11~12에서 order가 0인 경우 compaction으로 해결될 수 없으므로 처리하지 않는다.
코드 라인 14에서 메모리 부족으로 인한 현재 태스크의 psi 산출을 시작하는 지점이다.
- psi는 2018년 커널 v4.20-rc1에서 소개되었다.
- 참고
코드 라인 15에서 direct-compaction을 수행하기 위해 요청한 order 메모리의 2배 만큼의 메모리를 할당해야하는데 현재 메모리 부족 상황이므로 현재 태스크에 pfmemalloc 플래그를 사용하여 워터마크 제한 없이 메모리를 할당할 수 있도록 설정한다.
코드 라인 17~18에서 요청 order 페이지를 위해 direct-compaction을 시도하고 compact 진행 상태를 결과로 알아온다.
코드 라인 20에서 psi 산출을 종료하는 지점이다.
코드 라인 21에서 현재 태스크에서 pfmemalloc 플래그의 사용을 원위치한다.
코드 라인 23~24에서 compaction 수행 결과로 inactive 이하이면 더 이상 페이지 확보가 힘든 상황이므로 null을 반환한다.
코드 라인 30에서 COMPACTSTALL 카운터를 증가시킨다.
코드 라인 32~41에서 페이지 확보를 시도한다. 페이지가 확보된 경우 COMPACTSUCCESS 카운터를 증가시키고, zone의 compact_blockskip_flush에 false를 대입하고 compaction에 대한 트래킹 카운터들을 리셋한 후 페이지를 반환한다.
코드 라인 47~51에서 페이지 할당이 실패한 경우 COMPACTFAIL stat을 증가시키고 리스케쥴 필요한 경우 sleep하고 함수를 빠져나간다.

다음 그림은 direct compaction의 함수별 진행 흐름을 보여준다.

try_to_compact_pages()

mm/compaction.c

/**
 * try_to_compact_pages - Direct compact to satisfy a high-order allocation
 * @gfp_mask: The GFP mask of the current allocation
 * @order: The order of the current allocation
 * @alloc_flags: The allocation flags of the current allocation
 * @ac: The context of current allocation
 * @prio: Determines how hard direct compaction should try to succeed
 *
 * This is the main entry point for direct page compaction.
 */

enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
                unsigned int alloc_flags, const struct alloc_context *ac,
                enum compact_priority prio)
{
        int may_perform_io = gfp_mask & __GFP_IO;
        struct zoneref *z;
        struct zone *zone;
        enum compact_result rc = COMPACT_SKIPPED;

        /*
         * Check if the GFP flags allow compaction - GFP_NOIO is really
         * tricky context because the migration might require IO
         */
        if (!may_perform_io)
                return COMPACT_SKIPPED;

        trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);

        /* Compact each zone in the list */
        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                                                ac->nodemask) {
                enum compact_result status;

                if (prio > MIN_COMPACT_PRIORITY
                                        && compaction_deferred(zone, order)) {
                        rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
                        continue;
                }

                status = compact_zone_order(zone, order, gfp_mask, prio,
                                        alloc_flags, ac_classzone_idx(ac));
                rc = max(status, rc);

                /* The allocation should succeed, stop compacting */
                if (status == COMPACT_SUCCESS) {
                        /*
                         * We think the allocation will succeed in this zone,
                         * but it is not certain, hence the false. The caller
                         * will repeat this with true if allocation indeed
                         * succeeds in this zone.
                         */
                        compaction_defer_reset(zone, order, false);

                        break;
                }

                if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
                                        status == COMPACT_PARTIAL_SKIPPED))
                        /*
                         * We think that allocation won't succeed in this zone
                         * so we defer compaction there. If it ends up
                         * succeeding after all, it will be reset.
                         */
                        defer_compaction(zone, order);

                /*
                 * We might have stopped compacting due to need_resched() in
                 * async compaction, or due to a fatal signal detected. In that
                 * case do not try further zones
                 */
                if ((prio == COMPACT_PRIO_ASYNC && need_resched())
                                        || fatal_signal_pending(current))
                        break;
        }

        return rc;
}

요청 order를 위해 compaction을 시도하고 compact 진행 상태를 반환한다.

코드 라인 14~15에서 compaction을 하는 과정에 migration이 io를 유발한다. 따라서 io 허용 요청이 없는 경우에는 compaction을 진행할 수 없으므로 COMPACT_SKIPPED를 반환한다.
코드 라인 20~28 zonelist에서 지정된 nodemask와 high_zoneidx 이하의 zone에 대해 순회하며 compaction 우선 순위가 가장 높은 단계가 아닌 경우 해당 zone에서 지난 compaction 수행 시 실패한 경우 곧바로 compaction을 수행해도 성공하지 못할 가능성이 크므로 이번 시도에서는 유예시키기 위해 skip한다.
코드 라인 30~45에서 순회 중인 존에서 comaction 결과가 성공인 경우 유예 플래그를 리셋하고 결과를 반환한다.
코드 라인 47~54에서 compaction이 비동기가 아닌 모드로 동작하는 경우이면서 compaction 결과가 complete 또는 partial skipped 인 경우 순회 중인 존을 유예 표식한다.
코드 라인 61~63에서 비동기로 compaction이 진행 중인 경우 다른 태스크로 부터 선점 요청이 있거나 현재 태스크에 fatal 시그널이 인입된 경우 현재 결과로 함수를 빠져나간다.

order를 위한 존 compaction

compact_zone_order()

mm/compaction.c

static enum compact_result compact_zone_order(struct zone *zone, int order,
                gfp_t gfp_mask, enum compact_priority prio,
                unsigned int alloc_flags, int classzone_idx)
{
        enum compact_result ret;
        struct compact_control cc = {
                .nr_freepages = 0,
                .nr_migratepages = 0,
                .total_migrate_scanned = 0,
                .total_free_scanned = 0,
                .order = order,
                .gfp_mask = gfp_mask,
                .zone = zone,
                .mode = (prio == COMPACT_PRIO_ASYNC) ?
                                        MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
                .alloc_flags = alloc_flags,
                .classzone_idx = classzone_idx,
                .direct_compaction = true,
                .whole_zone = (prio == MIN_COMPACT_PRIORITY),
                .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
                .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
        };
        INIT_LIST_HEAD(&cc.freepages);
        INIT_LIST_HEAD(&cc.migratepages);

        ret = compact_zone(zone, &cc);

        VM_BUG_ON(!list_empty(&cc.freepages));
        VM_BUG_ON(!list_empty(&cc.migratepages));

        return ret;
}

compact_control_cc 구조체를 준비한 후 요청한 zone과 order 및 migrate 모드로 compact를 수행하고 결과를 반환한다.

존 compaction

다음 그림과 같이 compact_zone() 함수의 처리 과정을 보여준다.

compact_zone()

mm/compaction.c -1/3-

static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
        enum compact_result ret;
        unsigned long start_pfn = zone->zone_start_pfn;
        unsigned long end_pfn = zone_end_pfn(zone);
        const bool sync = cc->mode != MIGRATE_ASYNC;

        cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);
        ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
                                                        cc->classzone_idx);
        /* Compaction is likely to fail */
        if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
                return ret;

        /* huh, compaction_suitable is returning something unexpected */
        VM_BUG_ON(ret != COMPACT_CONTINUE);

        /*
         * Clear pageblock skip if there were failures recently and compaction
         * is about to be retried after being deferred.
         */
        if (compaction_restarting(zone, cc->order))
                __reset_isolation_suitable(zone);

        /*
         * Setup to move all movable pages to the end of the zone. Used cached
         * information on where the scanners should start (unless we explicitly
         * want to compact the whole zone), but check that it is initialised
         * by ensuring the values are within zone boundaries.
         */
        if (cc->whole_zone) {
                cc->migrate_pfn = start_pfn;
                cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
        } else {
                cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
                cc->free_pfn = zone->compact_cached_free_pfn;
                if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
                        cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
                        zone->compact_cached_free_pfn = cc->free_pfn;
                }
                if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
                        cc->migrate_pfn = start_pfn;
                        zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
                        zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
                }

                if (cc->migrate_pfn == start_pfn)
                        cc->whole_zone = true;
        }

        cc->last_migrated_pfn = 0;

        trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
                                cc->free_pfn, end_pfn, sync);

        migrate_prep_local();

요청 order를 위해 compaction을 시도하고 compact 진행 상태를 반환한다.

코드 라인 4~5에서 compaction은 존의 시작 pfn 부터 끝 pfn까지가 대상이다.
코드 라인 6에서 migraton 싱크 여부를 알아온다.
코드 라인 8에서 gfp 플래그를 사용하여 migrate 타입을 구한다.
- unmovable(0), movable(1), reclaimable(2)
코드 라인 9~13에서 compaction을 진행 여부에 대한 결과를 알아와서 이미 할당할 페이지가 있거나 skipped 결과인 경우 compaction을 하지 않고 함수를 빠져나간다.
코드 라인 22~23에서 compaction 유예가 최대 횟수(63)까지 도달한 경우 compaction을 다시 처음부터 하기 위해 zone의 usemap(pageblock_flags)에서 모든 PB_migrate_skip 비트를 clear한다.
코드 라인 31~33에서 처음 시작 시 migrate 스캐너의 시작은 존의 시작 pfn 으로 설정하고, free 스캐너의 시작은 존의 끝 pfn으로 설정한다.
코드 라인 34~36에서 지난 compactin에 연이어 동작해야 하는 경우 migrate 스캐너와 free 스캐너가 마지막 처리한 pfn 위치에서 계속하도록 한다.
코드 라인 37~40에서 free 스캐너의 pfn이 존의 범위를 벗어나는 경우 다시 존의 끝 블럭에 해당하는 페이지로 이동시킨다.
코드 라인 41~45에서 migrate 스캐너의 pfn이 존의 범위를 벗어나는 경우 다시 존의 시작 블럭에 해당하는 페이지로 이동시킨다.
- migrate pfn 위치를 기억시키는 캐시는 async(0) 및 sync(1)를 구분하여 2 개를 사용한다.
코드 라인 47~48에서 migrate 스캐너의 pfn이 시작 위치에 있는 경우 whole_zone을 true로 한다.
코드 라인 51에서 마지막 migrated pfn을 0으로 리셋한다.
코드 라인 52에서 migrate를 시작하기 전에 로컬 cpu가 할 일을 수행한다.
- lru 캐시인 pagevec으로부터 페이지들을 lru로 drain 한다.

mm/compaction.c -2/3-

.       while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
                int err;

                switch (isolate_migratepages(zone, cc)) {
                case ISOLATE_ABORT:
                        ret = COMPACT_CONTENDED;
                        putback_movable_pages(&cc->migratepages);
                        cc->nr_migratepages = 0;
                        goto out;
                case ISOLATE_NONE:
                        /*
                         * We haven't isolated and migrated anything, but
                         * there might still be unflushed migrations from
                         * previous cc->order aligned block.
                         */
                        goto check_drain;
                case ISOLATE_SUCCESS:
                        ;
                }

                err = migrate_pages(&cc->migratepages, compaction_alloc,
                                compaction_free, (unsigned long)cc, cc->mode,
                                MR_COMPACTION);

                trace_mm_compaction_migratepages(cc->nr_migratepages, err,
                                                        &cc->migratepages);

                /* All pages were either migrated or will be released */
                cc->nr_migratepages = 0;
                if (err) {
                        putback_movable_pages(&cc->migratepages);
                        /*
                         * migrate_pages() may return -ENOMEM when scanners meet
                         * and we want compact_finished() to detect it
                         */
                        if (err == -ENOMEM && !compact_scanners_met(cc)) {
                                ret = COMPACT_CONTENDED;
                                goto out;
                        }
                        /*
                         * We failed to migrate at least one page in the current
                         * order-aligned block, so skip the rest of it.
                         */
                        if (cc->direct_compaction &&
                                                (cc->mode == MIGRATE_ASYNC)) {
                                cc->migrate_pfn = block_end_pfn(
                                                cc->migrate_pfn - 1, cc->order);
                                /* Draining pcplists is useless in this case */
                                cc->last_migrated_pfn = 0;

                        }
                }

코드 라인 1에서 compact를 수행한 결과가 COMPACT_CONTINUE인 동안 루프를 돈다.
코드 라인 4~9에서 페이지를 isoaltion한 결과가 ISOLATE_ABORT일 때 compaction 결과를 COMPACT_CONTENDED로 변경하고 migrate 페이지들을 원위치 시킨 후 migrate 페이지 수를 0으로 클리어한 다음 out 레이블을 통해 함수를 빠져나간다.
코드 라인 10~16에서 isolation 결과가 ISOLATE_NONE인 경우 아무 페이지도 isolation하지 않은 경우이고, 이 때에 cpu 캐시를 drain하기 위해 check_drain 레이블로 이동한 후 계속 루프를 진행하게 한다.
코드 라인 17~19에서 결과가 ISOLATE_SUCCESS인 경우 migration을 위해 다음 루틴을 계속 진행한다.
코드 라인 21~23에서 migrate 스캐너가 가리키는 페이지를 free 스캐너가 가리키는 페이지로 migration한다.
코드 라인 30~31에서 migration에 실패한 경우이다. migrate하려고 하는 페이지들을 다시 원래 위치로 돌려 놓는다.
코드 라인 36~39에서 스캐닝이 완료되지 않은 채로 메모리 부족이면 compaction 결과로 COMPACT_CONTENDED를 담고 out 레이블로 이동하여 함수를 빠져나간다.
코드 라인 44~51에서 async로 direct-compaction을 요청한 경우 지금 처리 중인 migrate 블럭을 skip 하게 한다.

mm/compaction.c -3/3-

check_drain:
                /*
                 * Has the migration scanner moved away from the previous
                 * cc->order aligned block where we migrated from? If yes,
                 * flush the pages that were freed, so that they can merge and
                 * compact_finished() can detect immediately if allocation
                 * would succeed.
                 */
                if (cc->order > 0 && cc->last_migrated_pfn) {
                        int cpu;
                        unsigned long current_block_start =
                                block_start_pfn(cc->migrate_pfn, cc->order);

                        if (cc->last_migrated_pfn < current_block_start) {
                                cpu = get_cpu();
                                lru_add_drain_cpu(cpu);
                                drain_local_pages(zone);
                                put_cpu();
                                /* No more flushing until we migrate again */
                                cc->last_migrated_pfn = 0;
                        }
                }

        }

out:
        /*
         * Release free pages and update where the free scanner should restart,
         * so we don't leave any returned pages behind in the next attempt.
         */
        if (cc->nr_freepages > 0) {
                unsigned long free_pfn = release_freepages(&cc->freepages);

                cc->nr_freepages = 0;
                VM_BUG_ON(free_pfn == 0);
                /* The cached pfn is always the first in a pageblock */
                free_pfn = pageblock_start_pfn(free_pfn);
                /*
                 * Only go back, not forward. The cached pfn might have been
                 * already reset to zone end in compact_finished()
                 */
                if (free_pfn > zone->compact_cached_free_pfn)
                        zone->compact_cached_free_pfn = free_pfn;
        }

        count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
        count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);

        trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
                                cc->free_pfn, end_pfn, sync, ret);

        return ret;
}

코드 라인 1에서 check_drain: 레이블에서는 lru 캐시인 pagevec들을 비울지 여부를 판단한다.
코드 라인 9~22에서 만일 요청 order가 0이 아니고 마지막 migrate pfn이 현재 진행되는 블럭 밑에 존재하는 경우 lru 캐시인 pagevec들을 비우면 병합할 가능성이 커지고 compact_finished() 에서 할당 성공 여부를 즉시 감지할 수 있게된다.
코드 라인 26~44에서 out: 레이블이다. free 스캐너용 free 페이지들을 다시 되돌리고, 캐시에 위치를 기억시켜둔다.
코드 라인 46~47에서 COMPACTMIGRATE_SCANNED 및 COMPACTFREE_SCANNED 카운터를 갱신한다.

release_freepages()

mm/compaction.c

static unsigned long release_freepages(struct list_head *freelist)
{
        struct page *page, *next;
        unsigned long high_pfn = 0;

        list_for_each_entry_safe(page, next, freelist, lru) {
                unsigned long pfn = page_to_pfn(page);
                list_del(&page->lru);
                __free_page(page);
                if (pfn > high_pfn)
                        high_pfn = pfn;
        }

        return high_pfn;
}

freelist에 있는 페이지들을 제거하고 모두 해제하고 가장 큰 pfn 값을 반환한다.

compaction 유예

존에 대해 compaction을 유예할지 여부를 알아오는데, 유예 카운터를 증가시키며 한계에 도달하기 직전까지 compaction을 유예시킬 목적이다. 유예 카운터(compact_considered)는 compact_defer_shift 단계마다 높아져 최대 64까지 증가될 수 있다.

다음 그림은 유예 카운터 및 유예 shift 카운터에 대해 증가시키거나 리셋하는 3개의 함수 용도를 보여준다.

compaction_deferred() 함수의 결과가 true일때 곧바로 compaction이 진행되지 않도록 유예시킨다.

compaction_deferred()

mm/compaction.c

/* Returns true if compaction should be skipped this time */
bool compaction_deferred(struct zone *zone, int order)
{
        unsigned long defer_limit = 1UL << zone->compact_defer_shift;

        if (order < zone->compact_order_failed)
                return false;

        /* Avoid possible overflow */
        if (++zone->compact_considered > defer_limit)
                zone->compact_considered = defer_limit;

        if (zone->compact_considered >= defer_limit)
                return false;

        trace_mm_compaction_deferred(zone, order);

        return true;
}

이번 타임에 compaction이 유예 처리되어 skip해야 하는지 여부를 반환한다. (true=compaction 유예, false=compaction 진행)

코드 라인 6~7에서 지난 compaction에서 사용한 fail된 오더 값보다 더 작은 order 요청인 경우 다시 compaction을 시도해봐야 하므로 false를 반환한다.
코드 라인 10~18 존의 유예 카운터(compact_considered)를 증가시킨다. 최대 유예 한계(1 << compact_defer_shift) 미만에서는 true를 반환하여 compactin을 유예 시킨다. 최대 유예 한계 이상인 경우 false를 반환하여 compaction을 시도하도록 한다.

defer_compaction()

mm/compaction.c

/*
 * Compaction is deferred when compaction fails to result in a page
 * allocation success. 1 << compact_defer_limit compactions are skipped up
 * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
 */
void defer_compaction(struct zone *zone, int order)
{
        zone->compact_considered = 0;
        zone->compact_defer_shift++;

        if (order < zone->compact_order_failed)
                zone->compact_order_failed = order;

        if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
                zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;

        trace_mm_compaction_defer_compaction(zone, order);
}

요청한 존에서 order 페이지의 할당을 못한 채로 compaction이 완료될 때마다 유예 카운터는 0으로 리셋하고, 유예 한계 카운터는 1, 2, 4, 8, 16, 32, 64까지 증가한다.

compaction_defer_reset()

mm/compaction.c

/*
 * Update defer tracking counters after successful compaction of given order,
 * which means an allocation either succeeded (alloc_success == true) or is
 * expected to succeed.
 */

void compaction_defer_reset(struct zone *zone, int order,
                bool alloc_success) 
{
        if (alloc_success) {
                zone->compact_considered = 0;
                zone->compact_defer_shift = 0; 
        }               
        if (order >= zone->compact_order_failed)
                zone->compact_order_failed = order + 1;

        trace_mm_compaction_defer_reset(zone, order);
}

요청한 존에서 compaction 수행 후 order 페이지에 대한 성공이 기대될 때 호출되는 함수이다. 실제 페이지 할당 성공 시 호출되는 경우에는 유예 카운터 및 유예 한계 카운터를 0으로 리셋한다.

코드 라인 4~7에서 요청한 존에서 compaction 수행 후 order 페이지의 할당이 성공한 경우 유예 카운터 및 유예 한계 카운터를 0으로 리셋한다.
코드 라인 8~9에서 fail 오더 값으로는 요청한 order + 1 값으로 설정한다.

compaction_restarting()

mm/compaction.c

/* Returns true if restarting compaction after many failures */
bool compaction_restarting(struct zone *zone, int order)
{
        if (order < zone->compact_order_failed)
                return false;

        return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
                zone->compact_considered >= 1UL << zone->compact_defer_shift;
}

compaction 최대 유예 횟수(64)에 도달한 경우 true를 반환한다.

요청 order가 compact_order_failed 보다 작은 경우 false를 반환
compact_defer_shift가 마지막(6)이면서 compact_considered값이 64이상인 경우 true를 반환

compact 종료 체크

compact_finished()

mm/compaction.c

static enum compact_result compact_finished(struct zone *zone,
                        struct compact_control *cc)
{
        int ret;

        ret = __compact_finished(zone, cc);
        trace_mm_compaction_finished(zone, cc->order, ret);
        if (ret == COMPACT_NO_SUITABLE_PAGE)
                ret = COMPACT_CONTINUE;

        return ret;
}

compact 완료 여부를 판단하기 위해 진행 상태를 반환한다.

다음 그림은 compact 완료 여부를 판단하기 위해 진행 상태를 알아오는 모습을 보여준다.

__compact_finished()

mm/compaction.c -1/2-

static enum compact_result __compact_finished(struct zone *zone,
                                                struct compact_control *cc)
{
        unsigned int order;
        const int migratetype = cc->migratetype;

        if (cc->contended || fatal_signal_pending(current))
                return COMPACT_CONTENDED;

        /* Compaction run completes if the migrate and free scanner meet */
        if (compact_scanners_met(cc)) {
                /* Let the next compaction start anew. */
                reset_cached_positions(zone);

                /*
                 * Mark that the PG_migrate_skip information should be cleared
                 * by kswapd when it goes to sleep. kcompactd does not set the
                 * flag itself as the decision to be clear should be directly
                 * based on an allocation request.
                 */
                if (cc->direct_compaction)
                        zone->compact_blockskip_flush = true;

                if (cc->whole_zone)
                        return COMPACT_COMPLETE;
                else
                        return COMPACT_PARTIAL_SKIPPED;
        }

        if (is_via_compact_memory(cc->order))
                return COMPACT_CONTINUE;

        if (cc->finishing_block) {
                /*
                 * We have finished the pageblock, but better check again that
                 * we really succeeded.
                 */
                if (IS_ALIGNED(cc->migrate_pfn, pageblock_nr_pages))
                        cc->finishing_block = false;
                else
                        return COMPACT_CONTINUE;
        }

코드 라인 7~8에서 compaction 수행 중 현재 태스크에 급하게 처리할 preemption 요청 또는 fatal 시그널이 있는 경우 COMPACT_CONTENDED를 반환한다.
코드 라인 11~28에서 free 스캐너와 migrate 스캐너 둘이 만난 경우 compaction이 완료된 경우이다. 다음 스캐닝을 위해 스캔 시작 위치를 리셋한다. 전체 zone을 스캔한 경우라면 COMPACT_COMPLETE를 반환하고, 일부 존만 수행한 경우라면 COMPACT_PARTIAL_SKIPPED를 반환한다.
코드 라인 30~31에서 유저가 개입하여 compaction을 수행한 경우에 전체 블럭에 대해 무조건(force) compaction을 하기 위해 COMPACT_CONTINUE를 반환한다.
- “echo 1 > /proc/sys/vm/compact_memory”
코드 라인 33~42에서 페이지 블럭 하나를 완료한 경우이다. 정말 migrate 스캐너가 페이지 블럭 하나를 끝낸 경우인지 다시 확인하여 중간에 종료된 경우라면 COMPACT_CONTINUE를 반환하여 계속하도록 하게 한다.
- 참고: mm, compaction: finish whole pageblock to reduce fragmentation

mm/compaction.c -2/2-

        /* Direct compactor: Is a suitable page free? */
        for (order = cc->order; order < MAX_ORDER; order++) {
                struct free_area *area = &zone->free_area[order];
                bool can_steal;

                /* Job done if page is free of the right migratetype */
                if (!list_empty(&area->free_list[migratetype]))
                        return COMPACT_SUCCESS;

#ifdef CONFIG_CMA
                /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
                if (migratetype == MIGRATE_MOVABLE &&
                        !list_empty(&area->free_list[MIGRATE_CMA]))
                        return COMPACT_SUCCESS;
#endif
                /*
                 * Job done if allocation would steal freepages from
                 * other migratetype buddy lists.
                 */
                if (find_suitable_fallback(area, order, migratetype,
                                                true, &can_steal) != -1) {

                        /* movable pages are OK in any pageblock */
                        if (migratetype == MIGRATE_MOVABLE)
                                return COMPACT_SUCCESS;

                        /*
                         * We are stealing for a non-movable allocation. Make
                         * sure we finish compacting the current pageblock
                         * first so it is as free as possible and we won't
                         * have to steal another one soon. This only applies
                         * to sync compaction, as async compaction operates
                         * on pageblocks of the same migratetype.
                         */
                        if (cc->mode == MIGRATE_ASYNC ||
                                        IS_ALIGNED(cc->migrate_pfn,
                                                        pageblock_nr_pages)) {
                                return COMPACT_SUCCESS;
                        }

                        cc->finishing_block = true;
                        return COMPACT_CONTINUE;
                }
        }

        return COMPACT_NO_SUITABLE_PAGE;
}

필요로 하는 free 페이지가 확보될 수 있는지 확인한다.

코드 라인 2~8에서 @order부터 마지막 order까지 순회하며 해당 order의 free 리스트에 free 페이지가 발견된 경우 COMPACT_SUCCESS를 반환한다.
코드 라인 12~14에서 movable 페이지 요청인 경우 cma 타입 리스트에서 free 페이지가 발견된 경우 COMPACT_SUCCESS를 반환한다.
코드 라인 20~25에서 다른 타입에서 가져올 free 페이지가 있는 경우 movable 타입이면 COMPACT_SUCCESS를 반환한다.
코드 라인 35~39에서 compaction이 aync 수행 중이거나 migrate 스캐너가 한 페이지 블럭을 완료한 상태라면 COMPACT_SUCCESS를 반환한다.
코드 라인 41~42에서 COMPACT_CONTINUE를 반환한다. 그리고 finishing_block을 true로 변경하여 다음 compaction에서 중단하지 않고 페이지 블럭이 완료될 때 까지 계속 compaction을 할지 여부를 조사하게 한다.
코드 라인 46에서 페이지 확보가 실패하였으므로 COMPACT_NO_SUITABLE_PAGE를 반환한다.

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c – 현재 글
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Memory compaction (2010) | LWN.net
Linux: Memory fragmentation and compaction | uninformativ.de

Zoned Allocator -6- (Watermark)

2016-06-292020-01-15 문영일 11 Comments

Zone 워터마크

zone에서 페이지 할당을 시도할 때 남은 free 페이지가 zone 별 워터마크 기준과 비교하여 메모리 할당 여부를 판단한다. 이러한 워터마크는 다음과 같이 3 가지 워터 마크 기준 값을 지정하여 사용하며 각각의 의미는 다음과 같다.

WATERMARK_MIN
- 메모리 부족의 최저 기준 점과 동일하다.
- vm.min_free_kbytes 값에 해당하는 페이지 수를 가진다.
- 산출된 free 페이지가 이 값 미만이 되는 경우 페이지 회수가 불가능하다. 이러한 경우 동기되어 직접(direct) 페이지 회수 작업을 허용하는 경우 페이지 회수 후 할당이 계속 진행된다.
- 특수한 상황(atomic, oom, pfmemalloc을 제외한)에서는 이 기준 이하의 상황에서도 페이지 할당이 가능해진다.
WATERMARK_LOW
- 디폴트로 min 값의 125% 값을 가진 페이지 수이다. kswapd가 자동으로 wakeup되는 시점이다.
  - 사용자가 watermark_scale_factor를 변경하여 이 기준을 더 높일 수 있다.
WATERMARK_MAX
- 디폴트로 min 값의 150% 값을 가진 페이지 수이고 kswapd가 자동으로 sleep되는 시점이다.
- 사용자가 watermark_scale_factor를 변경하여 이 기준을 더 높일 수 있다.

다음 그림은 zone 워터마크 설정에 따른 할당과 페이지 회수에 따른 관계를 보여준다.

다음과 같이 nr_free_pages 및 min, low, high 워터마크 값을 확인할 수 있다.

$ cat /proc/zoneinfo
Node 0, zone    DMA32
  per-node stats
(...생략...)
  pages free     633680
        min      5632             <-----
        low      7040             <-----
        high     8448             <-----
        spanned  786432
        present  786432
        managed  765771           
        protection: (0, 0, 0)
      nr_free_pages 633680        <-----
(...생략...)

managed_pages

버디 시스템이 zone 별로 관리하는 페이지 수로 dma 및 normal 존에서의 초기 값은 다음과 같다.

managed_pages = present_pages – 버디 시스템이 동작하기 직전 reserve한 페이지들 수
managed_pages = spanned_pages – absent_pages – (memmap_pages + dma_reserve + …)
- present_pages = spanned_pages – absent_pages

다음과 같이 존별 managed_pages 값을 확인할 수 있다.

$ cat /proc/zoneinfo
Node 0, zone    DMA32
  per-node stats
(...생략...)
  pages free     633680
        min      5632             
        low      7040             
        high     8448             
        spanned  786432
        present  786432
        managed  765771           <-----
        protection: (0, 0, 0)
      nr_free_pages 633680        
(...생략...)

빠른 워터마크 비교

zone_watermark_fast()

mm/page_alloc.c

static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
                unsigned long mark, int classzone_idx, unsigned int alloc_flags)
{
        long free_pages = zone_page_state(z, NR_FREE_PAGES);
        long cma_pages = 0;

#ifdef CONFIG_CMA
        /* If allocation can't use CMA areas don't use free CMA pages */
        if (!(alloc_flags & ALLOC_CMA))
                cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

        /*
         * Fast check for order-0 only. If this fails then the reserves
         * need to be calculated. There is a corner case where the check
         * passes but only the high-order atomic reserve are free. If
         * the caller is !atomic then it'll uselessly search the free
         * list. That corner case is then slower but it is harmless.
         */
        if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
                return true;

        return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
                                        free_pages);
}

요청한 @z 존에서 @mark 기준보다 높은 free 페이지에서 @order 페이지를 확보할 수 있는지 여부를 빠르게 판단하여 반환한다. order 0 할당 요청 전용으로 판단 로직이 추가되었다. (true=워터마크 기준 이상으로 ok, false=워터마크 기준 이하로 not-ok)

코드 라인 4에서 해당 존의 free 페이지를 알아온다.
코드 라인 9~10에서 movable 페이지 요청이 아닌 경우 cma 영역의 페이지들을 사용할 수 없다. 일단 cma 영역을 사용하는 페이지 수를 알아온다.
코드 라인 19~20에서 order 0 할당 요청인 경우 빠른 판단을 할 수 있다. cma 페이지를 제외한 free 페이지가 lowmem 리저브 영역을 더한 워터 마크 기준을 초과하는 경우 true를 반환한다.
코드 라인 22~23에서 order 1 이상의 할당 요청에서 워터 마크 기준을 초과하는지 여부를 반환한다.

다음 그림은 order 0 페이지 할당 요청 시 워터마크 기준에 의해 빠른 할당 가능 여부를 판단하는 모습을 보여준다.

기본 워터마크 비교

zone_watermark_ok()

mm/page_alloc.c

bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
                      int classzone_idx, int alloc_flags)
{
        return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
                                        zone_page_state(z, NR_FREE_PAGES));
}

해당 zone에서 order 페이지를 처리하기 위해 free 페이지가 워터마크 값의 기준 보다 충분한지 확인한다.

__zone_watermark_ok()

mm/page_alloc.c

/*
 * Return true if free base pages are above 'mark'. For high-order checks it
 * will return true of the order-0 watermark is reached and there is at least
 * one free page of a suitable size. Checking now avoids taking the zone lock
 * to check in the allocation paths if no pages are free.
 */

bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
                         int classzone_idx, unsigned int alloc_flags,
                         long free_pages)
{
        long min = mark;
        int o;
        const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));

        /* free_pages may go negative - that's OK */
        free_pages -= (1 << order) - 1;

        if (alloc_flags & ALLOC_HIGH)
                min -= min / 2;

        /*
         * If the caller does not have rights to ALLOC_HARDER then subtract
         * the high-atomic reserves. This will over-estimate the size of the
         * atomic reserve but it avoids a search.
         */
        if (likely(!alloc_harder)) {
                free_pages -= z->nr_reserved_highatomic;
        } else {
                /*
                 * OOM victims can try even harder than normal ALLOC_HARDER
                 * users on the grounds that it's definitely going to be in
                 * the exit path shortly and free memory. Any allocation it
                 * makes during the free path will be small and short-lived.
                 */
                if (alloc_flags & ALLOC_OOM)
                        min -= min / 2;
                else
                        min -= min / 4;
        }


#ifdef CONFIG_CMA
        /* If allocation can't use CMA areas don't use free CMA pages */
        if (!(alloc_flags & ALLOC_CMA))
                free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

        /*
         * Check watermarks for an order-0 allocation request. If these
         * are not met, then a high-order request also cannot go ahead
         * even if a suitable page happened to be free.
         */
        if (free_pages <= min + z->lowmem_reserve[classzone_idx])
                return false;

        /* If this is an order-0 request then the watermark is fine */
        if (!order)
                return true;

        /* For a high-order request, check at least one suitable page is free */
        for (o = order; o < MAX_ORDER; o++) {
                struct free_area *area = &z->free_area[o];
                int mt;

                if (!area->nr_free)
                        continue;

                for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
                        if (!list_empty(&area->free_list[mt]))
                                return true;
                }

#ifdef CONFIG_CMA
                if ((alloc_flags & ALLOC_CMA) &&
                    !list_empty(&area->free_list[MIGRATE_CMA])) {
                        return true;
                }
#endif
                if (alloc_harder &&
                        !list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
                        return true;
        }
        return false;
}

해당 zone에서 order 페이지를 처리하기 위해 free 페이지가 워터마크 값의 기준 보다 충분한지 확인한다.

코드 라인 5에서 워터마크 값을 min 값에 일단 대입한다.
코드 라인 7에서 최대한 워터마크 기준을 낮춰서라도 할당을 해야 하는지 판단여부를 alloc_harder에 대입한다.
- gfp_atomic 요청 및 oom 상황에서의 할당은 시스템이 최대한 할당을 성공시켜야 한다.
코드 라인 10에서 free 페이지를 할당 요청한 페이지 수-1 만큼 감소시킨다.
코드 라인 12~13에서 atomic 요청 상황인 경우 ALLOC_HIGH 할당 플래그가 설정되는데 최대한 할당할 수 있게 하기 위해 워터마크 기준을 절반으로 줄인다.
코드 라인 20~21에서 alloc_harder 상황이 아니면 reserved_highatomic 페이지들은 사용을 하지 않게 하기 위해 빼둔다.
코드 라인 22~33에서 alloc_harder 상황인 경우 워터마크를 더 낮추는데 atomic 요청인 경우 25%, oom 상황인 경우 50%를 더 낮춘다.
코드 라인 38~39에서 movable 페이지 할당 요청이 아닌 경우에는 cma 영역을 사용할 수 없다. 따라서 free 페이지 수에서 cma 영역의 페이지들을 먼저 제외시킨다.
코드 라인 47~48에서 free 페이지가 lowmem 리저브 영역을 포함한 워터마크 기준 이하로 메모리가 부족한 상황이면 false를 반환한다.
코드 라인 51~52에서 order 0 할당 요청인 경우 true를 반환한다.
코드 라인 55~76에서 order 1 이상의 요청에 대해서 실제 free 리스트를 검색하여 할당가능한지 여부를 체크하여 반환한다. [요청 order, MAX_ORDER) 범위의 free_area를 순회하며 다음 3 가지 기준에 포함된 경우 할당 가능하므로 true를 반환한다.
- unmovable, movable, reclaimable 타입의 리스트에 free 페이지가 있는 경우
- cma 영역을 사용해도 되는 movable 페이지 요청인 경우 cma 타입의 리스트에 free 페이지가 있는 경우
- alloc_harder 상황인 경우 highatomic 타입의 리스트에 free 페이지가 있는 경우

다음 그림은 zone 워터마크 설정에 따른 free 페이지 할당 가능 여부를 체크하는 모습을 보여준다.

좌측 그림과 같이 남은 free 페이지 수에 여유가 있는지 확인 후
우측 그림에서와 같이 free_area[]를 검색하여 최종적으로 할당 가능한지 여부를 확인한다.
lowmem_reserve[]에 사용된 존 인덱스는 처음 요청한 preferred 존 인덱스이다.

정확한 워터마크 비교

zone_watermark_ok_safe()

page_alloc.c

bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
                        unsigned long mark, int classzone_idx)
{
        long free_pages = zone_page_state(z, NR_FREE_PAGES);

        if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
                free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

        return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
                                                                free_pages);
}

해당 zone에서 order 페이지를 처리하기 위해 정확히 산출한 free 페이지가 워터마크 값의 기준 보다 충분한지 확인한다.

코드 라인 4에서 먼저 대략적인 free 페이지 수를 알아온다.
코드 라인 6~7에서 알아온 free 페이지 수가 percpu_drift_mark 이하인 경우 free 페이지를 정확히 산출한다.
- percpu_drift_mark 값은 refresh_zone_stat_thresholds() 함수에서 설정된다.
코드 라인 9~10에서 free 페이지 수를 워터마크와 비교하여 할당 가능 여부를 반환한다.

존 카운터(stat)

존 카운터(stat)들은 성능을 위해 존 카운터 및 per-cpu 카운터 2개를 각각 별도로 운영하고 있다. 존의 대략적인 값만을 요구할 때에는 성능을 위해 존 카운터 값을 그대로 반환하고, 정확한 산출이 필요한 경우에만 존 카운터와 per-cpu 카운터를 더해 반환한다.

percpu_drift_mark

커널 메모리 관리 시 남은 free 페이지 수를 읽어 워터마크와 비교하는 것으로 메모리 부족을 판단하는 루틴들이 많이 사용된다. 그런데 정확한 free 페이지 값을 읽어내려면 존 카운터와 per-cpu 카운터를 모두 읽어 더해야하는데, 이렇게 매번 계산을 하는 경우 성능을 떨어뜨리므로, 존 카운터 값만 읽어 high 워터마크보다 일정 기준 더 큰 크기로 설정된 percpu_drift_mark 값과 비교하여 이 값 이하일 때에만 보다 정확한 연산을 하도록 유도하는 방법을 사용하여 성능을 유지시킨다.

stat_threshold

카운터의 증감이 발생하면 성능을 위해 per-cpu 카운터를 먼저 증감시킨다. 존 카운터의 정확도를 위해 per-cpu 카운터의 값이 일정 기준의 stat_threshold를 초과할 경우에만 이 값을 존 카운터로 옮겨 더하는 방식을 사용한다.

다음 그림은 해당 cpu의 zone 카운터에 변동이 있을 때 stat_threshold를 초과하면 존 카운터에 옮겨지는 과정을 보여준다.

다음과 같이 stat_threshold 값을 확인할 수 있다.

$ cat /proc/zoneinfo
Node 0, zone    DMA32
  per-node stats
(...생략...)
  pagesets
    cpu: 0
              count: 143
              high:  378
              batch: 63
  vm stats threshold: 24         <-----
    cpu: 1
              count: 285
              high:  378
              batch: 63
  vm stats threshold: 24         <-----
(...생략...)

zone_page_state()

include/linux/vmstat.h

static inline unsigned long zone_page_state(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);
#ifdef CONFIG_SMP
        if (x < 0)
                x = 0;
#endif
        return x;
}

존의 @item에 해당하는 대략적인 vm stat 값을 반환한다.

존의 vm stat 값만을 반환한다.

zone_page_state_snapshot()

include/linux/vmstat.h

/*
 * More accurate version that also considers the currently pending
 * deltas. For that we need to loop over all cpus to find the current
 * deltas. There is no synchronization so the result cannot be
 * exactly accurate either.
 */

static inline unsigned long zone_page_state_snapshot(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);

#ifdef CONFIG_SMP
        int cpu;
        for_each_online_cpu(cpu)
                x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];

        if (x < 0)
                x = 0;
#endif
        return x;
}

존의 @item에 해당하는 vm stat 값을 정확히 산출한다.

정확한 값을 반환하기 위해 다음 수식을 사용한다.
- 존의 vm stat 값 + 모든 per-cpu vm stat 값

다음 그림은 zone_page_state_snapshop() 함수와 zone_page_state() 함수의 차이를 보여준다.

zone_watermark_ok() 함수와 zone_watermark_ok_safe() 함수의 다른 점

zone_watermark_ok() 함수
- 워터마크와 비교할 free 페이지 수를 대략적으로 산출된 free 페이지 값을 사용한다.
zone_watermark_ok_safe() 함수
- 워터마크와 비교할 free 페이지 수에 정확히 산출된 free 페이지 값을 사용한다.

아래는 zone stat 스레졸드를 이용한 zone_watermark_ok_safe() 함수의 사용예를 보여준다.

워터마크 및 관련 설정 초기화

워터마크 및 관련 설정들의 초기화는 다음과 같이 진행된다.

워터마크 min 값 설정

init_per_zone_wmark_min()

mm/page_alloc.c

/*
 * Initialise min_free_kbytes.
 *
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 *      min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
 *      min_free_kbytes = sqrt(lowmem_kbytes * 16)
 *
 * which yields
 *
 * 16MB:        512k
 * 32MB:        724k
 * 64MB:        1024k
 * 128MB:       1448k
 * 256MB:       2048k
 * 512MB:       2896k
 * 1024MB:      4096k
 * 2048MB:      5792k
 * 4096MB:      8192k
 * 8192MB:      11584k
 * 16384MB:     16384k
 */

int __meminit init_per_zone_wmark_min(void)
{
        unsigned long lowmem_kbytes;
        int new_min_free_kbytes;

        lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
        new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);

        if (new_min_free_kbytes > user_min_free_kbytes) {
                min_free_kbytes = new_min_free_kbytes;
                if (min_free_kbytes < 128)
                        min_free_kbytes = 128;
                if (min_free_kbytes > 65536)
                        min_free_kbytes = 65536;
        } else {
                pr_warn("min_free_kbytes is not updated to %d because user defined value %d is prefee
rred\n",
                                new_min_free_kbytes, user_min_free_kbytes);
        }
        setup_per_zone_wmarks();
        refresh_zone_stat_thresholds();
        setup_per_zone_lowmem_reserve();

#ifdef CONFIG_NUMA
        setup_min_unmapped_ratio();
        setup_min_slab_ratio();
#endif

        return 0;
}
core_initcall(init_per_zone_wmark_min)

lowmem의 가용 페이지 크기를 사용하여 워터마크 기준으로 사용하는 min_free_kbytes 값을 산출한다. 그 외에 각 zone의 워터마크, lowmem_reserve, totalreserve, min umpapped, min slap 페이지 등을 산출한다.

코드 라인 6~18에서 lowmem의 가용 페이지 수를 사용하여 워터마크 기준 값으로 사용되는 min_free_kbytes 값을 산출한다.
- lowmem 가용 페이지 산출식
  - managed_pages – high 워터마크 페이지
- min_free_kbytes 산출식
  - sqrt((lowmem의 가용 페이지 수 >> 10) * 16)
    - 예) sqrt(512M * 0x10) -> 2M
  - 산출된 값은 128 ~ 65536 범위 이내로 제한 (128K ~ 64M 범위)
- 커널이 초기화될 때 이외에도 hotplug 메모리를 사용하여 메모리가 hot add/del 될 때마다 이 초기화 루틴이 수행된다.
- 커널이 처음 초기화될 때에는 각 zone의 워터마크 초기값은 모두 0이다.
코드 라인 19에서 각 존의 워터 마크 값 및 totalreserve_pages 값을 산출한다.
코드 라인 20에서 각 존의 stat 스레졸드와 percpu_drift_mark 를 산출한다.
코드 라인 21에서 zone->lowmem_reserve 및 totalreserve 페이지를 산출한다.
코드 라인 24에서 최소 ummapped 페이지를 산출한다.
코드 라인 25에서 최소 slab 페이지를 산출한다.

다음 그림은 전역 min_free_kbytes 값이 산출되는 과정을 보여준다.

lowmem(dma, normal)의 가용 페이지(managed – high 워터마크)를 16으로 곱하고 제곱근하여 산출한다.
- 최초 부트업 시 high 워터마크 값은 0 이다.
ARM64의 경우 dma32 존을 사용하고, highmem 존은 사용하지 않는다.

lowmem 가용 페이지 수 산출

nr_free_buffer_pages()

mm/page_alloc.c

/**
 * nr_free_buffer_pages - count number of pages beyond high watermark
 *
 * nr_free_buffer_pages() counts the number of pages which are beyond the high
 * watermark within ZONE_DMA and ZONE_NORMAL.
 */

unsigned long nr_free_buffer_pages(void)
{
        return nr_free_zone_pages(gfp_zone(GFP_USER));
}
EXPORT_SYMBOL_GPL(nr_free_buffer_pages);

현재 노드의 zonelist에서 normal 존 이하의 존들을 대상으로 가용 페이지 수를 산출하여 반환한다.

참고로 처음 커널 설정을 위해 호출될 때에는 high 워터마크가 0이다.
가용 페이지 수
- managed_pages – high 워터마크 페이지

nr_free_zone_pages()

mm/page_alloc.c

/**
 * nr_free_zone_pages - count number of pages beyond high watermark
 * @offset: The zone index of the highest zone
 *
 * nr_free_zone_pages() counts the number of counts pages which are beyond the
 * high watermark within all zones at or below a given zone index.  For each
 * zone, the number of pages is calculated as:
 *
 *     nr_free_zone_pages = managed_pages - high_pages
 */

static unsigned long nr_free_zone_pages(int offset)
{
        struct zoneref *z;
        struct zone *zone;

        /* Just pick one node, since fallback list is circular */
        unsigned long sum = 0;

        struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);

        for_each_zone_zonelist(zone, z, zonelist, offset) {
                unsigned long size = zone_managed_pages(zone);
                unsigned long high = high_wmark_pages(zone);
                if (size > high)
                        sum += size - high;
        }

        return sum;
}

현재 노드의 zonelist에서 offset 이하의 zone들을 대상으로 가용한 페이지를 모두 더한 수를 반환한다.

가용 페이지 수
- managed_pages – high 워터마크 페이지

워터 마크 비율 설정

워터마크 간의 간격은 디폴트로 워터마크 min 값의 25%를 사용하는데 kswapd 효율을 높이기 위해 사용자가 워터 마크 비율을 변경할 수 있다.

watermark_scale_factor

/proc/sys/vm/watermark_scale_factor를 통해 설정할 수 있으며 초깃값은 10이다. 초깃값이 10일 때 메모리의 0.1% 값이라는 의미이고, 최댓값으로 1000이 주어질 때 메모리의 10% 값이 된다. 이러한 경우 min과 low 및 low와 high와의 최대 간격은 메모리(managed_pages)의 10%까지 가능해진다.

커널 v4.6-rc1에서 대규모 메모리를 갖춘 시스템에서 kswapd 효율을 높이기 위해 워터마크 low와 high 값의 임계점을 기존보다 더 높일 수 있도록 추가되었다.

min, low, high 워터마크 산출

메모리의 양으로 결정된 min_free_kbytes 값을 사용하여 존 별 min, low, high 워터마크를 산출한다.

setup_per_zone_wmarks()

mm/page_alloc.c

/**
 * setup_per_zone_wmarks - called when min_free_kbytes changes
 * or when memory is hot-{added|removed}
 *
 * Ensures that the watermark[min,low,high] values for each zone are set
 * correctly with respect to min_free_kbytes.
 */

void setup_per_zone_wmarks(void)
{
        mutex_lock(&zonelists_mutex);
        __setup_per_zone_wmarks();
        mutex_unlock(&zonelists_mutex);
}

zone별 min, low, high 워터마크를 산출한다.

__setup_per_zone_wmarks()

mm/page_alloc.c

static void __setup_per_zone_wmarks(void)
{
        unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
        unsigned long lowmem_pages = 0;
        struct zone *zone;
        unsigned long flags;

        /* Calculate total number of !ZONE_HIGHMEM pages */
        for_each_zone(zone) {
                if (!is_highmem(zone))
                        lowmem_pages += zone_managed_pages(zone);
        }

        for_each_zone(zone) {
                u64 tmp;

                spin_lock_irqsave(&zone->lock, flags);
                tmp = (u64)pages_min * zone_managed_pages(zone);
                do_div(tmp, lowmem_pages);
                if (is_highmem(zone)) {
                        /*
                         * __GFP_HIGH and PF_MEMALLOC allocations usually don't
                         * need highmem pages, so cap pages_min to a small
                         * value here.
                         *
                         * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
                         * deltas control asynch page reclaim, and so should
                         * not be capped for highmem.
                         */
                        unsigned long min_pages;

                        min_pages = zone_managed_pages(zone) / 1024;
                        min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
                        zone->_watermark[WMARK_MIN] = min_pages;
                } else {
                        /*
                         * If it's a lowmem zone, reserve a number of pages
                         * proportionate to the zone's size.
                         */
                        zone->_watermark[WMARK_MIN] = tmp;
                }

                /*
                 * Set the kswapd watermarks distance according to the
                 * scale factor in proportion to available memory, but
                 * ensure a minimum size on small systems.
                 */
                tmp = max_t(u64, tmp >> 2,
                            mult_frac(zone_managed_pages(zone),
                                      watermark_scale_factor, 10000));

                zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
                zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
                zone->watermark_boost = 0;

                spin_unlock_irqrestore(&zone->lock, flags);
        }

        /* update totalreserve_pages */
        calculate_totalreserve_pages();
}

존별 low, min, high 워터마크를 산출하고 totalreserve 페이지를 갱신한다.

코드 라인 3에서 전역 min_free_kbytes 값을 페이지 수 단위로 바꾸어 pages_min에 대입한다.
코드 라인 9~12에서 lowmem 영역의 managed_pages를 합산하여 lowmem_pages를 구한다.
코드 라인 14~19에서 각 존을 돌며 tmp에 pages_min(min_free_kbytes를 페이지로 환산)을 현재 존의 비율만큼의 페이지 수로 설정한다.
코드 라인 20~41에서 highmem 존인 경우에는 실제 managed_pages를 1024로 나눈 값을 32~128까지의 범위로 제한한 값을 min 워터마크에 저장하고, highmem 존이 아닌 경우에는 min 워터마크에 위에서 산출한 tmp 값을 설정한다.
코드 라인 48~53에서 각 워터마크 간의 간격은 다음 두 산출 값 중 큰 값으로 결정한다.
- 최초 산출된 min 값의 25%
- managed_pages * 워터마크 비율(0.1 ~ 10%)
코드 라인 54에서 워터마크 부스트 값을 0으로 초기화한다.
코드 라인 60에서 high 워터마크 값이 갱신되었으므로 totalreserve 페이지 수를 산출하여갱신한다.
- high 워터마크와 lowmem reserve 페이지가 재산출될 때 마다 totalreserve 값도 갱신된다.

다음 그림은 min_free_kbytes 값과 lowmem 영역의 가용 페이지(managed_pages – high 워터마크)를 각 zone의 비율로 min, low, high 워터마크를 산출하는 모습을 보여준다.

다음 그림은 워터마크 간격 값으로 우측의 두 산출된 값 중 가장 큰 값이 적용되는 모습을 보여준다.

존 stat 스레졸드 산출

high 워터마크를 사용하여 존 및 노드에 대한 stat 스레졸드들을 산출한다. 산출되는 항목들은 다음과 같다.

노드 stat 스레졸드
존 stat 스레졸드
percpu_drift_mark

refresh_zone_stat_thresholds()

mm/vmstat.c

/*
 * Refresh the thresholds for each zone.
 */

void refresh_zone_stat_thresholds(void)
{
        struct pglist_data *pgdat;
        struct zone *zone;
        int cpu;
        int threshold;

        /* Zero current pgdat thresholds */
        for_each_online_pgdat(pgdat) {
                for_each_online_cpu(cpu) {
                        per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0;
                }
        }

        for_each_populated_zone(zone) {
                struct pglist_data *pgdat = zone->zone_pgdat;
                unsigned long max_drift, tolerate_drift;

                threshold = calculate_normal_threshold(zone);

                for_each_online_cpu(cpu) {
                        int pgdat_threshold;

                        per_cpu_ptr(zone->pageset, cpu)->stat_threshold
                                                        = threshold;

                        /* Base nodestat threshold on the largest populated zone. */
                        pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_thresholdd
;
                        per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
                                = max(threshold, pgdat_threshold);
                }

                /*
                 * Only set percpu_drift_mark if there is a danger that
                 * NR_FREE_PAGES reports the low watermark is ok when in fact
                 * the min watermark could be breached by an allocation
                 */
                tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
                max_drift = num_online_cpus() * threshold;
                if (max_drift > tolerate_drift)
                        zone->percpu_drift_mark = high_wmark_pages(zone) +
                                        max_drift;
        }
}

모든 노드 및 존에 대해 노드 stat 스레졸드, 존 stat 스레졸드 및 percpu_drift_mark 를 산출한다.

코드 라인 9~13에서 전체 노드를 순회하며 노드 stat 스레졸드를 0으로 초기화한다.
코드 라인 15~25에서 populate존을 순회하며 존 stat 스레졸드 값을 산출한다.
코드 라인 28~31에서 노드 stat 스레졸드를 가장 큰 존 stat 스레졸드 값으로 갱신한다.
코드 라인 38~42에서 존 멤버인 percpu_drift_mark를 산출한다.
- max_drift = online cpu 수 * 산출한 스레졸드 값
- max_drift 값이 워터마크들 간의 간격보다 큰 경우 high 워터마크 + max_drift 값을 사용한다.

a) 메모리 비압박 시 사용할 스레졸드 산출

노멀 스레졸드 값은 다음과 같은 케이스들에서 호출되어 산출되어 설정된다.

부트 업 및 메모리의 hotplug 및 hotremove로 인해 메모리가 변경되면서 호출되는 refresh_zone_stat_thresholds() 함수
kswapd_try_to_sleep() 함수에서 kswapd가 sleep하기 직전

calculate_normal_threshold()

mm/vmstat.c

int calculate_normal_threshold(struct zone *zone)
{
        int threshold;          
        int mem;        /* memory in 128 MB units */

        /*
         * The threshold scales with the number of processors and the amount
         * of memory per zone. More memory means that we can defer updates for
         * longer, more processors could lead to more contention.
         * fls() is used to have a cheap way of logarithmic scaling.
         *
         * Some sample thresholds:
         *
         * Threshold    Processors      (fls)   Zonesize        fls(mem+1)
         * ------------------------------------------------------------------
         * 8            1               1       0.9-1 GB        4
         * 16           2               2       0.9-1 GB        4
         * 20           2               2       1-2 GB          5
         * 24           2               2       2-4 GB          6
         * 28           2               2       4-8 GB          7
         * 32           2               2       8-16 GB         8
         * 4            2               2       <128M           1
         * 30           4               3       2-4 GB          5
         * 48           4               3       8-16 GB         8
         * 32           8               4       1-2 GB          4
         * 32           8               4       0.9-1GB         4
         * 10           16              5       <128M           1
         * 40           16              5       900M            4
         * 70           64              7       2-4 GB          5
         * 84           64              7       4-8 GB          6
         * 108          512             9       4-8 GB          6
         * 125          1024            10      8-16 GB         8
         * 125          1024            10      16-32 GB        9
         */

        mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT);

        threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem));

        /*
         * Maximum threshold is 125
         */
        threshold = min(125, threshold);

        return threshold;
}

요청 zone의 managed_pages 크기에 비례하여 메모리 압박 상황이 아닐 때 사용할 스레졸드 값을 구하는데 최대 125로 제한한다.

산출식
- threshold = 2 * (log2(온라인 cpu 수) + 1) * (1 + log2(z->managed_pages / 128M) + 1)
  - = 2 * fls(온라인 cpu 수) + fls(z->managed_pages / 128M)
- threshold 값은 최대 125로 제한
- 예) 6 cpu이고, zone이 managed_pages=3.5G
  - threshold=2 x 3 x 5=30
- 예) 64 cpu이고, zone이 managed_pages=3.5G
  - threshold=2 x 7 x 5=70

b) 메모리 압박 시 사용할 스레졸드 산출

pressure 스레졸드 값은 메모리가 부족하여 동작되는 kswapd가 깨어난 직후에 사용되는 스레졸드 값으로 관련 함수는 다음과 같다.

kswapd_try_to_sleep() 함수

calculate_pressure_threshold()

mm/vmstat.c

int calculate_pressure_threshold(struct zone *zone)
{
        int threshold;
        int watermark_distance;

        /*
         * As vmstats are not up to date, there is drift between the estimated
         * and real values. For high thresholds and a high number of CPUs, it
         * is possible for the min watermark to be breached while the estimated
         * value looks fine. The pressure threshold is a reduced value such
         * that even the maximum amount of drift will not accidentally breach
         * the min watermark
         */
        watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
        threshold = max(1, (int)(watermark_distance / num_online_cpus()));

        /*
         * Maximum threshold is 125
         */
        threshold = min(125, threshold);

        return threshold;
}

요청 zone의 managed_pages 크기에 비례하여 메모리 압박 상황에서 사용할 스레졸드 값을 구하는데 최대 125로 제한한다.

cpu가 많은 시스템에서 스레졸드가 크면 각 cpu에서 유지되고 있는 per-cpu 카운터들이 존 카운터에 반영되지 못한 값들이 매우 많을 가능성이 있다. 따라서 메모리 부족 상황에서는 스레졸드 값을 줄일 수 있도록 다음과 같은 산출식을 사용한다.
산출식
- threshold = 워터마크 간 간격 / 온라인 cpu 수
- threshold 값은 최대 125로 제한
- 예) 6 cpu이고, min=1975, low=2468, high=2962,
  - threshold=493 / 6 = 82
- 예) 64 cpu이고, min=1975, low=2468, high=2962
  - threshold=493 / 64 = 7

lowmem reserve 페이지 산출

메모리 할당 요청 시 원하는 존을 지정하는 방법은 다음과 같은 gfp 마스크를 사용한다. 다음 플래그를 사용하지 않는 경우 디폴트로 NORMAL 존을 선택한다.

__GFP_DMA
__GFP_DMA32
__GFP_HIGHMEM
__GFP_MOVABLE

커널 개발자들은 메모리 할당 시 다음과 같은 gfp 마스크를 사용하여 존을 선택한다.

GFP_DMA
- __GFP_DMA gfp 플래그를 사용하고, DMA 존을 선택한다.
GFP_DMA32
- __GFP_DMA32 gfp 플래그를 사용하고, DMA32 존을 선택한다.
GFP_KERNEL
- 존 지정과 관련한 gfp 플래그를 사용하지 않으면 NORMAL 존을 선택한다.
GFP_HIGHUSER
- __GFP_HIGHMEM gfp 플래그를 사용하고, HIGHMEM 존을 선택한다.
GFP_HIGHUSER_MOVABLE
- __GFP_HIGHMEM 및 __GFP_HIGHUSER gfp 플래그를 사용하고, MOVABLE 존을 선택한다.

요청한 존에서 할당을 하지 못하면 zonelist를 통해 다음 존으로 fallback 되는데 그 순서는 다음과 같다.

MOVABLE -> HIGHMEM -> NORMAL -> DMA32 -> DMA

lowmem_reserve

여러 존들 사이에서 lowmem 영역에 해당하는 normal 및 dma/dma32 존에서 사용되는 메모리 영역은 커널에 미리 매핑이 되어 있어 빠른 커널 메모리 할당이 가능한 존이다. 따라서 유저 application에서 사용하는 highmem 및 movable 존에 대한 할당 요청 시 lowmem 영역을 할당하는 것을 선호하지 않는다. 부득이하게 상위 존의 메모리가 부족하여 fallback 되어 lowmem 영역의 존에서 할당해야 할 경우 해당 존의 할당을 제한할 페이지 수가 지정된다. 이 값들은 최초 요청(request) 존과 최종 할당(target) 존에 대한 매트릭스로 지정된다.

다음 그림의 예와 같이 각 존의 lowmem_reserve 값을 참고해본다.

1G 메모리를 가진 ARM32 시스템에 3 개의 존이 각각 dma(16M), normal(784M), highmem(200M)으로 운용되고 있다고 가정한다.
괄호{} 안의 값들은 시스템에서 운영되는 존 순서이고 그 값들은 할당 허용 페이지 수가 지정된다.
아래 붉은 선을 따라가보면 highmem 요청을 받았으나 fallback 되어 normal 존으로 향했을 때 워터마크 기준보다 1600 페이지를 더 높여 할당을 제한하는 방법으로 lowmem 영역을 보호한다.

다음 그림과 같이 페이지 할당 시 최초 요구한 존에서 할당 실패할 때 fallback 되어 lowmem 영역으로 향할 때 워터마크 기준보다 더 추가된 페이지 할당 제한을 보여준다.

lowmem_reserve_ratio

각 존의 managed_pages를 사용하여 존별 lowmem_reserve 값을 산출하기 위한 비율이다.

다음과 같이 lowmem_reserve_ratio 값을 확인할 수 있다.

시스템에서 운영되는 낮은 zone 부터 높은 zone 순서대로 비율이 설정된다.
아래 예제 값은 dma, normal, highmem 존 순서이다.

$ cat /proc/sys/vm/lowmem_reserve_ratio
256     32      0

다음 그림과 같이 할당 요청 존에 대해 fallback 되었을 때 허용되는 페이지 수의 산출은 다음과 같다.

분자의 값으로 요청 존에서 할당 실패한 존까지의 managed_pages 수를 더함을 알 수 있다.
분모의 값으로 fallback되어 최종 할당 존의 lowmem_reserve_ratio가 지정됨을 알 수 있다.

다음 그림과 같이 각 존의 managed_pages 크기와 lowmem_reserve_ratio를 이용해 각 존들의 lowmem_reserve 값이 산출되는 모습을 보여준다.

다음은 NUMA 및 4개의 존(dma, dma32, normal, movable)으로 구성된 x86 시스템에서 노드-존별 lowmem_reserve 값을 확인해본다.

아래 protection 항목을 보면된다. 처음 이 기능이 개발될 때 사용한 변수명이 protection 이었고, 나중에 lowmem_reserve로 바뀌었다.

$ cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3915
        min      5
        low      6
        high     7
        scanned  0
        spanned  4095
        present  3992
        managed  3971
    nr_free_pages 3915
...
        protection: (0, 1675, 31880, 31880)
...
Node 0, zone    DMA32
  pages free     59594
        min      588
        low      735
        high     882
        scanned  0
        spanned  1044480
        present  491295
        managed  429342
    nr_free_pages 59594
...
        protection: (0, 0, 30204, 30204)
...
Node 0, zone   Normal
  pages free     902456
        min      10607
        low      13258
        high     15910
        scanned  0
        spanned  7864320
        present  7864320
        managed  7732469
    nr_free_pages 902456
...
        protection: (0, 0, 0, 0)
...
Node 1, zone   Normal
  pages free     1133093
        min      11326
        low      14157
        high     16989
        scanned  0
        spanned  8388608
        present  8388608
        managed  8256697
    nr_free_pages 1133093
...
        protection: (0, 0, 0, 0)
...

다음과 같이 한 개의 존만 운영되는 시스템에서의 lowmem_reserve 값은 0이 되는 것을 확인 할 수 있다.

dma32, normal 존을 운영하는 ARM64 시스템의 경우 dma32 존만으로 모든 메모리를 사용하는 경우 하나의 존으로만 운영한다. 즉 fallback이 없다.

$ cat /proc/zoneinfo
Node 0, zone    DMA32
  per-node stats
(...생략...)
  pages free     633680
        min      5632             
        low      7040             
        high     8448             
        spanned  786432
        present  786432
        managed  765771           
        protection: (0, 0, 0)     <-----
      nr_free_pages 633680        
(...생략...)
Node 0, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0)     <-----
(...생략...)

setup_per_zone_lowmem_reserve()

mm/page_alloc.c

/*
 * setup_per_zone_lowmem_reserve - called whenever
 *      sysctl_lower_zone_reserve_ratio changes.  Ensures that each zone
 *      has a correct pages reserved value, so an adequate number of
 *      pages are left in the zone after a successful __alloc_pages().
 */

static void setup_per_zone_lowmem_reserve(void)
{
        struct pglist_data *pgdat;
        enum zone_type j, idx;

        for_each_online_pgdat(pgdat) {
                for (j = 0; j < MAX_NR_ZONES; j++) {
                        struct zone *zone = pgdat->node_zones + j;
                        unsigned long managed_pages = zone_managed_pages(zone);

                        zone->lowmem_reserve[j] = 0;

                        idx = j;
                        while (idx) {
                                struct zone *lower_zone;

                                idx--;
                                lower_zone = pgdat->node_zones + idx;

                                if (sysctl_lowmem_reserve_ratio[idx] < 1) {
                                        sysctl_lowmem_reserve_ratio[idx] = 0;
                                        lower_zone->lowmem_reserve[j] = 0;
                                } else {
                                        lower_zone->lowmem_reserve[j] =
                                                managed_pages / sysctl_lowmem_reserve_ratio[idx];
                                }
                                managed_pages += zone_managed_pages(lower_zone);
                        }
                }
        }

        /* update totalreserve_pages */
        calculate_totalreserve_pages();
}

존의 managed_pages 값과 lowmem_reserve_ratio 비율을 사용하여 존별 lowmem_reserve[] 페이지 값들을 산출한다.

코드 라인 6~11에서 전체 온라인 노드 수 만큼 순회하고, 해당 노드내 가용 존 수만큼 내부에서 순회하며 lowmem_reserve[] 값을 0으로 초기화한다.
코드 라인 13~30애서 가장 처음 존을 제외하고 루프 카운터로 지정된 존부터 밑으로 루프를 돌며 첫 번째 존(idx=0)을 제외한 managed_pages를 더해 lowmem_reserve에 lowmem_reserve_ratio 비율만큼 지정한다.
코드 라인 33에서 lowmem_reserve 값이 변화되었으므로 totalreserve 페이지 수도 갱신한다.

다음 그림은 zone별 lowmem_reserve[]를 산출하는 과정을 보여준다.

totalreserve 페이지 산출

calculate_totalreserve_pages()

mm/page_alloc.c

/*
 * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio
 *      or min_free_kbytes changes.
 */

static void calculate_totalreserve_pages(void)
{
        struct pglist_data *pgdat;
        unsigned long reserve_pages = 0;
        enum zone_type i, j;

        for_each_online_pgdat(pgdat) {

                pgdat->totalreserve_pages = 0;

                for (i = 0; i < MAX_NR_ZONES; i++) {
                        struct zone *zone = pgdat->node_zones + i;
                        long max = 0;
                        unsigned long managed_pages = zone_managed_pages(zone);

                        /* Find valid and maximum lowmem_reserve in the zone */
                        for (j = i; j < MAX_NR_ZONES; j++) {
                                if (zone->lowmem_reserve[j] > max)
                                        max = zone->lowmem_reserve[j];
                        }

                        /* we treat the high watermark as reserved pages. */
                        max += high_wmark_pages(zone);

                        if (max > managed_pages)
                                max = managed_pages;

                        pgdat->totalreserve_pages += max;

                        reserve_pages += max;
                }
        }
        totalreserve_pages = reserve_pages;
}

노드별 totalreserve_pages 값과 전역 totalreserve_pages 값을 산출한다.

코드 라인 7~9에서 모든 노드를 순회하며 totalreserve_pages를 산출하기 위해 먼저 0으로 초기화한다.
코드 라인 11~14에서 모든 존을 순회하며 해당 존의 managed 페이지를 알아온다.
코드 라인 17~28에서 순회 중인 존부터 마지막 존에 설정된 lowmem_reserve 값의 최대치를 알아온 후 high 워터마크를 더해 totalreserve_pages에 추가한다. 추가할 값이 순회중인 존의 managed_pages를 초과하지 않게 제한한다.
코드 라인 30~33에서 각 노드와 존에서 추가한 값들을 전역 totalreserve_pages에 대입한다.

다음 그림은 totalreserve_pages 값을 산출하기 위해 아래 값들을 각 노드와 존을 모두 더하여 산출하는 과정을 보여준다.

각 존 이상의 lowmem_reserve 최대치
각 존의 high 워터마크

inactive_ratio 산출

다음 과정들은 커널 v4.7-rc1에서 부트업 시 inactive_ratio를 산출하는 것을 제거하였다. inactive 리스트의 비율은 이제 inactive_list_is_low() 함수내에서 전체 메모리에 맞춰 자동으로 비교되므로 사용하지 않게 되었다.

참고: mm: vmscan: reduce size of inactive file list

setup_per_zone_inactive_ratio()

mm/page_alloc.c

static void __meminit setup_per_zone_inactive_ratio(void)
{
        struct zone *zone;

        for_each_zone(zone)
                calculate_zone_inactive_ratio(zone);
}

모든 zone의 inactive anon 비율을 산출한다.

zone->managed_pages가 256k pages (1GB) 미만인 경우 zone->inactive_ratio가 1이되어 active anon과 inactive anon의 비율을 1:1로 설정한다. 만일 managed_pages가 256k pages (1GB)를 초과한 경우 inactive_ratio 값은 3 이상이 되면서 inactive anon의 비율이 1/3 이하로 줄어든다.

다음 그림은 각 zone의 managed_pages 크기에 따라 inactive_ratio 값이 산출되는 과정을 보여준다.

calculate_zone_inactive_ratio()

mm/page_alloc.c

/*
 * The inactive anon list should be small enough that the VM never has to
 * do too much work, but large enough that each inactive page has a chance
 * to be referenced again before it is swapped out.
 *
 * The inactive_anon ratio is the target ratio of ACTIVE_ANON to
 * INACTIVE_ANON pages on this zone's LRU, maintained by the
 * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of
 * the anonymous pages are kept on the inactive list.
 *
 * total     target    max
 * memory    ratio     inactive anon
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB
 */
static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
{
        unsigned int gb, ratio;

        /* Zone size in gigabytes */
        gb = zone->managed_pages >> (30 - PAGE_SHIFT);
        if (gb)
                ratio = int_sqrt(10 * gb);
        else
                ratio = 1;

        zone->inactive_ratio = ratio;
}

지정된 zone의 inactive anon 비율을 산출한다.

managed 페이지가 1G 미만인 경우 ratio는 1이된다.
- inactive anon=1 : active anon=1
managed 페이지가 1G 이상인 경우 * 10한 후 제곱근(루트)을 하여 ratio를 설정한다. (inactive = 1 / ratio)
- 예) managed pages=1G
  - inactive_ratio=3
    - inactive anon=1 : active anon=3
      - inactive anon=250M : active anon=750M

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c – 현재 글
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Zoned Allocator -2- (물리 페이지 할당-Slowpath)

2016-06-292019-10-09 문영일 Leave a comment

Slowpath

NUMA 메모리 정책에 따른 zonelist에서 fastpath 할당이 실패한 경우 slowpath 단계를 진행한다. 만일 nofail 옵션이 사용된 경우 할당이 성공할 때까지 반복한다. slowpath 단계를 진행하는 동안 요청 옵션에 따라 free 페이지 부족 시 다음과 같은 회수 동작들을 수행한다.

direct-compaction
- 페이지 할당 시 요청한 order의 페이지가 부족하여 할당 할 수 없는 상황일 때 곧바로(direct) compaction 동작을 수행하여 페이지들을 확보한 후 할당한다.
direct-reclaim
- 페이지 할당 시 요청한 order의 페이지가 부족하여 할당 할 수 없는 상황일 때 곧바로(direct) reclaim 동작을 수행하여 페이지들을 확보한 후 할당한다.
- reclaim 동작을 수행하면
OOM killing
- 페이지 할당 시 요청한 order의 페이지가 부족하여 최종적으로 OOM killing을 통해 특정 태스크를 종료시키므로 확보한 페이지들로 할당한다.
kswapd
- 백그라운드에서 페이지 회수(reclaim) 매커니즘을 동작시켜 Dirty 된 파일 캐시들을 기록하고, Clean된 파일 캐시를 비우고, swap 시스템에 페이지들을 옮기는 등으로 free 페이지들을 확보한다.
kcompactd
- 백그라운드에서 compaction 동작을 수행하여 파편화된 movable 페이지를 이동시켜 free 페이지들을 병합하는 것으로 더 큰 order free 페이지들을 확보한다.

OOM(Out of Memory) Killing

메모리가 부족한 상황에서 compaction이나 reclaim을 통해 페이지 회수가 모두 실패하여 더 이상 진행할 수 없는 경우 다음 특정 태스크 중 하나를 죽여야 하는 상황이다.

현재 태스크가 종료 중인 경우 0 순위
OOM 상태에서 먼저 처리해야 할 지정된 태스크 1 순위
태스크 들 중 일정 연산 기준 이상의 한 태스크 2 순위

다음은 OOM killing을 강제로 수행한 결과를 보여준다.

$ echo f > /proc/sysrq-trigger
$ dmesg
[460767.036092] sysrq: SysRq : Manual OOM execution
[460767.037248] kworker/0:0 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
[460767.038016] kworker/0:0 cpuset=/ mems_allowed=0
[460767.038468] CPU: 0 PID: 8063 Comm: kworker/0:0 Tainted: G        W       4.4.103-g94108fb3583f-dirty #4
[460767.039307] Hardware name: ROCK960 - 96boards based on Rockchip RK3399 (DT)
[460767.039948] Workqueue: events moom_callback
[460767.040348] Call trace:
[460767.040603] [<ffffff800808806c>] dump_backtrace+0x0/0x21c
[460767.041104] [<ffffff80080882ac>] show_stack+0x24/0x30
[460767.041583] [<ffffff80083b56f4>] dump_stack+0x94/0xbc
[460767.042064] [<ffffff80081bdd4c>] dump_header.isra.5+0x50/0x15c
[460767.042603] [<ffffff800817f240>] oom_kill_process+0x94/0x3dc
[460767.043128] [<ffffff800817f7fc>] out_of_memory+0x1e4/0x2ac
[460767.043639] [<ffffff800845d9ac>] moom_callback+0x48/0x70
[460767.044128] [<ffffff80080cd264>] process_one_work+0x220/0x378
[460767.044663] [<ffffff80080ce124>] worker_thread+0x2e0/0x3a0
[460767.045176] [<ffffff80080d3004>] kthread+0xe0/0xe8
[460767.045627] [<ffffff80080826c0>] ret_from_fork+0x10/0x50
[460767.046235] Mem-Info:
[460767.046484] active_anon:33752 inactive_anon:6597 isolated_anon:0
                 active_file:535284 inactive_file:190256 isolated_file:0
                 unevictable:0 dirty:31 writeback:0 unstable:0
                 slab_reclaimable:48749 slab_unreclaimable:5217
                 mapped:25870 shmem:6685 pagetables:894 bounce:0
                 free:145659 free_pcp:686 free_cma:0
[460767.049564] DMA free:582636kB min:7900kB low:9872kB high:11848kB active_anon:135008kB inactive_anon:26388kB active_file:2141136kB inactive_file:761024kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4061184kB managed:3903784kB mlocked:0kB dirty:124kB writeback:0kB mapped:103480kB shmem:26740kB slab_reclaimable:194996kB slab_unreclaimable:20868kB kernel_stack:4352kB pagetables:3576kB unstable:0kB bounce:0kB free_pcp:2744kB local_pcp:620kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[460767.053616] lowmem_reserve[]: 0 0 0
[460767.054043] DMA: 997*4kB (UME) 613*8kB (UME) 559*16kB (UME) 482*32kB (UM) 356*64kB (UM) 160*128kB (UME) 77*256kB (UME) 40*512kB (UM) 29*1024kB (ME) 21*2048kB (M) 96*4096kB (M) = 582636kB
[460767.055848] 732229 total pagecache pages
[460767.056242] 0 pages in swap cache
[460767.056559] Swap cache stats: add 0, delete 0, find 0/0
[460767.057065] Free swap  = 1048572kB
[460767.057390] Total swap = 1048572kB
[460767.057714] 1015296 pages RAM
[460767.058027] 0 pages HighMem/MovableOnly
[460767.058388] 39350 pages reserved
[460767.058689] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[460767.059547] [  195]     0   195     7917     1596       9       3        0             0 systemd-journal
[460767.060436] [  229]     0   229     3281      822       9       4        0         -1000 systemd-udevd
[460767.061301] [  257]   102   257     1952      979       8       4        0             0 systemd-network
[460767.062193] [  382]   101   382    20508      936      10       5        0             0 systemd-timesyn
[460767.063081] [  416]     0   416    61035     1826      18       3        0             0 upowerd
[460767.063883] [  419]   106   419     1676      980       7       3        0          -900 dbus-daemon
[460767.064739] [  434]     0   434     2481     1120       7       4        0             0 wpa_supplicant
[460767.065614] [  437]     0   437     1900     1091       9       4        0             0 systemd-logind
[460767.066533] [  441]     0   441    79590     2400      20       4        0             0 udisksd
[460767.067365] [  450]     0   450      446      231       5       4        0             0 acpid
[460767.068170] [  454]     0   454    54453      725      11       3        0             0 rsyslogd
[460767.068973] [  459]     0   459    88404     4283      28       4        0             0 NetworkManager
[460767.069849] [  478]     0   478    58975     2351      18       4        0             0 polkitd
[460767.070677] [  568]   103   568     2102     1141       8       4        0             0 systemd-resolve
[460767.071565] [  585]     0   585    58528     1963      17       4        0             0 lightdm
[460767.072396] [  590]     0   590      623      363       6       4        0             0 agetty
[460767.073213] [  592]     0   592     1841      780       8       4        0             0 login
[460767.073991] [  603]     0   603   274336    17176      90       5        0             0 Xorg
[460767.074794] [  630]   110   630      441       92       5       4        0             0 uml_switch
[460767.075645] [  658]     0   658     2357     1385       8       3        0             0 systemd
[460767.076472] [  668]     0   668     2994      454      11       4        0             0 (sd-pam)
[460767.077298] [  673]     0   673     1675     1097       7       3        0             0 bash
[460767.078102] [  772]     0   772    40489     2052      14       4        0             0 lightdm
[460767.078904] [  780]  1000   780     2476     1328       9       4        0             0 systemd
[460767.079731] [  785]  1000   785     2994      454      11       4        0             0 (sd-pam)
[460767.080559] [  788]  1000   788    61601     3212      23       3        0             0 lxsession
[460767.081398] [  808]  1000   808     1801      382       7       4        0             0 dbus-launch
[460767.082257] [  809]  1000   809     1596      681       7       4        0             0 dbus-daemon
[460767.083134] [  830]  1000   830      984       79       6       4        0             0 ssh-agent
[460767.083948] [  838]  1000   838    58269     1523      14       3        0             0 gvfsd
[460767.084752] [  848]  1000   848    13921     3757      19       3        0             0 openbox
[460767.085579] [  853]  1000   853   196606     6848      44       4        0             0 lxpanel
[460767.086479] [  854]  1000   854    98755     7408      36       3        0             0 pcmanfm
[460767.087311] [  859]  1000   859      984       79       7       4        0             0 ssh-agent
[460767.088153] [  863]  1000   863    92314    13900      46       5        0             0 blueman-applet
[460767.089028] [  868]  1000   868   112028    14267      67       5        0             0 nm-applet
[460767.089842] [  869]  1000   869    43274     2728      20       3        0             0 xfce4-power-man
[460767.090729] [  876]  1000   876     2359     1082       9       4        0             0 xfconfd
[460767.091555] [  885]  1000   885   123932     2468      19       3        0             0 pulseaudio
[460767.092403] [  898]  1000   898    39537     1649      13       3        0             0 menu-cached
[460767.093268] [  905]     0   905     1810      963       7       4        0             0 bluetoothd
[460767.094139] [  915]  1000   915    67422     2887      21       5        0             0 gvfs-udisks2-vo
[460767.095036] [  923]  1000   923    77517     2078      19       5        0             0 gvfsd-trash
[460767.095889] [  933]  1000   933     9809     1559      12       3        0             0 obexd
[460767.096706] [ 5483]     0  5483     3049     1593      11       3        0             0 sshd
[460767.097537] [ 5498]     0  5498     2968     1498      10       3        0             0 sshd
[460767.098351] [ 5511]     0  5511      573      403       5       3        0             0 sftp-server
[460767.099273] [ 5518]     0  5518     1691     1156       7       3        0             0 bash
[460767.100048] [ 5735]     0  5735     3048     1579      11       4        0             0 sshd
[460767.100810] [ 5743]     0  5743     2968     1511      10       4        0             0 sshd
[460767.101580] [ 5766]     0  5766      573      427       6       3        0             0 sftp-server
[460767.102396] [ 5773]     0  5773     1698     1173       7       3        0             0 bash
[460767.103166] [ 5994]     0  5994     3048     1565       9       4        0             0 sshd
[460767.103928] [ 5999]     0  5999     2968     1529      10       4        0             0 sshd
[460767.104697] [ 6021]     0  6021      573      409       5       4        0             0 sftp-server
[460767.105515] [ 6028]     0  6028     1699     1172       7       3        0             0 bash
[460767.106289] [ 7742]     0  7742     2968     1514      10       3        0             0 sshd
[460767.107059] [ 7758]     0  7758     1697     1221       7       3        0             0 bash
[460767.107821] [ 7849]     0  7849     2968     1506      10       4        0             0 sshd
[460767.108591] [ 7863]     0  7863     1699     1201       7       3        0             0 bash
[460767.109360] Out of memory: Kill process 603 (Xorg) score 13 or sacrifice child
[460767.110302] Killed process 603 (Xorg) total-vm:1097344kB, anon-rss:15892kB, file-rss:52812kB

__alloc_pages_slowpath()

다음 그림과 같이 slow-path 페이지 할당 과정을 보여준다.

mm/page_alloc.c -1/5-

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
        const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
        struct page *page = NULL;
        unsigned int alloc_flags;
        unsigned long did_some_progress;
        enum compact_priority compact_priority;
        enum compact_result compact_result;
        int compaction_retries;
        int no_progress_loops;
        unsigned int cpuset_mems_cookie;
        int reserve_flags;

        /*
         * We also sanity check to catch abuse of atomic reserves being used by
         * callers that are not in atomic context.
         */
        if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
                                (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
                gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
        compaction_retries = 0;
        no_progress_loops = 0;
        compact_priority = DEF_COMPACT_PRIORITY;
        cpuset_mems_cookie = read_mems_allowed_begin();

        /*
         * The fast path uses conservative alloc_flags to succeed only until
         * kswapd needs to be woken up, and to avoid the cost of setting up
         * alloc_flags precisely. So we do that now.
         */
        alloc_flags = gfp_to_alloc_flags(gfp_mask);

        /*
         * We need to recalculate the starting point for the zonelist iterator
         * because we might have used different nodemask in the fast path, or
         * there was a cpuset modification and we are retrying - otherwise we
         * could end up iterating over non-eligible zones endlessly.
         */
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                        ac->high_zoneidx, ac->nodemask);
        if (!ac->preferred_zoneref->zone)
                goto nopage;

        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);

        /*
         * The adjusted alloc_flags might result in immediate success, so try
         * that first
         */
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
        if (page)
                goto got_pg;

코드 라인 5에서 페이지 할당 중에 free 페이지가 기준 이하인 경우 직접 적인 페이지 회수(direct-reclaim)가 필요한데 이를 허용하는지 여부를 알아온다. direct-reclaim을 허용하는 요청들은 다음과 같다.
- 커널 메모리 할당에 사용하는 GFP_KERNEL, GFP_KERNEL_ACCOUNT, GFP_NOIO, GFP_NOFS
- 유저 메모리 할당에 사용하는 GFP_USER
코드 라인 6에서 order 3 이상이면 높은 order로 판단하여 costly order라고 한다.
코드 라인 21~23에서 불합리하게 gfp 마스크에 atomic과 direct-reclaim 요청을 동시에 한 경우 atomic 요청을 무시하도록 gfp 마스크에서 제거한다.
코드 라인 25~29에서 디폴트 compact 우선 순위로 준비한다.
코드 라인 36에서 gfp 마스크로 할당 플래그를 구한다.
코드 라인 44~47에서 노드 마스크와 zonelist에서 리셋하여 다시 첫 zone을 선택한다. 만일 대상 노드 마스크에 가용한 zone이 없는 경우 nopage: 레이블로 이동한다.
코드 라인 49~50에서 kswapd reclaim이 허용된 경우 zonelist에 관련된 가용 zone에 대한 노드들에서 free 메모리가 기준 이상 적거나 너무 많은 파편화 상태인 경우 해당 노드의 kswapd를 깨운다.
코드 라인 56~58에서 조정된 할당 플래그를 사용하여 첫 번째 slow-path 할당 시도를 수행한다.

mm/page_alloc.c -2/5-

.       /*
         * For costly allocations, try direct compaction first, as it's likely
         * that we have enough base pages and don't need to reclaim. For non-
         * movable high-order allocations, do that as well, as compaction will
         * try prevent permanent fragmentation by migrating from blocks of the
         * same migratetype.
         * Don't try this for allocations that are allowed to ignore
         * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
         */
        if (can_direct_reclaim &&
                        (costly_order ||
                           (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
                        && !gfp_pfmemalloc_allowed(gfp_mask)) {
                page = __alloc_pages_direct_compact(gfp_mask, order,
                                                alloc_flags, ac,
                                                INIT_COMPACT_PRIORITY,
                                                &compact_result);
                if (page)
                        goto got_pg;

                /*
                 * Checks for costly allocations with __GFP_NORETRY, which
                 * includes THP page fault allocations
                 */
                if (costly_order && (gfp_mask & __GFP_NORETRY)) {
                        /*
                         * If compaction is deferred for high-order allocations,
                         * it is because sync compaction recently failed. If
                         * this is the case and the caller requested a THP
                         * allocation, we do not want to heavily disrupt the
                         * system, so we fail the allocation instead of entering
                         * direct reclaim.
                         */
                        if (compact_result == COMPACT_DEFERRED)
                                goto nopage;

                        /*
                         * Looks like reclaim/compaction is worth trying, but
                         * sync compaction could be very expensive, so keep
                         * using async compaction.
                         */
                        compact_priority = INIT_COMPACT_PRIORITY;
                }
        }

코드 라인 10~19에서 다음 3 가지 조건을 동시에 만족시키는 경우 첫 번째 direct-compaction을 수행하고 페이지를 할당한다.
- direct-reclaim이 허용된 상태의 할당 요청
- costly high order 할당 요청이거나 movable이 아닌 1 order 이상의 할당 요청
- 페이지 회수를 위해 일시적으로 할당 요청을 할 때 사용되는 pfmemalloc을 사용하지 않아야 하는 일반 할당 요청
코드 라인 25~43에서 첫 번째 direct-compaction 을 통해서도 페이지 할당이 실패한 상황이다. costly order에 대해 noretry 옵션이 사용된 경우 다시 한 번 시도하기 위해 async compaction을 사용한다. 단 compact 결과가 유예상태인 경우 nopage 레이블로 이동한다.

mm/page_alloc.c -3/5-

retry:
        /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);

        reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
        if (reserve_flags)
                alloc_flags = reserve_flags;

        /*
         * Reset the nodemask and zonelist iterators if memory policies can be
         * ignored. These allocations are high priority and system rather than
         * user oriented.
         */
        if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
                ac->nodemask = NULL;
                ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                        ac->high_zoneidx, ac->nodemask);
        }

        /* Attempt with potentially adjusted zonelist and alloc_flags */
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
        if (page)
                goto got_pg;

        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;

        /* Try direct compaction and then allocating */
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                                        compact_priority, &compact_result);
        if (page)
                goto got_pg;

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_RETRY_MAYFAIL
         */
        if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
                goto nopage;

        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;

코드 라인 1~4에서 페이지 할당을 다시 시도하기 위한 retry: 레이블이다. kswapd 깨우기를 할 수 있는 상태로 할당 요청된 경우 메모리 상태를 보고 kswapd를 꺠운다.
코드 라인 6~8에서 페이지 회수를 위해 일시적으로 할당 요청을 할 때 사용되는 pfmemalloc을 사용해야 하는 상황인 경우 워터 마크 기준을 사용하지 않도록 할당 플래그를 ALLOC_NO_WATERMARKS로 설정한다.
코드 라인 15~24에서 다음 두 가지 조건 중 하나인 경우 메모리 policies 대로 순회 중인 노드와 존 순서를 무시하고 처음 노드와 존으로 바꾼 후 페이지 할당을 시도한다.
- cpuset을 사용하는 페이지 할당 요청이 아닌 커널 할당 요청
- pfmemalloc으로 인해 워터 마크 기준을 사용하지 않아야 하는 할당 요청
코드 라인 27~28에서 atomic 할당 요청과 같이 direct-reclaim이 허용되지 않은 할당 요청인 경우 더 이상 페이지 회수를 하지 못하므로 nopage: 레이블로 이동한다.
코드 라인 31~32에서 페이지 회수를 위해 일시적으로 할당 요청을 할 때 사용되는 pfmemalloc을 사용해야 하는 상황인 경우 재귀 동작이 수행되지 않도록 페이지 회수를 진행하지 않고 nopage: 레이블로 이동한다.
코드 라인 35~38에서 direct-recalim을 시도한다.
코드 라인 41~44에서 두 번째 direct-compaction을 시도한다.
코드 라인 47~48에서 noretry 요청이 있는 경우 nopage: 레이블로 이동한다.
코드 라인 54~55에서 __GFP_RETRY_MAYFAIL 플래그를 사용하지 않는 할당 요청인 경우 costly order에 대해서는 재시도를 하지 않고 nopage: 레이블로 이동한다.
코드 라인 57~59에서 reclaim을 재시도할 필요가 있는 경우 retry: 레이블로 이동하여 할당을 재시도한다.

mm/page_alloc.c -4/5-

.       /*
         * It doesn't make any sense to retry for the compaction if the order-0
         * reclaim is not able to make any progress because the current
         * implementation of the compaction depends on the sufficient amount
         * of free memory (see __compaction_suitable)
         */
        if (did_some_progress > 0 &&
                        should_compact_retry(ac, order, alloc_flags,
                                compact_result, &compact_priority,
                                &compaction_retries))
                goto retry;


        /* Deal with possible cpuset update races before we start OOM killing */
        if (check_retry_cpuset(cpuset_mems_cookie, ac))
                goto retry_cpuset;

        /* Reclaim has failed us, start killing things */
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
                goto got_pg;

        /* Avoid allocations with no watermarks from looping endlessly */
        if (tsk_is_oom_victim(current) &&
            (alloc_flags == ALLOC_OOM ||
             (gfp_mask & __GFP_NOMEMALLOC)))
                goto nopage;

        /* Retry as long as the OOM killer is making progress */
        if (did_some_progress) {
                no_progress_loops = 0;
                goto retry;
        }

코드 라인 7~11에서 compaction을 재시도할 필요가 있는 경우 retry: 레이블로 이동하여 재시도한다.
코드 라인 15~16에서 cpuset에 변경이 있어 race 상황이 감지되면 retry_cpuset: 레이블로 이동하여 재시도한다.
코드 라인 19~21에서 OOM 킬링을 통해 확보한 페이지로 할당을 다시 시도해본다.
코드 라인 24~27에서 현재 태스크가 OOM으로 인해 특정 태스크가 killing되고 있는 상태이면 nopage: 레이블로 이동한다.
코드 라인 30~33에서 OOM 킬링을 통해 페이지가 회수될 가능성이 있는 경우 retry: 레이블로 이동하여 재시도한다.

mm/page_alloc.c -5/5-

nopage:
        /* Deal with possible cpuset update races before we fail */
        if (check_retry_cpuset(cpuset_mems_cookie, ac))
                goto retry_cpuset;

        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE(!can_direct_reclaim))
                        goto fail;

                /*
                 * PF_MEMALLOC request from this context is rather bizarre
                 * because we cannot reclaim anything and only can loop waiting
                 * for somebody to do a work for us
                 */
                WARN_ON_ONCE(current->flags & PF_MEMALLOC);

                /*
                 * non failing costly orders are a hard requirement which we
                 * are not prepared for much so let's warn about these users
                 * so that we can identify them and convert them to something
                 * else.
                 */
                WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

                /*
                 * Help non-failing allocations by giving them access to memory
                 * reserves but do not use ALLOC_NO_WATERMARKS because this
                 * could deplete whole memory reserves which would just make
                 * the situation worse
                 */
                page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
                if (page)
                        goto got_pg;

                cond_resched();
                goto retry;
        }
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}

코드 라인 1~4에서 페이지를 할당하지 못하고 포기 직전인 nopage: 레이블이다. 만일 cpuset에 변경이 있어 race 상황이 감지되면 retry_cpuset: 레이블로 이동하여 재시도한다.
코드 라인 10~45에서 nofail 옵션을 사용한 경우 페이지가 할당될 때 까지 재시도한다. 단 direct-reclaim 이 허용되지 않는 상황이면 fail 처리한다.

GFP 마스크 -> 할당 플래그 변환

gfp_to_alloc_flags()

mm/page_alloc.c

static inline unsigned int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
        unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;

        /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
        BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);

        /*
         * The caller may dip into page reserves a bit more if the caller
         * cannot run direct reclaim, or if the caller has realtime scheduling
         * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
         * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
         */
        alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);

        if (gfp_mask & __GFP_ATOMIC) {
                /*
                 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
                 * if it can't schedule.
                 */
                if (!(gfp_mask & __GFP_NOMEMALLOC))
                        alloc_flags |= ALLOC_HARDER;
                /*
                 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
                 * comment for __cpuset_node_allowed().
                 */
                alloc_flags &= ~ALLOC_CPUSET;
        } else if (unlikely(rt_task(current)) && !in_interrupt())
                alloc_flags |= ALLOC_HARDER;

        if (gfp_mask & __GFP_KSWAPD_RECLAIM)
                alloc_flags |= ALLOC_KSWAPD;

#ifdef CONFIG_CMA
        if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
                alloc_flags |= ALLOC_CMA;
#endif
        return alloc_flags;
}

@gfp_mask 값으로 할당 플래그를 구성하여 반환한다. 반환되는 할당 플래그와 조건 들은 다음과 같다.

ALLOC_WMARK_MIN(0)
- 디폴트
ALLOC_NO_WATERMARKS
- pfmemalloc 요청인 경우
ALLOC_CPUSET
- atomic 요청이 아닌 경우
ALLOC_HIGH
- high 요청이 있는 경우
ALLOC_HARDER
- rt 태스크 요청인 경우
- atomic 요청이 있으면서 nomemalloc이 없는 경우
ALLOC_CMA
- movable 페이지 타입인 할당 요청인 경우
ALLOC_KSWAPD
- swapd_reclaim 요청이 있는 경우

코드 라인 4에서 할당 플래그에 min 워터마크 할당과 cpuset 사용을 하도록 한다.
코드 라인 15에서 할당 플래그에 high 요청 여부를 추가한다.
코드 라인 17~28에서 atomic 요청인 경우 할당 플래그에서 cpuset을 제거한다. 또한 nomemalloc 요청이 아닌 한 harder 플래그를 추가한다.
코드 라인 29~30에서 rt 태스크에서 요청한 경우에도 harder 플래그를 추가한다.
코드 라인 32~33에서 swapd_reclaim 요청이 있는 경우 kswpd 플래그를 추가한다.
코드 라인 36~37에서 removable 요청이 있는 경우 cma 플래그를 추가한다.

gfp 플래그 -> migrate 타입 변환

gfpflags_to_migratetype()

include/linux/gfp.h

static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
        VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
        BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
        BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);

        if (unlikely(page_group_by_mobility_disabled))
                return MIGRATE_UNMOVABLE;

        /* Group based on mobility */
        return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
}

GFP 플래그에 대응하는 migrate 타입을 다음 중 하나로 변환한다.

MIGRATE_UNMOVABLE
MIGRATE_RECLAIMABLE
MIGRATE_MOVABLE

모든 kswapd 깨우기

wake_all_kswapds()

mm/page_alloc.c

static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
                             const struct alloc_context *ac)
{
        struct zoneref *z;
        struct zone *zone;
        pg_data_t *last_pgdat = NULL;
        enum zone_type high_zoneidx = ac->high_zoneidx;

        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, high_zoneidx,
                                        ac->nodemask) {
                if (last_pgdat != zone->zone_pgdat)
                        wakeup_kswapd(zone, gfp_mask, order, high_zoneidx);
                last_pgdat = zone->zone_pgdat;
        }
}

high_zoneidx 이하의 zonelist에서 high_zoneidx 이하의 zone이면서 노드 마스크에 설정된 노드들인 존을 순회하며 해당 노드의 kswapd를 모두 깨운다.

reclaim 재시도 필요 체크

should_reclaim_retry()

mm/page_alloc.c

/*
 * Checks whether it makes sense to retry the reclaim to make a forward progress
 * for the given allocation request.
 *
 * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
 * without success, or when we couldn't even meet the watermark if we
 * reclaimed all remaining pages on the LRU lists.
 *
 * Returns true if a retry is viable or false to enter the oom path.
 */

static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
                     struct alloc_context *ac, int alloc_flags,
                     bool did_some_progress, int *no_progress_loops)
{
        struct zone *zone;
        struct zoneref *z;
        bool ret = false;

        /*
         * Costly allocations might have made a progress but this doesn't mean
         * their order will become available due to high fragmentation so
         * always increment the no progress counter for them
         */
        if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
                *no_progress_loops = 0;
        else
                (*no_progress_loops)++;

        /*
         * Make sure we converge to OOM if we cannot make any progress
         * several times in the row.
         */
        if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
                /* Before OOM, exhaust highatomic_reserve */
                return unreserve_highatomic_pageblock(ac, true);
        }

        /*
         * Keep reclaiming pages while there is a chance this will lead
         * somewhere.  If none of the target zones can satisfy our allocation
         * request even if all reclaimable pages are considered then we are
         * screwed and have to go OOM.
         */
        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                        ac->nodemask) {
                unsigned long available;
                unsigned long reclaimable;
                unsigned long min_wmark = min_wmark_pages(zone);
                bool wmark;

                available = reclaimable = zone_reclaimable_pages(zone);
                available += zone_page_state_snapshot(zone, NR_FREE_PAGES);

                /*
                 * Would the allocation succeed if we reclaimed all
                 * reclaimable pages?
                 */
                wmark = __zone_watermark_ok(zone, order, min_wmark,
                                ac_classzone_idx(ac), alloc_flags, available);
                trace_reclaim_retry_zone(z, order, reclaimable,
                                available, min_wmark, *no_progress_loops, wmark);
                if (wmark) {
                        /*
                         * If we didn't make any progress and have a lot of
                         * dirty + writeback pages then we should wait for
                         * an IO to complete to slow down the reclaim and
                         * prevent from pre mature OOM
                         */
                        if (!did_some_progress) {
                                unsigned long write_pending;

                                write_pending = zone_page_state_snapshot(zone,
                                                        NR_ZONE_WRITE_PENDING);

                                if (2 * write_pending > reclaimable) {
                                        congestion_wait(BLK_RW_ASYNC, HZ/10);
                                        return true;
                                }
                        }

                        ret = true;
                        goto out;
                }
        }

out:
        /*
         * Memory allocation/reclaim might be called from a WQ context and the
         * current implementation of the WQ concurrency control doesn't
         * recognize that a particular WQ is congested if the worker thread is
         * looping without ever sleeping. Therefore we have to do a short sleep
         * here rather than calling cond_resched().
         */
        if (current->flags & PF_WQ_WORKER)
                schedule_timeout_uninterruptible(1);
        else
                cond_resched();
        return ret;
}

회수 가능한 페이지와 남은 free 페이지를 확인하여 reclaim 시도를 계속할 지 여부를 체크한다.( true=계속, false=스탑) costly high order 초과 요청 시 MAX_RECLAIM_RETRIES(16)번 이내에서 반복 가능하며 마지막 시도 시에는 atomic 요청으로 high order 처리를 위해 남겨둔(reserve) highatomic 타입의 free 페이지를 모두 요청한 페이지 타입으로 변경하여 활용하도록 한다.

코드 라인 15~27에서 이 함수 호출 전에 수행한 direct-reclaim에서 페이지 회수가 있었고 high order를 초과하는 요청인 경우 direct-reclaim을 재시도한 회수를 저장하기 위해 출력 인자 no_progress_loops를 1 증가 시킨다. 이 값이 최대 reclaim 시도 수를 초과하면 OOM 직전의 상황이므로 atomic 요청으로 high order 처리를 위해 남겨둔(reserve) highatomic 타입의 free 페이지를 모두 회수하여 요청한 페이지 타입으로 변환한다. 기존 direct-reclaim 처리 시 회수한 페이지가 하나도 없거나, costly high order 이하의 요청인 경우에는 retry 카운터를 0으로 지정한다.
코드 라인 35~43에서 zonelist에서 노드 마스크를 대상으로하는 high_zoneidx 이하의 존을 순회하며 해당 존의 최대 회수 가능한 페이지를 더한 free 페이지 수를 산출한다.
코드 라인 49~74에서 availble 페이지 수가 min 워터마크 기준을 상회하는 경우 reclaim을 다시 시도할 수 있도록 true를 반환하기 위해 out: 레이블로 이동한다. 만일 기존 시도 했던 reclaim 페이지 수가 0이고 회수 가능한 페이지의 50% 이상이 write 지연 상태이면 곧바로 재시도해도 페이지 회수 가능성이 낮기 때문에 0.1초간 대기한 후 true를 곧바로 반환한다.
코드 라인 77~89에서 out: 레이블에서는 함수를 빠져나가기 전에 슬립 여부를 결정한다. 워커 스레드에서 페이지 할당을 요청한 경우 congestion 상태에서 슬립없이 루프를 도는 일이 발생하므로 최소 1틱을 쉬며 다른 태스크에게 실행을 양보한다. 그 외의 경우는 premption 요청이 있으면 슬립하고 그렇지 않으면 슬립하지 않고 곧바로 함수를 빠져나간다.

compaction 재시도 필요 체크

should_compact_retry()

mm/page_alloc.c

static inline bool
should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
                     enum compact_result compact_result,
                     enum compact_priority *compact_priority,
                     int *compaction_retries)
{
        int max_retries = MAX_COMPACT_RETRIES;
        int min_priority;
        bool ret = false;
        int retries = *compaction_retries;
        enum compact_priority priority = *compact_priority;

        if (!order)
                return false;

        if (compaction_made_progress(compact_result))
                (*compaction_retries)++;

        /*
         * compaction considers all the zone as desperately out of memory
         * so it doesn't really make much sense to retry except when the
         * failure could be caused by insufficient priority
         */
        if (compaction_failed(compact_result))
                goto check_priority;

        /*
         * make sure the compaction wasn't deferred or didn't bail out early
         * due to locks contention before we declare that we should give up.
         * But do not retry if the given zonelist is not suitable for
         * compaction.
         */
        if (compaction_withdrawn(compact_result)) {
                ret = compaction_zonelist_suitable(ac, order, alloc_flags);
                goto out;
        }

        /*
         * !costly requests are much more important than __GFP_RETRY_MAYFAIL
         * costly ones because they are de facto nofail and invoke OOM
         * killer to move on while costly can fail and users are ready
         * to cope with that. 1/4 retries is rather arbitrary but we
         * would need much more detailed feedback from compaction to
         * make a better decision.
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER)
                max_retries /= 4;
        if (*compaction_retries <= max_retries) {
                ret = true;
                goto out;
        }

        /*
         * Make sure there are attempts at the highest priority if we exhausted
         * all retries or failed at the lower priorities.
         */
check_priority:
        min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                        MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;

        if (*compact_priority > min_priority) {
                (*compact_priority)--;
                *compaction_retries = 0;
                ret = true;
        }
out:
        trace_compact_retry(order, priority, compact_result, retries, max_retries, ret);
        return ret;
}

compaction을 재시도할 필요가 있는지 여부를 반환한다. (true=재시도 필요, false=재시도 불필요)

코드 라인 13~14에서 0 오더 할당 요청에 대해서는 compaction 시도가 필요 없으므로 false를 반환한다.
코드 라인 16~17에서 지난 compaction 과정에서 migrate한 페이지가 있는 경우 compaction_retries 카운터를 증가시킨다.
코드 라인 24~25에서 지난 compaction 과정이 완전히 완료된 경우 check_priority: 레이블로 이동한다.
코드 라인 33~36에서 지난 compaction 과정이 몇 가지 이유로 완료되지 않은 경우이다. 다시 compaction을 시도해도 적절할지 판단 여부를 반환한다.
코드 라인 46~51에서 최대 compaction 재시도 수 만큼 반복하기 위해 true를 반환한다. 단 costly high order를 초과하는 할당 요청인 경우 16번의 재시도를 1/4로 줄여 4번까지만 재시도하게 한다.
코드 라인 57~65에서 check_priority: 레이블이다. compact 우선 순위가 높은 경우에 대해 재시도 기회를 더 부여하고자 한다. 따라서 compact 우선순위가 최소 priority를 초과한 경우 compact 우선 순위를 1 저하시킨 후 재시도 수를 0으로 리셋하고 true를 반환한다.

OOM 킬링을 통한 페이지 할당

__alloc_pages_may_oom()

mm/page_alloc.c

static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        const struct alloc_context *ac, unsigned long *did_some_progress)
{
        struct oom_control oc = {
                .zonelist = ac->zonelist,
                .nodemask = ac->nodemask,
                .memcg = NULL,
                .gfp_mask = gfp_mask,
                .order = order,
        };
        struct page *page;

        *did_some_progress = 0;

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure. But make sure that this reclaim
         * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
         * allocation which will never fail due to oom_lock already held.
         */
        page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
                                      ~__GFP_DIRECT_RECLAIM, order,
                                      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
        if (page)
                goto out;

        /* Coredumps can quickly deplete all memory reserves */
        if (current->flags & PF_DUMPCORE)
                goto out;
        /* The OOM killer will not help higher order allocs */
        if (order > PAGE_ALLOC_COSTLY_ORDER)
                goto out;
        /*
         * We have already exhausted all our reclaim opportunities without any
         * success so it is time to admit defeat. We will skip the OOM killer
         * because it is very likely that the caller has a more reasonable
         * fallback than shooting a random task.
         */
        if (gfp_mask & __GFP_RETRY_MAYFAIL)
                goto out;
        /* The OOM killer does not needlessly kill tasks for lowmem */
        if (ac->high_zoneidx < ZONE_NORMAL)
                goto out;
        if (pm_suspended_storage())
                goto out;
        /*
         * XXX: GFP_NOFS allocations should rather fail than rely on
         * other request to make a forward progress.
         * We are in an unfortunate situation where out_of_memory cannot
         * do much for this context but let's try it to at least get
         * access to memory reserved if the current task is killed (see
         * out_of_memory). Once filesystems are ready to handle allocation
         * failures more gracefully we should just bail out here.
         */

        /* The OOM killer may not free memory on a specific node */
        if (gfp_mask & __GFP_THISNODE)
                goto out;

        /* Exhausted what can be done so it's blame time */
        if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
                *did_some_progress = 1;

                /*
                 * Help non-failing allocations by giving them access to memory
                 * reserves
                 */
                if (gfp_mask & __GFP_NOFAIL)
                        page = __alloc_pages_cpuset_fallback(gfp_mask, order,
                                        ALLOC_NO_WATERMARKS, ac);
        }
out:
        mutex_unlock(&oom_lock);
        return page;
}

OOM 킬링을 통해 페이지 확보를 시도한다. 이를 통해 확보가 가능할 수 있는 경우 did_some_progress에 1이 출력된다.

코드 라인 20~24에서 oom lock 획득이 실패하는 경우 1 틱 동안 스케줄하여 다른 태스크에게 실행을 양보한다.
코드 라인 33~37에서 hardwall 및 cpuset 추가하고, direct-reclaim은 제외한 상태로 high 워터마크를 기준으로 페이지 할당을 다시 한 번 시도한다.
- OOM 킬링이 발생하면서 메모리 압박이 풀릴 가능성이 있기 때문이다.
코드 라인 40~41에서 이미 현재 태스크가 코어 덤프 중인 경우 out: 레이블로 이동한다.
코드 라인 43~44에서 costly high order를 초과한 요청인 경우 OOM 킬러로 극복하지 못하므로 out 레이블로 이동한다.
코드 라인 51~52에서 __GFP_RETRY_MAYFAIL 요청인 경우 이미 많은 reclaim 기회를 다 소진하였고, 합리적인 fallback을 가질 가능성이 있기 때문에 OOM 킬링을 skip 하기위해 out:레이블로 이동한다.
- migrate_pages() -> new_page()를 통해 할당 요청 시 __GFP_RETRY_MAYFILE이 사용된다.
코드 라인 54~55에서 DMA32 이하의 존에서 할당을 요청한 경우 OOM 킬링을 skip 하기 위해 out:레이블로 이동한다.
코드 라인 56~57에서 io 및 fs를 사용하지 못하는 경우 OOM 킬링을 skip 하기 위해 out: 레이블로 이동한다.
코드 라인 69~70에서 로컬 노드에서만 할당하라는 요청인 경우 OOM 킬링을 skip 하기 위해 out: 레이블로 이동한다.
코드 라인 73~83에서 OOM 킬링을 수행하여 페이지 회수가 되었거나 nofail 옵션을 사용한 경우 출력 인수 did_some_progress에 1을 대입한다. 만일 nofail 옵션이 사용된 경우 cpuset fallback을 통해 할당을 시도해본다.
코드 라인 84~86에서 out: 레이블에서는 oom 락을 풀고 페이지를 반환한다.

OOM 킬링 후 cpuset fallback

__alloc_pages_cpuset_fallback()

mm/page_alloc.c

static inline struct page *
__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
                              unsigned int alloc_flags,
                              const struct alloc_context *ac)
{
        struct page *page;

        page = get_page_from_freelist(gfp_mask, order,
                        alloc_flags|ALLOC_CPUSET, ac);
        /*
         * fallback to ignore cpuset restriction if our nodes
         * are depleted
         */
        if (!page)
                page = get_page_from_freelist(gfp_mask, order,
                                alloc_flags, ac);

        return page;
}

alloc 플래그에 cpuset을 적용한 후 페이지 할당을 먼저 해본 후 할당이 실패하면 cpuset을 제외하고 다시 할당을 시도한다.

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c – 현재 글
Zoned Allocator -3- (Buddy 페이지 할당)) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Zoned Allocator -1- (물리 페이지 할당-Fastpath)

2016-06-242023-10-04 문영일 7 Comments

버디 시스템의 구조

버디 시스템이라 불리는 버디 페이지 할당자는 페이지 단위의 메모리 할당과 해제를 수행한다. 버디 시스템은 연속으로 할당 가능한 페이지를 2의 승수 단위로 관리한다. 페이지들은 2^0 = 1페이지부터 2^(MAX_ORDER – 1)까지 각 order 슬롯으로 나누어 관리한다.

페이지 할당 order

페이지 할당자에서 사용하는 버디 시스템에서는 오더(order)라는 용어로 페이지 할당을 요청한다. 이것은 2의 제곱승 단위로만 요청을 할 수 있음을 의미한다. 예를 들어, 오더 3에 해당하는 페이지를 요청하는 경우 2^3 = 8페이지를 요청하는 것이다.

다음 그림은 버디 시스템이 할당 요청 받을 수 있는 order 페이지들을 보여준다.

0x1234_5000 ~ 0x1236_2000 사이에 연속된 9개의 free 페이지가 존재하지만, order 단위로 align되어 관리하는 버디 시스템에서 할당 가능한 페이지들은 다음과 같다.
- 0 order 페이지 1개
- 1 order 페이지 2개
- 2 order 페이지 1개

MAX_ORDER

버디 시스템에서 한 번에 최대 할당 가능한 페이지 수는 2^(MAX_ORDER-1) 페이지이다.

예) PAGE_SIZE=4K, MAX_ORDER=11인 경우 한 번에 최대 할당 가능한 페이지는 1024 페이지이고 바이트로는 4M이다.
- 2^0, 2^1, 2^2, … 2^10 페이지
- 4K, 8K, 16K, …, 4M 페이지

다음 그림은 페이지를 관리하는 order 슬롯이 0 부터 최대 MAX_ORDER-1 까지 free 페이지들을 관리하고 있는 모습을 보여준다.

각 free 메모리에 대한 page 구조체들 중 head에 해당하는 page 구조체가 대표 페이지이며 리스트에 연결된다.

ARM32 및 ARM64 커널의 디폴트 설정으로 MAX_ORDER는 11로 정의되어 있다. 또한 CONFIG_FORCE_MAX_ZONEORDER 커널 옵션을 사용하여 크기를 바꿀 수 있다. 각 오더 슬롯 또한 단편화되지 않도록 관리하기 위해 다음과 같은 구조로 이루어져 있다.

같은 mobility 속성을 가진 페이지들끼리 가능하면 뭉쳐 있도록 각 오더 슬롯은 마이그레이션 타입별로 나누어 관리한다. 이렇게 나누어 관리함으로써 페이지 회수 및 메모리 컴팩션(compaction) 과정에서 효율을 높일 수 있다. 특별히 MIGRATE_MOVABLE 타입으로만 구성된 ZONE_MOVABLE 영역을 만들 수도 있다.
각 페이지를 담는 free_list에서 free 페이지들은 짝(버디)을 이루어 2개의 짝(버디)이 모이면 더 큰 오더로 합병되어 올라가고 필요시 분할하여 하나 더 적은 오더로 나뉠 수 있다. 이제 더 이상 짝(버디)을 관리할 때 map이라는 이름의 bitmap을 사용하지 않고 free_list라는 이름의 리스트와 페이지 정보만을 사용하여 관리한다.
free_list는 선두 방향으로 hot 속성을 갖고 후미 방향으로 cold 속성을 갖는다. hot, cold 속성은 각각 리스트의 head와 tail의 위치로 대응하여 관리된다. 앞부분에 놓인 페이지들은 다시 할당되어 사용될 가능성이 높은 페이지다. 뒷부분에 놓인 페이지들은 오더가 통합되어 점점 상위 오더로 올라갈 가능성이 높은 페이지다. 이를 통해 free 페이지의 단편화 방지에 도움을 주고 캐시의 지속성을 높여 성능을 올리는 효과도 있다.

버디 시스템의 관리 기법이 계속 발전하면서 복잡도는 증가하고 있지만, 최대한 버디 시스템의 효율(비단편화)을 높이는 쪽으로 발전하고 있다. 그림 4-41은 버디 메모리 할당자의 코어 부분을 보여준다.

페이지 블록과 mobility(migrate) 속성

메모리를 페이지 블록 단위로 나누어 페이지 블록마다 4비트를 사용한 비트맵으로 mobility 속성을 표현한다. 첫 3비트를 사용하여 각 메모리 블록을 마이그레이션 타입으로 구분하여 mobility 속성을 관리한다. 페이지 블록은 2^pageblock_order만큼 페이지를 관리하고 그 페이지들이 가장 많이 사용하는 mobility 속성을 메모리 블록에서 대표 mobility 속성으로 기록하여 관리한다. 나머지 1비트를 사용하여 컴팩션 기능에 의해 스킵할 수 있도록 한다.

migrate 타입

페이지 타입이라고도 불린다. 각 order 페이지는 각각 mobility 속성을 표현하기 위해 migrate 타입을 갖고 있으며, 가능하면 같은 속성을 가진 페이지들끼리 뭉쳐 있도록 하여 연속된 메모리의 단편화를 억제한다. 최대한 커다랗고 연속된 free 메모리를 유지하고자 하는 목적으로 버디 시스템에 설계되었다. 버디 시스템의 free_list는 다음과 같은 migrate 타입별로 관리된다. 단 migrate 타입별로 1 개 이상의 페이지 블럭을 확보하지 못하는 메모리가 극히 적은 시스템(수M ~ 수십M)에서는 모든 페이지를 unmovable로 구성한다.

참고로 버디의 pcp 캐시는 아래 타입 중 가장 많이 사용되는 아래 3가지 타입만을 사용하여 관리한다.

MIGRATE_UNMOVABLE
- 이동과 메모리 회수가 불가능한 타입이다.
- 용도
  - 커널에서 할당한 페이지, 슬랩, I/O 버퍼, 커널 스택, 페이지 테이블 등에 사용되는 타입이다.
MIGRATE_MOVABLE
- 연속된 큰 메모리가 필요한 경우, 현재 사용되는 페이지를 이동시켜 최대한 단편화를 막기 위해 사용되는 타입이다.
- 용도
  - 유저(file, anon) 메모리 할당 시 사용된다.
MIGRATE_RECLAIMABLE
- 이동은 불가능하지만 메모리 부족 시 메모리 회수가 가능한 경우에 사용되는 타입이며, 자주 사용하는 타입이 아니다.
- 용도
  - __GFP_RECLAIMABLE 플래그를 특별히 지정하여 생성한 슬랩 캐시인 경우 이 타입으로 이용한다
MIGRATE_HIGHATOMIC
- high order 페이지 할당에 대해 atomic 할당 요청 시 실패될 확률을 줄이기 위해 커널은 이 유형의 페이지 타입을 1 블럭씩 미리 준비하고 있다. 이 타입의 메모리는 최대 메모리의 1% 범위까지 확장될 수 있다.
- 용도
  - 주로 RT 스케줄러를 사용하는 커널 스레드나 인터럽트 핸들러 등에서 메모리 할당 시 GFP_ATOMIC을 사용하여 메모리 할당을 요청하면, 슬립을 일으킬 수 있는 페이지 회수(reclaim) 없이 처리해야 한다. 그런데 high order 페이지 할당 요청을 하는 경우 페이지 회수 없이 처리하다 보면 메모리가 충분한 경우에도 high order 페이지가 하나도 없어 OOM(Out Of Memory)이 발생할 수 있다. 이러한 경우를 위해 미리 reserve된 MIGRATE_HIGHATOMIC 타입의 페이지를 할당하여 위기를 넘길 수 있다.
- 커널 4.4-rc1에서 MIGRATE_RESERVE 타입이 삭제되었고 대신 high-order atomic allocation을 지원하기 위해 MIGRATE_HIGHATOMIC이 추가되었다.
MIGRATE_CMA
- CMA 메모리 할당자가 별도로 관리하는 페이지 타입이다. CMA 영역을 버디 시스템에 구성하는 경우 이 영역은 movable 페이지도 할당될 수 있다. CMA 요청 시 이 영역이 부족하면 movable 페이지를 다른 영역으로 이동시킨다. CMA 페이지로 할당되면 각 페이지들의 할당 및 관리는 별도로 CMA 메모리 할당자에서 수행한다.
- 용도
  - 커널이 DMA 용도 등으로 사용하기 위한 메모리 할당 시 사용된다.
MIGRATE_ISOLATE
- 커널이 특정 범위의 사용 중인 movable 페이지들을 다른 곳으로 migration 하기 위해 잠시 이 타입으로 변경하여 관리한다. 그리고 이 타입으로 있는 페이지들에 대해 버디 시스템은 절대 사용하지 않는다.
- 용도
  - CMA 영역의 메모리가 부족한 상황에서 free 페이지를 확보하기 위해 CMA 영역에 있는 movable 페이지들을 CMA 영역의 밖으로 이동시켜 확보할 때 사용된다.
  - 메모리 hot-remove를 위해 해당 영역의 사용 중인 movable 페이들을 다른 곳으로 옮길 때 사용된다.

다음 그림은 각 order 슬롯마다 6개의 migrate type 별로 free 페이지를 관리하는 리스트의 모습을 보여준다.

page_blockorder

매크로로 정의된 pageblock_order는 페이지 블록을 구성하는 페이지의 개수를 승수 단위로 표현한다.

ARM64 커널에서는 huge 페이지를 지원하며 huge 페이지 단위에 맞춰 사용한다. huge 페이지를 사용하지 않는 경우는 버디 시스템이 사용하는 최대 페이지 크기에 맞춰사용한다.

예) 4K 페이지 및 huge 페이지 사용 시 pageblock_order=9
예) 4K 페이지 및 huge 페이지 미사용 시 pageblock_order=10

include/linux/pageblock-flags.h

#define pageblock_order         HUGETLB_PAGE_ORDER

ARM32 커널은 버디 시스템이 사용하는 최대 페이지 크기에 맞춰사용한다.

예) 4K 페이지 사용 시 pageblock_order=10

include/linux/pageblock-flags.h

#define pageblock_order (MAX_ORDER-1)

다음 ARM64 시스템은 pageblock_order가 9로 설정되어 있음을 알 수 있고, 각 노드, zone 및 order, migrate 타입별 버디에서 관리되고 있는 free 페이지 수를 보여준다. 또한 각 노드, zone 및 migrate 타입별 페이지 블럭 수를 보여준다.

$ cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone    DMA32, type    Unmovable      0      0    168    322    108      6      2      0      0      1      0
Node    0, zone    DMA32, type      Movable      0      0     31      9      3      4      0      0      0      0    412
Node    0, zone    DMA32, type  Reclaimable      0      0      1      0      0      1      1      1      1      0      0
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      1      1      1      1      7
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone    DMA32          232         1250           38            0           16            0

Number of mixed blocks    Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone    DMA32            0            1            1            0            0            0

다음 ARM32 시스템은 pageblock_order가 10으로 설정되어 있음을 알 수 있다.

$ cat /proc/pagetypeinfo
Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone   Normal, type    Unmovable     10     13      9      6      2      2      0      1      0      1      0
Node    0, zone   Normal, type  Reclaimable      1      2      2      3      0      0      1      0      1      0      0
Node    0, zone   Normal, type      Movable      6      3      1      1    447    109      4      0      0      1    147
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2
Node    0, zone   Normal, type          CMA      1      1      1      2      1      2      1      0      1      1      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate
Node 0, zone   Normal            4            5          223            2            2            0

rpi2의 경우 MAX_ORDER=11로 설정되어 있고 pageblock_order 역시 MAX_ORDER-1로 설정되어 있는 것을 알 수 있다.
- 커널 버전에 따라 MAX_ORDER값이 다르다. 기존에는 9, 10 등이 사용되었었다.

다음 그림은 각 pageblock에 포함된 order 페이지들의 migrate 타입이 하나로 유지된 것과, 그렇지 않고 여러 개가 mix된 사례를 보여준다.

커널은 가능하면 같은 블럭내에서 한 가지 migrate 타입을 유지하려 하지만 메모리 부족 등의 상황에서 migrate 타입이 섞일 수도 있다.
migrate 타입이 섞이지 않게 관리하면 파편화되지 않을 가능성이 높아진다.

pageblock 별 대표 migrate type

order 페이지들이 migrate 타입으로 관리되는 것과 동일하게, pageblock들도 각각의 migrate 타입을 관리한다. 만일 pageblock 내에서 여러 개의 migrate 타입을 사용하는 order 페이지들이 섞여 사용 중인 경우 그 들중 가장 많이 사용되는 migrate 타입을 해당 블럭의 대표 migrate 타입으로 지정한다. 이러한 대표 migrate 타입 3 비트와 compaction에서 사용하는 1 비트의 skip 비트를 추가하여 usemap에 저장한다.

usemap 각 비트의 용도

대표 migrate 타입(3 bits)
- 파편화를 회피하기 위해 페이지 할당 시 요청한 migrate 타입과 동일한 대표 migrate 타입이 있는 pageblock에서 할당하려고 노력한다.
skip 비트(1bits)
- 메모리 부족 시compaction을 수행할 때 compaction을 skip 하기 위한 비트이다.

다음 그림은 usemap에 표현된 각 pageblock 별 대표 migrate 타입을 보여준다.

PF_MEMALLOC

메모리 회수와 관련한 특수한 커널 스레드 또는 페이지 회수 시 잠시 페이지 할당을 해야 하는 스레드에서 사용하는 플래그 비트이다. 워터마크 기준 이하로 메모리가 부족해진 경우 페이지 회수 시스템이 동작한다. 이 때 페이지를 회수하는 과정에서 네트웍을 이용한 Swap 등을 포함한 몇 개의 특수한 페이지 회수 요청에서 오히려 페이지를 잠시 할당해야 하는 경우가 있다. 예를 들어 페이지 부족 시 네트워크를 사용하여 swap을 해야 할 때 네트웍 처리 시 필요한 skb(소켓버퍼)등이 할당되어야 한다. 이렇게 메모리 할당 요청을 할 때 다시 메모리 부족에 대한 페이지 회수 시스템이 동작하는 등을 반복하게 되는 문제가 있다. 따라서 이러한 재귀적인 문제가 발생하지 않도록 특수한 목적으로 페이지 할당을 해야 하는 경우에 “메모리 부족을 해결하기 위한 임시 메모리 할당 요청”임을 식별하게 하는 PF_MEMALLOC 플래그를 사용한다. 이 플래그를 사용하면 다음과 같은 동작을 수행하게 한다.

워터마크 기준 이하의 메모리도 할당한다.
페이지 회수가 반복되지 않도록 모든 종류의 페이지 회수를 다시 요청하지 않는다.

PF_MEMALLOC 플래그와 같이 메모리가 부족한 상황에서 임시 메모리를 할당하기 위해 호출될 때 사용하기 위해 아래 두 개의 플래그가 추가되었다.

PF_MEMALLOC_NOIO

워터마크 기준 이하의 메모리도 할당한다.
IO 요청을 동반한 페이지 회수가 반복되지 않게 한다. 즉 IO 요청이 아닌 메모리 내에서만 동작하는 페이지 회수는 동작시킬 수 있다.

PF_MEMALLOC_NOFS

워터마크 기준 이하의 메모리도 할당한다.
파일 시스템을 이용하는 페이지 회수가 반복되지 않게 한다. 즉 파일 시스템이 아닌 메모리나 다른 종류의 IO를 사용하는 페이지 회수는 동작시킬 수 있다.
참고: mm: introduce memalloc_nofs_{save,restore} API

페이지 할당자 구조

다음 그림은 페이지 할당자를 구성하는 주요 항목들을 보여준다.

노드별
- NUMA 메모리 정책에 따른 노드 및 zonelist
- 페이지 회수 매커니즘
존별
- 버디 코어(심장)
- 버디 캐시(pcp)

페이지 할당 Sequence

페이지 할당자는 크게 다음과 같은 루틴들을 통해 할당된다.

가장 먼저 NUMA 메모리 정책을 통해 대상이 되는 노드 또는 노드들을 정한다.
Memory Control Group 통제 내에서 할당된다.
버디 시스템을 통해 할당을 수행한다.
- 1 페이지(0-order) 할당 요청인 경우 버디 캐시 시스템인 pcp를 사용하여 할당한다.
- 메모리 부족 시 인터럽트 여부 또는 요청한 플래그 옵션에 따라 메모리 회수를 동반할 수 있다.

GFP 마스크(gfp_mask)

페이지 할당 요청 시 사용되는 플래그들이다.

참고: GFP 플래그 | 문c

할당 플래그(alloc_flags)들

페이지 할당 함수에서 gfp_mask와 별도로 사용되며, 함수 내부(internal) 용도로 사용되는 할당 플래그들이다.

ALLOC_WMARK_MIN
- 남은 메모리가 min 워터마크 미만으로 내려가는 경우 할당을 제한하도록 하는 기준이다.
- 유저 메모리 할당 요청(GFP_USER), 일반적인 커널 메모리 할당 요청(GFP_KERNEL) 등에서는 이 기준 이하의 할당을 제한한다.
- 단 GFP_ATOMIC으로 할당요청하는 경우 비상용으로 남겨둔 이 기준보다 절반 정도를 더 사용하도록 허락한다.
ALLOC_WMARK_LOW
- 남은 메모리가 low 워터마크 미만으로 내려가는 경우 kcompactd 및 kswapd 등의 페이지 회수 시스템을 가동시키는 기준으로 사용된다.
ALLOC_WMARK_HIGH
- 남은 메모리가 high워터마크 이상일 때 kcompactd 및 kswapd 등의 페이지 회수 시스템의 가동을 슬립시켜 정지시키는 기준으로 사용된다.
ALLOC_NO_WATERMARKS
- 워터마크 기준을 무시하고 할당할 수 있도록 한다.
- PF_MEMALLOC 플래그가 사용되는 태스크(kswapd, kcompactd, … 등의 페이지 회수 스레드들)이 메모리 할당을 요구할 때 워터마크 기준을 무시하고 할당 할 수 있어야 하므로 이러한 플래그가 사용된다.
ALLOC_HARDER
- 다음과 같은 상황에서 이 플래그가 사용된다.
  - GFP_ATOMIC 플래그 사용하여 메모리 할당을 요청하는 경우
  - RT 스케줄러를 사용하는 커널 스레드에서 할당을 요청하는 경우
- 이 플래그는 다음과 같은 동작을 수행한다.
  - 메모리 부족 시 남은 min 워터마크 기준보다 25% 더 할당을 받게 한다.
    - GFP_ATOMIC 사용시 아래 ALLOC_HIGH까지 부터 50% 한 후 추가 25% 적용
  - high order 페이지 할당 시 실패하는 경우에 대비하여 높은 order 할당이 실패하지 않도록 예비로 관리하는 MIGRATE_HIGHATOMIC freelist를 사용하게 한다.
ALLOC_HIGH
- GFP_ATOMIC 플래그를 사용할 때 ALLOC_HARDER와 함께 이 플래그가 사용되며, 메모리 부족 시 남은 min 워터마크 기준보다 50% 더 할당을 받게 한다.
ALLOC_CPUSET
- 태스크가 요청하는 메모리를 cgroup의 cpuset 서브시스템을 사용하여 제한한다.
- interrupt context에서 요청하는 메모리의 경우 cpuset을 무시하고 할당한다.
ALLOC_CMA
- movable 페이지에 대한 할당 요청 시 가능하면 cma 영역을 사용하지 않고 할당을 시도하지만, 이 플래그를 사용하면 메모리가 부족 시 cma 영역도 사용하여 할당을 시도한다.
ALLOC_NOFRAGMENT
- 페이지 할당 시 요청한 migratetype 만으로 구성된 페이지 블럭내에서 할당을 시도한다.
- 단 메모리가 부족한 경우에는 어쩔 수 없이, 이 플래그 요청을 무시하고 fragment 할당을 한다.
- GFP_KERNEL or GFP_ATOMIC 등)과 같이 normal 존을 이용하는 커널 메모리 등을 할당해야 할 때 노드 내에 해당 normal zone 밑에 dma(or dma32)가 구성되어 있는 경우 이러한 플래그를 사용되어 최대한 1 페이지 블럭내에서 여러 migratetype의 페이지가 할당되어 구성되지 않도록 노력한다.
ALLOC_KSWAPD
- GFP_ATOMIC을 제외한 GFP_KERNEL, GFP_USER, GFP_HIGHUSER 등의 메모리 할당을 요청하는 경우 __GFP_RECLAIM 플래그(direct + kswapd)가 추가되는데 그 중 __GFP_RECLAIM_KSWAPD를 체크하여 이 플래그가 사용된다. 메모리 부족시 즉각 kcompactd 및 kswapd 스레드를 꺄워 동작시키는 기능을 의미한다.

ALLOC_FAIR

ALLOC_FAIR 플래그를 사용한 fair zone 정책은 NUMA 시스템의 메모리 정책을 사용하면서 불필요하게 되어 커널 v4.8-rc1에서 제거되었다.

참고: mm, page_alloc: remove fair zone allocation policy

NUMA 메모리 정책(Policy)

참고: NUMA -3- (Memory policy) | 문c

물리 페이지 할당(alloc)

다음 그림은 페이지 할당을 하는 흐름을 보여준다.

alloc_pages()

include/linux/gfp.h

#ifdef CONFIG_NUMA
static inline struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
{
        return alloc_pages_current(gfp_mask, order);
}
#else
#define alloc_pages(gfp_mask, order) \
                alloc_pages_node(numa_node_id(), gfp_mask, order)
#endif

버디 시스템을 통해 연속된 2^order 페이지들을 할당받는다. NUMA 시스템을 사용하는 경우 NUMA 메모리 정책을 반영하기 위해 alloc_pages_current( ) 함수를 통해 노드가 선택되고, 그 후 그 함수 내부에서 alloc_pages_node( ) 함수를 호출하여 페이지를 할당받는다.

alloc_pages_current()

mm/mempolicy.c

/**
 *      alloc_pages_current - Allocate pages.
 *
 *      @gfp:
 *              %GFP_USER   user allocation,
 *              %GFP_KERNEL kernel allocation,
 *              %GFP_HIGHMEM highmem allocation,
 *              %GFP_FS     don't call back into a file system.
 *              %GFP_ATOMIC don't sleep.
 *      @order: Power of two of allocation size in pages. 0 is a single page.
 *
 *      Allocate a page from the kernel page pool.  When not in
 *      interrupt context and apply the current process NUMA policy.
 *      Returns NULL when no page can be allocated.
 */

struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
        struct mempolicy *pol = &default_policy;
        struct page *page;

        if (!in_interrupt() && !(gfp & __GFP_THISNODE))
                pol = get_task_policy(current);

        /*
         * No reference counting needed for current->mempolicy
         * nor system default_policy
         */
        if (pol->mode == MPOL_INTERLEAVE)
                page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
        else
                page = __alloc_pages_nodemask(gfp, order,
                                policy_node(gfp, pol, numa_node_id()),
                                policy_nodemask(gfp, pol));

        return page;
}
EXPORT_SYMBOL(alloc_pages_current);

NUMA 시스템에서 메모리 정책에 따라 노드를 선택하고 버디 시스템을 통하여 연속된 2^order 페이지들을 할당 받는다.

코드 라인 3~7에서 인터럽트 처리중이거나 현재 노드에서만 할당하라는 요청인 경우에는 디폴트 메모리 정책을 선택한다. 그 외의 경우에는 현재 태스크에 주어진 메모리 정책을 선택한다. 만일 태스크에도 메모리 정책이 설정되지 않은 경우 노드를 지정한 경우 해당 노드에 우선 처리되는 메모리 정책을 사용한다. 노드가 지정되지 않은 경우에는 디폴트 메모리 정책을 선택한다.디폴트 메모리 정책은 로컬 노드를 사용하여 할당한다.
- __GFP_THISNODE 플래그를 사용하여 로컬 노드로 제한한 경우 -> 디폴트 메모리 정책 사용
- 인터럽트 중 -> 디폴트(로컬 노드 preferred) 메모리 정책 사용
- 태스크에 지정된 정책
  - 태스크에 지정된 정책이 있으면 -> 태스크에 지정된 메모리 정책
  - 지정된 노드가 있으면 -> 노드에 지정된 우선 메모리 정책 사용
  - 지정된 노드가 없으면 -> 디폴트(로컬 노드 preferred) 메모리 정책
코드 라인 13~14에서 인터리브 메모리 정책을 사용하는 경우 페이지를 노드별로 돌아가며 할당하게 한다.
코드 라인 15~16에서 그 외의 정책을 사용하는 경우 요청한 노드 제한내에서 order 페이지를 할당 받는다.

alloc_pages_node()

include/linux/gfp.h

static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
                                                unsigned int order)
{
        /* Unknown node is current node */
        if (nid < 0)
                nid = numa_node_id();

        return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

지정된 노드에서 연속된 2^order 페이지들을 할당 받는다. 만일 알 수 없는 노드가 지정된 경우 현재 노드에서 할당 받는다.

__alloc_pages()

include/linux/gfp.h

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist)
{
        return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}

노드 및 존에 대한 우선순위를 담은 zonelist에서 2^order 페이지 만큼 연속된 물리메모리를 할당 받는다.

버디 할당자의 심장

지정 노드들에서 페이지 할당하기

__alloc_pages_nodemask()

mm/page_alloc.c

/*
 * This is the 'heart' of the zoned buddy allocator.
 */

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
                                                        nodemask_t *nodemask)
{
        struct page *page;
        unsigned int alloc_flags = ALLOC_WMARK_LOW;
        gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
        struct alloc_context ac = { };

        /*
         * There are several places where we assume that the order value is sane
         * so bail out early if the request is out of bound.
         */
        if (unlikely(order >= MAX_ORDER)) {
                WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
                return NULL;
        }

        gfp_mask &= gfp_allowed_mask;
        alloc_mask = gfp_mask;
        if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc__
flags))
                return NULL;

        finalise_ac(gfp_mask, &ac);

        /*
         * Forbid the first pass from falling back to types that fragment
         * memory until all local zones are considered.
         */
        alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);

        /* First allocation attempt */
        page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
        if (likely(page))
                goto out;

        /*
         * Apply scoped allocation constraints. This is mainly about GFP_NOFS
         * resp. GFP_NOIO which has to be inherited for all allocation requests
         * from a particular context which has been marked by
         * memalloc_no{fs,io}_{save,restore}.
         */
        alloc_mask = current_gfp_context(gfp_mask);
        ac.spread_dirty_pages = false;

        /*
         * Restore the original nodemask if it was potentially replaced with
         * &cpuset_current_mems_allowed to optimize the fast-path attempt.
         */
        if (unlikely(ac.nodemask != nodemask))
                ac.nodemask = nodemask;

        page = __alloc_pages_slowpath(alloc_mask, order, &ac);

out:
        if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
            unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
                __free_pages(page, order);
                page = NULL;
        }

        trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);

        return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);

지정된 노드마스크, zonelist 및 flags 설정을 참고하여 노드와 zone을 선택한 후 2^order 페이지만큼 연속된 물리 메모리 할당을 한다.

코드 라인 6에서 할당 플래그의 초기 값으로 low 워터마크를 사용하는 것으로 지정한다.
코드 라인 14~17에서 @order 값이 MAX_ORDER 이상을 사용할 수 없다. 이러한 경우 null을 반환한다.
- 버디 시스템에서 연속된 메모리를 한 번에 요청할 수 있는 최대 페이지 수는 2^(MAX_ORDER-1) 이다.
코드 라인 19에서 커널 부트업 프로세스가 처리되는 동안은 페이지 할당을 위해 IO 처리를 위한 드라이버나 파일 시스템이 준비되지 않으며, 따라서 메모리 회수 시스템도 구동되지 않고 있는 상태다. 따라서 이러한 요청들이 발생할 때 이 기능을 사용하지 못하게 막기 위해 gfp 플래그에서 _ _GFP_RECLAIM, _ _GFP_IO 및 _ _GFP_FS 비트를 제거한다.
- 부팅 중에는 전역 변수 gfp_allowed_mask에 GFP_BOOT_MASK를 대입한다.
  - GFP_BOOT_MASK에는 __GFP_RECLAIM | __GFP_IO | __GFP_FS를 제거한 비트들이 담겨있다.
코드 라인 20~23에서 페이지 할당을 시도하기 전에 필요한 할당 context 및 필요한 할당 플래그를 추가하여 준비한다.
코드 라인 25에서 마지막으로 alloc 컨텍스트의 추가 멤버를 준비한다.
코드 라인 31에서 zone 및 gfp 마스크 요청에 따라 nofragment 등의 alloc 플래그를 추가한다.
코드 라인34~36에서 처음 Fast-path 페이지 할당을 시도한다.
코드 라인 44~54에서 Fast-path 할당이 실패한 경우 dirty zone 밸런싱을 하지 않도록 설정하고, slow-path 할당을 시도한다.
코드 라인 56~61에서 out: 레이블이다. 할당된 페이지를 반환하는데, 메모리 컨트롤 그룹의 리밋을 벗어나는 경우 할당을 포기한다.

할당 context 준비

prepare_alloc_pages()

mm/page_alloc.c

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                struct alloc_context *ac, gfp_t *alloc_mask,
                unsigned int *alloc_flags)
{
        ac->high_zoneidx = gfp_zone(gfp_mask);
        ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
        ac->nodemask = nodemask;
        ac->migratetype = gfpflags_to_migratetype(gfp_mask);

        if (cpusets_enabled()) {
                *alloc_mask |= __GFP_HARDWALL;
                if (!ac->nodemask)
                        ac->nodemask = &cpuset_current_mems_allowed;
                else
                        *alloc_flags |= ALLOC_CPUSET;
        }

        fs_reclaim_acquire(gfp_mask);
        fs_reclaim_release(gfp_mask);

        might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

        if (should_fail_alloc_page(gfp_mask, order))
                return false;

        if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
                *alloc_flags |= ALLOC_CMA;

        return true;
}

페이지 할당을 시도하기 전에 각 하위 함수들에 전달할 값들을 모아 ac_context 구조체에 준비한다. 그리고 입출력 인자 @alloc_flags에 필요 시 플래그를 추가하여 준비한다. 디버깅 목적으로 강제로 할당을 실패하게 한 경우에만 false를 반환한다.

코드 라인 6에서 @gfp_mask에 해당하는 존 인덱스를 알아온다.
코드 라인 7에서 @nid 노드에서 @flags 값에 따라 두 zonelist 중 하나를 선택하여 반환한다.
- 참고: build_all_zonelists() | 문c
코드 라인 8~9에서 노드 마스크와 할당받을 마이그레이션 타입을 gfp 플래그에서 구한다.
코드 라인 11~17에서 컨트롤 그룹의 cpuset 을 사용하는 경우 alloc_mask에 hardwall 플래그를 추가하여, 요청한 태스크에 혹시 cgroup의 현재 cpuset 디렉토리 설정에서 지정한 제한 사항들이 반영되는 상태에서 할당하게 한다. 추후 이러한 제한은 GFP_ATOMIC 같은 할당에서는 제외된다. 또한 할당 함수에서 노드 마스크 지정여부에 따라 다음과 같이 나뉜다.
- hardwall + 할당 함수에서 요청한 노드마스크에 지정된 노드들
- hardwall + 할당 함수에서 요청한 노드마스크 없으면 태스크가 지정한 노드들
코드 라인 22에서 direct 회수를 허용한 경우 preempt point를 수행한다.
- 일반적인 유저 및 커널 메모리 할당 요청 시 direct 회수는 허용된다.
코드 라인 24~25에서 디버깅 목적으로 실패 상황을 만들 수 있다.
코드 라인 27~28에서 movable 페이지인 경우 cma 영역의 사용을 허락한다.

finalise_ac()

mm/page_alloc.c

/* Determine whether to spread dirty pages and what the first usable zone */
static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
{
        /* Dirty zone balancing only done in the fast path */
        ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);

        /*
         * The preferred zone is used for statistics but crucially it is
         * also used as the starting point for the zonelist iterator. It
         * may get reset for allocations that ignore memory policies.
         */
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                        ac->high_zoneidx, ac->nodemask);
}

alloc 컨텍스트의 추가 멤버를 준비하여 마무리한다.

코드 라인 5에서 gfp 플래그에 _ _GFP_WRITE 요청이 있다면 fastpath 페이지 할당에서만 더티 존(dirty zone) 밸런싱을 사용한다
코드 라인 12~13에서 현재 zonelist의 사용 가능한 가장 첫 번째 존을 preferred_zoneref에 저장한다. 이 값은 나중에 통계에서 사용한다. 또한 첫 존이 없는 경우에는 페이지 할당을 할 수 없으므로 out 레이블로 이동하여 함수를 빠져나간다. ac.nodemask가 지정되지 않아 NULL인 경우에는 현재 태스크에 cpuset으로 지정된 노드마스크를 사용한다.

alloc_flags_nofragment()

mm/page_alloc.c

/*
 * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
 * fragmentation is subtle. If the preferred zone was HIGHMEM then
 * premature use of a lower zone may cause lowmem pressure problems that
 * are worse than fragmentation. If the next zone is ZONE_DMA then it is
 * probably too small. It only makes sense to spread allocations to avoid
 * fragmentation between the Normal and DMA32 zones.
 */

static inline unsigned int
alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
{
        unsigned int alloc_flags = 0;

        if (gfp_mask & __GFP_KSWAPD_RECLAIM)
                alloc_flags |= ALLOC_KSWAPD;

#ifdef CONFIG_ZONE_DMA32
        if (zone_idx(zone) != ZONE_NORMAL)
                goto out;

        /*
         * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
         * the pointer is within zone->zone_pgdat->node_zones[]. Also assume
         * on UMA that if Normal is populated then so is DMA32.
         */
        BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);
        if (nr_online_nodes > 1 && !populated_zone(--zone))
                goto out;

out:
#endif /* CONFIG_ZONE_DMA32 */
        return alloc_flags;
}

zone 및 gfp 마스크 요청에 따라 nofragment 등의 alloc 플래그를 추가한다.

코드 라인 6~7에서 gfp 플래그로 __GFP_KSWAPD_RECLAIM이 요청된 경우 alloc 플래그에서 ALLOC_KSWAPD를 추가하여 메모리 부족 시 kswapd를 깨울 수 있게 한다.
코드 라인 9~23에서 dma32 존과 normal 존을 모두 운용하는 경우에 normal 존에 할당 요청을 한 경우ALLOC_NOFRAGMENT플래그를추가한다. (단 5.0 코드는 버그)
- ALLOC_NOFRAGMENT 플래그를 추가하지 않는 버그로 인하여 커널 v5.1-rc7에서 패치되었다.
  - 참고: mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag

ALLOC_NOFRAGMENT

요청한 migratetype의 메모리를 할당 시 메모리가 부족한 경우 다른 타입(fallback migratetype)으로부터 steal 해오는데, 이 때 1 페이지 블럭 단위 이상으로만 steal하도록 하여 페이지 블럭내에 다른 migratetype이 섞이지 않도록 제한한다. 이렇게 migratetype간의 fragment 요소를 배제하도록 한다.

Fastpath 페이지 할당

아래의 함수는 페이지 할당 시 Fastpath와 Slowpath 두 곳에서 호출되어 사용되는데 Fastpath에서 호출될 때에는 인수 gfp_mask에 __GFP_HARDWALL을 추가하여 호출한다.

GFP_KERNEL
- (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
- __GFP_RECLAIM은 ___GFP_DIRECT_RECLAIM 과 ___GFP_KSWAPD_RECLAIM 두 플래그를 가진다.
GFP_USER
- (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
- user space에서 페이지를 할당 요청시 __GFP_HARDWALL을 설정하여 현재 태스크의 cpuset이 허락하는 메모리 노드가 아닌 곳에서 할당되는 것을 허용하지 않게 한다.

다음 그림은 페이지 할당 시 사용되는 Fastpath 루틴과 Slowpath 루틴에서 사용되는 함수를 구분하여 보여준다.

단 get_page_from_freelist() 함수가 __alloc_pages_slowpath() 함수에서 호출되는 경우에는 Slowpath의 일부분이다.

get_page_from_freelist()

mm/page_alloc.c -1/2-

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                                                const struct alloc_context *ac)
{
        struct zoneref *z;
        struct zone *zone;
        struct pglist_data *last_pgdat_dirty_limit = NULL;
        bool no_fallback;

retry:
        /*
         * Scan zonelist, looking for a zone with enough free.
         * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
         */
        no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
        z = ac->preferred_zoneref;
        for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                                                ac->nodemask) {
                struct page *page;
                unsigned long mark;

                if (cpusets_enabled() &&
                        (alloc_flags & ALLOC_CPUSET) &&
                        !__cpuset_zone_allowed(zone, gfp_mask))
                                continue;
                /*
                 * When allocating a page cache page for writing, we
                 * want to get it from a node that is within its dirty
                 * limit, such that no single node holds more than its
                 * proportional share of globally allowed dirty pages.
                 * The dirty limits take into account the node's
                 * lowmem reserves and high watermark so that kswapd
                 * should be able to balance it without having to
                 * write pages from its LRU list.
                 *
                 * XXX: For now, allow allocations to potentially
                 * exceed the per-node dirty limit in the slowpath
                 * (spread_dirty_pages unset) before going into reclaim,
                 * which is important when on a NUMA setup the allowed
                 * nodes are together not big enough to reach the
                 * global limit.  The proper fix for these situations
                 * will require awareness of nodes in the
                 * dirty-throttling and the flusher threads.
                 */
                if (ac->spread_dirty_pages) {
                        if (last_pgdat_dirty_limit == zone->zone_pgdat)
                                continue;

                        if (!node_dirty_ok(zone->zone_pgdat)) {
                                last_pgdat_dirty_limit = zone->zone_pgdat;
                                continue;
                        }
                }

                if (no_fallback && nr_online_nodes > 1 &&
                    zone != ac->preferred_zoneref->zone) {
                        int local_nid;

                        /*
                         * If moving to a remote node, retry but allow
                         * fragmenting fallbacks. Locality is more important
                         * than fragmentation avoidance.
                         */
                        local_nid = zone_to_nid(ac->preferred_zoneref->zone);
                        if (zone_to_nid(zone) != local_nid) {
                                alloc_flags &= ~ALLOC_NOFRAGMENT;
                                goto retry;
                        }
                }

할당 context 정보를 토대로 요청한 2^@order 페이지들을 할당한다. 할당을 성공하면 해당 페이지를 반환하고, 실패하면 null을 반환한다.

코드 라인 15에서 처음 할당 시도 시 페이지 블럭 내에서 fragment 되지 않도록 할당 플래그에 ALLOC_NOFRAGMENT 플래그가 있는 경우 no_fallback 여부를 지정한다.
- 페이지 블럭내에서 여러 mobility 타입이 혼재될 수 있다.
- 파편화를 방지하기 위해서 메모리에 여유가 있으면 각 페이지 블럭에서 한 가지 mobility 타입으로 유도하는 것이 좋다.
코드 라인 16~18에서 요청한 노드와 zonelist에서 [선호 존, high_zoneidx]의 존에 대해 순서대로 zone을 순회한다.
코드 라인 22~25에서 현재 태스크가 control group의 cpuset이 지원하는 존을 지원하지 않는 경우 제외하기 위해 skip 한다.
코드 라인 45~53에서 노드 별로 dirty limit이 제한되어 있다. 모든 노드의 dirty limit을 초과한 경우가 아니라면 dirty limit을 초과한 노드는 skip 하기 위함이다. slowpath 할당 시에는 spread_dirty_pages 값은 false로 호출되어 dirty limit 제한을 받지 않는다.
코드 라인 55~69에서 페이지 할당 시 리모트 노드에서 nofragment 요청 보다 fragment 되더라도 로컬 노드에서 할당하는 것이 더 중요한 상황이다. 만일 2개 이상의 노드를 가진 시스템에서 nofragment 요청을 가졌지만 다른 노드에서 할당을 해야 하는 상황이라면 nofragment 요청을 제거하고 로컬 노드에서 할당할 수 있도록 retry 레이블로 이동한다.

mm/page_alloc.c -2/2-

                mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
                if (!zone_watermark_fast(zone, order, mark,
                                       ac_classzone_idx(ac), alloc_flags)) {
                        int ret;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
                        /*
                         * Watermark failed for this zone, but see if we can
                         * grow this zone if it contains deferred pages.
                         */
                        if (static_branch_unlikely(&deferred_pages)) {
                                if (_deferred_grow_zone(zone, order))
                                        goto try_this_zone;
                        }
#endif
                        /* Checked here to keep the fast path fast */
                        BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
                        if (alloc_flags & ALLOC_NO_WATERMARKS)
                                goto try_this_zone;

                        if (node_reclaim_mode == 0 ||
                            !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
                                continue;

                        ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
                        switch (ret) {
                        case NODE_RECLAIM_NOSCAN:
                                /* did not scan */
                                continue;
                        case NODE_RECLAIM_FULL:
                                /* scanned but unreclaimable */
                                continue;
                        default:
                                /* did we reclaim enough */
                                if (zone_watermark_ok(zone, order, mark,
                                                ac_classzone_idx(ac), alloc_flags))
                                        goto try_this_zone;

                                continue;
                        }
                }

try_this_zone:
                page = rmqueue(ac->preferred_zoneref->zone, zone, order,
                                gfp_mask, alloc_flags, ac->migratetype);
                if (page) {
                        prep_new_page(page, order, gfp_mask, alloc_flags);

                        /*
                         * If this is a high-order atomic allocation then check
                         * if the pageblock should be reserved for the future
                         */
                        if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                                reserve_highatomic_pageblock(page, zone, order);

                        return page;
                } else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
                        /* Try again if zone has deferred pages */
                        if (static_branch_unlikely(&deferred_pages)) {
                                if (_deferred_grow_zone(zone, order))
                                        goto try_this_zone;
                        }
#endif
                }
        }

        /*
         * It's possible on a UMA machine to get through all zones that are
         * fragmented. If avoiding fragmentation, reset and try again.
         */
        if (no_fallback) {
                alloc_flags &= ~ALLOC_NOFRAGMENT;
                goto retry;
        }

        return NULL;
}

코드 라인 1~3에서 빠른 산출을 위해 대략적으로 추산한 남은 free 페이지 수와 할당 플래그로 요청 받은 3가지 high, low, min 워터마크 값 중 하나와 비교하여 기준 이하의 메모리 부족 상태인 경우이다.
코드 라인 11~14에서 x86 시스템 등의 대용량 메모리 시스템에서 부트업 중에 일부 메모리의 초기화를 유예시킨다. 현재 그러한 상황이라 현재 존의 page들이 초기화되지 않은 상태라면 진짜 메모리 부족이 아닌 경우이므로 이 존에서 할당을 시도하게 한다.
코드 라인 18~19에서 워터 마크 기준이 설정되지 않은 경우에도 이 존에서 할당을 시도하게 한다.
코드 라인 21~23에서 “/sys/vm/zone_reclaim_mode” 설정이 0(디폴트) 이거나 로컬 또는 근거리의 리모트 노드가 아니면 이 존에서 할당을 skip 한다.
코드 라인 25~32에서 노드에 대한 페이지 회수를 위해 scan을 하지 않은 경우이거나, 이미 full scan 하여 더 이상 효과가 없는 상태인 경우 이 존을 skip 한다.
코드 라인 33~40에서 노드에 대해 페이지 회수를 한 결과가 일부 있거나 성공적이면 다시 한 번 정확히 추산하여 남은 free 페이지 수와 할당 플래그로 요청 받은 3가지 high, low, min 워터마크 값 중 하나와 비교하여 기준을 넘어 메모리 부족 상태가 아니면 이 존에서 할당을 시도한다. 그렇지 않고 메모리가 여전히 부족한 상태이면 이 존을 skip 한다.
코드 라인 43~45에서 try_this_zone: 레이블에서는 실제 버디 시스템을 통해 order 페이지를 할당해본다.
코드 라인 46~56에서 메모리가 정상적으로 할당된 경우 새 페이지 구조체에 대한 준비를 수행한 후 해당 페이지를 반환한다.
코드 라인 57~65에서 초기화 유예된 상태인 경우 다시 한번 이 존에서 페이지 할당을 시도한다.
코드 라인 72~75에서 첫 번째 할당 시도가 실패한 경우 블럭 내에서 migrate 타입이 달라도 할당을 할 수 있도록, nofragment 속성을 제거후 다시 시도한다.
코드 라인 77에서 두 번째 할당 시도도 실패한 경우 null을 반환한다.

다음 그림은 get_page_from_freelist() 함수를 통해 처리되는 과정을 보여준다.

zone_reclaim_mode

NUMA 시스템을 지원하는 커널에서 “proc/sys/fs/zone_reclaim_mode” 파일 값으로 설정한다
- RECLAIM_OFF(0)
  - 워터마크 기준 이하인 경우 현재 zone의 회수 처리 없이 다음 zone으로 skip
- RECLAIM_ZONE(1)
  - inactive LRU list를 대상으로 회수 처리한다.
- RECLAIM_WRITE(2)
  - 수정된 파일 페이지에 대해 writeout 과정을 통해 회수 처리한다.
- RECLAIM_UNMAP(3)
  - 매핑된 파일 페이지에 대해 unmap 과정을 통해 회수 처리한다.

zone_allows_reclaim()

mm/page_alloc.c

static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
        return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
                                RECLAIM_DISTANCE;
}

local_zone과 요청 zone이 RECLAIM_DISTANCE(30) 이내의 거리에 있는 경우 true를 반환한다.

페이지 회수는 가까운 리모트 zone에서만 가능하게 한다.

새 페이지 할당 후 초기화

prep_new_page()

page_alloc.c

static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
                                                        unsigned int alloc_flags)
{
        int i;

        post_alloc_hook(page, order, gfp_flags);

        if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
                for (i = 0; i < (1 << order); i++)
                        clear_highpage(page + i);

        if (order && (gfp_flags & __GFP_COMP))
                prep_compound_page(page, order);

        /*
         * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
         * allocate the page. The expectation is that the caller is taking
         * steps that will free more memory. The caller should avoid the page
         * being used for !PFMEMALLOC purposes.
         */
        if (alloc_flags & ALLOC_NO_WATERMARKS)
                set_page_pfmemalloc(page);
        else
                clear_page_pfmemalloc(page);
}

할당 받은 2^order 페이지 사이즈 메모리에 해당하는 모든 페이지 디스크립터를 초기화한다.

코드 라인 6에서 할당받은 2^order 페이지 사이즈 메모리에 해당하는 첫 번째 페이지 디스크립터를 초기화한다.
코드 라인 8~10에서 zero 초기화 요청을 받은 경우 할당 받은 메모리를 0으로 모두 초기화한다.
- 32bit 시스템의 highmem 영역에 속한 메모리들은 임시 매핑하여 0으로 초기화한 후 다시 매핑해제한다.
코드 라인 12~13에서 compound 페이지인 경우 페이지 디스크립터들을 compound 페이지로 초기화한다.
- 모든 tail 페이지들은 head 페이지를 가리킨다.
코드 라인 21~24에서 no watermark 기준으로 할당 요청한 경우 해당 페이지 디스크립터의 index 멤버에 -1을 대입하여 pfmemalloc 상태에서 할당 받았다는 표식을 한다.

post_alloc_hook()

mm/page_alloc.c

inline void post_alloc_hook(struct page *page, unsigned int order,
                                gfp_t gfp_flags)
{
        set_page_private(page, 0);
        set_page_refcounted(page);

        arch_alloc_page(page, order);
        kernel_map_pages(page, 1 << order, 1);
        kernel_poison_pages(page, 1 << order, 1);
        kasan_alloc_pages(page, order);
        set_page_owner(page, order, gfp_flags);
}

할당받은 2^order 페이지 사이즈 메모리에 해당하는 첫 번째 페이지 디스크립터를 초기화한다.

코드 라인 4에서 페이지 디스크립터의 private 멤버를 0으로 초기화한다.
코드 라인 5에서 참조 카운터를 1로 초기화한다.
코드 라인 7에서 아키텍처에 대응하는 페이지 할당 후크를 호출한다.
- ARM, ARM64는 해당 호출 함수가 없다.
코드 라인 8에서 디버그용 페이지 할당을 호출한다.
코드 라인 9에서 poison을 사용한 디버깅을 수행한다. 할당된 메모리에 미리 표식된 poison이 이상 없는지 체크한다.
코드 라인 10에서 KASAN 디버깅을 위해 호출한다.
코드 라인 11에서 디버그용 페이지 오너 트래킹을 위해 호출한다.

CPUSET 관련

cpuset_zone_allowed()

include/linux/cpuset.h

static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
{
        if (cpusets_enabled())
                return __cpuset_zone_allowed(z, gfp_mask);
        return true;
}

요청 zone의 노드가 현재 cpu가 지원하는 노드인 경우 true를 반환한다. 그 외에 우선되는 경우는 인터럽트 수행중에 호출되었거나 __GFP_THISNODE 플래그가 설정되었거나 현재 태스크가 이미 허락하는 노드이거나 태스크가 TIF_MEMDIE 플래그 또는 PF_EXITING 플래그가 설정된 경우는 true를 반환하고 __GFP_HARDWALL이 설정된 경우 현재 태스크의 cpuset이 허락한 메모리 노드가 아닌 노드에 기회를 주지않게 하기 위해 false를 반환한다.

cpuset_node_allowed()

include/linux/cpuset.h

static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
{
        return __cpuset_node_allowed(zone_to_nid(z), gfp_mask);
}

아래 함수 호출

__cpuset_node_allowed()

kernel/cpuset.c

/**
 * cpuset_node_allowed - Can we allocate on a memory node?
 * @node: is this an allowed node?
 * @gfp_mask: memory allocation flags
 *
 * If we're in interrupt, yes, we can always allocate.  If @node is set in
 * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
 * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
 * yes.  If current has access to memory reserves as an oom victim, yes.
 * Otherwise, no.
 *
 * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
 * and do not allow allocations outside the current tasks cpuset
 * unless the task has been OOM killed.
 * GFP_KERNEL allocations are not so marked, so can escape to the
 * nearest enclosing hardwalled ancestor cpuset.
 *
 * Scanning up parent cpusets requires callback_lock.  The
 * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
 * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
 * current tasks mems_allowed came up empty on the first pass over
 * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
 * cpuset are short of memory, might require taking the callback_lock.
 *
 * The first call here from mm/page_alloc:get_page_from_freelist()
 * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
 * so no allocation on a node outside the cpuset is allowed (unless
 * in interrupt, of course).
 *
 * The second pass through get_page_from_freelist() doesn't even call
 * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
 * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
 * in alloc_flags.  That logic and the checks below have the combined
 * affect that:
 *      in_interrupt - any node ok (current task context irrelevant)
 *      GFP_ATOMIC   - any node ok
 *      tsk_is_oom_victim   - any node ok
 *      GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
 *      GFP_USER     - only nodes in current tasks mems allowed ok.
 */

bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
{
        struct cpuset *cs;              /* current cpuset ancestors */
        int allowed;                    /* is allocation in zone z allowed? */
        unsigned long flags;

        if (in_interrupt())
                return true;
        if (node_isset(node, current->mems_allowed))
                return true;
        /*
         * Allow tasks that have access to memory reserves because they have
         * been OOM killed to get memory anywhere.
         */
        if (unlikely(tsk_is_oom_victim(current)))
                return true;
        if (gfp_mask & __GFP_HARDWALL)  /* If hardwall request, stop here */
                return false;

        if (current->flags & PF_EXITING) /* Let dying task have memory */
                return true;

        /* Not hardwall and node outside mems_allowed: scan up cpusets */
        spin_lock_irqsave(&callback_lock, flags);

        rcu_read_lock();
        cs = nearest_hardwall_ancestor(task_cs(current));
        allowed = node_isset(node, cs->mems_allowed);
        rcu_read_unlock();

        spin_unlock_irqrestore(&callback_lock, flags);
        return allowed;
}

요청 노드가 현재 cpu가 지원하는 노드인 경우 true를 반환한다. 그 외에 우선되는 경우는 인터럽트 수행중에 호출되었거나 __GFP_THISNODE 플래그가 설정되었거나 현재 태스크가 이미 허락하는 노드이거나 태스크가 TIF_MEMDIE 플래그 또는 PF_EXITING 플래그가 설정된 경우는 true를 반환하고 __GFP_HARDWALL이 설정된 경우 현재 태스크의 cpuset이 허락한 메모리 노드가 아닌 노드에 기회를 주지않게 하기 위해 false를 반환한다.

코드 라인 7~8에서 인터럽트 핸들러에서 호출된 경우 true를 반환한다.
코드 라인 9~10에서 현재 태스크가 허락하는 노드인 경우 true를 반환한다.
코드 라인 15~16에서 낮은 확률로 현재 태스크가 메모리 부족으로 인해 종료되고 있는 중이면 true를 반환한다.
코드 라인 17~18에서 hardwall 요청인 경우 false를 반환한다.
코드 라인 20~21에서 현재 태스크가 종료 중인 경우 true를 반환한다.
코드 라인 24~32에서 hardwall 요청이 없고, 현재 태스크가 허락하지 않는 노드인 경우이다. 이러한 경우 현재 태스크의 cpuset에서 가장 가까운 hardwall 부모 cpuset을 알아와서 cpuset에 허락된 메모리 노드의 여부를 반환한다.

__GFP_HARDWALL 플래그를 사용하는 케이스는 다음 3가지이다.

fastpath 페이지 할당 요청
사용자 태스크에서 페이지 할당 요청
슬랩(slab) 페이지 할당 요청

커널이 메모리를 할당 요청할 때엔 슬랩(slab) 페이지를 위한 할당 등의 특수한 경우를 제외하고 __GFP_HARDWALL 플래그를 사용하지 않는다.

__GFP_HARDWALL 플래그를 사용하지 않으면 cgroup을 사용한다.
cgroup의 cpuset 서브시스템에서 현재 태스크가 포함된 그룹을 기점으로 부모 방향으로 hardwall 또는 exclusive 설정이 된 가장 가까운 상위 그룹을 찾아 사용한다.
cgroup의 cpuset 서브시스템에 있는 cpuset.mem_exclusive와 cpuset.mem_hardwall의 값을 각각 1로 변경하는 것으로 해당 그룹의 hardwall 및 exclusive가 설정된다.
cgroup에 있는 모든 서브시스템의 형상은 트리 구조로 관리가 되며 특별히 값을 설정하지 않아도 부모의 값을 자식이 상속하는 구조로 구성된다. 이 때 hardwall 기능을 사용하면 자신의 그룹과 부모 그룹을 막는 벽이 생기는 것이다

nearest_hardwall_ancestor()

kernel/cpuset.c

/*
 * nearest_hardwall_ancestor() - Returns the nearest mem_exclusive or
 * mem_hardwall ancestor to the specified cpuset.  Call holding
 * callback_lock.  If no ancestor is mem_exclusive or mem_hardwall
 * (an unusual configuration), then returns the root cpuset.
 */
static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
{
        while (!(is_mem_exclusive(cs) || is_mem_hardwall(cs)) && parent_cs(cs))
                cs = parent_cs(cs);
        return cs;
}

cpuset이 메모리를 베타적으로 사용하거나 부모 cpuset이 hardwall인 경우 cpuset을 반환한다. 조건을 만족하지 못하면 만족할 때 까지 부모 cpuset을 계속 찾는다.

read_mems_allowed_begin()

include/linux/cpuset.h

/*
 * read_mems_allowed_begin is required when making decisions involving
 * mems_allowed such as during page allocation. mems_allowed can be updated in
 * parallel and depending on the new value an operation can fail potentially
 * causing process failure. A retry loop with read_mems_allowed_begin and
 * read_mems_allowed_retry prevents these artificial failures.
 */

static inline unsigned int read_mems_allowed_begin(void)
{
        if (!static_branch_unlikely(&cpusets_pre_enable_key))
                return 0;
        return read_seqcount_begin(&current->mems_allowed_seq);
}

현재 태스크의 mems_allowed_seq 시퀀스 락 값을 알아온다.

현재 태스크에 대한 cpuset 설정이 바뀐 경우(/sys/fs/cpuset 디렉토리에 있는 설정) current->mems_allowed_seq 값이 변경된다.

Dirty 노드 밸런싱

dirty(파일에 기록하였지만 파일 캐시 메모리에 상주된 상태로 지연(lazy) 기록되는 상태) limit을 지정하여 사용하는데, 이러한 파일 기록을 요청한 노드를 대상으로 dirty limit을 초과하는 경우 다른 노드에 할당하여 dirty 파일들이 분산되도록 한다.

만일 모든 노드에서 dirty limit을 초과하여 할당을 실패하는 경우 그냥 dirty limit을 풀고 다시 시도하여 할당한다.

다음 그림은 dirty 페이지 할당 요청들로부터 20%의 dirty 제한을 초과하지 않게 운영되는 모습을 보여준다.

node_dirty_ok()

mm/page-writeback.c

/**
 * node_dirty_ok - tells whether a node is within its dirty limits
 * @pgdat: the node to check
 *
 * Returns %true when the dirty pages in @pgdat are within the node's
 * dirty limit, %false if the limit is exceeded.
 */

bool node_dirty_ok(struct pglist_data *pgdat)
{
        unsigned long limit = node_dirty_limit(pgdat);
        unsigned long nr_pages = 0;

        nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
        nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
        nr_pages += node_page_state(pgdat, NR_WRITEBACK);

        return nr_pages <= limit;
}

요청한 노드가 dirty 제한 이하인지 여부를 반환한다. 1=dirty 제한 범위 이하, 0=dirty 제한 범위 초과

노드의 NR_FILE_DRITY + NR_UNSTABLE_NFS + NR_WRITEBACK 페이지들이 노드의 dirty(write buffer) 한계 이하인 경우 true를 반환한다.
zone 카운터들 대한 수치 확인은

다음과 같이 “cat /proc/zoneinfo” 명령을 통해 노드별 카운터 정보를 확인할 수 있다.

nodeinfo 파일이 아니라 zoneinfo 파일인 이유: 기존에는 노드별 정보가 아니라 존별 정보를 출력 하였었다.

$ cat /proc/zoneinfo
Node 0, zone    DMA32
  per-node stats
      nr_inactive_anon 2162
      nr_active_anon 4186
      nr_inactive_file 8415
      nr_active_file 5303
      nr_unevictable 0
      nr_slab_reclaimable 3490
      nr_slab_unreclaimable 6347
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault 0
      workingset_activate 0
      workingset_restore 0
      workingset_nodereclaim 0
      nr_anon_pages 4141
      nr_mapped    6894
      nr_file_pages 15924
      nr_dirty     91                       <-----
      nr_writeback 0                        <-----
      nr_writeback_temp 0
      nr_shmem     2207
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0                        <-----
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   330
      nr_written   239
      nr_kernel_misc_reclaimable 0
  pages free     634180
        min      5632
        low      7040
        high     8448
        spanned  786432
        present  786432
        managed  765785
        protection: (0, 0, 0)
      nr_free_pages 634180
      nr_zone_inactive_anon 2162
      nr_zone_active_anon 4186
      nr_zone_inactive_file 8415
      nr_zone_active_file 5303
      nr_zone_unevictable 0
      nr_zone_write_pending 91
      nr_mlock     0
      nr_page_table_pages 233
      nr_kernel_stack 1472
      nr_bounce    0
      nr_free_cma  8128
      numa_hit     97890
      numa_miss    0
      numa_foreign 0
      numa_interleave 6202
      numa_local   97890
      numa_other   0
  pagesets
    cpu: 0
              count: 320
              high:  378
              batch: 63
  vm stats threshold: 24
    cpu: 1
              count: 275
              high:  378
              batch: 63
  vm stats threshold: 24
  node_unreclaimable:  0
  start_pfn:           262144
Node 0, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0)
Node 0, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0)

node_dirty_limit()

mm/page-writeback.c

/**
 * node_dirty_limit - maximum number of dirty pages allowed in a node
 * @pgdat: the node
 *
 * Returns the maximum number of dirty pages allowed in a node, based
 * on the node's dirtyable memory.
 */

static unsigned long node_dirty_limit(struct pglist_data *pgdat)
{
        unsigned long node_memory = node_dirtyable_memory(pgdat);
        struct task_struct *tsk = current;
        unsigned long dirty;

        if (vm_dirty_bytes)
                dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
                        node_memory / global_dirtyable_memory();
        else
                dirty = vm_dirty_ratio * node_memory / 100;

        if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
                dirty += dirty / 4;

        return dirty;
}

노드에 허락된 dirty 가능한 페이지 수의 일정 비율만큼으로 제한한 페이지 수를 반환한다. 만일 태스크에 PF_LESS_THROTTLE가 설정되어 있거나 우선 순위가 user task보다 높은 태스크인 경우 25%를 추가한다.

vm_dirty_bytes가 설정된 경우 노드가 사용하는 dirty 페이지의 비율만큼 배정한다.
vm_dirty_bytes가 설정되지 않은 경우 vm_dirty_ratio 백분율로 배정한다.
태스크 우선 순위
- 0~139중 rt_task는 100이하의 높은 우선 순위를 가진다.
- 100~139는 유저 태스크의 우선순위를 가진다.
- 낮은 숫자가 가장 높은 우선순위를 가진다.

아래 그림은 node_dirty_limit() 값을 산출 시 dirty_bytes 또는 dirty_ratio를 사용할 때 달라지는 모습을 보여준다.

node_dirtyable_memory()

mm/page-writeback.c

/*
 * In a memory zone, there is a certain amount of pages we consider
 * available for the page cache, which is essentially the number of
 * free and reclaimable pages, minus some zone reserves to protect
 * lowmem and the ability to uphold the zone's watermarks without
 * requiring writeback.
 *
 * This number of dirtyable pages is the base value of which the
 * user-configurable dirty ratio is the effictive number of pages that
 * are allowed to be actually dirtied.  Per individual zone, or
 * globally by using the sum of dirtyable pages over all zones.
 *
 * Because the user is allowed to specify the dirty limit globally as
 * absolute number of bytes, calculating the per-zone dirty limit can
 * require translating the configured limit into a percentage of
 * global dirtyable memory first.
 */

/**
 * node_dirtyable_memory - number of dirtyable pages in a node
 * @pgdat: the node
 *
 * Returns the node's number of pages potentially available for dirty
 * page cache.  This is the base value for the per-node dirty limits.
 */

static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
{
        unsigned long nr_pages = 0;
        int z;

        for (z = 0; z < MAX_NR_ZONES; z++) {
                struct zone *zone = pgdat->node_zones + z;

                if (!populated_zone(zone))
                        continue;

                nr_pages += zone_page_state(zone, NR_FREE_PAGES);
        }

        /*
         * Pages reserved for the kernel should not be considered
         * dirtyable, to prevent a situation where reclaim has to
         * clean pages in order to balance the zones.
         */
        nr_pages -= min(nr_pages, pgdat->totalreserve_pages);

        nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
        nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);

        return nr_pages;
}

해당 노드의 dirty 가능한 페이지 수를 반환한다. (노드의 free 페이지 + 사용된 파일 캐시 페이지 – totalreserve_pages)

코드 라인 6~13에서 노드의 모든 populate 존의 free 페이지를 산출한다.
코드 라인 20에서 산출된 페이지에서 totalreserve_pages는 제외한다.
코드 라인 22~23에서 노드의 모든(inactive+active) 파일 캐시 페이지를 추가한다.

global_dirtyable_memory()

mm/page-writeback.c

/**
 * global_dirtyable_memory - number of globally dirtyable pages
 *
 * Returns the global number of pages potentially available for dirty
 * page cache.  This is the base value for the global dirty limits.
 */

static unsigned long global_dirtyable_memory(void)
{
        unsigned long x;

        x = global_zone_page_state(NR_FREE_PAGES);
        /*
         * Pages reserved for the kernel should not be considered
         * dirtyable, to prevent a situation where reclaim has to
         * clean pages in order to balance the zones.
         */
        x -= min(x, totalreserve_pages);

        x += global_node_page_state(NR_INACTIVE_FILE);
        x += global_node_page_state(NR_ACTIVE_FILE);

        if (!vm_highmem_is_dirtyable)
                x -= highmem_dirtyable_memory(x);

        return x + 1;   /* Ensure that we never return 0 */
}

시스템에서 dirty 가능한 페이지 수를 반환한다. (시스템의 free 페이지 + file 캐시 페이지 – totalreserve_pages)

free 페이지 + file 캐시로 사용 중인 페이지 – dirty_balance_reserve 값을 반환한다.
만일 highmem을 dirty 페이지로 사용되지 못하게 한 경우 highmem의 dirty 페이지 부분을 제외시킨다.

highmem_dirtyable_memory()

mm/page-writeback.c

static unsigned long highmem_dirtyable_memory(unsigned long total)
{
#ifdef CONFIG_HIGHMEM
        int node;
        unsigned long x = 0;
        int i;

        for_each_node_state(node, N_HIGH_MEMORY) {
                for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
                        struct zone *z;
                        unsigned long nr_pages;

                        if (!is_highmem_idx(i))
                                continue;

                        z = &NODE_DATA(node)->node_zones[i];
                        if (!populated_zone(z))
                                continue;

                        nr_pages = zone_page_state(z, NR_FREE_PAGES);
                        /* watch for underflows */
                        nr_pages -= min(nr_pages, high_wmark_pages(z));
                        nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
                        nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
                        x += nr_pages;
                }
        }

        /*
         * Unreclaimable memory (kernel memory or anonymous memory
         * without swap) can bring down the dirtyable pages below
         * the zone's dirty balance reserve and the above calculation
         * will underflow.  However we still want to add in nodes
         * which are below threshold (negative values) to get a more
         * accurate calculation but make sure that the total never
         * underflows.
         */
        if ((long)x < 0)
                x = 0;

        /*
         * Make sure that the number of highmem pages is never larger
         * than the number of the total dirtyable memory. This can only
         * occur in very strange VM situations but we want to make sure
         * that this does not occur.
         */
        return min(x, total);
#else
        return 0;
#endif
}

high memory에 대한 dirty 페이지 가능한 수를 알아온다. (64비트 시스템은 사용하지 않는다.)

구조체

alloc_context 구조체

mm/internal.h

/*
 * Structure for holding the mostly immutable allocation parameters passed
 * between functions involved in allocations, including the alloc_pages*
 * family of functions.
 *      
 * nodemask, migratetype and high_zoneidx are initialized only once in
 * __alloc_pages_nodemask() and then never change.
 *      
 * zonelist, preferred_zone and classzone_idx are set first in
 * __alloc_pages_nodemask() for the fast path, and might be later changed
 * in __alloc_pages_slowpath(). All other functions pass the whole strucure
 * by a const pointer.
 */

struct alloc_context {
        struct zonelist *zonelist;
        nodemask_t *nodemask;           
        struct zone *preferred_zone;
        int migratetype;
        enum zone_type high_zoneidx;
        bool spread_dirty_pages;
};

alloc_pages* 패밀리 함수들에서 여러 가지 파라미터를 전달하기 위한 목적으로 사용되는 구조체이다.

zonelist
- 페이지 할당 시 사용하는 zonelist
nodemask
- zonelist의 노드들 중 지정한 노드들에서만 할당 가능하도록 제한한다.
- 지정하지 않으면 모든 노드가 대상이 된다.
preferred_zone
- fastpath에서 가장 우선 할당할 존이 지정된다.
- slowpath에서는 zonelist의 가용한 첫 존이 지정된다.
migratetype
- 할당할 migrate(페이지) 타입 유형
high_zoneidx
- zonelist의 존들 중 지정한 high zone 이하에서만 할당 가능하도록 제한한다.
spread_dirty_pages
- 더티 존(dirty zone) 밸런싱 여부로 다음과 같이 사용된다.
  - fastpath 할당 요청 시 1로 설정
  - slowpath 할당 요청 시 0으로 설정

참고

Zoned Allocator -1- (물리 페이지 할당-Fastpath) | 문c – 현재 글
Zoned Allocator -2- (물리 페이지 할당-Slowpath) | 문c
Zoned Allocator -3- (Buddy 페이지 할당) | 문c
Zoned Allocator -4- (Buddy 페이지 해지) | 문c
Zoned Allocator -5- (Per-CPU Page Frame Cache) | 문c
Zoned Allocator -6- (Watermark) | 문c
Zoned Allocator -7- (Direct Compact) | 문c
Zoned Allocator -8- (Direct Compact-Isolation) | 문c
Zoned Allocator -9- (Direct Compact-Migration) | 문c
Zoned Allocator -10- (LRU & pagevec) | 문c
Zoned Allocator -11- (Direct Reclaim) | 문c
Zoned Allocator -12- (Direct Reclaim-Shrink-1) | 문c
Zoned Allocator -13- (Direct Reclaim-Shrink-2) | 문c
Zoned Allocator -14- (Kswapd) | 문c

Tunable watermark | LWN.net
메모리 재할당과 커널 파라미터 | 강진우
Memory compaction | LWN.net
[LWN 번역] Memory Compaction | Daeseok’s Blog
mm, compaction: introduce kcompactd
Page migration | www.kernel.org
리눅스커널의이해 16장 스와핑 – 메모리 해제 방법 | 한남대 최성자 – ppt 다운로드
Controlling Memory Fragmentation and Higher Order Allocation Failure: Analysis, Observations and Results | Pintu Kumar – pdf 다운로드
ZONE 비트맵 (API) | 문c

pgtable_init()

2016-06-212019-11-26 문영일 Leave a comment

페이지 테이블용 슬랩 캐시 준비

pgtable_init()

include/linux/mm.h

static inline void pgtable_init(void)
{
        ptlock_cache_init();
        pgtable_cache_init();
}

페이지 테이블용 슬랩 캐시들을 준비한다.

page->ptl에 사용되는 슬랩 캐시
pgd 테이블용 슬랩 캐시
- x86, arm에서는 빈 함수이다.

page->ptl용 슬랩 캐시 준비

DEBUG_SPINLOCK과 DEBUG_LOCK_ALLOC이 활성화된 경우 수 만개 이상의 페이지에 사용되는 page->ptl용 spinlock_t 사이즈가 커진다. 이를 kmalloc 메모리 할당자를 통해 할당하게 되면 2의 배수 크기 중 하나를 사용하게 되므로 사용하지 못하고 낭비되는 메모리 양이 매우 커진다. 이에 따라 디버그 상황에서는 메모리 낭비를 최소화하기 위해 page->ptl용 전용 슬랩 캐시를 만들어 사용하게 하였다.

약 x86_64에서 spinlock_t에 약 72 바이트가 사용
참고: mm: create a separate slab for page->ptl allocation

ptlock_cache_init()

mm/memory.c

#if USE_SPLIT_PTE_PTLOCKS && ALLOC_SPLIT_PTLOCKS
static struct kmem_cache *page_ptl_cachep;

void __init ptlock_cache_init(void)
{       
        page_ptl_cachep = kmem_cache_create("page->ptl", sizeof(spinlock_t), 0,
                        SLAB_PANIC, NULL);
}
#endif

페이지 테이블의 spinlock에 사용할 ptlock 캐시를 생성한다.

include/linux/mm_types.h

#define USE_SPLIT_PTE_PTLOCKS   (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)

include/linux/mm_types.h

#define ALLOC_SPLIT_PTLOCKS     (SPINLOCK_SIZE > BITS_PER_LONG/8)

페이지 테이블 lock/unlock

ptlock_alloc()

mm/memory.c

bool ptlock_alloc(struct page *page)
{       
        spinlock_t *ptl;

        ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
        if (!ptl)
                return false;
        page->ptl = ptl;
        return true;
}

spinlock에 사용할 slub object를 할당받아 page->ptl에 대입한다.

ptlock_free()

mm/memory.c

void ptlock_free(struct page *page)
{
        kmem_cache_free(page_ptl_cachep, page->ptl);
}

page->ptl에서 사용한 spinlock이 사용한 slub object를 해제한다.

CONFIG_SPLIT_PTLOCK_CPUS 커널 옵션

mm/Kconfig

# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
# Default to 4 for wider testing, though 8 might be more appropriate.
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
# PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
# DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC spinlock_t also enlarge struct page.
#
config SPLIT_PTLOCK_CPUS
        int
        default "999999" if !MMU
        default "999999" if ARM && !CPU_CACHE_VIPT
        default "999999" if PARISC && !PA20
        default "4"

pgd 테이블용 슬랩 캐시 준비

pgtable_cache_init() – ARM64

arch/arm64/include/asm/pgtable.h

#define pgtable_cache_init      pgd_cache_init

pgd_cache_init() – ARM64

arch/arm64/mm/pgd.c

void __init pgd_cache_init(void)
{
        if (PGD_SIZE == PAGE_SIZE)
                return;

#ifdef CONFIG_ARM64_PA_BITS_52
        /*
         * With 52-bit physical addresses, the architecture requires the
         * top-level table to be aligned to at least 64 bytes.
         */
        BUILD_BUG_ON(PGD_SIZE < 64);
#endif

        /*
         * Naturally aligned pgds required by the architecture.
         */
        pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE,
                                      SLAB_PANIC, NULL);
}

pgd 테이블용 슬랩 캐시를 준비한다.

pgd 테이블 할당/해제

pgd_alloc()

arch/arm64/mm/pgd.c

pgd_t *pgd_alloc(struct mm_struct *mm)
{
        if (PGD_SIZE == PAGE_SIZE)
                return (pgd_t *)__get_free_page(PGALLOC_GFP);
        else
                return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
}

pgd 테이블 슬랩 캐시를 사용하여 pgd 테이블을 할당한다.

pgd_free()

arch/arm64/mm/pgd.c

void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
        if (PGD_SIZE == PAGE_SIZE)
                free_page((unsigned long)pgd);
        else
                kmem_cache_free(pgd_cache, pgd);
}

pgd 테이블 슬랩 캐시에 pgd 테이블을 할당 해제한다.