문c 블로그

Swap -2- (Swapin & Swapout)

2019-10-292019-10-29 문영일 Leave a comment

Swapin & Swapout

유저 프로세스가 swap되어 언매핑된 빈 가상 주소에 접근하는 경우 swap 영역에서 swap 페이지를 로드하여 복구해야 한다. 이 때 swap 캐시를 거쳐 swap 영역에서 swap된 페이지를 로드하는 과정을 swap-in 이라고 하고, 반대로 저장하는 과정을 swap-out이라고 한다.

Swap readahead

swap-in 과정에서 swap되어 빈 영역에 접근하는 경우 fault 에러가 발생한다. 이 때 fault된 가상 주소로 페이지 테이블에 저장된 swap 엔트리를 알아온 후 이를 키로 관련 swap 영역에서 swap 페이지를 로드하는데 주변 페이지들을 미리 로드하여 처리 성능을 올릴 수 있다. 이러한 방법을 readahead라고 하는데 swap 과정에 사용되는 readahead는 다음과 같이 두 가지 방법이 사용되고 있다.

vma 기반 swap readahead
cluster 기반 swap readahead (for SSD)

readahead 기능으로 같이 읽혀온 페이지들에 swap 캐시에 있을때 해당 페이지에 PG_reclaim (플래그 사용을 절약하기 위해 readahead 시에는 reclaim이 아니라 PG_readahead 용도로 사용된다.) 플래그가 붙는다.

VMA 기반 swap readahead

유저 프로세스가 swap 된 페이지에 접근하여 fault가 발생하면, swap 캐시에서 페이지를 찾아보고, swap 캐시에서 발견하지 못하면 swap 영역으로 부터 vma 영역내의 fault된 페이지 주변을 조금 더 읽어와서 주변 페이지에 접근하여 fault될 때 이미 읽어온 페이지가 swap 캐시 영역에서 찾을 수 있어 이를 빠르게 anon 페이지로 변환 및 매핑할 수 있다.

이 방식을 사용하려면 커널 설정에서 vma_ra_enabled 속성(디폴트=true)이 설정되어 있어야 하며 SSD 타입의 블럭 디바이스에서만 사용할 수 있다. HDD를 사용하는 swap 영역에서는 성능 저하를 이유로 더 이상 VMA based swap readahead를 사용하지 못하게 하였다.
참고:
- mm, swap: don’t use VMA based swap readahead if HDD is used as swap (v4.14-rc1)
- mm, swap: VMA based swap readahead (2017) | LWN.net
- mm, swap: VMA based swap readahead (2017, patch) | LWN.net

다음 그림은 swap된 페이지가 swap 영역에서 로드될 때 fault 페이지의 vma내 주변 페이지 일부를 미리 swap 캐시에 로드하는 과정을 보여준다.

클러스터 기반 swap readahead

VMA 기반과 다르게 fault된 swap 페이지의 주변이 아니라 swap 영역의 주변 페이지를 더 읽어온다.

다음 그림은 swap된 페이지가 swap 영역에서 로드될 때 swap 영역의 주변 페이지 일부를 미리 swap 캐시에 로드하는 과정을 보여준다.

Swap 관련 페이지 플래그

PG_swapbacked
- swap 영역을 가졌는지 여부이다.
  - 이 플래그를 가지면 swap이 가능한 일반 anon 페이지이다.
  - 이 플래그가 없으면 swap이 불가능한 clean anon 페이지이다.
PG_swapcached
- swap 되어 swap 캐시 영역에 존재하는 상태이다. 유저가 swap된 가상 주소 페이지에 접근 시 fault 핸들러를 통해 swap 영역보다 먼저 swap 캐시를 찾는다.
  - swap 영역에 기록되었는지 여부는 이 플래그로 알 수 없고 page_swapped() 함수를 통해서 알아낼 수 있다.
- swap-in이 진행될 때 swap 영역에서 읽을 때 성능 향상을 위해 주변 페이지도 읽어 swap 캐시 영역에 로드한다.
- swap-out이 진행할 때 이 swap 캐시를 swap 영역에 저장한다.
PG_writeback
- swap 영역에 기록(sync 또는 async) 하는 동안 설정된다.
- pageout() 에서 swap writeback 후크 함수인 swap_writepage() 함수에서 설정되고, writeback이 완료되면 클리어된다.
PG_reclaim (2가지 용도)
- swap-out 시 PageReclaim()으로 사용된다.
  - reclaim을 위해 swap 영역에 기록하는 중에 설정되고, 이 플래그가 제거될 때 회수할 수 있다.
  - pageout() 에서 writeback 직전에 설정되고, writeback이 완료되면 클리어된다.
- swap-in 시 PageReadahead()으로 사용된다.
  - readahead로 미리 읽어온 swap 캐시 페이지에 설정된다.
PG_dirty
- swap 영역에 기록하기 위해 설정되며, 이 플래그를 보고 pageout()이 호출된다.
- add_to_swap_cache() 함수에서 설정되고, pageout() 에서 writeback 직전에 클리어된다.
PG_workingset
- 페이지가 작업중임을 알리기 위한 플래그이다.
  - file 페이지가 inactive lru 리스트에서 refault될 때 그 fault 간격을 메모리 크기와 비교하여 메모리 크기보다 작은 fault 간격인 경우 이를 체크하여 thrashing을 감지하는데 사용하기 위해 workingset 플래그를 사용한다.
  - 자주 사용되는 페이지가 여러 번 fualt되어 캐시를 교체하느라 성능 저하되는 현상을 막는 솔루션이다.
  - 참고: mm: workingset: tell cache transitions from workingset thrashing (2018, v4.20-rc1)
- file 페이지가 처음 access되면 inactive lru 리스트의 선두에서 출발한다. 그리고 두 번의 access를 감지하면 active lru로 승격한다. 그러나 anon 페이지는 처음 access되면 active lru 리스트의 선두에서 출발하므로 처음부터 workingset으로 설정한다.

다음 그림은 swap 관련 페이지 플래그의 변화를 보여준다.

Swap 초기화

swap_init_sysfs()

mm/swap_state.c

static int __init swap_init_sysfs(void)
{
        int err;
        struct kobject *swap_kobj;

        swap_kobj = kobject_create_and_add("swap", mm_kobj);
        if (!swap_kobj) {
                pr_err("failed to create swap kobject\n");
                return -ENOMEM;
        }
        err = sysfs_create_group(swap_kobj, &swap_attr_group);
        if (err) {
                pr_err("failed to register swap group\n");
                goto delete_obj;
        }
        return 0;

delete_obj:
        kobject_put(swap_kobj);
        return err;
}
subsys_initcall(swap_init_sysfs);

swap 시스템을 위해 /sys/kernel/mm/swap 디렉토리를 생성하고 관련 속성(vma_ra_enabled) 파일을 생성한다.

vma_ra_enabled 속성

mm/swap_state.c

static ssize_t vma_ra_enabled_show(struct kobject *kobj,
                                     struct kobj_attribute *attr, char *buf)
{
        return sprintf(buf, "%s\n", enable_vma_readahead ? "true" : "false");
}
static ssize_t vma_ra_enabled_store(struct kobject *kobj,
                                      struct kobj_attribute *attr,
                                      const char *buf, size_t count)
{
        if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
                enable_vma_readahead = true;
        else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
                enable_vma_readahead = false;
        else
                return -EINVAL;

        return count;
}
static struct kobj_attribute vma_ra_enabled_attr =
        __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show,
               vma_ra_enabled_store);

static struct attribute *swap_attrs[] = {
        &vma_ra_enabled_attr.attr,
        NULL,
};

static struct attribute_group swap_attr_group = {
        .attrs = swap_attrs,
};

vma 기반 readahead 기능을 enable 하는 속성 파일이다.

“/sys/kernel/mm/swap/vma_ra_enabled” 속성의 디폴트 값은 true이다.
- swap_vma_readahead -> vma_ra_enabled 속성으로 이름이 바뀌었다.
swap_use_vma_readahead() 함수를 통해 이 속성의 설정 여부를 알아온다.

Swap-out

normal anon 페이지가 swap 영역에 기록된 후 페이지가 free되는 순서는 다음과 같다.

normal anon 페이지 → swapcache → unmap → write out → free 페이지
- add_to_swap()
- try_to_unmap()
- pageout()
- free_unref_page_commit()

다음 그림은 swap-out 과정을 보여준다.

Swap 영역에 추가

add_to_swap()

mm/swap_state.c

/**
 * add_to_swap - allocate swap space for a page
 * @page: page we want to move to swap
 *
 * Allocate swap space for the page and add the page to the
 * swap cache.  Caller needs to hold the page lock.
 */

int add_to_swap(struct page *page)
{
        swp_entry_t entry;
        int err;

        VM_BUG_ON_PAGE(!PageLocked(page), page);
        VM_BUG_ON_PAGE(!PageUptodate(page), page);

        entry = get_swap_page(page);
        if (!entry.val)
                return 0;

        /*
         * XArray node allocations from PF_MEMALLOC contexts could
         * completely exhaust the page allocator. __GFP_NOMEMALLOC
         * stops emergency reserves from being allocated.
         *
         * TODO: this could cause a theoretical memory reclaim
         * deadlock in the swap out path.
         */
        /*
         * Add it to the swap cache.
         */
        err = add_to_swap_cache(page, entry,
                        __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
        if (err)
                /*
                 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
                 * clear SWAP_HAS_CACHE flag.
                 */
                goto fail;
        /*
         * Normally the page will be dirtied in unmap because its pte should be
         * dirty. A special case is MADV_FREE page. The page'e pte could have
         * dirty bit cleared but the page's SwapBacked bit is still set because
         * clearing the dirty bit and SwapBacked bit has no lock protected. For
         * such page, unmap will not set dirty bit for it, so page reclaim will
         * not write the page out. This can cause data corruption when the page
         * is swap in later. Always setting the dirty bit for the page solves
         * the problem.
         */
        set_page_dirty(page);

        return 1;

fail:
        put_swap_page(page, entry);
        return 0;
}

swap 엔트리를 할당한 후 이 값을 키로 anon 페이지를 swap 캐시 및 swap 영역에 저장한다. 성공 시 1을 반환한다.

코드 라인 6~7에서 swap-out을 하기 전에 PG_lock과 PG_uptodate가 반드시 설정되어 있어야 한다.
코드 라인 9~11에서 swap할 anon 페이지에 사용할 swap 엔트리를 얻어온다.
코드 라인 24~31에서 swap 엔트리를 키로 swap할 anon 페이지를 swap 캐시에 추가한다.
코드 라인 42~44에서 페이지에 dirty 설정을 한다. 그 후 성공하였으므로 1을 반환한다.
- address_space에 매핑된 페이지의 경우 드라이버를 통해 dirty 설정을 하고, 페이지에도 PG_dirty 플래그를 설정한다.
- dirty된 페이지는 reclaim 과정에서 pageout() 함수가 호출되어 swap 영역에 저장되며, 완료된 후에 PG_dirty 플래그가 클리어된다.
코드 라인 46~48에서 fail: 레이블이다. 실패한 경우이므로 0을 반환한다.

Swap 캐시에 추가

add_to_swap_cache()

mm/swap_state.c

/*
 * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
 * but sets SwapCache flag and private instead of mapping and index.
 */

int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
{
        struct address_space *address_space = swap_address_space(entry);
        pgoff_t idx = swp_offset(entry);
        XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
        unsigned long i, nr = 1UL << compound_order(page);

        VM_BUG_ON_PAGE(!PageLocked(page), page);
        VM_BUG_ON_PAGE(PageSwapCache(page), page);
        VM_BUG_ON_PAGE(!PageSwapBacked(page), page);

        page_ref_add(page, nr);
        SetPageSwapCache(page);

        do {
                xas_lock_irq(&xas);
                xas_create_range(&xas);
                if (xas_error(&xas))
                        goto unlock;
                for (i = 0; i < nr; i++) {
                        VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
                        set_page_private(page + i, entry.val + i);
                        xas_store(&xas, page + i);
                        xas_next(&xas);
                }
                address_space->nrpages += nr;
                __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
                ADD_CACHE_INFO(add_total, nr);
unlock:
                xas_unlock_irq(&xas);
        } while (xas_nomem(&xas, gfp));

        if (!xas_error(&xas))
                return 0;

        ClearPageSwapCache(page);
        page_ref_sub(page, nr);
        return xas_error(&xas);
}

swap 엔트리 정보를 키로 anon 페이지를 swap 캐시에 추가한다. 성공 시 0을 반환한다.

코드 라인 3에서 swap 엔트리를 사용하여 swap용 address_space 포인터를 알아온다.
코드 라인 4에서 swap 엔트리로 offset 부분만을 읽어 idx에 대입한다.
코드 라인 5에서 xarray operation state를 선언한다.
- 작업할 xarray는 &address_space->i_pages이고, 초기 인덱스(idx) 및 엔트리의 order를 지정한다.
- The XArray data structure (2018) | LWN.net
코드 라인 6에서 nr에 compound 페이지의 수 만큼 대입한다.
- 일반 페이지는 1이 대입되지만, thp의 경우 compound 구성된 페이지들의 수를 대입한다.
코드 라인 8~10에서 xarray로 관리되는 swap 캐시 영역에 추가할 페이지는 PG_locked, PG_swapbacked 플래그 설정이 반드시 있어야 하고, PG_swapcache 플래그는 없어야 한다.
코드 라인 12~13에서 swap 캐시 영역에 추가할 페이지의 참조 카운터를 nr 만큼 증가시키고, PG_swapcache 플래그를 설정한다.
- thp swap이 지원되는 경우 head 페이지의 참조 카운터를 nr 만큼 증가시킨다.
코드 라인 15~19에서 idx 부터 nr 페이지 수 범위의 xarray가 생성되도록 미리 준비한다.
코드 라인 20~25에서 페이지 수 만큼 순회하며 p->private에 swap 엔트리를 저장하고, 페이지를 xarray에 저장한다.
코드 라인 26~27에서 nr 페이지 수만큼 다음 카운터들을 증가시킨다.
- address_space가 관리하는 전체 페이지 수
- NR_FILE_PAGES 카운터
- swap_cache_info->add_total 카운터
코드 라인 29~31에서 unlock: 레이블이다. xa_node 할당 실패 시 다시 할당하고 반복한다.
코드 라인 33~34에서 할당이 성공한 경우 0을 반환한다.
코드 라인 36~38에서 할당이 실패한 경우 PG_swapcache를 클리어하고, 참조 카운터를 nr 만큼 다시 내린 후 에러 코드를 반환한다.

Swap-in

다음과 같은 순서로 swap된 페이지가 복구된다.

새 페이지 할당 → swapcache로 변경 → swap 영역(파일/파티션)에서 읽기 → 매핑

다음 그림은 swap-in 과정을 보여준다.

swapin_readahead()

mm/swap_state.c

/**
 * swapin_readahead - swap in pages in hope we need them soon
 * @entry: swap entry of this memory
 * @gfp_mask: memory allocation flags
 * @vmf: fault information
 *
 * Returns the struct page for entry and addr, after queueing swapin.
 *
 * It's a main entry function for swap readahead. By the configuration,
 * it will read ahead blocks by cluster-based(ie, physical disk based)
 * or vma-based(ie, virtual address based on faulty address) readahead.
 */

struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
                                struct vm_fault *vmf)
{
        return swap_use_vma_readahead() ?
                        swap_vma_readahead(entry, gfp_mask, vmf) :
                        swap_cluster_readahead(entry, gfp_mask, vmf);
}

swap 엔트리에 대한 swapin을 수행하여 페이지를 읽어들인다.

“/sys/kernel/mm/swap/vma_ra_enabled” 속성의 사용 시 vma based readahead를 사용하고, 그렇지 않은 경우 ssd 등에서 사용하는 클러스터 기반의 readahead 방식을 사용한다.

VMA 기반 readhead 방식으로 swap-in

vma 내에서 fault된 swap 페이지를 위해 fault된 가상 주소 그 전후로 산출된 readahead 페이지 수 만큼 swap-in을 하는 방식이다. (SSD only)

readahead할 페이지들은 vma 경계 또는 pte 테이블 한 개의 범위를 초과할 수 없다.

swap_vma_readahead()

mm/swap_state.c

static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
                                       struct vm_fault *vmf)
{
        struct blk_plug plug;
        struct vm_area_struct *vma = vmf->vma;
        struct page *page;
        pte_t *pte, pentry;
        swp_entry_t entry;
        unsigned int i;
        bool page_allocated;
        struct vma_swap_readahead ra_info = {0,};

        swap_ra_info(vmf, &ra_info);
        if (ra_info.win == 1)
                goto skip;

        blk_start_plug(&plug);
        for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
             i++, pte++) {
                pentry = *pte;
                if (pte_none(pentry))
                        continue;
                if (pte_present(pentry))
                        continue;
                entry = pte_to_swp_entry(pentry);
                if (unlikely(non_swap_entry(entry)))
                        continue;
                page = __read_swap_cache_async(entry, gfp_mask, vma,
                                               vmf->address, &page_allocated);
                if (!page)
                        continue;
                if (page_allocated) {
                        swap_readpage(page, false);
                        if (i != ra_info.offset) {
                                SetPageReadahead(page);
                                count_vm_event(SWAP_RA);
                        }
                }
                put_page(page);
        }
        blk_finish_plug(&plug);
        lru_add_drain();
skip:
        return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
                                     ra_info.win == 1);
}

swap 엔트리에 대해 vma 기반 swap readahead를 수행한다.

코드 라인 11~15에서 swap용 readahead 정보를 구성한다. 만일 swapin할 페이지가 최소 값 1이면 곧바로 skip 레이블로 이동한다.
코드 라인 17에서 blk_plug를 초기화하고, blk_finish_plug()가 끝나기 전까지 블럭 디바이스에 submit 을 유보하게 한다.
코드 라인 18~27에서 pte 엔트리 수 만큼 순회하며 swap 엔트리 정보가 기록된 pte가 아니면 스킵한다.
코드 라인 28~31에서 swap 캐시 영역에서 페이지를 찾아온다.
코드 라인 32~38에서 새로 할당된 페이지이면 swap 영역으로 부터 비동기로 bio 요청을 하여 페이지를 읽어온다. 그리고 할당한 페이지가 요청한 offset 페이지가 아니면 PG_reclaim(swap-in시 readahead 기능) 플래그를 설정하고, SWAP_RA 카운터를 증가시킨다.
코드 라인 41에서 blk_start_plug() 함수와 짝이되는 함수를 통해 블럭 디바이스의 submit 실행을 지금부터 가능하게 한다.
코드 라인 42에서 per-cpu lru 캐시들을 lru로 되돌린다.
코드 라인 43~45에서 skip: 레이블이다. 다시 한 번 swap 캐시 영역에서 페이지를 찾아온다. 단 한 페이지만(win=1) 처리할 때 swap 캐시에서 싱크 모드로 페이지를 읽어온다.

swapin 시 readahed를 위한 페이지 수 산출

swapin_nr_pages()

mm/swap_state.c

static unsigned long swapin_nr_pages(unsigned long offset)
{
        static unsigned long prev_offset;
        unsigned int hits, pages, max_pages;
        static atomic_t last_readahead_pages;

        max_pages = 1 << READ_ONCE(page_cluster);
        if (max_pages <= 1)
                return 1;

        hits = atomic_xchg(&swapin_readahead_hits, 0);
        pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
                                  atomic_read(&last_readahead_pages));
        if (!hits)
                prev_offset = offset;
        atomic_set(&last_readahead_pages, pages);

        return pages;
}

@offset 페이지에 대해 swapin 시 readahead할 페이지 수를 산출한다.

코드 라인 3~5에서 최근 readahead 산출 시 사용했던 offset 값이 prev_offset에, 그리고 최근 산출된 readahead 페이지 수가 last_readahead_pages에 저장되어 있다.
코드 라인 7~9에서 최대 페이지 제한으로 1 << page_cluster 값을 지정한다. 만일 그 값이 1이면 추가 산출할 필요 없이 가장 작은 수인 1을 반환한다.
코드 라인 11에서 swap-in시 readahead 히트 페이지 수를 알아온다.
코드 라인 12~13에서 최근 offset(prev_offset), @offset, readahead 히트 페이지(hits), 최대 페이지 수 제한(@max_pages) 및 최근 산출되었었던 readahead 페이지 수(last_readahead_pages) 값을 사용하여 적절한 readahead 페이지 수를 산출한다.
코드 라인 14~15에서 readahead 히트 페이지 수가 0인 경우 offset을 prev_offset에 기억해둔다.
코드 라인 16~18에서 산출한 readahead 페이지 수를 last_readahead_pages에 기억하고 반환한다.

__swapin_nr_pages()

mm/swap_state.c

static unsigned int __swapin_nr_pages(unsigned long prev_offset,
                                      unsigned long offset,
                                      int hits,
                                      int max_pages,
                                      int prev_win)
{
        unsigned int pages, last_ra;

        /*
         * This heuristic has been found to work well on both sequential and
         * random loads, swapping to hard disk or to SSD: please don't ask
         * what the "+ 2" means, it just happens to work well, that's all.
         */
        pages = hits + 2;
        if (pages == 2) {
                /*
                 * We can have no readahead hits to judge by: but must not get
                 * stuck here forever, so check for an adjacent offset instead
                 * (and don't even bother to check whether swap type is same).
                 */
                if (offset != prev_offset + 1 && offset != prev_offset - 1)
                        pages = 1;
        } else {
                unsigned int roundup = 4;
                while (roundup < pages)
                        roundup <<= 1;
                pages = roundup;
        }

        if (pages > max_pages)
                pages = max_pages;

        /* Don't shrink readahead too fast */
        last_ra = prev_win / 2;
        if (pages < last_ra)
                pages = last_ra;

        return pages;
}

최근 offset(@prev_offset), @offset, readahead 히트페이지(@hits), 최대 페이지 수 제한(@max_pages) 및 최근 결정된 readahead 페이지 수(@prev_win) 값을 사용하여 적절한 readahead 페이지 수를 산출하여 반환한다.

코드 라인 14~22에서 최근 readahead 히트 페이지가 없는 경우 2개 페이지로 지정한다. 단 offset이 최근 offset과 +- 1 차이를 벗어나면 1개 페이지로 지정한다.
코드 라인 23~28에서 최근 readahead 히트 페이지가 존재하는 경우 4부터 시작하여 두 배씩 증가하는 값(4, 8, 16, …) 중 하나로 결정하는데 증가 값이 (히트 페이지 + 2)를 초과한 수 증 작은 값이어야 한다.
- 예) hits=10
  - pages = 16
코드 라인 30~31에서 산출한 페이지 수가 최대 페이지 수를 초과하지 않도록 제한한다.
코드 라인 34~36에서 산출한 페이지 수가 급격히 작아지지 않도록, 최근 readahead한 페이지의 절반 이하로 내려가지 않도록 제한한다.
코드 라인 38에서 산출한 readahead 페이지 수를 반환한다.
- 참고:
  - swap: add a simple detector for inappropriate swapin readahead (v3.14-rc2)
  - mm, swap: VMA based swap readahead (v4.14-rc1)

swap 캐시에서 페이지를 읽어오기

read_swap_cache_async()

mm/swap_state.c

/*
 * Locate a page of swap in physical memory, reserving swap cache space
 * and reading the disk if it is not already cached.
 * A failure return means that either the page allocation failed or that
 * the swap entry is no longer in use.
 */

struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                struct vm_area_struct *vma, unsigned long addr, bool do_poll)
{
        bool page_was_allocated;
        struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
                        vma, addr, &page_was_allocated);

        if (page_was_allocated)
                swap_readpage(retpage, do_poll);

        return retpage;
}

swap 캐시 영역에서 swap 엔트리에 해당하는 페이지를 읽어온다. swap 캐시에서 찾을 수 없으면 swap 캐시를 할당하여 등록한 후 swap 영역에서 블럭 디바이스를 비동기로 읽어오도록 요청한다. 만일 @do_poll을 true로 요청한 경우 swap 캐시로 읽어올 때까지 기다린다.(sync)

코드 라인 5~6에서 swap 캐시 영역에서 페이지를 찾아온다. 만일 swap 캐시에서 발견할 수 없으면 swap 영역에서 읽어올 때 필요한 새 swap 캐시 페이지를 미리 준비해둔다. 이러한 경우 page_was_allocated 값에 true가 담겨온다.
코드 라인 8~9에서 새 swap 캐시 페이지가 할당된 경우 swap 영역에서 읽어오도록 bio 요청을 한다.
코드 라인 11에서 읽어온 페이지를 반환한다.

__read_swap_cache_async()

mm/swap_state.c

struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                        struct vm_area_struct *vma, unsigned long addr,
                        bool *new_page_allocated)
{
        struct page *found_page, *new_page = NULL;
        struct address_space *swapper_space = swap_address_space(entry);
        int err;
        *new_page_allocated = false;

        do {
                /*
                 * First check the swap cache.  Since this is normally
                 * called after lookup_swap_cache() failed, re-calling
                 * that would confuse statistics.
                 */
                found_page = find_get_page(swapper_space, swp_offset(entry));
                if (found_page)
                        break;

                /*
                 * Just skip read ahead for unused swap slot.
                 * During swap_off when swap_slot_cache is disabled,
                 * we have to handle the race between putting
                 * swap entry in swap cache and marking swap slot
                 * as SWAP_HAS_CACHE.  That's done in later part of code or
                 * else swap_off will be aborted if we return NULL.
                 */
                if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
                        break;

                /*
                 * Get a new page to read into from swap.
                 */
                if (!new_page) {
                        new_page = alloc_page_vma(gfp_mask, vma, addr);
                        if (!new_page)
                                break;          /* Out of memory */
                }

                /*
                 * Swap entry may have been freed since our caller observed it.
                 */
                err = swapcache_prepare(entry);
                if (err == -EEXIST) {
                        /*
                         * We might race against get_swap_page() and stumble
                         * across a SWAP_HAS_CACHE swap_map entry whose page
                         * has not been brought into the swapcache yet.
                         */
                        cond_resched();
                        continue;
                } else if (err)         /* swp entry is obsolete ? */
                        break;

                /* May fail (-ENOMEM) if XArray node allocation failed. */
                __SetPageLocked(new_page);
                __SetPageSwapBacked(new_page);
                err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
                if (likely(!err)) {
                        /* Initiate read into locked page */
                        SetPageWorkingset(new_page);
                        lru_cache_add_anon(new_page);
                        *new_page_allocated = true;
                        return new_page;
                }
                __ClearPageLocked(new_page);
                /*
                 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
                 * clear SWAP_HAS_CACHE flag.
                 */
                put_swap_page(new_page, entry);
        } while (err != -ENOMEM);

        if (new_page)
                put_page(new_page);
        return found_page;
}

swap 캐시 영역에서 페이지를 찾아온다. 만일 swap 캐시에서 발견할 수 없으면 swap 영역에서 읽어오기 위해 새 swap 캐시를 할당하여 준비한 후 반환한다. 새 swap 캐시를 준비한 경우 출력 인자 @new_page_allocated에 true를 저장한다.

코드 라인 16~18에서 swap 캐시 영역에서 swap 엔트리의 offset을 사용하여 swap 캐시 페이지를 찾아온다.
코드 라인 28~29에서 swapoff되어 더 이상 swap 엔트리가 유효하지 않은 경우 null 페이지를 반환하러 루프를 벗어난다.
- 참고: mm/swap: skip readahead only when swap slot cache is enabled (v4.11-rc1)
코드 라인 34~38에서 swap 영역에서 읽어올 페이지 데이터를 저장하기 위해 새 swap 캐시 페이지를 할당한다.
코드 라인 43~65에서 추가할 새 swap 캐시 페이지의 PG_locked, PG_swapbacked 플래그를 먼저 설정하고 swap 캐시에 추가한다. 추가가 완료하면 새 swap 캐시 페이지를 반환한다.
코드 라인 66~72에서 swap 캐시에 추가가 실패하는 경우 새 swap 캐시 페이지의 참조 카운터를 감소시키고 메모리가 부족하지 않는 한 다시 반복한다.
코드 라인 75에서 swap 캐시에서 찾은 페이지를 반환한다.

swap 영역에서 읽어 swap 캐시에 저장하기

swap_readpage()

mm/page_io.c

int swap_readpage(struct page *page, bool synchronous)
{
        struct bio *bio;
        int ret = 0;
        struct swap_info_struct *sis = page_swap_info(page);
        blk_qc_t qc;
        struct gendisk *disk;

        VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
        VM_BUG_ON_PAGE(!PageLocked(page), page);
        VM_BUG_ON_PAGE(PageUptodate(page), page);
        if (frontswap_load(page) == 0) {
                SetPageUptodate(page);
                unlock_page(page);
                goto out;
        }

        if (sis->flags & SWP_FS) {
                struct file *swap_file = sis->swap_file;
                struct address_space *mapping = swap_file->f_mapping;

                ret = mapping->a_ops->readpage(swap_file, page);
                if (!ret)
                        count_vm_event(PSWPIN);
                return ret;
        }

        ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
        if (!ret) {
                if (trylock_page(page)) {
                        swap_slot_free_notify(page);
                        unlock_page(page);
                }

                count_vm_event(PSWPIN);
                return 0;
        }

        ret = 0;
        bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
        if (bio == NULL) {
                unlock_page(page);
                ret = -ENOMEM;
                goto out;
        }
        disk = bio->bi_disk;
        /*
         * Keep this task valid during swap readpage because the oom killer may
         * attempt to access it in the page fault retry time check.
         */
        get_task_struct(current);
        bio->bi_private = current;
        bio_set_op_attrs(bio, REQ_OP_READ, 0);
        if (synchronous)
                bio->bi_opf |= REQ_HIPRI;
        count_vm_event(PSWPIN);
        bio_get(bio);
        qc = submit_bio(bio);
        while (synchronous) {
                set_current_state(TASK_UNINTERRUPTIBLE);
                if (!READ_ONCE(bio->bi_private))
                        break;

                if (!blk_poll(disk->queue, qc, true))
                        io_schedule();
        }
        __set_current_state(TASK_RUNNING);
        bio_put(bio);

out:
        return ret;
}

swap 영역에서 읽어온 데이터를 swap 캐시 페이지에 저장하도록 bio 요청 한다. 요청 시 @synchronous가 1인 경우 동기 요청한다. 결과는 성공한 경우 0을 반환한다.

코드 라인 5에서 swap 캐시 페이지의 private 멤버에 저장된 swap 엔트리 정보로 swap_info_struct 정보를 알아온다.
코드 라인 9~11에서 페이지에 PG_swapcache, PG_locked 플래그가 반드시 설정되어 있어야 하고, PG_uptodate는 클리어된 상태여야 한다.
코드 라인 12~16에서 front swap을 지원하는 경우 front swap 로드 후 PG_uptodate를 설정하고 성공을 반환한다.
코드 라인 18~26에서 파일 시스템을 통해 사용되는 swap 영역인 경우 해당 드라이버의 (*readpage) 후크 함수를 통해 페이지를 읽어오고 PSWPIN 카운터를 증가시키고 결과를 반환한다.
코드 라인 28~37에서 블럭 디바이스를 통해 사용되는 swap 영역인 경우 블럭 디바이스를 통해 페이지를 읽어오고 PSWPIN을 증가시키고 결과를 반환한다.
코드 라인 40~56에서 bio를 통해 swap 영역을 읽어오도록 요청 준비를 한 PSWPIN 카운터를 증가시킨다.
코드 라인 57~68에서 bio를 통해 요청을 한다. 그런 후 @synchronous가 설정된 경우 완료될 때까지 대기한다.
코드 라인 70~71에서 out: 레이블에서 곧바로 결과를 반환한다.

VMA 기반 Swap readahead 정보 구성

다음 그림은 vma 기반 swap readahead 과정을 보여준다.

vma_swap_readahead 구조체

include/linux/swap.h

struct vma_swap_readahead {
        unsigned short win;
        unsigned short offset;
        unsigned short nr_pte;
#ifdef CONFIG_64BIT
        pte_t *ptes;
#else
        pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
#endif
};

pmd 사이즈 범위내에서, 즉 1개의 pte 페이지 테이블내에서 pte

win
- swapin 시 readahead할 산출된 페이지 수
offset
- fault pfn을 기준으로 readahead할 시작 pfn offset
nr_pte
- swapin 시 readahead할 pte 엔트리 수 (vma 및 pmd 단위 경계로 win 값과 다를 수 있다)
*ptes
- fault 페이지 pte + offset에 해당하는 pte 주소

swap_ra_info()

mm/swap_state.c

static void swap_ra_info(struct vm_fault *vmf,
                        struct vma_swap_readahead *ra_info)
{
        struct vm_area_struct *vma = vmf->vma;
        unsigned long ra_val;
        swp_entry_t entry;
        unsigned long faddr, pfn, fpfn;
        unsigned long start, end;
        pte_t *pte, *orig_pte;
        unsigned int max_win, hits, prev_win, win, left;
#ifndef CONFIG_64BIT
        pte_t *tpte;
#endif

        max_win = 1 << min_t(unsigned int, READ_ONCE(page_cluster),
                             SWAP_RA_ORDER_CEILING);
        if (max_win == 1) {
                ra_info->win = 1;
                return;
        }

        faddr = vmf->address;
        orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
        entry = pte_to_swp_entry(*pte);
        if ((unlikely(non_swap_entry(entry)))) {
                pte_unmap(orig_pte);
                return;
        }

        fpfn = PFN_DOWN(faddr);
        ra_val = GET_SWAP_RA_VAL(vma);
        pfn = PFN_DOWN(SWAP_RA_ADDR(ra_val));
        prev_win = SWAP_RA_WIN(ra_val);
        hits = SWAP_RA_HITS(ra_val);
        ra_info->win = win = __swapin_nr_pages(pfn, fpfn, hits,
                                               max_win, prev_win);
        atomic_long_set(&vma->swap_readahead_info,
                        SWAP_RA_VAL(faddr, win, 0));

        if (win == 1) {
                pte_unmap(orig_pte);
                return;
        }

        /* Copy the PTEs because the page table may be unmapped */
        if (fpfn == pfn + 1)
                swap_ra_clamp_pfn(vma, faddr, fpfn, fpfn + win, &start, &end);
        else if (pfn == fpfn + 1)
                swap_ra_clamp_pfn(vma, faddr, fpfn - win + 1, fpfn + 1,
                                  &start, &end);
        else {
                left = (win - 1) / 2;
                swap_ra_clamp_pfn(vma, faddr, fpfn - left, fpfn + win - left,
                                  &start, &end);
        }
        ra_info->nr_pte = end - start;
        ra_info->offset = fpfn - start;
        pte -= ra_info->offset;
#ifdef CONFIG_64BIT
        ra_info->ptes = pte;
#else
        tpte = ra_info->ptes;
        for (pfn = start; pfn != end; pfn++)
                *tpte++ = *pte++;
#endif
        pte_unmap(orig_pte);
}

swap용 readahead 정보를 구성한다.

코드 라인 4에서 폴트 핸들러가 전달해준 vmf의 vma를 활용한다.
코드 라인 15~20에서 최대 readahead 페이지 수를 구한 후 max_win에 대입한다. 이 값이 1일 때 추가 산출할 필요 없으므로 ra_info->win에 1을 대입하고 함수를 빠져나간다.
- 1 << page_cluster와 SWAP_RA_ORDER_CEILING 중 작은 수
  - page_cluster의 초기 값은 2~3(메모리가 16M 이하인 경우 2이고, 그 외의 경우 3)이고, “/proc/sys/vm/page-cluster” 값을 통해 조정된다.
  - SWAP_RA_ORDER_CEILING 값은 64비트 시스템에서 5이고, 32비트 시스템에서 3이다.
  - arm64 디폴트 설정에서 max_win 값은 2^3=8이다.
코드 라인 22~28에서 fault 주소로 페이지 테이블에서 orig_pte 값을 알아오고, 이 값으로 swap 엔트리를 구한다. swap 엔트리가 아닌 경우 pte를 언맵하고 함수를 빠져나간다.
코드 라인 30에서 fault 주소에 해당하는 pfn 값을 구한다.
코드 라인 31~34에서 vma에 지정된 ra_val 값을 알아오고, 이 값에서 pfn, prev_win과 hits 값을 알아온다.
- ra_val 값에는 세 가지 값이 담겨 있다.
  - PAGE_SHIFT 비트 수를 초과하는 비트들에 pfn 값을 담는다.
  - PAGE_SHIFT 비트 수의 상위 절반에 win 값을 담는다.
  - PAGE_SHIFT 비트 수의 하위 절반에 hits 값을 담는다.
- 예) ra_val=0b111_101010_100001
  - SWAP_RA_ADDR(ra_val)=0b111_000000_000000
    - PFN_DOWN() 하면 0b111
  - SWAP_RA_WIN(ra_val)=0b101010
  - SWAP_RA_HITS(ra_val)=0b100001
- 지정되지 않은 경우 prev_win=0, hists=4부터 시작한다.
코드 라인 35~36에서 pfn, 폴트 pfn, hits, max_win, prev_win 값을 사용하여 readahead할 페이지 수를 산출한다.
코드 라인 37~38에서 vma->swap_readahead_info 값에 faddr, win, hits=0 값을 사용하여 ra_val 값을 만들어 저장한다.
코드 라인 40~43에서 만일 win 값이 1인 경우 pte 수와 ptes를 수정할 필요 없으므로 기존 pte 매핑을 언맵하고 함수를 빠져나간다.
코드 라인 46~55에서 다음 세 가지 조건으로 시작과 끝 pfn을 산출한다.
- 지난번에 사용한 페이지 다음에서 fault가 발생한 경우
- 지난번에 사용한 페이지 이전에서 fault가 발생한 경우
- 그 외의 경우
코드 라인 56~65에서 출력 인자인 @ra_info에 pte 정보들을 대입한다.
코드 라인 66에서 오리지널 pte는 매핑 해제한다.

다음 그림은 지난 번에 사용했던 페이지와 fault 페이지의 위치에 따라 swap 캐시에 읽어올 readahead 페이지 수를 산출하는 과정을 보여준다.

ra_val 값 관련 매크로

mm/swap_state.c

#define SWAP_RA_WIN_SHIFT       (PAGE_SHIFT / 2)
#define SWAP_RA_HITS_MASK       ((1UL << SWAP_RA_WIN_SHIFT) - 1)
#define SWAP_RA_HITS_MAX        SWAP_RA_HITS_MASK
#define SWAP_RA_WIN_MASK        (~PAGE_MASK & ~SWAP_RA_HITS_MASK)

#define SWAP_RA_HITS(v)         ((v) & SWAP_RA_HITS_MASK)
#define SWAP_RA_WIN(v)          (((v) & SWAP_RA_WIN_MASK) >> SWAP_RA_WIN_SHIFT)
#define SWAP_RA_ADDR(v)         ((v) & PAGE_MASK)

#define SWAP_RA_VAL(addr, win, hits)                            \
        (((addr) & PAGE_MASK) |                                 \
         (((win) << SWAP_RA_WIN_SHIFT) & SWAP_RA_WIN_MASK) |    \
         ((hits) & SWAP_RA_HITS_MASK))

다음 그림은 swap_ra 값에서 addr, win, hits 값을 가져오는 세 매크로들을 보여준다.

다음 그림은 SWAP_RA_VAL() 매크로를 사용하여 addr, win, hits 인자로 swap_ra 값을 만드는 과정을 보여준다.

swap_ra_clamp_pfn()

mm/swap_state.c

static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
                                     unsigned long faddr,
                                     unsigned long lpfn,
                                     unsigned long rpfn,
                                     unsigned long *start,
                                     unsigned long *end)
{
        *start = max3(lpfn, PFN_DOWN(vma->vm_start),
                      PFN_DOWN(faddr & PMD_MASK));
        *end = min3(rpfn, PFN_DOWN(vma->vm_end),
                    PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
}

다음 그림은 한 개의 pte 페이지 테이블에서 가져올 pte 엔트리들에 대한 시작 주소와 끝 주소를 알아내는 과정을 보여준다.

클러스터 기반 readhead 방식으로 swap-in

swap_cluster_readahead()

mm/swap_state.c

/**
 * swap_cluster_readahead - swap in pages in hope we need them soon
 * @entry: swap entry of this memory
 * @gfp_mask: memory allocation flags
 * @vmf: fault information
 *
 * Returns the struct page for entry and addr, after queueing swapin.
 *
 * Primitive swap readahead code. We simply read an aligned block of
 * (1 << page_cluster) entries in the swap area. This method is chosen
 * because it doesn't cost us any seek time.  We also make sure to queue
 * the 'original' request together with the readahead ones...
 *
 * This has been extended to use the NUMA policies from the mm triggering
 * the readahead.
 *
 * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
 */

struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
                                struct vm_fault *vmf)
{
        struct page *page;
        unsigned long entry_offset = swp_offset(entry);
        unsigned long offset = entry_offset;
        unsigned long start_offset, end_offset;
        unsigned long mask;
        struct swap_info_struct *si = swp_swap_info(entry);
        struct blk_plug plug;
        bool do_poll = true, page_allocated;
        struct vm_area_struct *vma = vmf->vma;
        unsigned long addr = vmf->address;

        mask = swapin_nr_pages(offset) - 1;
        if (!mask)
                goto skip;

        do_poll = false;
        /* Read a page_cluster sized and aligned cluster around offset. */
        start_offset = offset & ~mask;
        end_offset = offset | mask;
        if (!start_offset)      /* First page is swap header. */
                start_offset++;
        if (end_offset >= si->max)
                end_offset = si->max - 1;

        blk_start_plug(&plug);
        for (offset = start_offset; offset <= end_offset ; offset++) {
                /* Ok, do the async read-ahead now */
                page = __read_swap_cache_async(
                        swp_entry(swp_type(entry), offset),
                        gfp_mask, vma, addr, &page_allocated);
                if (!page)
                        continue;
                if (page_allocated) {
                        swap_readpage(page, false);
                        if (offset != entry_offset) {
                                SetPageReadahead(page);
                                count_vm_event(SWAP_RA);
                        }
                }
                put_page(page);
        }
        blk_finish_plug(&plug);

        lru_add_drain();        /* Push any new pages onto the LRU now */
skip:
        return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
}

swap 엔트리에 대해 클러스터 기반 swap readahead를 수행한다.

코드 라인 9에서 swap 엔트리로 swap 정보를 알아온다.
코드 라인 12에서 폴트 핸들러가 전달해준 vmf의 vma를 활용한다.
코드 라인 15~22에서 fault pnf에 대한 상대 offset pfn이 시작 pfn인데 이 값으로 swapin할 페이지 수 단위로 align한 시작 offset과 끝 offset을 구한다.
- 예) offset=0x3, mask=0xf
  - start_offset=0x0
  - end_offset=0xf
- 예) offset=0xffff_ffff_ffff_fffd(-3), mask=0xf
  - start_offset=0xffff_ffff_ffff_fff0(-16)
  - end_offset=0xffff_ffff_ffff_ffff(-1)
코드 라인 23~26에서 시작 offset이 0이면 swap 헤더이므로 그 다음을 사용하도록 증가시키고, 끝 offset이 최대값 미만이 되도록 제한한다.
코드 라인 28에서 blk_plug를 초기화하고, blk_finish_plug()가 끝나기 전까지 블럭 디바이스에 submit 을 유보하게 한다.
코드 라인 29~35에서 pte 엔트리 수 만큼 순회하며 swap 캐시 영역에서 swap 엔트리를 사용하여 페이지를 찾아온다.
코드 라인 36~42에서 새로 할당된 페이지이면 swap 영역으로 부터 비동기로 bio 요청을 하여 페이지를 읽어온다. 그리고 할당한 페이지가 요청한 offset 페이지가 아니면 PG_reclaim(swap-in시 readahead 기능) 플래그를 설정하고, SWAP_RA 카운터를 증가시킨다.
코드 라인 45에서 blk_start_plug() 함수와 짝이되는 함수를 통해 블럭 디바이스의 submit 실행을 지금부터 가능하게 한다.
코드 라인 47에서 per-cpu lru 캐시들을 lru로 되돌린다.
코드 라인 48~49에서 skip: 레이블이다. 다시 한 번 swap 캐시 영역에서 페이지를 async 모드로 찾아온다.

Swap 캐시 페이지 찾기

fault 핸들러에서 fault된 pte 엔트리 값에 swap 엔트리가 기록되어 있는 경우 do_swap_page() 함수를 호출하여 swap 캐시 영역에서 찾아 anon 페이지로 매핑해주는데 이 때 swap 캐시 영역을 찾는 함수를 알아본다.

lookup_swap_cache()

mm/swap_state.c

/*
 * Lookup a swap entry in the swap cache. A found page will be returned
 * unlocked and with its refcount incremented - we rely on the kernel
 * lock getting page table operations atomic even if we drop the page
 * lock before returning.
 */

struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
                               unsigned long addr)
{
        struct page *page;

        page = find_get_page(swap_address_space(entry), swp_offset(entry));

        INC_CACHE_INFO(find_total);
        if (page) {
                bool vma_ra = swap_use_vma_readahead();
                bool readahead;

                INC_CACHE_INFO(find_success);
                /*
                 * At the moment, we don't support PG_readahead for anon THP
                 * so let's bail out rather than confusing the readahead stat.
                 */
                if (unlikely(PageTransCompound(page)))
                        return page;

                readahead = TestClearPageReadahead(page);
                if (vma && vma_ra) {
                        unsigned long ra_val;
                        int win, hits;

                        ra_val = GET_SWAP_RA_VAL(vma);
                        win = SWAP_RA_WIN(ra_val);
                        hits = SWAP_RA_HITS(ra_val);
                        if (readahead)
                                hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
                        atomic_long_set(&vma->swap_readahead_info,
                                        SWAP_RA_VAL(addr, win, hits));
                }

                if (readahead) {
                        count_vm_event(SWAP_RA_HIT);
                        if (!vma || !vma_ra)
                                atomic_inc(&swapin_readahead_hits);
                }
        }

        return page;
}

swap 엔트리 값으로 swap 캐시 페이지를 찾아온다.

코드 라인 6에서 swap 엔트리에 매핑된 address_space와 swap 엔트리의 offset 값으로 페이지를 찾아온다.
코드 라인 8에서 swap 캐시 stat의 find_total 카운터를 증가시킨다.
코드 라인 9~10에서 캐시된 페이지를 찾은 경우 vma 기반 readahead가 enable 되었는지 여부를 vam_ra에 대입한다.
코드 라인 13에서 swap 캐시 stat의 find_success 카운터를 증가시킨다.
코드 라인 18~19에서 낮은 확률로 thp 또는 hugetlbfs 페이지인 경우 anon THP에 대한 readahead를 지원하지 않아 그냥 해당 페이지를 반환한다.
코드 라인 21에서 페이지의 PG_reclaim(swap-in시 readahed 플래그로 동작) 플래그 값을 readahead에 대입한 후 클리어한다.
코드 라인 22~33에서 vma 기반 readahead를 사용하는 경우 vma에 저장한 ra_val 값을 갱신한다. 이 때 readahead 플래그가 있었던 페이지인 경우 ra_val내의 hits 값은 증가시킨다.
코드 라인 35~39에서 readahead 플래그가 있었던 페이지인 경우 SWAP_RA_HIT 카운터를 증가시킨다. 만일 vma가 지정되지 않았거나 vma 기반 readahead를 사용하지 않는 경우 swapin_readahead_hits 카운터를 증가시킨다.

swap operations

address_space_operations 구조체

mm/swap_state.c

/*
 * swapper_space is a fiction, retained to simplify the path through
 * vmscan's shrink_page_list.
 */

static const struct address_space_operations swap_aops = {
        .writepage      = swap_writepage,
        .set_page_dirty = swap_set_page_dirty,
#ifdef CONFIG_MIGRATION
        .migratepage    = migrate_page,
#endif
};

swap 영역에 기록

anon 페이지를 swap 캐시에 저장한 후 dirty를 설정하고 빠져나온다음 매핑을 해제한다. 그런 후 pageout()을 통한 swap 캐시를 swap 영역에 기록할 수 있다.

swap_writepage()

mm/page_io.c

/*
 * We may have stale swap cache pages in memory: notice
 * them here and get rid of the unnecessary final write.
 */

int swap_writepage(struct page *page, struct writeback_control *wbc)
{
        int ret = 0;

        if (try_to_free_swap(page)) {
                unlock_page(page);
                goto out;
        }
        if (frontswap_store(page) == 0) {
                set_page_writeback(page);
                unlock_page(page);
                end_page_writeback(page);
                goto out;
        }
        ret = __swap_writepage(page, wbc, end_swap_bio_write);
out:
        return ret;
}

dirty 상태의 swap 캐시를 swap 영역에 기록한다.

코드 라인 5~8에서 다음의 경우가 아니면 swap 캐시에서 이 페이지를 제거하고 out: 레이블로 이동한다.
- 이미 swap 캐시에서 제거되었다. (PG_swapcache 플래그가 없는 상태)
- writeback이 완료되었다. (PG_writeback이 없는 상태)
- 이미 swap 영역에 저장한 상태이다. (swap_map[]에 비트가 설정된 상태)
코드 라인 9~14에서 frontswap을 지원하는 경우의 처리이다.
코드 라인 15~17에서 페이지를 swap 영역에 기록 요청하고 결과를 반환한다. sync/async 기록이 완료되면 end_swap_bio_write() 함수를 호출하여 writeback이 완료되었음을 페이지 플래그를 변경한다.

swap_set_page_dirty()

mm/page_io.c

int swap_set_page_dirty(struct page *page)
{
        struct swap_info_struct *sis = page_swap_info(page);

        if (sis->flags & SWP_FS) {
                struct address_space *mapping = sis->swap_file->f_mapping;

                VM_BUG_ON_PAGE(!PageSwapCache(page), page);
                return mapping->a_ops->set_page_dirty(page);
        } else {
                return __set_page_dirty_no_writeback(page);
        }
}

swap 페이지에 dirty 표식을 한다. 새롭게 dirty로 변경된 경우 1을 반환한다.

코드 라인 5~9에서 SWP_FS가 설정된 swap 영역인 경우 해당 드라이버가 제공하는 set_page_dirty() 함수를 사용하여 dirty 표식을 한다.
- SWP_FS는 sunrpc, nfs, xfs, btrfs 등에서 사용된다.
코드 라인 10~12에서 그 외의 일반 swap 영역을 사용하는 경우 swap 캐시 페이지에 dirty 설정을 한다.

__set_page_dirty_no_writeback()

mm/page-writeback.c

/*
 * For address_spaces which do not use buffers nor write back.
 */

int __set_page_dirty_no_writeback(struct page *page)
{
        if (!PageDirty(page))
                return !TestSetPageDirty(page);
        return 0;
}

해당 페이지에 dirty 플래그를 설정한다. 새롭게 dirty 설정한 경우 1을 반환한다.

참고

Swap -1- (Basic, 초기화)

2019-10-292021-03-15 문영일 4 Comments

Swap

유저 프로세스에서 사용한 스택 및 anonymous 메모리 할당(malloc 등) 요청 시 커널은 anon 페이지를 할당하여 관리한다. 메모리 부족 시 swap 영역에 저장할 수 있다.

swap 매커니즘은 다음과 같이 3단계 구성을 통해 swap 된다.

Swap 캐시(swapcache)
Frontswap
Swap 영역(Swap Backing Store)

다음 그림은 3단계 swap 컴포넌트들을 보여준다.

clean anon 페이지

swap 가능한 anon 페이지와 다르게 swap 영역을 가지지 않아 swap 할 수 없는 특별한 anon 페이지가 clean anon 페이지이다. 이러한 clean anon 페이지는 swap 할 필요없다.

PG_swapbacked 가 없는 anon 페이지가 clean anon 페이지 상태이다.
lazy free 페이지로 매핑을 바로 해제 하지 않기 위해 사용되는데, 이들은 madvise API를 통해 사용된다.
- 참고: mm: support madvise(MADV_FREE)

FRONTSWAP

Back-End에 존재하는 swap 영역과 달리 Front-End에 만든 swap 시스템이다.

여기에 사용되는 swap 장치는 swap할 페이지에 대해 동기 저장할 수 있는 고속이어야 한다.
- back-end swap 장치는 보통 비동기 저장한다.
tmem(Transcendent Memory)을 사용하며 현재 Xen에서 제공된다.
- DRAM 메모리 또는 persistent 메모리
참고: Frontswap (2012) | Kernel.org

Transcendent Memory

부족한 RAM 개선을 위한 새로운 메모리 관리 기술중 하나이다. 리눅스 커널에서 동기적으로 접근할 수 있을정도로 빠른 메모리로 직접 주소 지정되지 않고, 간접 주소 지정 방식으로 호출되며, 영역은 가변 사이즈로 구성되는 영구적 또는 일시적 메모리를 의미하는 메모리 클래스이다. 이 메모리에 저장된 데이터는 경고없이 사라질 수 있다.

참고:
- Transcendent memory (2009) | LWN.net
- Transcendent memory in a nutshell (2011) | LWN.net
- Transcendent Memory and Friends (2011) | Oracle – 다운로드 pdf

Cleancache

page 캐시 중 clean page 캐시만을 저장할 수 있는 캐시이다. swap 캐시도 swap 영역에 저장된 경우 이 또한 clean page 캐시이므로 이 cleancache를 사용할 수 있다.

현재 리눅스 커널은 xen에서 제공하는 tmem을 사용한다.

swap 영역(Swap Backing Store)

swap 영역은 다음과 같이 swap 파일 및 swap 디스크(블럭 디바이스)에 지정하여 사용할 수 있다.

swap 파일
swap 디스크

다음과 같은 방법으로도 swap 장치를 구성할 수 있다.

Swap over NFS, sunrpc 등 네트워크를 통해 마운트된 파일 시스템에 위치한 swap 파일
기타 로컬 RAM의 일부를 사용하여 압축하여 저장하는 zram을 사용하여 IO 요청을 줄인다.
- 디폴트로 cpu를 사용하여 압축하고, 서버 등에서는 HW를 사용한 압축도 가능하다.

다음 그림과 같이 swap 영역을 지정하는 다양한 방법이 있다.

Swap Priority

다음 그림과 같이 swap 영역은 여러 개를 지정할 수 있다. priority를 부여할 때 높은 번호 priority가 높은 우선 순위로 먼저 사용되고, priority를 부여하지 않은 경우 디폴트로 swapon을 통해 활성화한 순서대로 사용된다.

default priority 값은 -2부터 순서대로 낮아진다.
swap 영역 1개의 크기는 아키텍처마다 조금씩 다르다.
- ARM32
  - 실제 가상 주소 공간(최대 3G)이내에서 사용 가능하나 보통 2G까지 사용한다.
- ARM64
  - 실제 가상 주소 공간(커널 설정마다 다름)이내에서 사용 가능하다.

swap 캐시

anon 페이지가 swap 영역으로 나가고(swap-out) 들어올(swap-in) 때 일시적으로 swap 영역의 내용이 RAM에 존재하는 경우이다. 예를 들어 유저 프로세스가 사용하던 anon 페이지가 메모리 부족으로 인해 swap 영역에 저장이 되었다고 가정한다. 이 때 유저 프로세스가 swap된 페이지에 접근을 시도할 때 fault 가 발생하고, fault 핸들러가 swap 된 페이지임을 확인하면 swap-in을 시도한다. 그런 후 swap 캐시가 발견되면 swap 영역에서 로드를 하지 않고 곧바로 swap 캐시를 찾아 anon 페이지로 변경하여 사용한다.

swap 캐시 상태는 PG_swapcache 플래그로 표현된다.

swap 캐시별로 사용하는 address_space는 xarray(기존 radix tree)로 관리되며 이에 접근하기 위해 사용되는 rock 경합에 대한 개선을 위해 각각의 address_space의 크기를 64M로 제한하였다.

참고: mm/swap: split swap cache into 64MB trunks (4.11-rc1)

XArray 자료 구조 사용

swap 캐시는 기존 커널에서 radix tree를 사용하여 관리하였지만, 커널 v4.20-rc1부터 XArray 자료 구조로 관리하고 있다.

address_space->i_pages가 radix tree를 가리켰었는데 지금은 xarray를 가리킨다.

다음 그림은 여러 개의 swap 영역을 사용할 때 64M 단위로 관리되는 swap 캐시의 운용되는 모습을 보여준다.

Swap 슬롯 캐시

swap 영역의 swap_map[]을 통해 swap 엔트리를 할당하는데, 이 때마다 swap 영역의 락이 필요하다. swap 영역을 빈번하게 접근할 때 이의 성능이 저하되는 것을 개선하기 위해 swap 엔트리 할당을 위해 앞단에 per-cpu swap 슬롯 캐시를 사용하여 swap 영역의 잠금 없이 빠르게 할당/해제할 수 있게 하였다.

참고: mm/swap: add cache for swap slots allocation (v4.11-rc1)

다음 그림은 anon 페이지를 swap 영역에 저장할 때 swap 엔트리의 빠른 할당 및 회수에 사용되는 각각 64개로 구성된 swap 슬롯 캐시를 보여준다.

Swap 엔트리

swap 엔트리는 swap 페이지를 검색할 수 있는 최소한의 정보를 담고 있다.

VMA 영역내의 anonymous 페이지가 swap이 되면 swap 엔트리 정보를 구성하여 해당 pte 엔트리에 기록해둔다.
ARM에서 swap 엔트리의 구성은 다음과 같다.
- offset
  - swap 페이지 offset
- type
  - swap 영역 구분
- 하위 두 비트=0b00
  - ARM 및 ARM64 아키텍처에서 언매핑 상태로, 이 페이지에 접근하는 경우 fault 에러가 발생한다.
참고: Swap 엔트리 | 문c

다음 그림은 유저 프로세스의 가상 주소 공간에 swap 상태의 페이지와 사용 중인 상태의 anonymous 페이지들의 관리 상태를 swap 공정 관점으로 보여준다.

1개의 swap 영역(type=0)의 첫 번째 헤더 페이지를 제외한 그 다음 페이지(offset=1)를 swap 엔트리로 할당한 경우
- ARM64용 swap 엔트리 값 0x100이 페이지 테이블에 매핑되고, swap 캐시의 xarray에는 엔트리 값을 키로 페이지가 저장된다.

Swap 영역 관리

swap 영역은 swap_info_struct의 멤버인 swap_map이 바이트 단위로 관리되어 사용된다. 이 때 각 바이트 값은 1 페이지의 swap 여부에 대응된다. 이 값에는 페이지의 참조 카운터가 기록된다. 이 swap_map은 다음과 같이 두 가지 관리 방식이 사용된다.

Legacy swap 맵 관리
- Legacy HDD 및 swap 파일에서 사용되는 일반적인 방식이다. swap 페이지의 할당/해제 관리에 사용되는 swap_map을 SMP 시스템에서 사용하기 위해서 swap을 시도하는 cpu가 swap 영역(swap_info_struct)에 대한 spin-lock을 잡고 swap_map의 선두부터 빈 자리를 검색하여 사용하는 방식이다. 이 방식은 swap할 페이지마다 spin-lock을 획득하고 사용하므로 lock contention이 매우 큰 단점이 있다.
per-cpu 클러스터 단위 swap 맵 관리
- SSD 및 persistent 메모리가 나타나면서 swap 성능을 높이려는 시도가 시도되었고 이 때 cpu 별로 클러스터 단위로 처리하는 방법이 소개되었다. 이러한 방법은 cpu 마다 일정 클러스터 단위의 페이지를 관리하게 하여 lock contention에 대한 부담을 줄여 성능을 높이는 방법이 2013년 커널 v3.12에 소개되었다.
- 참고
  - Making swapping scalable (2016) | LWN.net
  - Reconsidering swapping (2016) | LWN.net
  - swap: change block allocation algorithm for SSD (v3.12-rc1)
  - swap: make cluster allocation per-cpu (v3.12-rc1)

다음 그림과 같이 swap 페이지의 swap 여부는 swap_map[] 배열에 각 swap 페이지가 1바이트 정보로 기록되어 관리된다.

다음 그림은 per-cpu 클러스터 단위로 swap_map을 관리하는 모습을 보여준다.

swap 유틸리티

mkswap

swap 영역을 파일 또는 파티션에 할당하여 생성한다.

$ mkswap -h

Usage:
 mkswap [options] device [size]

Set up a Linux swap area.

Options:
 -c, --check               check bad blocks before creating the swap area
 -f, --force               allow swap size area be larger than device
 -p, --pagesize SIZE       specify page size in bytes
 -L, --label LABEL         specify label
 -v, --swapversion NUM     specify swap-space version number
 -U, --uuid UUID           specify the uuid to use
 -V, --version             output version information and exit
 -h, --help                display this help and exit

예) 10M(10240K) 사이즈의 /abc 파일명을 생성한 후 swap 파일로 지정한다.

$ fallocate –length 10485760 /abc
- 또는 dd if=/dev/zero of=/abc bs=1M count=10
$ mkswap /abc

예) hdc3 파티션을 swap 영역으로 생성한다.

$ mkswap /dev/hdc3

swapon

swap 영역을 파일 또는 파티션에 지정하여 enable 한다.

$ swapon -h

Usage:
 swapon [options] [<spec>]

Enable devices and files for paging and swapping.

Options:
 -a, --all                enable all swaps from /etc/fstab
 -d, --discard[=<policy>] enable swap discards, if supported by device
 -e, --ifexists           silently skip devices that do not exist
 -f, --fixpgsz            reinitialize the swap space if necessary
 -o, --options <list>     comma-separated list of swap options
 -p, --priority <prio>    specify the priority of the swap device
 -s, --summary            display summary about used swap devices (DEPRECATED)
     --show[=<columns>]   display summary in definable table
     --noheadings         don't print table heading (with --show)
     --raw                use the raw output format (with --show)
     --bytes              display swap size in bytes in --show output
 -v, --verbose            verbose mode

 -h, --help     display this help and exit
 -V, --version  output version information and exit

The <spec> parameter:
 -L <label>             synonym for LABEL=<label>
 -U <uuid>              synonym for UUID=<uuid>
 LABEL=<label>          specifies device by swap area label
 UUID=<uuid>            specifies device by swap area UUID
 PARTLABEL=<label>      specifies device by partition label
 PARTUUID=<uuid>        specifies device by partition UUID
 <device>               name of device to be used
 <file>                 name of file to be used

Available discard policy types (for --discard):
 once    : only single-time area discards are issued
 pages   : freed pages are discarded before they are reused
If no policy is selected, both discard types are enabled (default).

Available columns (for --show):
 NAME   device file or partition path
 TYPE   type of the device
 SIZE   size of the swap area
 USED   bytes in use
 PRIO   swap priority
 UUID   swap uuid
 LABEL  swap label

For more details see swapon(8).

예) /abc 파일명의 swap 파일에서 swap을 활성화한다.

$ swapon /abc

예) hdc3 파티션의 swap을 활성화한다.

$ swapon /dev/hdc3

swapoff

swap 영역으로 지정된 파일 또는 파티션의 swap 기능을 disable 한다.

$ swapoff -h

Usage:
 swapoff [options] [<spec>]

Disable devices and files for paging and swapping.

Options:
 -a, --all              disable all swaps from /proc/swaps
 -v, --verbose          verbose mode

 -h, --help     display this help and exit
 -V, --version  output version information and exit

The <spec> parameter:
 -L <label>             LABEL of device to be used
 -U <uuid>              UUID of device to be used
 LABEL=<label>          LABEL of device to be used
 UUID=<uuid>            UUID of device to be used
 <device>               name of device to be used
 <file>                 name of file to be used

For more details see swapoff(8).

예) /abc 파일명의 swap 파일에서 swap을 비활성화한다.

$ swapoff /abc

예) hdc3 파티션의 swap을 비활성화한다.

$ swapoff /dev/hdc3

Swap Extent

swap 영역의 페이지 범위를 swap 장치의 블럭에 매핑할 때 사용하는 자료 구조이다. swap 영역이 블럭 디바이스에 직접 사용된 경우에는 swap 영역의 각 페이지가 블럭 디바이스의 각 블럭에 연속적으로 1:1로 동일하게 사용되므로 하나의 매핑만 필요하다. 그러나 swap 영역으로 swap 파일을 이용하는 경우에는 swap 파일의 각 페이지와 블럭 디바이스의 실제 사용된 블럭이 동일하게 연속하지 않으므로 여러 개의 매핑이 필요하다. 그렇다고 하나의 페이지와 하나의 블럭을 각각 매핑하면 이러한 매핑으로 너무 많이 소모된다. 따라서 범위로 묶어서 매핑을 하는데 이 때 사용하는 구조체가 swap_extent이다.

하나의 swap_extent에는 시작 페이지 번호와 페이지 수 그리고 매핑할 시작 블럭 번호가 포함된다.
리스트로 관리하던 swap extent를 커널 v5.3-rc1에서 rbtree로 변경하였다.
- 참고: mm, swap: use rbtree for swap_extent (2019)

다음 그림은 swap 파일이 위치한 블럭 디바이스에 3개의 연속된 블럭들로 나뉘어 배치된 경우에 필요한 3개의 swap_extent를 보여준다.

Swapon

swap 영역을 파일 또는 블럭 디바이스에 지정하여 활성화한다.

sys_swapon()

mm/swapfile.c -1/3-

SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
{
        struct swap_info_struct *p;
        struct filename *name;
        struct file *swap_file = NULL;
        struct address_space *mapping;
        int prio;
        int error;
        union swap_header *swap_header;
        int nr_extents;
        sector_t span;
        unsigned long maxpages;
        unsigned char *swap_map = NULL;
        struct swap_cluster_info *cluster_info = NULL;
        unsigned long *frontswap_map = NULL;
        struct page *page = NULL;
        struct inode *inode = NULL;
        bool inced_nr_rotate_swap = false;

        if (swap_flags & ~SWAP_FLAGS_VALID)
                return -EINVAL;

        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;

        if (!swap_avail_heads)
                return -ENOMEM;

        p = alloc_swap_info();
        if (IS_ERR(p))
                return PTR_ERR(p);

        INIT_WORK(&p->discard_work, swap_discard_work);

        name = getname(specialfile);
        if (IS_ERR(name)) {
                error = PTR_ERR(name);
                name = NULL;
                goto bad_swap;
        }
        swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
        if (IS_ERR(swap_file)) {
                error = PTR_ERR(swap_file);
                swap_file = NULL;
                goto bad_swap;
        }

        p->swap_file = swap_file;
        mapping = swap_file->f_mapping;
        inode = mapping->host;

        /* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
        error = claim_swapfile(p, inode);
        if (unlikely(error))
                goto bad_swap;

        /*
         * Read the swap header.
         */
        if (!mapping->a_ops->readpage) {
                error = -EINVAL;
                goto bad_swap;
        }
        page = read_mapping_page(mapping, 0, swap_file);
        if (IS_ERR(page)) {
                error = PTR_ERR(page);
                goto bad_swap;
        }
        swap_header = kmap(page);

        maxpages = read_swap_header(p, swap_header, inode);
        if (unlikely(!maxpages)) {
                error = -EINVAL;
                goto bad_swap;
        }

        /* OK, set up the swap map and apply the bad block list */
        swap_map = vzalloc(maxpages);
        if (!swap_map) {
                error = -ENOMEM;
                goto bad_swap;
        }

swap용 블럭 디바이스 또는 파일을 @type 번호의 swap 영역에 활성화한다. 성공 시 0을 반환한다.

코드 라인 20~21에서 잘못된 swap 플래그가 지정된 경우 -EINVAL 에러를 반환한다. 허용되는 플래그들은 다음과 같다.
- SWAP_FLAG_PREFER (0x8000)
- SWAP_FLAG_PRIO_MASK (0x7fff)
- SWAP_FLAG_DISCARD (0x10000)
- SWAP_FLAG_DISCARD_ONCE (0x20000)
- SWAP_FLAG_DISCARD_PAGES (0x40000
코드 라인 23~24에서 CAP_SYS_ADMIN 권한이 없는 경우 -EPERM 에러를 반환한다.
코드 라인 26~27에서 swap_avail_heads 리스트가 초기화되지 않은 경우 -ENOMEM 에러를 반환한다.
코드 라인 29~31에서 swap 영역을 구성하기 위해 swap_info_struct를 할당하고 초기화한다.
코드 라인 33에서 워커 스레드에서 swap_discard_work 함수를 호출할 수 있도록 워크를 초기화한다.
- SSD를 사용하는 디스크에서 discard 기능을 사용할 수 있다.
코드 라인 35~48에서 swapon할 디바이스 또는 파일을 오픈한 후 swap 정보에 지정한다.
코드 라인 49~55에서 swap파일의 address_space와 inode 정보로 다음과 같이 수행한다.
- 블럭 디바이스인 경우 swap 정보의 멤버 bdev에 inode를 포함한 블럭 디바이스를 지정한다. 그리고 멤버 flag에 SWP_BLKDEV 플래그를 추가한다.
- 파일인 경우 swap 정보의 멤버 bdev에 inode->i_sb->s_bdev를 지정한다. 그리고 이 파일이 이미 swapfile로 사용중이면 -EBUSY 에러를 반환한다.
코드 라인 60~63에서 swap 파일 시스템의 (*readpage) 후크가 지정되지 않은 경우 -INVAL 에러를 반환한다.
코드 라인 64~68에서 swap 파일의 헤더를 읽기 위해 인덱스 0에 대한 대한 페이지 캐시를 읽어온다.
코드 라인 69에서 읽어온 페이지를 swap_header로 사용하기 위해 kmap을 사용하여 잠시 매핑해둔다.
- arm64에서는 highmem을 사용하지 않기 때문에 이미 매핑되어 있다.
코드 라인 71~75에서 swap 헤더를 파싱하여 swap 정보에 그 시작과 끝 위치를 기록하고, 실제 swap 영역에 해당하는 페이지 수를 알아온다.
코드 라인 78~82에서 실제 swap 영역에 해당하는 페이지 수에 해당하는 바이트를 할당하여 swap_map에 할당한다.

mm/swapfile.c -2/3-

.       if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
                p->flags |= SWP_STABLE_WRITES;

        if (bdi_cap_synchronous_io(inode_to_bdi(inode)))
                p->flags |= SWP_SYNCHRONOUS_IO;

        if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
                int cpu;
                unsigned long ci, nr_cluster;

                p->flags |= SWP_SOLIDSTATE;
                /*
                 * select a random position to start with to help wear leveling
                 * SSD
                 */
                p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
                nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);

                cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
                                        GFP_KERNEL);
                if (!cluster_info) {
                        error = -ENOMEM;
                        goto bad_swap;
                }

                for (ci = 0; ci < nr_cluster; ci++)
                        spin_lock_init(&((cluster_info + ci)->lock));

                p->percpu_cluster = alloc_percpu(struct percpu_cluster);
                if (!p->percpu_cluster) {
                        error = -ENOMEM;
                        goto bad_swap;
                }
                for_each_possible_cpu(cpu) {
                        struct percpu_cluster *cluster;
                        cluster = per_cpu_ptr(p->percpu_cluster, cpu);
                        cluster_set_null(&cluster->index);
                }
        } else {
                atomic_inc(&nr_rotate_swap);
                inced_nr_rotate_swap = true;
        }

        error = swap_cgroup_swapon(p->type, maxpages);
        if (error)
                goto bad_swap;

        nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map,
                cluster_info, maxpages, &span);
        if (unlikely(nr_extents < 0)) {
                error = nr_extents;
                goto bad_swap;
        }
        /* frontswap enabled? set up bit-per-page map for frontswap */
        if (IS_ENABLED(CONFIG_FRONTSWAP))
                frontswap_map = kvcalloc(BITS_TO_LONGS(maxpages),
                                         sizeof(long),
                                         GFP_KERNEL);

        if (p->bdev &&(swap_flags & SWAP_FLAG_DISCARD) && swap_discardable(p)) {
                /*
                 * When discard is enabled for swap with no particular
                 * policy flagged, we set all swap discard flags here in
                 * order to sustain backward compatibility with older
                 * swapon(8) releases.
                 */
                p->flags |= (SWP_DISCARDABLE | SWP_AREA_DISCARD |
                             SWP_PAGE_DISCARD);

                /*
                 * By flagging sys_swapon, a sysadmin can tell us to
                 * either do single-time area discards only, or to just
                 * perform discards for released swap page-clusters.
                 * Now it's time to adjust the p->flags accordingly.
                 */
                if (swap_flags & SWAP_FLAG_DISCARD_ONCE)
                        p->flags &= ~SWP_PAGE_DISCARD;
                else if (swap_flags & SWAP_FLAG_DISCARD_PAGES)
                        p->flags &= ~SWP_AREA_DISCARD;

                /* issue a swapon-time discard if it's still required */
                if (p->flags & SWP_AREA_DISCARD) {
                        int err = discard_swap(p);
                        if (unlikely(err))
                                pr_err("swapon: discard_swap(%p): %d\n",
                                        p, err);
                }
        }

코드 라인 1~2에서 swap 기록을 안정적으로 할 수 있는 장치인 경우 SWP_STABLE_WRITES 플래그를 추가한다.
코드 라인 4~5에서 swap 기록이 빠른 장치(zram, pmem 등)인 경우 비동기로 처리할 필요 없다. 이 때 SWP_SYNCHRONOUS_IO 플래그를 추가한다.
코드 라인 7~11에서 SSD 처럼 non-rotational 블럭 장치인 경우 SWP_SOLIDSTATE 플래그를 추가한다.
코드 라인 16에서 다음 사용할 클러스터 위치를 swap 영역내에서 랜덤하게 선택한다.
코드 라인 17에서 swap 가용 페이지 수를 사용하여 클러스터의 수를 결정한다.
- 1개의 클러스터는 SWAPFILE_CLUSTER(256) 수 만큼 페이지를 관리한다.
- 현재 x86_64 아키텍처만 THP_SWAP을 지원하고 이 때 256 페이지 대신 HPAGE_PMD_NR 수 만큼 페이지를 관리한다.
코드 라인 19~27에서 결정된 클러스터 수만큼 cluster_info를 할당하고 초기화한다.
코드 라인 29~38에서 swap 정보의 멤버 percpu_cluster에 per-cpu percpu_cluster 구조체를 할당하여 지정하고 초기화한다.
코드 라인 39~42에서 SSD가 아닌 장치인 경우 nr_rotate_swap을 증가시키고 inced_nr_rotate_swap을 true로 지정한다.
- nr_rotate_swap 값이 0이 아니면 vma 기반 readahead를 사용하지 않는다.
코드 라인 44~46에서 cgroup용 swapon을 위해 swap_cgroup 배열들을 할당하고 준비한다.
코드 라인 48~53에서 swap 맵과 swap_extent를 할당하고 준비한다.
코드 라인 55~58에서 frontswap을 지원하는 커널인 경우 frontswap용 맵을 할당한다.
- 맵의 각 비트는 1 페이지에 대응한다.
코드 라인 60~88에서 SWAP_FLAG_DISCARD 요청을 처리한다.

mm/swapfile.c -3/3-

        error = init_swap_address_space(p->type, maxpages);
        if (error)
                goto bad_swap;

        mutex_lock(&swapon_mutex);
        prio = -1;
        if (swap_flags & SWAP_FLAG_PREFER)
                prio =
                  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
        enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);

        pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
                p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
                nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
                (p->flags & SWP_SOLIDSTATE) ? "SS" : "",
                (p->flags & SWP_DISCARDABLE) ? "D" : "",
                (p->flags & SWP_AREA_DISCARD) ? "s" : "",
                (p->flags & SWP_PAGE_DISCARD) ? "c" : "",
                (frontswap_map) ? "FS" : "");

        mutex_unlock(&swapon_mutex);
        atomic_inc(&proc_poll_event);
        wake_up_interruptible(&proc_poll_wait);

        if (S_ISREG(inode->i_mode))
                inode->i_flags |= S_SWAPFILE;
        error = 0;
        goto out;
bad_swap:
        free_percpu(p->percpu_cluster);
        p->percpu_cluster = NULL;
        if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
                set_blocksize(p->bdev, p->old_block_size);
                blkdev_put(p->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
        }
        destroy_swap_extents(p);
        swap_cgroup_swapoff(p->type);
        spin_lock(&swap_lock);
        p->swap_file = NULL;
        p->flags = 0;
        spin_unlock(&swap_lock);
        vfree(swap_map);
        kvfree(cluster_info);
        kvfree(frontswap_map);
        if (inced_nr_rotate_swap)
                atomic_dec(&nr_rotate_swap);
        if (swap_file) {
                if (inode && S_ISREG(inode->i_mode)) {
                        inode_unlock(inode);
                        inode = NULL;
                }
                filp_close(swap_file, NULL);
        }
out:
        if (page && !IS_ERR(page)) {
                kunmap(page);
                put_page(page);
        }
        if (name)
                putname(name);
        if (inode && S_ISREG(inode->i_mode))
                inode_unlock(inode);
        if (!error)
                enable_swap_slots_cache();
        return error;
}

코드 라인 1~3에서 @type에 대한 swap 영역을 초기화한다.
- swapper_spaces[type]에 swap 영역 크기를 SWAP_ADDRESS_SPACE_PAGES(2^14=16K pages=64M) 단위 수로 나누어 address_space 배열을 할당하여 준비한다.
코드라인 6~9에서 SWAP_FLAG_PREFER 플래그가 요청된 경우 플래그에 priority 값이 추가되어 있다. 이 경우 priority 값만 분리하여 prio에 대입한다. 그 외의 경우 -1이다.
코드 라인 10~19에서 swap 영역을 활성화하고, 메시지를 춮력한다.
- “Adding <페이지수> swap on <파일/블럭 디바이스명>. Priority:<prio> extents:<extent 매핑수> across:<span 크기> [SS][D][s][c][FS]”
  - SS: SSD
  - D: discardable
  - s: swap 영역 discard
  - c: swap 페이지 discard
  - FS: FrontSwap 맵
코드 라인 25~26에서 swap 파일을 사용하는 swap 영역인 경우 inode에 S_SWAPFILE 플래그를 추가한다.
코드 라인 27~28에서 에러 없이 성공적으로 처리하려 out 레이블로 이동한다.
코드 라인 29~53에서 bad_swap: 레이블이다. swap 영역을 활성화하지 못하는 경우 할당했었던 메모리를 회수한다.
코드 라인 54~65에서 out: 레이블이다. swap 영역의 초기화가 성공한 경우 swap 슬롯 캐시를 활성화한다.

swap_info_struct 할당후 초기화

alloc_swap_info()

mm/swapfile.c

static struct swap_info_struct *alloc_swap_info(void)
{
        struct swap_info_struct *p;
        unsigned int type;
        int i;
        int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node);

        p = kvzalloc(size, GFP_KERNEL);
        if (!p)
                return ERR_PTR(-ENOMEM);

        spin_lock(&swap_lock);
        for (type = 0; type < nr_swapfiles; type++) {
                if (!(swap_info[type]->flags & SWP_USED))
                        break;
        }
        if (type >= MAX_SWAPFILES) {
                spin_unlock(&swap_lock);
                kvfree(p);
                return ERR_PTR(-EPERM);
        }
        if (type >= nr_swapfiles) {
                p->type = type;
                swap_info[type] = p;
                /*
                 * Write swap_info[type] before nr_swapfiles, in case a
                 * racing procfs swap_start() or swap_next() is reading them.
                 * (We never shrink nr_swapfiles, we never free this entry.)
                 */
                smp_wmb();
                nr_swapfiles++;
        } else {
                kvfree(p);
                p = swap_info[type];
                /*
                 * Do not memset this entry: a racing procfs swap_next()
                 * would be relying on p->type to remain valid.
                 */
        }
        INIT_LIST_HEAD(&p->first_swap_extent.list);
        plist_node_init(&p->list, 0);
        for_each_node(i)
                plist_node_init(&p->avail_lists[i], 0);
        p->flags = SWP_USED;
        spin_unlock(&swap_lock);
        spin_lock_init(&p->lock);
        spin_lock_init(&p->cont_lock);

        return p;
}

swap 영역 정보를 할당한다. (할당한 swap_info_struct 포인터를 반환한다.)

코드 라인 6~10에서 swap_info_struct 구조체와 연결하여 노드 수 만큼의 plist_node 구조체 배열을 할당한다.
코드 라인 13~16에서 swap 파일 수만큼 swap_info[] 배열에 빈 자리가 있는지 찾아본다.
코드 라인 17~21에서 생성한 swap 파일 수가 이미 MAX_SWAPFILES 수 이상인 경우 할당을 취소하고 -EPERM 에러를 반환한다.
코드 라인 22~31에서 swap_info[] 배열에 빈 자리가 없으면 마지막에 할당한 메모리를 지정하고, swap 파일 수를 증가시킨다.
코드 라인 32~39에서 swap_info[] 배열에 빈 자리가 있으면 이미 할당한 메모리는 취소하고, 기존 할당한 메모리를 사용한다.
코드 라인 40~47에서 할당한 swap_info_struct 배열의 관련 멤버들을 초기화하고, SWP_USED 플래그를 설정한다.

Swap 헤더

swap_header 구조체

include/linux/swap.h

/*
 * Magic header for a swap area. The first part of the union is
 * what the swap magic looks like for the old (limited to 128MB)
 * swap area format, the second part of the union adds - in the
 * old reserved area - some extra information. Note that the first
 * kilobyte is reserved for boot loader or disk label stuff...
 *
 * Having the magic at the end of the PAGE_SIZE makes detecting swap
 * areas somewhat tricky on machines that support multiple page sizes.
 * For 2.5 we'll probably want to move the magic to just beyond the
 * bootbits...
 */

union swap_header {
        struct {
                char reserved[PAGE_SIZE - 10];
                char magic[10];                 /* SWAP-SPACE or SWAPSPACE2 */
        } magic;
        struct {
                char            bootbits[1024]; /* Space for disklabel etc. */
                __u32           version;
                __u32           last_page;
                __u32           nr_badpages;
                unsigned char   sws_uuid[16];
                unsigned char   sws_volume[16];
                __u32           padding[117];
                __u32           badpages[1];
        } info;
};

다음 그림은 swap 파일의 헤더 구성을 보여준다.

read_swap_header()

mm/swapfile.c

static unsigned long read_swap_header(struct swap_info_struct *p,
                                        union swap_header *swap_header,
                                        struct inode *inode)
{
        int i;
        unsigned long maxpages;
        unsigned long swapfilepages;
        unsigned long last_page;

        if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
                pr_err("Unable to find swap-space signature\n");
                return 0;
        }

        /* swap partition endianess hack... */
        if (swab32(swap_header->info.version) == 1) {
                swab32s(&swap_header->info.version);
                swab32s(&swap_header->info.last_page);
                swab32s(&swap_header->info.nr_badpages);
                if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
                        return 0;
                for (i = 0; i < swap_header->info.nr_badpages; i++)
                        swab32s(&swap_header->info.badpages[i]);
        }
        /* Check the swap header's sub-version */
        if (swap_header->info.version != 1) {
                pr_warn("Unable to handle swap header version %d\n",
                        swap_header->info.version);
                return 0;
        }

        p->lowest_bit  = 1;
        p->cluster_next = 1;
        p->cluster_nr = 0;

        maxpages = max_swapfile_size();
        last_page = swap_header->info.last_page;
        if (!last_page) {
                pr_warn("Empty swap-file\n");
                return 0;
        }
        if (last_page > maxpages) {
                pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
                        maxpages << (PAGE_SHIFT - 10),
                        last_page << (PAGE_SHIFT - 10));
        }
        if (maxpages > last_page) {
                maxpages = last_page + 1;
                /* p->max is an unsigned int: don't overflow it */
                if ((unsigned int)maxpages == 0)
                        maxpages = UINT_MAX;
        }
        p->highest_bit = maxpages - 1;

        if (!maxpages)
                return 0;
        swapfilepages = i_size_read(inode) >> PAGE_SHIFT;
        if (swapfilepages && maxpages > swapfilepages) {
                pr_warn("Swap area shorter than signature indicates\n");
                return 0;
        }
        if (swap_header->info.nr_badpages && S_ISREG(inode->i_mode))
                return 0;
        if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
                return 0;

        return maxpages;
}

swap 헤더를 파싱하여 swap 정보에 그 시작과 끝 위치를 알아온다.

코드 라인 10~13에서 페이지의 마지막 10바이트에 “SWAPSPACE2” 라는 매직 문자열을 확인하고, 없는 경우 에러 메시지를 출력하고 0을 반환한다.
- “Unable to find swap-space signature\n”
코드 라인 16~30에서 버전이 1로 확인되는 경우 모든 unsigned int 값들을 바이트 swap 하여 읽는다. 만일 버전이 1이 아닌 경우 다음과 같은 경고 메시지를 출력하고 0을 반환한다.
- “Unable to handle swap header version %d\n”
코드 라인 32~34에서 클러스터 수를 0으로하고, 다음 클러스터 번호는 1로 지정하여 초기화한다. 그리고 swap 시작(lowestbit) 페이지로 첫 페이지인 1을 지정한다.
코드 라인 36에서 아키텍처가 지원하는 swap offset 페이지 한계를 알아와서 maxpages에 대입한다.
코드 라인 37~41에서 swap 헤더에 기록된 last_page 수가 0인 경우 “Empty swap-file” 메시지를 출력하고 0을 반환한다.
코드 라인 42~46에서 last_page가 maxpages를 초과하는 경우 경고 메시지를 출력한다.
코드 라인 47~52에서 maxpages가 last_page보다 큰 경우 maxpages는 last_page+1을 대입한다.
코드 라인 53에서 swap 끝(highestbit) 페이지로 maxpages-1을 지정한다.
코드 라인 57~61에서 swap 파일 또는 블럭 디바이스의 페이지 수를 알아와서 maxpages보다 작은 경우 다음과 같은 경고 메시지를 출력하고 0을 반환한다.
- “Swap area shorter than signature indicates\n”
코드 라인 62~63에서 swap 파일의 경우 배드 페이지가 존재하는 경우 0을 반환한다.
코드 라인 64~65에서 배드 페이지 수가 MAX_SWAP_BADPAGES를 초과하는 경우 0을 반환한다.
코드 라인 67에서 maxpages를 반환한다.

다음 그림은 swap 파일 또는 블럭 디바이스에서 swap 헤더를 읽고 swap 정보에 시작(lowest_bit)과 끝(highest_bit) 위치를 처음 지정하는 모습을 보여준다.

Cgroup용 swap

swap_cgroup_swapon()

mm/swap_cgroup.c

int swap_cgroup_swapon(int type, unsigned long max_pages)
{
        void *array;
        unsigned long array_size;
        unsigned long length;
        struct swap_cgroup_ctrl *ctrl;

        if (!do_swap_account)
                return 0;

        length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
        array_size = length * sizeof(void *);

        array = vzalloc(array_size);
        if (!array)
                goto nomem;

        ctrl = &swap_cgroup_ctrl[type];
        mutex_lock(&swap_cgroup_mutex);
        ctrl->length = length;
        ctrl->map = array;
        spin_lock_init(&ctrl->lock);
        if (swap_cgroup_prepare(type)) {
                /* memory shortage */
                ctrl->map = NULL;
                ctrl->length = 0;
                mutex_unlock(&swap_cgroup_mutex);
                vfree(array);
                goto nomem;
        }
        mutex_unlock(&swap_cgroup_mutex);

        return 0;
nomem:
        pr_info("couldn't allocate enough memory for swap_cgroup\n");
        pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n");
        return -ENOMEM;
}

swap cgroup을 할당하고 준비한다. 성공 시 0을 반환한다.

코드 라인 8~9에서 memcg swap을 지원하는 커널이 아니면 함수를 빠져나간다.
- CONFIG_MEMCG_SWAP & CONFIG_MEMCG_SWAP_ENABLED 커널 옵션을 사용해야 한다.
코드 라인 11~16에서 swap 영역에 필요한 페이지 수만큼 필요한 swap_cgroup 구조체를 수를 산출하고 그 수 만큼 페이지 포인터 배열을 할당한다.
코드 라인 18~22에서 전역 swap_cgroup_ctrl[] 배열에서 @type에 해당하는 swap_cgroup_ctrl에 할당한 페이지 포인터 배열을 연결하고, length에 산출한 swap_cgroup 구조체 수를 담는다.
코드 라인 23~30에서 swap_cgroup_ctrl에 할당하여 준비한 페이지 포인터 배열에 swap_cgroup 구조체 배열용으로 사용할 페이지들을 할당하고 연결한다.
코드 라인 33에서 정상 할당이 완료되면 0을 반환한다.
코드 라인 34~37에서 nomem: 레이블이다. swap cgroup을 할당하기에 메모리가 부족하다고 메시지 출력을 한다. 그리고 메모리 부족 시 커널 옵션으로 “swapaccount=0″을 사용하면 swap cgroup을 disable 할 수 있다고 메시지 출력을 한다.

다음 그림은 swap cgroup을 할당하고 준비하는 과정을 보여준다.

swap_cgroup_prepare()

mm/swap_cgroup.c

/*
 * allocate buffer for swap_cgroup.
 */

static int swap_cgroup_prepare(int type)
{
        struct page *page;
        struct swap_cgroup_ctrl *ctrl;
        unsigned long idx, max;

        ctrl = &swap_cgroup_ctrl[type];

        for (idx = 0; idx < ctrl->length; idx++) {
                page = alloc_page(GFP_KERNEL | __GFP_ZERO);
                if (!page)
                        goto not_enough_page;
                ctrl->map[idx] = page;

                if (!(idx % SWAP_CLUSTER_MAX))
                        cond_resched();
        }
        return 0;
not_enough_page:
        max = idx;
        for (idx = 0; idx < max; idx++)
                __free_page(ctrl->map[idx]);

        return -ENOMEM;
}

swap_cgroup_ctrl에 할당하여 준비한 각 페이지 포인터 배열에 swap_cgroup 구조체 배열용으로 사용할 페이지들을 할당하고 연결한다.

코드 라인 7에서 @type에 해당하는 swap_cgroup_ctrl을 지정한다.
코드 라인 9~17에서 ctrl->length 만큼 순회하며 swap_cgroup 구조체 배열용 페이지를 할당하여 ctrl->map[idx]에 연결한다.
코드 라인 18에서 성공시 0을 반환한다.
코드 라인 19~24에서 not_enough_page: 레이블이다. 메모리 부족 시 할당한 페이지들을 할당 해제하고 -ENOMEM 에러를 반환한다.

swap 맵 초기화

setup_swap_map_and_extents()

mm/swapfile.c

static int setup_swap_map_and_extents(struct swap_info_struct *p,
                                        union swap_header *swap_header,
                                        unsigned char *swap_map,
                                        struct swap_cluster_info *cluster_info,
                                        unsigned long maxpages,
                                        sector_t *span)
{
        unsigned int j, k;
        unsigned int nr_good_pages;
        int nr_extents;
        unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
        unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
        unsigned long i, idx;

        nr_good_pages = maxpages - 1;   /* omit header page */

        cluster_list_init(&p->free_clusters);
        cluster_list_init(&p->discard_clusters);

        for (i = 0; i < swap_header->info.nr_badpages; i++) {
                unsigned int page_nr = swap_header->info.badpages[i];
                if (page_nr == 0 || page_nr > swap_header->info.last_page)
                        return -EINVAL;
                if (page_nr < maxpages) {
                        swap_map[page_nr] = SWAP_MAP_BAD;
                        nr_good_pages--;
                        /*
                         * Haven't marked the cluster free yet, no list
                         * operation involved
                         */
                        inc_cluster_info_page(p, cluster_info, page_nr);
                }
        }

        /* Haven't marked the cluster free yet, no list operation involved */
        for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
                inc_cluster_info_page(p, cluster_info, i);

        if (nr_good_pages) {
                swap_map[0] = SWAP_MAP_BAD;
                /*
                 * Not mark the cluster free yet, no list
                 * operation involved
                 */
                inc_cluster_info_page(p, cluster_info, 0);
                p->max = maxpages;
                p->pages = nr_good_pages;
                nr_extents = setup_swap_extents(p, span);
                if (nr_extents < 0)
                        return nr_extents;
                nr_good_pages = p->pages;
        }
        if (!nr_good_pages) {
                pr_warn("Empty swap-file\n");
                return -EINVAL;
        }

        if (!cluster_info)
                return nr_extents;

        /*
         * Reduce false cache line sharing between cluster_info and
         * sharing same address space.
         */
        for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
                j = (k + col) % SWAP_CLUSTER_COLS;
                for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
                        idx = i * SWAP_CLUSTER_COLS + j;
                        if (idx >= nr_clusters)
                                continue;
                        if (cluster_count(&cluster_info[idx]))
                                continue;
                        cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
                        cluster_list_add_tail(&p->free_clusters, cluster_info,
                                              idx);
                }
        }
        return nr_extents;
}

swap 맵과 swap_extent를 할당하고 준비하며 할당한 swap_extent 수를 반환한다.

코드 라인 11에서 swap 최대 페이지 수로 필요한 클러스터 수를 산출한다.
코드 라인 12에서 다음에 진행할 클러스터 번호를 알아온다.
- swap 파일은 최대 64M 단위로 최대 클러스터 번호는
코드 라인 15에서 good 페이지 수를 산출할 때 배드 페이지들을 빼기 전에 먼저 헤더 페이지로 1 페이지를 사용하므로 swap 최대 페이지 수에서 1을 뺀다.
코드 라인 17~18에서 free_clusters와 discard_clusters 리스트들을 클리어한다.
코드 라인 20~33에서 헤더 페이지에 기록된 배드 페이지 수만큼 순회하며 배드 페이지 번호를 알아와서 그에 해당하는 swap_map[]에 SWAP_MAP_BAD 마킹을 하고, good 페이지 수를 감소시킨다. 마킹한 배드 페이지가 있는 클러스터도 사용중으로 설정한다.
코드 라인 36~37에서 swap 영역을 클러스터 단위로 관리하는데 끝 부분이 정렬되지 않고 남는 영역의 페이지들도 모두 사용중인 클러스터로 설정한다.
코드 라인 39~52에서 good 페이지가 존재하는 경우 swap 영역의 첫 페이지는 SWAP_MAP_BAD로 설정하고, 첫 클러스터를 사용중으로 설정한다. 그리고 swap extent를 구성한다.
코드 라인 53~56에서 good 페이지가 하나도 없으면 빈 swap 파일이라고 경고 메시지를 출력하고 -EINVAL 에러를 반환한다.
코드 라인 58~59에서 @cluster_info가 지정되지 않은 경우 swap extent 수를 반환한다.
코드 라인 65~77에서 false cache line sharing을 줄이기 위해 cluster_info와 address_space를 각 cpu들이 따로 접근하도록 떨어뜨렸다. 사용 가능한 free 클러스터들을 fre_clusters 리스트에 추가하고, cluster_info 정보의 플래그에 CLUSTER_FLAG_FREE를 추가한다.
코드 라인 78에서 swap extent 수를 반환한다.

다음 그림은 swap 영역의 헤더 페이지를 분석한 정보로 swap_map을 구성하는 모습을 보여준다.

다음 그림은 SSD를 사용 시 클러스터 구성을 위해 free 클러스터 리스트를 준비하는 과정을 보여준다.

헤더 페이지, BAD 페이지가 속한 클러스터는 사용 중으로 만들어 free 클러스터 리스트에서 제거한다.
free 클러스터를 추가할 때 클러스터를 순서대로 넣지 않고 64개씩 분리하여 추가하는 것을 보여준다.

Swap Extents

swap 영역을 블럭 디바이스에 범위를 매핑할 때 사용한다. swap 영역의 종류에 따라 다음 3가지 방법으로 swap extent를 준비한다.

블럭 디바이스를 사용하는 경우 swap 영역과 블럭 디바이스는 한 번에 전부 1개의 swap extent를 사용하여 매핑한다.
마운트된 파일 시스템에서 (*swap_activate)가 지원되는 swap 파일이 위치한 swap 영역도 한 번에 전부를 매핑하므로 1개의 swap extent만 필요하다.
- nfs, xfs, btrfs, sunrpc, …
generic swap 파일
- swap 영역으로 swap 파일을 사용하는 경우이다. 이 때에는 블럭 디바이스의 빈 공간이 여러 군데에 fragment된 경우 이므로 여러 개의 swap_extent가 필요하게 된다.

setup_swap_extents()

mm/swapfile.c

/*
 * A `swap extent' is a simple thing which maps a contiguous range of pages
 * onto a contiguous range of disk blocks.  An ordered list of swap extents
 * is built at swapon time and is then used at swap_writepage/swap_readpage
 * time for locating where on disk a page belongs.
 *
 * If the swapfile is an S_ISBLK block device, a single extent is installed.
 * This is done so that the main operating code can treat S_ISBLK and S_ISREG
 * swap files identically.
 *
 * Whether the swapdev is an S_ISREG file or an S_ISBLK blockdev, the swap
 * extent list operates in PAGE_SIZE disk blocks.  Both S_ISREG and S_ISBLK
 * swapfiles are handled *identically* after swapon time.
 *
 * For S_ISREG swapfiles, setup_swap_extents() will walk all the file's blocks
 * and will parse them into an ordered extent list, in PAGE_SIZE chunks.  If
 * some stray blocks are found which do not fall within the PAGE_SIZE alignment
 * requirements, they are simply tossed out - we will never use those blocks
 * for swapping.
 *
 * For S_ISREG swapfiles we set S_SWAPFILE across the life of the swapon.  This
 * prevents root from shooting her foot off by ftruncating an in-use swapfile,
 * which will scribble on the fs.
 *
 * The amount of disk space which a single swap extent represents varies.
 * Typically it is in the 1-4 megabyte range.  So we can have hundreds of
 * extents in the list.  To avoid much list walking, we cache the previous
 * search location in `curr_swap_extent', and start new searches from there.
 * This is extremely effective.  The average number of iterations in
 * map_swap_page() has been measured at about 0.3 per page.  - akpm.
 */

static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
{
        struct file *swap_file = sis->swap_file;
        struct address_space *mapping = swap_file->f_mapping;
        struct inode *inode = mapping->host;
        int ret;

        if (S_ISBLK(inode->i_mode)) {
                ret = add_swap_extent(sis, 0, sis->max, 0);
                *span = sis->pages;
                return ret;
        }

        if (mapping->a_ops->swap_activate) {
                ret = mapping->a_ops->swap_activate(sis, swap_file, span);
                if (ret >= 0)
                        sis->flags |= SWP_ACTIVATED;
                if (!ret) {
                        sis->flags |= SWP_FS;
                        ret = add_swap_extent(sis, 0, sis->max, 0);
                        *span = sis->pages;
                }
                return ret;
        }

        return generic_swapfile_activate(sis, swap_file, span);
}

swap extents를 준비한다. 출력 인자 @span에 페이지 수를 지정한다. 결과가 0인 경우 새로 활성화된 경우이다.

코드 라인 8~12에서 swap 영역이 블럭 디바이스인 경우 swap 영역 전체(0 ~ sis->max 페이지)를 지정한 블럭 디바이스에 한 번에 매핑할 수 있다. 그렇게 하기 위해 0번 블럭부터 전체를 매핑하도록 1개의 swap_extent를 추가한다.
코드 라인 14~24에서 매핑된 오퍼레이션의 (*swap_activate) 후크가 지원되는 경우 호출한 후 이미 활성화된 경우 SWP_ACTIVATED 플래그를 추가한다. 또한 새로 활성화된 경우 SWP_FS 플래그를 설정하고, swap 영역 전체(0 ~ sis->max 페이지)를 한 번에 매핑하도록 1개의 swap_extent를 구성하여 추가한다.
코드 라인 26에서 swap 영역이 generic한 swap 파일인 경우 1개 이상의 매핑을 위해 swap_extent 들을 추가하고 활성화한다.

generic_swapfile_activate()

mm/page_io.c

int generic_swapfile_activate(struct swap_info_struct *sis,
                                struct file *swap_file,
                                sector_t *span)
{
        struct address_space *mapping = swap_file->f_mapping;
        struct inode *inode = mapping->host;
        unsigned blocks_per_page;
        unsigned long page_no;
        unsigned blkbits;
        sector_t probe_block;
        sector_t last_block;
        sector_t lowest_block = -1;
        sector_t highest_block = 0;
        int nr_extents = 0;
        int ret;

        blkbits = inode->i_blkbits;
        blocks_per_page = PAGE_SIZE >> blkbits;

        /*
         * Map all the blocks into the extent list.  This code doesn't try
         * to be very smart.
         */
        probe_block = 0;
        page_no = 0;
        last_block = i_size_read(inode) >> blkbits;
        while ((probe_block + blocks_per_page) <= last_block &&
                        page_no < sis->max) {
                unsigned block_in_page;
                sector_t first_block;

                cond_resched();

                first_block = bmap(inode, probe_block);
                if (first_block == 0)
                        goto bad_bmap;

                /*
                 * It must be PAGE_SIZE aligned on-disk
                 */
                if (first_block & (blocks_per_page - 1)) {
                        probe_block++;
                        goto reprobe;
                }

                for (block_in_page = 1; block_in_page < blocks_per_page;
                                        block_in_page++) {
                        sector_t block;

                        block = bmap(inode, probe_block + block_in_page);
                        if (block == 0)
                                goto bad_bmap;
                        if (block != first_block + block_in_page) {
                                /* Discontiguity */
                                probe_block++;
                                goto reprobe;
                        }
                }

                first_block >>= (PAGE_SHIFT - blkbits);
                if (page_no) {  /* exclude the header page */
                        if (first_block < lowest_block)
                                lowest_block = first_block;
                        if (first_block > highest_block)
                                highest_block = first_block;
                }

                /*
                 * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks
                 */
                ret = add_swap_extent(sis, page_no, 1, first_block);
                if (ret < 0)
                        goto out;
                nr_extents += ret;
                page_no++;
                probe_block += blocks_per_page;
reprobe:
                continue;
        }
        ret = nr_extents;
        *span = 1 + highest_block - lowest_block;
        if (page_no == 0)
                page_no = 1;    /* force Empty message */
        sis->max = page_no;
        sis->pages = page_no - 1;
        sis->highest_bit = page_no - 1;
out:
        return ret;
bad_bmap:
        pr_err("swapon: swapfile has holes\n");
        ret = -EINVAL;
        goto out;
}

swap 영역이 generic한 swap 파일에서 이를 활성화한다. 성공 시 추가한 extent 수를 반환한다.

코드 라인 17~18에서 한 개의 페이지에 들어갈 수 있는 블럭 비트 수를 구해 blocks_per_page에 대입한다.
- 블럭 크기는 512byte 이다. 그러나 여기서 말하는 블럭은 IO 단위로 처리 가능한 가상 블럭을 의미한다. swap 파일이 ext2, ext3, ext4 파일시스템에서 운영되는 경우 가상 블럭 사이즈는 1K, 2K, 4K, 8K를 지원하고, 디폴트로 4K를 사용한다.
  - 예) PAGE_SIZE(4096) >> inode->iblkbits(12) = 1 블럭
코드 라인 24~26에서 파일의 시작 페이지(page_no)를 0부터 끝 페이지(sis->max) 까지 순회를 위해 준비한다. 이 때 probe_block도 0부터 시작하고, last_block에는 파일의 끝 블럭 번호를 대입한다.
코드 라인 27~28에서 probe_block 부터 blocks_per_page 단위로 증가하며 last_block까지 순회한다.
- swap 파일이 ext2, ext3, ext4 파일시스템에서 운영되는 경우 블럭 사이즈로 디폴트 설정을 사용하면 1페이지가 1블럭과 동일하다. 따라서 blocks_per_page의 경우 1이다.
코드 라인 34~36에서 swap 파일의 probe_block 페이지에 대한 디스크 블럭 번호를 알아와서 first_block에 대입한다.
코드 라인 41~44에서 알아온 디스크 블럭 번호(first_block)가 blocks_per_page 단위로 정렬되지 않은 경우 정렬될 때까지 swap 파일에 대한 블럭 번호(probe_block)를 증가시키고 reprobe 레이블로 이동한다.
코드 라인 46~58에서 페이지 내에 2개 이상의 블럭이 있는 경우 블럭내 두 번째 페이지부터 블럭내 끝 페이지까지 순회한다. probe_block + 순회중인 페이지 순번을 더한 번호로 이에 해당하는 블럭 디바이스의 블럭 번호와 동일하게 연동되는지 확인한다. 만일 일치하지 않으면 reprobe 레이블로 이동한다.
코드 라인 60~66에서 헤더 페이지를 제외하고 알아온 블럭 디바이스의 번호(first_block)로 가장 작은 lowest_block과 가장 큰 highest_block을 갱신한다.
코드 라인 71~74에서 swap 파일의 page_no에 해당하는 1개 페이지를 알아온 블럭 디바이스 번호(first_block)에 매핑한다. 매핑 시 실제 swap extent가 추가되면 1이 반환된다. 추가되지 않고 기존 swap extent에 merge되면 0이 반환된다. 이렇게 반환된 수를 nr_extents에 합산한다.
코드 라인 75~76에서 다음 페이지를 처리하러 계속 진행한다. probe_block은 페이지 단위로 정렬되어야 하므로 blocks_per_page 만큼 증가시킨다.
코드 라인 77~79에서 reprobe: 레이블이다. while 루프를 계속 진행한다.
코드 라인 80에서 반환할 값으로 추가한 extent 수를 대입한다.
코드 라인 81에서 출력 인자 @span에는 가장 작은 블록부터 가장 큰 블록까지의 수를 대입한다.
코드 라인 82~83에서 page_no가 한 번도 증가되지 않은 경우 빈 페이지인 경우이다. page_no에 1을 대입한다.
코드 라인 84~86에서 max, pages, highest_bit등을 갱신한다.
코드 라인 87~88에서 out: 레이블이다. 추가한 extent 수를 반환한다.
코드 라인 89~92에서 bad_bmap: 레이블이다. 다음 에러 메시지를 출력하고 -EINVAL 에러를 반환한다.
- “swapon: swapfile has holes\n”

swap_extent 구조체

include/linux/swap.h

/*
 * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
 * disk blocks.  A list of swap extents maps the entire swapfile.  (Where the
 * term `swapfile' refers to either a blockdevice or an IS_REG file.  Apart
 * from setup, they're handled identically.
 *
 * We always assume that blocks are of size PAGE_SIZE.
 */

struct swap_extent {
        struct list_head list;
        pgoff_t start_page;
        pgoff_t nr_pages;
        sector_t start_block;
};

swap 영역을 블럭 디바이스에 범위를 매핑할 때 사용한다.

list
- 다음과 같이 두 가지 사용 방법으로 나뉜다.
  - swap_info_struct에 내장된 swap_extent의 list인 경우 리스트 헤드로 사용된다.
  - 헤드에 추가되는 swap_extent의 list는 추가할 때 사용하는 노드로 사용된다.
start_page
- swap 영역에서 매핑할 시작 페이지이다.
nr_pages
- swap 영역에서 위의 start_page부터 연속 매핑할 페이지 수이다.
start_block
- swap 영역의 start_page 부터 nr_pages 만큼 블럭 디바이스의 start_block 부터 매핑된다.

add_swap_extent()

mm/swapfile.c

/*
 * Add a block range (and the corresponding page range) into this swapdev's
 * extent list.  The extent list is kept sorted in page order.
 *
 * This function rather assumes that it is called in ascending page order.
 */

int
add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
                unsigned long nr_pages, sector_t start_block)
{
        struct swap_extent *se;
        struct swap_extent *new_se;
        struct list_head *lh;

        if (start_page == 0) {
                se = &sis->first_swap_extent;
                sis->curr_swap_extent = se;
                se->start_page = 0;
                se->nr_pages = nr_pages;
                se->start_block = start_block;
                return 1;
        } else {
                lh = sis->first_swap_extent.list.prev;  /* Highest extent */
                se = list_entry(lh, struct swap_extent, list);
                BUG_ON(se->start_page + se->nr_pages != start_page);
                if (se->start_block + se->nr_pages == start_block) {
                        /* Merge it */
                        se->nr_pages += nr_pages;
                        return 0;
                }
        }

        /*
         * No merge.  Insert a new extent, preserving ordering.
         */
        new_se = kmalloc(sizeof(*se), GFP_KERNEL);
        if (new_se == NULL)
                return -ENOMEM;
        new_se->start_page = start_page;
        new_se->nr_pages = nr_pages;
        new_se->start_block = start_block;

        list_add_tail(&new_se->list, &sis->first_swap_extent.list);
        return 1;
}
EXPORT_SYMBOL_GPL(add_swap_extent);

swap 영역의 @start_page부터 @nr_pages를 블럭 디바이스의 @start_block에 매핑한다. 만일 새 swap extent가 할당된 경우 1을 반환한다.

코드 라인 9~15에서 swap 영역을 처음 매핑하러 시도할 때 0번 페이지부터 시작하는데 이 때 @nr_pages 만큼 @start_block에 매핑한다. 매핑할 때 사용되는 swap_extent는 swap_info_struct 구조체 내부에 기본으로 사용되는 first_swap_extent를 사용한다. 내장된 swap extent를 사용했어도 추가되었다는 의미로 1을 반환한다.
- 블럭 디바이스를 swap 영역으로 사용 시 swap_extent는 하나만 사용되므로, 한 번의 매핑을 위해 이 조건 한 번만 호출된다.
코드 라인 16~25에서 연속된 블럭을 사용할 수 있는 경우 기존 매핑을 merge하여 사용되는 케이스이다. merge 되었으므로 swap extent가 추가되지 않아 0을 반환한다.
코드 라인 30~38에서 방금 전에 매핑한 블럭 디바이스 번호가 연속되지 않는 경우 새로운 swap extent를 할당하고 새로운 매핑 정보를 기록한다. 그리고 sis->first_swap_extent.list에 추가한 후 1을 반환한다.

다음 그림은 swap 파일이 마운트된 블럭 디바이스에 흩어지지(fragment) 않고 연속되어 사용되는 경우 swap extent를 1개만 사용하는 모습을 보여준다.

다음 그림은 swap 파일이 마운트된 블럭 디바이스에 3번 흩어져서(fragment) 사용되는 경우 swap extent를 3개 사용한 모습을 보여준다.

destroy_swap_extents()

mm/swapfile.c

/*
 * Free all of a swapdev's extent information
 */

static void destroy_swap_extents(struct swap_info_struct *sis)
{
        while (!list_empty(&sis->first_swap_extent.list)) {
                struct swap_extent *se;

                se = list_first_entry(&sis->first_swap_extent.list,
                                struct swap_extent, list);
                list_del(&se->list);
                kfree(se);
        }

        if (sis->flags & SWP_ACTIVATED) {
                struct file *swap_file = sis->swap_file;
                struct address_space *mapping = swap_file->f_mapping;

                sis->flags &= ~SWP_ACTIVATED;
                if (mapping->a_ops->swap_deactivate)
                        mapping->a_ops->swap_deactivate(swap_file);
        }
}

모든 swap extent들을 리스트에서 제거하고 할당 해제한다.

swap 엔트리용 address_space 관리

swap용 address_space 생성과 소멸

address_space 구조체는 swapon/swapoff 명령에 의해 생성과 소멸된다.

swapon -> sys_swapon() -> init_swap_address_space()
swapoff -> sys_swapoff() -> exit_swap_address_space()

init_swap_address_space()

mm/swap_state.c

int init_swap_address_space(unsigned int type, unsigned long nr_pages)
{
        struct address_space *spaces, *space;
        unsigned int i, nr;

        nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
        spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
        if (!spaces)
                return -ENOMEM;
        for (i = 0; i < nr; i++) {
                space = spaces + i;
                xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
                atomic_set(&space->i_mmap_writable, 0);
                space->a_ops = &swap_aops;
                /* swap cache doesn't use writeback related tags */
                mapping_set_no_writeback_tags(space);
        }
        nr_swapper_spaces[type] = nr;
        rcu_assign_pointer(swapper_spaces[type], spaces);

        return 0;
}

swap용 address_space 구조체를 할당하여 준비한다.

swapon에 의해 호출되어 nr_pages를 SWAP_ADDRESS_SPACE_PAGES(2^14) 단위로 절상한 수만큼 address_space 구조체를 할당하여 초기화한 후 swapper_space[@type]에 지정한다.
swap 영역에 사용되는 swap 캐시를 하나의 radix tree로 관리하였었는데, rock 사용 빈도를 줄여 성능을 올리기 위해 swap 캐시를 관리하는 address_space마다 최대 64M만을 관리하도록 나누어 배치하였다. 따라서 swapper_space[] 배열을 사용하는 것이 아니라 swapper_space[][] 이중 배열로 사용하는 것으로 변경되었다.

다음 그림은 swapon 시 swap용 address_space 구조체 배열이 생성되는 과정을 보여준다.

exit_swap_address_space()

mm/swap_state.c

void exit_swap_address_space(unsigned int type)
{
        struct address_space *spaces;

        spaces = swapper_spaces[type];
        nr_swapper_spaces[type] = 0;
        rcu_assign_pointer(swapper_spaces[type], NULL);
        synchronize_rcu();
        kvfree(spaces);
}

swapoff에 의해 호출되어 swapper_space[@type]에 저장된 address_space 배열을 할당 해제한다.

swap 엔트리로 address_space 찾기

swap_address_space()

include/linux/swap.h

#define swap_address_space(entry)                           \
        (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
                >> SWAP_ADDRESS_SPACE_SHIFT])

swap 엔트리에 해당하는 address_space 포인터를 반환한다.

swap 엔트리의 type과 offset으로 지정된 address_space 포인터를 반환한다.

/* linux/mm/swap_state.c */
/* One swap address space for each 64M swap space */

#define SWAP_ADDRESS_SPACE_SHIFT        14
#define SWAP_ADDRESS_SPACE_PAGES        (1 << SWAP_ADDRESS_SPACE_SHIFT)

1 개의 address_space가 관리하는 swap 크기는 64M이다.

다음 그림은 swap_address_space() 함수에서 swap 엔트리로 address_space를 알아오는 과정을 보여준다.

swapper_spaces[]

mm/swap_state.c

struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;

MAX_SWAPFILES(커널 옵션에 따라 27~32, ARM64 디폴트=29) 수 만큼의 파일과 연결되며 각각은 address_space 배열이 지정된다.

이 배열은 static하게 1차원 배열로 생성되지만, 실제 운영은 다음과 같이 2차원 배열로 사용한다.
- address_space[type][offset]

다음 그림은 swap 캐시의 address_space가 관리하는 xarray와 swap opearations를 보여준다.

구조체

swap_info_struct 구조체

/*
 * The in-memory structure used to track swap areas.
 */

struct swap_info_struct {
        unsigned long   flags;          /* SWP_USED etc: see above */
        signed short    prio;           /* swap priority of this type */
        struct plist_node list;         /* entry in swap_active_head */
        signed char     type;           /* strange name for an index */
        unsigned int    max;            /* extent of the swap_map */
        unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
        struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
        struct swap_cluster_list free_clusters; /* free clusters list */
        unsigned int lowest_bit;        /* index of first free in swap_map */
        unsigned int highest_bit;       /* index of last free in swap_map */
        unsigned int pages;             /* total of usable pages of swap */
        unsigned int inuse_pages;       /* number of those currently in use */
        unsigned int cluster_next;      /* likely index for next allocation */
        unsigned int cluster_nr;        /* countdown to next cluster search */
        struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
        struct swap_extent *curr_swap_extent;
        struct swap_extent first_swap_extent;
        struct block_device *bdev;      /* swap device or bdev of swap file */
        struct file *swap_file;         /* seldom referenced */
        unsigned int old_block_size;    /* seldom referenced */
#ifdef CONFIG_FRONTSWAP
        unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
        atomic_t frontswap_pages;       /* frontswap pages in-use counter */
#endif
        spinlock_t lock;                /*
                                         * protect map scan related fields like
                                         * swap_map, lowest_bit, highest_bit,
                                         * inuse_pages, cluster_next,
                                         * cluster_nr, lowest_alloc,
                                         * highest_alloc, free/discard cluster
                                         * list. other fields are only changed
                                         * at swapon/swapoff, so are protected
                                         * by swap_lock. changing flags need
                                         * hold this lock and swap_lock. If
                                         * both locks need hold, hold swap_lock
                                         * first.
                                         */
        spinlock_t cont_lock;           /*
                                         * protect swap count continuation page
                                         * list.
                                         */
        struct work_struct discard_work; /* discard worker */
        struct swap_cluster_list discard_clusters; /* discard clusters list */
        struct plist_node avail_lists[0]; /*
                                           * entries in swap_avail_heads, one
                                           * entry per node.
                                           * Must be last as the number of the
                                           * array is nr_node_ids, which is not
                                           * a fixed value so have to allocate
                                           * dynamically.
                                           * And it has to be an array so that
                                           * plist_for_each_* can work.
                                           */
};

swap 영역마다 하나씩 사용된다.

flags
- swap 영역에 사용되는 플래그 값들이다. (아래 참조)
prio
- 영역의 사용에 대한 순서를 정하기 위한 우선 순위 값이다.
- default 값은 -2부터 생성되는 순서대로 감소된다.
- 사용자가 지정하는 경우 양수를 사용할 수 있다.
list
- swap_active_head 리스트에 사용되는 노드이다.
type
- swap 영역에 대한 인덱스 번호이다. (0~)
max
- swap 영역의 전체 페이지 수 (bad 페이지 포함)
*swam_map
- swap 영역에 대한 swap 맵으로 1페이지당 1바이트의 값을 사용된다.
- 예) 0=free 상태, 1~0x3e: 사용 상태(usage counter), 0x3f: bad 상태
*cluster_info
- swap 영역에서 사용하는 모든 클러스터들을 가리킨다.
free_clusters
- 사용 가능한 free 클러스터들을 담는 리스트이다.
lowest_bit
- swap 영역에서 가장 낮은 free 페이지의 offset 값이다.
highest_bit
- swap 영역에서 가장 높은 free 페이지의 offset 값이다.
pages
- swap 영역의 전체 사용가능한 페이지 수 (bad 페이지 제외)
inuse_pages
- swap 영역에서 할당되어 사용중인 페이지 수
cluster_next
- 높은 확률로 다음 할당 시 사용할 페이지 offset를 가리킨다.
cluster_nr
- 다음 클러스터를 검색하기 위한 countdown 값
*percpu_cluster
- cpu별 현재 지정된 클러스터를 가리킨다.
- 해당 cpu에 지정된 클러스터가 없는 경우 그 cpu의 값은 null을 가진다.
*curr_swap_extent
- 현재 사용중인 swap extent를 가리킨다.
first_swap_extent
- swap 영역에 내장된 첫 swap extent이다.
*bdev
- swap 영역에 지정된 블럭 디바이스 또는 swap 파일의 블럭 디바이스
*swap_file
- swap 파일을 가리킨다.
old_block_size
- swap 블럭 디바이스의 사이즈
*frontswap_map
- frontswap의 in-use 상태를 비트로 표기하고, 1 비트는 1 페이지에 해당한다.
frontswap_pages
- fronswap의 in-use 카운터
lock
- swap 영역에 대한 lock이다.
cont_lock
- swap_map[]의 usage 카운터가 0x3e를 초과하는 경우에 swap 영역의 맵 관리는 swap count continuation 모드로 관리되는데 이 때 사용되는 swap count continuation 페이지 리스트에 접근할 때 사용하는 lock 이다.
discard_work
- discard를 지원하는 SSD에서 사용할 워크이다.
discard_clusters
- discard할 클러스터들이 담긴 리스트이다.
- discard 처리 후 free 클러스터 리스트로 옮긴다.
avail_lists[0]
- 노드별로 관리되는 swap_avail_heads[] 리스트의 노드로 사용된다.

플래그에 사용되는 값들이다.

enum {
        SWP_USED        = (1 << 0),     /* is slot in swap_info[] used? */
        SWP_WRITEOK     = (1 << 1),     /* ok to write to this swap?    */
        SWP_DISCARDABLE = (1 << 2),     /* blkdev support discard */
        SWP_DISCARDING  = (1 << 3),     /* now discarding a free cluster */
        SWP_SOLIDSTATE  = (1 << 4),     /* blkdev seeks are cheap */
        SWP_CONTINUED   = (1 << 5),     /* swap_map has count continuation */
        SWP_BLKDEV      = (1 << 6),     /* its a block device */
        SWP_ACTIVATED   = (1 << 7),     /* set after swap_activate success */
        SWP_FS          = (1 << 8),     /* swap file goes through fs */
        SWP_AREA_DISCARD = (1 << 9),    /* single-time swap area discards */
        SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
        SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
        SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
                                        /* add others here before... */
        SWP_SCANNING    = (1 << 13),    /* refcount in scan_swap_map */
};

SWP_USED
- swapon되어 사용 중인 swap 영역이다.
SWP_WRITEOK
- swap 영역에 write 가능한 상태이다.
SWP_DISCARDABLE
- swap 영역이 discard를 지원한다. (SSD)
SWP_DISCARDING
- swap 영역에 discard 작업을 수행 중이다.
SWP_SOLIDSTATE
- swap 영역이 SSD이다.
SWP_CONTINUED
- swap 영역에서 usage 카운터가 0x3e를 초과하여 사용 중이다.
SWP_BLKDEV
- swap 영역이 블럭 디바이스이다.
SWP_ACTIVATED
- swap 영역이 swap_activate 후크를 성공한 경우에 설정된다.
SWP_FS
- 파일 시스템을 통한 swap 파일이다.
SWP_AREA_DISCARD
- 한 번만 discard를 지원한다.
SWP_PAGE_DISCARD
- 클러스터 단위로 자유롭게 discard를 사용할 수 있다.
SWP_STABLE_WRITES
- 체크섬등을 사용한 무결성 저장 장치에 부여한다.
- SCSI Data Integrity Block Device
SWP_SYNCHRONOUS_IO
- 동기 저장이 가능한 swap 영역이다. (RAM 등)
SWP_SCANNING
- swap 영역을 검색중일 때 설정된다.

참고

Memory Resource Controller (2009) | Kame – 다운로드 pdf
[Linux] 블록 장치 I/O 동작 방식 (1) | F/OSS
[Linux] 블록 장치 I/O 동작 방식 (2) | F/OSS
[Linux] 블록 장치 I/O 동작 방식 (3) | F/OSS
SWAP 관리 | DrakeOh
Frontswap (2012) | Kernel.org
Cleancache (2011) | Kernel.org
Automatically bind swap device to numa node | Kernel.org
zsmalloc | Kernel.org
zswap | Kernel.org
Transcendent memory (2009) | LWN.net
Transcendent memory in a nutshell (2011) | LWN.net
Transcendent Memory and Friends (2011) | Oracle – 다운로드 pdf
우분투에서 RAM이 부족합니까? ZRAM 사용 | Eddy lab

VMPressure

2019-09-172022-04-26 문영일 Leave a comment

VMPressure

Memory Control Ggroup을 통해 스캔한 페이지와 회수한 페이지 비율을 분석하여 메모리 압박률을 산출하고, 이에 대응하는 스레졸드별 3 가지 이벤트 레벨로 memcg에 등록한 vmpressure 리스너들에 통지할 수 있게 하였다. vmpressure 리스너들은 eventfd를 사용하여 이러한 이벤트를 수신할 수 있다.

이벤트 레벨

low
- memcg로 지정한 메모리 압박이 적은 편이다.
medium
- memcg로 지정한 메모리 압박이 많은 편이다.
critical
- memcg로 지정한 메모리 압박이 심해 곧 OOM killer가 동작할 예정이다.

vmpressure_win

mm/vmpressure.c

/*
 * The window size (vmpressure_win) is the number of scanned pages before
 * we try to analyze scanned/reclaimed ratio. So the window is used as a
 * rate-limit tunable for the "low" level notification, and also for
 * averaging the ratio for medium/critical levels. Using small window
 * sizes can cause lot of false positives, but too big window size will
 * delay the notifications.
 *
 * As the vmscan reclaimer logic works with chunks which are multiple of
 * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
 *
 * TODO: Make the window size depend on machine size, as we do for vmstat
 * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
 */

static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;

SWAP_CLUSTER_MAX(32) * 16 = 512 페이지로 설정되어 있다.
이 윈도우 사이즈는 scanned/reclaim 비율을 분석을 시도하기 전에 사용하는 scanned 페이지 수이다.
low 레벨 notification에 사용되고 medium/critical 레벨의 평균 비율을 위해서도 사용된다.

vmpressure_level_med & vmpressure_level_critical

mm/vmpressure.c

/*
 * These thresholds are used when we account memory pressure through
 * scanned/reclaimed ratio. The current values were chosen empirically. In
 * essence, they are percents: the higher the value, the more number
 * unsuccessful reclaims there were.
 */

static const unsigned int vmpressure_level_med = 60;
static const unsigned int vmpressure_level_critical = 95;

vmpressure_level_med
- scanned/reclaimed 비율로 메모리 pressure 계량시 사용되는 medium 레벨의 스레졸드 값
vmpressure_level_critical
- scanned/reclaimed 비율로 메모리 pressure 계량시 사용되는 critical 레벨의 스레졸드 값

vmpressure_prio()

mm/vmpressure.c

/**
 * vmpressure_prio() - Account memory pressure through reclaimer priority level
 * @gfp:        reclaimer's gfp mask
 * @memcg:      cgroup memory controller handle
 * @prio:       reclaimer's priority
 *
 * This function should be called from the reclaim path every time when
 * the vmscan's reclaiming priority (scanning depth) changes.
 *
 * This function does not return any value.
 */

void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
{
        /*
         * We only use prio for accounting critical level. For more info
         * see comment for vmpressure_level_critical_prio variable above.
         */
        if (prio > vmpressure_level_critical_prio)
                return;

        /*
         * OK, the prio is below the threshold, updating vmpressure
         * information before shrinker dives into long shrinking of long
         * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
         * to the vmpressure() basically means that we signal 'critical'
         * level.
         */
        vmpressure(gfp, memcg, true, vmpressure_win, 0);
}

우선 순위가 높아져 스캔 depth가 깊어지는 경우 vmpressure 정보를 갱신한다.

코드 라인 7~8에서 요청 우선 순위가 vmpressure_level_critical_prio(3)보다 낮아 함수를 빠져나간다.
- prio는 낮을 수록 우선 순위가 높다.
코드 라인 17에서 스레졸드 이하로 prio가 떨어진 경우, 즉 우선 순위가 높아진 경우 shrinker가 오랫 동안 스캔하기 전에 vmpressure 정보를 업데이트한다.

vmpressure()

mm/vmpressure.c

/**
 * vmpressure() - Account memory pressure through scanned/reclaimed ratio
 * @gfp:        reclaimer's gfp mask
 * @memcg:      cgroup memory controller handle
 * @tree:       legacy subtree mode
 * @scanned:    number of pages scanned
 * @reclaimed:  number of pages reclaimed
 *
 * This function should be called from the vmscan reclaim path to account
 * "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
 * pressure index is then further refined and averaged over time.
 *
 * If @tree is set, vmpressure is in traditional userspace reporting
 * mode: @memcg is considered the pressure root and userspace is
 * notified of the entire subtree's reclaim efficiency.
 *
 * If @tree is not set, reclaim efficiency is recorded for @memcg, and
 * only in-kernel users are notified.
 *
 * This function does not return any value.
 */

void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
                unsigned long scanned, unsigned long reclaimed)
{
        struct vmpressure *vmpr = memcg_to_vmpressure(memcg);

        /*
         * Here we only want to account pressure that userland is able to
         * help us with. For example, suppose that DMA zone is under
         * pressure; if we notify userland about that kind of pressure,
         * then it will be mostly a waste as it will trigger unnecessary
         * freeing of memory by userland (since userland is more likely to
         * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
         * is why we include only movable, highmem and FS/IO pages.
         * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
         * we account it too.
         */
        if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
                return;

        /*
         * If we got here with no pages scanned, then that is an indicator
         * that reclaimer was unable to find any shrinkable LRUs at the
         * current scanning depth. But it does not mean that we should
         * report the critical pressure, yet. If the scanning priority
         * (scanning depth) goes too high (deep), we will be notified
         * through vmpressure_prio(). But so far, keep calm.
         */
        if (!scanned)
                return;

        if (tree) {
                spin_lock(&vmpr->sr_lock);
                scanned = vmpr->tree_scanned += scanned;
                vmpr->tree_reclaimed += reclaimed;
                spin_unlock(&vmpr->sr_lock);

                if (scanned < vmpressure_win)
                        return;
                schedule_work(&vmpr->work);
        } else {
                enum vmpressure_levels level;

                /* For now, no users for root-level efficiency */
                if (!memcg || memcg == root_mem_cgroup)
                        return;

                spin_lock(&vmpr->sr_lock);
                scanned = vmpr->scanned += scanned;
                reclaimed = vmpr->reclaimed += reclaimed;
                if (scanned < vmpressure_win) {
                        spin_unlock(&vmpr->sr_lock);
                        return;
                }
                vmpr->scanned = vmpr->reclaimed = 0;
                spin_unlock(&vmpr->sr_lock);

                level = vmpressure_calc_level(scanned, reclaimed);

                if (level > VMPRESSURE_LOW) {
                        /*
                         * Let the socket buffer allocator know that
                         * we are having trouble reclaiming LRU pages.
                         *
                         * For hysteresis keep the pressure state
                         * asserted for a second in which subsequent
                         * pressure events can occur.
                         */
                        memcg->socket_pressure = jiffies + HZ;
                }
        }
}

scaned 및 reclaimed 비율로 메모리 pressure를 계량한다.

코드 라인 4에서 요청한 memcg의 vmpressure 정보를 반환한다.
코드 라인 17~18에서 highmem, movable, FS, IO 플래그 요청이 하나도 없는 경우 pressure 계량을 하지 않는다.
코드 라인 28~29에서 인수 scanned가 0인 경우 함수를 중단한다.
코드 라인 31~39에서 기존 tree 방식의 presssure를 계량한다. tree_scanned와 tree_reclaimed 각각 그 만큼 증가시키고 vmpr->work에 등록한 작업을 실행시킨다. 만일 vmpr->scanned가 vmpressure_win 보다 작은 경우 함수를 중단한다.
- vmpressure_work_fn()
코드 라인 40~45에서 @tree가 0이면 커널 내부 사용자에게 통지하기 위해 @memcg를 위한 회수 효율성이 기록된다. memcg가 지정되지 않은 경우 함수를 중단한다.
코드 라인 47~55에서 scanned와 reclaimed 각각 그 만큼 증가시키고 만일 scanned가 vmpressure_win 보다 작은 경우 함수를 중단한다. 중단하지 않은 경우 vmpr의 scanned와 reclaimed는 0으로 리셋한다.
코드 라인 57~69에서 산출된 vmpressure 레벨이 VMPRESSURE_LOW를 초과하면 memcg의 socket_pressure를 현재 시각보다 1초 뒤인 틱 값을 설정한다.
- mem_cgroup_under_socket_pressure() 함수에서 이 값을 사용한다.
- 참고: mm: memcontrol: hook up vmpressure to socket pressure

다음 그림은 vmpressure() 함수가 처리되는 과정을 보여준다.

워크 큐에서 vmpressure에 따른 이벤트 통지

vmpressure_work_fn()

mm/vmpressure.c

static void vmpressure_work_fn(struct work_struct *work)
{
        struct vmpressure *vmpr = work_to_vmpressure(work);
        unsigned long scanned;
        unsigned long reclaimed;
        enum vmpressure_levels level;
        bool ancestor = false;
        bool signalled = false;

        spin_lock(&vmpr->sr_lock);
        /*
         * Several contexts might be calling vmpressure(), so it is
         * possible that the work was rescheduled again before the old
         * work context cleared the counters. In that case we will run
         * just after the old work returns, but then scanned might be zero
         * here. No need for any locks here since we don't care if
         * vmpr->reclaimed is in sync.
         */
        scanned = vmpr->tree_scanned;
        if (!scanned) {
                spin_unlock(&vmpr->sr_lock);
                return;
        }

        reclaimed = vmpr->tree_reclaimed;
        vmpr->tree_scanned = 0;
        vmpr->tree_reclaimed = 0;
        spin_unlock(&vmpr->sr_lock);

        level = vmpressure_calc_level(scanned, reclaimed);

        do {
                if (vmpressure_event(vmpr, level, ancestor, signalled))
                        signalled = true;
                ancestor = true;
        } while ((vmpr = vmpressure_parent(vmpr)));
}

메모리 압박 레벨을 산출하고 레벨 및 모드 조건을 만족시키는 vmpressure 리스너에 이벤트를 전송한다.

코드 라인 10~28에서 tree_scanned 값과 tree_reclaimed 값을 가져오고 리셋한다.
코드 라인 30에서 scanned 값과 reclaimed 값으로 레벨을 산출한다.
코드 라인 32~36에서 하이라키로 구성된 memcg의 vmpressure 값을 최상위 루트까지 순회하며 조건을 만족시키는 vmpressure 리스너에 이벤트를 통지한다.

memcg_to_vmpressure()

mm/memcontrol.c

/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
        if (!memcg)
                memcg = root_mem_cgroup;
        return &memcg->vmpressure;   
}

요청한 memcg의 vmpressure 정보를 반환한다. memcg가 지정되지 않은 경우 root memcg의 vmpressure를 반환한다.

다음 그림은 등록된 vmpressure 리스너들 중 조건에 맞는 리스너들을 대상으로 이벤트를 보내는 과정을 보여준다.

vmpressure 이벤트 통지

vmpressure_event()

mm/vmpressure.c

static bool vmpressure_event(struct vmpressure *vmpr,
                             const enum vmpressure_levels level,
                             bool ancestor, bool signalled)
{
        struct vmpressure_event *ev;
        bool ret = false;

        mutex_lock(&vmpr->events_lock);
        list_for_each_entry(ev, &vmpr->events, node) {
                if (ancestor && ev->mode == VMPRESSURE_LOCAL)
                        continue;
                if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
                        continue;
                if (level < ev->level)
                        continue;
                eventfd_signal(ev->efd, 1);
                ret = true;
        }
        mutex_unlock(&vmpr->events_lock);

        return ret;
}

vmpressure에 등록된 이벤트들을 대상으로 요청 @level 이하로 등록한 vmpressure 리스터 application에 eventfd 시그널을 통지한다.

통지 대상이 아닌 경우는 다음과 같다.
- @ancestor=1일 때, local 모드는 제외한다.
- @signalled=1일 때, no_passthrough 모드는 제외한다.
- @level보다 큰 레벨로 등록한 경우는 제외한다.

다음 그림은 memcg에 등록한 vmpressure 리스너에 이벤트를 통지하는 조건들을 보여준다.

vmpressure 레벨 산출

vmpressure_calc_level()

mm/vmpressure.c

static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
                                                    unsigned long reclaimed)
{
        unsigned long scale = scanned + reclaimed;
        unsigned long pressure = 0;

        /*
         * reclaimed can be greater than scanned for things such as reclaimed
         * slab pages. shrink_node() just adds reclaimed pages without a
         * related increment to scanned pages.
         */
        if (reclaimed >= scanned)
                goto out;
        /*
         * We calculate the ratio (in percents) of how many pages were
         * scanned vs. reclaimed in a given time frame (window). Note that
         * time is in VM reclaimer's "ticks", i.e. number of pages
         * scanned. This makes it possible to set desired reaction time
         * and serves as a ratelimit.
         */
        pressure = scale - (reclaimed * scale / scanned);
        pressure = pressure * 100 / scale;

out:
        pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
                 scanned, reclaimed);

        return vmpressure_level(pressure);
}

scanned, reclaimed 비율에 따라 pressure 값을 산출하고, 이에 따른 레벨을 반환한다.

다음과 예와 같이 scanned 페이지 수와 reclaimed 페이지 수에 대한 pressure 값과 레벨을 확인해보자.

scanned=5, reclaimed=0
- pressure=100%, level=critical
scanned=5, reclaimed=1
- pressure=66%, level=medium
scanned=5, reclaimed=2
- pressure=57%, level=low
scanned=5, reclaimed=3
- pressure=37%, level=low
scanned=5, reclaimed=4
- pressure=11%, level=low
scanned=5, reclaimed=5
- pressure=0%, level=low

다음 그림은 scanned, reclaimed 비율에 따른 pressure 값을 산출하고, 이에 따른 레벨을 결정하는 과정을 보여준다.

vmpressure_level()

mm/vmpressure.c

static enum vmpressure_levels vmpressure_level(unsigned long pressure)
{
        if (pressure >= vmpressure_level_critical)
                return VMPRESSURE_CRITICAL;
        else if (pressure >= vmpressure_level_med)
                return VMPRESSURE_MEDIUM;
        return VMPRESSURE_LOW;
}

@pressure에 따른 레벨을 반환한다.

critical
- 디폴트 값 95% 이상
med
- 디폴트 값 60% 이상
low
- 그 외

이벤트 수신 프로그램 데모

cgroup_event_listener

tools/cgroup 위치에서 make를 실행하면 다음 소스를 빌드하여 cgroup_event_listener 파일이 생성된다.

tools/cgroup/cgroup_event_listener.c

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/eventfd.h>

#define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>"

int main(int argc, char **argv)
{
        int efd = -1;
        int cfd = -1;
        int event_control = -1;
        char event_control_path[PATH_MAX];
        char line[LINE_MAX];
        int ret;

        if (argc != 3)
                errx(1, "%s", USAGE_STR);

        cfd = open(argv[1], O_RDONLY);
        if (cfd == -1)
                err(1, "Cannot open %s", argv[1]);

        ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
                        dirname(argv[1]));
        if (ret >= PATH_MAX)
                errx(1, "Path to cgroup.event_control is too long");

        event_control = open(event_control_path, O_WRONLY);
        if (event_control == -1)
                err(1, "Cannot open %s", event_control_path);

        efd = eventfd(0, 0);
        if (efd == -1)
                err(1, "eventfd() failed");

        ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]);
        if (ret >= LINE_MAX)
                errx(1, "Arguments string is too long");

        ret = write(event_control, line, strlen(line) + 1);
        if (ret == -1)
                err(1, "Cannot write to cgroup.event_control");

        while (1) {
                uint64_t result;

                ret = read(efd, &result, sizeof(result));
                if (ret == -1) {
                        if (errno == EINTR)
                                continue;
                        err(1, "Cannot read from eventfd");
                }
                assert(ret == sizeof(result));

                ret = access(event_control_path, W_OK);
                if ((ret == -1) && (errno == ENOENT)) {
                        puts("The cgroup seems to have removed.");
                        break;
                }

                if (ret == -1)
                        err(1, "cgroup.event_control is not accessible any more");

                printf("%s %s: crossed\n", argv[1], argv[2]);
        }

        return 0;
}

사용 방법

다음과 같이 pressure 레벨이 medium일 때 이벤트를 수신할 수 있게 한다.

참고: memcg: Add memory.pressure_level events | LWN.net

# cd /sys/fs/cgroup/memory/
$ mkdir foo
$ cd foo
$ cgroup_event_listener memory.pressure_level medium &
$ echo 8000000 > memory.limit_in_bytes
$ echo 8000000 > memory.memsw.limit_in_bytes
$ echo $$ > tasks
$ dd if=/dev/zero | read x

PSI(Process Stall Information)

메모리 압박을 감시하고자 하는 유저 application(안드로이드의 lmkd 등)이 메모리 회수 동작에서 받는 압박 레벨을 catch 하고자 2018년 커널 v4.20-rc1에서 소개되었다.

참고

참고

Memory Resource Controller (Documentation/cgroups/memory.txt) | LWN.net
로우 메모리 킬러 데몬 | Android

Rmap -1- (Reverse Mapping)

2019-09-102022-06-06 문영일 4 Comments

Rmap -1- (Reverse Mapping)

Rmap은 물리 주소를 사용하여 가상 주소에 역 매핑하는 방법이다. 이러한 rmap을 사용하여 물리 페이지를 사용하는 모든 VM 및 가상 주소를 찾도록 rmap_walk를 사용하며, 이렇게 찾은 매핑(VM, 가상주소, pte)에 대해 여러 가지 작업을 수행하도록 요청하는 API는 다음과 같다.

try_to_unmap()
page_mkclean()
page_referenced()
try_to_migrate()
page_mlock()
page_make_device_exclusive()
remove_migration_ptes()
page_idle_clear_pte_refs()
try_to_munlock()

다음 그림은 정방향 매핑과 역방향 매핑을 사용한 rmap의 컨셉을 보여준다.

rmap 종류

다음 그림과 같이 유저 페이지 종류에 따라 다른 4 가지 타입의 rmap을 구성할 수 있다.

Anonymous 매핑
- 유저 스택, 유저 힙 및 CoW 페이지가 할당될 때 사용한다.
- 아래 그림의 1)번
KSM 매핑
- VM_MERGEABLE 영역인 경우 물리 메모리 데이터가 동일한 경우 유저 anon 페이지를 병합할 수 있다.
- 아래 그림의 2)번
File 매핑
- 코드, 라이브러리, 데이터 파일, 공유 메모리, 디바이스 등이 로드 및 할당될 때 사용한다.
- 아래 그림의 3)번
non-lru movable 매핑
- non-lru movable을 지원하는 파일 시스템의 페이지들이 매핑될 때 사용하며, 이 페이지들은 migration이 가능하다.
- 아래 그림의 3)번
- 2016년 커널 v4.8-rc1에서 추가되었다.
  - 참고: mm: migrate: support non-lru movable page migration (2016, v4.8-rc1)

page 구조체의 mapping 멤버에는 위의 구조체 들을 가리키고, 하위 2비트의 플래그를 통해 구분을 할 수 있다.

anynymouse 페이지
- PAGE_MAPPING_ANON(1)
ksm 페이지
- PAGE_MAPPING_KSM(3) = PAGE_MAPPING_ANON(1) + PAGE_MAPPING_MOVABLE(2)
file 페이지
- 하위 비트 모두 0
non-lru movable 페이지
- PAGE_MAPPING_MOVABLE(2)

Fault 핸들러와 VMA 생성

사용자 메모리 할당이 요청되면 커널은 가상 주소 공간을 사용한다는 표식으로 VMA를 생성 또는 확장하고, 실제 물리 메모리는 할당하지 않는다. 그런 후 해당 유저 영역에 접근 시 fault가 발생하게 되는데, 이 때 fault 핸들러는 fault가 발생한 주소가 사용자 주소 공간의 어떤 VMA에 위치했는지 알아온다. 그리고 fault가 발생한 주소가 유효한 경우 뒤늦게 물리 페이지를 할당하는 lazy 할당 정책을 사용한다. 이 때 정방향 매핑을 사용하여 페이지 테이블 엔트리를 갱신하고, 역방향 매핑인 rmap도 추가한다.

다음 그림은 페이지 할당 요청 후 실제 메모리 사용 시 fault 핸들러에 의해 페이지 할당 및 매핑/역매핑이 수행되는 과정을 보여준다.

다음 그림은 fault 핸들러를 통해 최초 VMA가 생성되는 과정을 보여준다.

Fault 핸들러와 rmap 생성

다음 그림과 같이 anon rmap 상태를 간략히 보여준다.

정방향 매핑 시 여러 단계의 페이지 테이블(pgd -> p4d -> pud -> pmd -> pte)을 사용한다.
역방향 매핑 시 각 anon 페이지들은 AV(anon_vma 구조체)에 연결되어 표현된다.
- 아래 그림에서 VMA와 AV가 AVC(anon_vma_chain)을 통해 연결되는데 AVC 표기는 생략하였다.
- 연결될 때 page->mapping에 PAGE_MAPPING_ANON 플래그가 추가된 anon_vma 구조체 주소가 사용된다.

가상 주소 공간 관리(VMA 관리)

유저 프로세스 각각마다 가상 주소 공간이 존재하며, 이들의 매핑을 위해 페이지 테이블이 사용된다. 유저 가상 주소 공간에는 유저가 요청한 여러 개의 가상 주소 영역이 등록되는데, 이들이 vm_area_struct 구조체를 할당하여 표현되는 VMA이다.

참고로 프로세스에 child 스레드들이 존재하는 경우 child 스레드들도 task_struct로 관리된다. 그러나 사용자 주소 공간은 유저 프로세스 공간을 같이 공유하며 사용한다. 따라서 모든 child 스레드들의 mm_struct는 동일하다.

VMA가 관리되는 자료 구조는 RB 트리와 리스트 두 개의 자료 구조를 사용하여 관리되며, 두 개를 사용하는 이유는 빈 공간 범위 검색에 효과적으로 대응되는 방법이 이 두 개의 자료 구조를 사용하는 방식이다.

RB 트리
- 시작 주소의 검색에 빠르게 대응할 수 있다.
리니어 링크드 리스트
- VMA 들이 주소로 정렬되어 관리되어 있어, 정렬된 VMA들 간의 free 공간 사이즈를 빠르게 알아낼 수 있다.

다음 그림은 유저 주소 공간에 3개의 가상 주소 영역(VMA)이 등록되어 관리되는 모습을 보여준다.

다음과 같이 프로세스(pid 1481)에 등록된 VMA들을 보여준다. 파일과 관련없는 anon vma를 확인해본다.

표기되는 순서대로 다음과 같다.
- 시작 가상 주소(vma->vm_start)
- 끝 가상 주소(vma->vm_end)
- 플래그 여부 (VM_READ, VM_WRITE, VM_EXEC, VM_MAYSHARE)
- 파일이 매핑된 경우 오프셋 주소(vma->vm_pgoff << PAGE_SHIFT)
- 파일이 매핑된 경우 디바이스의 메이저 및 마이너 번호
- 파일이 매핑된 경우 inode 번호
- 파일이 매핑된 경우 파일명, anon vma의 경우 공백 또는 괄호 []로 표기
  - heap, vvar, vdso, stack 영역들은 특별하게 괄호 []를 통해 표시되는 anon vma이다.

$ cat /proc/1481/maps
00400000-004ec000 r-xp 00000000 b3:05 26                                 /bin/bash
004fb000-004ff000 r--p 000eb000 b3:05 26                                 /bin/bash
004ff000-00508000 rw-p 000ef000 b3:05 26                                 /bin/bash
00508000-00512000 rw-p 00000000 00:00 0
30d7f000-30f1d000 rw-p 00000000 00:00 0                                  [heap]
7f87324000-7f8732d000 r-xp 00000000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8732d000-7f8733c000 ---p 00009000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733c000-7f8733d000 r--p 00008000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733d000-7f8733e000 rw-p 00009000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733e000-7f87344000 rw-p 00000000 00:00 0
7f87344000-7f8734d000 r-xp 00000000 b3:05 2197                           /lib/aarch64-linux-gnu/libnss_nis-2.24.so
7f87370000-7f8737f000 ---p 00012000 b3:05 2243                           /lib/aarch64-linux-gnu/libnsl-2.24.so
(...생략...)
7f8771b000-7f8771c000 r--p 00000000 b3:05 7362                           /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION
7f8771c000-7f87721000 rw-p 00000000 00:00 0
7f87721000-7f87722000 r--p 00000000 00:00 0                              [vvar]
7f87722000-7f87723000 r-xp 00000000 00:00 0                              [vdso]
7f87723000-7f87724000 r--p 0001c000 b3:05 2253                           /lib/aarch64-linux-gnu/ld-2.24.so
7f87724000-7f87726000 rw-p 0001d000 b3:05 2253                           /lib/aarch64-linux-gnu/ld-2.24.so
7fcdc27000-7fcdc48000 rw-p 00000000 00:00 0                              [stack]

다음 명령을 사용하면 위의 정보를 더 자세히 볼 수 있다.

$ cat /proc/1481/smaps
00400000-004ec000 r-xp 00000000 b3:05 26                                 /bin/bash
Size:                944 kB
Rss:                 936 kB
Pss:                 311 kB
Shared_Clean:        936 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:          936 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw
(...생략...) 
7fcdc27000-7fcdc48000 rw-p 00000000 00:00 0                              [stack]
Size:                132 kB
Rss:                  32 kB
Pss:                  32 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        32 kB
Referenced:           32 kB
Anonymous:            32 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac

VM 플래그

다음은 vma->vm_flags에서 사용되는 VM 플래그들이다.

우측에 표시된 두 글자는 maps 또는 smaps로 출력 시 약식 표기법이다.

VM_READ         rd
VM_WRITE        wr
VM_EXEC         ex
VM_SHARED       sh
VM_MAYREAD      mr
VM_MAYWRITE     mw
VM_MAYEXEC      me
VM_MAYSHARE     ms
VM_GROWSDOWN    gd
VM_UFFD_MISSING 
VM_PFNMAP       pf
VM_DENYWRITE    dw
VM_UFFD_WP      
VM_LOCKED       lo
VM_IO           io
VM_SEQ_READ     sr
VM_RAND_READ    rr
VM_DONTCOPY     dc
VM_DONTEXPAND   de
VM_LOCKONFAULT  
VM_ACCOUNT      ac
VM_NORESERVE    nr
VM_HUGETLB      ht
VM_SYNC         sf
VM_ARCH_1       ar
VM_WIPEONFORK   wf
VM_DONTDUMP     dd
VM_SOFTDIRTY    sd
VM_MIXEDMAP     mm
VM_HUGEPAGE     hg
VM_NOHUGEPAGE   nh
VM_MERGEABLE	mg

anonymous 타입 VMA

anonymous 타입 VMA(vm_area_struct)를 관리하기 위해 AV(anon_vma 구조체)가 사용되고 VMA와 AV의 연결에 AVC(anon_vma_chain 구조체)를 사용하여 관리한다.

AV는 많은 VMA를 포함시킬 수 있도록 RB 트리를 사용하여 관리한다.
- 2012년 9월 커널 v3.7-rc1에서 AV에 사용한 리니어 링크드 리스트 대신 RB 트리를 사용한 interval 트리로 교체하였다.
- 참고: mm anon rmap: replace same_anon_vma linked list with an interval tree.
VMA는 AV를 포함시킬 수 있도록 여전히 리니어 링크드 리스트를 사용하여 관리한다.

다음 그림은 VMA와 AV간의 연결에 AVC가 사용되는 모습을 보여준다.

anon_vma 병합

VMA가 인접하고 속성이 비슷한 경우 anon_vma를 별도로 생성하지 않고 기존 것을 그대로 사용할 수 있다.

Fork된 child 프로세스에서의 관리

다음 그림은 fork된 child 프로세스와의 관계를 보여준다.

다음 그림은 두 번의 child 프로세스를 fork하여 부모 VMA가 clone되고 AV가 생성된 후 링크된 모습을 보여준다.

다음 그림과 같이 AV의 부모(parent) 관계가 표현된다.

다음 그림과 같이 AV의 처음 생성된 루트(root) 관계가 표현된다.

다음 그림과 같이 VMA가 1:1 직접 가리키는 관계가 표현된다.

rmap을 사용한 효율적인 매핑 제거

공유 페이지의 매핑을 제거할 때 정방향 매핑만을 사용하려면 정방향 매핑에 사용된 모든 수 많은 사용자 페이지 테이블을 뒤져야 하는 문제가 있다. 이 때 역방향 매핑을 사용하여 VMA(가상 주소 영역)를 찾고, VMA에서 사용되는 사용자 페이지 테이블에서만 매핑 해제하는 방식을 사용하면 빠르게 제거할 수 있다.

다음 그림은 부모 프로세스와 fork된 child 프로세스 사이에서 공유된 페이지의 매핑을 제거할 때 검색할 VMA #1, #2를 찾을 수 있도록 괸리되는 모습을 보여준다.

자식 프로세스 B가 fork될 때 VMA#1을 clone하여 VMA #2를 만든다.
새롭게 자식 프로세스 B가 할당한 페이지들은 각각의 VMA#2, VMA#3에서 생성되어 관리되는 모습을 알 수 있다.

anon_vma 초기화

anon_vma_init()

mm/rmap.c

void __init anon_vma_init(void)
{
        anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
                        0, SLAB_TYPESAFE_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
                        anon_vma_ctor);
        anon_vma_chain_cachep = KMEM_CACHE(anon_vma_chain,
                        SLAB_PANIC|SLAB_ACCOUNT);
}

anon_vma 및 anon_vma_chain 구조체 할당 목적으로 slub 캐시를 생성한다.

코드 라인 3에서 anon_vma 구조체 할당 목적으로 slub 캐시를 생성하고, 초기화 시 anon_vma_ctor() 함수가 호출되게 한다.
코드 라인 6에서 anon_vma_chain 구조체 할당 목적으로 slub 캐시를 생성한다.

anon_vma 할당

anon_vma_alloc()

mm/rmap.c

static inline struct anon_vma *anon_vma_alloc(void)
{
        struct anon_vma *anon_vma;

        anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
        if (anon_vma) {
                atomic_set(&anon_vma->refcount, 1);
                anon_vma->degree = 1;   /* Reference for first vma */
                anon_vma->parent = anon_vma;
                /*
                 * Initialise the anon_vma root to point to itself. If called
                 * from fork, the root will be reset to the parents anon_vma.
                 */
                anon_vma->root = anon_vma;
        }

        return anon_vma;
}

anon_vma를 할당한다.

할당한 anon_vma 구조체의 멤버 degree를 1로 설정하여 가장 선두에 위치했음을 식별하게하고, parent 및 root도 자기 자신을 가리키게 한다.

anon_vma 해제

anon_vma_free()

mm/rmap.c

static inline void anon_vma_free(struct anon_vma *anon_vma)
{
        VM_BUG_ON(atomic_read(&anon_vma->refcount));

        /*
         * Synchronize against page_lock_anon_vma_read() such that
         * we can safely hold the lock without the anon_vma getting
         * freed.
         *
         * Relies on the full mb implied by the atomic_dec_and_test() from
         * put_anon_vma() against the acquire barrier implied by
         * down_read_trylock() from page_lock_anon_vma_read(). This orders:
         *
         * page_lock_anon_vma_read()    VS      put_anon_vma()
         *   down_read_trylock()                  atomic_dec_and_test()
         *   LOCK                                 MB
         *   atomic_read()                        rwsem_is_locked()
         *
         * LOCK should suffice since the actual taking of the lock must
         * happen _before_ what follows.
         */
        might_sleep();
        if (rwsem_is_locked(&anon_vma->root->rwsem)) {
                anon_vma_lock_write(anon_vma);
                anon_vma_unlock_write(anon_vma);
        }

        kmem_cache_free(anon_vma_cachep, anon_vma);
}

anon_vma를 할당 해제한다.

anon_vma 준비

유저용 free 페이지를 준비할 때 anon_vma를 준비하도록 요청하는 곳은 많은데, 그 중 주요 루틴들은 다음과 같다.

fault 시 메모리 할당 관련
- do_cow_fault()
- wp_page_copy()
- do_anonymous_page()
스택 확장 관련
- expand_upwards()
- expand_downwards()
migration 관련
- migrate_vma_insert_page()

anon_vma_prepare()

include/linux/rmap.h

static inline int anon_vma_prepare(struct vm_area_struct *vma)
{
        if (likely(vma->anon_vma))
                return 0;

        return __anon_vma_prepare(vma);
}

vma 영역에 anon_vma를 준비한다.

코드 라인 3~4에서 vma에 이미 anon_vma가 이미 준비되어 사용중인 경우 성공(0)을 반환한다.
코드 라인 6에서 vma에 anon 페이지들을 관리하는 anon_vma 자료 구조를 준비하여 어태치한다.

__anon_vma_prepare()

mm/rmap.c

/**
 * __anon_vma_prepare - attach an anon_vma to a memory region
 * @vma: the memory region in question
 *
 * This makes sure the memory mapping described by 'vma' has
 * an 'anon_vma' attached to it, so that we can associate the
 * anonymous pages mapped into it with that anon_vma.
 *
 * The common case will be that we already have one, which
 * is handled inline by anon_vma_prepare(). But if
 * not we either need to find an adjacent mapping that we
 * can re-use the anon_vma from (very common when the only
 * reason for splitting a vma has been mprotect()), or we
 * allocate a new one.
 *
 * Anon-vma allocations are very subtle, because we may have
 * optimistically looked up an anon_vma in page_lock_anon_vma_read()
 * and that may actually touch the rwsem even in the newly
 * allocated vma (it depends on RCU to make sure that the
 * anon_vma isn't actually destroyed).
 *
 * As a result, we need to do proper anon_vma locking even
 * for the new allocation. At the same time, we do not want
 * to do any locking for the common case of already having
 * an anon_vma.
 *
 * This must be called with the mmap_lock held for reading.
 */

int __anon_vma_prepare(struct vm_area_struct *vma)
{
        struct mm_struct *mm = vma->vm_mm;
        struct anon_vma *anon_vma, *allocated;
        struct anon_vma_chain *avc;

        might_sleep();

        avc = anon_vma_chain_alloc(GFP_KERNEL);
        if (!avc)
                goto out_enomem;

        anon_vma = find_mergeable_anon_vma(vma);
        allocated = NULL;
        if (!anon_vma) {
                anon_vma = anon_vma_alloc();
                if (unlikely(!anon_vma))
                        goto out_enomem_free_avc;
                allocated = anon_vma;
        }

        anon_vma_lock_write(anon_vma);
        /* page_table_lock to protect against threads */
        spin_lock(&mm->page_table_lock);
        if (likely(!vma->anon_vma)) {
                vma->anon_vma = anon_vma;
                anon_vma_chain_link(vma, avc, anon_vma);
                /* vma reference or self-parent link for new root */
                anon_vma->degree++;
                allocated = NULL;
                avc = NULL;
        }
        spin_unlock(&mm->page_table_lock);
        anon_vma_unlock_write(anon_vma);

        if (unlikely(allocated))
                put_anon_vma(allocated);
        if (unlikely(avc))
                anon_vma_chain_free(avc);

        return 0;

 out_enomem_free_avc:
        anon_vma_chain_free(avc);
 out_enomem:
        return -ENOMEM;
}

vma에 anon 페이지들을 관리하는 anon_vma 자료 구조를 준비하여 어태치한다. anon_vma는 anon_vma_chain을 통해 vma에 연결된다.

코드 라인 9~11에서 anon_vma_chain을 할당한다.
코드 라인 13~20에서 이웃한 vma 영역의 병합 가능한 anon_vma를 찾아 가져오거나 발견하지 못하면 anon_vma를 새로 할당한다.
코드 라인 22~34에서 락을 획득 후 vma에 처음 anon_vma를 연결하는 경우 anon_vma 및 anon_vma_chain을 vma에 어태치한다.
코드 라인 36~39에서 낮은 확률로 할당한 anon_vma 및 anon_vma_chain을 vma에 어태치하지 못한 경우 할당하였던 anon_vma와 anon_vma_chain을 할당 해제한다.

다음 그림은 AV(anon_vma)를 준비하는데 새로 생성하거나, 기존 AV를 사용하는 과정을 보여준다.

다음 그림은 처음 anon_vma 사용 시 AV(anon_vma)를 할당하여 VMA(vm_area_struct)와 AVC(anon_vma_chain)을 통해 연결되는 과정을 보여준다.

AV(anon_vma)의 degree & refcount 관리

AV(anon_vma)의 lifetime 관리

refcount
- 참조 카운터로 0이 되면 소멸한다.
- AVC(anon_vma_chain)을 통해 VMA가 연결될 때마다 증가한다.
degree
- VMA의 owner(vma->anon_vma로 지정된 AV)로 지정될 때마다 degree가 증가된다.
- AV가 최초 생성되었을 때 1이지만 곧바로 VMA와 연결되고 VMA의 owner로 지정되므로 2로 시작한다.
- 이 값이 1이 되는 경우는 AV를 재사용(reuse)할 수 있는 상황이다.
  - fork된 자식 프로세스 하나만 동작 중이면서 부모 process가 종료된 경우이다.
  - It has no vma and only one anon_vma child

다음 그림은 AV의 degree 및 refcount의 변화를 보여준다.

다음 그림은 merged AV의 degree 및 refcount의 변화를 보여준다.

find_mergeable_anon_vma()

mm/mmap.c

/*
 * find_mergeable_anon_vma is used by anon_vma_prepare, to check
 * neighbouring vmas for a suitable anon_vma, before it goes off
 * to allocate a new anon_vma.  It checks because a repetitive
 * sequence of mprotects and faults may otherwise lead to distinct
 * anon_vmas being allocated, preventing vma merge in subsequent
 * mprotect.
 */

struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
        struct anon_vma *anon_vma = NULL;

        /* Try next first. */
        if (vma->vm_next) {
                anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next);
                if (anon_vma)
                        return anon_vma;
        }

        /* Try prev next. */
        if (vma->vm_prev)
                anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma);

        /*
         * We might reach here with anon_vma == NULL if we can't find
         * any reusable anon_vma.
         * There's no absolute need to look only at touching neighbours:
         * we could search further afield for "compatible" anon_vmas.
         * But it would probably just be a waste of time searching,
         * or lead to too many vmas hanging off the same anon_vma.
         * We're trying to allow mprotect remerging later on,
         * not trying to minimize memory used for anon_vmas.
         */
        return anon_vma;
}

이웃한 vma에 병합 가능한 anon_vma가 있으면 해당 anon_vma를 반환한다. 병합 가능한 anon_vma가 없으면 null을 반환한다. (null을 반환하면 새롭게 생성하여 사용한다)

코드 라인 6~10에서 @vma 영역 다음의 이웃한 vma 영역에 anon_vma를 같이 사용할 수 있으면 해당 이웃 vma가 사용하는 anon_vma를 알아온다.
코드 라인 13~14에서 @vma 영역 이전의 이웃한 vma 영역에 anon_vma를 같이 사용할 수 있으면 해당 이웃 vma가 사용하는 anon_vma를 알아온다.
코드 라인 26에서 anon_vma를 반환한다.

다음 그림은 vma에 이웃한 두 VMA를 대상으로 병합 가능한 anon_vma를 반환하는 모습을 보여준다.

reusable_anon_vma()

mm/mmap.c

/*
 * Do some basic sanity checking to see if we can re-use the anon_vma
 * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
 * the same as 'old', the other will be the new one that is trying
 * to share the anon_vma.
 *
 * NOTE! This runs with mm_sem held for reading, so it is possible that
 * the anon_vma of 'old' is concurrently in the process of being set up
 * by another page fault trying to merge _that_. But that's ok: if it
 * is being set up, that automatically means that it will be a singleton
 * acceptable for merging, so we can do all of this optimistically. But
 * we do that READ_ONCE() to make sure that we never re-load the pointer.
 *
 * IOW: that the "list_is_singular()" test on the anon_vma_chain only
 * matters for the 'stable anon_vma' case (ie the thing we want to avoid
 * is to return an anon_vma that is "complex" due to having gone through
 * a fork).
 *
 * We also make sure that the two vma's are compatible (adjacent,
 * and with the same memory policies). That's all stable, even with just
 * a read lock on the mm_sem.
 */

static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struu
ct vm_area_struct *b)
{
        if (anon_vma_compatible(a, b)) {
                struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);

                if (anon_vma && list_is_singular(&old->anon_vma_chain))
                        return anon_vma;
        }
        return NULL;
}

@a anon 영역과 @b anon 영역의 anon_vma이 병합 가능한 경우 @old 영역의 anon_vma를 찾아 반환한다.

코드 라인 4에서 @a anon 영역과 @b anon 영역이 병합 가능한 경우이다.
코드 라인 5~8에서 @old 영역의 anon_vma가 존재하고, @old 영역에 avc가 하나만 등록되어 있는 경우 anon_vma를 반환한다.

anon_vma_compatible()

mm/mmap.c

/*
 * Rough compatibility check to quickly see if it's even worth looking
 * at sharing an anon_vma.
 *
 * They need to have the same vm_file, and the flags can only differ
 * in things that mprotect may change.
 *
 * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
 * we can merge the two vma's. For example, we refuse to merge a vma if
 * there is a vm_ops->close() function, because that indicates that the
 * driver is doing some kind of reference counting. But that doesn't
 * really matter for the anon_vma sharing case.
 */

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
        return a->vm_end == b->vm_start &&
                mpol_equal(vma_policy(a), vma_policy(b)) &&
                a->vm_file == b->vm_file &&
                !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
                b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
}

다음 두 anon 영역에 대해 같이 병합되어 사용될 수 있는지 여부를 반환한다. 다음 조건들을 모두 허용하는 경우 병합 가능(1)을 반환한다.

a 영역 다음 b 영역으로 두 영역이 붙어있다.
두 영역의 vma policy가 같다.
두 영역에서 read, write, exec, softdirty 이외의 플래그 사용이 서로 같다.
- #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
두 영역의 vm_file이 같고, vm_pgoff가 a와 b영역 순서대로 지정되어 있다.

다음과 같이 가상 주소 영역의 두 VMA 영역이 같이 사용될 수 있는지 여부를 알아내는 과정을 보여준다.

anon_vma_chain를 사용한 링크

anon_vma_chain_alloc()

mm/rmap.c

static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp)
{
        return kmem_cache_alloc(anon_vma_chain_cachep, gfp);
}

kmem 캐시에 준비된 anon_vma_chain 구조체를 할당한다.

anon_vma_chain_free()

mm/rmap.c

static void anon_vma_chain_free(struct anon_vma_chain *anon_vma_chain)
{
        kmem_cache_free(anon_vma_chain_cachep, anon_vma_chain);
}

anon_vma_chain 구조체를 할당 해제하여 kmem 캐시로 돌려보낸다.

anon_vma_chain_link()

mm/rmap.c

static void anon_vma_chain_link(struct vm_area_struct *vma,
                                struct anon_vma_chain *avc,
                                struct anon_vma *anon_vma)
{
        avc->vma = vma;
        avc->anon_vma = anon_vma;
        list_add(&avc->same_vma, &vma->anon_vma_chain);
        anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);
}

vma 및 anon_vma 간에 avc를 사용하여 링크시킨다.

코드 라인 5~6에서 @avc가 @vma 및 @anon_vma를 가리킬 수 있도록 지정한다.
코드 라인 7에서 vma의 anon_vma_chain 리스트에 @avc를 추가한다.
코드 라인 8에서 @anon_vma의 RB 트리에 @avc를 추가한다.

다음 그림은 vma 및 anon_vma가 avc를 통해 링크되는 모습을 보여준다.

anon 페이지에 매핑 추가

page_add_anon_rmap()

mm/rmap.c

/**
 * page_add_anon_rmap - add pte mapping to an anonymous page
 * @page:       the page to add the mapping to
 * @vma:        the vm area in which the mapping is added
 * @address:    the user virtual address mapped
 * @compound:   charge the page as compound or small page
 *
 * The caller needs to hold the pte lock, and the page must be locked in
 * the anon_vma case: to serialize mapping,index checking after setting,
 * and to ensure that PageAnon is not being upgraded racily to PageKsm
 * (but PageKsm is never downgraded to PageAnon).
 */

void page_add_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, bool compound)
{
        do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
}

공유 anon 페이지에 역방향 매핑을 추가한다.

do_page_add_anon_rmap()

주의: 이 함수는 do_swap_page()로부터 비공유 anon 페이지에 대한 역방향 매핑을 추가할 때에만 사용된다. 그 외의 경우는 위 page_add_anon_rmap() 함수를 사용해야 한다.

mm/rmap.c

/*
 * Special version of the above for do_swap_page, which often runs
 * into pages that are exclusively owned by the current process.
 * Everybody else should continue to use page_add_anon_rmap above.
 */

void do_page_add_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, int flags)
{
        bool compound = flags & RMAP_COMPOUND;
        bool first;

        if (unlikely(PageKsm(page)))
                lock_page_memcg(page);
        else
                VM_BUG_ON_PAGE(!PageLocked(page), page);

        if (compound) {
                atomic_t *mapcount;
                VM_BUG_ON_PAGE(!PageLocked(page), page);
                VM_BUG_ON_PAGE(!PageTransHuge(page), page);
                mapcount = compound_mapcount_ptr(page);
                first = atomic_inc_and_test(mapcount);
        } else {
                first = atomic_inc_and_test(&page->_mapcount);
        }

        if (first) {
                int nr = compound ? thp_nr_pages(page) : 1;
                /*
                 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
                 * these counters are not modified in interrupt context, and
                 * pte lock(a spinlock) is held, which implies preemption
                 * disabled.
                 */
                if (compound)
                        __inc_node_page_state(page, NR_ANON_THPS);
                __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
        }
        if (unlikely(PageKsm(page))) {
                unlock_page_memcg(page);
                return;
        }

        /* address might be in next vma when migration races vma_adjust */
        if (first)
                __page_set_anon_rmap(page, vma, address,
                                flags & RMAP_EXCLUSIVE);
        else
                __page_check_anon_rmap(page, vma, address);
}

anon 페이지에 역방향 매핑을 추가한다.

코드 라인 4에서 플래그에 compound 요청이 있었는지 여부를 확인한다.
코드 라인 7~10에서 Ksm 페이지의 경우 페이지와 memcg 바인딩을 위한 락을 건다.
코드 라인 12~20에서 해당 페이지에 대한 매핑 카운트를 1 증가시키고 처음 매핑 여부를 알아온다.
코드 라인 22~33에서 첫 매핑 시 nr_anon_mapped 카운터를 페이지 수만큼 증가시킨다. compound 페이지인 경우 nr_anon_thps 카운터도 1 증가시킨다.
코드 라인 34~37에서 Ksm 페이지의 경우 위에서 건 락을 풀고 함수를 빠져나간다.
코드 라인 40~44에서 첫 매핑인 경우 페이지를 새 anonymous rmap으로 매핑한다. 그 외의 경우 디버그 이외에는 아무것도 수행하지 않는다.

page_add_new_anon_rmap()

mm/rmap.c

/**
 * page_add_new_anon_rmap - add pte mapping to a new anonymous page
 * @page:       the page to add the mapping to
 * @vma:        the vm area in which the mapping is added
 * @address:    the user virtual address mapped
 * @compound:   charge the page as compound or small page
 *
 * Same as page_add_anon_rmap but must only be called on *new* pages.
 * This means the inc-and-test can be bypassed.
 * Page does not have to be locked.
 */

void page_add_new_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, bool compound)
{
        int nr = compound ? thp_nr_pages(page) : 1;

        VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
        __SetPageSwapBacked(page);
        if (compound) {
                VM_BUG_ON_PAGE(!PageTransHuge(page), page);
                /* increment count (starts at -1) */
                atomic_set(compound_mapcount_ptr(page), 0);
                if (hpage_pincount_available(page))
                        atomic_set(compound_pincount_ptr(page), 0);

                __mod_lruvec_page_state(page, NR_ANON_THPS, nr);
        } else {
                /* Anon THP always mapped first with PMD */
                VM_BUG_ON_PAGE(PageTransCompound(page), page);
                /* increment count (starts at -1) */
                atomic_set(&page->_mapcount, 0);
        }
        __mod_lruvec_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
        __page_set_anon_rmap(page, vma, address, 1);
}

태스크 전용 anon 페이지에 rmap을 지정한다.

코드 라인 7에서 새로 할당 받은 anon 페이지이므로 PG_swapbacked 플래그를 추가한다.
코드 라인 8~15에서 compound 페이지인 경우 컴파운드 매핑 카운터를 0으로 리셋하고, NR_ANON_THPS 카운터를 증가시킨다.
코드 라인 16~21에서 compound 페이지가 아닌 경우 매핑 카운터를 0으로 리셋한다.
코드 라인 22~23에서 NR_ANON_MAPPED 카운터를 증가시키고 페이지에 anon rmap을 설정한다.
- page->mapping = anon_vma
- page->index = linear_page_index(vma, address)

__page_set_anon_rmap()

mm/rmap.c

/**
 * __page_set_anon_rmap - set up new anonymous rmap
 * @page:       Page or Hugepage to add to rmap
 * @vma:        VM area to add page to.
 * @address:    User virtual address of the mapping
 * @exclusive:  the page is exclusively owned by the current process
 */

static void __page_set_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, int exclusive)
{
        struct anon_vma *anon_vma = vma->anon_vma;

        BUG_ON(!anon_vma);

        if (PageAnon(page))
                return;

        /*
         * If the page isn't exclusively mapped into this vma,
         * we must use the _oldest_ possible anon_vma for the
         * page mapping!
         */
        if (!exclusive)
                anon_vma = anon_vma->root;

        /*
         * page_idle does a lockless/optimistic rmap scan on page->mapping.
         * Make sure the compiler doesn't split the stores of anon_vma and
         * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
         * could mistake the mapping for a struct address_space and crash.
        /*
        anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
        WRITE_ONCE(page->mapping = (struct address_space *) anon_vma);
        page->index = linear_page_index(vma, address);
}

페이지를 anonymous rmap으로 지정한다.

코드 라인 8~9에서 페이지가 이미 anon 매핑이된 경우 함수를 빠져나간다.
코드 라인 16~17에서 페이지가 현재 태스크 전용이 아닌 shared 페이지인 경우 루트 anon_vma를 사용한다.
코드 라인 25~27에서 페이지에 anon 매핑을 한다.
- page->mapping에는 anon_vma 포인터에 PAGE_PAPPING_ANON 플래그를 더해 지정한다.
- page->index에는 vma에 지정된 페이지 offset + (주소 pfn – vm 시작 pfn)에 대한 값을 지정한다.

다음 그림과 같이 anon 페이지를 rmap에 매핑하는 과정을 보여준다.

anon 매핑된 페이지는 PageAnon(page) 함수를 통해 true를 반환한다.

linear_page_index()

include/linux/pagemap.h

static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
                                        unsigned long address)
{
        pgoff_t pgoff;
        if (unlikely(is_vm_hugetlb_page(vma)))
                return linear_hugepage_index(vma, address);
        pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
        pgoff += vma->vm_pgoff;
        return pgoff;
}

vma 영역내에서 요청한 주소에 해당하는 페이지 offset을 vma->vm_pgoff를 더한 인덱스 값을 반환한다.

페이지에서 rmap 매핑 제거

page_remove_rmap()

mm/rmap.c

/**
 * page_remove_rmap - take down pte mapping from a page
 * @page:       page to remove mapping from
 * @compound:   uncharge the page as compound or small page
 *
 * The caller needs to hold the pte lock.
 */

void page_remove_rmap(struct page *page, bool compound)
{
        lock_page_memcg(page);

        if (!PageAnon(page)) {
                page_remove_file_rmap(page, compound);
                goto out;
        }

        if (compound) {
                page_remove_anon_compound_rmap(page);
                goto out;
        }

        /* page still mapped by someone else? */
        if (!atomic_add_negative(-1, &page->_mapcount))
                goto out;

        /*
         * We use the irq-unsafe __{inc|mod}_zone_page_stat because
         * these counters are not modified in interrupt context, and
         * pte lock(a spinlock) is held, which implies preemption disabled.
         */
        __dec_lruvec_page_state(page, NR_ANON_MAPPED);

        if (unlikely(PageMlocked(page)))
                clear_page_mlock(page);

        if (PageTransCompound(page))
                deferred_split_huge_page(compound_head(page));

        /*
         * It would be tidy to reset the PageAnon mapping here,
         * but that might overwrite a racing page_add_anon_rmap
         * which increments mapcount after us but sets mapping
         * before us: so leave the reset to free_unref_page,
         * and remember that it's only reliable while mapped.
         * Leaving it set also helps swapoff to reinstate ptes
         * faster for those pages still in swapcache.
         */
out:
        unlock_page_memcg(page);
}

페이지에서역방향 매핑(rmap)을 제거한다.

page_remove_file_rmap()

mm/rmap.c

static void page_remove_file_rmap(struct page *page, bool compound)
{
        int i, nr = 1;

        VM_BUG_ON_PAGE(compound && !PageHead(page), page);
        lock_page_memcg(page);

        /* Hugepages are not counted in NR_FILE_MAPPED for now. */
        if (unlikely(PageHuge(page))) {
                /* hugetlb pages are always mapped with pmds */
                atomic_dec(compound_mapcount_ptr(page));
                return;
        }

        /* page still mapped by someone else? */
        if (compound && PageTransHuge(page)) {
                int nr_pages = thp_nr_pages(page);

                for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
                        if (atomic_add_negative(-1, &page[i]._mapcount))
                                nr++;
                }
                if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
                        return;
                if (PageSwapBacked(page))
                        __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
                                                -nr_pages);
                else
                        __mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
                                                -nr_pages);
        } else {
                if (!atomic_add_negative(-1, &page->_mapcount))
                        return;
        }

        /*
         * We use the irq-unsafe __{inc|mod}_lruvec_page_state because
         * these counters are not modified in interrupt context, and
         * pte lock(a spinlock) is held, which implies preemption disabled.
         */
        __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);

        if (unlikely(PageMlocked(page)))
                clear_page_mlock(page);
out:
        unlock_page_memcg(page);
}

파일 페이지의 역방향 매핑(rmap)을 제거한다.

Child 프로세스 fork 시 anon_vma 생성 및 연결

anon_vma_fork()

mm/rmap.c

/*
 * Attach vma to its own anon_vma, as well as to the anon_vmas that
 * the corresponding VMA in the parent process is attached to.
 * Returns 0 on success, non-zero on failure.
 */

int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
{
        struct anon_vma_chain *avc;
        struct anon_vma *anon_vma;
        int error;

        /* Don't bother if the parent process has no anon_vma here. */
        if (!pvma->anon_vma)
                return 0;

        /* Drop inherited anon_vma, we'll reuse existing or allocate new. */
        vma->anon_vma = NULL;

        /*
         * First, attach the new VMA to the parent VMA's anon_vmas,
         * so rmap can find non-COWed pages in child processes.
         */
        error = anon_vma_clone(vma, pvma);
        if (error)
                return error;

        /* An existing anon_vma has been reused, all done then. */
        if (vma->anon_vma)
                return 0;

        /* Then add our own anon_vma. */
        anon_vma = anon_vma_alloc();
        if (!anon_vma)
                goto out_error;
        avc = anon_vma_chain_alloc(GFP_KERNEL);
        if (!avc)
                goto out_error_free_anon_vma;

        /*
         * The root anon_vma's rwsem is the lock actually used when we
         * lock any of the anon_vmas in this anon_vma tree.
         */
        anon_vma->root = pvma->anon_vma->root;
        anon_vma->parent = pvma->anon_vma;
        /*
         * With refcounts, an anon_vma can stay around longer than the
         * process it belongs to. The root anon_vma needs to be pinned until
         * this anon_vma is freed, because the lock lives in the root.
         */
        get_anon_vma(anon_vma->root);
        /* Mark this anon_vma as the one where our new (COWed) pages go. */
        vma->anon_vma = anon_vma;
        anon_vma_lock_write(anon_vma);
        anon_vma_chain_link(vma, avc, anon_vma);
        anon_vma->parent->degree++;
        anon_vma_unlock_write(anon_vma);

        return 0;

 out_error_free_anon_vma:
        put_anon_vma(anon_vma);
 out_error:
        unlink_anon_vmas(vma);
        return -ENOMEM;
}

fork되어 자식 프로세스에 생성된 @vma를 부모 @pvma들에 존재하는 anon_vma 수 만큼 새로 할당한 후 연결한다. 또한 @vma에 대응하는 anon_vma도 재사용하거나 할당한다.

코드 라인 8~9에서 부모 vma에 anon_vma가 지정된 적이 없으면 성공(0)을 반환한다.
코드 라인 12에서 부모 태스크로부터 상속하여 만든 vma의 anon_vma를 다시 재설정하기 위해 null을 담는다.
코드 라인 18~20에서 부모 vma들에 연결된 anon_vma들과 새로 clone될 vma와 링크한다.
코드 라인 23~24에서 anon_vma가 재사용된 경우 성공(0)을 반환한다.
코드 라인 27~32에서 anon_vma와 anon_vma_chain 구조체를 할당해온다.
코드 라인 38~39에서 anon_vma의 루트와 부모 관계를 지정한다.
코드 라인 45~53에서 anon_vma를 anon_vma_chain을 통해 vm_area_struct와 링크하고 성공(0)을 반환한다.

다음 그림은 child 프로세스가 fork되면서 anon_vma_fork()가 세 번 호출되어 나타나는 결과를 보여준다.

anon_vma_clone()

이 함수의 주 호출 패스는 다음과 같다.

태스크가 fork되어 부모 태스크의 vma를 clone
- anon_vma_fork()
VMA를 split할 때 기존 VMA를 clone
- __split_vma()

mm/rmap.c

/*
 * Attach the anon_vmas from src to dst.
 * Returns 0 on success, -ENOMEM on failure.
 *
 * anon_vma_clone() is called by __vma_adjust(), __split_vma(), copy_vma() and
 * anon_vma_fork(). The first three want an exact copy of src, while the last
 * one, anon_vma_fork(), may try to reuse an existing anon_vma to prevent
 * endless growth of anon_vma. Since dst->anon_vma is set to NULL before call,
 * we can identify this case by checking (!dst->anon_vma && src->anon_vma).
 *
 * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
 * and reuse existing anon_vma which has no vmas and only one child anon_vma.
 * This prevents degradation of anon_vma hierarchy to endless linear chain in
 * case of constantly forking task. On the other hand, an anon_vma with more
 * than one child isn't reused even if there was no alive vma, thus rmap
 * walker has a good chance of avoiding scanning the whole hierarchy when it
 * searches where page is mapped.
 */

int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
        struct anon_vma_chain *avc, *pavc;
        struct anon_vma *root = NULL;

        list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
                struct anon_vma *anon_vma;

                avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
                if (unlikely(!avc)) {
                        unlock_anon_vma_root(root);
                        root = NULL;
                        avc = anon_vma_chain_alloc(GFP_KERNEL);
                        if (!avc)
                                goto enomem_failure;
                }
                anon_vma = pavc->anon_vma;
                root = lock_anon_vma_root(root, anon_vma);
                anon_vma_chain_link(dst, avc, anon_vma);

                /*
                 * Reuse existing anon_vma if its degree lower than two,
                 * that means it has no vma and only one anon_vma child.
                 *
                 * Do not chose parent anon_vma, otherwise first child
                 * will always reuse it. Root anon_vma is never reused:
                 * it has self-parent reference and at least one child.
                 */
                if (!dst->anon_vma && src->anon_vma &&
                    anon_vma != src->anon_vma && anon_vma->degree < 2) dst->anon_vma = anon_vma;
        }
        if (dst->anon_vma)
                dst->anon_vma->degree++;
        unlock_anon_vma_root(root);
        return 0;

 enomem_failure:
        /*
         * dst->anon_vma is dropped here otherwise its degree can be incorrectly
         * decremented in unlink_anon_vmas().
         * We can safely do this because callers of anon_vma_clone() don't care
         * about dst->anon_vma if anon_vma_clone() failed.
         */
        dst->anon_vma = NULL;
        unlink_anon_vmas(dst);
        return -ENOMEM;
}

@src vma들에 연결된 anon_vma들과 @dst vma를 링크한다.

코드 라인 6에서 @src vma에 연결된 anon_vma 수만큼 pvac를 순회한다.
- @src vma의 anon_vma_chain 리스트에 연결된 avc들을 순회한다.
코드 라인 9~16에서 anon_vma_chain을 할당한다. 만일 할당이 실패하면 anon_vma root 락을 풀고 다시 한 번 더 할당을 시도해본다.
코드 라인 17에서 순회 중인 pvac에 연결된 anon_vma를 알아온다.
코드 라인 18에서 root anon_vma 락을 건다.
코드 라인 19에서 clone된 @dst vma와 순회 중인 anon_vma 사이에 새로 할당한 anon_vma_chain을 사용하여 링크한다.
코드 라인 29~31에서 anon_vma 하이라키가 계속 끝 없이 증가되는 것을 막기 위한 패치이다. vma를 소유하지 않았고, 조부모 이상의 anon_vma를 대상으로 오직 하나의 child anon_vma를 가진 anon_vma는 재 사용하게 한다.
코드 라인 32~33에서 @dst vma에 anon_vma가 지정된 경우 degree를 상향 시킨다.

반복되는 fork hierarchy로부터 anon_vma의 끝없는 증가 방지 패치

참고:

mm: prevent endless growth of anon_vma hierarchy (2015, v3.19-rc4)
mm/rmap.c: don’t reuse anon_vma if we just want a copy (2019, v5.5-rc1)
mm/rmap.c: reuse mergeable anon_vma as parent when fork (2019, v5.5-rc1)
Repeated fork() causes SLAB to grow without bound

다음과 같이 반복되는 fork로 인해 anon_vma가 끝없이 증가를 하던 것을 anon_vma를 reuse하는 방법으로 패치하였다.

#include <unistd.h>

int main(int argc, char *argv[])
{
        pid_t pid;

        while (1) {
                pid = fork();
                if (pid == -1) {
                        /* error */
                        return 1;
                }
                if (pid) {
                        /* parent */
                        sleep(2);
                        break;
                }
                else {
                        /* child */
                        sleep(1);
                }
        }

        return 0;
}

다음 그림은 조부모 프로세스로부터 부모 및 자식까지 fork 될 때 AV의 연결되는 과정을 보여준다.

fork 과정을 자세히 보여주기 위해 조부모 프로세스에 1개의 VMA와 AV 만을 사용하였다.

다음 그림은 부모 프로세스 종료 후 AV를 재사용할 수 있는 상태를 보여준다.

만일 fork한 자식 프로세스 B가 종료하면 부모 프로세스의 AV는 재사용 없이 free된다.

다음 그림은 자식 프로세스가 조부모 프로세스의 AV를 다시 재사용(reuse)하는 과정을 보여준다.

부모 프로세스는 제외하고 조부모 이상이 사용하였던 AV를 대상으로 degree가 1이이어야 한다.

참고

The case of the overly anonymous anon_vma (2010) | LWN.net
Virtual Memory II: the return of objrmap (2004) | LWN.net
Reverse mapping anonymous pages – again (2004)| LWN.net
Linux Memory Management | Columbia.edu – 다운로드 pdf

Rmap -2- (TTU & Rmap Walk)

2019-09-102021-07-07 문영일 Leave a comment

Rmap -2- (TTU & Rmap Walk)

Rmap Walk

다음 그림은 유저 페이지 하나에 대해 rmap 워크를 수행하면서 rmap_walk_control 구조체에 담긴 후크 함수를 호출하는 과정을 보여준다.

유저 페이지 하나가 여러 개의 가상 주소에 매핑된 경우 관련된 VMA들을 찾아 매핑된 pte 엔트리를 대상으로 언매핑, 마이그레이션, … 등 여러 기능들을 수행할 수 있다.

다음 그림은 현재 rmap walk를 사용하는 호출 함수들을 보여주며 사용된 후크 함수들을 보여준다.

TTU(Try To Unmap)

유저 페이지의 매핑을 해제할 때 rmap walk를 통해 매핑을 해제한다.

사용자 공간의 매핑을 해제할 때 mmu notifier를 통해 연동된 secondary MMU를 제어한다.

TTU 플래그

TTU 동작 시 사용할 플래그들이다.

TTU_MIGRATION
- 마이그레이션 모드
TTU_MUNLOCK
- munlock 모드로 VM_LOCKED되지 않은 vma들은 skip 한다.
TTU_SPLIT_HUGE_PMD
- 페이지가 huge PMD인 경우 분리(split) 시킨다.
TTU_IGNORE_MLOCK
- MLOCK 무시
TTU_IGNORE_ACCESS
- young 페이지가 access된 적이 있으면 pte 엔트리의 액세스 플래그를 클리어하고 TLB 플러시를 하게하는데 이의 조사를 무시하게 한다.
TTU_IGNORE_HWPOISON
- hwpoison을 무시하고 손상된 페이지라도 사용한다.
TTU_BATCH_FLUSH
- 가능하면 TLB 플러시를 마지막에 한꺼번에 처리한다.
TTU_RMAP_LOCKED
TTU_SPLIT_FREEZE
- thp를 분리(split)할 때 pte를 freeze한다.

try_to_unmap()

mm/rmap.c

/**
 * try_to_unmap - try to remove all page table mappings to a page
 * @page: the page to get unmapped
 * @flags: action and flags
 *
 * Tries to remove all the page table entries which are mapping this
 * page, used in the pageout path.  Caller must hold the page lock.
 *
 * If unmap is successful, return true. Otherwise, false.
 */

bool try_to_unmap(struct page *page, enum ttu_flags flags)
{
        struct rmap_walk_control rwc = {
                .rmap_one = try_to_unmap_one,
                .arg = (void *)flags,
                .done = page_mapcount_is_zero,
                .anon_lock = page_lock_anon_vma_read,
        };

        /*
         * During exec, a temporary VMA is setup and later moved.
         * The VMA is moved under the anon_vma lock but not the
         * page tables leading to a race where migration cannot
         * find the migration ptes. Rather than increasing the
         * locking requirements of exec(), migration skips
         * temporary VMAs until after exec() completes.
         */
        if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
            && !PageKsm(page) && PageAnon(page))
                rwc.invalid_vma = invalid_migration_vma;

        if (flags & TTU_RMAP_LOCKED)
                rmap_walk_locked(page, &rwc);
        else
                rmap_walk(page, &rwc);

        return !page_mapcount(page) ? true : false;
}

rmap 워크를 통해 페이지에 대한 모든 매핑을 해제한다.

코드 라인 3~8에서 rwc를 통해 호출될 후크 함수들을 지정한다.
코드 라인 18~20에서 thp(TTU_SPLIT_FREEZE)의 분리(split) 또는 마이그레이션(TTU_MIGRATION)에서 사용되었으며 ksm을 제외한 anon 매핑 페이지의 경우 (*invalid_vma) 후크 함수에 invalid_migration_vma() 함수를 지정한다.
- 참고: mm: thp: introduce separate TTU flag for thp freezing
코드 라인 22~25에서 rwc를 사용하여 페이지에 대한 모든 매핑을 해제하도록 rmap 워크를 수행한다. TTU_RMAP_LOCKED 플래그가 사용된 경우 외부에서 락을 획득한 상황이다.
- 참고: rmap: introduce rmap_walk_locked()

다음 그림은 try_to_unmap() 함수의 진행 과정을 보여준다.

page_mapcount_is_zero()

mm/rmap.c

static int page_mapcount_is_zero(struct page *page)
{
        return !total_mapcount(page);
}

페이지의 매핑 카운터가 0인지 여부를 반환한다.

rmap_walk_control 구조체의 (*done) 후크 함수에 연결되어 사용될 때 페이지의 언매핑이 완료되면 rmap walk를 완료하도록 하기 위해 사용된다.

total_mapcount()

mm/huge_memory.c

int total_mapcount(struct page *page)
{
        int i, compound, ret;

        VM_BUG_ON_PAGE(PageTail(page), page);

        if (likely(!PageCompound(page)))
                return atomic_read(&page->_mapcount) + 1;

        compound = compound_mapcount(page);
        if (PageHuge(page))
                return compound;
        ret = compound;
        for (i = 0; i < HPAGE_PMD_NR; i++)
                ret += atomic_read(&page[i]._mapcount) + 1;
        /* File pages has compound_mapcount included in _mapcount */
        if (!PageAnon(page))
                return ret - compound * HPAGE_PMD_NR;
        if (PageDoubleMap(page))
                ret -= HPAGE_PMD_NR;
        return ret;
}

페이지의 매핑 카운터 값을 반환한다.

page_lock_anon_vma_read()

mm/rmap.c

/*
 * Similar to page_get_anon_vma() except it locks the anon_vma.
 *
 * Its a little more complex as it tries to keep the fast path to a single
 * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
 * reference like with page_get_anon_vma() and then block on the mutex.
 */

struct anon_vma *page_lock_anon_vma_read(struct page *page)
{
        struct anon_vma *anon_vma = NULL;
        struct anon_vma *root_anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long)READ_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        root_anon_vma = READ_ONCE(anon_vma->root);
        if (down_read_trylock(&root_anon_vma->rwsem)) {
                /*
                 * If the page is still mapped, then this anon_vma is still
                 * its anon_vma, and holding the mutex ensures that it will
                 * not go away, see anon_vma_free().
                 */
                if (!page_mapped(page)) {
                        up_read(&root_anon_vma->rwsem);
                        anon_vma = NULL;
                }
                goto out;
        }

        /* trylock failed, we got to sleep */
        if (!atomic_inc_not_zero(&anon_vma->refcount)) {
                anon_vma = NULL;
                goto out;
        }

        if (!page_mapped(page)) {
                rcu_read_unlock();
                put_anon_vma(anon_vma);
                return NULL;
        }

        /* we pinned the anon_vma, its safe to sleep */
        rcu_read_unlock();
        anon_vma_lock_read(anon_vma);

        if (atomic_dec_and_test(&anon_vma->refcount)) {
                /*
                 * Oops, we held the last refcount, release the lock
                 * and bail -- can't simply use put_anon_vma() because
                 * we'll deadlock on the anon_vma_lock_write() recursion.
                 */
                anon_vma_unlock_read(anon_vma);
                __put_anon_vma(anon_vma);
                anon_vma = NULL;
        }

        return anon_vma;

out:
        rcu_read_unlock();
        return anon_vma;
}

anon 페이지에 대한 루트 anon_vma 락을 획득하고, anon_vma를 반환한다.

rmap 워크

rmap_walk()

mm/rmap.c

void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
{
        if (unlikely(PageKsm(page)))
                rmap_walk_ksm(page, rwc);
        else if (PageAnon(page))
                rmap_walk_anon(page, rwc, false);
        else
                rmap_walk_file(page, rwc, false);
}

페이지가 소속된 ksm, anon 및 file 타입에 대한 vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

/*
 * rmap_walk_anon - do something to anonymous page using the object-based
 * rmap method
 * @page: the page to be handled
 * @rwc: control variable according to each walk type
 *
 * Find all the mappings of a page using the mapping pointer and the vma chains
 * contained in the anon_vma struct it points to.
 *
 * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
 * where the page was found will be held for write.  So, we won't recheck
 * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
 * LOCKED.
 */

static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
                bool locked)
{
        struct anon_vma *anon_vma;
        pgoff_t pgoff_start, pgoff_end;
        struct anon_vma_chain *avc;

        if (locked) {
                anon_vma = page_anon_vma(page);
                /* anon_vma disappear under us? */
                VM_BUG_ON_PAGE(!anon_vma, page);
        } else {
                anon_vma = rmap_walk_anon_lock(page, rwc);
        }
        if (!anon_vma)
                return;

        pgoff_start = page_to_pgoff(page);
        pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
        anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
                        pgoff_start, pgoff_end) {
                struct vm_area_struct *vma = avc->vma;
                unsigned long address = vma_address(page, vma);

                cond_resched();

                if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
                        continue;

                if (!rwc->rmap_one(page, vma, address, rwc->arg))
                        break;
                if (rwc->done && rwc->done(page))
                        break;
        }

        if (!locked)
                anon_vma_unlock_read(anon_vma);
}

페이지가 소속된 anon vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

코드 라인 8~16에서 페이지에 해당하는 anon_vma를 구해온다. @locked가 설정되지 않은 경우 이 함수에서 anon_vma에 대한 lock을 획득해야 한다.
코드 라인 18~23에서 pgoff 시작 ~ pgoff 끝에 포함되는 anon_vma에 소속된 vma들을 순회한다.
코드 라인 27~28에서 rwc에 (*invalid_vma) 후크 함수를 수행한 결과가 ture인 경우 해당 vma는 skip 한다.
코드 라인 30~31에서 rwc의 (*rmap_one) 후크 함수를 수행한다. 만일 실패한 경우 break 한다.
코드 라인 32~33에서 rwc의 (*done) 후크 함수를 수행한 후 결과가 true이면 break 한다.
코드 라인 36~37에서 anon_vma에 대한 락을 해제한다.

/*
 * rmap_walk_file - do something to file page using the object-based rmap method
 * @page: the page to be handled
 * @rwc: control variable according to each walk type
 *
 * Find all the mappings of a page using the mapping pointer and the vma chains
 * contained in the address_space struct it points to.
 *
 * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
 * where the page was found will be held for write.  So, we won't recheck
 * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
 * LOCKED.
 */

static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
                bool locked)
{
        struct address_space *mapping = page_mapping(page);
        pgoff_t pgoff_start, pgoff_end;
        struct vm_area_struct *vma;

        /*
         * The page lock not only makes sure that page->mapping cannot
         * suddenly be NULLified by truncation, it makes sure that the
         * structure at mapping cannot be freed and reused yet,
         * so we can safely take mapping->i_mmap_rwsem.
         */
        VM_BUG_ON_PAGE(!PageLocked(page), page);

        if (!mapping)
                return;

        pgoff_start = page_to_pgoff(page);
        pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
        if (!locked)
                i_mmap_lock_read(mapping);
        vma_interval_tree_foreach(vma, &mapping->i_mmap,
                        pgoff_start, pgoff_end) {
                unsigned long address = vma_address(page, vma);

                cond_resched();

                if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
                        continue;

                if (!rwc->rmap_one(page, vma, address, rwc->arg))
                        goto done;
                if (rwc->done && rwc->done(page))
                        goto done;
        }

done:
        if (!locked)
                i_mmap_unlock_read(mapping);
}

페이지가 소속된 file vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

코드 라인 4~17에서 매핑된 파일 페이지가 아닌 경우 함수를 빠져나간다.
코드 라인 19~20에서 파일 페이지에서 pgoff 시작과 pgoff 끝을 산출한다.
코드 라인 21~22에서 @locked가 설정되지 않은 경우 이 함수에서 i_mmap에 대한 lock을 획득해야 한다.
코드 라인 23~24에서 pgoff 시작 ~ pgoff 끝에 포함되는 파일 매핑 공간에 대해 인터벌 트리를 통해 vma들을 순회한다.
코드 라인 29~30에서 rwc의 (*invalid_vma) 후크 함수를 수행한 결과가 true인 경우 해당 vma는 skip 한다.
코드 라인 32~33에서 rwc의 (*rmap_one) 후크 함수를 수행한다. 수행 결과가 실패한 경우 break 한다.
코드 라인 34~35에서 rwc의 (*done) 후크 함수의 수행 결과가 true인 경우 break 한다.
코드 라인 38~40에서 done: 레이블이다. i_mmap에 대한 락을 해제한다.

언맵 1개 시도

try_to_unmap_one()

mm/rmap.c -1/6-

/*
 * @arg: enum ttu_flags will be passed to this argument
 */

static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                     unsigned long address, void *arg)
{
        struct mm_struct *mm = vma->vm_mm;
        struct page_vma_mapped_walk pvmw = {
                .page = page,
                .vma = vma,
                .address = address,
        };
        pte_t pteval;
        struct page *subpage;
        bool ret = true;
        struct mmu_notifier_range range;
        enum ttu_flags flags = (enum ttu_flags)arg;

        /* munlock has nothing to gain from examining un-locked vmas */
        if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
                return true;

        if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
            is_zone_device_page(page) && !is_device_private_page(page))
                return true;

        if (flags & TTU_SPLIT_HUGE_PMD) {
                split_huge_pmd_address(vma, address,
                                flags & TTU_SPLIT_FREEZE, page);
        }

        /*
         * For THP, we have to assume the worse case ie pmd for invalidation.
         * For hugetlb, it could be much worse if we need to do pud
         * invalidation in the case of pmd sharing.
         *
         * Note that the page can not be free in this function as call of
         * try_to_unmap() must hold a reference on the page.
         */
        mmu_notifier_range_init(&range, vma->vm_mm, address,
                                min(vma->vm_end, address +
                                    (PAGE_SIZE << compound_order(page))));
        if (PageHuge(page)) {
                /*
                 * If sharing is possible, start and end will be adjusted
                 * accordingly.
                 */
                adjust_range_if_pmd_sharing_possible(vma, &range.start,
                                                     &range.end);
        }
        mmu_notifier_invalidate_range_start(&range);

코드 라인 4에서 vma가 소속된 가상 주소 공간 관리 mm을 구해온다.
코드 라인 5~9에서 매핑 여부를 확인하기 위한 pvmw를 준비한다.
코드 라인 17~18에서 VM_LOCKED 설정되지 않은 vma에 TTU_MUNLOCK 요청이 있는 경우 해당 페이지를 skip하기 위해 그냥 성공을 반환한다.
코드 라인 20~22에서 migration 매핑 요청 시 hmm이 아닌 zone 디바이스인 경우 skip 하기 위해 true를 반환한다.
- hmm이 아닌 zone 디바이스는 정규 매핑/언매핑 동작을 수행할 수 없다.
코드 라인 24~27에서 huge 페이지 split 요청을 수행한다.
코드 라인 37~39에서 mmu_notifier_range를 초기화한다.
코드 라인 40~47에서 huge 페이지인 경우 위의 range를 조정한다.
코드 라인 48에서 secondary mmu의range에 대한 tlb 무효화를 수행하기 전에 시작됨을 알려주기 위해 mmu notifier가 등록된 (*invalidate_range_start) 함수를 호출한다.

mm/rmap.c -2/6-

        while (page_vma_mapped_walk(&pvmw)) {
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
                /* PMD-mapped THP migration entry */
                if (!pvmw.pte && (flags & TTU_MIGRATION)) {
                        VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);

                        set_pmd_migration_entry(&pvmw, page);
                        continue;
                }
#endif

                /*
                 * If the page is mlock()d, we cannot swap it out.
                 * If it's recently referenced (perhaps page_referenced
                 * skipped over this mm) then we should reactivate it.
                 */
                if (!(flags & TTU_IGNORE_MLOCK)) {
                        if (vma->vm_flags & VM_LOCKED) {
                                /* PTE-mapped THP are never mlocked */
                                if (!PageTransCompound(page)) {
                                        /*
                                         * Holding pte lock, we do *not* need
                                         * mmap_sem here
                                         */
                                        mlock_vma_page(page);
                                }
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (flags & TTU_MUNLOCK)
                                continue;
                }

                /* Unexpected PMD-mapped THP? */
                VM_BUG_ON_PAGE(!pvmw.pte, page);

                subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
                address = pvmw.address;

                if (PageHuge(page)) {
                        if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
                                /*
                                 * huge_pmd_unshare unmapped an entire PMD
                                 * page.  There is no way of knowing exactly
                                 * which PMDs may be cached for this mm, so
                                 * we must flush them all.  start/end were
                                 * already adjusted above to cover this range.
                                 */
                                flush_cache_range(vma, range.start, range.end);
                                flush_tlb_range(vma, range.start, range.end);
                                mmu_notifier_invalidate_range(mm, range.start,
                                                              range.end);

                                /*
                                 * The ref count of the PMD page was dropped
                                 * which is part of the way map counting
                                 * is done for shared PMDs.  Return 'true'
                                 * here.  When there is no other sharing,
                                 * huge_pmd_unshare returns false and we will
                                 * unmap the actual page and drop map count
                                 * to zero.
                                 */
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                }

코드 라인 1에서 pvmw를 통해 요청한 정규 매핑 상태가 정상인 경우에 한해 루프를 돈다.
코드 라인 4~9에서 pmd 엔트리를 통해 thp 페이지를 migration 엔트리에 매핑하고 루프를 계속 수행 한다.
코드 라인 17~33에서 TTU_IGNORE_MLOCK 플래그가 없이 요청한 경우 mlocked 페이지는 swap out할 수 없다. VM_LOCKED vma 영역인 경우 루프를 멈추고 처리도 중단하게 한다.
코드 라인 38~67에서 공유되지 않은 huge 페이지인 경우 range 영역에 대해 캐시 및 tlb 캐시를 플러시하고, secondary MMU의 range에 대한 tlb 무효화를 수행한다. 그런 후 루프를 멈추고 처리도 중단하게 한다.

mm/rmap.c -3/6-

.               if (IS_ENABLED(CONFIG_MIGRATION) &&
                    (flags & TTU_MIGRATION) &&
                    is_zone_device_page(page)) {
                        swp_entry_t entry;
                        pte_t swp_pte;

                        pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);

                        /*
                         * Store the pfn of the page in a special migration
                         * pte. do_swap_page() will wait until the migration
                         * pte is removed and then restart fault handling.
                         */
                        entry = make_migration_entry(page, 0);
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
                        /*
                         * No need to invalidate here it will synchronize on
                         * against the special swap migration pte.
                         */
                        goto discard;
                }

                if (!(flags & TTU_IGNORE_ACCESS)) {
                        if (ptep_clear_flush_young_notify(vma, address,
                                                pvmw.pte)) {
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                }

                /* Nuke the page table entry. */
                flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
                if (should_defer_flush(mm, flags)) {
                        /*
                         * We clear the PTE but do not flush so potentially
                         * a remote CPU could still be writing to the page.
                         * If the entry was previously clean then the
                         * architecture must guarantee that a clear->dirty
                         * transition on a cached TLB entry is written through
                         * and traps if the PTE is unmapped.
                         */
                        pteval = ptep_get_and_clear(mm, address, pvmw.pte);

                        set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
                } else {
                        pteval = ptep_clear_flush(vma, address, pvmw.pte);
                }

                /* Move the dirty bit to the page. Now the pte is gone. */
                if (pte_dirty(pteval))
                        set_page_dirty(page);

                /* Update high watermark before we lower rss */
                update_hiwater_rss(mm);

코드 라인 1~24에서 존 디바이스 페이지에 TTU_MIGRATION 플래그를 요청한 경우이다. pte 엔트리에 migration 정보를 swap 엔트리 형태로 만들어 매핑한다.
- swap 엔트리를 활용하여 migration 정보를 담아 매핑하여 사용한다. 이 매핑을 통해 fault 발생 시 fault 핸들러 중 swap fault를 담당하는 do_swap_page()에서 migration이 완료될 때까지 기다리게 한다.
- soft dirty 기능은 현재 x86_64, powerpc_64, s390 아키텍처에서만 사용된다.
- 참고: mm/migrate: support un-addressable ZONE_DEVICE page in migration
코드 라인 26~33에서 TTU_IGNORE_ACCESS 플래그 요청이 없는 경우 address에 해당하는 secondary MMU의 pte 엔트리의 young/accessed 플래그를 test-and-clearing을 수행한 후 access된 적이 있으면 플러시하고 루틴을 중단하게 한다.
코드 라인 36에서 유저 가상 주소에 대한 캐시를 flush한다.
- ARM64 아키텍처는 아무런 동작도 하지 않는다.
- 아키텍처의 캐시 타입이 vivt 또는 vipt aliasing 등을 사용하면 flush 한다.
코드 라인 37~51에서 유저 가상 주소에 매핑된 pte 엔트리를 클리어하여 언매핑하고, tlb 플러시한다. 만일 TTU_BATCH_FLUSH 플러그 요청을 받은 경우 tlb flush는 마지막에 모아 처리하는 것으로 성능을 높인다.
코드 라인 54~55에서 언매핑(클리어) 하기 전 기존 pte 엔트리가 dirty 상태인 경우 페이지를 dirty 상태로 설정한다.
코드 라인 58에서 mm의 hiwater_rss 카운터를 최고 값인 경우 갱신한다.
- mm의 file 페이지 수 + anon 페이지 수 + shmem 페이지 수

mm/rmap.c -4/6-

.               if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
                        pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
                        if (PageHuge(page)) {
                                int nr = 1 << compound_order(page);
                                hugetlb_count_sub(nr, mm);
                                set_huge_swap_pte_at(mm, address,
                                                     pvmw.pte, pteval,
                                                     vma_mmu_pagesize(vma));
                        } else {
                                dec_mm_counter(mm, mm_counter(page));
                                set_pte_at(mm, address, pvmw.pte, pteval);
                        }

                } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
                        /*
                         * The guest indicated that the page content is of no
                         * interest anymore. Simply discard the pte, vmscan
                         * will take care of the rest.
                         * A future reference will then fault in a new zero
                         * page. When userfaultfd is active, we must not drop
                         * this page though, as its main user (postcopy
                         * migration) will not expect userfaults on already
                         * copied pages.
                         */
                        dec_mm_counter(mm, mm_counter(page));
                        /* We have to invalidate as we cleared the pte */
                        mmu_notifier_invalidate_range(mm, address,
                                                      address + PAGE_SIZE);
                } else if (IS_ENABLED(CONFIG_MIGRATION) &&
                                (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
                        swp_entry_t entry;
                        pte_t swp_pte;

                        if (arch_unmap_one(mm, vma, address, pteval) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        /*
                         * Store the pfn of the page in a special migration
                         * pte. do_swap_page() will wait until the migration
                         * pte is removed and then restart fault handling.
                         */
                        entry = make_migration_entry(subpage,
                                        pte_write(pteval));
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, address, pvmw.pte, swp_pte);
                        /*
                         * No need to invalidate here it will synchronize on
                         * against the special swap migration pte.
                         */

코드 라인 1~12에서 hwpoison 페이지이면서 TTU_IGNORE_HWPOISON 플래그를 사용하지 않는 요청인 경우이다. 관련 mm 카운터(anon, file, shm)를 페이지 수 만큼 감소시킨다. 그런 후 pte 엔트리에 swap 엔트리 값을 매핑시킨다.
코드 라인 14~28에서 userfaultfd vma의 사용되지 않는 pte 값인 경우 관련 mm 카운터(anon, file, shm)를 페이지 수 만큼 감소시킨다. 그런 후 secondary MMU도 가상 주소에 대한 TLB 무효화(invalidate)를 수행하게 한다.
코드 라인 29~51에서 TTU_MIGRATION 또는 TTU_SPLIT_FREEZE 플래그 요청을 받은 경우이다. migration할 swap 엔트리를 매핑한다. 기존 매핑에 soft dirty가 설정된 경우 swap 엔트리에도 포함시킨다.

mm/rmap.c -5/6-

.               } else if (PageAnon(page)) {
                        swp_entry_t entry = { .val = page_private(subpage) };
                        pte_t swp_pte;
                        /*
                         * Store the swap location in the pte.
                         * See handle_pte_fault() ...
                         */
                        if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
                                WARN_ON_ONCE(1);
                                ret = false;
                                /* We have to invalidate as we cleared the pte */
                                mmu_notifier_invalidate_range(mm, address,
                                                        address + PAGE_SIZE);
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        /* MADV_FREE page check */
                        if (!PageSwapBacked(page)) {
                                if (!PageDirty(page)) {
                                        /* Invalidate as we cleared the pte */
                                        mmu_notifier_invalidate_range(mm,
                                                address, address + PAGE_SIZE);
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

                                /*
                                 * If the page was redirtied, it cannot be
                                 * discarded. Remap the page to page table.
                                 */
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                SetPageSwapBacked(page);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        if (swap_duplicate(entry) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (arch_unmap_one(mm, vma, address, pteval) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (list_empty(&mm->mmlist)) {
                                spin_lock(&mmlist_lock);
                                if (list_empty(&mm->mmlist))
                                        list_add(&mm->mmlist, &init_mm.mmlist);
                                spin_unlock(&mmlist_lock);
                        }
                        dec_mm_counter(mm, MM_ANONPAGES);
                        inc_mm_counter(mm, MM_SWAPENTS);
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, address, pvmw.pte, swp_pte);
                        /* Invalidate as we cleared the pte */
                        mmu_notifier_invalidate_range(mm, address,
                                                      address + PAGE_SIZE);

코드 라인 1~16에서 anon 페이지에 대한 처리이다. 낮은 확률로 페이지가 swapbacked와 swapcache 플래그 설정이 일치하지 않은 경우 secondary mmu의 tlb 무효화를 수행하고, 루틴을 중단하게 한다.
코드 라인 19~37에서 SwapBacked 페이지가 아닌 경우 다시 페이지를 매핑하고, 루틴을 중단하게 한다. 단 dirty 페이지가 아닌 경우 secondary mmu의 tlb 무효화를 수행하고 anon mm 카운터를 감소시킨 후 discard 레이블로 이동하여 다음을 진행하게 한다.
코드 라인 39~44에서 swap 엔트리의 참조 카운터를 1 증가시킨다. 에러인 경우 다시 페이지를 매핑하고 루틴을 중단하게 한다.
코드 라인 45~50에서 아키텍처 고유의 unmap 수행 시 에러가 발생하면 다시 페이지를 매핑하고 루틴을 중단하게 한다.
- 현재 sparc_64 아키텍처만 지원한다.
코드 라인 51~56에서 현재 mm의 mmlist가 비어있는 경우 init_mm의 mmlist에 추가한다.
코드 라인 57~65에서 anon, swap mm 카운터를 증가시키고 swap 엔트리를 매핑한 후 secondary mmu의 tlb 무효화를 수행한다.

mm/rmap.c -6/6-

                } else {
                        /*
                         * This is a locked file-backed page, thus it cannot
                         * be removed from the page cache and replaced by a new
                         * page before mmu_notifier_invalidate_range_end, so no
                         * concurrent thread might update its page table to
                         * point at new page while a device still is using this
                         * page.
                         *
                         * See Documentation/vm/mmu_notifier.rst
                         */
                        dec_mm_counter(mm, mm_counter_file(page));
                }
discard:
                /*
                 * No need to call mmu_notifier_invalidate_range() it has be
                 * done above for all cases requiring it to happen under page
                 * table lock before mmu_notifier_invalidate_range_end()
                 *
                 * See Documentation/vm/mmu_notifier.rst
                 */
                page_remove_rmap(subpage, PageHuge(page));
                put_page(page);
        }

        mmu_notifier_invalidate_range_end(&range);

        return ret;
}

코드 라인 1~13에서 그 밖(file-backed 페이지)의 경우 file 관련 mm 카운터(file, shm)를 감소시킨다.
코드 라인 14~23에서 discard: 레이블이다. 페이지의 rmap 매핑을 제거한 후 페이지 사용을 완료한다.
코드 라인 26에서 primary MMU의 range 영역에 대한 invalidate가 완료되었다. 따라서 이 함수를 호출해 secondary MMU를 위해 mmu notifier에 등록된 (*invalidate_range_end) 후크 함수를 호출해준다.
- (*invalidate_range_start) 후크 함수와 항상 pair로 동작한다.

ptep_clear_flush_young_notify()

include/linux/mmu_notifier.h

#define ptep_clear_flush_young_notify(__vma, __address, __ptep)         \
({                                                                      \
        int __young;                                                    \
        struct vm_area_struct *___vma = __vma;                          \
        unsigned long ___address = __address;                           \
        __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
        __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
                                                  ___address,           \
                                                  ___address +          \
                                                        PAGE_SIZE);     \
        __young;                                                        \
})

페이지가 access된 적이 있으면 access 플래그를 클리어하고 플러시를 수행한다. access 여부가 있거나 mnu_notifier에 등록된 (*clear_flush_young) 후크 함수의 결과를 반환한다.

코드 라인 6에서 pte 엔트리에 액세스 플래그 설정 유무를 알아오고 클리어한다. 액세스된 적이 있으면 tlb 플러시한다.
코드 라인 7에서 mnu notifier에 등록된 (*clear_flush_young) 후크 함수를 호출한 결과를 더한다.
- 다음 함수에서 사용되고 있다.
  - virt/kvm/kvm_main.c – kvm_mmu_notifier_clear_flush_young()
  - drivers/iommu/amd_iommu_v2.c – mn_clear_flush_young()

MMU Notifier

가상 주소 영역에 매핑된 물리 주소와 관련된 secondary MMU(KVM, IOMMU, …)의 mmu 관련 operation 함수를 동작시킨다. mmu notifier를 사용할 드라이버들은 mmu_notifier_ops를 포함한 mmu_notifier 구조체를 준비하여 mmu_notifer_register()를 통해 등록하여 사용한다.

다음 그림은 유저 가상 주소 공간에 매핑된 물리 주소 공간의 페이지의 매핑이 변경되었을 때 Secondary MMU도 같이 영향 받는 모습을 보여준다.

mmu_notifier 구조체

include/linux/mmu_notifier.h

/*
 * The notifier chains are protected by mmap_sem and/or the reverse map
 * semaphores. Notifier chains are only changed when all reverse maps and
 * the mmap_sem locks are taken.
 *
 * Therefore notifier chains can only be traversed when either
 *
 * 1. mmap_sem is held.
 * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem).
 * 3. No other concurrent thread can access the list (release)
 */

struct mmu_notifier {
        struct hlist_node hlist;
        const struct mmu_notifier_ops *ops;
};

IOMMU 드라이버가 준비한 mmu_notifier_ops를 관련 가상 주소 공간을 관리하는 mm의 mmu_notifier_mm 리스트에 추가하여 등록하는 구조체이다.

hlist
- mm_struct의 멤버 mmu_notifier_mm->list에 등록할 때 사용하는 노드 엔트리이다.
*ops
- 준비한 mmu_notifier_ops 구조체를 가리킨다.

mmu_notifier_ops 구조체

include/linux/mmu_notifier.h -1/2-

struct mmu_notifier_ops {
        /*
         * Called either by mmu_notifier_unregister or when the mm is
         * being destroyed by exit_mmap, always before all pages are
         * freed. This can run concurrently with other mmu notifier
         * methods (the ones invoked outside the mm context) and it
         * should tear down all secondary mmu mappings and freeze the
         * secondary mmu. If this method isn't implemented you've to
         * be sure that nothing could possibly write to the pages
         * through the secondary mmu by the time the last thread with
         * tsk->mm == mm exits.
         *
         * As side note: the pages freed after ->release returns could
         * be immediately reallocated by the gart at an alias physical
         * address with a different cache model, so if ->release isn't
         * implemented because all _software_ driven memory accesses
         * through the secondary mmu are terminated by the time the
         * last thread of this mm quits, you've also to be sure that
         * speculative _hardware_ operations can't allocate dirty
         * cachelines in the cpu that could not be snooped and made
         * coherent with the other read and write operations happening
         * through the gart alias address, so leading to memory
         * corruption.
         */
        void (*release)(struct mmu_notifier *mn,
                        struct mm_struct *mm);

        /*
         * clear_flush_young is called after the VM is
         * test-and-clearing the young/accessed bitflag in the
         * pte. This way the VM will provide proper aging to the
         * accesses to the page through the secondary MMUs and not
         * only to the ones through the Linux pte.
         * Start-end is necessary in case the secondary MMU is mapping the page
         * at a smaller granularity than the primary MMU.
         */
        int (*clear_flush_young)(struct mmu_notifier *mn,
                                 struct mm_struct *mm,
                                 unsigned long start,
                                 unsigned long end);

        /*
         * clear_young is a lightweight version of clear_flush_young. Like the
         * latter, it is supposed to test-and-clear the young/accessed bitflag
         * in the secondary pte, but it may omit flushing the secondary tlb.
         */
        int (*clear_young)(struct mmu_notifier *mn,
                           struct mm_struct *mm,
                           unsigned long start,
                           unsigned long end);

        /*
         * test_young is called to check the young/accessed bitflag in
         * the secondary pte. This is used to know if the page is
         * frequently used without actually clearing the flag or tearing
         * down the secondary mapping on the page.
         */
        int (*test_young)(struct mmu_notifier *mn,
                          struct mm_struct *mm,
                          unsigned long address);

        /*
         * change_pte is called in cases that pte mapping to page is changed:
         * for example, when ksm remaps pte to point to a new shared page.
         */
        void (*change_pte)(struct mmu_notifier *mn,
                           struct mm_struct *mm,
                           unsigned long address,
                           pte_t pte);

(*release)
- mm이 제거되거나, mmu_notifer_unregister() 호출 시 동작되는 후크 함수이다.
(*clear_flush_young)
- pte 엔트리에 있는 young/accessed 비트 플래그에 대해 test-and-clearing 사용 후 호출되는 후크 함수이다.
- Secondary MMU에서 start ~ end 주소 범위에 관련된 young/accessed 비트 플래그의 test-and-clearing을 수행 후 secondary MMU의 tlb 플러시를 수행하게 한다.
(*clear_young)
- 위의 (*clear_flush_young)의 light 버전으로 secondary MMU의 tlb 플러시를 수행하지 않는다.
(*test_young)
- Secondary MMU에서 start ~ end 주소 범위에 관련된 young/accessed 비트 플래그 상태를 반환한다.
(*change_pte)
- Secondary MMU에서 address 주소에 관련된 pte를 교체한다.

include/linux/mmu_notifier.h -2/2-

        /*
         * invalidate_range_start() and invalidate_range_end() must be
         * paired and are called only when the mmap_sem and/or the
         * locks protecting the reverse maps are held. If the subsystem
         * can't guarantee that no additional references are taken to
         * the pages in the range, it has to implement the
         * invalidate_range() notifier to remove any references taken
         * after invalidate_range_start().
         *
         * Invalidation of multiple concurrent ranges may be
         * optionally permitted by the driver. Either way the
         * establishment of sptes is forbidden in the range passed to
         * invalidate_range_begin/end for the whole duration of the
         * invalidate_range_begin/end critical section.
         *
         * invalidate_range_start() is called when all pages in the
         * range are still mapped and have at least a refcount of one.
         *
         * invalidate_range_end() is called when all pages in the
         * range have been unmapped and the pages have been freed by
         * the VM.
         *
         * The VM will remove the page table entries and potentially
         * the page between invalidate_range_start() and
         * invalidate_range_end(). If the page must not be freed
         * because of pending I/O or other circumstances then the
         * invalidate_range_start() callback (or the initial mapping
         * by the driver) must make sure that the refcount is kept
         * elevated.
         *
         * If the driver increases the refcount when the pages are
         * initially mapped into an address space then either
         * invalidate_range_start() or invalidate_range_end() may
         * decrease the refcount. If the refcount is decreased on
         * invalidate_range_start() then the VM can free pages as page
         * table entries are removed.  If the refcount is only
         * droppped on invalidate_range_end() then the driver itself
         * will drop the last refcount but it must take care to flush
         * any secondary tlb before doing the final free on the
         * page. Pages will no longer be referenced by the linux
         * address space but may still be referenced by sptes until
         * the last refcount is dropped.
         *
         * If blockable argument is set to false then the callback cannot
         * sleep and has to return with -EAGAIN. 0 should be returned
         * otherwise. Please note that if invalidate_range_start approves
         * a non-blocking behavior then the same applies to
         * invalidate_range_end.
         *
         */
        int (*invalidate_range_start)(struct mmu_notifier *mn,
                                      const struct mmu_notifier_range *range);
        void (*invalidate_range_end)(struct mmu_notifier *mn,
                                     const struct mmu_notifier_range *range);

        /*
         * invalidate_range() is either called between
         * invalidate_range_start() and invalidate_range_end() when the
         * VM has to free pages that where unmapped, but before the
         * pages are actually freed, or outside of _start()/_end() when
         * a (remote) TLB is necessary.
         *
         * If invalidate_range() is used to manage a non-CPU TLB with
         * shared page-tables, it not necessary to implement the
         * invalidate_range_start()/end() notifiers, as
         * invalidate_range() alread catches the points in time when an
         * external TLB range needs to be flushed. For more in depth
         * discussion on this see Documentation/vm/mmu_notifier.rst
         *
         * Note that this function might be called with just a sub-range
         * of what was passed to invalidate_range_start()/end(), if
         * called between those functions.
         */
        void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
                                 unsigned long start, unsigned long end);
};

(*invalidate_range_start)
(*invalidate_range_end)
- 위 두 개의 후크 함수는 페어로 동작한다. VM들에 대해 primary MMU의 해당 range에 대해 TLB를 무효화할 때, secondary MMU도 따라서 TLB 무효화를 수행하는데, rmap walk를 통해 한꺼번에 조작하는 루틴 전후에 이 후크 함수들을 호출한다.
(*invalidate_range)
- VM에서 primary MMU의 주어진 범위를 invalidate 한 후 secondary MMU도 같이 호출된다.

다음 그림은 mmu notifier를 사용 시 secondary MMU에 대해 init, start, end 등의 위치를 확인할 수 있다.

MM 카운터

RSS(Resident Set Size)
- 프로세스에 실제 물리 메모리가 매핑되어 사용중인 페이지 수를 말한다.
- swap out된 페이지는 제외하고, 스택 및 힙 메모리 등이 포함된다.
VSZ(Virutal memory SiZe)
- 프로세스가 사용중인 가상 주소 페이지 수이다.
- RSS보다 크며 swap out된 페이지들도 포함되고, 공간에 할당되었지만 한 번도 액세스되지 않아 실제 메모리가 매핑되지 않은 페이지들을 포함한다.

다음과 같이 bash(pid=5831) 프로세스가 사용하는 RSS 및 VSZ 카운터를 확인할 수 있다.

$ cat /proc/5831/status
Name:   bash
State:  S (sleeping)
Tgid:   5831
Ngid:   0
Pid:    5831
PPid:   5795
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 256
Groups: 0
NStgid: 5831
NSpid:  5831
NSpgid: 5831
NSsid:  5831
VmPeak:     6768 kB     <--- hiwater VSZ
VmSize:     6704 kB     <--- VSZ
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      4552 kB     <--- hiwater RSS (file + anon + shm)
VmRSS:      3868 kB     <--- RSS
VmData:     1652 kB
VmStk:       132 kB
VmExe:       944 kB
VmLib:      1648 kB
VmPTE:        28 kB
VmPMD:        12 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/15244
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000380004
SigCgt: 000000004b817efb
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
Seccomp:        0
Cpus_allowed:   3f
Cpus_allowed_list:      0-5
Mems_allowed:   1
Mems_allowed_list:      0
voluntary_ctxt_switches:        50
nonvoluntary_ctxt_switches:     29

페이지 참조 확인

page_referenced()

mm/rmap.c

/**
 * page_referenced - test if the page was referenced
 * @page: the page to test
 * @is_locked: caller holds lock on the page
 * @memcg: target memory cgroup
 * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
 *
 * Quick test_and_clear_referenced for all mappings to a page,
 * returns the number of ptes which referenced the page.
 */

int page_referenced(struct page *page,
                    int is_locked,
                    struct mem_cgroup *memcg,
                    unsigned long *vm_flags)
{
        int we_locked = 0;
        struct page_referenced_arg pra = {
                .mapcount = total_mapcount(page),
                .memcg = memcg,
        };
        struct rmap_walk_control rwc = {
                .rmap_one = page_referenced_one,
                .arg = (void *)&pra,
                .anon_lock = page_lock_anon_vma_read,
        };

        *vm_flags = 0;
        if (!page_mapped(page))
                return 0;

        if (!page_rmapping(page))
                return 0;

        if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
                we_locked = trylock_page(page);
                if (!we_locked)
                        return 1;
        }

        /*
         * If we are reclaiming on behalf of a cgroup, skip
         * counting on behalf of references from different
         * cgroups
         */
        if (memcg) {
                rwc.invalid_vma = invalid_page_referenced_vma;
        }

        rmap_walk(page, &rwc);
        *vm_flags = pra.vm_flags;

        if (we_locked)
                unlock_page(page);

        return pra.referenced;
}

page_referenced_one()

mm/rmap.c

/*
 * arg: page_referenced_arg will be passed
 */

static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
                        unsigned long address, void *arg)
{
        struct page_referenced_arg *pra = arg;
        struct page_vma_mapped_walk pvmw = {
                .page = page,
                .vma = vma,
                .address = address,
        };
        int referenced = 0;

        while (page_vma_mapped_walk(&pvmw)) {
                address = pvmw.address;

                if (vma->vm_flags & VM_LOCKED) {
                        page_vma_mapped_walk_done(&pvmw);
                        pra->vm_flags |= VM_LOCKED;
                        return false; /* To break the loop */
                }

                if (pvmw.pte) {
                        if (ptep_clear_flush_young_notify(vma, address,
                                                pvmw.pte)) {
                                /*
                                 * Don't treat a reference through
                                 * a sequentially read mapping as such.
                                 * If the page has been used in another mapping,
                                 * we will catch it; if this other mapping is
                                 * already gone, the unmap path will have set
                                 * PG_referenced or activated the page.
                                 */
                                if (likely(!(vma->vm_flags & VM_SEQ_READ)))
                                        referenced++;
                        }
                } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
                        if (pmdp_clear_flush_young_notify(vma, address,
                                                pvmw.pmd))
                                referenced++;
                } else {
                        /* unexpected pmd-mapped page? */
                        WARN_ON_ONCE(1);
                }

                pra->mapcount--;
        }

        if (referenced)
                clear_page_idle(page);
        if (test_and_clear_page_young(page))
                referenced++;

        if (referenced) {
                pra->referenced++;
                pra->vm_flags |= vma->vm_flags;
        }

        if (!pra->mapcount)
                return false; /* To break the loop */

        return true;
}