문c 블로그

VMPressure

2019-09-172022-04-26 문영일 Leave a comment

VMPressure

Memory Control Ggroup을 통해 스캔한 페이지와 회수한 페이지 비율을 분석하여 메모리 압박률을 산출하고, 이에 대응하는 스레졸드별 3 가지 이벤트 레벨로 memcg에 등록한 vmpressure 리스너들에 통지할 수 있게 하였다. vmpressure 리스너들은 eventfd를 사용하여 이러한 이벤트를 수신할 수 있다.

이벤트 레벨

low
- memcg로 지정한 메모리 압박이 적은 편이다.
medium
- memcg로 지정한 메모리 압박이 많은 편이다.
critical
- memcg로 지정한 메모리 압박이 심해 곧 OOM killer가 동작할 예정이다.

vmpressure_win

mm/vmpressure.c

/*
 * The window size (vmpressure_win) is the number of scanned pages before
 * we try to analyze scanned/reclaimed ratio. So the window is used as a
 * rate-limit tunable for the "low" level notification, and also for
 * averaging the ratio for medium/critical levels. Using small window
 * sizes can cause lot of false positives, but too big window size will
 * delay the notifications.
 *
 * As the vmscan reclaimer logic works with chunks which are multiple of
 * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
 *
 * TODO: Make the window size depend on machine size, as we do for vmstat
 * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
 */

static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;

SWAP_CLUSTER_MAX(32) * 16 = 512 페이지로 설정되어 있다.
이 윈도우 사이즈는 scanned/reclaim 비율을 분석을 시도하기 전에 사용하는 scanned 페이지 수이다.
low 레벨 notification에 사용되고 medium/critical 레벨의 평균 비율을 위해서도 사용된다.

vmpressure_level_med & vmpressure_level_critical

mm/vmpressure.c

/*
 * These thresholds are used when we account memory pressure through
 * scanned/reclaimed ratio. The current values were chosen empirically. In
 * essence, they are percents: the higher the value, the more number
 * unsuccessful reclaims there were.
 */

static const unsigned int vmpressure_level_med = 60;
static const unsigned int vmpressure_level_critical = 95;

vmpressure_level_med
- scanned/reclaimed 비율로 메모리 pressure 계량시 사용되는 medium 레벨의 스레졸드 값
vmpressure_level_critical
- scanned/reclaimed 비율로 메모리 pressure 계량시 사용되는 critical 레벨의 스레졸드 값

vmpressure_prio()

mm/vmpressure.c

/**
 * vmpressure_prio() - Account memory pressure through reclaimer priority level
 * @gfp:        reclaimer's gfp mask
 * @memcg:      cgroup memory controller handle
 * @prio:       reclaimer's priority
 *
 * This function should be called from the reclaim path every time when
 * the vmscan's reclaiming priority (scanning depth) changes.
 *
 * This function does not return any value.
 */

void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
{
        /*
         * We only use prio for accounting critical level. For more info
         * see comment for vmpressure_level_critical_prio variable above.
         */
        if (prio > vmpressure_level_critical_prio)
                return;

        /*
         * OK, the prio is below the threshold, updating vmpressure
         * information before shrinker dives into long shrinking of long
         * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
         * to the vmpressure() basically means that we signal 'critical'
         * level.
         */
        vmpressure(gfp, memcg, true, vmpressure_win, 0);
}

우선 순위가 높아져 스캔 depth가 깊어지는 경우 vmpressure 정보를 갱신한다.

코드 라인 7~8에서 요청 우선 순위가 vmpressure_level_critical_prio(3)보다 낮아 함수를 빠져나간다.
- prio는 낮을 수록 우선 순위가 높다.
코드 라인 17에서 스레졸드 이하로 prio가 떨어진 경우, 즉 우선 순위가 높아진 경우 shrinker가 오랫 동안 스캔하기 전에 vmpressure 정보를 업데이트한다.

vmpressure()

mm/vmpressure.c

/**
 * vmpressure() - Account memory pressure through scanned/reclaimed ratio
 * @gfp:        reclaimer's gfp mask
 * @memcg:      cgroup memory controller handle
 * @tree:       legacy subtree mode
 * @scanned:    number of pages scanned
 * @reclaimed:  number of pages reclaimed
 *
 * This function should be called from the vmscan reclaim path to account
 * "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
 * pressure index is then further refined and averaged over time.
 *
 * If @tree is set, vmpressure is in traditional userspace reporting
 * mode: @memcg is considered the pressure root and userspace is
 * notified of the entire subtree's reclaim efficiency.
 *
 * If @tree is not set, reclaim efficiency is recorded for @memcg, and
 * only in-kernel users are notified.
 *
 * This function does not return any value.
 */

void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
                unsigned long scanned, unsigned long reclaimed)
{
        struct vmpressure *vmpr = memcg_to_vmpressure(memcg);

        /*
         * Here we only want to account pressure that userland is able to
         * help us with. For example, suppose that DMA zone is under
         * pressure; if we notify userland about that kind of pressure,
         * then it will be mostly a waste as it will trigger unnecessary
         * freeing of memory by userland (since userland is more likely to
         * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
         * is why we include only movable, highmem and FS/IO pages.
         * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
         * we account it too.
         */
        if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
                return;

        /*
         * If we got here with no pages scanned, then that is an indicator
         * that reclaimer was unable to find any shrinkable LRUs at the
         * current scanning depth. But it does not mean that we should
         * report the critical pressure, yet. If the scanning priority
         * (scanning depth) goes too high (deep), we will be notified
         * through vmpressure_prio(). But so far, keep calm.
         */
        if (!scanned)
                return;

        if (tree) {
                spin_lock(&vmpr->sr_lock);
                scanned = vmpr->tree_scanned += scanned;
                vmpr->tree_reclaimed += reclaimed;
                spin_unlock(&vmpr->sr_lock);

                if (scanned < vmpressure_win)
                        return;
                schedule_work(&vmpr->work);
        } else {
                enum vmpressure_levels level;

                /* For now, no users for root-level efficiency */
                if (!memcg || memcg == root_mem_cgroup)
                        return;

                spin_lock(&vmpr->sr_lock);
                scanned = vmpr->scanned += scanned;
                reclaimed = vmpr->reclaimed += reclaimed;
                if (scanned < vmpressure_win) {
                        spin_unlock(&vmpr->sr_lock);
                        return;
                }
                vmpr->scanned = vmpr->reclaimed = 0;
                spin_unlock(&vmpr->sr_lock);

                level = vmpressure_calc_level(scanned, reclaimed);

                if (level > VMPRESSURE_LOW) {
                        /*
                         * Let the socket buffer allocator know that
                         * we are having trouble reclaiming LRU pages.
                         *
                         * For hysteresis keep the pressure state
                         * asserted for a second in which subsequent
                         * pressure events can occur.
                         */
                        memcg->socket_pressure = jiffies + HZ;
                }
        }
}

scaned 및 reclaimed 비율로 메모리 pressure를 계량한다.

코드 라인 4에서 요청한 memcg의 vmpressure 정보를 반환한다.
코드 라인 17~18에서 highmem, movable, FS, IO 플래그 요청이 하나도 없는 경우 pressure 계량을 하지 않는다.
코드 라인 28~29에서 인수 scanned가 0인 경우 함수를 중단한다.
코드 라인 31~39에서 기존 tree 방식의 presssure를 계량한다. tree_scanned와 tree_reclaimed 각각 그 만큼 증가시키고 vmpr->work에 등록한 작업을 실행시킨다. 만일 vmpr->scanned가 vmpressure_win 보다 작은 경우 함수를 중단한다.
- vmpressure_work_fn()
코드 라인 40~45에서 @tree가 0이면 커널 내부 사용자에게 통지하기 위해 @memcg를 위한 회수 효율성이 기록된다. memcg가 지정되지 않은 경우 함수를 중단한다.
코드 라인 47~55에서 scanned와 reclaimed 각각 그 만큼 증가시키고 만일 scanned가 vmpressure_win 보다 작은 경우 함수를 중단한다. 중단하지 않은 경우 vmpr의 scanned와 reclaimed는 0으로 리셋한다.
코드 라인 57~69에서 산출된 vmpressure 레벨이 VMPRESSURE_LOW를 초과하면 memcg의 socket_pressure를 현재 시각보다 1초 뒤인 틱 값을 설정한다.
- mem_cgroup_under_socket_pressure() 함수에서 이 값을 사용한다.
- 참고: mm: memcontrol: hook up vmpressure to socket pressure

다음 그림은 vmpressure() 함수가 처리되는 과정을 보여준다.

워크 큐에서 vmpressure에 따른 이벤트 통지

vmpressure_work_fn()

mm/vmpressure.c

static void vmpressure_work_fn(struct work_struct *work)
{
        struct vmpressure *vmpr = work_to_vmpressure(work);
        unsigned long scanned;
        unsigned long reclaimed;
        enum vmpressure_levels level;
        bool ancestor = false;
        bool signalled = false;

        spin_lock(&vmpr->sr_lock);
        /*
         * Several contexts might be calling vmpressure(), so it is
         * possible that the work was rescheduled again before the old
         * work context cleared the counters. In that case we will run
         * just after the old work returns, but then scanned might be zero
         * here. No need for any locks here since we don't care if
         * vmpr->reclaimed is in sync.
         */
        scanned = vmpr->tree_scanned;
        if (!scanned) {
                spin_unlock(&vmpr->sr_lock);
                return;
        }

        reclaimed = vmpr->tree_reclaimed;
        vmpr->tree_scanned = 0;
        vmpr->tree_reclaimed = 0;
        spin_unlock(&vmpr->sr_lock);

        level = vmpressure_calc_level(scanned, reclaimed);

        do {
                if (vmpressure_event(vmpr, level, ancestor, signalled))
                        signalled = true;
                ancestor = true;
        } while ((vmpr = vmpressure_parent(vmpr)));
}

메모리 압박 레벨을 산출하고 레벨 및 모드 조건을 만족시키는 vmpressure 리스너에 이벤트를 전송한다.

코드 라인 10~28에서 tree_scanned 값과 tree_reclaimed 값을 가져오고 리셋한다.
코드 라인 30에서 scanned 값과 reclaimed 값으로 레벨을 산출한다.
코드 라인 32~36에서 하이라키로 구성된 memcg의 vmpressure 값을 최상위 루트까지 순회하며 조건을 만족시키는 vmpressure 리스너에 이벤트를 통지한다.

memcg_to_vmpressure()

mm/memcontrol.c

/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
        if (!memcg)
                memcg = root_mem_cgroup;
        return &memcg->vmpressure;   
}

요청한 memcg의 vmpressure 정보를 반환한다. memcg가 지정되지 않은 경우 root memcg의 vmpressure를 반환한다.

다음 그림은 등록된 vmpressure 리스너들 중 조건에 맞는 리스너들을 대상으로 이벤트를 보내는 과정을 보여준다.

vmpressure 이벤트 통지

vmpressure_event()

mm/vmpressure.c

static bool vmpressure_event(struct vmpressure *vmpr,
                             const enum vmpressure_levels level,
                             bool ancestor, bool signalled)
{
        struct vmpressure_event *ev;
        bool ret = false;

        mutex_lock(&vmpr->events_lock);
        list_for_each_entry(ev, &vmpr->events, node) {
                if (ancestor && ev->mode == VMPRESSURE_LOCAL)
                        continue;
                if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
                        continue;
                if (level < ev->level)
                        continue;
                eventfd_signal(ev->efd, 1);
                ret = true;
        }
        mutex_unlock(&vmpr->events_lock);

        return ret;
}

vmpressure에 등록된 이벤트들을 대상으로 요청 @level 이하로 등록한 vmpressure 리스터 application에 eventfd 시그널을 통지한다.

통지 대상이 아닌 경우는 다음과 같다.
- @ancestor=1일 때, local 모드는 제외한다.
- @signalled=1일 때, no_passthrough 모드는 제외한다.
- @level보다 큰 레벨로 등록한 경우는 제외한다.

다음 그림은 memcg에 등록한 vmpressure 리스너에 이벤트를 통지하는 조건들을 보여준다.

vmpressure 레벨 산출

vmpressure_calc_level()

mm/vmpressure.c

static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
                                                    unsigned long reclaimed)
{
        unsigned long scale = scanned + reclaimed;
        unsigned long pressure = 0;

        /*
         * reclaimed can be greater than scanned for things such as reclaimed
         * slab pages. shrink_node() just adds reclaimed pages without a
         * related increment to scanned pages.
         */
        if (reclaimed >= scanned)
                goto out;
        /*
         * We calculate the ratio (in percents) of how many pages were
         * scanned vs. reclaimed in a given time frame (window). Note that
         * time is in VM reclaimer's "ticks", i.e. number of pages
         * scanned. This makes it possible to set desired reaction time
         * and serves as a ratelimit.
         */
        pressure = scale - (reclaimed * scale / scanned);
        pressure = pressure * 100 / scale;

out:
        pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
                 scanned, reclaimed);

        return vmpressure_level(pressure);
}

scanned, reclaimed 비율에 따라 pressure 값을 산출하고, 이에 따른 레벨을 반환한다.

다음과 예와 같이 scanned 페이지 수와 reclaimed 페이지 수에 대한 pressure 값과 레벨을 확인해보자.

scanned=5, reclaimed=0
- pressure=100%, level=critical
scanned=5, reclaimed=1
- pressure=66%, level=medium
scanned=5, reclaimed=2
- pressure=57%, level=low
scanned=5, reclaimed=3
- pressure=37%, level=low
scanned=5, reclaimed=4
- pressure=11%, level=low
scanned=5, reclaimed=5
- pressure=0%, level=low

다음 그림은 scanned, reclaimed 비율에 따른 pressure 값을 산출하고, 이에 따른 레벨을 결정하는 과정을 보여준다.

vmpressure_level()

mm/vmpressure.c

static enum vmpressure_levels vmpressure_level(unsigned long pressure)
{
        if (pressure >= vmpressure_level_critical)
                return VMPRESSURE_CRITICAL;
        else if (pressure >= vmpressure_level_med)
                return VMPRESSURE_MEDIUM;
        return VMPRESSURE_LOW;
}

@pressure에 따른 레벨을 반환한다.

critical
- 디폴트 값 95% 이상
med
- 디폴트 값 60% 이상
low
- 그 외

이벤트 수신 프로그램 데모

cgroup_event_listener

tools/cgroup 위치에서 make를 실행하면 다음 소스를 빌드하여 cgroup_event_listener 파일이 생성된다.

tools/cgroup/cgroup_event_listener.c

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/eventfd.h>

#define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>"

int main(int argc, char **argv)
{
        int efd = -1;
        int cfd = -1;
        int event_control = -1;
        char event_control_path[PATH_MAX];
        char line[LINE_MAX];
        int ret;

        if (argc != 3)
                errx(1, "%s", USAGE_STR);

        cfd = open(argv[1], O_RDONLY);
        if (cfd == -1)
                err(1, "Cannot open %s", argv[1]);

        ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
                        dirname(argv[1]));
        if (ret >= PATH_MAX)
                errx(1, "Path to cgroup.event_control is too long");

        event_control = open(event_control_path, O_WRONLY);
        if (event_control == -1)
                err(1, "Cannot open %s", event_control_path);

        efd = eventfd(0, 0);
        if (efd == -1)
                err(1, "eventfd() failed");

        ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]);
        if (ret >= LINE_MAX)
                errx(1, "Arguments string is too long");

        ret = write(event_control, line, strlen(line) + 1);
        if (ret == -1)
                err(1, "Cannot write to cgroup.event_control");

        while (1) {
                uint64_t result;

                ret = read(efd, &result, sizeof(result));
                if (ret == -1) {
                        if (errno == EINTR)
                                continue;
                        err(1, "Cannot read from eventfd");
                }
                assert(ret == sizeof(result));

                ret = access(event_control_path, W_OK);
                if ((ret == -1) && (errno == ENOENT)) {
                        puts("The cgroup seems to have removed.");
                        break;
                }

                if (ret == -1)
                        err(1, "cgroup.event_control is not accessible any more");

                printf("%s %s: crossed\n", argv[1], argv[2]);
        }

        return 0;
}

사용 방법

다음과 같이 pressure 레벨이 medium일 때 이벤트를 수신할 수 있게 한다.

참고: memcg: Add memory.pressure_level events | LWN.net

# cd /sys/fs/cgroup/memory/
$ mkdir foo
$ cd foo
$ cgroup_event_listener memory.pressure_level medium &
$ echo 8000000 > memory.limit_in_bytes
$ echo 8000000 > memory.memsw.limit_in_bytes
$ echo $$ > tasks
$ dd if=/dev/zero | read x

PSI(Process Stall Information)

메모리 압박을 감시하고자 하는 유저 application(안드로이드의 lmkd 등)이 메모리 회수 동작에서 받는 압박 레벨을 catch 하고자 2018년 커널 v4.20-rc1에서 소개되었다.

참고

참고

Memory Resource Controller (Documentation/cgroups/memory.txt) | LWN.net
로우 메모리 킬러 데몬 | Android

Rmap -1- (Reverse Mapping)

2019-09-102022-06-06 문영일 4 Comments

Rmap -1- (Reverse Mapping)

Rmap은 물리 주소를 사용하여 가상 주소에 역 매핑하는 방법이다. 이러한 rmap을 사용하여 물리 페이지를 사용하는 모든 VM 및 가상 주소를 찾도록 rmap_walk를 사용하며, 이렇게 찾은 매핑(VM, 가상주소, pte)에 대해 여러 가지 작업을 수행하도록 요청하는 API는 다음과 같다.

try_to_unmap()
page_mkclean()
page_referenced()
try_to_migrate()
page_mlock()
page_make_device_exclusive()
remove_migration_ptes()
page_idle_clear_pte_refs()
try_to_munlock()

다음 그림은 정방향 매핑과 역방향 매핑을 사용한 rmap의 컨셉을 보여준다.

rmap 종류

다음 그림과 같이 유저 페이지 종류에 따라 다른 4 가지 타입의 rmap을 구성할 수 있다.

Anonymous 매핑
- 유저 스택, 유저 힙 및 CoW 페이지가 할당될 때 사용한다.
- 아래 그림의 1)번
KSM 매핑
- VM_MERGEABLE 영역인 경우 물리 메모리 데이터가 동일한 경우 유저 anon 페이지를 병합할 수 있다.
- 아래 그림의 2)번
File 매핑
- 코드, 라이브러리, 데이터 파일, 공유 메모리, 디바이스 등이 로드 및 할당될 때 사용한다.
- 아래 그림의 3)번
non-lru movable 매핑
- non-lru movable을 지원하는 파일 시스템의 페이지들이 매핑될 때 사용하며, 이 페이지들은 migration이 가능하다.
- 아래 그림의 3)번
- 2016년 커널 v4.8-rc1에서 추가되었다.
  - 참고: mm: migrate: support non-lru movable page migration (2016, v4.8-rc1)

page 구조체의 mapping 멤버에는 위의 구조체 들을 가리키고, 하위 2비트의 플래그를 통해 구분을 할 수 있다.

anynymouse 페이지
- PAGE_MAPPING_ANON(1)
ksm 페이지
- PAGE_MAPPING_KSM(3) = PAGE_MAPPING_ANON(1) + PAGE_MAPPING_MOVABLE(2)
file 페이지
- 하위 비트 모두 0
non-lru movable 페이지
- PAGE_MAPPING_MOVABLE(2)

Fault 핸들러와 VMA 생성

사용자 메모리 할당이 요청되면 커널은 가상 주소 공간을 사용한다는 표식으로 VMA를 생성 또는 확장하고, 실제 물리 메모리는 할당하지 않는다. 그런 후 해당 유저 영역에 접근 시 fault가 발생하게 되는데, 이 때 fault 핸들러는 fault가 발생한 주소가 사용자 주소 공간의 어떤 VMA에 위치했는지 알아온다. 그리고 fault가 발생한 주소가 유효한 경우 뒤늦게 물리 페이지를 할당하는 lazy 할당 정책을 사용한다. 이 때 정방향 매핑을 사용하여 페이지 테이블 엔트리를 갱신하고, 역방향 매핑인 rmap도 추가한다.

다음 그림은 페이지 할당 요청 후 실제 메모리 사용 시 fault 핸들러에 의해 페이지 할당 및 매핑/역매핑이 수행되는 과정을 보여준다.

다음 그림은 fault 핸들러를 통해 최초 VMA가 생성되는 과정을 보여준다.

Fault 핸들러와 rmap 생성

다음 그림과 같이 anon rmap 상태를 간략히 보여준다.

정방향 매핑 시 여러 단계의 페이지 테이블(pgd -> p4d -> pud -> pmd -> pte)을 사용한다.
역방향 매핑 시 각 anon 페이지들은 AV(anon_vma 구조체)에 연결되어 표현된다.
- 아래 그림에서 VMA와 AV가 AVC(anon_vma_chain)을 통해 연결되는데 AVC 표기는 생략하였다.
- 연결될 때 page->mapping에 PAGE_MAPPING_ANON 플래그가 추가된 anon_vma 구조체 주소가 사용된다.

가상 주소 공간 관리(VMA 관리)

유저 프로세스 각각마다 가상 주소 공간이 존재하며, 이들의 매핑을 위해 페이지 테이블이 사용된다. 유저 가상 주소 공간에는 유저가 요청한 여러 개의 가상 주소 영역이 등록되는데, 이들이 vm_area_struct 구조체를 할당하여 표현되는 VMA이다.

참고로 프로세스에 child 스레드들이 존재하는 경우 child 스레드들도 task_struct로 관리된다. 그러나 사용자 주소 공간은 유저 프로세스 공간을 같이 공유하며 사용한다. 따라서 모든 child 스레드들의 mm_struct는 동일하다.

VMA가 관리되는 자료 구조는 RB 트리와 리스트 두 개의 자료 구조를 사용하여 관리되며, 두 개를 사용하는 이유는 빈 공간 범위 검색에 효과적으로 대응되는 방법이 이 두 개의 자료 구조를 사용하는 방식이다.

RB 트리
- 시작 주소의 검색에 빠르게 대응할 수 있다.
리니어 링크드 리스트
- VMA 들이 주소로 정렬되어 관리되어 있어, 정렬된 VMA들 간의 free 공간 사이즈를 빠르게 알아낼 수 있다.

다음 그림은 유저 주소 공간에 3개의 가상 주소 영역(VMA)이 등록되어 관리되는 모습을 보여준다.

다음과 같이 프로세스(pid 1481)에 등록된 VMA들을 보여준다. 파일과 관련없는 anon vma를 확인해본다.

표기되는 순서대로 다음과 같다.
- 시작 가상 주소(vma->vm_start)
- 끝 가상 주소(vma->vm_end)
- 플래그 여부 (VM_READ, VM_WRITE, VM_EXEC, VM_MAYSHARE)
- 파일이 매핑된 경우 오프셋 주소(vma->vm_pgoff << PAGE_SHIFT)
- 파일이 매핑된 경우 디바이스의 메이저 및 마이너 번호
- 파일이 매핑된 경우 inode 번호
- 파일이 매핑된 경우 파일명, anon vma의 경우 공백 또는 괄호 []로 표기
  - heap, vvar, vdso, stack 영역들은 특별하게 괄호 []를 통해 표시되는 anon vma이다.

$ cat /proc/1481/maps
00400000-004ec000 r-xp 00000000 b3:05 26                                 /bin/bash
004fb000-004ff000 r--p 000eb000 b3:05 26                                 /bin/bash
004ff000-00508000 rw-p 000ef000 b3:05 26                                 /bin/bash
00508000-00512000 rw-p 00000000 00:00 0
30d7f000-30f1d000 rw-p 00000000 00:00 0                                  [heap]
7f87324000-7f8732d000 r-xp 00000000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8732d000-7f8733c000 ---p 00009000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733c000-7f8733d000 r--p 00008000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733d000-7f8733e000 rw-p 00009000 b3:05 2173                           /lib/aarch64-linux-gnu/libnss_files-2.24.so
7f8733e000-7f87344000 rw-p 00000000 00:00 0
7f87344000-7f8734d000 r-xp 00000000 b3:05 2197                           /lib/aarch64-linux-gnu/libnss_nis-2.24.so
7f87370000-7f8737f000 ---p 00012000 b3:05 2243                           /lib/aarch64-linux-gnu/libnsl-2.24.so
(...생략...)
7f8771b000-7f8771c000 r--p 00000000 b3:05 7362                           /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION
7f8771c000-7f87721000 rw-p 00000000 00:00 0
7f87721000-7f87722000 r--p 00000000 00:00 0                              [vvar]
7f87722000-7f87723000 r-xp 00000000 00:00 0                              [vdso]
7f87723000-7f87724000 r--p 0001c000 b3:05 2253                           /lib/aarch64-linux-gnu/ld-2.24.so
7f87724000-7f87726000 rw-p 0001d000 b3:05 2253                           /lib/aarch64-linux-gnu/ld-2.24.so
7fcdc27000-7fcdc48000 rw-p 00000000 00:00 0                              [stack]

다음 명령을 사용하면 위의 정보를 더 자세히 볼 수 있다.

$ cat /proc/1481/smaps
00400000-004ec000 r-xp 00000000 b3:05 26                                 /bin/bash
Size:                944 kB
Rss:                 936 kB
Pss:                 311 kB
Shared_Clean:        936 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:          936 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw
(...생략...) 
7fcdc27000-7fcdc48000 rw-p 00000000 00:00 0                              [stack]
Size:                132 kB
Rss:                  32 kB
Pss:                  32 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        32 kB
Referenced:           32 kB
Anonymous:            32 kB
AnonHugePages:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac

VM 플래그

다음은 vma->vm_flags에서 사용되는 VM 플래그들이다.

우측에 표시된 두 글자는 maps 또는 smaps로 출력 시 약식 표기법이다.

VM_READ         rd
VM_WRITE        wr
VM_EXEC         ex
VM_SHARED       sh
VM_MAYREAD      mr
VM_MAYWRITE     mw
VM_MAYEXEC      me
VM_MAYSHARE     ms
VM_GROWSDOWN    gd
VM_UFFD_MISSING 
VM_PFNMAP       pf
VM_DENYWRITE    dw
VM_UFFD_WP      
VM_LOCKED       lo
VM_IO           io
VM_SEQ_READ     sr
VM_RAND_READ    rr
VM_DONTCOPY     dc
VM_DONTEXPAND   de
VM_LOCKONFAULT  
VM_ACCOUNT      ac
VM_NORESERVE    nr
VM_HUGETLB      ht
VM_SYNC         sf
VM_ARCH_1       ar
VM_WIPEONFORK   wf
VM_DONTDUMP     dd
VM_SOFTDIRTY    sd
VM_MIXEDMAP     mm
VM_HUGEPAGE     hg
VM_NOHUGEPAGE   nh
VM_MERGEABLE	mg

anonymous 타입 VMA

anonymous 타입 VMA(vm_area_struct)를 관리하기 위해 AV(anon_vma 구조체)가 사용되고 VMA와 AV의 연결에 AVC(anon_vma_chain 구조체)를 사용하여 관리한다.

AV는 많은 VMA를 포함시킬 수 있도록 RB 트리를 사용하여 관리한다.
- 2012년 9월 커널 v3.7-rc1에서 AV에 사용한 리니어 링크드 리스트 대신 RB 트리를 사용한 interval 트리로 교체하였다.
- 참고: mm anon rmap: replace same_anon_vma linked list with an interval tree.
VMA는 AV를 포함시킬 수 있도록 여전히 리니어 링크드 리스트를 사용하여 관리한다.

다음 그림은 VMA와 AV간의 연결에 AVC가 사용되는 모습을 보여준다.

anon_vma 병합

VMA가 인접하고 속성이 비슷한 경우 anon_vma를 별도로 생성하지 않고 기존 것을 그대로 사용할 수 있다.

Fork된 child 프로세스에서의 관리

다음 그림은 fork된 child 프로세스와의 관계를 보여준다.

다음 그림은 두 번의 child 프로세스를 fork하여 부모 VMA가 clone되고 AV가 생성된 후 링크된 모습을 보여준다.

다음 그림과 같이 AV의 부모(parent) 관계가 표현된다.

다음 그림과 같이 AV의 처음 생성된 루트(root) 관계가 표현된다.

다음 그림과 같이 VMA가 1:1 직접 가리키는 관계가 표현된다.

rmap을 사용한 효율적인 매핑 제거

공유 페이지의 매핑을 제거할 때 정방향 매핑만을 사용하려면 정방향 매핑에 사용된 모든 수 많은 사용자 페이지 테이블을 뒤져야 하는 문제가 있다. 이 때 역방향 매핑을 사용하여 VMA(가상 주소 영역)를 찾고, VMA에서 사용되는 사용자 페이지 테이블에서만 매핑 해제하는 방식을 사용하면 빠르게 제거할 수 있다.

다음 그림은 부모 프로세스와 fork된 child 프로세스 사이에서 공유된 페이지의 매핑을 제거할 때 검색할 VMA #1, #2를 찾을 수 있도록 괸리되는 모습을 보여준다.

자식 프로세스 B가 fork될 때 VMA#1을 clone하여 VMA #2를 만든다.
새롭게 자식 프로세스 B가 할당한 페이지들은 각각의 VMA#2, VMA#3에서 생성되어 관리되는 모습을 알 수 있다.

anon_vma 초기화

anon_vma_init()

mm/rmap.c

void __init anon_vma_init(void)
{
        anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
                        0, SLAB_TYPESAFE_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
                        anon_vma_ctor);
        anon_vma_chain_cachep = KMEM_CACHE(anon_vma_chain,
                        SLAB_PANIC|SLAB_ACCOUNT);
}

anon_vma 및 anon_vma_chain 구조체 할당 목적으로 slub 캐시를 생성한다.

코드 라인 3에서 anon_vma 구조체 할당 목적으로 slub 캐시를 생성하고, 초기화 시 anon_vma_ctor() 함수가 호출되게 한다.
코드 라인 6에서 anon_vma_chain 구조체 할당 목적으로 slub 캐시를 생성한다.

anon_vma 할당

anon_vma_alloc()

mm/rmap.c

static inline struct anon_vma *anon_vma_alloc(void)
{
        struct anon_vma *anon_vma;

        anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
        if (anon_vma) {
                atomic_set(&anon_vma->refcount, 1);
                anon_vma->degree = 1;   /* Reference for first vma */
                anon_vma->parent = anon_vma;
                /*
                 * Initialise the anon_vma root to point to itself. If called
                 * from fork, the root will be reset to the parents anon_vma.
                 */
                anon_vma->root = anon_vma;
        }

        return anon_vma;
}

anon_vma를 할당한다.

할당한 anon_vma 구조체의 멤버 degree를 1로 설정하여 가장 선두에 위치했음을 식별하게하고, parent 및 root도 자기 자신을 가리키게 한다.

anon_vma 해제

anon_vma_free()

mm/rmap.c

static inline void anon_vma_free(struct anon_vma *anon_vma)
{
        VM_BUG_ON(atomic_read(&anon_vma->refcount));

        /*
         * Synchronize against page_lock_anon_vma_read() such that
         * we can safely hold the lock without the anon_vma getting
         * freed.
         *
         * Relies on the full mb implied by the atomic_dec_and_test() from
         * put_anon_vma() against the acquire barrier implied by
         * down_read_trylock() from page_lock_anon_vma_read(). This orders:
         *
         * page_lock_anon_vma_read()    VS      put_anon_vma()
         *   down_read_trylock()                  atomic_dec_and_test()
         *   LOCK                                 MB
         *   atomic_read()                        rwsem_is_locked()
         *
         * LOCK should suffice since the actual taking of the lock must
         * happen _before_ what follows.
         */
        might_sleep();
        if (rwsem_is_locked(&anon_vma->root->rwsem)) {
                anon_vma_lock_write(anon_vma);
                anon_vma_unlock_write(anon_vma);
        }

        kmem_cache_free(anon_vma_cachep, anon_vma);
}

anon_vma를 할당 해제한다.

anon_vma 준비

유저용 free 페이지를 준비할 때 anon_vma를 준비하도록 요청하는 곳은 많은데, 그 중 주요 루틴들은 다음과 같다.

fault 시 메모리 할당 관련
- do_cow_fault()
- wp_page_copy()
- do_anonymous_page()
스택 확장 관련
- expand_upwards()
- expand_downwards()
migration 관련
- migrate_vma_insert_page()

anon_vma_prepare()

include/linux/rmap.h

static inline int anon_vma_prepare(struct vm_area_struct *vma)
{
        if (likely(vma->anon_vma))
                return 0;

        return __anon_vma_prepare(vma);
}

vma 영역에 anon_vma를 준비한다.

코드 라인 3~4에서 vma에 이미 anon_vma가 이미 준비되어 사용중인 경우 성공(0)을 반환한다.
코드 라인 6에서 vma에 anon 페이지들을 관리하는 anon_vma 자료 구조를 준비하여 어태치한다.

__anon_vma_prepare()

mm/rmap.c

/**
 * __anon_vma_prepare - attach an anon_vma to a memory region
 * @vma: the memory region in question
 *
 * This makes sure the memory mapping described by 'vma' has
 * an 'anon_vma' attached to it, so that we can associate the
 * anonymous pages mapped into it with that anon_vma.
 *
 * The common case will be that we already have one, which
 * is handled inline by anon_vma_prepare(). But if
 * not we either need to find an adjacent mapping that we
 * can re-use the anon_vma from (very common when the only
 * reason for splitting a vma has been mprotect()), or we
 * allocate a new one.
 *
 * Anon-vma allocations are very subtle, because we may have
 * optimistically looked up an anon_vma in page_lock_anon_vma_read()
 * and that may actually touch the rwsem even in the newly
 * allocated vma (it depends on RCU to make sure that the
 * anon_vma isn't actually destroyed).
 *
 * As a result, we need to do proper anon_vma locking even
 * for the new allocation. At the same time, we do not want
 * to do any locking for the common case of already having
 * an anon_vma.
 *
 * This must be called with the mmap_lock held for reading.
 */

int __anon_vma_prepare(struct vm_area_struct *vma)
{
        struct mm_struct *mm = vma->vm_mm;
        struct anon_vma *anon_vma, *allocated;
        struct anon_vma_chain *avc;

        might_sleep();

        avc = anon_vma_chain_alloc(GFP_KERNEL);
        if (!avc)
                goto out_enomem;

        anon_vma = find_mergeable_anon_vma(vma);
        allocated = NULL;
        if (!anon_vma) {
                anon_vma = anon_vma_alloc();
                if (unlikely(!anon_vma))
                        goto out_enomem_free_avc;
                allocated = anon_vma;
        }

        anon_vma_lock_write(anon_vma);
        /* page_table_lock to protect against threads */
        spin_lock(&mm->page_table_lock);
        if (likely(!vma->anon_vma)) {
                vma->anon_vma = anon_vma;
                anon_vma_chain_link(vma, avc, anon_vma);
                /* vma reference or self-parent link for new root */
                anon_vma->degree++;
                allocated = NULL;
                avc = NULL;
        }
        spin_unlock(&mm->page_table_lock);
        anon_vma_unlock_write(anon_vma);

        if (unlikely(allocated))
                put_anon_vma(allocated);
        if (unlikely(avc))
                anon_vma_chain_free(avc);

        return 0;

 out_enomem_free_avc:
        anon_vma_chain_free(avc);
 out_enomem:
        return -ENOMEM;
}

vma에 anon 페이지들을 관리하는 anon_vma 자료 구조를 준비하여 어태치한다. anon_vma는 anon_vma_chain을 통해 vma에 연결된다.

코드 라인 9~11에서 anon_vma_chain을 할당한다.
코드 라인 13~20에서 이웃한 vma 영역의 병합 가능한 anon_vma를 찾아 가져오거나 발견하지 못하면 anon_vma를 새로 할당한다.
코드 라인 22~34에서 락을 획득 후 vma에 처음 anon_vma를 연결하는 경우 anon_vma 및 anon_vma_chain을 vma에 어태치한다.
코드 라인 36~39에서 낮은 확률로 할당한 anon_vma 및 anon_vma_chain을 vma에 어태치하지 못한 경우 할당하였던 anon_vma와 anon_vma_chain을 할당 해제한다.

다음 그림은 AV(anon_vma)를 준비하는데 새로 생성하거나, 기존 AV를 사용하는 과정을 보여준다.

다음 그림은 처음 anon_vma 사용 시 AV(anon_vma)를 할당하여 VMA(vm_area_struct)와 AVC(anon_vma_chain)을 통해 연결되는 과정을 보여준다.

AV(anon_vma)의 degree & refcount 관리

AV(anon_vma)의 lifetime 관리

refcount
- 참조 카운터로 0이 되면 소멸한다.
- AVC(anon_vma_chain)을 통해 VMA가 연결될 때마다 증가한다.
degree
- VMA의 owner(vma->anon_vma로 지정된 AV)로 지정될 때마다 degree가 증가된다.
- AV가 최초 생성되었을 때 1이지만 곧바로 VMA와 연결되고 VMA의 owner로 지정되므로 2로 시작한다.
- 이 값이 1이 되는 경우는 AV를 재사용(reuse)할 수 있는 상황이다.
  - fork된 자식 프로세스 하나만 동작 중이면서 부모 process가 종료된 경우이다.
  - It has no vma and only one anon_vma child

다음 그림은 AV의 degree 및 refcount의 변화를 보여준다.

다음 그림은 merged AV의 degree 및 refcount의 변화를 보여준다.

find_mergeable_anon_vma()

mm/mmap.c

/*
 * find_mergeable_anon_vma is used by anon_vma_prepare, to check
 * neighbouring vmas for a suitable anon_vma, before it goes off
 * to allocate a new anon_vma.  It checks because a repetitive
 * sequence of mprotects and faults may otherwise lead to distinct
 * anon_vmas being allocated, preventing vma merge in subsequent
 * mprotect.
 */

struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
        struct anon_vma *anon_vma = NULL;

        /* Try next first. */
        if (vma->vm_next) {
                anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next);
                if (anon_vma)
                        return anon_vma;
        }

        /* Try prev next. */
        if (vma->vm_prev)
                anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma);

        /*
         * We might reach here with anon_vma == NULL if we can't find
         * any reusable anon_vma.
         * There's no absolute need to look only at touching neighbours:
         * we could search further afield for "compatible" anon_vmas.
         * But it would probably just be a waste of time searching,
         * or lead to too many vmas hanging off the same anon_vma.
         * We're trying to allow mprotect remerging later on,
         * not trying to minimize memory used for anon_vmas.
         */
        return anon_vma;
}

이웃한 vma에 병합 가능한 anon_vma가 있으면 해당 anon_vma를 반환한다. 병합 가능한 anon_vma가 없으면 null을 반환한다. (null을 반환하면 새롭게 생성하여 사용한다)

코드 라인 6~10에서 @vma 영역 다음의 이웃한 vma 영역에 anon_vma를 같이 사용할 수 있으면 해당 이웃 vma가 사용하는 anon_vma를 알아온다.
코드 라인 13~14에서 @vma 영역 이전의 이웃한 vma 영역에 anon_vma를 같이 사용할 수 있으면 해당 이웃 vma가 사용하는 anon_vma를 알아온다.
코드 라인 26에서 anon_vma를 반환한다.

다음 그림은 vma에 이웃한 두 VMA를 대상으로 병합 가능한 anon_vma를 반환하는 모습을 보여준다.

reusable_anon_vma()

mm/mmap.c

/*
 * Do some basic sanity checking to see if we can re-use the anon_vma
 * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
 * the same as 'old', the other will be the new one that is trying
 * to share the anon_vma.
 *
 * NOTE! This runs with mm_sem held for reading, so it is possible that
 * the anon_vma of 'old' is concurrently in the process of being set up
 * by another page fault trying to merge _that_. But that's ok: if it
 * is being set up, that automatically means that it will be a singleton
 * acceptable for merging, so we can do all of this optimistically. But
 * we do that READ_ONCE() to make sure that we never re-load the pointer.
 *
 * IOW: that the "list_is_singular()" test on the anon_vma_chain only
 * matters for the 'stable anon_vma' case (ie the thing we want to avoid
 * is to return an anon_vma that is "complex" due to having gone through
 * a fork).
 *
 * We also make sure that the two vma's are compatible (adjacent,
 * and with the same memory policies). That's all stable, even with just
 * a read lock on the mm_sem.
 */

static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struu
ct vm_area_struct *b)
{
        if (anon_vma_compatible(a, b)) {
                struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);

                if (anon_vma && list_is_singular(&old->anon_vma_chain))
                        return anon_vma;
        }
        return NULL;
}

@a anon 영역과 @b anon 영역의 anon_vma이 병합 가능한 경우 @old 영역의 anon_vma를 찾아 반환한다.

코드 라인 4에서 @a anon 영역과 @b anon 영역이 병합 가능한 경우이다.
코드 라인 5~8에서 @old 영역의 anon_vma가 존재하고, @old 영역에 avc가 하나만 등록되어 있는 경우 anon_vma를 반환한다.

anon_vma_compatible()

mm/mmap.c

/*
 * Rough compatibility check to quickly see if it's even worth looking
 * at sharing an anon_vma.
 *
 * They need to have the same vm_file, and the flags can only differ
 * in things that mprotect may change.
 *
 * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
 * we can merge the two vma's. For example, we refuse to merge a vma if
 * there is a vm_ops->close() function, because that indicates that the
 * driver is doing some kind of reference counting. But that doesn't
 * really matter for the anon_vma sharing case.
 */

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
        return a->vm_end == b->vm_start &&
                mpol_equal(vma_policy(a), vma_policy(b)) &&
                a->vm_file == b->vm_file &&
                !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
                b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
}

다음 두 anon 영역에 대해 같이 병합되어 사용될 수 있는지 여부를 반환한다. 다음 조건들을 모두 허용하는 경우 병합 가능(1)을 반환한다.

a 영역 다음 b 영역으로 두 영역이 붙어있다.
두 영역의 vma policy가 같다.
두 영역에서 read, write, exec, softdirty 이외의 플래그 사용이 서로 같다.
- #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
두 영역의 vm_file이 같고, vm_pgoff가 a와 b영역 순서대로 지정되어 있다.

다음과 같이 가상 주소 영역의 두 VMA 영역이 같이 사용될 수 있는지 여부를 알아내는 과정을 보여준다.

anon_vma_chain를 사용한 링크

anon_vma_chain_alloc()

mm/rmap.c

static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp)
{
        return kmem_cache_alloc(anon_vma_chain_cachep, gfp);
}

kmem 캐시에 준비된 anon_vma_chain 구조체를 할당한다.

anon_vma_chain_free()

mm/rmap.c

static void anon_vma_chain_free(struct anon_vma_chain *anon_vma_chain)
{
        kmem_cache_free(anon_vma_chain_cachep, anon_vma_chain);
}

anon_vma_chain 구조체를 할당 해제하여 kmem 캐시로 돌려보낸다.

anon_vma_chain_link()

mm/rmap.c

static void anon_vma_chain_link(struct vm_area_struct *vma,
                                struct anon_vma_chain *avc,
                                struct anon_vma *anon_vma)
{
        avc->vma = vma;
        avc->anon_vma = anon_vma;
        list_add(&avc->same_vma, &vma->anon_vma_chain);
        anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);
}

vma 및 anon_vma 간에 avc를 사용하여 링크시킨다.

코드 라인 5~6에서 @avc가 @vma 및 @anon_vma를 가리킬 수 있도록 지정한다.
코드 라인 7에서 vma의 anon_vma_chain 리스트에 @avc를 추가한다.
코드 라인 8에서 @anon_vma의 RB 트리에 @avc를 추가한다.

다음 그림은 vma 및 anon_vma가 avc를 통해 링크되는 모습을 보여준다.

anon 페이지에 매핑 추가

page_add_anon_rmap()

mm/rmap.c

/**
 * page_add_anon_rmap - add pte mapping to an anonymous page
 * @page:       the page to add the mapping to
 * @vma:        the vm area in which the mapping is added
 * @address:    the user virtual address mapped
 * @compound:   charge the page as compound or small page
 *
 * The caller needs to hold the pte lock, and the page must be locked in
 * the anon_vma case: to serialize mapping,index checking after setting,
 * and to ensure that PageAnon is not being upgraded racily to PageKsm
 * (but PageKsm is never downgraded to PageAnon).
 */

void page_add_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, bool compound)
{
        do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
}

공유 anon 페이지에 역방향 매핑을 추가한다.

do_page_add_anon_rmap()

주의: 이 함수는 do_swap_page()로부터 비공유 anon 페이지에 대한 역방향 매핑을 추가할 때에만 사용된다. 그 외의 경우는 위 page_add_anon_rmap() 함수를 사용해야 한다.

mm/rmap.c

/*
 * Special version of the above for do_swap_page, which often runs
 * into pages that are exclusively owned by the current process.
 * Everybody else should continue to use page_add_anon_rmap above.
 */

void do_page_add_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, int flags)
{
        bool compound = flags & RMAP_COMPOUND;
        bool first;

        if (unlikely(PageKsm(page)))
                lock_page_memcg(page);
        else
                VM_BUG_ON_PAGE(!PageLocked(page), page);

        if (compound) {
                atomic_t *mapcount;
                VM_BUG_ON_PAGE(!PageLocked(page), page);
                VM_BUG_ON_PAGE(!PageTransHuge(page), page);
                mapcount = compound_mapcount_ptr(page);
                first = atomic_inc_and_test(mapcount);
        } else {
                first = atomic_inc_and_test(&page->_mapcount);
        }

        if (first) {
                int nr = compound ? thp_nr_pages(page) : 1;
                /*
                 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
                 * these counters are not modified in interrupt context, and
                 * pte lock(a spinlock) is held, which implies preemption
                 * disabled.
                 */
                if (compound)
                        __inc_node_page_state(page, NR_ANON_THPS);
                __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
        }
        if (unlikely(PageKsm(page))) {
                unlock_page_memcg(page);
                return;
        }

        /* address might be in next vma when migration races vma_adjust */
        if (first)
                __page_set_anon_rmap(page, vma, address,
                                flags & RMAP_EXCLUSIVE);
        else
                __page_check_anon_rmap(page, vma, address);
}

anon 페이지에 역방향 매핑을 추가한다.

코드 라인 4에서 플래그에 compound 요청이 있었는지 여부를 확인한다.
코드 라인 7~10에서 Ksm 페이지의 경우 페이지와 memcg 바인딩을 위한 락을 건다.
코드 라인 12~20에서 해당 페이지에 대한 매핑 카운트를 1 증가시키고 처음 매핑 여부를 알아온다.
코드 라인 22~33에서 첫 매핑 시 nr_anon_mapped 카운터를 페이지 수만큼 증가시킨다. compound 페이지인 경우 nr_anon_thps 카운터도 1 증가시킨다.
코드 라인 34~37에서 Ksm 페이지의 경우 위에서 건 락을 풀고 함수를 빠져나간다.
코드 라인 40~44에서 첫 매핑인 경우 페이지를 새 anonymous rmap으로 매핑한다. 그 외의 경우 디버그 이외에는 아무것도 수행하지 않는다.

page_add_new_anon_rmap()

mm/rmap.c

/**
 * page_add_new_anon_rmap - add pte mapping to a new anonymous page
 * @page:       the page to add the mapping to
 * @vma:        the vm area in which the mapping is added
 * @address:    the user virtual address mapped
 * @compound:   charge the page as compound or small page
 *
 * Same as page_add_anon_rmap but must only be called on *new* pages.
 * This means the inc-and-test can be bypassed.
 * Page does not have to be locked.
 */

void page_add_new_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, bool compound)
{
        int nr = compound ? thp_nr_pages(page) : 1;

        VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
        __SetPageSwapBacked(page);
        if (compound) {
                VM_BUG_ON_PAGE(!PageTransHuge(page), page);
                /* increment count (starts at -1) */
                atomic_set(compound_mapcount_ptr(page), 0);
                if (hpage_pincount_available(page))
                        atomic_set(compound_pincount_ptr(page), 0);

                __mod_lruvec_page_state(page, NR_ANON_THPS, nr);
        } else {
                /* Anon THP always mapped first with PMD */
                VM_BUG_ON_PAGE(PageTransCompound(page), page);
                /* increment count (starts at -1) */
                atomic_set(&page->_mapcount, 0);
        }
        __mod_lruvec_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
        __page_set_anon_rmap(page, vma, address, 1);
}

태스크 전용 anon 페이지에 rmap을 지정한다.

코드 라인 7에서 새로 할당 받은 anon 페이지이므로 PG_swapbacked 플래그를 추가한다.
코드 라인 8~15에서 compound 페이지인 경우 컴파운드 매핑 카운터를 0으로 리셋하고, NR_ANON_THPS 카운터를 증가시킨다.
코드 라인 16~21에서 compound 페이지가 아닌 경우 매핑 카운터를 0으로 리셋한다.
코드 라인 22~23에서 NR_ANON_MAPPED 카운터를 증가시키고 페이지에 anon rmap을 설정한다.
- page->mapping = anon_vma
- page->index = linear_page_index(vma, address)

__page_set_anon_rmap()

mm/rmap.c

/**
 * __page_set_anon_rmap - set up new anonymous rmap
 * @page:       Page or Hugepage to add to rmap
 * @vma:        VM area to add page to.
 * @address:    User virtual address of the mapping
 * @exclusive:  the page is exclusively owned by the current process
 */

static void __page_set_anon_rmap(struct page *page,
        struct vm_area_struct *vma, unsigned long address, int exclusive)
{
        struct anon_vma *anon_vma = vma->anon_vma;

        BUG_ON(!anon_vma);

        if (PageAnon(page))
                return;

        /*
         * If the page isn't exclusively mapped into this vma,
         * we must use the _oldest_ possible anon_vma for the
         * page mapping!
         */
        if (!exclusive)
                anon_vma = anon_vma->root;

        /*
         * page_idle does a lockless/optimistic rmap scan on page->mapping.
         * Make sure the compiler doesn't split the stores of anon_vma and
         * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
         * could mistake the mapping for a struct address_space and crash.
        /*
        anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
        WRITE_ONCE(page->mapping = (struct address_space *) anon_vma);
        page->index = linear_page_index(vma, address);
}

페이지를 anonymous rmap으로 지정한다.

코드 라인 8~9에서 페이지가 이미 anon 매핑이된 경우 함수를 빠져나간다.
코드 라인 16~17에서 페이지가 현재 태스크 전용이 아닌 shared 페이지인 경우 루트 anon_vma를 사용한다.
코드 라인 25~27에서 페이지에 anon 매핑을 한다.
- page->mapping에는 anon_vma 포인터에 PAGE_PAPPING_ANON 플래그를 더해 지정한다.
- page->index에는 vma에 지정된 페이지 offset + (주소 pfn – vm 시작 pfn)에 대한 값을 지정한다.

다음 그림과 같이 anon 페이지를 rmap에 매핑하는 과정을 보여준다.

anon 매핑된 페이지는 PageAnon(page) 함수를 통해 true를 반환한다.

linear_page_index()

include/linux/pagemap.h

static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
                                        unsigned long address)
{
        pgoff_t pgoff;
        if (unlikely(is_vm_hugetlb_page(vma)))
                return linear_hugepage_index(vma, address);
        pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
        pgoff += vma->vm_pgoff;
        return pgoff;
}

vma 영역내에서 요청한 주소에 해당하는 페이지 offset을 vma->vm_pgoff를 더한 인덱스 값을 반환한다.

페이지에서 rmap 매핑 제거

page_remove_rmap()

mm/rmap.c

/**
 * page_remove_rmap - take down pte mapping from a page
 * @page:       page to remove mapping from
 * @compound:   uncharge the page as compound or small page
 *
 * The caller needs to hold the pte lock.
 */

void page_remove_rmap(struct page *page, bool compound)
{
        lock_page_memcg(page);

        if (!PageAnon(page)) {
                page_remove_file_rmap(page, compound);
                goto out;
        }

        if (compound) {
                page_remove_anon_compound_rmap(page);
                goto out;
        }

        /* page still mapped by someone else? */
        if (!atomic_add_negative(-1, &page->_mapcount))
                goto out;

        /*
         * We use the irq-unsafe __{inc|mod}_zone_page_stat because
         * these counters are not modified in interrupt context, and
         * pte lock(a spinlock) is held, which implies preemption disabled.
         */
        __dec_lruvec_page_state(page, NR_ANON_MAPPED);

        if (unlikely(PageMlocked(page)))
                clear_page_mlock(page);

        if (PageTransCompound(page))
                deferred_split_huge_page(compound_head(page));

        /*
         * It would be tidy to reset the PageAnon mapping here,
         * but that might overwrite a racing page_add_anon_rmap
         * which increments mapcount after us but sets mapping
         * before us: so leave the reset to free_unref_page,
         * and remember that it's only reliable while mapped.
         * Leaving it set also helps swapoff to reinstate ptes
         * faster for those pages still in swapcache.
         */
out:
        unlock_page_memcg(page);
}

페이지에서역방향 매핑(rmap)을 제거한다.

page_remove_file_rmap()

mm/rmap.c

static void page_remove_file_rmap(struct page *page, bool compound)
{
        int i, nr = 1;

        VM_BUG_ON_PAGE(compound && !PageHead(page), page);
        lock_page_memcg(page);

        /* Hugepages are not counted in NR_FILE_MAPPED for now. */
        if (unlikely(PageHuge(page))) {
                /* hugetlb pages are always mapped with pmds */
                atomic_dec(compound_mapcount_ptr(page));
                return;
        }

        /* page still mapped by someone else? */
        if (compound && PageTransHuge(page)) {
                int nr_pages = thp_nr_pages(page);

                for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
                        if (atomic_add_negative(-1, &page[i]._mapcount))
                                nr++;
                }
                if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
                        return;
                if (PageSwapBacked(page))
                        __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
                                                -nr_pages);
                else
                        __mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
                                                -nr_pages);
        } else {
                if (!atomic_add_negative(-1, &page->_mapcount))
                        return;
        }

        /*
         * We use the irq-unsafe __{inc|mod}_lruvec_page_state because
         * these counters are not modified in interrupt context, and
         * pte lock(a spinlock) is held, which implies preemption disabled.
         */
        __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);

        if (unlikely(PageMlocked(page)))
                clear_page_mlock(page);
out:
        unlock_page_memcg(page);
}

파일 페이지의 역방향 매핑(rmap)을 제거한다.

Child 프로세스 fork 시 anon_vma 생성 및 연결

anon_vma_fork()

mm/rmap.c

/*
 * Attach vma to its own anon_vma, as well as to the anon_vmas that
 * the corresponding VMA in the parent process is attached to.
 * Returns 0 on success, non-zero on failure.
 */

int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
{
        struct anon_vma_chain *avc;
        struct anon_vma *anon_vma;
        int error;

        /* Don't bother if the parent process has no anon_vma here. */
        if (!pvma->anon_vma)
                return 0;

        /* Drop inherited anon_vma, we'll reuse existing or allocate new. */
        vma->anon_vma = NULL;

        /*
         * First, attach the new VMA to the parent VMA's anon_vmas,
         * so rmap can find non-COWed pages in child processes.
         */
        error = anon_vma_clone(vma, pvma);
        if (error)
                return error;

        /* An existing anon_vma has been reused, all done then. */
        if (vma->anon_vma)
                return 0;

        /* Then add our own anon_vma. */
        anon_vma = anon_vma_alloc();
        if (!anon_vma)
                goto out_error;
        avc = anon_vma_chain_alloc(GFP_KERNEL);
        if (!avc)
                goto out_error_free_anon_vma;

        /*
         * The root anon_vma's rwsem is the lock actually used when we
         * lock any of the anon_vmas in this anon_vma tree.
         */
        anon_vma->root = pvma->anon_vma->root;
        anon_vma->parent = pvma->anon_vma;
        /*
         * With refcounts, an anon_vma can stay around longer than the
         * process it belongs to. The root anon_vma needs to be pinned until
         * this anon_vma is freed, because the lock lives in the root.
         */
        get_anon_vma(anon_vma->root);
        /* Mark this anon_vma as the one where our new (COWed) pages go. */
        vma->anon_vma = anon_vma;
        anon_vma_lock_write(anon_vma);
        anon_vma_chain_link(vma, avc, anon_vma);
        anon_vma->parent->degree++;
        anon_vma_unlock_write(anon_vma);

        return 0;

 out_error_free_anon_vma:
        put_anon_vma(anon_vma);
 out_error:
        unlink_anon_vmas(vma);
        return -ENOMEM;
}

fork되어 자식 프로세스에 생성된 @vma를 부모 @pvma들에 존재하는 anon_vma 수 만큼 새로 할당한 후 연결한다. 또한 @vma에 대응하는 anon_vma도 재사용하거나 할당한다.

코드 라인 8~9에서 부모 vma에 anon_vma가 지정된 적이 없으면 성공(0)을 반환한다.
코드 라인 12에서 부모 태스크로부터 상속하여 만든 vma의 anon_vma를 다시 재설정하기 위해 null을 담는다.
코드 라인 18~20에서 부모 vma들에 연결된 anon_vma들과 새로 clone될 vma와 링크한다.
코드 라인 23~24에서 anon_vma가 재사용된 경우 성공(0)을 반환한다.
코드 라인 27~32에서 anon_vma와 anon_vma_chain 구조체를 할당해온다.
코드 라인 38~39에서 anon_vma의 루트와 부모 관계를 지정한다.
코드 라인 45~53에서 anon_vma를 anon_vma_chain을 통해 vm_area_struct와 링크하고 성공(0)을 반환한다.

다음 그림은 child 프로세스가 fork되면서 anon_vma_fork()가 세 번 호출되어 나타나는 결과를 보여준다.

anon_vma_clone()

이 함수의 주 호출 패스는 다음과 같다.

태스크가 fork되어 부모 태스크의 vma를 clone
- anon_vma_fork()
VMA를 split할 때 기존 VMA를 clone
- __split_vma()

mm/rmap.c

/*
 * Attach the anon_vmas from src to dst.
 * Returns 0 on success, -ENOMEM on failure.
 *
 * anon_vma_clone() is called by __vma_adjust(), __split_vma(), copy_vma() and
 * anon_vma_fork(). The first three want an exact copy of src, while the last
 * one, anon_vma_fork(), may try to reuse an existing anon_vma to prevent
 * endless growth of anon_vma. Since dst->anon_vma is set to NULL before call,
 * we can identify this case by checking (!dst->anon_vma && src->anon_vma).
 *
 * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
 * and reuse existing anon_vma which has no vmas and only one child anon_vma.
 * This prevents degradation of anon_vma hierarchy to endless linear chain in
 * case of constantly forking task. On the other hand, an anon_vma with more
 * than one child isn't reused even if there was no alive vma, thus rmap
 * walker has a good chance of avoiding scanning the whole hierarchy when it
 * searches where page is mapped.
 */

int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
        struct anon_vma_chain *avc, *pavc;
        struct anon_vma *root = NULL;

        list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
                struct anon_vma *anon_vma;

                avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
                if (unlikely(!avc)) {
                        unlock_anon_vma_root(root);
                        root = NULL;
                        avc = anon_vma_chain_alloc(GFP_KERNEL);
                        if (!avc)
                                goto enomem_failure;
                }
                anon_vma = pavc->anon_vma;
                root = lock_anon_vma_root(root, anon_vma);
                anon_vma_chain_link(dst, avc, anon_vma);

                /*
                 * Reuse existing anon_vma if its degree lower than two,
                 * that means it has no vma and only one anon_vma child.
                 *
                 * Do not chose parent anon_vma, otherwise first child
                 * will always reuse it. Root anon_vma is never reused:
                 * it has self-parent reference and at least one child.
                 */
                if (!dst->anon_vma && src->anon_vma &&
                    anon_vma != src->anon_vma && anon_vma->degree < 2) dst->anon_vma = anon_vma;
        }
        if (dst->anon_vma)
                dst->anon_vma->degree++;
        unlock_anon_vma_root(root);
        return 0;

 enomem_failure:
        /*
         * dst->anon_vma is dropped here otherwise its degree can be incorrectly
         * decremented in unlink_anon_vmas().
         * We can safely do this because callers of anon_vma_clone() don't care
         * about dst->anon_vma if anon_vma_clone() failed.
         */
        dst->anon_vma = NULL;
        unlink_anon_vmas(dst);
        return -ENOMEM;
}

@src vma들에 연결된 anon_vma들과 @dst vma를 링크한다.

코드 라인 6에서 @src vma에 연결된 anon_vma 수만큼 pvac를 순회한다.
- @src vma의 anon_vma_chain 리스트에 연결된 avc들을 순회한다.
코드 라인 9~16에서 anon_vma_chain을 할당한다. 만일 할당이 실패하면 anon_vma root 락을 풀고 다시 한 번 더 할당을 시도해본다.
코드 라인 17에서 순회 중인 pvac에 연결된 anon_vma를 알아온다.
코드 라인 18에서 root anon_vma 락을 건다.
코드 라인 19에서 clone된 @dst vma와 순회 중인 anon_vma 사이에 새로 할당한 anon_vma_chain을 사용하여 링크한다.
코드 라인 29~31에서 anon_vma 하이라키가 계속 끝 없이 증가되는 것을 막기 위한 패치이다. vma를 소유하지 않았고, 조부모 이상의 anon_vma를 대상으로 오직 하나의 child anon_vma를 가진 anon_vma는 재 사용하게 한다.
코드 라인 32~33에서 @dst vma에 anon_vma가 지정된 경우 degree를 상향 시킨다.

반복되는 fork hierarchy로부터 anon_vma의 끝없는 증가 방지 패치

참고:

mm: prevent endless growth of anon_vma hierarchy (2015, v3.19-rc4)
mm/rmap.c: don’t reuse anon_vma if we just want a copy (2019, v5.5-rc1)
mm/rmap.c: reuse mergeable anon_vma as parent when fork (2019, v5.5-rc1)
Repeated fork() causes SLAB to grow without bound

다음과 같이 반복되는 fork로 인해 anon_vma가 끝없이 증가를 하던 것을 anon_vma를 reuse하는 방법으로 패치하였다.

#include <unistd.h>

int main(int argc, char *argv[])
{
        pid_t pid;

        while (1) {
                pid = fork();
                if (pid == -1) {
                        /* error */
                        return 1;
                }
                if (pid) {
                        /* parent */
                        sleep(2);
                        break;
                }
                else {
                        /* child */
                        sleep(1);
                }
        }

        return 0;
}

다음 그림은 조부모 프로세스로부터 부모 및 자식까지 fork 될 때 AV의 연결되는 과정을 보여준다.

fork 과정을 자세히 보여주기 위해 조부모 프로세스에 1개의 VMA와 AV 만을 사용하였다.

다음 그림은 부모 프로세스 종료 후 AV를 재사용할 수 있는 상태를 보여준다.

만일 fork한 자식 프로세스 B가 종료하면 부모 프로세스의 AV는 재사용 없이 free된다.

다음 그림은 자식 프로세스가 조부모 프로세스의 AV를 다시 재사용(reuse)하는 과정을 보여준다.

부모 프로세스는 제외하고 조부모 이상이 사용하였던 AV를 대상으로 degree가 1이이어야 한다.

참고

The case of the overly anonymous anon_vma (2010) | LWN.net
Virtual Memory II: the return of objrmap (2004) | LWN.net
Reverse mapping anonymous pages – again (2004)| LWN.net
Linux Memory Management | Columbia.edu – 다운로드 pdf

Rmap -2- (TTU & Rmap Walk)

2019-09-102021-07-07 문영일 Leave a comment

Rmap -2- (TTU & Rmap Walk)

Rmap Walk

다음 그림은 유저 페이지 하나에 대해 rmap 워크를 수행하면서 rmap_walk_control 구조체에 담긴 후크 함수를 호출하는 과정을 보여준다.

유저 페이지 하나가 여러 개의 가상 주소에 매핑된 경우 관련된 VMA들을 찾아 매핑된 pte 엔트리를 대상으로 언매핑, 마이그레이션, … 등 여러 기능들을 수행할 수 있다.

다음 그림은 현재 rmap walk를 사용하는 호출 함수들을 보여주며 사용된 후크 함수들을 보여준다.

TTU(Try To Unmap)

유저 페이지의 매핑을 해제할 때 rmap walk를 통해 매핑을 해제한다.

사용자 공간의 매핑을 해제할 때 mmu notifier를 통해 연동된 secondary MMU를 제어한다.

TTU 플래그

TTU 동작 시 사용할 플래그들이다.

TTU_MIGRATION
- 마이그레이션 모드
TTU_MUNLOCK
- munlock 모드로 VM_LOCKED되지 않은 vma들은 skip 한다.
TTU_SPLIT_HUGE_PMD
- 페이지가 huge PMD인 경우 분리(split) 시킨다.
TTU_IGNORE_MLOCK
- MLOCK 무시
TTU_IGNORE_ACCESS
- young 페이지가 access된 적이 있으면 pte 엔트리의 액세스 플래그를 클리어하고 TLB 플러시를 하게하는데 이의 조사를 무시하게 한다.
TTU_IGNORE_HWPOISON
- hwpoison을 무시하고 손상된 페이지라도 사용한다.
TTU_BATCH_FLUSH
- 가능하면 TLB 플러시를 마지막에 한꺼번에 처리한다.
TTU_RMAP_LOCKED
TTU_SPLIT_FREEZE
- thp를 분리(split)할 때 pte를 freeze한다.

try_to_unmap()

mm/rmap.c

/**
 * try_to_unmap - try to remove all page table mappings to a page
 * @page: the page to get unmapped
 * @flags: action and flags
 *
 * Tries to remove all the page table entries which are mapping this
 * page, used in the pageout path.  Caller must hold the page lock.
 *
 * If unmap is successful, return true. Otherwise, false.
 */

bool try_to_unmap(struct page *page, enum ttu_flags flags)
{
        struct rmap_walk_control rwc = {
                .rmap_one = try_to_unmap_one,
                .arg = (void *)flags,
                .done = page_mapcount_is_zero,
                .anon_lock = page_lock_anon_vma_read,
        };

        /*
         * During exec, a temporary VMA is setup and later moved.
         * The VMA is moved under the anon_vma lock but not the
         * page tables leading to a race where migration cannot
         * find the migration ptes. Rather than increasing the
         * locking requirements of exec(), migration skips
         * temporary VMAs until after exec() completes.
         */
        if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
            && !PageKsm(page) && PageAnon(page))
                rwc.invalid_vma = invalid_migration_vma;

        if (flags & TTU_RMAP_LOCKED)
                rmap_walk_locked(page, &rwc);
        else
                rmap_walk(page, &rwc);

        return !page_mapcount(page) ? true : false;
}

rmap 워크를 통해 페이지에 대한 모든 매핑을 해제한다.

코드 라인 3~8에서 rwc를 통해 호출될 후크 함수들을 지정한다.
코드 라인 18~20에서 thp(TTU_SPLIT_FREEZE)의 분리(split) 또는 마이그레이션(TTU_MIGRATION)에서 사용되었으며 ksm을 제외한 anon 매핑 페이지의 경우 (*invalid_vma) 후크 함수에 invalid_migration_vma() 함수를 지정한다.
- 참고: mm: thp: introduce separate TTU flag for thp freezing
코드 라인 22~25에서 rwc를 사용하여 페이지에 대한 모든 매핑을 해제하도록 rmap 워크를 수행한다. TTU_RMAP_LOCKED 플래그가 사용된 경우 외부에서 락을 획득한 상황이다.
- 참고: rmap: introduce rmap_walk_locked()

다음 그림은 try_to_unmap() 함수의 진행 과정을 보여준다.

page_mapcount_is_zero()

mm/rmap.c

static int page_mapcount_is_zero(struct page *page)
{
        return !total_mapcount(page);
}

페이지의 매핑 카운터가 0인지 여부를 반환한다.

rmap_walk_control 구조체의 (*done) 후크 함수에 연결되어 사용될 때 페이지의 언매핑이 완료되면 rmap walk를 완료하도록 하기 위해 사용된다.

total_mapcount()

mm/huge_memory.c

int total_mapcount(struct page *page)
{
        int i, compound, ret;

        VM_BUG_ON_PAGE(PageTail(page), page);

        if (likely(!PageCompound(page)))
                return atomic_read(&page->_mapcount) + 1;

        compound = compound_mapcount(page);
        if (PageHuge(page))
                return compound;
        ret = compound;
        for (i = 0; i < HPAGE_PMD_NR; i++)
                ret += atomic_read(&page[i]._mapcount) + 1;
        /* File pages has compound_mapcount included in _mapcount */
        if (!PageAnon(page))
                return ret - compound * HPAGE_PMD_NR;
        if (PageDoubleMap(page))
                ret -= HPAGE_PMD_NR;
        return ret;
}

페이지의 매핑 카운터 값을 반환한다.

page_lock_anon_vma_read()

mm/rmap.c

/*
 * Similar to page_get_anon_vma() except it locks the anon_vma.
 *
 * Its a little more complex as it tries to keep the fast path to a single
 * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
 * reference like with page_get_anon_vma() and then block on the mutex.
 */

struct anon_vma *page_lock_anon_vma_read(struct page *page)
{
        struct anon_vma *anon_vma = NULL;
        struct anon_vma *root_anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long)READ_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        root_anon_vma = READ_ONCE(anon_vma->root);
        if (down_read_trylock(&root_anon_vma->rwsem)) {
                /*
                 * If the page is still mapped, then this anon_vma is still
                 * its anon_vma, and holding the mutex ensures that it will
                 * not go away, see anon_vma_free().
                 */
                if (!page_mapped(page)) {
                        up_read(&root_anon_vma->rwsem);
                        anon_vma = NULL;
                }
                goto out;
        }

        /* trylock failed, we got to sleep */
        if (!atomic_inc_not_zero(&anon_vma->refcount)) {
                anon_vma = NULL;
                goto out;
        }

        if (!page_mapped(page)) {
                rcu_read_unlock();
                put_anon_vma(anon_vma);
                return NULL;
        }

        /* we pinned the anon_vma, its safe to sleep */
        rcu_read_unlock();
        anon_vma_lock_read(anon_vma);

        if (atomic_dec_and_test(&anon_vma->refcount)) {
                /*
                 * Oops, we held the last refcount, release the lock
                 * and bail -- can't simply use put_anon_vma() because
                 * we'll deadlock on the anon_vma_lock_write() recursion.
                 */
                anon_vma_unlock_read(anon_vma);
                __put_anon_vma(anon_vma);
                anon_vma = NULL;
        }

        return anon_vma;

out:
        rcu_read_unlock();
        return anon_vma;
}

anon 페이지에 대한 루트 anon_vma 락을 획득하고, anon_vma를 반환한다.

rmap 워크

rmap_walk()

mm/rmap.c

void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
{
        if (unlikely(PageKsm(page)))
                rmap_walk_ksm(page, rwc);
        else if (PageAnon(page))
                rmap_walk_anon(page, rwc, false);
        else
                rmap_walk_file(page, rwc, false);
}

페이지가 소속된 ksm, anon 및 file 타입에 대한 vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

/*
 * rmap_walk_anon - do something to anonymous page using the object-based
 * rmap method
 * @page: the page to be handled
 * @rwc: control variable according to each walk type
 *
 * Find all the mappings of a page using the mapping pointer and the vma chains
 * contained in the anon_vma struct it points to.
 *
 * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
 * where the page was found will be held for write.  So, we won't recheck
 * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
 * LOCKED.
 */

static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
                bool locked)
{
        struct anon_vma *anon_vma;
        pgoff_t pgoff_start, pgoff_end;
        struct anon_vma_chain *avc;

        if (locked) {
                anon_vma = page_anon_vma(page);
                /* anon_vma disappear under us? */
                VM_BUG_ON_PAGE(!anon_vma, page);
        } else {
                anon_vma = rmap_walk_anon_lock(page, rwc);
        }
        if (!anon_vma)
                return;

        pgoff_start = page_to_pgoff(page);
        pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
        anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
                        pgoff_start, pgoff_end) {
                struct vm_area_struct *vma = avc->vma;
                unsigned long address = vma_address(page, vma);

                cond_resched();

                if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
                        continue;

                if (!rwc->rmap_one(page, vma, address, rwc->arg))
                        break;
                if (rwc->done && rwc->done(page))
                        break;
        }

        if (!locked)
                anon_vma_unlock_read(anon_vma);
}

페이지가 소속된 anon vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

코드 라인 8~16에서 페이지에 해당하는 anon_vma를 구해온다. @locked가 설정되지 않은 경우 이 함수에서 anon_vma에 대한 lock을 획득해야 한다.
코드 라인 18~23에서 pgoff 시작 ~ pgoff 끝에 포함되는 anon_vma에 소속된 vma들을 순회한다.
코드 라인 27~28에서 rwc에 (*invalid_vma) 후크 함수를 수행한 결과가 ture인 경우 해당 vma는 skip 한다.
코드 라인 30~31에서 rwc의 (*rmap_one) 후크 함수를 수행한다. 만일 실패한 경우 break 한다.
코드 라인 32~33에서 rwc의 (*done) 후크 함수를 수행한 후 결과가 true이면 break 한다.
코드 라인 36~37에서 anon_vma에 대한 락을 해제한다.

/*
 * rmap_walk_file - do something to file page using the object-based rmap method
 * @page: the page to be handled
 * @rwc: control variable according to each walk type
 *
 * Find all the mappings of a page using the mapping pointer and the vma chains
 * contained in the address_space struct it points to.
 *
 * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
 * where the page was found will be held for write.  So, we won't recheck
 * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
 * LOCKED.
 */

static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
                bool locked)
{
        struct address_space *mapping = page_mapping(page);
        pgoff_t pgoff_start, pgoff_end;
        struct vm_area_struct *vma;

        /*
         * The page lock not only makes sure that page->mapping cannot
         * suddenly be NULLified by truncation, it makes sure that the
         * structure at mapping cannot be freed and reused yet,
         * so we can safely take mapping->i_mmap_rwsem.
         */
        VM_BUG_ON_PAGE(!PageLocked(page), page);

        if (!mapping)
                return;

        pgoff_start = page_to_pgoff(page);
        pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
        if (!locked)
                i_mmap_lock_read(mapping);
        vma_interval_tree_foreach(vma, &mapping->i_mmap,
                        pgoff_start, pgoff_end) {
                unsigned long address = vma_address(page, vma);

                cond_resched();

                if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
                        continue;

                if (!rwc->rmap_one(page, vma, address, rwc->arg))
                        goto done;
                if (rwc->done && rwc->done(page))
                        goto done;
        }

done:
        if (!locked)
                i_mmap_unlock_read(mapping);
}

페이지가 소속된 file vma들을 순회하며 rwc의 (*rmap_one) 후크 함수를 동작시킨다.

코드 라인 4~17에서 매핑된 파일 페이지가 아닌 경우 함수를 빠져나간다.
코드 라인 19~20에서 파일 페이지에서 pgoff 시작과 pgoff 끝을 산출한다.
코드 라인 21~22에서 @locked가 설정되지 않은 경우 이 함수에서 i_mmap에 대한 lock을 획득해야 한다.
코드 라인 23~24에서 pgoff 시작 ~ pgoff 끝에 포함되는 파일 매핑 공간에 대해 인터벌 트리를 통해 vma들을 순회한다.
코드 라인 29~30에서 rwc의 (*invalid_vma) 후크 함수를 수행한 결과가 true인 경우 해당 vma는 skip 한다.
코드 라인 32~33에서 rwc의 (*rmap_one) 후크 함수를 수행한다. 수행 결과가 실패한 경우 break 한다.
코드 라인 34~35에서 rwc의 (*done) 후크 함수의 수행 결과가 true인 경우 break 한다.
코드 라인 38~40에서 done: 레이블이다. i_mmap에 대한 락을 해제한다.

언맵 1개 시도

try_to_unmap_one()

mm/rmap.c -1/6-

/*
 * @arg: enum ttu_flags will be passed to this argument
 */

static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                     unsigned long address, void *arg)
{
        struct mm_struct *mm = vma->vm_mm;
        struct page_vma_mapped_walk pvmw = {
                .page = page,
                .vma = vma,
                .address = address,
        };
        pte_t pteval;
        struct page *subpage;
        bool ret = true;
        struct mmu_notifier_range range;
        enum ttu_flags flags = (enum ttu_flags)arg;

        /* munlock has nothing to gain from examining un-locked vmas */
        if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
                return true;

        if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
            is_zone_device_page(page) && !is_device_private_page(page))
                return true;

        if (flags & TTU_SPLIT_HUGE_PMD) {
                split_huge_pmd_address(vma, address,
                                flags & TTU_SPLIT_FREEZE, page);
        }

        /*
         * For THP, we have to assume the worse case ie pmd for invalidation.
         * For hugetlb, it could be much worse if we need to do pud
         * invalidation in the case of pmd sharing.
         *
         * Note that the page can not be free in this function as call of
         * try_to_unmap() must hold a reference on the page.
         */
        mmu_notifier_range_init(&range, vma->vm_mm, address,
                                min(vma->vm_end, address +
                                    (PAGE_SIZE << compound_order(page))));
        if (PageHuge(page)) {
                /*
                 * If sharing is possible, start and end will be adjusted
                 * accordingly.
                 */
                adjust_range_if_pmd_sharing_possible(vma, &range.start,
                                                     &range.end);
        }
        mmu_notifier_invalidate_range_start(&range);

코드 라인 4에서 vma가 소속된 가상 주소 공간 관리 mm을 구해온다.
코드 라인 5~9에서 매핑 여부를 확인하기 위한 pvmw를 준비한다.
코드 라인 17~18에서 VM_LOCKED 설정되지 않은 vma에 TTU_MUNLOCK 요청이 있는 경우 해당 페이지를 skip하기 위해 그냥 성공을 반환한다.
코드 라인 20~22에서 migration 매핑 요청 시 hmm이 아닌 zone 디바이스인 경우 skip 하기 위해 true를 반환한다.
- hmm이 아닌 zone 디바이스는 정규 매핑/언매핑 동작을 수행할 수 없다.
코드 라인 24~27에서 huge 페이지 split 요청을 수행한다.
코드 라인 37~39에서 mmu_notifier_range를 초기화한다.
코드 라인 40~47에서 huge 페이지인 경우 위의 range를 조정한다.
코드 라인 48에서 secondary mmu의range에 대한 tlb 무효화를 수행하기 전에 시작됨을 알려주기 위해 mmu notifier가 등록된 (*invalidate_range_start) 함수를 호출한다.

mm/rmap.c -2/6-

        while (page_vma_mapped_walk(&pvmw)) {
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
                /* PMD-mapped THP migration entry */
                if (!pvmw.pte && (flags & TTU_MIGRATION)) {
                        VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);

                        set_pmd_migration_entry(&pvmw, page);
                        continue;
                }
#endif

                /*
                 * If the page is mlock()d, we cannot swap it out.
                 * If it's recently referenced (perhaps page_referenced
                 * skipped over this mm) then we should reactivate it.
                 */
                if (!(flags & TTU_IGNORE_MLOCK)) {
                        if (vma->vm_flags & VM_LOCKED) {
                                /* PTE-mapped THP are never mlocked */
                                if (!PageTransCompound(page)) {
                                        /*
                                         * Holding pte lock, we do *not* need
                                         * mmap_sem here
                                         */
                                        mlock_vma_page(page);
                                }
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (flags & TTU_MUNLOCK)
                                continue;
                }

                /* Unexpected PMD-mapped THP? */
                VM_BUG_ON_PAGE(!pvmw.pte, page);

                subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
                address = pvmw.address;

                if (PageHuge(page)) {
                        if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
                                /*
                                 * huge_pmd_unshare unmapped an entire PMD
                                 * page.  There is no way of knowing exactly
                                 * which PMDs may be cached for this mm, so
                                 * we must flush them all.  start/end were
                                 * already adjusted above to cover this range.
                                 */
                                flush_cache_range(vma, range.start, range.end);
                                flush_tlb_range(vma, range.start, range.end);
                                mmu_notifier_invalidate_range(mm, range.start,
                                                              range.end);

                                /*
                                 * The ref count of the PMD page was dropped
                                 * which is part of the way map counting
                                 * is done for shared PMDs.  Return 'true'
                                 * here.  When there is no other sharing,
                                 * huge_pmd_unshare returns false and we will
                                 * unmap the actual page and drop map count
                                 * to zero.
                                 */
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                }

코드 라인 1에서 pvmw를 통해 요청한 정규 매핑 상태가 정상인 경우에 한해 루프를 돈다.
코드 라인 4~9에서 pmd 엔트리를 통해 thp 페이지를 migration 엔트리에 매핑하고 루프를 계속 수행 한다.
코드 라인 17~33에서 TTU_IGNORE_MLOCK 플래그가 없이 요청한 경우 mlocked 페이지는 swap out할 수 없다. VM_LOCKED vma 영역인 경우 루프를 멈추고 처리도 중단하게 한다.
코드 라인 38~67에서 공유되지 않은 huge 페이지인 경우 range 영역에 대해 캐시 및 tlb 캐시를 플러시하고, secondary MMU의 range에 대한 tlb 무효화를 수행한다. 그런 후 루프를 멈추고 처리도 중단하게 한다.

mm/rmap.c -3/6-

.               if (IS_ENABLED(CONFIG_MIGRATION) &&
                    (flags & TTU_MIGRATION) &&
                    is_zone_device_page(page)) {
                        swp_entry_t entry;
                        pte_t swp_pte;

                        pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);

                        /*
                         * Store the pfn of the page in a special migration
                         * pte. do_swap_page() will wait until the migration
                         * pte is removed and then restart fault handling.
                         */
                        entry = make_migration_entry(page, 0);
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
                        /*
                         * No need to invalidate here it will synchronize on
                         * against the special swap migration pte.
                         */
                        goto discard;
                }

                if (!(flags & TTU_IGNORE_ACCESS)) {
                        if (ptep_clear_flush_young_notify(vma, address,
                                                pvmw.pte)) {
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                }

                /* Nuke the page table entry. */
                flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
                if (should_defer_flush(mm, flags)) {
                        /*
                         * We clear the PTE but do not flush so potentially
                         * a remote CPU could still be writing to the page.
                         * If the entry was previously clean then the
                         * architecture must guarantee that a clear->dirty
                         * transition on a cached TLB entry is written through
                         * and traps if the PTE is unmapped.
                         */
                        pteval = ptep_get_and_clear(mm, address, pvmw.pte);

                        set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
                } else {
                        pteval = ptep_clear_flush(vma, address, pvmw.pte);
                }

                /* Move the dirty bit to the page. Now the pte is gone. */
                if (pte_dirty(pteval))
                        set_page_dirty(page);

                /* Update high watermark before we lower rss */
                update_hiwater_rss(mm);

코드 라인 1~24에서 존 디바이스 페이지에 TTU_MIGRATION 플래그를 요청한 경우이다. pte 엔트리에 migration 정보를 swap 엔트리 형태로 만들어 매핑한다.
- swap 엔트리를 활용하여 migration 정보를 담아 매핑하여 사용한다. 이 매핑을 통해 fault 발생 시 fault 핸들러 중 swap fault를 담당하는 do_swap_page()에서 migration이 완료될 때까지 기다리게 한다.
- soft dirty 기능은 현재 x86_64, powerpc_64, s390 아키텍처에서만 사용된다.
- 참고: mm/migrate: support un-addressable ZONE_DEVICE page in migration
코드 라인 26~33에서 TTU_IGNORE_ACCESS 플래그 요청이 없는 경우 address에 해당하는 secondary MMU의 pte 엔트리의 young/accessed 플래그를 test-and-clearing을 수행한 후 access된 적이 있으면 플러시하고 루틴을 중단하게 한다.
코드 라인 36에서 유저 가상 주소에 대한 캐시를 flush한다.
- ARM64 아키텍처는 아무런 동작도 하지 않는다.
- 아키텍처의 캐시 타입이 vivt 또는 vipt aliasing 등을 사용하면 flush 한다.
코드 라인 37~51에서 유저 가상 주소에 매핑된 pte 엔트리를 클리어하여 언매핑하고, tlb 플러시한다. 만일 TTU_BATCH_FLUSH 플러그 요청을 받은 경우 tlb flush는 마지막에 모아 처리하는 것으로 성능을 높인다.
코드 라인 54~55에서 언매핑(클리어) 하기 전 기존 pte 엔트리가 dirty 상태인 경우 페이지를 dirty 상태로 설정한다.
코드 라인 58에서 mm의 hiwater_rss 카운터를 최고 값인 경우 갱신한다.
- mm의 file 페이지 수 + anon 페이지 수 + shmem 페이지 수

mm/rmap.c -4/6-

.               if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
                        pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
                        if (PageHuge(page)) {
                                int nr = 1 << compound_order(page);
                                hugetlb_count_sub(nr, mm);
                                set_huge_swap_pte_at(mm, address,
                                                     pvmw.pte, pteval,
                                                     vma_mmu_pagesize(vma));
                        } else {
                                dec_mm_counter(mm, mm_counter(page));
                                set_pte_at(mm, address, pvmw.pte, pteval);
                        }

                } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
                        /*
                         * The guest indicated that the page content is of no
                         * interest anymore. Simply discard the pte, vmscan
                         * will take care of the rest.
                         * A future reference will then fault in a new zero
                         * page. When userfaultfd is active, we must not drop
                         * this page though, as its main user (postcopy
                         * migration) will not expect userfaults on already
                         * copied pages.
                         */
                        dec_mm_counter(mm, mm_counter(page));
                        /* We have to invalidate as we cleared the pte */
                        mmu_notifier_invalidate_range(mm, address,
                                                      address + PAGE_SIZE);
                } else if (IS_ENABLED(CONFIG_MIGRATION) &&
                                (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
                        swp_entry_t entry;
                        pte_t swp_pte;

                        if (arch_unmap_one(mm, vma, address, pteval) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        /*
                         * Store the pfn of the page in a special migration
                         * pte. do_swap_page() will wait until the migration
                         * pte is removed and then restart fault handling.
                         */
                        entry = make_migration_entry(subpage,
                                        pte_write(pteval));
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, address, pvmw.pte, swp_pte);
                        /*
                         * No need to invalidate here it will synchronize on
                         * against the special swap migration pte.
                         */

코드 라인 1~12에서 hwpoison 페이지이면서 TTU_IGNORE_HWPOISON 플래그를 사용하지 않는 요청인 경우이다. 관련 mm 카운터(anon, file, shm)를 페이지 수 만큼 감소시킨다. 그런 후 pte 엔트리에 swap 엔트리 값을 매핑시킨다.
코드 라인 14~28에서 userfaultfd vma의 사용되지 않는 pte 값인 경우 관련 mm 카운터(anon, file, shm)를 페이지 수 만큼 감소시킨다. 그런 후 secondary MMU도 가상 주소에 대한 TLB 무효화(invalidate)를 수행하게 한다.
코드 라인 29~51에서 TTU_MIGRATION 또는 TTU_SPLIT_FREEZE 플래그 요청을 받은 경우이다. migration할 swap 엔트리를 매핑한다. 기존 매핑에 soft dirty가 설정된 경우 swap 엔트리에도 포함시킨다.

mm/rmap.c -5/6-

.               } else if (PageAnon(page)) {
                        swp_entry_t entry = { .val = page_private(subpage) };
                        pte_t swp_pte;
                        /*
                         * Store the swap location in the pte.
                         * See handle_pte_fault() ...
                         */
                        if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
                                WARN_ON_ONCE(1);
                                ret = false;
                                /* We have to invalidate as we cleared the pte */
                                mmu_notifier_invalidate_range(mm, address,
                                                        address + PAGE_SIZE);
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        /* MADV_FREE page check */
                        if (!PageSwapBacked(page)) {
                                if (!PageDirty(page)) {
                                        /* Invalidate as we cleared the pte */
                                        mmu_notifier_invalidate_range(mm,
                                                address, address + PAGE_SIZE);
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

                                /*
                                 * If the page was redirtied, it cannot be
                                 * discarded. Remap the page to page table.
                                 */
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                SetPageSwapBacked(page);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }

                        if (swap_duplicate(entry) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (arch_unmap_one(mm, vma, address, pteval) < 0) {
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                ret = false;
                                page_vma_mapped_walk_done(&pvmw);
                                break;
                        }
                        if (list_empty(&mm->mmlist)) {
                                spin_lock(&mmlist_lock);
                                if (list_empty(&mm->mmlist))
                                        list_add(&mm->mmlist, &init_mm.mmlist);
                                spin_unlock(&mmlist_lock);
                        }
                        dec_mm_counter(mm, MM_ANONPAGES);
                        inc_mm_counter(mm, MM_SWAPENTS);
                        swp_pte = swp_entry_to_pte(entry);
                        if (pte_soft_dirty(pteval))
                                swp_pte = pte_swp_mksoft_dirty(swp_pte);
                        set_pte_at(mm, address, pvmw.pte, swp_pte);
                        /* Invalidate as we cleared the pte */
                        mmu_notifier_invalidate_range(mm, address,
                                                      address + PAGE_SIZE);

코드 라인 1~16에서 anon 페이지에 대한 처리이다. 낮은 확률로 페이지가 swapbacked와 swapcache 플래그 설정이 일치하지 않은 경우 secondary mmu의 tlb 무효화를 수행하고, 루틴을 중단하게 한다.
코드 라인 19~37에서 SwapBacked 페이지가 아닌 경우 다시 페이지를 매핑하고, 루틴을 중단하게 한다. 단 dirty 페이지가 아닌 경우 secondary mmu의 tlb 무효화를 수행하고 anon mm 카운터를 감소시킨 후 discard 레이블로 이동하여 다음을 진행하게 한다.
코드 라인 39~44에서 swap 엔트리의 참조 카운터를 1 증가시킨다. 에러인 경우 다시 페이지를 매핑하고 루틴을 중단하게 한다.
코드 라인 45~50에서 아키텍처 고유의 unmap 수행 시 에러가 발생하면 다시 페이지를 매핑하고 루틴을 중단하게 한다.
- 현재 sparc_64 아키텍처만 지원한다.
코드 라인 51~56에서 현재 mm의 mmlist가 비어있는 경우 init_mm의 mmlist에 추가한다.
코드 라인 57~65에서 anon, swap mm 카운터를 증가시키고 swap 엔트리를 매핑한 후 secondary mmu의 tlb 무효화를 수행한다.

mm/rmap.c -6/6-

                } else {
                        /*
                         * This is a locked file-backed page, thus it cannot
                         * be removed from the page cache and replaced by a new
                         * page before mmu_notifier_invalidate_range_end, so no
                         * concurrent thread might update its page table to
                         * point at new page while a device still is using this
                         * page.
                         *
                         * See Documentation/vm/mmu_notifier.rst
                         */
                        dec_mm_counter(mm, mm_counter_file(page));
                }
discard:
                /*
                 * No need to call mmu_notifier_invalidate_range() it has be
                 * done above for all cases requiring it to happen under page
                 * table lock before mmu_notifier_invalidate_range_end()
                 *
                 * See Documentation/vm/mmu_notifier.rst
                 */
                page_remove_rmap(subpage, PageHuge(page));
                put_page(page);
        }

        mmu_notifier_invalidate_range_end(&range);

        return ret;
}

코드 라인 1~13에서 그 밖(file-backed 페이지)의 경우 file 관련 mm 카운터(file, shm)를 감소시킨다.
코드 라인 14~23에서 discard: 레이블이다. 페이지의 rmap 매핑을 제거한 후 페이지 사용을 완료한다.
코드 라인 26에서 primary MMU의 range 영역에 대한 invalidate가 완료되었다. 따라서 이 함수를 호출해 secondary MMU를 위해 mmu notifier에 등록된 (*invalidate_range_end) 후크 함수를 호출해준다.
- (*invalidate_range_start) 후크 함수와 항상 pair로 동작한다.

ptep_clear_flush_young_notify()

include/linux/mmu_notifier.h

#define ptep_clear_flush_young_notify(__vma, __address, __ptep)         \
({                                                                      \
        int __young;                                                    \
        struct vm_area_struct *___vma = __vma;                          \
        unsigned long ___address = __address;                           \
        __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
        __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
                                                  ___address,           \
                                                  ___address +          \
                                                        PAGE_SIZE);     \
        __young;                                                        \
})

페이지가 access된 적이 있으면 access 플래그를 클리어하고 플러시를 수행한다. access 여부가 있거나 mnu_notifier에 등록된 (*clear_flush_young) 후크 함수의 결과를 반환한다.

코드 라인 6에서 pte 엔트리에 액세스 플래그 설정 유무를 알아오고 클리어한다. 액세스된 적이 있으면 tlb 플러시한다.
코드 라인 7에서 mnu notifier에 등록된 (*clear_flush_young) 후크 함수를 호출한 결과를 더한다.
- 다음 함수에서 사용되고 있다.
  - virt/kvm/kvm_main.c – kvm_mmu_notifier_clear_flush_young()
  - drivers/iommu/amd_iommu_v2.c – mn_clear_flush_young()

MMU Notifier

가상 주소 영역에 매핑된 물리 주소와 관련된 secondary MMU(KVM, IOMMU, …)의 mmu 관련 operation 함수를 동작시킨다. mmu notifier를 사용할 드라이버들은 mmu_notifier_ops를 포함한 mmu_notifier 구조체를 준비하여 mmu_notifer_register()를 통해 등록하여 사용한다.

다음 그림은 유저 가상 주소 공간에 매핑된 물리 주소 공간의 페이지의 매핑이 변경되었을 때 Secondary MMU도 같이 영향 받는 모습을 보여준다.

mmu_notifier 구조체

include/linux/mmu_notifier.h

/*
 * The notifier chains are protected by mmap_sem and/or the reverse map
 * semaphores. Notifier chains are only changed when all reverse maps and
 * the mmap_sem locks are taken.
 *
 * Therefore notifier chains can only be traversed when either
 *
 * 1. mmap_sem is held.
 * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem).
 * 3. No other concurrent thread can access the list (release)
 */

struct mmu_notifier {
        struct hlist_node hlist;
        const struct mmu_notifier_ops *ops;
};

IOMMU 드라이버가 준비한 mmu_notifier_ops를 관련 가상 주소 공간을 관리하는 mm의 mmu_notifier_mm 리스트에 추가하여 등록하는 구조체이다.

hlist
- mm_struct의 멤버 mmu_notifier_mm->list에 등록할 때 사용하는 노드 엔트리이다.
*ops
- 준비한 mmu_notifier_ops 구조체를 가리킨다.

mmu_notifier_ops 구조체

include/linux/mmu_notifier.h -1/2-

struct mmu_notifier_ops {
        /*
         * Called either by mmu_notifier_unregister or when the mm is
         * being destroyed by exit_mmap, always before all pages are
         * freed. This can run concurrently with other mmu notifier
         * methods (the ones invoked outside the mm context) and it
         * should tear down all secondary mmu mappings and freeze the
         * secondary mmu. If this method isn't implemented you've to
         * be sure that nothing could possibly write to the pages
         * through the secondary mmu by the time the last thread with
         * tsk->mm == mm exits.
         *
         * As side note: the pages freed after ->release returns could
         * be immediately reallocated by the gart at an alias physical
         * address with a different cache model, so if ->release isn't
         * implemented because all _software_ driven memory accesses
         * through the secondary mmu are terminated by the time the
         * last thread of this mm quits, you've also to be sure that
         * speculative _hardware_ operations can't allocate dirty
         * cachelines in the cpu that could not be snooped and made
         * coherent with the other read and write operations happening
         * through the gart alias address, so leading to memory
         * corruption.
         */
        void (*release)(struct mmu_notifier *mn,
                        struct mm_struct *mm);

        /*
         * clear_flush_young is called after the VM is
         * test-and-clearing the young/accessed bitflag in the
         * pte. This way the VM will provide proper aging to the
         * accesses to the page through the secondary MMUs and not
         * only to the ones through the Linux pte.
         * Start-end is necessary in case the secondary MMU is mapping the page
         * at a smaller granularity than the primary MMU.
         */
        int (*clear_flush_young)(struct mmu_notifier *mn,
                                 struct mm_struct *mm,
                                 unsigned long start,
                                 unsigned long end);

        /*
         * clear_young is a lightweight version of clear_flush_young. Like the
         * latter, it is supposed to test-and-clear the young/accessed bitflag
         * in the secondary pte, but it may omit flushing the secondary tlb.
         */
        int (*clear_young)(struct mmu_notifier *mn,
                           struct mm_struct *mm,
                           unsigned long start,
                           unsigned long end);

        /*
         * test_young is called to check the young/accessed bitflag in
         * the secondary pte. This is used to know if the page is
         * frequently used without actually clearing the flag or tearing
         * down the secondary mapping on the page.
         */
        int (*test_young)(struct mmu_notifier *mn,
                          struct mm_struct *mm,
                          unsigned long address);

        /*
         * change_pte is called in cases that pte mapping to page is changed:
         * for example, when ksm remaps pte to point to a new shared page.
         */
        void (*change_pte)(struct mmu_notifier *mn,
                           struct mm_struct *mm,
                           unsigned long address,
                           pte_t pte);

(*release)
- mm이 제거되거나, mmu_notifer_unregister() 호출 시 동작되는 후크 함수이다.
(*clear_flush_young)
- pte 엔트리에 있는 young/accessed 비트 플래그에 대해 test-and-clearing 사용 후 호출되는 후크 함수이다.
- Secondary MMU에서 start ~ end 주소 범위에 관련된 young/accessed 비트 플래그의 test-and-clearing을 수행 후 secondary MMU의 tlb 플러시를 수행하게 한다.
(*clear_young)
- 위의 (*clear_flush_young)의 light 버전으로 secondary MMU의 tlb 플러시를 수행하지 않는다.
(*test_young)
- Secondary MMU에서 start ~ end 주소 범위에 관련된 young/accessed 비트 플래그 상태를 반환한다.
(*change_pte)
- Secondary MMU에서 address 주소에 관련된 pte를 교체한다.

include/linux/mmu_notifier.h -2/2-

        /*
         * invalidate_range_start() and invalidate_range_end() must be
         * paired and are called only when the mmap_sem and/or the
         * locks protecting the reverse maps are held. If the subsystem
         * can't guarantee that no additional references are taken to
         * the pages in the range, it has to implement the
         * invalidate_range() notifier to remove any references taken
         * after invalidate_range_start().
         *
         * Invalidation of multiple concurrent ranges may be
         * optionally permitted by the driver. Either way the
         * establishment of sptes is forbidden in the range passed to
         * invalidate_range_begin/end for the whole duration of the
         * invalidate_range_begin/end critical section.
         *
         * invalidate_range_start() is called when all pages in the
         * range are still mapped and have at least a refcount of one.
         *
         * invalidate_range_end() is called when all pages in the
         * range have been unmapped and the pages have been freed by
         * the VM.
         *
         * The VM will remove the page table entries and potentially
         * the page between invalidate_range_start() and
         * invalidate_range_end(). If the page must not be freed
         * because of pending I/O or other circumstances then the
         * invalidate_range_start() callback (or the initial mapping
         * by the driver) must make sure that the refcount is kept
         * elevated.
         *
         * If the driver increases the refcount when the pages are
         * initially mapped into an address space then either
         * invalidate_range_start() or invalidate_range_end() may
         * decrease the refcount. If the refcount is decreased on
         * invalidate_range_start() then the VM can free pages as page
         * table entries are removed.  If the refcount is only
         * droppped on invalidate_range_end() then the driver itself
         * will drop the last refcount but it must take care to flush
         * any secondary tlb before doing the final free on the
         * page. Pages will no longer be referenced by the linux
         * address space but may still be referenced by sptes until
         * the last refcount is dropped.
         *
         * If blockable argument is set to false then the callback cannot
         * sleep and has to return with -EAGAIN. 0 should be returned
         * otherwise. Please note that if invalidate_range_start approves
         * a non-blocking behavior then the same applies to
         * invalidate_range_end.
         *
         */
        int (*invalidate_range_start)(struct mmu_notifier *mn,
                                      const struct mmu_notifier_range *range);
        void (*invalidate_range_end)(struct mmu_notifier *mn,
                                     const struct mmu_notifier_range *range);

        /*
         * invalidate_range() is either called between
         * invalidate_range_start() and invalidate_range_end() when the
         * VM has to free pages that where unmapped, but before the
         * pages are actually freed, or outside of _start()/_end() when
         * a (remote) TLB is necessary.
         *
         * If invalidate_range() is used to manage a non-CPU TLB with
         * shared page-tables, it not necessary to implement the
         * invalidate_range_start()/end() notifiers, as
         * invalidate_range() alread catches the points in time when an
         * external TLB range needs to be flushed. For more in depth
         * discussion on this see Documentation/vm/mmu_notifier.rst
         *
         * Note that this function might be called with just a sub-range
         * of what was passed to invalidate_range_start()/end(), if
         * called between those functions.
         */
        void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
                                 unsigned long start, unsigned long end);
};

(*invalidate_range_start)
(*invalidate_range_end)
- 위 두 개의 후크 함수는 페어로 동작한다. VM들에 대해 primary MMU의 해당 range에 대해 TLB를 무효화할 때, secondary MMU도 따라서 TLB 무효화를 수행하는데, rmap walk를 통해 한꺼번에 조작하는 루틴 전후에 이 후크 함수들을 호출한다.
(*invalidate_range)
- VM에서 primary MMU의 주어진 범위를 invalidate 한 후 secondary MMU도 같이 호출된다.

다음 그림은 mmu notifier를 사용 시 secondary MMU에 대해 init, start, end 등의 위치를 확인할 수 있다.

MM 카운터

RSS(Resident Set Size)
- 프로세스에 실제 물리 메모리가 매핑되어 사용중인 페이지 수를 말한다.
- swap out된 페이지는 제외하고, 스택 및 힙 메모리 등이 포함된다.
VSZ(Virutal memory SiZe)
- 프로세스가 사용중인 가상 주소 페이지 수이다.
- RSS보다 크며 swap out된 페이지들도 포함되고, 공간에 할당되었지만 한 번도 액세스되지 않아 실제 메모리가 매핑되지 않은 페이지들을 포함한다.

다음과 같이 bash(pid=5831) 프로세스가 사용하는 RSS 및 VSZ 카운터를 확인할 수 있다.

$ cat /proc/5831/status
Name:   bash
State:  S (sleeping)
Tgid:   5831
Ngid:   0
Pid:    5831
PPid:   5795
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 256
Groups: 0
NStgid: 5831
NSpid:  5831
NSpgid: 5831
NSsid:  5831
VmPeak:     6768 kB     <--- hiwater VSZ
VmSize:     6704 kB     <--- VSZ
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      4552 kB     <--- hiwater RSS (file + anon + shm)
VmRSS:      3868 kB     <--- RSS
VmData:     1652 kB
VmStk:       132 kB
VmExe:       944 kB
VmLib:      1648 kB
VmPTE:        28 kB
VmPMD:        12 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/15244
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000380004
SigCgt: 000000004b817efb
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
Seccomp:        0
Cpus_allowed:   3f
Cpus_allowed_list:      0-5
Mems_allowed:   1
Mems_allowed_list:      0
voluntary_ctxt_switches:        50
nonvoluntary_ctxt_switches:     29

페이지 참조 확인

page_referenced()

mm/rmap.c

/**
 * page_referenced - test if the page was referenced
 * @page: the page to test
 * @is_locked: caller holds lock on the page
 * @memcg: target memory cgroup
 * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
 *
 * Quick test_and_clear_referenced for all mappings to a page,
 * returns the number of ptes which referenced the page.
 */

int page_referenced(struct page *page,
                    int is_locked,
                    struct mem_cgroup *memcg,
                    unsigned long *vm_flags)
{
        int we_locked = 0;
        struct page_referenced_arg pra = {
                .mapcount = total_mapcount(page),
                .memcg = memcg,
        };
        struct rmap_walk_control rwc = {
                .rmap_one = page_referenced_one,
                .arg = (void *)&pra,
                .anon_lock = page_lock_anon_vma_read,
        };

        *vm_flags = 0;
        if (!page_mapped(page))
                return 0;

        if (!page_rmapping(page))
                return 0;

        if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
                we_locked = trylock_page(page);
                if (!we_locked)
                        return 1;
        }

        /*
         * If we are reclaiming on behalf of a cgroup, skip
         * counting on behalf of references from different
         * cgroups
         */
        if (memcg) {
                rwc.invalid_vma = invalid_page_referenced_vma;
        }

        rmap_walk(page, &rwc);
        *vm_flags = pra.vm_flags;

        if (we_locked)
                unlock_page(page);

        return pra.referenced;
}

page_referenced_one()

mm/rmap.c

/*
 * arg: page_referenced_arg will be passed
 */

static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
                        unsigned long address, void *arg)
{
        struct page_referenced_arg *pra = arg;
        struct page_vma_mapped_walk pvmw = {
                .page = page,
                .vma = vma,
                .address = address,
        };
        int referenced = 0;

        while (page_vma_mapped_walk(&pvmw)) {
                address = pvmw.address;

                if (vma->vm_flags & VM_LOCKED) {
                        page_vma_mapped_walk_done(&pvmw);
                        pra->vm_flags |= VM_LOCKED;
                        return false; /* To break the loop */
                }

                if (pvmw.pte) {
                        if (ptep_clear_flush_young_notify(vma, address,
                                                pvmw.pte)) {
                                /*
                                 * Don't treat a reference through
                                 * a sequentially read mapping as such.
                                 * If the page has been used in another mapping,
                                 * we will catch it; if this other mapping is
                                 * already gone, the unmap path will have set
                                 * PG_referenced or activated the page.
                                 */
                                if (likely(!(vma->vm_flags & VM_SEQ_READ)))
                                        referenced++;
                        }
                } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
                        if (pmdp_clear_flush_young_notify(vma, address,
                                                pvmw.pmd))
                                referenced++;
                } else {
                        /* unexpected pmd-mapped page? */
                        WARN_ON_ONCE(1);
                }

                pra->mapcount--;
        }

        if (referenced)
                clear_page_idle(page);
        if (test_and_clear_page_young(page))
                referenced++;

        if (referenced) {
                pra->referenced++;
                pra->vm_flags |= vma->vm_flags;
        }

        if (!pra->mapcount)
                return false; /* To break the loop */

        return true;
}

참고

Rmap -3- (PVMW)

2019-09-102019-09-10 문영일 Leave a comment

Rmap -3- (PVMW)

PVMW(Page Vma Mapped Walk)

페이지가 VMA에 매핑되었는지 여부를 체크하는 인터페이스이다.

참고: mm: introduce page_vma_mapped_walk()

다음 그림은 가상 주소에 해당하는 물리 페이지 매핑 또는 swap 엔트리로의 매핑이 되었는지 여부를 보여준다.

요청 플래그

PVMW에 사용하는 플래그들과 사용하는 함수들은 다음과 같다.

PVMW_SYNC
- map_pte() 함수에서 사용되며 엄격한 체크 루틴을 무시하게 한다.
- page_mkclean_one()
- page_mapped_in_vma()
- remove_migration_pte()
PVMW_MIGRATION
- migration 엔트리에 매핑되었는지 체크하게 한다.
- remove_migration_pte()
no 플래그
- try_to_unmap_one()
- page_referenced_one()
- __replace_page()
- page_idle_clear_pte_refs_one()
- write_protect_page()

페이지 vma 매핑 여부 확인

page_vma_mapped_walk()

mm/page_vma_mapped.c -1/2-

/**
 * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
 * @pvmw->address
 * @pvmw: pointer to struct page_vma_mapped_walk. page, vma, address and flags
 * must be set. pmd, pte and ptl must be NULL.
 *
 * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point
 * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is
 * adjusted if needed (for PTE-mapped THPs).
 *
 * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page
 * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in
 * a loop to find all PTEs that map the THP.
 *
 * For HugeTLB pages, @pvmw->pte is set to the relevant page table entry
 * regardless of which page table level the page is mapped at. @pvmw->pmd is
 * NULL.
 *
 * Retruns false if there are no more page table entries for the page in
 * the vma. @pvmw->ptl is unlocked and @pvmw->pte is unmapped.
 *
 * If you need to stop the walk before page_vma_mapped_walk() returned false,
 * use page_vma_mapped_walk_done(). It will do the housekeeping.
 */

bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
{
        struct mm_struct *mm = pvmw->vma->vm_mm;
        struct page *page = pvmw->page;
        pgd_t *pgd;
        p4d_t *p4d;
        pud_t *pud;
        pmd_t pmde;

        /* The only possible pmd mapping has been handled on last iteration */
        if (pvmw->pmd && !pvmw->pte)
                return not_found(pvmw);

        if (pvmw->pte)
                goto next_pte;

        if (unlikely(PageHuge(pvmw->page))) {
                /* when pud is not present, pte will be NULL */
                pvmw->pte = huge_pte_offset(mm, pvmw->address,
                                            PAGE_SIZE << compound_order(page));
                if (!pvmw->pte)
                        return false;

                pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
                spin_lock(pvmw->ptl);
                if (!check_pte(pvmw))
                        return not_found(pvmw);
                return true;
        }
restart:
        pgd = pgd_offset(mm, pvmw->address);
        if (!pgd_present(*pgd))
                return false;
        p4d = p4d_offset(pgd, pvmw->address);
        if (!p4d_present(*p4d))
                return false;
        pud = pud_offset(p4d, pvmw->address);
        if (!pud_present(*pud))
                return false;
        pvmw->pmd = pmd_offset(pud, pvmw->address);
        /*
         * Make sure the pmd value isn't cached in a register by the
         * compiler and used as a stale value after we've observed a
         * subsequent update.
         */
        pmde = READ_ONCE(*pvmw->pmd);
        if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
                pvmw->ptl = pmd_lock(mm, pvmw->pmd);
                if (likely(pmd_trans_huge(*pvmw->pmd))) {
                        if (pvmw->flags & PVMW_MIGRATION)
                                return not_found(pvmw);
                        if (pmd_page(*pvmw->pmd) != page)
                                return not_found(pvmw);
                        return true;
                } else if (!pmd_present(*pvmw->pmd)) {
                        if (thp_migration_supported()) {
                                if (!(pvmw->flags & PVMW_MIGRATION))
                                        return not_found(pvmw);
                                if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
                                        swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);

                                        if (migration_entry_to_page(entry) != page)
                                                return not_found(pvmw);
                                        return true;
                                }
                        }
                        return not_found(pvmw);
                } else {
                        /* THP pmd was split under us: handle on pte level */
                        spin_unlock(pvmw->ptl);
                        pvmw->ptl = NULL;
                }
        } else if (!pmd_present(pmde)) {
                return false;
        }
        if (!map_pte(pvmw))
                goto next_pte;

@pvmw->page가 @pvmw->vma에 이미 매핑되어 있는지 여부를 반환한다.

코드 라인 11~12에서 pmd만 매핑이 있고, pte는 없는 경우 page_vma_mappled_walk를 완료하고, false를 반환한다.
코드 라인 14~15에서 pte 매핑이 있는 경우 곧바로 next_pte 레이블로 이동한다.
코드 라인 17~29에서 낮은 확률로 huge 페이지인 경우 pte를 지정한 후 함수를 빠져나간다.
코드 라인 30~40에서 restart: 레이블이다. pgd -> p4d -> pud -> pmd 엔트리까지 구한다. 단 엔트리가 없으면 false를 반환한다.
코드 라인 46~72 pmd 엔트리가 trans huge이거나 migration 엔트리인 경우에 대한 처리이다. pmd 엔트리까지의 단계로 완료되는 경우 true를 반환하고, 그렇지 않거나 적절하지 않은 요청인 경우 fasle를 반환한다. 단 split 되는 중이면 ptl 락을 풀고 계속 진행한다.
코드 라인 73~75에서 pmd 엔트리가 존재하지 않는 경우 false를 반환한다.
코드 라인 76~77에서 pte 엔트리가 매핑되지 않은 경우 next_pte 레이블로 이동한다.

mm/page_vma_mapped.c -2/2-

        while (1) {
                if (check_pte(pvmw))
                        return true;
next_pte:
                /* Seek to next pte only makes sense for THP */
                if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
                        return not_found(pvmw);
                do {
                        pvmw->address += PAGE_SIZE;
                        if (pvmw->address >= pvmw->vma->vm_end ||
                            pvmw->address >=
                                        __vma_address(pvmw->page, pvmw->vma) +
                                        hpage_nr_pages(pvmw->page) * PAGE_SIZE)
                                return not_found(pvmw);
                        /* Did we cross page table boundary? */
                        if (pvmw->address % PMD_SIZE == 0) {
                                pte_unmap(pvmw->pte);
                                if (pvmw->ptl) {
                                        spin_unlock(pvmw->ptl);
                                        pvmw->ptl = NULL;
                                }
                                goto restart;
                        } else {
                                pvmw->pte++;
                        }
                } while (pte_none(*pvmw->pte));

                if (!pvmw->ptl) {
                        pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
                        spin_lock(pvmw->ptl);
                }
        }
}

코드 라인 1~3에서 pte 엔트리들을 순회하며 매 페이지가 매핑되어 있는지 엄격히 체크한 후 확인이 된 경우 true를 반환한다.
코드 라인 4~7에서 next_pte: 레이블이다. thp가 아니거나 huge 페이지인 경우 page_vma_mappled_walk를 완료하고, false를 반환한다.
코드 라인 8~26에서 매핑된 pte 엔트리가 없는 경우 반복하며 주소가 vma 영역 범위를 벗어난 경우 page_vma_mappled_walk를 완료하고, false를 반환한다.
코드 라인 28~31에서 ptl 락이 해제된 경우 다시 lock을 획득하고 반복한다.

다음 그림은 page_vma_mapped_walk() 함수를 통해 해당 주소가 매핑이 잘되었는지 여부를 알아오는 모습을 보여준다.

check_pte()

mm/page_vma_mapped.c

/**
 * check_pte - check if @pvmw->page is mapped at the @pvmw->pte
 *
 * page_vma_mapped_walk() found a place where @pvmw->page is *potentially*
 * mapped. check_pte() has to validate this.
 *
 * @pvmw->pte may point to empty PTE, swap PTE or PTE pointing to arbitrary
 * page.
 *
 * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains migration
 * entry that points to @pvmw->page or any subpage in case of THP.
 *
 * If PVMW_MIGRATION flag is not set, returns true if @pvmw->pte points to
 * @pvmw->page or any subpage in case of THP.
 *
 * Otherwise, return false.
 *
 */

tatic bool check_pte(struct page_vma_mapped_walk *pvmw)
{
        unsigned long pfn;

        if (pvmw->flags & PVMW_MIGRATION) {
                swp_entry_t entry;
                if (!is_swap_pte(*pvmw->pte))
                        return false;
                entry = pte_to_swp_entry(*pvmw->pte);

                if (!is_migration_entry(entry))
                        return false;

                pfn = migration_entry_to_pfn(entry);
        } else if (is_swap_pte(*pvmw->pte)) {
                swp_entry_t entry;

                /* Handle un-addressable ZONE_DEVICE memory */
                entry = pte_to_swp_entry(*pvmw->pte);
                if (!is_device_private_entry(entry))
                        return false;

                pfn = device_private_entry_to_pfn(entry);
        } else {
                if (!pte_present(*pvmw->pte))
                        return false;

                pfn = pte_pfn(*pvmw->pte);
        }

        return pfn_in_hpage(pvmw->page, pfn);
}

페이지가 매핑되어 있는지 여부를 반환한다. (pvmw->pte 엔트리 —> pvmw->page 매핑, 단 migration 시 swap & migration 엔트리 매핑 여부)

코드 라인 5~14에서 PVMW_MIGRATOIN 플래그 요청이 있는 경우 swap pte 엔트리에 대한 pfn을 알아온다. 만일 swap pte가 아니거나 migration 엔트리가 아닌 경우 false를 반환한다.
코드 라인 15~23에서 swap pte 엔트리에 매핑된 경우 false를 반환하는데 만일 swap pte 엔트리가 디바이스 private 엔트리인 경우 해당 pfn를 알아온다.
- swap 장치가 버디에서 관리하지 않는 HMM 메모리에 swap된 경우이다.
코드 라인 24~29에서 pte 엔트리가 정상 매핑된 경우 해당 페이지의 pfn을 알아온다.
코드 라인 31에서 산출된 pfn이 일반 페이지 또는 huge 페이지인 pvmw->page의 pfn 범위이내에 존재하는지 여부를 반환한다.

다음 그림은 pwmc->address에 해당하는 pwmc->pte 엔트리가 pwmc->page에 매핑이 되었는지 체크한다.

map_pte()

mm/page_vma_mapped.c

static bool map_pte(struct page_vma_mapped_walk *pvmw)
{
        pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
        if (!(pvmw->flags & PVMW_SYNC)) {
                if (pvmw->flags & PVMW_MIGRATION) {
                        if (!is_swap_pte(*pvmw->pte))
                                return false;
                } else {
                        /*
                         * We get here when we are trying to unmap a private
                         * device page from the process address space. Such
                         * page is not CPU accessible and thus is mapped as
                         * a special swap entry, nonetheless it still does
                         * count as a valid regular mapping for the page (and
                         * is accounted as such in page maps count).
                         *
                         * So handle this special case as if it was a normal
                         * page mapping ie lock CPU page table and returns
                         * true.
                         *
                         * For more details on device private memory see HMM
                         * (include/linux/hmm.h or mm/hmm.c).
                         */
                        if (is_swap_pte(*pvmw->pte)) {
                                swp_entry_t entry;

                                /* Handle un-addressable ZONE_DEVICE memory */
                                entry = pte_to_swp_entry(*pvmw->pte);
                                if (!is_device_private_entry(entry))
                                        return false;
                        } else if (!pte_present(*pvmw->pte))
                                return false;
                }
        }
        pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
        spin_lock(pvmw->ptl);
        return true;
}

페이지가 매핑되어 있는지 여부를 반환한다. (pvmw->pte 엔트리 —> pvmw->page 매핑, 단 migration 시 swap 엔트리 매핑 여부) 단 sync 모드의 경우 체크하지 않고 항상 true를 반환한다.

코드 라인 3에서 pvmw->pmd 엔트리와 pvmw->address를 사용하여 pte 엔트리를 알아온다.
코드 라인 4에서 PVMW_SYNC 플래그로 요청된 경우에는 다음 엄격한 체크를 하지 않고 true를 반환한다.
코드 라인 5~7에서 PVMW_MIGRATOIN 플래그로 요청된 경우 false를 반환한다. 단 pte 엔트리가 swap_pte에 매핑된 경우는 제외한다.
코드 라인 8~33에서 swap 장치에 매핑된 경우 버디에서 관리하지 않는 HMM 메모리에 swap된 경우가 아니라면 false를 반환한다. 또한 swap 장치에 매핑된 경우가 아니면서 존재하지 않는 페이지에 매핑된 경우 false를 반환한다.
코드 라인 35~37에서 pte 엔트리가 정상 매핑된 경우이다. ptl 락을 획득하고 true를 반환한다.

다음 그림은 pwmc->address에 해당하는 pwmc->pte 엔트리가 pwmc->page에 매핑이 되었는지 체크한다.

check_pte()와 매우 유사하다. map_pte()에서는 sync 모드를 요청 시 항상 true를 반환한다.

page_vma_mapped_walk 구조체

include/linux/rmap.h

struct page_vma_mapped_walk {
        struct page *page;
        struct vm_area_struct *vma;
        unsigned long address;
        pmd_t *pmd;
        pte_t *pte;
        spinlock_t *ptl;
        unsigned int flags;
};

*page
- 매핑된 페이지인지 확인하기 위한 타겟 페이지이다.
*vma
- VMA 영역을 지정한다.
address
- 가상 주소
pmd
- 가상 주소가 매핑된 pmd 엔트리로 함수 내부에서 전달용으로 사용된다.
pte
- 가상 주소가 매핑된 pte 엔트리로 함수 내부에서 전달용으로 사용된다.
*ptl
- PVMW 락으로 함수 내부에서 사용된다.
flags
- PVMW_SYNC
  - 엄격한 체크를 무시한다.
- PVMW_MIGRATION
  - migration 엔트리를 찾는다.

참고

디버그 메모리 -5- (Fault Injection)

2019-08-272019-08-27 문영일 Leave a comment

디버그 메모리 -5- (Fault Injection)

디버깅용 페이지 할당 실패 Injection

FAIL_PAGE_ALLOC

alloc_pages()에 대한 디버그 기능으로 fault-injection 처리를 위한 코드가 추가된다.

“fail_page_alloc=<interval>,<probability>,<space>,<times>” 커널 옵션을 사용하여 FAULT-INJECTION에 대한 4가지 속성을 변경할 수 있다.

should_fail_alloc_page()

mm/page_alloc.c

static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
{
        return __should_fail_alloc_page(gfp_mask, order);
}
ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);

__should_fail_alloc_page()

mm/page_alloc.c

static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
{
        if (order < fail_page_alloc.min_order)
                return false;
        if (gfp_mask & __GFP_NOFAIL)
                return false;
        if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
                return false;
        if (fail_page_alloc.ignore_gfp_reclaim && 
                        (gfp_mask & __GFP_DIRECT_RECLAIM))
                return false;

        return should_fail(&fail_page_alloc.attr, 1 << order);
}

CONFIG_FAIL_PAGE_ALLOC 커널 옵션을 사용하는 경우 디버깅 목적으로 사용되며 fail_page_alloc 전역 객체의 조건을 벗어나는 gfp_mask를 가진 경우 false를 반환한다. 또한 __GFP_NOFAIL 옵션을 사용하는 경우에도 false를 반환한다.

should_fail()

lib/fault-inject.c

/*
 * This code is stolen from failmalloc-1.0
 * http://www.nongnu.org/failmalloc/
 */

bool should_fail(struct fault_attr *attr, ssize_t size)
{
        if (in_task()) {
                unsigned int fail_nth = READ_ONCE(current->fail_nth);

                if (fail_nth) {
                        if (!WRITE_ONCE(current->fail_nth, fail_nth - 1))
                                goto fail;

                        return false;
                }
        }

        /* No need to check any other properties if the probability is 0 */
        if (attr->probability == 0)
                return false;

        if (attr->task_filter && !fail_task(attr, current))
                return false;

        if (atomic_read(&attr->times) == 0)
                return false;

        if (atomic_read(&attr->space) > size) {
                atomic_sub(size, &attr->space);
                return false;
        }

        if (attr->interval > 1) {
                attr->count++;
                if (attr->count % attr->interval)
                        return false;
        }

        if (attr->probability <= prandom_u32() % 100)
                return false;

        if (!fail_stacktrace(attr))
                return false;

fail:
        fail_dump(attr);

        if (atomic_read(&attr->times) != -1)
                atomic_dec_not_zero(&attr->times);

        return true;
}
EXPORT_SYMBOL_GPL(should_fail);

참고

디버그 메모리 -1- (Page Alloc) | 문c
디버그 메모리 -2- (Page Poisoning) | 문c
디버그 메모리 -3- (Page Owner 추적) | 문c
디버그 메모리 -4- (Idle Page 추적) | 문c
디버그 메모리 -5- (Fault Injection) | 문c – 현재 글
page_ext_init_flatmem() | 문c
page_ext_init() | 문c