NUMA -3- (Memory policy)

NUMA -3- (Memory policy)

메모리 정책

NUMA 메모리 정책은 다음과 같은 종류가 있다.

MPOL_DEFAULT
- 오직 메모리 정책 API 내부에서만 사용되는 모드이다. NULL로 리턴되어 내부에서 fallback되어 시스템 디폴트 메모리 정책을 사용하도록 한다.
MPOL_PREFERRED
- 선호하는 노드 하나를 지정하여 할당한다. 메모리 부족 시 다른 노드에서 할당할 수 있다.
- 단 MPOL_F_LOCAL 플래그와 함께 사용되는 경우에는 preferred 기능이 무시되고, 로컬 노드 할당을 우선하게 한다.
MPOL_BIND
- 지정된 bind 노드들에서만 메모리를 할당한다. 메모리 부족 시 다른 노드에서 할당할 수 없다.
MPOL_INTERLEAVE
- 지정된 인터리브 노드들에서 순환(Round Robin)하며 할당하게 한다. 메모리 부족 시 다른 노드에서 할당할 수 있다.
MPOL_LOCAL
- 모드를 MPOL_PREFERRED로 변경하고, MPOL_F_LOCAL 플래그와 함께 사용되어 로컬 노드를 우선하게 한다.

메모리 정책 플래그

다음과 같은 각종 메모리 정책 관련 플래그들을 사용한다.

set_mempolicy()에서 사용되는 플래그
- MPOL_F_STATIC_NODES(0x8000)
  - static 노드 지정
- MPOL_F_RELATIVE_NODES(0x4000)
  - 상대 노드 지정
get_mempolicy()에서 사용되는 플래그
- MPOL_F_NODE(1)
  - 노드 매스크 대신 next IL 노드 반환
- MPOL_F_ADDR(2)
  - 주소로 vma 검색
- MPOL_F_MEMS_ALLOWED(4)
  - return allowed memories
mbind()에서 사용되는 플래그
- MPOL_MF_STRICT(1)
- MPOL_MF_MOVE(2)
- MPOL_MF_MOVE_ALL(4)
- MPOL_MF_LAZY(8)
- MPOL_MF_INTERNAL(16)
모드와 함께 사용되는 내부 플래그
- MPOL_F_SHARED(1)
  - 공유 정책
- MPOL_F_LOCAL(2)
  - preferred 로컬 노드 할당
- MPOL_F_MOF(8)
  - 폴트 시 마이그레이션
- MPOL_F_MORON(16)
  - Migrate On protnone Reference On Node

다음은 두 개의 노드에 각각 20개의 cpu core가 사용되는 모습을 보여준다.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 32654 MB
node 0 free: 18259 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 32768 MB
node 1 free: 15491 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

다음은 누마 노드 정책이 default를 사용하는 모습을 보여준다.

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
cpubind: 0 1
nodebind: 0 1
membind: 0 1

NUMA 메모리 정책(Policy)

default_policy 전역 객체

mm/mempolicy.c

/*
 * run-time system-wide default policy => local allocation
 */
static struct mempolicy default_policy = {
        .refcnt = ATOMIC_INIT(1), /* never free it */
        .mode = MPOL_PREFERRED,
        .flags = MPOL_F_LOCAL,
};

메모리 정책을 로컬 노드 우선으로 지정한다.

태스크에 대한 메모리 정책 알아오기

get_task_policy()

mm/mempolicy.c

struct mempolicy *get_task_policy(struct task_struct *p)
{
        struct mempolicy *pol = p->mempolicy;
        int node;

        if (pol)
                return pol;

        node = numa_node_id();
        if (node != NUMA_NO_NODE) {
                pol = &preferred_node_policy[node];
                /* preferred_node_policy is not initialised early in boot */
                if (pol->mode)
                        return pol;
        }

        return &default_policy;
}

현재 태스크의 메모리 정책을 아래의 케이스에 따라 반환한다.

태스크에 지정된 정책이 있으면 -> 1) 태스크에 지정된 메모리 정책
지정한 노드가 있으면 -> 2) 지정한 노드 우선 메모리 정책 사용
지정한 노드 없이 요청 시 -> 3) 시스템 디폴트 메모리 정책(로컬 노드 우선)

policy_node()

mm/mempolicy.c

/* Return the node id preferred by the given mempolicy, or the given id */
static int policy_node(gfp_t gfp, struct mempolicy *policy,
                                                                int nd)
{
        if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
                nd = policy->v.preferred_node;
        else {
                /*
                 * __GFP_THISNODE shouldn't even be used with the bind policy
                 * because we might easily break the expectation to stay on the
                 * requested node and not break the policy.
                 */
                WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
        }

        return nd;
}

메모리 정책이 preferred 모드이고 preferred 노드가 지정된 경우 해당 노드 번호를 반환한다. 그렇지 않은 경우 입력 인자 @nd를 반환한다.

preferred라도 로컬 노드를 사용하도록 요청한 경우에는 그냥 @nd를 반환한다.

policy_nodemask()

mm/mempolicy.c

/*
 * Return a nodemask representing a mempolicy for filtering nodes for
 * page allocation
 */

static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
        /* Lower zones don't get a nodemask applied for MPOL_BIND */
        if (unlikely(policy->mode == MPOL_BIND) &&
                        apply_policy_zone(policy, gfp_zone(gfp)) &&
                        cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
                return &policy->v.nodes;

        return NULL;
}

인터리브 메모리 정책

interleave_nodes()

mm/mempolicy.c

/* Do dynamic interleaving for a process */
static unsigned interleave_nodes(struct mempolicy *policy)
{
        unsigned nid, next;
        struct task_struct *me = current;

        nid = me->il_next;
        next = next_node(nid, policy->v.nodes);
        if (next >= MAX_NUMNODES) 
                next = first_node(policy->v.nodes);
        if (next < MAX_NUMNODES)
                me->il_next = next;
        return nid;
}

인터리브 노드들을 대상으로 순회하며 노드 번호를 반환한다.

currnet->il_next에 기억된 노드를 반환한다. 그 동안 다음에 배정할 노드를 current->il_next에 기억해놓는다.
il_next에는 메모리 정책이 MPOL_INTERLEAVE로 설정된 경우 사용할 노드를 interleave(round robin) 방식에 의해 배정한다.
- next_node()
  - 노드 비트맵에 대하여 지정된 노드의 다음노드를 알아온다. 못 찾은 경우 MAX_NUMNODES를 반환한다.
- first_node()
  - 노드 비트맵의 처음에 위치한 노드를 알아온다.

alloc_page_interleave()

mm/mempolicy.c

/* Allocate a page in interleaved policy.
   Own path because it needs to do special accounting. */

static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
                                        unsigned nid)
{
        struct page *page;

        page = __alloc_pages(gfp, order, nid);
        /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
        if (!static_branch_likely(&vm_numa_stat_key))
                return page;
        if (page && page_to_nid(page) == nid) {
                preempt_disable();
                __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
                preempt_enable();
        }
        return page;
}

interleave 메모리 정책으로 2^order 페이지만큼 연속된 물리 메모리를 할당 받는다.

코드 라인 6에서 요청한 gfp 플래그, 노드에서 2^order 페이지만큼 연속된 물리 메모리를 할당해온다.
- gfp 플래그에 따라 전체 노드 zonelist인 node_zonelist[0]을 가져오거나 지정된 노드 zone만을 담은 node_zonelist[1]을 가져온다.
코드 라인 8~9에서 NUMA 통계를 사용하지 않는 경우 skip 한다.
코드 라인 10~14에서 NUMA 통계를 갱신한다.
- 요청한 노드에서 페이지를 할당 받은 경우에 한해 해당 존의 NUMA_INTERLEAVE_HIT stat을 증가시킨다.

주요 구조체

mempolicy 구조체

include/linux/mempolicy.h

/*
 * Describe a memory policy.
 *
 * A mempolicy can be either associated with a process or with a VMA.
 * For VMA related allocations the VMA policy is preferred, otherwise
 * the process policy is used. Interrupts ignore the memory policy
 * of the current process.
 *
 * Locking policy for interlave:
 * In process context there is no locking because only the process accesses
 * its own state. All vma manipulation is somewhat protected by a down_read on
 * mmap_sem.
 *
 * Freeing policy:
 * Mempolicy objects are reference counted.  A mempolicy will be freed when
 * mpol_put() decrements the reference count to zero.
 *
 * Duplicating policy objects:
 * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
 * to the new storage.  The reference count of the new object is initialized
 * to 1, representing the caller of mpol_dup().
 */

struct mempolicy {
        atomic_t refcnt;
        unsigned short mode;    /* See MPOL_* above */
        unsigned short flags;   /* See set_mempolicy() MPOL_F_* above */
        union {
                short            preferred_node; /* preferred */
                nodemask_t       nodes;         /* interleave/bind */
                /* undefined for default */
        } v;
        union {
                nodemask_t cpuset_mems_allowed; /* relative to these nodes */
                nodemask_t user_nodemask;       /* nodemask passed by user */
        } w;
};

참고

NUMA with Linux | Lunatine’s Box
Local and Remote Memory: Memory in a Linux/NUMA System | Christoph Lameter – pdf 다운로드
NUMA Best Practices for Dell PowerEdge 12th Generation Servers | Dell – pdf 다운로드
What is Linux Memory Policy? | kernel.org
NUMA API for Linux | LWN.net
numa – overview of Non-Uniform Memory Architecture | man7.org
NUMA (Non-Uniform Memory Access): An Overview | queue.acm.org
Red Hat Enterprise Linux Non-Uniform Memory Access support for HP ProLiant servers | HP – pdf 다운로드

NUMA -3- (Memory policy)

메모리 정책

메모리 정책 플래그

NUMA 메모리 정책(Policy)

default_policy 전역 객체

태스크에 대한 메모리 정책 알아오기

get_task_policy()

policy_node()

policy_nodemask()

인터리브 메모리 정책

interleave_nodes()

alloc_page_interleave()

주요 구조체

mempolicy 구조체

참고

댓글 남기기 댓글 취소