문c 블로그

mdesc->init_early()

2016-04-152016-04-15 문영일 Leave a comment

mdesc->init_early()

각 머신에 정의되어 있는 init_early() 함수를 호출한다.

아래는 rpi2에서 사용하는 머신 정의이며 init_early 멤버 변수가 bcm2709_init_early() 함수를 가리키는 것을 보여준다.

arch/arm/mach-bcm2709/bcm2709.c

MACHINE_START(BCM2709, "BCM2709")
    /* Maintainer: Broadcom Europe Ltd. */
#ifdef CONFIG_SMP
        .smp            = smp_ops(bcm2709_smp_ops),
#endif
        .map_io = bcm2709_map_io,
        .init_irq = bcm2709_init_irq,
        .init_time = bcm2709_timer_init,
        .init_machine = bcm2709_init,
        .init_early = bcm2709_init_early,
        .reserve = board_reserve,
        .restart        = bcm2709_restart,
        .dt_compat = bcm2709_compat,
MACHINE_END

bcm2709_init_early()

arch/arm/mach-bcm2709/bcm2709.c

void __init bcm2709_init_early(void)
{
        /*
         * Some devices allocate their coherent buffers from atomic
         * context. Increase size of atomic coherent pool to make sure such
         * the allocations won't fail.
         */
        init_dma_coherent_pool_size(SZ_4M);
}

coherent 버퍼에 대한 사이즈를 기본 256K에서 4M로 증가시킨다.

init_dma_coherent_pool_size()

arch/arm/mm/dma-mapping.c

/*      
 * This can be called during early boot to increase the size of the atomic
 * coherent DMA pool above the default value of 256KiB. It must be called
 * before postcore_initcall. 
 */
void __init init_dma_coherent_pool_size(unsigned long size)
{
        /*
         * Catch any attempt to set the pool size too late.
         */
        BUG_ON(atomic_pool);

        /*
         * Set architecture specific coherent pool size only if
         * it has not been changed by kernel command line parameter.
         */
        if (atomic_pool_size == DEFAULT_DMA_COHERENT_POOL_SIZE)
                atomic_pool_size = size;
}

coherent 버퍼에 대한 사이즈가 default 256KB 상태이면 요청 size로 변경한다.

static size_t atomic_pool_size = DEFAULT_DMA_COHERENT_POOL_SIZE;

#define DEFAULT_DMA_COHERENT_POOL_SIZE  SZ_256K

coherent_pool 커널 파라메터

early_coherent_pool()

arch/arm/mm/dma-mapping.c

static int __init early_coherent_pool(char *p)
{
        atomic_pool_size = memparse(p, &p);
        return 0;
}
early_param("coherent_pool", early_coherent_pool);

“coherent_pool=” 커널 파라메터에 의해 coherent 버퍼 사이즈를 변경할 수 있다.
- 예) “coherent_pool=2M”

reserve_crashkernel()

2016-04-152020-03-09 문영일 2 Comments

커널 크래쉬(패닉)가 발생될 때 원인 분석을 위해 덤프를 출력할 수 있도록 별도의 Capture 커널을 로드해두어야 한다. 이에 필요한 영역을 reserve 한다.

다음 그림은 커널이 panic 되는 경우 Capture 커널로 부팅되고 화일로 덤프를 수행 후 다시 원래 커널로 부팅이 일어나는 과정을 보여준다.

crashkernel 커널 파라메터

        crashkernel=size[KMG][@offset[KMG]]
                        [KNL] Using kexec, Linux can switch to a 'crash kernel'
                        upon panic. This parameter reserves the physical
                        memory region [offset, offset + size] for that kernel
                        image. If '@offset' is omitted, then a suitable offset
                        is selected automatically. Check
                        Documentation/kdump/kdump.txt for further details.

        crashkernel=range1:size1[,range2:size2,...][@offset]
                        [KNL] Same as above, but depends on the memory
                        in the running system. The syntax of range is
                        start-[end] where start and end are both
                        a memory unit (amount[KMG]). See also
                        Documentation/kdump/kdump.txt for an example.

        crashkernel=size[KMG],high
                        [KNL, x86_64] range could be above 4G. Allow kernel
                        to allocate physical memory region from top, so could
                        be above 4G if system have more than 4G ram installed.
                        Otherwise memory region will be allocated below 4G, if
                        available.
                        It will be ignored if crashkernel=X is specified.
        crashkernel=size[KMG],low
                        [KNL, x86_64] range under 4G. When crashkernel=X,high
                        is passed, kernel could allocate physical memory region
                        above 4G, that cause second kernel crash on system
                        that require some amount of low memory, e.g. swiotlb
                        requires at least 64M+32K low memory.  Kernel would
                        try to allocate 72M below 4G automatically.
                        This one let user to specify own low range under 4G
                        for second kernel instead.
                        0: to disable low allocation.
                        It will be ignored when crashkernel=X,high is not used
                        or memory reserved is below 4G.

“,high” 또는 “,low” 옵션은 x86_64 아키텍처에서만 사용한다.

커널 파라메터

아래 커널 파라메터 외에 CONFIG_SYSFS=y, CONFIG_DEBUG_INFO=Y

CONFIG_KEXEC

arch/arm/Kconfig

config KEXEC
        bool "Kexec system call (EXPERIMENTAL)"
        depends on (!SMP || PM_SLEEP_SMP)
        help
          kexec is a system call that implements the ability to shutdown your
          current kernel, and to start another kernel.  It is like a reboot
          but it is independent of the system firmware.   And like a reboot
          you can start any kernel with it, not just Linux.
        
          It is an ongoing process to be certain the hardware in a machine
          is properly shutdown, so do not be surprised if this code does not
          initially work for you.

CONFIG_CRASH_DUMP

config CRASH_DUMP
        bool "Build kdump crash kernel (EXPERIMENTAL)"
        help
          Generate crash dump after being started by kexec. This should
          be normally only set in special crash dump kernels which are
          loaded in the main kernel with kexec-tools into a specially
          reserved region and then later executed after a crash by
          kdump/kexec. The crash dump kernel must be compiled to a
          memory address not used by the main kernel

          For more details see Documentation/kdump/kdump.txt

CONFIG_PROC_VMCORE

fs/proc/Kconfig

config PROC_VMCORE
        bool "/proc/vmcore support"
        depends on PROC_FS && CRASH_DUMP
        default y
        help
        Exports the dump image of crashed kernel in ELF format.

reserve_crashkernel()

arch/arm/kernel/setup.c

/**
 * reserve_crashkernel() - reserves memory are for crash kernel
 *
 * This function reserves memory area given in "crashkernel=" kernel command
 * line parameter. The memory reserved is used by a dump capture kernel when
 * primary kernel is crashing.
 */
static void __init reserve_crashkernel(void)
{
        unsigned long long crash_size, crash_base;
        unsigned long long total_mem;
        int ret; 

        total_mem = get_total_mem();
        ret = parse_crashkernel(boot_command_line, total_mem,
                                &crash_size, &crash_base);
        if (ret)
                return;

        ret = memblock_reserve(crash_base, crash_size);
        if (ret < 0) { 
                pr_warn("crashkernel reservation failed - memory is in use (0x%lx)\n",
                        (unsigned long)crash_base);
                return;
        }

        pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB)\n",
                (unsigned long)(crash_size >> 20), 
                (unsigned long)(crash_base >> 20), 
                (unsigned long)(total_mem >> 20));

        crashk_res.start = crash_base;
        crashk_res.end = crash_base + crash_size - 1; 
        insert_resource(&iomem_resource, &crashk_res);
}

total_mem = get_total_mem();
- lowmem 사이즈를 알아온다.
ret = parse_crashkernel(boot_command_line, total_mem, &crash_size, &crash_base);
- “crashkernel=” 커널 파라메터를 파싱하여 crash_size와 crash_base를 알아온다.
ret = memblock_reserve(crash_base, crash_size);
- crash 영역을 memblock에 reserve 한다.
insert_resource(&iomem_resource, &crashk_res);
- crashk_res 리소스를 전역 iomem_resource에 추가한다.

get_total_mem()

arch/arm/kernel/setup.c

static inline unsigned long long get_total_mem(void)
{
        unsigned long total;
        
        total = max_low_pfn - min_low_pfn;
        return total << PAGE_SHIFT;
}

lowmem 사이즈를 알아온다.

parse_crashkernel()

kernel/kexec.c

/*
 * That function is the entry point for command line parsing and should be
 * called from the arch-specific code.
 */
int __init parse_crashkernel(char *cmdline,
                             unsigned long long system_ram,
                             unsigned long long *crash_size,
                             unsigned long long *crash_base)
{
        return __parse_crashkernel(cmdline, system_ram, crash_size, crash_base,
                                        "crashkernel=", NULL);
}

cmdline 문자열을 파싱하여 crash_base와 crash_size 값을 알아온다.

__parse_crashkernel()

kernel/kexec.c

static int __init __parse_crashkernel(char *cmdline,
                             unsigned long long system_ram,
                             unsigned long long *crash_size,
                             unsigned long long *crash_base,
                             const char *name,
                             const char *suffix)
{
        char    *first_colon, *first_space;
        char    *ck_cmdline;

        BUG_ON(!crash_size || !crash_base);
        *crash_size = 0;
        *crash_base = 0;

        ck_cmdline = get_last_crashkernel(cmdline, name, suffix);

        if (!ck_cmdline)
                return -EINVAL;

        ck_cmdline += strlen(name);

        if (suffix)
                return parse_crashkernel_suffix(ck_cmdline, crash_size,
                                suffix);
        /*
         * if the commandline contains a ':', then that's the extended
         * syntax -- if not, it must be the classic syntax
         */
        first_colon = strchr(ck_cmdline, ':');
        first_space = strchr(ck_cmdline, ' ');
        if (first_colon && (!first_space || first_colon < first_space))
                return parse_crashkernel_mem(ck_cmdline, system_ram,
                                crash_size, crash_base);

        return parse_crashkernel_simple(ck_cmdline, crash_size, crash_base);
}

crash_size와 crash_base 값을 알아오는데 다음 3 가지 문법 형태에 따라 호출함수를 달리한다.

1) suffix가 주어진 경우
- “64M,high”
2) ‘:’문자로 구분된 range가 주어진 경우
- “xxx-yyy:64M@0”
3) 단순히 사이즈(및 시작 주소)가 주어진 경우
- “64M@0”

ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
- cmdline에서 name으로 검색된 마지막 위치에서 suffix가 일치하는 문자열을 알아온다.
  - name=”crashkernel=”
  - suffix=”,high”, “,low” 또는 null
  - 예) cmdline=”console=ttyS0 crashkernel=128M@0″, name=”crashkernel=”, suffix=null
    - ck_cmdline=”crashkernel=128M@0″
ck_cmdline += strlen(name);
- ck_cmdline이 “crashkernel=” 문자열 다음을 가리키게 한다.
if (suffix) return parse_crashkernel_suffix(ck_cmdline, crash_size, suffix);
- suffix가 주어진 경우
first_colon = strchr(ck_cmdline, ‘:’);
- 처음 발견되는 “:” 문자열을 찾아본다.
first_space = strchr(ck_cmdline, ‘ ‘);
- 처음 발견되는 space 문자열을 찾아본다.
if (first_colon && (!first_space || first_colon < first_space)) return parse_crashkernel_mem(ck_cmdline, system_ram, crash_size, crash_base);
- “:” 문자열이 space 문자열 이전에 있는 경우 cmdline=”ramsize-range:size[,…][@offset]” 형태로 문자열이 전달되게 되는데 ‘,’로 구분한 마지막 range에서 size를 파싱하여 출력 인수 crash_size에 담고, @offset를 파싱하여 crash_base에 담은 후 성공리에 0을 리턴한다. 만일 range가 system_ram을 초과하는 경우에는 에러를 리턴한다.
return parse_crashkernel_simple(ck_cmdline, crash_size, crash_base);
- 단순하게 cmdline=”size[@offset]” 형태로 문자열이 전달되는 경우 size를 파싱하여 crash_size에 담고, @offset를 파싱하여 crash_base에 담는다.

get_last_crashkernel()

kernel/kexec.c

static __init char *get_last_crashkernel(char *cmdline,
                             const char *name,
                             const char *suffix)
{
        char *p = cmdline, *ck_cmdline = NULL;

        /* find crashkernel and use the last one if there are more */
        p = strstr(p, name);
        while (p) {
                char *end_p = strchr(p, ' ');
                char *q;

                if (!end_p)
                        end_p = p + strlen(p);

                if (!suffix) {
                        int i;

                        /* skip the one with any known suffix */
                        for (i = 0; suffix_tbl[i]; i++) {
                                q = end_p - strlen(suffix_tbl[i]);
                                if (!strncmp(q, suffix_tbl[i],
                                             strlen(suffix_tbl[i])))
                                        goto next;
                        }
                        ck_cmdline = p;
                } else {
                        q = end_p - strlen(suffix);
                        if (!strncmp(q, suffix, strlen(suffix)))
                                ck_cmdline = p;
                }
next:
                p = strstr(p+1, name);
        }

        if (!ck_cmdline)
                return NULL;

        return ck_cmdline;
}

cmdline에서 마지막 “crashkernel=” 문자열을 찾고 인수로 지정한 suffix 문자열을 찾아 온다. 만일 suffix 인수가 지정되지 않았으면 suffix에 “,high” 또는 “,low” 문자열을 사용하여 검색한다.

p = strstr(p, name);
- cmdline으로 받은 문자열 중 name(“crashkernel=”) 문자열이 있는지 찾아본다.
while (p) { char *end_p = strchr(p, ‘ ‘);
- 문자열이 검색된 경우 그 위치부터 space 문자열 또는 문자열의 끝까지 루프를 돈다.
for (i = 0; suffix_tbl[i]; i++) {
- suffix 인수가 지정되지 않은 경우 suffix_tbl[] 배열 수 만큼 루프를 돈다.
if (!strncmp(q, suffix_tbl[i], strlen(suffix_tbl[i]))) goto next;
- 문자열의 끝에 suffix_tbl에 있는 문자열이 발견된 경우 next로 이동
} else { q = end_p – strlen(suffix);
- 인수 suffix가 지정된 경우
if (!strncmp(q, suffix, strlen(suffix))) ck_cmdline = p;
- 인수로 지정한 suffix 문자열이 발견되는 경우 일단 ck_cmdline에 발견된 위치를 기억한다.
p = strstr(p+1, name);
- 발견된 위치+1에서 다시 name 문자열을 찾아보고 발견되지 않으면 루프 조건에 의해 루프를 빠져나온다.

kernel/kexec.c

static __initdata char *suffix_tbl[] = {
        [SUFFIX_HIGH] = ",high",
        [SUFFIX_LOW]  = ",low",
        [SUFFIX_NULL] = NULL,
};

parse_crashkernel_suffix()

kernel/kexec.c

/*
 * That function parses "suffix"  crashkernel command lines like
 *
 *      crashkernel=size,[high|low]
 *
 * It returns 0 on success and -EINVAL on failure.
 */
static int __init parse_crashkernel_suffix(char *cmdline,
                                           unsigned long long   *crash_size,
                                           const char *suffix)
{
        char *cur = cmdline;

        *crash_size = memparse(cmdline, &cur);
        if (cmdline == cur) {
                pr_warn("crashkernel: memory value expected\n");
                return -EINVAL;
        }

        /* check with suffix */
        if (strncmp(cur, suffix, strlen(suffix))) {
                pr_warn("crashkernel: unrecognized char\n");
                return -EINVAL;
        }
        cur += strlen(suffix);
        if (*cur != ' ' && *cur != '\0') {
                pr_warn("crashkernel: unrecognized char\n");
                return -EINVAL;
        }

        return 0;
}

suffix 문자열로 끝나는 cmdline 문자열을 파싱하여 crashsize를 리턴한다.

*crash_size = memparse(cmdline, &cur);
- 사이즈를 알아오고 cur에는 파싱 문자열의 끝+1의 위치를 담는다.
  - 예) cmdline=”64K,high”
    - crash_size=65536, cur=”,high”
if (strncmp(cur, suffix, strlen(suffix))) {
- cur 문자열에서 인수로 지정한 suffix 문자열이 발견되지 않으면 경고 메시지를 출력하고 에러로 리턴한다.
if (*cur != ‘ ‘ && *cur != ‘\0’) {
- suffix 문자열 다음 문자가 space 또는 null 문자가 아닌 경우 경고 메시지를 출력하고 에러로 리턴한다.

memparse()

lib/cmdline.c

/**
 *      memparse - parse a string with mem suffixes into a number
 *      @ptr: Where parse begins
 *      @retptr: (output) Optional pointer to next char after parse completes
 *
 *      Parses a string into a number.  The number stored at @ptr is
 *      potentially suffixed with K, M, G, T, P, E.
 */

unsigned long long memparse(const char *ptr, char **retptr)
{
        char *endptr;   /* local pointer to end of parsed string */

        unsigned long long ret = simple_strtoull(ptr, &endptr, 0);

        switch (*endptr) {
        case 'E':
        case 'e':
                ret <<= 10;
        case 'P':
        case 'p':
                ret <<= 10;
        case 'T':
        case 't':
                ret <<= 10;
        case 'G':
        case 'g':
                ret <<= 10;
        case 'M':
        case 'm':
                ret <<= 10;
        case 'K':
        case 'k':
                ret <<= 10;
                endptr++; 
        default:
                break;
        }

        if (retptr)
                *retptr = endptr;

        return ret;
}
EXPORT_SYMBOL(memparse);

단위를 포함한 숫자를 파싱하여 값을 리턴하고 retptr에는 파싱 완료한 다음 문자열을 가리킨다.

unsigned long long ret = simple_strtoull(ptr, &endptr, 0);
- ptr 문자열을 unsigned long long으로 변환하여 리턴한다. retptr에는 변환에 필요한 문자열의 끝+1이 담긴다.
  - 예) ptr=”10G”
    - ret=10ULL, retptr=”G”
switch (*endptr) {
- 단위를 나타내는 문자열을 비교하여 일치하는 경우 해당 case 문부터 break 될 때까지를 수행한다.

parse_crashkernel_mem()

kernel/kexec.c

/*
 * parsing the "crashkernel" commandline
 *
 * this code is intended to be called from architecture specific code
 */


/*
 * This function parses command lines in the format
 *
 *   crashkernel=ramsize-range:size[,...][@offset]
 *
 * The function returns 0 on success and -EINVAL on failure.
 */
static int __init parse_crashkernel_mem(char *cmdline,
                                        unsigned long long system_ram,
                                        unsigned long long *crash_size,
                                        unsigned long long *crash_base)
{
        char *cur = cmdline, *tmp;

        /* for each entry of the comma-separated list */
        do {
                unsigned long long start, end = ULLONG_MAX, size;

                /* get the start of the range */
                start = memparse(cur, &tmp);
                if (cur == tmp) {
                        pr_warn("crashkernel: Memory value expected\n");
                        return -EINVAL;
                }
                cur = tmp;
                if (*cur != '-') {
                        pr_warn("crashkernel: '-' expected\n");
                        return -EINVAL;
                }
                cur++;

                /* if no ':' is here, than we read the end */
                if (*cur != ':') {
                        end = memparse(cur, &tmp);
                        if (cur == tmp) {
                                pr_warn("crashkernel: Memory value expected\n");
                                return -EINVAL;
                        }
                        cur = tmp;
                        if (end <= start) {
                                pr_warn("crashkernel: end <= start\n");
                                return -EINVAL;
                        }
                }

                if (*cur != ':') {
                        pr_warn("crashkernel: ':' expected\n");
                        return -EINVAL;
                }
                cur++;

                size = memparse(cur, &tmp);
                if (cur == tmp) {
                        pr_warn("Memory value expected\n");
                        return -EINVAL;
                }
                cur = tmp;
                if (size >= system_ram) {
                        pr_warn("crashkernel: invalid size\n");
                        return -EINVAL;
                }

                /* match ? */
                if (system_ram >= start && system_ram < end) {
                        *crash_size = size;
                        break;
                }
        } while (*cur++ == ',');

        if (*crash_size > 0) {
                while (*cur && *cur != ' ' && *cur != '@')
                        cur++;
                if (*cur == '@') {
                        cur++;
                        *crash_base = memparse(cur, &tmp);
                        if (cur == tmp) {
                                pr_warn("Memory value expected after '@'\n");
                                return -EINVAL;
                        }
                }
        }

        return 0;
}

cmdline=”ramsize-range:size[,…][@offset]” 형태로 문자열이 전달된 경우 ‘,’로 구분한 마지막 range에서 size를 파싱하여 출력 인수 crash_size에 담고, @offset를 파싱하여 crash_base에 담은 후 성공리에 0을 리턴한다. 만일 range가 system_ram을 초과하는 경우에는 에러를 리턴한다.

start = memparse(cur, &tmp);
- cur 문자열을 파싱하여 start에 시작 주소를 알아오고 tmp는 다음 문자열을 가리킨다.
if (*cur != ‘-‘) {
- 다음 문자열이 ‘-‘가 아니면 경고 메시지를 출력하고 에러를 리턴한다.
if (*cur != ‘:’) { end = memparse(cur, &tmp);
- ‘:’ 문자열이 아니면 cur 문자열을 파싱하여 end에 끝 주소를 알아오고 tmp는 다음 문자열을 가리킨다.
if (*cur != ‘:’) {
- 다음 문자열이 ‘:’가 아니면 경고 메시지를 출력하고 에러를 리턴한다.
size = memparse(cur, &tmp);
- cur 문자열을 파싱하여 size에 사이즈를 알아오고 tmp는 다음 문자열을 가리킨다.
if (size >= system_ram) {
- size가 system_ram 크기를 초과하면 경고 메시지를 출력하고 에러로 리턴한다.
if (system_ram >= start && system_ram < end) { *crash_size = size;
- start와 end가 system_ram(전체 lowmem) 보다 작으면 crash_size를 size로 결정하고 루프를 빠져나간다.
} while (*cur++ == ‘,’);
- cur 문자열이 ‘,’ 인 경우 루프를 계속한다.
if (*crash_size > 0) {
- crash_size가 0 이상인 경우
while (*cur && *cur != ‘ ‘ && *cur != ‘@’) cur++
- cur 문자열이 ‘ ‘ 또는 ‘@’가 나올 때 까지 루프를 돌며 cur를 증가시킨다.
if (*cur == ‘@’) { cur++;
- ‘@’ 문자열을 발견하면 cur를 증가시킨다.
*crash_base = memparse(cur, &tmp);
- cur 문자열에서 사이즈를 알아와서 crash_base에 대입한다.

parse_crashkernel_simple()

kernel/kexec.c

/*
 * That function parses "simple" (old) crashkernel command lines like
 *
 *      crashkernel=size[@offset]
 *
 * It returns 0 on success and -EINVAL on failure.
 */
static int __init parse_crashkernel_simple(char *cmdline,
                                           unsigned long long *crash_size,
                                           unsigned long long *crash_base)
{
        char *cur = cmdline;

        *crash_size = memparse(cmdline, &cur);
        if (cmdline == cur) {
                pr_warn("crashkernel: memory value expected\n");
                return -EINVAL;
        }

        if (*cur == '@')
                *crash_base = memparse(cur+1, &cur);
        else if (*cur != ' ' && *cur != '\0') {
                pr_warn("crashkernel: unrecognized char\n");
                return -EINVAL;
        }

        return 0;
}

cmdline에 “size[@offset]” 형태인 경우 이를 파싱하여 size 부분을 crash_size에 담고 crash_

*crash_size = memparse(cmdline, &cur);
- cmdline에서 size를 파싱하여 crash_size에 담고 cur에는 파싱한 문자열의 끝+1의 위치를 담는다.
if (*cur == ‘@’) *crash_base = memparse(cur+1, &cur);
- 다음 문자가 ‘@’인 경우 crash_base에 ‘@’ 문자 다음을 파싱하여 가져온다.
else if (*cur != ‘ ‘ && *cur != ‘\0’) {
- 다음 문자가 space 문자 또는 null이 아닌 다른 문자인 경우 경고 메시지를 출력하고 에러로 리턴한다.

구조체

crashk_res 구조체

kernel/kexec.c

/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
        .name  = "Crash kernel",
        .start = 0, 
        .end   = 0, 
        .flags = IORESOURCE_BUSY | IORESOURCE_MEM
};

참고

Documentation for Kdump – The kexec-based Crash Dumping Solution | kernel.org
Linux kdump에 대한 이해 | OSC
Kdump Anlaysis
Kernel Crash Dump | Ubuntu

가상화 지원 (하이퍼 모드)

2016-04-142022-09-02 문영일 Leave a comment

가상화 지원 (하이퍼 모드)

하이퍼바이저라는 용어는 VM(Virtual Machine)을 생성하고, 관리하고 스케줄링 하기 위해 사용되는 소프트웨어이다.

ARM 시스템에서 가상화를 지원하기 위해서는 AVE(Architecture Virtualization Extension)가 필요하다.

ARM32에서는 SoC 마다 탑재 여부가 다르다.
ARM64에서는 대부분 내장되어 있다.

ARM32와 ARM64 시스템의에서 다음과 같은 하이퍼 바이저 운영 모드들이 지원된다. 대부분의 설명은 ARM64를 디폴트로 한다.

ARM32
- 하이퍼 바이저는 HYP 모드에서 운영
- Guest OS는 SVC 모드에서 운영
ARM64
- 하이퍼 바이저는 EL2에서 운영
- Guest OS는 EL1 모드에서 운영

하이퍼 바이저를 운영하는 방법은 다음과 같이 두 가지가 있다.

Type 1 (Standalone)
- EL2에서 Host OS 없이 전용 하이퍼바이저를 운영하고, Host OS 없이 EL1에서 Guest OS를 운영한다.
  - 예) Xen Server, VMWare ESXi, VMWare vSphere, Oracle VM Server, Hyper-V
Type 2 (Hypervisors and Hosted)
- EL2에서 하이퍼바이저와 Host OS를 같이 운영하고, EL1에서 Guest OS를 운영한다.
  - 예) VMware worksataion, Virtual Box, QEMU/KVM

VHE(Virutalization Host Extension)

ARM 리눅스를 Type 2 하이퍼 바이저로 운영될 때 ARMv8 VHE 지원 여부에 따라 EL2 모드와 EL1 모드를 사용한다.

VHE 미지원시
- EL2 스위칭 + EL1 Host & Guest OS
- Host OS를 EL2로 부팅한 후 EL2에 관련 스위칭(irq 라우팅 포함) stub 코드만 남겨 놓고 EL1으로 전환하여 운영한다.
  - 예) Host용 리눅스 커널이 EL2에서 부팅되고, EL1으로 전환한다.
VHE 지원시
- EL2 Host OS + EL1 Guest OS
- Host OS를 EL2로 부팅 및 운영하는 것으로 EL2 및 EL1 간 많은 스위칭을 제거해 성능을 향상시켰다.
  - 예) 지원을 통해 Host용 리눅스 커널이 EL2에서 부팅 및 운영을 한다.
- VHE 기능을 사용하는 동안 EL1 레지스터를 사용하는 리눅스 커널의 코드 변경없이 EL2 모드에서 사용할 수 있게 하였다.즉 코드에서 접근하는 *_EL1 레지스터들은 실제 *_EL2 레지스터와 동일한 동작을 한다.
  - HCR_EL2.E2H 비트를 설정할 때에 사용할 수 있다.
  - EL2 모드에서 원래 EL0 및 EL1 레지스터를 사용하기 위해서는 *_EL02 및 *_EL12 레지스터를 사용한다.
- 참고:
  - Virtualization Host Extension support (2015) | LWN.net
  - To EL2, and Beyond! | Christoffer Dall, Shih-Wei Li – 다운로드 pdf

다음 그림은 리눅스 커널등의 OS가 다양한 하이퍼바이저 운영 상황에서 동작하는 모습을 보여준다.

좌측 그림은 Xen, VMware/ESX 같은 전용 하이퍼바이저가 동작하고, guest 운영 체제들은 EL1에서 동작한다.
중앙 및 우측 그림은 VHE 지원여부에 따라 각각 EL2 및 EL1에서 호스트 운영체제가 동작하고, guest 운영 체제들은 EL1에서 동작한다.
- SCR_EL2.E2H==1 (VHE 운영) 에서 Host OS가 운영되는 cpu는 SCR_EL2.TGE 값이 1이고, Guest OS가 운영되는 cpu는 0이다.

하이퍼모드 지원 유무 확인

하이퍼모드로 부팅된 경우인지 아닌지 메시지(dmesg)를 출력하여 안내한다.

ARM64 예)

EL2 부팅
- “CPU: All CPU(s) started at EL2“
EL1 부팅
- “CPU: All CPU(s) started at EL1”

ARM32 예)

HYP 모드 부팅
- “CPU: All CPU(s) started in HYP mode.”
- “CPU: Virtualization extensions available.”
SVC 모드 부팅
- “CPU: All CPU(s) started in SVC mode.”

Secure EL2 support

ARMv8.4 아키텍처부터 Secure EL2를 지원하기 시작하였다.

Full-virtualization vs Para-virtualization

한글로는 전가상화 또는 반가상화라고 불리운다. 오늘 날 가상화를 위한 하드웨어 지원이 있는 cpu에서 Guest OS들은 Guest OS 코드 변경 없이 전가상화를 지원하도록 빠르게 동작하고 있다.

현재 전가상화와 반가상화를 구분하여 사용하지 않는다.
Guest OS를 보다 효과적인 성능으로 사용하기 위해 대부분의 운영체제에 반가상화용 드라이버(virtio)들이 포함되어 있다.

Stage 2 변환

하이퍼 모드를 사용하는 경우 운영체제 별로 매핑된 Stage 2 테이블을 사용는 변환을 통해 물리주소(PA)를 얻어낸다.

하이퍼 모드 사용하지 않을 때
- VA —-> (Stage 1 Transalation MMU) —-> PA
하이퍼 모드 사용할 때
- VA —-> (Stage 1 Transalation MMU) —-> IPA —-> (Stage 2 Transalation MMU) —-> PA

ASID(Address Space IDentifier)

Stage 1 변환에서 VA에 대해 application을 식별하기 위해 ASID를 사용한다.

VMID(Virtual Machine IDentifier)

Stage 2 변환에서 IPA에 대해 운영체제를 식별하기 위해 VMID를 사용한다.

속성 병합(Combining)

Stage 1과 Stage 2의 매핑은 각각 메모리 타입과 속성을 가지고 있고, 이들은 각각의 제한을 가지고 있는데 병합(Combining)하여 더 많은 제한을 가진다.

예) 디바이스 + 메모리 = 디바이스

속성 오버라이딩(Overinding)

Stage 1과 Stage 2의 매핑은 각각 메모리 타입과 속성에 대해 동작 방식을 결정할 수 있는 레지스터가 있다.

HCR_EL2.CD
- Stage 1 속성을 모두 Non-cacheable로 운영한다.
HCR_EL2.DC
- Stage 1 속성을 모두 Normal, Write-Back 캐시 속성으로 운영한다.
HCR_EL2.FWB
- 속성 병합대신 Stage 1 속성으로 오버라이딩한다. (forARMv8.4)

Virtual CPU

vCPU

하이퍼 바이저를 운영하면 cpu core를 각각의 운영체제가 공유하여 사용할 수 있도록 가상 cpu 개념이 사용된다. 즉 1개의 cpu를 여러 개의 vcpu로 나누어 운용할 수 있다.

예) 4 개의 core에 리눅스 Guest OS를 vcpu0~3까지 4개의 코어에서 운용하고, 또 다른 리눅스 Guest OS를 vcpu0~1까지 2개의 코어에서 운용할 수 있다.

Virtual Exception

vIRQ, vFIQ 및 vSErrors

하이퍼 바이저를 운영하면 EL2에서 수신된 물리 인터럽트를 vCPU로 forwarding할 때 가상 Exception인 vIRQ, vFIQ 및 vSErrors가 사용된다.

하이퍼 모드 체크

hyp_mode_check() – ARM64

arch/arm64/kernel/setup.c

static void __init hyp_mode_check(void)
{
        if (is_hyp_mode_available())
                pr_info("CPU: All CPU(s) started at EL2\n");
        else if (is_hyp_mode_mismatched())
                WARN_TAINT(1, TAINT_CPU_OUT_OF_SPEC,
                           "CPU: CPUs started in inconsistent modes");
        else
                pr_info("CPU: All CPU(s) started at EL1\n");
}

커널이 EL2로 부팅되었는지 EL1으로 부팅되었는지 여부를 메시지로 출력한다.

hyp_mode_check() – ARM32

arch/arm/kernel/setup.c

#ifndef ZIMAGE
void __init hyp_mode_check(void)
{
#ifdef CONFIG_ARM_VIRT_EXT
        sync_boot_mode();

        if (is_hyp_mode_available()) {
                pr_info("CPU: All CPU(s) started in HYP mode.\n");
                pr_info("CPU: Virtualization extensions available.\n");
        } else if (is_hyp_mode_mismatched()) {
                pr_warn("CPU: WARNING: CPU(s) started in wrong/inconsistent modes (primary CPU mode 0x%x)\n",
                        __boot_cpu_mode & MODE_MASK);
                pr_warn("CPU: This may indicate a broken bootloader or firmware.\n");
        } else       
                pr_info("CPU: All CPU(s) started in SVC mode.\n");
#endif               
}
#endif

커널이 HYP 모드로 부팅되었는지 SVC 모드로 부팅되었는지 여부를 메시지로 출력한다.

하이퍼 모드 운영 여부

is_hyp_mode_available() – ARM64

arch/arm64/include/asm/virt.h

/* Reports the availability of HYP mode */
static inline bool is_hyp_mode_available(void)
{
        return (__boot_cpu_mode[0] == BOOT_CPU_MODE_EL2 &&
                __boot_cpu_mode[1] == BOOT_CPU_MODE_EL2);
}

boot cpu가 EL2로 부팅되었는지 여부를 알아온다.

__boot_cpu_mode[] 저장 루틴은 다음을 참고한다.
- 참고: head.S 전체 | 문c

is_hyp_mode_available() – ARM32

arch/arm/include/asm/virt.h

/* Reports the availability of HYP mode */
static inline bool is_hyp_mode_available(void)
{
        return ((__boot_cpu_mode & MODE_MASK) == HYP_MODE &&
                !(__boot_cpu_mode & BOOT_CPU_MODE_MISMATCH));
}

boot cpu가 EL2로 부팅되었는지 여부를 알아온다.

기타

sync_boot_mode()

arch/arm/include/asm/virt.h

/*
 * __boot_cpu_mode records what mode the primary CPU was booted in.
 * A correctly-implemented bootloader must start all CPUs in the same mode:
 * if it fails to do this, the flag BOOT_CPU_MODE_MISMATCH is set to indicate
 * that some CPU(s) were booted in a different mode.
 *
 * This allows the kernel to flag an error when the secondaries have come up.
 */

extern int __boot_cpu_mode;

static inline void sync_boot_mode(void)
{
        /*  
         * As secondaries write to __boot_cpu_mode with caches disabled, we
         * must flush the corresponding cache entries to ensure the visibility
         * of their writes.
         */
        sync_cache_r(&__boot_cpu_mode);
}

전역 __boot_cpu_mode 변수 영역에 대해 inner & outer 캐시 flush를 수행한다.

sync_cache_r()

arch/arm/include/asm/cacheflush.h

#define sync_cache_r(ptr) __sync_cache_range_r(ptr, sizeof *(ptr))

long으로 선언된 전역 __boot_cpu_mode 변수 위치에 대해 inner & outer 캐시 flush를 수행한다.
- long 값이므로 32bit 시스템에서는 4 byte , 64bit 시스템에서는 8 byte 영역만큼에 대해 flush를 수행하게 된다.

__sync_cache_range_r()

arch/arm/include/asm/cacheflush.h

/*
 * Ensure preceding writes to *p by other CPUs are visible to
 * subsequent reads by this CPU.  We must be careful not to
 * discard data simultaneously written by another CPU, hence the
 * usage of flush rather than invalidate operations.
 */

static inline void __sync_cache_range_r(volatile void *p, size_t size)
{
        char *_p = (char *)p;

#ifdef CONFIG_OUTER_CACHE
        if (outer_cache.flush_range) {
                /*
                 * Ensure dirty data migrated from other CPUs into our cache
                 * are cleaned out safely before the outer cache is cleaned:
                 */
                __cpuc_clean_dcache_area(_p, size);

                /* Clean and invalidate stale data for *p from outer ... */
                outer_flush_range(__pa(_p), __pa(_p + size));
        }
#endif

        /* ... and inner cache: */
        __cpuc_flush_dcache_area(_p, size);
}

지정된 range에 대해 inner & outer 캐시 flush를 수행한다.

CONFIG_OUTER_CACHE
- outer 캐시가 사용되는 시스템에서 사용하는 커널 옵션
- rpi2: outer 캐시를 사용하지 않는다.
if (outer_cache.flush_range) {
- outer 캐시를 사용하는 시스템에서 flush_range에 연결된 함수가 존재하는 경우
__cpuc_clean_dcache_area(_p, size);
- 지정 range의 outer 캐시를 flush하기 전에 먼저 cpu 캐시 즉 inner 캐시를 먼저 clean 작업을 해야한다.
- ARMv7:
  - inner d-cache 영역에 대해 clean 오퍼레이션 대신 flush를 구현하였다.
outer_flush_range(__pa(_p), __pa(_p + size));
- 지정 range의 outer cache에 대한 flush(clean & invalidate)를 수행한다.
__cpuc_flush_dcache_area(_p, size);
- 지정 range의 inner d-cache에 대한 flush(clean & invalidate)를 수행한다.

arch/arm/include/asm/cacheflush.h

/*
 * There is no __cpuc_clean_dcache_area but we use it anyway for
 * code intent clarity, and alias it to __cpuc_flush_dcache_area.
 */

#define __cpuc_clean_dcache_area __cpuc_flush_dcache_area

ARMv7에서는 clean 구현을 따로 준비하지 않았고 따라서 flush 구현을 사용한다.

arch/arm/include/asm/cacheflush.h

#define __cpuc_flush_dcache_area        cpu_cache.flush_kern_dcache_area

ARMv7: v7_flush_kern_dcache_area()

outer_flush_range()

arch/arm/include/asm/outercache.h

/**     
 * outer_flush_range - clean and invalidate outer cache lines
 * @start: starting physical address, inclusive
 * @end: end physical address, exclusive
 */

static inline void outer_flush_range(phys_addr_t start, phys_addr_t end)
{
        if (outer_cache.flush_range)
                outer_cache.flush_range(start, end);
}

outer_cache가 구현된 머신에서 지정 range의 outer cache에 대한 flush(clean & invalidate)를 수행한다.

is_hyp_mode_mismatched()

arch/arm/include/asm/virt.h

/* Check if the bootloader has booted CPUs in different modes */
static inline bool is_hyp_mode_mismatched(void)
{
        return !!(__boot_cpu_mode & BOOT_CPU_MODE_MISMATCH);
}

arch/arm/include/asm/virt.h

/*
 * Flag indicating that the kernel was not entered in the same mode on every
 * CPU.  The zImage loader stashes this value in an SPSR, so we need an
 * architecturally defined flag bit here.
 */

#define BOOT_CPU_MODE_MISMATCH  PSR_N_BIT

arch/arm/include/uapi/asm/ptrace.h

/*
 * PSR bits
 * Note on V7M there is no mode contained in the PSR
 */

#define USR26_MODE      0x00000000
#define FIQ26_MODE      0x00000001
#define IRQ26_MODE      0x00000002
#define SVC26_MODE      0x00000003
#if defined(__KERNEL__) && defined(CONFIG_CPU_V7M)
/*
 * Use 0 here to get code right that creates a userspace
 * or kernel space thread.
 */
#define USR_MODE        0x00000000
#define SVC_MODE        0x00000000
#else
#define USR_MODE        0x00000010
#define SVC_MODE        0x00000013
#endif
#define FIQ_MODE        0x00000011
#define IRQ_MODE        0x00000012
#define ABT_MODE        0x00000017
#define HYP_MODE        0x0000001a
#define UND_MODE        0x0000001b
#define SYSTEM_MODE     0x0000001f
#define MODE32_BIT      0x00000010
#define MODE_MASK       0x0000001f

arch/arm/include/uapi/asm/ptrace.h

/*
 * Use 0 here to get code right that creates a userspace
 * or kernel space thread.
 */

#define USR_MODE        0x00000000
#define SVC_MODE        0x00000000
#else
#define USR_MODE        0x00000010
#define SVC_MODE        0x00000013
#endif
#define FIQ_MODE        0x00000011
#define IRQ_MODE        0x00000012
#define ABT_MODE        0x00000017 
#define HYP_MODE        0x0000001a
#define UND_MODE        0x0000001b
#define SYSTEM_MODE     0x0000001f
#define MODE32_BIT      0x00000010
#define MODE_MASK       0x0000001f

전역 변수

__boot_cpu_mode – ARM64

arch/arm64/include/asm/virt.h

/*
 * __boot_cpu_mode records what mode CPUs were booted in.
 * A correctly-implemented bootloader must start all CPUs in the same mode:
 * In this case, both 32bit halves of __boot_cpu_mode will contain the
 * same value (either 0 if booted in EL1, BOOT_CPU_MODE_EL2 if booted in EL2).
 *
 * Should the bootloader fail to do this, the two values will be different.
 * This allows the kernel to flag an error when the secondaries have come up.
 */

extern u32 __boot_cpu_mode[2];

arch/arm64/kernel/hyp-stub.S

/*
 * We need to find out the CPU boot mode long after boot, so we need to
 * store it in a writable variable.
 *
 * This is not in .bss, because we set it sufficiently early that the boot-time
 * zeroing of .bss would clobber it.
 */

ENTRY(__boot_cpu_mode)
        .long   BOOT_CPU_MODE_EL2
        .long   BOOT_CPU_MODE_EL1

__boot_cpu_mode – ARM32

arch/arm/include/asm/virt.h

/*
 * __boot_cpu_mode records what mode the primary CPU was booted in.
 * A correctly-implemented bootloader must start all CPUs in the same mode:
 * if it fails to do this, the flag BOOT_CPU_MODE_MISMATCH is set to indicate
 * that some CPU(s) were booted in a different mode.
 *
 * This allows the kernel to flag an error when the secondaries have come up.
 */

extern int __boot_cpu_mode;

arch/arm/kernel/hyp-stub.S

/*
 * For the kernel proper, we need to find out the CPU boot mode long after
 * boot, so we need to store it in a writable variable.
 *
 * This is not in .bss, because we set it sufficiently early that the boot-time
 * zeroing of .bss would clobber it.
 */

.data
ENTRY(__boot_cpu_mode)
        .long   0

참고

Next steps in KVM enablement on ARM (2015) | ARM
Isolation using virtualization in the Secure world (2018) | ARM – 다운로드 pdf, 다운로드 pdf
Virtualization Host Extension support (2015) | LWN.net
To EL2, and Beyond! | Christoffer Dall, Shih-Wei Li – 다운로드 pdf
Armv8-A virtualization (2019) | ARM – 다운로드 pdf

smp_build_mpidr_hash()

2016-04-122019-04-29 문영일 Leave a comment

MPIDR Hash Bits

MPIDR 해시 비트를 구성하고 사용하는 단계를 알아본다.

1단계: MPIDR 읽어오기
- a) smp_setup_processor_id() 함수를 통해 부트 cpu에 대해 affinity 레벨이 표현된 MPIDR 값을 읽어 __cpu_logical_map[0] 배열에 저장한다. 저장된 값은 cpu_logical_map(cpu) 함수를 사용하여 @cpu에 해당하는 저장된 mpidr 값을 읽어온다.
  - mpidr 값은 물리 cpu id를 포함한 affinity 단계별 id가 담겨있는 값이다.
  - 참고: smp_setup_processor_id() | 문c
- b) smp_init_cpus() 함수에서 디바이스 트리 또는 ACPI 테이블에 지정된 cpu 노드의 “reg” 값에서 읽은 mpidr 값을 __cpu_logical_map[] 배열에 저장한다.
  - 참고: smp_init_cpus() | 문c
2단계: MPIDR 해시 구성하기
- smp_build_mpidr_hash() 함수를 사용하여 각 cpu에서 읽어온 mpidr 값들로 각 affinity 레벨별로 분석하여 전역 mpidr_hash 구조체 객체를 구성한다. 이 mpidr_hash 구조체는 각 affinity 레벨별로 필요한 비트 수 및 shift 비트 수와 전체 비트 수 등을 관리한다.
- 참고로 ARM32에서는 최대 3 단계 그리고 ARM64에서는 최대 4 단계의affinity 레벨을 관리한다.
3단계: MPIDR 해시 사용하기
- 이렇게 미리 계산된 mpidr_hash는 cpu_suspend() 및 cpu_resume() 내부의 어셈블리 코드에서 사용된다.
  - sleep 또는 resume 할 cpu에 해당하는 mpidr 값을 읽어 산출된 mpidr_hash를 사용하여 각 affinity 레벨에서 사용하는 비트 들을 우측 시프트한 최종 값을 얻어낸다.

MPIDR 해시 산출

다음 그림은 mpidr_hash를 산출하는 과정을 보여준다.

smp_build_mpidr_hash() – ARM32

arch/arm/kernel/setup.c

/**
 * smp_build_mpidr_hash - Pre-compute shifts required at each affinity
 *                        level in order to build a linear index from an
 *                        MPIDR value. Resulting algorithm is a collision
 *                        free hash carried out through shifting and ORing
 */

static void __init smp_build_mpidr_hash(void)
{
        u32 i, affinity;
        u32 fs[3], bits[3], ls, mask = 0;
        /*
         * Pre-scan the list of MPIDRS and filter out bits that do
         * not contribute to affinity levels, ie they never toggle.
         */
        for_each_possible_cpu(i)
                mask |= (cpu_logical_map(i) ^ cpu_logical_map(0));
        pr_debug("mask of set bits 0x%x\n", mask);
        /*
         * Find and stash the last and first bit set at all affinity levels to
         * check how many bits are required to represent them.
         */
        for (i = 0; i < 3; i++) {
                affinity = MPIDR_AFFINITY_LEVEL(mask, i);
                /*
                 * Find the MSB bit and LSB bits position
                 * to determine how many bits are required
                 * to express the affinity level.
                 */
                ls = fls(affinity);
                fs[i] = affinity ? ffs(affinity) - 1 : 0;
                bits[i] = ls - fs[i];
        }
        /*
         * An index can be created from the MPIDR by isolating the
         * significant bits at each affinity level and by shifting
         * them in order to compress the 24 bits values space to a
         * compressed set of values. This is equivalent to hashing
         * the MPIDR through shifting and ORing. It is a collision free
         * hash though not minimal since some levels might contain a number
         * of CPUs that is not an exact power of 2 and their bit
         * representation might contain holes, eg MPIDR[7:0] = {0x2, 0x80}.
         */
        mpidr_hash.shift_aff[0] = fs[0];
        mpidr_hash.shift_aff[1] = MPIDR_LEVEL_BITS + fs[1] - bits[0];
        mpidr_hash.shift_aff[2] = 2*MPIDR_LEVEL_BITS + fs[2] -
                                                (bits[1] + bits[0]);
        mpidr_hash.mask = mask;
        mpidr_hash.bits = bits[2] + bits[1] + bits[0];
        pr_debug("MPIDR hash: aff0[%u] aff1[%u] aff2[%u] mask[0x%x] bits[%u]\n",
                                mpidr_hash.shift_aff[0],
                                mpidr_hash.shift_aff[1],
                                mpidr_hash.shift_aff[2],
                                mpidr_hash.mask,
                                mpidr_hash.bits);
        /*
         * 4x is an arbitrary value used to warn on a hash table much bigger
         * than expected on most systems.
         */
        if (mpidr_hash_size() > 4 * num_possible_cpus())
                pr_warn("Large number of MPIDR hash buckets detected\n");
        sync_cache_w(&mpidr_hash);
}

전체 logical cpu id에 대한 mpidr 값을 읽어 3 개의 affinity 레벨별로 분석하여 전역 mpidr_hash 구조체 객체를 구성한다. 구성된 mpidr_hash에는 cpi id 값으로 affinity 레벨로 변환을 할 수 있는 shift 값을 가지고 있는데 이렇게 구성한 mpidr_hash 구조체는 __cpu_suspend() 및 __cpu_resume() 등에서 사용된다.

코드 라인 9~11에서 전체 possible cpu 수만큼 순회하며 해당 로지컬 cpu의 mpidr 값이 저장된 값을 읽어 변화되는 비트들 만을 추출하여 mask에 저장하고 디버그 정보로 출력한다.
- 모든 코어에 설정되어 변화되지 않는 값들을 제거한다.
코드 라인 16~26에서 각 affinity 레벨을 순회하며 mask 값에서 각 affinity 레벨에서 변동되는 비트들만을 추출하여 해당 affnity 레벨에 필요한 hash 비트를 구해 bits[]에 저장한다.
- 처음 3 개의 affnity 레벨을 순회하며 mask에 대해 각 affnity 레벨별로 값을 추출한다. (0~255)
- affnity 값에서 가장 마지막 세트된 비트 번호와 가장 처음 세트된 비트 번호 -1을 알아온다.
- 예) affnity=0xc
  - ls=4
  - fs=2
코드 라인 37~48에서 각 affinity 레벨별로 shift 되야할 비트 수를 산출하고, mask와 전체 hash 비트수를 저장한다. 그런 후 이들 값들을 디버그 출력한다.
코드 라인 55에서 mpidr_hash 객체 영역에 대해 inner & outer 캐시 클린을 수행 한다.

아래 그림은 cluster x 2개, cpu core x 4개, virtual core 4개(실제가 아닌 가상)로 이루어진 시스템에 대해 전역 mpidr_hash 객체가 구성되는 것을 보여준다.

mpidr hash bits는 4개가 필요하고 각각의 레벨에 대해 쉬프트가 필요한 수는 2, 10, 19이다.

아래 그림은rpi2 및 exynos-5420 시스템에 대해 전역 mpidr_hash 객체가 구성되는 것을 보여준다.

rpi2: mpidr hash bits는 2개가 필요하고 각각의 레벨에 대해 쉬프트가 필요한 수는 0, 6, 14이다.
exynos-5420: mpidr hash bits는 3개가 필요하고 각각의 레벨에 대해 쉬프트가 필요한 수는 0, 6, 13이다.

smp_build_mpidr_hash() – ARM64

arch/arm64/kernel/setup.c

/**
 * smp_build_mpidr_hash - Pre-compute shifts required at each affinity
 *                        level in order to build a linear index from an
 *                        MPIDR value. Resulting algorithm is a collision
 *                        free hash carried out through shifting and ORing
 */

static void __init smp_build_mpidr_hash(void)
{
        u32 i, affinity, fs[4], bits[4], ls;
        u64 mask = 0;
        /*
         * Pre-scan the list of MPIDRS and filter out bits that do
         * not contribute to affinity levels, ie they never toggle.
         */
        for_each_possible_cpu(i)
                mask |= (cpu_logical_map(i) ^ cpu_logical_map(0));
        pr_debug("mask of set bits %#llx\n", mask);
        /*
         * Find and stash the last and first bit set at all affinity levels to
         * check how many bits are required to represent them.
         */
        for (i = 0; i < 4; i++) {
                affinity = MPIDR_AFFINITY_LEVEL(mask, i);
                /*
                 * Find the MSB bit and LSB bits position
                 * to determine how many bits are required
                 * to express the affinity level.
                 */
                ls = fls(affinity);
                fs[i] = affinity ? ffs(affinity) - 1 : 0;
                bits[i] = ls - fs[i];
        }
        /*
         * An index can be created from the MPIDR_EL1 by isolating the
         * significant bits at each affinity level and by shifting
         * them in order to compress the 32 bits values space to a
         * compressed set of values. This is equivalent to hashing
         * the MPIDR_EL1 through shifting and ORing. It is a collision free
         * hash though not minimal since some levels might contain a number
         * of CPUs that is not an exact power of 2 and their bit
         * representation might contain holes, eg MPIDR_EL1[7:0] = {0x2, 0x80}.
         */
        mpidr_hash.shift_aff[0] = MPIDR_LEVEL_SHIFT(0) + fs[0];
        mpidr_hash.shift_aff[1] = MPIDR_LEVEL_SHIFT(1) + fs[1] - bits[0];
        mpidr_hash.shift_aff[2] = MPIDR_LEVEL_SHIFT(2) + fs[2] -
                                                (bits[1] + bits[0]);
        mpidr_hash.shift_aff[3] = MPIDR_LEVEL_SHIFT(3) +
                                  fs[3] - (bits[2] + bits[1] + bits[0]);
        mpidr_hash.mask = mask;
        mpidr_hash.bits = bits[3] + bits[2] + bits[1] + bits[0];
        pr_debug("MPIDR hash: aff0[%u] aff1[%u] aff2[%u] aff3[%u] mask[%#llx] bits[%u]\n",
                mpidr_hash.shift_aff[0],
                mpidr_hash.shift_aff[1],
                mpidr_hash.shift_aff[2],
                mpidr_hash.shift_aff[3],
                mpidr_hash.mask,
                mpidr_hash.bits);
        /*
         * 4x is an arbitrary value used to warn on a hash table much bigger
         * than expected on most systems.
         */
        if (mpidr_hash_size() > 4 * num_possible_cpus())
                pr_warn("Large number of MPIDR hash buckets detected\n");
}

3 단계 affinity 단계를 4 단계 까지 관리하는 것만 다르고 ARM32와 동일한 방법을 사용한다.

캐시 싱크(clean) – ARM32

sync_cache_w() – ARM32

arch/arm/include/asm/cacheflush.h

#define sync_cache_w(ptr) __sync_cache_range_w(ptr, sizeof *(ptr))

ptr 영역을 cache clean 한다.

__sync_cache_range_w()

arch/arm/include/asm/cacheflush.h

/*
 * Ensure preceding writes to *p by this CPU are visible to
 * subsequent reads by other CPUs:
 */
static inline void __sync_cache_range_w(volatile void *p, size_t size)
{
        char *_p = (char *)p;

        __cpuc_clean_dcache_area(_p, size);
        outer_clean_range(__pa(_p), __pa(_p + size));
}

p 주소 위치 부터 해당 size 만큼의 영역에 대해 inner 캐시 및 outer cache를 clean 한다.

arch/arm/include/asm/cacheflush.h

/*
 * There is no __cpuc_clean_dcache_area but we use it anyway for
 * code intent clarity, and alias it to __cpuc_flush_dcache_area.
 */
#define __cpuc_clean_dcache_area __cpuc_flush_dcache_area

arch/arm/include/asm/cacheflush.h

#define __cpuc_flush_dcache_area        cpu_cache.flush_kern_dcache_area

MULTI_CPU가 선택된 경우 cpu_cache 구조체를 통해 cache 핸들러 함수를 호출한다.
- rpi2도 이를 사용한다.

outer_clean_range()

arch/arm/include/asm/outercache.h

/**
 * outer_clean_range - clean dirty outer cache lines
 * @start: starting physical address, inclusive
 * @end: end physical address, exclusive
 */
static inline void outer_clean_range(phys_addr_t start, phys_addr_t end)
{
        if (outer_cache.clean_range)
                outer_cache.clean_range(start, end);
}

start ~ end 주소 까지 outer cache를 clean 한다.

#ifdef CONFIG_OUTER_CACHE
struct outer_cache_fns outer_cache __read_mostly;
EXPORT_SYMBOL(outer_cache);
#endif

전역 outer_cache는 outer_cache_fns 구조체를 가리키며 outer cache 핸들러 코드를 관리한다.

구조체

mpidr_hash 구조체 – ARM32

arch/arm/include/asm/smp_plat.h

/*      
 * NOTE ! Assembly code relies on the following
 * structure memory layout in order to carry out load
 * multiple from its base address. For more
 * information check arch/arm/kernel/sleep.S
 */     
struct mpidr_hash {
        u32     mask; /* used by sleep.S */
        u32     shift_aff[3]; /* used by sleep.S */
        u32     bits;
};

전체 logical cpu id 값에서 변화되는 비트들만을 추출한다.
- rpi2 예) 0xf00, 0xf01, 0xf02, 0xf03 -> mask=0x03 (lsb 두 개만 변화됨)
shift_aff[3]
- mpidr hash bit를 logical cpu id 값으로 쉬프트하기 위한 비트 수
- rpi2 예) mpir hash bit = 전체 2 개 비트 (affinity0=2, affnity1=0, affnity2=0)
  - shift_aff[0]=0, shift_aff[1]=6, shift_aff[2]=14
bits
- mpidr hash bit 수

mpidr_hash 구조체 – ARM64

arch/arm64/include/asm/smp_plat.h

struct mpidr_hash {
        u64     mask;
        u32     shift_aff[4];
        u32     bits;
};

ARM32와 유사하고, shift_aff[] 배열만 3에서 4로 확장됨을 알 수 있다.

affinity 레벨만 3 단계에서 4 단계까지 관리한다.

outer_cache_fns 구조체 – ARM32 only

arch/arm/include/asm/outercache.h

struct outer_cache_fns {
        void (*inv_range)(unsigned long, unsigned long);
        void (*clean_range)(unsigned long, unsigned long);
        void (*flush_range)(unsigned long, unsigned long);
        void (*flush_all)(void);
        void (*disable)(void);
#ifdef CONFIG_OUTER_CACHE_SYNC
        void (*sync)(void);
#endif
        void (*resume)(void);

        /* This is an ARM L2C thing */
        void (*write_sec)(unsigned long, unsigned);
        void (*configure)(const struct l2x0_regs *);
};

outer 캐시 핸들러 함수들로 구성된다.
- rpi2: 사용하지 않는다.
- l2 또는 l3 캐시를 outer 캐시로 활용하는 특수한 arm 머신들이 몇 개 있다.

참고

ARM: kernel: build MPIDR hash function data structure

smp_init_cpus()

2016-04-082019-05-08 문영일 Leave a comment

SMP(Symetric Multi Processor) Operations

SMP 전용 명령어 핸들러를 위한 구조체 smp_operations를 준비하여 전역 smp_ops의 각 기능별 후크 함수를 갖는다.

smp 오퍼레이션은 다음과 같이 크게 3가지 타입으로 구성된다.

머신 디스크립터를 사용하는 SMP operations (ARM32 only)
PSCI용 SMP operations
spin-table을 사용하는 SMP operations

부트 CPU Operations 결정 – ARM64

cpu_read_bootcpu_ops() – ARM64

arch/arm64/include/asm/cpu_ops.h

static inline void __init cpu_read_bootcpu_ops(void)
{
        cpu_read_ops(0);
}

boot cpu가 사용할 operations를 결정한다.

cpu_read_ops() – ARM64

arch/arm64/kernel/cpu_ops.c

/*
 * Read a cpu's enable method and record it in cpu_ops.
 */

int __init cpu_read_ops(int cpu)
{
        const char *enable_method = cpu_read_enable_method(cpu);

        if (!enable_method)
                return -ENODEV;

        cpu_ops[cpu] = cpu_get_ops(enable_method);
        if (!cpu_ops[cpu]) {
                pr_warn("Unsupported enable-method: %s\n", enable_method);
                return -EOPNOTSUPP;
        }

        return 0;
}

인자로 요청받은 @cpu가 사용할 operations를 결정한다. 이 때 디바이스 트리 또는 ACPI를 통해 enable_method 속성 값을 읽어온다.

cpu_read_enable_method() – ARM64

arch/arm64/kernel/cpu_ops.c

static const char *__init cpu_read_enable_method(int cpu)
{
        const char *enable_method;

        if (acpi_disabled) {
                struct device_node *dn = of_get_cpu_node(cpu, NULL);

                if (!dn) {
                        if (!cpu)
                                pr_err("Failed to find device node for boot cpu\n");
                        return NULL;
                }

                enable_method = of_get_property(dn, "enable-method", NULL);
                if (!enable_method) {
                        /*
                         * The boot CPU may not have an enable method (e.g.
                         * when spin-table is used for secondaries).
                         * Don't warn spuriously.
                         */
                        if (cpu != 0)
                                pr_err("%pOF: missing enable-method property\n",
                                        dn);
                }
        } else {
                enable_method = acpi_get_enable_method(cpu);
                if (!enable_method) {
                        /*
                         * In ACPI systems the boot CPU does not require
                         * checking the enable method since for some
                         * boot protocol (ie parking protocol) it need not
                         * be initialized. Don't warn spuriously.
                         */
                        if (cpu != 0)
                                pr_err("Unsupported ACPI enable-method\n");
                }
        }

        return enable_method;
}

인자로 요청받은 @cpu가 사용할 enable-method를 알아온다. 발견되지 않는 경우에는 null을 가져온다.

코드 라인 5~24에서 디바이스 트리의 cpu 노드에서 “enable-method” 속성 값을 읽어온다.
- “psci” 또는 “spin-table”을 알아온다. rpi3 시스템의 경우 “brcm,bcm2836-smp” 값을 사용한다.
코드 라인 25~37에서 ACPI를 사용하는 경우엔 acpi 테이블에서 enable-method 속성 값을 읽어온다.
- “psci” 또는 “parking-protocol”을 알아온다.

SMP CPU 초기화 – ARM64

다음 그림은 SMP cpu에 대한 초기화를 수행한다.

smp_init_cpus() – ARM64

arch/arm64/kernel/smp.c

/*
 * Enumerate the possible CPU set from the device tree or ACPI and build the
 * cpu logical map array containing MPIDR values related to logical
 * cpus. Assumes that cpu_logical_map(0) has already been initialized.
 */

void __init smp_init_cpus(void)
{
        int i;

        if (acpi_disabled)
                of_parse_and_init_cpus();
        else
                acpi_parse_and_init_cpus();

        if (cpu_count > nr_cpu_ids)
                pr_warn("Number of cores (%d) exceeds configured maximum of %u - clipping\n",
                        cpu_count, nr_cpu_ids);

        if (!bootcpu_valid) {
                pr_err("missing boot CPU MPIDR, not enabling secondaries\n");
                return;
        }

        /*
         * We need to set the cpu_logical_map entries before enabling
         * the cpus so that cpu processor description entries (DT cpu nodes
         * and ACPI MADT entries) can be retrieved by matching the cpu hwid
         * with entries in cpu_logical_map while initializing the cpus.
         * If the cpu set-up fails, invalidate the cpu_logical_map entry.
         */
        for (i = 1; i < nr_cpu_ids; i++) {
                if (cpu_logical_map(i) != INVALID_HWID) {
                        if (smp_cpu_setup(i))
                                cpu_logical_map(i) = INVALID_HWID;
                }
        }
}

SMP cpu에 대해 로지컬 cpu -> 물리 cpu 매핑과 cpu -> 노드 매핑 설정 및 cpu의 초기화를 수행한다.

코드 라인 5~6에서 디바이스 트리의 cpu 노드에서 cpu 정보를 읽어 로지컬 cpu -> 물리 cpu 매핑과 cpu -> 노드 매핑을 설정한다.
코드 라인 7~8에서 ACPI 테이블에서 cpu 정보를 읽어 로지컬 cpu -> 물리 cpu 매핑과 cpu -> 노드 매핑을 설정한다.
코드 라인 26~31에서 각 cpu의 초기화를 수행한다.

of_parse_and_init_cpus() – ARM64

arch/arm64/kernel/smp.c

/*
 * Enumerate the possible CPU set from the device tree and build the
 * cpu logical map array containing MPIDR values related to logical
 * cpus. Assumes that cpu_logical_map(0) has already been initialized.
 */

static void __init of_parse_and_init_cpus(void)
{
        struct device_node *dn;

        for_each_of_cpu_node(dn) {
                u64 hwid = of_get_cpu_mpidr(dn);

                if (hwid == INVALID_HWID)
                        goto next;

                if (is_mpidr_duplicate(cpu_count, hwid)) {
                        pr_err("%pOF: duplicate cpu reg properties in the DT\n",
                                dn);
                        goto next;
                }

                /*
                 * The numbering scheme requires that the boot CPU
                 * must be assigned logical id 0. Record it so that
                 * the logical map built from DT is validated and can
                 * be used.
                 */
                if (hwid == cpu_logical_map(0)) {
                        if (bootcpu_valid) {
                                pr_err("%pOF: duplicate boot cpu reg property in DT\n",
                                        dn);
                                goto next;
                        }

                        bootcpu_valid = true;
                        early_map_cpu_to_node(0, of_node_to_nid(dn));

                        /*
                         * cpu_logical_map has already been
                         * initialized and the boot cpu doesn't need
                         * the enable-method so continue without
                         * incrementing cpu.
                         */
                        continue;
                }

                if (cpu_count >= NR_CPUS)
                        goto next;

                pr_debug("cpu logical map 0x%llx\n", hwid);
                cpu_logical_map(cpu_count) = hwid;

                early_map_cpu_to_node(cpu_count, of_node_to_nid(dn));
next:
                cpu_count++;
        }
}

디바이스 트리의 cpu 노드에서 cpu 정보를 읽어 로지컬 cpu -> 물리 cpu 매핑과 cpu -> 노드 매핑을 설정한다.

코드 라인 5~9에서 cpu 노드들을 순회하며 reg 속성 값에서 hwid를 읽어온다.
코드 라인 11~15에서 hwid가 중복되는 경우 에러 메시지를 출력하고 skip 한다.
코드 라인 23~40에서 부트 cpu에 대한 cpu -> 노드 변환을 지원하기 위해 매핑을 하고, 노드 들에 부트 cpu가 하나만 있는지 체크한다.
코드 라인 42~43에서 디바이스 트리에서 읽은 cpu 노드 수가 컴파일 당시 설정한 최대 cpu 수를 초과하는 경우 skip 한다.
코드 라인 45~46에서 로지컬 cpu id 번호를 디버그 정보로 출력하고 로지컬 cpu -> 물리 cpu 변환을 지원하기 위해 매핑한다.
코드 라인 48에서 cpu -> 노드 변환을 지원하기 위해 매핑을 한다.

다음은 boradcom 사의 northstart2 칩에서 사용된 cpu 노드들을 보여준다.

4개의 코어가 시큐어 펌웨어에 psci 콜을 사용하는 것을 알 수 있다.

        cpus {
                #address-cells = <2>;
                #size-cells = <0>;

                A57_0: cpu@0 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a57", "arm,armv8";
                        reg = <0 0>;
                        enable-method = "psci";
                        next-level-cache = <&CLUSTER0_L2>;
                };

                A57_1: cpu@1 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a57", "arm,armv8";
                        reg = <0 1>;
                        enable-method = "psci";
                        next-level-cache = <&CLUSTER0_L2>;
                };

                A57_2: cpu@2 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a57", "arm,armv8";
                        reg = <0 2>;
                        enable-method = "psci";
                        next-level-cache = <&CLUSTER0_L2>;
                };

                A57_3: cpu@3 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a57", "arm,armv8";
                        reg = <0 3>;
                        enable-method = "psci";
                        next-level-cache = <&CLUSTER0_L2>;
                };

                CLUSTER0_L2: l2-cache@0 {
                        compatible = "cache";
                };
        };

        psci {
                compatible = "arm,psci-1.0";
                method = "smc";
        };

early_map_cpu_to_node() – ARM64

arch/arm64/mm/numa.c

void __init early_map_cpu_to_node(unsigned int cpu, int nid)
{
        /* fallback to node 0 */
        if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
                nid = 0;

        cpu_to_node_map[cpu] = nid;

        /*
         * We should set the numa node of cpu0 as soon as possible, because it
         * has already been set up online before. cpu_to_node(0) will soon be
         * called.
         */
        if (!cpu)
                set_cpu_numa_node(cpu, nid);
}

cpu -> node 변환을 위해 요청한 @cpu에 대한 @nid를 설정한다.

NUMA 노드에서 0번 cpu의 경우 이미 online 상태이므로 추가로 cpu에 대한 numa 노드 설정도 한다.

SMP CPU opearations 지정 및 초기화 – ARM32

setup_arch() 함수 중반 smp_init_cpus() 함수를 호출하기 직전

PSCI가 동작 상태에 따라 전역 smp_ops가 가리키는 구조체가 다르다.

PSCI 동작 시 smp_ops는 psci_smp_ops를 가리키게 한다.
PSCI 동작하지 않고 mdesc->smp가 존재하는 경우 smp_ops는 mdesc->smp를 가리킨다.

SMP cpu의 operations 지정

setup_arch() 중반 – ARM32

arch/arm/kernel/setup.c

#ifdef CONFIG_SMP
        if (is_smp()) {
                if (!mdesc->smp_init || !mdesc->smp_init()) {
                        if (psci_smp_available())
                                smp_set_ops(&psci_smp_ops);
                        else if (mdesc->smp)
                                smp_set_ops(mdesc->smp);
                }

코드 라인 3에서 SMP 머신에서 smp_init 멤버 변수가 null 이거나 머신의 smp_init() 함수 수행 결과가 실패한 경우
- A) PSCI 방식 보다 먼저 사용될 수 있도록 MCPM(Multiple Cluster Power Management) 기능을 사용할 수 있도록 smp_ops가 mcpm_smp_ops 객체를 가리키게 한다.
  - vexpress 예)
    - mdesc->smp_init = vexpress_smp_init_ops() 함수를 가리키고 수행한다.
      - DT 에서 “cci-400” 이라는 cache coherent interface 400 series 디바이스 장치가 발견되고 “status” 속성이 “ok”일 때 smp_ops를 mcpm_smp_ops 객체를 가리키게 한다.
        
        mcpm_smp_ops.smp_init_cpus = null
코드 라인 4~5에서 CONFIG_ARM_PSCI 커널 옵션이 설정되어 있고 PSCI가 동작 가능하면
- B) PSCI(Power State Cordination Interface) 기능을 사용할 수 있도록 전역 smp_ops가 psci_smp_ops를 가리키게 한다.
  - psci_smp_ops.smp_init_cpus = null
코드 라인 6~7에서 C) MCPM이나 PSCI가 동작 가능 상태가 아니면 smp_set_ops() 함수를 통해 smp_ops가 mdesc->smp를 대입한다.
- 예) rpi2:
  - smp_ops가 bcm2709_smp_ops.ops 를 가리킨다.

mdesc->smp 및 mdesc->smp_init을 사용하는 시스템 예)

arch/arm/mach-vexpress/v2m.c

static const char * const v2m_dt_match[] __initconst = { 
        "arm,vexpress",
        NULL,
};

DT_MACHINE_START(VEXPRESS_DT, "ARM-Versatile Express")
        .dt_compat      = v2m_dt_match,
        .l2c_aux_val    = 0x00400000,
        .l2c_aux_mask   = 0xfe0fffff,
        .smp            = smp_ops(vexpress_smp_dt_ops),
        .smp_init       = smp_init_ops(vexpress_smp_init_ops),
MACHINE_END

Versatile Express 시스템에서 DT 머신 정의
- .smp가 전역 vexpress_smp_dt_ops 객체를 가리킴
- .smp_init_ops가 vexpress_smp_init_ops() 함수를 가리킴

smp_ops()

#define smp_ops(ops) (&(ops))

vexpress_smp_dt_ops 전역 객체

arch/arm/mach-vexpress/platsmp.c

struct smp_operations __initdata vexpress_smp_dt_ops = { 
        .smp_prepare_cpus       = vexpress_smp_dt_prepare_cpus,
        .smp_secondary_init     = versatile_secondary_init,
        .smp_boot_secondary     = versatile_boot_secondary,
#ifdef CONFIG_HOTPLUG_CPU
        .cpu_die                = vexpress_cpu_die,
#endif
};

smp_init_ops()

arch/arm/include/asm/mach/arch.h

#define smp_init_ops(ops) (&(ops))

vexpress_smp_init_ops()

arch/arm/mach-vexpress/platsmp.c

bool __init vexpress_smp_init_ops(void)
{
#ifdef CONFIG_MCPM
        /*
         * The best way to detect a multi-cluster configuration at the moment
         * is to look for the presence of a CCI in the system.
         * Override the default vexpress_smp_ops if so.
         */
        struct device_node *node;
        node = of_find_compatible_node(NULL, NULL, "arm,cci-400");
        if (node && of_device_is_available(node)) {
                mcpm_smp_set_ops();
                return true;
        }
#endif
        return false;
}

CONFIG_MCPM
- Multi-Cluster Power Management로 big.LITTLE 등의 클러스터 기반의 파워를 관리하는 기능이다.
node = of_find_compatible_node(NULL, NULL, “arm,cci-400”);
- DT 전체 노드 중 compatible 속성이 “arm,cci-400” 인 노드를 찾는다.
if (node && of_device_is_available(node)) {
- 노드의 디바이스가 사용 가능하면
  - 노드의 “status” 속성이 “ok”이면
mcpm_smp_set_ops();
- 전역 smp_ops가 mcpm_smp_ops를 가리키게 한다.

vexpress-v2p-ca15_a7.dts 에서 “arm,cci-400” Cache Coherent Interface 400 series 디바이스에 대한 스크립트 정의를 보여준다.

2개의 a15 cpu와 3개의 a7 cpu가 big.LITTLE 클러스터 구성되어 있다.

        cci@2c090000 {
                compatible = "arm,cci-400";
                #address-cells = <1>;
                #size-cells = <1>;
                reg = <0 0x2c090000 0 0x1000>;
                ranges = <0x0 0x0 0x2c090000 0x10000>;

                cci_control1: slave-if@4000 {
                        compatible = "arm,cci-400-ctrl-if";
                        interface-type = "ace";
                        reg = <0x4000 0x1000>;
                };

                cci_control2: slave-if@5000 {
                        compatible = "arm,cci-400-ctrl-if";
                        interface-type = "ace";
                        reg = <0x5000 0x1000>;
                };
        };

psci_smp_available()

arch/arm/kernel/psci_smp.c

bool __init psci_smp_available(void)
{
        /* is cpu_on available at least? */
        return (psci_ops.cpu_on != NULL);
}

psci_ops.cpu_on에 함수가 연결되어 있는 경우 PSCI가 동작하는 것으로 간주할 수 있다.

mcpm_smp_set_ops()

arch/arm/common/mcpm_platsmp.c

static struct smp_operations __initdata mcpm_smp_ops = {
        .smp_boot_secondary     = mcpm_boot_secondary,
        .smp_secondary_init     = mcpm_secondary_init,
#ifdef CONFIG_HOTPLUG_CPU
        .cpu_kill               = mcpm_cpu_kill,
        .cpu_disable            = mcpm_cpu_disable,
        .cpu_die                = mcpm_cpu_die,
#endif
};

void __init mcpm_smp_set_ops(void)
{
        smp_set_ops(&mcpm_smp_ops);
}

전역 smp_ops가 mcpm_smp_ops 객체를 가리키게 한다.

smp_init_cpus() – ARM32

arch/arm/kernel/smp.c

/* platform specific SMP operations */
void __init smp_init_cpus(void)
{
        if (smp_ops.smp_init_cpus)
                smp_ops.smp_init_cpus();
}

smp_ops.smp_init_cpus에 함수가 연결되어 있는 경우 호출한다.

smp_ops.smp_init_cpus에 등록된 함수를 호출하여 해당 SMP 머신에 대한 초기화를 진행한다.
- 보통 SCU(Snoop Control Unit) 즉 Cache Coherent Interface에 대한 설정이나 cpu possible bitmap 설정 등을 수행한다.
- exynos 예)
  - exynos_smp_init_cpus()
- rpi2 예)
  - bcm2709_smp_init_cpus()

아래는 smp_ops에 bcm2709_smp_ops.smp 가 연결되어 있어서 bcm2709_smp_init_cpus() 함수를 호출하는 과정을 보여준다.

arch/arm/mach-bcm2709/bcm2709.c

struct smp_operations  bcm2709_smp_ops __initdata = {
        .smp_init_cpus          = bcm2709_smp_init_cpus,
        .smp_prepare_cpus       = bcm2709_smp_prepare_cpus,
        .smp_secondary_init     = bcm2709_secondary_init,
        .smp_boot_secondary     = bcm2709_boot_secondary,
};
#endif

static const char * const bcm2709_compat[] = {
        "brcm,bcm2709",
        "brcm,bcm2708", /* Could use bcm2708 in a pinch */
        NULL
};

MACHINE_START(BCM2709, "BCM2709")
    /* Maintainer: Broadcom Europe Ltd. */
#ifdef CONFIG_SMP
        .smp            = smp_ops(bcm2709_smp_ops),
#endif
        .map_io = bcm2709_map_io,
        .init_irq = bcm2709_init_irq,
        .init_time = bcm2709_timer_init,
        .init_machine = bcm2709_init,
        .init_early = bcm2709_init_early,
        .reserve = board_reserve,
        .restart        = bcm2709_restart,
        .dt_compat = bcm2709_compat,
MACHINE_END

bcm2709_smp_init_cpus()

arch/arm/mach-bcm2709/bcm2709.c

void __init bcm2709_smp_init_cpus(void)
{       
        void secondary_startup(void);
        unsigned int i, ncores;
        
        ncores = 4; // xxx scu_get_core_count(NULL);
        printk("[%s] enter (%x->%x)\n", __FUNCTION__, (unsigned)virt_to_phys((void *)secondary_startup), (unsigned)__io_address(ST_BASE + 0x10));
        printk("[%s] ncores=%d\n", __FUNCTION__, ncores);
    
        for (i = 0; i < ncores; i++) {
                set_cpu_possible(i, true);
                /* enable IRQ (not FIQ) */
                writel(0x1, __io_address(ARM_LOCAL_MAILBOX_INT_CONTROL0 + 0x4 * i));
                //writel(0xf, __io_address(ARM_LOCAL_TIMER_INT_CONTROL0   + 0x4 * i));
        }
        set_smp_cross_call(bcm2835_send_doorbell);
}

ncores를 4로 고정시켰다.
각 코어 번호에 대해 cpu possible 비트를 설정한다.
writel(0x1, __io_address(ARM_LOCAL_MAILBOX_INT_CONTROL0 + 0x4 * i));
- 각 core에 대해 IRQ enable (not FIQ)
set_smp_cross_call(bcm2835_send_doorbell);
- 전역 __smp_cross_call이 bcm2835_send_doorbell() 함수를 가리키게 한다.

set_smp_cross_call()

arch/arm/kernel/smp.c

static void (*__smp_cross_call)(const struct cpumask *, unsigned int);

void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
{
        if (!__smp_cross_call)
                __smp_cross_call = fn; 
}

전역 __smp_cross_call이 설정되어 있지 않으면 fn으로 설정한다.

smp_cross_call()

다음의 함수들에서 호출된다.

arch_send_call_function_ipi_mask()
arch_send_wakeup_ipi_mask()
arch_send_call_function_single_ipi()
arch_irq_work_raise()
tick_broadcast()
smp_send_reschedule()
smp_send_stop()

arm/kernel/smp.c

static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
{
        trace_ipi_raise(target, ipi_types[ipinr]);
        __smp_cross_call(target, ipinr);
}

bcm2835_send_doorbell()

arch/arm/kernel/smp.c

static void bcm2835_send_doorbell(const struct cpumask *mask, unsigned int irq) 
{
        int cpu; 
        /*
         * Ensure that stores to Normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
        dsb();

        /* Convert our logical CPU mask into a physical one. */
        for_each_cpu(cpu, mask)
        {    
                /* submit softirq */
                writel(1<<irq, __io_address(ARM_LOCAL_MAILBOX0_SET0 + 0x10 * MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 0)));
        }
}

writel()

arch/arm/include/asm/io.h

#define writel(v,c)             ({ __iowmb(); writel_relaxed(v,c); })

#define __iowmb()               wmb()

#define writel_relaxed(v,c)     __raw_writel((__force u32) cpu_to_le32(v),c)

__raw_writel()

arch/arm/include/asm/io.h

static inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
        asm volatile("str %1, %0"
                     : "+Qo" (*(volatile u32 __force *)addr)
                     : "r" (val));
}

CPU 관련 API

get_cpu()

include/linux/smp.h

#define get_cpu()               ({ preempt_disable(); smp_processor_id(); })

cpu id를 알아온다. cpu가 바뀌지 않도록 Preemption을 disable한다.

이 함수를 사용하는 경우 사용 후에 반드시 짝이되는 put_cpu()를 사용하여 preemption을 다시 enable 해줘야 한다.

put_cpu()

include/linux/smp.h

#define put_cpu()               preempt_enable()

cpu id 사용이 완료되었으므로 Preemption을 enable한다.

smp_processor_id()

include/linux/smp.h

/*
 * smp_processor_id(): get the current CPU ID.
 *
 * if DEBUG_PREEMPT is enabled then we check whether it is
 * used in a preemption-safe way. (smp_processor_id() is safe
 * if it's used in a preemption-off critical section, or in
 * a thread that is bound to the current CPU.)
 *
 * NOTE: raw_smp_processor_id() is for internal use only
 * (smp_processor_id() is the preferred variant), but in rare
 * instances it might also be used to turn off false positives
 * (i.e. smp_processor_id() use that the debugging code reports but
 * which use for some reason is legal). Don't use this to hack around
 * the warning message, as your code might not work under PREEMPT.
 */
#ifdef CONFIG_DEBUG_PREEMPT
  extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() raw_smp_processor_id()
#endif

CONFIG_DEBUG_PREEMPT를 사용하는 경우 이 함수를 호출하기 전에 preempt가 이미 enable되어 있는 경우 경고를 한다. 그리고 사용을 하지 않는 경우 raw_smp_processor_id() 매크로를 호출한다.

raw_smp_processor_id() – ARM32

arch/arm/include/asm/smp.h

#define raw_smp_processor_id() (current_thread_info()->cpu)

현재 태스크가 동작하고 있는 cpu 번호를 리턴한다.

raw_smp_processor_id() – ARM64

arch/arm64/include/asm/smp.h

/*
 * We don't use this_cpu_read(cpu_number) as that has implicit writes to
 * preempt_count, and associated (compiler) barriers, that we'd like to avoid
 * the expense of. If we're preemptible, the value can be stale at use anyway.
 * And we can't use this_cpu_ptr() either, as that winds up recursing back
 * here under CONFIG_DEBUG_PREEMPT=y.
 */
#define raw_smp_processor_id() (*raw_cpu_ptr(&cpu_number))

per-cpu로 관리되는 cpu_number를 통해 cpu id를 알아온다.

참고

Linux Kernel Power Management (PM) Framework for ARM 64-bit Processors | arm – 다운로드
Multi-cluster power management | LWN.net
Linux support for ARM big.LITTLE | LWN.net