문c 블로그

page_ext_init_flatmem()

2016-05-042019-07-19 문영일 Leave a comment

page extension

memmap extension 또는 extended memap 이라고도 불린다. page 구조체를 변경하지 않고 page_ext라는 별도의 구조체를 만들어 page 확장을 하여 디버깅에 도움을 줄 수 있게 하였다. 이 기능은 2014년 12월 kernel v3.19-rc1에 LG 전자의 김준수님이 추가하였다.

다음과 같은 기능들이 준비되어 있다.

page alloc 디버깅
page poisoning 디버그
page owner 트래킹
idle page 트래킹

다음 그림은 sparse 메모리 모델이 아닌 경우 수행되는 page_ext_init_flatmem() 함수가 동작하여 확장 memmap을 생성하는 과정을 보여준다.

Flat & Discontiguous 메모리 모델에서의 page_ext 초기화

참고로 sparse 메모리 모델에서의 초기화 루틴은 page_ext_init() 함수에서 한다.

page_ext_init_flatmem()

mm/page_ext.c

void __init page_ext_init_flatmem(void)
{

        int nid, fail;

        if (!invoke_need_callbacks())
                return;

        for_each_online_node(nid)  {
                fail = alloc_node_page_ext(nid);
                if (fail)
                        goto fail;
        }
        pr_info("allocated %ld bytes of page_ext\n", total_usage);
        invoke_init_callbacks();
        return;

fail:
        pr_crit("allocation of page_ext failed.\n");
        panic("Out of memory");
}

page_ext가 필요한 경우 각 노드에 page_ext를 할당 받고 초기화한다.

invoke_need_callbacks()

mm/page_ext.c

static bool __init invoke_need_callbacks(void)
{
        int i;
        int entries = ARRAY_SIZE(page_ext_ops);
        bool need = false;

        for (i = 0; i < entries; i++) {
                if (page_ext_ops[i]->need && page_ext_ops[i]->need()) {
                        page_ext_ops[i]->offset = sizeof(struct page_ext) +
                                                extra_mem;
                        extra_mem += page_ext_ops[i]->size;
                        need = true;
                }
        }

        return need;
}

page_ext_ops 엔트리 수 만큼 루프를 돌며 엔트리에 등록된 need 함수를 수행시켜 엔트리 하나라도 초기화가 필요한 경우 true를 반환한다.

코드 라인 7~8에서 컴파일 타임에 결정된 page_ext_ops[] 배열 엔트리 수 만큼 순회하며 해당 오퍼레이션에 등록된 (*need) 후크 함수에서 초기화가 필요한 경우 true를 반환하여 조건을 만족시킨다.
코드 라인 9~10에서 해당 오퍼레이션의 offset에는 page_ext 구조체 + 해당 오퍼레이션이 사용할 extra 메모리에 대한 offset 바이트 수를 대입한다.
- CONFIG_PAGE_OWNER 커널 옵션을 사용하는 경우 extra 메모리로 page_owner 구조체 사이즈를 사용한다.
코드 라인 11에서 오퍼레이션에서 사용한 extra 메모리 사이즈 만큼 전역 변수 extra_mem에 더한다.

alloc_node_page_ext()

mm/page_ext.c

static int __init alloc_node_page_ext(int nid)
{
        struct page_ext *base;
        unsigned long table_size;
        unsigned long nr_pages;

        nr_pages = NODE_DATA(nid)->node_spanned_pages;
        if (!nr_pages)
                return 0;

        /*
         * Need extra space if node range is not aligned with
         * MAX_ORDER_NR_PAGES. When page allocator's buddy algorithm
         * checks buddy's status, range could be out of exact node range.
         */
        if (!IS_ALIGNED(node_start_pfn(nid), MAX_ORDER_NR_PAGES) ||
                !IS_ALIGNED(node_end_pfn(nid), MAX_ORDER_NR_PAGES))
                nr_pages += MAX_ORDER_NR_PAGES;

        table_size = get_entry_size() * nr_pages;

        base = memblock_virt_alloc_try_nid_nopanic(
                        table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
                        BOOTMEM_ALLOC_ACCESSIBLE, nid);
        if (!base)
                return -ENOMEM;
        NODE_DATA(nid)->node_page_ext = base;
        total_usage += table_size;
        return 0;
}

지정된 노드에 MAX_ORDER_NR_PAGES 단위로 align된 존재하는 페이지 수 만큼의 page_ext 구조체 + extra 메모리 만큼을 memblock에 할당한다.

코드 라인 7~9에서 해당 노드가 관리하는 홀을 포함한 전체 페이지 수를 가져온다.
코드 라인 16~18에서 해당 노드의 시작과 끝 페이지가 최대 버디 오더 페이지 단위로 정렬되지 않은 경우 extra 공간을 더 정렬 단위만큼 추가한다.
- MAX_ORDER_NR_PAGES는 버디 시스템에서 최대 할당 가능한 오더 페이지 수 이다.
코드 라인 20~26에서 page_ext 용도로 메모리를 할당한다.
- memblock의 DMA 주소 ~ lowmem 범위에 가능하면 노드 공간에 page_ext 공간을 할당한다.
- 노드 주소 범위가 DMA 주소 ~ lowmem 범위에 포함되지 않는 경우에는 지정된 노드가 아니더라도 할당을 시도한다.
코드 라인 27에서 할당된 메모리는 노드 멤버 node_page_ext에 지정한다.
코드 라인 28에서 전역 사용량(바이트)에 할당 사이즈 만큼 추가한다.

invoke_init_callbacks()

mm/page_ext.c

static void __init invoke_init_callbacks(void)
{
        int i;
        int entries = ARRAY_SIZE(page_ext_ops);

        for (i = 0; i < entries; i++) {
                if (page_ext_ops[i]->init)
                        page_ext_ops[i]->init();
        }
}

page_ext_ops 엔트리 수 만큼 루프를 돌며 초기화 핸들러 함수를 호출한다.

lookup_page_ext()

mm/page_ext.c

struct page_ext *lookup_page_ext(struct page *page)
{
        unsigned long pfn = page_to_pfn(page);
        unsigned long offset;
        struct page_ext *base;

        base = NODE_DATA(page_to_nid(page))->node_page_ext;
#ifdef CONFIG_DEBUG_VM
        /*
         * The sanity checks the page allocator does upon freeing a
         * page can reach here before the page_ext arrays are
         * allocated when feeding a range of pages to the allocator
         * for the first time during bootup or memory hotplug.
         */
        if (unlikely(!base))
                return NULL;
#endif
        offset = pfn - round_down(node_start_pfn(page_to_nid(page)),
                                        MAX_ORDER_NR_PAGES);
        return base + offset;
}

page로 page_ext 정보츨 찾아 반환한다.

구조체

page_ext_operations 구조체

include/linux/page_ext.h

struct page_ext_operations {
        size_t offset;
        size_t size;
        bool (*need)(void);       
        void (*init)(void);
};

offset
- page_ext 뒤에 추가된 extra 메모리를 가리키는 offset 이다.
size
- 해당 오퍼레이션에서 사용하는 extra 메모리 바이트 수
- CONFIG_PAGE_OWNER 커널 옵션을 사용하는 경우 page_owner 구조체 크기가 사용된다.
need
- 필요 여부 조회 핸들러 함수 등록
init
- 초기화 핸들러 함수 등록

page_ext_ops[] 전역 객체

mm/page_ext.c

/*
 * struct page extension
 *
 * This is the feature to manage memory for extended data per page.
 *
 * Until now, we must modify struct page itself to store extra data per page.
 * This requires rebuilding the kernel and it is really time consuming process.
 * And, sometimes, rebuild is impossible due to third party module dependency.
 * At last, enlarging struct page could cause un-wanted system behaviour change.
 *
 * This feature is intended to overcome above mentioned problems. This feature
 * allocates memory for extended data per page in certain place rather than
 * the struct page itself. This memory can be accessed by the accessor
 * functions provided by this code. During the boot process, it checks whether
 * allocation of huge chunk of memory is needed or not. If not, it avoids
 * allocating memory at all. With this advantage, we can include this feature
 * into the kernel in default and can avoid rebuild and solve related problems.
 *
 * To help these things to work well, there are two callbacks for clients. One
 * is the need callback which is mandatory if user wants to avoid useless
 * memory allocation at boot-time. The other is optional, init callback, which
 * is used to do proper initialization after memory is allocated.
 *
 * The need callback is used to decide whether extended memory allocation is
 * needed or not. Sometimes users want to deactivate some features in this
 * boot and extra memory would be unneccessary. In this case, to avoid
 * allocating huge chunk of memory, each clients represent their need of
 * extra memory through the need callback. If one of the need callbacks
 * returns true, it means that someone needs extra memory so that
 * page extension core should allocates memory for page extension. If
 * none of need callbacks return true, memory isn't needed at all in this boot
 * and page extension core can skip to allocate memory. As result,
 * none of memory is wasted.
 *
 * When need callback returns true, page_ext checks if there is a request for
 * extra memory through size in struct page_ext_operations. If it is non-zero,
 * extra space is allocated for each page_ext entry and offset is returned to
 * user through offset in struct page_ext_operations.
 *
 * The init callback is used to do proper initialization after page extension
 * is completely initialized. In sparse memory system, extra memory is
 * allocated some time later than memmap is allocated. In other words, lifetime
 * of memory for page extension isn't same with memmap for struct page.
 * Therefore, clients can't store extra data until page extension is
 * initialized, even if pages are allocated and used freely. This could
 * cause inadequate state of extra data per page, so, to prevent it, client
 * can utilize this callback to initialize the state of it correctly.
 */

static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_DEBUG_PAGEALLOC
        &debug_guardpage_ops,
#endif
#ifdef CONFIG_PAGE_OWNER
        &page_owner_ops,
#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
        &page_idle_ops,
#endif
};

커널 옵션에 따라 몇 가지 기능의 핸들러 객체들이 등록되어 있다.

debug_guardpage_ops 전역 객체

mm/page_alloc.c

#ifdef CONFIG_DEBUG_PAGEALLOC
struct page_ext_operations debug_guardpage_ops = {
        .need = need_debug_guardpage,
        .init = init_debug_guardpage,
};
#endif

가드 페이지 디버깅에 관련된 핸들러 함수들이 등록되어 있다.

page_owner_ops 전역 객체

mm/page_owner.c

struct page_ext_operations page_owner_ops = {
        .size = sizeof(struct page_owner),
        .need = need_page_owner,
        .init = init_page_owner,
};

페이지 owner 트래킹에 관련된 핸들러 함수들이 등록되어 있다.

디버그용으로 CONFIG_PAGE_OWNER 커널 옵션이 동작하는 경우에만 사용된다.

page_idle_ops 전역 객체

mm/page_idle.c

#ifndef CONFIG_64BIT
struct page_ext_operations page_idle_ops = {
        .need = need_page_idle,
};
#endif

idle 페이지 트래킹에 관련된 핸들러 함수들이 등록되어 있다.

이 오퍼레이션은 32비트 시스템에서만 사용되며, 64비트 시스템에서는 page_ext를 사용하는 이러한 오퍼레이션 없이 page 구조체만을 사용하여 동작한다.

참고

page owner: Tracking about who allocated each page | kernel.org
Resurrect and use struct page extension for some debugging features | LWN.net
Per-CPU Page Frame Cache (zone->pageset) | 문c
Idle Page Tracking | Kernel.org

Interrupts -6- (IPI cross-call)

2016-05-042020-03-16 문영일 Leave a comment

IPI (Inter Processor Interrupts)

인터럽트의 특별한 케이스로 SMP 시스템에서 하나의 cpu에서 다른 cpu로 발생시킨다.

시스템에서 처리하는 IPI 타입은 다음과 같다.

1) ARM64

IPI_RESCHEDULE
- 리스케줄링 IPI
IPI_CALL_FUNC
- call function IPI
IPI_CPU_STOP
- cpu stop IPI
IPI_CPU_CRASH_STOP
- cpu crash stop IPI
IPI_TIMER
- broadcast 타이머 호출 IPI
IPI_IRQ_WORK
- IRQ work IPI
IPI_WAKEUP
- 다른 cpu를 깨우는 IPI

2) ARM32

SGI0~7에 해당하는 IPI들은 다음과 같다. SGI8-15는 secure 펌웨어 의해 예약되어 있다.

IPI_WAKEUP
IPI_TIMER
IPI_RESCHEDULE
IPI_CALL_FUNC
- IPI_CALL_FUNC_SINGLE 기능은 IPI_CALL_FUNC으로만 사용되도록 커널 v4.5-rc1에서 제거된다.
  - 참고: ARM: 8487/1: Remove IPI_CALL_FUNC_SINGLE
IPI_CPU_STOP
IPI_IRQ_WORK
IPI_COMPLETION
- 완료 IPI
IPI_CPU_BACKTRACE
- cpu 백 트레이스 IPI
- IPI_CPU_BACKTRACE 기능이 커널 v4.5-rc1에서 추가되었다.
  - 참고: ARM: 8488/1: Make IPI_CPU_BACKTRACE a “non-secure” SGI

IPI 발생

IPI 함수 초기 설정

각 인터럽트 컨트롤러 초기화 함수에서 IPI 발생시킬 때 사용할 함수를 지정한다.

gic v3 드라이버
- gic_smp_init() 함수에서 gic_raise_softirq() 함수를 지정
rpi2 IC 드라이버
- bcm2709_smp_init_cpus() 함수에서 bcm2835_send_doorbell() 함수를 지정

set_smp_cross_call() – ARM64

arch/arm64/kernel/smp.c

void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
{
        __smp_cross_call = fn;
}

IPI 발생시킬 때 사용할 함수를 지정한다.

set_smp_cross_call() – ARM32

arch/arm/kernel/smp.c

/*
 * Provide a function to raise an IPI cross call on CPUs in callmap.
 */

void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
{
        if (!__smp_cross_call)
                __smp_cross_call = fn;
}

IPI 발생시킬 때 사용할 함수를 지정한다.

IPI 발생 API

smp_cross_call()

arch/arm64/kernel/smp.c & arch/arm/kernel/smp.c

static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
{
        trace_ipi_raise(target, ipi_types[ipinr]);
        __smp_cross_call(target, ipinr);
}

전역 __smp_cross_call에 등록된 핸들러 함수를 호출하여 @ipinr에 해당하는 IPI를 비트 마스크로 표현된 @target cpu들에 발생시킨다.

IPI 발생 – GIC v3용

아래 함수 명에 사용된 softirq는 커널의 softirq subsystem을 뜻하는 것이 아니다. 외부 디바이스가 발생시키는 인터럽트가 아닌 cpu 내부에서 소프트웨어 명령으로 인해 발생되는 인터럽트라는 의미의 아키텍처 용어이다. ARM GIC에서는 이를 SGI(Software Generate Interrupt)로 표현한다.

다음 그림은 GIC v3 사용 시 최대 16개의 타겟 cpu들에 IPI(SGI)를 발생시키는 과정을 보여준다.

SGI는 0~15번까지 지정하여 인터럽트를 발생시킬 수 있다.

gic_raise_softirq()

drivers/irqchip/irq-gic-v3.c

static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
{
        int cpu;

        if (WARN_ON(irq >= 16))
                return;

        /*
         * Ensure that stores to Normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
        wmb();

        for_each_cpu(cpu, mask) {
                u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
                u16 tlist;

                tlist = gic_compute_target_list(&cpu, mask, cluster_id);
                gic_send_sgi(cluster_id, tlist, irq);
        }

        /* Force the above writes to ICC_SGI1R_EL1 to be executed */
        isb();
}

요청 받은 cpu 비트 @mask에 대해 @irq 번호의 IPI를 발생시킨다.

코드 라인 5~6에서 @irq 번호가 16 이상인 경우 처리를 중단하고 함수를 빠져나간다.
- GIC의 경우 SGI#0 ~ SGI#15까지 발생 가능하다.
코드 라인 12에서 IPI 발생 전에 다른 cpu들에서 저장한 노멀 메모리들의 저장을 확실하게 하는 메모리 베리어를 수행한다.
코드 라인 14~20에서 cpu에 해당하는 mpidr 값에서 클러스터 id만 가져온다.

gic_compute_target_list()

drivers/irqchip/irq-gic-v3.c

static u16 gic_compute_target_list(int *base_cpu, const struct cpumask *mask,
                                   unsigned long cluster_id)
{
        int next_cpu, cpu = *base_cpu;
        unsigned long mpidr = cpu_logical_map(cpu);
        u16 tlist = 0;

        while (cpu < nr_cpu_ids) {
                tlist |= 1 << (mpidr & 0xf);

                next_cpu = cpumask_next(cpu, mask);
                if (next_cpu >= nr_cpu_ids)
                        goto out;
                cpu = next_cpu;

                mpidr = cpu_logical_map(cpu);

                if (cluster_id != MPIDR_TO_SGI_CLUSTER_ID(mpidr)) {
                        cpu--;
                        goto out;
                }
        }
out:
        *base_cpu = cpu;
        return tlist;
}

@cluster_id 들에 소속한 @mask에 포함된 cpu들에 대해 @base_cpu 부터 16비트의 비트마스크 값으로 반환한다.

코드 라인 8~9에서 해당 cpu의 mpidr 값의 4비트가 cpu 번호이며 이 값을 반환하기 위해 사용할 tlist의 해당 cpu 비트를 설정한다.
코드 라인 11~14에서 @mask에 설정된 다음 cpu를 알아온다.
코드 라인 16~21에서 알아온 cpu의 클러스터 id가 @cluster_id와 다른 경우 cpu 번호를 그 전 성공한 번호로 감소시킨 후 함수를 빠져나가기 위해 out 레이블로 이동한다.
코드 라인 23~25에서 인자 @base_cpu에 최종 타겟 cpu를 담고, 클러스터 @cluster_id에 포함되는 16비트 cpu 비트마스크를 반환한다.
- 16비트 값을 반환하는 이유는 SGI가 최대 16개 cpu까지 지원하기 때문이다.

예) cluseter_id#0에 0~3번 cpu, cluster_id#1에 4~7번 cpu들이 배치되었을 때

base_cpu=2, mask=0x00ff, cluseter_id=0인 경우
- 출력 인자 base_cpu=3이고, 반환 값=0x000c이다. (cpu2~3에 해당)

gic_send_sgi()

drivers/irqchip/irq-gic-v3.c

static void gic_send_sgi(u64 cluster_id, u16 tlist, unsigned int irq)
{
        u64 val;

        val = (MPIDR_TO_SGI_AFFINITY(cluster_id, 3)     |
               MPIDR_TO_SGI_AFFINITY(cluster_id, 2)     |
               irq << ICC_SGI1R_SGI_ID_SHIFT            |
               MPIDR_TO_SGI_AFFINITY(cluster_id, 1)     |
               MPIDR_TO_SGI_RS(cluster_id)              |
               tlist << ICC_SGI1R_TARGET_LIST_SHIFT);

        pr_devel("CPU%d: ICC_SGI1R_EL1 %llx\n", smp_processor_id(), val);
        gic_write_sgi1r(val);
}

@irq 번호에 해당하는 SGI(Software Generated Interrupt)를 @cluster_id의 @tlist cpu들에 발생시킨다.

ICC_SGI1R_EL1 레지스터에 @cluster_id로 aff3, aff2, aff1을 지정하고, 타겟 cpu 비트마스크인 @tlist를 대입한다.

IPI 발생 – rpi2 IC용

bcm2835_send_doorbell()

arch/arm/mach-bcm2709/bcm2709.c

static void bcm2835_send_doorbell(const struct cpumask *mask, unsigned int irq)
{
        int cpu;
        /*
         * Ensure that stores to Normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
        dsb();

        /* Convert our logical CPU mask into a physical one. */
        for_each_cpu(cpu, mask)
        {
                /* submit softirq */
                writel(1<<irq, __io_address(ARM_LOCAL_MAILBOX0_SET0 + 0x10 * MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 0)));
        }
}

요청 받은 cpu들에 대해 IPI를 발생시킨다.

Call Function IPI

IPI(Inter Process Interrupts)를 수신 받아 처리하는 핸들러 함수는 handle_IPI()이다. 이 함수에서 사용하는 여러 IPI 항목 중 다음 2개의 항목은 다른 cpu의 요청에 의해 전달받은 call function data를 통해 이에 등록된 함수를 호출한다. 이러한 기능을 지원하기 위해 call_function_init() 함수를 통해 초기화한다.

IPI_CALL_FUNC
- 요청한 함수들을 동작시킨다.
IPI_CALL_FUNC_SINGLE
- 요청한 함수 하나를 동작시킨다.

Call Function 초기화

call_function_init()

kernel/smp.c

void __init call_function_init(void)
{
        int i;

        for_each_possible_cpu(i)
                init_llist_head(&per_cpu(call_single_queue, i));

        smpcfd_prepare_cpu(smp_processor_id());
}

call function에 사용되는 IPI 기능을 초기화한다.

코드 라인 5~6에서 call function이 담기는 call_single_queue를 cpu 수 만큼초기화한다.
코드 라인 8에서 현재 부트 cpu에 대한 call function data를 준비한다.

smpcfd_prepare_cpu()

kernel/smp.c

int smpcfd_prepare_cpu(unsigned int cpu)
{
        struct call_function_data *cfd = &per_cpu(cfd_data, cpu);

        if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
                                     cpu_to_node(cpu)))
                return -ENOMEM;
        if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL,
                                     cpu_to_node(cpu))) {
                free_cpumask_var(cfd->cpumask);
                return -ENOMEM;
        }
        cfd->csd = alloc_percpu(call_single_data_t);
        if (!cfd->csd) {
                free_cpumask_var(cfd->cpumask);
                free_cpumask_var(cfd->cpumask_ipi);
                return -ENOMEM;
        }

        return 0;
}

@cpu에 대한 call function data를 할당하고 초기화한다.

코드 라인 5~7에서 cfd->cpumask를 할당한다.
코드 라인 8~12에서 cfd->cpumask_ipi를 할당한다.
코드 라인 13~18에서 per-cpu 멤버인 cfd->csd를 할당한다.
코드 라인 20에서 성공 값 0을 반환한다.

function call IPI 발생 – many cpu

smp_call_function_many()

kernel/smp.c

/**
 * smp_call_function_many(): Run a function on a set of other CPUs.
 * @mask: The set of cpus to run on (only runs on online subset).
 * @func: The function to run. This must be fast and non-blocking.
 * @info: An arbitrary pointer to pass to the function.
 * @wait: If true, wait (atomically) until function has completed
 *        on other CPUs.
 *
 * If @wait is true, then returns once @func has returned.
 *
 * You must not call this function with disabled interrupts or from a
 * hardware interrupt handler or from a bottom half handler. Preemption
 * must be disabled when calling this function.
 */

void smp_call_function_many(const struct cpumask *mask,
                            smp_call_func_t func, void *info, bool wait)
{
        struct call_function_data *cfd;
        int cpu, next_cpu, this_cpu = smp_processor_id();

        /*
         * Can deadlock when called with interrupts disabled.
         * We allow cpu's that are not yet online though, as no one else can
         * send smp call function interrupt to this cpu and as such deadlocks
         * can't happen.
         */
        WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
                     && !oops_in_progress && !early_boot_irqs_disabled);

        /*
         * When @wait we can deadlock when we interrupt between llist_add() and
         * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
         * csd_lock() on because the interrupt context uses the same csd
         * storage.
         */
        WARN_ON_ONCE(!in_task());
        /* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
        cpu = cpumask_first_and(mask, cpu_online_mask);
        if (cpu == this_cpu)
                cpu = cpumask_next_and(cpu, mask, cpu_online_mask);

        /* No online cpus?  We're done. */
        if (cpu >= nr_cpu_ids)
                return;

        /* Do we have another CPU which isn't us? */
        next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
        if (next_cpu == this_cpu)
                next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);

        /* Fastpath: do that cpu by itself. */
        if (next_cpu >= nr_cpu_ids) {
                smp_call_function_single(cpu, func, info, wait);
                return;
        }

        cfd = this_cpu_ptr(&cfd_data);

        cpumask_and(cfd->cpumask, mask, cpu_online_mask);
        cpumask_clear_cpu(this_cpu, cfd->cpumask);

        /* Some callers race with other cpus changing the passed mask */
        if (unlikely(!cpumask_weight(cfd->cpumask)))
                return;

        cpumask_clear(cfd->cpumask_ipi);
        for_each_cpu(cpu, cfd->cpumask) {
                call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);

                csd_lock(csd);
                if (wait)
                        csd->flags |= CSD_FLAG_SYNCHRONOUS;
                csd->func = func;
                csd->info = info;
                if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
                        __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
        }

        /* Send a message to all CPUs in the map */
        arch_send_call_function_ipi_mask(cfd->cpumask_ipi);

        if (wait) {
                for_each_cpu(cpu, cfd->cpumask) {
                        call_single_data_t *csd;

                        csd = per_cpu_ptr(cfd->csd, cpu);
                        csd_lock_wait(csd);
                }
        }
}
EXPORT_SYMBOL(smp_call_function_many);

현재 cpu를 제외한 @mask cpu들에서 @func 함수가 실행되게 한다. @wait 인수를 설정하는 경우 각 cpu에서 함수의 실행이 완료될 때 까지 기다린다.

fastpath: 실행해야 할 타겟 cpu가 1개만 있는 경우

코드 라인 24~30에서 전송할 대상 cpu들 @mask & online cpu들 중 첫 번째 cpu를 알아온다. 단 현재 cpu는 제외한다.
코드 라인 33~35에서 다음 전송할 두 번째 cpu를 알아온다. 단 현재 cpu는 제외한다.
코드 라인 38~41에서 전송 대상이 하나의 cpu인 경우 해당 cpu로 function call IPI를 발생시킨다.

slowpath: 실행해야 할 타겟 cpu가 2개 이상인 경우

코드 라인 43~50에서 cfd_data라는 이름의 전역 per-cpu 객체의 cpumask에 현재 cpu를 제외한 타겟 cpu를 비트마스크 형태로 대입한다. 보낼 타겟 cpu가 하나도 없는 경우 함수를 빠져나간다.
코드 라인 52~63에서 타겟 cpu들을 순회하며 per-cpu 타입인 call_single_data_t 타입 csd에 전송할 정보들을 저장한다.
- csd_lock() 함수에서는 csd->flags에 lock 비트가 설정되어 있는 동안 sleep하고 설정되어 있지 않으면 lock 비트를 설정한다.
- 현재 cpu의 call_single_queue의 선두에 추가한다.
코드 라인 66에서 타겟 cpu들에 대해 function call IPI를 발생시킨다.
코드 라인 68~75에서 인자 @wait이 지정된 경우 다른 cpu에서 func이 모두 실행되어 complete될 때까지 기다린다.

cfd_data

kernel/smp.c

static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data);

다음 그림은 로컬 cpu를 제외한 타겟 cpu들에 대해 fastpath와 slowpath로 나뉘어 처리되는 과정을 보여준다.

smp_call_function_single()

kernel/smp.c

/*
 * smp_call_function_single - Run a function on a specific CPU
 * @func: The function to run. This must be fast and non-blocking.
 * @info: An arbitrary pointer to pass to the function.
 * @wait: If true, wait until function has completed on other CPUs.
 *
 * Returns 0 on success, else a negative status code.
 */

int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
                             int wait)
{
        int this_cpu;
        int err;

        /*
         * prevent preemption and reschedule on another processor,
         * as well as CPU removal
         */
        this_cpu = get_cpu();

        /*
         * Can deadlock when called with interrupts disabled.
         * We allow cpu's that are not yet online though, as no one else can
         * send smp call function interrupt to this cpu and as such deadlocks
         * can't happen.
         */
        WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
                     && !oops_in_progress);

        /*
         * When @wait we can deadlock when we interrupt between llist_add() and
         * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
         * csd_lock() on because the interrupt context uses the same csd
         * storage.
         */
        WARN_ON_ONCE(!in_task());

        csd = &csd_stack;
        if (!wait) {
                csd = this_cpu_ptr(&csd_data);
                csd_lock(csd);
        }
        err = generic_exec_single(cpu, NULL, func, info, wait);

        if (wait)
                csd_lock_wait(csd);
        put_cpu();

        return err;
}
EXPORT_SYMBOL(smp_call_function_single);

지정한 @cpu에서 @func 함수를 동작시키도록 function call IPI를 발생시킨다. 인자 @wait이 설정된 경우 실행이 완료될 때까지 기다린다.

generic_exec_single()

kernel/smp.c

/*
 * Insert a previously allocated call_single_data element
 * for execution on the given CPU. data must already have
 * ->func, ->info, and ->flags set.
 */

static int generic_exec_single(int cpu, struct call_single_data *csd,
                               smp_call_func_t func, void *info, int wait)
{
        if (cpu == smp_processor_id()) {
                unsigned long flags;

                /*
                 * We can unlock early even for the synchronous on-stack case,
                 * since we're doing this from the same CPU..
                 */
                csd_unlock(csd);
                local_irq_save(flags);
                func(info);
                local_irq_restore(flags);
                return 0;
        }

        if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) {
                csd_unlock(csd);
                return -ENXIO;
        }

        if (!csd) {
                csd = &csd_stack;
                if (!wait)
                        csd = this_cpu_ptr(&csd_data);
        }

        csd->func = func;
        csd->info = info;

        /*
         * The list addition should be visible before sending the IPI
         * handler locks the list to pull the entry off it because of
         * normal cache coherency rules implied by spinlocks.
         *
         * If IPIs can go out of order to the cache coherency protocol
         * in an architecture, sufficient synchronisation should be added
         * to arch code to make it appear to obey cache coherency WRT
         * locking and barrier primitives. Generic code isn't really
         * equipped to do the right thing...
         */
        if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
                arch_send_call_function_single_ipi(cpu);

        return 0;
}

지정한 @cpu에서 @func 함수를 동작시키도록 function call IPI를 발생시킨다.

코드 라인 4~16에서 요청 받은 cpu가 현재 cpu인 경우 IPI 발생 없이 그냥 함수를 호출하여 수행한다.
코드 라인 18~21에서 허용 online cpu가 아니면 -ENXIO 에러를 반환한다.
코드 라인 23~27에서 csd가 지정되지 않은 경우 전역 &csd_data를 사용한다.
코드 라인 29~30에서 csd에 호출할 함수와 인수 정보를 대입한다.
코드 라인 43~44에서 csd를 call_single_queue에 추가한 후 IPI를 발생시킨다.
코드 라인 46에서 성공 값 0을 반환한다.

csd_data

kernel/smp.c

static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);

arch_send_call_function_single_ipi()

arch/arm64/kernel/smp.c & arch/arm/kernel/smp.c

void arch_send_call_function_single_ipi(int cpu)
{
        smp_cross_call(cpumask_of(cpu), IPI_CALL_FUNC);
}

지정된 @cpu에 대해 call function IPI를 발생시킨다.

호출된 cpu에서 call_single_queue에 추가된 함수를 수행한다.

IPI 핸들러

아키텍처 및 사용하는 인터럽트 컨트롤러마다 IPI 핸들러를 호출하는 곳이 다르다.

arm
- do_IPI() -> handle_IPI()
arm GIC v3
- gic_handle_irq() -> handle_IPI()

handle_IPI() – ARM64

arch/arm64/kernel/smp.c

/*
 * Main handler for inter-processor interrupts
 */

void handle_IPI(int ipinr, struct pt_regs *regs)
{
        unsigned int cpu = smp_processor_id();
        struct pt_regs *old_regs = set_irq_regs(regs);

        if ((unsigned)ipinr < NR_IPI) {
                trace_ipi_entry_rcuidle(ipi_types[ipinr]);
                __inc_irq_stat(cpu, ipi_irqs[ipinr]);
        }

        switch (ipinr) {
        case IPI_RESCHEDULE:
                scheduler_ipi();
                break;

        case IPI_CALL_FUNC:
                irq_enter();
                generic_smp_call_function_interrupt();
                irq_exit();
                break;

        case IPI_CPU_STOP:
                irq_enter();
                local_cpu_stop();
                irq_exit();
                break;

        case IPI_CPU_CRASH_STOP:
                if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
                        irq_enter();
                        ipi_cpu_crash_stop(cpu, regs);

                        unreachable();
                }
                break;

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
        case IPI_TIMER:
                irq_enter();
                tick_receive_broadcast();
                irq_exit();
                break;
#endif

#ifdef CONFIG_IRQ_WORK
        case IPI_IRQ_WORK:
                irq_enter();
                irq_work_run();
                irq_exit();
                break;
#endif

#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
        case IPI_WAKEUP:
                WARN_ONCE(!acpi_parking_protocol_valid(cpu),
                          "CPU%u: Wake-up IPI outside the ACPI parking protocol\n",
                          cpu);
                break;
#endif

        default:
                pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr);
                break;
        }

        if ((unsigned)ipinr < NR_IPI)
                trace_ipi_exit_rcuidle(ipi_types[ipinr]);
        set_irq_regs(old_regs);
}

@ipinr 번호에 해당하는 IPI를 발생시킨다.

코드 라인 4에서 per-cpu 전역 변수인 __irq_regs에 @regs를 기록하고, 그 전 값은 이 루틴이 끝날때까지 old_regs 임시 변수에 저장한다.
코드 라인 6~9에서 @ipinr가 NR_IPI 범위내에 있는 경우 IPI 시작에 대한 trace 출력을 하고 해당 @ipinr에 해당하는 ipi 카운터를 증가시킨다.
코드 라인 11~14에서 IPI_RESCHEDULE 요청을 받은 경우 현재 프로세서에 대해 리스케쥴한다.
코드 라인 16~20에서 IPI_CALL_FUNC 요청을 받은 경우 미리 등록된 call function 함수들을 호출한다.
코드 라인 22~26에서 IPI_CPU_STOP 요청을 받은 경우 현재 cpu가 부팅 중 또는 동작 중인 경우 “CPU%u: stopping” 메시지 출력 및 스택 덤프를 하고 해당 cpu의 irq, fiq를 모두 정지 시키고 offline 상태로 바꾼 후 정지(spin)한다.
코드 라인 28~35에서 IPI_CPU_CRASH_STOP 요청을 받은 경우 CONFIG_KEXEC_CORE 커널 옵션이 있는 경우에 한해 crash 처리를 한 후 cpu를 offline 상태로 바꾸고 정지(spin)시킨다.
코드 라인 38~42에서 IPI_TIMER 요청을 받은 경우 tick 디바이스에 등록된 브로드 캐스트 이벤트 디바이스의 핸들러 함수를 호출한다.
코드 라인 46~50에서 IPI_IRQ_WORK 요청을 받은 경우 현재의 모든 irq 작업들을 즉시 수행하게 한다.
코드 라인 54~58에서 IPI_WAKEUP 요청을 받은 경우 아무것도 처리하지 않는다.
- 이미 깨어나서 돌고 있고 아울러 트래킹을 위해 이미 해당 통계 카운터도 증가시켰다.
- 호출 방법 1: arch_send_wakeup_ipi_mask(cpumask) 함수를 사용하여 cpumask에 해당하는 각 cpu들에 대해 WFI에 의해 잠들어 있는 경우 프로세서를 깨울 수 있다.
- 호출 방법 2: 빅/리틀 hot plug cpu 시스템에서 gic_raise_softirq() 함수를 사용하여 WFI에 의해 대기하고 있는 cpu들을 깨운다. 이 때 인수로 1대신 0을 사용하여 호출하면 불필요한 printk 메시지(“CPU%u: Unknown IPI message 0x00000001”)를 출력하지 않고 트래킹을 위해 해당 카운터 수도 추적할 수 있다.
  - PM으로 전력 기능까지 콘트롤하는 GIC(Generic Interrupt Controller) #0을 사용하는 방법으로 지정한 cpu를 wakeup 시킨다.
- 참고: ARM: 7536/1: smp: Formalize an IPI for wakeup
코드 라인 61~64 정의 되지 않은 @ipinr 번호의 IPI를 요청했을 때 경고 메시지를 출력한다.
코드 라인 66~67에서 @ ipinr이 범위내에 있는 경우 IPI 종료에 대한 trace 출력을 한다.
코드 라인 68에서 백업해 두었던 old_regs를 per-cpu 전역 변수인 __irq_regs에 복원한다.

handle_IPI() – ARM32

arch/arm/kernel/smp.c

void handle_IPI(int ipinr, struct pt_regs *regs)
{
        unsigned int cpu = smp_processor_id();
        struct pt_regs *old_regs = set_irq_regs(regs);

        if ((unsigned)ipinr < NR_IPI) {
                trace_ipi_entry(ipi_types[ipinr]);
                __inc_irq_stat(cpu, ipi_irqs[ipinr]);
        }

        switch (ipinr) {
        case IPI_WAKEUP:
                break;

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
        case IPI_TIMER:
                irq_enter();
                tick_receive_broadcast();
                irq_exit();
                break;
#endif

        case IPI_RESCHEDULE:
                scheduler_ipi();
                break;

        case IPI_CALL_FUNC:
                irq_enter();
                generic_smp_call_function_interrupt();
                irq_exit();
                break;

        case IPI_CPU_STOP:
                irq_enter();
                ipi_cpu_stop(cpu);
                irq_exit();
                break;

#ifdef CONFIG_IRQ_WORK
        case IPI_IRQ_WORK:
                irq_enter();
                irq_work_run();
                irq_exit();
                break;
#endif

        case IPI_COMPLETION:
                irq_enter();
                ipi_complete(cpu);
                irq_exit();
                break;

        case IPI_CPU_BACKTRACE:
                printk_nmi_enter();
                irq_enter();
                nmi_cpu_backtrace(regs);
                irq_exit();
                printk_nmi_exit();
                break;
        default:
                pr_crit("CPU%u: Unknown IPI message 0x%x\n",
                        cpu, ipinr);
                break;
        }

        if ((unsigned)ipinr < NR_IPI)
                trace_ipi_exit(ipi_types[ipinr]);
        set_irq_regs(old_regs);
}

@ipinr 번호에 해당하는 IPI를 발생시킨다.

코드 라인 4에서 per-cpu 전역 변수인 __irq_regs에 @regs를 기록하고, 그 전 값은 이 루틴이 끝날때까지 old_regs 임시 변수에 저장한다.
코드 라인 6~9에서 @ipinr가 NR_IPI 범위내에 있는 경우 IPI 시작에 대한 trace 출력을 하고 해당 @ipinr에 해당하는 ipi 카운터를 증가시킨다.
코드 라인 11~13에서 IPI_WAKEUP 요청을 받은 경우 아무것도 처리하지 않는다.
- 이미 깨어나서 돌고 있고 아울러 트래킹을 위해 이미 해당 통계 카운터도 증가시켰다.
- 호출 방법 1: arch_send_wakeup_ipi_mask(cpumask) 함수를 사용하여 cpumask에 해당하는 각 cpu들에 대해 WFI에 의해 잠들어 있는 경우 프로세서를 깨울 수 있다.
- 호출 방법 2: 빅/리틀 hot plug cpu 시스템에서 gic_raise_softirq() 함수를 사용하여 WFI에 의해 대기하고 있는 cpu들을 깨운다. 이 때 인수로 1대신 0을 사용하여 호출하면 불필요한 printk 메시지(“CPU%u: Unknown IPI message 0x00000001”)를 출력하지 않고 트래킹을 위해 해당 카운터 수도 추적할 수 있다.
  - PM으로 전력 기능까지 콘트롤하는 GIC(Generic Interrupt Controller) #0을 사용하는 방법으로 지정한 cpu를 wakeup 시킨다.
- 참고: ARM: 7536/1: smp: Formalize an IPI for wakeup
코드 라인 16~20에서 IPI_TIMER 요청을 받은 경우 tick 디바이스에 등록된 브로드 캐스트 이벤트 디바이스의 핸들러 함수를 호출한다.
코드 라인 23~25에서 IPI_RESCHEDULE 요청을 받은 경우 현재 프로세서에 대해 리스케쥴한다.
코드 라인 27~31에서 IPI_CALL_FUNC 요청을 받은 경우 미리 등록된 call function 함수들을 호출한다.
코드 라인 33~37에서 IPI_CPU_STOP 요청을 받은 경우 현재 cpu가 부팅 중 또는 동작 중인 경우 “CPU%u: stopping” 메시지 출력 및 스택 덤프를 하고 해당 cpu의 irq, fiq를 모두 정지 시키고 offline 상태로 바꾼 후 정지(spin)한다.
코드 라인 40~44에서 IPI_IRQ_WORK 요청을 받은 경우 현재의 모든 irq 작업들을 즉시 수행하게 한다.
코드 라인 47~51에서 IPI_COMPLETION 요청을 받은 경우 register_ipi_completion() 함수로 등록해 놓은 per-cpu cpu_completion 에서 wait_for_completion()등으로 대기하고 있는 스레드를 깨운다.
코드 라인 53~59에서 IPI_CPU_BACKTRACE 요청을 받은 경우 nmi backtrace를 지원하는 cpu들의 backtrace를 출력 한다.
코드 라인 60~64 정의 되지 않은 @ipinr 번호의 IPI를 요청했을 때 경고 메시지를 출력한다.
코드 라인 66~67에서 @ ipinr이 범위내에 있는 경우 IPI 종료에 대한 trace 출력을 한다.
코드 라인 68에서 백업해 두었던 old_regs를 per-cpu 전역 변수인 __irq_regs에 복원한다.

IPI_TIMER 처리

tick_receive_broadcast()

kernel/time/tick-broadcast.c

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
int tick_receive_broadcast(void)
{
        struct tick_device *td = this_cpu_ptr(&tick_cpu_device);
        struct clock_event_device *evt = td->evtdev;

        if (!evt)
                return -ENODEV;

        if (!evt->event_handler)
                return -EINVAL;

        evt->event_handler(evt);
        return 0;
}
#endif

tick 디바이스에 등록된 이벤트 디바이스의 핸들러 함수를 호출한다.

CONFIG_GENERIC_CLOCKEVENTS_BROADCAST 커널 옵션을 사용하는 경우에만 동작한다.
호출 방법: tick_do_broadcast() 함수를 사용하여 해당 cpu들에 tick을 전달한다.
참고: clockevents: Add generic timer broadcast receiver

코드 라인 4~8에서 tick 디바이스에 등록된 clock 이벤트 디바이스를 알아온다. 없으면 -ENODEV 에러를 반환한다.
코드 라인 10~11에서 clock 이벤트 디바이스에 이벤트 핸들러 함수가 등록되어 있지 않으면 -EINVAL 에러를 반환한다.
코드 라인 13에서 등록된 이벤트 핸들러 함수를 호출한다.
코드 라인 14에서 성공 값 0을 반환한다.

IPI_RESCHEDULE 처리

scheduler_ipi()

kernel/sched/core.c

void scheduler_ipi(void)
{
        /*            
         * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
         * TIF_NEED_RESCHED remotely (for the first time) will also send
         * this IPI.
         */
        preempt_fold_need_resched();

        if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
                return;

        /*
         * Not all reschedule IPI handlers call irq_enter/irq_exit, since
         * traditionally all their work was done from the interrupt return
         * path. Now that we actually do some work, we need to make sure
         * we do call them.
         *
         * Some archs already do call them, luckily irq_enter/exit nest
         * properly.
         *
         * Arguably we should visit all archs and update all handlers,
         * however a fair share of IPIs are still resched only so this would
         * somewhat pessimize the simple resched case.
         */
        irq_enter();
        sched_ttwu_pending();

        /*
         * Check if someone kicked us for doing the nohz idle load balance.
         */
        if (unlikely(got_nohz_idle_kick())) {
                this_rq()->idle_balance = 1;
                raise_softirq_irqoff(SCHED_SOFTIRQ);
        }
        irq_exit();
}

현재 프로세서에 대해 리스케쥴한다.

호출 방법: smp_send_reschedule() 명령을 사용하여 해당 cpu에서 리스케쥴링하도록 요청한다.

코드 라인 8에서 현재 스레드에 리스케쥴 요청이 있는 경우 preempt count에 있는 리스쥴 요청중 비트를 제거한다.
- 현재까지 x86에서만 구현되어 있고 arm을 포함한 다른 아키텍처에서는 동작하지 않는다.
- 참고: sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
코드 라인 10~11애서 런큐의 wake_list가 비어 있으면서 현재 cpu의 런큐에 NOHZ_BALANCE_KICK 요청이 없거나, 현재 cpu가 idle 상태가 아니거나 리스케쥴 요청이 있는 경우 리스케쥴할 태스크가 없어서 함수를 빠져나간다.
코드 라인 26에서 hard irq preecmption 카운터를 증가시켜 preemption을 막고 irq 소요 시간 및 latency를 측정할 수 있도록 한다.
코드 라인 27에서 런큐의 wake_list에서 모두 꺼내서 다시 enqueue 하여 리스케쥴 한다.
코드 라인 32~35에서 낮은 확률로 현재 cpu의 런큐에 NOHZ_BALANCE_KICK 요청이 있고 현재 cpu가 idle 상태이면서 리스케쥴 요청이 없는 경우 런큐의 idle_balance 를 1로 설정하고 SCHED_SOFTIRQ 를 정지시킨다.
코드 라인 36에서 hard irq preecmption 카운터를 감소시켜 preemption을 다시 열어주고 irq 소요 시간 및 latency를 측정할 수 있도록 후처리 작업을 수행한다.

IPI_CALL_FUNC 처리

generic_smp_call_function_interrupt()

include/linux/smp.h

#define generic_smp_call_function_interrupt \
        generic_smp_call_function_single_interrupt

IPI에 의해 인터럽트 된 후 미리 등록된 함수를 호출한다.

호출 방법: arch_send_call_function_ipi_mask() 명령을 사용하여 해당 cpu들에서 미리 등록된 함수들을 호출한다.

generic_smp_call_function_single_interrupt()

kernel/smp.c

/**
 * generic_smp_call_function_single_interrupt - Execute SMP IPI callbacks
 *
 * Invoked by arch to handle an IPI for call function single.
 * Must be called with interrupts disabled.
 */

void generic_smp_call_function_single_interrupt(void)
{
        flush_smp_call_function_queue(true);
}

IPI에 의해 인터럽트 된 후 미리 등록된 함수를 호출한다.

호출 방법: arch_send_call_function_single_ipi() 명령을 사용하여 요청 cpu에서 미리 등록된 함수를 호출한다.

flush_smp_call_function_queue()

kernel/smp.c

/**
 * flush_smp_call_function_queue - Flush pending smp-call-function callbacks
 *
 * @warn_cpu_offline: If set to 'true', warn if callbacks were queued on an
 *                    offline CPU. Skip this check if set to 'false'.
 *
 * Flush any pending smp-call-function callbacks queued on this CPU. This is
 * invoked by the generic IPI handler, as well as by a CPU about to go offline,
 * to ensure that all pending IPI callbacks are run before it goes completely
 * offline.
 *
 * Loop through the call_single_queue and run all the queued callbacks.
 * Must be called with interrupts disabled.
 */

static void flush_smp_call_function_queue(bool warn_cpu_offline)
{
        struct llist_head *head;
        struct llist_node *entry;
        call_single_data_t *csd, *csd_next;
        static bool warned;

        lockdep_assert_irqs_disabled();

        head = this_cpu_ptr(&call_single_queue);
        entry = llist_del_all(head);
        entry = llist_reverse_order(entry);

        /* There shouldn't be any pending callbacks on an offline CPU. */
        if (unlikely(warn_cpu_offline && !cpu_online(smp_processor_id()) &&
                     !warned && !llist_empty(head))) {
                warned = true; 
                WARN(1, "IPI on offline CPU %d\n", smp_processor_id());

                /*
                 * We don't have to use the _safe() variant here
                 * because we are not invoking the IPI handlers yet.
                 */
                llist_for_each_entry(csd, entry, llist)
                        pr_warn("IPI callback %pS sent to offline CPU\n",
                                csd->func);
        }

        llist_for_each_entry_safe(csd, csd_next, entry, llist) {
                csd->func(csd->info);
                csd_unlock(csd);
        }

        /*
         * Handle irq works queued remotely by irq_work_queue_on().
         * Smp functions above are typically synchronous so they
         * better run first since some other CPUs may be busy waiting
         * for them.
         */
        irq_work_run();
}

call_single_queue에 있는 모든 call function들을 한꺼번에 처리하고 비운다. 또한 남은 irq work도 모두 처리하여 비운다.

코드 라인 10~12에서 per-cpu call_single_queue에 등록된 엔트리들을 모두 제거하고 entry로 가져오는데 가장 처음에 추가한 call_single_data 엔트리가 앞으로 가도록 순서를 거꾸로 바꾼다.
코드 라인 15~27에서 인수 warn_cpu_offline가 설정된 경우 현재 cpu가 offline된 cpu인 경우 한 번만 “IPI on offline CPU %d” 및 “IPI callback %pS sent to offline CPU”라는 경고 메시지를 출력하게 한다.
코드 라인 29~32에서 등록되어 있는 함수들을 모두 호출하여 수행한다.
코드 라인 40에서 현재의 모든 irq 작업들을 즉시 수행하게 한다.

IPI_CPU_STOP 처리

local_cpu_stop() – ARM64

arch/arm64/kernel/smp.c

static void local_cpu_stop(void)
{
        set_cpu_online(smp_processor_id(), false);

        local_daif_mask();
        sdei_mask_local_cpu();
        cpu_park_loop();
}

로컬 cpu를 파킹(stop)한다.

코드 라인 3에서 로컬 cpu를 offline 상태로 변경한다.
코드 라인 5에서 로컬 cpu에 Exception들이 진입하지 않도록 모두 마스크한다.
코드 라인 6에서 SDEI(Software Delegated Exception Interface)를 통해 로컬 cpu로의 인터럽트를 mask한다.
코드 라인 7에서 cpu를 파킹한다.

ipi_cpu_stop() – ARM32

arch/arm/kernel/smp.c

/*
 * ipi_cpu_stop - handle IPI from smp_send_stop()
 */

static void ipi_cpu_stop(unsigned int cpu)
{
        if (system_state == SYSTEM_BOOTING ||
            system_state == SYSTEM_RUNNING) {
                raw_spin_lock(&stop_lock);
                pr_crit("CPU%u: stopping\n", cpu);
                dump_stack();
                raw_spin_unlock(&stop_lock);
        }

        set_cpu_online(cpu, false);

        local_fiq_disable();
        local_irq_disable();

        while (1)
                cpu_relax();
}

현재 cpu가 부팅 중 또는 동작 중인 경우 “CPU%u: stopping” 메시지 출력 및 스택 덤프를 하고 해당 cpu의 irq, fiq를 모두 정지 시키고 offline 상태로 바꾼 후 정지(spin)한다.

호출 방법: smp_send_stop() 함수를 사용하여 현재 cpu를 제외한 online cpu를 stop 시킨다.

IPI_IRQ_WORKIRQ 처리

irq_work_run()

kernel/irq_work.c

/*
 * hotplug calls this through:
 *  hotplug_cfd() -> flush_smp_call_function_queue()
 */

void irq_work_run(void)
{
        irq_work_run_list(this_cpu_ptr(&raised_list));
        irq_work_run_list(this_cpu_ptr(&lazy_list));
}       
EXPORT_SYMBOL_GPL(irq_work_run);

현재의 모든 irq 작업들을 즉시 수행하게 한다.

CONFIG_IRQ_WORK 커널 옵션을 사용하는 경우에만 동작한다.
호출 방법: arch_irq_work_raise() 요청을 받은 경우 현재 자신의 cpu에서 모든 irq 작업들을 즉시 수행하게 한다.
참고: ARM: 7872/1: Support arch_irq_work_raise() via self IPIs

코드 라인 3에서 &raised_list에 있는 모든 irq 작업들을 수행하게 한다.
코드 라인 4에서 &lazy_list에 있는 모든 irq 작업들을 수행하게 한다.

IPI_COMPLETION 처리- ARM32

ipi_complete() – ARM32

arch/arm/kernel/smp.c

static void ipi_complete(unsigned int cpu)
{
        complete(per_cpu(cpu_completion, cpu));
}

register_ipi_completion() 함수로 등록해 놓은 per-cpu cpu_completion 에서 wait_for_completion()등으로 대기하고 있는 스레드를 깨운다.

CPU가 커널 환경에 들어가기 전에 살아 있는지, 즉 초기 어셈블리 코드에서 신호를 보내도록하는 메커니즘이 필요해서 준비를 하였다.
참고: ARM: SMP: basic IPI triggered completion support

IPI_CPU_BACKTRACE 처리 – ARM32

nmi_cpu_backtrace()

lib/nmi_backtrace.c

bool nmi_cpu_backtrace(struct pt_regs *regs)
{
        int cpu = smp_processor_id();

        if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
                if (regs && cpu_in_idle(instruction_pointer(regs))) {
                        pr_warn("NMI backtrace for cpu %d skipped: idling at %pS\n",
                                cpu, (void *)instruction_pointer(regs));
                } else {
                        pr_warn("NMI backtrace for cpu %d\n", cpu);
                        if (regs)
                                show_regs(regs);
                        else
                                dump_stack();
                }
                cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
                return true;
        }

        return false;
}
NOKPROBE_SYMBOL(nmi_cpu_backtrace);

현재 cpu에 대한 레지스터 및 backtrace 로그를 출력한다.

구조체

call_function_data 구조체

kernel/smp.c

struct call_function_data {
        call_single_data_t      __percpu *csd;
        cpumask_var_t           cpumask;
        cpumask_var_t           cpumask_ipi;
};

call function IPI 호출 시 사용할 cfd로 function이 포함된 csd와 타겟 cpu들을 지정한 cpumask들이 담겨있다.

call_single_data_t 타입

include/linux/smp.h

/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
typedef struct __call_single_data call_single_data_t
        __aligned(sizeof(struct __call_single_data));

include/linux/smp.h

struct __call_single_data {
        struct llist_node llist;
        smp_call_func_t func;
        void *info;
        unsigned int flags;
};

call function IPI 호출 시 사용할 csd로 function 정보가 담겨있다.

llist
- csd 리스트
func
- 수행할 함수
*info
- 함수에 인수로 전달할 정보
flags
- 아래 플래그 비트를 저장한다.

kernel/smp.c

enum {
        CSD_FLAG_LOCK           = 0x01,
        CSD_FLAG_WAIT           = 0x02,
};

SMP 시스템에서 call single data를 위한 cpu의 처리 상태

참고

Interrupts -1- (Interrupt Controller) | 문c
Interrupts -2- (irq chip) | 문c
Interrupts -3- (irq domain) | 문c
Interrupts -4- (Top-Half & Bottom-Half) | 문c
Interrupts -5- (Softirq) | 문c
Interrupts -6- (IPI Cross-call) | 문c – 현재 글
Interrupts -7- (Workqueue 1) | 문c
Interrupts -8- (Workqueue 2) | 문c
Interrupts -9- (GIC v3 Driver) | 문c
Interrupts -10- (irq partition) | 문c
Interrupts -11- (RPI2 IC Driver) | 문c
Interrupts -12- (irq desc) | 문c

Software Delegated Exception Interface (SDEI) | ARM – 다운로드 pdf

sort_main_extable()

2016-05-032020-03-20 문영일 Leave a comment

통합: Exception -5- (Extable) | 문c

Exception -5- (Extable)

2016-05-032020-07-15 문영일 Leave a comment

Extable

Exception Table(extable)에는 프로세스 주소 공간(process address space)을 access하는 커널 API를 사용한 코드의 주소와 해당 API의 예외 처리 코드의 주소가 함께 담겨 있다.

커널이 프로세스용으로 할당된 페이지를 읽다 에러가 발생한 경우 페이지 테이블에 아직 로드(lazy load)가 되어 있지 않아 추가적으로 로드를 해야 하는 경우도 있고 그렇지 않고 정말 access 할 수 없는 공간으로 access를 시도하다 발생한 exception인 경우도 있다. 따라서 이에 대한 처리를 하기 위한 코드들이 호출될 수 있도록 설계되었다.

프로세스가 특정 주소를 access 하다가 exception이 발생하는 여러 fault(ARM32에서는 fsr_info[], ARM64에서는 fault_info[])별 처리 핸들러 함수를 실행시킬 때 extable을 검사하여 에러 진원지와 동일한 엔트리를 찾은 경우 해당 fixup 코드를 실행시킨다.

다음은 fsr_info[]에 연결되어 있는 처리함수들이다.
- do_translation_fault()
  - “section translation fault”
- do_page_fault()
  - “page translation fault”
  - “page permission fault”
- do_sect_fault()
  - “section permission fault”

다음 함수들에서 .fixup 섹션과 __ex_table 섹션을 사용한다.

get_user(), strlen_user(), strnlen_user()
- __get_user_asm()
put_user(), clear_user()
- __put_user_asm()
이 외에 영역 검사 및 futex 등 몇 개 더 있다.

초기화 (정렬)

Main Exception Table이 정렬되어 있지 않은 경우 정렬한다.

Exception Table은 다음 두 가지가 있다.
- Main Exception Table
- 각 Driver에서 사용하는 Exception Table
컴파일 타임에 툴에 의해 Exception table을 정렬하고, 툴에 의해 main_extable_sort_needed 변수 값까지 1로 바꿔주면 커널 부트업 처리 시 정렬을 할 필요가 없다.
- CONFIG_BUILDTIME_EXTABLE_SORT 커널 옵션을 사용하여 빌드 타임에 Exception table을 정렬하게 한다.

sort_main_extable()

kernel/extable.c

/* Sort the kernel's built-in exception table */
void __init sort_main_extable(void)
{
        if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) {
                pr_notice("Sorting __ex_table...\n");
                sort_extable(__start___ex_table, __stop___ex_table);
        }      
}

Main 커널에서 사용하는 Exception table을 정렬되어 있지 않은 경우 정렬한다.

kernel/extable.c

/* Cleared by build time tools if the table is already sorted. */
u32 __initdata __visible main_extable_sort_needed = 1;

sort_extable()

kernel/extable.c

void sort_extable(struct exception_table_entry *start,
                  struct exception_table_entry *finish)
{       
        sort(start, finish - start, sizeof(struct exception_table_entry),
             cmp_ex_sort, swap_ex);
}

Exception table의 시작 부터 끝 까지 정렬한다.

swap_ex 매크로는 ARCH_HAS_RELATIVE_EXTABLE이 사용되는 x86, powerpc, s390, arm64 아키텍처 등에서만 generic한 swap_ex() 함수를 사용하고, 그렇지 않은 경우 NULL로 치환된다.

cmp_ex_sort() – generic

lib/extable.c

/*      
 * The exception table needs to be sorted so that the binary
 * search that we use to find entries in it works properly.
 * This is used both for the kernel exception table and for
 * the exception tables of modules that get loaded.
 */

static int cmp_ex(const void *a, const void *b)
{       
        const struct exception_table_entry *x = a, *y = b;
        
        /* avoid overflow */
        if (ex_to_insn(x) > ex_to_insn(y))
                return 1;
        if (ex_to_insn(x) < ex_to_insn(y))
                return -1;
        return 0;
}

a와 b를 비교한다.

sort_extable() 함수가 ARCH_HAS_SORT_EXTABLE 가 사용되지 않는 아키텍처에서 사용할 수 있는 generic 루틴이다.
- sparc 아키텍처만 ARCH_HAS_SORT_EXTABLE을 지원하고, 나머지는 위의 generic 함수를 사용한다.

swap_ex() – generic

lib/extable.c

static void swap_ex(void *a, void *b, int size)
{
        struct exception_table_entry *x = a, *y = b, tmp;
        int delta = b - a;

        tmp = *x;
        x->insn = y->insn + delta;
        y->insn = tmp.insn - delta;

#ifdef swap_ex_entry_fixup
        swap_ex_entry_fixup(x, y, tmp, delta);
#else
        x->fixup = y->fixup + delta;
        y->fixup = tmp.fixup - delta;
#endif
}

a와 b를 치환한다.

__ex_table – ARM

arch/arm/kernel/vmlinux.lds.S

#ifdef CONFIG_DEBUG_RODATA
        . = ALIGN(1<<SECTION_SHIFT);
#endif
        RO_DATA(PAGE_SIZE)

        . = ALIGN(4);
        __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
                __start___ex_table = .;
#ifdef CONFIG_MMU
                *(__ex_table)
#endif
                __stop___ex_table = .;
        }

__ex_table은 .rodata 섹션 뒤에 위치한다.

__ex_table – ARM64

arch/arm64/kernel/vmlinux.lds.S

        .text : {                       /* Real text segment            */
                _stext = .;             /* Text and read-only data      */
                        __exception_text_start = .;
                        *(.exception.text)
                        __exception_text_end = .;
                        IRQENTRY_TEXT
                        SOFTIRQENTRY_TEXT
                        ENTRY_TEXT
                        TEXT_TEXT
                        SCHED_TEXT
                        CPUIDLE_TEXT
                        LOCK_TEXT
                        KPROBES_TEXT
                        HYPERVISOR_TEXT
                        IDMAP_TEXT
                        HIBERNATE_TEXT
                        TRAMP_TEXT
                        *(.fixup)
                        *(.gnu.warning)
                . = ALIGN(16);
                *(.got)                 /* Global offset table          */
        }

        . = ALIGN(SEGMENT_ALIGN);
        _etext = .;                     /* End of text section */

        RO_DATA(PAGE_SIZE)              /* everything from this point to     */
        EXCEPTION_TABLE(8)              /* __init_begin will be marked RO NX */
        NOTES

__ex_table은 EXCEPTION_TABLE() 매크로로 정의되고, .rodata 섹션 뒤에 위치한다.

include/asm-generic/vmlinux.lds.h

/*
 * Exception table
 */

define EXCEPTION_TABLE(align)                                          \
        . = ALIGN(align);                                               \
        __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {               \
                __start___ex_table = .;                                 \
                KEEP(*(__ex_table))                                     \
                __stop___ex_table = .;                                  \
        }

Exception 처리

다음 그림은 커널 코드가 process space에 있는 한 페이지에 접근하다 exception이 발생한 경우를 보여준다.

페이지를 access할 때 exception이 발생하는 경우 Data Abort 또는 Prefetch Abort 등의 exception이 발생하고 해당 exception vector로 jump를 하여 관련 함수를 수행한다.

search_exception_tables()

kernel/extable.c

/* Given an address, look for it in the exception tables. */
const struct exception_table_entry *search_exception_tables(unsigned long addr)
{
        const struct exception_table_entry *e;

        e = search_kernel_exception_table(addr);
        if (!e)
                e = search_module_extables(addr);
        return e;
}

Main exception table과 전체 모듈에 있는 exception table의 insn 가상주소가 인수 addr와 같은 경우 해당 엔트리를 반환하고 검색되지 않는 경우 null을 리턴한다.

search_kernel_exception_table()

lib/extable.c

/* Given an address, look for it in the kernel exception table */
const
struct exception_table_entry *search_kernel_exception_table(unsigned long addr)
{
        return search_extable(__start___ex_table,
                              __stop___ex_table - __start___ex_table, addr);
}

메인 커널에 위치한 exception table을 대상으로 가상 주소 @addr가 검색되는 경우 해당 엔트리를 반환한다. 그 외의 경우 null을 반환한다.

search_extable() – generic

lib/extable.c

/*
 * Search one exception table for an entry corresponding to the
 * given instruction address, and return the address of the entry,
 * or NULL if none is found.
 * We use a binary search, and thus we assume that the table is
 * already sorted.
 */

const struct exception_table_entry *
search_extable(const struct exception_table_entry *base,
               const size_t num,
               unsigned long value)
{
        return bsearch(&value, base, num,
                       sizeof(struct exception_table_entry), cmp_ex_search);
}

@base에 위치한 exception 테이블에서 가상 주소 @value를 검색하되 최대 @num 엔트리 수만큼 검색한다.

ARCH_HAS_SEARCH_EXTABLE을 사용하는 sparc 등의 아키텍처를 제외한 아키텍처들은 위의 generic 함수를 사용한다.

bsearch()

lib/bsearch.c

/*
 * bsearch - binary search an array of elements
 * @key: pointer to item being searched for
 * @base: pointer to first element to search
 * @num: number of elements
 * @size: size of each element
 * @cmp: pointer to comparison function
 *
 * This function does a binary search on the given array.  The
 * contents of the array should already be in ascending sorted order
 * under the provided comparison function.
 *
 * Note that the key need not have the same type as the elements in
 * the array, e.g. key could be a string and the comparison function
 * could compare the string with the struct's name field.  However, if
 * the key and elements in the array are of the same type, you can use
 * the same comparison function for both sort() and bsearch().
 */

void *bsearch(const void *key, const void *base, size_t num, size_t size,
              int (*cmp)(const void *key, const void *elt))
{
        const char *pivot;
        int result;

        while (num > 0) {
                pivot = base + (num >> 1) * size;
                result = cmp(key, pivot);

                if (result == 0)
                        return (void *)pivot;

                if (result > 0) {
                        base = pivot + size;
                        num--;
                }
                num >>= 1;
        }

        return NULL;
}
EXPORT_SYMBOL(bsearch);
NOKPROBE_SYMBOL(bsearch);

바이너리 검색을 수행한다.

search_module_extables()

kernel/module.c

/* Given an address, look for it in the module exception tables. */
const struct exception_table_entry *search_module_extables(unsigned long addr)
{
        const struct exception_table_entry *e = NULL;
        struct module *mod;

        preempt_disable();
        mod = __module_address(addr);
        if (!mod)
                goto out;

        if (!mod->num_exentries)
                goto out;

        e = search_extable(mod->extable,
                           mod->num_exentries,
                           addr);
out:
        preempt_enable();

        /*
         * Now, if we found one, we are running inside it now, hence
         * we cannot unload the module, hence no refcnt needed.
         */
        return e;
}

전체 모듈을 루프를 돌며 각각의 모듈에 있는 exception table의 insn 가상주소가 인수 addr와 같은 경우 해당 엔트리를 반환하고 검색되지 않는 경우 null을 리턴한다.

구조체 및 전역변수

exception_table_entry 구조체 – generic

include/asm-generic/extable.h

/*
 * The exception table consists of pairs of addresses: the first is the
 * address of an instruction that is allowed to fault, and the second is
 * the address at which the program should continue.  No registers are
 * modified, so it is entirely up to the continuation code to figure out
 * what to do.
 *
 * All the routines below use bits of fixup code that are out of line
 * with the main instruction path.  This means when everything is well,
 * we don't even have to jump over them.  Further, they do not intrude
 * on our cache or tlb entries.
 */

struct exception_table_entry
{
        unsigned long insn, fixup;
};

insn
- process(user) space에 접근하는 커널 API의 가상 주소
fixup
- 해당 API에 대한 예외 처리 코드를 가리키는 가상 주소

exception_table_entry 구조체 – ARM64

arch/arm64/include/asm/extable.h

struct exception_table_entry
{
        int insn, fixup;
};

ARM64의 경우 generic 코드와 다르게 unsigned long 대신 int를 사용하는 것을 알 수 있다.

fsr_info[] – ARM32

arch/arm/mm/fsr-2level.c

static struct fsr_info fsr_info[] = {
        /*
         * The following are the standard ARMv3 and ARMv4 aborts.  ARMv5
         * defines these to be "precise" aborts.
         */
        { do_bad,               SIGSEGV, 0,             "vector exception"                 },
        { do_bad,               SIGBUS,  BUS_ADRALN,    "alignment exception"              },
        { do_bad,               SIGKILL, 0,             "terminal exception"               },
        { do_bad,               SIGBUS,  BUS_ADRALN,    "alignment exception"              },
        { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "section translation fault"        },
        { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
        { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page translation fault"           },
        { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
        { do_bad,               SIGSEGV, SEGV_ACCERR,   "section domain fault"             },
        { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
        { do_bad,               SIGSEGV, SEGV_ACCERR,   "page domain fault"                },
        { do_bad,               SIGBUS,  0,             "external abort on translation"    },
        { do_sect_fault,        SIGSEGV, SEGV_ACCERR,   "section permission fault"         },
        { do_bad,               SIGBUS,  0,             "external abort on translation"    },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "page permission fault"            },
        /*
         * The following are "imprecise" aborts, which are signalled by bit
         * 10 of the FSR, and may not be recoverable.  These are only
         * supported if the CPU abort handler supports bit 10.
         */
        { do_bad,               SIGBUS,  0,             "unknown 16"                       },
        { do_bad,               SIGBUS,  0,             "unknown 17"                       },
        { do_bad,               SIGBUS,  0,             "unknown 18"                       },
        { do_bad,               SIGBUS,  0,             "unknown 19"                       },
        { do_bad,               SIGBUS,  0,             "lock abort"                       }, /* xscale */
        { do_bad,               SIGBUS,  0,             "unknown 21"                       },
        { do_bad,               SIGBUS,  BUS_OBJERR,    "imprecise external abort"         }, /* xscale */              
        { do_bad,               SIGBUS,  0,             "unknown 23"                       },
        { do_bad,               SIGBUS,  0,             "dcache parity error"              }, /* xscale */              
        { do_bad,               SIGBUS,  0,             "unknown 25"                       },
        { do_bad,               SIGBUS,  0,             "unknown 26"                       },
        { do_bad,               SIGBUS,  0,             "unknown 27"                       },
        { do_bad,               SIGBUS,  0,             "unknown 28"                       },
        { do_bad,               SIGBUS,  0,             "unknown 29"                       },
        { do_bad,               SIGBUS,  0,             "unknown 30"                       },
        { do_bad,               SIGBUS,  0,             "unknown 31"                       },
};

exception이 발생했을 때 fault가 32개로 구분되며 해당 fault 핸들러 함수가 등록된다.

3레벨 페이지 테이블을 사용하는 경우 arch/arm/mm/fsr-3level.c 파일을 참고한다.

fault_info[] – ARM64

arch/arm64/mm/fault.c

static const struct fault_info fault_info[] = {
        { do_bad,               SIGKILL, SI_KERNEL,     "ttbr address size fault"       },
        { do_bad,               SIGKILL, SI_KERNEL,     "level 1 address size fault"    },
        { do_bad,               SIGKILL, SI_KERNEL,     "level 2 address size fault"    },
        { do_bad,               SIGKILL, SI_KERNEL,     "level 3 address size fault"    },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 0 translation fault"     },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 translation fault"     },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 translation fault"     },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 3 translation fault"     },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 8"                     },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 1 access flag fault"     },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 access flag fault"     },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 3 access flag fault"     },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 12"                    },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 1 permission fault"      },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 permission fault"      },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 3 permission fault"      },
        { do_sea,               SIGBUS,  BUS_OBJERR,    "synchronous external abort"    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 17"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 18"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 19"                    },
        { do_sea,               SIGKILL, SI_KERNEL,     "level 0 (translation table walk)"      },
        { do_sea,               SIGKILL, SI_KERNEL,     "level 1 (translation table walk)"      },
        { do_sea,               SIGKILL, SI_KERNEL,     "level 2 (translation table walk)"      },
        { do_sea,               SIGKILL, SI_KERNEL,     "level 3 (translation table walk)"      },
        { do_sea,               SIGBUS,  BUS_OBJERR,    "synchronous parity or ECC error" },    // RR
eserved when RAS is implemented
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 25"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 26"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 27"                    },
        { do_sea,               SIGKILL, SI_KERNEL,     "level 0 synchronous parity error (translatii
on table walk)"     },      // Reserved when RAS is implemented
        { do_sea,               SIGKILL, SI_KERNEL,     "level 1 synchronous parity error (translatii
on table walk)"     },      // Reserved when RAS is implemented
        { do_sea,               SIGKILL, SI_KERNEL,     "level 2 synchronous parity error (translatii
on table walk)"     },      // Reserved when RAS is implemented
        { do_sea,               SIGKILL, SI_KERNEL,     "level 3 synchronous parity error (translatii
on table walk)"     },      // Reserved when RAS is implemented
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 32"                    },
        { do_alignment_fault,   SIGBUS,  BUS_ADRALN,    "alignment fault"               },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 34"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 35"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 36"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 37"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 38"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 39"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 40"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 41"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 42"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 43"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 44"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 45"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 46"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 47"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "TLB conflict abort"            },
        { do_bad,               SIGKILL, SI_KERNEL,     "Unsupported atomic hardware update fault"
    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 50"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 51"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "implementation fault (lockdown abort)" },
        { do_bad,               SIGBUS,  BUS_OBJERR,    "implementation fault (unsupported exclusivee
)" },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 54"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 55"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 56"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 57"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 58"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 59"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 60"                    },
        { do_bad,               SIGKILL, SI_KERNEL,     "section domain fault"          },
        { do_bad,               SIGKILL, SI_KERNEL,     "page domain fault"             },
        { do_bad,               SIGKILL, SI_KERNEL,     "unknown 63"                    },
};

참고

copy_from_user() | 문c

시스템 콜 | 김태훈 – pptx 다운로드

Exception -1- (ARM32 Vector)

2016-05-032020-03-20 문영일 Leave a comment

ARM32 Exception Vector

non-secure PL1에서 동작하는 리눅스 커널 위주로 벡터테이블을 설명한다.

특징

irq 처리
- irq 처리 중 fiq 처리는 가능하지만 irq 재진입(irq preemption)은 허용하지 않는다.
- ARM32 아키텍처는 irq preemption을 허용하지만 ARM 리눅스 커널이 이를 허용하지 않는 설계로 구현되어 있다.
- ARM64 아키텍처는 특수한 Pesudo-NMI를 사용하는 시스템에서만 nmi 관련 디바이스에 한해 irq preemption을 허용한다.
fiq 처리
- 리눅스 커널의 fiq 핸들러 함수인 handle_fiq_as_nmi()에 처리 코드를 추가하여 사용하여야 한다.
  - 디폴트 처리로 아무것도 하지 않는다. (nop)
  - 예) rpi2의 경우 usb 호스트 드라이버(dwc_otg)에서 사용한다.
- 일반적으로 secure 펌웨어에서 fiq를 처리하고, 리눅스 커널에서는 fiq를 처리하지 않는다.

모드별 Exception 벡터 테이블

다음 그림은 각 모드별 ARM Exception Vector 테이블을 보여준다.

붉은 박스가 하이퍼 모드를 사용하지 않을 때의 리눅스 커널이 동작하는 벡터 테이블을 보여준다.

Exception 벡터 테이블 주소

벡터 테이블의 주소는 arm 모드 및 Security Extension 사용 유무에 따라 각각 다르다.

Security Extension 미사용
- low 벡터 주소
  - SCTLR.V=0
  - 0x0000_0000
- high 벡터 주소
  - SCTLR.V=1
  - 0xffff_0000
Security Extension 사용
- non-secure 및 secure 공통
  - VBAR 지정한 주소
    - SCTLR.V=0
  - high 벡터 주소
    - SCTLR.V=1
- Monitor
  - MVBAR 지정한 주소
Hyper Extensino 사용
- HVBAR 지정한 주소

Exception 시 CPU 모드별 스택

irq, fiq, abt(pabt & babt), und
- ARM32 리눅스 커널은 위의 exception이 발생하는 경우 각각 3 워드 초소형 스택을 사용하여 돌아갈 주소등을 저장하고, SVC 모드로 전환한 후 커널 스택을 사용한다.
svc
- 커널 스택
  - 커널 스레드 동작 중에 exception 발생한 경우 부트업 시 생성한 커널용 스택을 사용한다.
usr
- 유저용 커널 스택
  - 유저 process가 동작 중이었으면 유저 process마다 유저 스택과 커널 스택이 생성되며, exception이 발생하는 경우 동작 중인 유저 process용 커널 스택을 사용한다.

Exception Stub 흐름

ARM에서 Exception이 발생하면 CPU는 지정된 벡터테이블(low/high)로 강제 jump되고 해당 엔트리에는 각 exception을 처리할 수 있는 코드로의 branch 코드가 수행된다.

아래 그림은 8개의 exception vector entry 중 5개 엔트리의 경우 각각 16개의 모드 테이블을 갖고 있고 exception 되기 전의 모드를 인덱스로 하여 해당 루틴으로 이동하는 것을 보여준다.

stub 코드들은 벡터로 부터 1 페이지 아래 위치한 주소에 복사되어 사용된다.

Vector 선언

__vectors_start

arch/arm/kernel/entry-armv.S

        .section .vectors, "ax", %progbits
.L__vectors_start:
        W(b)    vector_rst
        W(b)    vector_und
        W(ldr)  pc, __vectors_start + 0x1000
        W(b)    vector_pabt
        W(b)    vector_dabt
        W(b)    vector_addrexcptn
        W(b)    vector_irq
        W(b)    vector_fiq

8개의 exception vector 엔트리에는 각각을 처리할 수 있는 stub 코드로 이동할 수 있는 branch 코드로 되어 있다.
3번 째 엔트리의 경우 SVC 엔트리로 syscall을 담당한다. 벡터 위치+0x1000에 담겨있는 주소로 이동하는 코드가 있는데 해당 위치는 vector_swi 변수 주소를 담고 있다.

Vector가 저장되는 섹션 위치

arch/arm/kernel/vmlinux.lds.S

#ifdef CONFIG_STRICT_KERNEL_RWX
        . = ALIGN(1<<SECTION_SHIFT);
#else
        . = ALIGN(PAGE_SIZE);
#endif
        __init_begin = .;

        ARM_VECTORS
        INIT_TEXT_SECTION(8)
        .exit.text : {
                ARM_EXIT_KEEP(EXIT_TEXT)
        }
        .init.proc.info : {
                ARM_CPU_DISCARD(PROC_INFO)
        }
        .init.arch.info : {
                __arch_info_begin = .;
                *(.arch.info.init)
                __arch_info_end = .;
        }
        .init.tagtable : {
                __tagtable_begin = .;
                *(.taglist.init)
                __tagtable_end = .;
        }
#ifdef CONFIG_SMP_ON_UP
        .init.smpalt : {
                __smpalt_begin = .;
                *(.alt.smp.init)
                __smpalt_end = .;
        }
#endif
        .init.pv_table : {
                __pv_table_begin = .;
                *(.pv_table)
                __pv_table_end = .;
        }

        INIT_DATA_SECTION(16)

__init_begin 바로 다음에 ARM_VECTORS 매크로가 위치한 것을 확인할 수 있다.

arch/arm/kernel/vmlinux.lds.h

/*
 * The vectors and stubs are relocatable code, and the
 * only thing that matters is their relative offsets
 */

#define ARM_VECTORS                                                     \
        __vectors_start = .;                                            \
        .vectors 0xffff0000 : AT(__vectors_start) {                     \
                *(.vectors)                                             \
        }                                                               \
        . = __vectors_start + SIZEOF(.vectors);                         \
        __vectors_end = .;                                              \
                                                                        \
        __stubs_start = .;                                              \
        .stubs ADDR(.vectors) + 0x1000 : AT(__stubs_start) {            \
                *(.stubs)                                               \
        }                                                               \
        . = __stubs_start + SIZEOF(.stubs);                             \
        __stubs_end = .;                                                \
                                                                        \
        PROVIDE(vector_fiq_offset = vector_fiq - ADDR(.vectors));

0xffff_0000 주소로 시작하는 __vectors_start ~ __vectors_end 사이의 .vectors 섹션에 벡터가 포함됨을 알 수 있다.
벡터 + 0x1000_0000 주소로 시작하는 __stubs_start ~ __stubs_end 사이의 .stubs 섹션에 syscall 관련 코드가 포함된다.

Vector 설치

exception 벡터 및 stub 코드들은 아래의 루틴에서 설치된다.

paging_init()->devicemaps_init()->early_trap_init()

Vector 핸들러

vector_stub 매크로

arch/arm/kernel/entry-armv.S

/*
 * Vector stubs.
 *
 * This code is copied to 0xffff1000 so we can use branches in the
 * vectors, rather than ldr's.  Note that this code must not exceed
 * a page size.
 *
 * Common stub entry macro:
 *   Enter in IRQ mode, spsr = SVC/USR CPSR, lr = SVC/USR PC
 *
 * SP points to a minimal amount of processor-private memory, the address
 * of which is copied into r0 for the mode specific abort handler.
 */

        .macro  vector_stub, name, mode, correction=0
        .align  5

vector_\name:
        .if \correction
        sub     lr, lr, #\correction
        .endif

        @
        @ Save r0, lr_<exception> (parent PC) and spsr_<exception>
        @ (parent CPSR)
        @
        stmia   sp, {r0, lr}            @ save r0, lr
        mrs     lr, spsr
        str     lr, [sp, #8]            @ save spsr

        @
        @ Prepare for SVC32 mode.  IRQs remain disabled.
        @
        mrs     r0, cpsr
        eor     r0, r0, #(\mode ^ SVC_MODE | PSR_ISETSTATE)
        msr     spsr_cxsf, r0

        @
        @ the branch table must immediately follow this code
        @
        and     lr, lr, #0x0f
 THUMB( adr     r0, 1f                  )
 THUMB( ldr     lr, [r0, lr, lsl #2]    )
        mov     r0, sp
 ARM(   ldr     lr, [pc, lr, lsl #2]    )
        movs    pc, lr                  @ branch to handler in SVC mode
ENDPROC(vector_\name)

        .align 2
        @ handler addresses follow this label
1:
        .endm

Exception이 발생되면 8개의 exception vector 엔트리 중 해당하는 exception 위치로 jump 되는데 이 중 5개의 exception 엔트리에서 사용되는 공통 매크로 루틴에서는 exception 벡터로 점프되기 직전의 최대 16개 모드별 테이블에 해당하는 루틴을 찾아 수행한다. 수행 전에 svc 모드로 변환한다.

correction은 루틴이 수행된 후 다시 리턴될 주소를 조정한다. exception이 발생될 때의 파이프라인 위치에 따라 보정 주소가 다르다.
- IRQ, FIQ, Data Abort의 경우 리턴 주소-4로 보정한다.
- Prefetch Abort의 경우 리턴 주소-8로 보정한다.
- Undefined 및 SWI의 경우 보정 없이 리턴 주소를 사용한다.
- Reset은 리턴하지 않으므로 해당사항 없다.
irq, fiq, abt, und 모드에서 사용하는 스택
- cpu_init() 함수에서 3 word 초소형 static 배열을 4개의 스택으로 설정하였다.
- 이 스택에는 다음 3가지 항목을 보관한다.
  - scratch되어 파괴될 r0 레지스터
  - 되돌아갈 주소가 담긴 lr 레지스터(exception 시 pc -> lr_<exception>에 복사되는데 이 값에 correction 값을 보정하여 저장해둔다.)
  - 복원되어야 할 psr 레지스터 (exception 시 cpsr -> spsr_<exception>에 복사되는데 이 값을 저장해둔다.)
ldm/stm에서 사용하는 ^ 접미사 용도
- stm
  - user 모드의 sp, lr 레지스터가 뱅크되어 일반적으로는 다른 모드에서 접근할 수 없어서 user 모드의 sp, lr 레지스터 값을 access하고자할 때 사용
- ldm
  - pc가 포함된 경우 spsr->cpsr로 복사된다
  - 기존 모드로 복귀할 때 spsr(백업받아 두었던) -> cpsr로 복사한다.

.stubs 섹션의 시작

arch/arm/kernel/entry-armv.S

        .section .stubs, "ax", %progbits
__stubs_start:
        @ This must be the first word
        .word   vector_swi

stub 영역은 시스템 콜이 호출되는 vector_swi 부터 시작하여 다음과 같은 순으로 위치한다.

vector_swi
vector_rst
vector_irq
vector_dabt
vector_pabt
vector_und
vector_addrexcptn
vector_fiq

vector_rst (Reset)

arch/arm/kernel/entry-armv.S

vector_rst:
 ARM(   swi     SYS_ERROR0      )
 THUMB( svc     #0              )
 THUMB( nop                     )
        b       vector_und

SYS_ERROR0 소프트웨어 인터럽트를 발생시켜 시스템 콜을 호출하게 한다.

vector_irq (IRQ)

arch/arm/kernel/entry-armv.S

/*
 * Interrupt dispatcher
 */

        vector_stub     irq, IRQ_MODE, 4

        .long   __irq_usr                       @  0  (USR_26 / USR_32)
        .long   __irq_invalid                   @  1  (FIQ_26 / FIQ_32)
        .long   __irq_invalid                   @  2  (IRQ_26 / IRQ_32)
        .long   __irq_svc                       @  3  (SVC_26 / SVC_32)
        .long   __irq_invalid                   @  4
        .long   __irq_invalid                   @  5
        .long   __irq_invalid                   @  6
        .long   __irq_invalid                   @  7
        .long   __irq_invalid                   @  8
        .long   __irq_invalid                   @  9
        .long   __irq_invalid                   @  a
        .long   __irq_invalid                   @  b
        .long   __irq_invalid                   @  c
        .long   __irq_invalid                   @  d
        .long   __irq_invalid                   @  e
        .long   __irq_invalid                   @  f

USR 모드에서 인터럽트가 발생한 경우 __irq_usr 루틴을 호출한다.
FIQ 및 IRQ 모드에서 인터럽트가 발생한 경우 __irq_invalid 루틴을 호출한다.
SVC 모드에서 인터럽트가 발생한 경우 __irq_svc 루틴을 호출한다.

vector_dabt (Data Abort)

arch/arm/kernel/entry-armv.S

/*
 * Data abort dispatcher
 * Enter in ABT mode, spsr = USR CPSR, lr = USR PC
 */

        vector_stub     dabt, ABT_MODE, 8

        .long   __dabt_usr                      @  0  (USR_26 / USR_32)
        .long   __dabt_invalid                  @  1  (FIQ_26 / FIQ_32)
        .long   __dabt_invalid                  @  2  (IRQ_26 / IRQ_32)
        .long   __dabt_svc                      @  3  (SVC_26 / SVC_32)
        .long   __dabt_invalid                  @  4
        .long   __dabt_invalid                  @  5
        .long   __dabt_invalid                  @  6
        .long   __dabt_invalid                  @  7
        .long   __dabt_invalid                  @  8
        .long   __dabt_invalid                  @  9
        .long   __dabt_invalid                  @  a
        .long   __dabt_invalid                  @  b
        .long   __dabt_invalid                  @  c
        .long   __dabt_invalid                  @  d
        .long   __dabt_invalid                  @  e
        .long   __dabt_invalid                  @  f

vector_pabt (Prefetch Abort)

arch/arm/kernel/entry-armv.S

/*
 * Prefetch abort dispatcher
 * Enter in ABT mode, spsr = USR CPSR, lr = USR PC
 */

        vector_stub     pabt, ABT_MODE, 4

        .long   __pabt_usr                      @  0 (USR_26 / USR_32)
        .long   __pabt_invalid                  @  1 (FIQ_26 / FIQ_32)
        .long   __pabt_invalid                  @  2 (IRQ_26 / IRQ_32)
        .long   __pabt_svc                      @  3 (SVC_26 / SVC_32)
        .long   __pabt_invalid                  @  4
        .long   __pabt_invalid                  @  5
        .long   __pabt_invalid                  @  6
        .long   __pabt_invalid                  @  7
        .long   __pabt_invalid                  @  8
        .long   __pabt_invalid                  @  9
        .long   __pabt_invalid                  @  a
        .long   __pabt_invalid                  @  b
        .long   __pabt_invalid                  @  c
        .long   __pabt_invalid                  @  d
        .long   __pabt_invalid                  @  e
        .long   __pabt_invalid                  @  f

vector_und (Undefined Instruction)

arch/arm/kernel/entry-armv.S

/*
 * Undef instr entry dispatcher
 * Enter in UND mode, spsr = SVC/USR CPSR, lr = SVC/USR PC
 */

        vector_stub     und, UND_MODE

        .long   __und_usr                       @  0 (USR_26 / USR_32)
        .long   __und_invalid                   @  1 (FIQ_26 / FIQ_32)
        .long   __und_invalid                   @  2 (IRQ_26 / IRQ_32)
        .long   __und_svc                       @  3 (SVC_26 / SVC_32)
        .long   __und_invalid                   @  4
        .long   __und_invalid                   @  5
        .long   __und_invalid                   @  6
        .long   __und_invalid                   @  7
        .long   __und_invalid                   @  8
        .long   __und_invalid                   @  9
        .long   __und_invalid                   @  a
        .long   __und_invalid                   @  b
        .long   __und_invalid                   @  c
        .long   __und_invalid                   @  d
        .long   __und_invalid                   @  e
        .long   __und_invalid                   @  f

        .align  5

vector_addrexcptn (Address Exception Handler)

arch/arm/kernel/entry-armv.S

/*=============================================================================
 * Address exception handler
 *-----------------------------------------------------------------------------
 * These aren't too critical.
 * (they're not supposed to happen, and won't happen in 32-bit data mode).
 */

vector_addrexcptn:
        b       vector_addrexcptn

vector_fiq (FIQ)

arch/arm/kernel/entry-armv.S

/*=============================================================================
 * FIQ "NMI" handler
 *-----------------------------------------------------------------------------
 * Handle a FIQ using the SVC stack allowing FIQ act like NMI on x86
 * systems.
 */

        vector_stub     fiq, FIQ_MODE, 4

        .long   __fiq_usr                       @  0  (USR_26 / USR_32)
        .long   __fiq_svc                       @  1  (FIQ_26 / FIQ_32)
        .long   __fiq_svc                       @  2  (IRQ_26 / IRQ_32)
        .long   __fiq_svc                       @  3  (SVC_26 / SVC_32)
        .long   __fiq_svc                       @  4
        .long   __fiq_svc                       @  5
        .long   __fiq_svc                       @  6
        .long   __fiq_abt                       @  7
        .long   __fiq_svc                       @  8
        .long   __fiq_svc                       @  9
        .long   __fiq_svc                       @  a
        .long   __fiq_svc                       @  b
        .long   __fiq_svc                       @  c
        .long   __fiq_svc                       @  d
        .long   __fiq_svc                       @  e
        .long   __fiq_svc                       @  f

        .globl  vector_fiq_offset
        .equ    vector_fiq_offset, vector_fiq

참고

Exception -1- (ARM32 Vector) | 문c – 현재 글
Exception -2- (ARM32 Handler 1) | 문c
Exception -3- (ARM32 Handler 2) | 문c
Exception -4- (ARM32 VFP & FPE) | 문c
Exception -5- (Extable) | 문c
Exception -6- (MM Fault Handler) | 문c

ARM Exceptions and Modes | 히언
Pipe line과 Exception 관계 그리고 ^접미사 | 히언