문c 블로그

Early ioremap

2016-12-132021-10-08 문영일 5 Comments

Early IO 매핑

커널 부트업 과정에서 paging_init( ) 함수가 수행된 후에 메모리 및 입출력(I/O) 장치들이 가상 주소 공간에 정식으로 매핑되어 사용될 수 있다. 그런데 이러한 동작이 완료되기 전에 반드시 먼저 매핑되어 사용되어야 하는 경우를 위해 먼저(early) 처리할 수 있는 방법을 준비했다. 이러한 API를 사용하는 경우는 특정 시스템의 바이오스 정보 등에 먼저 접근해서 각종 메모리 및 디바이스의 정보를 가져와야 매핑 처리가 가능한 경우다. 주로 이용하는 경우는 다음과 같다.

ACPI 테이블에 접근해서 각종 디바이스의 설정 정보를 가져와야 하는 경우
EFI 테이블에 접근해서 각종 디바이스의 설정 정보를 가져와야 하는 경우
일부 디바이스에서 특정 설정 정보 등을 읽어와야 하는 경우

이른(early) 시간에 io 매핑이 필요한 경우 256K씩 최대 7개의 fixmap 매핑 공간을 사용하여 매핑을 할 수 있게 한다.

다음 그림은 early_ioremap() 및 early_ioremap_init() 두 함수의 호출 관계를 보여준다.

Early IO 매핑 초기화

early_ioremap_init() – ARM32 & ARM64

ioremap을 사용하기 전에 부트 타임에 early I/O 매핑을 할 수 있도록 준비하는 과정을 자세히 알아본다. 다음 함수는 곧바로 early_ioremap_setup( ) 함수를 호출한다.

ARM32도 커널 v4.5-rc1 부터 early ioremap을 지원한다.
- 참고: ARM: add support for generic early_ioremap/early_memremap

arch/arm64/mm/ioremap.c

/*
 * Must be called after early_fixmap_init
 */
void __init early_ioremap_init(void)
{
        early_ioremap_setup();
}

paging_init()이 완료되지 않아 정규 ioremap() 함수를 사용하지 못하지만 이른(early) 시간에 임시적으로 매핑이 필요한 경우를 위해 준비한다.

fixmap을 이용하므로 early_fixmap_init() 함수가 먼저 호출되어 초기화가 되어야한다.

early_ioremap_setup()

early I/O 매핑을 위해 fixmap 영역에 7개의 256K 가상 주소 영역을 준비하는 과정을 알아본다.

mm/early_ioremap.c

void __init early_ioremap_setup(void)
{
        int i;

        for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
                if (WARN_ON(prev_map[i]))
                        break;

        for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
                slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
}

7개 배열로 이루어진 slot_virt[] 배열에 fixmap에서 early ioremap 용도로 배정된 가상 주소를 배정한다.

코드 라인 5~7에서 FIX_BTMAPS_SLOTS(7) 수만큼 루프를 돌며 전역 prev_map[ ] 배열에 값이 설정된 경우 경고 메시지를 출력한다
- 7개 각 주소 공간은 최대 256K를 사용할 수 있다.
코드 라인 9~10에서 FIX_BTMAPS_SLOTS(7) 수만큼 루프를 돌며 전역 slot_virt[ ] 배열에 fixmap의 BTMAP 가상 주소를 설정한다. 7개의 fixmap 엔트리가 있고, 각 엔트리들은 fixmap의 BTMAP에 해당하는 가상 주소들을 가리키는데 엔트리 간의 간격은 256K이다.
- slot_virt[0] = FIX_BTMAP_BEGIN
- slot_virt[1] = FIX_BTMAP_BEGIN + 256K

다음 그림은 early_ioremap_setup( ) 함수가 호출되어 early IO 매핑에 사용되는 fixmap의 BTMAPS 영역이 7개의 slot이 초기화되는 것을 보여준다.

주요 API

early_ioremap()

mm/early_ioremap.c

/* Remap an IO device */
void __init __iomem *
early_ioremap(resource_size_t phys_addr, unsigned long size)
{
        return __early_ioremap(phys_addr, size, FIXMAP_PAGE_IO);
}

물리 주소 @phys_addr부터 @size에 대응하는 io 디바이스들을 early 매핑한 후 매핑된 가상 주소를 반환한다.

256K 크기를 최대 7개 까지 io 디바이스를 early 매핑한다.
FIXMAP_PAGE_IO
- ARM32
  - L_PTE_YOUNG | L_PTE_PRESENT | L_PTE_XN | L_PTE_DIRTY | L_PTE_MT_DEV_SHARED | L_PTE_SHARED 송성을 가진다.
- ARM64
  - PROT_DEVICE_nGnRE 매핑 속성을 가진다.

early_memremap()

mm/early_ioremap.c

/* Remap memory */ 
void __init *
early_memremap(resource_size_t phys_addr, unsigned long size)
{
        return (__force void *)__early_ioremap(phys_addr, size,
                                               FIXMAP_PAGE_NORMAL);
}

물리 주소 @phys_addr부터 @size에 대응하는 메모리 영역을 early 매핑한 후 매핑된 가상 주소를 반환한다.

256K 크기를 최대 7개 까지 메모리 영역을 early 매핑한다.
FIXMAP_PAGE_NORMAL
- ARM32
  - L_PTE_XN 속성이 추가된 커널 메모리 속성이다.
- ARM64
  - 커널 메모리 속성과 동일하다.

__early_ioremap()

mm/early_ioremap.c

static void __init __iomem *
__early_ioremap(resource_size_t phys_addr, unsigned long size, pgprot_t prot)
{
        unsigned long offset;
        resource_size_t last_addr;
        unsigned int nrpages;
        enum fixed_addresses idx;
        int i, slot;

        WARN_ON(system_state != SYSTEM_RUNNING);

        slot = -1;
        for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
                if (!prev_map[i]) {
                        slot = i;
                        break;
                }
        }

        if (WARN(slot < 0, "%s(%08llx, %08lx) not found slot\n",
                 __func__, (u64)phys_addr, size))
                return NULL;

        /* Don't allow wraparound or zero size */
        last_addr = phys_addr + size - 1;
        if (WARN_ON(!size || last_addr < phys_addr))
                return NULL;

        prev_size[slot] = size;
        /*
         * Mappings have to be page-aligned
         */
        offset = offset_in_page(phys_addr);
        phys_addr &= PAGE_MASK;
        size = PAGE_ALIGN(last_addr + 1) - phys_addr;

        /*
         * Mappings have to fit in the FIX_BTMAP area.
         */
        nrpages = size >> PAGE_SHIFT;
        if (WARN_ON(nrpages > NR_FIX_BTMAPS))
                return NULL;

        /*
         * Ok, go for it..
         */
        idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;
        while (nrpages > 0) {
                if (after_paging_init)
                        __late_set_fixmap(idx, phys_addr, prot);
                else
                        __early_set_fixmap(idx, phys_addr, prot);
                phys_addr += PAGE_SIZE;
                --idx;
                --nrpages;
        }
        WARN(early_ioremap_debug, "%s(%08llx, %08lx) [%d] => %08lx + %08lx\n",
             __func__, (u64)phys_addr, size, slot, offset, slot_virt[slot]);

        prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
        return prev_map[slot];
}

요청한 물리 주소와 사이즈를 fixmap의 BTMAPS 슬롯에 해당하는 가상주소에 prot 속성으로 매핑한다. 실패하는 경우 null을 반환한다. arm64에서 early 하게 io 매핑할 수 있는 최대 수는 FIX_BTMAPS_SLOTS(7)개이다.

fixmap용 BTMAPS 슬롯 관리 정보
- slot_virt[]: 부트업 타임에 초기화된 각 슬롯에 해당하는 fixmap 가상 주소 (256K 단위)
- prev_map[]: fixmap에 매핑된 가상 주소
- prev_size[]: 매핑된 사이즈

코드 라인 12-18에서 사용할 빈 슬롯을 선택한다. FIX_BTMAPS_SLOTS(7) 수만큼 루프를 돌며 전역 prev_map[ ] 값이 설정되지 않았다면 현재 카운터 i 값을 slot에 설정하여 slot을 선택한다.
코드 라인 20~22에서 루프를 도는 동안 빈 슬롯을 못 찾은 경우 경고 메시지를 출력하고 NULL 값을 리턴한다
코드 라인 25~27에서 크기가 0이거나 끝 주소가 시스템 주소 범위를 벗어나는 경우 경고 메시지를 출력하고 NULL 값을 리턴한다.
코드 라인 29~34에서 prev_size[ ]에 요청 크기를 담아둔다. offset은 요청 물리 주소에서 페이지 단위의 나머지 offset만을 계산하고, 물리 주소는 페이지 단위로 내림 정렬한다
코드 라인 35에서 페이지 단위로 정렬된 크기 값을 계산한다. 처음 요청한 물리 시작 주소 + 크기 값을 페이지 단위로 정렬하고, 페이지 단위로 내림 정렬한 물리 주소를 뺀다(최솟값은 페이지 단위가 된다).
코드 라인 40~42에서 크기 값으로 페이지 수를 계산하고 한다. 크기가 256K를 초과한다면 NULL을 리턴한다
코드 라인 48에서 계산된 페이지 수만큼 루프를 돈다.
코드 라인 49~50에서 전역 after_paging_init이 설정된 경우에는 _ _late_clear_fixmap( ) 매크로 함수를 호출한다.
- 예) arm64 및 x86의 경우 별도의 late 처리 함수 없이 그냥 __set_fixmap() 함수를 호출한다.
코드 라인 51~52에서 반대로 after_paging_init이 설정되지 않은 경우에는 _ _early_set_fixmap( ) 매크로 함수를 호출한다.
- 예) arm64의 경우 별도의 early 처리 함수 없이 그냥 __set_fixmap() 함수를 호출한다.
- 예) x64의 경우 별도의 early 처리 함수인 __early_set_fixmap() 함수를 호출한다.
코드 라인 60에서 prev_map[slot]에 슬롯에 해당하는 fixmap 가상 주소 + offset을 더해 설정한다

__early_set_fixmap()

arm64/include/asm/fixmap.h

#define __early_set_fixmap __set_fixmap

paging_init()이 완료되기 전에는 early_ioremap() 함수가 호출되는 경우에는 fixmap에 매핑한다.

__late_set_fixmap()

mm/early_ioremap.c

#define __late_set_fixmap __set_fixmap

paging_init()이 완료된 후 early_ioremap() 함수가 호출되는 경우에는 fixmap에 매핑한다.

커널 v4.1-rc1 부터 부팅 후에도 early_ioremap()을 호출 가능하도록 지원한다. 그 전 까지는 BUG()를 호출하였다.
- 참고: ARM64: allow late use of early_ioremap

early_ioremap() API 제공 만료

early_ioremap_reset()

mm/early_ioremap.c

void __init early_ioremap_reset(void)
{
        early_ioremap_shutdown();
        after_paging_init = 1;
}

정규 페이징이 준비되었으므로 early_ioremap() API를 더 이상 사용하지 말고 정규 ioremap()을 사용해야 하는 시점이다.

아키텍처에 따라 이 함수 진행 후에 early_ioremap() 함수가 호출될 때 BUG() 발생토록 한다.
- x86과 arm64의 경우에는 계속 early_ioremap() 함수의 사용을 허용한다.

기타 API

early_memremap_ro()
- early_memremap()과 동일하나 read only 속성이 추가된다.
early_memremap_prot()
- early_memremap()과 동일하나 @prot_val 속성을 요청하여 추가할 수 있다.

참고

Ioremap | 문c
Fixmap | 문c

Ioremap

2016-12-132017-09-23 문영일 Leave a comment

io 매핑

정규 ioremap 함수는 빈 vmalloc 영역을 찾아 매핑하는데 매핑 종류에 따라 5개의 함수를 사용한다. 아키텍처마다 지원되는 캐시의 종류가 다르므로 특정 모드가 지원되지 않는 경우 하향 호환을 시킨다. ioremap_cache() 함수는 캐시를 사용하지만 아키텍처에 따라 읽기 캐시외에 쓰기 캐시까지도 사용할 수 있다. 쓰기 캐시에 대한 옵션을 가능하면 선택할 수 있도록 하기 위해 특정 아키텍처에서는 ioremap_wc() 함수와 ioremap_wt() 함수를 구분하여 사용하기도 한다. 참고로 다음 표와 같이 arm 및 arm64 에서 원래 함수 의도와는 다르게 약간 상이한 매핑 타입을 확인할 수 있다.

32bit arm ioremap 함수 흐름도

arm64 ioremap 함수 흐름도

ioremap() 등

arch/arm/include/asm/io.h

/*
 * ioremap and friends.
 *
 * ioremap takes a PCI memory address, as specified in
 * Documentation/io-mapping.txt.
 *
 */
#define ioremap(cookie,size)            __arm_ioremap((cookie), (size), MT_DEVICE)
#define ioremap_nocache(cookie,size)    __arm_ioremap((cookie), (size), MT_DEVICE)
#define ioremap_cache(cookie,size)      __arm_ioremap((cookie), (size), MT_DEVICE_CACHED)
#define ioremap_wc(cookie,size)         __arm_ioremap((cookie), (size), MT_DEVICE_WC)

arm64에서 ioremap(), ioremap_nocache(), ioremap_wt() 함수는 PROT_DEVICE_nGnRE 매핑을 사용하여 캐시를 사용하지 않는 디바이스 타입으로 매핑하고, ioremap_wc() 함수는 캐시를 사용하지 않는 것은 동일하지만 디바이스 타입이 아닌 메모리 타입으로 매핑하는 것이 다른 점이다.

__arm_ioremap()

arch/arm/mm/ioremap.c

void __iomem *
__arm_ioremap(phys_addr_t phys_addr, size_t size, unsigned int mtype)
{
        return arch_ioremap_caller(phys_addr, size, mtype,
                __builtin_return_address(0));
}
EXPORT_SYMBOL(__arm_ioremap);

요청 물리주소 및 사이즈만큼 빈 vmalloc 공간을 찾아 mtype 속성으로 매핑하고 가상 주소를 반환한다.

전역 __arm_ioremap_caller 함수 포인터 변수

arch/arm/mm/ioremap.c

void __iomem * (*arch_ioremap_caller)(phys_addr_t, size_t,
                                      unsigned int, void *) =
        __arm_ioremap_caller;

컴파일 타임에 전역 arch_ioremap_caller 함수 포인터 변수에 __arm_ioremap_caller() 함수 주소를 대입한다.

__arm_ioremap_caller()

arch/arm/mm/ioremap.c

void __iomem *__arm_ioremap_caller(phys_addr_t phys_addr, size_t size,
        unsigned int mtype, void *caller)
{
        phys_addr_t last_addr;
        unsigned long offset = phys_addr & ~PAGE_MASK;
        unsigned long pfn = __phys_to_pfn(phys_addr);

        /*
         * Don't allow wraparound or zero size
         */
        last_addr = phys_addr + size - 1;
        if (!size || last_addr < phys_addr)
                return NULL;

        return __arm_ioremap_pfn_caller(pfn, offset, size, mtype,
                        caller);
}

요청 물리주소 및 사이즈만큼 빈 vmalloc 공간을 찾아 mtype 속성으로 매핑하고 가상 주소를 반환한다.

코드 라인 5~6에서 물리 주소에서 페이지 단위로 절삭하여 남은 offset 값과 pfn 값을 구한다.
코드 라인 11에서 페이지 단위로 매핑할 페이지의 끝 주소를 구한다.
- 예) phys_addr=0x1234_5000, size=0x2000
  - last_addr=0x1234_6fff
코드 라인 12~13에서 size가 0이거나 끝 주소가 시스템 범위를 초과하는 경우 null을 반환한다.
코드 라인 15에서 pfn에 대해 size 만큼 빈 vmalloc 공간을 찾아 mtype 속성으로 매핑하고 가상 주소를 반환한다.

__arm_ioremap_pfn_caller()

arch/arm/mm/ioremap.c

void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
        unsigned long offset, size_t size, unsigned int mtype, void *caller)
{
        const struct mem_type *type;
        int err;
        unsigned long addr;
        struct vm_struct *area;
        phys_addr_t paddr = __pfn_to_phys(pfn);

#ifndef CONFIG_ARM_LPAE
        /*
         * High mappings must be supersection aligned
         */
        if (pfn >= 0x100000 && (paddr & ~SUPERSECTION_MASK))
                return NULL;
#endif

        type = get_mem_type(mtype);
        if (!type)
                return NULL;

        /*
         * Page align the mapping size, taking account of any offset.
         */
        size = PAGE_ALIGN(offset + size);

        /*
         * Try to reuse one of the static mapping whenever possible.
         */
        if (size && !(sizeof(phys_addr_t) == 4 && pfn >= 0x100000)) {
                struct static_vm *svm;

                svm = find_static_vm_paddr(paddr, size, mtype);
                if (svm) {
                        addr = (unsigned long)svm->vm.addr;
                        addr += paddr - svm->vm.phys_addr;
                        return (void __iomem *) (offset + addr);
                }
        }

        /*
         * Don't allow RAM to be mapped - this causes problems with ARMv6+
         */
        if (WARN_ON(pfn_valid(pfn)))
                return NULL;

        area = get_vm_area_caller(size, VM_IOREMAP, caller);
        if (!area)
                return NULL;
        addr = (unsigned long)area->addr;
        area->phys_addr = paddr;

#if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
        if (DOMAIN_IO == 0 &&
            (((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
               cpu_is_xsc3()) && pfn >= 0x100000 &&
               !((paddr | size | addr) & ~SUPERSECTION_MASK)) {
                area->flags |= VM_ARM_SECTION_MAPPING;
                err = remap_area_supersections(addr, pfn, size, type);
        } else if (!((paddr | size | addr) & ~PMD_MASK)) {
                area->flags |= VM_ARM_SECTION_MAPPING;
                err = remap_area_sections(addr, pfn, size, type);
        } else
#endif
                err = ioremap_page_range(addr, addr + size, paddr,
                                         __pgprot(type->prot_pte));

        if (err) {
                vunmap((void *)addr);
                return NULL;
        }

        flush_cache_vmap(addr, addr + size);
        return (void __iomem *) (offset + addr);
}

요청 물리주소 및 사이즈만큼 빈 vmalloc 공간을 찾아 mtype 속성으로 매핑하고 가상 주소를 반환한다.

코드 라인 18~20에서 요청한 mtype에 해당하는 mem_type 구조체를 알아오는데 없는 경우 null을 반환한다.
코드 라인 25에서 매핑에 필요한 페이지 사이즈를 구하기 위해 size에 offset을 더한 값을 페이지 단위로 정렬한다.
- 예) offset=0x678, size=0x1000, PAGE_SIZE=4K
  - size=0x2000 (2개 페이지)
코드 라인 30에서 물리 주소가 4이면서 pfn이 0x10000(4G)를 초과하는 경우만 아니라면
코드 라인 33~38에서 vmalloc 공간에 이미 동일한 타입으로 static 매핑 영역이내에 포함된 경우 매핑 없이 가상 주소만 찾아 반환한다.
코드 라인 47~49에서 VM_IOREMAP 플래그를 사용하여 vmalloc 공간에서 빈 영역을 찾아 영역 정보를 vm_struct 구조체 형태로 알아오고 null인 경우 함수를 빠져나간다.
코드 라인 50~51에서 찾은 시작 가상 주소를 addr에 대입하고 영역의 시작 물리 주소에 paddr을 대입한다.
코드 라인 53~64에서 CONFIG_SMP 커널 옵션을 사용하지 않고 LPAE도 아닌 경우 수퍼 섹션 또는 섹션 매핑을 지원한다.
코드 라인 65~66에서 가상 주소 범위만큼 물리 주소 paddr부터 매핑한다.
코드 라인 68~71 매핑이 실패하는 경우 매핑해제를 시도한다.
코드 라인 73에서 매핑된 가상 주소 영역을 flush한다. 단 armv7 및 armv8 아키텍처와 같이 데이터 캐시가 PIPT 또는 VIPT non-aliasing을 사용하는 경우에는 flush할 필요없다

find_static_vm_paddr()

arch/arm/mm/ioremap.c

static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
                        size_t size, unsigned int mtype)
{
        struct static_vm *svm;
        struct vm_struct *vm;

        list_for_each_entry(svm, &static_vmlist, list) {
                vm = &svm->vm;
                if (!(vm->flags & VM_ARM_STATIC_MAPPING))
                        continue;
                if ((vm->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
                        continue;

                if (vm->phys_addr > paddr ||
                        paddr + size - 1 > vm->phys_addr + vm->size - 1)
                        continue;

                return svm;
        }

        return NULL;
}

요청 영역이 vmalloc 공간에 매핑된 영역 엔트리 중 하나에 포함되고 같은 매핑 타입으로 이미 static 매핑된 경우 해당 vm_struct 정보를 반환한다. 그 외의 경우 null을 반환한다.

코드 라인 7에서 전역 static_vmlist의 모든 엔트리를 대상으로 루프를 돈다.
코드 라인 9~10에서 해당 영역이 VM_ARM_STATIC_MAPPING 플래그를 사용하지 않으면 skip 한다.
코드 라인 11~12에서 해당 영역에 매핑된 타입과 요청 매핑 타입이 다른 경우 skip 한다.
코드 라인 14~16에서 기존 영역의 범위내에 요청 범위가 포함되지 않은 경우 skip 한다.

remap_area_sections()

arch/arm/mm/ioremap.c

static int
remap_area_sections(unsigned long virt, unsigned long pfn,
                    size_t size, const struct mem_type *type)
{
        unsigned long addr = virt, end = virt + size;
        pgd_t *pgd;
        pud_t *pud;
        pmd_t *pmd;

        /*
         * Remove and free any PTE-based mapping, and
         * sync the current kernel mapping.
         */
        unmap_area_sections(virt, size);

        pgd = pgd_offset_k(addr);
        pud = pud_offset(pgd, addr);
        pmd = pmd_offset(pud, addr);
        do {
                pmd[0] = __pmd(__pfn_to_phys(pfn) | type->prot_sect);
                pfn += SZ_1M >> PAGE_SHIFT;
                pmd[1] = __pmd(__pfn_to_phys(pfn) | type->prot_sect);
                pfn += SZ_1M >> PAGE_SHIFT;
                flush_pmd_entry(pmd);

                addr += PMD_SIZE;
                pmd += 2;
        } while (addr < end);

        return 0;
}

1M 단위 섹션 매핑이 가능한 경우 해당 pmd 엔트리를 섹션 페이지에 매핑한다.

arm에서 pmd 엔트리는 pmd[0] 및 pmd[1] 2개로 구성되어 있다.

unmap_area_sections()

lib/ioremap.c

/*
 * Section support is unsafe on SMP - If you iounmap and ioremap a region,
 * the other CPUs will not see this change until their next context switch.
 * Meanwhile, (eg) if an interrupt comes in on one of those other CPUs
 * which requires the new ioremap'd region to be referenced, the CPU will
 * reference the _old_ region.
 *
 * Note that get_vm_area_caller() allocates a guard 4K page, so we need to
 * mask the size back to 1MB aligned or we will overflow in the loop below.
 */
static void unmap_area_sections(unsigned long virt, unsigned long size)
{
        unsigned long addr = virt, end = virt + (size & ~(SZ_1M - 1));
        pgd_t *pgd;
        pud_t *pud;
        pmd_t *pmdp;

        flush_cache_vunmap(addr, end);
        pgd = pgd_offset_k(addr);
        pud = pud_offset(pgd, addr);
        pmdp = pmd_offset(pud, addr);
        do {
                pmd_t pmd = *pmdp;

                if (!pmd_none(pmd)) {
                        /*
                         * Clear the PMD from the page table, and
                         * increment the vmalloc sequence so others
                         * notice this change.
                         *
                         * Note: this is still racy on SMP machines.
                         */
                        pmd_clear(pmdp);
                        init_mm.context.vmalloc_seq++;

                        /*
                         * Free the page table, if there was one.
                         */
                        if ((pmd_val(pmd) & PMD_TYPE_MASK) == PMD_TYPE_TABLE)
                                pte_free_kernel(&init_mm, pmd_page_vaddr(pmd));
                }

                addr += PMD_SIZE;
                pmdp += 2;
        } while (addr < end);

        /*
         * Ensure that the active_mm is up to date - we want to
         * catch any use-after-iounmap cases.
         */
        if (current->active_mm->context.vmalloc_seq != init_mm.context.vmalloc_seq)
                __check_vmalloc_seq(current->active_mm);

        flush_tlb_kernel_range(virt, end);
}

vmalloc 공간에서 pmd 매핑된 엔트리를 클리어한다.

코드 라인 18에서 요청 영역을 pmd 섹션 단위만큼 데이터 캐시를 flush한다. 단 armv7 및 armv8 아키텍처와 같이 데이터 캐시가 PIPT 또는 VIPT non-aliasing을 사용하는 경우에는 flush할 필요없다
코드 라인 19~21에서 커널 페이지 테이블을 통해 요청 가상 주소에 해당하는 pgd, pud, pmd 엔트리 주소를 알아온다.
코드 라인 25~33에서 이미 pmd 매핑된 경우 매핑을 클리어한다.
코드 라인 34에서 커널에 메모리 context에서 vmalloc 시퀀스 값을 증가시켜 커널의 vmalloc 등록 정보가 갱신되었음을 나타내게 한다.
코드 라인 39~40에서 기존 pmd 엔트리가 테이블을 가리킨 경우 그 테이블을 할당 해제 시킨다.
코드 라인 43~45에서 끝 주소까지 pmd 단위를 증가시키며 루프를 돈다.
코드 라인 51~52에서 커널의 vmalloc 정보와 현재 태스크의 vmalloc 정보를 비교하여 갱신된 경우 커널의 pgd 매핑된 vmalloc 엔트리들을 현재 태스크 메모리 디스크립터의 pgd 테이블에 복사한다.
코드 라인 54에서 해당 커널 영역에 대한 TLB cache를 flush 한다

ioremap_page_range()

lib/ioremap.c

int ioremap_page_range(unsigned long addr, 
                       unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
        pgd_t *pgd;
        unsigned long start;
        unsigned long next;
        int err;

        BUG_ON(addr >= end);

        start = addr;
        phys_addr -= addr;
        pgd = pgd_offset_k(addr);
        do {
                next = pgd_addr_end(addr, end);
                err = ioremap_pud_range(pgd, addr, next, phys_addr+addr, prot);
                if (err)
                        break;
        } while (pgd++, addr = next, addr != end);

        flush_cache_vmap(start, end);

        return err;
}
EXPORT_SYMBOL_GPL(ioremap_page_range);

주어진 가상 주소 영역에 물리 주소를 prot 속성으로 매핑한다.

코드 라인 11~12에서 요청 가상 주소를 start에 잠시 보관하고, 물리 주소에서 요청 가상 주소를 뺀다.
- 예) addr=0x1000_0000, phys_addr=0x1234_5000
  - phys_addr=0x0234_5000
코드 라인 13에서 가상 주소 addr에 해당하는 커널 pgd 엔트리 주소를 알아온다.
코드 라인 15에서 다음 pgd 엔트리가 관리하는 가상 주소를 가져오되 마지막인 경우 end 값을 가져온다.
코드 라인 16~18에서 pgd 엔트리 하나 범위내 addr ~ next 가상 주소 범위의 pud 엔트리들을 매핑하고 에러인 경우 루프를 탈출한다.
코드 라인 19에서 다음 pgd 엔트리 및 다음 가상 주소를 선택하고 end까지 루프를 돌며 반복한다.
코드 라인 21에서 매핑된 가상 주소 영역을 flush한다. 단 armv7 및 armv8 아키텍처와 같이 데이터 캐시가 PIPT 또는 VIPT non-aliasing을 사용하는 경우에는 flush할 필요없다.

ioremap_pud_range()

lib/ioremap.c

static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
                unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
        pud_t *pud;
        unsigned long next;

        phys_addr -= addr;
        pud = pud_alloc(&init_mm, pgd, addr);
        if (!pud)
                return -ENOMEM;
        do {
                next = pud_addr_end(addr, end);
                if (ioremap_pmd_range(pud, addr, next, phys_addr + addr, prot))
                        return -ENOMEM;
        } while (pud++, addr = next, addr != end);
        return 0;          
}

pud 단위로 주어진 가상 주소 영역에 물리 주소를 prot 속성으로 매핑한다.

코드 라인 7에서 물리 주소에서 요청 가상 주소를 미리 뺀다.
- 예) addr=0x1000_0000, phys_addr=0x1_1234_5000
  - phys_addr=0x1_0234_5000
코드 라인 8~10에서 pud용 테이블을 하나 할당 받아온다. 만일 실패하는 경우 -ENOMEM 에러로 함수를 빠져나간다.
코드 라인 12에서 다음 pud 엔트리가 관리하는 가상 주소를 가져오되 마지막인 경우 end 값을 가져온다.
코드 라인 13~14에서 pud 엔트리 하나 범위내 addr ~ next 가상 주소 범위의 pmd 엔트리들을 매핑한다. 실패하는 경우 -ENOMEM 결과를 가지고 함수를 빠져나간다.
코드 섹션 15에서 다음 pud 엔트리 및 다음 가상 주소를 선택하고 end까지 루프를 돌며 반복한다.

ioremap_pmd_range()

lib/ioremap.c

static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
                unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
        pmd_t *pmd;
        unsigned long next;

        phys_addr -= addr;
        pmd = pmd_alloc(&init_mm, pud, addr);
        if (!pmd)
                return -ENOMEM;
        do {
                next = pmd_addr_end(addr, end);
                if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
                        return -ENOMEM;
        } while (pmd++, addr = next, addr != end); 
        return 0;              
}

pmd 단위로 주어진 가상 주소 영역에 물리 주소를 prot 속성으로 매핑한다.

코드 라인 7에서 물리 주소에서 요청 가상 주소를 미리 뺀다.
코드 라인 8~10에서 pmd용 테이블을 하나 할당 받아온다. 만일 실패하는 경우 -ENOMEM 에러로 함수를 빠져나간다.
코드 라인 12에서 다음 pmd 엔트리가 관리하는 가상 주소를 가져오되 마지막인 경우 end 값을 가져온다.
코드 라인 13~14에서 pmd 엔트리 하나 범위내 addr ~ next 가상 주소 범위의 pte 엔트리들을 매핑한다. 실패하는 경우 -ENOMEM 결과를 가지고 함수를 빠져나간다.
코드 섹션 15에서 다음 pmd 엔트리 및 다음 가상 주소를 선택하고 end까지 루프를 돌며 반복한다.

ioremap_pte_range()

lib/ioremap.c

static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
                unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
        pte_t *pte;
        u64 pfn;

        pfn = phys_addr >> PAGE_SHIFT;
        pte = pte_alloc_kernel(pmd, addr);
        if (!pte)
                return -ENOMEM;
        do {
                BUG_ON(!pte_none(*pte));
                set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
                pfn++;
        } while (pte++, addr += PAGE_SIZE, addr != end);
        return 0;
}

pte 단위로 주어진 가상 주소 영역에 물리 주소 페이지를 prot 속성으로 매핑한다.

코드 라인 7에서 pfn을 알아온다.
코드 라인 8~10에서 pte용 테이블을 하나 할당 받아온다. 만일 실패하는 경우 -ENOMEM 에러로 함수를 빠져나간다.
코드 라인 13~14에서 가상 주소 addr에 해당하는 pte 엔트리를 pfn 물리페이지에 prot 타입으로 매핑시키고 pfn을 증가시킨다.
코드 라인 15에서 다음 pte 엔트리를 증가시키고, 가상 주소도 다음 페이지만큼 증가시키며 end까지 루프를 돌며 반복한다.

io 언매핑

iounmap()

arch/arm/include/asm/io.h

#define iounmap                         __arm_iounmap

요청 물리 주소의 io 매핑을 해제한다.

__arm_iounmap()

arch/arm/mm/ioremap.c

void __arm_iounmap(volatile void __iomem *io_addr)
{
        arch_iounmap(io_addr);
}
EXPORT_SYMBOL(__arm_iounmap);

요청 물리 주소의 io 매핑을 해제한다.

arch_iounmap()

arch/arm/mm/ioremap.c

void (*arch_iounmap)(volatile void __iomem *) = __iounmap;

컴파일 타임에 전역 arch_iounmap 함수 포인터 변수에는 __iounmap() 함수의 주소가 담긴다.

__iounmap()

arch/arm/mm/ioremap.c

void __iounmap(volatile void __iomem *io_addr)
{       
        void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
        struct static_vm *svm;

        /* If this is a static mapping, we must leave it alone */
        svm = find_static_vm_vaddr(addr);
        if (svm)
                return;

#if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
        {
                struct vm_struct *vm;
                       
                vm = find_vm_area(addr);
        
                /*
                 * If this is a section based mapping we need to handle it
                 * specially as the VM subsystem does not know how to handle
                 * such a beast.
                 */
                if (vm && (vm->flags & VM_ARM_SECTION_MAPPING))
                        unmap_area_sections((unsigned long)vm->addr, vm->size);
        }
#endif          

        vunmap(addr);
}

요청 물리 주소의 io 매핑을 해제한다.

코드 라인 7~9에서 vmalloc 공간에 요청 주소에 대한 static 매핑이 있는 경우 그냥 함수를 빠져나간다.
코드 라인 11~25에서 시스템(빌드된 커널)이 SMP 및 LPAE를 지원하지 않는 경우 해당 주소로 섹션 매핑이 된 경우 해제한다.
코드 라인 27에서 vmalloc 공간에 매핑된 io 물리 주소를 vunmap() 함수를 호출하여 매핑을 해제한다.

vmalloc 공간의 매핑 상태 확인

다음과 같이 어떤 api를 통해 물리주소가 vmalloc 가상 주소 공간에 매핑되었는지 확인할 수 있다.

$ sudo cat /proc/vmallocinfo
0xba800000-0xbb000000 8388608 iotable_init+0x0/0xb8 phys=3a800000 ioremap
0xbb804000-0xbb806000    8192 raw_init+0x50/0x148 pages=1 vmalloc
0xbb806000-0xbb809000   12288 pcpu_mem_zalloc+0x44/0x80 pages=2 vmalloc
0xbb899000-0xbb89c000   12288 pcpu_mem_zalloc+0x44/0x80 pages=2 vmalloc
0xbb89c000-0xbb89e000    8192 dwc_otg_driver_probe+0x650/0x7a8 phys=3f006000 ioremap
0xbb89e000-0xbb8a0000    8192 devm_ioremap_nocache+0x40/0x7c phys=3f300000 ioremap
0xbbc00000-0xbbc83000  536576 bcm2708_fb_set_par+0x108/0x13c phys=3db79000 ioremap
0xbc9f7000-0xbc9ff000   32768 SyS_swapon+0x618/0xf6c pages=7 vmalloc
0xbc9ff000-0xbca01000    8192 SyS_swapon+0x850/0xf6c pages=1 vmalloc
0xbcc39000-0xbcc3b000    8192 SyS_swapon+0xaa0/0xf6c pages=1 vmalloc
0xbce7b000-0xbce7f000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbce83000-0xbce87000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbce8b000-0xbce8f000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbce93000-0xbce97000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbce9b000-0xbce9f000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbcea7000-0xbceab000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbceab000-0xbceaf000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbceb7000-0xbceb9000    8192 bpf_prog_alloc+0x44/0xb0 pages=1 vmalloc
0xbf509000-0xbf50d000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xbf50d000-0xbf511000   16384 n_tty_open+0x20/0xe4 pages=3 vmalloc
0xf3000000-0xf3001000    4096 iotable_init+0x0/0xb8 phys=3f000000 ioremap
0xf3003000-0xf3004000    4096 iotable_init+0x0/0xb8 phys=3f003000 ioremap
0xf3007000-0xf3008000    4096 iotable_init+0x0/0xb8 phys=3f007000 ioremap
0xf300b000-0xf300c000    4096 iotable_init+0x0/0xb8 phys=3f00b000 ioremap
0xf3100000-0xf3101000    4096 iotable_init+0x0/0xb8 phys=3f100000 ioremap
0xf3200000-0xf3201000    4096 iotable_init+0x0/0xb8 phys=3f200000 ioremap
0xf3201000-0xf3202000    4096 iotable_init+0x0/0xb8 phys=3f201000 ioremap
0xf3215000-0xf3216000    4096 iotable_init+0x0/0xb8 phys=3f215000 ioremap
0xf3980000-0xf39a0000  131072 iotable_init+0x0/0xb8 phys=3f980000 ioremap
0xf4000000-0xf4001000    4096 iotable_init+0x0/0xb8 phys=40000000 ioremap
0xfea50000-0xff000000 5963776 pcpu_get_vm_areas+0x0/0x5e0 vmalloc

참고

Early ioremam | 문c
Vmap | 문c

delayacct_init()

2016-12-022016-12-02 문영일 Leave a comment

delayacct_init()

kernel/delayacct.c

void delayacct_init(void)
{
        delayacct_cache = KMEM_CACHE(task_delay_info, SLAB_PANIC);
        delayacct_tsk_init(&init_task);
}

CONFIG_TASK_DELAY_ACCT 커널 옵션을 사용하는 경우 task에 대한 각종 지연 stat을 얻을 수 있다.

cpu, 동기 블록 입출력 완료 및 페이지 스와핑과 같은 시스템 자원을 기다리는 작업에 소요되는 시간에 대한 정보를 수집합니다.
“nodelayacct” 커널 파라메터를 사용하는 경우 delay accounting 기능을 disable한다.

__delayacct_tsk_init()

kernel/delayacct.c

void __delayacct_tsk_init(struct task_struct *tsk)
{
        tsk->delays = kmem_cache_zalloc(delayacct_cache, GFP_KERNEL);
        if (tsk->delays)
                spin_lock_init(&tsk->delays->lock);
}

요청 태스크의 delays 멤버에 delayacct_cache 구조체를 할당받은 후 초기화한다.

delayacct_setup_disable()

kernel/delayacct.c

int delayacct_on __read_mostly = 1;     /* Delay accounting turned on/off */
EXPORT_SYMBOL_GPL(delayacct_on);
struct kmem_cache *delayacct_cache;

static int __init delayacct_setup_disable(char *str)
{
        delayacct_on = 0;
        return 1;
}
__setup("nodelayacct", delayacct_setup_disable);

“nodelayacct” 커널 파라메터를 사용하는 경우 delay accounting 기능을 disable한다.

Per-task Delay Accounting

delayacct_freepages_start()

include/linux/delayacct.h

static inline void delayacct_freepages_start(void)
{
        if (current->delays)
                __delayacct_freepages_start();
}

태스크의 delay accounting에 사용하기 위해 현재 시작 clock을 기록한다.

__delayacct_blkio_start()

kernel/delayacct.c

void __delayacct_freepages_start(void)
{
        current->delays->freepages_start = ktime_get_ns();
}

태스크의 delay accounting에 사용하기 위해 현재 시작 clock을 기록한다.

delayacct_freepages_end()

include/linux/delayacct.h

static inline void delayacct_freepages_end(void)
{
        if (current->delays)
                __delayacct_freepages_end();
}

태스크의 delay accounting에 사용하기 위해 시작 clock에서 현재 clock을 빼서 기록한다.

__delayacct_freepages_end()

kernel/delayacct.c

void __delayacct_freepages_end(void)
{
        delayacct_end(&current->delays->freepages_start,
                        &current->delays->freepages_delay,
                        &current->delays->freepages_count);
}

태스크의 delay accounting에 사용하기 위해 현재 clock에서 freepages_start clock을 빼서 freepages_delay에 delay time을 추가하고 freepages_count를 증가시킨다.

delayacct_end()

kernel/delayacct.c

/*
 * Finish delay accounting for a statistic using its timestamps (@start),
 * accumalator (@total) and @count
 */
static void delayacct_end(u64 *start, u64 *total, u32 *count)
{
        s64 ns = ktime_get_ns() - *start;
        unsigned long flags;

        if (ns > 0) {
                spin_lock_irqsave(&current->delays->lock, flags);
                *total += ns;
                (*count)++;
                spin_unlock_irqrestore(&current->delays->lock, flags);
        }
}

delay accounting에 사용하기 위해 현재 clock에서 인수로 전달받은 start clock을 빼서 total에 추가 하고 count를 증가시킨다.

getdelays 유틸리티 빌드 및 사용

Documentation/accounting/getdelay.c를 컴파일하여 사용한다.
- gcc -I/usr/src/linux/include getdelays.c -o getdelays

./getdelays 
getdelays [-dilv] [-w logfile] [-r bufsize] [-m cpumask] [-t tgid] [-p pid]
  -d: print delayacct stats
  -i: print IO accounting (works only with -p)
  -l: listen forever
  -v: debug on
  -C: container path

참고

Documentation/accounting/delay-accounting.txt | kernel.org
Per-task delay accounting | LWN.net

RCU(Read Copy Update) -2- (Callback process)

2016-12-022021-04-13 문영일 2 Comments

RCU(Read Copy Update) -2- (Callback process)

RCU_BH 및 RCU_SCHED state 제거

커널 v4.20-rc1 부터 rcu_preempt, rcu_bh 및 rcu_sched와 같이 3 종류의 state를 사용하여 왔는데 rcu_preempt 하나로 통합되었다.

call_rcu_bh() 및 call_rcu_sched() 함수는 제거되었고, 대신 call_rcu() 만을 사용한다.

다음은 커널 v4.19에서 사용하는 rcu 스레드를 보여준다.

$ ps -ef | grep rcu
root         3     2  0 Jan15 ?        00:00:00 [rcu_gp]
root         4     2  0 Jan15 ?        00:00:00 [rcu_par_gp]
root        10     2  0 Jan15 ?        00:27:01 [rcu_preempt]
root        11     2  0 Jan15 ?        00:00:10 [rcu_sched]
root        12     2  0 Jan15 ?        00:00:00 [rcu_bh]
root        32     2  0 Jan15 ?        00:00:00 [rcu_tasks_kthre]

다음은 커널 v5.4에서 사용하는 rcu 스레드를 보여준다.

rcu_sched 및 rcu_bh가 없는 것을 확인할 수 있다.

$ ps -ef | grep rcu
root         3     2  0 Mar10 ?        00:00:00 [rcu_gp]
root         4     2  0 Mar10 ?        00:00:00 [rcu_par_gp]
root        10     2  0 Mar10 ?        00:00:06 [rcu_preempt]
root        20     2  0 Mar10 ?        00:00:00 [rcu_tasks_kthre]

RCU 동기 Update

synchronize_rcu()

kernel/rcu/tree.c

/**
 * synchronize_rcu - wait until a grace period has elapsed.
 *
 * Control will return to the caller some time after a full grace
 * period has elapsed, in other words after all currently executing RCU
 * read-side critical sections have completed.  Note, however, that
 * upon return from synchronize_rcu(), the caller might well be executing
 * concurrently with new RCU read-side critical sections that began while
 * synchronize_rcu() was waiting.  RCU read-side critical sections are
 * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
 * In addition, regions of code across which interrupts, preemption, or
 * softirqs have been disabled also serve as RCU read-side critical
 * sections.  This includes hardware interrupt handlers, softirq handlers,
 * and NMI handlers.
 *
 * Note that this guarantee implies further memory-ordering guarantees.
 * On systems with more than one CPU, when synchronize_rcu() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last RCU read-side critical section whose beginning
 * preceded the call to synchronize_rcu().  In addition, each CPU having
 * an RCU read-side critical section that extends beyond the return from
 * synchronize_rcu() is guaranteed to have executed a full memory barrier
 * after the beginning of synchronize_rcu() and before the beginning of
 * that RCU read-side critical section.  Note that these guarantees include
 * CPUs that are offline, idle, or executing in user mode, as well as CPUs
 * that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked synchronize_rcu(), which returned
 * to its caller on CPU B, then both CPU A and CPU B are guaranteed
 * to have executed a full memory barrier during the execution of
 * synchronize_rcu() -- even if CPU A and CPU B are the same CPU (but
 * again only if the system has more than one CPU).
 */

void synchronize_rcu(void)
{
        RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
                         lock_is_held(&rcu_lock_map) ||
                         lock_is_held(&rcu_sched_lock_map),
                         "Illegal synchronize_rcu() in RCU read-side critical section");
        if (rcu_blocking_is_gp())
                return;
        if (rcu_gp_is_expedited())
                synchronize_rcu_expedited();
        else
                wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);

grace period가 지날때까지 기다린다(sleep).

코드 라인 7~8에서 아직 rcu의 gp 사용이 블러킹상태인 경우엔 gp 대기 없이 곧바로 함수를 빠져나간다.
- preemptible 커널에서 아직 rcu 스케쥴러가 동작하지 않는 경우
- 부팅 중인 경우 또는 1개의 online cpu만을 사용하는 경우
코드 라인 9~12에서 gp를 대기할 때 조건에 따라 다음 두 가지중 하나를 선택하여 동작한다.
- 더 신속한 Brute-force RCU grace period 방법
- 일반 RCU grace period 방법

rcu_blocking_is_gp()

kernel/rcu/tree.c

/*
 * During early boot, any blocking grace-period wait automatically
 * implies a grace period.  Later on, this is never the case for PREEMPT.
 *
 * Howevr, because a context switch is a grace period for !PREEMPT, any
 * blocking grace-period wait automatically implies a grace period if
 * there is only one CPU online at any point time during execution of
 * either synchronize_rcu() or synchronize_rcu_expedited().  It is OK to
 * occasionally incorrectly indicate that there are multiple CPUs online
 * when there was in fact only one the whole time, as this just adds some
 * overhead: RCU still operates correctly.
 */

static int rcu_blocking_is_gp(void)
{
        int ret;

        if (IS_ENABLED(CONFIG_PREEMPTION))
                return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
        might_sleep();  /* Check for RCU read-side critical section. */
        preempt_disable();
        ret = num_online_cpus() <= 1;
        preempt_enable();
        return ret;
}

rcu의 gp 사용이 블러킹된 상태인지 여부를 반환한다.

코드 라인 5~6에서 preemptible 커널인 경우 rcu 스케쥴러가 비활성화 여부에 따라 반환한다.
코드 라인 7~11에서 online cpu수를 카운트하여 1개 이하인 경우 blocking 상태로 반환한다. 1개의 cpu인 경우 might_sleep() 이후에 preempt_disable() 및 preempt_enable()을 반복하였으므로 필요한 GP는 완료된 것으로 간주한다.

다음 그림은 동기화 rcu 요청 함수인 synchronize_rcu() 및 비동기 rcu 요청 함수인 call_rcu() 함수 두 가지에 대해 각각 호출 경로를 보여준다.

wait_rcu_gp()

include/linux/rcupdate_wait.h

#define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)

Grace Period가 완료될 때까지 대기한다. 인자로 gp 완료를 대기하는 다음 두 함수 중 하나가 주어진다.

call_rcu()
call_rcu_tasks()

_wait_rcu_gp()

include/linux/rcupdate_wait.h

#define _wait_rcu_gp(checktiny, ...) \
do {                                                                    \
        call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };               \
        struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];    \
        __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),              \
                        __crcu_array, __rs_array);                      \
} while (0)

가변 인자를 받아 처리할 수 있게 하였으나, 현제 실제 커널 코드에서는 1 개의 인자만 받아 처리하고 있다.

__wait_rcu_gp()

kernel/rcu/update.c

void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
                   struct rcu_synchronize *rs_array)
{
        int i;
        int j;

        /* Initialize and register callbacks for each crcu_array element. */
        for (i = 0; i < n; i++) {
                if (checktiny &&
                    (crcu_array[i] == call_rcu)) {
                        might_sleep();
                        continue;
                }
                init_rcu_head_on_stack(&rs_array[i].head);
                init_completion(&rs_array[i].completion);
                for (j = 0; j < i; j++)
                        if (crcu_array[j] == crcu_array[i])
                                break;
                if (j == i)
                        (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
        }

        /* Wait for all callbacks to be invoked. */
        for (i = 0; i < n; i++) {
                if (checktiny &&
                    (crcu_array[i] == call_rcu))
                        continue;
                for (j = 0; j < i; j++)
                        if (crcu_array[j] == crcu_array[i])
                                break;
                if (j == i)
                        wait_for_completion(&rs_array[i].completion);
                destroy_rcu_head_on_stack(&rs_array[i].head);
        }
}
EXPORT_SYMBOL_GPL(__wait_rcu_gp);

Grace Period가 완료될 때까지 대기한다. (@checktiny는 tiny/true 모델을 구분한다. 두 번째 인자는 전달 되는 array 크기를 갖고, 세 번째 인자는 호출할 비동기 콜백 함수(call_rcu() 또는 call_rcu_tasks())가 주어지며 마지막으로 네 번째 인자에는 gp 대기를 위한 rcu_synchronize 구조체 배열이 전달된다)

코드 라인 8~13에서 인자 수 @n 만큼 순회하며 첫 번째 인자 @checktiny 가 설정되었고, 세 번째 인자로 call_rcu 함수가 지정된 경우 preemption pointer를 실행한 후 skip 한다.
- 현재 커널 코드들에서는 @checktiny=0으로 호출되고 있다.
코드 라인 14~15에서 마지막 인자로 제공된 @rs_array의 rcu head를 스택에서 초기화하고, 또한 completion도 초기화한다.
- @rs_array에는 gp 비동기 처리를 위한 다음 두 함수가 사용되고 있다.
코드 라인 16~20에서 중복된 함수 호출이 없으면 세 번째 인자로 전달 받은 다음의 gp 비동기 처리 함수 중 하나를 호출하고, 인자로 rcuhead와 wakeme_after_rcu() 콜백 함수를지정한다.
- call_rcu()
- call_rcu_tasks()
코드 라인 24~27에서 인자 수 @n 만큼 순회하며 인자로 call_rcu 함수가 지정된 경우 skip 한다.
코드 라인 28~32에서중복된 함수 호출이 없으면 순회 중인 인덱스에 해당하는 콜백 함수가 처리 완료될 때까지 대기한다.
코드 라인 33에서 스택에 위치한 rcu head를 제거한다.

wakeme_after_rcu()

kernel/rcu/update.c

/**
 * wakeme_after_rcu() - Callback function to awaken a task after grace period
 * @head: Pointer to rcu_head member within rcu_synchronize structure
 *
 * Awaken the corresponding task now that a grace period has elapsed.
 */

void wakeme_after_rcu(struct rcu_head *head)
{
        struct rcu_synchronize *rcu;

        rcu = container_of(head, struct rcu_synchronize, head);
        complete(&rcu->completion);
}
EXPORT_SYMBOL_GPL(wakeme_after_rcu);

gp가 완료되었음을 알리도록 complete 처리를 한다. (콜백 함수)

RCU 비동기 Update

call_rcu()

/**
 * call_rcu() - Queue an RCU callback for invocation after a grace period.
 * @head: structure to be used for queueing the RCU updates.
 * @func: actual callback function to be invoked after the grace period
 *
 * The callback function will be invoked some time after a full grace
 * period elapses, in other words after all pre-existing RCU read-side
 * critical sections have completed.  However, the callback function
 * might well execute concurrently with RCU read-side critical sections
 * that started after call_rcu() was invoked.  RCU read-side critical
 * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
 * may be nested.  In addition, regions of code across which interrupts,
 * preemption, or softirqs have been disabled also serve as RCU read-side
 * critical sections.  This includes hardware interrupt handlers, softirq
 * handlers, and NMI handlers.
 *
 * Note that all CPUs must agree that the grace period extended beyond
 * all pre-existing RCU read-side critical section.  On systems with more
 * than one CPU, this means that when "func()" is invoked, each CPU is
 * guaranteed to have executed a full memory barrier since the end of its
 * last RCU read-side critical section whose beginning preceded the call
 * to call_rcu().  It also means that each CPU executing an RCU read-side
 * critical section that continues beyond the start of "func()" must have
 * executed a memory barrier after the call_rcu() but before the beginning
 * of that RCU read-side critical section.  Note that these guarantees
 * include CPUs that are offline, idle, or executing in user mode, as
 * well as CPUs that are executing in the kernel.
 *
 * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
 * resulting RCU callback function "func()", then both CPU A and CPU B are
 * guaranteed to execute a full memory barrier during the time interval
 * between the call to call_rcu() and the invocation of "func()" -- even
 * if CPU A and CPU B are the same CPU (but again only if the system has
 * more than one CPU).
 */

void call_rcu(struct rcu_head *head, rcu_callback_t func)
{
        __call_rcu(head, func, 0);
}
EXPORT_SYMBOL_GPL(call_rcu);

rcu 콜백 함수 @func을 등록한다. 이 rcu 콜백 함수는 GP(Grace Period) 완료 후 호출된다.

__call_rcu()

kernel/rcu/tree.c

/*
 * Helper function for call_rcu() and friends.  The cpu argument will
 * normally be -1, indicating "currently running CPU".  It may specify
 * a CPU only if that CPU is a no-CBs CPU.  Currently, only rcu_barrier()
 * is expected to specify a CPU.
 */

static void
__call_rcu(struct rcu_head *head, rcu_callback_t func, bool lazy)
{
        unsigned long flags;
        struct rcu_data *rdp;
        bool was_alldone;

        /* Misaligned rcu_head! */
        WARN_ON_ONCE((unsigned long)head & (sizeof(void *) - 1));

        if (debug_rcu_head_queue(head)) {
                /*
                 * Probable double call_rcu(), so leak the callback.
                 * Use rcu:rcu_callback trace event to find the previous
                 * time callback was passed to __call_rcu().
                 */
                WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n",
                          head, head->func);
                WRITE_ONCE(head->func, rcu_leak_callback);
                return;
        }
        head->func = func;
        head->next = NULL;
        local_irq_save(flags);
        rdp = this_cpu_ptr(&rcu_data);

        /* Add the callback to our list. */
        if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist))) {
                // This can trigger due to call_rcu() from offline CPU:
                WARN_ON_ONCE(rcu_scheduler_active != RCU_SCHEDULER_INACTIVE);
                WARN_ON_ONCE(!rcu_is_watching());
                // Very early boot, before rcu_init().  Initialize if needed
                // and then drop through to queue the callback.
                if (rcu_segcblist_empty(&rdp->cblist))
                        rcu_segcblist_init(&rdp->cblist);
        }
        if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
                return; // Enqueued onto ->nocb_bypass, so just leave.
        /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */
        rcu_segcblist_enqueue(&rdp->cblist, head, lazy);
        if (__is_kfree_rcu_offset((unsigned long)func))
                trace_rcu_kfree_callback(rcu_state.name, head,
                                         (unsigned long)func,
                                         rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                         rcu_segcblist_n_cbs(&rdp->cblist));
        else
                trace_rcu_callback(rcu_state.name, head,
                                   rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                   rcu_segcblist_n_cbs(&rdp->cblist));

        /* Go handle any RCU core processing required. */
        if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
            unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
                __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
        } else {
                __call_rcu_core(rdp, head, flags);
                local_irq_restore(flags);
        }
}

call_rcu의 헬퍼 함수로 grace period가 지난 후에 RCU 콜백을 호출할 수 있도록 큐에 추가한다. @lazy가 true인 경우 kfree_call_rcu() 함수에서 호출하는 경우이며 @func에는 콜백 함수 대신 offset이 담긴다.

코드 라인 11~21에서 CONFIG_DEBUG_OBJECTS_RCU_HEAD 커널 옵션을 사용하여 디버깅을 하는 동안 __call_rcu() 함수가 이중 호출되는 경우 경고 메시지를 출력하고 함수를 빠져나간다.
코드 라인 22~23에서 rcu 인스턴스에 인수로 전달받은 콜백 함수를 대입한다. rcu는 리스트로 연결되는데 가장 마지막에 추가되므로 마지막을 의미하는 null을 대입한다.
코드 라인 25에서 cpu별 rcu 데이터를 표현한 자료 구조에서 현재 cpu에 대한 rcu 데이터를 rdp에 알아온다.
코드 라인 28~36에서 낮은 확률로 early 부트 중이라 disable 되었고, rcu 콜백 리스트가 준비되지 않은 경우 콜백 리스트를 초기화하고 enable 한다.
코드 라인 37~38에서 낮은 확률로 no-cb용 커널 스레드에서 처리할 콜백이 인입된 경우를 위한 처리를 수행한다.
코드 라인 40에서 rcu 콜백 함수를 콜백 리스트에 엔큐한다.
코드 라인 41~49에서 kfree 타입 또는 일반 타입의 rcu 요청에 맞게 trace 로그를 출력한다.
코드 라인 52~58에서 no-callback 처리용 스레드를 깨우거나 __call_rcu_core() 함수를 호출하여 rcu 코어 처리를 수행한다.

다음 그림은 __call_rcu() 함수를 호출하여 콜백을 추가하는 과정과 이를 처리하는 과정 두 가지를 모두 보여준다.

__call_rcu_core()

kernel/rcu/tree.c

/*
 * Handle any core-RCU processing required by a call_rcu() invocation.
 */

static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head,
                            unsigned long flags)
{
        /*
         * If called from an extended quiescent state, invoke the RCU
         * core in order to force a re-evaluation of RCU's idleness.
         */
        if (!rcu_is_watching())
                invoke_rcu_core();

        /* If interrupts were disabled or CPU offline, don't invoke RCU core. */
        if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id()))
                return;

        /*
         * Force the grace period if too many callbacks or too long waiting.
         * Enforce hysteresis, and don't invoke rcu_force_quiescent_state()
         * if some other CPU has recently done so.  Also, don't bother
         * invoking rcu_force_quiescent_state() if the newly enqueued callback
         * is the only one waiting for a grace period to complete.
         */
        if (unlikely(rcu_segcblist_n_cbs(&rdp->cblist) >
                     rdp->qlen_last_fqs_check + qhimark)) {

                /* Are we ignoring a completed grace period? */
                note_gp_changes(rdp);

                /* Start a new grace period if one not already started. */
                if (!rcu_gp_in_progress()) {
                        rcu_accelerate_cbs_unlocked(rdp->mynode, rdp);
                } else {
                        /* Give the grace period a kick. */
                        rdp->blimit = DEFAULT_MAX_RCU_BLIMIT;
                        if (rcu_state.n_force_qs == rdp->n_force_qs_snap &&
                            rcu_segcblist_first_pend_cb(&rdp->cblist) != head)
                                rcu_force_quiescent_state();
                        rdp->n_force_qs_snap = rcu_state.n_force_qs;
                        rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist);
                }
        }
}

rcu 처리(새 콜백, qs, gp 등의 변화) 를 위해 rcu core를 호출한다.

코드 라인 8~9에서 현재 cpu가 extended-qs(idle 또는 idle 진입) 상태에서 호출된 경우 rcu core 처리를 위해 rcu softirq 또는 rcu core 스레드를 호출한다.
- 스케줄 틱에서도 rcu에 대한 softirq 호출을 하지만 더 빨리 처리하기 위함이다.
코드 라인 12~13에서 인터럽트가 disable된 상태인 경우이거나 offline 상태이면 함수를 빠져나간다.
코드 라인 22~26에서 낮은 확률로 rcu 콜백이 너무 많이 대기 중이면 force_qs 조건을 만족하는지 확인하기 전에 먼저 gp 상태 변경을 확인한다.
코드 라인 29~30에서 gp가 진행 중이 아니면 신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리한다.
코드 라인 31~39에서 gp가 이미 진행 중이면 blimit 제한을 max(디폴트 10000개) 값으로 풀고 fqs(force quiescent state)를 진행한다.
- rcu_state.n_force_qs == rdp->n_force_qs_snap
  - 현재 cpu가 force 처리한 qs 수가 글로벌에 저장한 수와 일치하면
  - 다른 cpu가 force 하는 경우 글로벌 값이 변경되므로 이 경우에는 force qs 처리하지 않을 계획이다.

rcu_is_watching()

kernel/rcu/tree.c

/**
 * rcu_is_watching - see if RCU thinks that the current CPU is not idle
 *
 * Return true if RCU is watching the running CPU, which means that this
 * CPU can safely enter RCU read-side critical sections.  In other words,
 * if the current CPU is not in its idle loop or is in an interrupt or
 * NMI handler, return true.
 */

bool notrace rcu_is_watching(void)
{
        bool ret;

        preempt_disable_notrace();
        ret = !rcu_dynticks_curr_cpu_in_eqs();
        preempt_enable_notrace();
        return ret;
}
EXPORT_SYMBOL_GPL(rcu_is_watching);

rcu가 동작중인 cpu를 감시하는지 여부를 반환한다. (true=watching, nohz idle 상태가 아니거나 인터럽트 진행중일 때 안전하게 read-side 크리티컬 섹션에 진입 가능한 상태이다.)

invoke_rcu_core()

kernel/rcu/tree.c

/*
 * Wake up this CPU's rcuc kthread to do RCU core processing.
 */

static void invoke_rcu_core(void)
{
        if (!cpu_online(smp_processor_id()))
                return;
        if (use_softirq)
                raise_softirq(RCU_SOFTIRQ);
        else
                invoke_rcu_core_kthread();
}

rcu softirq가 enable(디폴트)된 경우 softirq를 호출하고, 그렇지 않은 경우 rcu core 스레드를 깨운다.

use_softirq
- “module/rcutree/parameters/use_softirq” 파라미터 값에 의해 결정된다. (디폴트=1)

rcu의 3단계 자료 관리

3단계 구조

rcu의 상태 관리를 위해 접근 비용 별로 3 단계의 구조로 관리한다.

rcu_state
- 글로벌에 하나만 존재하며 락을 사용하여 접근한다.
rcu_node
- 노드별로 구성되며, 하이라키 구조(노드 관리는 64개씩, leaf 노드는 16개)로 관리한다. (NUMA의 노드가 아니라 관리목적의 cpu들의 집합이다)
- 노드 락을 사용하여 접근한다.
rcu_data
- per-cpu 별로 구성된다.
- 로컬 cpu에 대한 접근 시에 사용하므로 락없이 사용된다.

약칭

rsp
- rcu_state 구조체 포인터
rnp
- rcu_node 구조체 포인터
rdp
- rcu_data 구조체 포인터

다음 그림과 같이 rcu 데이터 구조체들이 3 단계로 운영되고 있음을 보여준다.

다음 그림은 qs와 급행 qs의 보고에 사용되는 각 멤버들을 보여준다.

qs 보고 체계
- gp 시작 후
  - qs 체크가 필요한 경우 core_need_qs 및 cpu_no_qs를 1로 클리어한다.
  - qsmaskinitnext는 핫플러그 online cpu 상태가 변경될 때마다 이의 변화를 수용하고, qsmakinit에 이를 복사하여 사용한다. 그리고 이를 qsmaskinit에 복사한 후 매 gp가 시작될 때마다 qsmaskinit -> qsmask로 복사하여 qs를 체크할 준비를 한다.
- qs 보고
  - cpu에서 qs가 체크된 경우 core_need_qs와 cpu_no_qs를 모두 클리어한다.
  - 노드에 보고하여 해당 cpu에 대한 qsmask를 클리어한다. 하위 노드들의 qs가 모두 체크되면 상위 노드도 동일하게 전달하여 처리한다.
급행 qs 보고 체계
- gp 시작 후
  - 급행 qs 체크가 필요한 경우 core_need_qs 및 cpu_no_qs를 1로 클리어한다.
  - expmaskinitnext는 핫플러그 online cpu 상태가 변경될 때마다 이의 변화를 수용하고, expmakinit에 이를 복사하여 사용한다. 그리고 이를 매 gp가 시작될 때마다 expmaskinit -> expmask로 복사하여 급행 qs를 체크할 준비를 한다.
- 급행 qs 보고
  - cpu에서 급행 qs가 체크된 경우 core_need_qs와 cpu_no_qs를 모두 클리어한다.
  - 노드에 보고하여 해당 cpu에 대한 expmask를 클리어한다. 하위 노드들의 급행 qs가 모두 체크되면 상위 노드도 동일하게 전달하여 처리한다.
  - exp_deferred_qs의 경우는 disable(irq, bh, preemption) 상태에서 rcu_read_unlock()이 수행될 때에도 qs를 완료 시키면 안된다. 이 때 qs의 완료를 지연시키기 위해 사용된다.

gp 시퀀스 관리

기존 gpnum이 삭제되고, 새로운 gp 시퀀스 번호(gp_seq)가 커널 v4.19-rc1에서 소개되었다.

참고: rcu: Introduce grace-period sequence numbers

gp 시퀀스는 다음과 같이 두 개의 파트로 나누어 관리한다.

gp 카운터
- 새로운 gp가 시작될 때마다 gp 카운터가 증가된다.
- overflow에 대한 빠른 버그를 찾기위해 jiffies가 -300초에 해당하는 틱부터 시작하였다시피 gp 카운터도 -300부터 시작한다.
gp 상태
- idle(0)
  - gp가 동작하지 않는 idle 상태이다.
- start(1)
  - gp가 시작되어 동작하는 상태이다.
- 그 외의 번호는 srcu에서만 사용되므로 생략한다.

다음 그림과 같이 gp 시퀀스는 두 개의 파트로 나누어 관리한다.

다음 그림과 같이 새로운 gp가 시작하고 끝날때 마다 gp 시퀀스가 증가되는 모습을 보여준다.

위 3가지 구조체에 공통적으로 gp_seq가 담겨 있다. rcu 관리에서 최대한 락 사용을 억제하기 위해 per-cpu를 사용한 rcu_data는 로컬 cpu 데이터를 취급한다. 로컬 데이터는 해당 cpu가 필요한 시점에서만 갱신한다. 따라서 이로 인해 로컬 gp_seq는 글로벌 gp_seq 번호에 비해 최대 1 만큼 늦어질 수도 있다. 루트 노드의 값은 항상 rcu_state의 값들과 동시에 변경되지만 그 외의 노드들 값은 노드 락을 거는 시간이 필요하므로 약간씩 지연된다. (모든 값의 차이는 최대 1을 초과할 수 없다.)

gp 시퀀스 번호 관련 함수들

/*
 * Grace-period counter management.
 */

#define RCU_SEQ_CTR_SHIFT       2
#define RCU_SEQ_STATE_MASK      ((1 << RCU_SEQ_CTR_SHIFT) - 1)

rcu_seq_ctr()

kernel/rcu/rcu.h

/*
 * Return the counter portion of a sequence number previously returned
 * by rcu_seq_snap() or rcu_seq_current().
 */

static inline unsigned long rcu_seq_ctr(unsigned long s)
{
        return s >> RCU_SEQ_CTR_SHIFT;
}

gp 시퀀스의 카운터 부분만을 반환한다.

rcu_seq_state()

kernel/rcu/rcu.h

/*
 * Return the state portion of a sequence number previously returned
 * by rcu_seq_snap() or rcu_seq_current().
 */

static inline int rcu_seq_state(unsigned long s)
{
        return s & RCU_SEQ_STATE_MASK;
}

gp 시퀀스의 상태 값만을 반환한다. (0~3)

rcu_seq_set_state()

kernel/rcu/rcu.h

/*
 * Set the state portion of the pointed-to sequence number.
 * The caller is responsible for preventing conflicting updates.
 */

static inline void rcu_seq_set_state(unsigned long *sp, int newstate)
{
        WARN_ON_ONCE(newstate & ~RCU_SEQ_STATE_MASK);
        WRITE_ONCE(*sp, (*sp & ~RCU_SEQ_STATE_MASK) + newstate);
}

gp 시퀀스의 상태 부분을 @newstate로 갱신한다.

rcu_seq_start()

kernel/rcu/rcu.h

/* Adjust sequence number for start of update-side operation. */
static inline void rcu_seq_start(unsigned long *sp)
{
        WRITE_ONCE(*sp, *sp + 1);
        smp_mb(); /* Ensure update-side operation after counter increment. */
        WARN_ON_ONCE(rcu_seq_state(*sp) != 1);
}

gp 시작을 위해 gp 시퀀스를 1 증가시킨다. (rcu_seq_end() 함수가 호출된 후 gp idle 상태인대 이후에 gp 시작을 처리한다)

rcu_seq_endval()

kernel/rcu/rcu.h

/* Compute the end-of-grace-period value for the specified sequence number. */
static inline unsigned long rcu_seq_endval(unsigned long *sp)
{
        return (*sp | RCU_SEQ_STATE_MASK) + 1;
}

gp 종료를 위해 gp 시퀀스 @sp의 카운터 부분을 1 증가시키고, 상태 부분은 0(gp idle)인 값을 반환한다.

rcu_seq_end()

kernel/rcu/rcu.h

/* Adjust sequence number for end of update-side operation. */
static inline void rcu_seq_end(unsigned long *sp)
{
        smp_mb(); /* Ensure update-side operation before counter increment. */
        WARN_ON_ONCE(!rcu_seq_state(*sp));
        WRITE_ONCE(*sp, rcu_seq_endval(sp));
}

gp 종료를 위해 gp 시퀀스 @sp의 카운터 부분을 1 증가시키고, 상태 부분은 0(gp idle)으로 변경한다.

rcu_seq_snap()

kernel/rcu/rcu.h

/*
 * rcu_seq_snap - Take a snapshot of the update side's sequence number.
 *
 * This function returns the earliest value of the grace-period sequence number
 * that will indicate that a full grace period has elapsed since the current
 * time.  Once the grace-period sequence number has reached this value, it will
 * be safe to invoke all callbacks that have been registered prior to the
 * current time. This value is the current grace-period number plus two to the
 * power of the number of low-order bits reserved for state, then rounded up to
 * the next value in which the state bits are all zero.
 */

static inline unsigned long rcu_seq_snap(unsigned long *sp)
{
        unsigned long s;

        s = (READ_ONCE(*sp) + 2 * RCU_SEQ_STATE_MASK + 1) & ~RCU_SEQ_STATE_MASK;
        smp_mb(); /* Above access must not bleed into critical section. */
        return s;
}

현재까지 등록된 콜백이 안전하게 처리될 수 있는 가장 빠른 update side의 gp 시퀀스 번호를 알아온다. (gp가 진행 중인 경우 안전하게 두 단계 뒤의 gp 시퀀스를 반환하고, gp가 idle 상태인 경우 다음 단계 뒤의 시퀀스 번호를 반환한다. 반환 되는 gp 시퀀스의 상태는 idle 이다.)

예) sp=12 (gp idle)
- 16
예) sp=9 (gp start)
- 16

다음 그림은 gp 시퀀스의 스냅샷 값을 알아오는 모습을 보여준다.

rcu_seq_current()

kernel/rcu/rcu.h

/* Return the current value the update side's sequence number, no ordering. */

static inline unsigned long rcu_seq_current(unsigned long *sp)
{
        return READ_ONCE(*sp);
}

update side의 현재 gp 시퀀스 값을 반환한다.

rcu_seq_started()

kernel/rcu/rcu.h

/*
 * Given a snapshot from rcu_seq_snap(), determine whether or not the
 * corresponding update-side operation has started.
 */

static inline bool rcu_seq_started(unsigned long *sp, unsigned long s)
{
        return ULONG_CMP_LT((s - 1) & ~RCU_SEQ_STATE_MASK, READ_ONCE(*sp));
}

스냅샷 @s를 통해 gp가 시작되었는지 여부를 반환한다.

참고로 스냅샷 @s의 하위 2비트는 0으로 항상 gp_idle 상태 값을 반환한다.
rounddown(@s-1, 4) < @sp

다음 그림은 rcu_seq_started() 함수를 통해 스냅샷 기준으로 gp가 이미 시작되었는지 여부를 알아온다.

gp 시퀀스가 9 또는 12에 있을 때 스냅샷을 발급하면 16이다. 이후 gp 시퀀스가 13이 되는 순간 gp가 시작되어 이 함수가 true를 반환한다.

rcu_seq_done()

kernel/rcu/rcu.h

/*
 * Given a snapshot from rcu_seq_snap(), determine whether or not a
 * full update-side operation has occurred.
 */

static inline bool rcu_seq_done(unsigned long *sp, unsigned long s)
{
        return ULONG_CMP_GE(READ_ONCE(*sp), s);
}

스냅샷 @s를 통해 gp가 완료되었는지 여부를 반환한다.

이 함수는 nocb 운용 시 rcu_segcblist_nextgp() 함수에서 반환된 wait 구간의 gp 시퀀스 값을 스냅샷 @s 값으로 사용하며, gp 시퀀스가 이 값에 도달하는 순간 true를 반환한다.
- 참고로 세그먼트 콜백리스트 각 구간의 gp 시퀀스 값은 스냅샷 값을 사용하므로 항상 gp idle 상태이다.
@sp >= @s

다음 그림은 rcu_seq_done() 함수를 통해 스냅샷 기준으로 gp가 완료되었는지 여부를 알아온다.

gp 시퀀스가 9 또는 12에 있을 때 스냅샷을 발급하면 16이다. 이후 gp 시퀀스가 13이 되는 순간 gp가 시작되어 이 함수가 true를 반환한다.

rcu_seq_completed_gp()

kernel/rcu/rcu.h

/*
 * Has a grace period completed since the time the old gp_seq was collected?
 */

static inline bool rcu_seq_completed_gp(unsigned long old, unsigned long new)
{
        return ULONG_CMP_LT(old, new & ~RCU_SEQ_STATE_MASK);
}

노드의 gp 시퀀스 @new가 기존 cpu가 진행하던 gp 시퀀스 @old를 초과하는 것으로 gp를 완료하면 true를 반환한다.

@old < rounddown(@new, 4)
예) old=5, new=9
- 5 < 8 = true

다음 그림은 rcu_seq_completed_gp() 함수를 통해 new gp 기준으로 old gp가 완료되었는지 여부를 알아온다.

노드의 gp 시퀀스가 기존 cpu가 진행하던 gp를 완료하면 true를 반환한다.
- 현재 cpu의 gp 시퀀스가 #3 구간의 gp를 진행하고 있을 때(12, 13), 노드의 gp 시퀀스가 기존 #3 구간을 끝내고 새 구간으로 진입하면 이 함수가 true를 반환한다.

rcu_seq_new_gp()

kernel/rcu/rcu.h

/*
 * Has a grace period started since the time the old gp_seq was collected?
 */

static inline bool rcu_seq_new_gp(unsigned long old, unsigned long new)
{
        return ULONG_CMP_LT((old + RCU_SEQ_STATE_MASK) & ~RCU_SEQ_STATE_MASK,
                            new);
}

gp 시퀀스 old 이후로 새로운 gp가 시작되었는지 여부를 반환한다.

roundup(@old, 4) < @new
예) old=5, new=9
- 8 < 9 = true

다음 그림은 rcu_seq_new_gp() 함수를 통해 old gp 이후로 new gp가 시작되었는지 여부를 알아온다.

현재 cpu의 gp 시퀀스 구간보다 노드의 gp 시퀀스가 새로운 gp를 시작하게되면 true를 반환한다.
- 현재 cpu의 gp 시퀀스가 #2 구간의 gp를 진행하고 있거나(9) #3 구간에서 gp가 idle 중일 때, 노드의 gp 시퀀스가 새 구간 #3의 gp를 시작(13)하면 이 함수가 true를 반환한다.

rcu_seq_diff()

kernel/rcu/rcu.h

/*
 * Roughly how many full grace periods have elapsed between the collection
 * of the two specified grace periods?
 */

static inline unsigned long rcu_seq_diff(unsigned long new, unsigned long old)
{
        unsigned long rnd_diff;

        if (old == new)
                return 0;
        /*
         * Compute the number of grace periods (still shifted up), plus
         * one if either of new and old is not an exact grace period.
         */
        rnd_diff = (new & ~RCU_SEQ_STATE_MASK) -
                   ((old + RCU_SEQ_STATE_MASK) & ~RCU_SEQ_STATE_MASK) +
                   ((new & RCU_SEQ_STATE_MASK) || (old & RCU_SEQ_STATE_MASK));
        if (ULONG_CMP_GE(RCU_SEQ_STATE_MASK, rnd_diff))
                return 1; /* Definitely no grace period has elapsed. */
        return ((rnd_diff - RCU_SEQ_STATE_MASK - 1) >> RCU_SEQ_CTR_SHIFT) + 2;
}

두 개의 gp 시퀀스 간에 소요된 gp 수를 반환한다.

코드 라인 5~6에서 두 값이 동일한 경우 0을 반환한다.
코드 라인 11~13에서 rnd_diff 값을 다음과 같이 구한다.
- = 내림 @new – 올림 @old + @new 상태 || @old 상태
코드 라인 14~15에서 rnd_diff 값이 3 미만인 경우 1을 반환한다.
코드 라인 16에서 다음 값을 반환한다.
- (rnd_diff – 4) >> 2 + 2

RCU CB 처리 (softirq)

rcu_core_si()

kernel/rcu/tree.c

static void rcu_core_si(struct softirq_action *h)
{
        rcu_core();
}

완료된 rcu 콜백들을 호출하여 처리한다.

rcu_core()

kernel/rcu/tree.c

/* Perform RCU core processing work for the current CPU.  */

static __latent_entropy void rcu_core(void)
{
        unsigned long flags;
        struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
        struct rcu_node *rnp = rdp->mynode;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);

        if (cpu_is_offline(smp_processor_id()))
                return;
        trace_rcu_utilization(TPS("Start RCU core"));
        WARN_ON_ONCE(!rdp->beenonline);

        /* Report any deferred quiescent states if preemption enabled. */
        if (!(preempt_count() & PREEMPT_MASK)) {
                rcu_preempt_deferred_qs(current);
        } else if (rcu_preempt_need_deferred_qs(current)) {
                set_tsk_need_resched(current);
                set_preempt_need_resched();
        }

        /* Update RCU state based on any recent quiescent states. */
        rcu_check_quiescent_state(rdp);

        /* No grace period and unregistered callbacks? */
        if (!rcu_gp_in_progress() &&
            rcu_segcblist_is_enabled(&rdp->cblist) && !offloaded) {
                local_irq_save(flags);
                if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
                        rcu_accelerate_cbs_unlocked(rnp, rdp);
                local_irq_restore(flags);
        }

        rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());

        /* If there are callbacks ready, invoke them. */
        if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist) &&
            likely(READ_ONCE(rcu_scheduler_fully_active)))
                rcu_do_batch(rdp);

        /* Do any needed deferred wakeups of rcuo kthreads. */
        do_nocb_deferred_wakeup(rdp);
        trace_rcu_utilization(TPS("End RCU core"));
}

완료된 rcu 콜백들을 호출하여 처리한다.

코드 라인 9~10에서 cpu가 offline 상태인 경우 함수를 빠져나간다.
코드 라인 15~20에서 preempt 가능한 상태인 경우(preempt 카운터=0)인 경우 deferred qs를 처리한다. 그렇지 않고 deferred qs가 pending 상태인 경우 리스케줄 요청을 수행한다.
- deferred qs 상태인 경우 deferred qs를 해제하고, blocked 상태인 경우 blocked 해제 후 qs를 보고한다.
코드 라인 23에서 현재 cpu에 대해 새 gp가 시작되었는지 체크한다. 또한 qs 상태를 체크하고 패스된 경우 rdp에 기록하여 상위 노드로 보고하게 한다.
코드 라인 26~32에서 gp가 idle 상태이면서 새로운 콜백이 존재하고, 필요 시 acceleration을 수행한다.
코드 라인 34에서 gp 요청을 체크한다.
코드 라인 37~39에서 완료된 콜백이 있는 경우 rcu 콜백들을 호출한다.
코드 라인 42에서 rcu no-callback을 위한 처리를 한다

rcu_do_batch()

kernel/rcu/tree.c -1/2-

/*
 * Invoke any RCU callbacks that have made it to the end of their grace
 * period.  Thottle as specified by rdp->blimit.
 */

static void rcu_do_batch(struct rcu_data *rdp)
{
        unsigned long flags;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);
        struct rcu_head *rhp;
        struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
        long bl, count;
        long pending, tlimit = 0;

        /* If no callbacks are ready, just return. */
        if (!rcu_segcblist_ready_cbs(&rdp->cblist)) {
                trace_rcu_batch_start(rcu_state.name,
                                      rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                                      rcu_segcblist_n_cbs(&rdp->cblist), 0);
                trace_rcu_batch_end(rcu_state.name, 0,
                                    !rcu_segcblist_empty(&rdp->cblist),
                                    need_resched(), is_idle_task(current),
                                    rcu_is_callbacks_kthread());
                return;
        }

        /*
         * Extract the list of ready callbacks, disabling to prevent
         * races with call_rcu() from interrupt handlers.  Leave the
         * callback counts, as rcu_barrier() needs to be conservative.
         */
        local_irq_save(flags);
        rcu_nocb_lock(rdp);
        WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
        pending = rcu_segcblist_n_cbs(&rdp->cblist);
        bl = max(rdp->blimit, pending >> rcu_divisor);
        if (unlikely(bl > 100))
                tlimit = local_clock() + rcu_resched_ns;
        trace_rcu_batch_start(rcu_state.name,
                              rcu_segcblist_n_lazy_cbs(&rdp->cblist),
                              rcu_segcblist_n_cbs(&rdp->cblist), bl);
        rcu_segcblist_extract_done_cbs(&rdp->cblist, &rcl);
        if (offloaded)
                rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist);
        rcu_nocb_unlock_irqrestore(rdp, flags);

seg 콜백리스트의 완료 구간에서 대기중인 rcu 콜백들을 호출한다.

코드 라인 4~5에서 set 콜백리스트를 no-cb offload 처리하는지 여부를 알아온다.
- softirq에서 콜백을 처리하지 않고, no-cb 스레드에 떠넘겨 처리하는 경우인지를 알아온다.
코드 라인 12~21에서 seg 콜백 리스트의 완료 구간에서 대기중인 rcu 콜백들이 하나도 없는 경우 그냥 함수를 빠져나간다.
코드 라인 31~34에서 blimit 값을 구하고, 이 값이 100을 초과하는 경우 현재 시각보다 3ms 후로 tlimit 제한 시각을 구한다.
코드 라인 38에서 rcu seg 콜백리스트의 done 구간의 콜백들을 extract하여 rcl 리스트로 옮긴다.
코드 라인 39~40에서 콜백을 no-cb 스레드에서 처리해야 하는 경우 rdp->qlen_last_fqs_check에 콜백 수를 대입한다.

kernel/rcu/tree.c -2/2-

        /* Invoke callbacks. */
        rhp = rcu_cblist_dequeue(&rcl);
        for (; rhp; rhp = rcu_cblist_dequeue(&rcl)) {
                debug_rcu_head_unqueue(rhp);
                if (__rcu_reclaim(rcu_state.name, rhp))
                        rcu_cblist_dequeued_lazy(&rcl);
                /*
                 * Stop only if limit reached and CPU has something to do.
                 * Note: The rcl structure counts down from zero.
                 */
                if (-rcl.len >= bl && !offloaded &&
                    (need_resched() ||
                     (!is_idle_task(current) && !rcu_is_callbacks_kthread())))
                        break;
                if (unlikely(tlimit)) {
                        /* only call local_clock() every 32 callbacks */
                        if (likely((-rcl.len & 31) || local_clock() < tlimit))
                                continue;
                        /* Exceeded the time limit, so leave. */
                        break;
                }
                if (offloaded) {
                        WARN_ON_ONCE(in_serving_softirq());
                        local_bh_enable();
                        lockdep_assert_irqs_enabled();
                        cond_resched_tasks_rcu_qs();
                        lockdep_assert_irqs_enabled();
                        local_bh_disable();
                }
        }

        local_irq_save(flags);
        rcu_nocb_lock(rdp);
        count = -rcl.len;
        trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(),
                            is_idle_task(current), rcu_is_callbacks_kthread());

        /* Update counts and requeue any remaining callbacks. */
        rcu_segcblist_insert_done_cbs(&rdp->cblist, &rcl);
        smp_mb(); /* List handling before counting for rcu_barrier(). */
        rcu_segcblist_insert_count(&rdp->cblist, &rcl);

        /* Reinstate batch limit if we have worked down the excess. */
        count = rcu_segcblist_n_cbs(&rdp->cblist);
        if (rdp->blimit >= DEFAULT_MAX_RCU_BLIMIT && count <= qlowmark)
                rdp->blimit = blimit;

        /* Reset ->qlen_last_fqs_check trigger if enough CBs have drained. */
        if (count == 0 && rdp->qlen_last_fqs_check != 0) {
                rdp->qlen_last_fqs_check = 0;
                rdp->n_force_qs_snap = rcu_state.n_force_qs;
        } else if (count < rdp->qlen_last_fqs_check - qhimark)
                rdp->qlen_last_fqs_check = count;

        /*
         * The following usually indicates a double call_rcu().  To track
         * this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.
         */
        WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(&rdp->cblist));
        WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                     count != 0 && rcu_segcblist_empty(&rdp->cblist));

        rcu_nocb_unlock_irqrestore(rdp, flags);

        /* Re-invoke RCU core processing if there are callbacks remaining. */
        if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist))
                invoke_rcu_core();
}

코드 라인 2~3에서 seg 콜백 리스트를 순회하며 콜백을 하나씩 디큐해온다.
코드 라인 5~6에서 rcu 콜백을 recliaim 호출하여 처리하고, kfree용 콜백인 경우 lazy 카운터를 감소시킨다.
코드 라인 11~14에서 호출된 콜백 수가 blimit 이상이면서 offload되지 않으며 다음 조건 중 하나라도 만족하는 경우 콜백을 그만 처리하기 위해 루프를 벗어난다.
- 조건
  - 리스케줄 요청이 있는 경우
  - 현재 태스크가 idle 태스크도 아니고 no-cb 커널 스레드도 아닌 경우
- 임시로 사용되는 로컬 콜백리스트의 len 멤버는 0부터 시작하여 디큐되어 호출될때마다 1씩 감소한다. 따라서 이 값에는 콜백 호출된 수가 음수 값으로 담겨 있게 된다.
코드 라인 15~21에서 낮은 확률로 tlimit 제한 시각(3ms)이 설정된 경우 매 32개의 콜백을 처리할 때 마다 제한 시간을 초과한 경우 그만 처리하기 위해 루프를 벗어난다.
코드 라인 22~29에서 no-cb 커널 스레드에서 콜백들을 처리하도록 offload된 경우 현재 태스크의 rcu_tasks_holdout 멤버를 false로 변경한다.
코드 라인 34에서 호출한 콜백 수를 count 변수로 알아온다.
코드 라인 39~41에서 처리하지 않고 남은 콜백들을 다음에 처리하기 위해 다시 seg 콜백 리스트의 done 구간에 추가하고, 콜백 수도 추가한다.
코드 라인 44~46에서 rdp->blimit가 10000개 이상이고, seg 콜백 리스트에 있는 콜백 수가 qlowmark(디폴트=100) 이하인 경우 rdp->blimit을 blimit 값으로 재설정한다.
코드 라인 49~53에서 로컬 콜백 리스트의 콜백들을 모두 처리하여 count가 0이고, rdp->qlen_last_fqs_check가 0이 아닌 경우 이 값을 0으로 리셋한다. 만일 다 처리하지 않고 남은 수가 rdp->qlen_last_fqs_check – qhimark 보다 작은 경우 rdp->qlen_last_fqs_check 값을 남은 count 값으로 대입한다.
코드 라인 66~67에서 offloaded되지 않고 done 구간에 남은 콜백들이 여전히 남아 있는 경우 softirq를 호출하여 계속 처리하게 한다.

__rcu_reclaim()

kernel/rcu/rcu.h

/*
 * Reclaim the specified callback, either by invoking it (non-lazy case)
 * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
 */

static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
{
        rcu_callback_t f;
        unsigned long offset = (unsigned long)head->func;

        rcu_lock_acquire(&rcu_callback_map);
        if (__is_kfree_rcu_offset(offset)) {
                trace_rcu_invoke_kfree_callback(rn, head, offset);
                kfree((void *)head - offset);
                rcu_lock_release(&rcu_callback_map);
                return true;
        } else {
                trace_rcu_invoke_callback(rn, head);
                f = head->func;
                WRITE_ONCE(head->func, (rcu_callback_t)0L);
                f(head);
                rcu_lock_release(&rcu_callback_map);
                return false;
        }
}

rcu 콜백을 recliaim 처리한다. (kfree용 콜백인 경우 kfree 후 true를 반환하고, 그 외의 경우 해당 콜백을 호출한 후 false를 반환한다.)

코드 라인 7~11에서 rcu 콜백에 함수가 아닌 kfree용 rcu offset이 담긴 경우 이를 통해 kfree를 수행하고 true를 반환한다.
코드 라인 12~19에서 rcu 콜백에 함수가 담긴 경우 이를 호출하고 false를 반환한다.

구조체

rcu_state 구조체

kernel/rcu/tree.h

/*
 * RCU global state, including node hierarchy.  This hierarchy is
 * represented in "heap" form in a dense array.  The root (first level)
 * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
 * level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]),
 * and the third level in ->node[m+1] and following (->node[m+1] referenced
 * by ->level[2]).  The number of levels is determined by the number of
 * CPUs and by CONFIG_RCU_FANOUT.  Small systems will have a "hierarchy"
 * consisting of a single rcu_node.
 */

struct rcu_state {
        struct rcu_node node[NUM_RCU_NODES];    /* Hierarchy. */
        struct rcu_node *level[RCU_NUM_LVLS + 1];
                                                /* Hierarchy levels (+1 to */
                                                /*  shut bogus gcc warning) */
        int ncpus;                              /* # CPUs seen so far. */

        /* The following fields are guarded by the root rcu_node's lock. */

        u8      boost ____cacheline_internodealigned_in_smp;
                                                /* Subject to priority boost. */
        unsigned long gp_seq;                   /* Grace-period sequence #. */
        struct task_struct *gp_kthread;         /* Task for grace periods. */
        struct swait_queue_head gp_wq;          /* Where GP task waits. */
        short gp_flags;                         /* Commands for GP task. */
        short gp_state;                         /* GP kthread sleep state. */
        unsigned long gp_wake_time;             /* Last GP kthread wake. */
        unsigned long gp_wake_seq;              /* ->gp_seq at ^^^. */

        /* End of fields guarded by root rcu_node's lock. */

        struct mutex barrier_mutex;             /* Guards barrier fields. */
        atomic_t barrier_cpu_count;             /* # CPUs waiting on. */
        struct completion barrier_completion;   /* Wake at barrier end. */
        unsigned long barrier_sequence;         /* ++ at start and end of */
                                                /*  rcu_barrier(). */
        /* End of fields guarded by barrier_mutex. */

        struct mutex exp_mutex;                 /* Serialize expedited GP. */
        struct mutex exp_wake_mutex;            /* Serialize wakeup. */
        unsigned long expedited_sequence;       /* Take a ticket. */
        atomic_t expedited_need_qs;             /* # CPUs left to check in. */
        struct swait_queue_head expedited_wq;   /* Wait for check-ins. */
        int ncpus_snap;                         /* # CPUs seen last time. */

        unsigned long jiffies_force_qs;         /* Time at which to invoke */
                                                /*  force_quiescent_state(). */
        unsigned long jiffies_kick_kthreads;    /* Time at which to kick */
                                                /*  kthreads, if configured. */
        unsigned long n_force_qs;               /* Number of calls to */
                                                /*  force_quiescent_state(). */
        unsigned long gp_start;                 /* Time at which GP started, */
                                                /*  but in jiffies. */
        unsigned long gp_end;                   /* Time last GP ended, again */
                                                /*  in jiffies. */
        unsigned long gp_activity;              /* Time of last GP kthread */
                                                /*  activity in jiffies. */
        unsigned long gp_req_activity;          /* Time of last GP request */
                                                /*  in jiffies. */
        unsigned long jiffies_stall;            /* Time at which to check */
                                                /*  for CPU stalls. */
        unsigned long jiffies_resched;          /* Time at which to resched */
                                                /*  a reluctant CPU. */
        unsigned long n_force_qs_gpstart;       /* Snapshot of n_force_qs at */
                                                /*  GP start. */
        unsigned long gp_max;                   /* Maximum GP duration in */
                                                /*  jiffies. */
        const char *name;                       /* Name of structure. */
        char abbr;                              /* Abbreviated name. */

        raw_spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
                                                /* Synchronize offline with */
                                                /*  GP pre-initialization. */
};

rcu 글로벌 상태를 관리하는 구조체이다.

node[]
- 컴파일 타임에 산출된 NUM_RCU_NODES 수 만큼 rcu_node 들이 배열에 배치된다.
*level[]
- 각 레벨의 (0(top) 레벨부터 최대 3레벨까지) 첫 rcu_node를 가리킨다.
- 최소 노드가 1개 이상 존재하므로 level[0]는 항상 node[0]를 가리킨다.
- 노드가 2개 이상되면 level[1]은 node[1]을 가리킨다.
ncpus
- cpu 수
gp_seq
- 현재 grace period 번호
- overflow에 관련된 에러가 발생하는 것에 대해 빠르게 감지하기 위해 -300UL부터 시작한다.
*gp_kthread
- grace period를 관리하는 커널 스레드이다.
- “rcu_preempt” 라는 이름을 사용한다.
gp_wq
- gp 커널 스레드가 대기하는 wait queue이다.
gp_flags
- gp 커널 스레드에 대한 명령이다.
- 2개의 gp 플래그를 사용한다.
  - RCU_GP_FLAG_INIT(0x1) -> gp 시작 필요
  - RCU_GP_FLAG_FQS(0x2) -> fqs 필요
gp_state
- gp 커널 스레드의 상태이다.
  - RCU_GP_IDLE(0) – gp가 동작하지 않는 상태
  - RCU_GP_WAIT_GPS(1) – gp 시작 대기
  - RCU_GP_DONE_GPS(2) – gp 시작을 위해 완료 대기
  - RCU_GP_ONOFF(3) – gp 초기화 핫플러그
  - RCU_GP_INIT(4) – gp 초기화
  - RCU_GP_WAIT_FQS(5) – fqs 시간 대기
  - RCU_GP_DOING_FQS(6) – fqs 시간 완료 대기
  - RCU_GP_CLEANUP(7) – gp 클린업 시작
  - RCU_GP_CLEANED(8) – gp 클린업 완료
gp_wake_time
- 마지막 gp kthread 깨어난 시각
gp_wake_seq
- gp kthread 깨어났을 때의 gp_seq
barrier_mutex
- rcu_barrier() 함수에서 사용하는 베리어 뮤텍스
barrier_cpu_count
- 베리어 대기중인 cpu 수
barrier_completion
- 베리어 완료를 기다리기 위해 사용되는 completion
barrier_sequence
- rcu_barrier()에서 사용하는 베리어용 gp 시퀀스
exp_mutex
- expedited gp를 순서대로 처리하기 위한 뮤텍스 락
exp_wake_mutex
- 순서대로 wakeup하기 위한 뮤텍스 락
expedited_sequence
- 급행 grace period 시퀀스 번호
expedited_need_qs
- expedited qs를 위해 남은 cpu 수
sexpedited_wq
- 체크인 대기 워크큐
ncpus_snap
- 지난 스캔에서 사용했던 cpu 수가 담긴다.
jiffies_force_qs
- force_quiescent_state() -> gp kthread -> rcu_gp_fqs() 함수를 호출하여 fqs를 해야 할 시각(jiffies)
- gp 커널 스레드에서 gp가 완료되지 않아 강제로 fqs를 해야할 때까지 기다릴 시각이 담긴다.
jiffies_kick_kthreads
- 현재 gp에서 이 값으로 지정된 시간이 흘러 stall된 커널 스레드를 깨우기 위한 시각이 담긴다.
- 2 * jiffies_till_first_fqs 시간을 사용한다.
n_force_qs
- force_quiescent_state() -> gp kthread -> rcu_gp_fqs() 함수를 호출하여 fqs를 수행한 횟수
- 각 cpu(rcu_data)에서 수행한 값이 글로벌(rcu_state)에 갱신되며, 해당 cpu에 fqs를 수행했던 이후로 변경된 적이 없는지 확인하기 위해 사용된다.
gp_start
- gp 시작된 시각(틱)
gp_activity
- gp kthread가 동작했던 마지막 시각(틱)
gp_req_activity
- gp 시작 및 종료 요청 시각(틱)이 담기며, gp stall 경과 시각을 체크할 때 사용한다.
jiffies_stall
- cpu stall을 체크할 시각(틱)
jiffies_resched
- cpu stall을 체크하는 시간의 절반의 시각(jiffies)으로 설정되고 필요에 따라 5 틱씩 증가
- 이 시각이 되면 리스케줄한다.
n_force_qs_gpstart
- gp가 시작될 때마다 fqs를 수행한 횟수(n_force_qs)로 복사(snapshot)된다.
gp_max
- 최대 gp 기간(jiffies)
*name
- rcu 명
- “rcu_sched”, “rcu_bh”, “rcu_preempt”
abbr
- 축약된 1 자리 rcu 명
- ‘s’, ‘b’, ‘p’

rcu_node 구조체

kernel/rcu/tree.h – 1/2

/*
 * Definition for node within the RCU grace-period-detection hierarchy.
 */

struct rcu_node {
        raw_spinlock_t __private lock;  /* Root rcu_node's lock protects */
                                        /*  some rcu_state fields as well as */
                                        /*  following. */
        unsigned long gp_seq;   /* Track rsp->rcu_gp_seq. */
        unsigned long gp_seq_needed; /* Track furthest future GP request. */
        unsigned long completedqs; /* All QSes done for this node. */
        unsigned long qsmask;   /* CPUs or groups that need to switch in */
                                /*  order for current grace period to proceed.*/
                                /*  In leaf rcu_node, each bit corresponds to */
                                /*  an rcu_data structure, otherwise, each */
                                /*  bit corresponds to a child rcu_node */
                                /*  structure. */
        unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */
        unsigned long qsmaskinit;
                                /* Per-GP initial value for qsmask. */
                                /*  Initialized from ->qsmaskinitnext at the */
                                /*  beginning of each grace period. */
        unsigned long qsmaskinitnext;
                                /* Online CPUs for next grace period. */
        unsigned long expmask;  /* CPUs or groups that need to check in */
                                /*  to allow the current expedited GP */
                                /*  to complete. */
        unsigned long expmaskinit;
                                /* Per-GP initial values for expmask. */
                                /*  Initialized from ->expmaskinitnext at the */
                                /*  beginning of each expedited GP. */
        unsigned long expmaskinitnext;
                                /* Online CPUs for next expedited GP. */
                                /*  Any CPU that has ever been online will */
                                /*  have its bit set. */
        unsigned long ffmask;   /* Fully functional CPUs. */
        unsigned long grpmask;  /* Mask to apply to parent qsmask. */
                                /*  Only one bit will be set in this mask. */
        int     grplo;          /* lowest-numbered CPU or group here. */
        int     grphi;          /* highest-numbered CPU or group here. */
        u8      grpnum;         /* CPU/group number for next level up. */
        u8      level;          /* root is at level 0. */
        bool    wait_blkd_tasks;/* Necessary to wait for blocked tasks to */
                                /*  exit RCU read-side critical sections */
                                /*  before propagating offline up the */
                                /*  rcu_node tree? */
        struct rcu_node *parent;
        struct list_head blkd_tasks;
                                /* Tasks blocked in RCU read-side critical */
                                /*  section.  Tasks are placed at the head */
                                /*  of this list and age towards the tail. */
        struct list_head *gp_tasks;
                                /* Pointer to the first task blocking the */
                                /*  current grace period, or NULL if there */
                                /*  is no such task. */
        struct list_head *exp_tasks;
                                /* Pointer to the first task blocking the */
                                /*  current expedited grace period, or NULL */
                                /*  if there is no such task.  If there */
                                /*  is no current expedited grace period, */
                                /*  then there can cannot be any such task. */

gp 감지를 포함하는 rcu 노드 구조체이다.

lock
- rcu 노드 락
gp_seq
- 이 노드에 대한 gp 번호
gp_seq_needed
- 이 노드에 요청한 새 gp 번호
completedqs
- 이 노드가 qs 되었을때의 completed 번호
qsmask
- leaf 노드의 경우 qs를 보고해야 할 rcu_data에 대응하는 비트가 1로 설정된다.
- leaf 노드가 아닌 경우 qs를 보고해야 할 child 노드에 대응하는 비트가 1로 설정된다.
rcu_gp_init_mask
- gp 초기화시의 offline cpumask
qsmaskinit
- gp가 시작할 때마다 적용되는 초기값
- qsmask & expmask
qsmaskinitnext
- 다음 gp를 위한 online cpumask
expmask
- preemptible-rcu에서만 사용되며, 빠른(expedited) grace period가 진행중인 경우 설정된다..
- leaf 노드가 아닌 노드들의 초기값은 qsmaskinit으로한다. (snapshot)
expmaskinit
- expmask의 per-gp 초기 값
expmaskinitnext
- 다음 pepedited gp를 위한 online cpumask
ffmask
- fully functional cpus
grpmask
- 이 노드가 부모 노드의 qsmask에 대응하는 비트 값이다.
- 예) 512개의 cpu를 위해 루트노드에 32개의 leaf 노드가 있을 때 각 leaf 노드의 grpmask 값은 각각 0x1, 0x2, … 0x8000_0000이다.
grplo
- 이 노드가 관리하는 시작 cpu 번호
- 예) grplo=48, grphi=63
grphi
- 이 노드가 관리하는 끝 cpu 번호
- 예) grplo=48, grphi=63
grpnum
- 상위 노드에서 볼 때 이 노드에 해당하는 그룹번호(32bits: 0~31, 64bits: 0~63)
- grpmask에 설정된 비트 번호와 같다.
  - 예) grpmask=0x8000_0000인 경우 bit31이 설정되어 있다. 이러한 경우 grpnum=31이다.
level
- 이 노드에 해당하는 노드 레벨
- 루트 노드는 0이다.
- 최대 값은 3이다. (최대 4 단계의 레벨 구성이 가능하다)
wait_blkd_tasks
- read side critcal section에서 preemption된 블럭드 태스크가 있는지 여부
*parent
- 부모 노드를 가리킨다.
blkd_tasks
- preemptible 커널의 read side critical section에서 preempt된 경우 해당 태스크를 이 리스트에 추가된다.
*gp_tasks
- 현재 일반 gp에서 블럭된 첫 번째 태스크를 가리킨다.
*exp_tasks
- 현재 급행 gp에서 블럭된 첫 번째 태스크를 가리킨다.

kernel/rcu/tree.h – 2/2

        struct list_head *boost_tasks;
                                /* Pointer to first task that needs to be */
                                /*  priority boosted, or NULL if no priority */
                                /*  boosting is needed for this rcu_node */
                                /*  structure.  If there are no tasks */
                                /*  queued on this rcu_node structure that */
                                /*  are blocking the current grace period, */
                                /*  there can be no such task. */
        struct rt_mutex boost_mtx;
                                /* Used only for the priority-boosting */
                                /*  side effect, not as a lock. */
        unsigned long boost_time;
                                /* When to start boosting (jiffies). */
        struct task_struct *boost_kthread_task;
                                /* kthread that takes care of priority */
                                /*  boosting for this rcu_node structure. */
        unsigned int boost_kthread_status;
                                /* State of boost_kthread_task for tracing. */
#ifdef CONFIG_RCU_NOCB_CPU
        struct swait_queue_head nocb_gp_wq[2];
                                /* Place for rcu_nocb_kthread() to wait GP. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
        raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;

        spinlock_t exp_lock ____cacheline_internodealigned_in_smp;
        unsigned long exp_seq_rq;
        wait_queue_head_t exp_wq[4];
        struct rcu_exp_work rew;
        bool exp_need_flush;    /* Need to flush workitem? */
} ____cacheline_internodealigned_in_smp;

*boost_tasks
- priority 부스트가 필요한 첫 태스크
boost_mtx
- priority 부스트에 사용되는 락
*boost_kthread_task
- 이 노드에서 priority 부스트를 수행하는 boost 커널 스레드
boost_kthread_status
- boost_kthread_task trace를 위한 상태
nocb_gp_wq[]
- 2 개의 no-cb용 커널 스레드의 대기큐
exp_seq_rq
- 급행 grace period 시퀀스 번호
exp_wq[]
- synchronize_rcu_expedited() 호출한 태스크들이 급행 gp를 대기하게 되는데, 이 호출한 태스크들이 대기하는 곳이다.
- 4개의 해시 리스트 형태로 운영된다. (급행 gp 시퀀스 번호에서 하위 2비트를 우측 시프트하여 버린 후 하위 2비트로 해시 운영한다)
rew
- rcu_exp_work 구조체가 임베드되며, 내부에선 wait_rcu_exp_gp() 함수를 호출하는 워크큐가 사용된다.
exp_need_flush
- sync_rcu_exp_select_cpus() 함수내부에서 각 leaf 노드들에서 위의 워크큐가 사용되는 경우 여부를 관리할 때 사용된다.

rcu_data 구조체

kernel/rcu/tree.h – 1/2

/* Per-CPU data for read-copy update. */

struct rcu_data {
        /* 1) quiescent-state and grace-period handling : */
        unsigned long   gp_seq;         /* Track rsp->rcu_gp_seq counter. */
        unsigned long   gp_seq_needed;  /* Track furthest future GP request. */
        union rcu_noqs  cpu_no_qs;      /* No QSes yet for this CPU. */
        bool            core_needs_qs;  /* Core waits for quiesc state. */
        bool            beenonline;     /* CPU online at least once. */
        bool            gpwrap;         /* Possible ->gp_seq wrap. */
        bool            exp_deferred_qs; /* This CPU awaiting a deferred QS? */
        struct rcu_node *mynode;        /* This CPU's leaf of hierarchy */
        unsigned long grpmask;          /* Mask to apply to leaf qsmask. */
        unsigned long   ticks_this_gp;  /* The number of scheduling-clock */
                                        /*  ticks this CPU has handled */
                                        /*  during and after the last grace */
                                        /* period it is aware of. */
        struct irq_work defer_qs_iw;    /* Obtain later scheduler attention. */
        bool defer_qs_iw_pending;       /* Scheduler attention pending? */

        /* 2) batch handling */
        struct rcu_segcblist cblist;    /* Segmented callback list, with */
                                        /* different callbacks waiting for */
                                        /* different grace periods. */
        long            qlen_last_fqs_check;
                                        /* qlen at last check for QS forcing */
        unsigned long   n_force_qs_snap;
                                        /* did other CPU force QS recently? */
        long            blimit;         /* Upper limit on a processed batch */

        /* 3) dynticks interface. */
        int dynticks_snap;              /* Per-GP tracking for dynticks. */
        long dynticks_nesting;          /* Track process nesting level. */
        long dynticks_nmi_nesting;      /* Track irq/NMI nesting level. */
        atomic_t dynticks;              /* Even value for idle, else odd. */
        bool rcu_need_heavy_qs;         /* GP old, so heavy quiescent state! */
        bool rcu_urgent_qs;             /* GP old need light quiescent state. */
#ifdef CONFIG_RCU_FAST_NO_HZ
        bool all_lazy;                  /* All CPU's CBs lazy at idle start? */
        unsigned long last_accelerate;  /* Last jiffy CBs were accelerated. */
        unsigned long last_advance_all; /* Last jiffy CBs were all advanced. */
        int tick_nohz_enabled_snap;     /* Previously seen value from sysfs. */
#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */

        /* 4) rcu_barrier(), OOM callbacks, and expediting. */
        struct rcu_head barrier_head;
        int exp_dynticks_snap;          /* Double-check need for IPI. */

1) qs와 gp 핸들링 관련

gp_seq
- gp 시퀀스 번호
gp_seq_needed
- 요청한 gp 시퀀스 번호
cpu_no_qs
- gp 시작 후 설정되며 cpu에서 qs가 체크되면 클리어된다.
- 그 후 해당 노드 및 최상위 노드까지 보고하며, 각 노드의 rnp->qsmask의 비트 중 하위 rnp 또는 rdp의 ->grpmask에 해당하는 비트를 클리어한다.
core_need_qs
- cpu가 qs를 체크해야 하는지 여부
beenonline
- 한 번이라도 online 되었었던 경우 1로 설정된다.
gpwrap
- nohz 진입한 cpu는 스케줄 틱을 한동안 갱신하지 못하는데, 이로 인해 gp 시퀀스 역시 장시간 갱신 못할 수도 있다. cpu의 gp 시퀀스가 노드용 gp 시퀀스에 비해 ulong 값의 1/4을 초과하도록 갱신을 못한 경우 gp 시퀀스를 오버플로우로 판정하여 이 값을 true로 변경한다.
exp_deferred_qs
- 현재 cpu가 deferred qs를 대기중인지 여부
*mynode
- 이 cpu를 관리하는 노드
grpmask
- 이 cpu가 해당 leaf 노드의 qsmask에 대응하는 비트 값으로 qs 패스되면 leaf 노드의 qsmask에 반영한다.
- 예) 노드 당 16개의 cpu가 사용될 수 있으며 각각의 cpu에 대해 0x1, 0x2, … 0x1_0000 값이 배치된다.
ticks_this_gp
- 마지막 gp 이후 구동된 스케줄 틱 수
- CONFIG_RCU_CPU_STALL_INFO 커널 옵션을 사용한 경우 cpu stall 정보의 출력을 위해 사용된다.
defer_qs_iw
defer_qs_iw_pending

2) 배치 핸들링

cblist
- 콜백들이 추가되는 segmented 콜백 리스트이다.
qlen_last_fqs_check
- fqs를 위해 마지막 체크시 사용할 qlen 값
n_force_qs_snap
- n_force_qs가 복사된 값으로 최근에 fqs가 수행되었는지 확인하기 위해 사용된다.
blimit
- 배치 처리할 콜백 제한 수
- 빠른 인터럽트 latency를 보장하게 하기 위해 콜백들이 많은 경우 한 번에 blimit 이하의 콜백들만 처리하게 한다.

3) dynticks(nohz) 인터페이스

dynticks_snap
- dynticks->dynticks 값을 복사해두고 카운터 값이 변동이 있는지 확인하기 위해 사용된다.
dynticks_nesting
- 초기 값은 1로 시작되며, eqs 진입 시 1 감소 시키며, 퇴출 시 1 증가 시킨다.
dynticks_nmi_nesting
- 초기 값은 DYNTICK_IRQ_NONIDLE(long_max / 2 + 1)로 시작되며, irq/nmi 진출시 1(eqs)~2 증가되고, 퇴출시 1(eqs)~2 감소된다.
dynticks
- per-cpu로 구성된 전역 rcu_dynticks에 연결된다.
- no-hz에서 qs상태를 관리하기 위해 사용한다.
rcu_need_heavy_qs
- 2 * jiffies_to_sched_qs 시간(틱)동안 gp 연장
rcu_urgent_qs
- jiffies_to_sched_qs 시간(틱)동안 gp 연장
all_lazy
- 대기 중인 모든 콜백이 lazy(kfree) 타입 콜백인 경우 true가 된다.
last_accelerate
- 최근 accelerate 콜백 처리한 시각(틱)
last_advance_all
- 최근 advance 콜백 처리한 시각(틱)
tick_nozh_enabled_snap
- nohz active 여부를 snap 저장하여 변경 여부를 체크하기 위해 사용한다.

4) rcu 배리어, OOM 콜백과 expediting

barrier_head
- 베리어 역할로 사용하는 콜백
exp_dynticks_snap
- bit0 클리어된 rdp->dynticks 값을 snap 저장하여 사용한다.

kernel/rcu/tree.h – 2/2

        /* 5) Callback offloading. */
#ifdef CONFIG_RCU_NOCB_CPU
        struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */
        struct task_struct *nocb_gp_kthread;
        raw_spinlock_t nocb_lock;       /* Guard following pair of fields. */
        atomic_t nocb_lock_contended;   /* Contention experienced. */
        int nocb_defer_wakeup;          /* Defer wakeup of nocb_kthread. */
        struct timer_list nocb_timer;   /* Enforce finite deferral. */
        unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */

        /* The following fields are used by call_rcu, hence own cacheline. */
        raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
        struct rcu_cblist nocb_bypass;  /* Lock-contention-bypass CB list. */
        unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
        unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
        int nocb_nobypass_count;        /* # ->cblist enqueues at ^^^ time. */

        /* The following fields are used by GP kthread, hence own cacheline. */
        raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
        struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
        u8 nocb_gp_sleep;               /* Is the nocb GP thread asleep? */
        u8 nocb_gp_bypass;              /* Found a bypass on last scan? */
        u8 nocb_gp_gp;                  /* GP to wait for on last scan? */
        unsigned long nocb_gp_seq;      /*  If so, ->gp_seq to wait for. */
        unsigned long nocb_gp_loops;    /* # passes through wait code. */
        struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
        bool nocb_cb_sleep;             /* Is the nocb CB thread asleep? */
        struct task_struct *nocb_cb_kthread;
        struct rcu_data *nocb_next_cb_rdp;
                                        /* Next rcu_data in wakeup chain. */

        /* The following fields are used by CB kthread, hence new cacheline. */
        struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp;
                                        /* GP rdp takes GP-end wakeups. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */

        /* 6) RCU priority boosting. */
        struct task_struct *rcu_cpu_kthread_task;
                                        /* rcuc per-CPU kthread or NULL. */
        unsigned int rcu_cpu_kthread_status;
        char rcu_cpu_has_work;

        /* 7) Diagnostic data, including RCU CPU stall warnings. */
        unsigned int softirq_snap;      /* Snapshot of softirq activity. */
        /* ->rcu_iw* fields protected by leaf rcu_node ->lock. */
        struct irq_work rcu_iw;         /* Check for non-irq activity. */
        bool rcu_iw_pending;            /* Is ->rcu_iw pending? */
        unsigned long rcu_iw_gp_seq;    /* ->gp_seq associated with ->rcu_iw. */
        unsigned long rcu_ofl_gp_seq;   /* ->gp_seq at last offline. */
        short rcu_ofl_gp_flags;         /* ->gp_flags at last offline. */
        unsigned long rcu_onl_gp_seq;   /* ->gp_seq at last online. */
        short rcu_onl_gp_flags;         /* ->gp_flags at last online. */
        unsigned long last_fqs_resched; /* Time of last rcu_resched(). */

        int cpu;
};

5) callback offloading (no-cb)

nocb_cb_wq
- nocb 커널 스레드가 잠들때 대기하는 리스트
*nocb_gp_kthread
- no-cb용 gp 커널 스레드
nocb_lock
- no-cb용 스핀락
nocb_lock_contended
- no-cb용 lock contention 유무를 관리하기 위해 사용되는 카운터이다.
nocb_defer_wakeup
- no-cb용 커널 스레드를 깨우는 것에 대한 유예 상태
  - RCU_NOGP_WAKE_NOT(0)
  - RCU_NOGP_WAKE(1)
  - RCU_NOGP_WAKE_FORCE(2)
nocb_timer
- no-cb용 타이머
- no-cb용 gp 커널 스레드를 1틱 deferred wakeup을 할 때 사용한다.
nocb_gp_adv_time
- no-cb용 gp 커널 스레드에서 콜백들을 flush 처리하고 advance 처리를 하는데, 1 틱이내에 반복하지 않기 위해 사용한다.
nocb_bypass_lock
- no-cb bypass 갱신 시 사용하는 스핀락
nocb_bypass
- no-cb용 bypass 콜백 리스트
nocb_bypass_first
- no-cb용 bypass 에 처음 콜백이 추가될 때의 시각을 기록한다.
- no-cb용 bypass 리스트에 있는 콜백들을 flush하여 처리한 경우에도 해당 시각을 기록한다.
- 1ms 이내에 진입하는 콜백들이 일정량(16개)을 초과하여 진입할 때에만 no-cb용 bypass에 콜백을 추가하고 nocb_nobypass_count 카운터를 증가시키는데 이 카운터(는 이 시각이 바뀔 때마다 리셋된다.
nocb_nobypass_last
- no-cb용 bypass 리스트에 콜백을 추가 시도할 때 갱신되는 시각(틱)이다.
nocb_nobypass_count
- no-cb용 bypass 리스트를 사용하여 콜백을 추가하기 위해 카운팅을 한다.
- 이 값이 매 틱마다 nocb_nobypass_lim_per_jiffy(디폴트 1ms 당 16개) 갯 수를 초과할 때에만 no-cb용 bypass 리스트에 콜백을 추가한다.
nocb_gp_lock
- no-cb용 gp 커널스레드가 사용하는 락
nocb_bypass_timer
- no-cb용 bypass 타이머로 새 gp 요청을 기다리기 위해 슬립 전에 no-cb용 콜백들을 모두 flush 처리하기 위해 2틱을 설정하여 사용한다.
nocb_gp_sleep
- no-cb용 gp 커널 스레드가 슬립 상태인지 여부를 담는다. gp 변화를 위해 외부에서 이 값을 false로 바꾸고 깨워 사용한다.
nocb_gp_bypass
- no-cb용 gp 커널 스레드가 지난 스캔에서 bypass 모드로 동작하였는지 여부를 담는다.
nocb_gp_gp
- no-cb용 gp 커널 스레드가 지난 스캔에서 wait 중인지 여부를 담는다.
nocb_gp_seq
- no-cb용 gp 커널 스레드에서 gp idle 상태일때 -1 값을 갖고, 진행 중일 때 ->gp_seq 값을 담고 있다.
nocb_gp_loops
- no-cb용 gp 커널 스레드의 루프 횟수를 카운팅하기 위해 사용된다. (gp state 출력용)
nocb_gp_wq
- no-cb용 gp 커널 스레드가 슬립하는 곳이다.
nocb_cb_sleep
- no-cb용 cb 커널 스레드가 슬립 상태인지 여부를 담는다. 콜백 처리를 위해 이 값을 false로 바꾸고 깨워 사용한다.
*nocb_cb_kthread
- no-cb용 cb 커널 스레드를 가리킨다.
*nocb_next_cb_rdp
- wakeup 체인에서 다음에 처리할 rdp를 가리킨다.
*nocb_gp_rdp
- no-cb용 gp 커널 스레드가 있는 rdp(cpu)를 가리킨다.

6) rcu priority boosting

*rcu_cpu_kthread_task
- rcuc 커널 스레드
rcu_cpu_kthread_status
- rcu 커널 스레드 상태
  - RCU_KTHREAD_STOPPED 0
  - RCU_KTHREAD_RUNNING 1
  - RCU_KTHREAD_WAITING 2
  - RCU_KTHREAD_OFFCPU 3
  - RCU_KTHREAD_YIELDING 4
rcu_cpu_has_work
- cpu에 처리할 콜백이 있어 cb용 콜백 처리 커널 스레드를 깨워야 할 때 사용된다.

7) diagnostic data / rcu cpu stall warning

softirq_snap
- cpu stall 정보를 출력할 때 사용하기 위해 rcu softirq 카운터 값 복사(snap-shot)
- CONFIG_RCU_CPU_STALL_INFO 커널 옵션 필요
rcu_iw
- rcu_iw_handler() 함수가 담긴 rcu irq work
rcu_iw_pending
- rcu irq work 호출 전에 true가 담긴다.
rcu_iw_gp_seq
- rcu irq work 호출 전에 rnp->gp_seq가 담긴다.
rcu_ofl_gp_seq
- cpu offline 변경 시 gp_seq가 담긴다.
rcu_ofl_gp_flags
- cpu offline 변경 시 gp_flags가 담긴다.
rcu_onl_gp_seq
- cpu online 변경 시 gp_seq가 담긴다.
rcu_onl_gp_flags
- cpu online 변경 시 gp_flags가 담긴다.
last_fqs_resched
- 최근 fqs 리스케줄 시각(틱)이 담긴다.
- fqs 진행될 때 3 * jiffies_to_sched_qs 시간이 지난 경우 리스케줄 요청을 한다.
cpu
- cpu 번호

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c – 현재글
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c

numa_default_policy()

2016-12-022016-12-02 문영일 Leave a comment

numa_default_policy()

mm/mempolicy.c

/* Reset policy of current process to default */
void numa_default_policy(void)
{
        do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
}

현재 태스크를 기본 메모리 정책으로 설정한다.

참고

numa_policy_init() | 문c 블로그