문c 블로그

arm64_memblock_init()

2019-03-272021-10-30 문영일 Leave a comment

Memblock 초기화

memblock은 커널 빌드 타임에 준비한 static 배열을 사용하여 운용하므로 memblock 자체적으로 별도의 초기화는 필요 없고, 시스템에서 사용하는 기본 영역들을 reserve 하는 준비하는 과정이 있다.

reserved 영역의 엔트리 등록은 초기 커널 부트업 과정에서 다음과 같은 영역을 reserved memblock에 등록하며 각 아키텍처 및 머신에 따라서 설정이 달라진다.

커널 영역
initrd 영역
DTB 영역 및 DTB reserved-mem 노드가 요청하는 영역
CMA-DMA 영역
crash kernel 영역
elf core 헤더 영역

그 외에 시스템 메모리 영역을 초과하는 영역이나, DTB의 chosen 노드의 “linux,usable-memory-range” 속성으로 사용할 수 있는 메모리 영역을 제한한 경우 해당 영역을 memblock에서 제거한다.

다음 그림은 arm64_memblock_init() 함수에서 reserve하는 memblock들을 보여준다.

arm64_memblock_init()

arch/arm64/mm/init.c – 1/3-

void __init arm64_memblock_init(void)
{
        const s64 linear_region_size = BIT(vabits_actual - 1);

        /* Handle linux,usable-memory-range property */
        fdt_enforce_memory_region();

        /* Remove memory above our supported physical address size */
        memblock_remove(1ULL << PHYS_MASK_SHIFT, ULLONG_MAX);

        /*
         * Select a suitable value for the base of physical memory.
         */
        memstart_addr = round_down(memblock_start_of_DRAM(),
                                   ARM64_MEMSTART_ALIGN);

        /*
         * Remove the memory that we will not be able to cover with the
         * linear mapping. Take care not to clip the kernel which may be
         * high in memory.
         */
        memblock_remove(max_t(u64, memstart_addr + linear_region_size,
                        __pa_symbol(_end)), ULLONG_MAX);
        if (memstart_addr + linear_region_size < memblock_end_of_DRAM()) {
                /* ensure that memstart_addr remains sufficiently aligned */
                memstart_addr = round_up(memblock_end_of_DRAM() - linear_region_size,
                                         ARM64_MEMSTART_ALIGN);
                memblock_remove(0, memstart_addr);
        }

        /*
         * If we are running with a 52-bit kernel VA config on a system that
         * does not support it, we have to place the available physical
         * memory in the 48-bit addressable part of the linear region, i.e.,
         * we have to move it upward. Since memstart_addr represents the
         * physical address of PAGE_OFFSET, we have to *subtract* from it.
         */
        if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
                memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);

        /*
         * Apply the memory limit if it was set. Since the kernel may be loaded
         * high up in memory, add back the kernel region that must be accessible
         * via the linear mapping.
         */
        if (memory_limit != PHYS_ADDR_MAX) {
                memblock_mem_limit_remove_map(memory_limit);
                memblock_add(__pa_symbol(_text), (u64)(_end - _text));
        }

코드 라인 3에서 linear_region_size는 64비트 커널에서 사용할 가상 주소 크기의 절반을 담는다.
- 예) 가상 주소 크기를 256T(vabits_actual=48)로 한 경우 이의 절반인 128T 크기가 연속된 영역으로 정의된다.
코드 라인 6에서 디바이스 트리(FDT)가 지정한 사용 메모리 영역이 제한된 경우 그 영역 이외의 memblock 영역을 제거한다.
- chosen 노드에 “linux,usable-memory-range” 속성으로 사용할 수 있는 메모리 영역을 제한할 수 있다.
코드 라인 9에서 시스템 물리 메모리 영역을 초과하는 영역은 모두 제거한다.
코드 라인 14~15에서 물리메모리의 시작 주소를 알아온다. ARM64 시스템에서 이 주소는 1G 섹션 단위로 정렬된다.
코드 라인 22~29에서 lm(linear mapping) 가상 주소 영역을 초과하는 물리 주소 영역은 제거한다.
- 커널 리니어 매핑 사이즈를 초과하는 물리 메모리의 끝을 memory memblock 영역에서 제거한다. 커널이 메모리의 끝 부분에 로드된 경우가 있으므로 이러한 경우 끝 부분을 기준으로 로드된 커널이 제거되지 않도록 제한한다.
  - 예) VA_BITS = 39, DRAM 크기 = 1TB인 경우에는 리니어 매핑 영역이 256GB로 제한되므로 768GB 메모리를 리니어 매핑으로 사용할 수 없게 된다.
- 로드된 커널이 커널 리니어 매핑 사이즈보다 큰 메모리의 상위쪽에 로드된 경우 메모리의 상위에 위치한 커널을 보호하기 위해 커널 리니어 매핑 사이즈를 초과한 메모리의 아랫 부분을 제거한다.
코드 라인 38~39에서 52bit vabits 커널이 실제 48bit vabits로 운영하는 시스템에서 동작하는 경우 lm 가상 주소와 물리 주소의 변환에 사용하는 memstart_addr 값을 조정해야 한다.
- memstart_addr -= 0xf_0000_0000_0000
- memstart_addr(=PHYS_OFFSET)는 가상 주소 PAGE_OFFSET에 대한 물리 주소를 담고 있다.
- 참고: arm64: mm: use single quantity to represent the PA to VA translation (2020, v5.10-rc1)
코드 라인 46~49에서 DRAM 메모리 제한을 설정한 경우 제한 메모리 범위를 초과한 DRAM 메모리 영역을 memory memblock 영역에서 제거한다.

arch/arm64/mm/init.c – 2/3-

.       if (IS_ENABLED(CONFIG_BLK_DEV_INITRD) && phys_initrd_size) {
                /*
                 * Add back the memory we just removed if it results in the
                 * initrd to become inaccessible via the linear mapping.
                 * Otherwise, this is a no-op
                 */
                u64 base = phys_initrd_start & PAGE_MASK;
                u64 size = PAGE_ALIGN(phys_initrd_start + phys_initrd_size) - base;

                /*
                 * We can only add back the initrd memory if we don't end up
                 * with more memory than we can address via the linear mapping.
                 * It is up to the bootloader to position the kernel and the
                 * initrd reasonably close to each other (i.e., within 32 GB of
                 * each other) so that all granule/#levels combinations can
                 * always access both.
                 */
                if (WARN(base < memblock_start_of_DRAM() ||
                         base + size > memblock_start_of_DRAM() +
                                       linear_region_size,
                        "initrd not fully accessible via the linear mapping -- please check your bootloader ...\n")) {
                        phys_initrd_size = 0;
                } else {
                        memblock_remove(base, size); /* clear MEMBLOCK_ flags */
                        memblock_add(base, size);
                        memblock_reserve(base, size);
                }
        }

        if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
                extern u16 memstart_offset_seed;
                u64 range = linear_region_size -
                            (memblock_end_of_DRAM() - memblock_start_of_DRAM());

                /*
                 * If the size of the linear region exceeds, by a sufficient
                 * margin, the size of the region that the available physical
                 * memory spans, randomize the linear region as well.
                 */
                if (memstart_offset_seed > 0 && range >= ARM64_MEMSTART_ALIGN) {
                        range /= ARM64_MEMSTART_ALIGN;
                        memstart_addr -= ARM64_MEMSTART_ALIGN *
                                         ((range * memstart_offset_seed) >> 16);
                }
        }

코드 라인 1~28에서 램디스크(initrd) 영역을 reserved memblock에 추가한다.
코드 라인 30~45에서 보안 목적으로 CONFIG_RANDOMIZE_BASE 커널 옵션을 사용하여 커널 시작 주소가 랜덤하게 바뀌는 경우 memstart_addr을 구한다.

arch/arm64/mm/init.c – 3/3-

        /*
         * Register the kernel text, kernel data, initrd, and initial
         * pagetables with memblock.
         */
        memblock_reserve(__pa_symbol(_text), _end - _text);
        if (IS_ENABLED(CONFIG_BLK_DEV_INITRD) && phys_initrd_size) {
                /* the generic initrd code expects virtual addresses */
                initrd_start = __phys_to_virt(phys_initrd_start);
                initrd_end = initrd_start + phys_initrd_size;
        }

        early_init_fdt_scan_reserved_mem();

        if (IS_ENABLED(CONFIG_ZONE_DMA)) {
                zone_dma_bits = ARM64_ZONE_DMA_BITS;
                arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
        }

        if (IS_ENABLED(CONFIG_ZONE_DMA32))
                arm64_dma32_phys_limit = max_zone_phys(32);
        else
                arm64_dma32_phys_limit = PHYS_MASK + 1;

        reserve_crashkernel();

        reserve_elfcorehdr();

        high_memory = __va(memblock_end_of_DRAM() - 1) + 1;

        dma_contiguous_reserve(arm64_dma32_phys_limit);
}

코드 라인 5에서 커널 영역을 reserve한다.
코드 라인 6~10에서 램디스크(initrd) 영역 주소를 가상 주소로 변환하여 저장한다.
코드 라인 12에서 DTB에 관련된 다음의 세 가지 영역을 추가한다.
- DTB 자신의 영역
- DTB 헤더의 off_mem_rsvmap 필드가 가리키는 memory reserve 블록(바이너리)에서 읽은 메모리 영역들
- DTB reserved-mem 노드 영역이 요청하는 영역들
코드 라인 14~22에서 디바이스 드라이버(dma for coherent/cma for dma)가 필요로 하는 DMA 및 DMA32 영역을 구한다.
코드 라인 24에서 crash 커널 영역을 reserve 한다.
코드 라인 26에서 elf core 헤더 영역을 reserve 한다.
코드 라인 28에서 ARM64의 경우 highmem을 사용하지 않는다. 따라서 메모리의 끝 주소를 대입한다.
코드 라인 30에서 dma 영역을 reserved memblock에 추가하고 CMA(Contiguous Memory Allocator)에도 추가한다. 전역 cma_areas[ ] 배열에 추가한 엔트리는 CMA 드라이버가 로드되면서 초기화될 때 사용한다. 또한 전역 dma_mmu_remap[ ] 배열에 추가된 엔트리는 추후 dma_contiguous_remap( ) 함수를 통해 지정된 영역에 대응하는 페이지 테이블 엔트리들을 IO 속성으로 매핑할 때 사용한다.

참고

Memblock – (1) | 문c
Memblock – (2) | 문c
arm_memblock_init() | 문c
arm64_memblock_init() | 문c – 현재 글

cpu_replace_ttbr1

2019-03-262021-10-27 문영일 6 Comments

커널 페이지 테이블 지정(변경)

커널에서 사용하는 페이지 테이블을 가리키는 레지스터가 TTBR1 레지스터다. 이 레지스터에 다른 페이지 테이블을 설정하기 위해서는 특별한 처리 방법이 요구되는데, 곧 이어질 cpu_replace_ttbr1( ) 함수에서 자세히 설명하기로 한다.

다음 그림은 TTBR1 레지스터의 값을 바꿀 때 함수 간의 흐름을 보여준다.

cpu_replace_ttbr1()

arch/arm64/include/asm/mmu_context.h

/*
 * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
 * avoiding the possibility of conflicting TLB entries being allocated.
 */

static inline void cpu_replace_ttbr1(pgd_t *pgdp)
{
        typedef void (ttbr_replace_func)(phys_addr_t);
        extern ttbr_replace_func idmap_cpu_replace_ttbr1;
        ttbr_replace_func *replace_phys;

        /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */
        phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp));

        if (system_supports_cnp() && !WARN_ON(pgdp != lm_alias(swapper_pg_dir))) {
                /*
                 * cpu_replace_ttbr1() is used when there's a boot CPU
                 * up (i.e. cpufeature framework is not up yet) and
                 * latter only when we enable CNP via cpufeature's
                 * enable() callback.
                 * Also we rely on the cpu_hwcap bit being set before
                 * calling the enable() function.
                 */
                ttbr1 |= TTBR_CNP_BIT;
        }

        replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1);

        cpu_install_idmap();
        replace_phys(ttbr1);
        cpu_uninstall_idmap();
}

페이지 테이블 물리 주소를 커널용 페이지 테이블 레지스터인 TTBR1에 어토믹(atomic)하게 설정한다. 가상 주소에서 이미 커널 코드가 동작하고 있기 때문에 TTBR1 레지스터를 곧바로 변경할 수 없으므로 idmap 페이지 테이블을 사용하여 어토믹하게 TTBR1 레지스터를 변경해야 한다. idmap 페이지 테이블을 사용할 때 유저용 가상 페이지 테이블을 가리키는 TTBR0 레지스터를 임시로 잠시 사용한다.

코드 라인 8에서 @pgdp에 해당하는 물리 주소를 구한다.
코드 라인 10~20에서 ARMv8.2 확장에 적용된 CNP(Common Not Private) capability가 동작하는 시스템인 경우 CNP 비트를 추가한다.
코드 라인 22에서 1:1 아이덴티티 매핑되어 있는 위치의 idmap_cpu_replace_ttbr1( ) 함수 가상 주소를 구한다.
코드 라인 24~26에서 1:1 아이덴티티 매핑 영역을 활성화한 후 idmap_cpu_replace_ttbr1( ) 함수를 호출하여 페이지 테이블의 주소를 TTBR1에 설정하고 그 후 1:1 아이덴티티 매핑을 해제한다.

다음 그림은 커널 페이지 테이블을 지정하는 TTBR1을 atomic하게 변경하는 모습을 보여준다.

idmap_cpu_replace_ttbr1()

arch/arm64/mm/proc.S

/*
 * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1)
 *
 * This is the low-level counterpart to cpu_replace_ttbr1, and should not be
 * called by anything else. It can only be executed from a TTBR0 mapping.
 */

ENTRY(idmap_cpu_replace_ttbr1)
        save_and_disable_daif flags=x2

        __idmap_cpu_set_reserved_ttbr1 x1, x3

        offset_ttbr1 x0
        msr     ttbr1_el1, x0
        isb

        restore_daif x2

        ret
ENDPROC(idmap_cpu_replace_ttbr1)
        .popsection

.idmap.text 섹션에 위치한 이 함수의 코드는 head.S에서 이미 가상 주소와 물리 주소가 1:1 매핑이 된 상태로 구동될 수 있는 코드가 위치한다. TTBR1에 zero 페이지를 설정하고 TLB flush 및 isb를 수행하여 TTBR1을 먼저 깨끗하게 비운다. 그런 후 TTBR1에 요청한 페이지 테이블 물리 주소를 설정한다. 이 함수는 TTBR0를 사용하여 1:1 아이덴티티 매핑이 된 상태에서 동작된다.

코드 라인 2에서 상태 레지스터의 D, A, I, F 비트 값을 x2 레지스터로 백업한 후 D, A, I, F 비트를 설정한다.
코드 라인 4에서 커널 페이지 테이블을 담당하는 ttbr1을 제로 페이지에 매핑한다. ttbr1 제로 페이지를 담을 x1 레지스터와 ttbr1에 기록할 값이 x3 레지스터에 담긴다.
코드 라인 6~8에서 52 비트 주소를 사용하는 경우를 위해 ttbr1에 offset 값을 추가 적용한다.
코드 라인 10~12에서 상태 레지스터의 D, A, I, F 비트를 복원한 후 리턴한다.

save_and_disable_daif 매크로

arch/arm64/include/asm/assembler.h

        .macro save_and_disable_daif, flags
        mrs     \flags, daif
        msr     daifset, #0xf
        .endm

상태 레지스터의 D, A, I, F 비트 값을 @flags 레지스터로 백업한 후 D, A, I, F 비트를 설정한다.

__idmap_cpu_set_reserved_ttbr1 매크로

arch/arm64/mm/proc.S

.macro  __idmap_cpu_set_reserved_ttbr1, tmp1, tmp2
        adrp    \tmp1, empty_zero_page
        phys_to_ttbr \tmp2, \tmp1
        offset_ttbr1 \tmp2
        msr     ttbr1_el1, \tmp2
        isb
        tlbi    vmalle1
        dsb     nsh
        isb
.endm

커널 페이지 테이블을 담당하는 ttbr1을 제로 페이지에 매핑한다. ttbr1 제로 페이지를 담을 @tmp1 레지스터의 물리 주소를 사용하여 ttbr1에 기록할 값인 @tmp2 레지스터에 담는다. 그런 후 ttbr1에 기록한다.

코드 라인 2~3에서 zero 페이지를 @tmp1 레지스터에 설정한다.
코드 라인 4에서 ttbr1 레지스터에 사용할 값으로 @tmp2 레지스터에 물리 주소 @tmp1 레지스터 값을 대입한다.
코드 라인 5에서 ttbr1 레지스터에 @tmp2를 설정한다.
코드 라인 6~9에서 명령 파이프라인을 비우고, tlb 플러쉬를 수행하고 완료된 후 다시 명령 파이프라인을 비운다.

phys_to_ttbr 매크로

arch/arm64/include/asm/assembler.h

/*
 * Arrange a physical address in a TTBR register, taking care of 52-bit
 * addresses.
 *
 *      phys:   physical address, preserved
 *      ttbr:   returns the TTBR value
 */

        .macro  phys_to_ttbr, ttbr, phys
#ifdef CONFIG_ARM64_PA_BITS_52
        orr     \ttbr, \phys, \phys, lsr #46
        and     \ttbr, \ttbr, #TTBR_BADDR_MASK_52
#else
        mov     \ttbr, \phys
#endif
        .endm

ttbr 레지스터에 사용할 값으로 @ttbr 레지스터에 물리 주소 @phys 레지스터 값을 대입한다. 만일 52bit 물리 주소를 사용하는 시스템인 경우 52 비트를 지원하는 format으로 변경한다.

offset_ttbr1 매크로

arch/arm64/include/asm/assembler.h

/*
 * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
 * orr is used as it can cover the immediate value (and is idempotent).
 * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
 *      ttbr: Value of ttbr to set, modified.
 */

        .macro  offset_ttbr1, ttbr
#ifdef CONFIG_ARM64_USER_VA_BITS_52
        orr     \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
#endif
        .endm

52비트 주소를 사용하는 경우를 위해 ttbr offset 값을 @ttbr에 반환한다.

Identity 매핑 테이블 사용

cpu_install_idmap()

arch/arm64/include/asm/mmu_context.h

static inline void cpu_install_idmap(void)
{
        cpu_set_reserved_ttbr0();
        local_flush_tlb_all();
        cpu_set_idmap_tcr_t0sz();

        cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm);
}

아이덴티티 매핑을 이용하기 위해 TTBR0에 idmap 페이지 테이블을 설정한다. TTBR0를 유저용 페이지 테이블 변환 레지스터로 사용하지만 ARM64 부트업 과정에서는 idmap 페이지 테이블에 먼저 사용한다

코드 라인 3에서 zero 페이지 주소를 TTBR0 레지스터에 설정한다.
코드 라인 4에서 현재 cpu의 TLB를 flush한다.
코드 라인 5에서 idmap 페이지 테이블을 사용하기 전에 offset 지정을 위해 TCR 레지스터 T0SZ 필드 값을 설정한다.
코드 라인 7에서 idmap 페이지 테이블의 가상 주소를 물리 주소로 변환하여 TTBR0에 설정한다.

cpu_set_reserved_ttbr0()

arch/arm64/include/asm/mmu_context.h

/*
 * Set TTBR0 to empty_zero_page. No translations will be possible via TTBR0.
 */

static inline void cpu_set_reserved_ttbr0(void)
{
        unsigned long ttbr = phys_to_ttbr(__pa_symbol(empty_zero_page));

        write_sysreg(ttbr, ttbr0_el1);
        isb();
}

TTBR0를 zero 페이지를 가리키도록 설정한다. 주로 COW(Copy On Write)에 사용되는 특수 목적의 zero 페이지를 물리 주소로 변환하여 TTBR0 레지스터에 설정하고 인스트럭션 파이프라인을 비우는 것을 알 수 있다.

write_sysreg() 매크로 함수

arch/arm64/include/asm/sysreg.h

/*
 * The "Z" constraint normally means a zero immediate, but when combined with
 * the "%x0" template means XZR.
 */

#define write_sysreg(v, r) do {                                 \
        u64 __val = (u64)(v);                                   \
        asm volatile("msr " __stringify(r) ", %x0"              \
                     : : "rZ" (__val));                         \
} while (0)

레지스터 @r에 @v 값을 기록한다.

local_flush_tlb_all()

arch/arm64/include/asm/tlbflush.h

static inline void local_flush_tlb_all(void)
{
        dsb(nshst);
        __tlbi(vmalle1);
        dsb(nsh);
        isb();
}

현재 cpu에 대해 TLB를 flush한다. flush 앞뒤로 완료되지 않은 캐시 등의 처리를 완료하고 함수를 나가기 전에 인스트럭션 파이프라인을 비워 다음 명령과 분리하는 배리어 작업을 한다.

cpu_set_idmap_tcr_t0sz() 매크로

arch/arm64/include/asm/mmu_context.h

#define cpu_set_idmap_tcr_t0sz()        __cpu_set_tcr_t0sz(idmap_t0sz)

TCR 레지스터의 T0SZ 필드 값을 idmap_t0sz 값으로 설정하는 매크로 함수다. 전역 변수 idmap_t0sz는 컴파일 타임에 64 – VA_BITS 값이 설정되지만 커널 진입 전 head.S에서 idmap을 확장하여 만든 경우 설정된다. idmap 확장은 가상 주소가 사용하는 비트가 48비트보다 작은 커널에서 실제 물리 RAM 주소가 가상 주소보다 더 큰 경우 1:1로 아이덴티티 매핑을 할 수 없는 경우를 위해 사용한다. 따라서 이러한 설계를 사용하지 않는 시스템 구성에서는 기본적으로 사용하지 않는다.

__cpu_set_tcr_t0sz()

arch/arm64/include/asm/mmu_context.h

/*
 * Set TCR.T0SZ to its default value (based on VA_BITS)
 */

static inline void __cpu_set_tcr_t0sz(unsigned long t0sz)
{
        unsigned long tcr;

        if (!__cpu_uses_extended_idmap())
                return;

        tcr = read_sysreg(tcr_el1);
        tcr &= ~TCR_T0SZ_MASK;
        tcr |= t0sz << TCR_T0SZ_OFFSET;
        write_sysreg(tcr, tcr_el1);
        isb();
}

TCR.T0SZ에 t0sz를 설정한다. TCR 레지스터를 읽은 값의 0번 비트에 t0sz 값의 lsb 16비트를 복사하여 다시 TCR 레지스터에 기록한다.

__cpu_uses_extended_idmap()

arch/arm64/include/asm/mmu_context.h

static inline bool __cpu_uses_extended_idmap(void)
{
        if (IS_ENABLED(CONFIG_ARM64_USER_VA_BITS_52))
                return false;

        return unlikely(idmap_t0sz != TCR_T0SZ(VA_BITS));
}

확장 idmap을 사용하는 커널 설정인 경우에는 true를 리턴한다. 52 비트 가상 주소를 지원하는 커널의 경우 false를 리턴한다.

Identity 매핑 테이블 사용 해제

cpu_uninstall_idmap()

arch/arm64/include/asm/mmu_context.h

/*
 * Remove the idmap from TTBR0_EL1 and install the pgd of the active mm.
 *
 * The idmap lives in the same VA range as userspace, but uses global entries
 * and may use a different TCR_EL1.T0SZ. To avoid issues resulting from
 * speculative TLB fetches, we must temporarily install the reserved page
 * tables while we invalidate the TLBs and set up the correct TCR_EL1.T0SZ.
 *
 * If current is a not a user task, the mm covers the TTBR1_EL1 page tables,
 * which should not be installed in TTBR0_EL1. In this case we can leave the
 * reserved page tables in place.
 */

static inline void cpu_uninstall_idmap(void)
{
        struct mm_struct *mm = current->active_mm;

        cpu_set_reserved_ttbr0();
        local_flush_tlb_all();
        cpu_set_default_tcr_t0sz();

        if (mm != &init_mm && !system_uses_ttbr0_pan())
                cpu_switch_mm(mm->pgd, mm);
}

아이덴티티 매핑을 다 사용한 경우 현재 태스크에 사용하는 페이지 테이블을 다시 TTBR0에 설정한다.

코드 라인 5에서 zero 페이지 주소를 TTBR0 레지스터에 설정한다.
코드 라인 6에서 현재 cpu의 TLB를 flush한다.
코드 라인 7에서 idmap 페이지 테이블을 사용하지 않을 때 커널 기본 t0sz 설정 값을 tcr_el1에 설정한다.
코드 라인 9~10에서 mm 스위칭을 한다. mm_struct 구조체가 커널 초기화 시 사용한 init_mm이 아닌 경우 mm에서 사용하는 페이지 테이블로 TTBR0 레지스터를 설정한다.

cpu_set_default_tcr_t0sz()

arch/arm64/include/asm/mmu_context.h

#define cpu_set_default_tcr_t0sz()      __cpu_set_tcr_t0sz(TCR_T0SZ(VA_BITS))

TCR 레지스터 T0SZ 필드의 기본 값으로 ‘64 – 가상 주소가 사용하는 비트 수’를 설정한다(64 – VA_BITS(39) = 25).

TCR_T0SZ() 매크로

arch/arm64/include/asm/pgtable-hwdef.h

#define TCR_T0SZ(x)             ((UL(64) - (x)) << TCR_T0SZ_OFFSET)

ARM64 커널에서 64 – x 값을 갖는다(TCR_T0SZ_OFFSET = 0)

mm 스위칭

cpu_switch_mm()

arch/arm64/include/asm/mmu_context.h”

static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
{
        BUG_ON(pgd == swapper_pg_dir);
        cpu_set_reserved_ttbr0();
        cpu_do_switch_mm(virt_to_phys(pgd),mm);
}

요청한 유저용 가상 주소 공간으로 스위칭하기 위해 인자로 받은 페이지 테이블 가상 주소를 물리 주소로 변환하여 유저용 페이지 테이블 주소 레지스터에 설정하고 이어서 cpu_do_switch_mm( ) 함수를 호출한다.
mm 스위칭은 주로 스케줄러의 _ _schedule( ) 함수를 통해 컨텍스트 스위칭이 일어날 때 mm 스위칭 파트와 태스크 스위칭 두 파트가 동작하게 된다. 그중 태스크 스위칭이 다음 태스크를 위해 레지스터 백업/복구 및 스택을 준비하는 과정이다. 그리고 mm 스위칭은 유저 태스크에 대해서만 유저가 사용하는 가상 주소 환경을 준비하기 위해 사용한다. 해당 유저 태스크가 사용하는 pgd 테이블을 TTBR0 레지스터에 지정하는 것이 핵심이다.

cpu_do_switch_mm()

arch/arm64/mm/context.c

void cpu_do_switch_mm(phys_addr_t pgd_phys, struct mm_struct *mm)
{
        unsigned long ttbr1 = read_sysreg(ttbr1_el1);
        unsigned long asid = ASID(mm);
        unsigned long ttbr0 = phys_to_ttbr(pgd_phys);

        /* Skip CNP for the reserved ASID */
        if (system_supports_cnp() && asid)
                ttbr0 |= TTBR_CNP_BIT;

        /* SW PAN needs a copy of the ASID in TTBR0 for entry */
        if (IS_ENABLED(CONFIG_ARM64_SW_TTBR0_PAN))
                ttbr0 |= FIELD_PREP(TTBR_ASID_MASK, asid);

        /* Set ASID in TTBR1 since TCR.A1 is set */
        ttbr1 &= ~TTBR_ASID_MASK;
        ttbr1 |= FIELD_PREP(TTBR_ASID_MASK, asid);

        write_sysreg(ttbr1, ttbr1_el1);
        isb();
        write_sysreg(ttbr0, ttbr0_el1);
        isb();
        post_ttbr_update_workaround();
}

두 번째 인자로 받은 @mm의 context.id(x1)에서 하위 16비트를 ASID 값으로 하여, 첫 번째 인자로 받은 페이지 테이블 물리 주소 @pgd_phys의 비트 [63:48]에 복사하고 이를 ttbr0_el1에 설정한다. ttbr0_el1 레지스터는 상위 16비트는 ASID를 저장하고 나머지 하위 비트들은 페이지 테이블의 물리 주소를 가리킨다.

코드 라인 3에서 ttbr1 값을 읽어온다.
코드 라인 4에서 두 번째 인자로 받은 @mm의 context.id를 알아와서 asid에 대입한다.
코드 라인 5에서 첫 번째 인자로 받은 pgd 엔트리의 물리주소 @pgd_phys pgd를 사용하여 ttbr0 값으로 변환한다.
코드 라인 8~9에서 ARMv8.2 확장에 적용된 CNP(Common Not Private) capability가 동작하는 시스템인 경우 asid 값이 있으면 ttbr 값에서 CNP 비트를 추가 설정한다.
코드 라인 12~13에서 PAN SW 에뮬레이션 기능이 사용되는 경우 ASID 필드를 ttbr0에도 설정한다.
코드 라인 16~17에서 ttbr1의 ASID 필드도 갱신한다.
코드 라인 19~22에서 ttbr1과 ttbr0를 갱신하고 명령 파이프라인을 비운다.
코드 라인 23에서 TTBR을 갱신한 후 아키텍처마다 추가적으로 수행해야할 루틴이 있으면 수행한다.
- cavium SoC를 사용한 경우 TTBR을 갱신한 후 명령 캐시 및 파이프 라인을 추가적으로 flush 한다.

ASID() 매크로

arch/arm64/include/asm/mmu.h

/*
 * This macro is only used by the TLBI and low-level switch_mm() code,
 * neither of which can race with an ASID change. We therefore don't
 * need to reload the counter using atomic64_read().
 */

#define ASID(mm)        ((mm)->context.id.counter & 0xffff)

mm_struct 포인터인 @mm을 사용하여 mm->context.id.counter의 하위 16비트값인 asid를 알아온다.

post_ttbr_update_workaround()

arch/arm64/mm/context.c

/* Errata workaround post TTBRx_EL1 update. */

asmlinkage void post_ttbr_update_workaround(void)
{
        if (!IS_ENABLED(CONFIG_CAVIUM_ERRATUM_27456))
                return;

        asm(ALTERNATIVE("nop; nop; nop",
                        "ic iallu; dsb nsh; isb",
                        ARM64_WORKAROUND_CAVIUM_27456,
                        CONFIG_CAVIUM_ERRATUM_27456));
}

cavium SoC를 사용한 경우 TTBRx를 갱신한 후 명령 캐시 및 파이프 라인을 추가적으로 flush 한다.

Earlycon & Earlyprintk

2019-03-252019-03-25 문영일 Leave a comment

Early 부트 명령어 라인 파라메터들

아래의 항목들이 early 부트 명령어 라인 파라메터로 동작할 수 있다.

주요 항목

earlycon
earlyprintk
mem
initrd
cma
quiet
loglevel
debug
debug_objects
no_debug_objects

각종 early 파라메터들

force_pal_cache_flush, nomca, coherent_pool, cachepoicy, nocache, nowb, ecc
coherentio, nocoherentio, cca, rd_start, rd_size, elfcorehdr, disable_octeon_edac, pmb, sh_mv, nmi_mode, clkin_hz=, memmap, fbmem, vmalloc, nogbpages, gbpages, noexec, nopat, reservetop, highmem, userpte, memtest, numa, kmemcheck, possible_cpus, io_delay, disable_timer_pin_1, memory_corrutpion_check, memory_corruption_check_period, memory_corruption_check_size, no-kvmaf, no-steal-acc, no-kvmclock-vsyscall, update_mptable, alloc_mptable, gart_fix_e820, nokaslr, diseble_mtrr_cleanup, enable_mtrr_cleanup, mtrr_cleanup_debug, mtrr_chunk_size, mtrr_gran_size, mtrr_spare_reg_nr, disable_mtrr_trim, stack, pci, no-kvmclolck, vsyscall, reservelow, iommu, idle, xen_emul_unplug, xen_nopv, xen_nopvsipin, add_efi_memmap, efi, efi_no_storage_paranoia, nobau, topology_updates, fadump, fadump_reserve_mem, smt-enabled, kvm_cma_resv_ratio, xmon, video, ps3fb, ps3flash, disable_ddw, hvirq, coherent_pool, iefi_debug, nodebugmon, cad, topology, nosmt, smt, possible_cpus, etr, stp, uaccess_primary, l2cache, l2cache_pf, hwthread_map, memdma, hugepagesz, vector, additional_cpus, mem_fclk_21285, balloon3_features, clocksource, tbr, rproc_mem, switches, stram_pool, noallocl2, ktext, pcie_rc_delay, maxmem, maxnodemem, isolnodes, pci_reserve, initramfs_file, disabled_cpus, dataplane, noudn, noidn, noipi, LABC, nointremap, intremap, kgdbdbgp, sysfs.deprecated, noefi, efi, intel_psate, sim_console, ekgdboc, swiotlb
iapic, nox2apic, disableapic, nolapic, lapic_timer_c2_ok, noapictimer, nolapic_timer, apic, disable_cpu_apicd, x2apic_phys, noapic, acpi_skip_timer_override, acpi_use_timer_override, acpi_sci, acpi_fsdp, acpi_no_static_ssdt, acpi_apic_instance, acpi_force_table_verification

earlyprintk & earlycon

earlyprintk:
- 커널 2.6.12에서 도입
- 정식 드라이버 로딩 전 커널의 디버깅 출력용
- ARM 아키텍처에서는 뒤에 붙은 인수를 사용하지 않고 다음과 같다.
  - “earlycon” 부트 명령행 파라메터가 같이 사용한다.
  - CONFIG_DEBUG_LL에서 지정한 디바이스로 출력이된다..

earlycon:
- 2014년 3월 arm64에서 제안. 커널 3.16에 도입. (i386은 2007년)
- 디버깅 목적으로 다른 드라이버보다 더 빨리 기동이 필요
- 동작되면 “bootconsole [uart0] enabled”라고 출력.
참고: Linux earlyprintk/earlycon support on ARM

Documentation/admin-guide/kernel-parameters.txt

        earlycon=       [KNL] Output early console device and options.

                        [ARM64] The early console is determined by the
                        stdout-path property in device tree's chosen node,
                        or determined by the ACPI SPCR table.

                        [X86] When used with no options the early console is
                        determined by the ACPI SPCR table.

                cdns,<addr>[,options]
                        Start an early, polled-mode console on a Cadence
                        (xuartps) serial port at the specified address. Only
                        supported option is baud rate. If baud rate is not
                        specified, the serial port must already be setup and
                        configured.

                uart[8250],io,<addr>[,options]
                uart[8250],mmio,<addr>[,options]
                uart[8250],mmio32,<addr>[,options]
                uart[8250],mmio32be,<addr>[,options]
                uart[8250],0x<addr>[,options]
                        Start an early, polled-mode console on the 8250/16550
                        UART at the specified I/O port or MMIO address.
                        MMIO inter-register address stride is either 8-bit
                        (mmio) or 32-bit (mmio32 or mmio32be).
                        If none of [io|mmio|mmio32|mmio32be], <addr> is assumed
                        to be equivalent to 'mmio'. 'options' are specified
                        in the same format described for "console=ttyS<n>"; if
                        unspecified, the h/w is not initialized.

                pl011,<addr>
                pl011,mmio32,<addr>
                        Start an early, polled-mode console on a pl011 serial
                        port at the specified address. The pl011 serial port
                        must already be setup and configured. Options are not
                        yet supported.  If 'mmio32' is specified, then only
                        the driver will use only 32-bit accessors to read/write
                        the device registers.

                meson,<addr>
                        Start an early, polled-mode console on a meson serial
                        port at the specified address. The serial port must
                        already be setup and configured. Options are not yet
                        supported.

                msm_serial,<addr>
                        Start an early, polled-mode console on an msm serial
                        port at the specified address. The serial port
                        must already be setup and configured. Options are not
                        yet supported.

                msm_serial_dm,<addr>
                        Start an early, polled-mode console on an msm serial
                        dm port at the specified address. The serial port
                        must already be setup and configured. Options are not
                        yet supported.

                owl,<addr>
                        Start an early, polled-mode console on a serial port
                        of an Actions Semi SoC, such as S500 or S900, at the
                        specified address. The serial port must already be
                        setup and configured. Options are not yet supported.

                rda,<addr>
                        Start an early, polled-mode console on a serial port
                        of an RDA Micro SoC, such as RDA8810PL, at the
                        specified address. The serial port must already be
                        setup and configured. Options are not yet supported.

                smh     Use ARM semihosting calls for early console.

                s3c2410,<addr>
                s3c2412,<addr>
                s3c2440,<addr>
                s3c6400,<addr>
                s5pv210,<addr>
                exynos4210,<addr>
                        Use early console provided by serial driver available
                        on Samsung SoCs, requires selecting proper type and
                        a correct base address of the selected UART port. The
                        serial port must already be setup and configured.
                        Options are not yet supported.

                lantiq,<addr>
                        Start an early, polled-mode console on a lantiq serial
                        (lqasc) port at the specified address. The serial port
                        must already be setup and configured. Options are not
                        yet supported.

                lpuart,<addr>
                lpuart32,<addr>
                        Use early console provided by Freescale LP UART driver
                        found on Freescale Vybrid and QorIQ LS1021A processors.
                        A valid base address must be provided, and the serial
                        port must already be setup and configured.

                ar3700_uart,<addr>
                        Start an early, polled-mode console on the
                        Armada 3700 serial port at the specified
                        address. The serial port must already be setup
                        and configured. Options are not yet supported.

                qcom_geni,<addr>
                        Start an early, polled-mode console on a Qualcomm
                        Generic Interface (GENI) based serial port at the
                        specified address. The serial port must already be
                        setup and configured. Options are not yet supported.

Documentation/admin-guide/kernel-parameters.txt

        earlyprintk=    [X86,SH,ARM,M68k,S390]
                        earlyprintk=vga
                        earlyprintk=efi
                        earlyprintk=sclp
                        earlyprintk=xen
                        earlyprintk=serial[,ttySn[,baudrate]]
                        earlyprintk=serial[,0x...[,baudrate]]
                        earlyprintk=ttySn[,baudrate]
                        earlyprintk=dbgp[debugController#]
                        earlyprintk=pciserial[,force],bus:device.function[,baudrate]
                        earlyprintk=xdbc[xhciController#]

                        earlyprintk is useful when the kernel crashes before
                        the normal console is initialized. It is not enabled by
                        default because it has some cosmetic problems.

                        Append ",keep" to not disable it when the real console
                        takes over.

                        Only one of vga, efi, serial, or usb debug port can
                        be used at a time.

                        Currently only ttyS0 and ttyS1 may be specified by
                        name.  Other I/O ports may be explicitly specified
                        on some architectures (x86 and arm at least) by
                        replacing ttySn with an I/O port address, like this:
                                earlyprintk=serial,0x1008,115200
                        You can find the port for a given device in
                        /proc/tty/driver/serial:
                                2: uart:ST16650V2 port:00001008 irq:18 ...

                        Interaction with the standard serial driver is not
                        very good.

                        The VGA and EFI output is eventually overwritten by
                        the real console.

                        The xen output can only be used by Xen PV guests.

                        The sclp output can only be used on s390.

                        The optional "force" to "pciserial" enables use of a
                        PCI device even when its classcode is not of the
                        UART class.

파라메터 선언

커멘드라인 파라메터를 통해 지정된 파라메터의 셋업 루틴을 호출한다.

early 파라메터는 컴파일 타임에 생성되어 파라메터명과 이에 대응하는 셋업 함수 및 early 여부를 “.init.setup” 섹션에 추가한다.
earlycon 및 earlyprintk 등에 대응하는 초기화 루틴도 이 매크로를 통해 등록한다.

__setup() 매크로

#define __setup(str, fn)                                                \
        __setup_param(str, fn, fn, 0)

@str 커멘드 라인 파라메터명이 지정될 때 호출될 @fn 함수를 컴파일 타임에 등록한다.

early_param() 매크로

include/linux/init.h

#define early_param(str, fn)                                            \
        __setup_param(str, fn, fn, 1)

커멘드 라인 명령에서 @str 문자열 지정되는 경우 대응하는 함수를 부트업 초반 정규 메모리 등의 사용이 불가능한 이른(early) 타임에 호출하기 위해 컴파일 타임에 등록한다.

예) early_param(“earlycon”, param_setup_earlycon);

__setup_param()

include/linux/init.h

/*
 * Only for really core code.  See moduleparam.h for the normal way.
 *
 * Force the alignment so the compiler doesn't space elements of the
 * obs_kernel_param "array" too far apart in .init.setup.
 */

#define __setup_param(str, unique_id, fn, early)                        \
        static const char __setup_str_##unique_id[] __initconst         \
                __aligned(1) = str;                                     \
        static struct obs_kernel_param __setup_##unique_id              \
                __used __section(.init.setup)                           \
                __attribute__((aligned((sizeof(long)))))                \
                = { __setup_str_##unique_id, fn, early }

파라메터명과 대응 셋업 함수가 담긴 objs_kernel_param 구조체를 “.init.setup” 섹션에 추가한다.

이렇게 추가한 파라메터는 부트업 타임에 주어지는 커멘드 라인 명령으로 호출된다.

파라메터 위치

파라메터가 저장되는 위치를 커널 v4.6-rc1 이전과 이후로 구분하여 보여준다.

다음 그림은 커널 v4.5까지 파라메터가 저장되는 섹션 위치와 관련된 매크로들 보여준다.

다음 그림은 커널 v4.6 이상 파라메터가 저장되는 섹션 위치와 관련된 매크로들 보여준다.

struct obs_kernel_param 구조체

include/linux/init.h

struct obs_kernel_param {
        const char *str;
        int (*setup_func)(char *); 
        int early;
};

str
- 부트 커멘드라인 파라메터 명
- 예) “console”, “earlycon”, “earlyprintk”
setup_func
- 부트 커멘드라인 파라메터와 연결되어 있는 함수
early
- 1인 경우 early 파라메터

Earlycon

각종 디바이스 드라이버들은 커널 부트 업 과정 중 정규 메모리 초기화 과정이 끝난 후에 각각의 필요한 드라이버들이 지정되어 로드된다. 그러나 디버그를 위해 먼저 콘솔 출력이 필요한 경우 이를 지정하여 사용할 수 있게 하였다.

earlycon 초기화

param_setup_earlycon()

drivers/tty/serial/earlycon.c

/* early_param wrapper for setup_earlycon() */
static int __init param_setup_earlycon(char *buf)
{
        int err;

        /* Just 'earlycon' is a valid param for devicetree and ACPI SPCR. */
        if (!buf || !buf[0]) {
                if (IS_ENABLED(CONFIG_ACPI_SPCR_TABLE)) {
                        earlycon_acpi_spcr_enable = true;
                        return 0;
                } else if (!buf) {
                        return early_init_dt_scan_chosen_stdout();
                }
        }

        err = setup_earlycon(buf);
        if (err == -ENOENT || err == -EALREADY)
                return 0;
        return err;
}
early_param("earlycon", param_setup_earlycon);

커멘드 라인 파라메터로 “earlycon=” 또는 “console=” 요청을 받은 경우 early console용 디바이스를 준비한다. 커멘드 라인 파라메터에서 “earlycon” 또는 “console” 명령 이후 디바이스명을 지정하지 않은 경우에 한해 디바이스 트리를 통해 지정한다.

setup_earlycon()

drivers/tty/serial/earlycon.c

int __init setup_earlycon(char *buf, const char *match,
                          int (*setup)(struct earlycon_device *, const char *)) 
{
        int err;
        size_t len;

        struct uart_port *port = &early_console_dev.port;

        if (!buf || !match || !setup)
                return 0;

        len = strlen(match);

        if (strncmp(buf, match, len))
                return 0;
        if (buf[len] && (buf[len] != ','))
                return 0;

        buf += len + 1;

        err = parse_options(&early_console_dev, buf);
        /* On parsing error, pass the options buf to the setup function */
        if (!err)
                buf = NULL;

        if (port->mapbase)
                port->membase = earlycon_map(port->mapbase, 64);

        early_console_dev.con->data = &early_console_dev;
        err = setup(&early_console_dev, buf);
        if (err < 0)
                return err;
        if (!early_console_dev.con->write)
                return -ENODEV;

        register_console(early_console_dev.con);
        return 0;
}

struct uart_port *port = &early_console_dev.port;
if (strncmp(buf, match, len)) return 0;
- 부트 명령행 파라메터 buf와 디바이스명 match가 동일하지 않으면 함수를 빠져나간다.
if (buf[len] && (buf[len] != ‘,’)) return 0;
- 디바이스명 다음에 ‘,’가 없으면 함수를 빠져나간다.
err = parse_options(&early_console_dev, buf);
- 옵션들을 파싱하여 port->mapbase에 포트 주소, options에 주소뒤의 나머지 옵션문자열, baud에 uart speed를 저장한다.
if (port->mapbase) port->membase = earlycon_map(port->mapbase, 64);
- 물리주소를 가상주소로 변환하여 알아온다.
err = setup(&early_console_dev, buf);
- 인수로 전달받은 setup 함수를 호출한다.
if (!early_console_dev.con->write) return -ENODEV;
- early 콘솔이 설정되지 않았으면 에러를 리턴한다.
register_console(early_console_dev.con);
- early 콘솔 디바이스를 등록한다.

drivers/tty/serial/earlycon.c

static struct earlycon_device early_console_dev = {
        .con = &early_con,
};

static struct console early_con = {
        .name =         "uart", /* 8250 console switch requires this name */
        .flags =        CON_PRINTBUFFER | CON_BOOT,
        .index =        -1,
};

parse_options()

drivers/tty/serial/earlycon.c

static int __init parse_options(struct earlycon_device *device,
                                char *options)
{
        struct uart_port *port = &device->port;
        int mmio, mmio32, length;
        unsigned long addr;

        if (!options)
                return -ENODEV;

        mmio = !strncmp(options, "mmio,", 5);
        mmio32 = !strncmp(options, "mmio32,", 7);
        if (mmio || mmio32) {
                port->iotype = (mmio ? UPIO_MEM : UPIO_MEM32);
                options += mmio ? 5 : 7;
                addr = simple_strtoul(options, NULL, 0);
                port->mapbase = addr;
                if (mmio32)
                        port->regshift = 2;
        } else if (!strncmp(options, "io,", 3)) {
                port->iotype = UPIO_PORT;
                options += 3;
                addr = simple_strtoul(options, NULL, 0);
                port->iobase = addr;
                mmio = 0;
        } else if (!strncmp(options, "0x", 2)) {
                port->iotype = UPIO_MEM;
                addr = simple_strtoul(options, NULL, 0);
                port->mapbase = addr;
        } else {
                return -EINVAL;
        }

        port->uartclk = BASE_BAUD * 16;

        options = strchr(options, ',');
        if (options) {
                options++;
                device->baud = simple_strtoul(options, NULL, 0);
                length = min(strcspn(options, " ") + 1,
                             (size_t)(sizeof(device->options)));
                strlcpy(device->options, options, length);
        }

        if (port->iotype == UPIO_MEM || port->iotype == UPIO_MEM32)
                pr_info("Early serial console at MMIO%s 0x%llx (options '%s')\n",
                        mmio32 ? "32" : "",
                        (unsigned long long)port->mapbase,
                        device->options);
        else
                pr_info("Early serial console at I/O port 0x%lx (options '%s')\n",
                        port->iobase,
                        device->options);

        return 0;
}

earlycon 옵션 문자열을 파싱하여 port->iotype, port->mapbase, port->regshift, port->iobase, port->uartclk, device->baud, device->options 등을 설정한다.

earlycon 디바이스 선언

earlycon 디바이스는 컴파일 타임에 생성되어 earlycon 테이블에 추가되며 “.init.data” 섹션에 위치한다.

참고로 .init 섹션은 default로 부팅 후 더 이상 사용되지 않으므로 모두 제거되어 일반 메모리로 사용하기 위해 할당 해제된다.
커널 v4.6-rc1 이후로 EARLYCON_DECLARE()와 OF_EARLYCON_DECLARE() 함수가 각자의 테이블에 엔트리를 추가하지 않고, 중복되지 않게 하나의 earlycon 테이블에 earlycon_id 구조체를 사용하여 추가된다.
- 참고: earlycon: Use common framework for earlycon declarations

EARLYCON_DECLARE() 매크로

include/linux/serial_core.h

#define EARLYCON_DECLARE(_name, fn)     OF_EARLYCON_DECLARE(_name, "", fn)

command 파라메터에서 요청할 earlycon 디바이스를 선언한다. 부트 업시 이 디바이스 이름 @_name이 지정되면 호출될 디바이스 셋업 함수를 인자 @fn으로 지정한다.

커널 v4.6-rc1 이후로 디바이스 트리용 매크로를 호출하도록 통합되었다.
예) EARLYCON_DECLARE(qdf2400_e44, qdf2400_e44_early_console_setup);

OF_EARLYCON_DECLARE() 매크로

include/linux/serial_core.h

#define OF_EARLYCON_DECLARE(_name, compat, fn)                          \
        _OF_EARLYCON_DECLARE(_name, compat, fn,                         \
                             __UNIQUE_ID(__earlycon_##_name))

디바이스 트리에서 요청할 earlycon 디바이스를 선언한다. 디바이스 트리에서 이 디바이스의 compatible 명을 지정하여 매치되는 경우 호출될 디바이스 셋업 함수를 @fn으로 지정한다.

커널 v4.6-rc1 이후로 커멘드 파라메터에서 요청해도 동작되도록 통합되었다.
예) OF_EARLYCON_DECLARE(pl011, “arm,pl011”, pl011_early_console_setup);

_OF_EARLYCON_DECLARE() 매크로

include/linux/serial_core.h

#define _OF_EARLYCON_DECLARE(_name, compat, fn, unique_id)              \
        static const struct earlycon_id unique_id                       \
             EARLYCON_USED_OR_UNUSED __initconst                        \
                = { .name = __stringify(_name),                         \
                    .compatible = compat,                               \
                    .setup = fn  };                                     \
        static const struct earlycon_id EARLYCON_USED_OR_UNUSED         \
                __section(__earlycon_table)                             \
                * const __PASTE(__p, unique_id) = &unique_id

earlycon_id 구조체 포인터가 __earlycon_table 섹션에 추가된다.

earlycon 테이블 위치

earlycon 테이블 위치는 커널 v4.6-rc1 이전과 이후로 나뉜다.

다음 그림은 커널 v4.5까지 earlycon 테이블이 저장되는 섹션 위치와 관련된 매크로들 보여준다.

다음 그림은 커널 v4.6 이상 earlycon 테이블이 저장되는 섹션 위치와 관련된 매크로들 보여준다.

다음 그림은 파라메터들이 등록된 “.init.setup” 섹션의 모습을 보여준다.

earlycon_id 구조체

include/linux/serial_core.h

struct earlycon_id {
        char    name[15];
        char    name_term;      /* In case compiler didn't '\0' term name */
        char    compatible[128];
        int     (*setup)(struct earlycon_device *, const char *options);
};

name
- 커멘트 파라메터로 요청 시 매치될 earlycon 디바이스 이름
name_term
- null이 담긴다.
compatible
- OF_EARLYCON_DECLARE() 매크로를 통할 때 compatible 명이 주어지며 디바이스 트리에서 요청되는 이름과 비교될 때 사용된다.
(*setup)
- earlycon 디바이스가 지정되어 호출될 때 디바이스 셋업을 위해 불려지는 후크 함수이다.

Console Driver

1) 표준 8250 UART 드라이버

표준 8250 UART 포트를 console로 사용하는 드라이버이다.

예1)
- earlycon=uart8250,io,0x3f8,9600n8
- earlycon=uart8250,mmio,0xff5e0000,115200n8
- earlycon=uart8250,mmio32,0xff5e0000,115200n8
예2)
- console=uart8250,io,0x3f8,9600n8
- console=uart8250,mmio,0xff5e0000,115200n8
- console=uart8250,mmio32,0xff5e0000,115200n8

early_serial8250_setup()

drivers/tty/serial/8250/8250_early.c

static int __init early_serial8250_setup(struct earlycon_device *device,
                                         const char *options)
{
        if (!(device->port.membase || device->port.iobase))
                return 0;

        if (!device->baud) {
                device->baud = probe_baud(&device->port);
                snprintf(device->options, sizeof(device->options), "%u",
                         device->baud);
        }

        init_port(device);

        early_device = device;
        device->con->write = early_serial8250_write;
        return 0;
}
EARLYCON_DECLARE(uart8250, early_serial8250_setup);
EARLYCON_DECLARE(uart, early_serial8250_setup);

EARLYCON_DECLARE(uart8250, early_serial8250_setup);
- uart8250_setup_earlycon() 함수를 만들고 함수 내부에서 early_serial8250_setup() 함수 주소를 인수로 setup_earlycon() 함수를 호출한다.
- early_param(“earlycon”, uart8250_setup_earlycon) 매크로 함수 호출
  - __setup_param(“earlycon”, uart8250_setup_earlycon, uart8250_setup_earlycon, 1)
    - static const char __setup_str_uart8250_setup_earlycon[] = “earlycon” 문자열 생성
    - struct obs_kernel_param __setup_uart8250_setup_earlycon = { __setup_str_uart8250_setup_earlycon, uart8250_setup_earycon, 1 } 구조체 생성
  - 결국 obs_kernel_param 구조체 형태로 { “earlycon”, uart8250_setup_earlycon, 1 }이 생성된다.

serial8250_console 전역 드라이버 객체

drivers/tty/serial/8250/8250_core.c

static struct console serial8250_console = {
        .name           = "ttyS",
        .write          = serial8250_console_write,
        .device         = uart_console_device,
        .setup          = serial8250_console_setup,
        .early_setup    = serial8250_console_early_setup,
        .flags          = CON_PRINTBUFFER | CON_ANYTIME,
        .index          = -1,
        .data           = &serial8250_reg,
};

serial8250_console_setup()

drivers/tty/serial/8250/8250_core.c

static int serial8250_console_setup(struct console *co, char *options)
{
        struct uart_port *port;
        int baud = 9600;
        int bits = 8; 
        int parity = 'n'; 
        int flow = 'n'; 

        /*
         * Check whether an invalid uart number has been specified, and
         * if so, search for the first available port that does have
         * console support.
         */
        if (co->index >= nr_uarts)
                co->index = 0; 
        port = &serial8250_ports[co->index].port;
        if (!port->iobase && !port->membase)
                return -ENODEV;

        if (options)
                uart_parse_options(options, &baud, &parity, &bits, &flow);

        return uart_set_options(port, co, baud, parity, bits, flow);
}

serial8250_console_early_setup()

drivers/tty/serial/8250/8250_core.c

static int serial8250_console_early_setup(void)
{
        return serial8250_find_port_for_earlycon();
}

2) AMBA UART 드라이버

AMBA UART 포트를 console로 사용하는 드라이버로 CONFIG_SERIAL_AMBA_PL011_CONSOLE 옵션이 동작할 때 사용할 수 있다.

rpi2에서 사용하는 UART 드라이버이다.
예) console=ttyAMA0,115200
- ttyAMA는 rpi2의 정규 console로 사용되는 uart이다.
예) earlycon=pl011,0x3f201000,115200n8
- pl011은 rpi2의 early console로 사용되는 uart이다.

pl011_early_console_setup()

drivers/tty/serial/amba-pl011.c

static int __init pl011_early_console_setup(struct earlycon_device *device,
                                            const char *opt)
{
        if (!device->port.membase)
                return -ENODEV;

        device->con->write = pl011_early_write;
        return 0;
}
EARLYCON_DECLARE(pl011, pl011_early_console_setup);
OF_EARLYCON_DECLARE(pl011, "arm,pl011", pl011_early_console_setup);

EARLYCON_DECLARE(pl011, pl011_early_console_setup);
- pl011_setup_earlycon() 함수를 만들고 함수 내부에서 pl011_early_console_setup() 함수 주소를 인수로 setup_earlycon() 함수를 호출한다.
- early_param(“earlycon”, pl011_setup_earlycon) 매크로 함수 호출
  - __setup_param(“earlycon”, pl011_setup_earlycon, pl011_setup_earlycon, 1)
    - static const char __setup_str_pl011_setup_earlycon[] = “earlycon” 문자열 생성
    - struct obs_kernel_param __setup_pl011_setup_earlycon = { __setup_str_pl011_setup_earlycon, pl011_setup_earycon, 1 } 구조체 생성
  - 결국 obs_kernel_param 구조체 형태로 { “earlycon”, pl011_setup_earlycon, 1 }이 생성된다.
- cmdline parsing하여 검색을 위한 테이블
OF_EARLYCON_DECLARE(pl011, “arm.pl011”, pl011_early_console_setup);
- _OF_DECLARE(earlycon, pl011, “arm.pl011”, pl011_early_console_setup, void *) 매크로 호출
  - const struct of_device_id __of_table_pl011 = { .compatible = “arm.pl011”, data = (pl011_early_console_setup == (void *) NULL) ? pl011_early_console_setup : pl011_early_console_setup } 구조체 생성
- DTB 검색을 위한 테이블

amba_console 전역 드라이버 객체

drivers/tty/serial/amba-pl011.c

static struct uart_driver amba_reg;
static struct console amba_console = {
        .name           = "ttyAMA",
        .write          = pl011_console_write,
        .device         = uart_console_device,
        .setup          = pl011_console_setup,
        .flags          = CON_PRINTBUFFER,
        .index          = -1,
        .data           = &amba_reg,
};

pl011_console_setup()

drivers/tty/serial/amba-pl011.c

static int __init pl011_console_setup(struct console *co, char *options)
{
        struct uart_amba_port *uap;
        int baud = 38400;
        int bits = 8; 
        int parity = 'n'; 
        int flow = 'n'; 
        int ret; 

        /*
         * Check whether an invalid uart number has been specified, and
         * if so, search for the first available port that does have
         * console support.
         */
        if (co->index >= UART_NR)
                co->index = 0; 
        uap = amba_ports[co->index];
        if (!uap)
                return -ENODEV;

        /* Allow pins to be muxed in and configured */
        pinctrl_pm_select_default_state(uap->port.dev);

        ret = clk_prepare(uap->clk);
        if (ret)
                return ret; 

        if (dev_get_platdata(uap->port.dev)) {
                struct amba_pl011_data *plat;

                plat = dev_get_platdata(uap->port.dev);
                if (plat->init)
                        plat->init();
        }

        uap->port.uartclk = clk_get_rate(uap->clk);

        if (options)
                uart_parse_options(options, &baud, &parity, &bits, &flow);
        else
                pl011_console_get_options(uap, &baud, &parity, &bits);

        return uart_set_options(&uap->port, co, baud, parity, bits, flow);
}

라즈베리 파이 earlycon 디바이스

아래는 라즈베리파이2의 pl011 이라는 이름의 시리얼 드라이버가 earlycon을 사용하여 테이블을 선언한다. – 커널 v4.0 기준 소스

pl011_early_console_setup()

콘솔 디바이스 write 콜백 함수에 pl011_early_write() 함수 대입

DTB용 테이블 예)
- OF_EARLYCON_DECLARE(pl011, “arm,pl011”, pl011_early_console_setup);
  - __of_table_pl011 구조체가 earlycon_of_table 섹션에 생성된다.

earlyprintk

printk() 출력은 정규 콘솔에 출력하는데, 이 옵션이 지정되면 early_printk()를 통해 정규 콘솔이 로드되기 전에 early console 디바이스에 디버그 출력을 할 수 있다.

earlyprint 초기화

setup_early_printk()

kernel/early_printk.c

tatic int __init setup_early_printk(char *buf)
{
        early_console = &early_console_dev;
        register_console(&early_console_dev);
        return 0;
}

early_param("earlyprintk", setup_early_printk);

early_param(“earlyprintk”, setup_early_printk);
- “earlyprintk” 커널 파라메터를 등록하고 setup_early_printk() 함수에 연결한다.
- 부트 커멘드라인으로 “earlyprintk” 항목이 입력된 경우 setup_early_printk() 함수를 호출한다.
setup_early_printk()
- early_console로 전역 early_console_dev 주소를 가리키게 하고 디바이스로 등록한다.

early_console_dev 구조체

arch/arm/kernel/early_printk.c

static struct console early_console_dev = { 
        .name =         "earlycon",
        .write =        early_console_write,
        .flags =        CON_PRINTBUFFER | CON_BOOT,
        .index =        -1,
};

“earlycon”이라는 디바이스명이고 early_printk() 함수를 사용하여 출력 시 early_console_write() 함수를 호출하게 한다.

사용

early_printk()

kernel/printk/printk.c

#ifdef CONFIG_EARLY_PRINTK
struct console *early_console;

asmlinkage __visible void early_printk(const char *fmt, ...) 
{
        va_list ap;
        char buf[512];
        int n;

        if (!early_console)
                return;

        va_start(ap, fmt);
        n = vscnprintf(buf, sizeof(buf), fmt, ap); 
        va_end(ap);

        early_console->write(early_console, buf, n);
}
#endif

early_printk() 함수 출력은 CONFIG_EARLY_PRINTK 커널 옵션을 사용할 때만 다음과 같은 3가지 동작 중 하나를 선택 사용하게 된다.
- 최 우선 순위로 “earlycon” 부트 커멘드라인 파라메터를 사용한 경우 이 명령에서 가리키는 출력 디바이스가 사용된다.
- 다음 우선 순위로 CONFIG_DEBUG_LL 커널 옵션을 사용한 경우 kernel/head.S에서 사용했었던 포트(uart)로의 출력 된다.
- 아무 옵션도 사용하지 않은 경우 어떠한 출력도 하지 않는다.

early_console_write()

kernel/early_printk.c

static void early_console_write(struct console *con, const char *s, unsigned n)
{
        early_write(s, n); 
}

early_write()

kernel/early_printk.c

static void early_write(const char *s, unsigned n)
{
        while (n-- > 0) {
                if (*s == '\n')
                        printch('\r');
                printch(*s);
                s++;
        }
}

printch()

arch/arm/kernel/debug.S

ENTRY(printch)
                addruart_current r3, r1, r2
                mov     r1, r0 
                mov     r0, #0
                b       1b
ENDPROC(printch)

addruart_current

arch/arm/kernel/debug.S

                .macro  addruart_current, rx, tmp1, tmp2
                addruart        \tmp1, \tmp2, \rx
                mrc             p15, 0, \rx, c1, c0
                tst             \rx, #1
                moveq           \rx, \tmp1
                movne           \rx, \tmp2
                .endm

rpi2: CONFIG_DEBUG_LL_INCLUDE=”mach/debug-macro.S”
- addruart는 arch/arm/mach-bcm2709/include/mach/debug-macro.S에 있다.

addruart()

uart base 물리주소와 가상 주소를 알아온다.

rpi2: 출력 아래 두 개의 소스가 준비되어 있는데 rpi2는 첫 번째 addruart를 사용한다.

arch/arm/mach-bcm2709/include/mach/debug-macro.S

#include <mach/platform.h>

                .macro  addruart, rp, rv, tmp 
                ldr     \rp, =UART0_BASE
                ldr     \rv, =IO_ADDRESS(UART0_BASE)
                .endm

#include <debug/pl01x.S>

UART0_BASE
- 0x3f20_1000 물리주소
- dtb에서 명시된 0x7e20_1000은 버스 주소를 사용한다.
IO_ADDRESS()
- virtual address로 변환
- 예) IO_ADDRESS(0x3f20_1000)
  - 0xf320_1000

arch/arm/include/debug/pl01x.S

#ifdef CONFIG_DEBUG_UART_PHYS
                .macro  addruart, rp, rv, tmp 
                ldr     \rp, =CONFIG_DEBUG_UART_PHYS
                ldr     \rv, =CONFIG_DEBUG_UART_VIRT
                .endm
#endif

rpi2: CONFIG_DEBUG_UART_PHYS 커널 옵션을 사용하지 않는다.

참고

Earlycon & Earlyprintk | 문c – 현재 글
parse_early_param() | 문c
parse_args() | 문c

Kernel Parameters | kernel.org

DMA -5- (DMA-IOMMU)

2019-03-202021-04-06 문영일 Leave a comment

DMA -5- (DMA-IOMMU)

디바이스가 DMA(Direct Memory Access)로 IOMMU를 통과하여 메모리에 접근하기 위해 커널에서 제공하는 두 가지의 방법이 있다.

IOMMU API
- drivers/iommu/iommu.c 파일에 IOMMU core API들이 제공된다.
- IOMMU API들 대부분이 iommu prefix로 이루어진 함수이다.
  - iommu_*()
- drivers/iommu 디렉토리에 iommu 칩 드라이버들이 구현되어 있다.
  - drivers/iommu/arm-smmu.c
  - drivers/iommu/arm-smmu-v3.c
  - drivers/iommu/intel-iommu.c
  - drivers/iommu/amd_iommu.c
  - …
DMA-IOMMU API
- drivers/iommu/dma-iommu.c 파일에 DMA-IOMMU core API들이 제공된다.
- DMA-IOMMU API들 대부분이 iommu_dma prefix로 이루어진 함수이다.
  - iommu_dma_*()
- drivers/iommu 디렉토리에 iommu 칩 드라이버들이 구현되어 있다.
  - arch/arm/mm/dma-mapping.c (2가지 구현)
  - arch/arm64/mm/dma-mapping.c
  - arch/x86/kernel/amd-gart_64.c
  - …

다음 그림은 IOMMU에 연결된 사용자 디바이스가 IOMMU 매핑을 위해 IOMMU API 또는 DMA-IOMMU API 를 통해 접근할 수 있음을 보여준다.

DMA-IOMMU

IOMMU를 통해 시스템 메모리의 물리 주소에 접근할 수 있는 디바이스가 dma coherent 메모리를 할당하는 방법은 각 시스템의 IOMMU 장치마다 다른 구현을 가진다. 이 글에서는 DMA-IOMMU에 대한 core API를 위주로 분석하고, 조금씩 이에 대응하는 ARM64 아키텍처에 코드를 추가로 분석한다.

DMA-IOMMU API가 아닌 IOMMU API와 관련된 문서는 다음을 참고한다.
- IOMMU | 문c

ARM, ARM64 – CoreLink MMU-500 System

다음 그림은 ARM, ARM64에서 사용하는 SMMU를 보여준다.

다음 그림은 SMMU가 관리하는 두 개의 공간 모습을 보여준다.

다음 그림은 SMMU를 사용한 매핑들을 보여준다.

초기화

ARM64 IOMMU-DMA 초기화

arch/arm64/mm/dma-mapping.c

static int __init __iommu_dma_init(void)
{
        return iommu_dma_init();
}
arch_initcall(__iommu_dma_init);

iommu dma를 초기화한다.

drivers/iommu/dma-iommu.c

int iommu_dma_init(void)
{
        return iova_cache_get();
}

iommu dma를 위해 iova 캐시를 생성한다.

iova_cache_get()

drivers/iommu/iova.c

int iova_cache_get(void)
{
        mutex_lock(&iova_cache_mutex);
        if (!iova_cache_users) {
                iova_cache = kmem_cache_create(
                        "iommu_iova", sizeof(struct iova), 0,
                        SLAB_HWCACHE_ALIGN, NULL);
                if (!iova_cache) {
                        mutex_unlock(&iova_cache_mutex);
                        printk(KERN_ERR "Couldn't create iova cache\n");
                        return -ENOMEM;
                }
        }

        iova_cache_users++;
        mutex_unlock(&iova_cache_mutex);

        return 0;
}
EXPORT_SYMBOL_GPL(iova_cache_get);

iova 캐시를 생성한다.

코드 라인 4~13에서 처음 iova 캐시를 사용할 때 iova 구조체를 공급할 수 있도록 kmem 캐시를 생성한다.
코드 라인 15에서 한 번만 kmem 캐시가 생성되도록 카운트를 증가시킨다.

디바이스 트리를 통한 DMA 준비

of_dma_configure()

drivers/of/device.c

/**
 * of_dma_configure - Setup DMA configuration
 * @dev:        Device to apply DMA configuration
 * @np:         Pointer to OF node having DMA configuration
 * @force_dma:  Whether device is to be set up by of_dma_configure() even if
 *              DMA capability is not explicitly described by firmware.
 *
 * Try to get devices's DMA configuration from DT and update it
 * accordingly.
 *
 * If platform code needs to use its own special DMA configuration, it
 * can use a platform bus notifier and handle BUS_NOTIFY_ADD_DEVICE events
 * to fix up DMA configuration.
 */

int of_dma_configure(struct device *dev, struct device_node *np, bool force_dma)
{
        u64 dma_addr, paddr, size = 0;
        int ret;
        bool coherent;
        unsigned long offset;
        const struct iommu_ops *iommu;
        u64 mask;

        ret = of_dma_get_range(np, &dma_addr, &paddr, &size);
        if (ret < 0) {
                /*
                 * For legacy reasons, we have to assume some devices need
                 * DMA configuration regardless of whether "dma-ranges" is
                 * correctly specified or not.
                 */
                if (!force_dma)
                        return ret == -ENODEV ? 0 : ret;

                dma_addr = offset = 0;
        } else {
                offset = PFN_DOWN(paddr - dma_addr);

                /*
                 * Add a work around to treat the size as mask + 1 in case
                 * it is defined in DT as a mask.
                 */
                if (size & 1) {
                        dev_warn(dev, "Invalid size 0x%llx for dma-range\n",
                                 size);
                        size = size + 1;
                }

                if (!size) {
                        dev_err(dev, "Adjusted size 0x%llx invalid\n", size);
                        return -EINVAL;
                }
                dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);
        }

코드 라인 10~20에서 “dma-ranges” 속성을 읽어 dma 주소와, 물리 주소, 사이즈를 알아온다. 만일 속성이 없이 사용된 legacy를 위해 1:1 direct 매핑으로 판단하여 dma 주소와 offset 을 모두 0으로 한다.
코드 라인 21~39에서 물리 주소에서 dma 주소를 뺀 값을 offset으로 한다. 워크어라운드 이슈로 size는 짝수로 round up한다.

        /*
         * If @dev is expected to be DMA-capable then the bus code that created
         * it should have initialised its dma_mask pointer by this point. For
         * now, we'll continue the legacy behaviour of coercing it to the
         * coherent mask if not, but we'll no longer do so quietly.
         */
        if (!dev->dma_mask) {
                dev_warn(dev, "DMA mask not set\n");
                dev->dma_mask = &dev->coherent_dma_mask;
        }

        if (!size && dev->coherent_dma_mask)
                size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
        else if (!size)
                size = 1ULL << 32;

        dev->dma_pfn_offset = offset;

        /*
         * Limit coherent and dma mask based on size and default mask
         * set by the driver.
         */
        mask = DMA_BIT_MASK(ilog2(dma_addr + size - 1) + 1);
        dev->coherent_dma_mask &= mask;
        *dev->dma_mask &= mask;
        /* ...but only set bus mask if we found valid dma-ranges earlier */
        if (!ret)
                dev->bus_dma_mask = mask;

        coherent = of_dma_is_coherent(np);
        dev_dbg(dev, "device is%sdma coherent\n",
                coherent ? " " : " not ");

        iommu = of_iommu_configure(dev, np);
        if (IS_ERR(iommu) && PTR_ERR(iommu) == -EPROBE_DEFER)
                return -EPROBE_DEFER;

        dev_dbg(dev, "device is%sbehind an iommu\n",
                iommu ? " " : " not ");

        arch_setup_dma_ops(dev, dma_addr, size, iommu, coherent);

        return 0;
}
EXPORT_SYMBOL_GPL(of_dma_configure);

디바이스 트리로부터 dma 매핑 값을 읽어 디바이스의 dma 매핑 지원을 위해 설정한다. 1:1 direct 매핑이 아닌 경우 iommu 설정도 수행한다.

코드 라인 7~10에서 디바이스에 dma 주소제한이 설정되어 있지 않은 경우 경고 메시지를 출력한 후 디바이스의 coherent 주소 제한 값을 사용한다.
코드 라인 12~15에서 size가 0(“dma-ranges”를 사용하지 않은 legacy)일 때 다음과 같이 size를 설정한다.
- coherent 주소 제한이 설정된 경우 size 값은 coherent 주소 제한 값+1로 한다. 주소가 경계를 넘어가지 않게 제한한다.
  - 예) coherent_dma_mask=0xffff_ffff 인 경우
    - 32비트 시스템에서 size=0xffff_ffff
    - 64비트 시스템에서 size=0x1_0000_0000
- coherent 주소 제한이 설정되지 않은 경우 size 값은 0x1_0000_0000 (4G)로 한다.
코드 라인 17에서 디바이스의 dma_pfn_offset 값에 offset 값을 대입한다. (paddr – dma_addr)
코드 라인 23~25에서 dma 주소 + size 를 적용한 마스크 값으로 디바이스의 coherent 및 dma 주소 제한을 한다.
코드 라인 27~28에서 “dma-ranges” 속성이 발견된 경우에 한해서 디바이스의 버스 주소 제한값에 dma 주소 + size를 적용한 마스크 값을 사용한다.
코드 라인 30~32에서 “dma-coherent” 속성이 있는지 찾아 coherent 디바이스인지 여부를 디버그 메시지로 출력한다.
코드 라인 34~36에서 디바이스를 위해 iommu를 설정한다.
코드 라인 38~39에서 iommu 아래에 디바이스가 동작하는지 여부를 디버그 메시지로 출력한다.
코드 라인 41에서 디바이스에 대한 dma opeation을 셋업한다.

of_dma_get_range()

drivers/of/address.c

/**
 * of_dma_get_range - Get DMA range info
 * @np:         device node to get DMA range info
 * @dma_addr:   pointer to store initial DMA address of DMA range
 * @paddr:      pointer to store initial CPU address of DMA range
 * @size:       pointer to store size of DMA range
 *
 * Look in bottom up direction for the first "dma-ranges" property
 * and parse it.
 *  dma-ranges format:
 *      DMA addr (dma_addr)     : naddr cells
 *      CPU addr (phys_addr_t)  : pna cells
 *      size                    : nsize cells
 *
 * It returns -ENODEV if "dma-ranges" property was not found
 * for this device in DT.
 */

int of_dma_get_range(struct device_node *np, u64 *dma_addr, u64 *paddr, u64 *size)
{
        struct device_node *node = of_node_get(np);
        const __be32 *ranges = NULL;
        int len, naddr, nsize, pna;
        int ret = 0;
        u64 dmaaddr;

        if (!node)
                return -EINVAL;

        while (1) {
                naddr = of_n_addr_cells(node);
                nsize = of_n_size_cells(node);
                node = of_get_next_parent(node);
                if (!node)
                        break;

                ranges = of_get_property(node, "dma-ranges", &len);

                /* Ignore empty ranges, they imply no translation required */
                if (ranges && len > 0)
                        break;

                /*
                 * At least empty ranges has to be defined for parent node if
                 * DMA is supported
                 */
                if (!ranges)
                        break;
        }

        if (!ranges) {
                pr_debug("no dma-ranges found for node(%pOF)\n", np);
                ret = -ENODEV;
                goto out;
        }

        len /= sizeof(u32);

        pna = of_n_addr_cells(node);

        /* dma-ranges format:
         * DMA addr     : naddr cells
         * CPU addr     : pna cells
         * size         : nsize cells
         */
        dmaaddr = of_read_number(ranges, naddr);
        *paddr = of_translate_dma_address(np, ranges);
        if (*paddr == OF_BAD_ADDR) {
                pr_err("translation of DMA address(%pad) to CPU address failed node(%pOF)\n",
                       dma_addr, np);
                ret = -EINVAL;
                goto out;
        }
        *dma_addr = dmaaddr;

        *size = of_read_number(ranges + naddr + pna, nsize);

        pr_debug("dma_addr(%llx) cpu_addr(%llx) size(%llx)\n",
                 *dma_addr, *paddr, *size);

out:
        of_node_put(node);

        return ret;
}

디바이스 트리의 @np 노드에서 “dma-ranges” 속성을 읽어 dma 주소와, 물리 주소, 사이즈를 알아온다.

코드 라인 13에서 상위 노드에서 “#address-cells” 값을 알아온다.
- “dma-ranges” 속성의 주소에 32비트 주소 값을 사용하려면 1, 64비트 주소 값을 사용하려면 2를 사용한다.
코드 라인 14에서 상위 노드에서 “#size-cells” 값을 알아온다.
코드 라인 15~17에서 부모 노드가 없으면 루프를 빠져나간다.
코드 라인 19에서 “dma-ranges” 속성을 읽어온다. len에는 읽은 속성의 길이가 담긴다.
코드 라인 22~23에서 정상적으로 읽어온 경우 루프를 빠져나간다.
코드 라인 29~30에서 정상적으로 읽어오지 못한 경우 루프를 빠져나간다.
코드 라인 33~37에서 “dma-ranges” 속성을 못찾은 경우 디버그 메시지를 출력하고 -ENODEV 에러로 함수를 빠져나간다.
코드 라인 49~56에서 dma 주소와 물리 주소를 얻어 @dma_addr 및 @paddr에 대입한다.
코드 라인 58에서 size를 얻어 @size에 대입한다.
코드 라인 60~61에서 dma range에 대한 디버그 정보를 출력한다.

예) ARM64 juno 보드

DMA 주소 0x0 ~ 0x100_0000_0000 128G를 동일한 cpu 물리주소에 매핑한다. (inbound 매핑)

arch/arm64/boot/dts/arm/juno-base.dtsi

{
        dma-ranges = <0 0 0 0 0x100 0>;
        ...

<dma_addr:msb> <dma_addr:lsb> <cpu_addr:msb> <cpu_addr:lsb> <len:msb> <len:lsb>

예) ARM rpi2 & rpi3 보드

DMA 주소 0xc000_0000 ~ 0xff00_0000 1008M를 cpu 물리주소 0x0 ~ 0x3f00_0000에 매핑한다. (inbound 매핑)

boot/dts/bcm2836.dtsi & bcm2837.dtsi

        soc {
                ranges = <0x7e000000 0x3f000000 0x1000000>,
                         <0x40000000 0x40000000 0x00001000>;
                dma-ranges = <0xc0000000 0x00000000 0x3f000000>;
                ...

<dma_addr> <cpu_addr> <len>

of_dma_is_coherent()

drivers/of/address.c

/**
 * of_dma_is_coherent - Check if device is coherent
 * @np: device node
 *
 * It returns true if "dma-coherent" property was found
 * for this device in DT.
 */

bool of_dma_is_coherent(struct device_node *np)
{
        struct device_node *node = of_node_get(np);

        while (node) {
                if (of_property_read_bool(node, "dma-coherent")) {
                        of_node_put(node);
                        return true;
                }
                node = of_get_next_parent(node);
        }
        of_node_put(node);
        return false;
}
EXPORT_SYMBOL_GPL(of_dma_is_coherent);

디바이스 트리의 @np 노드 또는 상위노드에서 “dma-coherent” 속성 발견 여부로 dma coherent 디바이스 여부를 반환한다.

of_iommu_configure()

drivers/iommu/of_iommu.c

const struct iommu_ops *of_iommu_configure(struct device *dev,
                                           struct device_node *master_np)
{
        const struct iommu_ops *ops = NULL;
        struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
        int err = NO_IOMMU;

        if (!master_np)
                return NULL;

        if (fwspec) {
                if (fwspec->ops)
                        return fwspec->ops;

                /* In the deferred case, start again from scratch */
                iommu_fwspec_free(dev);
        }

        /*
         * We don't currently walk up the tree looking for a parent IOMMU.
         * See the `Notes:' section of
         * Documentation/devicetree/bindings/iommu/iommu.txt
         */
        if (dev_is_pci(dev)) {
                struct of_pci_iommu_alias_info info = {
                        .dev = dev,
                        .np = master_np,
                };

                err = pci_for_each_dma_alias(to_pci_dev(dev),
                                             of_pci_iommu_init, &info);
        } else if (dev_is_fsl_mc(dev)) {
                err = of_fsl_mc_iommu_init(to_fsl_mc_device(dev), master_np);
        } else {
                struct of_phandle_args iommu_spec;
                int idx = 0;

                while (!of_parse_phandle_with_args(master_np, "iommus",
                                                   "#iommu-cells",
                                                   idx, &iommu_spec)) {
                        err = of_iommu_xlate(dev, &iommu_spec);
                        of_node_put(iommu_spec.np);
                        idx++;
                        if (err)
                                break;
                }
        }

        /*
         * Two success conditions can be represented by non-negative err here:
         * >0 : there is no IOMMU, or one was unavailable for non-fatal reasons
         *  0 : we found an IOMMU, and dev->fwspec is initialised appropriately
         * <0 : any actual error
         */
        if (!err) {
                /* The fwspec pointer changed, read it again */
                fwspec = dev_iommu_fwspec_get(dev);
                ops    = fwspec->ops;
        }
        /*
         * If we have reason to believe the IOMMU driver missed the initial
         * probe for dev, replay it to get things in order.
         */
        if (!err && dev->bus && !device_iommu_mapped(dev))
                err = iommu_probe_device(dev);

        /* Ignore all other errors apart from EPROBE_DEFER */
        if (err == -EPROBE_DEFER) {
                ops = ERR_PTR(err);
        } else if (err < 0) {
                dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
                ops = NULL;
        }

        return ops;
}

디바이스 트리에서 디바이스가 연결된 iommu를 찾아 fwspec을 전달한 후 해당 iommu 드라이버를 probe 한다. (fwspec은 iommu 드라이버에 전달할 id 등의 값)

코드 라인 8~9에서 노드가 지정되지 않은 경우 null을 반환한다.
코드 라인 11~17에서 디바이스에 iommu configure가 이미 완료된 경우 디바이스가 사용하고 있는 iommu_ops를 반환한다.
코드 라인 24~31에서 pci 디바이스인 경우 “iommu-map” 속성 값을 읽어들여 of_iommu_xlate() 함수를 호출한다.
코드 라인 32~33에서 NXP의 QoriQ DPAA2 “fsl-mc” 버스를 사용하는 디바이스인 경우에 “iommu-map” 속성 값을 읽어들여 of_iommu_xlate() 함수를 호출한다.
코드 라인 34~47에서 그 외의 일반 iommu인 경우 “iommus” 속성에 담긴 값을 읽고, 해당 iommu가 원하는 셀 수만큼 인자를 읽어들여 of_iommu_xlate() 함수를 호출한다.
코드 라인 55~59에서 iommu를 찾은 경우 디바이스에 지정된 iommu ops를 알아온다.
코드 라인 64~73에서 디바이스가 아직 iommu에 probe되지 않은 경우 probe를 시도한다. probe 시도 시 나중에 probe 해달라는 -EPROBE_DEFER 에러를 받을 수도 있다.

of_iommu_xlate()

drivers/iommu/of_iommu.c

static int of_iommu_xlate(struct device *dev,
                          struct of_phandle_args *iommu_spec)
{
        const struct iommu_ops *ops;
        struct fwnode_handle *fwnode = &iommu_spec->np->fwnode;
        int err;

        ops = iommu_ops_from_fwnode(fwnode);
        if ((ops && !ops->of_xlate) ||
            !of_device_is_available(iommu_spec->np))
                return NO_IOMMU;

        err = iommu_fwspec_init(dev, &iommu_spec->np->fwnode, ops);
        if (err)
                return err;
        /*
         * The otherwise-empty fwspec handily serves to indicate the specific
         * IOMMU device we're waiting for, which will be useful if we ever get
         * a proper probe-ordering dependency mechanism in future.
         */
        if (!ops)
                return driver_deferred_probe_check_state(dev);

        return ops->of_xlate(dev, iommu_spec);
}

요청한 @iommu_spec phandle 인자를 iommu 드라이버의 (*of_xlate) 후크 함수를 통해 전달한다.

smmu에 연결된 디바이스 예)

smmu에 연결된 sdhci 디바이스가 0x6002, 0x0000 인자를 전달한다. (두 개 값을 조합하여 0x6002_0000라는 id를 지정한다.)

                sdio0: sdhci@3f1000 {
                        compatible = "brcm,sdhci-iproc";
                        reg = <0x003f1000 0x100>;
                        interrupts = <GIC_SPI 204 IRQ_TYPE_LEVEL_HIGH>;
                        bus-width = <8>;
                        clocks = <&sdio0_clk>;
                        iommus = <&smmu 0x6002 0x0000>;
                        status = "disabled";
                };

smmu 디바이스 예)

#iommu-cells 속성 값과 같이 2개의 인자를 전달받는 것을 알 수 있다.

                smmu: mmu@3000000 {
                        compatible = "arm,mmu-500";
                        reg = <0x03000000 0x80000>;
                        ...
                        #iommu-cells = <2>;
                }

iommu_fwspec_init()

drivers/iommu/iommu.c

int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode,
                      const struct iommu_ops *ops)
{
        struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);

        if (fwspec)
                return ops == fwspec->ops ? 0 : -EINVAL;

        fwspec = kzalloc(sizeof(*fwspec), GFP_KERNEL);
        if (!fwspec)
                return -ENOMEM;

        of_node_get(to_of_node(iommu_fwnode));
        fwspec->iommu_fwnode = iommu_fwnode;
        fwspec->ops = ops;
        dev_iommu_fwspec_set(dev, fwspec);
        return 0;
}
EXPORT_SYMBOL_GPL(iommu_fwspec_init);

iommu용 fwspec이 처음 초기화되는 경우 fwspec을 할당한 후 @iommu_fwnode 및 @ops를 설정한다. 성공 시 0을 반환한다.

arm_smmu_of_xlate() – ARM, ARM64

drivers/iommu/arm-smmu.c

static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
{
        u32 mask, fwid = 0;

        if (args->args_count > 0)
                fwid |= (u16)args->args[0];

        if (args->args_count > 1)
                fwid |= (u16)args->args[1] << SMR_MASK_SHIFT;
        else if (!of_property_read_u32(args->np, "stream-match-mask", &mask))
                fwid |= (u16)mask << SMR_MASK_SHIFT;

        return iommu_fwspec_add_ids(dev, &fwid, 1);
}

smmu에 phandle argument인 @args를 전달받아 디바이스에 설정한다.

smmu의 경우 fwspec에 id 값으로 지정한다.
@args의 첫 번째 인자는 low 16bit 값이고, 두 번째 인자는 high 16bit 값으로 이를 더한 id를 추가 할당한다.
“stream-match-mask” 속성이 지정된 경우 id 값에 이 마스크 값을 16비트 좌측 shift 한 후 id 값에 or 한다.

smmu에 연결된 디바이스

“iommu-map” 속성에 지정된 id 값 0(u-boot에 의해 수정)을 iommu 드라이버에 전달

.               fsl_mc: fsl-mc@80c000000 {
                        compatible = "fsl,qoriq-mc";
                        reg = <0x00000008 0x0c000000 0 0x40>,    /* MC portal base */
                              <0x00000000 0x08340000 0 0x40000>; /* MC control reg */
                        msi-parent = <&its>;
                        iommu-map = <0 &smmu 0 0>;      /* This is fixed-up by u-boot */
                        dma-coherent;
                        ...
                  }

smmu 디바이스

전달받은 id 값에 “stream-match-mask” 속성 값 0x7c00 를 smmu 드라이버가 16bit 좌로 쉬프트한 id 값을 or 연산하여 사용한다.

.               smmu: iommu@5000000 {
                        compatible = "arm,mmu-500";
                        reg = <0 0x5000000 0 0x800000>;
                        #global-interrupts = <12>;
                        #iommu-cells = <1>;
                        stream-match-mask = <0x7C00>;
                        dma-coherent;
                        ...
                }

iommu_fwspec_add_ids()

drivers/iommu/iommu.c

int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids)
{
        struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
        size_t size;
        int i;

        if (!fwspec)
                return -EINVAL;

        size = offsetof(struct iommu_fwspec, ids[fwspec->num_ids + num_ids]);
        if (size > sizeof(*fwspec)) {
                fwspec = krealloc(fwspec, size, GFP_KERNEL);
                if (!fwspec)
                        return -ENOMEM;

                dev_iommu_fwspec_set(dev, fwspec);
        }

        for (i = 0; i < num_ids; i++)
                fwspec->ids[fwspec->num_ids + i] = ids[i];

        fwspec->num_ids += num_ids;
        return 0;
}
EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);

디바이스에 요청한 ids[] 배열에서 @num_ids 수 만큼 디바이스의 fwspc의 id 공간을 확보하고 전달받은 ids를 복사한다.

ARM64 IOMMU Operations 준비

arch_setup_dma_ops()

arch/arm64/mm/dma-mapping.c

void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
                        const struct iommu_ops *iommu, bool coherent)
{
        dev->dma_coherent = coherent;
        __iommu_setup_dma_ops(dev, dma_base, size, iommu);

#ifdef CONFIG_XEN
        if (xen_initial_domain())
                dev->dma_ops = xen_dma_ops;
#endif
}

arm64용 dma operation를 준비한다. @coherent는 cohrent 디바이스 여부를 나타낸다.

of_dma_configure() 함수에서 “dma-ranges” 속성을 읽어들인 후 dma_base 및 size 등을 인자로 주고 이 함수에 진입하였다.

__iommu_setup_dma_ops()

arch/arm64/mm/dma-mapping.c

static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
                                  const struct iommu_ops *ops)
{
        struct iommu_domain *domain;

        if (!ops)
                return;

        /*
         * The IOMMU core code allocates the default DMA domain, which the
         * underlying IOMMU driver needs to support via the dma-iommu layer.
         */
        domain = iommu_get_domain_for_dev(dev);

        if (!domain)
                goto out_err;

        if (domain->type == IOMMU_DOMAIN_DMA) {
                if (iommu_dma_init_domain(domain, dma_base, size, dev))
                        goto out_err;

                dev->dma_ops = &iommu_dma_ops;
        }

        return;

out_err:
         pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
                 dev_name(dev));
}

iommu를 사용하는 dma 디바이스인 경우 dma 매핑 도메인을 초기화하고 디바이스에 generic iommu_dma_ops를 지정한다.

도메인 초기화

iommu_dma_init_domain()

drivers/iommu/dma-iommu.c

/**
 * iommu_dma_init_domain - Initialise a DMA mapping domain
 * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
 * @base: IOVA at which the mappable address space starts
 * @size: Size of IOVA space
 * @dev: Device the domain is being initialised for
 *
 * @base and @size should be exact multiples of IOMMU page granularity to
 * avoid rounding surprises. If necessary, we reserve the page at address 0
 * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
 * any change which could make prior IOVAs invalid will fail.
 */

int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
                u64 size, struct device *dev)
{
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        struct iova_domain *iovad = &cookie->iovad;
        unsigned long order, base_pfn, end_pfn;
        int attr;

        if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
                return -EINVAL;

        /* Use the smallest supported page size for IOVA granularity */
        order = __ffs(domain->pgsize_bitmap);
        base_pfn = max_t(unsigned long, 1, base >> order);
        end_pfn = (base + size - 1) >> order;

        /* Check the domain allows at least some access to the device... */
        if (domain->geometry.force_aperture) {
                if (base > domain->geometry.aperture_end ||
                    base + size <= domain->geometry.aperture_start) {
                        pr_warn("specified DMA range outside IOMMU capability\n");
                        return -EFAULT;
                }
                /* ...then finally give it a kicking to make sure it fits */
                base_pfn = max_t(unsigned long, base_pfn,
                                domain->geometry.aperture_start >> order);
        }

        /* start_pfn is always nonzero for an already-initialised domain */
        if (iovad->start_pfn) {
                if (1UL << order != iovad->granule ||
                    base_pfn != iovad->start_pfn) {
                        pr_warn("Incompatible range for DMA domain\n");
                        return -EFAULT;
                }

                return 0;
        }

        init_iova_domain(iovad, 1UL << order, base_pfn);

        if (!cookie->fq_domain && !iommu_domain_get_attr(domain,
                        DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, &attr) && attr) {
                cookie->fq_domain = domain;
                init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, NULL);
        }

        if (!dev)
                return 0;

        return iova_reserve_iommu_regions(dev, domain);
}
EXPORT_SYMBOL(iommu_dma_init_domain);

dma 매핑 도메인을 초기호한다.

코드 라인 9~10에서 도메인에 지정된 iova 쿠키가 없거나 IOMMU_DMA_IOVA_COOKIE 타입이 아닌 경우 -EINVAL 에러를 반환한다.
- IOMMU_DMA_MSI_COOKIE 타입인 경우 에러
코드 라인 13~15에서 도메인이 지원하는 페이지 사이지 중 가장 작은 페이지 단위의 order 값을 사용한 후 dma 주소인 @base와 @size로 base_pfn과 end_pfn을 산출한다.
- 예) 4K 페이지, @base=0x1000_0000, @size=0x2000
  - base_pfn=0x1_0000, end_pfn=0x1_0001과 같이 2 개의 페이지 범위가 산출된다.
코드 라인 18~27에서 주소 범위가 도메인 주소 경계를 초과하는 경우 경고 메시지를 출력하고 -EFAULT 에러를 반환한다.
코드 라인 30~38에서 iova 도메인이 초기화되어 이미 시작 pfn이 지정된 경우에는 아무것도 하지 않고 성공(0)을 반환한다.
코드 라인 40에서 iova 도메인을 초기화한다.
코드 라인 42~46에서 flush queue 콜백을 사용하지 않으면서 flush queue 속성 요청이 있으면 이 도메인을 fq_domain에 지정하고, flush queue를 초기화한다.
코드 라인 51에서 도메인 영역을 reserve 한다.

DMA 준비

다음 그림은 DMA coherent 메모리 할당 및 매핑 관련 operation을 설정하는 과정을 보여준다.

AMBA 버스, Platform 버스 또는 PCI 버스에 연결된 디바이스가 probe될 때각 버스의 dma configure 함수들이 호출되고, 아래에서는 ACPI는 제외하고 디바이스 트리를 통해 셋업하는 과정만을 보여준다.
또한 디바이스 트리의 reserve-memory가 지정되는 경우에도 of_dma_configure() 함수가 호출되는 것을 알 수 있다.
자세한 과정의 소스는 여기에서 분석하지 않고, (*alloc) 후크에 연결된 함수만을 보면 아래 붉은 글씨로 __iommu_alloc_attrs() 함수가 연결된 것을 확인할 수 있다. 이 소스만 잠깐 아래에서 분석해본다.

__iommu_alloc_attrs()

arch/arm64/mm/dma-mapping.c – 1/2-

static void *__iommu_alloc_attrs(struct device *dev, size_t size,
                                 dma_addr_t *handle, gfp_t gfp,
                                 unsigned long attrs)
{
        bool coherent = dev_is_dma_coherent(dev);
        int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs);
        size_t iosize = size;
        void *addr;

        if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
                return NULL;

        size = PAGE_ALIGN(size);

        /*
         * Some drivers rely on this, and we probably don't want the
         * possibility of stale kernel data being read by devices anyway.
         */
        gfp |= __GFP_ZERO;

        if (!gfpflags_allow_blocking(gfp)) {
                struct page *page;
                /*
                 * In atomic context we can't remap anything, so we'll only
                 * get the virtually contiguous buffer we need by way of a
                 * physically contiguous allocation.
                 */
                if (coherent) {
                        page = alloc_pages(gfp, get_order(size));
                        addr = page ? page_address(page) : NULL;
                } else {
                        addr = dma_alloc_from_pool(size, &page, gfp);
                }
                if (!addr)
                        return NULL;

                *handle = iommu_dma_map_page(dev, page, 0, iosize, ioprot);
                if (*handle == DMA_MAPPING_ERROR) {
                        if (coherent)
                                __free_pages(page, get_order(size));
                        else
                                dma_free_from_pool(addr, size);
                        addr = NULL;
                }

DMA를 위해 coherent 메모리를 할당한 후 매핑한다.

코드 라인 6에서 dma 방향, coherent 지원 여부, 속성을 통해 iommu 페이지 프로텍션 플래그를 구해온다.
코드 라인 13에서 매핑 사이즈는 페이지 단위로 정렬한다.
코드 라인 19에서 할당할 메모리는 0으로 초기화하도록 __GFP_ZERO 플래그를 추가한다.
코드 라인 21~35에서 interrupt context 등의 atomic 할당 요청인 경우이다. coherent 지원 디바이스인 경우 시스템 메모리에서 할당 받고, 아닌 경우 atomic pool에서 할당 받는다.
코드 라인 37~44에서 할당한 페이지를 iommu 매핑한다. 매핑이 실패한 경우 할당한 메모리를 할당 해제한다.

iommu 페이지 프로텍션 플래그

IOMMU_READ
- RAM에서 읽기
IOMMU_WRITE
- RAM으로 기록
IOMMU_CACHE
- DMA 캐시 coherency
IOMMU_NOEXEC
IOMMU_MMIO
IOMMU_PRIV
- privilege level (커널 레벨) 액세스

arch/arm64/mm/dma-mapping.c – 2/2-

        } else if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) {
                pgprot_t prot = arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs);
                struct page *page;

                page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
                                        get_order(size), gfp & __GFP_NOWARN);
                if (!page)
                        return NULL;

                *handle = iommu_dma_map_page(dev, page, 0, iosize, ioprot);
                if (*handle == DMA_MAPPING_ERROR) {
                        dma_release_from_contiguous(dev, page,
                                                    size >> PAGE_SHIFT);
                        return NULL;
                }
                addr = dma_common_contiguous_remap(page, size, VM_USERMAP,
                                                   prot,
                                                   __builtin_return_address(0));
                if (addr) {
                        if (!coherent)
                                __dma_flush_area(page_to_virt(page), iosize);
                        memset(addr, 0, size);
                } else {
                        iommu_dma_unmap_page(dev, *handle, iosize, 0, attrs);
                        dma_release_from_contiguous(dev, page,
                                                    size >> PAGE_SHIFT);
                }
        } else {
                pgprot_t prot = arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs);
                struct page **pages;

                pages = iommu_dma_alloc(dev, iosize, gfp, attrs, ioprot,
                                        handle, flush_page);
                if (!pages)
                        return NULL;

                addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
                                              __builtin_return_address(0));
                if (!addr)
                        iommu_dma_free(dev, pages, iosize, handle);
        }
        return addr;
}

코드 라인 1~27에서 contiguous 메모리를 사용하라는 속성 요청이 있는 경우 contiguous 메모리 페이지를 할당 받은 후 iommu 매핑을 한다. 그런 후 아키텍처가 지원하는 페이지 속성으로 다시 리매핑한다.
코드 라인 28~41에서 iommu 디바이스를 위해 dma 버퍼 메모리를 할당하고 iova 영역에 매핑한다. 그 후 아키텍처가 지원하는 페이지 속성으로 다시 리매핑한다.

IOMMU-DMA 메모리 할당

iommu_dma_alloc()

drivers/iommu/dma-iommu.c

/**
 * iommu_dma_alloc - Allocate and map a buffer contiguous in IOVA space
 * @dev: Device to allocate memory for. Must be a real device
 *       attached to an iommu_dma_domain
 * @size: Size of buffer in bytes
 * @gfp: Allocation flags
 * @attrs: DMA attributes for this allocation
 * @prot: IOMMU mapping flags
 * @handle: Out argument for allocated DMA handle
 * @flush_page: Arch callback which must ensure PAGE_SIZE bytes from the
 *              given VA/PA are visible to the given non-coherent device.
 *
 * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
 * but an IOMMU which supports smaller pages might not map the whole thing.
 *
 * Return: Array of struct page pointers describing the buffer,
 *         or NULL on failure.
 */

struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
                unsigned long attrs, int prot, dma_addr_t *handle,
                void (*flush_page)(struct device *, const void *, phys_addr_t))
{
        struct iommu_domain *domain = iommu_get_dma_domain(dev);
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        struct iova_domain *iovad = &cookie->iovad;
        struct page **pages;
        struct sg_table sgt;
        dma_addr_t iova;
        unsigned int count, min_size, alloc_sizes = domain->pgsize_bitmap;

        *handle = DMA_MAPPING_ERROR;

        min_size = alloc_sizes & -alloc_sizes;
        if (min_size < PAGE_SIZE) {
                min_size = PAGE_SIZE;
                alloc_sizes |= PAGE_SIZE;
        } else {
                size = ALIGN(size, min_size);
        }
        if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
                alloc_sizes = min_size;

        count = PAGE_ALIGN(size) >> PAGE_SHIFT;
        pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
                                        gfp);
        if (!pages)
                return NULL;

        size = iova_align(iovad, size);
        iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
        if (!iova)
                goto out_free_pages;

        if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
                goto out_free_iova;

        if (!(prot & IOMMU_CACHE)) {
                struct sg_mapping_iter miter;
                /*
                 * The CPU-centric flushing implied by SG_MITER_TO_SG isn't
                 * sufficient here, so skip it by using the "wrong" direction.
                 */
                sg_miter_start(&miter, sgt.sgl, sgt.orig_nents, SG_MITER_FROM_SG);
                while (sg_miter_next(&miter))
                        flush_page(dev, miter.addr, page_to_phys(miter.page));
                sg_miter_stop(&miter);
        }

        if (iommu_map_sg(domain, iova, sgt.sgl, sgt.orig_nents, prot)
                        < size)
                goto out_free_sg;

        *handle = iova;
        sg_free_table(&sgt);
        return pages;

out_free_sg:
        sg_free_table(&sgt);
out_free_iova:
        iommu_dma_free_iova(cookie, iova, size);
out_free_pages:
        __iommu_dma_free_pages(pages, count);
        return NULL;
}

iommu 디바이스를 위해 dma 버퍼 메모리를 할당하고 iova 영역에 매핑한다.

코드 라인 15에서 지원하는 페이지들 중 가장 작은 페이지의 크기를 구한다.
- 예) 4K, 16K를 동시 지원하는 경우, pgsize_bitmap=0x5000(16K+4K)
  - min_size=0x1000 (4K)
코드 라인 16~21에서 최소 사이즈가 페이지 사이즈보다 작은 경우 최소 사이즈를 페이지 사이즈로 하고, 할당 사이즈는 페이지 사이즈를 포함한다. 요청한 @size는 최소 사이즈 단위로 정렬해야 한다.
코드 라인 22~23에서 만일 싱글 페이지 요청이 있었던 경우 할당 사이즈는 최소 사이즈가 된다.
코드 라인 25~29에서 iommu 디바이스를 위해 dma 버퍼 메모리를 할당한다. 분산된 물리 페이지를 가리키는 페이지 디스크립터 배열을 반환한다.
코드 라인 31에서 요청한 @size를 iova 공간이 지원하는 granule 단위로 정렬한다.
- granule
  - iommu의 최소 매핑 단위
코드 라인 32~34에서 iommu 디바이스를 위해 iova 영역을 할당한다.
코드 라인 36~37에서 페이지 배열 정보로 이 함수에서 임시로 사용하기 위해 sg 테이블을 할당하고 구성한다.
코드 라인 39~49에서 iommu가 캐시를 지원하지 않는 경우 페이지들을 플러시한다.
코드 라인 51~53에서 해당 iova 영역을 sg 매핑한다.
코드 라인 55~57에서 @handle에 dma 물리 주소를 대입하고, 임시로 할당받은 sg 테이블을 할당 해제한 후 페이지를 반환한다.

__iommu_dma_alloc_pages()

drivers/iommu/dma-iommu.c

static struct page **__iommu_dma_alloc_pages(struct device *dev,
                unsigned int count, unsigned long order_mask, gfp_t gfp)
{
        struct page **pages;
        unsigned int i = 0, nid = dev_to_node(dev);

        order_mask &= (2U << MAX_ORDER) - 1;
        if (!order_mask)
                return NULL;

        pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
        if (!pages)
                return NULL;

        /* IOMMU can map any pages, so himem can also be used here */
        gfp |= __GFP_NOWARN | __GFP_HIGHMEM;

        while (count) {
                struct page *page = NULL;
                unsigned int order_size;

                /*
                 * Higher-order allocations are a convenience rather
                 * than a necessity, hence using __GFP_NORETRY until
                 * falling back to minimum-order allocations.
                 */
                for (order_mask &= (2U << __fls(count)) - 1;
                     order_mask; order_mask &= ~order_size) {
                        unsigned int order = __fls(order_mask);
                        gfp_t alloc_flags = gfp;

                        order_size = 1U << order;
                        if (order_mask > order_size)
                                alloc_flags |= __GFP_NORETRY;
                        page = alloc_pages_node(nid, alloc_flags, order);
                        if (!page)
                                continue;
                        if (!order)
                                break;
                        if (!PageCompound(page)) {
                                split_page(page, order);
                                break;
                        } else if (!split_huge_page(page)) {
                                break;
                        }
                        __free_pages(page, order);
                }
                if (!page) {
                        __iommu_dma_free_pages(pages, i);
                        return NULL;
                }
                count -= order_size;
                while (order_size--)
                        pages[i++] = page++;
        }
        return pages;
}

iommu dma 디바이스를 위해 @count 페이지 수 만큼 페이지를 할당한 후 page 구조체 배열로 반환한다. 물리 페이지들은 sg 방식을 사용할 계획이므로 order 페이지 단위로 분산될 수 있다. @order_mask에는 할당에 사용할 모든 order 비트들을 지정한다.

코드 라인 7~9에서 @order_mask를 버디 시스템에서 사용할 수 있는 모든 order 범위로 제한한다.
- 예) 4K, MAX_ORDER=11, @order_mask=0xf0f
  - @order_mask=0x70f (11개 비트로 제한)
코드 라인 11~13에서 @count 수 만큼 page 디스크립터 배열을 할당한다.
코드 라인 16에서 할당 가능한 영역으로 highmem 메모리 영역을 포함한다. 또한 높은 order 할당부터 내부 루프를 돌며 반복 시도를 할 예정이므로 nowarn 플래그를 추가한다.
코드 라인 18~29에서 할당 단위 @order_mask의 비트들 중 큰 order 부터 먼저 시도한다. 그 후 낮은 오더로 줄여나간다.
- 예) size=8M+56K, count=0x40e, order_mask=0x40e
  - order = 10, 3, 2, 1 순으로 내려간다.
코드 라인 32~34에서 할당할 order size가 order_mask보다 작은 경우 noretry 플래그를 추가한다.
코드 라인 35~37에서 2^order 만큼 페이지를 할당한다.
코드 라인 38~39에서 최소 order 0인 경우 for 루프를 중단한다.
코드 라인 40~46에서 compound 페이지 또는 huge 페이지인 경우 split한다음 버디로 돌려보내고 계속 루프를 진행한다.
코드 라인 47~50에서 할당한 페이지가 없는 경우 그 동안 할당한 모든 페이지들을 할당 해제한 후 null을 반환한다.
코드 라인 51~53에서 할당한 페이지 수 만큼 @count를 감소시킨다. 할당에 사용한 order_size만큼 pages[] 배열의 인덱스 i와 page 포인터를 증가시킨다.

다음 그림은 dma 용도로 사용할 메모리를 분산된 물리 메모리로 부터 order 단위로 할당한 페이지들에 대한 page 디스크립터 배열로 반환하는 모습을 보여준다. (예: 이해를 돕기 위해 4M+56K 할당)

iommu_dma_alloc_iova()

drivers/iommu/dma-iommu.c

static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
                size_t size, dma_addr_t dma_limit, struct device *dev)
{
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        struct iova_domain *iovad = &cookie->iovad;
        unsigned long shift, iova_len, iova = 0;

        if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
                cookie->msi_iova += size;
                return cookie->msi_iova - size;
        }

        shift = iova_shift(iovad);
        iova_len = size >> shift;
        /*
         * Freeing non-power-of-two-sized allocations back into the IOVA caches
         * will come back to bite us badly, so we have to waste a bit of space
         * rounding up anything cacheable to make sure that can't happen. The
         * order of the unadjusted size will still match upon freeing.
         */
        if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
                iova_len = roundup_pow_of_two(iova_len);

        if (dev->bus_dma_mask)
                dma_limit &= dev->bus_dma_mask;

        if (domain->geometry.force_aperture)
                dma_limit = min(dma_limit, domain->geometry.aperture_end);

        /* Try to get PCI devices a SAC address */
        if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
                iova = alloc_iova_fast(iovad, iova_len,
                                       DMA_BIT_MASK(32) >> shift, false);

        if (!iova)
                iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
                                       true);

        return (dma_addr_t)iova << shift;
}

iommu 디바이스를 위해 iova 영역을 할당하고, iova 주소를 반환한다.

코드 라인 8~11에서 pci msi 인터럽트를 사용하는 dma 디바이스인 경우 iova 쿠키의 msi_iova에서 @size만큼 더한다. 반환 값은 @size를 더하기 직전의 값을 반환한다.
코드 라인 13~14에서 @size에 맞게 iova 매핑할 개수를 iova_len에 담는다.
코드 라인 21~22에서 iova 매핑할 개수가 캐시 최대 사이즈의 절반인 16보다 작은 경우 iova_len 개수를 2의 제곱승 단위로 올림 처리한다.
- IOVA_RANGE_CACHE_MAX_SIZE(6)
  - 캐시 최대 지원 order 페이지
코드 라인 24~25에서 버스가 지원하는 영역 범위 이내로 디바이스의 dma 주소 제한을 한다.
코드 라인 27~28에서 도메인이 지원하는 영역 범위 이내로 디바이스의 dma 주소 제한을 한다.
코드 라인 31~37에서 매핑할 페이지 수 @iova_len 만큼 iova 공간을 할당한다.
코드 라인 39에서 iova 주소를 반환한다.

IOMMU-DMA 메모리 페이지 매핑

iommu_dma_map_page()

drivers/iommu/dma-iommu.c

dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
                unsigned long offset, size_t size, int prot)
{
        return __iommu_dma_map(dev, page_to_phys(page) + offset, size, prot,
                        iommu_get_dma_domain(dev));
}

dma 디바이스에서 물리 페이지 + @offset을 한 iova 공간에 @size 만큼 @prot 속성으로 매핑한다.

__iommu_dma_map()

drivers/iommu/dma-iommu.c

static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
                size_t size, int prot, struct iommu_domain *domain)
{
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        size_t iova_off = 0;
        dma_addr_t iova;

        if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
                iova_off = iova_offset(&cookie->iovad, phys);
                size = iova_align(&cookie->iovad, size + iova_off);
        }

        iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
        if (!iova)
                return DMA_MAPPING_ERROR;

        if (iommu_map(domain, iova, phys - iova_off, size, prot)) {
                iommu_dma_free_iova(cookie, iova, size);
                return DMA_MAPPING_ERROR;
        }
        return iova + iova_off;
}

dma 디바이스에서 물리 주소 @phys @size 만큼 @prot 속성으로 매핑한다.

코드 라인 8~11에서 full 할당자인 iova 도메인을 사용하는 경우 iova offset 값과 size를 정렬한다.(smmu에서사용)
- 그 외의 쿠키 타입은 리니어 페이지 할당자를 사용하는 msi 방식이다. (vfio에서 사용)
코드 라인 13~15에서 @size 만큼 iova 공간을 할당한다.
코드 라인 17~20에서 할당한 공간을 물리 메모리가 있는 주소 공간에 매핑한다.
- iova (디바이스가 보는 io 주소) <—–> phys(메모리가 있는 물리 주소)

IOMMU-DMA 메모리 페이지 매핑 해제

iommu_dma_unmap()

drivers/iommu/dma-iommu.c

void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
                enum dma_data_direction dir, unsigned long attrs)
{
        __iommu_dma_unmap(iommu_get_dma_domain(dev), handle, size);
}

dma 디바이스가 사용한 dma용 물리 페이지 및 iova 공간을 size 만큼 매핑 해제한다.

__iommu_dma_unmap()

drivers/iommu/dma-iommu.c

static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t dma_addr,
                size_t size)
{
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        struct iova_domain *iovad = &cookie->iovad;
        size_t iova_off = iova_offset(iovad, dma_addr);

        dma_addr -= iova_off;
        size = iova_align(iovad, size + iova_off);

        WARN_ON(iommu_unmap_fast(domain, dma_addr, size) != size);
        if (!cookie->fq_domain)
                iommu_tlb_sync(domain);
        iommu_dma_free_iova(cookie, dma_addr, size);
}

IOVA 공간 관리

iova 영역 관리는 RB 트리와 rcache로 관리된다.

init_iova_domain()

drivers/iommu/iova.c

void
init_iova_domain(struct iova_domain *iovad, unsigned long granule,
        unsigned long start_pfn)
{
        /*
         * IOVA granularity will normally be equal to the smallest
         * supported IOMMU page size; both *must* be capable of
         * representing individual CPU pages exactly.
         */
        BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));

        spin_lock_init(&iovad->iova_rbtree_lock);
        iovad->rbroot = RB_ROOT;
        iovad->cached_node = &iovad->anchor.node;
        iovad->cached32_node = &iovad->anchor.node;
        iovad->granule = granule;
        iovad->start_pfn = start_pfn;
        iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
        iovad->max32_alloc_size = iovad->dma_32bit_pfn;
        iovad->flush_cb = NULL;
        iovad->fq = NULL;
        iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
        rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
        rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
        init_iova_rcaches(iovad);
}
EXPORT_SYMBOL_GPL(init_iova_domain);

iova 도메인을 초기화한다.

IOVA reserve 영역 관리

iova_reserve_iommu_regions()

drivers/iommu/dma-iommu.c

static int iova_reserve_iommu_regions(struct device *dev,
                struct iommu_domain *domain)
{
        struct iommu_dma_cookie *cookie = domain->iova_cookie;
        struct iova_domain *iovad = &cookie->iovad;
        struct iommu_resv_region *region;
        LIST_HEAD(resv_regions);
        int ret = 0;

        if (dev_is_pci(dev))
                iova_reserve_pci_windows(to_pci_dev(dev), iovad);

        iommu_get_resv_regions(dev, &resv_regions);
        list_for_each_entry(region, &resv_regions, list) {
                unsigned long lo, hi;

                /* We ARE the software that manages these! */
                if (region->type == IOMMU_RESV_SW_MSI)
                        continue;

                lo = iova_pfn(iovad, region->start);
                hi = iova_pfn(iovad, region->start + region->length - 1);
                reserve_iova(iovad, lo, hi);

                if (region->type == IOMMU_RESV_MSI)
                        ret = cookie_init_hw_msi_region(cookie, region->start,
                                        region->start + region->length);
                if (ret)
                        break;
        }
        iommu_put_resv_regions(dev, &resv_regions);

        return ret;
}

reserved iommu 영역을 iova 도메인에 등록한다.

코드 라인 10~11에서 pci 디바이스인 경우 호스트 브리지의 window 영역들을 iova 도메인에 reserve 한다.
- window
  - pci 호스트 브리지의 메모리 BAR 영역
코드 라인 13에서 임시로 resv_regions 리스트에 디바이스가 사용하는 iommu의 reserve 영역 정보를 사용하기 위해 추가한다.
코드 라인 14~19에서 4 가지 reserve 영역을 순회하며 reserve 타입 중 sw_msi 타입은 skip 한다.
코드 라인 21~23에서 시작과 끝 영역의 pfn 값을 사요하여 iova 도메인에 등록한다.
- pfn 영역은 iova 도메인의 rbroot 멤버에서 RB Tree로 운영한다.
코드 라인 25~29에서 reserve 타입이 msi 타입인 경우 쿠키 초기화를 한다.
코드 라인 31에서 임시로 사용한 resv_regions 리스트로부터 reserve 영역 정보를 제거하고 해제한다.

iommu_get_resv_regions()

drivers/iommu/iommu.c

void iommu_get_resv_regions(struct device *dev, struct list_head *list)
{
        const struct iommu_ops *ops = dev->bus->iommu_ops;

        if (ops && ops->get_resv_regions)
                ops->get_resv_regions(dev, list);
}

iommu 디바이스가 사용하는 iommu 드라이버에서 reserve 영역을 준비하여 @list에 추가한다.

arm_smmu_get_resv_regions() – ARM, ARM64

drivers/iommu/arm-smmu.c

static void arm_smmu_get_resv_regions(struct device *dev,
                                      struct list_head *head)
{
        struct iommu_resv_region *region;
        int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;

        region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
                                         prot, IOMMU_RESV_SW_MSI);
        if (!region)
                return;

        list_add_tail(&region->list, head);

        iommu_dma_get_resv_regions(dev, head);
}

smmu에 reserved된 영역을 할당하여 추가한다.

코드 라인 5에서 write 가능, 실행 금지, mmio 매핑 플래그를 지정한다.
코드 라인 7~10에서 0x800_0000 ~ 0x810_0000 까지 msi iova 영역을 reserve 한다.
- MSI_IOVA_BASE(0x800_0000)
- MSI_IOVA_LENGTH(0x10_0000)
코드 라인 12에서 인자로 전달받은 @head 리스트에 이 영역을 추가한다.
코드 라인 14에서 디바이스 트리가 아닌 서버에서 acpi를 통해 reserve 영역을 추가한다.

IOVA Rcaches

iova 도메인의 영역을 관리할 때 RB 트리를 이용하여 관리하지만, 더 빠른 fast-path 할당 관리를 지원하기 위해 rcache를 사용한다.

init_iova_rcaches()

drivers/iommu/iova.c

static void init_iova_rcaches(struct iova_domain *iovad)
{
        struct iova_cpu_rcache *cpu_rcache;
        struct iova_rcache *rcache;
        unsigned int cpu;
        int i;

        for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
                rcache = &iovad->rcaches[i];
                spin_lock_init(&rcache->lock);
                rcache->depot_size = 0;
                rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
                if (WARN_ON(!rcache->cpu_rcaches))
                        continue;
                for_each_possible_cpu(cpu) {
                        cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
                        spin_lock_init(&cpu_rcache->lock);
                        cpu_rcache->loaded = iova_magazine_alloc(GFP_KERNEL);
                        cpu_rcache->prev = iova_magazine_alloc(GFP_KERNEL);
                }
        }
}

iova 영역을 Fast-Path로 할당 관리할 수 있도록 rcaches를 초기화한다.

다음 그림은 rcaches의 초기화된 모습을 보여준다.

alloc_iova_fast()

drivers/iommu/iova.c

/**
 * alloc_iova_fast - allocates an iova from rcache
 * @iovad: - iova domain in question
 * @size: - size of page frames to allocate
 * @limit_pfn: - max limit address
 * @flush_rcache: - set to flush rcache on regular allocation failure
 * This function tries to satisfy an iova allocation from the rcache,
 * and falls back to regular allocation on failure. If regular allocation
 * fails too and the flush_rcache flag is set then the rcache will be flushed.
*/

unsigned long
alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
                unsigned long limit_pfn, bool flush_rcache)
{
        unsigned long iova_pfn;
        struct iova *new_iova;

        iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1);
        if (iova_pfn)
                return iova_pfn;

retry:
        new_iova = alloc_iova(iovad, size, limit_pfn, true);
        if (!new_iova) {
                unsigned int cpu;

                if (!flush_rcache)
                        return 0;

                /* Try replenishing IOVAs by flushing rcache. */
                flush_rcache = false;
                for_each_online_cpu(cpu)
                        free_cpu_cached_iovas(cpu, iovad);
                goto retry;
        }

        return new_iova->pfn_lo;
}
EXPORT_SYMBOL_GPL(alloc_iova_fast);

iova 공간 할당을 한 후 iova pfn을 반환한다. fast-path인 rcache를 먼저 이용해보고, 안되면 slow-path인 RB 트리를 사용한다.

코드 라인 8~10에서 per-cpu rcache를 구해와서 존재하는 경우 이를 반환한다. (Fast-path)
코드 라인 13에서 iova 도메인의 limit_pfn 범위 이내에서 size 만큼의 영역을 할당한다.
코드 라인 14~25에서 만일 할당이 실패한 경우 0을 반환한다. 다만 flush_rcache 옵션이 지정된 경우 cpu 캐시를 해제한 후 재시도 해본다.

다음 그림은 Fast-Path인 rcache로 먼저 iova 영역을 할당 시도한 후 실패 시 Slow-Path인 RB-Tree를 사용한 정규 방법으로 시도한다.

iova_rcache_insert()

drivers/iommu/iova.c

static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn,
                               unsigned long size)
{
        unsigned int log_size = order_base_2(size);

        if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
                return false;

        return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
}

요청한 iova 도메인에서 @size에 해당하는 order용 rcache를 선택하고, iova용 @pfn을 추가한다.

rcache는 order 별로 관리되며, order는 0 ~ IOVA_RANGE_CACHE_MAX_SIZE(6) – 1까지 사용한다.
- 1, 2, 4, 8, 16, 32 페이지만 가능하다.

다음 그림은 order 별로 관리하는 rcache를 @size로 rcache를 선택한 후 iova pfn을 추가하는 모습을 보여준다.

cpu rcache의 loaded에서 먼저 할당해준다.
만일 cpu rcache의 loaded가 full 상태이면 prev와 교체한 후 loaded에서 할당한다.
만일 prev 마저도 full 상태이면 loaded를 iova 영역으로 free한 후 새로운 magazine을 할당해서 loaded에 지정하고 여기서 다시 할당해준다.

__iova_rcache_insert()

drivers/iommu/iova.c

/*
 * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
 * return true on success.  Can fail if rcache is full and we can't free
 * space, and free_iova() (our only caller) will then return the IOVA
 * range to the rbtree instead.
 */

static bool __iova_rcache_insert(struct iova_domain *iovad,
                                 struct iova_rcache *rcache,
                                 unsigned long iova_pfn)
{
        struct iova_magazine *mag_to_free = NULL;
        struct iova_cpu_rcache *cpu_rcache;
        bool can_insert = false;
        unsigned long flags;

        cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
        spin_lock_irqsave(&cpu_rcache->lock, flags);

        if (!iova_magazine_full(cpu_rcache->loaded)) {
                can_insert = true;
        } else if (!iova_magazine_full(cpu_rcache->prev)) {
                swap(cpu_rcache->prev, cpu_rcache->loaded);
                can_insert = true;
        } else {
                struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);

                if (new_mag) {
                        spin_lock(&rcache->lock);
                        if (rcache->depot_size < MAX_GLOBAL_MAGS) {
                                rcache->depot[rcache->depot_size++] =
                                                cpu_rcache->loaded;
                        } else {
                                mag_to_free = cpu_rcache->loaded;
                        }
                        spin_unlock(&rcache->lock);

                        cpu_rcache->loaded = new_mag;
                        can_insert = true;
                }
        }

        if (can_insert)
                iova_magazine_push(cpu_rcache->loaded, iova_pfn);

        spin_unlock_irqrestore(&cpu_rcache->lock, flags);

        if (mag_to_free) {
                iova_magazine_free_pfns(mag_to_free, iovad);
                iova_magazine_free(mag_to_free);
        }

        return can_insert;
}

요청한 iova 도메인의 rcache에 iova용 @pfn을 추가한다.

코드 라인 10~14에서 per-cpu rcache의 loaded에 있는 magazine이 full이 아니면 can_insert를 true로 한다.
코드 라인 15~17에서 per-cpu rcache의 prev에 있는 magazine이 full이 아니면 loaded로 옮기고 can_insert를 true로 한다.
코드 라인 18~32에서 depot이 충분한 경우 기존 loaded의 magazine을 push 한다. depot이 충분하지 않은 경우 loaded를 할당해제 하기 위해 임시 mag_to_free 변수에 대입한다. 할당한 magazine은 loaded에 대입한 후 can_insert를 true로 한다.
코드 라인 34~35에서 can_insert가 ture인 경우 magazine을 push한다.
코드 라인 39~42에서 depot이 full 된 경우 loaded를 모두 할당 해제한다.

iova_rcache_get()

drivers/iommu/iova.c

/*
 * Try to satisfy IOVA allocation range from rcache.  Fail if requested
 * size is too big or the DMA limit we are given isn't satisfied by the
 * top element in the magazine.
 */

static unsigned long iova_rcache_get(struct iova_domain *iovad,
                                     unsigned long size,
                                     unsigned long limit_pfn)
{
        unsigned int log_size = order_base_2(size);

        if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
                return 0;

        return __iova_rcache_get(&iovad->rcaches[log_size], limit_pfn - size);
}

요청한 iova 도메인의 size order를 관리하는 rcache의 @limit_pfn 범위이내에서 iova 공간을 할당하고, 이에 해당하는 iova pfn을 반환한다.

rcache는 order 별로 관리되며, order는 0 ~ 5까지 사용한다.

__iova_rcache_get()

drivers/iommu/iova.c

/*
 * Caller wants to allocate a new IOVA range from 'rcache'.  If we can
 * satisfy the request, return a matching non-NULL range and remove
 * it from the 'rcache'.
 */

static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
                                       unsigned long limit_pfn)
{
        struct iova_cpu_rcache *cpu_rcache;
        unsigned long iova_pfn = 0;
        bool has_pfn = false;
        unsigned long flags;

        cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
        spin_lock_irqsave(&cpu_rcache->lock, flags);

        if (!iova_magazine_empty(cpu_rcache->loaded)) {
                has_pfn = true;
        } else if (!iova_magazine_empty(cpu_rcache->prev)) {
                swap(cpu_rcache->prev, cpu_rcache->loaded);
                has_pfn = true;
        } else {
                spin_lock(&rcache->lock);
                if (rcache->depot_size > 0) {
                        iova_magazine_free(cpu_rcache->loaded);
                        cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
                        has_pfn = true;
                }
                spin_unlock(&rcache->lock);
        }

        if (has_pfn)
                iova_pfn = iova_magazine_pop(cpu_rcache->loaded, limit_pfn);

        spin_unlock_irqrestore(&cpu_rcache->lock, flags);

        return iova_pfn;
}

요청한 rcache의 @limit_pfn 범위이내에서 iova 공간을 할당하고, 이에 해당하는 iova pfn을 반환한다.

코드 라인 9~13에서 per-cpu rcache의 loaded에 있는 magazine이 있는 경우 has_pfn을 true로 한다.
코드 라인 14~16에서 prev magazie이 있는 경우 prev magazine을 load로 옮긴 후 has_pfn을 true로 한다.
코드 라인 17~25에서 그 외의 경우 depot_size가 있는 경우 rcache를 lock으로 보호한 후 혹시 lock 이전에 끼어든 loaded magazine이 있으면 할당 해제한다. 그런 후 depot[]을 하나 연결하고 has_pfn을 true로 한다.
코드 라인 27~28에서 has_pfn이 true인 경우 loaded magazine pop을 하여 iova_pfn을 알아온다.

참고

DMA -1- (Basic) | 문c
DMA -2- (DMA Coherent Memory) | 문c
DMA -3- (DMA Pool) | 문c
DMA -4- (DMA Mapping) | 문c
DMA -5- (IOMMU) | 문c – 현재 글
DMA -6- (DMAEngine Subsystem) | 문c
IOMMU | 문c

ARM® CoreLink™ MMU-500 System Memory Management Unit – Technical Reference Manual – 다운로드 pdf

IOMMU

2019-03-182019-03-20 문영일 2 Comments

IOMMU(Input Output Memory Management Unit)

IOMMU는 다음과 같은 일들을 할 수 있다.

Transalation
- 디바이스(IO 또는 버스) 주소를 물리 주소로 변환할 수 있도록 매핑을 제공한다.
- DMA에 사용하는 버퍼는 연속된 물리 주소이어야 하는데, IOMMU를 사용하는 경우 그러한 제한이 없어진다.
  - IOMMU가 MMU와는 별도의 매핑 테이블을 사용하므로 시스템 메모리에 페이지가 반드시 연속되지 않아도 된다.
Isolation
- 메모리에 대한 디바이스의 접근 제어를 제공한다.
IO Virtualization
- 가상화를 지원하며 디바이스가 별개의 DMA 가상 주소 공간을 사용할 수 있다.
- cpu에 있는 MMU도 가상화를 위해 별개의 MMU가 있는 것처럼 IOMMU도 유사하다.

아래 그림에서 좌측은 디바이스와 CPU가 메모리에 접근하는 것에 대한 논리적인 다이어그램이고, 우측은 ARM, ARM64 아키텍처에서 사용되는 ARM SMMU가 구성된 위치를 보여준다.

메모리 할당의 제한과 성능

시스템 메모리에 접근하는 다음 두 가지에 방법에 대한 성능을 생각해보자.

Direct DMA
- 시스템이 DMA를 위해 메모리 영역의 일부문만을 접근하는 몇 가지 case를 고려해보자.
  - 32비트 시스템에서 제한된 ZONE_DMA 또는 ZONE_NORMAL 영역만을 사용할 수 있는 경우
  - 물리 메모리의 주소 영역이 4G를 초과하는 64비트 시스템에서 제한된 ZONE_DMA32(4G) 영역만을 사용할 수 있는 경우
- DMA를 위해 극히 제한된 영역만을 사용할 수 있는 경우 DMA 버퍼만을 제한된 영역에 두고, DMA 전송 후에 시스템은 사용자의 버퍼 공간으로 복사하는 방법을 사용해야한다. 이러한 바운스 버퍼를 사용하는 swiotlb 구현을 사용하면 성능이 저하된다.
- 연속된 물리 메모리의 할당만을 요구하므로 대규모 또는 큰 버퍼가 요구되는 경우 처리가 지연될 수 있다.
IOMMU
- IOMMU를 사용하면 디바이스가 물리 메모리의 전체 영역을 사용할 수 있으므로 대용량 버퍼를 구성할 수 있다.
- IOMMU의 매핑 테이블을 사용하여 디바이스가 분산된 물리 메모리에 접근할 수 있다.
  - DMA 버퍼 용도로 분산된 물리 메모리를 할당받고, IOMMU의 매핑 테이블을 이용하여 디바이스가 하나의 연속된 가상 주소에 액세스한다.
- 특수한 제한: 물리 메모리의 주소 영역이 4G를 초과하는 64비트 시스템에서 IOMMU를 32비트 모드를 사용하는 경우 시스템 메모리의 모든 영역에 버퍼를 만들 수가 없다. 이러한 경우에도 swiotlb 방법을 사용하므로 이 때에도 성능이 저하된다. 그러나 이러한 사용사례 마저도 점점 64비트 IOMMU 모드를 사용하는 방법으로 migration 하므로 점점 찾아보기 힘들것이다.

대표적인 IOMMU

인텔
- North Bridge에 VT-D(Virtualization Technology for Directed I/O)를 채용하여 IO 허브를 제공하고 가상화도 지원한다.
AMD
- dual IOMMU를 채용하여 IO 허브를 제공하고 가상화도 지원한다.
ARM 및 ARM64
- 여러 버전의 SMMU가 제공되며 역시 가상화를 지원한다.
PCI-SIG
- 내부에 IOMMU가 있고, I/O 가상화 (IOV)와 Address Translation Services (ATS) 기능을 제공한다.
Nvidia 그래픽 카드
- 내부에 GARTGraphics Address Remapping Table 라는 IOMMU가 있고, 가상화를 지원한다.

IOMMU-API vs DMA-IOMMU API

IOMMU를 사용하기 위해 IOMMU API를 직접 사용하는 경우는 몇 개의 드라이버를 제외하곤 드물다. 대부분 DMA-IOMMU API를 통해 IOMMU와 연동하여 사용한다.

IOMMU 코어 API들과 이에 대응하는 드라이버로 ARM의 SMMU-500(compatible=”arm,mmu-500″) 코드를 위주로 분석한다.

참고한 드라이버는 ARM smmu v1 및 v2를 지원하는 드라이버이다.
- ARM smmu v3 드라이버는 별도의 코드를 사용한다.

Default Domain

디바이스들의 각 그룹을 위해 하나의 디폴트 도메인을 할당한다.
초기 모든 디바이스들은 디폴트 도메인에 속한다.
디폴트 도메인은 공통 DMA-API 구현을 사용한다.

디바이스 트리

smmu v1 사용법

iommu 드라이버

.       /* SMMU with stream matching or stream indexing */
        smmu1: iommu {
                compatible = "arm,smmu-v1";
                reg = <0xba5e0000 0x10000>;
                #global-interrupts = <2>;
                interrupts = <0 32 4>,
                             <0 33 4>,
                             <0 34 4>, /* This is the first context interrupt */
                             <0 35 4>,
                             <0 36 4>,
                             <0 37 4>;
                #iommu-cells = <1>;
        };

iommu 디바이스

.       /* device with two stream IDs, 0 and 7 */
        master1 {
                iommus = <&smmu1 0>,
                         <&smmu1 7>;
        };

smmu v1 사용 사례)

smmu 드라이버

./arm/juno-base.dtsi

        smmu_pcie: iommu@2b500000 {
                compatible = "arm,mmu-401", "arm,smmu-v1";
                reg = <0x0 0x2b500000 0x0 0x10000>;
                interrupts = <GIC_SPI 40 IRQ_TYPE_LEVEL_HIGH>,
                             <GIC_SPI 40 IRQ_TYPE_LEVEL_HIGH>;
                #iommu-cells = <1>;
                #global-interrupts = <1>;
                dma-coherent;
                status = "disabled";
        };

smmu 디바이스

./arm/juno-base.dtsi

.       pcie_ctlr: pcie@40000000 {
                compatible = "arm,juno-r1-pcie", "plda,xpressrich3-axi", "pci-host-ecam-generic";
                device_type = "pci";
                reg = <0 0x40000000 0 0x10000000>;      /* ECAM config space */
                bus-range = <0 255>;
                linux,pci-domain = <0>;
                #address-cells = <3>;
                #size-cells = <2>;
                dma-coherent;
                ranges = <0x01000000 0x00 0x00000000 0x00 0x5f800000 0x0 0x00800000>,
                         <0x02000000 0x00 0x50000000 0x00 0x50000000 0x0 0x08000000>,
                         <0x42000000 0x40 0x00000000 0x40 0x00000000 0x1 0x00000000>;
                #interrupt-cells = <1>;
                interrupt-map-mask = <0 0 0 7>;
                interrupt-map = <0 0 0 1 &gic 0 0 0 136 4>,
                                <0 0 0 2 &gic 0 0 0 137 4>,
                                <0 0 0 3 &gic 0 0 0 138 4>,
                                <0 0 0 4 &gic 0 0 0 139 4>;
                msi-parent = <&v2m_0>;
                status = "disabled";
                iommu-map-mask = <0x0>; /* RC has no means to output PCI RID */
                iommu-map = <0x0 &smmu_pcie 0x0 0x1>;
        };

IOMMU 코어 초기화

iommu_init()

static int __init iommu_init(void)
{
        iommu_group_kset = kset_create_and_add("iommu_groups",
                                               NULL, kernel_kobj);
        BUG_ON(!iommu_group_kset);

        iommu_debugfs_setup();

        return 0;
}
core_initcall(iommu_init);

“/sys/kernel/iommu_groups” 디렉토리를 생성한다.

iommu 도메인에 iommu 그룹이 추가되는 경우 위의 디렉토리 아래에 추가된 iommu 그룹의 id 번호로 디렉토리가 생성된다.

IOMMU Core API

기본이 되는 파란색 함수들에 한해 소스를 자세히 분석하였다.

IOMMU 디바이스

iommu_device_register()
- iommu 디바이스를 시스템에 등록한다.
iommu_device_unregister()
- iommu 디바이스를 시스템에서 등록 해제한다.

IOMMU 도메인

iommu_domain_alloc()
- 새 iommu 도메인을 시스템에 추가한다.
iommu_domain_free()
- iommu 도메인을 시스템에서 삭제한다.
iommu_domain_window_enable()
iommu_domain_window_disable()
iommu_domain_set_attr()
- 도메인 속성에 따른 데이터를 지정한다.
  - DOMAIN_ATTR_GEOMETRY
  - DOMAIN_ATTR_PAGING
  - DOMAIN_ATTR_WINDOWS
  - DOMAIN_ATTR_FSL_PAMU_STASH
  - DOMAIN_ATTR_FSL_PAMU_ENABLE
  - DOMAIN_ATTR_FSL_PAMUV1
  - DOMAIN_ATTR_NESTING
  - DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
iommu_domain_get_attr()
- 도메인 속성에 따른 데이터를 알아온다.

IOMMU 그룹

iommu_group_alloc()
- iommu 드라이버에서 새로운 iommu 그룹을 할당하기 위해 호출된다.
iommu_group_release()
- 새로운 iommu 그룹을 할당 해제한다.
iommu_group_get_by_id()
- id 번호로 iommu 그룹을 알아온다.
iommu_group_get_iommudata()
- iommu 그룹에서 저장된 데이터를 가져온다.
iommu_group_set_iommudata()
- iommu 그룹에 데이터를 저장한다.
iommu_group_set_name()
- iommu 그룹 이름을 지정한다.
iommu_group_get()
- 디바이스를 위해 그룹을 알아온다. 그리고 참조 카운터를 증가시킨다.
iommu_group_put()
- 디바이스가 iommu 그룹의 사용이 완료되었다. 따라서 참조 카운터를 감소시킨다.
iommu_group_register_notifier()
- iommu 그룹에 디바이스의 추가 및 삭제 시마다 호출될 notifier call-back을 등록한다.
- blocking_notifier_call_chain() 함수 호출시 위의 call-back이 동작된다.
iommu_group_unregister_notifier()
- iommu 그룹에 디바이스의 추가 및 삭제 시마다 호출되던 notifier call-back을 등록 해제한다.
iommu_group_id()
- iommu 그룹 id를 반환한다.
iommu_get_group_resv_regions()
- /sys/kernel/iommu_groups/reserved_regions 파일을 통해 등록된 그룹을 확인할 수 있다.

IOMMU 디바이스

iommu_group_add_device()
- iommu 그룹에 디바이스를 추가한다.
iommu_group_remove_device()
- iommu 그룹에서 디바이스를 제거한다.
iommu_group_for_each_dev()
- iommu 그룹내에서의 디바이스 iteration
iommu_attach_device()
- 디바이스가 포함된 그룹을 도메인에 attach 한다.
iommu_detach_device()
- 디바이스가 포함된 그룹을 도메인에서 detach 한다.
iommu_get_domain_for_dev()
- 디바이스가 소속한 그룹의 도메인을 알아온다.
iommu_attach_group()
- iommu 그룹을 iommu 도메인에 attach 한다.
iommu_detach_group()
- iommu 그룹을 iommu 도메인에서 detach 한다.

IOMMU 매핑

iommu_map()
- 페이지들을 매핑한다.
iommu_map_sg()
- scatter/gather 분산된 메모리를 매핑한다.
iommu_unmap()
- 페이지들을 매핑 해제한다.
iommu_unmap_fast()
- iommu_unmap()과 동일하나 iotlb sync를 하지 않는다.

디바이스 트리

iommu_fwspec_init()
iommu_fwspec_add_ids()
iommu_fwspec_free()

기타

iommu_set_fault_handler()
report_iommu_fault()
iommu_iova_to_phys()

IOMMU 디바이스 등록/해제

iommu_device_register()

drivers/iommu/iommu.c

int iommu_device_register(struct iommu_device *iommu)
{
        spin_lock(&iommu_device_lock);
        list_add_tail(&iommu->list, &iommu_device_list);
        spin_unlock(&iommu_device_lock);

        return 0;
}

iommu 디바이스를 시스템에 등록한다.

iommu 디바이스를 전역 리스트인 iommu_device_list에 추가한다.

iommu_device_unregister()

drivers/iommu/iommu.c

void iommu_device_unregister(struct iommu_device *iommu)
{
        spin_lock(&iommu_device_lock);
        list_del(&iommu->list);
        spin_unlock(&iommu_device_lock);
}

iommu 디바이스를 시스템에서 등록 해제한다.

iommu 디바이스를 전역 리스트인 iommu_device_list에서 제거한다.

IOMMU Domain

게스트 VM을 지원(보호)하기 위해 도메인 기능을 제공한다.

디바이스는 호스트 및 VM들 간에 이동할 수 있다.

iommu domain 타입

IOMMU_DOMAIN_BLOCKED
- 모든 DMA가 차단되었으므로 디바이스를 분리할 수 있다.
IOMMU_DOMAIN_IDENTITY
- DMA 주소와 시스템의 물리 주소가 같다. (1:1 identity 매핑)
IOMMU_DOMAIN_UNMANAGED
- 가상 머신에서 사용되며, DMA 매핑이 IOMMU-API에 의해 관리된다.
IOMMU_DOMAIN_DMA
- 내부적으로 DMA-API 구현을 위해 사용되는 타입이다.
- 이 플래그를 사용하면 IOMMU 드라이버가 이 도메인에 대해 특정 최적화를 구현할 수 있다.

iommu_domain_alloc()

drivers/iommu/iommu.c

struct iommu_domain *iommu_domain_alloc(struct bus_type *bus)
{
        return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED);
}
EXPORT_SYMBOL_GPL(iommu_domain_alloc);

버스의 iommu 도메인을 할당하여 지정한다.

이렇게 생성한 도메인은 IOMMU-API를 사용할 수 있다.
IOMMU-API들은 몇 개의 함수를 제외하고는 대부분 iommu로 시작한다.
- iommu_*()

__iommu_domain_alloc()

drivers/iommu/iommu.c

static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
                                                 unsigned type)
{
        struct iommu_domain *domain;

        if (bus == NULL || bus->iommu_ops == NULL)
                return NULL;

        domain = bus->iommu_ops->domain_alloc(type);
        if (!domain)
                return NULL;

        domain->ops  = bus->iommu_ops;
        domain->type = type;
        /* Assume all sizes by default; the driver may override this later */
        domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;

        return domain;
}

요청한 버스를 위해 iommu 도메인을 새로 할당한다.

코드 라인 6~7에서 버스에 지정된 iommu ops가 구현되지 않은 경우 함수를 빠져나간다.
코드 라인 9~11에서 iommu 드라이버에서 domain을 할당한다.
코드 라인 13~16에서 할당한 도메인에 버스의 ops, pgsize_bitmap 및 인자로 요청한 타입을 지정한다.

arm_smmu_domain_alloc() – compatible=”arm,mmu-500″

drivers/iommu/arm-smmu.c

static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
{
        struct arm_smmu_domain *smmu_domain;

        if (type != IOMMU_DOMAIN_UNMANAGED &&
            type != IOMMU_DOMAIN_DMA &&
            type != IOMMU_DOMAIN_IDENTITY)
                return NULL;
        /*
         * Allocate the domain and initialise some of its data structures.
         * We can't really do anything meaningful until we've added a
         * master.
         */
        smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
        if (!smmu_domain)
                return NULL;

        if (type == IOMMU_DOMAIN_DMA && (using_legacy_binding ||
            iommu_get_dma_cookie(&smmu_domain->domain))) {
                kfree(smmu_domain);
                return NULL;
        }

        mutex_init(&smmu_domain->init_mutex);
        spin_lock_init(&smmu_domain->cb_lock);

        return &smmu_domain->domain;
}

ARM smmu의 도메인을 할당한다.

iommu_domain_free()

drivers/iommu/iommu.c

void iommu_domain_free(struct iommu_domain *domain)
{
        domain->ops->domain_free(domain);
}
EXPORT_SYMBOL_GPL(iommu_domain_free);

버스의 iommu 도메인을 할당 해제한다.

arm_smmu_domain_free() – compatible=”arm,mmu-500″

drivers/iommu/arm-smmu.c

static void arm_smmu_domain_free(struct iommu_domain *domain)
{
        struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

        /*
         * Free the domain resources. We assume that all devices have
         * already been detached.
         */
        iommu_put_dma_cookie(domain);
        arm_smmu_destroy_domain_context(domain);
        kfree(smmu_domain);
}

ARM smmu 도메인을 할당 해제한다.

IOMMU 그룹

디바이스 개별로 isolation이 안될 수도 있다. 즉 IOMMU 그룹에 묶인 디바이스들을 대상으로 접근 제어를 제공할 수도 있다.

그룹명

그룹 생성 시 그룹 마다 부여되는 group_id는 0부터 시작하는 숫자이고, 곧바로 그룹 명으로 사용된다.

다음은 0번 iommu 그룹에 연결된 디바이스를 보여준다.

$ ls /sys/kernel/iommu_groups/0/devices/
ff650000.vpu_service

Grouping 구현

그루핑에 관계된 디바이스 구현 함수는 다음과 같이 3가지로 나뉜다.

PCI 디바이스 그루핑 함수
FSL-MC 디바이스 그루핑 함수
Generic 디바이스 그루핑 함수

그룹 할당

iommu_group_alloc()

drivers/iommu/iommu.c

/**
 * iommu_group_alloc - Allocate a new group
 *
 * This function is called by an iommu driver to allocate a new iommu
 * group.  The iommu group represents the minimum granularity of the iommu.
 * Upon successful return, the caller holds a reference to the supplied
 * group in order to hold the group until devices are added.  Use
 * iommu_group_put() to release this extra reference count, allowing the
 * group to be automatically reclaimed once it has no devices or external
 * references.
 */

struct iommu_group *iommu_group_alloc(void)
{
        struct iommu_group *group;
        int ret;

        group = kzalloc(sizeof(*group), GFP_KERNEL);
        if (!group)
                return ERR_PTR(-ENOMEM);

        group->kobj.kset = iommu_group_kset;
        mutex_init(&group->mutex);
        INIT_LIST_HEAD(&group->devices);
        BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);

        ret = ida_simple_get(&iommu_group_ida, 0, 0, GFP_KERNEL);
        if (ret < 0) {
                kfree(group);
                return ERR_PTR(ret);
        }
        group->id = ret;

        ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
                                   NULL, "%d", group->id);
        if (ret) {
                ida_simple_remove(&iommu_group_ida, group->id);
                kfree(group);
                return ERR_PTR(ret);
        }

        group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
        if (!group->devices_kobj) {
                kobject_put(&group->kobj); /* triggers .release & free */
                return ERR_PTR(-ENOMEM);
        }

        /*
         * The devices_kobj holds a reference on the group kobject, so
         * as long as that exists so will the group.  We can therefore
         * use the devices_kobj for reference counting.
         */
        kobject_put(&group->kobj);

        ret = iommu_group_create_file(group,
                                      &iommu_group_attr_reserved_regions);
        if (ret)
                return ERR_PTR(ret);

        ret = iommu_group_create_file(group, &iommu_group_attr_type);
        if (ret)
                return ERR_PTR(ret);

        pr_debug("Allocated group %d\n", group->id);

        return group;
}
EXPORT_SYMBOL_GPL(iommu_group_alloc);

iommu 그룹을 새로 할당한다.

코드 라인 6~8에서 iommu_group 구조체를 할당한다.
코드 라인 10~13에서 할당 받은 그룹을 초기화한다.
코드 라인 15~20에서 전역 iommu_group_ida를 통해 그룹 id를 부여받는다.
코드 라인 22~34에서 “/sys/kernel/iommu_groups/”뒤에 부여 받은 그룹 id 번호로 디렉토리를 생성하고, 그 아래로 “devices” 디렉토리도 생성한다.
코드 라인 43~46에서 “/sys/kernel/iommu_groups/reserved_regions” 속성 파일을 생성하여 등록된 영역을 볼 수 있도록 한다.
- 커널 v4.11-rc1에 추가되었다.
- 참고: iommu: Implement reserved_regions iommu-group sysfs file
코드 라인 48~50에서 “/sys/kernel/iommu_groups/reserved_regions”속성 파일을 생성하여 그룹이 등록된 디폴트 도메인의 타입을 볼 수 있도록 한다.
- 커널 v4.19-rc1에 추가되었다.
- 참고: iommu: Add sysfs attribyte for domain type
코드 라인 52~54에서 “allocated group %d” 메시지를 출력하고 할당한 그룹을 반환한다.

그룹 삭제

그룹 할당 해제는 그룹에 해당하는 속성 디렉토리를 삭제하면 iommu_group_release() 함수가 호출되어 제거된다.

iommu_group_release()

drivers/iommu/iommu.c

static void iommu_group_release(struct kobject *kobj)
{
        struct iommu_group *group = to_iommu_group(kobj);

        pr_debug("Releasing group %d\n", group->id);

        if (group->iommu_data_release)
                group->iommu_data_release(group->iommu_data);

        ida_simple_remove(&iommu_group_ida, group->id);

        if (group->default_domain)
                iommu_domain_free(group->default_domain);

        kfree(group->name);
        kfree(group);
}

iommu 그룹에 embed된 kboject를 지정하여 그룹을 삭제한다.

그룹에 디바이스 추가 및 제거

iommu_group_add_device()

drivers/iommu/iommu.c

/**
 * iommu_group_add_device - add a device to an iommu group
 * @group: the group into which to add the device (reference should be held)
 * @dev: the device
 *
 * This function is called by an iommu driver to add a device into a
 * group.  Adding a device increments the group reference count.
 */

int iommu_group_add_device(struct iommu_group *group, struct device *dev)
{
        int ret, i = 0;
        struct group_device *device;

        device = kzalloc(sizeof(*device), GFP_KERNEL);
        if (!device)
                return -ENOMEM;

        device->dev = dev;

        ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
        if (ret)
                goto err_free_device;

        device->name = kasprintf(GFP_KERNEL, "%s", kobject_name(&dev->kobj));
rename:
        if (!device->name) {
                ret = -ENOMEM;
                goto err_remove_link;
        }

        ret = sysfs_create_link_nowarn(group->devices_kobj,
                                       &dev->kobj, device->name);
        if (ret) {
                if (ret == -EEXIST && i >= 0) {
                        /*ㅑㅐㅡ
                         * Account for the slim chance of collision
                         * and append an instance to the name.
                         */
                        kfree(device->name);
                        device->name = kasprintf(GFP_KERNEL, "%s.%d",
                                                 kobject_name(&dev->kobj), i++);
                        goto rename;
                }
                goto err_free_name;
        }

        kobject_get(group->devices_kobj);

        dev->iommu_group = group;

        iommu_group_create_direct_mappings(group, dev);

        mutex_lock(&group->mutex);
        list_add_tail(&device->list, &group->devices);
        if (group->domain)
                ret = __iommu_attach_device(group->domain, dev);
        mutex_unlock(&group->mutex);
        if (ret)
                goto err_put_group;

        /* Notify any listeners about change to group. */
        blocking_notifier_call_chain(&group->notifier,
                                     IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);

        trace_add_device_to_group(group->id, dev);

        pr_info("Adding device %s to group %d\n", dev_name(dev), group->id);

        return 0;

err_put_group:
        mutex_lock(&group->mutex);
        list_del(&device->list);
        mutex_unlock(&group->mutex);
        dev->iommu_group = NULL;
        kobject_put(group->devices_kobj);
err_free_name:
        kfree(device->name);
err_remove_link:
        sysfs_remove_link(&dev->kobj, "iommu_group");
err_free_device:
        kfree(device);
        pr_err("Failed to add device %s to group %d: %d\n", dev_name(dev), group->id, ret);
        return ret;
}

디바이스를 iommu 그룹에 추가한다. 정상적으로 추가되는 경우 0을 반환한다.

코드 라인 6~8에서 그룹을 만들기 위해 group_device 구조체를 할당받는다.
코드 라인 12~14에서 “iommu_group” 이라는 이름의 링크를 생성하여 그룹에 연결한다.
- 예) “/sys/devices/platform/ff650000.vpu_service/iommu_group”
  - “/sys/kernel/iommu_groups/0” 그룹을 가리킨다.
코드 라인 16~37에서 디바이스명으로 링크를 생성하여 디바이스에 연결한다.
- 예) “/sys/kernel/iommu_groups/0/devices/ff650000.vpu_service”
  - “/sys/devices/platform/ff650000.vpu_service”를 가리킨다.
- 이름이 중복되는 경우가 생기면 “디바이스명.0″과 같이 숫자를 붙여서 재시도한다. 숫자는 계속 증가될 수 있다.
코드 라인 41에서 디바이스에 소속 그룹을 연결한다.
코드 라인 43에서 그룹에 디바이스를 Direct 매핑한다.
코드 라인 45~51에서 그룹 락을 획득한 채로 디바이스를 그룹의 도메인에 attach한다.
코드 라인 54~55에서 그룹에 디바이스가 추가되었음을 알리도록 notifier 콜 체인에 등록한 함수들을 호출한다.
코드 라인 59~61에서 디바이스가 그룹에 추가되었음을 알리는 메시지를 출력하고 정상적으로 함수를 종료한다.

iommu_group_remove_device()

drivers/iommu/iommu.c

/**
 * iommu_group_remove_device - remove a device from it's current group
 * @dev: device to be removed
 *
 * This function is called by an iommu driver to remove the device from
 * it's current group.  This decrements the iommu group reference count.
 */

void iommu_group_remove_device(struct device *dev)
{
        struct iommu_group *group = dev->iommu_group;
        struct group_device *tmp_device, *device = NULL;

        pr_info("Removing device %s from group %d\n", dev_name(dev), group->id);

        /* Pre-notify listeners that a device is being removed. */
        blocking_notifier_call_chain(&group->notifier,
                                     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);

        mutex_lock(&group->mutex);
        list_for_each_entry(tmp_device, &group->devices, list) {
                if (tmp_device->dev == dev) {
                        device = tmp_device;
                        list_del(&device->list);
                        break;
                }
        }
        mutex_unlock(&group->mutex);

        if (!device)
                return;

        sysfs_remove_link(group->devices_kobj, device->name);
        sysfs_remove_link(&dev->kobj, "iommu_group");

        trace_remove_device_from_group(group->id, dev);

        kfree(device->name);
        kfree(device);
        dev->iommu_group = NULL;
        kobject_put(group->devices_kobj);
}
EXPORT_SYMBOL_GPL(iommu_group_remove_device);

디바이스를 iommu 그룹에서 제거한다.

코드 라인 6에서 디바이스가 그룹에서 제거되었음을 알리는 메시지를 출력한다.
코드 라인 9~10에서 iommu 그룹에서 디바이스가 제거되었음을 알리도록 notifier 콜 체인에 등록한 함수들을 호출한다.
코드 라인 12~23에서 그룹에 등록된 디바이스를 제거한다.
코드 라인 25에서 iommu 그룹에서 디바이스로 연결된 링크를 제거한다.
- 예) “/sys/kernel/iommu_groups/0/devices/ff650000.vpu_service”
코드 라인 26에서 디바이스 디렉토리에서 “iommu_group” 링크를 제거한다.
- 예) “/sys/devices/platform/ff650000.vpu_service/iommu_group”
코드 라인 30~34에서 group_device를 할당해제한다.

그룹내 디바이스들의 attach/detach

iommu_attach_group()

drivers/iommu/iommu.c

int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
{
        int ret;

        mutex_lock(&group->mutex);
        ret = __iommu_attach_group(domain, group);
        mutex_unlock(&group->mutex);

        return ret;
}
EXPORT_SYMBOL_GPL(iommu_attach_group);

iommu 그룹 락을 소유한 채로 iommu 그룹에 포함된 디바이스들을 요청한 iommu 도메인에 attach 한다.

iommu_attach_device()

drivers/iommu/iommu.c

int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
{
        struct iommu_group *group;
        int ret;

        group = iommu_group_get(dev);
        if (!group)
                return -ENODEV;

        /*
         * Lock the group to make sure the device-count doesn't
         * change while we are attaching
         */
        mutex_lock(&group->mutex);
        ret = -EINVAL;
        if (iommu_group_device_count(group) != 1)
                goto out_unlock;

        ret = __iommu_attach_group(domain, group);

out_unlock:
        mutex_unlock(&group->mutex);
        iommu_group_put(group);

        return ret;
}
EXPORT_SYMBOL_GPL(iommu_attach_device);

iommu 그룹 락을 소유한 채로 iommu 그룹에 포함된 디바이스들을 요청한 iommu 도메인에 attach 한다.

코드 라인 16~17에서 iommu 그룹에 포함된 디바이스가 하나도 없는 경우 실패 결과로 함수를 빠져나간다.
코드 라인 19에서 iommu 그룹에 포함된 디바이스들을 요청한 iommu 도메인에 attach 한다.

__iommu_attach_group()

drivers/iommu/iommu.c

static int __iommu_attach_group(struct iommu_domain *domain,
                                struct iommu_group *group)
{
        int ret;

        if (group->default_domain && group->domain != group->default_domain)
                return -EBUSY;

        ret = __iommu_group_for_each_dev(group, domain,
                                         iommu_group_do_attach_device);
        if (ret == 0)
                group->domain = domain;

        return ret;
}

요청한 iommu 그룹(@group)에 포함된 디바이스들을 요청한 iommu 도메인(@domain)에 attach 한다.

__iommu_group_for_each_dev()

drivers/iommu/iommu.c

/**
 * iommu_group_for_each_dev - iterate over each device in the group
 * @group: the group
 * @data: caller opaque data to be passed to callback function
 * @fn: caller supplied callback function
 *
 * This function is called by group users to iterate over group devices.
 * Callers should hold a reference count to the group during callback.
 * The group->mutex is held across callbacks, which will block calls to
 * iommu_group_add/remove_device.
 */

static int __iommu_group_for_each_dev(struct iommu_group *group, void *data,
                                      int (*fn)(struct device *, void *))
{
        struct group_device *device;
        int ret = 0;

        list_for_each_entry(device, &group->devices, list) {
                ret = fn(device->dev, data);
                if (ret)
                        break;
        }
        return ret;
}

요청한 iommu 그룹(@group)에 포함된 디바이스들을 대상으로 함수(fn)을 호출한다. 호출 할 때 인자 @data를 전달한다.

iommu_group_do_attach_device()

drivers/iommu/iommu.c

/*
 * IOMMU groups are really the natural working unit of the IOMMU, but
 * the IOMMU API works on domains and devices.  Bridge that gap by
 * iterating over the devices in a group.  Ideally we'd have a single
 * device which represents the requestor ID of the group, but we also
 * allow IOMMU drivers to create policy defined minimum sets, where
 * the physical hardware may be able to distiguish members, but we
 * wish to group them at a higher level (ex. untrusted multi-function
 * PCI devices).  Thus we attach each device.
 */

static int iommu_group_do_attach_device(struct device *dev, void *data)
{
        struct iommu_domain *domain = data;

        return __iommu_attach_device(domain, dev);
}

디바이스를 도메인에 attach 한다.

__iommu_attach_device()

drivers/iommu/iommu.c

static int __iommu_attach_device(struct iommu_domain *domain,
                                 struct device *dev)
{
        int ret;
        if ((domain->ops->is_attach_deferred != NULL) &&
            domain->ops->is_attach_deferred(domain, dev))
                return 0;

        if (unlikely(domain->ops->attach_dev == NULL))
                return -ENODEV;

        ret = domain->ops->attach_dev(domain, dev);
        if (!ret)
                trace_attach_device_to_domain(dev);
        return ret;
}

디바이스를 도메인에 attach 하기 위해 iommu 도메인에 등록된 (*attach_dev) 후크 함수를 호출한다.

amd iommu의 경우 iommu 드라이버의 (*is_attach_deferred) 함수를 호출하여 유예되는 case에서는 함수를 그냥 정상 종료시킨다.

arm_smmu_attach_dev() – compatible=”arm,mmu-500″

drivers/iommu/arm-smmu.c

static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
{
        int ret;
        struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
        struct arm_smmu_device *smmu;
        struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

        if (!fwspec || fwspec->ops != &arm_smmu_ops) {
                dev_err(dev, "cannot attach to SMMU, is it on the same bus?\n");
                return -ENXIO;
        }

        /*
         * FIXME: The arch/arm DMA API code tries to attach devices to its own
         * domains between of_xlate() and add_device() - we have no way to cope
         * with that, so until ARM gets converted to rely on groups and default
         * domains, just say no (but more politely than by dereferencing NULL).
         * This should be at least a WARN_ON once that's sorted.
         */
        if (!fwspec->iommu_priv)
                return -ENODEV;

        smmu = fwspec_smmu(fwspec);

        ret = arm_smmu_rpm_get(smmu);
        if (ret < 0)
                return ret;

        /* Ensure that the domain is finalised */
        ret = arm_smmu_init_domain_context(domain, smmu);
        if (ret < 0)
                goto rpm_put;

        /*
         * Sanity check the domain. We don't support domains across
         * different SMMUs.
         */
        if (smmu_domain->smmu != smmu) {
                dev_err(dev,
                        "cannot attach to SMMU %s whilst already attached to domain on SMMU %s\n",
                        dev_name(smmu_domain->smmu->dev), dev_name(smmu->dev));
                ret = -EINVAL;
                goto rpm_put;
        }

        /* Looks ok, so add the device to the domain */
        ret = arm_smmu_domain_add_master(smmu_domain, fwspec);

rpm_put:
        arm_smmu_rpm_put(smmu);
        return ret;
}

smmu 도메인에 디바이스를 attach 한다.

코드 라인 4에서 디바이스가 디바이스 트리를 통해 iommu(smmu) 드라이버에 전달할 가변 인자값들을 알아온다.
코드 라인 8~11에서 fwspec이 SMMU 드라이버와 attach하지 않은 경우 에러를 반환한다.
코드 라인 20~21에서 smmu 마스터 정보가 없는 경우 에러를 반환한다.
코드 라인 23에서 smmu가 절전 상태인 경우 깨운다.
코드 라인 26~28에서 smmu의 페이지 테이블을 구성하고 매핑을 동작시킨다.
코드 라인 34~40에서 도메인에 smmu 디바이스를 가리키지 않는 경우 에러 메시지를 출력하고 함수를 빠져나간다.
코드 라인 43에서 디바이스를 도메인에 추가한다.

다음은 iommu 각 그룹에 attach된 디바이스들을 보여준다.

find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/0/devices/ff650000.vpu_service
/sys/kernel/iommu_groups/1/devices/ff660000.rkvdec
/sys/kernel/iommu_groups/2/devices/ff8f0000.vop
/sys/kernel/iommu_groups/3/devices/ff900000.vop
/sys/kernel/iommu_groups/4/devices/ff910000.cif_isp

IOMMU 매핑

iommu_map()

drivers/iommu/iommu.c

int iommu_map(struct iommu_domain *domain, unsigned long iova,
              phys_addr_t paddr, size_t size, int prot)
{
        unsigned long orig_iova = iova;
        unsigned int min_pagesz;
        size_t orig_size = size;
        phys_addr_t orig_paddr = paddr;
        int ret = 0;

        if (unlikely(domain->ops->map == NULL ||
                     domain->pgsize_bitmap == 0UL))
                return -ENODEV;

        if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
                return -EINVAL;

        /* find out the minimum page size supported */
        min_pagesz = 1 << __ffs(domain->pgsize_bitmap);

        /*
         * both the virtual address and the physical one, as well as
         * the size of the mapping, must be aligned (at least) to the
         * size of the smallest page supported by the hardware
         */
        if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
                pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
                       iova, &paddr, size, min_pagesz);
                return -EINVAL;
        }

        pr_debug("map: iova 0x%lx pa %pa size 0x%zx\n", iova, &paddr, size);

        while (size) {
                size_t pgsize = iommu_pgsize(domain, iova | paddr, size);

                pr_debug("mapping: iova 0x%lx pa %pa pgsize 0x%zx\n",
                         iova, &paddr, pgsize);

                ret = domain->ops->map(domain, iova, paddr, pgsize, prot);
                if (ret)
                        break;

                iova += pgsize;
                paddr += pgsize;
                size -= pgsize;
        }

        /* unroll mapping in case something went wrong */
        if (ret)
                iommu_unmap(domain, orig_iova, orig_size - size);
        else
                trace_map(orig_iova, orig_paddr, orig_size);

        return ret;
}
EXPORT_SYMBOL_GPL(iommu_map);

가상 주소 @iova를 물리 주소 @paddr로 @size 만큼 변환하도록 매핑을 요청한다. 정상 매핑된 경우 0을 반환한다.

코드 라인 10~12에서 (*map) 후크 함수가 없거나 사이즈가 지정되지 않은 경우 에러 결과로 함수를 빠져나간다.
코드 라인 14~15에서 매핑 변환 플래그가 없는 요청인 경우 에러 결과로 함수를 빠져나간다.
코드 라인 18에서 지원하는 페이지 사이즈 중 가장 작은 단위의 페이지 사이즈를 알아온다.
- 예) pgsize_bitmap=0x10_1000
  - 1M 페이지와 4K 페이지를 지원
코드 라인 25~29에서 가상 주소와 물리 주소 및 사이즈 가 최소 페이지 단위로 정렬되지 않은 경우 에러 결과로 함수를 빠져나간다.
코드 라인 31에서 iommu 매핑 정보를 디버그 레벨로 출력한다.
코드 라인 33~46에서 @size 만큼 매핑을 하기 위해 페이지 사이즈 단위로 매핑을 하며 반복한다.
코드 라인 39~44에서 매핑이 실패한 경우 언맵을 수행하고, 정상 처리가 된 경우 0을 반환한다.

drivers/iommu/arm-smmu.c – ARM & ARM64

static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
                        phys_addr_t paddr, size_t size, int prot)
{
        struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
        struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
        int ret;

        if (!ops)
                return -ENODEV;

        arm_smmu_rpm_get(smmu);
        ret = ops->map(ops, iova, paddr, size, prot);
        arm_smmu_rpm_put(smmu);

        return ret;
}

smmu용 페이지 테이블 드라이버를 통해 가상 주소 @iova 를 물리 주소 @paddr로 @size 만큼 매핑한다.

arm_lpae_map() – ARM & ARM64

drivers/iommu/io-pgtable-arm.c

static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
                        phys_addr_t paddr, size_t size, int iommu_prot)
{
        struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
        arm_lpae_iopte *ptep = data->pgd;
        int ret, lvl = ARM_LPAE_START_LVL(data);
        arm_lpae_iopte prot;

        /* If no access, then nothing to do */
        if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
                return 0;

        if (WARN_ON(iova >= (1ULL << data->iop.cfg.ias) ||
                    paddr >= (1ULL << data->iop.cfg.oas)))
                return -ERANGE;

        prot = arm_lpae_prot_to_pte(data, iommu_prot);
        ret = __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep);
        /*
         * Synchronise all PTE updates for the new mapping before there's
         * a chance for anything to kick off a table walk for the new iova.
         */
        wmb();

        return ret;
}

smmu용 페이지 테이블 드라이버를 통해 가상 주소 @iova 를 물리 주소 @paddr로 @size 만큼 매핑한다.

코드 라인 10~11에서 페이지 속성 @iommu_prot에 read 또는 write 속성이 지정되지 않은 경우 아무 일도 하지 않고 성공(0)을 반환한다.
코드 라인 13~15에서 @iova 및 @paddr이 smmu가 처리할 수 있는 영역을 벗어난 경우 -ERANGE 에러를 반환한다.
코드 라인 17에서 @iommu_prot를 smmu의 pte 엔트리에 기록할 속성으로 변환한다.
코드 라인 18에서 시작 페이지 테이블 레벨부터 smmu를 위해 매핑 테이블을 구성한다.

__arm_lpae_map() – ARM & ARM64

drivers/iommu/io-pgtable-arm.c

static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
                          phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
                          int lvl, arm_lpae_iopte *ptep)
{
        arm_lpae_iopte *cptep, pte;
        size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
        size_t tblsz = ARM_LPAE_GRANULE(data);
        struct io_pgtable_cfg *cfg = &data->iop.cfg;

        /* Find our entry at the current level */
        ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);

        /* If we can install a leaf entry at this level, then do so */
        if (size == block_size && (size & cfg->pgsize_bitmap))
                return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);

        /* We can't allocate tables at the final level */
        if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
                return -EINVAL;

        /* Grab a pointer to the next level */
        pte = READ_ONCE(*ptep);
        if (!pte) {
                cptep = __arm_lpae_alloc_pages(tblsz, GFP_ATOMIC, cfg);
                if (!cptep)
                        return -ENOMEM;

                pte = arm_lpae_install_table(cptep, ptep, 0, cfg);
                if (pte)
                        __arm_lpae_free_pages(cptep, tblsz, cfg);
        } else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
                   !(pte & ARM_LPAE_PTE_SW_SYNC)) {
                __arm_lpae_sync_pte(ptep, cfg);
        }

        if (pte && !iopte_leaf(pte, lvl)) {
                cptep = iopte_deref(pte, data);
        } else if (pte) {
                /* We require an unmap first */
                WARN_ON(!selftest_running);
                return -EEXIST;
        }

        /* Rinse, repeat */
        return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep);
}

smmu용 페이지 테이블을 첫 레벨부터 마지막 레벨의 엔트리까지 드라이버를 통해 가상 주소 @iova 를 물리 주소 @paddr로 @size 만큼 매핑한다.

코드 라인 6에서 @lvl에 해당하는 블럭 사이즈를 구해온다.
- 4K 페이지 테이블을 사용하는 경우 가장 마지막 레벨에서는 블럭 크기는 4K이다.
코드 라인 7에서 할당 준비할 테이블 크기를 구한다.
코드 라인 11에서 @lvl에 해당하는 페이지 테이블의 @iova에 해당하는 엔트리 주소를 알아온다.
코드 라인 14~15에서 만일 매핑 범위가 최종 페이지 테이블의 pte 엔트리인 경우 해당 pte를 갱신한다.
코드 라인 18~19에서 이 함수는 마지막 페이지 테이블 레벨까지 재귀 호출된다. 마지막 테이블 레벨까지 이미 처리한 경우 -EINVAL 결과로 함수를 빠져나간다.
코드 라인 22에서 연결된 페이지 테이블을 과 연결된 엔트리 값을 읽어온다.
코드 라인 23~30에서 엔트리가 구성되지 않은 상태면 다음 단계용 페이지 테이블을 할당하고 할당한 테이블을 가리키도록 엔트리를 수정한다.코드 라인 31~34에서 만일 NO_DMA 요청을 사용하지 않으면서 pte에 대한 SW 싱크 플래그가 없으면 누군가 수정 중이므로 coherency sync를 수행한다.
코드 라인 36~42에서 leaf 레벨의 pte가 아닌 경우 pte 엔트리가 가리키는 가상 주소 값을 알아온다. 만일 pte 엔트리가 이미 매핑된 경우 -EEXIST 에러를 반환한다.
코드 라인 45에서 다음 레벨로 바꾸고 이 함수를 다시 재귀호출한다.

arm_lpae_install_table() – ARM & ARM64

drivers/iommu/io-pgtable-arm.c

static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
                                             arm_lpae_iopte *ptep,
                                             arm_lpae_iopte curr,
                                             struct io_pgtable_cfg *cfg)
{
        arm_lpae_iopte old, new;

        new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
        if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
                new |= ARM_LPAE_PTE_NSTABLE;

        /*
         * Ensure the table itself is visible before its PTE can be.
         * Whilst we could get away with cmpxchg64_release below, this
         * doesn't have any ordering semantics when !CONFIG_SMP.
         */
        dma_wmb();

        old = cmpxchg64_relaxed(ptep, curr, new);

        if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
            (old & ARM_LPAE_PTE_SW_SYNC))
                return old;

        /* Even if it's not ours, there's no point waiting; just kick it */
        __arm_lpae_sync_pte(ptep, cfg);
        if (old == curr)
                WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

        return old;
}

smmu용 페이지 테이블 엔트리 @ptep와 요청한 페이지 테이블 @table을 연결하여 기록한다.

코드 라인 8에서 엔트리에 기록하고자 하는 값으로 새로 할당받은 테이블의 물리 주소와 테이블 플래그를 추가한다.
코드 라인 9~10에서 non-secure quirk 플래그로 요청된 경우 기록할 값에도 non-secure 테이블 플래그를 추가한다.
코드 라인 17~19에서 베리어를 먼저 수행한 후 엔트리를 교체한다.
코드 라인 21~23에서 no dma 요청이었거나 기존 값에 sw sync가 설정된 경우 기존 엔트리를 반환한다.
- ARM_LPAE_PTE_SW_SYNC
  - coherency race를 방지하기 위해 기록한다.
    - 기존 엔트리 값에 sw sync 플래그가 기록된 경우 엔트리 변경이 완벽히 완료된 상태이다.
    - 기존 엔트리값에 sw sync 플래그가 없는 경우 누군가 이 엔트리를 변경 중인 상태이다.
코드 라인 26에서 아직 sync 되지 않은 상태이다. 따라서 다른 modifier를 위해 coherency sync를 수행한다.
코드 라인 27~28에서 cmpxchg64 명령에서 경쟁 상황 없이 변경이 이루어진 경우 sw sync 플래그를 추가로 기록한다.

iommu_map_sg()

drivers/iommu/iommu.c

size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
                    struct scatterlist *sg, unsigned int nents, int prot)
{
        size_t len = 0, mapped = 0;
        phys_addr_t start;
        unsigned int i = 0;
        int ret;

        while (i <= nents) {
                phys_addr_t s_phys = sg_phys(sg);

                if (len && s_phys != start + len) {
                        ret = iommu_map(domain, iova + mapped, start, len, prot);
                        if (ret)
                                goto out_err;

                        mapped += len;
                        len = 0;
                }

                if (len) {
                        len += sg->length;
                } else {
                        len = sg->length;
                        start = s_phys;
                }

                if (++i < nents)
                        sg = sg_next(sg);
        }

        return mapped;

out_err:
        /* undo mappings already done */
        iommu_unmap(domain, iova, mapped);

        return 0;

}
EXPORT_SYMBOL_GPL(iommu_map_sg);

참고

DMA -5- (DMA-IOMMU) | 문c

IOMMU Event Tracing – What It Is and How It Can Help Your Distro (2015) | Samsung – 다운로드 pdf
Using IOMMUs for Virtualization in Linux and Xen | IBM – 다운로드 pdf
IOMMU and VFIO Microconference Notes (2014)
SVM on Intel Graphics (2016) | Intel – 다운로드 pdf
KVM: PCI device assignment (2010) | Red Hat – 다운로드 pdf
AMD I/O Virtualization Technology (IOMMU) Specification (2012) | AMD – 다운로드 pdf
ARM® System Memory Management Unit Architecture Specification SMMU architecture version 2.0 | ARM – 다운로드 pdf
Rethinking the I/O Memory Management Unit (IOMMU) (2015) | Moshe Malka – 다운로드 pdf