문c 블로그

ARM Cortex-A 시리즈

Cortex A57 (Little A53)
- SoC) 퀄컴 스냅드래곤 808, 810, 삼성 엑시노스 7 Octa 5433, 7420
- 참고
  - ARM Cortex-A53 | Wiki
Cortex A72
- Cortex-A57 대비 동클럭당 23% 향상(일반 연산 16%, 암호화 38%, 메모리 I/O 50%, 실수 연산 26%, 정수 연산 16%)
- 참고:
  - Fast Processing Designed for Multiple Application Areas | ARM
  - ARM Cortex-A72 | Wiki
Cortex A73
- Cortex A72와 성능 유사, 단 면적당 성능비 40% 향상, 20% 전력 효율 개선
- 최대 2.8Ghz 클럭
- SoC) 퀄컴 스냅드래곤 835
- 참고:
  - Sustained Performance for Mobile Processing | ARM
  - ARM Cortex-A73 | Wiki
Cortex A75 (Little A55)
- A73 대비 20% 향상
- ARMV8.2, DynamIQ, L3 캐시 첫 채용
- 최대 3.0Ghz 클럭
- 참고:
  - First Generation, High-Performance CPU Based on DynamIQ Technology | ARM
  - ARM Cortex-A75 | Wiki
Cortex A76
- Cortex-A75 대비 최대 35% 향상, 40% 전력 효율 개선, 머신 러닝 4배 향상, 메모리 대역폭 90% 확장
- 최대 3.0/3.3Ghz 클럭 (phone/tablets)
- 참고:
  - Second-Generation, High-Performance CPU Based on DynamIQ Technology | ARM
  - ARM Cortex-A76 | Wiki
Cortex A77
- A76 대비 20% 향상
- 최대 3.0/3.3Ghz 클럭 (phone/tablets)
- 참고:
  - Third-Generation, High-Performance CPU Based on DynamIQ Technology | ARM
  - ARM Cortex-A77 | Wiki
Cortex A78 (Little A55, Custom X1)
- A77 대비 5% 향상
- 마지막 ARMv8 아키텍처
- 최대 3.0/3.3Ghz 클럭 (phone/tablets)
- 참고:
  - Fourth-Generation, High-Performance CPU Based on DynamIQ Technology | ARM
  - ARM Cortex-A78 | Wiki
Cortex A710 (Little A510, Custom X2)
- A78 대비 10% 향상, 소비전력 30% 개선, 머신 러닝 2배 향상
- 참고:
  - First-Generation Armv9 “big” Cortex CPU Based on Arm DynamIQ Technology | ARM

ARMv8 & ARMv9.x 아키텍처 extension

ARMv8.1
- ARM Cortex-A32, A35, A53, A57, A72, A73
- Atomic(LSE) memory access instructions (AArch64)
- Limited Order regions (AArch64)
- Increased Virtual Machine Identifier (VMID) size, and Virtualization Host Extensions (AArch64)
- Privileged Access Never (PAN) (AArch32 and AArch64)
ARMv8.2 (52bits, share TLB, RAS)
- ARM Cortex-A55, A75, A76, A77, A78
- Support for 52-bit addresses (AArch64)
- The ability for PEs to share Translation Lookaside Buffer (share TLB) entries (AArch32 and AArch64)
- FP16 data processing instructions (AArch32 and AArch64)
- Statistical profiling (AArch64)
- Reliability Availability Serviceabilty (RAS) support becomes mandatory (AArch32 and AArch64)
ARMv8.3 (Pointer Authentification)
- ARM Cortex 미채택
- Pointer authentication (AArch64)
- Nested virtualization (AArch64)
- Advanced Single Instruction Multiple Data (SIMD) complex number support (AArch32 and AArch64)
- Improved JavaScript data type conversion support (AArch32 and AArch64)
- A change to the memory consistency model (AArch64)
- ID mechanism support for larger system-visible caches (AArch32 and AArch64)
ARMv8.4
- ARM Cortex 미채택, Apple 사에서만 사용
- Secure virtualization (AArch64)
- Nested virtualization enhancements (AArch64)
- Small translation table support (AArch64)
- Relaxed alignment restrictions (AArch32 and AArch64)
- Memory Partitioning and Monitoring (MPAM) (AArch32 and AArch64)
- Additional crypto support (AArch32 and AArch64)
- Generic counter scaling (AArch32 and AArch64)
- Instructions to accelerate SHA512 and SHA3 (AArch64 only)
ARMv8.5 & ARMv9.0
- ARM Cortex A510, 710
- Memory Tagging (AArch64)
- Branch Target Identification (AArch64)
- Random Number Generator instructions (AArch64)
- Cache Clean to Point of Deep Persistence (AArch64)
ARMv8.6 & ARMv9.1
- General Matrix Multiply (GEMM) instructions (AArch64)
- Fine grained traps for virtualization (AArch64)
- High precision Generic Timer
- Data Gathering Hint (AArch64)
ARMv8.7 & ARMv9.2
- Enhanced support for PCIe hot plug (AArch64)
- Atomic 64-byte load and stores to accelerators (AArch64)
- Wait For Instruction (WFI) and Wait For Event (WFE) with timeout (AArch64)
- Branch-Record recording (Armv9.2 only)

Feature Name

참고: Feature names for A-profile | ARM

예) ARM Cortex A57 (

# cat cpuinfo
processor       : 0
BogoMIPS        : 50.00
Features        : fp asimd aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 3

예) rpi4 – ARM Cortex A72

$ cat cpuinfo
processor       : 0
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

예) Odroid-N2 – ARM Cortex A73

# cat /proc/cpuinfo
processor       : 0
BogoMIPS        : 48.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

참고

Understanding the Armv8.x extensions | ARM – 다운로드 pdf
ARMv9: What is the Big Deal? | gitconnected
Arm, 모바일 Armv9 CPU 마이크로 아키텍처 : Cortex-X2, Cortex-A710 및 Cortex-A510 발표 | ANNDTECH
- (1), (2), (3), (4), (5), (6), (7)

DMA -6- (DMAEngine Subsystem)

2021-04-062021-04-13 문영일 Leave a comment

DMAEngine Subsystem

pci 및 pcie 디바이스들을 제외한 DMA를 사용하는 대부분의 디바이스들은 DMAEngine 서스시스템이 제공하는 API를 사용한다.

PCI/PCIe에서 DMA

호스트가 주도하는 DMA 시스템에서만 DMAEngine 서브시스템이 사용된다. 슬레이브 디바이스가 주도하는 pci/pcie 디바이스들은 DMAEngine 서스시스템을 사용하지 않고 각자 구현되어 사용중이다.

DMA 버퍼

dma 버퍼에 dma 전송 전후로 iommu 및 cache sync 유무에 따라 dma 매핑 API들을 같이 사용한다.

dma 매핑
- iommu 사용 시 매핑/언매핑을 수행한다.
- dma 코히런트 메모리를 사용하지 않는 경우 매핑/언매핑 시 마다 invalidate 또는 clean & invalidate 한다.
  - GFP_KERNEL 등으로 할당 받은 커널 메모리는 캐시를 사용하므로 이에 대한 sync 처리를 해야 한다.

DMA 활용

비동기 TX 전송용
- 메모리 to 메모리의 전송, XOR 및 cryptography 및 RAID 장치에서 사용되어 왔다.
슬레이브 전송용
- 그 후 발전되어 DMAEngine과 통합되어 슬레이브 디바이스와의 DMA를 위하여 사용되기 시작하였다.
- 심플한 슬레이브 DMA 컨트롤러들은 한 번에 요청한 바이트 수만큼 DMA를 수행한다.
- 조금 더 진보한 슬레이브 DMA 컨트롤러들은 전송시 widths(비트수)를 지정하고, 반복적인 전송을 지원하기 위해 버스트 사이즈를 지정할 수 있다.
- 더 많이 진보한 슬레이브 DMA 컨트롤러들은 scatter-gather 전송을 지원하여 연속적이지 않은 여러 곳의 버퍼 메모리를 지정하여 사용할 수 있게 하였다.

다음 그림은 메모리 <–> 메모리의 DMA 전송 방법과 슬레이브 디바이스 <–> 메모리의 DMA 전송 방법 차이를 보여준다.

메모리 <–> 메모리 DMA 요청을 위해 cpu가 async tx 방식의 DMA를 요청
슬레이브 디바이스 <–> 메모리 DMA 요청을 위해 slave 방식의 DMA를 요청

DMA 구성

다음 그림은 ARM 시스템의 AXI 버스에 연결된 DMA 컨트롤러가 연결된 모습을 보여준다. (amba pl330 dma 컨트롤러)

다음 그림은 위의 amba pl330 dma 컨트롤러를 확대하여 더 자세히 보여주고 있다.

DMA 채널 및 request 인터페이스

여러 디바이스들이 DMA를 사용할 수 있도록 대부분의 DMA 컨트롤러들은 여러 개의 DMA를 동시 지원할 수 있도록 DMA 채널을 지원한다. 또한 여러 개의 슬레이브 디바이스들로 부터 DMA 요청을 받을 수 있도록 DMA 컨트롤러 H/W가 지원한다.

DMA MUX (DMA-Router)

DMA MUX 정보는 다음을 참고한다.

참고: [STM32H7 tutorial] Chapter 39 STM32H7 DMAMUX basic knowledge (important)

DMA 사이즈

DMAEngine을 통해 DMA 전송 요청하는 경우 디스크립터 단위로 요청된다. 이는 내부적으로 여러 개의 세그먼트가 포함될 수 있으며 각 세그먼트는 1개 이상의 burst 전송이 이루어진다.

Descriptor > Segment > Burst

DMA 트랜스퍼 타입

DMA_MEMCPY
- 메모리 to 메모리 copy
DMA_XOR
- 디바이스가 RAID5를 위해 메모리에서 XOR 연산을 수행한다.
DMA_PQ
- 디바이스가 RAID6 P+Q 계산을 수행한다. (P=XOR, Q=Reed-Solomon 알고리즘)
DMA_XOR_VAL
- 디바이스가 XOR를 사용한 메모리 버퍼 패리티 체크를 수행한다.
DMA_PQ_VAL
- 디바이스가 RAID6 P+Q 계산을 사용한 메모리 버퍼 패리티 체크를 수행한다.
DMA_MEMSET
- 메모리 to 메모리 memset
DMA_MEMSET_SG
- 메모리 to 메모리 memset scatter gather
DMA_INTERRUPT
- 디바이스가 더미 전송을 통한 인터럽트를 생성한다.
DMA_PRIVATE
- 슬레이브 전송만 지원하고, 비동기 tx 전송은 지원하지 않는다.
- 이 플래그를 설정하지 않는 경우 비동기 TX 사용 시 dma_request_channel() 함수를 거치지 않고 랜덤 채널을 사용한다.
DMA_ASYNC_TX
- 비동기 전송(tx) 가능
DMA_SLAVE
- 디바이스 to 메모리 및 메모리 to 디바이스 전송을 수행한다. (scatter-gather 포함)
DMA_CYCLIC
- 디바이스가 사이클릭 전송 가능하다.
- 세그먼트(청크) 단위로 전송이 완료될 때마다 인터럽트로 보고된다.
DMA_INTERLEAVE
- 메모리 to 메모리 인터리브(interleaved) 전송 방법을 사용한다.

Slave DMA Controller

기본적인 DMA 슬레이브 전송만을 지원하는 일반 dma 컨트롤러의 경우 다음과 같은 플래그가 주어진다. (예: pl330, bcm2835, stm32, …)

DMA_SLAVE
DMA_PRIVATE
DMA_CYCLIC
DMA_MEMCPY

RAID DMA Controller

예) raid 장치용 dma 컨트롤러의 경우 다음과 같은 플래그가 주어진다. (예: fsl-raideng, bcm-sba-raid, ioat, …)

DMA_XOR
DMA_PQ
DMA_MEMCPY

주요 API

dma_request_chan()
- dma_request_slave_channel()
- dma_request_slave_channel_reason()
dmaengine_slave_config()
dmaengine_prep_*()
- dmaengine_prep_slave_single()
- dmaengine_prep_slave_sg()
- dmaengine_prep_rio_sg()
- dmaengine_prep_dma_cyclic()
- dmaengine_prep_interleaved_dma()
- dmaengine_prep_dma_memset()
- dmaengine_prep_dma_memcpy()
dmaengine_submit()
dma_async_issue_pending()

DMA 엔진 – DMA 호스트 컨트롤러측

Device Operations

dma 호스트 컨트롤러 드라이버는 dma_device 구조체에 구현 관련 콜백 함수들을 연결한 후 dma_async_device_register() 함수를 사용하여 등록한다.

다음 그림은 arm사의 pl330 dma 컨트롤러를 사용한 드라이버이며, 구현된 오퍼레이션 함수를 모두 보여준다.

dma 호스트 컨트롤러 등록

dmaenginem_async_device_register()

drivers/dma/dmaengine.c

/**
 * dmaenginem_async_device_register - registers DMA devices found
 * @device: &dma_device
 *
 * The operation is managed and will be undone on driver detach.
 */

int dmaenginem_async_device_register(struct dma_device *device)
{
        void *p;
        int ret;

        p = devres_alloc(dmam_device_release, sizeof(void *), GFP_KERNEL);
        if (!p)
                return -ENOMEM;

        ret = dma_async_device_register(device);
        if (!ret) {
                *(struct dma_device **)p = device;
                devres_add(device->dev, p);
        } else {
                devres_free(p);
        }

        return ret;
}
EXPORT_SYMBOL(dmaenginem_async_device_register);

dma 컨트롤러 디바이스를 등록한다. (managed 디바이스로 등록하므로 드라이버 모듈을 언로드하는 경우 자동으로 해제한다)

dma_async_device_register()

drivers/dma/dmaengine.c -1/3-

/**
 * dma_async_device_register - registers DMA devices found
 * @device: &dma_device
 */

int dma_async_device_register(struct dma_device *device)
{
        int chancnt = 0, rc;
        struct dma_chan* chan;
        atomic_t *idr_ref;

        if (!device)
                return -ENODEV;

        /* validate device routines */
        if (!device->dev) {
                pr_err("DMAdevice must have dev\n");
                return -EIO;
        }

        if (dma_has_cap(DMA_MEMCPY, device->cap_mask) && !device->device_prep_dma_memcpy) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_MEMCPY");
                return -EIO;
        }

        if (dma_has_cap(DMA_XOR, device->cap_mask) && !device->device_prep_dma_xor) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_XOR");
                return -EIO;
        }

        if (dma_has_cap(DMA_XOR_VAL, device->cap_mask) && !device->device_prep_dma_xor_val) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_XOR_VAL");
                return -EIO;
        }

        if (dma_has_cap(DMA_PQ, device->cap_mask) && !device->device_prep_dma_pq) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_PQ");
                return -EIO;
        }

        if (dma_has_cap(DMA_PQ_VAL, device->cap_mask) && !device->device_prep_dma_pq_val) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_PQ_VAL");
                return -EIO;
        }

        if (dma_has_cap(DMA_MEMSET, device->cap_mask) && !device->device_prep_dma_memset) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_MEMSET");
                return -EIO;
        }

        if (dma_has_cap(DMA_INTERRUPT, device->cap_mask) && !device->device_prep_dma_interrupt) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_INTERRUPT");
                return -EIO;
        }

        if (dma_has_cap(DMA_CYCLIC, device->cap_mask) && !device->device_prep_dma_cyclic) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_CYCLIC");
                return -EIO;
        }

        if (dma_has_cap(DMA_INTERLEAVE, device->cap_mask) && !device->device_prep_interleaved_dma) {
                dev_err(device->dev,
                        "Device claims capability %s, but op is not defined\n",
                        "DMA_INTERLEAVE");
                return -EIO;
        }


        if (!device->device_tx_status) {
                dev_err(device->dev, "Device tx_status is not defined\n");
                return -EIO;
        }


        if (!device->device_issue_pending) {
                dev_err(device->dev, "Device issue_pending is not defined\n");
                return -EIO;
        }

dma 컨트롤러 디바이스를 등록한다.

코드 라인 7~8에서 입력 인자로 디바이스가 null로 지정된 경우 -ENODEV 에러를 반환한다.
코드 라인 11~14에서 디바이스가 없는 경우 -EIO 에러를 반환한다.
코드 라인 16~77에서 dma 컨트롤러에 각 dma capacity에 해당하는 콜백 함수가 구현되어 있지 않은 경우 -EIO 에러를 반환한다.
코드 라인 80~89에서 dma 컨트롤러에 다음 기본 콜백 함수가 구현되어 있지 않은 경우 -EIO 에러를 반환한다.
- (*device_tx_status)
- (*device_issue_pending)

drivers/dma/dmaengine.c -2/3-

.       /* note: this only matters in the
         * CONFIG_ASYNC_TX_ENABLE_CHANNEL_SWITCH=n case
         */
        if (device_has_all_tx_types(device))
                dma_cap_set(DMA_ASYNC_TX, device->cap_mask);

        idr_ref = kmalloc(sizeof(*idr_ref), GFP_KERNEL);
        if (!idr_ref)
                return -ENOMEM;
        rc = get_dma_id(device);
        if (rc != 0) {
                kfree(idr_ref);
                return rc;
        }

        atomic_set(idr_ref, 0);

        /* represent channels in sysfs. Probably want devs too */
        list_for_each_entry(chan, &device->channels, device_node) {
                rc = -ENOMEM;
                chan->local = alloc_percpu(typeof(*chan->local));
                if (chan->local == NULL)
                        goto err_out;
                chan->dev = kzalloc(sizeof(*chan->dev), GFP_KERNEL);
                if (chan->dev == NULL) {
                        free_percpu(chan->local);
                        chan->local = NULL;
                        goto err_out;
                }

                chan->chan_id = chancnt++;
                chan->dev->device.class = &dma_devclass;
                chan->dev->device.parent = device->dev;
                chan->dev->chan = chan;
                chan->dev->idr_ref = idr_ref;
                chan->dev->dev_id = device->dev_id;
                atomic_inc(idr_ref);
                dev_set_name(&chan->dev->device, "dma%dchan%d",
                             device->dev_id, chan->chan_id);

                rc = device_register(&chan->dev->device);
                if (rc) {
                        free_percpu(chan->local);
                        chan->local = NULL;
                        kfree(chan->dev);
                        atomic_dec(idr_ref);
                        goto err_out;
                }
                chan->client_count = 0;
        }

        if (!chancnt) {
                dev_err(device->dev, "%s: device has no channels!\n", __func__);
                rc = -ENODEV;
                goto err_out;
        }

        device->chancnt = chancnt;

코드 라인 4~5에서 CONFIG_ASYNC_TX_ENABLE_CHANNEL_SWITCH=n으로 설정하여 async tx 채널이 설정되지 않은 경우 디바이스의 cap_mask에 DMA_ASYNC_TX 플래그를 추가한다.
코드 라인 7~16에서 디바이스에 대한 idr_ref를 할당받아 0으로 초기화한다. 또한 dma 컨트롤러의 id를 할당받아 device->dev_id에 지정한다.
코드 라인 19~56에서 dma 컨트롤러가 가진 채널을 순회하며 각 채널 디바이스를 초기화하고 디바이스로 등록한다.
코드 라인 58에서 dma 채널 수를 지정한다.

drivers/dma/dmaengine.c -3/3-

        mutex_lock(&dma_list_mutex);
        /* take references on public channels */
        if (dmaengine_ref_count && !dma_has_cap(DMA_PRIVATE, device->cap_mask))
                list_for_each_entry(chan, &device->channels, device_node) {
                        /* if clients are already waiting for channels we need
                         * to take references on their behalf
                         */
                        if (dma_chan_get(chan) == -ENODEV) {
                                /* note we can only get here for the first
                                 * channel as the remaining channels are
                                 * guaranteed to get a reference
                                 */
                                rc = -ENODEV;
                                mutex_unlock(&dma_list_mutex);
                                goto err_out;
                        }
                }
        list_add_tail_rcu(&device->global_node, &dma_device_list);
        if (dma_has_cap(DMA_PRIVATE, device->cap_mask))
                device->privatecnt++;   /* Always private */
        dma_channel_rebalance();
        mutex_unlock(&dma_list_mutex);

        return 0;

err_out:
        /* if we never registered a channel just release the idr */
        if (atomic_read(idr_ref) == 0) {
                ida_free(&dma_ida, device->dev_id);
                kfree(idr_ref);
                return rc;
        }

        list_for_each_entry(chan, &device->channels, device_node) {
                if (chan->local == NULL)
                        continue;
                mutex_lock(&dma_list_mutex);
                chan->dev->chan = NULL;
                mutex_unlock(&dma_list_mutex);
                device_unregister(&chan->dev->device);
                free_percpu(chan->local);
        }
        return rc;
}
EXPORT_SYMBOL(dma_async_device_register);

코드 라인 3~17에서 공유 dma 채널을 사용하고 있는 경우 dma 컨트롤러의 모든 채널을 순회하며 참조 카운터를 증가시킨다.
코드 라인 18에서 dma 컨트롤러 리스트 dma_device_list에 추가한다.
코드 라인 19~20에서 슬레이브 전송만 지원하고 async tx를 하지못하는 private dma 컨트롤러의 privatecnt를 1 증가시킨다.
코드 라인 21에서 dma tx 타입별로 채널을 재분배하는데 가능하면 로컬 노드에 포함된 dma 컨트롤러의 가장 적게 사용된 채널을 지정한다.
코드 라인 24에서 성공 값 0을 반환한다.

tx 타입별 채널 재분배

dma_channel_rebalance()

drivers/dma/dmaengine.c

/**
 * dma_channel_rebalance - redistribute the available channels
 *
 * Optimize for cpu isolation (each cpu gets a dedicated channel for an
 * operation type) in the SMP case,  and operation isolation (avoid
 * multi-tasking channels) in the non-SMP case.  Must be called under
 * dma_list_mutex.
 */

static void dma_channel_rebalance(void)
{
        struct dma_chan *chan;
        struct dma_device *device;
        int cpu;
        int cap;

        /* undo the last distribution */
        for_each_dma_cap_mask(cap, dma_cap_mask_all)
                for_each_possible_cpu(cpu)
                        per_cpu_ptr(channel_table[cap], cpu)->chan = NULL;

        list_for_each_entry(device, &dma_device_list, global_node) {
                if (dma_has_cap(DMA_PRIVATE, device->cap_mask))
                        continue;
                list_for_each_entry(chan, &device->channels, device_node)
                        chan->table_count = 0;
        }

        /* don't populate the channel_table if no clients are available */
        if (!dmaengine_ref_count)
                return;

        /* redistribute available channels */
        for_each_dma_cap_mask(cap, dma_cap_mask_all)
                for_each_online_cpu(cpu) {
                        chan = min_chan(cap, cpu);
                        per_cpu_ptr(channel_table[cap], cpu)->chan = chan;
                }
}

dma tx 타입별로 채널을 재분배하는데 가능하면 로컬 노드에 포함된 dma 컨트롤러의 가장 적게 사용된 채널을 지정한다.

코드 라인 9~11에서 모든 dma cap을 순회하고, 각 dma cap에 대한 possible cpu들을 순회하며 채널을 초기화한다.
코드 라인 13~18에서 dma 컨트롤러 리스트를 순회하며 DMA_PRIVATE cap을 가진 dma 컨트롤러는 skip 하고, 순회 중인 dma 컨트롤러의 모든 채널에 대해 table_count를 0으로 리셋한다.
코드 라인 21~22에서 dma 컨트롤러를 사용하는 사용자가 없는 경우 함수를 빠져나간다.
코드 라인 25~29에서 모든 dma cap을 순회하고, 각 dma cap에 대한 online cpu들을 순회하며 해당 cap에서 가장 사용이 적은 채널을 찾아 채널 테이블에 지정한다.
- channel_table[dma tx 타입]->chan에 가장 사용이 적은 컨트롤러의 채널을 지정한다.

min_chan()

drivers/dma/dmaengine.c

/**
 * min_chan - returns the channel with min count and in the same numa-node as the cpu
 * @cap: capability to match
 * @cpu: cpu index which the channel should be close to
 *
 * If some channels are close to the given cpu, the one with the lowest
 * reference count is returned. Otherwise, cpu is ignored and only the
 * reference count is taken into account.
 * Must be called under dma_list_mutex.
 */

static struct dma_chan *min_chan(enum dma_transaction_type cap, int cpu)
{
        struct dma_device *device;
        struct dma_chan *chan;
        struct dma_chan *min = NULL;
        struct dma_chan *localmin = NULL;

        list_for_each_entry(device, &dma_device_list, global_node) {
                if (!dma_has_cap(cap, device->cap_mask) ||
                    dma_has_cap(DMA_PRIVATE, device->cap_mask))
                        continue;
                list_for_each_entry(chan, &device->channels, device_node) {
                        if (!chan->client_count)
                                continue;
                        if (!min || chan->table_count < min->table_count)
                                min = chan;

                        if (dma_chan_is_local(chan, cpu))
                                if (!localmin ||
                                    chan->table_count < localmin->table_count)
                                        localmin = chan;
                }
        }

        chan = localmin ? localmin : min;

        if (chan)
                chan->table_count++;

        return chan;
}

@cpu가 소속된 로컬 노드 dma 컨트롤러 중 @cap에 대한 가장 작은 채널을 찾아 반환한다. 단 로컬 디바이스가 없으면 전체 노드를 대상으로 한다.

코드 라인 8~11에서 모든 dma 컨트롤러를 순회하며 @cap이 없거나 DMA_PRIVATE cap의 경우 skip 한다.
코드 라인 12~14에서 순회 중인 dma 컨트롤러의 모든 채널을 순회하며 해당 채널에 연결된 클라이언트가 없는 경우 skip 한다.
코드 라인 15~16에서 순회 중인 채널 중 가장 작은 table_count 인 채널을 min 채널로 갱신한다.
코드 라인 18~21에서 @cpu가 포함된 노드에 소속된 디바이스이고 순회 중인 채널 중 가장 작은 table_count 인 채널을 localmin 채널로 갱신한다.
코드 라인 25~30에서 localmin이 지정된 경우 이를 반환한다. 그렇지 않은 경우 chan을 반환한다. 반환 전에 table_count를 증가킨다.

dma_chan_is_local()

drivers/dma/dmaengine.c

/**
 * dma_chan_is_local - returns true if the channel is in the same numa-node as the cpu
 */

static bool dma_chan_is_local(struct dma_chan *chan, int cpu)
{
        int node = dev_to_node(chan->device->dev);
        return node == NUMA_NO_NODE ||
                cpumask_test_cpu(cpu, cpumask_of_node(node));
}

로컬 노드에 소속된 디바이스인지 여부를 반환한다.

DMA 채널 획득

dma_get_slave_channel()

drivers/dma/dmaengine.c

/**
 * dma_get_slave_channel - try to get specific channel exclusively
 * @chan: target channel
 */

struct dma_chan *dma_get_slave_channel(struct dma_chan *chan)
{
        int err = -EBUSY;

        /* lock against __dma_request_channel */
        mutex_lock(&dma_list_mutex);

        if (chan->client_count == 0) {
                struct dma_device *device = chan->device;

                dma_cap_set(DMA_PRIVATE, device->cap_mask);
                device->privatecnt++;
                err = dma_chan_get(chan);
                if (err) {
                        dev_dbg(chan->device->dev,
                                "%s: failed to get %s: (%d)\n",
                                __func__, dma_chan_name(chan), err);
                        chan = NULL;
                        if (--device->privatecnt == 0)
                                dma_cap_clear(DMA_PRIVATE, device->cap_mask);
                }
        } else
                chan = NULL;

        mutex_unlock(&dma_list_mutex);


        return chan;
}
EXPORT_SYMBOL_GPL(dma_get_slave_channel);

요청한 dma 채널을 베타적으로 획득한다. 성공한 경우 요청한 dma 채널이 그대로 반화되며, 실패한 경우 null을 반환한다.

DMA 엔진 – 슬레이브 디바이스 유저 측

슬레이브측 DMA 사용 순서

크게 3 부분의 함수를 통해 dma api 호출 순서를 알아본다.

probe 함수
- DMA 슬레이브 채널 할당
  - dma_request_chan()
  - or dma_request_slave_channel_reason()
  - or dma_request_slave_channel()
- 슬레이브와 컨트롤러 관련 파라미터 설정
  - dmaengine_slave_config()
DMA 전송 함수
- DMA 매핑 API (Option)
  - dma_map_*()
- 트랜잭션을 위한 트랜잭션 준비
  - dmaengine_prep_*()
- 트랜잭션 전송
  - dmaengine_submit()
- 이슈 펜딩 요청 및 콜백 통지 대기
  - dma_async_issue_pending()
DMA 인터럽트 핸들러
- DMA 언매핑 API (Option)
  - dma_unmap_*()
- dmaengine_terminate_all() – 타임아웃 처리

다음 그림은 dma 슬레이브 디바이스가 처음 인식되어 초기화될 때 dma 관련 명령이 처리되는 순서를 보여준다.

다음 그림은 dma 슬레이브 디바이스를 통해 전송을 시도 시 dma 관련 명령이 처리되는 순서를 보여준다.

DMA 채널 할당 요청

dma 채널을 할당받아 사용해야 하는 dma 슬레이브 디바이스 유저들은 dma_request_chan() 함수를 통해 DT 또는 ACPI에서 정의한 dma 컨트롤러로부터 dma 채널을 할당받아온다.

예) master->dma_rx = dma_request_chan(user_slave_device->dev, “rx”);

dma_request_slave_channel()

drivers/dma/dmaengine.c

/**
 * dma_request_slave_channel - try to allocate an exclusive slave channel
 * @dev:        pointer to client device structure
 * @name:       slave channel name
 *
 * Returns pointer to appropriate DMA channel on success or NULL.
 */

struct dma_chan *dma_request_slave_channel(struct device *dev,
                                           const char *name)
{
        struct dma_chan *ch = dma_request_chan(dev, name);
        if (IS_ERR(ch))
                return NULL;

        return ch;
}
EXPORT_SYMBOL_GPL(dma_request_slave_channel);

슬레이브 채널을 할당 시도한다. 성공 시 dma 채널이 반환되고, 실패 시 에러가 반환된다.

dma_request_slave_channel_reason()

include/linux/dmaengine.h

#define dma_request_slave_channel_reason(dev, name) dma_request_chan(dev, name)

슬레이브 채널을 할당 시도한다. 성공 시 dma 채널이 반환되고, 실패 시 에러가 반환된다.

dma_request_chan()

drivers/dma/dmaengine.c

/**
 * dma_request_chan - try to allocate an exclusive slave channel
 * @dev:        pointer to client device structure
 * @name:       slave channel name
 *
 * Returns pointer to appropriate DMA channel on success or an error pointer.
 */

struct dma_chan *dma_request_chan(struct device *dev, const char *name)
{
        struct dma_device *d, *_d;
        struct dma_chan *chan = NULL;

        /* If device-tree is present get slave info from here */
        if (dev->of_node)
                chan = of_dma_request_slave_channel(dev->of_node, name);

        /* If device was enumerated by ACPI get slave info from here */
        if (has_acpi_companion(dev) && !chan)
                chan = acpi_dma_request_slave_chan_by_name(dev, name);

        if (chan) {
                /* Valid channel found or requester needs to be deferred */
                if (!IS_ERR(chan) || PTR_ERR(chan) == -EPROBE_DEFER)
                        return chan;
        }

        /* Try to find the channel via the DMA filter map(s) */
        mutex_lock(&dma_list_mutex);
        list_for_each_entry_safe(d, _d, &dma_device_list, global_node) {
                dma_cap_mask_t mask;
                const struct dma_slave_map *map = dma_filter_match(d, name, dev);

                if (!map)
                        continue;

                dma_cap_zero(mask);
                dma_cap_set(DMA_SLAVE, mask);

                chan = find_candidate(d, &mask, d->filter.fn, map->param);
                if (!IS_ERR(chan))
                        break;
        }
        mutex_unlock(&dma_list_mutex);

        return chan ? chan : ERR_PTR(-EPROBE_DEFER);
}
EXPORT_SYMBOL_GPL(dma_request_chan);

슬레이브 채널을 할당 시도한다. 성공 시 dma 채널이 반환되고, 실패 시 에러가 반환된다.

코드 라인 7~8에서 dma 슬레이브 디바이스가 디바이스 트리를 통해 등록된 경우 디바이스 트리를 통해 사용할 dma 컨트롤러의 dma 채널을 알아온다.
코드 라인 11~12에서 위에서 가져온 정보가 없으면서 ACPI 정보를 통해 등록된 경우 ACPI를 통해 사용할 dma 컨트롤러의 dma 채널을 알아온다.
코드 라인 14~18에서 dma 채널을 발견하였거나 -EPROBE_DEFER 에러인 경우 함수를 빠져나간다.
코드 라인 21~36에서 dma 채널을 발견하지 못한 경우 dma_list_mutex 락을 획득한 채로 dma filter를 사용하여 dma 채널을 알아온다.
코드 라인 38에서 성공한 경우 dma 채널을 반환하고, 실패한 경우 -EPROBE_DEFER 에러를 반환한다.

dma 슬레이브 설정

dmaengine_slave_config()

static inline int dmaengine_slave_config(struct dma_chan *chan,
                                          struct dma_slave_config *config)
{
        if (chan->device->device_config)
                return chan->device->device_config(chan, config);

        return -ENOSYS;
}

요청한 채널에 dma 슬레이브 설정을 한다.

DMA 전송 시작

dmaengine_submit()

include/linux/dmaengine.h

static inline dma_cookie_t dmaengine_submit(struct dma_async_tx_descriptor *desc)
{
        return desc->tx_submit(desc);
}

준비한 비동기 전송용 디스크립터 내용으로 dma 요청한다.

펜딩된 남은 트랜잭션을 hw로 flush

dma_async_issue_pending()

include/linux/dmaengine.h

/**
 * dma_async_issue_pending - flush pending transactions to HW
 * @chan: target DMA channel
 *
 * This allows drivers to push copies to HW in batches,
 * reducing MMIO writes where possible.
 */

static inline void dma_async_issue_pending(struct dma_chan *chan)
{
        chan->device->device_issue_pending(chan);
}

dma 처리 중인 트랜잭션을 hw에 모두 전송하도록 요청한다.(flush)

dmaengine_terminate_all()

include/linux/dmaengine.h

/**
 * dmaengine_terminate_all() - Terminate all active DMA transfers
 * @chan: The channel for which to terminate the transfers
 *
 * This function is DEPRECATED use either dmaengine_terminate_sync() or
 * dmaengine_terminate_async() instead.
 */

static inline int dmaengine_terminate_all(struct dma_chan *chan)
{
        if (chan->device->device_terminate_all)
                return chan->device->device_terminate_all(chan);

        return -ENOSYS;
}

tdma 채널에서 동작중인 모든 dma 전송을 종료시킨다.

DMA 관련 디바이스 트리

ARM64 SoC에 내장된 DMA 컨트롤러들 (DT compatible 명)

mediatek
- “mediatek,mt7622-hsdma”
arm
- “arm,pl330”, “arm,primecell”
- juno, rockchip, exynos, broadcom(ns2,stingray),altera SoC에서 채택
sprd
- “sprd,sc9860-dma”
qualcom
- “qcom,bam-v1.7.0”
freescale
- “fsl,imx8mn-sdma”, “fsl,imx8mq-sdma”
- “fsl,imx7d-dma-apbh”, “fsl,imx28-dma-apbh”
- “fsl,vf610-edma”
renesas
- “renesas,usb-dmac”
- “renesas,dmac-r8a77965”
- “renesas,rcar-dmac”
actions
- “actions,s900-dma”
zte
- “zte,zx296702-dma”
hisilicon
- “hisilicon,k3-dma-1.0”
- “hisilicon,hisi-pcm-asp-dma-1.0”
allwinner
- “allwinner,sun8i-h3-dma”
- “allwinner,sun50i-h6-dma”
- “allwinner,sun50i-a64-dma”
nvidia
- “nvidia,tegra194-adma”, “nvidia,tegra186-adma”
broadcom(rpi)
- “brcm,bcm2835-dma”

예) 2개의 dma 컨트롤러 – DT

arch/arm64/boot/dts/rockchip/rk3399.dtsi

        amba {
                compatible = "simple-bus";
                #address-cells = <2>;
                #size-cells = <2>;
                ranges;

                dmac_bus: dma-controller@ff6d0000 {
                        compatible = "arm,pl330", "arm,primecell";
                        reg = <0x0 0xff6d0000 0x0 0x4000>;
                        interrupts = <GIC_SPI 5 IRQ_TYPE_LEVEL_HIGH 0>,
                                     <GIC_SPI 6 IRQ_TYPE_LEVEL_HIGH 0>;
                        #dma-cells = <1>;
                        clocks = <&cru ACLK_DMAC0_PERILP>;
                        clock-names = "apb_pclk";
                };

                dmac_peri: dma-controller@ff6e0000 {
                        compatible = "arm,pl330", "arm,primecell";
                        reg = <0x0 0xff6e0000 0x0 0x4000>;
                        interrupts = <GIC_SPI 7 IRQ_TYPE_LEVEL_HIGH 0>,
                                     <GIC_SPI 8 IRQ_TYPE_LEVEL_HIGH 0>;
                        #dma-cells = <1>;
                        clocks = <&cru ACLK_DMAC1_PERILP>;
                        clock-names = "apb_pclk";
                };
        };

rk3399 SoC의 경우 arm사의 pl330 dma 컨트롤러 IP를 사용하고, 위의 디바이스 트리를 통해 2개의 dma 컨트롤러가 amba 버스 하위 플랫폼 디바이스로 등록된다.
- amba 노드의 “simple-bus”는 다음 하위 노드를 플랫폼 디바이스로 인식한다

예) 5개의 SPI 컨트롤러(2개의 dma 컨트롤러 사용) – DT

arch/arm64/boot/dts/rockchip/rk3399.dtsi

        spi0: spi@ff1c0000 {
                compatible = "rockchip,rk3399-spi", "rockchip,rk3066-spi";
                interrupts = <GIC_SPI 68 IRQ_TYPE_LEVEL_HIGH 0>;
                dmas = <&dmac_peri 10>, <&dmac_peri 11>;
                dma-names = "tx", "rx";
                ...
        };

        spi1: spi@ff1d0000 {
                compatible = "rockchip,rk3399-spi", "rockchip,rk3066-spi";
                interrupts = <GIC_SPI 53 IRQ_TYPE_LEVEL_HIGH 0>;
                dmas = <&dmac_peri 12>, <&dmac_peri 13>;
                dma-names = "tx", "rx";
                ...
        };

        spi2: spi@ff1e0000 {
                compatible = "rockchip,rk3399-spi", "rockchip,rk3066-spi";
                interrupts = <GIC_SPI 52 IRQ_TYPE_LEVEL_HIGH 0>;
                dmas = <&dmac_peri 14>, <&dmac_peri 15>;
                dma-names = "tx", "rx";
                ...
        };

        spi4: spi@ff1f0000 {
                compatible = "rockchip,rk3399-spi", "rockchip,rk3066-spi";
                interrupts = <GIC_SPI 67 IRQ_TYPE_LEVEL_HIGH 0>;
                dmas = <&dmac_peri 18>, <&dmac_peri 19>;
                dma-names = "tx", "rx";
 ...
        };

        spi5: spi@ff200000 {
                compatible = "rockchip,rk3399-spi", "rockchip,rk3066-spi";
                interrupts = <GIC_SPI 132 IRQ_TYPE_LEVEL_HIGH 0>;
                dmas = <&dmac_bus 8>, <&dmac_bus 9>;
                dma-names = "tx", "rx";
        };

다음 4개의 SPI 컨트롤러는 dmac_peri 컨트롤러를 사용한다.
- 10(tx), 11(rx)번 채널 및 SPI #68 인터럽트 사용
- 12(tx), 13(rx)번 채널 및 SPI #53 인터럽트 사용
- 14(tx), 15(rx)번 채널 및 SPI #52 인터럽트 사용
- 18(tx), 19(rx)번 채널 및 SPI #67 인터럽트 사용
다음 1개의 SPI 컨트롤러는 dmac_bus 컨트롤러를 사용한다.
- 8(tx), 9(rx)번 채널 및 SPI #132 인터럽트 사용

DMA 호스트 컨트롤러 관련 – DT

DMA 컨트롤러 등록

of_dma_controller_register()

drivers/dma/of-dma.c

/**
 * of_dma_controller_register - Register a DMA controller to DT DMA helpers
 * @np:                 device node of DMA controller
 * @of_dma_xlate:       translation function which converts a phandle
 *                      arguments list into a dma_chan structure
 * @data                pointer to controller specific data to be used by
 *                      translation function
 *
 * Returns 0 on success or appropriate errno value on error.
 *
 * Allocated memory should be freed with appropriate of_dma_controller_free()
 * call.
 */

int of_dma_controller_register(struct device_node *np,
                                struct dma_chan *(*of_dma_xlate)
                                (struct of_phandle_args *, struct of_dma *),
                                void *data)
{
        struct of_dma   *ofdma;

        if (!np || !of_dma_xlate) {
                pr_err("%s: not enough information provided\n", __func__);
                return -EINVAL;
        }

        ofdma = kzalloc(sizeof(*ofdma), GFP_KERNEL);
        if (!ofdma)
                return -ENOMEM;

        ofdma->of_node = np;
        ofdma->of_dma_xlate = of_dma_xlate;
        ofdma->of_dma_data = data;

        /* Now queue of_dma controller structure in list */
        mutex_lock(&of_dma_lock);
        list_add_tail(&ofdma->of_dma_controllers, &of_dma_list);
        mutex_unlock(&of_dma_lock);

        return 0;
}
EXPORT_SYMBOL_GPL(of_dma_controller_register);

디바이스 트리의 dma 노드에서 dma 컨트롤러를 정보를 읽어 등록한다. 성공 시 0을 반환한다.

두 번째 인자에 dma 채널을 결과로 반환하는 dma 변환(*of_dma_xlate) 콜백 함수가 지정된다.

DMA 변환 콜백 함수 (*of_dma_xlate)

of_dma_simple_xlate() – simple

drivers/dma/of-dma.c

/**
 * of_dma_simple_xlate - Simple DMA engine translation function
 * @dma_spec:   pointer to DMA specifier as found in the device tree
 * @of_dma:     pointer to DMA controller data
 *
 * A simple translation function for devices that use a 32-bit value for the
 * filter_param when calling the DMA engine dma_request_channel() function.
 * Note that this translation function requires that #dma-cells is equal to 1
 * and the argument of the dma specifier is the 32-bit filter_param. Returns
 * pointer to appropriate dma channel on success or NULL on error.
 */

struct dma_chan *of_dma_simple_xlate(struct of_phandle_args *dma_spec,
                                                struct of_dma *ofdma)
{
        int count = dma_spec->args_count;
        struct of_dma_filter_info *info = ofdma->of_dma_data;

        if (!info || !info->filter_fn)
                return NULL;

        if (count != 1)
                return NULL;

        return __dma_request_channel(&info->dma_cap, info->filter_fn,
                                     &dma_spec->args[0], dma_spec->np);
}
EXPORT_SYMBOL_GPL(of_dma_simple_xlate);

간단히 지정한 번호에 해당하는 dma 채널을 반환하는 dma 콜백함수이다.

“dmas” 속성에서 phandle 값 뒤의 1개의 숫자를 그대로 채널로 해석하여 dma 채널을 알아온다.
- 예) dmas = <&dmac_bus 8>
  - &dmac_bus alias 노드가 가리키는 dma 컨트롤러에서 8번에 해당하는 dma 채널을 알아온다.

of_dma_pl330_xlate() – for pl330

drivers/dma/pl330.c

static struct dma_chan *of_dma_pl330_xlate(struct of_phandle_args *dma_spec,
                                                struct of_dma *ofdma)
{
        int count = dma_spec->args_count;
        struct pl330_dmac *pl330 = ofdma->of_dma_data;
        unsigned int chan_id;

        if (!pl330)
                return NULL;

        if (count != 1)
                return NULL;

        chan_id = dma_spec->args[0];
        if (chan_id >= pl330->num_peripherals)
                return NULL;

        return dma_get_slave_channel(&pl330->peripherals[chan_id].chan);
}

pl330 dma 컨트롤러의 경우 요청한 번호에 해당하는 dma 채널을 반환하는 dma 콜백함수이다.

DMA 슬레이브 관련 – DT

슬레이브 채널 요청

of_dma_request_slave_channel()

drivers/dma/of-dma.c

/**
 * of_dma_request_slave_channel - Get the DMA slave channel
 * @np:         device node to get DMA request from
 * @name:       name of desired channel
 *
 * Returns pointer to appropriate DMA channel on success or an error pointer.
 */

struct dma_chan *of_dma_request_slave_channel(struct device_node *np,
                                              const char *name)
{
        struct of_phandle_args  dma_spec;
        struct of_dma           *ofdma;
        struct dma_chan         *chan;
        int                     count, i, start;
        int                     ret_no_channel = -ENODEV;
        static atomic_t         last_index;

        if (!np || !name) {
                pr_err("%s: not enough information provided\n", __func__);
                return ERR_PTR(-ENODEV);
        }

        /* Silently fail if there is not even the "dmas" property */
        if (!of_find_property(np, "dmas", NULL))
                return ERR_PTR(-ENODEV);

        count = of_property_count_strings(np, "dma-names");
        if (count < 0) {
                pr_err("%s: dma-names property of node '%pOF' missing or empty\n",
                        __func__, np);
                return ERR_PTR(-ENODEV);
        }

        /*
         * approximate an average distribution across multiple
         * entries with the same name
         */
        start = atomic_inc_return(&last_index);
        for (i = 0; i < count; i++) {
                if (of_dma_match_channel(np, name,
                                         (i + start) % count,
                                         &dma_spec))
                        continue;

                mutex_lock(&of_dma_lock);
                ofdma = of_dma_find_controller(&dma_spec);

                if (ofdma) {
                        chan = ofdma->of_dma_xlate(&dma_spec, ofdma);
                } else {
                        ret_no_channel = -EPROBE_DEFER;
                        chan = NULL;
                }

                mutex_unlock(&of_dma_lock);

                of_node_put(dma_spec.np);

                if (chan)
                        return chan;
        }

        return ERR_PTR(ret_no_channel);
}
EXPORT_SYMBOL_GPL(of_dma_request_slave_channel);

요청한 dma 채널을 사용하고자 하는 dma 슬레이브 디바이스 정보가 있는 @np 노드에서 @name에 해당하는 dma 채널을 알아온다. 실패한 경우 에러 값을 반환한다.

코드 라인 11~14에서 두 인자가 주어지지 않은 경우 에러 메시지를 출력하고 -ENODEV 에러 값을 반환한다.
코드 라인 17~18에서 dma 슬레이브 노드 내에 컨트롤러 및 채널 번호를 가리키는 “dmas” 속성을 발견할 수 없는 경우 -ENODEV 에러 값을 반환한다.
코드 라인 20~25에서 사용하고자 하는 dma 채널 이름을 의미하는 “dma-names” 속성 값에 포함된 문자열 수를 count에 담고, 속성을 발견할 수 없는 경으면 에러 메시지를 출력하고 -ENODEV 에러 값을 반환한다.
코드 라인 31에서 static 변수로 선언된 last_index를 증가시킨 값을 start에 알아온다.
- 1부터 시작
코드 라인 32~36에서 count 만큼 순회하며 “dma-names” 속성에서 지정한 이름과 인자로 요청한 @name이 매치되지 않는 경우 skip 한다.
코드 라인 38~48에서 of_dma_lock을 획득한 채로 컨트롤러를 찾고, 컨트롤러에 지정된 dma 변환 콜백 함수 (*of_dma_xlate)를 호출하여 dma 채널을 알아온다.
코드 라인 50에서 phandle이 가리키는 dma 컨트롤러 노드의 참조 카운터를 1 감소시킨다.
- of_dma_match_channel() 함수에서 dma_spec.np 노드의 참조 카운터가 1 증가되었었다.
코드 라인 52~53에서 발견된 채널을 반환한다.
코드 라인 56에서 -ENODEV 또는 -EPROBE_DEFER 에러를 반환한다.

of_dma_match_channel()

drivers/dma/of-dma.c

/**
 * of_dma_match_channel - Check if a DMA specifier matches name
 * @np:         device node to look for DMA channels
 * @name:       channel name to be matched
 * @index:      index of DMA specifier in list of DMA specifiers
 * @dma_spec:   pointer to DMA specifier as found in the device tree
 *
 * Check if the DMA specifier pointed to by the index in a list of DMA
 * specifiers, matches the name provided. Returns 0 if the name matches and
 * a valid pointer to the DMA specifier is found. Otherwise returns -ENODEV.
 */

static int of_dma_match_channel(struct device_node *np, const char *name,
                                int index, struct of_phandle_args *dma_spec)
{
        const char *s;

        if (of_property_read_string_index(np, "dma-names", index, &s))
                return -ENODEV;

        if (strcmp(name, s))
                return -ENODEV;

        if (of_parse_phandle_with_args(np, "dmas", "#dma-cells", index,
                                       dma_spec))
                return -ENODEV;

        return 0;
}

디바이스 노드 @np에서 “dma-names” 속성의 @index 번째의 문자열과 @name이 일치하는 경우 “dmas” 속성의 @index 번째의 phandle 노드 및 속성 값을 읽어 출력 인자 @dma_spec에 알아온다. 성공 시 0을 반환하고, 싪패 시 -ENODEV 에러를 반환한다.

코드 라인 6~7에서 디바이스 노드 @np에서 “dma-names” 속성 값에서 @index 번째의 값을 읽어온다. 읽어올 수 없으면 -ENODEV 에러를 반환한다.
코드 라인 9~10에서 읽어온 값이 @name과 다른 경우 -ENODEV 에러를 반환한다.
코드 라인 12~14에서 “dmas 속성에서 @index 번째의 phandle 노드 및 속성 값을 읽어 출력 인자 @dma_spec에 알아온다. 만일 읽어 올 수 없으면 -ENODEV 에러를 반환한다.
코드 라인 16에서 성공 값 0을 반환한다.

다음 그림과 같이 “dma-names 속성의 1번 인덱스(0번 인덱스부터 시작)에 “rx”가 존재하는 경우 dmas 속성에서 1번 인덱스의 phandle을 통한 dmac_peri 노드와 뒤이어 이어지는 숫자 11을 #dma-cells 만큼 읽어 출력 인자 dma_spec에 알아온다.

of_dma_find_controller()

drivers/dma/of-dma.c

/**
 * of_dma_find_controller - Get a DMA controller in DT DMA helpers list
 * @dma_spec:   pointer to DMA specifier as found in the device tree
 *
 * Finds a DMA controller with matching device node and number for dma cells
 * in a list of registered DMA controllers. If a match is found a valid pointer
 * to the DMA data stored is retuned. A NULL pointer is returned if no match is
 * found.
 */

static struct of_dma *of_dma_find_controller(struct of_phandle_args *dma_spec)
{
        struct of_dma *ofdma;

        list_for_each_entry(ofdma, &of_dma_list, of_dma_controllers)
                if (ofdma->of_node == dma_spec->np)
                        return ofdma;

        pr_debug("%s: can't find DMA controller %pOF\n", __func__,
                 dma_spec->np);

        return NULL;
}

dma 컨트롤러 리스트에 등록된 dma 컨트롤러들 중 @dma_spec가 가리키는 dma 컨트롤러 노드를 찾아 해당 of_dma를 찾아온다. 실패하는 경우 null을 반환한다.

구조체

dma_device 구조체

include/linux/dmaengine.h

/**
 * struct dma_device - info on the entity supplying DMA services
 * @chancnt: how many DMA channels are supported
 * @privatecnt: how many DMA channels are requested by dma_request_channel
 * @channels: the list of struct dma_chan
 * @global_node: list_head for global dma_device_list
 * @filter: information for device/slave to filter function/param mapping
 * @cap_mask: one or more dma_capability flags
 * @max_xor: maximum number of xor sources, 0 if no capability
 * @max_pq: maximum number of PQ sources and PQ-continue capability
 * @copy_align: alignment shift for memcpy operations
 * @xor_align: alignment shift for xor operations
 * @pq_align: alignment shift for pq operations
 * @fill_align: alignment shift for memset operations
 * @dev_id: unique device ID
 * @dev: struct device reference for dma mapping api
 * @src_addr_widths: bit mask of src addr widths the device supports
 *      Width is specified in bytes, e.g. for a device supporting
 *      a width of 4 the mask should have BIT(4) set.
 * @dst_addr_widths: bit mask of dst addr widths the device supports
 * @directions: bit mask of slave directions the device supports.
 *      Since the enum dma_transfer_direction is not defined as bit flag for
 *      each type, the dma controller should set BIT(<TYPE>) and same
 *      should be checked by controller as well
 * @max_burst: max burst capability per-transfer
 * @residue_granularity: granularity of the transfer residue reported
 *      by tx_status
 * @device_alloc_chan_resources: allocate resources and return the
 *      number of allocated descriptors
 * @device_free_chan_resources: release DMA channel's resources
 * @device_prep_dma_memcpy: prepares a memcpy operation
 * @device_prep_dma_xor: prepares a xor operation
 * @device_prep_dma_xor_val: prepares a xor validation operation
 * @device_prep_dma_pq: prepares a pq operation
 * @device_prep_dma_pq_val: prepares a pqzero_sum operation
 * @device_prep_dma_memset: prepares a memset operation
 * @device_prep_dma_memset_sg: prepares a memset operation over a scatter list
 * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
 * @device_prep_slave_sg: prepares a slave dma operation
 * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
 *      The function takes a buffer of size buf_len. The callback function will
 *      be called after period_len bytes have been transferred.
 * @device_prep_interleaved_dma: Transfer expression in a generic way.
 * @device_prep_dma_imm_data: DMA's 8 byte immediate data to the dst address
 * @device_config: Pushes a new configuration to a channel, return 0 or an error
 *      code
 * @device_pause: Pauses any transfer happening on a channel. Returns
 *      0 or an error code
 * @device_resume: Resumes any transfer on a channel previously
 *      paused. Returns 0 or an error code
 * @device_terminate_all: Aborts all transfers on a channel. Returns 0
 *      or an error code
 * @device_synchronize: Synchronizes the termination of a transfers to the
 *  current context.
 * @device_tx_status: poll for transaction completion, the optional
 *      txstate parameter can be supplied with a pointer to get a
 *      struct with auxiliary transfer status information, otherwise the call
 *      will just return a simple status code
 * @device_issue_pending: push pending transactions to hardware
 * @descriptor_reuse: a submitted transfer can be resubmitted after completion
 */

struct dma_device {

        unsigned int chancnt;
        unsigned int privatecnt;
        struct list_head channels;
        struct list_head global_node;
        struct dma_filter filter;
        dma_cap_mask_t  cap_mask;
        unsigned short max_xor;
        unsigned short max_pq;
        enum dmaengine_alignment copy_align;
        enum dmaengine_alignment xor_align;
        enum dmaengine_alignment pq_align;
        enum dmaengine_alignment fill_align;
        #define DMA_HAS_PQ_CONTINUE (1 << 15)

        int dev_id;
        struct device *dev;

        u32 src_addr_widths;
        u32 dst_addr_widths;
        u32 directions;
        u32 max_burst;
        bool descriptor_reuse;
        enum dma_residue_granularity residue_granularity;

        int (*device_alloc_chan_resources)(struct dma_chan *chan);
        void (*device_free_chan_resources)(struct dma_chan *chan);

        struct dma_async_tx_descriptor *(*device_prep_dma_memcpy)(
                struct dma_chan *chan, dma_addr_t dst, dma_addr_t src,
                size_t len, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_xor)(
                struct dma_chan *chan, dma_addr_t dst, dma_addr_t *src,
                unsigned int src_cnt, size_t len, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_xor_val)(
                struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
                size_t len, enum sum_check_flags *result, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_pq)(
                struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src,
                unsigned int src_cnt, const unsigned char *scf,
                size_t len, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_pq_val)(
                struct dma_chan *chan, dma_addr_t *pq, dma_addr_t *src,
                unsigned int src_cnt, const unsigned char *scf, size_t len,
                enum sum_check_flags *pqres, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_memset)(
                struct dma_chan *chan, dma_addr_t dest, int value, size_t len,
                unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_memset_sg)(
                struct dma_chan *chan, struct scatterlist *sg,
                unsigned int nents, int value, unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_interrupt)(
                struct dma_chan *chan, unsigned long flags);

        struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
                struct dma_chan *chan, struct scatterlist *sgl,
                unsigned int sg_len, enum dma_transfer_direction direction,
                unsigned long flags, void *context);
        struct dma_async_tx_descriptor *(*device_prep_dma_cyclic)(
                struct dma_chan *chan, dma_addr_t buf_addr, size_t buf_len,
                size_t period_len, enum dma_transfer_direction direction,
                unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
                struct dma_chan *chan, struct dma_interleaved_template *xt,
                unsigned long flags);
        struct dma_async_tx_descriptor *(*device_prep_dma_imm_data)(
                struct dma_chan *chan, dma_addr_t dst, u64 data,
                unsigned long flags);

        int (*device_config)(struct dma_chan *chan,
                             struct dma_slave_config *config);
        int (*device_pause)(struct dma_chan *chan);
        int (*device_resume)(struct dma_chan *chan);
        int (*device_terminate_all)(struct dma_chan *chan);
        void (*device_synchronize)(struct dma_chan *chan);

        enum dma_status (*device_tx_status)(struct dma_chan *chan,
                                            dma_cookie_t cookie,
                                            struct dma_tx_state *txstate);
        void (*device_issue_pending)(struct dma_chan *chan);
};

dma 호스트 컨트롤러 정보 및 오퍼레이션이 포함된 구조체이다.

chancnt
- 지원가능한 dma 채널 수
privatecnt
- dma_request_channel() 함수로 요청된 dma 채널 수
channels
- dma 채널들이 등록되는 리스트이다. (dma_chan 구조체들이 연결된다.)
global_node
- dma 호스트 컨트롤러 글로벌 리스트에 연결될 때 사용되는 노드이다.
filter
- filter function/param 매핑
cap_mask
- dma capability 플래그들이 표현된다.
- pl330 예)
  - BIT(DMA_MEMCPY) |
  - BIT(DMA_SLAVE) |
  - BIT(DMA_CYCLIC) |
  - BIT(DMA_PRIVATE)
max_xor, max_pq
- xor, pq 소스의 최대 수 (지원되지 않는 경우 0)
copy_align, xor_align, pq_align, fill_align
- memcpy, xor, pq, memset 오퍼레이션을 위한 정렬 바이트 수로 다음과 같이 지정할 수 있다.
  - DMA_SLAVE_BUSWIDTH_UNDEFINED(0)
  - DMAENGINE_ALIGN_1_BYTE(1)
  - DMA_SLAVE_BUSWIDTH_2_BYTES(2)
  - DMA_SLAVE_BUSWIDTH_3_BYTES(3)
  - DMA_SLAVE_BUSWIDTH_4_BYTES(4)
  - DMA_SLAVE_BUSWIDTH_8_BYTES(8)
  - DMA_SLAVE_BUSWIDTH_16_BYTES(16)
  - DMA_SLAVE_BUSWIDTH_32_BYTES(32)
  - DMA_SLAVE_BUSWIDTH_64_BYTES(64)
dev_id
- 디비이스의 유니크 id
dev
- dma 호스트 컨트롤러를 가리키는 디바이스
src_addr_widths
- 지원 가능한 소스 주소 폭들을 나타낸다.
- pl330 예) 0x117
  - BIT(DMA_SLAVE_BUSWIDTH_UNDEFINED) |
  - BIT(DMA_SLAVE_BUSWIDTH_1_BYTE) |
  - BIT(DMA_SLAVE_BUSWIDTH_2_BYTES) |
  - BIT(DMA_SLAVE_BUSWIDTH_4_BYTES) |
  - BIT(DMA_SLAVE_BUSWIDTH_8_BYTES)
dst_addr_widths
- 지원 가능한 목적지 주소 폭들을 나타낸다.
directions
- 지원 가능한 dma 방향들의 비트들이다.
  - DMA_MEM_TO_MEM(0)
  - DMA_MEM_TO_DEV(1)
  - DMA_DEV_TO_MEM(2)
  - DMA_DEV_TO_DEV(3)
  - DMA_TRANS_NONE(4)
- pl330 예) BIT(DMA_DEV_TO_MEM) | BIT(DMA_MEM_TO_DEV)
max_burst
- 최대 버스트 가능한 전송 수
- pl330 예) 16 바이트
  - 디바이스 트리에서 “arm,pl330-broken-no-flushp” 속성이 사용되는 경우 버스트 최대 바이트는 1바이트로 제한된다.
  - pl330이 사용된 rockchip사의 일부 칩(rk3368)에서는 오류로 인해 1 바이트만 사용가능하다.
descriptor_reuse
- 전송에 사용된 디스크립터를 전송 후에 자사용할 수 있는지 여부를 나타낸다.
residue_granularity
- tx_status에 의해 보고된 전송 단위(단위 크기: 디스크립터 > 세그먼트 > 버스트)
  - DMA_RESIDUE_GRANULARITY_DESCRIPTOR(0)
    - 미지원하므로 디스크립터의 완료 여부만 알고, 즉 잔여(residue) 보고하지 않아 dma_tx_state.residue는 항상 0이다.
  - DMA_RESIDUE_GRANULARITY_SEGMENT(1)
    - 사이클릭 전송에서 매 피리어드의 완료 성공시마다 tx_status가 보고된다.
    - scatter-gather 전송에서 세그먼트의 완료 시마다 tx_status가 보고된다.
  - DMA_RESIDUE_GRANULARITY_BURST(2)
    - 사이클릭 전송에서 버스트 전송의 완료 시에 tx_status가 보고된다.
- pl330 예) DMA_RESIDUE_GRANULARITY_BURST(2)가 사용된다.
(*device_alloc_chan_resources)
- dma 채널 리소스를 할당하고 할당된 디스크립터의 수를 반환하는 콜백 함수를 구현하기 위해 사용된다.
(*device_free_chan_resources)
- dma 채널 리소스를 반납하기 위해 해제하는 콜백 함수를 구현하기 위해 사용된다.
(*device_prep_*)
- dma 전송 오퍼레이션을 위한 각각을 준비하는 콜백 함수를 구현하기 위해 사용된다.
(*device_config)
- 채널에 대한 새로운 설정을 지정하는 콜백 함수를 구현하기 위해 사용된다.
(*device_pause)
- 전송을 잠시 멈추기 위한 콜백 함수를 구현하기 위해 사용된다.
(*device_resume)
- 잠시 멈춘 전송을 재계하기 위한 콜백 함수를 구현하기 위해 사용된다. (sleepable)
(*device_terminate_all)
- 하나의 채널에서 모든 전송을 취소하기 위한 콜백 함수를 구현하기 위해 사용된다. (atomic)
(*device_synchronize)
- 현재 dma 전송을 마치도록 대기하는 콜백 함수를 구현하기 위해 사용된다. (sleepable)
(*device_tx_status)
- dma 트랜잭션 완료 상태를 알아보기 위해 사용하는 콜백 함수를 구현하기 위해 사용된다.
(*device_issuepending)
- 지연된 트랜잭션을 h/w에 푸쉬한다.

dma_async_tx_descriptor 구조체

include/linux/dmaengine.h

/**
 * struct dma_async_tx_descriptor - async transaction descriptor
 * ---dma generic offload fields---
 * @cookie: tracking cookie for this transaction, set to -EBUSY if
 *      this tx is sitting on a dependency list
 * @flags: flags to augment operation preparation, control completion, and
 *      communicate status
 * @phys: physical address of the descriptor
 * @chan: target channel for this operation
 * @tx_submit: accept the descriptor, assign ordered cookie and mark the
 * descriptor pending. To be pushed on .issue_pending() call
 * @callback: routine to call after this operation is complete
 * @callback_param: general parameter to pass to the callback routine
 * ---async_tx api specific fields---
 * @next: at completion submit this descriptor
 * @parent: pointer to the next level up in the dependency chain
 * @lock: protect the parent and next pointers
 */

struct dma_async_tx_descriptor {
        dma_cookie_t cookie;
        enum dma_ctrl_flags flags; /* not a 'long' to pack with cookie */
        dma_addr_t phys;
        struct dma_chan *chan;
        dma_cookie_t (*tx_submit)(struct dma_async_tx_descriptor *tx);
        int (*desc_free)(struct dma_async_tx_descriptor *tx);
        dma_async_tx_callback callback;
        dma_async_tx_callback_result callback_result;
        void *callback_param;
        struct dmaengine_unmap_data *unmap;
#ifdef CONFIG_ASYNC_TX_ENABLE_CHANNEL_SWITCH
        struct dma_async_tx_descriptor *next;
        struct dma_async_tx_descriptor *parent;
        spinlock_t lock;
#endif
};

dma 비동기 전송용 tx 디스크립터 구조체이다.

cookie
- 트랜잭션을 추적하기 위한 쿠키 id
flags
- 오퍼레이션을 준비, 완료 제어 및 통신 상태에 대한 플래그
phys
- 디스크립터의 물리 주소
*chan
- 사용할 dma 채널을 가리킨다.
(*tx_submit)
- 디스크립터를 받아 전송할 콜백 함수가 지정된다.
(*desc_free)
- 디스크립터 전송 후 해제할 콜백 함수가 지정된다.
callback
- 오퍼레이션이 완료되면 호출될 콜백
callback_result
- 오퍼레이션이 완료되면 호출된 콜백 함수이다.
- 이 함수가 구현된 경우 위의 callback 대신 이 콜백이 호출된다.
*callback_param
- 콜백 루틴에 전달할 일반적인 파라미터
*unmap
- dmaengine_unmap_data 구조체 포인터를 가리킨다.
*next
- 다음 디스크립터를 가리킨다.
*parent
- 디펜던시 체인 내의 부모 디스크립터를 가리킨다.
lock
- parent 및 next 포인터를 보호하기 위한 락이다.

dma_slave_config 구조체

include/linux/dmaengine.h

/**
 * struct dma_slave_config - dma slave channel runtime config
 * @direction: whether the data shall go in or out on this slave
 * channel, right now. DMA_MEM_TO_DEV and DMA_DEV_TO_MEM are
 * legal values. DEPRECATED, drivers should use the direction argument
 * to the device_prep_slave_sg and device_prep_dma_cyclic functions or
 * the dir field in the dma_interleaved_template structure.
 * @src_addr: this is the physical address where DMA slave data
 * should be read (RX), if the source is memory this argument is
 * ignored.
 * @dst_addr: this is the physical address where DMA slave data
 * should be written (TX), if the source is memory this argument
 * is ignored.
 * @src_addr_width: this is the width in bytes of the source (RX)
 * register where DMA data shall be read. If the source
 * is memory this may be ignored depending on architecture.
 * Legal values: 1, 2, 3, 4, 8, 16, 32, 64.
 * @dst_addr_width: same as src_addr_width but for destination
 * target (TX) mutatis mutandis.
 * @src_maxburst: the maximum number of words (note: words, as in
 * units of the src_addr_width member, not bytes) that can be sent
 * in one burst to the device. Typically something like half the
 * FIFO depth on I/O peripherals so you don't overflow it. This
 * may or may not be applicable on memory sources.
 * @dst_maxburst: same as src_maxburst but for destination target
 * mutatis mutandis.
 * @src_port_window_size: The length of the register area in words the data need
 * to be accessed on the device side. It is only used for devices which is using
 * an area instead of a single register to receive the data. Typically the DMA
 * loops in this area in order to transfer the data.
 * @dst_port_window_size: same as src_port_window_size but for the destination
 * port.
 * @device_fc: Flow Controller Settings. Only valid for slave channels. Fill
 * with 'true' if peripheral should be flow controller. Direction will be
 * selected at Runtime.
 * @slave_id: Slave requester id. Only valid for slave channels. The dma
 * slave peripheral will have unique id as dma requester which need to be
 * pass as slave config.
 *
 * This struct is passed in as configuration data to a DMA engine
 * in order to set up a certain channel for DMA transport at runtime.
 * The DMA device/engine has to provide support for an additional
 * callback in the dma_device structure, device_config and this struct
 * will then be passed in as an argument to the function.
 *
 * The rationale for adding configuration information to this struct is as
 * follows: if it is likely that more than one DMA slave controllers in
 * the world will support the configuration option, then make it generic.
 * If not: if it is fixed so that it be sent in static from the platform
 * data, then prefer to do that.
 */

struct dma_slave_config {
        enum dma_transfer_direction direction;
        phys_addr_t src_addr;
        phys_addr_t dst_addr;
        enum dma_slave_buswidth src_addr_width;
        enum dma_slave_buswidth dst_addr_width;
        u32 src_maxburst;
        u32 dst_maxburst;
        u32 src_port_window_size;
        u32 dst_port_window_size;
        bool device_fc;
        unsigned int slave_id;
};

dma 슬레이브 전송용 설정이 담기는 구조체이다.

direction
- 지원 가능한 dma 방향들의 비트들이다.
  - DMA_MEM_TO_MEM(0)
  - DMA_MEM_TO_DEV(1)
  - DMA_DEV_TO_MEM(2)
  - DMA_DEV_TO_DEV(3)
  - DMA_TRANS_NONE(4)
- pl330 예) BIT(DMA_DEV_TO_MEM) | BIT(DMA_MEM_TO_DEV)
src_addr
- dma 소스 물리 주소
dst_addr
- dma 목적지 물리 주소
src_addr_width
- src_addr을 통해 한 번에 읽어올 데이터 폭을 지정한다.
- 예) DMA_SLAVE_BUSWIDTH_4_BYTES
dst_addr_width
- dst_addr을 통해 한 번에 기록할 데이터 폭을 지정한다.
src_maxburst
- src_addr을 통해 버스트 읽어올 사이즈
dst_maxburst
- src_addr을 통해 버스트 기록할 사이즈
src_port_window_size
- dma 읽기할 영역 사이즈(바이트). 특정 영역내에서만 dma 가능한 장치에서만 사용된다.
- 예) 8
dst_port_window_size
- dma 기록할 영역 사이즈(바이트). 특정 영역내에서만 dma 가능한 장치에서만 사용된다.
device_fc
- flow 컨트롤이 필요한 슬레이브 장치가 true로 설정한다. 슬레이브 채널에서만 사용된다.
slave_id
- 슬레이브 요청자 id로 슬레이브 채널에서만 사용된다.

dma_chan 구조체

include/linux/dmaengine.h

/**
 * struct dma_chan - devices supply DMA channels, clients use them
 * @device: ptr to the dma device who supplies this channel, always !%NULL
 * @cookie: last cookie value returned to client
 * @completed_cookie: last completed cookie for this channel
 * @chan_id: channel ID for sysfs
 * @dev: class device for sysfs
 * @device_node: used to add this to the device chan list
 * @local: per-cpu pointer to a struct dma_chan_percpu
 * @client_count: how many clients are using this channel
 * @table_count: number of appearances in the mem-to-mem allocation table
 * @router: pointer to the DMA router structure
 * @route_data: channel specific data for the router
 * @private: private data for certain client-channel associations
 */

struct dma_chan {
        struct dma_device *device;
        dma_cookie_t cookie;
        dma_cookie_t completed_cookie;

        /* sysfs */
        int chan_id;
        struct dma_chan_dev *dev;

        struct list_head device_node;
        struct dma_chan_percpu __percpu *local;
        int client_count;
        int table_count;

        /* DMA router */
        struct dma_router *router;
        void *route_data;

        void *private;
};

dma 채널 정보가 구성된 구조체이다.

device
- dma 컨트럴러를 가리킨다. (dma_device)
cookie
- 클라이언트로 반환한 마지막 쿠키 값
completed_cookie
- 이 채널을 위해 마지막 완료된 쿠기 값
chan_id
- sysfs를 위한 채널 id
dev
- sysfs를 위한 클래스 디바이스
device_node
- 채널 리스트에 등록할 때 사용하는 노드이다.
*local
- dma_chan_percpu 구조체를 가리키는 per-cpu 포인터
client_count
- 얼마나 많은 클라이언트가 이 채널을 사용중인지 나타내는 카운터
table_count
- mem-to-mem 할당 테이블의 출현 수
*router
- dma 라우터 포인터
*route_data
- 라우터를 위한 채널 관련 데이터
*private
- 특정 클라이언트 채널 연결에 대한 private 데이터

dma_ctrl_flags enum

include/linux/dmaengine.h

/**
 * enum dma_ctrl_flags - DMA flags to augment operation preparation,
 *  control completion, and communicate status.
 * @DMA_PREP_INTERRUPT - trigger an interrupt (callback) upon completion of
 *  this transaction
 * @DMA_CTRL_ACK - if clear, the descriptor cannot be reused until the client
 *  acknowledges receipt, i.e. has has a chance to establish any dependency
 *  chains
 * @DMA_PREP_PQ_DISABLE_P - prevent generation of P while generating Q
 * @DMA_PREP_PQ_DISABLE_Q - prevent generation of Q while generating P
 * @DMA_PREP_CONTINUE - indicate to a driver that it is reusing buffers as
 *  sources that were the result of a previous operation, in the case of a PQ
 *  operation it continues the calculation with new sources
 * @DMA_PREP_FENCE - tell the driver that subsequent operations depend
 *  on the result of this operation
 * @DMA_CTRL_REUSE: client can reuse the descriptor and submit again till
 *  cleared or freed
 * @DMA_PREP_CMD: tell the driver that the data passed to DMA API is command
 *  data and the descriptor should be in different format from normal
 *  data descriptors.
 */

enum dma_ctrl_flags {
        DMA_PREP_INTERRUPT = (1 << 0),
        DMA_CTRL_ACK = (1 << 1),
        DMA_PREP_PQ_DISABLE_P = (1 << 2),
        DMA_PREP_PQ_DISABLE_Q = (1 << 3),
        DMA_PREP_CONTINUE = (1 << 4),
        DMA_PREP_FENCE = (1 << 5),
        DMA_CTRL_REUSE = (1 << 6),
        DMA_PREP_CMD = (1 << 7),
};

dma engine 및 dma 컨트롤러에게 전달되어지는 플래그들이다.

DMA_PREP_INTERRUPT
- 트랜잭션 완료 후 인터럽트 트리거
DMA_CTRL_ACK
- 이 플래그가 없는 경우 클라이언트가 수신을 확인하기 전까지 디스크립터를 재사용할 수 없다.
- 즉 디펜던시 체인을 설정할 기회가 있다.
DMA_PREP_PQ_DISABLE_P
- Q를 생성하는 도중에 P의 생성을 금지한다. (거의 사용하지 않는다)
DMA_PREP_PQ_DISABLE_Q
- P를 생성하는 도중에 Q의 생성을 금지한다. (ppc4xx, ioat에서 사용되고 있다)
DMA_PREP_CONTINUE
- 기존 결과가 담긴 버퍼의 재사용을 허용한다. (fsl, bcm-sba에서 사용되고 있다)
DMA_PREP_FENCE
- 오퍼레이션에 이어지는 오퍼레이션을 의미한다.
DMA_CTRL_REUSE
- 클라이언트가 디스크립터가 삭제되기 전까지 재사용을 가능하게 한다.
DMA_PREP_CMD
- 일반 데이터가 아니라 명령 데이터를 드라이버에게 전달한다. (qualcom bam-dma에서 사용된다)

of_dma 구조체

include/linux/of_dma.h

struct of_dma {
        struct list_head        of_dma_controllers;
        struct device_node      *of_node;
        struct dma_chan         *(*of_dma_xlate)
                                (struct of_phandle_args *, struct of_dma *);
        void                    *(*of_dma_route_allocate)
                                (struct of_phandle_args *, struct of_dma *);
        struct dma_router       *dma_router;
        void                    *of_dma_data;
};

디바이스 트리에서 DMA 컨트롤러에 대한 노드 구성정보를 담고 있다.

of_dma_controllers
- of_dma_list 전역 리스트에 이 노드가 추가될 때 사용된다.
*of_node
- 디바이스 노드를 가리킨다.
(*of_dma_xlate)
- 슬레이브 디바이스가 phandle로 지정하여 가리킬 때 관련 인자들을 파싱할 수 있는 콜백 함수가 지정된다.
(*of_dma_route_allocate)
- dma mux 할당에서 사용되는 콜백 함수가 지정된다.
*dma_router
- dma router를 가리킨다.
of_dma_data
- void 형태의 private 데이터가 저장된다.
- of_dma_filter_info 정보 등

참고

DMA -1- (Basic) | 문c
DMA -2- (DMA Coherent Memory) | 문c
DMA -3- (DMA Pool) | 문c
DMA -4- (DMA Mapping) | 문c
DMA -5- (IOMMU) | 문c
DMA -6- (DMAEngine Subsystem) | 문c – 현재 글
IOMMU | 문c

DMAEngine documentation | Kernel.org
An Overview of the DMAEngine Subsystem (2015) | Free Electrons -> Bootlin – 다운로드 pdf
STM32H7 DMA MUX – 다운로드 pdf
PrimeCell® DMA Controller (PL330) | ARM – 다운로드 pdf
AXI DMA v7.1 | Xilinx – 다운로드 pdf
PCI Express DMA Reference Design Using External Memory | Intel – 다운로드 pdf
MSC8144 PCI Example Software | NXP – 다운로드 pdf

RCU(Read Copy Update) -7- (Preemptible RCU)

2021-03-192021-04-01 문영일 Leave a comment

RCU(Read Copy Update) -7- (Preemptible RCU)

rcu read-side critical section 에서 preemption이 가능한 모델을 사용하려면 CONFIG_PREEMPT_RCU 커널 옵션이 설정되어야 하는데, preemptible 커널이 선택되는 경우 함께 디폴트로 설정된다.

QS 체크, 기록 및 보고

preemptible rcu를 사용하여 read-side critical section 내에서 preemption 되어 블럭된 태스크가 있을 때와 없을 때의 qs 체크 후 qs 기록 및 qs 보고가 어떻게 다른지 알아본다.

qs 체크
- qs 상태인지 확인하는 과정
qs 기록
- qs 상태가 인지되어 qs를 해당 cpu의 rcu_data에 기록
qs 보고
- 해당 cpu의 rcu_data에 기록된 qs를 노드(rcu_node) 및 글로벌(rcu_state)에 보고

블럭드 태스크 비존재 시

gp 시작 이후 스케줄 틱마다 qs 체크를 시도하는데 rcu read-side critical section 내부 및 외부에서 qs 체크 변화는 다음과 같다.

내부: 증가된 t->rcu_read_lock_nesting 카운터로 인해 qs 기록하지 않는다.
외부: t->rcu_read_lock_nesting이 0이므로 qs 기록한다.

기록된 qs는 이후 rdp(rcu_data) -> rnp((rcu_node) -> rsp(rcu_state) 단계에 걸쳐 qs를 보고하는 과정을 거치는데 보통 콜백을 처리하는 rcu core에서 rcu_check_quiescent_state() 함수를 통해 qs 보고가 이루어진다.

다음 그림은 스케줄 틱에서 qs를 체크하는데, 블럭된 태스크가 없는 rcu read-side critical section 외부인 경우 qs를 기록한 후, 나중에 기록된 qs의 보고가 이루어지는 과정을 보여준다.

gp가 새롭게 시작하면 기존의 qs는 모두 무시되므로, current gp start 이후의 스케줄 틱을 관찰한다.

블럭드 태스크 존재 시

gp 시작 이후 rcu read-side critical section 내부에서 preemption이 발생하면 해당 태스크는 qs가 체크가 되지만, 블럭드 태스크가 된다. 다시 resume하고, rcu_read_unlock()을 수행하는 순간 block된 태스크로 인해 rcu read unblock special 케이스가 적용되는데 이 때 irq/bh/preempt가 모두 enable된 상태인 경우 블럭된 태스크의 unblock을 수행하고 qs를 보고한다.

또 하나, 블럭드 태스크가 되고 irq/bh/preempt disable이 수행된 이후 rcu_read_unlock()을 수행하는 순간부터 preemption으로 인해 block된 태스크를 가진 이유로 rcu read unlock special 케이스가 적용된다. 이 때 irq/bh/preempt가 disable 되어 있는 경우 해당 cpu는 deferred qs가 되어 qs 보고를 유예하게 된다. 추후 다시 deferred qs가 해제되는 순간 deferred qs가 제거되며 블럭된 태스크의 unblock을 수행하고 qs를 보고한다.

deferred qs의 해제
- 스케줄 틱에서 nest되지 않고 irq/bh/preempt enable 상황인 경우
- rcu read unlock에서 nest를 빠져나갈 때 irq/bh/preempt enable 상황인 경우

다음 그림은 블럭된 태스크가 존재하는 상황에서 context switch로 qs가 기록된 이후 unlock special 상황에서 unblock 태스크 처리 및 qs를 보고하는 과정을 보여준다.

다음 그림은 블럭된 태스크가 존재하는 상황에서 context switch로 qs가 기록된 이후 deferred qs 가 해제되는 순간 unblock 태스크 처리 및 qs를 보고하는 과정을 보여준다.

preemptible RCU Read-side Critical Section

다음 그림은 preemptible RCU를 사용할 때 rcu read-side critical section 내에서 preempt되는 태스크를 관리하기 위해 다음 3 곳에서 호출되는 함수들의 관계를 보여준다.

rcu_read_unlock()의 special 케이스를 호출하는 rcu_read_unlock_special()
스케줄 틱에서 호출되는 rcu_sched_clock_irq()
preemption이 발생하여 context-switch 될 때 호출되는 rcu_note_context_switch()

rcu_read_lock()

include/linux/rcupdate.h

/**
 * rcu_read_lock() - mark the beginning of an RCU read-side critical section
 *
 * When synchronize_rcu() is invoked on one CPU while other CPUs
 * are within RCU read-side critical sections, then the
 * synchronize_rcu() is guaranteed to block until after all the other
 * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
 * on one CPU while other CPUs are within RCU read-side critical
 * sections, invocation of the corresponding RCU callback is deferred
 * until after the all the other CPUs exit their critical sections.
 *
 * Note, however, that RCU callbacks are permitted to run concurrently
 * with new RCU read-side critical sections.  One way that this can happen
 * is via the following sequence of events: (1) CPU 0 enters an RCU
 * read-side critical section, (2) CPU 1 invokes call_rcu() to register
 * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
 * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
 * callback is invoked.  This is legal, because the RCU read-side critical
 * section that was running concurrently with the call_rcu() (and which
 * therefore might be referencing something that the corresponding RCU
 * callback would free up) has completed before the corresponding
 * RCU callback is invoked.
 *
 * RCU read-side critical sections may be nested.  Any deferred actions
 * will be deferred until the outermost RCU read-side critical section
 * completes.
 *
 * You can avoid reading and understanding the next paragraph by
 * following this rule: don't put anything in an rcu_read_lock() RCU
 * read-side critical section that would block in a !PREEMPT kernel.
 * But if you want the full story, read on!
 *
 * In non-preemptible RCU implementations (TREE_RCU and TINY_RCU),
 * it is illegal to block while in an RCU read-side critical section.
 * In preemptible RCU implementations (PREEMPT_RCU) in CONFIG_PREEMPTION
 * kernel builds, RCU read-side critical sections may be preempted,
 * but explicit blocking is illegal.  Finally, in preemptible RCU
 * implementations in real-time (with -rt patchset) kernel builds, RCU
 * read-side critical sections may be preempted and they may also block, but
 * only when acquiring spinlocks that are subject to priority inheritance.
 */

static __always_inline void rcu_read_lock(void)
{
        __rcu_read_lock();
        __acquire(RCU);
        rcu_lock_acquire(&rcu_lock_map);
        RCU_LOCKDEP_WARN(!rcu_is_watching(),
                         "rcu_read_lock() used illegally while idle");
}

rcu read-side critical section의 시작을 알린다.

__rcu_read_lock()

kernel/rcu/tree_plugin.h

/*
 * Preemptible RCU implementation for rcu_read_lock().
 * Just increment ->rcu_read_lock_nesting, shared state will be updated
 * if we block.
 */

void __rcu_read_lock(void)
{
        current->rcu_read_lock_nesting++;
        if (IS_ENABLED(CONFIG_PROVE_LOCKING))
                WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX);
        barrier();  /* critical section after entry code. */
}
EXPORT_SYMBOL_GPL(__rcu_read_lock);

premptible rcu를 사용 시 사용되며 rcu read-side critical section의 시작을 알린다.

코드 라인 3에서 현재 태스크의 rcu_read_lock_nesting 카운터를 1 증가시킨다.
코드 라인 4~5에서 lock 디버깅을 위해 사용되며 증가 시킨 카운터가 오버 플로우(0x3fff_ffff)되면 경고 메시지를 출력한다.

RCU_NEST_* 상수

kernel/rcu/tree_plugin.h

/* Bias and limit values for ->rcu_read_lock_nesting. */
#define RCU_NEST_BIAS INT_MAX
#define RCU_NEST_NMAX (-INT_MAX / 2)
#define RCU_NEST_PMAX (INT_MAX / 2)

다음 그림에서 RCU_NEST_* 관련 상수 값을 보여준다.

rcu_read_unlock()

include/linux/rcupdate.h

/**
 * rcu_read_unlock() - marks the end of an RCU read-side critical section.
 *
 * In most situations, rcu_read_unlock() is immune from deadlock.
 * However, in kernels built with CONFIG_RCU_BOOST, rcu_read_unlock()
 * is responsible for deboosting, which it does via rt_mutex_unlock().
 * Unfortunately, this function acquires the scheduler's runqueue and
 * priority-inheritance spinlocks.  This means that deadlock could result
 * if the caller of rcu_read_unlock() already holds one of these locks or
 * any lock that is ever acquired while holding them.
 *
 * That said, RCU readers are never priority boosted unless they were
 * preempted.  Therefore, one way to avoid deadlock is to make sure
 * that preemption never happens within any RCU read-side critical
 * section whose outermost rcu_read_unlock() is called with one of
 * rt_mutex_unlock()'s locks held.  Such preemption can be avoided in
 * a number of ways, for example, by invoking preempt_disable() before
 * critical section's outermost rcu_read_lock().
 *
 * Given that the set of locks acquired by rt_mutex_unlock() might change
 * at any time, a somewhat more future-proofed approach is to make sure
 * that that preemption never happens within any RCU read-side critical
 * section whose outermost rcu_read_unlock() is called with irqs disabled.
 * This approach relies on the fact that rt_mutex_unlock() currently only
 * acquires irq-disabled locks.
 *
 * The second of these two approaches is best in most situations,
 * however, the first approach can also be useful, at least to those
 * developers willing to keep abreast of the set of locks acquired by
 * rt_mutex_unlock().
 *
 * See rcu_read_lock() for more information.
 */

static inline void rcu_read_unlock(void)
{
        RCU_LOCKDEP_WARN(!rcu_is_watching(),
                         "rcu_read_unlock() used illegally while idle");
        __release(RCU);
        __rcu_read_unlock();
        rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
}

rcu read-side critical section의 끝을 알린다.

__rcu_read_unlock()

kernel/rcu/tree_plugin.h

/*
 * Preemptible RCU implementation for rcu_read_unlock().
 * Decrement ->rcu_read_lock_nesting.  If the result is zero (outermost
 * rcu_read_unlock()) and ->rcu_read_unlock_special is non-zero, then
 * invoke rcu_read_unlock_special() to clean up after a context switch
 * in an RCU read-side critical section and other special cases.
 */

void __rcu_read_unlock(void)
{
        struct task_struct *t = current;

        if (t->rcu_read_lock_nesting != 1) {
                --t->rcu_read_lock_nesting;
        } else {
                barrier();  /* critical section before exit code. */
                t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
                barrier();  /* assign before ->rcu_read_unlock_special load */
                if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
                        rcu_read_unlock_special(t);
                barrier();  /* ->rcu_read_unlock_special load before assign */
                t->rcu_read_lock_nesting = 0;
        }
        if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
                int rrln = t->rcu_read_lock_nesting;

                WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
        }
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);

premptible rcu를 사용 시 사용되며 rcu read-side critical section의 끝을 알린다.

코드 라인 3~6에서 현재 태스크의 rcu_read_lock_nesting 카운터가 1이 아닌 경우 1 만큼 감소시킨다.
- 이 전에 이미 rcu_read_lock()가 호출되었었고, 그 후 중첩된 경우이다.
코드 라인 7~15에서 현재 태스크의 rcu_read_lock_nesting 카운터가 1일 때 일시적으로 RCU_NEST_BIAS(0x7fff_ffff) 만큼 뺀 후 unlock special 케이스가 감지되면 이를 수행하고 추후 카운터를 0으로 변경한다. 이 때 카운터 값은 일시적으로 음수로 변경된다.
- 1 – 0x7fff_ffff = 0x8000_0002 (-2,147,483,646)
- special 케이스의 감지는 nest 없이 마지막 rcu_read_unlock()에서 t->rcu_read_unlock_special.s 값을 관찰하여 판단한다.
  - special 값은 blocked, need_qs, exp_hint 및 deferred_qs와 같이 총 4가지가 운영된다.
코드 라인 16~20에서 lock 디버깅을 위해 사용되며 감소 시킨 카운터가 언더 플로우(-1073741823 ~ -1)되면 경고 메시지를 출력한다.

rcu_read_unlock()의 special 케이스

rcu_read_unlock()을 통해 rcu read-side critical 섹션을 빠져나가는 순간에 irq/bh/preempt가 disable인 경우 곧바로 gp가 완료되면 안되므로 해당 cpu의 qs를 보고하지 않고 지연(deferred qs)시킨다. 이렇게 특별한 조건을 처리하기 위해 unlock special 케이스를 두었고 다음과 같이 동작한다.

먼저 preemption이 발생하는 경우 해당 cpu에 대해 qs가 기록된다. 그런데 rcu read-side critical 섹션 내에서 preemption이 발생하는 경우엔 이 태스크를 blocked 태스크에 추가하고, 나중에 rcu_read_unlock() 함수의 special 케이스에서 이를 인지하기 위해 rcu_read_unlock_special.b.blocked 비트를 1로 설정해둔다.
rcu_read_unlock() 호출 시 위의 blocked 등의 이유로 unlock special 케이스를 만나게 되면 interrupt, bh(softirq) 및 preemption의 disable/enable 여부에 따라 다음과 같이 동작한다.
- disable된 상태인 경우 deferred qs로 만든다.
- enable 상태인 경우 deferred qs 상태를 해제한다. 그리고 blocked 상태인 경우 해제하고 qs 또한 보고한다.

다음 그림은 rcu_read_unlock() 함수 호출 시 special 케이스를 통해 gp가 종료되지 않고 연속될 수 있도록 하는 모습을 보여준다.

rcu_read_unlock_special()

kernel/rcu/tree_plugin.h

/*
 * Handle special cases during rcu_read_unlock(), such as needing to
 * notify RCU core processing or task having blocked during the RCU
 * read-side critical section.
 */

static void rcu_read_unlock_special(struct task_struct *t)
{
        unsigned long flags;
        bool preempt_bh_were_disabled =
                        !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
        bool irqs_were_disabled;

        /* NMI handlers cannot block and cannot safely manipulate state. */
        if (in_nmi())
                return;

        local_irq_save(flags);
        irqs_were_disabled = irqs_disabled_flags(flags);
        if (preempt_bh_were_disabled || irqs_were_disabled) {
                bool exp;
                struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
                struct rcu_node *rnp = rdp->mynode;

                t->rcu_read_unlock_special.b.exp_hint = false;
                exp = (t->rcu_blocked_node && t->rcu_blocked_node->exp_tasks) ||
                      (rdp->grpmask & rnp->expmask) ||
                      tick_nohz_full_cpu(rdp->cpu);
                // Need to defer quiescent state until everything is enabled.
                if (irqs_were_disabled && use_softirq &&
                    (in_interrupt() ||
                     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
                        // Using softirq, safe to awaken, and we get
                        // no help from enabling irqs, unlike bh/preempt.
                        raise_softirq_irqoff(RCU_SOFTIRQ);
                } else {
                        // Enabling BH or preempt does reschedule, so...
                        // Also if no expediting or NO_HZ_FULL, slow is OK.
                        set_tsk_need_resched(current);
                        set_preempt_need_resched();
                        if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
                            !rdp->defer_qs_iw_pending && exp) {
                                // Get scheduler to re-evaluate and call hooks.
                                // If !IRQ_WORK, FQS scan will eventually IPI.
                                init_irq_work(&rdp->defer_qs_iw,
                                              rcu_preempt_deferred_qs_handler);
                                rdp->defer_qs_iw_pending = true;
                                irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
                        }
                }
                t->rcu_read_unlock_special.b.deferred_qs = true;
                local_irq_restore(flags);
                return;
        }
        WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
        rcu_preempt_deferred_qs_irqrestore(t, flags);
}

premptible rcu를 사용 시 rcu_read_unlock() 함수에서 special 케이스가 감지된 경우 호출된다. 블럭된 태스크가 있을 때 irq/bh/preempt 중 하나라도 disable된 경우 deferred qs 처리하고, 그렇지 않고 enable된 경우 deferred qs를 해제하고, 블럭드 상태인 경우 해제하고 qs도 보고한다.

코드 라인 4~5에서 preempt와 softirq가 disable된 상태인지 여부를 preempt_bh_were_disabled 변수에 대입한다.
코드 라인 9~10에서 NMI 인터럽트가 진행 중인 경우 처리를 하지 않고 함수를 빠져나간다.
코드 라인 12~13에서 local irq가 disable되어 있는 상태인지 여부를 irqs_were_disabled 변수에 대입하고, local irq를 disable한다.
코드 라인 14~19에서 special 케이스를 만족시키는 경우 태스크에서 rcu_read_unlock_special의 exp_hint 비트를 클리어한다.
- rcu_read_unlock()에서 가장 바깥쪽 rcu read-side critical section이고 special 비트들 중 하나라도 존재하는 경우 special 케이스가 적용된다.
코드 라인 20~22에서 softirq를 호출하는 급행(exp) 처리는 다음 세 조건 중 하나라도 만족하면 시도한다.
- 태스크가 read-side critical section에서 preempt되어 블럭된 상태이고 노드에도 블럭된 태스크가 존재하는 경우
- 아직도 현재 cpu가 급행 qs를 보고하지 못한 상태인 경우
- 현재 cpu가 nohz full 상태라 qs를 보고하지 못한 상태인 경우
코드 라인 24~29에서 irq가 disable 상태였었고 rcu 콜백에 softirq 호출을 사용하고 다음 두 조건 중 하나를 만족하는 경우이다. 인터럽트 context에서 급행 gp를 빠르게 처리하기 위해 rcu용 softirq를 호출한다.
- 인터럽트 처리중인 경우
- 급행(exp == true) 가능하고, 유예 qs 요청되지 않은 상태인(태스크에 rcu_read_unlock_special의 deferred_qs 비트가 false) 경우
코드 라인 30~44에서 그 외의 경우 약간 느리더라도 유예 qs 체크를 위해 리스케줄 요청을 한다. 만일 인터럽트도 disable되어있고, softirq도 사용하지 않고, nohz full 상태로 진입한 경우 qs 보고가 무한정 처리되지 못할 수 있다. 따라서 이를 빠르게 처리하기 위해 irq 워크큐를 사용하여 rcu_preempt_deferred_qs_handler() 함수를 호출하게 한다.
- 빠르게 처리하기 위해 강제로 irq를 발생시킨다.
- irq 워크(IPI call)를 통해 rdp->defer_qs_iw_pending 변수에 false를 대입한다.
코드 라인 45~47에서 irq/bh/preempt가 하나라도 disable 된 상태에서는 특정 시점까지 qs 보고를 지연시킬 목적으로 deferred qs 상태로 설정하다. 태스크에서 rcu_read_unlock_special의 deferred_qs 비트를 true로 변경하고 로컬 irq를 복구한 후 함수를 빠져나간다.
코드 라인 49에서 special 케이스를 만족시키지 못한 경우이다. 태스크에서 rcu_read_unlock_speciald의 exp_hint 비트를 클리어한다.
코드 라인 50에서 deferred qs를 해제한다. 또한 block 상태인 경우 block 태스크를 해제하고 qs를 보고한다.

다음 그림은 rcu_read_unlock() 함수내에서 special 케이스를 진행할 때 블럭되었던 태스크 A가 해제된 후 qs를 보고하는 과정을 보여준다.

다음 그림은 rcu_read_unlock() 함수내에서 special 케이스를 진행할 때 블럭되었던 태스크 A가 해제된 후 곧바로 qs를 보고하지 않고 deferred qs로 전환된 나중에 qs로 체크하는 과정을 보여준다.

RCU preempt 될 때 스케줄 처리

rcu_note_context_switch()

kernel/rcu/tree_plugin.h

/*
 * We have entered the scheduler, and the current task might soon be
 * context-switched away from.  If this task is in an RCU read-side
 * critical section, we will no longer be able to rely on the CPU to
 * record that fact, so we enqueue the task on the blkd_tasks list.
 * The task will dequeue itself when it exits the outermost enclosing
 * RCU read-side critical section.  Therefore, the current grace period
 * cannot be permitted to complete until the blkd_tasks list entries
 * predating the current grace period drain, in other words, until
 * rnp->gp_tasks becomes NULL.
 *
 * Caller must disable interrupts.
 */

void rcu_note_context_switch(bool preempt)
{
        struct task_struct *t = current;
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        struct rcu_node *rnp;

        trace_rcu_utilization(TPS("Start context switch"));
        lockdep_assert_irqs_disabled();
        WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
        if (t->rcu_read_lock_nesting > 0 &&
            !t->rcu_read_unlock_special.b.blocked) {

                /* Possibly blocking in an RCU read-side critical section. */
                rnp = rdp->mynode;
                raw_spin_lock_rcu_node(rnp);
                t->rcu_read_unlock_special.b.blocked = true;
                t->rcu_blocked_node = rnp;

                /*
                 * Verify the CPU's sanity, trace the preemption, and
                 * then queue the task as required based on the states
                 * of any ongoing and expedited grace periods.
                 */
                WARN_ON_ONCE((rdp->grpmask & rcu_rnp_online_cpus(rnp)) == 0);
                WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
                trace_rcu_preempt_task(rcu_state.name,
                                       t->pid,
                                       (rnp->qsmask & rdp->grpmask)
                                       ? rnp->gp_seq
                                       : rcu_seq_snap(&rnp->gp_seq));
                rcu_preempt_ctxt_queue(rnp, rdp);
        } else {
                rcu_preempt_deferred_qs(t);
        }

        /*
         * Either we were not in an RCU read-side critical section to
         * begin with, or we have now recorded that critical section
         * globally.  Either way, we can now note a quiescent state
         * for this CPU.  Again, if we were in an RCU read-side critical
         * section, and if that critical section was blocking the current
         * grace period, then the fact that the task has been enqueued
         * means that we continue to block the current grace period.
         */
        rcu_qs();
        if (rdp->exp_deferred_qs)
                rcu_report_exp_rdp(rdp);
        trace_rcu_utilization(TPS("End context switch"));
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);

context-switching을 수행하기 전에 rcu에 관련된 처리를 수행한다. context switch가 진행될 때 해당 cpu에 대해 qs를 기록한다. 현재 태스크가 rcu read-side critical section에서 처음 preempt된 경우 현재 태스크를 해당 rcu 노드의 블럭드 태스크 리스트에 추가한다. 그 외 이미 deferred qs 상태인 경우 deferred qs를 해제하고, 블럭드 상태인 경우 이를 제거하고 qs도 보고한다.

코드 라인 9에서 rcu_read_lock() 호출한 상태는 반드시 preempt된 경우에만 호출되어야 한다. 그렇지 않은 경우 경고 메시지를 출력한다.
- rcu_read_lock() 호출 후 강제로 schedule() 같은 API를 사용하여 블럭시키면 안된다.
코드 라인 10~31에서 rcu read-side critical section에 진입 후 preemption되어 처음 블럭된 태스크가 없는 상태인 경우 노드락을 잡은 후 태스크를 blocked 표기 및 이 노드를 가리키게 한다. 그 후 태스크를 블럭드 리스트에 추가한다.
코드 라인 32~34에서 deferred qs 상태인 경우 deferred qs를 해제한다. 또한 block 해제 가능한 상태인 경우 block 태스크를 해제한 후 qs를 보고한다.
코드 라인 45에서 현재 cpu의 qs를 기록한다.
코드 라인 46~47에서 현재 cpu에서 급행 qs가 유예된 경우 유예 상태를 제거하고 상위 노드로 qs를 보고하게 한다.

블럭드 리스트에 태스크 추가

rcu_preempt_ctxt_queue()

kernel/rcu/tree_plugin.h -1/2-

/*
 * Queues a task preempted within an RCU-preempt read-side critical
 * section into the appropriate location within the ->blkd_tasks list,
 * depending on the states of any ongoing normal and expedited grace
 * periods.  The ->gp_tasks pointer indicates which element the normal
 * grace period is waiting on (NULL if none), and the ->exp_tasks pointer
 * indicates which element the expedited grace period is waiting on (again,
 * NULL if none).  If a grace period is waiting on a given element in the
 * ->blkd_tasks list, it also waits on all subsequent elements.  Thus,
 * adding a task to the tail of the list blocks any grace period that is
 * already waiting on one of the elements.  In contrast, adding a task
 * to the head of the list won't block any grace period that is already
 * waiting on one of the elements.
 *
 * This queuing is imprecise, and can sometimes make an ongoing grace
 * period wait for a task that is not strictly speaking blocking it.
 * Given the choice, we needlessly block a normal grace period rather than
 * blocking an expedited grace period.
 *
 * Note that an endless sequence of expedited grace periods still cannot
 * indefinitely postpone a normal grace period.  Eventually, all of the
 * fixed number of preempted tasks blocking the normal grace period that are
 * not also blocking the expedited grace period will resume and complete
 * their RCU read-side critical sections.  At that point, the ->gp_tasks
 * pointer will equal the ->exp_tasks pointer, at which point the end of
 * the corresponding expedited grace period will also be the end of the
 * normal grace period.
 */

static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
        __releases(rnp->lock) /* But leaves rrupts disabled. */
{
        int blkd_state = (rnp->gp_tasks ? RCU_GP_TASKS : 0) +
                         (rnp->exp_tasks ? RCU_EXP_TASKS : 0) +
                         (rnp->qsmask & rdp->grpmask ? RCU_GP_BLKD : 0) +
                         (rnp->expmask & rdp->grpmask ? RCU_EXP_BLKD : 0);
        struct task_struct *t = current;

        raw_lockdep_assert_held_rcu_node(rnp);
        WARN_ON_ONCE(rdp->mynode != rnp);
        WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
        /* RCU better not be waiting on newly onlined CPUs! */
        WARN_ON_ONCE(rnp->qsmaskinitnext & ~rnp->qsmaskinit & rnp->qsmask &
                     rdp->grpmask);

        /*
         * Decide where to queue the newly blocked task.  In theory,
         * this could be an if-statement.  In practice, when I tried
         * that, it was quite messy.
         */
        switch (blkd_state) {
        case 0:
        case                RCU_EXP_TASKS:
        case                RCU_EXP_TASKS + RCU_GP_BLKD:
        case RCU_GP_TASKS:
        case RCU_GP_TASKS + RCU_EXP_TASKS:

                /*
                 * Blocking neither GP, or first task blocking the normal
                 * GP but not blocking the already-waiting expedited GP.
                 * Queue at the head of the list to avoid unnecessarily
                 * blocking the already-waiting GPs.
                 */
                list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
                break;

        case                                              RCU_EXP_BLKD:
        case                                RCU_GP_BLKD:
        case                                RCU_GP_BLKD + RCU_EXP_BLKD:
        case RCU_GP_TASKS +                               RCU_EXP_BLKD:
        case RCU_GP_TASKS +                 RCU_GP_BLKD + RCU_EXP_BLKD:
        case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:

                /*
                 * First task arriving that blocks either GP, or first task
                 * arriving that blocks the expedited GP (with the normal
                 * GP already waiting), or a task arriving that blocks
                 * both GPs with both GPs already waiting.  Queue at the
                 * tail of the list to avoid any GP waiting on any of the
                 * already queued tasks that are not blocking it.
                 */
                list_add_tail(&t->rcu_node_entry, &rnp->blkd_tasks);
                break;

        case                RCU_EXP_TASKS +               RCU_EXP_BLKD:
        case                RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
        case RCU_GP_TASKS + RCU_EXP_TASKS +               RCU_EXP_BLKD:

                /*
                 * Second or subsequent task blocking the expedited GP.
                 * The task either does not block the normal GP, or is the
                 * first task blocking the normal GP.  Queue just after
                 * the first task blocking the expedited GP.
                 */
                list_add(&t->rcu_node_entry, rnp->exp_tasks);
                break;

        case RCU_GP_TASKS +                 RCU_GP_BLKD:
        case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD:

                /*
                 * Second or subsequent task blocking the normal GP.
                 * The task does not block the expedited GP. Queue just
                 * after the first task blocking the normal GP.
                 */
                list_add(&t->rcu_node_entry, rnp->gp_tasks);
                break;

        default:

                /* Yet another exercise in excessive paranoia. */
                WARN_ON_ONCE(1);
                break;
        }

rcu read-side critical section에 진입 후 preemption되어 처음 블럭된 태스크를 블럭드 리스트에 추가한다.

코드 라인 4~7에서 노드에서 4가지 조합으로 블럭드 상태를 알아온다.
- 현재 일반 gp에서 일반 gp 태스크를 가리키고 있는지 여부
- 현재 급행 gp에서 급행 gp 태스크를 가리키고 있는지 여부
- 미완료된 일반 qs가 있어 추가할 일반 gp 태스크인지 여부
- 미완료된 급행 qs가 있어 추가할 급행 gp 태스크인지 여부
코드 라인 8에서 현재 태스크가 preemption될 예정이다.
- Task A (current) -> Task B 인 경우 태스크 A에 해당한다.
코드 라인 22~36에서 블럭드 태스크 리스트에 현재 태스크를 추가한다.
코드 라인 38~54에서 블럭드 태스크 리스트의 뒤쪽에 현재 태스크를 추가한다.
코드 라인 56~67에서 급행 gp 태스크 리스트에 현재 태스크를 추가한다.
코드 라인 69~78에서 일반 gp 태스크 리스트에 현재 태스크를 추가한다.

kernel/rcu/tree_plugin.h -2/2-

        /*
         * We have now queued the task.  If it was the first one to
         * block either grace period, update the ->gp_tasks and/or
         * ->exp_tasks pointers, respectively, to reference the newly
         * blocked tasks.
         */
        if (!rnp->gp_tasks && (blkd_state & RCU_GP_BLKD)) {
                rnp->gp_tasks = &t->rcu_node_entry;
                WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq);
        }
        if (!rnp->exp_tasks && (blkd_state & RCU_EXP_BLKD))
                rnp->exp_tasks = &t->rcu_node_entry;
        WARN_ON_ONCE(!(blkd_state & RCU_GP_BLKD) !=
                     !(rnp->qsmask & rdp->grpmask));
        WARN_ON_ONCE(!(blkd_state & RCU_EXP_BLKD) !=
                     !(rnp->expmask & rdp->grpmask));
        raw_spin_unlock_rcu_node(rnp); /* interrupts remain disabled. */

        /*
         * Report the quiescent state for the expedited GP.  This expedited
         * GP should not be able to end until we report, so there should be
         * no need to check for a subsequent expedited GP.  (Though we are
         * still in a quiescent state in any case.)
         */
        if (blkd_state & RCU_EXP_BLKD && rdp->exp_deferred_qs)
                rcu_report_exp_rdp(rdp);
        else
                WARN_ON_ONCE(rdp->exp_deferred_qs);
}

코드 라인 7~10에서 첫 일반 gp 태스크가 추가된 경우이다.
- 일반 gp 태스크 리스트 포인터가 비어있고, 일반 gp 태스크가 추가될 때 gp 태스크 리스트 포인터가 현재 태스크를 가리키게 한다.
코드 라인 11~12에서 첫 급행 gp 태스크가 추가된 경우이다.
- 급행 gp 태스크 리스트 포인터가 비어있고, 급행 gp 태스크가 추가될 때 급행 gp 태스크 리스트 포인터가 현재 태스크를 가리키게 한다.
코드 라인 25~26에서 급행 gp 태스크가 추가되었고 현재 cpu가 급행 qs 유예 상태인 경우 유예 상태를 제거하고 노드에 급행 qs를 보고한다.

kernel/rcu/tree_plugin.h

/* Flags for rcu_preempt_ctxt_queue() decision table. */
#define RCU_GP_TASKS    0x8
#define RCU_EXP_TASKS   0x4
#define RCU_GP_BLKD     0x2
#define RCU_EXP_BLKD    0x1

다음 그림은 rcu read-side critical secition 내에서 preempt되어 rcu_preempt_ctxt_queue() 함수가 블럭드 태스크를 추가하는 모습을 보여준다.

gp가 계속 유효한 태스크는 A이다.

다음 그림은 rcu_read_unlock() 함수내에서 여러 개의 cpu에서 동시에 special 케이스를 진행할 때 현재 gp 시작 이후 가장 처음 블럭된 블럭되었던 태스크 A가 gp_tasks에 지정된 모습을 보여준다.

다음 그림은 태스크들이 blkd_tasks에 추가될 때 gp_tasks와 exp_tasks는 리스트 포인터를 이용하여 3가지 리스트로 관리를 하는 모습을 보여준다.

코드 라인 아래 붉은 색 원에 있는 숫자는 코드에서 각 case 문 순서에 해당한다.
(2)번의 경우 blkd_tasks 리스트의 뒤에 추가하는데 case 문 뒤에 이어지는 조건에서 추가된 태스크가 어떤 영역에 들어갈지 결정된다.

스케줄틱에서 qs 처리

rcu_sched_clock_irq()

kernel/rcu/tree.c

/*
 * This function is invoked from each scheduling-clock interrupt,
 * and checks to see if this CPU is in a non-context-switch quiescent
 * state, for example, user mode or idle loop.  It also schedules RCU
 * core processing.  If the current grace period has gone on too long,
 * it will ask the scheduler to manufacture a context switch for the sole
 * purpose of providing a providing the needed quiescent state.
 */

void rcu_sched_clock_irq(int user)
{
        trace_rcu_utilization(TPS("Start scheduler-tick"));
        raw_cpu_inc(rcu_data.ticks_this_gp);
        /* The load-acquire pairs with the store-release setting to true. */
        if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
                /* Idle and userspace execution already are quiescent states. */
                if (!rcu_is_cpu_rrupt_from_idle() && !user) {
                        set_tsk_need_resched(current);
                        set_preempt_need_resched();
                }
                __this_cpu_write(rcu_data.rcu_urgent_qs, false);
        }
        rcu_flavor_sched_clock_irq(user);
        if (rcu_pending())
                invoke_rcu_core();

        trace_rcu_utilization(TPS("End scheduler-tick"));
}

매 스케줄 틱마다 호출되어 rcu 관련하여 현재 cpu의 qs를 체크하고, gp 완료되어 처리를 기다리는 콜백들을 처리하기 위해 동작한다.

코드 라인 4에서 이 함수가 호출될 때 마다 현재 cpu의 ticks_this_gp를 1 증가시킨다.
- 이 정보는 gp 시작 시 초기화되며, print_cpu_stall_info() 함수내에서 cpu의 rcu stall 정보를 출력할 때에만 사용된다.
코드 라인 6~13에서 현재 cpu에서 rcu_urgent_qs가 검출되는 경우 이를 제거하고, 만일 커널에서 실행 중에 인터럽트가 들어온 경우 리스케줄 요청한다.
코드 라인 14에서 스케줄 틱마다 qs를 체크한다.
코드 라인 15~16에서 rcu 콜백이 아직 처리되지 않고 남아있는 경우 콜백 처리를 시작한다.

rcu_flavor_sched_clock_irq()

kernel/rcu/tree_plugin.h

/*
 * Check for a quiescent state from the current CPU, including voluntary
 * context switches for Tasks RCU.  When a task blocks, the task is
 * recorded in the corresponding CPU's rcu_node structure, which is checked
 * elsewhere, hence this function need only check for quiescent states
 * related to the current CPU, not to those related to tasks.
 */

static void rcu_flavor_sched_clock_irq(int user)
{
        struct task_struct *t = current;

        if (user || rcu_is_cpu_rrupt_from_idle()) {
                rcu_note_voluntary_context_switch(current);
        }
        if (t->rcu_read_lock_nesting > 0 ||
            (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
                /* No QS, force context switch if deferred. */
                if (rcu_preempt_need_deferred_qs(t)) {
                        set_tsk_need_resched(t);
                        set_preempt_need_resched();
                }
        } else if (rcu_preempt_need_deferred_qs(t)) {
                rcu_preempt_deferred_qs(t); /* Report deferred QS. */
                return;
        } else if (!t->rcu_read_lock_nesting) {
                rcu_qs(); /* Report immediate QS. */
                return;
        }

        /* If GP is oldish, ask for help from rcu_read_unlock_special(). */
        if (t->rcu_read_lock_nesting > 0 &&
            __this_cpu_read(rcu_data.core_needs_qs) &&
            __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
            !t->rcu_read_unlock_special.b.need_qs &&
            time_after(jiffies, rcu_state.gp_start + HZ))
                t->rcu_read_unlock_special.b.need_qs = true;
}

스케줄 틱마다 현재 cpu의 qs를 체크하고 조건에 따라 qs를 기록 또는 deferred qs를 처리하고 qs를 보고한다.

코드 라인 5~7에서 유저 태스크 또는 커널 태스크의 idle 중에 스케줄 틱으로 진입한 경우 태스크의 holdout 상태의 태스크를 클리어한다.
코드 라인 8~21에서 스케줄 틱마다 현재 cpu의 qs 체크 관련하여 다음 4가지 조건 중 하나를 수행한다.
- rcu reader가 내부에 있거나 bh/preempt disable 상태인 경우
  - rcu reader 내부가 아닌 경우에만 리스케줄 요청 후 계속 진행한다.
- deferred qs 상태인 경우
  - deferred qs 상태인 경우 deferred qs를 해제한다. 또한 block 해제 가능한 상태인 경우 block 태스크를 해제하고 qs 보고하고 함수를 빠져나간다.
- rcu reader 외부에 있는 경우
  - 현재 cpu의 qs를 기록하고 함수를 빠져나간다.
- 그 외 다음을 계속 진행한다.
코드 라인 24~29에서 rcu read-side critical section이 진행 중이고, 현재 cpu의 qs가 아직 체크되지 않았고, unlock special의 need_qs가 없으며, gp 시작 후 1초가 경과한 경우 unlock special을 동작시키기 위해 need_qs를 true로 변경한다.

다음 그림은 스케줄 틱마다 현재 cpu의 qs를 체크하고 조건에 따라 qs를 기록 또는 보고하는 과정을 보여준다.

deferred qs 처리 및 block 태스크 해제 시 qs 보고

rcu_preempt_deferred_qs()

kernel/rcu/tree_plugin.h

/*
 * Report a deferred quiescent state if needed and safe to do so.
 * As with rcu_preempt_need_deferred_qs(), "safe" involves only
 * not being in an RCU read-side critical section.  The caller must
 * evaluate safety in terms of interrupt, softirq, and preemption
 * disabling.
 */

static void rcu_preempt_deferred_qs(struct task_struct *t)
{
        unsigned long flags;
        bool couldrecurse = t->rcu_read_lock_nesting >= 0;

        if (!rcu_preempt_need_deferred_qs(t))
                return;
        if (couldrecurse)
                t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
        local_irq_save(flags);
        rcu_preempt_deferred_qs_irqrestore(t, flags);
        if (couldrecurse)
                t->rcu_read_lock_nesting += RCU_NEST_BIAS;
}

deferred qs 상태인 경우 deferred qs를 해제한다. 또한 block 해제 가능한 상태인 경우 block 태스크를 해제하고 qs 보고한다.

코드 라인 4에서 현재 태스크에 RCU_NEST_BIAS가 진행중이지 않은지 여부를 couldrecurse에 알아온다.
- RCU_NEST_BIAS 진행 중인 경우 음수가 되어 이 값은 false가 된다.
코드 라인 6~7에서 이미 qs 유예 처리가 진행 중인 경우 함수를 빠져나간다.
코드 라인 8~9에서 couldrecurse가 true인 경우에만 t->rcu_read_lock_nesting 값에서 임시로 RCU_NEST_BIAS을 뺀다.
코드 라인 10~11에서 스핀락을 획득한 후 deferred qs 상태인 경우 deferred qs를 해제한다. 또한 block 해제 가능한 상태인 경우 block 태스크를 해제하고 qs를 보고한다.
코드 라인 12~13에서 t->rcu_read_lock_nesting 값을 원래 값으로 원복한다.

rcu_preempt_need_deferred_qs()

kernel/rcu/tree_plugin.h

/*
 * Is a deferred quiescent-state pending, and are we also not in
 * an RCU read-side critical section?  It is the caller's responsibility
 * to ensure it is otherwise safe to report any deferred quiescent
 * states.  The reason for this is that it is safe to report a
 * quiescent state during context switch even though preemption
 * is disabled.  This function cannot be expected to understand these
 * nuances, so the caller must handle them.
 */

static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
{
        return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
                READ_ONCE(t->rcu_read_unlock_special.s)) &&
               t->rcu_read_lock_nesting <= 0;
}

deferred qs 처리가 펜딩중인지 또는 급행 deferred qs가 이미 진행중인지 여부를 반환한다.

이미 급행 qs 유예(exp_deferred_qs)가 설정된 경우이거나
태스크에 rcu_read_unlock_special.s 가 설정되었고 다른 곳에서 deferred qs가 펜딩중인 경우(RCU_NEST_BIAS 처리)

IRQ 워크 – qs 유예 핸들러

rcu_preempt_deferred_qs_handler()

kernel/rcu/tree_plugin.h

/*
 * Minimal handler to give the scheduler a chance to re-evaluate.
 */

static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
{
        struct rcu_data *rdp;

        rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
        rdp->defer_qs_iw_pending = false;
}

irq 워크로 부터 호출되는 핸들러이다. 현재 cpu에 해당하는 rdp->defer_qs_iw_pending에 false를 대입한다.

irq 복구하면서 deferred qs 해제 및 블럭 태스크 해제

rcu_preempt_deferred_qs_irqrestore()

kernel/rcu/tree_plugin.h -1/2-

/*
 * Report deferred quiescent states.  The deferral time can
 * be quite short, for example, in the case of the call from
 * rcu_read_unlock_special().
 */

static void
rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
{
        bool empty_exp;
        bool empty_norm;
        bool empty_exp_now;
        struct list_head *np;
        bool drop_boost_mutex = false;
        struct rcu_data *rdp;
        struct rcu_node *rnp;
        union rcu_special special;

        /*
         * If RCU core is waiting for this CPU to exit its critical section,
         * report the fact that it has exited.  Because irqs are disabled,
         * t->rcu_read_unlock_special cannot change.
         */
        special = t->rcu_read_unlock_special;
        rdp = this_cpu_ptr(&rcu_data);
        if (!special.s && !rdp->exp_deferred_qs) {
                local_irq_restore(flags);
                return;
        }
        t->rcu_read_unlock_special.b.deferred_qs = false;
        if (special.b.need_qs) {
                rcu_qs();
                t->rcu_read_unlock_special.b.need_qs = false;
                if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
                        local_irq_restore(flags);
                        return;
                }
        }

        /*
         * Respond to a request by an expedited grace period for a
         * quiescent state from this CPU.  Note that requests from
         * tasks are handled when removing the task from the
         * blocked-tasks list below.
         */
        if (rdp->exp_deferred_qs) {
                rcu_report_exp_rdp(rdp);
                if (!t->rcu_read_unlock_special.s) {
                        local_irq_restore(flags);
                        return;
                }
        }

preemptible rcu에서 irq를 해제하면서 deferred qs가 설정된 경우 해제하고, 블럭 태스크도 해제한 후 qs를 보고한다.

코드 라인 18~23에서 태스크에 unlock special이 설정되지 않았고 급행 qs의 유예도 없는 경우 로컬 irq를 원복한 후 함수를 빠져나간다.
코드 라인 24에서 태스크에서 unlock spcial의 deferred qs를 해제한다.
코드 라인 25~32에서 태스크에서 unlock special의 need_qs가 설정된 경우 이를 클리어하고, 현재 cpu의 qs를 즉각 보고한다. 다시 한 번 태스크에 unlock special이 설정되지 않았고 급행 qs의 유예도 없는 경우 로컬 irq를 원복한 후 함수를 빠져나간다.
- rcu reader nesting 후 deferred qs 설정된 경우도 아닌데 gp 시작 후 1초가 지나면 need_qs를 설정하여 unlock special 케이스에서 qs를 보고하게 한다.
코드 라인 40~46에서 급행 qs 유예가 설정된 경우 이를 보고한다. 만일 태스크에 unlock special이 설정되지 않은 경우 로컬 irq를 원복한 후 함수를 빠져나간다.

kernel/rcu/tree_plugin.h -2/2-

        /* Clean up if blocked during RCU read-side critical section. */
        if (special.b.blocked) {
                t->rcu_read_unlock_special.b.blocked = false;

                /*
                 * Remove this task from the list it blocked on.  The task
                 * now remains queued on the rcu_node corresponding to the
                 * CPU it first blocked on, so there is no longer any need
                 * to loop.  Retain a WARN_ON_ONCE() out of sheer paranoia.
                 */
                rnp = t->rcu_blocked_node;
                raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
                WARN_ON_ONCE(rnp != t->rcu_blocked_node);
                WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
                empty_norm = !rcu_preempt_blocked_readers_cgp(rnp);
                WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
                             (!empty_norm || rnp->qsmask));
                empty_exp = sync_rcu_preempt_exp_done(rnp);
                smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
                np = rcu_next_node_entry(t, rnp);
                list_del_init(&t->rcu_node_entry);
                t->rcu_blocked_node = NULL;
                trace_rcu_unlock_preempted_task(TPS("rcu_preempt"),
                                                rnp->gp_seq, t->pid);
                if (&t->rcu_node_entry == rnp->gp_tasks)
                        rnp->gp_tasks = np;
                if (&t->rcu_node_entry == rnp->exp_tasks)
                        rnp->exp_tasks = np;
                if (IS_ENABLED(CONFIG_RCU_BOOST)) {
                        /* Snapshot ->boost_mtx ownership w/rnp->lock held. */
                        drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t;
                        if (&t->rcu_node_entry == rnp->boost_tasks)
                                rnp->boost_tasks = np;
                }

                /*
                 * If this was the last task on the current list, and if
                 * we aren't waiting on any CPUs, report the quiescent state.
                 * Note that rcu_report_unblock_qs_rnp() releases rnp->lock,
                 * so we must take a snapshot of the expedited state.
                 */
                empty_exp_now = sync_rcu_preempt_exp_done(rnp);
                if (!empty_norm && !rcu_preempt_blocked_readers_cgp(rnp)) {
                        trace_rcu_quiescent_state_report(TPS("preempt_rcu"),
                                                         rnp->gp_seq,
                                                         0, rnp->qsmask,
                                                         rnp->level,
                                                         rnp->grplo,
                                                         rnp->grphi,
                                                         !!rnp->gp_tasks);
                        rcu_report_unblock_qs_rnp(rnp, flags);
                } else {
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                }

                /* Unboost if we were boosted. */
                if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
                        rt_mutex_futex_unlock(&rnp->boost_mtx);

                /*
                 * If this was the last task on the expedited lists,
                 * then we need to report up the rcu_node hierarchy.
                 */
                if (!empty_exp && empty_exp_now)
                        rcu_report_exp_rnp(rnp, true);
        } else {
                local_irq_restore(flags);
        }
}

코드 라인 2~3에서 preemptible rcu에서 태스크가 rcu critical seciont에 진입 후 preemption되어 태스크에 unlock special의 blocked 비트가 설정된 경우 이를 클리어한다.
코드 라인 11~12에서 태스크가 블럭된 노드에 대해 스핀락을 얻어온다.
코드 라인 15에서 현재 gp에서 해당 노드의 preemptible rcu의 블럭드 태스크들이 없는지 여부를 empty_norm에 담는다. (다시 읽어 비교할 예정)
코드 라인 18에서 현재 gp에서 해당 노드에 preemptible rcu의 블럭드 태스크들이 없고 급행 qs 보고도 모두 완료된 여부를 empty_exp에 담는다. (다시 읽어 비교할 예정)
코드 라인 20~34에서 노드의 블럭드 태스크 리스트에서 첫 블럭드 태스크를 리스트에서 제거한다. 그 후 다음 3가지 포인터도 다음 블럭드 태스크를 가리키게 한다.
- gp_tasks가 제거한 블럭드 태스크를 가리키고 있었던 경우
- exp_tasks가 제거한 블럭드 태스크를 가리키고 있었던 경우
- boost_tasks가 제거한 블럭드 태스크를 가리키고 있었던 경우
코드 라인 42에서 해당 노드에 preemptible rcu의 블럭드 태스크들이 없고 급행 qs 보고도 모두 완료된 경우 여부를 empty_exp_now에 담는다.
코드 라인 43~54에서 해당 노드의 preemptible rcu의 블럭드 태스크들이 있었지만(empty_norm == false) 다시 읽었을 때 없어진 경우unblock 되었으므로 qs를 보고한다. 그 외의 경우 스핀 락과 로컬 irq 원복하고 preempt도 enable 한다.
코드 라인 57~58에서 이미 bootst된 경우 boost unlock 한다.
코드 라인 64~65에서 블럭드 상태였지만 급행 qs가 모두 완료된 경우 급행 qs를 보고한다.
코드 라인 66~68에서 preemptible rcu에서 태스크가 rcu critical seciont에 진입 후 preemption되지 않아 블럭된적이 없으면 그냥 로컬 irq를 원복한다.

blocked 태스크 체크 에러 시 덤프

rcu_preempt_check_blocked_tasks()

kernel/rcu/tree_plugin.h

/*
 * Check that the list of blocked tasks for the newly completed grace
 * period is in fact empty.  It is a serious bug to complete a grace
 * period that still has RCU readers blocked!  This function must be
 * invoked -before- updating this rnp's ->gp_seq, and the rnp's ->lock
 * must be held by the caller.
 *
 * Also, if there are blocked tasks on the list, they automatically
 * block the newly created grace period, so set up ->gp_tasks accordingly.
 */

static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
{
        struct task_struct *t;

        RCU_LOCKDEP_WARN(preemptible(), "rcu_preempt_check_blocked_tasks() invoked with preemption enabled!!!\n");
        if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
                dump_blkd_tasks(rnp, 10);
        if (rcu_preempt_has_tasks(rnp) &&
            (rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
                rnp->gp_tasks = rnp->blkd_tasks.next;
                t = container_of(rnp->gp_tasks, struct task_struct,
                                 rcu_node_entry);
                trace_rcu_unlock_preempted_task(TPS("rcu_preempt-GPS"),
                                                rnp->gp_seq, t->pid);
        }
        WARN_ON_ONCE(rnp->qsmask);
}

gp 시작할 때 이 함수가 호출되어 블럭된 태스크들이 있는지 체크한다. 만일 해당 노드에 블럭된 태스크들이 있으면 dump 출력을 한다.

dump_blkd_tasks()

kernel/rcu/tree_plugin.h

/*
 * Dump the blocked-tasks state, but limit the list dump to the
 * specified number of elements.
 */

static void
dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
{
        int cpu;
        int i;
        struct list_head *lhp;
        bool onl;
        struct rcu_data *rdp;
        struct rcu_node *rnp1;

        raw_lockdep_assert_held_rcu_node(rnp);
        pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
                __func__, rnp->grplo, rnp->grphi, rnp->level,
                (long)rnp->gp_seq, (long)rnp->completedqs);
        for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
                pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
                        __func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext);
        pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
                __func__, rnp->gp_tasks, rnp->boost_tasks, rnp->exp_tasks);
        pr_info("%s: ->blkd_tasks", __func__);
        i = 0;
        list_for_each(lhp, &rnp->blkd_tasks) {
                pr_cont(" %p", lhp);
                if (++i >= ncheck)
                        break;
        }
        pr_cont("\n");
        for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++) {
                rdp = per_cpu_ptr(&rcu_data, cpu);
                onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
                pr_info("\t%d: %c online: %ld(%d) offline: %ld(%d)\n",
                        cpu, ".o"[onl],
                        (long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
                        (long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
        }
}

노드에 블럭된 태스크들이 있을 때 호출되며, 관련 정보를 덤프한다.

no-hz idle 진입전에 rcu 콜백들 호출

rcu_prepare_for_idle()

kernel/rcu/tree_plugin.h

/*
 * Prepare a CPU for idle from an RCU perspective.  The first major task
 * is to sense whether nohz mode has been enabled or disabled via sysfs.
 * The second major task is to check to see if a non-lazy callback has
 * arrived at a CPU that previously had only lazy callbacks.  The third
 * major task is to accelerate (that is, assign grace-period numbers to)
 * any recently arrived callbacks.
 *
 * The caller must have disabled interrupts.
 */

static void rcu_prepare_for_idle(void)
{
        bool needwake;
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        struct rcu_node *rnp;
        int tne;

        lockdep_assert_irqs_disabled();
        if (rcu_segcblist_is_offloaded(&rdp->cblist))
                return;

        /* Handle nohz enablement switches conservatively. */
        tne = READ_ONCE(tick_nohz_active);
        if (tne != rdp->tick_nohz_enabled_snap) {
                if (!rcu_segcblist_empty(&rdp->cblist))
                        invoke_rcu_core(); /* force nohz to see update. */
                rdp->tick_nohz_enabled_snap = tne;
                return;
        }
        if (!tne)
                return;

        /*
         * If a non-lazy callback arrived at a CPU having only lazy
         * callbacks, invoke RCU core for the side-effect of recalculating
         * idle duration on re-entry to idle.
         */
        if (rdp->all_lazy && rcu_segcblist_n_nonlazy_cbs(&rdp->cblist)) {
                rdp->all_lazy = false;
                invoke_rcu_core();
                return;
        }

        /*
         * If we have not yet accelerated this jiffy, accelerate all
         * callbacks on this CPU.
         */
        if (rdp->last_accelerate == jiffies)
                return;
        rdp->last_accelerate = jiffies;
        if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
                rnp = rdp->mynode;
                raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
                needwake = rcu_accelerate_cbs(rnp, rdp);
                raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
                if (needwake)
                        rcu_gp_kthread_wake();
        }
}

cpu가 idle 진입 전에 남은 non-lazy rcu 콜백들을 호출하여 처리한다.

코드 라인 9~10에서 콜백 offloaded 상태인 경우 함수를 빠져나간다.
코드 라인 13~21에서 nohz_active에 변경이 있는 경우 남은 콜백이 있으면 콜백을 처리하기 위해 rcu core를 호출하고 함수를 빠져나간다. 만일 nohz가 active되어 있지 않은 경우 함수를 빠져나간다.
- nohz는 절전이 필요한 모바일 시스템에서 사용되는 옵션이다.
코드 라인 28~32에서 non-lazy 콜백이 도착했는데 기존에 모든 콜백이 lazy인 경우 rdp->all_lazy를 false로 변경 후 콜백을 처리하기 위해 rcu core를 호출하고 함수를 빠져나간다.
- non lazy 콜백들은 idle 진입할 때 모두 호출한다.
코드 라인 38~40에서 현재 jiffies 틱에 이미 accelerate 처리한 경우 함수를 빠져나간다.
코드 라인 41~48에서 펜딩 콜백들이 있는 경우 acceleration 처리한다.

rcu_special 유니언

include/linux/sched.h

union rcu_special {
        struct {
                u8                      blocked;
                u8                      need_qs;
                u8                      exp_hint; /* Hint for performance. */
                u8                      deferred_qs;
        } b; /* Bits. */
        u32 s; /* Set of bits. */
};

태스크에서 rcu read-side critical section 수행 중 unlock special 케이스로 동작해야 하는데 이를 관리하기 위한 유니언 값이다.

b.blocked
- rcu read-side critical section 수행 중 태스크가 preempt 되어 블럭된 적이 있는지 여부를 나태낸다.
- __schedule() -> rcu_note_context_switch() 함수 호출 경로를 통해 수행된다.
b.need_qs
- rcu reader nesting 상태에서 deferred qs 설정된 경우도 아닌데 gp 시작 후 1초가 지났는지 매 스케줄 틱마다 체크한 후 발생 시 need_qs를 설정한다. 그 후 이렇게 설정된 비트는 unlock special 케이스에서 qs를 보고하게 한다.
- 스케줄 틱 -> update_process_times() -> rcu_sched_clock_irq() -> rcu_flavor_sched_clock_irq() 함수 호출 경로를 통해 수행된다.
b.exp_hint
- 급행 qs를 처리하도록 IPI call로 호출된 rcu_exp_handler() 함수내에서 태스크가 rcu read-side critical section내에서 수행 중이고 급행 qs가 보고되지 않은 경우 이 비트를 true로 설정한다. 이 때 rdp->exp_deferred_qs도 true로 설정된다.
b.deferred_qs
- preempt 되었었던 rcu reader에서 irq/bh/preempt 들 중 하나라도 disable된 상태에서 rcu_read_unlock 하는 경우 rcu_read_unlock_special의 deferred_qs 비트를 설정한다.
  - 추후 qs 완료는 rcu_preempt_deferred_qs_irqrestore() 함수가 호출될 때 블럭된 태스크를 제거하면서 qs가 보고된다.
- rcu_read_unlock() -> rcu_read_unlock_special() 함수 호출 경로를 통해 수행된다.
s
- 위의 비트 모두를 합한 비트셋

rcu_noqs 유니언

kernel/rcu/tree.h

/*
 * Union to allow "aggregate OR" operation on the need for a quiescent
 * state by the normal and expedited grace periods.
 */

union rcu_noqs {
        struct {
                u8 norm;
                u8 exp;
        } b; /* Bits. */
        u16 s; /* Set of bits, aggregate OR here. */
};

cpu에서 qs가 아직 체크되지 않은지 여부를 관리하는 유니온 값이다.

b.norm
- normal gp를 사용할 때 이 값이 1인 경우 현재 cpu에서 qs가 체크되지 않은 상태이다. 체크되면 0으로 바뀐다.
b.exp
- 급행 gp를 사용할 때 이 값이 1인 경우 현재 cpu에서 급행 qs가 체크되지 않은 상태이다. 체크되면 0으로 바뀐다.
s
- 위의 비트 모두 합한 비트셋

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c – 현재글
rcu_init() | 문c
wait_for_completion() | 문c

RCU(Read Copy Update) -6- (Expedited GP)

2021-03-162023-07-29 문영일 Leave a comment

RCU(Read Copy Update) -6- (Expedited GP)

Brute-force RCU-sched grace period(Expedited Grace Period)

고속 네트워크(10G, 100G) 장치를 사용하는 경우 gp 완료를 늦게 감지하면 메모리의 clean-up이 지연되어 OOM(Out Of Memory)가 발생하여 다운될 수도 있다. 이러한 경우를 위해 expedited gp 대기 방식을 사용하면 기본 normal gp 대기 방식보다 더 빠르게 gp의 완료를 수행할 수 있다. 강제로 idle cpu를 제외한 online cpu들에 대해 IPI call을 호출하여 강제로 interrupt context를 발생시키고 메모리 배리어를 호출한다. 이렇게 하는 것으로 호출된 각 cpu의 context 스위칭을 빠르게 유발하여 qs를 보고하게 한다. 이 방법은 모든 cpu들에 대해 ipi call을 호출하므로 대단위 cpu 시스템에서는 많은 cost가 소요되는 단점이 있다.

참고: Expedited-Grace-Periods | Kernel.org

sysfs 설정

rcu expedited GP와 관련한 다음 두 설정이 있다.

/sys/kernel/rcu_expedited
- 1로 설정되면 일반 RCU grace period 방법을 사용하는 대신 Brute-force RCU grace period 방법을 사용한다.
/sys/kernel/rcu_normal
- 1로 설정되면 Brute-force RCU grace period 방법을 사용하고 있는 중이더라도 일반 RCU grace period 방법을 사용하도록 한다.

boot-up 시 expdited gp 동작

메모리가 적은 시스템의 경우 rcu 스케줄러가 준비되기 전에 쌓이는 콜백들로 인해 지연 처리되느라 OOM이 발생될 수 있다. 이를 회피하기 위해 커널 부트업 타임에 잠시 expedited gp 모드로 동작하고 다시 일반 gp 모드로 복귀하는데 다음 카운터를 사용한다. 물론 위의 커널 파라미터를 통해 enable하여 운영할 수도 있다.

rcu_expedited_nesting
- 초기 값으로 1이 지정된다. (부트업 cpu를 위해)
- 커널 boot-up이 끝날 때 감소 된다.
- secondary-cpu
  - hotplug online시 rcu_expedite_gp() 함수가 호출되며 증가된다.
  - hotplug offline시 rcu_unexpedite_gp() 함수가 호출되어 감소된다.

boot-up 종료에 따른 expedited gp 모드 종료

rcu_end_inkernel_boot()

kernel/rcu/update.c

/*
 * Inform RCU of the end of the in-kernel boot sequence.
 */

void rcu_end_inkernel_boot(void)
{
        rcu_unexpedite_gp();
        if (rcu_normal_after_boot)
                WRITE_ONCE(rcu_normal, 1);
}

expedited gp를 종료하기 위해 카운터를 감소시킨다. 만일 모듈 파라미터 “rcu_end_inkernel_boot”를 설정한 경우 0번 cpu의 부트업이 완료된 이후에는 항상 normal gp 모드로 동작한다.

rcu_expedite_gp()

kernel/rcu/update.c

/**
 * rcu_expedite_gp - Expedite future RCU grace periods
 *
 * After a call to this function, future calls to synchronize_rcu() and
 * friends act as the corresponding synchronize_rcu_expedited() function
 * had instead been called.
 */

void rcu_expedite_gp(void)
{
        atomic_inc(&rcu_expedited_nesting);
}
EXPORT_SYMBOL_GPL(rcu_expedite_gp);

expedited gp를 종료하기 위해 카운터를 증가시킨다. (secondary cpu가 부트업될 때 사용된다)

rcu_unexpedite_gp()

kernel/rcu/update.c

/**
 * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
 *
 * Undo a prior call to rcu_expedite_gp().  If all prior calls to
 * rcu_expedite_gp() are undone by a subsequent call to rcu_unexpedite_gp(),
 * and if the rcu_expedited sysfs/boot parameter is not set, then all
 * subsequent calls to synchronize_rcu() and friends will return to
 * their normal non-expedited behavior.
 */

void rcu_unexpedite_gp(void)
{
        atomic_dec(&rcu_expedited_nesting);
}
EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);

expedited gp를 종료하기 위해 카운터를 감소시킨다.

rcu_gp_is_expedited()

kernel/rcu/update.c

/*
 * Should normal grace-period primitives be expedited?  Intended for
 * use within RCU.  Note that this function takes the rcu_expedited
 * sysfs/boot variable and rcu_scheduler_active into account as well
 * as the rcu_expedite_gp() nesting.  So looping on rcu_unexpedite_gp()
 * until rcu_gp_is_expedited() returns false is a -really- bad idea.
 */

bool rcu_gp_is_expedited(void)
{
        return rcu_expedited || atomic_read(&rcu_expedited_nesting);
}
EXPORT_SYMBOL_GPL(rcu_gp_is_expedited);

RCU GP 대기 시 급행 방법 사용 여부를 반환한다.

/sys/kernel/rcu_expedited 값이 설정된 경우 일반 RCU grace period 방법을 사용하는 대신 Brute-force RCU grace period 방법을 사용한다.
커널 부트업 시 expedited gp 모드로 동작시킨다.

rcu_gp_is_normal()

kernel/rcu/update.c

/*
 * Should expedited grace-period primitives always fall back to their
 * non-expedited counterparts?  Intended for use within RCU.  Note
 * that if the user specifies both rcu_expedited and rcu_normal, then
 * rcu_normal wins.  (Except during the time period during boot from
 * when the first task is spawned until the rcu_set_runtime_mode()
 * core_initcall() is invoked, at which point everything is expedited.)
 */

bool rcu_gp_is_normal(void)
{
        return READ_ONCE(rcu_normal) &&
               rcu_scheduler_active != RCU_SCHEDULER_INIT;
}
EXPORT_SYMBOL_GPL(rcu_gp_is_normal);

RCU GP 대기 시 normal 방법 사용 여부를 반환한다.

/sys/kernel/rcu_normal 값이 설정된 경우 gp 대기를 위해 일반 RCU grace period 방법을 사용한다. (expedited gp 방식을 사용중이더라도)

급행 GP 시퀀스 번호

rcu_exp_gp_seq_start()

kernel/rcu/tree_exp.h

/*
 * Record the start of an expedited grace period.
 */

static void rcu_exp_gp_seq_start(void)
{
        rcu_seq_start(&rcu_state.expedited_sequence);
}

급행 gp 시작을 위해 급행 gp 시퀀스를 1 증가시킨다.

rcu_exp_gp_seq_endval()

kernel/rcu/tree_exp.h

/*
 * Return then value that expedited-grace-period counter will have
 * at the end of the current grace period.
 */

static __maybe_unused unsigned long rcu_exp_gp_seq_endval(void)
{
        return rcu_seq_endval(&rcu_state.expedited_sequence);
}

급행 gp 종료를 위해 급행 gp 시퀀스의 카운터 부분을 1 증가시키고, 상태 값은 0(gp idle)인 값을 반환한다.

rcu_exp_gp_seq_end()

kernel/rcu/tree_exp.h

/*
 * Record the end of an expedited grace period.
 */

static void rcu_exp_gp_seq_end(void)
{
        rcu_seq_end(&rcu_state.expedited_sequence);
        smp_mb(); /* Ensure that consecutive grace periods serialize. */
}

급행 gp 종료를 위해 급행 gp 시퀀스의 카운터 부분을 1 증가시키고, 상태 값은 0(gp idle)으로 변경한다.

rcu_exp_gp_seq_snap()

kernel/rcu/tree_exp.h

/*
 * Take a snapshot of the expedited-grace-period counter.
 */

static unsigned long rcu_exp_gp_seq_snap(void)
{
        unsigned long s;

        smp_mb(); /* Caller's modifications seen first by other CPUs. */
        s = rcu_seq_snap(&rcu_state.expedited_sequence);
        trace_rcu_exp_grace_period(rcu_state.name, s, TPS("snap"));
        return s;
}

급행 gp 시퀀스의 다음 단계 뒤의 급행 gp 시퀀스를 반환한다. (gp가 진행 중인 경우 안전하게 두 단계 뒤의 급행 gp 시퀀스를 반환하고, gp가 idle 상태인 경우 다음 단계 뒤의 급행 gp 시퀀스 번호를 반환한다.)

예) sp=8 (idle)
- 12 (idle, 1단계 뒤)
예) sp=9 (start)
- 16 (idle, 2단계 뒤)

rcu_exp_gp_seq_done()

kernel/rcu/tree_exp.h

/*
 * Given a counter snapshot from rcu_exp_gp_seq_snap(), return true
 * if a full expedited grace period has elapsed since that snapshot
 * was taken.
 */

static bool rcu_exp_gp_seq_done(unsigned long s)
{
        return rcu_seq_done(&rcu_state.expedited_sequence, s);
}

스냅샷 @s를 통해 급행 gp가 완료되었는지 여부를 반환한다.

급행 gp 시퀀스 >= 스냅샷 @s

동기화용 급행 rcu gp 대기

다음 그림은 급행 gp 완료를 대기하는 함수의 호출 과정을 보여준다.

synchronize_rcu_expedited()

kernel/rcu/tree_exp.h

/**
 * synchronize_rcu_expedited - Brute-force RCU grace period
 *
 * Wait for an RCU grace period, but expedite it.  The basic idea is to
 * IPI all non-idle non-nohz online CPUs.  The IPI handler checks whether
 * the CPU is in an RCU critical section, and if so, it sets a flag that
 * causes the outermost rcu_read_unlock() to report the quiescent state
 * for RCU-preempt or asks the scheduler for help for RCU-sched.  On the
 * other hand, if the CPU is not in an RCU read-side critical section,
 * the IPI handler reports the quiescent state immediately.
 *
 * Although this is a great improvement over previous expedited
 * implementations, it is still unfriendly to real-time workloads, so is
 * thus not recommended for any sort of common-case code.  In fact, if
 * you are using synchronize_rcu_expedited() in a loop, please restructure
 * your code to batch your updates, and then Use a single synchronize_rcu()
 * instead.
 *
 * This has the same semantics as (but is more brutal than) synchronize_rcu().
 */

void synchronize_rcu_expedited(void)
{
        bool boottime = (rcu_scheduler_active == RCU_SCHEDULER_INIT);
        struct rcu_exp_work rew;
        struct rcu_node *rnp;
        unsigned long s;

        RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
                         lock_is_held(&rcu_lock_map) ||
                         lock_is_held(&rcu_sched_lock_map),
                         "Illegal synchronize_rcu_expedited() in RCU read-side critical section");

        /* Is the state is such that the call is a grace period? */
        if (rcu_blocking_is_gp())
                return;

        /* If expedited grace periods are prohibited, fall back to normal. */
        if (rcu_gp_is_normal()) {
                wait_rcu_gp(call_rcu);
                return;
        }

        /* Take a snapshot of the sequence number.  */
        s = rcu_exp_gp_seq_snap();
        if (exp_funnel_lock(s))
                return;  /* Someone else did our work for us. */

        /* Ensure that load happens before action based on it. */
        if (unlikely(boottime)) {
                /* Direct call during scheduler init and early_initcalls(). */
                rcu_exp_sel_wait_wake(s);
        } else {
                /* Marshall arguments & schedule the expedited grace period. */
                rew.rew_s = s;
                INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
                queue_work(rcu_gp_wq, &rew.rew_work);
        }

        /* Wait for expedited grace period to complete. */
        rnp = rcu_get_root();
        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
                   sync_exp_work_done(s));
        smp_mb(); /* Workqueue actions happen before return. */

        /* Let the next expedited grace period start. */
        mutex_unlock(&rcu_state.exp_mutex);

        if (likely(!boottime))
                destroy_work_on_stack(&rew.rew_work);
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);

Brute-force RCU grace period 방법을 사용하여 gp가 완료될 때까지 기다린다. 이 방법을 사용할 수 없는 상황인 경우 일반 gp를 대기하는 방식을 사용한다.

코드 라인 14~15에서 아직 rcu의 gp 사용이 블러킹상태인 경우엔 gp 대기 없이 곧바로 함수를 빠져나간다.
코드 라인 18~21에서 sysfs를 통해 rcu normal 방식을 사용하도록 강제한 경우 normal 방식의 gp 대기를 사용하는 wait_rcu_gp() 함수를 호출한 후 함수를 빠져나간다.
- “/sys/kernel/rcu_normal”에 1이 기록된 경우이다. 디폴트 값은 0이다.
코드 라인 24에서 expedited 용 gp 시퀀스의 스냅샷을 얻어온다.
- gp idle에서는 다음 gp 번호를 얻어오고, gp 진행 중일때에는 다다음 gp 번호를 얻어온다.
코드 라인 25~26에서 expedited gp를 위한 funnel-lock을 획득해온다. 만일 다른 cpu에서 먼저 처리중이라면 함수를 빠져나간다.
코드 라인 29~37에서 낮은 확률로 early 부트중인 경우에는 현재 cpu가 직접 호출하여 급행 gp 대기 처리를 수행하고, 그렇지 않은 경우 워크 큐를 사용하여 처리하도록 wait_rcu_exp_gp() 워크 함수를 엔큐한다.
코드 라인 40~42에서 루트 노드의 expedited gp의 완료까지 대기한다. (TASK_UNINTERRUPTIBLE 상태로 슬립한다.)
코드 라인 48~50에서 사용한 워크를 삭제한다.

exp_funnel_lock()

kernel/rcu/tree_exp.h

/*
 * Funnel-lock acquisition for expedited grace periods.  Returns true
 * if some other task completed an expedited grace period that this task
 * can piggy-back on, and with no mutex held.  Otherwise, returns false
 * with the mutex held, indicating that the caller must actually do the
 * expedited grace period.
 */

static bool exp_funnel_lock(unsigned long s)
{
        struct rcu_data *rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
        struct rcu_node *rnp = rdp->mynode;
        struct rcu_node *rnp_root = rcu_get_root();

        /* Low-contention fastpath. */
        if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s) &&
            (rnp == rnp_root ||
             ULONG_CMP_LT(READ_ONCE(rnp_root->exp_seq_rq), s)) &&
            mutex_trylock(&rcu_state.exp_mutex))
                goto fastpath;

        /*
         * Each pass through the following loop works its way up
         * the rcu_node tree, returning if others have done the work or
         * otherwise falls through to acquire ->exp_mutex.  The mapping
         * from CPU to rcu_node structure can be inexact, as it is just
         * promoting locality and is not strictly needed for correctness.
         */
        for (; rnp != NULL; rnp = rnp->parent) {
                if (sync_exp_work_done(s))
                        return true;

                /* Work not done, either wait here or go up. */
                spin_lock(&rnp->exp_lock);
                if (ULONG_CMP_GE(rnp->exp_seq_rq, s)) {

                        /* Someone else doing GP, so wait for them. */
                        spin_unlock(&rnp->exp_lock);
                        trace_rcu_exp_funnel_lock(rcu_state.name, rnp->level,
                                                  rnp->grplo, rnp->grphi,
                                                  TPS("wait"));
                        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
                                   sync_exp_work_done(s));
                        return true;
                }
                rnp->exp_seq_rq = s; /* Followers can wait on us. */
                spin_unlock(&rnp->exp_lock);
                trace_rcu_exp_funnel_lock(rcu_state.name, rnp->level,
                                          rnp->grplo, rnp->grphi, TPS("nxtlvl"));
        }
        mutex_lock(&rcu_state.exp_mutex);
fastpath:
        if (sync_exp_work_done(s)) {
                mutex_unlock(&rcu_state.exp_mutex);
                return true;
        }
        rcu_exp_gp_seq_start();
        trace_rcu_exp_grace_period(rcu_state.name, s, TPS("start"));
        return false;
}

급행 gp 완료 대기를 위해 funnel 락을 획득하고 급행 gp 시퀀스를 시작한다. 만일 이미 급행 gp가 완료된 상태라면 true를 반환한다.

코드 라인 8~12에서 다음 조건의 경우 빠른 처리를 위해 fastpath 레이블로 이동한다.
- 루트 노드의 급행 gp 시퀀스 번호 < 스냅샷 @s 이고, 글로벌 락(rcu_state.exp_mutex)의 획득을 시도하여 얻을 수 있는 경우
코드 라인 21~23에서 해당 노드부터 최상위 루트 노드까시 순회하며 이미 급행 gp 시퀀스가 완료됨을 확인하면 true를 반환한다.
- 스냅샷을 받은 후 시간이 흘러 급행 gp가 이미 완료될 수도 있다.
  - rcu_state->expedited_sequence >= 스냅샷 @s
코드 라인 26~39에서 노드 락을 획득한 채로 노드의 급행 gp 시퀀스 번호를 스냅샷 @s로 갱신한다. 만일 노드의 급행 시퀀스 번호 >= 스냅샷 @s 인 경우에는 누군가 gp 중에 있는 것이므로 급행 gp가 완료될때까지 대기한 후 true를 반환한다.
코드 라인 44~48에서 fastpath: 레이블이다. 급행 gp가 이미 완료된 경우 뮤텍스 락을 해제한 후 true를 반환한다.
코드 라인 49에서 급행 gp 시퀀스 번호를 1 증가시켜 새로운 gp 시작 상태로 변경한다.
코드 라인 51에서 false를 반환한다.

sync_exp_work_done()

kernel/rcu/tree_exp.h

/* Common code for work-done checking. */

static bool sync_exp_work_done(unsigned long s)
{
        if (rcu_exp_gp_seq_done(s)) {
                trace_rcu_exp_grace_period(rcu_state.name, s, TPS("done"));
                smp_mb(); /* Ensure test happens before caller kfree(). */
                return true;
        }
        return false;
}

급행 gp 시퀀스의 스냅샷 @s 값을 기준으로 기존 급행 gp가 완료되었는지 여부를 반환한다.

rcu_state->expedited_sequence >= 스냅샷 @s

워크큐 핸들러

wait_rcu_exp_gp()

kernel/rcu/tree_exp.h

/*
 * Work-queue handler to drive an expedited grace period forward.
 */

static void wait_rcu_exp_gp(struct work_struct *wp)
{
        struct rcu_exp_work *rewp;

        rewp = container_of(wp, struct rcu_exp_work, rew_work);
        rcu_exp_sel_wait_wake(rewp->rew_s);
}

급행 gp를 처리하는 워크큐 핸들러이다. 급행 qs 처리 완료시까지 대기한 후 블럭된 gp 태스크를 wakeup 한다.

rcu_exp_sel_wait_wake()

kernel/rcu/tree_exp.h

/*
 * Common code to drive an expedited grace period forward, used by
 * workqueues and mid-boot-time tasks.
 */

static void rcu_exp_sel_wait_wake(unsigned long s)
{
        /* Initialize the rcu_node tree in preparation for the wait. */
        sync_rcu_exp_select_cpus();

        /* Wait and clean up, including waking everyone. */
        rcu_exp_wait_wake(s);
}

급행 qs 처리 완료시까지 대기한 후 블럭된 gp 태스크를 wakeup 한다.

코드 라인 4에서 급행 처리할 cpu들을 선택한 후 완료를 기다린다
코드 라인 7에서 급행 gp 시퀀스를 완료하고 급행 gp 대기 중인 태스크들을 wakeup 한다.

sync_rcu_exp_select_cpus()

kernel/rcu/tree_exp.h

/*
 * Select the nodes that the upcoming expedited grace period needs
 * to wait for.
 */

static void sync_rcu_exp_select_cpus(void)
{
        int cpu;
        struct rcu_node *rnp;

        trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
        sync_exp_reset_tree();
        trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("select"));

        /* Schedule work for each leaf rcu_node structure. */
        rcu_for_each_leaf_node(rnp) {
                rnp->exp_need_flush = false;
                if (!READ_ONCE(rnp->expmask))
                        continue; /* Avoid early boot non-existent wq. */
                if (!READ_ONCE(rcu_par_gp_wq) ||
                    rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
                    rcu_is_last_leaf_node(rnp)) {
                        /* No workqueues yet or last leaf, do direct call. */
                        sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
                        continue;
                }
                INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
                cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
                /* If all offline, queue the work on an unbound CPU. */
                if (unlikely(cpu > rnp->grphi - rnp->grplo))
                        cpu = WORK_CPU_UNBOUND;
                else
                        cpu += rnp->grplo;
                queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
                rnp->exp_need_flush = true;
        }

        /* Wait for workqueue jobs (if any) to complete. */
        rcu_for_each_leaf_node(rnp)
                if (rnp->exp_need_flush)
                        flush_work(&rnp->rew.rew_work);
}

급행 처리할 cpu들을 선택한 후 완료를 기다린다.

코드 라인 7에서 모든 rcu 노드들의 expmask를 리셋한다.
코드 라인 11~14에서 leaf 노드들을 순회하며 처리가 필요 없는 노드(expmask가 0)는 skip 한다.
- 노드의 exp_need_flush 변수는 워크큐 작업을 사용하는 경우에만 노드의 워크큐 작업이 완료되었는지 확인하는 용도로 사용한다.
코드 라인 15~21에서 다음 조건들 중 하나라도 해당하는 경우 현재 cpu가 sync_rcu_exp_select_node_cpus() 함수를 직접 호출하여 처리하도록 한다. 그런 후 다음 처리를 위해 skip 한다.
- 워크큐 rcu_par_gp_wq 가 아직 준비되지 않은 경우
- rcu 스케줄러가 아직 동작하지 않는 경우
- 가장 마지막 leaf 노드인 경우
코드 라인 22~30에서 rcu용 워크큐(rcu_par_gp_wq)를 통해 cpu bound(해당 cpu)에서 또는 cpu unbound 하에 sync_rcu_exp_select_node_cpus() 함수가 실행되도록 워크를 실행한다. 그 후 exp_need_flush를 설정하고, 작업 완료 후 이 값이 클리어된다.
코드 라인 34~36에서 모든 노드에 대해 워크큐에서 실행한 작업이 완료되지 않은 경우(exp_need_flush == true) 완료 시까지 대기한다.

모든 노드들의 expmask 재설정

sync_exp_reset_tree()

kernel/rcu/tree_exp.h

/*
 * Reset the ->expmask values in the rcu_node tree in preparation for
 * a new expedited grace period.
 */

static void __maybe_unused sync_exp_reset_tree(void)
{
        unsigned long flags;
        struct rcu_node *rnp;

        sync_exp_reset_tree_hotplug();
        rcu_for_each_node_breadth_first(rnp) {
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                WARN_ON_ONCE(rnp->expmask);
                rnp->expmask = rnp->expmaskinit;
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        }
}

새 급행 gp를 위한 준비로 모든 노드들을 재설정한다.

코드 라인 6에서 최근 online된 cpu를 반영하여 각 노드의 expmaskinit 값을 재설정한다.
코드 라인 7~12에서 루트부터 모든 노드를 순회하며 expmask를 expmaskinit 값으로 재설정한다.

sync_exp_reset_tree_hotplug()

kernel/rcu/tree_exp.h

/*
 * Reset the ->expmaskinit values in the rcu_node tree to reflect any
 * recent CPU-online activity.  Note that these masks are not cleared
 * when CPUs go offline, so they reflect the union of all CPUs that have
 * ever been online.  This means that this function normally takes its
 * no-work-to-do fastpath.
 */

static void sync_exp_reset_tree_hotplug(void)
{
        bool done;
        unsigned long flags;
        unsigned long mask;
        unsigned long oldmask;
        int ncpus = smp_load_acquire(&rcu_state.ncpus); /* Order vs. locking. */
        struct rcu_node *rnp;
        struct rcu_node *rnp_up;

        /* If no new CPUs onlined since last time, nothing to do. */
        if (likely(ncpus == rcu_state.ncpus_snap))
                return;
        rcu_state.ncpus_snap = ncpus;

        /*
         * Each pass through the following loop propagates newly onlined
         * CPUs for the current rcu_node structure up the rcu_node tree.
         */
        rcu_for_each_leaf_node(rnp) {
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                if (rnp->expmaskinit == rnp->expmaskinitnext) {
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        continue;  /* No new CPUs, nothing to do. */
                }

                /* Update this node's mask, track old value for propagation. */
                oldmask = rnp->expmaskinit;
                rnp->expmaskinit = rnp->expmaskinitnext;
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

                /* If was already nonzero, nothing to propagate. */
                if (oldmask)
                        continue;

                /* Propagate the new CPU up the tree. */
                mask = rnp->grpmask;
                rnp_up = rnp->parent;
                done = false;
                while (rnp_up) {
                        raw_spin_lock_irqsave_rcu_node(rnp_up, flags);
                        if (rnp_up->expmaskinit)
                                done = true;
                        rnp_up->expmaskinit |= mask;
                        raw_spin_unlock_irqrestore_rcu_node(rnp_up, flags);
                        if (done)
                                break;
                        mask = rnp_up->grpmask;
                        rnp_up = rnp_up->parent;
                }
        }
}

최근 online된 cpu를 반영하여 각 노드의 expmaskinit 값을 재설정한다.

코드 라인 12~14에서 cpu 수가 변경된 경우 rcu_state.ncpus_snaap에 이를 반영한다. 온라인 이후로 새 cpu 추가가 없는 경우 함수를 빠져나간다.
코드 라인 20~25에서 leaf 노드들을 순회하며 노드에 새 cpu 변화가 없으면 skip 한다.
코드 라인 28~29에서 기존 expmaskinit 비트들을 oldmask에 담아둔 후, 새 노드에 반영된 expmaskinitnext 비트들을 expmaskinit에 대입한다.
코드 라인 33~34에서 oldmask가 존재하는 경우 상위로 전파할 필요 없으므로 skip 한다.
코드 라인 37~50에서 상위 노드의 expmaskinit에 하위 grpmask를 추가하여 전파하되, 상위 노드의 expmaskinit에 값이 있었던 경우 더 이상 부모 노드로 이동하지 않고 루프를 탈출한다.

다음 그림은 최근 추가 online된 cpu를 반영하여 각 leaf 노드의 expmaskinit 값이 재설정되는 과정을 보여준다.

sync_rcu_exp_select_node_cpus()

kernel/rcu/tree_exp.h

/*
 * Select the CPUs within the specified rcu_node that the upcoming
 * expedited grace period needs to wait for.
 */

static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
{
        int cpu;
        unsigned long flags;
        unsigned long mask_ofl_test;
        unsigned long mask_ofl_ipi;
        int ret;
        struct rcu_exp_work *rewp =
                container_of(wp, struct rcu_exp_work, rew_work);
        struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);

        raw_spin_lock_irqsave_rcu_node(rnp, flags);

        /* Each pass checks a CPU for identity, offline, and idle. */
        mask_ofl_test = 0;
        for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
                struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
                unsigned long mask = rdp->grpmask;
                int snap;

                if (raw_smp_processor_id() == cpu ||
                    !(rnp->qsmaskinitnext & mask)) {
                        mask_ofl_test |= mask;
                } else {
                        snap = rcu_dynticks_snap(rdp);
                        if (rcu_dynticks_in_eqs(snap))
                                mask_ofl_test |= mask;
                        else
                                rdp->exp_dynticks_snap = snap;
                }
        }
        mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;

        /*
         * Need to wait for any blocked tasks as well.  Note that
         * additional blocking tasks will also block the expedited GP
         * until such time as the ->expmask bits are cleared.
         */
        if (rcu_preempt_has_tasks(rnp))
                WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

        /* IPI the remaining CPUs for expedited quiescent state. */
        for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
                struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
                unsigned long mask = rdp->grpmask;

retry_ipi:
                if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
                        mask_ofl_test |= mask;
                        continue;
                }
                if (get_cpu() == cpu) {
                        put_cpu();
                        continue;
                }
                ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
                put_cpu();
                /* The CPU will report the QS in response to the IPI. */
                if (!ret)
                        continue;

                /* Failed, raced with CPU hotplug operation. */
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                if ((rnp->qsmaskinitnext & mask) &&
                    (rnp->expmask & mask)) {
                        /* Online, so delay for a bit and try again. */
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("selectofl"));
                        schedule_timeout_idle(1);
                        goto retry_ipi;
                }
                /* CPU really is offline, so we must report its QS. */
                if (rnp->expmask & mask)
                        mask_ofl_test |= mask;
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        }
        /* Report quiescent states for those that went offline. */
        if (mask_ofl_test)
                rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
}

선택 cpu들에 IPI 호출을 통해 급행 qs를 처리하도록 강제하고 완료시까지 기다린다.

코드 라인 8~10에서 인자로 전달 받은 워크를 통해 rcu_exp_work와 rcu_node를 알아온다.
코드 라인 15~32에서 ipi 호출 대상은 rnp->expmask 중 mask_ofl_test 비트들을 산출시킨 후 제외한다. rnp leaf 노드의 cpu를 순회하며 다음에 해당하는 cpu들을 대상에서 제외시키기 위해 mask_ofl_test에 추가(or) 한다.참고로 mask는 cpu 번호에 해당하는 cpu 마스크를 그대로 사용하는 것이 아니라 BIT(cpu – grplo)로 취급된다.
- 현재 cpu
- 해당 cpu가 qsmaskinitnext에 없는 경우
- 새 dynaticks snap 값이 확장(extended) qs 상태인 경우
코드 라인 39~40에서 preempt 커널의 rcu critical 섹션에서 preempt되어 발생되는 블럭드 태스크가 존재하는 경우 exp_tasks 리스트가 해당 블럭드 태스크를 제외시키고 다음 태스크를 가리키게 한다.
코드 라인 44에서 leaf 노드에 소속된 cpu 중 mask_ofl_ipi에 포함된 cpu들만을 순회한다.
코드 라인 48~52에서 retry_ipi: 레이블이다. 이미 dynticks로 확장(extended) qs 시간을 보낸 경우 확장 qs의 완료 처리를 위해 해당 cpu를 mask_ofl_test에 추가(or)한다.
코드 라인 53~56에서 현재 cpu는 ipi 대상에서 skip 한다.
코드 라인 57~61에서 IPI를 통해 rcu_exp_handler() 함수가 호출되도록 한다. 성공시 ipi 대상에서 현재 cpu를 제외시키고, skip 한다.
코드 라인 65~72에서 IPI 호출이 실패한 경우이다. 해당 cpu가 qsmaskinitnext 및 expmask에 여전히 존재하면 1틱 지연후 재시도한다.
코드 라인 74~75에서 만일 해당 cpu가 expmask에 없는 경우(offline) ipi 대상에서 제외시킨다.
코드 라인 79~80에서 mask_ofl_test에 ipi 대상을 추가하여 노드에 급행 qs를 보고한다.

rcu_exp_wait_wake()

kernel/rcu/tree_exp.h

/*
 * Wait for the current expedited grace period to complete, and then
 * wake up everyone who piggybacked on the just-completed expedited
 * grace period.  Also update all the ->exp_seq_rq counters as needed
 * in order to avoid counter-wrap problems.
 */

static void rcu_exp_wait_wake(unsigned long s)
{
        struct rcu_node *rnp;

        synchronize_sched_expedited_wait();
        rcu_exp_gp_seq_end();
        trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));

        /*
         * Switch over to wakeup mode, allowing the next GP, but -only- the
         * next GP, to proceed.
         */
        mutex_lock(&rcu_state.exp_wake_mutex);

        rcu_for_each_node_breadth_first(rnp) {
                if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
                        spin_lock(&rnp->exp_lock);
                        /* Recheck, avoid hang in case someone just arrived. */
                        if (ULONG_CMP_LT(rnp->exp_seq_rq, s))
                                rnp->exp_seq_rq = s;
                        spin_unlock(&rnp->exp_lock);
                }
                smp_mb(); /* All above changes before wakeup. */
                wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
        }
        trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
        mutex_unlock(&rcu_state.exp_wake_mutex);
}

모든 급행 gp 완료까지 기다리며 루트 노드에서 급행 gp 완료를 기다리던 태스크들을 깨운다.

코드 라인 5에서 stall 시간이내에 모든 cpu들의 급행 qs가 완료되기를 기다린다. 만일 시간내 완료되지 않는 경우 분석을 위해 rcu stall에 대한 커널 메시지를 덤프한다.
코드 라인 6에서 급행 gp 시퀀스 번호를 완료 처리한다.
코드 라인 15~25에서 루트 노드부터 모든 노드들에 대해 순회하며 급행 gp 완료 대기 중인 경우 이를 깨운다. 만일 급행 gp 시퀀스 번호가 @s 값 보다 작은 경우 갱신한다.
- synchronize_rcu_expedited() 함수 내부에서 wait_event()를 사용하여 급행 gp 완료를 기다리며 잠들어 있는 태스크들을 깨운다.
- 대기 큐는 급행 gp 시퀀스의 하위 2비트로 해시되어 4개로 운영한다.

급행 QS 처리 및 RCU Stall 출력

synchronize_sched_expedited_wait()

kernel/rcu/tree_exp.h

tatic void synchronize_sched_expedited_wait(void)
{
        int cpu;
        unsigned long jiffies_stall;
        unsigned long jiffies_start;
        unsigned long mask;
        int ndetected;
        struct rcu_node *rnp;
        struct rcu_node *rnp_root = rcu_get_root();
        int ret;

        trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait"));
        jiffies_stall = rcu_jiffies_till_stall_check();
        jiffies_start = jiffies;

        for (;;) {
                ret = swait_event_timeout_exclusive(
                                rcu_state.expedited_wq,
                                sync_rcu_preempt_exp_done_unlocked(rnp_root),
                                jiffies_stall);
                if (ret > 0 || sync_rcu_preempt_exp_done_unlocked(rnp_root))
                        return;
                WARN_ON(ret < 0);  /* workqueues should not be signaled. */
                if (rcu_cpu_stall_suppress)
                        continue;
                panic_on_rcu_stall();
                pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {",
                       rcu_state.name);
                ndetected = 0;
                rcu_for_each_leaf_node(rnp) {
                        ndetected += rcu_print_task_exp_stall(rnp);
                        for_each_leaf_node_possible_cpu(rnp, cpu) {
                                struct rcu_data *rdp;

                                mask = leaf_node_cpu_bit(rnp, cpu);
                                if (!(rnp->expmask & mask))
                                        continue;
                                ndetected++;
                                rdp = per_cpu_ptr(&rcu_data, cpu);
                                pr_cont(" %d-%c%c%c", cpu,
                                        "O."[!!cpu_online(cpu)],
                                        "o."[!!(rdp->grpmask & rnp->expmaskinit)],
                                        "N."[!!(rdp->grpmask & rnp->expmaskinitnext)]);
                        }
                }
                pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
                        jiffies - jiffies_start, rcu_state.expedited_sequence,
                        rnp_root->expmask, ".T"[!!rnp_root->exp_tasks]);
                if (ndetected) {
                        pr_err("blocking rcu_node structures:");
                        rcu_for_each_node_breadth_first(rnp) {
                                if (rnp == rnp_root)
                                        continue; /* printed unconditionally */
                                if (sync_rcu_preempt_exp_done_unlocked(rnp))
                                        continue;
                                pr_cont(" l=%u:%d-%d:%#lx/%c",
                                        rnp->level, rnp->grplo, rnp->grphi,
                                        rnp->expmask,
                                        ".T"[!!rnp->exp_tasks]);
                        }
                        pr_cont("\n");
                }
                rcu_for_each_leaf_node(rnp) {
                        for_each_leaf_node_possible_cpu(rnp, cpu) {
                                mask = leaf_node_cpu_bit(rnp, cpu);
                                if (!(rnp->expmask & mask))
                                        continue;
                                dump_cpu_task(cpu);
                        }
                }
                jiffies_stall = 3 * rcu_jiffies_till_stall_check() + 3;
        }
}

stall 시간이내에 모든 cpu들의 급행 qs가 완료되기를 기다린다. 만일 시간내 완료되지 않는 경우 분석을 위해 rcu stall에 대한 커널 메시지를 덤프하고, 시간을 3배씩 늘려가며 재시도한다. (디폴트 21+5초)

코드 라인 13에서 rcu stall에 해당하는 시간(jiffies_stall)을 알아온다. (디폴트 21+5초)
코드 라인 16~22에서 최대 rcu stall에 해당하는 시간까지 기다린다. 시간내에 급행 gp가 완료된 경우 함수를 빠져나간다.
코드 라인 24~25에서 모듈 변수 “/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress”가 1인 경우 경고 메시지를 출력하지 않기 위해 계속 루프를 돈다. (디폴트=0)
코드 라인 26에서 “/proc/sys/kernel/panic_on_rcu_stall”이 1로 설정된 경우 “RCU Stall\n” 메시지와 함께 panic 처리한다. (디폴트=0)
코드 라인 27~28에서 panic이 동작하지 않는 경우엔 “INFO: rcu_preempt detected expedited stalls on CPUs/tasks: {“와 같은 에러 메시지를 출력한다.
- 참고: Using RCU’s CPU Stall Detector
코드 라인 29~31에서 leaf 노드를 순회하며 preempt 커널에서 preempt되어 블럭된 태스크들의 pid를 출력하고 그 수를 알아와서 ndetected에 더한다.
코드 라인 32~44에서 순회 중인 노드에 소속된 possible cpu들을 순회하며 qs 보고되지 않은 cpu 정보를 출력한다.
- ” <cpu>-<online 여부를 O 또는 .><expmaskinit 소속 여부를 o 또는 .><expmaskinitnext 소속 여부를 N 또는 .>”
  - 예) “4-O..” -> 4번 cpu online되었고, expmaskinit 및 expmaskinitnext에 포함되지 않은 cpu이다.
코드 라인 46~48에서 계속하여 stall 시각을 jiffies 기준으로 출력하고, 급행 gp 시퀀스 번호 및 root 노드의 expmask와 exp_tasks 여부를 “T 또는 “.” 문자로 출력한다.
- 예) “} 6002 jiffies s: 173 root: 0x0201/T”
코드 라인 49~62에서 ndetected가 있는 경우 최상위 노드를 제외한 모든 노드를 순회하며 qs 보고되지 않은 노드의 정보를 다음과 같이 출력한다.
- ” l=<레벨>:<grplo>-<grphi>:0x2ff/<T or .>”
  - 예) “blocking rcu_node structures: l=1:0-15:0x1/. l=1:144-159:0x8002/.”
코드 라인 63~70에서 leaf 노드를 순회하고, 순회 중인 노드의 possible cpu들을 대상으로 qs 보고되지 않은 cpu 별로 태스크 백 트레이스 정보를 덤프한다.
코드 라인 71에서 jiffies_stall 값을 3배 곱하고 3초를 더해 다시 시도한다.

rcu_jiffies_till_stall_check()

kernel/rcu/tree_stall.h

/* Limit-check stall timeouts specified at boottime and runtime. */

int rcu_jiffies_till_stall_check(void)
{
        int till_stall_check = READ_ONCE(rcu_cpu_stall_timeout);

        /*
         * Limit check must be consistent with the Kconfig limits
         * for CONFIG_RCU_CPU_STALL_TIMEOUT.
         */
        if (till_stall_check < 3) {
                WRITE_ONCE(rcu_cpu_stall_timeout, 3);
                till_stall_check = 3;
        } else if (till_stall_check > 300) {
                WRITE_ONCE(rcu_cpu_stall_timeout, 300);
                till_stall_check = 300;
        }
        return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;
}
EXPORT_SYMBOL_GPL(rcu_jiffies_till_stall_check);

rcu stall 체크에 필요한 시간을 알아온다. (3~300 + 5초에 해당하는 jiffies 틱수, 디폴트=21+5초)

코드 라인 3에서 모듈 변수 “/sys/module/rcupdate/parameters/rcu_cpu_stall_timeout” 값을 알아온다. (디폴트 21초)
코드 라인 9~11에서 읽어온 값이 3초 미만인 경우 3초로 제한한다.
코드 라인 12~15에서 읽어온 값이 300초를 초과하는 경우 300초로 제한한다.
코드 라인 16에서 결정된 값 + 5초를 반환한다

rcu_print_task_exp_stall()

kernel/rcu/tree_exp.h

/*
 * Scan the current list of tasks blocked within RCU read-side critical
 * sections, printing out the tid of each that is blocking the current
 * expedited grace period.
 */

static int rcu_print_task_exp_stall(struct rcu_node *rnp)
{
        struct task_struct *t;
        int ndetected = 0;

        if (!rnp->exp_tasks)
                return 0;
        t = list_entry(rnp->exp_tasks->prev,
                       struct task_struct, rcu_node_entry);
        list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
                pr_cont(" P%d", t->pid);
                ndetected++;
        }
        return ndetected;
}

preempt 커널에서 preempt되어 블럭된 태스크들의 pid를 출력하고 그 수를 반환한다.

RCU Expedited IPI 핸들러

rcu_exp_handler()

kernel/rcu/tree_exp.h

/*
 * Remote handler for smp_call_function_single().  If there is an
 * RCU read-side critical section in effect, request that the
 * next rcu_read_unlock() record the quiescent state up the
 * ->expmask fields in the rcu_node tree.  Otherwise, immediately
 * report the quiescent state.
 */

static void rcu_exp_handler(void *unused)
{
        unsigned long flags;
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        struct rcu_node *rnp = rdp->mynode;
        struct task_struct *t = current;

        /*
         * First, the common case of not being in an RCU read-side
         * critical section.  If also enabled or idle, immediately
         * report the quiescent state, otherwise defer.
         */
        if (!t->rcu_read_lock_nesting) {
                if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
                    rcu_dynticks_curr_cpu_in_eqs()) {
                        rcu_report_exp_rdp(rdp);
                } else {
                        rdp->exp_deferred_qs = true;
                        set_tsk_need_resched(t);
                        set_preempt_need_resched();
                }
                return;
        }

        /*
         * Second, the less-common case of being in an RCU read-side
         * critical section.  In this case we can count on a future
         * rcu_read_unlock().  However, this rcu_read_unlock() might
         * execute on some other CPU, but in that case there will be
         * a future context switch.  Either way, if the expedited
         * grace period is still waiting on this CPU, set ->deferred_qs
         * so that the eventual quiescent state will be reported.
         * Note that there is a large group of race conditions that
         * can have caused this quiescent state to already have been
         * reported, so we really do need to check ->expmask.
         */
        if (t->rcu_read_lock_nesting > 0) {
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                if (rnp->expmask & rdp->grpmask) {
                        rdp->exp_deferred_qs = true;
                        t->rcu_read_unlock_special.b.exp_hint = true;
                }
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                return;
        }

        /*
         * The final and least likely case is where the interrupted
         * code was just about to or just finished exiting the RCU-preempt
         * read-side critical section, and no, we can't tell which.
         * So either way, set ->deferred_qs to flag later code that
         * a quiescent state is required.
         *
         * If the CPU is fully enabled (or if some buggy RCU-preempt
         * read-side critical section is being used from idle), just
         * invoke rcu_preempt_deferred_qs() to immediately report the
         * quiescent state.  We cannot use rcu_read_unlock_special()
         * because we are in an interrupt handler, which will cause that
         * function to take an early exit without doing anything.
         *
         * Otherwise, force a context switch after the CPU enables everything.
         */
        rdp->exp_deferred_qs = true;
        if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
            WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
                rcu_preempt_deferred_qs(t);
        } else {
                set_tsk_need_resched(t);
                set_preempt_need_resched();
        }
}

리모트 IPI 호출에 의해 이 함수가 실행되어 급행 qs를 빠르게 보고하게 한다. 이 함수에서는 기존 동작 중이던 태스크가 rcu read-side critical section이 진행 중 여부에 따라 다르게 동작한다. 진행 중인 경우엔 exp_deferred_qs로 설정하고, 그렇지 않은 경우 급행 qs의 완료를 보고한다.

코드 라인 13~23에서 첫 번째, 현재 태스크가 rcu read-side critical section 내부 코드를 수행 중이지 않는 일반적인 경우이다. softirq 및 preemption 마스킹되지 않았거나, nohz인 경우 여부에 따라 다음과 같이 동작한 후 함수를 빠져나간다.
- 조건 적합: 즉각 확장(extended) qs를 보고한다.
- 조건 부합: 급행 유예 qs 설정(rdp->exp_deferred_qs = true)을 한 후 리스케줄 요청한다.
코드 라인 37~45에서 두 번째, 현재 태스크가 rcu read-side critical section 내부 코드를 수행 중인 경우이다. 이 때엔 qs를 결정할 것이 없어 함수를 그냥 빠져나간다. 단 노드에서 급행 qs를 대기중인 cpu가 현재 cpu 단 하나인 경우엔 급행 유예 qs 설정을 한다. (rdp->exp_deferred_qs = true) 그리고 신속히 급행 유예 qs를 처리하기 위해 unlock special의 exp_hint 비트를 설정하다.
- 참고: rcu: Speed up expedited GPs when interrupting RCU reader (2018, v5.0-rc1)
코드 라인 63~70에서 마지막, 급행 유예 qs 설정(rdp->exp_deferred_qs = true)을 한 후 softirq 및 preemption 마스킹되지 않았거나, nohz인 경우 여부에 따라 다음과 같이 동작한다.
- 조건 적학: deferred qs 상태인 경우 deferred qs를 해제하고, blocked 상태인 경우 blocked 해제 후 qs를 보고한다.
- 조건 부합: 리스케줄 요청만 한다.

Expedited QS 보고 – to cpu

rcu_report_exp_rdp()

kernel/rcu/tree_exp.h

/*
 * Report expedited quiescent state for specified rcu_data (CPU).
 */

static void rcu_report_exp_rdp(struct rcu_data *rdp)
{
        WRITE_ONCE(rdp->exp_deferred_qs, false);
        rcu_report_exp_cpu_mult(rdp->mynode, rdp->grpmask, true);
}

@rdp에 해당하는 cpu의 급행 qs를 해당 노드에 보고한다.

rcu_report_exp_cpu_mult()

kernel/rcu/tree_exp.h

/*
 * Report expedited quiescent state for multiple CPUs, all covered by the
 * specified leaf rcu_node structure.
 */

static void rcu_report_exp_cpu_mult(struct rcu_node *rnp,
                                    unsigned long mask, bool wake)
{
        unsigned long flags;

        raw_spin_lock_irqsave_rcu_node(rnp, flags);
        if (!(rnp->expmask & mask)) {
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                return;
        }
        rnp->expmask &= ~mask;
        __rcu_report_exp_rnp(rnp, wake, flags); /* Releases rnp->lock. */
}

@mask에 포함된 cpu들에 급행 qs 상태를 보고한다.

코드 라인 6~11에서 노드의 expmask에 cpu들 비트가 포함된 @mask를 제거한다. 만일 하나도 포함되지 않은 경우 함수를 빠져나간다.
코드 라인 12에서 해당 노드에 급행 qs 상태를 보고한다.

Expedited qs 보고 – to 노드

rcu_report_exp_rnp()

kernel/rcu/tree_exp.h

/*
 * Report expedited quiescent state for specified node.  This is a
 * lock-acquisition wrapper function for __rcu_report_exp_rnp().
 */

static void __maybe_unused rcu_report_exp_rnp(struct rcu_node *rnp, bool wake)
{
        unsigned long flags;

        raw_spin_lock_irqsave_rcu_node(rnp, flags);
        __rcu_report_exp_rnp(rnp, wake, flags);
}

해당 노드에 급행 qs 상태를 보고한다.

__rcu_report_exp_rnp()

kernel/rcu/tree_exp.h

/*
 * Report the exit from RCU read-side critical section for the last task
 * that queued itself during or before the current expedited preemptible-RCU
 * grace period.  This event is reported either to the rcu_node structure on
 * which the task was queued or to one of that rcu_node structure's ancestors,
 * recursively up the tree.  (Calm down, calm down, we do the recursion
 * iteratively!)
 *
 * Caller must hold the specified rcu_node structure's ->lock.
 */

static void __rcu_report_exp_rnp(struct rcu_node *rnp,
                                 bool wake, unsigned long flags)
        __releases(rnp->lock)
{
        unsigned long mask;

        for (;;) {
                if (!sync_rcu_preempt_exp_done(rnp)) {
                        if (!rnp->expmask)
                                rcu_initiate_boost(rnp, flags);
                        else
                                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        break;
                }
                if (rnp->parent == NULL) {
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        if (wake) {
                                smp_mb(); /* EGP done before wake_up(). */
                                swake_up_one(&rcu_state.expedited_wq);
                        }
                        break;
                }
                mask = rnp->grpmask;
                raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled */
                rnp = rnp->parent;
                raw_spin_lock_rcu_node(rnp); /* irqs already disabled */
                WARN_ON_ONCE(!(rnp->expmask & mask));
                rnp->expmask &= ~mask;
        }
}

해당 노드에 급행 qs 상태를 보고한다.

코드 라인 7~14에서 요청한 노드에서 최상위 노드까지 순회하며 해당 노드의 급행 qs가 모두 완료되지 않았고, 급행 gp 모드에서 블럭된 태스크가 있는 경우 priority boost 스레드를 깨워 동작시킨 후 함수를 빠져나간다.
코드 라인 15~22에서 부모 노드가 없는 경우 함수를 빠져나간다.
코드 라인 23~28에서 상위 노드의 expmask에서 현재 노드의 grpmask를 제거한다.

sync_rcu_preempt_exp_done()

kernel/rcu/tree_exp.h

/*
 * Return non-zero if there is no RCU expedited grace period in progress
 * for the specified rcu_node structure, in other words, if all CPUs and
 * tasks covered by the specified rcu_node structure have done their bit
 * for the current expedited grace period.  Works only for preemptible
 * RCU -- other RCU implementation use other means.
 *
 * Caller must hold the specificed rcu_node structure's ->lock
 */

static bool sync_rcu_preempt_exp_done(struct rcu_node *rnp)
{
        raw_lockdep_assert_held_rcu_node(rnp);

        return rnp->exp_tasks == NULL &&
               READ_ONCE(rnp->expmask) == 0;
}

노드의 급행 qs가 모두 완료된 경우 여부를 반환한다.

급행 gp를 사용하며 블럭된 태스크가 없으면서
해당 노드의 expmask가 모두 클리어되어 child(rcu_node 또는 rcu_data)의 급행 qs가 모두 완료된 상태

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c – 현재글
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c

RCU(Read Copy Update) -5- (Callback list)

2021-03-092023-07-22 문영일 Leave a comment

RCU 콜백 리스트

RCU Segmented 콜백 리스트

커널 v4.12에서 콜백 리스트의 관리 방법이 바뀌어 콜백 리스트를 4 등분하여 관리하는 segmented 콜백 리스트로 이름이 바뀌었다.

다음 4개의 세그먼트들을 알아본다.

include/linux/rcu_segcblist.h

/* Complicated segmented callback lists.  😉 */

/*
 * Index values for segments in rcu_segcblist structure.
 *
 * The segments are as follows:
 *
 * [head, *tails[RCU_DONE_TAIL]):
 *      Callbacks whose grace period has elapsed, and thus can be invoked.
 * [*tails[RCU_DONE_TAIL], *tails[RCU_WAIT_TAIL]):
 *      Callbacks waiting for the current GP from the current CPU's viewpoint.
 * [*tails[RCU_WAIT_TAIL], *tails[RCU_NEXT_READY_TAIL]):
 *      Callbacks that arrived before the next GP started, again from
 *      the current CPU's viewpoint.  These can be handled by the next GP.
 * [*tails[RCU_NEXT_READY_TAIL], *tails[RCU_NEXT_TAIL]):
 *      Callbacks that might have arrived after the next GP started.
 *      There is some uncertainty as to when a given GP starts and
 *      ends, but a CPU knows the exact times if it is the one starting
 *      or ending the GP.  Other CPUs know that the previous GP ends
 *      before the next one starts.
 *
 * Note that RCU_WAIT_TAIL cannot be empty unless RCU_NEXT_READY_TAIL is also
 * empty.
 *
 * The ->gp_seq[] array contains the grace-period number at which the
 * corresponding segment of callbacks will be ready to invoke.  A given
 * element of this array is meaningful only when the corresponding segment
 * is non-empty, and it is never valid for RCU_DONE_TAIL (whose callbacks
 * are already ready to invoke) or for RCU_NEXT_TAIL (whose callbacks have
 * not yet been assigned a grace-period number).
 */

#define RCU_DONE_TAIL           0       /* Also RCU_WAIT head. */
#define RCU_WAIT_TAIL           1       /* Also RCU_NEXT_READY head. */
#define RCU_NEXT_READY_TAIL     2       /* Also RCU_NEXT head. */
#define RCU_NEXT_TAIL           3
#define RCU_CBLIST_NSEGS        4

콜백 리스트는 done 구간, wait 구간, next-ready 구간, next 구간과 같이 총 4개의 구간으로 나누어 관리한다.

먼저 tails[0] 과 *tails[0]의 의미를 구분해야 한다.

tails[0]
- tails[0]이 가리키는 곳
*tails[0]
- tails[0]의 값으로 이 값은 다음 콜백을 가리킨다.

수의 범위 괄호 표기법

다음과 같이 수의 범위 괄호 표기법을 기억해둔다.

[1, 5]
- 실수: 1 <= x <= 5
- 정수: 1, 2, 3, 4, 5
(1, 5)
- 실수: 1 < x < 5
- 정수: 2, 3, 4
[1, 5)
- 실수: 1 <= x < 5
- 정수: 1, 2, 3, 4

[head, *tails[0])

head 위치부터 tails[0]의 값이 가리키는 다음 콜백의 전까지
예) CB1, CB2, CB3가 범위에 포함된다.
- head tails[0] *tails[0]
- CB1 —–> CB2 —–> CB3 —–> CB4

rcu의 segmented 콜백 리스트는 다음과 같이 4개의 구간으로 나뉘어 관리한다.

4개의 구간 관리를 위해 cb 리스트는 1개의 head와 4개의 포인터 tails[0~3]를 사용한다.

1개의 콜백 리스트 = rdp->cblist = rsclp
1개의 시작 : rsclp->head
4개의 구간에 대한 포인터 배열: rsclp->tails[]
- 1 번째 done 구간:
  - [head, *tails[0])
  - head에 연결된 콜백 위치부터 tails[RCU_DONE_TAIL] 까지
  - g.p가 완료하여 처리 가능한 구간이다. blimit 제한으로 인해 다 처리되지 못한 콜백들이 이 구간에 남아 있을 수 있다.
  - 이 구간의 콜백들은 아무때나 처리 가능하다.
- 2 번째 wait 구간
  - [*tails[0], *tails[1])
  - tails[RCU_DONE_TAIL]에 연결된 다음 콜백 위치부터 tails[RCU_WAIT_TAIL] 까지
  - current gp 이전에 추가된 콜백으로 current gp가 끝난 후에 처리될 예정인 콜백들이 대기중인 구간이다.
- 3 번째 next-ready 구간 (=wait 2 구간)
  - [*tails[1], *tails[2])
  - tails[RCU_WAIT_TAIL]에 연결된 다음 콜백 위치부터 tails[RCU_NEXT_READY_TAIL] 까지
  - current gp 진행 중에 추가된 콜백들이 추가된 구간이다. 이 콜백들은 current gp가 완료되어도 곧바로 처리되면 안되고, 다음 gp가 완료된 후에 처리될 예정인 콜백들이 있는 구간이다.
  - 이 구간의 콜백들은 current gp 및 next gp 까지 완료된 후에 처리되어야 한다.
  - 전통적인(classic) rcu에서는 이 구간이 없이 1개의 wait 구간만으로 처리가 되었지만 preemptible rcu 처리를 위해 1 번의 gp를 더 연장하기 위해 구간을 추가하였다. preemptible rcu 에서 rcu_read_lock()에서 메모리 배리어를 사용하지 않게 하기 위해 gp를 1단계 더 delay하여 처리한다.
- 4 번째 next 구간
  - [*tails[2], tails[3])
  - tails[RCU_NEXT_READY_TAIL]에 연결된 다음 콜백 위치부터 tails[RCU_NEXT_TAIL] 까지
  - 새 콜백이 추가되면 이 구간의 마지막에 추가되고, tails[RCU_NEXT_TAIL]이 추가된 마지막 콜백을 항상 가리키게 해야한다.

일반적인 콜백 진행은 위와 같지만 cpu가 16개를 초과하면 1 단계가 느려질 수도 있고, 반대로 acceleration 조건에 부합하여 1단계씩 전진될 수도 있다.

un-segmented 콜백 리스트 관련

rcu un-segmented 콜백 리스트는 구간을 나뉘지 않고 그냥 하나로 관리된다.

rcu_cblist_init()

kernel/rcu/rcu_segcblist.c

/* Initialize simple callback list. */
void rcu_cblist_init(struct rcu_cblist *rclp)
{
        rclp->head = NULL;
        rclp->tail = &rclp->head;
        rclp->len = 0;
        rclp->len_lazy = 0;
}

rcu 콜백 리스트를 초기화한다. (주의: unsegmented 콜백 리스트이다)

rcu_cblist_enqueue()

kernel/rcu/rcu_segcblist.c

/*
 * Enqueue an rcu_head structure onto the specified callback list.
 * This function assumes that the callback is non-lazy because it
 * is intended for use by no-CBs CPUs, which do not distinguish
 * between lazy and non-lazy RCU callbacks.
 */

void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
{
        *rclp->tail = rhp;
        rclp->tail = &rhp->next;
        WRITE_ONCE(rclp->len, rclp->len + 1);
};

rcu 콜백리스트의 뒷부분에 rcu 콜백을 추가한다.

다음 그림은 rcu 콜백 리스트에 rcu 콜백을 엔큐하는 모습을 보여준다.

rcu_cblist_dequeue()

kernel/rcu/rcu_segcblist.c

/*
 * Dequeue the oldest rcu_head structure from the specified callback
 * list.  This function assumes that the callback is non-lazy, but
 * the caller can later invoke rcu_cblist_dequeued_lazy() if it
 * finds otherwise (and if it cares about laziness).  This allows
 * different users to have different ways of determining laziness.
 */

struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp)
{
        struct rcu_head *rhp;

        rhp = rclp->head;
        if (!rhp)
                return NULL;
        rclp->len--;
        rclp->head = rhp->next;
        if (!rclp->head)
                rclp->tail = &rclp->head;
        return rhp;
};

rcu 콜백리스트의 앞부분에서 rcu 콜백을 하나 디큐해온다.

다음 그림은 rcu 콜백 리스트로부터 rcu 콜백을 디큐하는 모습을 보여준다.

rcu_cblist_flush_enqueue()

kernel/rcu/rcu_segcblist.c

/*
 * Flush the second rcu_cblist structure onto the first one, obliterating
 * any contents of the first.  If rhp is non-NULL, enqueue it as the sole
 * element of the second rcu_cblist structure, but ensuring that the second
 * rcu_cblist structure, if initially non-empty, always appears non-empty
 * throughout the process.  If rdp is NULL, the second rcu_cblist structure
 * is instead initialized to empty.
 */

void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
                              struct rcu_cblist *srclp,
                              struct rcu_head *rhp)
{
        drclp->head = srclp->head;
        if (drclp->head)
                drclp->tail = srclp->tail;
        else
                drclp->tail = &drclp->head;
        drclp->len = srclp->len;
        drclp->len_lazy = srclp->len_lazy;
        if (!rhp) {
                rcu_cblist_init(srclp);
        } else {
                rhp->next = NULL;
                srclp->head = rhp;
                srclp->tail = &rhp->next;
                WRITE_ONCE(srclp->len, 1);
                srclp->len_lazy = 0;
        }
};

콜백 리스트 @drclp을 모두 콜백 리스트 @srclp로 옮기고, 콜백 리스트 @srclp에 새로운 콜백 헤드 @rhp를 대입한다. (@rhp가 null인 경우에는 콜백 리스트 @srclp를 초기화한다)

다음 그림은 소스측의 rcu 콜백리스트를 목적측에 flush하고, 새로운 콜백을 대입하는 모습을 보여준다.

rcu_cblist_n_cbs()

kernel/rcu/rcu_segcblist.h

/* Return number of callbacks in the specified callback list. */
static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
{
        return READ_ONCE(rclp->len);
}

콜백 리스트에 있는 콜백 수를 반환한다.

rcu_cblist_dequeued_lazy()

kernel/rcu/rcu_segcblist.h

/*
 * Account for the fact that a previously dequeued callback turned out
 * to be marked as lazy.
 */

static inline void rcu_cblist_dequeued_lazy(struct rcu_cblist *rclp)
{
        rclp->len_lazy--;
}

콜백 리스트에서 lazy 카운터를 감소시킨다.

segmented 콜백 리스트 관련

rcu_segcblist_init()

kernel/rcu/rcu_segcblist.c

/*
 * Initialize an rcu_segcblist structure.
 */

void rcu_segcblist_init(struct rcu_segcblist *rsclp)
{
        int i;

        BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
        BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
        rsclp->head = NULL;
        for (i = 0; i < RCU_CBLIST_NSEGS; i++)
                rsclp->tails[i] = &rsclp->head;
        rcu_segcblist_set_len(rsclp, 0);
        rsclp->len_lazy = 0;
        rsclp->enabled = 1;
};

rcu segmented 콜백 리스트를 초기화한다.

다음 그림은 segmented 콜백 리스트를 초기화하는 모습을 보여준다.

rcu_segcblist_set_len()

kernel/rcu/rcu_segcblist.c

/* Set the length of an rcu_segcblist structure. */
void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
{
#ifdef CONFIG_RCU_NOCB_CPU
        atomic_long_set(&rsclp->len, v);
#else
        WRITE_ONCE(rsclp->len, v);
#endif
}

rcu seg 콜백 리스트에 콜백 수를 기록한다.

rcu_segcblist_add_len()

kernel/rcu/rcu_segcblist.c

/*
 * Increase the numeric length of an rcu_segcblist structure by the
 * specified amount, which can be negative.  This can cause the ->len
 * field to disagree with the actual number of callbacks on the structure.
 * This increase is fully ordered with respect to the callers accesses
 * both before and after.
 */

void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
{
#ifdef CONFIG_RCU_NOCB_CPU
        smp_mb__before_atomic(); /* Up to the caller! */
        atomic_long_add(v, &rsclp->len);
        smp_mb__after_atomic(); /* Up to the caller! */
#else
        smp_mb(); /* Up to the caller! */
        WRITE_ONCE(rsclp->len, rsclp->len + v);
        smp_mb(); /* Up to the caller! */
#endif
};

rcu seg 콜백 리스트에 콜백 수 @v를 추가한다.

rcu_segcblist_inc_len()

kernel/rcu/rcu_segcblist.c

/*
 * Increase the numeric length of an rcu_segcblist structure by one.
 * This can cause the ->len field to disagree with the actual number of
 * callbacks on the structure.  This increase is fully ordered with respect
 * to the callers accesses both before and after.
 */

void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
{
        rcu_segcblist_add_len(rsclp, 1);
};

rcu seg 콜백 리스트에 콜백 수를 1 증가시킨다.

rcu_segcblist_xchg_len()

kernel/rcu/rcu_segcblist.c

/*
 * Exchange the numeric length of the specified rcu_segcblist structure
 * with the specified value.  This can cause the ->len field to disagree
 * with the actual number of callbacks on the structure.  This exchange is
 * fully ordered with respect to the callers accesses both before and after.
 */

long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
{
#ifdef CONFIG_RCU_NOCB_CPU
        return atomic_long_xchg(&rsclp->len, v);
#else
        long ret = rsclp->len;

        smp_mb(); /* Up to the caller! */
        WRITE_ONCE(rsclp->len, v);
        smp_mb(); /* Up to the caller! */
        return ret;
#endif
};

rcu seg 콜백 리스트에 콜백 수를 @v 값으로 치환한다. 그리고 반환 값은 기존 값을 반환한다.

rcu_segcblist_disable()

kernel/rcu/rcu_segcblist.c

/*
 * Disable the specified rcu_segcblist structure, so that callbacks can
 * no longer be posted to it.  This structure must be empty.
 */

void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
{
        WARN_ON_ONCE(!rcu_segcblist_empty(rsclp));
        WARN_ON_ONCE(rcu_segcblist_n_cbs(rsclp));
        WARN_ON_ONCE(rcu_segcblist_n_lazy_cbs(rsclp));
        rsclp->enabled = 0;
};

rcu seg 콜백 리스트를 disable 한다.

rcu_segcblist_offload()

kernel/rcu/rcu_segcblist.c

/*
 * Mark the specified rcu_segcblist structure as offloaded.  This
 * structure must be empty.
 */

void rcu_segcblist_offload(struct rcu_segcblist *rsclp)
{
        rsclp->offloaded = 1;
};

rcu seg 콜백 리스트를 offloaded 상태로 변경한다.

rcu_segcblist_ready_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Does the specified rcu_segcblist structure contain callbacks that
 * are ready to be invoked?
 */

bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp)
{
        return rcu_segcblist_is_enabled(rsclp) &&
               &rsclp->head != rsclp->tails[RCU_DONE_TAIL];
};

rcu seg 콜백 리스트에 호출 준비된 콜백들이 있는지 여부를 반환한다.

done 구간에 있는 콜백들이 있는지 여부를 반환한다.

rcu_segcblist_pend_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Does the specified rcu_segcblist structure contain callbacks that
 * are still pending, that is, not yet ready to be invoked?
 */

bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp)
{
        return rcu_segcblist_is_enabled(rsclp) &&
               !rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL);
};

rcu seg 콜백 리스트에 콜백들이 있는지 여부를 반환한다.

rcu_segcblist_first_cb()

kernel/rcu/rcu_segcblist.c

/*
 * Return a pointer to the first callback in the specified rcu_segcblist
 * structure.  This is useful for diagnostics.
 */

struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp)
{
        if (rcu_segcblist_is_enabled(rsclp))
                return rsclp->head;
        return NULL;
};

rcu seg 콜백 리스트의 첫 콜백을 알아온다.

rcu_segcblist_first_pend_cb()

kernel/rcu/rcu_segcblist.c

/*
 * Return a pointer to the first pending callback in the specified
 * rcu_segcblist structure.  This is useful just after posting a given
 * callback -- if that callback is the first pending callback, then
 * you cannot rely on someone else having already started up the required
 * grace period.
 */

struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp)
{
        if (rcu_segcblist_is_enabled(rsclp))
                return *rsclp->tails[RCU_DONE_TAIL];
        return NULL;
};

rcu seg 콜백 리스트의 호출 준비된 첫 콜백을 반환한다.

rcu_segcblist_nextgp()

kernel/rcu/rcu_segcblist.c

/*
 * Return false if there are no CBs awaiting grace periods, otherwise,
 * return true and store the nearest waited-upon grace period into *lp.
 */

bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp)
{
        if (!rcu_segcblist_pend_cbs(rsclp))
                return false;
        *lp = rsclp->gp_seq[RCU_WAIT_TAIL];
        return true;
};

rcu seg 콜백 리스트에서 콜백들이 있는 경우 여부를 반환한다. 만일 콜백들이 있는 경우 출력 인자 @lp에 gp 대기 중인 gp 시퀀스를 기록한다.

rcu_segcblist_enqueue()

kernel/rcu/rcu_segcblist.c

/*
 * Enqueue the specified callback onto the specified rcu_segcblist
 * structure, updating accounting as needed.  Note that the ->len
 * field may be accessed locklessly, hence the WRITE_ONCE().
 * The ->len field is used by rcu_barrier() and friends to determine
 * if it must post a callback on this structure, and it is OK
 * for rcu_barrier() to sometimes post callbacks needlessly, but
 * absolutely not OK for it to ever miss posting a callback.
 */

void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
                           struct rcu_head *rhp, bool lazy)
{
        rcu_segcblist_inc_len(rsclp);
        if (lazy)
                rsclp->len_lazy++;
        smp_mb(); /* Ensure counts are updated before callback is enqueued. */
        rhp->next = NULL;
        WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
        WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], &rhp->next);
};

rcu seg 콜백 리스트에 rcu 콜백을 엔큐한다.

코드 라인 4에서 rcu cb 수를 하나 추가한다.
코드 라인 5~6에서 @lazy 요청인 경우 len_lazy 카운터도 증가시킨다.
코드 라인 7에서 카운터가 완전히 업데이트 된 후 tails[]에 대한 처리를 수행하게 한다. (읽기 측에서는 tails[]를 먼저 확인한 후 카운터를 보게 한다)
코드 라인 9에서 tails[RCU_NEXT_TAIL]이 가리키는 곳의 다음에 cb를 추가한다. 즉 가장 마지막에 cb를 추가한다.
- tails[RCU_NEXT_TAIL]은 항상 마지막에 추가된 cb들을 가리킨다.

rcu_segcblist_entrain()

kernel/rcu/rcu_segcblist.c

/*
 * Entrain the specified callback onto the specified rcu_segcblist at
 * the end of the last non-empty segment.  If the entire rcu_segcblist
 * is empty, make no change, but return false.
 *
 * This is intended for use by rcu_barrier()-like primitives, -not-
 * for normal grace-period use.  IMPORTANT:  The callback you enqueue
 * will wait for all prior callbacks, NOT necessarily for a grace
 * period.  You have been warned.
 */

bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,
                           struct rcu_head *rhp, bool lazy)
{
        int i;

        if (rcu_segcblist_n_cbs(rsclp) == 0)
                return false;
        rcu_segcblist_inc_len(rsclp);
        if (lazy)
                rsclp->len_lazy++;
        smp_mb(); /* Ensure counts are updated before callback is entrained. */
        rhp->next = NULL;
        for (i = RCU_NEXT_TAIL; i > RCU_DONE_TAIL; i--)
                if (rsclp->tails[i] != rsclp->tails[i - 1])
                        break;
        WRITE_ONCE(*rsclp->tails[i], rhp);
        for (; i <= RCU_NEXT_TAIL; i++)
                WRITE_ONCE(rsclp->tails[i], &rhp->next);
        return true;
};

rcu seg 콜백 리스트에 rcu 콜백을 추가한다. done 구간을 제외하고, 마지막 콜백이 있는 구간에 콜백을 추가하고 true를 반환한다. 만일 seg 콜백 리스트의 모든 구간이 비어 있는 경우 false를 반환한다.

코드 라인 6~7에서 seg 콜백 리스트가 비어 있는 경우 false를 반환한다.
코드 라인 8에서 콜백 수를 1 증가 시킨다.
코드 라인 9~10에서 @lazy 요청한 경우 lazy 카운터를 1 증가시킨다.
코드 라인 11에서 콜백을 추가하기 전에 반드시 먼저 카운터를 갱신해야 하므로 메모리 베리어를 수행한다.
코드 라인 12에서 추가되는 콜백의 다음이 없으므로 null을 대입한다.
코드 라인 13~15에서 done을 제외하고 가장 마지막 콜백이 위치한 구간을 찾는다. (i=3, 2, 1 순)
코드 라인 16에서 찾은 구간에 콜백을 추가하낟.
코드 라인 17~18에서 마지막 구간부터 찾은 구간의 tails[] 포인터가 새로 추가한 콜백을 가리키게 한다.
코드 라인 19에서 성공에 해당하는 true를 반환한다.

rcu seg 콜백 리스트의 적절한 위치에 콜백이 추가되는 모습을 보여준다.

rcu_segcblist_empty()

kernel/rcu/rcu_segcblist.h

/*
 * Is the specified rcu_segcblist structure empty?
 *
 * But careful!  The fact that the ->head field is NULL does not
 * necessarily imply that there are no callbacks associated with
 * this structure.  When callbacks are being invoked, they are
 * removed as a group.  If callback invocation must be preempted,
 * the remaining callbacks will be added back to the list.  Either
 * way, the counts are updated later.
 *
 * So it is often the case that rcu_segcblist_n_cbs() should be used
 * instead.
 */

static inline bool rcu_segcblist_empty(struct rcu_segcblist *rsclp)
{
        return !READ_ONCE(rsclp->head);
}

seg 콜백 리스트가 비어 있는지 여부를 반환한다.

rcu_segcblist_n_cbs()

kernel/rcu/rcu_segcblist.h

/* Return number of callbacks in segmented callback list. */
static inline long rcu_segcblist_n_cbs(struct rcu_segcblist *rsclp)
{
#ifdef CONFIG_RCU_NOCB_CPU
        return atomic_long_read(&rsclp->len);
#else
        return READ_ONCE(rsclp->len);
#endif
}

seg 콜백 리스트의 콜백 수를 반환한다.

rcu_segcblist_n_lazy_cbs()

kernel/rcu/rcu_segcblist.h

/* Return number of lazy callbacks in segmented callback list. */
static inline long rcu_segcblist_n_lazy_cbs(struct rcu_segcblist *rsclp)
{
        return rsclp->len_lazy;
}

seg 콜백 리스트의 lazy 콜백 수를 반환한다.

rcu_segcblist_n_nonlazy_cbs()

kernel/rcu/rcu_segcblist.h

/* Return number of lazy callbacks in segmented callback list. */
static inline long rcu_segcblist_n_nonlazy_cbs(struct rcu_segcblist *rsclp)
{
        return rcu_segcblist_n_cbs(rsclp) - rsclp->len_lazy;
}

seg 콜백 리스트의 non-lazy 콜백 수를 반환한다.

rcu_segcblist_is_enabled()

kernel/rcu/rcu_segcblist.h

/*
 * Is the specified rcu_segcblist enabled, for example, not corresponding
 * to an offline CPU?
 */

static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp)
{
        return rsclp->enabled;
}

seg 콜백 리스트의 enabled 여부를 반환한다.

rcu_segcblist_is_offloaded()

kernel/rcu/rcu_segcblist.h

/* Is the specified rcu_segcblist offloaded?  */
static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp)
{
        return rsclp->offloaded;
}

seg 콜백 리스트의 offloaded 여부를 반환한다.

rcu_segcblist_restempty()

kernel/rcu/rcu_segcblist.h

/*
 * Are all segments following the specified segment of the specified
 * rcu_segcblist structure empty of callbacks?  (The specified
 * segment might well contain callbacks.)
 */

static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg)
{
        return !READ_ONCE(*READ_ONCE(rsclp->tails[seg]));
}

seg 콜백 리스트의 @seg 구간 뒤 대기중인 콜백이 비어 있는지 여부를 반환한다.

RCU 콜백 이동

rcu_segcblist_merge()

kernel/rcu/rcu_segcblist.c

/*
 * Merge the source rcu_segcblist structure into the destination
 * rcu_segcblist structure, then initialize the source.  Any pending
 * callbacks from the source get to start over.  It is best to
 * advance and accelerate both the destination and the source
 * before merging.
 */

void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
                         struct rcu_segcblist *src_rsclp)
{
        struct rcu_cblist donecbs;
        struct rcu_cblist pendcbs;

        rcu_cblist_init(&donecbs);
        rcu_cblist_init(&pendcbs);
        rcu_segcblist_extract_count(src_rsclp, &donecbs);
        rcu_segcblist_extract_done_cbs(src_rsclp, &donecbs);
        rcu_segcblist_extract_pend_cbs(src_rsclp, &pendcbs);
        rcu_segcblist_insert_count(dst_rsclp, &donecbs);
        rcu_segcblist_insert_done_cbs(dst_rsclp, &donecbs);
        rcu_segcblist_insert_pend_cbs(dst_rsclp, &pendcbs);
        rcu_segcblist_init(src_rsclp);
}

done 구간 -> rcu 콜백으로 옮기기

rcu_segcblist_extract_done_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Extract only those callbacks ready to be invoked from the specified
 * rcu_segcblist structure and place them in the specified rcu_cblist
 * structure.
 */

void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp,
                                    struct rcu_cblist *rclp)
{
        int i;

        if (!rcu_segcblist_ready_cbs(rsclp))
                return; /* Nothing to do. */
        *rclp->tail = rsclp->head;
        WRITE_ONCE(rsclp->head, *rsclp->tails[RCU_DONE_TAIL]);
        WRITE_ONCE(*rsclp->tails[RCU_DONE_TAIL], NULL);
        rclp->tail = rsclp->tails[RCU_DONE_TAIL];
        for (i = RCU_CBLIST_NSEGS - 1; i >= RCU_DONE_TAIL; i--)
                if (rsclp->tails[i] == rsclp->tails[RCU_DONE_TAIL])
                        WRITE_ONCE(rsclp->tails[i], &rsclp->head);
};

rcu seg 콜백리스트의 done 구간의 콜백들을 extract하여 rcu 콜백리스트로 옮긴다.

다음 그림은 rcu seg 콜백리스트의 done 구간의 콜백들을 extract하여 rcu 콜백리스트로 옮기는 모습을 보여준다.

done 구간 제외 -> rcu 콜백으로 옮기기

rcu_segcblist_extract_pend_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Extract only those callbacks still pending (not yet ready to be
 * invoked) from the specified rcu_segcblist structure and place them in
 * the specified rcu_cblist structure.  Note that this loses information
 * about any callbacks that might have been partway done waiting for
 * their grace period.  Too bad!  They will have to start over.
 */

void rcu_segcblist_extract_pend_cbs(struct rcu_segcblist *rsclp,
                                    struct rcu_cblist *rclp)
{
        int i;

        if (!rcu_segcblist_pend_cbs(rsclp))
                return; /* Nothing to do. */
        *rclp->tail = *rsclp->tails[RCU_DONE_TAIL];
        rclp->tail = rsclp->tails[RCU_NEXT_TAIL];
        WRITE_ONCE(*rsclp->tails[RCU_DONE_TAIL], NULL);
        for (i = RCU_DONE_TAIL + 1; i < RCU_CBLIST_NSEGS; i++)
                WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_DONE_TAIL]);
};

rcu seg 콜백리스트의 done 구간을 제외한 나머지 콜백들을 extract하여 rcu 콜백리스트로 옮긴다.

다음 그림은 rcu seg 콜백리스트의 done 구간을 제외한 나머지 콜백들을 extract하여 rcu 콜백리스트로 옮기는 모습을 보여준다.

rcu_segcblist_extract_count()

kernel/rcu/rcu_segcblist.c

/*
 * Extract only the counts from the specified rcu_segcblist structure,
 * and place them in the specified rcu_cblist structure.  This function
 * supports both callback orphaning and invocation, hence the separation
 * of counts and callbacks.  (Callbacks ready for invocation must be
 * orphaned and adopted separately from pending callbacks, but counts
 * apply to all callbacks.  Locking must be used to make sure that
 * both orphaned-callbacks lists are consistent.)
 */

void rcu_segcblist_extract_count(struct rcu_segcblist *rsclp,
                                               struct rcu_cblist *rclp)
{
        rclp->len_lazy += rsclp->len_lazy;
        rsclp->len_lazy = 0;
        rclp->len = rcu_segcblist_xchg_len(rsclp, 0);
};

rcu seg 콜백 리스트에 rcu 콜백 리스트의 lazy 카운터를 extract하여 누적시킨다. (rcu 콜백 리스트의 콜백 수 및 lazy 카운터는 0으로 클리어된다)

rcu 콜백 -> done 구간으로 옮기기

rcu_segcblist_insert_done_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Move callbacks from the specified rcu_cblist to the beginning of the
 * done-callbacks segment of the specified rcu_segcblist.
 */

void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp,
                                   struct rcu_cblist *rclp)
{
        int i;

        if (!rclp->head)
                return; /* No callbacks to move. */
        *rclp->tail = rsclp->head;
        WRITE_ONCE(rsclp->head, rclp->head);
        for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
                if (&rsclp->head == rsclp->tails[i])
                        WRITE_ONCE(rsclp->tails[i], rclp->tail);
                else
                        break;
        rclp->head = NULL;
        rclp->tail = &rclp->head;
};

rcu 콜백 리스트의 콜백들을 rcu seg 콜백리스트의 done 구간으로 옮긴다.

다음 그림은 rcu 콜백 리스트의 콜백들을 rcu seg 콜백리스트의 done 구간으로 옮기는 모습을 보여준다.

rcu 콜백 -> next 구간으로 옮기기

rcu_segcblist_insert_pend_cbs()

kernel/rcu/rcu_segcblist.c

/*
 * Move callbacks from the specified rcu_cblist to the end of the
 * new-callbacks segment of the specified rcu_segcblist.
 */

void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
                                   struct rcu_cblist *rclp)
{
        if (!rclp->head)
                return; /* Nothing to do. */
        WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rclp->head);
        WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], rclp->tail);
        rclp->head = NULL;
        rclp->tail = &rclp->head;
};

rcu 콜백 리스트의 콜백들을 rcu seg 콜백리스트의 next 구간으로 옮긴다.

다음 그림은 rcu 콜백 리스트의 콜백들을 rcu seg 콜백리스트의 next 구간으로 옮기는 모습을 보여준다.

rcu_segcblist_insert_count()

kernel/rcu/rcu_segcblist.c

/*
 * Insert counts from the specified rcu_cblist structure in the
 * specified rcu_segcblist structure.
 */

void rcu_segcblist_insert_count(struct rcu_segcblist *rsclp,
                                struct rcu_cblist *rclp)
{
        rsclp->len_lazy += rclp->len_lazy;
        rcu_segcblist_add_len(rsclp, rclp->len);
        rclp->len_lazy = 0;
        rclp->len = 0;
};

rcu 콜백리스트의 len_lazy 카운터와 콜백 수를 extract하여 rcu seg 콜백리스트에 추가한다.

RCU Cascading 처리

rcu_advance_cbs()

kernel/rcu/tree.c

/*
 * Move any callbacks whose grace period has completed to the
 * RCU_DONE_TAIL sublist, then compact the remaining sublists and
 * assign ->gp_seq numbers to any callbacks in the RCU_NEXT_TAIL
 * sublist.  This function is idempotent, so it does not hurt to
 * invoke it repeatedly.  As long as it is not invoked -too- often...
 * Returns true if the RCU grace-period kthread needs to be awakened.
 *
 * The caller must hold rnp->lock with interrupts disabled.
 */

static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
{
        rcu_lockdep_assert_cblist_protected(rdp);
        raw_lockdep_assert_held_rcu_node(rnp);

        /* If no pending (not yet ready to invoke) callbacks, nothing to do. */
        if (!rcu_segcblist_pend_cbs(&rdp->cblist))
                return false;

        /*
         * Find all callbacks whose ->gp_seq numbers indicate that they
         * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
         */
        rcu_segcblist_advance(&rdp->cblist, rnp->gp_seq);

        /* Classify any remaining callbacks. */
        return rcu_accelerate_cbs(rnp, rdp);
}

콜백들을 앞쪽으로 옮기는 cascade 처리를 수행한다.

코드 라인 7~8에서 pending 콜백들이 없는 경우 false를 반환한다.
코드 라인 14에서 콜백들을 앞쪽으로 옮기는 cascade 처리를 수행한다.
코드 라인 17에서 남은 콜백들에 대해 accelerate 처리가 가능한 콜백들을 묶어 앞으로 옮긴다.

rcu_segcblist_advance()

kernel/rcu/rcu_segcblist.c

/*
 * Advance the callbacks in the specified rcu_segcblist structure based
 * on the current value passed in for the grace-period counter.
 */

void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
{
        int i, j;

        WARN_ON_ONCE(!rcu_segcblist_is_enabled(rsclp));
        if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
                return;

        /*
         * Find all callbacks whose ->gp_seq numbers indicate that they
         * are ready to invoke, and put them into the RCU_DONE_TAIL segment.
         */
        for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) {
                if (ULONG_CMP_LT(seq, rsclp->gp_seq[i]))
                        break;
                WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]);
        }

        /* If no callbacks moved, nothing more need be done. */
        if (i == RCU_WAIT_TAIL)
                return;

        /* Clean up tail pointers that might have been misordered above. */
        for (j = RCU_WAIT_TAIL; j < i; j++)
                WRITE_ONCE(rsclp->tails[j], rsclp->tails[RCU_DONE_TAIL]);

        /*
         * Callbacks moved, so clean up the misordered ->tails[] pointers
         * that now point into the middle of the list of ready-to-invoke
         * callbacks.  The overall effect is to copy down the later pointers
         * into the gap that was created by the now-ready segments.
         */
        for (j = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++, j++) {
                if (rsclp->tails[j] == rsclp->tails[RCU_NEXT_TAIL])
                        break;  /* No more callbacks. */
                WRITE_ONCE(rsclp->tails[j], rsclp->tails[i]);
                rsclp->gp_seq[j] = rsclp->gp_seq[i];
        }
};

콜백들을 앞쪽으로 옮기는 cascade 처리를 수행한다. gp가 만료된 콜백들을 done 구간으로 옮기고, wait 구간이 빈 경우 next-ready 구간의 콜백들을 wait구간으로 옮긴다. 신규 진입한 콜백들의 경우 동일한 completed 발급번호를 사용하는 구간이 있으면 그 구간(wait or next-ready)과 합치는 acceleration작업도 수행한다. gp kthread를 깨워야 하는 경우 true를 반환한다.

코드 라인 6~7에서 done 구간 이후에 대기중인 콜백이 없으면 true를 반환한다.
코드 라인 13~17에서 1단계) 완료 처리. wait 구간과 next_ready 구간에 이미 만료된 콜백을 done 구간으로 옮긴다.
- rnp->gp_seq < gp_seq[]인 경우는 gp가 아직 완료되지 않아 처리할 수 없는 콜백들이다.
코드 라인 20~21에서 옮겨진 콜백이 없는 경우 함수를 빠져나간다.
코드 라인 24~25에서 위에서 wait 구간 또는 next 구간에 콜백들이 있었던 경우 wait tail 또는 next ready tail이 done tail 보다 앞서 있을 수 있다. 따라서 해당 구간을 일단 done tail과 동일하게 조정한다.
코드 라인 33~33에서 2단계) cascade 처리. 하위 (next ready, next) 구간에 있었던 콜백들을 한 단계 상위 구간으로 옮긴다.

다음 그림은 rcu_segcblist_advance() 함수를 통해 cascade 처리가 가능한 경우를 보여준다.

wait 및 next-ready 구간의 완료 처리 가능한 콜백들을 done 구간으로 옮긴다. wait 구간만 옮겨진 경우 next-ready 구간의 콜백을 wait 구간으로 옮기는 모습을 볼 수 있다.

신규 콜백들의 accelerate 처리

rcu_accelerate_cbs_unlocked()

kernel/rcu/tree.c

/*
 * Similar to rcu_accelerate_cbs(), but does not require that the leaf
 * rcu_node structure's ->lock be held.  It consults the cached value
 * of ->gp_seq_needed in the rcu_data structure, and if that indicates
 * that a new grace-period request be made, invokes rcu_accelerate_cbs()
 * while holding the leaf rcu_node structure's ->lock.
 */

static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
                                        struct rcu_data *rdp)
{
        unsigned long c;
        bool needwake;

        rcu_lockdep_assert_cblist_protected(rdp);
        c = rcu_seq_snap(&rcu_state.gp_seq);
        if (!rdp->gpwrap && ULONG_CMP_GE(rdp->gp_seq_needed, c)) {
                /* Old request still live, so mark recent callbacks. */
                (void)rcu_segcblist_accelerate(&rdp->cblist, c);
                return;
        }
        raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
        needwake = rcu_accelerate_cbs(rnp, rdp);
        raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
        if (needwake)
                rcu_gp_kthread_wake();
}

신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리한다. (leaf 노드 락 없이 진입)

코드 라인 8에서 rdp->gp_seq_needed 값이 gp 시퀀스의 스냅샷 보다 크거나 같은 경우 신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리하고 함수를 빠져나간다.
코드 라인 10~12에서 노드 락을 건후 신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리한다.
코드 라인 13~14에서 결과 값이 true인 경우 gp 커널 스레드를 깨워야 한다.

rcu_accelerate_cbs()

kernel/rcu/tree.c

/*
 * If there is room, assign a ->gp_seq number to any callbacks on this
 * CPU that have not already been assigned.  Also accelerate any callbacks
 * that were previously assigned a ->gp_seq number that has since proven
 * to be too conservative, which can happen if callbacks get assigned a
 * ->gp_seq number while RCU is idle, but with reference to a non-root
 * rcu_node structure.  This function is idempotent, so it does not hurt
 * to call it repeatedly.  Returns an flag saying that we should awaken
 * the RCU grace-period kthread.
 *
 * The caller must hold rnp->lock with interrupts disabled.
 */

static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
{
        unsigned long gp_seq_req;
        bool ret = false;

        rcu_lockdep_assert_cblist_protected(rdp);
        raw_lockdep_assert_held_rcu_node(rnp);

        /* If no pending (not yet ready to invoke) callbacks, nothing to do. */
        if (!rcu_segcblist_pend_cbs(&rdp->cblist))
                return false;

        /*
         * Callbacks are often registered with incomplete grace-period
         * information.  Something about the fact that getting exact
         * information requires acquiring a global lock...  RCU therefore
         * makes a conservative estimate of the grace period number at which
         * a given callback will become ready to invoke.        The following
         * code checks this estimate and improves it when possible, thus
         * accelerating callback invocation to an earlier grace-period
         * number.
         */
        gp_seq_req = rcu_seq_snap(&rcu_state.gp_seq);
        if (rcu_segcblist_accelerate(&rdp->cblist, gp_seq_req))
                ret = rcu_start_this_gp(rnp, rdp, gp_seq_req);

        /* Trace depending on how much we were able to accelerate. */
        if (rcu_segcblist_restempty(&rdp->cblist, RCU_WAIT_TAIL))
                trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("AccWaitCB"));
        else
                trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("AccReadyCB"));
        return ret;
}

신규 콜백들을 묶어 가능한 경우 앞으로 accelerate 처리한다. 결과 값이 true인 경우 콜러에서 gp 커널 스레드를 깨워야 한다. (반드시 leaf 노드 락이 획득된 상태에서 진입되어야 한다)

코드 라인 10~11에서 pending 콜백들이 없는 경우 함수를 빠져나간다.
코드 라인 23~25에서 스냅된 gp 시퀀스 요청 값에 따라 신규 콜백들의 accelerate 처리를 수행한다. 만일 accelerate 처리가 성공한 경우 스냅된 gp 시퀀스 번호로 gp를 시작 요청한다.

rcu_segcblist_accelerate()

kernel/rcu/rcu_segcblist.c

/*
 * "Accelerate" callbacks based on more-accurate grace-period information.
 * The reason for this is that RCU does not synchronize the beginnings and
 * ends of grace periods, and that callbacks are posted locally.  This in
 * turn means that the callbacks must be labelled conservatively early
 * on, as getting exact information would degrade both performance and
 * scalability.  When more accurate grace-period information becomes
 * available, previously posted callbacks can be "accelerated", marking
 * them to complete at the end of the earlier grace period.
 *
 * This function operates on an rcu_segcblist structure, and also the
 * grace-period sequence number seq at which new callbacks would become
 * ready to invoke.  Returns true if there are callbacks that won't be
 * ready to invoke until seq, false otherwise.
 */

bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
{
        int i;

        WARN_ON_ONCE(!rcu_segcblist_is_enabled(rsclp));
        if (rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL))
                return false;

        /*
         * Find the segment preceding the oldest segment of callbacks
         * whose ->gp_seq[] completion is at or after that passed in via
         * "seq", skipping any empty segments.  This oldest segment, along
         * with any later segments, can be merged in with any newly arrived
         * callbacks in the RCU_NEXT_TAIL segment, and assigned "seq"
         * as their ->gp_seq[] grace-period completion sequence number.
         */
        for (i = RCU_NEXT_READY_TAIL; i > RCU_DONE_TAIL; i--)
                if (rsclp->tails[i] != rsclp->tails[i - 1] &&
                    ULONG_CMP_LT(rsclp->gp_seq[i], seq))
                        break;

        /*
         * If all the segments contain callbacks that correspond to
         * earlier grace-period sequence numbers than "seq", leave.
         * Assuming that the rcu_segcblist structure has enough
         * segments in its arrays, this can only happen if some of
         * the non-done segments contain callbacks that really are
         * ready to invoke.  This situation will get straightened
         * out by the next call to rcu_segcblist_advance().
         *
         * Also advance to the oldest segment of callbacks whose
         * ->gp_seq[] completion is at or after that passed in via "seq",
         * skipping any empty segments.
         */
        if (++i >= RCU_NEXT_TAIL)
                return false;

        /*
         * Merge all later callbacks, including newly arrived callbacks,
         * into the segment located by the for-loop above.  Assign "seq"
         * as the ->gp_seq[] value in order to correctly handle the case
         * where there were no pending callbacks in the rcu_segcblist
         * structure other than in the RCU_NEXT_TAIL segment.
         */
        for (; i < RCU_NEXT_TAIL; i++) {
                WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_NEXT_TAIL]);
                rsclp->gp_seq[i] = seq;
        }
        return true;
};

이 함수에서는 next 구간에 새로 진입한 콜백들이 여건이 되면 next-ready 또는 더 나아가 wait 구간으로 옮겨 빠르게 처리할 수 있도록 앞당긴다.(acceleration). rcu gp kthread를 깨워햐 하는 경우 true를 반환한다.

코드 라인 6~7에서 done 구간 이후에 대기중인 콜백이 없으면 false를 반환한다.
코드 라인 17~20에서 next-ready(2) 구간과 wait(1) 구간에 대해 역순회한다. 만일 assign된 콜백들이 존재하면 루프를 벗어난다.
코드 라인 35~36에서 next-ready(2) 구간에 이미 assign된 콜백이 있으면 신규 콜백들을 acceleration 할 수 없으므로 false를 반환하고 함수를 빠져나간다.
코드 라인 45~48에서 next(3) 구간의 콜백들을 wait(1) 또는 next-ready(2) 구간에 통합하고, 글로벌 진행 중인 seq 번호로 gp_seq 번호를 갱신한다.
코드 라인 49에서 성공 true를 반환한다.

다음 그림은 next 구간에 새로 진입한 콜백을 next-ready 또는 wait 구간으로 acceleration하여 빠르게 처리할 수 있도록 하는 모습을 보여준다.

구조체

rcu_cblist 구조체

include/linux/rcu_segcblist.h”

/* Simple unsegmented callback lists. */
struct rcu_cblist {
        struct rcu_head *head;
        struct rcu_head **tail;
        long len;
        long len_lazy;
};

rcu un-segmented 콜백 리스트 구조체이다.

*head
- rcu 콜백들이 연결된다.
- 비어있는 경우 null이 사용된다.
**tail
- 마지막 rcu 콜백을 가리킨다.
- 비어있는 경우 head를 가리킨다.
len
- 콜백 수
len_lazy
- lazy 콜백 수
- 참고) non-lazy 콜백 수 = len – len_lazy

rcu_segcblist 구조체

include/linux/rcu_segcblist.h”

struct rcu_segcblist {
        struct rcu_head *head;
        struct rcu_head **tails[RCU_CBLIST_NSEGS];
        unsigned long gp_seq[RCU_CBLIST_NSEGS];
#ifdef CONFIG_RCU_NOCB_CPU
        atomic_long_t len;
#else
        long len;
#endif
        long len_lazy;
        u8 enabled;
        u8 offloaded;
};

rcu segmented 콜백 리스트 구조체이다.

*head
- rcu 콜백들이 연결된다.
- 비어있는 경우 null이 사용된다.
**tails[]
- 4 단계로 구성되며, 각각 구간의 마지막 rcu 콜백을 가리킨다.
- 비어있는 경우 이전 구간의 콜백을 가리키고, 모두 비어 있는 경우 head를 가리킨다.
gp_seq[]
- 각 구간의 gp 시퀀스 번호가 담긴다.
len
- 콜백 수
len_lazy
- lazy 콜백 수
- 참고) non-lazy 콜백 수 = len – len_lazy
enabled
- 활성화 여부가 담긴다. (1=enabled, 0=disabled)
offloaded
- 커널 v5.4-rc1에서 no-cb 처리를 위한 오프로드 여부가 담긴다. (1=offloaded, 0=none)
  - 참고: rcu/nocb: Use separate flag to indicate offloaded ->cblist

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c – 현재글
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c