문c 블로그

pidmap_init()

2018-01-102018-01-30 문영일 Leave a comment

pidmap_init()

kernel/pid.c

void __init pidmap_init(void)
{
        /* Veryify no one has done anything silly */
        BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);

        /* bump default and minimum pid_max based on number of cpus */
        pid_max = min(pid_max_max, max_t(int, pid_max,
                                PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
        pid_max_min = max_t(int, pid_max_min,
                                PIDS_PER_CPU_MIN * num_possible_cpus());
        pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);

        init_pid_ns.pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
        /* Reserve PID 0. We never call free_pidmap(0) */
        set_bit(0, init_pid_ns.pidmap[0].page);
        atomic_dec(&init_pid_ns.pidmap[0].nr_free);

        init_pid_ns.pid_cachep = KMEM_CACHE(pid,
                        SLAB_HWCACHE_ALIGN | SLAB_PANIC);
}

pid를 관리하기 위해 possible cpu 수에 맞추어 pid_max와 pid_max_min을 산출한다. 그 후 첫 번째 pidmap 배열을 할당하고 pid 0번을 사용상태로 설정한다.

코드 라인 7~8에서 1024 * possible cpu 수를 곱한 값(최소 32K부터 시작)과 pid_max_max(32bit=32K, 64bit=4M) 값 둘 중 작은 값으로 pid_max를 산출한다.
- rpi2: pid_max=32K
코드 라인 9~10에서 8 * possible cpu 수를 곱한 값(최소 301부터 시작)으로 pid_max_min을 대입한다.
- rpi2: pid_max_min=301
코드 라인 11에서 pid 관련 정보를 출력한다.
- rpi2: “pid_max: default: 32768 minimum: 301“
코드 라인 13에서 init_pid_ns의 pidmap 배열에서 첫 페이지는 무조건 필요하므로 1개 페이지를 할당한다.
코드 라인 15~16에서 pid 0번에 해당하는 비트를 설정하고, nr_free 수에서 1을 감소 시킨다.
코드 라인 18~19에서 pid 구조체 할당 시 사용할 목적의 kmem 슬랩 캐시를 준비한다.

PID_MAX_DEFAULT 값은 다음과 같다. (기본적으로는 32K 개의 pid 수를 사용한다.)

small foot print 커널
- 4K
32bit 또는 64bit 커널
- 32K

PID_MAX_LIMIT 값을 알아보면 다음과 같다. (64비트 시스템에서 4M개의 pid 수를 사용할 수 있음을 알 수 있다)

small foot print 커널
- PAGE_SIZE(4K) x 8 = 32K
32bit 커널
- 32K
64bit 커널
- 4M

include/linux/threads.h

/*
 * This controls the default maximum pid allocated to a process
 */
#define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)

/*
 * A maximum of 4 million PIDs should be enough for a while.
 * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
 */
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
        (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

컴파일 타임에 pid 관련 전역 변수의 초기값은 다음과 같다.

include/linux/threads.h

int pid_max = PID_MAX_DEFAULT;

#define RESERVED_PIDS           300

int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

cpu별 pid min 값은 1024개이고 pid max는 32K이다. (32개의 cpu 기준)

include/linux/threads.h

/*
 * Define a minimum number of pids per cpu.  Heuristically based
 * on original pid max of 32k for 32 cpus.  Also, increase the
 * minimum settable value for pid_max on the running system based
 * on similar defaults.  See kernel/pid.c:pidmap_init() for details.
 */
#define PIDS_PER_CPU_DEFAULT    1024
#define PIDS_PER_CPU_MIN        8

각 pid 번호의 사용 유무는 1bit를 사용한다. 1개의 페이지로 관리할 수 있는 pid 수는 PAGE_SIZE x 8 bit 이다. 관리할 최대 pid 수가 큰 경우 여러 페이지를 사용할 수 있는데 pidmap을 배열로 만들고 각 배열 인덱스는 1 개의 페이지를 관리한다. 예를 들어 1개 페이지 사이즈가 4K라고 가정할 때, 1개의 페이지 사이즈를 사용하여 pid에 대한 맵을 사용하면 1 PAGE x 8 bit = 32K 만큼의 pid를 관리할 수 있다. 따라서 pidmap 배열에 사용할 엔트리 수는 최대 pid 수(PID_MAX_LIMIT)를 32K 단위로 나누어 올림 처리한 수로 사용한다.

4K 페이지, 32bit 시스템에서 최대 32K 개의 pid는 1개 페이지로 처리할 수 있어 pidmap[1]을 사용한다.
4K 페이지, 64bit 시스템에서 최대 4M 개의 pid는 128개 페이지로 처리해야 하므로 pidmap[128]이 필요하다.

include/linux/pid_namespace.h

#define BITS_PER_PAGE           (PAGE_SIZE * 8)
#define BITS_PER_PAGE_MASK      (BITS_PER_PAGE-1)
#define PIDMAP_ENTRIES          ((PID_MAX_LIMIT+BITS_PER_PAGE-1)/BITS_PER_PAGE)

참조

pid 관리 | 문c
pidhash_init() | 문c
pidmap_init() | 문c – 현재글
alloc_large_system_hash() | 문c

Console & TTY Driver

2017-12-202017-12-22 문영일 Leave a comment

TTY 드라이버

TTY 드라이버 타입은 리눅스에서 6가지로 정의되어있다.

console
serial (서브 타입: normal)
pty (서브 타입: master, slave)
system (서브 타입: tty, console, syscons, sysptmx)
scc (not used)
syscons (not used)

TTY(Teletypewriter)

다음 그림과 같이 OSI 모델의 7 레이어와 TCP 모델과 비교하여 TTY에 해당하는 레이어를 확인해본다.

TTY의 Phsical Layer는 UART, Serial, Dial-up modem, ISDN modem 하드웨어 장치가 연결되고 그 장치를 제어하기 위한 Driver가 위치한다.
Data Link Layer에 해당하는 위치에는 위의 하드웨어 드라이버 일부와 common TTY 드라이버가 위치하며 Line discipline 드라이버 또한 포함한다.

TTY 드라이버

generic TTY 드라이버는 character device로 등록된다. 이 generic tty 드라이버는 하드웨어(또는 pseudo 터미널)와 연결된 tty 드라이버와 직접 연결되거나 중간에 특정 line discipline 드라이버와 연결될 수 있다. tty 드라이버의 구성은 매우 다양하다. 위에서 언급했었던 tty 드라이버 타입을 선택하여 만들어지는 드라이버들 중 시리얼 타입을 선택한 경우 아래와 같은구성으로 레이어를 설명할 수 있다.

아래 그림은 위의 시리얼 타입의 tty 뿐만 아니라 전체를 보여주기 위해 더 포괄적으로 표현하였다.

최근 커널에서 serial 타입을 지원하기 위한 tty 드라이버에 대해서 조금 더 알아본다. 아래 그림과 같이 시리얼 드라이버 유형을 case A) ~ D)까지 나누어 보았다.

커널 2.6 이전에는 시리얼 타입의 tty 드라이버를 개발하기 위해 case A)만을 지원하였고, tty 드라이버 등록을 위해 tty_register_driver() 함수를 사용한다.
커널 2.6부터 serial core(generic)가 추가되었고, 시리얼용 tty 드라이버를 더 간단히 개발하기 위해 case B) 형태도 지원하였다. serial 드라이버를 등록하기 위해 uart_register_driver() 함수를 사용한다.
또한 8250/16c550용 시리얼 드라이버를 위해서는 더 하단 레이어에 serial 8250 core(generic)이 추가되었다. 8250(16c550)을 사용하는 시리얼 드라이버를 개발하기 위해 case C) 형태도 지원하였다. 8250용 시리얼 드라이버를 등록하기 위해 serial8250_register_8250_port() 함수를 사용한다.
case D)와 같이 범용 용도의 Device Tree를 위한 of-serial 드라이버를 이용하여 사용할 수도 있다.

다음은 시리얼 입출력이 발생할 때 실제 함수들이 호출되는 과정을 보여준다. (rpi2 기준)

Pseudo 터미널의 구동되는 방법을 자세히 설명해 놓은 그림도 살펴본다.

위의 구성을 세션과 프로세스 그룹 관점으로 바꾼 그림이다.

기타 장치 구성도 살펴본다.

TTY Core (Generic)

TTY 드라이버 등록

tty_register_driver()

drivers/tty/tty_io.c

/*
 * Called by a tty driver to register itself.
 */
int tty_register_driver(struct tty_driver *driver)
{
        int error;
        int i;
        dev_t dev;
        struct device *d;

        if (!driver->major) {
                error = alloc_chrdev_region(&dev, driver->minor_start,
                                                driver->num, driver->name);
                if (!error) {
                        driver->major = MAJOR(dev);
                        driver->minor_start = MINOR(dev);
                }
        } else {
                dev = MKDEV(driver->major, driver->minor_start);
                error = register_chrdev_region(dev, driver->num, driver->name);
        }
        if (error < 0)
                goto err;

        if (driver->flags & TTY_DRIVER_DYNAMIC_ALLOC) {
                error = tty_cdev_add(driver, dev, 0, driver->num);
                if (error)
                        goto err_unreg_char;
        }

        mutex_lock(&tty_mutex);
        list_add(&driver->tty_drivers, &tty_drivers);
        mutex_unlock(&tty_mutex);

        if (!(driver->flags & TTY_DRIVER_DYNAMIC_DEV)) {
                for (i = 0; i < driver->num; i++) {
                        d = tty_register_device(driver, i, NULL);
                        if (IS_ERR(d)) {
                                error = PTR_ERR(d);
                                goto err_unreg_devs;
                        }
                }
        }
        proc_tty_register_driver(driver);
        driver->flags |= TTY_DRIVER_INSTALLED;
        return 0;

인수로 받은 tty 드라이버를 등록한다. 정상 등록하여 성공하면 0을 반환하고, 에러가 발생하면 에러 값을 반환한다.

코드 라인 11~23에서 tty 드라이버에 major 번호가 등록되어 있지 않은 경우 character 디바이스의 major 번호 254번부터 0번까지 중 사용하지 않는 번호를 찾아 그 번호로 character 디바이스를 등록한다.
- character 디바이스 번호는 32비트로 다음과 같이 나뉘는데, MKDEV(major, minor) 매크로 상수를 사용하여 32비트 값을 만들 수 있다.
  - major: 상위 12비트로, MAJOR() 매크로 상수를 사용하여 알아올 수 있다.
  - minor: 하위 20비트로, MINOR() 매크로 상수를 사용하여 알아올 수 있다.
코드 라인 25~29에서 tty 드라이버에 TTY_DRIVER_DYNAMIC_ALLOC 플래그를 사용한 경우 이 드라이버의 cdevs[]를 시스템의 character 디바이스로 등록한다. (두 번 등록되지 않도록 주의해야 한다)
코드 라인 31~33에서 전역 tty_drivers 리스트에 요청한 tty 드라이버를 추가한다.
코드 라인 35~43에서 tty 드라이버에 TTY_DRIVER_DYNAMIC_DEV 플래그를 사용한 경우 driver->num 수 만큼 tty 디바이스를 등록하게 한다.
코드 라인 44에서 proc 인터페이스에 요청한 tty 드라이버를 추가한다.
코드 라인 45에서 요청한 tty 드라이버가 설치가 완료된 것으로 인식하기 위해 TTY_DRIVER_INSTALLED 플래그를 추가한다.

err_unreg_devs:
        for (i--; i >= 0; i--)
                tty_unregister_device(driver, i);

        mutex_lock(&tty_mutex);
        list_del(&driver->tty_drivers);
        mutex_unlock(&tty_mutex);

err_unreg_char:
        unregister_chrdev_region(dev, driver->num);
err:
        return error;
}
EXPORT_SYMBOL(tty_register_driver);

에러가 발생하는 경우 등록한 디바이스 수 만큼 해지하고, 전역 tty_drivers 리스트에서도 제거한다. 그리고 마지막으로 시스템의 character 디바이스에서 제거한다.

TTY 초기화

아래 함수는 메모리를 읽고 쓸 수 있는 “mem” 캐릭터 디바이스가 초기화시키는 chr_dev_init() 함수의 마지막에서 호출되어 tty를 초기화한다.

fs_initcall(chr_dev_init) -> tty_init()

tty_init()

drivers/tty/tty_io.c

/*
 * Ok, now we can initialize the rest of the tty devices and can count
 * on memory allocations, interrupts etc..
 */
int __init tty_init(void)
{
        cdev_init(&tty_cdev, &tty_fops);
        if (cdev_add(&tty_cdev, MKDEV(TTYAUX_MAJOR, 0), 1) ||
            register_chrdev_region(MKDEV(TTYAUX_MAJOR, 0), 1, "/dev/tty") < 0)
                panic("Couldn't register /dev/tty driver\n");
        device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 0), NULL, "tty");

        cdev_init(&console_cdev, &console_fops);
        if (cdev_add(&console_cdev, MKDEV(TTYAUX_MAJOR, 1), 1) ||
            register_chrdev_region(MKDEV(TTYAUX_MAJOR, 1), 1, "/dev/console") < 0)
                panic("Couldn't register /dev/console driver\n");
        consdev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 1), NULL,
                              "console");
        if (IS_ERR(consdev))
                consdev = NULL;
        else
                WARN_ON(device_create_file(consdev, &dev_attr_active) < 0);

#ifdef CONFIG_VT
        vty_init(&console_fops);
#endif
        return 0;
}

시스템에 기본 tty character 디바이스(/dev/tty)와 기본 console character 디바이스(/dev/console )를 등록한다. 커널이 Virtual Terminal을 지원하는 경우 기본 console용 ops를 사용하여 virtual terminal(/dev/tty 사용)을 초기화한다.

코드 라인 7~11에서 기본 tty character 디바이스(/dev/tty)를 등록한다. (major=5, minor=0)
코드 라인 13~18에서 기본 console character 디바이스(/dev/console )를 등록한다. (major=5, minor=1)
코드 라인 19~22에서 콘솔이 성공적으로 등록된 경우 모든 사용자 및 그룹의 read 액세스 권한으로 active 파일을 생성한다.
- /sys/devices/virtual/tty/console 디렉토리 이후에 active 파일을 생성한다.
- 위의 디렉토리는 /sys/dev/char/<major>:<minor> 디렉토리명으로 링크되어 있다.
코드 라인 24~26에서 커널이 Virtual Terminal을 지원하는 경우 초기화한다.
- 기본 console용 ops를 사용하여 4:0 번 /dev/tty0 디바이스를 생성한다.
- 7:0번 /dev/vcs 파일 및 7:128번 /dev/vcsa 파일을 생성한다.
- 7:1번 /dev/vcs1 파일 및 7:129번 /dev/vcsa1 파일을 생성한다.
- 최대 63개 포트로 콘솔 드라이버를 할당하고 콘솔 타입의 tty 드라이버를 등록한다.

아래는 콘솔에 대한 active 파일이 read only 의 속성으로 만들어졌고, 해당 파일을 읽어 현재 console이 어떤 tty 드라이버로 구성되어 있는지 알아낼 수 있다.

# ls -la /sys/dev/char/5:1/active
-r--r--r-- 1 root root 4096 12월 21 23:13 /sys/dev/char/5:1/active
# cat /sys/dev/char/5:1/active
tty0

tty_driver 구조체

include/linux/tty_driver.h

struct tty_driver {
        int     magic;          /* magic number for this structure */
        struct kref kref;       /* Reference management */
        struct cdev *cdevs;
        struct module   *owner;
        const char      *driver_name;
        const char      *name;
        int     name_base;      /* offset of printed name */
        int     major;          /* major device number */ 
        int     minor_start;    /* start of minor device number */
        unsigned int    num;    /* number of devices allocated */
        short   type;           /* type of tty driver */
        short   subtype;        /* subtype of tty driver */
        struct ktermios init_termios; /* Initial termios */
        unsigned long   flags;          /* tty driver flags */
        struct proc_dir_entry *proc_entry; /* /proc fs entry */
        struct tty_driver *other; /* only used for the PTY driver */

        /*
         * Pointer to the tty data structures
         */
        struct tty_struct **ttys;
        struct tty_port **ports;
        struct ktermios **termios;
        void *driver_state;

        /*
         * Driver methods
         */

        const struct tty_operations *ops;
        struct list_head tty_drivers;
};

tty 드라이버는 1개 이상의 tty_struct, tty_port, ktermios 설정 및 상태등을 가진다.

magic
- tty 드라이버를 의미하는 magic 번호
- TTY_DRIVER_MAGIC(0x5402)
kref
- 참조 관리
*cdevs
- 캐릭터 디바이스 포인터
*owner
- 모듈을 가리킨다.
*driver_name
- 드라이버 명
*name
- 디바이스 명
name_base
- 출력할 이름의 시작 번호
major
- major 번호
minor_start
- minor 시작 번호
num
- tty 장치 수
type
- tty 드라이버 타입(6가지)
  - TTY_DRIVER_TYPE_SYSTEM(1)
  - TTY_DRIVER_TYPE_CONSOLE(2)
  - TTY_DRIVER_TYPE_SERIAL(3)
  - TTY_DRIVER_TYPE_PTY(4)
  - TTY_DRIVER_TYPE_SCC(5)
  - TTY_DRIVER_TYPE_SYSCONS(6)
subtype
- tty 드라이버 서브 타입
  - system 서브 타입
    - SYSTEM_TYPE_TTY(1)
    - SYSTEM_TYPE_CONSOLE(2)
    - SYSTEM_TYPE_SYSCONS(3)
    - SYSTEM_TYPE_SYSPTMX(4)
  - pty 서브 타입
    - PTY_TYPE_MASTER(1)
    - PTY_TYPE_SLAVE(2)
  - serial 서브 타입
    - SERIAL_TYPE_NORMAL(1)
*init_termios
- 터미널 초기화 후크 함수
flags
- 드라이버 플래그
  - TTY_DRIVER_INSTALLED(0x0001)
  - TTY_DRIVER_RESET_TERMIOS(0x0002)
  - TTY_DRIVER_REAL_RAW(0x0004)
  - TTY_DRIVER_DYNAMIC_DEV(0x0008)
  - TTY_DRIVER_DEVPTS_MEM(0x0010)
  - TTY_DRIVER_HARDWARE_BREAK(0x0020)
  - TTY_DRIVER_DYNAMIC_ALLOC(0x0040)
  - TTY_DRIVER_UNNUMBERED_NODE(0x0080)
*proc_entry
- proc 인터페이스 시작 디렉토리 엔트리
*other
- pty 타입의 tty 드라이버에서만 사용
**ttys
- tty_struct 배열을 가리킨다.
**ports
- tty_port 배열을 가리킨다. (tty 포트 수 만큼)
**termios
- 터미널 제어를 위해 ktermios 배열을 가리킨다.
*driver_state
- 드라이버 상태
*ops
- tty 드라이버의 operations
tty_drivers
- 전역 tty_drivers 리스트에 추가될 때 사용하는 노드 링크

tty_struct & tty_port 구조체

tty_struct는 tty 장치에 대한 line discipline을 지정하고 OSI L2 레이어에 해당하는 data link 역할을 수행하기 위해 flow control 등을 수행하도록 관리한다. tty 드라이버는 1 개 이상의 tty_port를 가질 수 있으므로 이에 대한 내용을 가진 tty_port 구조체가 있다.

구조체 코드는 생략

TTY Core operations

다음은 tty core가 사용하는 tty 및 console에 대한 ops이다.

drivers/tty/tty_io.c

static const struct file_operations tty_fops = {
        .llseek         = no_llseek,
        .read           = tty_read,
        .write          = tty_write,
        .poll           = tty_poll,
        .unlocked_ioctl = tty_ioctl,
        .compat_ioctl   = tty_compat_ioctl,
        .open           = tty_open,
        .release        = tty_release,
        .fasync         = tty_fasync,
};

static const struct file_operations console_fops = {
        .llseek         = no_llseek,
        .read           = tty_read,
        .write          = redirected_tty_write,
        .poll           = tty_poll,
        .unlocked_ioctl = tty_ioctl,
        .compat_ioctl   = tty_compat_ioctl,
        .open           = tty_open,
        .release        = tty_release,
        .fasync         = tty_fasync,
};

터미널 제어 설정(TERMIOS)

tty 드라이버가 표준 터미널 설정을 사용하는 경우 아래 설정을 사용할 수 있다.

struct ktermios tty_std_termios = {     /* for the benefit of tty drivers  */
        .c_iflag = ICRNL | IXON,
        .c_oflag = OPOST | ONLCR,
        .c_cflag = B38400 | CS8 | CREAD | HUPCL,
        .c_lflag = ISIG | ICANON | ECHO | ECHOE | ECHOK |
                   ECHOCTL | ECHOKE | IEXTEN,
        .c_cc = INIT_C_CC,
        .c_ispeed = 38400,
        .c_ospeed = 38400
};

EXPORT_SYMBOL(tty_std_termios);

다음은 stty 명령을 통해 tty 디바이스 장치에 설정된 터미널 설정 값을 보여준다.

$ sudo stty -F /dev/tty0 -a
speed 38400 baud; rows 25; columns 80; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>;
start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread -clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop -echoprt -echoctl -echoke

include/uapi/asm-generic/termbits.h

터미널 설정은 아래와 같이 4개의 mode 플래그 집합, line discipline, 19개의 제어 문자 및 I/O 스피드 설정으로 구성된다.

struct ktermios {
        tcflag_t c_iflag;               /* input mode flags */
        tcflag_t c_oflag;               /* output mode flags */
        tcflag_t c_cflag;               /* control mode flags */
        tcflag_t c_lflag;               /* local mode flags */
        cc_t c_line;                    /* line discipline */
        cc_t c_cc[NCCS];                /* control characters */
        speed_t c_ispeed;               /* input speed */
        speed_t c_ospeed;               /* output speed */
};

콘솔

시스템의 키보드와 스크린에 해당하는 드라이버를 콘솔이라고 한다. PC 에서는 콘솔의 입력과 출력으로 키보드 장치와 모니터를 통한 그래픽 출력장치를 사용한다. 참고로 임베디드 시스템에는 키보드와 모니터가 없으므로 이를 대신하기 위해 보통 UART(시리얼) 포트를 사용하여 입/출력을 하는 경우가 많다. 이러한 경우 UART(시리얼) 장치를 콘솔로 사용한다.

console_init()

drivers/tty/tty_io.c

/*
 * Initialize the console device. This is called *early*, so
 * we can't necessarily depend on lots of kernel help here.
 * Just do some early initializations, and do the complex setup
 * later.
 */             
void __init console_init(void)
{
        initcall_t *call;

        /* Setup the default TTY line discipline. */
        tty_ldisc_begin();

        /*
         * set up the console device so that later boot sequences can
         * inform about problems etc..
         */
        call = __con_initcall_start;
        while (call < __con_initcall_end) {
                (*call)();
                call++;
        }
}

디폴트 tty line discipline을 등록하고 커널에 설정 등록된 console 디바이스 드라이버의 셋업 함수들을 호출한다.

시스템 마다 사용하는 콘솔 장치가 다르며 커널에는 수십 종류가 등록되어 있으며 그 중 임베디드 장치에서 사용하는 일부는 다음과 같다.
- con_init() – CONFIG_VT_CONSOLE
- serial8250_console_init() – CONFIG_SERIAL_8250_CONSOLE
- bcm63xx_console_init() – CONFIG_SERIAL_BCM63XX_CONSOLE
- s3c24xx_serial_console_init() – CONFIG_SERIAL_SAMSUNG_CONSOLE

콘솔 드라이버의 시작 함수 포인터를 다음 매크로 함수를 통해 “.con_initcall.init” 섹션에 위치하게 한다.

console_initcall()

include/linux/init.h

#define console_initcall(fn) \
        static initcall_t __initcall_##fn \
        __used __section(.con_initcall.init) = fn

Line Discipline

line discipline 드라이버는 가장 상위의 character device(generic tty driver)와 하드웨어 또는 pseudo 터미널을 담당하는 tty 디바이스 드라이버 사이에 위치한다. 기능으로는 이 들 사이에서 입출력 데이터의 흐름 제어나 특수 명령 및 데이터 변경이 적용되게 할 수 있다. 이러한 line discipline 드라이버들의 종류는 다양하다.

tty_ldisc_begin()

drivers/tty/tty_ldisc.c

void tty_ldisc_begin(void)
{
        /* Setup the default TTY line discipline. */
        (void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY);
}

디폴트 tty line discipline 드라이버를 등록한다.

tty_register_ldisc()

drivers/tty/tty_ldisc.c

/**
 *      tty_register_ldisc      -       install a line discipline
 *      @disc: ldisc number
 *      @new_ldisc: pointer to the ldisc object
 *
 *      Installs a new line discipline into the kernel. The discipline
 *      is set up as unreferenced and then made available to the kernel
 *      from this point onwards.
 *
 *      Locking:
 *              takes tty_ldiscs_lock to guard against ldisc races
 */

int tty_register_ldisc(int disc, struct tty_ldisc_ops *new_ldisc)
{
        unsigned long flags;
        int ret = 0;

        if (disc < N_TTY || disc >= NR_LDISCS)
                return -EINVAL;

        raw_spin_lock_irqsave(&tty_ldiscs_lock, flags);
        tty_ldiscs[disc] = new_ldisc;
        new_ldisc->num = disc;
        new_ldisc->refcount = 0;
        raw_spin_unlock_irqrestore(&tty_ldiscs_lock, flags);

        return ret;
}
EXPORT_SYMBOL(tty_register_ldisc);

요청한 disc 장치에 해당하는 line discipline 드라이버를 등록한다.

TTY line discipline 테이블

아래와 같이 30개의 line discipline ops들을 등록할 수 있는 포인터 배열이 선언되어 있다.

drivers/tty/tty_ldisc.c

/* Line disc dispatch table */
static struct tty_ldisc_ops *tty_ldiscs[NR_LDISCS];

등록되는 장치의 line discipline들이다. 기본 tty 장치부터 시작되어 다양한 종류의 프로토콜들을 처리할 수 있다.

include/uapi/linux/tty.h

/*
 * 'tty.h' defines some structures used by tty_io.c and some defines.
 */

#define NR_LDISCS               30

/* line disciplines */
#define N_TTY           0
#define N_SLIP          1
#define N_MOUSE         2
#define N_PPP           3
#define N_STRIP         4
#define N_AX25          5
#define N_X25           6       /* X.25 async */
#define N_6PACK         7
#define N_MASC          8       /* Reserved for Mobitex module <kaz@cafe.net> */
#define N_R3964         9       /* Reserved for Simatic R3964 module */
#define N_PROFIBUS_FDL  10      /* Reserved for Profibus */
#define N_IRDA          11      /* Linux IrDa - http://irda.sourceforge.net/ */
#define N_SMSBLOCK      12      /* SMS block mode - for talking to GSM data */
                                /* cards about SMS messages */
#define N_HDLC          13      /* synchronous HDLC */
#define N_SYNC_PPP      14      /* synchronous PPP */
#define N_HCI           15      /* Bluetooth HCI UART */
#define N_GIGASET_M101  16      /* Siemens Gigaset M101 serial DECT adapter */
#define N_SLCAN         17      /* Serial / USB serial CAN Adaptors */
#define N_PPS           18      /* Pulse per Second */
#define N_V253          19      /* Codec control over voice modem */
#define N_CAIF          20      /* CAIF protocol for talking to modems */
#define N_GSM0710       21      /* GSM 0710 Mux */
#define N_TI_WL         22      /* for TI's WL BT, FM, GPS combo chips */
#define N_TRACESINK     23      /* Trace data routing for MIPI P1149.7 */
#define N_TRACEROUTER   24      /* Trace data routing for MIPI P1149.7 */

Line Discipline Operations

line discipline 장치에 대응하는 ops 구조체이다.

drivers/tty/n_tty.c

struct tty_ldisc_ops tty_ldisc_N_TTY = {
        .magic           = TTY_LDISC_MAGIC,
        .name            = "n_tty",
        .open            = n_tty_open,
        .close           = n_tty_close,
        .flush_buffer    = n_tty_flush_buffer,
        .chars_in_buffer = n_tty_chars_in_buffer,
        .read            = n_tty_read,
        .write           = n_tty_write,
        .ioctl           = n_tty_ioctl,
        .set_termios     = n_tty_set_termios,
        .poll            = n_tty_poll,
        .receive_buf     = n_tty_receive_buf,
        .write_wakeup    = n_tty_write_wakeup,
        .fasync          = n_tty_fasync,
        .receive_buf2    = n_tty_receive_buf2,
};

Serial Core (Generic)

uart_register_driver()

drivers/tty/serial/serial_core.c

/**
 *      uart_register_driver - register a driver with the uart core layer
 *      @drv: low level driver structure
 *
 *      Register a uart driver with the core driver.  We in turn register
 *      with the tty layer, and initialise the core driver per-port state.
 *
 *      We have a proc file in /proc/tty/driver which is named after the
 *      normal driver.
 *
 *      drv->port should be NULL, and the per-port structures should be
 *      registered using uart_add_one_port after this call has succeeded.
 */
int uart_register_driver(struct uart_driver *drv)
{
        struct tty_driver *normal;
        int i, retval;

        BUG_ON(drv->state);

        /*
         * Maybe we should be using a slab cache for this, especially if
         * we have a large number of ports to handle.
         */
        drv->state = kzalloc(sizeof(struct uart_state) * drv->nr, GFP_KERNEL);
        if (!drv->state)
                goto out;

        normal = alloc_tty_driver(drv->nr);
        if (!normal) 
                goto out_kfree;

        drv->tty_driver = normal;

        normal->driver_name     = drv->driver_name;
        normal->name            = drv->dev_name;
        normal->major           = drv->major;
        normal->minor_start     = drv->minor;
        normal->type            = TTY_DRIVER_TYPE_SERIAL;
        normal->subtype         = SERIAL_TYPE_NORMAL;
        normal->init_termios    = tty_std_termios; 
        normal->init_termios.c_cflag = B9600 | CS8 | CREAD | HUPCL | CLOCAL;
        normal->init_termios.c_ispeed = normal->init_termios.c_ospeed = 9600;
        normal->flags           = TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV;
        normal->driver_state    = drv;
        tty_set_operations(normal, &uart_ops);

        /*
         * Initialise the UART state(s).
         */
        for (i = 0; i < drv->nr; i++) {
                struct uart_state *state = drv->state + i;
                struct tty_port *port = &state->port; 

                tty_port_init(port);
                port->ops = &uart_port_ops;
        }

        retval = tty_register_driver(normal);
        if (retval >= 0)
                return retval;
                
        for (i = 0; i < drv->nr; i++)
                tty_port_destroy(&drv->state[i].port);
        put_tty_driver(normal);
out_kfree:
        kfree(drv->state);
out:    
        return -ENOMEM;
}

uart_driver 구조체

include/linux/serial_core.h

struct uart_driver {
        struct module           *owner;
        const char              *driver_name;
        const char              *dev_name;
        int                      major;
        int                      minor;
        int                      nr;
        struct console          *cons;

        /*
         * these are private; the low level driver should not
         * touch these; they should be initialised to NULL
         */
        struct uart_state       *state;
        struct tty_driver       *tty_driver;
};

uart_state & uart_port 구조체

uart 드라이버는 상태 및 1개 이상의 uart_port로 구성된다.

코드는 생략

uart Operations

include/linux/serial_core.h

/*
 * This structure describes all the operations that can be done on the
 * physical hardware.  See Documentation/serial/driver for details.
 */
struct uart_ops {
        unsigned int    (*tx_empty)(struct uart_port *);
        void            (*set_mctrl)(struct uart_port *, unsigned int mctrl);
        unsigned int    (*get_mctrl)(struct uart_port *);
        void            (*stop_tx)(struct uart_port *);
        void            (*start_tx)(struct uart_port *);
        void            (*throttle)(struct uart_port *);
        void            (*unthrottle)(struct uart_port *);
        void            (*send_xchar)(struct uart_port *, char ch);
        void            (*stop_rx)(struct uart_port *);
        void            (*enable_ms)(struct uart_port *);
        void            (*break_ctl)(struct uart_port *, int ctl);
        int             (*startup)(struct uart_port *);
        void            (*shutdown)(struct uart_port *);
        void            (*flush_buffer)(struct uart_port *);
        void            (*set_termios)(struct uart_port *, struct ktermios *new,
                                       struct ktermios *old);
        void            (*set_ldisc)(struct uart_port *, struct ktermios *);
        void            (*pm)(struct uart_port *, unsigned int state,
                              unsigned int oldstate);

        /*
         * Return a string describing the type of the port
         */
        const char      *(*type)(struct uart_port *);

        /*
         * Release IO and memory resources used by the port.
         * This includes iounmap if necessary.
         */
        void            (*release_port)(struct uart_port *);

        /*
         * Request IO and memory resources used by the port.
         * This includes iomapping the port if necessary.
         */
        int             (*request_port)(struct uart_port *);
        void            (*config_port)(struct uart_port *, int);
        int             (*verify_port)(struct uart_port *, struct serial_struct *);
        int             (*ioctl)(struct uart_port *, unsigned int, unsigned long);
#ifdef CONFIG_CONSOLE_POLL
        int             (*poll_init)(struct uart_port *);
        void            (*poll_put_char)(struct uart_port *, unsigned char);
        int             (*poll_get_char)(struct uart_port *);
#endif
};

Serial 8250 Core (Generic)

다음 uart 칩을 사용한 드라이버를 빠르게 구성할 수 있도록 준비되어 있다.

8250, 16450, 16550, 16550A, 16C950/954
Cirrus,
ST16650, ST16650V2, ST16654
TI16750
Startech
XR16850
RSA
NS16550A
XScale
OCTEON
AR7
U6_16550A
Tegra
XR17D15X, XR17V35X
LPC3220
TruManage
CIR port
Altera 16550 FIFO32, Altera 16550 FIFO64, Altera 16550 FIFO128
16550A_FSL64

serial8250_register_8250_port()

drivers/tty/serial/8250/8250_core.c

/**
 *      serial8250_register_8250_port - register a serial port
 *      @up: serial port template
 *
 *      Configure the serial port specified by the request. If the
 *      port exists and is in use, it is hung up and unregistered
 *      first.
 *
 *      The port is then probed and if necessary the IRQ is autodetected
 *      If this fails an error is returned.
 *
 *      On success the port is ready to use and the line number is returned.
 */
int serial8250_register_8250_port(struct uart_8250_port *up)
{
        struct uart_8250_port *uart;
        int ret = -ENOSPC;

        if (up->port.uartclk == 0)
                return -EINVAL;

        mutex_lock(&serial_mutex);

        uart = serial8250_find_match_or_unused(&up->port);
        if (uart && uart->port.type != PORT_8250_CIR) {
                if (uart->port.dev)
                        uart_remove_one_port(&serial8250_reg, &uart->port);

                uart->port.iobase       = up->port.iobase;
                uart->port.membase      = up->port.membase;
                uart->port.irq          = up->port.irq;
                uart->port.irqflags     = up->port.irqflags;
                uart->port.uartclk      = up->port.uartclk;
                uart->port.fifosize     = up->port.fifosize;
                uart->port.regshift     = up->port.regshift;
                uart->port.iotype       = up->port.iotype;
                uart->port.flags        = up->port.flags | UPF_BOOT_AUTOCONF;
                uart->bugs              = up->bugs;
                uart->port.mapbase      = up->port.mapbase;
                uart->port.private_data = up->port.private_data;
                uart->port.fifosize     = up->port.fifosize;
                uart->tx_loadsz         = up->tx_loadsz;
                uart->capabilities      = up->capabilities;
                uart->port.throttle     = up->port.throttle;
                uart->port.unthrottle   = up->port.unthrottle;
                uart->port.rs485_config = up->port.rs485_config;
                uart->port.rs485        = up->port.rs485;

                /* Take tx_loadsz from fifosize if it wasn't set separately */
                if (uart->port.fifosize && !uart->tx_loadsz)
                        uart->tx_loadsz = uart->port.fifosize;

                if (up->port.dev)
                        uart->port.dev = up->port.dev;

                if (up->port.flags & UPF_FIXED_TYPE)
                        serial8250_init_fixed_type_port(uart, up->port.type);

                set_io_from_upio(&uart->port);
                /* Possibly override default I/O functions.  */
                if (up->port.serial_in)
                        uart->port.serial_in = up->port.serial_in;
                if (up->port.serial_out)
                        uart->port.serial_out = up->port.serial_out;
                if (up->port.handle_irq)
                        uart->port.handle_irq = up->port.handle_irq;
                /*  Possibly override set_termios call */
                if (up->port.set_termios)
                        uart->port.set_termios = up->port.set_termios;
                if (up->port.set_mctrl)
                        uart->port.set_mctrl = up->port.set_mctrl;
                if (up->port.startup)
                        uart->port.startup = up->port.startup;
                if (up->port.shutdown)
                        uart->port.shutdown = up->port.shutdown;
                if (up->port.pm)
                        uart->port.pm = up->port.pm;
                if (up->port.handle_break)
                        uart->port.handle_break = up->port.handle_break;
                if (up->dl_read)
                        uart->dl_read = up->dl_read;
                if (up->dl_write)
                        uart->dl_write = up->dl_write;
                if (up->dma) {
                        uart->dma = up->dma;
                        if (!uart->dma->tx_dma)
                                uart->dma->tx_dma = serial8250_tx_dma;
                        if (!uart->dma->rx_dma)
                                uart->dma->rx_dma = serial8250_rx_dma;
                }

                if (serial8250_isa_config != NULL)
                        serial8250_isa_config(0, &uart->port,
                                        &uart->capabilities);

                ret = uart_add_one_port(&serial8250_reg, &uart->port);
                if (ret == 0)
                        ret = uart->port.line;
        }
        mutex_unlock(&serial_mutex);

        return ret;
}

uart_8250_port 구조체

include/linux/serial_8250.h

struct uart_8250_port {
        struct uart_port        port;
        struct timer_list       timer;          /* "no irq" timer */
        struct list_head        list;           /* ports on this IRQ */
        unsigned short          capabilities;   /* port capabilities */
        unsigned short          bugs;           /* port bugs */
        bool                    fifo_bug;       /* min RX trigger if enabled */
        unsigned int            tx_loadsz;      /* transmit fifo load size */
        unsigned char           acr;
        unsigned char           fcr;
        unsigned char           ier;
        unsigned char           lcr;
        unsigned char           mcr;
        unsigned char           mcr_mask;       /* mask of user bits */
        unsigned char           mcr_force;      /* mask of forced bits */
        unsigned char           cur_iotype;     /* Running I/O type */
        unsigned int            rpm_tx_active;
        unsigned char           canary;         /* non-zero during system sleep
                                                 *   if no_console_suspend
                                                 */

        /*
         * Some bits in registers are cleared on a read, so they must
         * be saved whenever the register is read but the bits will not
         * be immediately processed.
         */
#define LSR_SAVE_FLAGS UART_LSR_BRK_ERROR_BITS
        unsigned char           lsr_saved_flags;
#define MSR_SAVE_FLAGS UART_MSR_ANY_DELTA
        unsigned char           msr_saved_flags;

        struct uart_8250_dma    *dma;

        /* 8250 specific callbacks */
        int                     (*dl_read)(struct uart_8250_port *);
        void                    (*dl_write)(struct uart_8250_port *, int);
};

참고

Line Disciplines – www.embeddedlinux.org
Linux Device Drivers 18. chapter TTY Drivers – 다운로드 pdf | Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman
Linux kernel serial drivers – 다운로드 pdf | Free Electrons
TERMIOS | man7.org
stty | man

Kthreadd()

2017-11-302017-12-04 문영일 Leave a comment

이 페이지에서는 kthread_create() 함수와 kthreadd() 함수에 대해 분석해본다.

kthreadd

생성 요청한 커널 스레드를 fork 한다.

사용 APIs

kthread_create()
kthread_run()
kthread_bind()
kthread_stop()
kthread_should_stop()
kthread_data()

kthread_create()

include/linux/kthread.h

#define kthread_create(threadfn, data, namefmt, arg...) \
        kthread_create_on_node(threadfn, data, -1, namefmt, ##arg)

threadfn 함수를 엔트리 진입점으로 하여 커널 스레드를 만든다. 곧 바로 동작이 필요한 경우 후속 호출로 wake_up_process() 함수를 사용한다.

kthread_create_on_node()

kernel/kthread.c

/**
 * kthread_create_on_cpu - Create a cpu bound kthread
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @cpu: The cpu on which the thread should be bound,
 * @namefmt: printf-style name for the thread. Format is restricted
 *           to "name.*%u". Code fills in cpu number.
 *
 * Description: This helper function creates and names a kernel thread
 * The thread will be woken and put into park mode.
 */
struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
                                          void *data, unsigned int cpu,
                                          const char *namefmt)
{
        struct task_struct *p;

        p = kthread_create_on_node(threadfn, data, cpu_to_node(cpu), namefmt,
                                   cpu);
        if (IS_ERR(p))
                return p;
        set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
        to_kthread(p)->cpu = cpu;
        /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
        kthread_park(p);
        return p; 
}

threadfn 함수를 엔트리 진입점으로 하여 지정된 cpu에 커널 스레드를 만든다. 만들어진 스레드는 곧 바로 park 상태로 바꾸어 둔다.

kthread_create_on_node()

kernel/kthread.c

/**
 * kthread_create_on_node - create a kthread.
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @node: memory node number.
 * @namefmt: printf-style name for the thread.
 *
 * Description: This helper function creates and names a kernel
 * thread.  The thread will be stopped: use wake_up_process() to start
 * it.  See also kthread_run().
 *
 * If thread is going to be bound on a particular cpu, give its node
 * in @node, to get NUMA affinity for kthread stack, or else give -1.
 * When woken, the thread will run @threadfn() with @data as its
 * argument. @threadfn() can either call do_exit() directly if it is a
 * standalone thread for which no one will call kthread_stop(), or
 * return when 'kthread_should_stop()' is true (which means
 * kthread_stop() has been called).  The return value should be zero
 * or a negative error number; it will be passed to kthread_stop().
 *
 * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
 */
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
                                           void *data, int node,
                                           const char namefmt[],
                                           ...)
{
        DECLARE_COMPLETION_ONSTACK(done);
        struct task_struct *task;
        struct kthread_create_info *create = kmalloc(sizeof(*create),
                                                     GFP_KERNEL);

        if (!create)
                return ERR_PTR(-ENOMEM);
        create->threadfn = threadfn;
        create->data = data;
        create->node = node;
        create->done = &done;

        spin_lock(&kthread_create_lock);
        list_add_tail(&create->list, &kthread_create_list);
        spin_unlock(&kthread_create_lock);

        wake_up_process(kthreadd_task);

코드 라인 30~42에서 kthread_create_info 구조체를 할당받고 요청 인수 내용으로 채운 후 kthread_create_list에 추가한다.
코드 라인 44에서 kthread_create_list에 담은 내용으로 커널 스레드를 생성시키는 “kthreadd” 라는 이름의 최상위 커널 스레드를 깨운다.

        /*
         * Wait for completion in killable state, for I might be chosen by
         * the OOM killer while kthreadd is trying to allocate memory for
         * new kernel thread.
         */
        if (unlikely(wait_for_completion_killable(&done))) {
                /*
                 * If I was SIGKILLed before kthreadd (or new kernel thread)
                 * calls complete(), leave the cleanup of this structure to
                 * that thread.
                 */
                if (xchg(&create->done, NULL))
                        return ERR_PTR(-EINTR);
                /*
                 * kthreadd (or new kernel thread) will call complete()
                 * shortly.
                 */
                wait_for_completion(&done);
        }
        task = create->result;
        if (!IS_ERR(task)) {
                static const struct sched_param param = { .sched_priority = 0 };
                va_list args;

                va_start(args, namefmt);
                vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
                va_end(args);
                /*
                 * root may have changed our (kthreadd's) priority or CPU mask.
                 * The kernel thread should not inherit these properties.
                 */
                sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
                set_cpus_allowed_ptr(task, cpu_all_mask);
        }
        kfree(create);
        return task;
}
EXPORT_SYMBOL(kthread_create_on_node);

코드 라인 6~19에서 “kthreadd” 커널 태스크를 통해 사용자가 요청한 커널 스레드가 생성될 때까지 대기한다. 만일 낮은 확률로 에러(-ERESTARTSYS)가 발생하는 경우에는 sigkill 시그널이 와서 종료처리를 하는 경우에는 -EINTR 결과로 함수를 빠져나가고 그렇지 않은 경우 다시 한 번 기다린다.
- 참고: wait_for_completion() | 문c
코드 라인 20~36에서 생성이 완료된 경우 생성된 커널 태스크를 반환한다. 커널 태스크가 정상적으로 생성된 경우 normal 태스크로 스케줄러 속성을 변경하고 모든 cpu에서 동작하도록 설정한다.
- 커널 스레드를 normal 정책을 사용하는 cfs 태스크로 만들 때 sched_priority는 항상 0으로 세팅하여야 한다.

kthreadd()

kernel/kthread.c

int kthreadd(void *unused)
{
        struct task_struct *tsk = current;

        /* Setup a clean context for our children to inherit. */
        set_task_comm(tsk, "kthreadd");
        ignore_signals(tsk);
        set_cpus_allowed_ptr(tsk, cpu_all_mask);
        set_mems_allowed(node_states[N_MEMORY]);

        current->flags |= PF_NOFREEZE;

        for (;;) {
                set_current_state(TASK_INTERRUPTIBLE);
                if (list_empty(&kthread_create_list))
                        schedule();
                __set_current_state(TASK_RUNNING);

                spin_lock(&kthread_create_lock);
                while (!list_empty(&kthread_create_list)) {
                        struct kthread_create_info *create;

                        create = list_entry(kthread_create_list.next,
                                            struct kthread_create_info, list);
                        list_del_init(&create->list);
                        spin_unlock(&kthread_create_lock);

                        create_kthread(create);

                        spin_lock(&kthread_create_lock);
                }
                spin_unlock(&kthread_create_lock);
        }

        return 0;
}

커널 스레드를 생성시킬 때 사용하는 최상위 커널 스레드이다. 무한 루프를 돌며 커널 스레드 생성 요청이 있을 때 마다 이를 수행한다.

코드 라인 6~11에서 태스크의 이름을 “kthreadd”로 설정하고, 시그널을 받을 수 없게 한다. 이 커널 스레드는 모든 cpu와 모든 메모리 노드에서 동작할 수 있게 설정한다. 또한 freeze되지 않게 한다.
- 참고: set_mems_allowed() | 문c
코드 라인 13~17에서 무한 루프를 돌며 thread_create_list에 생성할 내용이 없으면 interruptible 상태로 변경하고 sleep 한다. 깨어나면 runnng 상태로 변경한다.
코드 라인 19~32에서 kthread_create_list에 등록된 내용으로 커널 스레드를 생성하면서 순회한다.

create_kthread()

kernel/kthread.c

static void create_kthread(struct kthread_create_info *create)
{
        int pid;

#ifdef CONFIG_NUMA
        current->pref_node_fork = create->node;
#endif
        /* We want our own signal handler (we take no signals by default). */
        pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
        if (pid < 0) {
                /* If user was SIGKILLed, I release the structure. */
                struct completion *done = xchg(&create->done, NULL);

                if (!done) {
                        kfree(create);
                        return;
                }
                create->result = ERR_PTR(pid);
                complete(done);
        }
}

커널 스레드를 생성한다. 만일 실패하는 경우 에러 결과로 completion 처리를 한다.

kthread() 함수를 fork하고 argument로 요청한 함수 포인터를 넘겨준다.
kthread() 함수 내부에서 현재 fork된 태스크를 sleep 시킨다. 깨어날 때 곧바로 인수로 넘겨 받은 함수를 호출한다.

kernel_thread()

kernel/fork.c

/*                     
 * Create a kernel thread.
 */
pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags) 
{
        return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
                (unsigned long)arg, NULL, NULL);
}

fn 함수를 엔트리 시작으로 fork 하여 커널 스레드를 생성한다.

fn 함수들:
- kthread()
- kernel_init()
- kthreadd()
- wait_for_helper()
- ____call_usermodehelper()

kthread()

kernel/kthread.c

static int kthread(void *_create)
{
        /* Copy data: it's on kthread's stack */
        struct kthread_create_info *create = _create;
        int (*threadfn)(void *data) = create->threadfn;
        void *data = create->data;
        struct completion *done;
        struct kthread self;
        int ret;

        self.flags = 0;
        self.data = data;
        init_completion(&self.exited);
        init_completion(&self.parked);
        current->vfork_done = &self.exited;

        /* If user was SIGKILLed, I release the structure. */
        done = xchg(&create->done, NULL);
        if (!done) {
                kfree(create);
                do_exit(-EINTR);
        }
        /* OK, tell user we're spawned, wait for stop or wakeup */
        __set_current_state(TASK_UNINTERRUPTIBLE);
        create->result = current;
        complete(done);
        schedule();

        ret = -EINTR;

        if (!test_bit(KTHREAD_SHOULD_STOP, &self.flags)) {
                __kthread_parkme(&self);
                ret = threadfn(data);
        }
        /* we can't just return, we must preserve "self" on stack */
        do_exit(ret);
}

fork 되어 실행된 이 함수는 스레드 생성되었음을 알리는 completion 처리를 한 후 sleep 한다. 이 후 wakeup 요청을 받으면 인수로 받은 함수를 호출한다.

코드 라인 18~22에서 SIGKILL에 의해 스레드 생성 요청이 없어진 경우 현재 스레드를 종료시킨다.
코드 라인 24~27에서 스레드 생성 요청에 따라 wait_for_completion() 함수에서 대기중인 스레드를 계속 진행할 수 있도록 complete() 함수를 사용한다. 그런 후 sleep 한다.
코드 라인 31~34에서 스레드 종료 요청을 받은 경우가 아니면 인수로 받은 함수를 호출한다.
코드 라인 36에서 수행이 완료된 경우 스레드를 종료시킨다.

참고

Kernel Thread | Flatinum
Kthreads – 다운로드 pdf

RCU(Read Copy Update) -3- (RCU threads)

2017-11-292021-04-01 문영일 Leave a comment

RCU(Read Copy Update) -3- (RCU threads)

다음과 같은 rcu 커널 스레드들을 알아본다.

rcu와 관련된 커널 스레드
- “rcu_preempt”
  - cb용 gp 관리 커널 스레드
    - rcu_gp_kthread()
- “rcuog/N”
  - no-cb용 gp 관리 커널 스레드
    - rcu_nocb_gp_kthread()
- “rcuop/N”
  - no-cb용 콜백 처리 커널 스레드
    - rcu_nocb_cb_kthread()
- “rcub/N”
  - priority boost 커널 스레드
    - rcu_boost_kthread()
- “rcuc/N”
  - cb용 콜백 처리 커널 스레드
    - rcu_cpu_kthread()
- “rcu_tasks_kthread”
  - srcu API를 사용하고 타스크 기반에서 동작하는 rcu tasks 커널 스레드 (srcu와 관련된 코드분석은 생략)
    - rcu_tasks_kthread()
- “rcu_gp”
  - synchronize_rcu_expedited() API에서 급행 gp 동기화 및 srcu에서 사용되는 워커 스레드
    - wait_rcu_exp_gp()
    - srcu_invoke_callbacks()
- “rcu_par_gp”
  - 급행 gp에서 선택된 노드들을 기다릴 때 사용되는 워커 스레드
    - sync_rcu_exp_select_node_cpus()

커널 v5.4.-rc1에서 많은 cpu 시스템에서 no-cb 처리관련한 OOM을 개선하기 위해 leader/follow group 기반의 no-cb 구조를 탈피한 nocb gp kthread룰 도입하였다.

참고:
- rcu/nocb: Provide separate no-CBs grace-period kthreads
- rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameter

RCU-Tasks 서브시스템

rcu-tasks 서브시스템 – call_rcu_tasks() API 함수를 활용하는 커널 코드는 일부((rorture 테스트 및 bpf)를 제외하고 현재까지 극히 드물다.

참고: The RCU-tasks subsystem

RCU 관련 스레드 생성

rcu 커널 스레드의 우선 순위

kernel/rcu/tree.c

/* rcuc/rcub kthread realtime priority */
static int kthread_prio = CONFIG_RCU_KTHREAD_PRIO;
module_param(kthread_prio, int, 0644);

SCHED_FIFO 정책을 사용하는 RCU 커널 스레드를 위한 우선 순위이다.

디폴트 값
- 가장 빠른 우선 순위(0 ~ 2)를 사용
- CONFIG_RCU_BOOST 커널 옵션이 사용되는 경우에는 0번을 비워둔다.
- CONFIG_RCU_TORTURE_TEST 커널 옵션을 사용한 경우 1번을 비워둔다.
사용자 설정
- 0 ~ 99까지이다.

다음은 kthread_prio 우선순위가 디폴트 0을 출력한 모습이다.

<pre “>$ cat /sys/module/rcutree/parameters/kthread_prio 0

다음 그림은 rcu 관련 스레드들을 생성시키는 함수들간의 호출 관계를 보여준다.

srcu 및 워커 스레드는 제외

rcu_spawn_gp_kthread()

kernel/rcu/tree.c

/*
 * Spawn the kthreads that handle RCU's grace periods.
 */

static int __init rcu_spawn_gp_kthread(void)
{
        unsigned long flags;
        int kthread_prio_in = kthread_prio;
        struct rcu_node *rnp;
        struct sched_param sp;
        struct task_struct *t;

        /* Force priority into range. */
        if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 2
            && IS_BUILTIN(CONFIG_RCU_TORTURE_TEST))
                kthread_prio = 2;
        else if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 1)
                kthread_prio = 1;
        else if (kthread_prio < 0)
                kthread_prio = 0;
        else if (kthread_prio > 99)
                kthread_prio = 99;

        if (kthread_prio != kthread_prio_in)
                pr_alert("rcu_spawn_gp_kthread(): Limited prio to %d from %d\n",
                         kthread_prio, kthread_prio_in);

        rcu_scheduler_fully_active = 1;
        t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
        if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
                return 0;
        if (kthread_prio) {
                sp.sched_priority = kthread_prio;
                sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
        }
        rnp = rcu_get_root();
        raw_spin_lock_irqsave_rcu_node(rnp, flags);
        rcu_state.gp_kthread = t;
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        wake_up_process(t);
        rcu_spawn_nocb_kthreads();
        rcu_spawn_boost_kthreads();
        return 0;
}
early_initcall(rcu_spawn_gp_kthread);

gp의 시작과 끝을 관리하는 gp kthread를 생성하여 동작시킨다.

코드 라인 10~12에서 CONFIG_RCU_BOOST 및 CONFIG_RCU_TORTURE_TEST 커널 옵션이 사용되는 시스템의 경우 rcu 및 rcu를 사용하는 커널 스레드의 우선 순위를 2 미만으로 내려가지 못하게 제한한다.
- rcutorture 스레드가 1번 우선 순위에서 동작하도록 2 미만으로 내려가지 못하게 한다. (0=최상위 우선순위)
코드 라인 13~18에서 rcu 커널 스레드의 우선 순위가 0~99 범위를 벗어나지 않도록 조정한다. rcu boost 기능을 사용하는 경우에는 preempt된 rcu가 가장 빠르게 처리될 수 있도록 0번 우선 순위를 사용하므로 rcu 커널 스레드들이 0번 우선 순위를 사용하지 않게 한다.
- rcu boost 기능을 사용하는 경우에는 preempt된 rcu가 가장 빠르게 처리될 수 있도록 0번 우선 순위를 사용하므로 gp 커널 스레드가 0번 우선 순위를 사용하지 않게 한다.
코드 라인 20~22에서 kthread_prio로 gp 커널 스레드가 동작하지 못하고 우선 순위가 하향 조정된 경우 이에 대한 alert 메시지를 출력한다.
코드 라인 24에서 rcu 스케줄러가 full active 되었음을 알린다.
코드 라인 25~36에서 rcu_gp_kthread를 생성하고 깨워 동작시킨다. 생성 시에 주어진 우선 순위로 SCHED_FIFO 정책을 사용하는 RT 스레드를 사용한다.
- rcu_gp_kthread 태스크명: [rcu_preempt]
코드 라인 37에서 no-cb 설정된 online cpu 수 만큼 rcu_nocb_kthread를 생성하고 동작시킨다.
- rcu_nocb_gp_kthread 태스크명: [rcuog/<cpu>]
- 예) [rcuog/0]
- rcu_nocb_cb_kthread 태스크명: [rcuop/<cpu>]
- 예) [rcuop/0]
코드 라인 38에서 leaf 노드 수 만큼 rcu_boost_kthread를 생성하고 동작시킨다. 이들은 SCHED_FIFO 스레드로 생성시킨다.
- rcu_boost_kthread 태스크명: [rcub/<rcu leaf node 인덱스>]
- 예) [rcub/0], [rcub/1], …
코드 라인 39에서 정상 값 0을 반환한다.

rcu_spawn_core_kthreads()

kernel/rcu/tree.c

/*
 * Spawn per-CPU RCU core processing kthreads.
 */

static int __init rcu_spawn_core_kthreads(void)
{
        int cpu;

        for_each_possible_cpu(cpu)
                per_cpu(rcu_data.rcu_cpu_has_work, cpu) = 0;
        if (!IS_ENABLED(CONFIG_RCU_BOOST) && use_softirq)
                return 0;
        WARN_ONCE(smpboot_register_percpu_thread(&rcu_cpu_thread_spec),
                  "%s: Could not start rcuc kthread, OOM is now expected behavior\n", __func__);
        return 0;
}
early_initcall(rcu_spawn_core_kthreads);

rcu 콜백 처리를 위한 core 프로세싱 스레드들을 생성하여 동작시킨다. 단 rcu boost 기능을 사용하지 않거나 use_softirq(디폴트=1)가 설정된 경우 core 프로세싱 스레드들을 생성하지 않는다.

코드 라인 5~6에서 모든 possible cpu를 순회하며 rcu_data.rcu_cpu_has_work 를 0으로 초기화한다.
코드 라인 7~8에서 rcu boost 기능을 사용하지 않거나 use_softirq(디폴트=1)가 설정된 경우 core 프로세싱 스레드들을 생성하지 않고 함수를 빠져나간다.
코드 라인 9~10에서 online cpu가 핫플러그될때마다 호출되도록 core 프로세싱 스레드를 등록한다.
- rcu_cpu_kthread 태스크명: [rcuc/<cpu>]
- 예) [rcuc/0], [rcuc/1], …

rcu_spawn_nocb_kthreads()

kernel/rcu/tree_plugin.h

/*
 * Once the scheduler is running, spawn rcuo kthreads for all online
 * no-CBs CPUs.  This assumes that the early_initcall()s happen before
 * non-boot CPUs come online -- if this changes, we will need to add
 * some mutual exclusion.
 */

static void __init rcu_spawn_nocb_kthreads(void)
{
        int cpu;

        for_each_online_cpu(cpu)
                rcu_spawn_cpu_nocb_kthread(cpu);
}

online cpu들 수 만큼 no-cb용 커널 스레드들을 생성시킨다.

rcu_spawn_cpu_nocb_kthread()

kernel/rcu/tree_plugin.h

/*
 * If the specified CPU is a no-CBs CPU that does not already have its
 * rcuo kthreads, spawn them.
 */

static void rcu_spawn_cpu_nocb_kthread(int cpu)
{
        if (rcu_scheduler_fully_active) 
                rcu_spawn_one_nocb_kthread(rsp, cpu); 
}

no-cb용 커널 스레드들을 생성시킨다.

no-cb용 gp 커널 스레드
nocb용 콜백 처리 커널 스레드를 생성시킨다.

rcu_spawn_one_nocb_kthread()

kernel/rcu/tree_plugin.h

/*
 * If the specified CPU is a no-CBs CPU that does not already have its
 * rcuo CB kthread, spawn it.  Additionally, if the rcuo GP kthread
 * for this CPU's group has not yet been created, spawn it as well.
 */

static void rcu_spawn_one_nocb_kthread(int cpu)
{
        struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
        struct rcu_data *rdp_gp;
        struct task_struct *t;

        /*
         * If this isn't a no-CBs CPU or if it already has an rcuo kthread,
         * then nothing to do.
         */
        if (!rcu_is_nocb_cpu(cpu) || rdp->nocb_cb_kthread)
                return;

        /* If we didn't spawn the GP kthread first, reorganize! */
        rdp_gp = rdp->nocb_gp_rdp;
        if (!rdp_gp->nocb_gp_kthread) {
                t = kthread_run(rcu_nocb_gp_kthread, rdp_gp,
                                "rcuog/%d", rdp_gp->cpu);
                if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo GP kthread, OOM is now expected behavior\n", __func___
))
                        return;
                WRITE_ONCE(rdp_gp->nocb_gp_kthread, t);
        }

        /* Spawn the kthread for this CPU. */
        t = kthread_run(rcu_nocb_cb_kthread, rdp,
                        "rcuo%c/%d", rcu_state.abbr, cpu);
        if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__))
                return;
        WRITE_ONCE(rdp->nocb_cb_kthread, t);
        WRITE_ONCE(rdp->nocb_gp_kthread, rdp_gp->nocb_gp_kthread);
}

cpu에 해당하는 no-cb용 gp 커널 스레드와 nocb용 콜백 처리 커널 스레드를 생성시킨다.

rcu_spawn_boost_kthreads()

kernel/rcu/tree_plugin.h

/*
 * Spawn boost kthreads -- called as soon as the scheduler is running.
 */

static void __init rcu_spawn_boost_kthreads(void)
{
        struct rcu_node *rnp;

        rcu_for_each_leaf_node(rnp) 
                rcu_spawn_one_boost_kthread(rnp);
}

leaf 노드 수 만큼 rcu_boost_kthread를 생성하고 동작시킨다. 이들은 SCHED_FIFO 스레드로 생성시킨다.

rcu_spawn_one_boost_kthread()

kernel/rcu/tree_plugin.h

/*
 * Create an RCU-boost kthread for the specified node if one does not
 * already exist.  We only create this kthread for preemptible RCU.
 * Returns zero if all is well, a negated errno otherwise.
 */

static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
{
        int rnp_index = rnp - rcu_get_root();
        unsigned long flags;
        struct sched_param sp;
        struct task_struct *t;

        if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
                return 0;

        if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0)
                return 0;

        rcu_state.boost = 1;

        if (rnp->boost_kthread_task != NULL)
                return 0;

        t = kthread_create(rcu_boost_kthread, (void *)rnp,
                           "rcub/%d", rnp_index);
        if (WARN_ON_ONCE(IS_ERR(t)))
                return;

        raw_spin_lock_irqsave_rcu_node(rnp, flags);
        rnp->boost_kthread_task = t;
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        sp.sched_priority = kthread_prio;
        sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
        wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
}

rcu boost 기능을 위해 leaf 노드 수 만큼 rcu_boost_kthread 들도 생성하고 동작시킨다. 이들은 SCHED_FIFO 스레드로 생성시킨다.

rcu_spawn_tasks_kthread()

kernel/rcu/update.c

/* Spawn rcu_tasks_kthread() at core_initcall() time. */

static int __init rcu_spawn_tasks_kthread(void)
{
        struct task_struct *t;

        t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
        if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __
_func__))
                return 0;
        smp_mb(); /* Ensure others see full kthread. */
        WRITE_ONCE(rcu_tasks_kthread_ptr, t);
        return 0;
}
core_initcall(rcu_spawn_tasks_kthread);

rcu tasks 커널 스레드를 생성하고 동작시킨다. srcu 기반에서 동작한다. (“rcu_tasks_kthread”)

CB용 GP 커널 스레드

CB용 gp 커널 스레드는 다음과 같이 동작한다.

다음과 같이 3 함수를 무한 반복한다.
- rcu_gp_init() -> rcu_gp_fqs_loop() -> rcu_gp_cleanup()
rcu_gp_init()
- 새 gp 요청을 대기한 후 요청이 오는 경우 gp 시퀀스를 증가시켜 새로운 gp를 시작한다.
  - 새 gp 요청
    - 새로운 콜백이 있는 경우 gp_flag에 RCU_GP_FLAG_INIT 플래그를 기록하고 gp 커널 스레드를 깨우는 것으로 새 gp 요청을 한다.
rcu_gp_fqs_loop()
- qs 완료 또는 fqs 요청을 기다린다.
- gp가 시작한 후에는 빠른 시간내에 gp를 완료해야 하므로 수 틱마다 반복하며 qs 완료 체크 및 외부 fqs 요청을 체크하고 fqs를 진행한다.
  - 외부 fqs 요청
    - gp_flag에 RCU_GP_FLAG_FQS 플래그를 기록하고 gp 커널 스레드를 깨우는 것으로 fqs를 요청한다.
- 반복하는 주기는 첫 체크 시 jiffies_till_first_fqs(디폴트=1~3+@256cpus) 틱 주기를 사용하고, 그 다음 반복시엔 jiffies_till_next_fqs(디폴트=1~3+@256cpus) 틱 주기를 사용한다.
- 장 시간 gp가 hang하는 경우를 막기 위해 rcu_gp_fqs() 함수를 통해 강제로 모든 cpu의 nohz 및 gp 시작 후 1초 경과한 offline cpu들의 qs 상태를 패스시켜 빠르게 gp를 종료할 수 있게 한다.
rcu_gp_cleanup()
- gp를 종료(gp idle) 처리하고, 다음 gp 시퀀스로 건너띈다.

다음 그림은 rcu_gp_kthread() 함수내에서 사용하는 gp 플래그 요청 및 gp 상태의 변화를 보여준다.

rcu_gp_kthread()

kernel/rcu/tree.c

static int __noreturn rcu_gp_kthread(void *unused)
{
        rcu_bind_gp_kthread();
        for (;;) {

                /* Handle grace-period start. */
                for (;;) {
                        trace_rcu_grace_period(rcu_state.name,
                                               READ_ONCE(rcu_state.gp_seq),
                                               TPS("reqwait"));
                        rcu_state.gp_state = RCU_GP_WAIT_GPS;
                        swait_event_idle_exclusive(rcu_state.gp_wq,
                                         READ_ONCE(rcu_state.gp_flags) &
                                         RCU_GP_FLAG_INIT);
                        rcu_state.gp_state = RCU_GP_DONE_GPS;
                        /* Locking provides needed memory barrier. */
                        if (rcu_gp_init())
                                break;
                        cond_resched_tasks_rcu_qs();
                        WRITE_ONCE(rcu_state.gp_activity, jiffies);
                        WARN_ON(signal_pending(current));
                        trace_rcu_grace_period(rcu_state.name,
                                               READ_ONCE(rcu_state.gp_seq),
                                               TPS("reqwaitsig"));
                }

                /* Handle quiescent-state forcing. */
                rcu_gp_fqs_loop();

                /* Handle grace-period end. */
                rcu_state.gp_state = RCU_GP_CLEANUP;
                rcu_gp_cleanup();
                rcu_state.gp_state = RCU_GP_CLEANED;
        }
}

CB용 grace period를 관리하기 위한 커널 스레드로 무한 루프를 돌며 외부에서 깨울 때마다 state 변경을 확인하며 gp의 진행 -> idle을 반복하며 슬립한다.

코드 라인 3에서 nohz full 을 위해 현재 gp 커널 스레드를 rcu 처리 가능한 cpu들에서만 동작하도록 제한시킨다.
코드 라인 4에서 무한 루프를 반복한다.
코드 라인 7에서 gp 시작이 실패하는 경우 다시 시도하기 위한 내부 루프이다.
코드 라인 11에서 gp 상태를 RCU_GP_WAIT_GPS(1)로 변경한다.
코드 라인 12~14에서 이 스레드는 새 gp 요청을 대기하며 슬립한다. RCU_GP_FLAG_INIT 플래그가 있으면 슬립하지 않고 다음을 진행한다.
- 잠들어 있으므로 외부에서 깨웠을 때 새 gp 요청을 확인한다.
코드 라인 15에서 gp 상태를 RCU_GP_DONE_GPS(2)로 변경한다.
코드 라인 17~18에서 gp가 정상적으로 시작된 경우 내부 루프를 벗어난다.
코드 라인 19~25에서 gp 시작이 실패한 경우 다시 슬립하기 위해 내부 루프를 돈다. 이 때 태스크의 rcu_tasks_holdout 플래그의 클리어 및 gp 상태가 변화에에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 28에서 qs 완료 시까지 지정된 틱 또는 fqs 요청마다 fqs를 반복하며 처리한다.
- jiffies_till_first_fqs(디폴트=1~3틱+@256cpus), jiffies_till_next_fqs(디폴트=1~3틱+@256cpus)
코드 라인 31에서 gp 상태를 RCU_GP_CLEANUP(7)으로 변경한다.
코드 라인 32에서 gp를 마감하고 idle 상태로 변경한다.
코드 라인 33~34에서 gp 상태를 RCU_GP_CLEANED(8)로 변경하고 다시 새 gp 요청을 대기 하기 위해 반복한다.

rcu_preempt_blocked_readers_cgp()

kernel/rcu/tree_plugin.h

/*
 * Check for preempted RCU readers blocking the current grace period
 * for the specified rcu_node structure.  If the caller needs a reliable
 * answer, it must hold the rcu_node's ->lock.
 */

static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
{
        return rnp->gp_tasks != NULL;
}

rcu 노드 내에서 선점형 rcu 리더가 블러킹되었는지 체크한다.

CB용 GP 시작 처리

rcu_gp_init()

kernel/rcu/tree.c -1/2-

/*
 * Initialize a new grace period.  Return false if no grace period required.
 */

static bool rcu_gp_init(void)
{
        unsigned long flags;
        unsigned long oldmask;
        unsigned long mask;
        struct rcu_data *rdp;
        struct rcu_node *rnp = rcu_get_root();

        WRITE_ONCE(rcu_state.gp_activity, jiffies);
        raw_spin_lock_irq_rcu_node(rnp);
        if (!READ_ONCE(rcu_state.gp_flags)) {
                /* Spurious wakeup, tell caller to go back to sleep.  */
                raw_spin_unlock_irq_rcu_node(rnp);
                return false;
        }
        WRITE_ONCE(rcu_state.gp_flags, 0); /* Clear all flags: New GP. */

        if (WARN_ON_ONCE(rcu_gp_in_progress())) {
                /*
                 * Grace period already in progress, don't start another.
                 * Not supposed to be able to happen.
                 */
                raw_spin_unlock_irq_rcu_node(rnp);
                return false;
        }

        /* Advance to a new grace period and initialize state. */
        record_gp_stall_check_time();
        /* Record GP times before starting GP, hence rcu_seq_start(). */
        rcu_seq_start(&rcu_state.gp_seq);
        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
        raw_spin_unlock_irq_rcu_node(rnp);

        /*
         * Apply per-leaf buffered online and offline operations to the
         * rcu_node tree.  Note that this new grace period need not wait
         * for subsequent online CPUs, and that quiescent-state forcing
         * will handle subsequent offline CPUs.
         */
        rcu_state.gp_state = RCU_GP_ONOFF;
        rcu_for_each_leaf_node(rnp) {
                raw_spin_lock(&rcu_state.ofl_lock);
                raw_spin_lock_irq_rcu_node(rnp);
                if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
                    !rnp->wait_blkd_tasks) {
                        /* Nothing to do on this leaf rcu_node structure. */
                        raw_spin_unlock_irq_rcu_node(rnp);
                        raw_spin_unlock(&rcu_state.ofl_lock);
                        continue;
                }

                /* Record old state, apply changes to ->qsmaskinit field. */
                oldmask = rnp->qsmaskinit;
                rnp->qsmaskinit = rnp->qsmaskinitnext;

                /* If zero-ness of ->qsmaskinit changed, propagate up tree. */
                if (!oldmask != !rnp->qsmaskinit) {
                        if (!oldmask) { /* First online CPU for rcu_node. */
                                if (!rnp->wait_blkd_tasks) /* Ever offline? */
                                        rcu_init_new_rnp(rnp);
                        } else if (rcu_preempt_has_tasks(rnp)) {
                                rnp->wait_blkd_tasks = true; /* blocked tasks */
                        } else { /* Last offline CPU and can propagate. */
                                rcu_cleanup_dead_rnp(rnp);
                        }
                }

                /*
                 * If all waited-on tasks from prior grace period are
                 * done, and if all this rcu_node structure's CPUs are
                 * still offline, propagate up the rcu_node tree and
                 * clear ->wait_blkd_tasks.  Otherwise, if one of this
                 * rcu_node structure's CPUs has since come back online,
                 * simply clear ->wait_blkd_tasks.
                 */
                if (rnp->wait_blkd_tasks &&
                    (!rcu_preempt_has_tasks(rnp) || rnp->qsmaskinit)) {
                        rnp->wait_blkd_tasks = false;
                        if (!rnp->qsmaskinit)
                                rcu_cleanup_dead_rnp(rnp);
                }

                raw_spin_unlock_irq_rcu_node(rnp);
                raw_spin_unlock(&rcu_state.ofl_lock);
        }
        rcu_gp_slow(gp_preinit_delay); /* Races with CPU hotplug. */

CB용 새로운 GP(grace period)를 시작한다.

코드 라인 9에서 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 10~15에서 노드 락을 획득하고, gp 플래그가 0인 경우 노드 락을 풀고 다시 gp 스레드가 잠들기 위해 false를 반환한다.
코드 라인 16에서 새로운 gp를 위해 플래그를 모두 클리어한다.
코드 라인 18~25에서 이미 gp가 시작한 경우 다시 시작하지 말고 false로 함수를 빠져나간다.
코드 라인 28에서 gp stall 여부 체크를 위해 시간을 gp 시작 시간을 기록해둔다.
- CONFIG_RCU_STALL_COMMON 커널 옵션을 사용하는 경우 장시간(디폴트 21초) 동안 gp가 끝나지 않으면 stall이 된 것으로 판단하여 강제로 gp 종료 처리를 하기 위해 사용한다.
코드 라인 30~32에서 gp 시퀀스를 1 증가시켜 gp를 시작시키고 노드 락을 푼다. (홀수)
코드 라인 40에서 gp 상태를 RCU_GP_ONOFF(3)로 변경한다. 이 곳에서는 cpu의 online/offline의 변경을 노드에 반영한다.
코드 라인 41~43에서 leaf 노드를 순회하며 전역 ofl_lock과 노드 락을 획득한다.
코드 라인 44~50에서 online cpu에 변경이 없으면서 순회하는 노드에 블럭드 태스크가 존재하지 않으면 이 노드에서 수행할 일이 없으므로 락들을 풀고 skip 한다.
코드 라인 53~54에서 online cpu를 반영하기 위해 rnp->qsmaskinit = <- rnp->qsmaskinitnext 하며, 기존 값은 oldmask에 담는다.
코드 라인 57~66에서 만일 노드 구성이 변경된 경우 다음 3 가지 중 하나의 조건을 처리한다.
- 새로운 online cpu가 추가된 경우이고, 노드에 블럭드 태스크가 없는 경우 최상위 노드까지 online을 전파한다.
- 노드에 블럭드 태스크가 존재하는 경우 rnp->wait_blkd_tasks에 true를 대입한다.
- offline cpu가 발생한 경우에 최상위 노드까지 offline을 전파한다.
코드 라인 76~81에서 순회 중인 노드에 대기중인 블럭드 태스크가 존재하고,노드에 블럭드 태스크가 없거나 rnp->qsmaskinit가 있는 경우 대기중인 블럭드 태스크를 false로 변경한다. 만일 rnp->qsmaskinit이 0인 경우 최상위 노드까지 offline을 전파한다.
- rcu read-side critical section에서 preemption된 경우 preemption된 태스크가 블럭드 태스크에 추가된다.
코드 라인 83~85에서 획득한 락들을 해제한 후 루프를 반복한다.
코드 라인 86에서 모듈 파라미터 gp_preinit_delay(디폴트=0)에 지정된 틱 수 만큼 슬립한다.

kernel/rcu/tree.c -2/2-

        /*
         * Set the quiescent-state-needed bits in all the rcu_node
         * structures for all currently online CPUs in breadth-first
         * order, starting from the root rcu_node structure, relying on the
         * layout of the tree within the rcu_state.node[] array.  Note that
         * other CPUs will access only the leaves of the hierarchy, thus
         * seeing that no grace period is in progress, at least until the
         * corresponding leaf node has been initialized.
         *
         * The grace period cannot complete until the initialization
         * process finishes, because this kthread handles both.
         */
        rcu_state.gp_state = RCU_GP_INIT;
        rcu_for_each_node_breadth_first(rnp) {
                rcu_gp_slow(gp_init_delay);
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                rdp = this_cpu_ptr(&rcu_data);
                rcu_preempt_check_blocked_tasks(rnp);
                rnp->qsmask = rnp->qsmaskinit;
                WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
                if (rnp == rdp->mynode)
                        (void)__note_gp_changes(rnp, rdp);
                rcu_preempt_boost_start_gp(rnp);
                trace_rcu_grace_period_init(rcu_state.name, rnp->gp_seq,
                                            rnp->level, rnp->grplo,
                                            rnp->grphi, rnp->qsmask);
                /* Quiescent states for tasks on any now-offline CPUs. */
                mask = rnp->qsmask & ~rnp->qsmaskinitnext;
                rnp->rcu_gp_init_mask = mask;
                if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
                        rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
                else
                        raw_spin_unlock_irq_rcu_node(rnp);
                cond_resched_tasks_rcu_qs();
                WRITE_ONCE(rcu_state.gp_activity, jiffies);
        }

        return true;
}

코드 라인 13에서 gp 상태를 RCU_GP_INIT(4)로 변경한다.
코드 라인 14~15에서 모든 노드를 순회하며 모듈 파라미터 gp_init_delay(디폴트=0) 에서 지정한 틱 만큼 슬립한다.
코드 라인 16~18에서 순회 중인 노드의 노드 락을 획득하고 노드에 블럭드 태스크가 있는지 여부를 체크한다. 블럭드 태스크가 존재하는 경우 이를 경고 덤프한다.
코드 라인 19~20에서 qsmask를 qsmaskinit 값으로 대입하고, 노드의 gp 시퀀스를 갱신한다.
코드라인 21~22에서 순회 중인 노드가 현재 cpu를 가진 노드인 경우 gp가 끝나고 새로운 gp가 시작되었는지 체크한다.
코드 라인 23에서 순회 중인 노드의 boost 타임(디폴트 500ms)을 지정한다.
코드 라인 28~33에서 새롭게 offline된 cpu들의 qs를 보고한다.
코드 라인 34에서 rcu_tasks_holdout 플래그의 클리어한다.
코드 라인 35~36에서 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록하고 루프를 반복한다.
코드 라인 38에서 true를 반환한다.

CB용 모든 QS 완료 및 FQS 처리

rcu_gp_fqs_loop()

kernel/rcu/tree.c

/*
 * Loop doing repeated quiescent-state forcing until the grace period ends.
 */

static void rcu_gp_fqs_loop(void)
{
        bool first_gp_fqs;
        int gf;
        unsigned long j;
        int ret;
        struct rcu_node *rnp = rcu_get_root();

        first_gp_fqs = true;
        j = READ_ONCE(jiffies_till_first_fqs);
        ret = 0;
        for (;;) {
                if (!ret) {
                        rcu_state.jiffies_force_qs = jiffies + j;
                        WRITE_ONCE(rcu_state.jiffies_kick_kthreads,
                                   jiffies + (j ? 3 * j : 2));
                }
                trace_rcu_grace_period(rcu_state.name,
                                       READ_ONCE(rcu_state.gp_seq),
                                       TPS("fqswait"));
                rcu_state.gp_state = RCU_GP_WAIT_FQS;
                ret = swait_event_idle_timeout_exclusive(
                                rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
                rcu_state.gp_state = RCU_GP_DOING_FQS;
                /* Locking provides needed memory barriers. */
                /* If grace period done, leave loop. */
                if (!READ_ONCE(rnp->qsmask) &&
                    !rcu_preempt_blocked_readers_cgp(rnp))
                        break;
                /* If time for quiescent-state forcing, do it. */
                if (ULONG_CMP_GE(jiffies, rcu_state.jiffies_force_qs) ||
                    (gf & RCU_GP_FLAG_FQS)) {
                        trace_rcu_grace_period(rcu_state.name,
                                               READ_ONCE(rcu_state.gp_seq),
                                               TPS("fqsstart"));
                        rcu_gp_fqs(first_gp_fqs);
                        first_gp_fqs = false;
                        trace_rcu_grace_period(rcu_state.name,
                                               READ_ONCE(rcu_state.gp_seq),
                                               TPS("fqsend"));
                        cond_resched_tasks_rcu_qs();
                        WRITE_ONCE(rcu_state.gp_activity, jiffies);
                        ret = 0; /* Force full wait till next FQS. */
                        j = READ_ONCE(jiffies_till_next_fqs);
                } else {
                        /* Deal with stray signal. */
                        cond_resched_tasks_rcu_qs();
                        WRITE_ONCE(rcu_state.gp_activity, jiffies);
                        WARN_ON(signal_pending(current));
                        trace_rcu_grace_period(rcu_state.name,
                                               READ_ONCE(rcu_state.gp_seq),
                                               TPS("fqswaitsig"));
                        ret = 1; /* Keep old FQS timing. */
                        j = jiffies;
                        if (time_after(jiffies, rcu_state.jiffies_force_qs))
                                j = 1;
                        else
                                j = rcu_state.jiffies_force_qs - j;
                }
        }
}

qs 완료를 대기하며, fqs 타임아웃 또는 fqs 요청이 있는 경우엔 fqs를 수행한다.

코드 라인 7에서 모든 qs 완료 체크는 최상위 노드에서 수행하므로 루트 노드를 알아온다.
코드 라인 9에서 fqs를 진행할 때 처음 시도시엔 true를 전달하고, 다음 시도시엔 false로 바뀔 예정이다.
코드 라인 10에서 fqs 대기 시간(jiffies_till_first_fqs 틱)을 j로 읽어온다. 이 값은 변화된다.
- jiffies_till_first_fqs 값은 초기 값으로 1~3(250hz 이하=1틱, 500hz 이하=2틱, 500hz 초과=3틱) + cpu 256개마다 1틱을 사용한다.
- 예) rpi4, 250hz인 경우 1틱이다.
코드 라인 12~17에서 fqs 체크 결과 ret(처음엔 0)가 false(0)인 경우 루프를 돌며 다음 fqs 대기 시간을 준비하고, jiffies_kick_kthreads에 j 값의 3배를 준다.
코드 라인 21에서 gp 상태를 RCU_GP_WAIT_FQS(5)로 변경한다.
코드 라인 22~23에서 j 타임아웃으로 깨어나거나 외부에서 깨울 때 마다 fqs를 체크한 결과를 ret로 알아온다.
코드 라인 24에서 gp 상태를 RCU_GP_DOING_FQS(6)로 변경한다.
코드 라인 27~29에서 qs가 모두 체크되어 완료된 경우이면서 preempt된 rcu 리더 태스크가 없는 경우 루프를 벗어나서 함수를 빠져나간다.
코드 라인 31~44에서 현재 시각이 fqs 대기 시간을 지난 경우이거나 fqs 플래그가 수신된 경우 fqs를 시작한다. 또한 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 45~59에서 signal이 수신된 경우 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록하고 계속 루프를 돈다.

CB용 GP 종료 처리

rcu_gp_cleanup()

kernel/rcu/tree.c – 1/2

/*
 * Clean up after the old grace period.
 */

static void rcu_gp_cleanup(void)
{
        unsigned long gp_duration;
        bool needgp = false;
        unsigned long new_gp_seq;
        bool offloaded;
        struct rcu_data *rdp;
        struct rcu_node *rnp = rcu_get_root();
        struct swait_queue_head *sq;

        WRITE_ONCE(rcu_state.gp_activity, jiffies);
        raw_spin_lock_irq_rcu_node(rnp);
        rcu_state.gp_end = jiffies;
        gp_duration = rcu_state.gp_end - rcu_state.gp_start;
        if (gp_duration > rcu_state.gp_max)
                rcu_state.gp_max = gp_duration;

        /*
         * We know the grace period is complete, but to everyone else
         * it appears to still be ongoing.  But it is also the case
         * that to everyone else it looks like there is nothing that
         * they can do to advance the grace period.  It is therefore
         * safe for us to drop the lock in order to mark the grace
         * period as completed in all of the rcu_node structures.
         */
        raw_spin_unlock_irq_rcu_node(rnp);

        /*
         * Propagate new ->gp_seq value to rcu_node structures so that
         * other CPUs don't have to wait until the start of the next grace
         * period to process their callbacks.  This also avoids some nasty
         * RCU grace-period initialization races by forcing the end of
         * the current grace period to be completely recorded in all of
         * the rcu_node structures before the beginning of the next grace
         * period is recorded in any of the rcu_node structures.
         */
        new_gp_seq = rcu_state.gp_seq;
        rcu_seq_end(&new_gp_seq);
        rcu_for_each_node_breadth_first(rnp) {
                raw_spin_lock_irq_rcu_node(rnp);
                if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
                        dump_blkd_tasks(rnp, 10);
                WARN_ON_ONCE(rnp->qsmask);
                WRITE_ONCE(rnp->gp_seq, new_gp_seq);
                rdp = this_cpu_ptr(&rcu_data);
                if (rnp == rdp->mynode)
                        needgp = __note_gp_changes(rnp, rdp) || needgp;
                /* smp_mb() provided by prior unlock-lock pair. */
                needgp = rcu_future_gp_cleanup(rnp) || needgp;
                sq = rcu_nocb_gp_get(rnp);
                raw_spin_unlock_irq_rcu_node(rnp);
                rcu_nocb_gp_cleanup(sq);
                cond_resched_tasks_rcu_qs();
                WRITE_ONCE(rcu_state.gp_activity, jiffies);
                rcu_gp_slow(gp_cleanup_delay);
        }

현재 gp의 종료 처리를 수행하며 모든 노드에 반영시킨다.

코드 라인 11에서 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 12~26에서 최상위 노드 락을 획득한 채로, gp 만료 시간을 gp_end에 기록하고, gp 최대 duration인 gp_max를 갱신한다.
코드 라인 37~38에서 gp 시퀀스를 만료시킨다. (gp idle 상태) 만료 전의 gp 시퀀스 값을 new_gp_seq에 담아둔다.
코드 라인 39~42에서 모든 노드를 대상으로 순회하며 순회 중인 노드에 블럭드 태스크가 있는 경우 경고 덤프 출력을 한다.
코드 라인 43에서 qs가 완료되지 않은 노드에 대해서 경고 출력을 한다.
코드 라인 44에서 노드에 gp 완료 전의 gp 시퀀스 값인 new_gp_seq 값으로 복사한다.
코드 라인 45~47에서 순회 중인 노드가 현재 cpu를 소유한 leaf 노드인 경우 gp가 끝나고 새로운 gp가 시작되었는지 체크한다.
코드 라인 49에서 기존 gp 요청을 클리어하고, 새로운 gp 요청이 있는지를 알아온다.
코드 라인 50~52에서 gp 클린업을 위해 no-cb용 gp 커널 스레드를 모두 wakeup 시킨다.
코드 라인 53에서 태스크의 rcu_tasks_holdout 플래그를 클리어한다.
코드 라인 54에서 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 55에서 모듈 파라미터 gp_cleanup_delay(디폴트=0) 에서 지정한 틱 만큼 슬립한다.

kernel/rcu/tree.c – 2/2

        rnp = rcu_get_root();
        raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */

        /* Declare grace period done, trace first to use old GP number. */
        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end"));
        rcu_seq_end(&rcu_state.gp_seq);
        rcu_state.gp_state = RCU_GP_IDLE;
        /* Check for GP requests since above loop. */
        rdp = this_cpu_ptr(&rcu_data);
        if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
                trace_rcu_this_gp(rnp, rdp, rnp->gp_seq_needed,
                                  TPS("CleanupMore"));
                needgp = true;
        }
        /* Advance CBs to reduce false positives below. */
        offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                    rcu_segcblist_is_offloaded(&rdp->cblist);
        if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) {
                WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT);
                rcu_state.gp_req_activity = jiffies;
                trace_rcu_grace_period(rcu_state.name,
                                       READ_ONCE(rcu_state.gp_seq),
                                       TPS("newreq"));
        } else {
                WRITE_ONCE(rcu_state.gp_flags,
                           rcu_state.gp_flags & RCU_GP_FLAG_INIT);
        }
        raw_spin_unlock_irq_rcu_node(rnp);
}

코드 라인 1~2에서 최상위 루트 노드 락을 획득한다.
코드 라인 6~7에서 gp 시퀀스를 종료시키고 gp 상태를 RCU_GP_IDLE(0)로 변경한다.
코드 라인 9~14에서 노드에서 새 gp가 요구된 경우 needgp를 true로 설정한다.
코드 라인 16~27에서 새 gp가 요청된 상태에서 콜백 처리를 offload 하였거나 acceleration 되지 않은 경우 gp 플래그에 RCU_GP_FLAG_INIT(1) 를 대입하고, gp 요청 시각을 현재 시각으로 갱신한다. 그렇지 않은 경우 gp 플래그에서 RCU_GP_FLAG_INIT(1) 비트만 제거한다.
코드 라인 28에서 노드 락을 해제한다.

rcu_future_gp_cleanup()

kernel/rcu/tree.c

/*
 * Clean up any old requests for the just-ended grace period.  Also return
 * whether any additional grace periods have been requested.
 */

static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
{
        bool needmore;
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

        needmore = ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed);
        if (!needmore)
                rnp->gp_seq_needed = rnp->gp_seq; /* Avoid counter wrap. */
        trace_rcu_this_gp(rnp, rdp, rnp->gp_seq,
                          needmore ? TPS("CleanupMore") : TPS("Cleanup"));
        return needmore;
}

노드에 기존 gp 요청은 클리어(gp_seq와 같은 값으로)하고 새 gp가 요구 여부를 반환한다.

gp 시작 요청

rcu_start_this_gp()

kernel/rcu/tree.c

/*
 * rcu_start_this_gp - Request the start of a particular grace period
 * @rnp_start: The leaf node of the CPU from which to start.
 * @rdp: The rcu_data corresponding to the CPU from which to start.
 * @gp_seq_req: The gp_seq of the grace period to start.
 *
 * Start the specified grace period, as needed to handle newly arrived
 * callbacks.  The required future grace periods are recorded in each
 * rcu_node structure's ->gp_seq_needed field.  Returns true if there
 * is reason to awaken the grace-period kthread.
 *
 * The caller must hold the specified rcu_node structure's ->lock, which
 * is why the caller is responsible for waking the grace-period kthread.
 *
 * Returns true if the GP thread needs to be awakened else false.
 */

static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp,
                              unsigned long gp_seq_req)
{
        bool ret = false;
        struct rcu_node *rnp;

        /*
         * Use funnel locking to either acquire the root rcu_node
         * structure's lock or bail out if the need for this grace period
         * has already been recorded -- or if that grace period has in
         * fact already started.  If there is already a grace period in
         * progress in a non-leaf node, no recording is needed because the
         * end of the grace period will scan the leaf rcu_node structures.
         * Note that rnp_start->lock must not be released.
         */
        raw_lockdep_assert_held_rcu_node(rnp_start);
        trace_rcu_this_gp(rnp_start, rdp, gp_seq_req, TPS("Startleaf"));
        for (rnp = rnp_start; 1; rnp = rnp->parent) {
                if (rnp != rnp_start)
                        raw_spin_lock_rcu_node(rnp);
                if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
                    rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
                    (rnp != rnp_start &&
                     rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
                        trace_rcu_this_gp(rnp, rdp, gp_seq_req,
                                          TPS("Prestarted"));
                        goto unlock_out;
                }
                rnp->gp_seq_needed = gp_seq_req;
                if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq))) {
                        /*
                         * We just marked the leaf or internal node, and a
                         * grace period is in progress, which means that
                         * rcu_gp_cleanup() will see the marking.  Bail to
                         * reduce contention.
                         */
                        trace_rcu_this_gp(rnp_start, rdp, gp_seq_req,
                                          TPS("Startedleaf"));
                        goto unlock_out;
                }
                if (rnp != rnp_start && rnp->parent != NULL)
                        raw_spin_unlock_rcu_node(rnp);
                if (!rnp->parent)
                        break;  /* At root, and perhaps also leaf. */
        }

        /* If GP already in progress, just leave, otherwise start one. */
        if (rcu_gp_in_progress()) {
                trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("Startedleafroot"));
                goto unlock_out;
        }
        trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("Startedroot"));
        WRITE_ONCE(rcu_state.gp_flags, rcu_state.gp_flags | RCU_GP_FLAG_INIT);
        rcu_state.gp_req_activity = jiffies;
        if (!rcu_state.gp_kthread) {
                trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread"));
                goto unlock_out;
        }
        trace_rcu_grace_period(rcu_state.name, READ_ONCE(rcu_state.gp_seq), TPS("newreq"));
        ret = true;  /* Caller must wake GP kthread. */
unlock_out:
        /* Push furthest requested GP to leaf node and rcu_data structure. */
        if (ULONG_CMP_LT(gp_seq_req, rnp->gp_seq_needed)) {
                rnp_start->gp_seq_needed = rnp->gp_seq_needed;
                rdp->gp_seq_needed = rnp->gp_seq_needed;
        }
        if (rnp != rnp_start)
                raw_spin_unlock_rcu_node(rnp);
        return ret;
}

새 gp 시작을 요청한다. gp 시작이 성공한 경우 1을 반환하고 이미 시작되었거나 기존 gp가 진행 중인 경우 0을 반환한다.

코드 라인 18~20에서 @rnp_start 노드부터 최상위 노드까지 순회하며 노드 스핀락을 획득한다. 인자로 전달받은 시작 노드는 이미 락을 획득한 상태로 진입하였다.
코드 라인 21~28에서 순회 중인 노드의 gp 시퀀스 요청(gp_seq_needed)이 인자로 전달받은 @gp_seq_req 보다 더 큰 경우이거나 gp가 이미 시작된 경우 gp가 이미 시작한 상태이므로 unlock_out: 레이블로 이동한다.
코드 라인 29에서 순회하는 노드의 gp 시퀀스 요청(gp_seq_needed)을 @gp_seq_req 값으로 갱신한다.
코드 라인 30~40에서 순회 중인 노드의 gp가 이미 시작하여 진행 중인 경우 unlock_out: 레이블로 이동한다.
코드 라인 41~45에서 다음 노드를 처리하기 위해 현재 순회 중인 노드의 스핀락을 해제한다. 그런 후 상위 노드를 계속 처리한다.
코드 라인 48~51에서 글로벌 gp가 이미 시작되어 진행 중인 경우 unlock_out: 레이블로 이동한다.
코드 라인 53에서 gp가 이제 시작되었으므로 gp 플래그에 RCU_GP_FLAG_INIT(1) 플래그를 추가한다.
코드 라인 54에서 gp 요청 시각을 현재 시각으로 갱신한다.
코드 라인 55~58에서 cb용 gp 커널 스레드가 아직 동작하지 않는 경우 unlock_out: 레이블로 이동한다.
코드 라인 60에서 성공적으로 gp가 시작된 것을 반환한기 위해 ret에 true를 미리 대입해둔다.
코드 라인 61~69에서 unlock_out: 레이블이다. 노드 및 cpu의 gp 시퀀스 요청(gp_seq_needed) 값을 갱신하고, 노드의 스핀락을 해제한 후 함수를 빠져나간다.

Quiscent State 체크, 기록 및 보고

QS 상태가 패스되었는지 확인하는 방법은 여러 가지가 사용되며 특히 preemption 커널 모델에 따라 조금씩 상이하다. qs는 다음과 같은 순서대로 진행된다.

qs 체크 -> qs 기록 -> qs 보고

Q.S 체크 및 기록

RCU non-preemptible 커널

context switch
- context switch 발생 시 해당 cpu는 q.s로 기록된다.
- __schedule() -> rcu_note_context_switch() -> rcu_qs()
유저 모드 또는 idle에서 스케줄 틱
- 유저 태스크 수행 중이거나 idle 중인 경우 해당 cpu는 q.s로 기록된다.
  - update_process_times() -> rcu_sched_clock_irq() -> rcu_flavor_sched_clock_irq() -> rcu_qs()
softirqd 실행 중
- 현재 cpu에서 동작 중인 태스크가 softirqd인 경우 qs로 기록된다.
- __do_softirq() -> rcu_softirq_qs() -> rcu_qs()
- 참고: rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe (2018, v4.20-rc1)
voluntry 커널에서 cond_resched() 사용
- cond_resched_tasks_rcu_qs() -> cond_resched() -> _cond_resched() -> rcu_all_qs() -> rcu_qs()
- 참고: Make cond_resched() provide RCU quiescent state (2017, v4.15-rc1)
voluntry 커널에서 cond_resched_tasks_rcu_qs() 사용
- cb용 gp 커널 스레드 또는 nocb용 콜백 처리 루틴에서 long loop 에서 이 함수를 사용 시 qs로 기록된다.
- 참고: Provide cond_resched_rcu_qs() to force quiescent states in long loops (2014, v3.18-rc1)

RCU preemtible 커널

Context Switch
- context switch 발생 시 해당 cpu는 q.s로 기록된다.
  - __schedule() -> rcu_note_context_switch() -> rcu_qs()
유저 모드 또는 idle 중 스케줄 틱
- 가장 바깥 쪽 rcu read-side critical section을 벗어났고(rcu_read_lock_nesting 카운터가 0), preempt 및 bh 등이 enable된 경우 해당 cpu는 q.s로 보고된다.
  - update_process_times() -> rcu_sched_clock_irq() -> rcu_flavor_sched_clock_irq() -> rcu_qs()
deferred qs에서 스케줄 틱
- gp 시작 후 1초 이상 지난 deferred qs의 경우 해당 cpu는 q.s로 기록한다.
  - update_process_times() -> rcu_sched_clock_irq() -> rcu_flavor_sched_clock_irq() -> rcu_preempt_deferred_qs() -> rcu_preempt_deferred_qs_irqrestore() -> rcu_qs()
softirqd 실행 중
- 현재 cpu에서 동작 중인 태스크가 softirqd인 경우 qs로 기록된다.
- __do_softirq() -> rcu_softirq_qs() -> rcu_qs()
- 참고: rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe (2018, v4.20-rc1)
rcu_read_unlock()의 special 케이스
- rcu_read_unlock_special() -> rcu_preempt_deferred_qs_irqrestore() -> rcu_qs()

QS 보고

기록된 qs를 보고하는 곳은 다음과 같다.

정규 체크 시 보고

rcu 코어의 qs 체크 루틴을 통해 기록된 qs의 정규 보고
- rcu_core() -> rcu_check_quiescent_state() -> rcu_report_qs_rdp() -> rcu_report_qs_rnp()

deferred qs 시 보고

irq/bh/preempt 모두 enable 된 상태인 경우 deferred qs를 해제하고, 태스크가 blocked 상태인 경우 blocked 상태도 해제하고 qs를 보고한다. 함수 처리는 다음과 같다.

rcu_preempt_deferred_qs() -> rcu_preempt_deferred_qs_irqrestore() -> rcu_report_unblock_qs_rnp()

deferred qs를 처리하는 곳은 다음과 같이 여러 곳에서 수행된다.

rcu 코어
- rcu_core() -> rcu_preempt_deferred_qs()
스케줄 틱
- rcu_flavor_sched_clock_irq() -> rcu_preempt_deferred_qs()
rcu_read_unlock()의 special 케이스
- rcu_read_unlock_special() -> rcu_preempt_deferred_qs_irqrestore()
softirqd에서 softirq 처리후
- rcu_softirq_qs() -> rcu_preempt_deferred_qs()
context switch의 heavy_qs
- rcu_momentary_dyntick_idle() -> rcu_preempt_deferred_qs()
voluntry 커널에서 cond_resched() 사용
- rcu_momentary_dyntick_idle() -> rcu_preempt_deferred_qs()
nocb용 cb 커널 스레드의 콜백 처리
- rcu_momentary_dyntick_idle() -> rcu_preempt_deferred_qs()
eqs 진입
- rcu_eqs_enter() -> rcu_preempt_deferred_qs()
cpu offline 시
- rcu_report_dead() -> rcu_preempt_deferred_qs()
급행 IPI 핸들러
- rcu_exp_handler() -> rcu_preempt_deferred_qs()
context switch
- rcu_note_context_switch() -> rcu_preempt_deferred_qs()
태스크 종료 시
- exit_rcu() -> rcu_preempt_deferred_qs()

qs 기록 함수

rcu_qs()

kernel/rcu/tree_plugin.h

/*
 * Record a preemptible-RCU quiescent state for the specified CPU.
 * Note that this does not necessarily mean that the task currently running
 * on the CPU is in a quiescent state:  Instead, it means that the current
 * grace period need not wait on any RCU read-side critical section that
 * starts later on this CPU.  It also means that if the current task is
 * in an RCU read-side critical section, it has already added itself to
 * some leaf rcu_node structure's ->blkd_tasks list.  In addition to the
 * current task, there might be any number of other tasks blocked while
 * in an RCU read-side critical section.
 *
 * Callers to this function must disable preemption.
 */

static void rcu_qs(void)
{
        RCU_LOCKDEP_WARN(preemptible(), "rcu_qs() invoked with preemption enabled!!!\n");
        if (__this_cpu_read(rcu_data.cpu_no_qs.s)) {
                trace_rcu_grace_period(TPS("rcu_preempt"),
                                       __this_cpu_read(rcu_data.gp_seq),
                                       TPS("cpuqs"));
                __this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
                barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
                WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
        }
}

preemptible RCU에서 현재 cpu의 QS를 보고한다.

코드 라인 4~11에서 아직 현재 cpu의 qs가 체크되어 있지 않은 경우 cpu_no_qs의 norm 비트를 클리어하는 것으로 현재 cpu에 대한 qs를 체크한다. 또한 현재 태스크의 unlock special의 need_qs 비트도 클리어한다.

qs 정규 체크 및 보고

qs 체크 시 qs가 기록된 후 qs의 보고 순서는 다음 그림과 같다.

local cpu qs 체크 -> local cpu qs 기록: rdp(rcu_data) -> 노드로 qs 보고: rnp(rcu_node) -> 글로벌로 qs 보고: rsp(rcu_state) -> gp 커널 스레드(current gp 종료)
- rnp(rcu_node)는 하이라키 구조로되어 있으므로 최상위 루트 노드까지 qs가 보고 완료되면 rsp(rcu_state)에 보고한다.

다음 그림에서는 36개의 cpu에 대한 rnp(rcu_node) 보고와 관련된 멤버 변수의 상태를 알 수 있다.

grpmask: 현재 노드에 대응하는 상위 노드 비트
qsmaskinit: 현재 노드가 취급하는 online cpu 비트마스크로 qs 시작할 때 마다 이 값은 qsmask에 복사된다.
qsmask: 노드에 포함된 online cpu 비트마스크로 qs가 pass된 cpu는 0으로 클리어된다.

qs 체크 및 보고

rcu_check_quiescent_state()

kernel/rcu/tree.c

/*
 * Check to see if there is a new grace period of which this CPU
 * is not yet aware, and if so, set up local rcu_data state for it.
 * Otherwise, see if this CPU has just passed through its first
 * quiescent state for this grace period, and record that fact if so.
 */

static void
rcu_check_quiescent_state(struct rcu_data *rdp)
{
        /* Check for grace-period ends and beginnings. */
        note_gp_changes(rdp);

        /*
         * Does this CPU still need to do its part for current grace period?
         * If no, return and let the other CPUs do their part as well.
         */
        if (!rdp->core_needs_qs)
                return;

        /*
         * Was there a quiescent state since the beginning of the grace
         * period? If no, then exit and wait for the next call.
         */
        if (rdp->cpu_no_qs.b.norm)
                return;

        /*
         * Tell RCU we are done (but rcu_report_qs_rdp() will be the
         * judge of that).
         */
        rcu_report_qs_rdp(rdp->cpu, rdp);
}

현재 cpu에 대해 새 gp가 시작되었는지 체크한다. 또한 qs 상태를 체크하고 패스된 경우 rdp에 기록하여 상위 노드로 보고하게 한다.

코드 라인 5에서 gp가 끝나고 새로운 gp가 시작되었는지 체크한다.
코드 라인 11~12에서 현재 cpu에 대한 qs 체크가 필요 없으면 함수를 빠져나간다.
코드 라인 18~19에서 gp 시작 이후 일반 qs가 감지되지 않은 경우 함수를 빠져나간다.
코드 라인 25에서 현재 cpu의 qs를 보고한다.

새 gp 변화(시작) 체크

note_gp_changes()

kernel/rcu/tree.c

static void note_gp_changes(struct rcu_data *rdp)
{
        unsigned long flags;
        bool needwake;
        struct rcu_node *rnp;

        local_irq_save(flags);
        rnp = rdp->mynode;
        if ((rdp->gp_seq == rcu_seq_current(&rnp->gp_seq) &&
             !unlikely(READ_ONCE(rdp->gpwrap))) || /* w/out lock. */
            !raw_spin_trylock_rcu_node(rnp)) { /* irqs already off, so later. */
                local_irq_restore(flags);
                return;
        }
        needwake = __note_gp_changes(rnp, rdp);
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        if (needwake)
                rcu_gp_kthread_wake();
}

gp가 끝나고 새로운 gp가 시작되었는지 체크한다.

코드 라인 9~14에서 gp 오버플로우(gpwrap) 없이 gp 시퀀스가 이미 갱신된 상태이거나 노드 락을 획득하지 못하는 경우 다음 기회를 위해 함수를 빠져나간다.
코드 라인 15~18에서 gp가 끝나고 새로운 gp가 시작되었는지 체크한다. 그러한 경우 gp kthread를 깨운다.

__note_gp_changes()

kernel/rcu/tree.c

/*
 * Update CPU-local rcu_data state to record the beginnings and ends of
 * grace periods.  The caller must hold the ->lock of the leaf rcu_node
 * structure corresponding to the current CPU, and must have irqs disabled.
 * Returns true if the grace-period kthread needs to be awakened.
 */

static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp)
{
        bool ret = false;
        bool need_gp;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);

        raw_lockdep_assert_held_rcu_node(rnp);

        if (rdp->gp_seq == rnp->gp_seq)
                return false; /* Nothing to do. */

        /* Handle the ends of any preceding grace periods first. */
        if (rcu_seq_completed_gp(rdp->gp_seq, rnp->gp_seq) ||
            unlikely(READ_ONCE(rdp->gpwrap))) {
                if (!offloaded)
                        ret = rcu_advance_cbs(rnp, rdp); /* Advance CBs. */
                trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuend"));
        } else {
                if (!offloaded)
                        ret = rcu_accelerate_cbs(rnp, rdp); /* Recent CBs. */
        }

        /* Now handle the beginnings of any new-to-this-CPU grace periods. */
        if (rcu_seq_new_gp(rdp->gp_seq, rnp->gp_seq) ||
            unlikely(READ_ONCE(rdp->gpwrap))) {
                /*
                 * If the current grace period is waiting for this CPU,
                 * set up to detect a quiescent state, otherwise don't
                 * go looking for one.
                 */
                trace_rcu_grace_period(rcu_state.name, rnp->gp_seq, TPS("cpustart"));
                need_gp = !!(rnp->qsmask & rdp->grpmask);
                rdp->cpu_no_qs.b.norm = need_gp;
                rdp->core_needs_qs = need_gp;
                zero_cpu_stall_ticks(rdp);
        }
        rdp->gp_seq = rnp->gp_seq;  /* Remember new grace-period state. */
        if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap)
                rdp->gp_seq_needed = rnp->gp_seq_needed;
        WRITE_ONCE(rdp->gpwrap, false);
        rcu_gpnum_ovf(rnp, rdp);
        return ret;
}

gp가 끝나고 새로운 gp가 시작되었는지 체크한다. 결과가 1인 경우 새 gp 시작이 필요하다는 의미이다.

코드 라인 5~6에서 콜백 오프로드 상태인지 여부를 알아온다.
코드 라인 10~11에서 해당 cpu의 gp 시퀀스가 이미 갱신된 상태인 경우 함수를 빠져나간다.
코드 라인 14~22에서 해당 cpu의 gp 시퀀스가 이미 완료된 경우 오프로드되지 않은 경우에 대해 미리 advance(cascade) 또는 acceleration 처리를 수행한다.
코드 라인 27~39에서 해당 cpu의 gp 시퀀스가 새롭게 시작하거나 오버플로우(gpwrap)된 경우 현재 cpu의 qs 필요 및 상태를 갱신한다.
코드 라인 40에서 해당 cpu의 gp 시퀀스를 갱신한다.
코드 라인 41~42에서 해당 cpu의 gp 시퀀스 요청을 갱신한다.
코드 라인 43에서 gp 시퀀스 오버플로우(gpwrap)을 false로 클리어한 후 다시 한 번 오버플로우 체크를 수행한다.

gp 시퀀스 오버플로우 체크

rcu_gpnum_ovf()

kernel/rcu/tree.c

/*
 * We are reporting a quiescent state on behalf of some other CPU, so
 * it is our responsibility to check for and handle potential overflow
 * of the rcu_node ->gp_seq counter with respect to the rcu_data counters.
 * After all, the CPU might be in deep idle state, and thus executing no
 * code whatsoever.
 */

static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
{
        raw_lockdep_assert_held_rcu_node(rnp);
        if (ULONG_CMP_LT(rcu_seq_current(&rdp->gp_seq) + ULONG_MAX / 4,
                         rnp->gp_seq))
                WRITE_ONCE(rdp->gpwrap, true);
        if (ULONG_CMP_LT(rdp->rcu_iw_gp_seq + ULONG_MAX / 4, rnp->gp_seq))
                rdp->rcu_iw_gp_seq = rnp->gp_seq + ULONG_MAX / 4;
}

해당 cpu의 gp 시퀀스와 노드 간에 간격이 너무 넓어 오버플로우 여부를 확인한다.

코드 라인 4~6에서 해당 cpu의 gp 시퀀스가 노드의 gp 시퀀스보다 ulong/4 이상 느린 경우 오버플로우를 설정한다.
- nohz로 인해 cpu의 gp 시퀀스는 갱신을 오랫동안 하지 못할 수 있다. 그래도 64 비트 시스템에서는 gp 시퀀스가 64비트를 사용하므로 발생할 일이 거의 없다고 봐야 한다.
코드 라인 7~8에서 rcu_iw_gp_seq 에는 gp 시퀀스가 오버플로우될 예정인 한계 값이 담겨있다. 노드의 gp 시퀀스가 이를 넘어서는 경우 이 값을 갱신한다.

rdp(cpu)에 qs 보고

rcu_report_qs_rdp()

kernel/rcu/tree.c

/*
 * Record a quiescent state for the specified CPU to that CPU's rcu_data
 * structure.  This must be called from the specified CPU.
 */

static void
rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
{
        unsigned long flags;
        unsigned long mask;
        bool needwake = false;
        const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
                               rcu_segcblist_is_offloaded(&rdp->cblist);
        struct rcu_node *rnp;

        rnp = rdp->mynode;
        raw_spin_lock_irqsave_rcu_node(rnp, flags);
        if (rdp->cpu_no_qs.b.norm || rdp->gp_seq != rnp->gp_seq ||
            rdp->gpwrap) {

                /*
                 * The grace period in which this quiescent state was
                 * recorded has ended, so don't report it upwards.
                 * We will instead need a new quiescent state that lies
                 * within the current grace period.
                 */
                rdp->cpu_no_qs.b.norm = true;   /* need qs for new gp. */
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                return;
        }
        mask = rdp->grpmask;
        rdp->core_needs_qs = false;
        if ((rnp->qsmask & mask) == 0) {
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        } else {
                /*
                 * This GP can't end until cpu checks in, so all of our
                 * callbacks can be processed during the next GP.
                 */
                if (!offloaded)
                        needwake = rcu_accelerate_cbs(rnp, rdp);

                rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
                /* ^^^ Released rnp->lock */
                if (needwake)
                        rcu_gp_kthread_wake();
        }
}

로컬 qs 상태가 pass된 경우 노드에 보고한다.

코드 라인 7~8에서 콜백 오프로드 상태인지 여부를 알아온다.
코드 라인 11~25에서 해당 cpu의 노드 락을 획득하고 다음 3 조건 중 하나에 해당하는 경우 qs를 계속 체크해야 하는 상황이다. qs를 클리어한 후 함수를 빠져나간다.
- 아직 일반 qs가 체크되지 않은 경우
- cpu의 gp 시퀀스가 갱신되지 않은 경우(새 gp 시작)
- gp 시퀀스가 오버플로우 상태인 경우
코드 라인 27에서 해당 cpu의 qs 체크가 완료되어 보고할 예정이므로 rdp->core_needs_qs에 false를 대입하여 다음엔 qs 체크 및 보고를 하지 않도록 한다.
코드 라인 28~29에서 해당 cpu의 qs가 이미 노드에 보고된 경우엔 노드락을 풀고 함수를 빠져나간다.
코드 라인 30~42에서 콜백 오프로드가 아닌 경우 acceleration 처리를 한 후 노드에 qs를 보고한다. 그 후 필요 시 gp 스레드를 깨운다.

rnp(노드)에 qs 보고

rcu_report_qs_rnp()

kernel/rcu/tree.c

/*
 * Similar to rcu_report_qs_rdp(), for which it is a helper function.
 * Allows quiescent states for a group of CPUs to be reported at one go
 * to the specified rcu_node structure, though all the CPUs in the group
 * must be represented by the same rcu_node structure (which need not be a
 * leaf rcu_node structure, though it often will be).  The gps parameter
 * is the grace-period snapshot, which means that the quiescent states
 * are valid only if rnp->gp_seq is equal to gps.  That structure's lock
 * must be held upon entry, and it is released before return.
 *
 * As a special case, if mask is zero, the bit-already-cleared check is
 * disabled.  This allows propagating quiescent state due to resumed tasks
 * during grace-period initialization.
 */

static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
                              unsigned long gps, unsigned long flags)
        __releases(rnp->lock)
{
        unsigned long oldmask = 0;
        struct rcu_node *rnp_c;

        raw_lockdep_assert_held_rcu_node(rnp);

        /* Walk up the rcu_node hierarchy. */
        for (;;) {
                if ((!(rnp->qsmask & mask) && mask) || rnp->gp_seq != gps) {

                        /*
                         * Our bit has already been cleared, or the
                         * relevant grace period is already over, so done.
                         */
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        return;
                }
                WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */
                WARN_ON_ONCE(!rcu_is_leaf_node(rnp) &&
                             rcu_preempt_blocked_readers_cgp(rnp));
                rnp->qsmask &= ~mask;
                trace_rcu_quiescent_state_report(rcu_state.name, rnp->gp_seq,
                                                 mask, rnp->qsmask, rnp->level,
                                                 rnp->grplo, rnp->grphi,
                                                 !!rnp->gp_tasks);
                if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {

                        /* Other bits still set at this level, so done. */
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        return;
                }
                rnp->completedqs = rnp->gp_seq;
                mask = rnp->grpmask;
                if (rnp->parent == NULL) {

                        /* No more levels.  Exit loop holding root lock. */

                        break;
                }
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                rnp_c = rnp;
                rnp = rnp->parent;
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                oldmask = rnp_c->qsmask;
        }

        /*
         * Get here if we are the last CPU to pass through a quiescent
         * state for this grace period.  Invoke rcu_report_qs_rsp()
         * to clean up and start the next grace period if one is needed.
         */
        rcu_report_qs_rsp(flags); /* releases rnp->lock. */
}

현재 cpu의 qs가 완료되어 소속된 노드부터 최상위 노드까지 보고한다. 최상위 노드에서 모든 cpu의 qs가 완료되었음을 인식하면 gp 커널 스레드를 깨워 새로운 gp를 시작하게 한다.

코드 라인 11~20에서 최상위 노드까지 루프를 돌며, 노드에 이미 qs 체크되었거나 새 gp 시퀀스가 시작된 경우 루프를 벗어난다.
코드 라인 24에서 순회중인 노드의 qsmask에서 @mask에 해당하는 비트를 클리어한다.
코드 라인 29~34에서 순회 중인 노드의 qsmask에서 비트 클리어 후 아직 남아있는 비트가 있거나 노드에 블럭된 rcu reader 태스크가 있는 경우 노드 락을 풀고 함수를 벗어난다.
코드 라인 35에서 노드의 완료 시퀀스 rnp->completedqs를 갱신한다.
코드 라인 36에서 상위 노드에서 사용하기 위해 현재 노드의 그룹 마스크 rnp->grpmask를 mask에 대입해둔다.
코드 라인 37~42에서 상위 노드가 없는 경우 루프를 벗어난다.
코드 라인 43~48에서 순회 중인 노드 락을 해제하고, 상위 노드를 선택하여 루프를 계속한다.
코드 라인 55에서 최상위 노드까지 모든 qs가 끝났으므로 rsp에 보고한다.

rsp에 qs 보고

rcu_report_qs_rsp()

kernel/rcu/tree.c

/*
 * Report a full set of quiescent states to the specified rcu_state
 * data structure.  This involves cleaning up after the prior grace
 * period and letting rcu_start_gp() start up the next grace period
 * if one is needed.  Note that the caller must hold rnp->lock, which
 * is released before return.
 */
static void rcu_report_qs_rsp(unsigned long flags)
        __releases(rcu_get_root(rsp)->lock)
{
        raw_lockdep_assert_held_rcu_node(rcu_get_root());
        WARN_ON_ONCE(!rcu_gp_in_progress());
        WRITE_ONCE(rcu_state.gp_flags,
                   READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
        raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags);
        rcu_gp_kthread_wake();
}

모든 qs가 pass되었으므로 gp를 갱신하기 위해 gp 커널 스레드를 깨워 새로운 gp를 시작하게 한다.

gp_flags에 RCU_GP_FLAG_FQS(2)를 추가한 후 cb용 gp 커널 스레드를 깨운다.

Force Quiescent State

대기 중인 콜백이 제한된 수(디폴트=10000) 이상으로 너무 많은 상태에서 gp 시퀀스가 변화가 없으면 다음 상황을 qs로 강제(forcing) 처리한다.

eqs(extended qs) 상태에 있는 cpu
- nohz idle 진입 상태인 cpu
- nohz full 유저 태스크가 동작 중인 cpu
gp 시작 후 1초 이상 지난 offline cpu

fqs 요청

force_quiescent_state()

kernel/rcu/tree.c

/*
 * Force quiescent states on reluctant CPUs, and also detect which
 * CPUs are in dyntick-idle mode.
 */

void rcu_force_quiescent_state(void)
{
        unsigned long flags;
        bool ret;
        struct rcu_node *rnp;
        struct rcu_node *rnp_old = NULL;

        /* Funnel through hierarchy to reduce memory contention. */
        rnp = __this_cpu_read(rcu_data.mynode);
        for (; rnp != NULL; rnp = rnp->parent) {
                ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
                      !raw_spin_trylock(&rnp->fqslock);
                if (rnp_old != NULL)
                        raw_spin_unlock(&rnp_old->fqslock);
                if (ret)
                        return;
                rnp_old = rnp;
        }
        /* rnp_old == rcu_get_root(), rnp == NULL. */

        /* Reached the root of the rcu_node tree, acquire lock. */
        raw_spin_lock_irqsave_rcu_node(rnp_old, flags);
        raw_spin_unlock(&rnp_old->fqslock);
        if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) {
                raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags);
                return;  /* Someone beat us to it. */
        }
        WRITE_ONCE(rcu_state.gp_flags,
                   READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
        raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags);
        rcu_gp_kthread_wake();
}
EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);

force quiescent state를 진행시켜 기존 gp를 종료시키고 새로운 gp를 시작하기 위해 시도한다.

코드 라인 9~18에서 요청한 cpu에 해당하는 노드에서 최상위 루트 노드까지 상위로 올라가면서 fqs 락 획득과 해제를 시도한다. 만일 다른 cpu로부터 노드에 이미 fqs 락을 걸어 로컬 cpu에서 락의 획득이 실패하였거나 다른 cpu에서 RCU_GP_FLAG_FQS 플래그를 이미 설정하여 Force Quiescent State를 처리하고 있는 경우 함수를 빠져나간다.
코드 라인 22~23에서 최상위 루트 노드에서 스핀락을 얻은 후 fqs 락을 푼다.
코드 라인 24~27에서 다시 한 번 최종 확인하는 것으로 최상위 노드에 이미 RCU_GP_FLAG_FQS 비트가 설정된 경우 함수를 빠져나간다.
코드 라인 28~31에서 fqs를 하기 위해 RCU_GP_FLAG_FQS 비트를 설정하고 최상위 노드의 스핀락을 해제한 후 gp 커널 스레드를 깨워 fqs를 진행하도록 요청한다.

fqs 진행(by cb용 gp kthread)

rcu_gp_fqs()

kernel/rcu/tree.c

/*
 * Do one round of quiescent-state forcing.
 */

static void rcu_gp_fqs(bool first_time)
{
        struct rcu_node *rnp = rcu_get_root();

        WRITE_ONCE(rcu_state.gp_activity, jiffies);
        rcu_state.n_force_qs++;
        if (first_time) {
                /* Collect dyntick-idle snapshots. */
                force_qs_rnp(dyntick_save_progress_counter);
        } else {
                /* Handle dyntick-idle and offline CPUs. */
                force_qs_rnp(rcu_implicit_dynticks_qs);
        }
        /* Clear flag to prevent immediate re-entry. */
        if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) {
                raw_spin_lock_irq_rcu_node(rnp);
                WRITE_ONCE(rcu_state.gp_flags,
                           READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS);
                raw_spin_unlock_irq_rcu_node(rnp);
        }
}

모든 처리되지 않은 qs들을 강제로 qs 패스된 것으로 처리한다. 두 가지 용도로 함수를 사용하는데

코드 라인 5에서 gp 상태가 변화됨에 따른 변경된 시각을 gp_activity에 기록한다.
코드 라인 6에서 fqs 카운터인 n_force_qs를 1 증가시킨다.
코드 라인 7~9에서 인자 @first_time이 true인 경우 qs 완료되지 않은 cpu에서 dyntick_save_progress_counter() 함수를 통해 eqs(extended qs)가 확인된 경우 qs를 보고한다.
코드 라인 10~13에서 그렇지 않은 경우 qs 완료되지 않은 cpu에서 rcu_implicit_dynticks_qs() 함수를 통해 다음 조건이 확인된 경우 qs를 보고한다.
- cpu가 eqs(extended qs) 상태인 경우
- offline cpu에서 gp 시작 후 1초 이상 경과한 경우
코드 라인 15~20에서 RCU_GP_FLAG_FQS 플래그를 제거한다.

다음 그림은 rcu_gp_fqs() 함수를 통해 fqs가 처리되는 과정을 보여준다.

force_qs_rnp()

kernel/rcu/tree.c

/*
 * Scan the leaf rcu_node structures.  For each structure on which all
 * CPUs have reported a quiescent state and on which there are tasks
 * blocking the current grace period, initiate RCU priority boosting.
 * Otherwise, invoke the specified function to check dyntick state for
 * each CPU that has not yet reported a quiescent state.
 */

static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
{
        int cpu;
        unsigned long flags;
        unsigned long mask;
        struct rcu_node *rnp;

        rcu_for_each_leaf_node(rnp) {
                cond_resched_tasks_rcu_qs();
                mask = 0;
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                if (rnp->qsmask == 0) {
                        if (!IS_ENABLED(CONFIG_PREEMPTION) ||
                            rcu_preempt_blocked_readers_cgp(rnp)) {
                                /*
                                 * No point in scanning bits because they
                                 * are all zero.  But we might need to
                                 * priority-boost blocked readers.
                                 */
                                rcu_initiate_boost(rnp, flags);
                                /* rcu_initiate_boost() releases rnp->lock */
                                continue;
                        }
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                        continue;
                }
                for_each_leaf_node_possible_cpu(rnp, cpu) {
                        unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
                        if ((rnp->qsmask & bit) != 0) {
                                if (f(per_cpu_ptr(&rcu_data, cpu)))
                                        mask |= bit;
                        }
                }
                if (mask != 0) {
                        /* Idle/offline CPUs, report (releases rnp->lock). */
                        rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
                } else {
                        /* Nothing to do here, so just drop the lock. */
                        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                }
        }
}

leaf 노드들의 qs 미완료 cpu들을 대상으로 인자로 전달 받은 함수를 호출하여 결과가 true인 cpu들을 강제로 qs 체크된 것으로 보고한다. 인자로 따라오는 함수는 다음과 같다.

dyntick_save_progress_counter()
- fqs 처리를 위해 처음 호출될 때 이 함수가 지정되는데 이는 idle(eqs) cpu 여부를 반환하는 함수이다.
rcu_implicit_dynticks_qs()
- fqs 처리를 위해 그 다음부터 호출될 때 이 함수가 지정되는데 이는 idle(eqs) 또는 offline cpu 여부를 반환하는 함수이다.

코드 라인 8~9에서 모든 leaf 노드들을 대상으로 순회하며 현재 태스크의 rcu_tasks_holdout 플래그를 클리어한다.
우선 리스케줄링 요청이 없는 경우 현재 cpu의 rcu_qs_ctr을 1 증가시킨다. 또한 nohz idle을 위해 qs가 pass된 상태이면 rcu core가 알 수 있도록 한다.
코드 라인 12~26에서 이미 qs가 모두 처리된 노드는 skip 한다. 또한 non-preemption 커널 모델이거나 순회 중인 노드에 rcu reader에서 블럭된 태스크들이 있는 경우 boost 한다.
코드 라인 27~33에서 순회 중인 노드에 포함된 possible cpu들에 대해 순회하며 인자로 전달받은 함수 @f를 호출하여 결과가 true인 경우 mask에 해당 cpu 비트를 추가한다.
코드 라인 34~40에서 qs 완료로 판정한(@f() 결과가 true) cpu들을 노드에 보고한다.

idle, user, nmi, irq 진출입에서의 rcu 처리

idle & user 진출입에서의 rcu 처리 관련

다음 그림은 idle 및 유저 모드 진출입시 rcu를 처리하기 위한 함수 호출 관계를 보여준다.

rcu_idle_enter()

kernel/rcu/tree.c

/**
 * rcu_idle_enter - inform RCU that current CPU is entering idle
 *
 * Enter idle mode, in other words, -leave- the mode in which RCU
 * read-side critical sections can occur.  (Though RCU read-side
 * critical sections can occur in irq handlers in idle, a possibility
 * handled by irq_enter() and irq_exit().)
 *
 * If you add or remove a call to rcu_idle_enter(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_idle_enter(void)
{
        lockdep_assert_irqs_disabled();
        rcu_eqs_enter();
}

해당 cpu가 idle 모드에 진입할 때의 rcu 처리를 수행한다.

idle 스케줄러: cpu_startup_entry() -> do_idle()
- -> cpu_idle_poll() -> rcu_idle_enter()
- -> cpuidle_idle_call() -> rcu_idle_enter()

rcu_idle_exit()

kernel/rcu/tree.c

/**
 * rcu_idle_exit - inform RCU that current CPU is leaving idle
 *
 * Exit idle mode, in other words, -enter- the mode in which RCU
 * read-side critical sections can occur.
 *
 * If you add or remove a call to rcu_idle_exit(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_idle_exit(void)
{
        unsigned long flags;

        local_irq_save(flags);
        rcu_eqs_exit(false);
        local_irq_restore(flags);
}

해당 cpu가 idle 모드에서 벗어날 때의 rcu 처리를 수행한다.

idle 스케줄러: cpu_startup_entry() -> do_idle()
- -> cpu_idle_poll() -> rcu_idle_exit()
- -> cpuidle_idle_call() -> rcu_idle_exit()

rcu_user_enter() – nohz full

kernel/rcu/tree.c

/**
 * rcu_user_enter - inform RCU that we are resuming userspace.
 *
 * Enter RCU idle mode right before resuming userspace.  No use of RCU
 * is permitted between this call and rcu_user_exit(). This way the
 * CPU doesn't need to maintain the tick for RCU maintenance purposes
 * when the CPU runs in userspace.
 *
 * If you add or remove a call to rcu_user_enter(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_user_enter(void)
{
        lockdep_assert_irqs_disabled();
        rcu_eqs_enter(true);
}

해당 cpu가 user 모드에 진입할 때의 rcu 처리를 수행한다.

ct_user_enter() -> context_tracking_user_enter() -> user_enter() -> context_tracking_enter() -> __context_tracking_enter() -> rcu_user_enter()

rcu_user_exit() – nohz full

kernel/rcu/tree.cㄷ

/**
 * rcu_user_exit - inform RCU that we are exiting userspace.
 *
 * Exit RCU idle mode while entering the kernel because it can
 * run a RCU read side critical section anytime.
 *
 * If you add or remove a call to rcu_user_exit(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_user_exit(void)
{
        rcu_eqs_exit(1);
}

해당 cpu가 user 모드에서 벗어날 때의 rcu 처리를 수행한다.

ct_user_exit() -> context_tracking_user_exit() -> user_exit() -> context_tracking_exit() -> __context_tracking_exit() -> rcu_user_exit()

인터럽트에서의 rcu 처리 관련

다음 그림은 인터럽트 진출입에서 rcu를 처리하기 위한 함수 호출 관계를 보여준다.

rcu_nmi_enter()

kernel/rcu/tree.c

/**
 * rcu_nmi_enter - inform RCU of entry to NMI context
 */

void rcu_nmi_enter(void)
{
        rcu_nmi_enter_common(false);
}
NOKPROBE_SYMBOL(rcu_nmi_enter);

cpu가 nmi에 진입 시 rcu에서 할 일을 처리한다.

rcu_nmi_exit()

kernel/rcu/tree.c

/**
 * rcu_nmi_exit - inform RCU of exit from NMI context
 *
 * If you add or remove a call to rcu_nmi_exit(), be sure to test
 * with CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_nmi_exit(void)
{
        rcu_nmi_exit_common(false);
}

cpu가 nmi로부터 복귀 전에 rcu에서 할 일을 처리한다.

rcu_irq_enter()

kernel/rcu/tree.c

/**
 * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
 *
 * Enter an interrupt handler, which might possibly result in exiting
 * idle mode, in other words, entering the mode in which read-side critical
 * sections can occur.  The caller must have disabled interrupts.
 *
 * Note that the Linux kernel is fully capable of entering an interrupt
 * handler that it never exits, for example when doing upcalls to user mode!
 * This code assumes that the idle loop never does upcalls to user mode.
 * If your architecture's idle loop does do upcalls to user mode (or does
 * anything else that results in unbalanced calls to the irq_enter() and
 * irq_exit() functions), RCU will give you what you deserve, good and hard.
 * But very infrequently and irreproducibly.
 *
 * Use things like work queues to work around this limitation.
 *
 * You have been warned.
 *
 * If you add or remove a call to rcu_irq_enter(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_irq_enter(void)
{
        lockdep_assert_irqs_disabled();
        rcu_nmi_enter_common(true);
}

cpu가 인터럽트에 진입 시 rcu에서 할 일을 처리한다.

rcu_irq_exit()

kernel/rcu/tree.c

/**
 * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
 *
 * Exit from an interrupt handler, which might possibly result in entering
 * idle mode, in other words, leaving the mode in which read-side critical
 * sections can occur.  The caller must have disabled interrupts.
 *
 * This code assumes that the idle loop never does anything that might
 * result in unbalanced calls to irq_enter() and irq_exit().  If your
 * architecture's idle loop violates this assumption, RCU will give you what
 * you deserve, good and hard.  But very infrequently and irreproducibly.
 *
 * Use things like work queues to work around this limitation.
 *
 * You have been warned.
 *
 * If you add or remove a call to rcu_irq_exit(), be sure to test with
 * CONFIG_RCU_EQS_DEBUG=y.
 */

void rcu_irq_exit(void)
{
        lockdep_assert_irqs_disabled();
        rcu_nmi_exit_common(true);
}

cpu가 인터럽트로부터 복귀 전에 rcu에서 할 일을 처리한다.

rcu_nmi_enter_common()

kernel/rcu/tree.c

/**
 * rcu_nmi_enter_common - inform RCU of entry to NMI context
 * @irq: Is this call from rcu_irq_enter?
 *
 * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
 * rdp->dynticks_nmi_nesting to let the RCU grace-period handling know
 * that the CPU is active.  This implementation permits nested NMIs, as
 * long as the nesting level does not overflow an int.  (You will probably
 * run out of stack space first.)
 *
 * If you add or remove a call to rcu_nmi_enter_common(), be sure to test
 * with CONFIG_RCU_EQS_DEBUG=y.
 */

static __always_inline void rcu_nmi_enter_common(bool irq)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        long incby = 2;

        /* Complain about underflow. */
        WARN_ON_ONCE(rdp->dynticks_nmi_nesting < 0);

        /*
         * If idle from RCU viewpoint, atomically increment ->dynticks
         * to mark non-idle and increment ->dynticks_nmi_nesting by one.
         * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
         * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
         * to be in the outermost NMI handler that interrupted an RCU-idle
         * period (observation due to Andy Lutomirski).
         */
        if (rcu_dynticks_curr_cpu_in_eqs()) {

                if (irq)
                        rcu_dynticks_task_exit();

                rcu_dynticks_eqs_exit();

                if (irq)
                        rcu_cleanup_after_idle();

                incby = 1;
        }
        trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
                          rdp->dynticks_nmi_nesting,
                          rdp->dynticks_nmi_nesting + incby, rdp->dynticks);
        WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
                   rdp->dynticks_nmi_nesting + incby);
        barrier();
}

cpu가 nmi 및 irq 진입 후 rcu에서 할 일을 처리한다.

코드 라인 17~28에서 cpu가 eqs 상태인 경우 eqs를 벗어난다. 단 nmi가 아닌 irq에서 진입한 경우 eqs를 벗어나기 전에 현재 태스크의 rcu_tasks_idle_cpu 멤버에서 cpu 지정을 클리어하기 위해 -1을 대입한다.
코드 라인 32~33에서 cpu가 eqs 상태였었던 경우 rdp->dynticks_nmi_nesting 값을 1 증가시키고, 그렇지 않은 경우 2 증가시킨다.

rcu_nmi_exit_common()

kernel/rcu/tree.c

/*
 * If we are returning from the outermost NMI handler that interrupted an
 * RCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nesting
 * to let the RCU grace-period handling know that the CPU is back to
 * being RCU-idle.
 *
 * If you add or remove a call to rcu_nmi_exit_common(), be sure to test
 * with CONFIG_RCU_EQS_DEBUG=y.
 */

static __always_inline void rcu_nmi_exit_common(bool irq)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

        /*
         * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
         * (We are exiting an NMI handler, so RCU better be paying attention
         * to us!)
         */
        WARN_ON_ONCE(rdp->dynticks_nmi_nesting <= 0);
        WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs());

        /*
         * If the nesting level is not 1, the CPU wasn't RCU-idle, so
         * leave it in non-RCU-idle state.
         */
        if (rdp->dynticks_nmi_nesting != 1) {
                trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
                WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
                           rdp->dynticks_nmi_nesting - 2);
                return;
        }

        /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
        trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, rdp->dynticks);
        WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */

        if (irq)
                rcu_prepare_for_idle();

        rcu_dynticks_eqs_enter();

        if (irq)
                rcu_dynticks_task_enter();
}

cpu가 nmi 및 irq 복귀 전 rcu에서 할 일을 처리한다.

코드 라인 17~22에서 rdp->dynticks_nmi_nesting이 1이 아닌 경우 2만 큼 감소시키고 함수를 빠져나간다.
코드 라인 26에서 rdp->dynticks_nmi_nesting을 0으로 초기화한다.
코드 라인 28~34에서 irq 복귀전인 경우 cpu가 idle 진입 전에 남은 non-lazy rcu 콜백들을 호출하여 처리한다.
코드 라인 31에서 cpu를 eqs 상태로 변경한다.
코드 라인 33~34에서 irq 복귀전인 경우 현재 태스크의 rcu_tasks_idle_cpu 멤버에 현재 cpu를 기록한다.

Extended QS

다음 그림은 idle(eqs)의 시퀀스 값 및 하위 두 비트를 보여준다.

rcu_eqs_enter()

kernel/rcu/tree.c

/*
 * Enter an RCU extended quiescent state, which can be either the
 * idle loop or adaptive-tickless usermode execution.
 *
 * We crowbar the ->dynticks_nmi_nesting field to zero to allow for
 * the possibility of usermode upcalls having messed up our count
 * of interrupt nesting level during the prior busy period.
 */

static void rcu_eqs_enter(bool user)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

        WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
        WRITE_ONCE(rdp->dynticks_nmi_nesting, 0);
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
                     rdp->dynticks_nesting == 0);
        if (rdp->dynticks_nesting != 1) {
                rdp->dynticks_nesting--;
                return;
        }

        lockdep_assert_irqs_disabled();
        trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, rdp->dynticks);
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
        rdp = this_cpu_ptr(&rcu_data);
        do_nocb_deferred_wakeup(rdp);
        rcu_prepare_for_idle();
        rcu_preempt_deferred_qs(current);
        WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
        rcu_dynticks_eqs_enter();
        rcu_dynticks_task_enter();
}

확장 qs 상태로 진입한다. @user가 true인 경우 nohz full user 모드 진입을 의미하며, false인 경우 nohz idle 모드 진입을 의미한다.

코드 라인 6에서 rdp->dynticks_nmi_nesting을 클리어한다.
코드 라인 9~12에서 이미 2 번 이상 네스팅된 경우 rdp->dynticks_nesting 값을 1 감소시키고 함수를 빠져나간다.
코드 라인 18에서 rdp->nocb_defer_wakeup 설정이 있는 경우 rcu_nocb_kthread를 깨운다.
코드 라인 19에서 cpu가 idle 진입 전에 남은 non-lazy rcu 콜백들을 호출하여 처리한다.
코드 라인 20에서 deferred qs를 처리한다.
- deferred qs를 해제하고, blocked 상태인 경우 blocked 해제 후 qs를 보고한다.
코드 라인 21에서 rdp->dynticks_nesting 값을 0으로 클리어한다.
코드 라인 22에서 확장 qs 상태로 기록한다.
코드 라인 23에서 현재 태스크의 rcu_tasks_idle_cpu 멤버에 현재 cpu를 기록한다.

rcu_eqs_exit()

kernel/rcu/tree.c

/*
 * Exit an RCU extended quiescent state, which can be either the
 * idle loop or adaptive-tickless usermode execution.
 *
 * We crowbar the ->dynticks_nmi_nesting field to DYNTICK_IRQ_NONIDLE to
 * allow for the possibility of usermode upcalls messing up our count of
 * interrupt nesting level during the busy period that is just now starting.
 */

static void rcu_eqs_exit(bool user)
{
        struct rcu_data *rdp;
        long oldval;

        lockdep_assert_irqs_disabled();
        rdp = this_cpu_ptr(&rcu_data);
        oldval = rdp->dynticks_nesting;
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
        if (oldval) {
                rdp->dynticks_nesting++;
                return;
        }
        rcu_dynticks_task_exit();
        rcu_dynticks_eqs_exit();
        rcu_cleanup_after_idle();
        trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, rdp->dynticks);
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
        WRITE_ONCE(rdp->dynticks_nesting, 1);
        WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
        WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
}

확장 qs 상태를 빠져나온다. @user가 true인 경우 nohz full user 모드 퇴출을 의미하며, false인 경우 nohz idle 모드 퇴출을 의미한다.

코드 라인 8~13에서 이미 네스팅 중인경우 rdp->dynticks_nesting 값을 1 증가시키고 함수를 빠져나간다.
코드 라인 14에서 현재 태스크의 rcu_tasks_idle_cpu 멤버에서 cpu 지정을 클리어하기 위해 -1을 대입한다.
코드 라인 15에서 확장 qs 상태를 클리어한다.
코드 라인 16에서 idle에서 빠져나온 후 rcu 처리를 수행한다. 현재 빈 함수이다.
코드 라인 19에서 rdp->dynticks_nesting을 0에서 1로 설정한다.
코드 라인 21에서 rdp->dynticks_nmi_nesting 값을 DYNTICK_IRQ_NONIDLE(long max / 2 + 1)으로 설정한다.

rcu_dynticks_eqs_enter()

kernel/rcu/tree.c

/*
 * Record entry into an extended quiescent state.  This is only to be
 * called when not already in an extended quiescent state.
 */

static void rcu_dynticks_eqs_enter(void)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        int seq;

        /*
         * CPUs seeing atomic_add_return() must see prior RCU read-side
         * critical sections, and we also must force ordering with the
         * next idle sojourn.
         */
        seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
        /* Better be in an extended quiescent state! */
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
                     (seq & RCU_DYNTICK_CTRL_CTR));
        /* Better not have special action (TLB flush) pending! */
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
                     (seq & RCU_DYNTICK_CTRL_MASK));
}

확장 qs 상태로 기록한다.

코드 라인 11에서 rdp->dynticks += 2를 수행한다. 이 때 시퀀스 파트는 1 증가되고, ilde(eqs) 상태를 나타내는 bit1은 클리어된다.
코드 라인 13~14에서 변경 후 bit1은 클리어 상태여야 한다.
코드 라인 16~17 에서 변경 후 special 비트인 bit0은 클리어된 상태여야 한다.

다음 그림은 idle(eqs)의 진출입시 rdp->dynticks의 시퀀스 및 하위 두 비트의 변화를 보여준다.

rcu_dynticks_eqs_exit()

kernel/rcu/tree.c

/*
 * Record exit from an extended quiescent state.  This is only to be
 * called from an extended quiescent state.
 */

static void rcu_dynticks_eqs_exit(void)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        int seq;

        /*
         * CPUs seeing atomic_add_return() must see prior idle sojourns,
         * and we also must force ordering with the next RCU read-side
         * critical section.
         */
        seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
        WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
                     !(seq & RCU_DYNTICK_CTRL_CTR));
        if (seq & RCU_DYNTICK_CTRL_MASK) {
                atomic_andnot(RCU_DYNTICK_CTRL_MASK, &rdp->dynticks);
                smp_mb__after_atomic(); /* _exit after clearing mask. */
                /* Prefer duplicate flushes to losing a flush. */
                rcu_eqs_special_exit();
        }
}

확장 qs 상태를 클리어한다.

코드 라인 11에서 rdp->dynticks += 2를 수행한다. 이 때 시퀀스 파트는 변경 없이 bit1이 설정되어 non-eqs 상태로 변경한다.
코드 라인 12~13에서 변경 후 bit1은 설정 상태여야 한다.
코드 라인 14~19 에서 만일 special 비트가 설정된 경우 special 비트를 제거한 후 rcu_eqs_special_exit() 함수를 호출한다.

rcu_dynticks_curr_cpu_in_eqs()

kernel/rcu/tree.c

/*
 * Is the current CPU in an extended quiescent state?
 *
 * No ordering, as we are sampling CPU-local information.
 */

bool rcu_dynticks_curr_cpu_in_eqs(void)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

        return !(atomic_read(&rdp->dynticks) & RCU_DYNTICK_CTRL_CTR);
}

현재 cpu가 eqs 상태인지 여부를 알아온다. (true=eqs, false=none eqs)

rdp->dynticks의 bit1이 0일때 true (eqs)

rcu_dynticks_eqs_online()

kernel/rcu/tree.c

/*
 * Reset the current CPU's ->dynticks counter to indicate that the
 * newly onlined CPU is no longer in an extended quiescent state.
 * This will either leave the counter unchanged, or increment it
 * to the next non-quiescent value.
 *
 * The non-atomic test/increment sequence works because the upper bits
 * of the ->dynticks counter are manipulated only by the corresponding CPU,
 * or when the corresponding CPU is offline.
 */

static void rcu_dynticks_eqs_online(void)
{
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

        if (atomic_read(&rdp->dynticks) & RCU_DYNTICK_CTRL_CTR)
                return;
        atomic_add(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
}

새롭게 online된 cpu가 eqs가 아니도록 리셋한다.

코드 라인 5~6에서 만일 rdp->dyntics의 bit1이 이미 설정되어 있는 경우 함수를 빠져나간다.
코드 라인 7에서 rdp->dyntics의 bit1이 클리어된 상태에서 rdp->dyntics += 2를 수행하여 카운터 파트를 증가시키고 bit1을 클리어한다.

rcu_dynticks_snap()

kernel/rcu/tree.c

/*
 * Snapshot the ->dynticks counter with full ordering so as to allow
 * stable comparison of this counter with past and future snapshots.
 */

int rcu_dynticks_snap(struct rcu_data *rdp)
{
        int snap = atomic_add_return(0, &rdp->dynticks);

        return snap & ~RCU_DYNTICK_CTRL_MASK;
}

dynticks 카운터의 스냡샷을 반환한다.

dynticks 카운터에서 스페셜 비트인 bit0를 클리어한 값을 반환한다.

rcu_dynticks_in_eqs()

kernel/rcu/tree.c

/*
 * Return true if the snapshot returned from rcu_dynticks_snap()
 * indicates that RCU is in an extended quiescent state.
 */

static bool rcu_dynticks_in_eqs(int snap)
{
        return !(snap & RCU_DYNTICK_CTRL_CTR);
}

dynticks 스냡샷 값이 eqs 상태인지 여부를 반환한다.

@snap 값의 bit1이 0일 때 eqs 상태이다.

rcu_dynticks_in_eqs_since()

kernel/rcu/tree.c

/*
 * Return true if the CPU corresponding to the specified rcu_data
 * structure has spent some time in an extended quiescent state since
 * rcu_dynticks_snap() returned the specified snapshot.
 */

static bool rcu_dynticks_in_eqs_since(struct rcu_data *rdp, int snap)
{
        return snap != rcu_dynticks_snap(rdp);
}

eqs 상태가 변경되었는지 여부를 반환한다.

rcu_momentary_dyntick_idle()

kernel/rcu/tree.c

/*
 * Let the RCU core know that this CPU has gone through the scheduler,
 * which is a quiescent state.  This is called when the need for a
 * quiescent state is urgent, so we burn an atomic operation and full
 * memory barriers to let the RCU core know about it, regardless of what
 * this CPU might (or might not) do in the near future.
 *
 * We inform the RCU core by emulating a zero-duration dyntick-idle period.
 *
 * The caller must have disabled interrupts and must not be idle.
 */

static void __maybe_unused rcu_momentary_dyntick_idle(void)
{
        int special;

        raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
        special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
                                    &this_cpu_ptr(&rcu_data)->dynticks);
        /* It is illegal to call this from idle state. */
        WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
        rcu_preempt_deferred_qs(current);
}

rcu core가 긴급하게 qs 상태를 알아야 할 때 수행된다.

코드 라인 5에서 긴급 qs 요청 플래그(rdp->rcu_need_heavy_qs)를 클리어한다.
- 이 플래그는 force_qs_rnp() -> rcu_implicit_dynticks_qs() 함수를 통해 설정된다.
코드 라인 6~9에서 dynticks 카운터에 4를 더하며, 더하기 전의 dynticks 카운터의 bit1이 0인 경우 경고 메시지를 출력한다.
코드 라인 10에서 deferred qs를 처리한다.
- deferred qs를 해제하고, blocked 상태인 경우 blocked 해제 후 qs를 보고한다.

rcu_is_cpu_rrupt_from_idle()

kernel/rcu/tree.c

/**
 * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
 *
 * If the current CPU is idle and running at a first-level (not nested)
 * interrupt from idle, return true.  The caller must have at least
 * disabled preemption.
 */

static int rcu_is_cpu_rrupt_from_idle(void)
{
        /* Called only from within the scheduling-clock interrupt */
        lockdep_assert_in_irq();

        /* Check for counter underflows */
        RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
                         "RCU dynticks_nesting counter underflow!");
        RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0,
                         "RCU dynticks_nmi_nesting counter underflow/zero!");

        /* Are we at first interrupt nesting level? */
        if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
                return false;

        /* Does CPU appear to be idle from an RCU standpoint? */
        return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
}

idle 모드에서 인터럽트가 발생하여 진입하였는지 여부를 반환한다.

CB용 콜백 처리 커널 스레드

“rcutree.use_softirq” 모듈 파라미터 값에 따라 cb용 콜백을 처리하는 곳이 달라진다.

1
- softirq에서 호출하여 처리한다. (디폴트)
0
- cb용 콜백 처리 커널 스레드에서 한다.

rcu_cpu_kthread()

kernel/rcu/tree.c

/*
 * Per-CPU kernel thread that invokes RCU callbacks.  This replaces
 * the RCU softirq used in configurations of RCU that do not support RCU
 * priority boosting.
 */

static void rcu_cpu_kthread(unsigned int cpu)
{
        unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
        char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
        int spincnt;

        for (spincnt = 0; spincnt < 10; spincnt++) {
                trace_rcu_utilization(TPS("Start CPU kthread@rcu_wait"));
                local_bh_disable();
                *statusp = RCU_KTHREAD_RUNNING;
                local_irq_disable();
                work = *workp;
                *workp = 0;
                local_irq_enable();
                if (work)
                        rcu_core();
                local_bh_enable();
                if (*workp == 0) {
                        trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
                        *statusp = RCU_KTHREAD_WAITING;
                        return;
                }
        }
        *statusp = RCU_KTHREAD_YIELDING;
        trace_rcu_utilization(TPS("Start CPU kthread@rcu_yield"));
        schedule_timeout_interruptible(2);
        trace_rcu_utilization(TPS("End CPU kthread@rcu_yield"));
        *statusp = RCU_KTHREAD_WAITING;
}

cb용 콜백 처리 커널 스레드가 동작하는 경우 rdp->rcu_cpu_has_work 가 1인 경우 최대 10회에 한해 완료된 rcu 콜백을 호출한다.

다음 그림은 cb용 콜백 처리 커널 스레드가 동작하는 과정과 상태를 보여준다.

RCU Boost 커널 스레드

preemptible rcu를 사용하는 경우 rcu read-side critical section에서도 preemption될 수 있다. 다음과 같은 경우 요청에 의해 스레드 B를 부스트하여 처리하도록 백그라운드에서 rcu 부스트 커널 스레드가 동작한다. 이렇게 부스트된 스레드 B는 빠르게 스레드 A로 되돌아 가는 것으로 gp가 지연되는 것을 방지한다. rcu 부스트 커널 스레드는 rcu leaf 노드별로 하나씩 운영된다.

스레드 A (rcu reader) ——-(preempt)——> 스레드 B

rcu_boost_kthread()

kernel/rcu/tree_plugin.h

/*
 * Priority-boosting kthread, one per leaf rcu_node.
 */

static int rcu_boost_kthread(void *arg)
{
        struct rcu_node *rnp = (struct rcu_node *)arg;
        int spincnt = 0;
        int more2boost;

        trace_rcu_utilization(TPS("Start boost kthread@init"));
        for (;;) {
                rnp->boost_kthread_status = RCU_KTHREAD_WAITING;
                trace_rcu_utilization(TPS("End boost kthread@rcu_wait"));
                rcu_wait(rnp->boost_tasks || rnp->exp_tasks);
                trace_rcu_utilization(TPS("Start boost kthread@rcu_wait"));
                rnp->boost_kthread_status = RCU_KTHREAD_RUNNING;
                more2boost = rcu_boost(rnp);
                if (more2boost)
                        spincnt++;
                else
                        spincnt = 0;
                if (spincnt > 10) {
                        rnp->boost_kthread_status = RCU_KTHREAD_YIELDING;
                        trace_rcu_utilization(TPS("End boost kthread@rcu_yield"));
                        schedule_timeout_interruptible(2);
                        trace_rcu_utilization(TPS("Start boost kthread@rcu_yield"));
                        spincnt = 0;
                }
        }
        /* NOTREACHED */
        trace_rcu_utilization(TPS("End boost kthread@notreached"));
        return 0;
}

rcu용 부스트 커널스레드는 leaf 노드당 하나가 동작하며 노드의 rt 뮤텍스를 사용하는 태스크를 부스트한다.

코드 라인 8에서 무한 반복하여 수행한다.
코드 라인 9에서 부스트 커널 스레드 상태를 RCU_KTHREAD_WAITING(2)로 변경한다.
코드 라인 11에서 노드에 급행 태스크나 부스트 태스크가 발생할 때까지 슬립하며 기다린다.
코드 라인 13에서 부스트 커널 스레드 상태를 RCU_KTHREAD_RUNNING(1)으로 변경한다.
코드 라인 14에서 노드에서 기존에 rt 뮤텍스를 선점하여 사용중인 rcu 태스크의 priority를 최대한도로 상승시킨다. 부스트할 태스크가 더 있는지 여부를 알아온다.
코드 라인 15~18에서 부스트를 반복할 카운터를 증가시킨다. 만일 부스트할 태스크가 없는 경우 0을 지정한다.
코드 라인 19~25에서 10회 이상 반복한 경우 부스트 커널 스레드 상태를 RCU_KTHREAD_YIELDING(4)로 변경한다. 2틱을 슬립하고 다시 반복한다.

다음 그림은 rcu 부스트 커널 스레드가 동작하는 과정과 상태를 보여준다.

rcu_wait()

kernel/rcu/tree.h

#define rcu_wait(cond)                                                  \
do {                                                                    \
        for (;;) {                                                      \
                set_current_state(TASK_INTERRUPTIBLE);                  \
                if (cond)                                               \
                        break;                                          \
                schedule();                                             \
        }                                                               \
        __set_current_state(TASK_RUNNING);                              \
} while (0)

태스크의 상태를 슬립(interrutible)로 바꾸고 슬립한다. 깨어날때마다 @cond 값을 확인하여 true인 경우 태스크 상태를 러닝으로 변경한 후 함수를 빠져나온다.

rcu_boost()

kernel/rcu/tree_plugin.h

/*
 * Carry out RCU priority boosting on the task indicated by ->exp_tasks
 * or ->boost_tasks, advancing the pointer to the next task in the
 * ->blkd_tasks list.
 *
 * Note that irqs must be enabled: boosting the task can block.
 * Returns 1 if there are more tasks needing to be boosted.
 */

static int rcu_boost(struct rcu_node *rnp)
{
        unsigned long flags;
        struct task_struct *t;
        struct list_head *tb;

        if (READ_ONCE(rnp->exp_tasks) == NULL &&
            READ_ONCE(rnp->boost_tasks) == NULL)
                return 0;  /* Nothing left to boost. */

        raw_spin_lock_irqsave_rcu_node(rnp, flags);

        /*
         * Recheck under the lock: all tasks in need of boosting
         * might exit their RCU read-side critical sections on their own.
         */
        if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL) {
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                return 0;
        }

        /*
         * Preferentially boost tasks blocking expedited grace periods.
         * This cannot starve the normal grace periods because a second
         * expedited grace period must boost all blocked tasks, including
         * those blocking the pre-existing normal grace period.
         */
        if (rnp->exp_tasks != NULL)
                tb = rnp->exp_tasks;
        else
                tb = rnp->boost_tasks;

        /*
         * We boost task t by manufacturing an rt_mutex that appears to
         * be held by task t.  We leave a pointer to that rt_mutex where
         * task t can find it, and task t will release the mutex when it
         * exits its outermost RCU read-side critical section.  Then
         * simply acquiring this artificial rt_mutex will boost task
         * t's priority.  (Thanks to tglx for suggesting this approach!)
         *
         * Note that task t must acquire rnp->lock to remove itself from
         * the ->blkd_tasks list, which it will do from exit() if from
         * nowhere else.  We therefore are guaranteed that task t will
         * stay around at least until we drop rnp->lock.  Note that
         * rnp->lock also resolves races between our priority boosting
         * and task t's exiting its outermost RCU read-side critical
         * section.
         */
        t = container_of(tb, struct task_struct, rcu_node_entry);
        rt_mutex_init_proxy_locked(&rnp->boost_mtx, t);
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        /* Lock only for side effect: boosts task t's priority. */
        rt_mutex_lock(&rnp->boost_mtx);
        rt_mutex_unlock(&rnp->boost_mtx);  /* Then keep lockdep happy. */

        return READ_ONCE(rnp->exp_tasks) != NULL ||
               READ_ONCE(rnp->boost_tasks) != NULL;
}

기존에 rt 뮤텍스를 선점하여 사용중인 rcu 태스크의 priority를 최대한도로 상승시킨다.

코드 라인 7~9에서 노드에 지정된 급행 태스크 및 부스트 태스크가 모두 없는 경우 할 일이 없으므로 0을 반환한다.
코드 라인 11~20에서 노드 스핀락을 획득한 후 다시 한 번 노드에 지정된 급행 태스크 및 부스트 태스크가 모두 없는지 체크하여 없으면 노드 스핀락을 풀고 0을 반환한다.
코드 라인 28~31에서 급행 태스크를 선택한다. 만일 없는 경우 부스트 태스크를 선택한다.
코드 라인 49~50에서 노드의 rt 뮤텍스(boost_mtx)를 초기화하고, 선택한 태스크를 뮤텍스의 owner 태스크로 지정한다.
- rt 뮤텍스에 대기 태스크가 있는 경우 owner 태스크를 지정하고 하위 비트에 RT_MUTEX_HAS_WAITERS 비트를 추가한다.
코드 라인 51에서 노드 스핀락을 해제한다.
코드 라인 53~54에서 노드의 rt 뮤텍스 락을 획득한 후 다시 푼다.
- rt 뮤텍스를 먼저 선점하여 사용중인 rcu 태스크의 priority를 최대한도로 상승시키는 역할을 한다.
코드 라인 54~55에서 노드에 여전히 급행 태스크 또는 부스트 태스크가 존재하는지 여부를 반환한다.

boost 요청

rcu_initiate_boost()

kernel/rcu/tree.c

/*
 * Check to see if it is time to start boosting RCU readers that are
 * blocking the current grace period, and, if so, tell the per-rcu_node
 * kthread to start boosting them.  If there is an expedited grace
 * period in progress, it is always time to boost.
 *
 * The caller must hold rnp->lock, which this function releases.
 * The ->boost_kthread_task is immortal, so we don't need to worry
 * about it going away.
 */

static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
        __releases(rnp->lock)
{
        raw_lockdep_assert_held_rcu_node(rnp);
        if (!rcu_preempt_blocked_readers_cgp(rnp) && rnp->exp_tasks == NULL) {
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                return;
        }
        if (rnp->exp_tasks != NULL ||
            (rnp->gp_tasks != NULL &&
             rnp->boost_tasks == NULL &&
             rnp->qsmask == 0 &&
             ULONG_CMP_GE(jiffies, rnp->boost_time))) {
                if (rnp->exp_tasks == NULL)
                        rnp->boost_tasks = rnp->gp_tasks;
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
                rcu_wake_cond(rnp->boost_kthread_task,
                              rnp->boost_kthread_status);
        } else {
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        }              
}

boost 처리해야 할 rcu 리더가 있는지 확인하고, 존재 시 부스트한다.

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c – 현재글
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c

Device Resource Management

2017-09-072017-09-07 문영일 Leave a comment

Device Resource Management

디바이스의 사용이 끝나고 해당 디바이스가 detach 될 때 할당하여 사용한 여러 가지 리소스들을 할당 해제하여야 한다. 이 리소스들을 모두 기억해 두었다가 한꺼번에 할당 해제할 수 있는 방법이 소개되었다. 이는 허태준씨가 ATA 장치를 위한 서브시스템을 개발하다가 디바이스 리소스 관리를 쉽게 할 수 있는 방법이 필요해 이를 지원할 수 있는 API들을 2007년 커널 v2.6.21에 제공하였다.

참고: Device resource management | LWN.net

디바이스 리소스 관리

다음 그림과 같이 디바이스 사용할 때 할당했던 리소스들을 리스트에 기억해두면 나중에 디바이스를 detach할 때 리소스의 할당 해제를 잊지 않고 할 수 있는 장점이 있다. (리소스 리크를 막을 수 있다)

디바이스 리소스 추가/해제 API
- devres_alloc()
- devres_free()
- devres_add()
- devres_find()
- devres_get()
- devres_remove()
- devres_destroy()
- devres_release()
- devres_release_all()
- devres_for_each_res()

디바이스 리소스 그룹 관리

다음 그림과 같이 디바이스의 동작을 위해 관련 리소스들을 그루핑하여 할당 시도하고 실패할 때 그루핑한 해당 범위의 리소스들을 자동으로 할당 해제할 수 있다.

다음 그림과 같이 네스트된 그루핑도 허용한다.

그룹 관련 API
- devres_open_group()
- devres_close_group()
- devres_remove_group()
- devres_release_group()

관리되는 리소스(Managed Resource) APIs

다음과 같이 할당과 관련된 많은 API들이 계속 포함되고 있다.

Custom 액션 관련 API
- devm_add_action()
- devm_remove_action()
Managed kmalloc 할당 및 해제
- devm_kmalloc()
- devm_kstrdup()
- devm_kvasprintf()
- devm_kasprintf()
- devm_kmemdup()
- devm_kzalloc()
- devm_kmalloc_array()
- devm_kcalloc()
- devm_kfree()
Managed 페이지 할당
- devm_get_free_pages()
- devm_free_pages()
IO remap 관련
- devm_ioremap()
- devm_ioremap_nocache()
- devm_ioremap_resource()
- devm_iounmap()
IO port map 관련
- devm_ioport_map()
- devm_ioport_unmap()
I/O 또는 메모리 할당/해제 관련
- devm_request_resource()
- devm_release_resource()
- devm_request_region()
- devm_request_mem_region()
- devm_release_region()
- devm_release_mem_region()
IRQ 관련
- devm_request_irq()
- devm_request_threaded_irq()
- devm_request_any_context_irq()
  - 참고: genirq: Add devm_request_any_context_irq()
- devm_free_irq()
- devm_irq_alloc_descs()
  - 참고: irqdesc: Add a resource managed version of irq_alloc_descs()
- devm_irq_alloc_desc()
- devm_irq_alloc_desc_at()
- devm_irq_alloc_desc_from()
- devm_irq_alloc_descs_from()
- devm_irq_alloc_generic_chip()
- devm_irq_setup_generic_chip()
  - 참고: irq/generic-chip: Provide devm_irq_setup_generic_chip()

DMA 관련
- dmam_alloc_coherent()
- dmam_free_coherent()
- dmam_alloc_noncoherent()
- dmam_free_noncoherent()
- dmam_declare_coherent_memory()
- dmam_release_declared_memory()
DMA pool 관련
- dmam_pool_create()
- dmam_pool_destroy()
PCI 관련
- pcim_enable_device()
- pcim_pin_device()
- pcim_release()
PCI iomap 관련
- pcim_iomap()
- pcim_iounmap()
- pcim_iomap_table()
- pcim_iomap_regions()
- pcim_iomap_regions_request_all()
- pcim_iounmap_regions()

디바이스 리소스 APIs

디바이스 리소스 할당 및 해제

devres_alloc()

drivers/base/devres.c

/**
 * devres_alloc - Allocate device resource data
 * @release: Release function devres will be associated with
 * @size: Allocation size
 * @gfp: Allocation flags
 *
 * Allocate devres of @size bytes.  The allocated area is zeroed, then
 * associated with @release.  The returned pointer can be passed to
 * other devres_*() functions.
 *
 * RETURNS:
 * Pointer to allocated devres on success, NULL on failure.
 */
void * devres_alloc(dr_release_t release, size_t size, gfp_t gfp)
{
        struct devres *dr;

        dr = alloc_dr(release, size, gfp | __GFP_ZERO);
        if (unlikely(!dr))
                return NULL;
        return dr->data;
}
EXPORT_SYMBOL_GPL(devres_alloc);

디바이스 리소스 데이터를 할당하고 반환한다. 실패 시 null을 반환한다.

release: 디바이스가 해제될 때 연동할 해제 함수를 지정한다.
size: object 할당 사이즈
gfp: object 할당 시 사용할 gfp 플래그

코드 라인 18에서 디바이스 리소스와 디바이스 리소스 데이터를 할당하고 디바이스 리소스는 0으로 초기화한다.
코드 라인 21에서 디바이스 리소스 데이터를 반환한다.

devres_free()

drivers/base/devres.c

/**
 * devres_free - Free device resource data
 * @res: Pointer to devres data to free
 *
 * Free devres created with devres_alloc().
 */
void devres_free(void *res)
{
        if (res) {
                struct devres *dr = container_of(res, struct devres, data);

                BUG_ON(!list_empty(&dr->node.entry));
                kfree(dr);
        }
}
EXPORT_SYMBOL_GPL(devres_free);

디바이스 리소스 데이터를 할당 해제한다.

코드 라인 10에서 디바이스 리소스 데이터를 포함하는 디바이스 리소스 dr을 알아온다.
- res: 디바이스 리소스 데이터
코드 라인 13에서 디바이스 리소스 dr을 할당 해제한다.

디바이스 리소스 추가

devres_add()

drivers/base/devres.c

/**
 * devres_add - Register device resource
 * @dev: Device to add resource to
 * @res: Resource to register
 *
 * Register devres @res to @dev.  @res should have been allocated
 * using devres_alloc().  On driver detach, the associated release
 * function will be invoked and devres will be freed automatically.
 */
void devres_add(struct device *dev, void *res)
{
        struct devres *dr = container_of(res, struct devres, data);
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);
        add_dr(dev, &dr->node);
        spin_unlock_irqrestore(&dev->devres_lock, flags);
}
EXPORT_SYMBOL_GPL(devres_add);

디바이스에 디바이스 리소스 데이터를 추가한다.

코드 라인 12에서 디바이스 리소스 데이터를 포함하는 디바이스 리소스 dr을 알아온다.
코드 라인 16에서 디바이스에 디바이스 리소스 dr을 추가한다.

add_dr()

drivers/base/devres.c

static void add_dr(struct device *dev, struct devres_node *node)
{
        devres_log(dev, node, "ADD");
        BUG_ON(!list_empty(&node->entry));
        list_add_tail(&node->entry, &dev->devres_head);
}

디바이스에 디바이스 리소스 데이터를 추가한다.

디바이스 리소스 검색

devres_find()

drivers/base/devres.c

/**
 * devres_find - Find device resource
 * @dev: Device to lookup resource from
 * @release: Look for resources associated with this release function
 * @match: Match function (optional)
 * @match_data: Data for the match function
 *
 * Find the latest devres of @dev which is associated with @release
 * and for which @match returns 1.  If @match is NULL, it's considered
 * to match all.
 *
 * RETURNS:
 * Pointer to found devres, NULL if not found.
 */
void * devres_find(struct device *dev, dr_release_t release,
                   dr_match_t match, void *match_data)
{
        struct devres *dr;
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);
        dr = find_dr(dev, release, match, match_data);
        spin_unlock_irqrestore(&dev->devres_lock, flags);

        if (dr)
                return dr->data;
        return NULL;
}
EXPORT_SYMBOL_GPL(devres_find);

디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾아 디바이스 리소스 데이터를 반환한다. 찾지 못한 경우 null을 반환한다.

코드 라인 22에서 디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾는다.
코드 라인 26~27에서 찾은 경우 디바이스 리소스 데이터를 반환한다.
코드 라인 28에서 못 찾은 경우 null을 반환한다.

find_dr()

drivers/base/devres.c

static struct devres *find_dr(struct device *dev, dr_release_t release,
                              dr_match_t match, void *match_data)
{
        struct devres_node *node;

        list_for_each_entry_reverse(node, &dev->devres_head, entry) {
                struct devres *dr = container_of(node, struct devres, node);

                if (node->release != release)
                        continue;
                if (match && !match(dev, dr->data, match_data))
                        continue;
                return dr;
        }

        return NULL;
}

디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾는다. 찾지못한 경우 null을 반환한다.

코드 라인 6~7에서 디바이스에 등록된 디바이스 리소스 dr을 역방향으로 순회한다.
코드 라인 9~10에서 순회 중인 디바이스 리소스가 인수로 지정한 release 함수를 사용하지 않는 경우 skip 한다.
코드 라인 11~12에서 인수로 match 함수가 주어진 경우 순회 중인 디바이스 리소스의 데이터가 매치되지 않으면 skip 한다.
코드 라인 13에서 디바이스 리소스를 반환한다.
코드 라인 16에서 루프를 돌 때까지 조건에 맞는 디바이스 리소스를 찾지 못한 경우 null을 반환한다.

devres_get()

drivers/base/devres.c

/**
 * devres_get - Find devres, if non-existent, add one atomically
 * @dev: Device to lookup or add devres for
 * @new_res: Pointer to new initialized devres to add if not found
 * @match: Match function (optional)
 * @match_data: Data for the match function
 *
 * Find the latest devres of @dev which has the same release function
 * as @new_res and for which @match return 1.  If found, @new_res is
 * freed; otherwise, @new_res is added atomically.
 *
 * RETURNS:
 * Pointer to found or added devres.
 */
void * devres_get(struct device *dev, void *new_res,
                  dr_match_t match, void *match_data)
{
        struct devres *new_dr = container_of(new_res, struct devres, data);
        struct devres *dr;
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);
        dr = find_dr(dev, new_dr->node.release, match, match_data);
        if (!dr) {
                add_dr(dev, &new_dr->node);
                dr = new_dr;
                new_dr = NULL;
        }
        spin_unlock_irqrestore(&dev->devres_lock, flags);
        devres_free(new_dr);

        return dr->data;
}
EXPORT_SYMBOL_GPL(devres_get);

디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾아 디바이스 리소스 데이터를 반환한다. 찾지 못한 경우 요청한 디바이스 리소스 데이터를 추가한다.

코드 라인 18에서 요청한 새 디바이스 리소스 데이터로 디바이스 리소스를 알아와 new_dr에 대입한다.
코드 라인 23에서 디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾는다.
코드 라인 24~28에서 찾지 못한 경우 디바이스에 새 디바이스 리소스를 추가한다. 반환 시 사용할 dr에 추가한 새 디바이스 리소스를 대입한다.
코드 라인 30에서 찾은 경우 새 디바이스 리소스는 할당 해제한다.
코드 라인 32에서 찾았거나 없어서 추가한 디바이스 리소스 데이터를 반환한다.

디바이스 리소스 할당 해제

devres_remove()

drivers/base/devres.c

/**
 * devres_remove - Find a device resource and remove it
 * @dev: Device to find resource from
 * @release: Look for resources associated with this release function
 * @match: Match function (optional)
 * @match_data: Data for the match function
 *
 * Find the latest devres of @dev associated with @release and for
 * which @match returns 1.  If @match is NULL, it's considered to
 * match all.  If found, the resource is removed atomically and
 * returned.
 *
 * RETURNS:
 * Pointer to removed devres on success, NULL if not found.
 */
void * devres_remove(struct device *dev, dr_release_t release,
                     dr_match_t match, void *match_data)
{
        struct devres *dr;
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);
        dr = find_dr(dev, release, match, match_data);
        if (dr) {
                list_del_init(&dr->node.entry);
                devres_log(dev, &dr->node, "REM");
        }
        spin_unlock_irqrestore(&dev->devres_lock, flags);

        if (dr)
                return dr->data;
        return NULL;
}
EXPORT_SYMBOL_GPL(devres_remove);

디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾아 디바이스에서 등록 해제하고 반환한다. 찾지 못한 경우 null을 반환한다.

코드 라인 23에서 디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾는다.
코드 라인 24~31에서 찾은 경우 디바이스에서 등록 해제한다. 디바이스 리소스 데이터를 반환한다.
코드 라인 32에서 못 찾은 경우 null을 반환한다.

디바이스 리소스 할당 해제(디바이스 리소스 데이터 포함)

devres_release()

drivers/base/devres.c

/**
 * devres_release - Find a device resource and destroy it, calling release
 * @dev: Device to find resource from
 * @release: Look for resources associated with this release function
 * @match: Match function (optional)
 * @match_data: Data for the match function
 *
 * Find the latest devres of @dev associated with @release and for
 * which @match returns 1.  If @match is NULL, it's considered to
 * match all.  If found, the resource is removed atomically, the
 * release function called and the resource freed.
 *
 * RETURNS:
 * 0 if devres is found and freed, -ENOENT if not found.
 */
int devres_release(struct device *dev, dr_release_t release,
                   dr_match_t match, void *match_data)
{
        void *res;

        res = devres_remove(dev, release, match, match_data);
        if (unlikely(!res))
                return -ENOENT;

        (*release)(dev, res);
        devres_free(res);
        return 0;
}
EXPORT_SYMBOL_GPL(devres_release);

디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾아 디바이스에서 등록 해제하고 할당 해제하고 성공 결과 값으로 0을 반환한다. 찾지 못한 경우 에러 코드 -ENOENT를 반환한다.

코드 라인 21에서 디바이스에 등록된 디바이스 리소스들에서 동일한 release 함수를 사용하고 디바이스 리소스 데이터가 매치되는 디바이스 리소스를 찾는다.
코드 라인 22~23에서 낮은 확률로 찾지 못한 경우 에러 코드 -ENOENT를 반환한다.
코드 라인 25에서 인수로 주어진 release 함수를 호출하여 디바이스 리소스 데이터를 할당 해제 한다.
코드 라인 26~27에서 디바이스 리소스를 할당 해제하고 성공 결과 값으로 0을 반환한다.

디바이스에 연결된 디바이스 리소스 모두 할당 해제(디바이스 리소스 데이터 포함)

devres_release_all()

drivers/base/devres.c

/**
 * devres_release_all - Release all managed resources
 * @dev: Device to release resources for
 *
 * Release all resources associated with @dev.  This function is
 * called on driver detach.
 */
int devres_release_all(struct device *dev)
{
        unsigned long flags;

        /* Looks like an uninitialized device structure */
        if (WARN_ON(dev->devres_head.next == NULL))
                return -ENODEV;
        spin_lock_irqsave(&dev->devres_lock, flags);
        return release_nodes(dev, dev->devres_head.next, &dev->devres_head,
                             flags);
}

디바이스에 등록된 모든 디바이스 리소스들을 할당 해제한다. 이 함수는 디바이스 드라이버가 detach 되는 경우 호출된다.

코드 라인 13~14에서 디바이스에 등록된 디바이스 리소스가 하나도 없는 경우 경고 메시지를 출력하고 에러 코드 -ENODEV를 반환한다.
코드 라인 16에서 디바이스에 등록된 모든 디바이스 리소스를 할당 해제한다.

release_nodes()

static int release_nodes(struct device *dev, struct list_head *first,
                         struct list_head *end, unsigned long flags)
        __releases(&dev->devres_lock)
{
        LIST_HEAD(todo);
        int cnt;
        struct devres *dr, *tmp;

        cnt = remove_nodes(dev, first, end, &todo);

        spin_unlock_irqrestore(&dev->devres_lock, flags);

        /* Release.  Note that both devres and devres_group are
         * handled as devres in the following loop.  This is safe.
         */
        list_for_each_entry_safe_reverse(dr, tmp, &todo, node.entry) {
                devres_log(dev, &dr->node, "REL");
                dr->node.release(dev, dr->data);
                kfree(dr);
        }

        return cnt;
}

디바이스에 등록된 first 엔트리부터 end 엔트리 직전 범위의 모든 디바이스 리소스를 할당 해제한다. 제거한 디바이스 리소스 수가 반환된다.

코드 라인 5에서 todo 리스트를 준비해둔다.
코드 라인 9에서 디바이스에 등록된 first 엔트리부터 end 엔트리 직전 범위의 모든 디바이스 리소스를 할당 해제하고 todo 리스트에 추가해둔다.
코드 라인 16~20에서 todo 리스트에 있는 디바이스 리소스를 역방향으로 순회하며 디바이스 리소스 데이터를 할당해제 한 후 디바이스 리소스도 할당 해제한다.
코드 라인 22에서 제거한 디바이스 리소스 수를 반환한다.

remove_nodes()

drivers/base/devres.c – 1/2

static int remove_nodes(struct device *dev,
                        struct list_head *first, struct list_head *end,
                        struct list_head *todo)
{
        int cnt = 0, nr_groups = 0;
        struct list_head *cur;

        /* First pass - move normal devres entries to @todo and clear
         * devres_group colors.
         */ 
        cur = first;
        while (cur != end) {
                struct devres_node *node;
                struct devres_group *grp;

                node = list_entry(cur, struct devres_node, entry);
                cur = cur->next;

                grp = node_to_group(node);
                if (grp) {
                        /* clear color of group markers in the first pass */
                        grp->color = 0;
                        nr_groups++;
                } else {
                        /* regular devres entry */
                        if (&node->entry == first)
                                first = first->next; 
                        list_move_tail(&node->entry, todo);
                        cnt++;
                }
        }

        if (!nr_groups)
                return cnt;

코드 라인 11~17에서 first 디바이스 리소스부터 end 직전까지의 디바이스 리소스를 순회하여 node를 알아온다.
코드 라인 19~23에서 node가 그룹 노드인 경우에 한해 그룹을 알아온다.
- node->release에 지정된 함수가 group_open_release() 또는 group_close_release() 함수 둘 중 하나인 경우 그룹에 소속된 노드이다.
- 그룹 노드는 디바이스 리소스 데이터 할당이 없으므로 release 함수가 호출되더라도 내부가 blank 되어 아무 일도 하지 않는다.
코드 라인 20~23에서 노드가 그룹에 소속된 경우 그룹의 color 값을 0으로 초기화하고 nr_groups 카운터를 증가시킨다.
- nr_groups: 발견된 그룹 노드 수
코드 라인 24~30에서 노드가 디바이스 리소스인 경우 노드를 todo 리스트에 옮기고 cnt를 증가시킨다. first로 지정된 노드가 일반 노드인 경우 first를 다음 노드로 갱신한다.
- cnt: 발견된 일반(devres) 노드 수
코드 라인 33~34에서 그룹 노드가 하나도 발견되지 않은 경우 일반(devres) 노드 수를 반환한다.

drivers/base/devres.c – 2/2

        /* Second pass - Scan groups and color them.  A group gets
         * color value of two iff the group is wholly contained in
         * [cur, end).  That is, for a closed group, both opening and
         * closing markers should be in the range, while just the
         * opening marker is enough for an open group.
         */ 
        cur = first;
        while (cur != end) {
                struct devres_node *node;
                struct devres_group *grp;

                node = list_entry(cur, struct devres_node, entry);
                cur = cur->next;

                grp = node_to_group(node);
                BUG_ON(!grp || list_empty(&grp->node[0].entry));

                grp->color++;
                if (list_empty(&grp->node[1].entry))
                        grp->color++;

                BUG_ON(grp->color <= 0 || grp->color > 2);
                if (grp->color == 2) {
                        /* No need to update cur or end.  The removed
                         * nodes are always before both.
                         */
                        list_move_tail(&grp->node[0].entry, todo);
                        list_del_init(&grp->node[1].entry);
                }
        }

        return cnt;
}

코드 라인 7~12에서 first 노드는 첫 그룹 노드가 되었다. first 노드 부터 end 노드 직전까지 디바이스 리소스를 순회한다. 실제로는 일반 노드는 제거하고 그룹 노드를 순회한다.
코드 라인 14~17에서 그룹의 color를 1 증가시킨다. (color=1)
코드 라인 18~19에서 그룹이 아직 close되지 않은 경우 그룹의 color를 1 추가 증가 시킨다. (color=2)
코드 라인 21에서 color값은 1 또는 2가 아닌 경우 경고 메시지를 출력한다.
- 그룹은 네스트 되어도 상관 없지만 개별 그룹은 각각 open과 close 한 번씩만 사용되어야 한다.
코드 라인 22~28에서 colse 되지 않은 그룹인 경우 open 노드를 todo로 옮기고 close 노드는 그냥 제거한다.
코드 라인 31에서 일반(devres) 노드 수를 반환한다.

node_to_group()

drivers/base/devres.c

static struct devres_group * node_to_group(struct devres_node *node)
{
        if (node->release == &group_open_release)
                return container_of(node, struct devres_group, node[0]);
        if (node->release == &group_close_release)
                return container_of(node, struct devres_group, node[1]);
        return NULL;
}

노드가 디바이스 리소스 그룹에 포함된 경우 디바이스 리소스 그룹을 반환한다.

디바이스 리소스 그룹은 다음과 같이 두 개의 노드를 가지고 있다.
- open 그룹용 노드
- close 그룹용 노드

디바이스 리소스 그룹 APIs

그룹의 open 및 close 마크 처리

devres_open_group()

drivers/base/devres.c

/**
 * devres_open_group - Open a new devres group
 * @dev: Device to open devres group for
 * @id: Separator ID
 * @gfp: Allocation flags
 *
 * Open a new devres group for @dev with @id.  For @id, using a
 * pointer to an object which won't be used for another group is
 * recommended.  If @id is NULL, address-wise unique ID is created.
 *
 * RETURNS:
 * ID of the new group, NULL on failure.
 */
void * devres_open_group(struct device *dev, void *id, gfp_t gfp)
{
        struct devres_group *grp;
        unsigned long flags;

        grp = kmalloc(sizeof(*grp), gfp);
        if (unlikely(!grp))
                return NULL;

        grp->node[0].release = &group_open_release;
        grp->node[1].release = &group_close_release;
        INIT_LIST_HEAD(&grp->node[0].entry);
        INIT_LIST_HEAD(&grp->node[1].entry);
        set_node_dbginfo(&grp->node[0], "grp<", 0);
        set_node_dbginfo(&grp->node[1], "grp>", 0);
        grp->id = grp;
        if (id)
                grp->id = id;

        spin_lock_irqsave(&dev->devres_lock, flags);
        add_dr(dev, &grp->node[0]);
        spin_unlock_irqrestore(&dev->devres_lock, flags);
        return grp->id;
}
EXPORT_SYMBOL_GPL(devres_open_group);

디바이스에 디바이스 리소스 그룹을 open 한다. 그룹의 id를 반환하고, 할당이 실패한 경우 null을 반환한다.

코드 라인 19~21에서 두 개의 노드로 구성된 디바이스 리소스 그룹을 할당한다.
코드 라인 23~28에서 디바이스 리소스 그룹을 초기화한다.
- 첫 번째 노드는 open 그룹으로 초기화하고, 두 번째 노드는 close 그룹으로 초기화한다.
- 첫 번째 노드의 release 함수가 group_opon_release() 빈 함수를 가리키고 open 그룹을 식별해낼 수 있게 한다.
- 두 번째 노드의 release 함수가 group_close_release() 빈 함수를 가리키고 close 그룹을 식별해낼 수 있게 한다.
코드 라인 29~31에서 인수로 id가 지정된 경우 그룹에 id를 지정하고, id가 null로 지정된 경우 자신의 그룹을 가리킨다.
코드 라인 34에서 open 그룹인 첫 번째 노드만 디바이스에 디바이스 리소스로 추가한다.
코드 라인 36에서 그룹의 id를 반환한다.

devres_close_group()

drivers/base/devres.c

/**
 * devres_close_group - Close a devres group
 * @dev: Device to close devres group for
 * @id: ID of target group, can be NULL
 *
 * Close the group identified by @id.  If @id is NULL, the latest open
 * group is selected.
 */
void devres_close_group(struct device *dev, void *id)
{
        struct devres_group *grp;
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);

        grp = find_group(dev, id);
        if (grp)
                add_dr(dev, &grp->node[1]);
        else
                WARN_ON(1);

        spin_unlock_irqrestore(&dev->devres_lock, flags);
}
EXPORT_SYMBOL_GPL(devres_close_group);

디바이스에서 디바이스 리소스 그룹을 close 한다.

코드 라인 16에서 id에 해당하는 그룹을 찾아온다.
코드 라인 17에서 그룹이 존재하면 그룹에 있는 두 번째 노드를 close 그룹으로 추가한다.

요청 그룹 삭제(그룹의 디바이스 리소스 제외)

devres_remove_group()

drivers/base/devres.c

/**
 * devres_remove_group - Remove a devres group
 * @dev: Device to remove group for
 * @id: ID of target group, can be NULL
 *
 * Remove the group identified by @id.  If @id is NULL, the latest
 * open group is selected.  Note that removing a group doesn't affect
 * any other resources.
 */
void devres_remove_group(struct device *dev, void *id)
{
        struct devres_group *grp;
        unsigned long flags;

        spin_lock_irqsave(&dev->devres_lock, flags);

        grp = find_group(dev, id);
        if (grp) {
                list_del_init(&grp->node[0].entry);
                list_del_init(&grp->node[1].entry);
                devres_log(dev, &grp->node[0], "REM");
        } else
                WARN_ON(1);

        spin_unlock_irqrestore(&dev->devres_lock, flags);

        kfree(grp);
}
EXPORT_SYMBOL_GPL(devres_remove_group);

디바이스에 등록된 디바이스 리소스들 중 요청한 id에 해당하는 디바이스 리소스 그룹(open 그룹과 close 그룹)만을 할당 해제한다.

코드 라인 17에서 id에 해당하는 그룹을 찾아온다.
코드 라인 18~21에서 그룹이 존재하면 그룹에 있는 첫 번째 노드인 open 그룹과 두 번째 노드인 close 그룹을 제거한다.
코드 라인 27에서 그룹을 할당 해제한다.

요청 그룹 범위내 디바이스 리소스들 모두 삭제

devres_release_group()

drivers/base/devres.c

/**
 * devres_release_group - Release resources in a devres group
 * @dev: Device to release group for
 * @id: ID of target group, can be NULL
 *
 * Release all resources in the group identified by @id.  If @id is
 * NULL, the latest open group is selected.  The selected group and
 * groups properly nested inside the selected group are removed.
 *
 * RETURNS:
 * The number of released non-group resources.
 */
int devres_release_group(struct device *dev, void *id)
{
        struct devres_group *grp;
        unsigned long flags;
        int cnt = 0;

        spin_lock_irqsave(&dev->devres_lock, flags);

        grp = find_group(dev, id);
        if (grp) {
                struct list_head *first = &grp->node[0].entry;
                struct list_head *end = &dev->devres_head;

                if (!list_empty(&grp->node[1].entry))
                        end = grp->node[1].entry.next;

                cnt = release_nodes(dev, first, end, flags);
        } else {
                WARN_ON(1);
                spin_unlock_irqrestore(&dev->devres_lock, flags);
        }

        return cnt;
}
EXPORT_SYMBOL_GPL(devres_release_group);

디바이스에 등록된 디바이스 리소스들 중 요청한 id에 해당하는 디바이스 리소스 그룹 범위의 모든 디바이스 리소스를 할당 해제 시킨다. 할당 해제한 그룹이 아닌 디바이스 리소스 수를 반환한다.

코드 라인 21에서 id에 해당하는 그룹을 찾아온다.
코드 라인 22~23에서 그룹이 존재하면 그룹에서 open 그룹에 해당하는 첫 번째 노드를 first에 대입한다. 마지막까지 처리하기 위해 리스트의 head를 end에 대입한다.
코드 라인 25~26에서 그룹에서 close 그룹이 존재하는 경우 그룹의 끝까지만 처리하기 위해 end를 close 그룹 다음 노드로 지정한다.
코드 라인 28에서 first ~ end 직전 범위의 모든 그룹을 포함하는 디바이스 리소스를 할당 해제 시킨다.
코드 라인 27에서 할당 해제한 그룹이 아닌 디바이스 리소스 수를 반환한다.

관리되는 IRQ 리소스 APIs

디바이스 리소스 관리용 IRQ 요청

devm_request_irq()

include/linux/interrupt.h

static inline int __must_check
devm_request_irq(struct device *dev, unsigned int irq, irq_handler_t handler,
                 unsigned long irqflags, const char *devname, void *dev_id)
{
        return devm_request_threaded_irq(dev, irq, handler, NULL, irqflags,
                                         devname, dev_id);
}

관리되는(managed) 디바이스용으로 irq 라인을 할당한다. 성공하는 경우 0을 반환하고, 실패하는 경우 에러 코드 값을 반환한다.

devm_request_threaded_irq()

kernel/irq/devres.c

/**
 *      devm_request_threaded_irq - allocate an interrupt line for a managed device
 *      @dev: device to request interrupt for
 *      @irq: Interrupt line to allocate
 *      @handler: Function to be called when the IRQ occurs
 *      @thread_fn: function to be called in a threaded interrupt context. NULL
 *                  for devices which handle everything in @handler
 *      @irqflags: Interrupt type flags
 *      @devname: An ascii name for the claiming device
 *      @dev_id: A cookie passed back to the handler function
 *
 *      Except for the extra @dev argument, this function takes the
 *      same arguments and performs the same function as
 *      request_threaded_irq().  IRQs requested with this function will be
 *      automatically freed on driver detach.
 *
 *      If an IRQ allocated with this function needs to be freed
 *      separately, devm_free_irq() must be used.
 */
int devm_request_threaded_irq(struct device *dev, unsigned int irq,
                              irq_handler_t handler, irq_handler_t thread_fn,
                              unsigned long irqflags, const char *devname,
                              void *dev_id)
{
        struct irq_devres *dr;
        int rc;

        dr = devres_alloc(devm_irq_release, sizeof(struct irq_devres),
                          GFP_KERNEL);
        if (!dr)
                return -ENOMEM;

        rc = request_threaded_irq(irq, handler, thread_fn, irqflags, devname,
                                  dev_id);
        if (rc) {
                devres_free(dr);
                return rc;
        }

        dr->irq = irq;
        dr->dev_id = dev_id;
        devres_add(dev, dr);

        return 0;
}
EXPORT_SYMBOL(devm_request_threaded_irq);

관리되는(managed) 디바이스용으로 스레디드 irq 라인을 할당한다. 성공하는 경우 0을 반환하고, 실패하는 경우 에러 코드 값을 반환한다.

코드 라인 28~31에서 irq 디바이스 리소스를 할당한다.
- 첫 번째 release 인수로 주어지는 devm_irq_release() 함수는 irq 라인 할당 해제를 담당한다.
코드 라인 33~38에서 스레디드 irq 라인을 할당 요청한다.
- thread_fn이 null로 요청된 경우 irq thread를 사용하지 않는다.
코드 라인 40~44에서 할당 요청이 성공한 경우 irq 디바이스 리소스에 irq 번호와 디바이스를 설정하고 요청한 디바이스에 등록한다.
- 이렇게 등록된 irq 디바이스 리소스는 디바이스가 detach될 때 devres_release_all() 함수를 호출하여 한꺼번에 등록된 모든 디바이스 리소스들을 할당 해제할 수 있다.

devm_request_any_context_irq()

kernel/irq/devres.c

/**
 *      devm_request_any_context_irq - allocate an interrupt line for a managed device
 *      @dev: device to request interrupt for
 *      @irq: Interrupt line to allocate
 *      @handler: Function to be called when the IRQ occurs
 *      @thread_fn: function to be called in a threaded interrupt context. NULL
 *                  for devices which handle everything in @handler
 *      @irqflags: Interrupt type flags
 *      @devname: An ascii name for the claiming device
 *      @dev_id: A cookie passed back to the handler function
 *
 *      Except for the extra @dev argument, this function takes the
 *      same arguments and performs the same function as
 *      request_any_context_irq().  IRQs requested with this function will be
 *      automatically freed on driver detach.
 *
 *      If an IRQ allocated with this function needs to be freed
 *      separately, devm_free_irq() must be used.
 */
int devm_request_any_context_irq(struct device *dev, unsigned int irq,
                              irq_handler_t handler, unsigned long irqflags,
                              const char *devname, void *dev_id)
{
        struct irq_devres *dr;
        int rc;

        dr = devres_alloc(devm_irq_release, sizeof(struct irq_devres),
                          GFP_KERNEL);
        if (!dr)
                return -ENOMEM;

        rc = request_any_context_irq(irq, handler, irqflags, devname, dev_id);
        if (rc) {
                devres_free(dr);
                return rc;
        }

        dr->irq = irq;
        dr->dev_id = dev_id;
        devres_add(dev, dr);

        return 0;
}
EXPORT_SYMBOL(devm_request_any_context_irq);

관리되는(managed) 디바이스용으로 context irq 라인을 할당한다. 성공하는 경우 0을 반환하고, 실패하는 경우 에러 코드 값을 반환한다.

코드 라인 27~30에서 irq 디바이스 리소스를 할당한다.
- 첫 번째 release 인수로 주어지는 devm_irq_release() 함수는 irq 라인 할당 해제를 담당한다.
코드 라인 32~36에서 context irq 라인을 할당 요청한다.
코드 라인 38~42에서 할당 요청이 성공한 경우 irq 디바이스 리소스에 irq 번호와 디바이스를 설정하고 요청한 디바이스에 등록한다.
- 이렇게 등록된 irq 디바이스 리소스는 디바이스가 detach될 때 devres_release_all() 함수를 호출하여 한꺼번에 등록된 모든 디바이스 리소스들을 할당 해제할 수 있다.

devm_irq_release()

kernel/irq/devres.c

static void devm_irq_release(struct device *dev, void *res)
{
        struct irq_devres *this = res;

        free_irq(this->irq, this->dev_id);
}

요청한 irq 디바이스 리소스에 저장된 irq와 dev_id를 사용하여 irq 라인 할당 해제를 수행한다.

디바이스 리소스 관리용 IRQ 해제

devm_free_irq()

kernel/irq/devres.c

/**
 *      devm_free_irq - free an interrupt
 *      @dev: device to free interrupt for
 *      @irq: Interrupt line to free
 *      @dev_id: Device identity to free
 *
 *      Except for the extra @dev argument, this function takes the
 *      same arguments and performs the same function as free_irq().
 *      This function instead of free_irq() should be used to manually
 *      free IRQs allocated with devm_request_irq().
 */
void devm_free_irq(struct device *dev, unsigned int irq, void *dev_id)
{
        struct irq_devres match_data = { irq, dev_id };

        WARN_ON(devres_destroy(dev, devm_irq_release, devm_irq_match,
                               &match_data));
        free_irq(irq, dev_id);
}
EXPORT_SYMBOL(devm_free_irq);

요청한 디바이스에서 irq 번호와 dev_id에 해당하는 irq 디바이스 리소스를 찾아 할당 해제한다.

코드 라인 14에서 매치 시킬 irq 번호와 dev_id를 준비한다.
코드 라인 16~17에서 요청한 디바이스에서 irq 번호와 dev_id에 해당하는 irq 디바이스 리소스를 찾아 할당 해제한다.
코드 라인 18에서 irq 라인 할당 해제를 수행한다.

devm_irq_match()

kernel/irq/devres.c

static int devm_irq_match(struct device *dev, void *res, void *data)
{
        struct irq_devres *this = res, *match = data;

        return this->irq == match->irq && this->dev_id == match->dev_id;
}

irq 번호와 dev_id가 같은지 여부를 반환한다.

구조체

devres 구조체

drivers/base/devres.c

struct devres {
        struct devres_node              node;
        /* -- 3 pointers */
        unsigned long long              data[]; /* guarantee ull alignment */
};

node
- 디바이스 리소스 노드
data[]
- managed 리소스에 해당하는 데이터

devres_node 구조체

drivers/base/devres.c

struct devres_node {
        struct list_head                entry;
        dr_release_t                    release;
#ifdef CONFIG_DEBUG_DEVRES
        const char                      *name;
        size_t                          size;
#endif
};

entry
- 디바이스에 등록될 노드 엔트리
release
- managed 리소스를 할당 해제할 함수가 지정된다.
*name
- 디버그용 디바이스 리소스 명
size
- 디버그용 managed 리소스 사이즈

devres_group 구조체

drivers/base/devres.c

struct devres_group {
        struct devres_node              node[2];
        void                            *id;
        int                             color;
        /* -- 8 pointers */
};

node[2]
- 첫 번째는 open 그룹에 해당하는 디바이스 리소스 노드
- 두 번째는 close 그룹에 해당하는 디바이스 리소스 노드
*id
- 그룹 식별 id
color
- 할당 해제 시 내부에서 사용할 color 값으로 정상인 경우 0~2의 범위로 사용된다.

irq_devres 구조체

kernel/irq/devres.c

/*
 * Device resource management aware IRQ request/free implementation.
 */
struct irq_devres {
        unsigned int irq;
        void *dev_id;
};

참고

Devres – Managed Device Resource | Kernel.org
Device resource management | LWN.net
The managed resource API | LWN.net
The Right Way: Managed Resource Allocation in Linux Device Drivers | Eli Billauer – 다운로드 pdf