문c 블로그

IPC between Kernel & User space

2018-04-102018-04-11 문영일 Leave a comment

커널 프로그래밍을 위해 커널에서 직접 제공하는 Kernel API들이 있음을 잘 알고 있을 것이다. 그 외에 유저 영역에서 커널과 인터페이스하기 위한 여러가지 수단들을 알아본다.

IPC between Kernel & User space

Posix Syscall
- Software 인터럽트를 이용한 System Call
- Posix library를 통해 호출
Kernel-provided User Helper
- 고속 처리를 목적으로 syscall을 통하지 않고 커널 코드에 직접 호출하는 특정 API들 (특정 아키텍처에서만 지원)
Usermode Helper
- 커널에서 직접 유저 코드를 로드하여 실행시킬 수 있다.
procfs 인터페이스
- 리눅스 커널에 포함된 가상 /proc 디렉토리를 통해 커널 정보를 간단히 조회할 수 있다.
sysfs 인터페이스
- 리눅스 커널에 포함된 가상 /sys 디렉토리를 통해 다양한 장치 드라이버 및 커널 정보에 접근할 수 있다. (커널 디버깅 및 트레이싱 인터페이스 포함)
cgroup 인터페이스
- CPU 시간, 시스템 메모리, 네트워크 대역폭과 커널 자원들을 사용자 정의 작업 그룹간에 할당할 수 있다.
dev 인터페이스
- 리눅스 커널에서 사용하는 장치 디바이스에 접근할 수 있다.
socket 인터페이스
- 네트워크 통신을 위해 제공하는 소켓 프로그래밍 인터페이스이다.
netlink 소켓 인터페이스
- 표준 소켓 인터페이스를 사용하여 커널과의 양방향 통신을 제공하는 인터페이스이다.

참고

Proc filesystem – 시스템 정보 수집 | 윤상배
Communicating between the kernel and user-space in Linux using Netlink Sockets | Pablo Neira Ayuso – 다운로드 pdf
Sockets in the kernel | Rami Rosen & Haifux – 다운로드 pdf
The Sysfs Virtual Filesystem Exploring the Linux Device Model | Bill Gatliff – 다운로드 pdf
Brief sysfs Tutorial | Ben Marks – 다운로드 pdf
Invoking user-space applications from the kernel | M. Tim Jones
Kernel-provided User Helpers | 문c
FTrace | 문c
Kernel Debugging | 문c
Kernel Tracing | 문c
TRACE_EVENT | 문c
Kprobe | 문c
Control Groups | 문c
Control Group for Memory | 문c

do_fork()

2018-02-012023-04-01 문영일 9 Comments

프로세스 생성

리눅스 유저 application에서 fork() 함수를 사용할 때 공유 라이브러리인 glibc를 통해 더 이상 fork syscall을 호출하지 않고 clone syscall을 호출한다. 때문에 clone과 fork는 동일하게 사용한다. 리눅스 커널은 backward 호환을 이유로 fork에 대한 syscall을 열어둔 상태이긴 하지만 거의 사용되지 않는다고 보면된다. 그 동안 fork와 vfork를 많이 비교하는 글들이 있지만 리눅스 커널 버전이 향상되면서 논쟁이 되었던 글들이 조금씩 핀트를 벗어나고 있다. 커널이 v4.x에 이르렀으므로 이제 조금 정리를 해보고자 한다.

최초 리눅스 구현 시 vfork가 fork에 비해 매우 light한 구동을 보여줬다. vfork로 생성되는 자식 프로세스는 부모 프로세스와 파이프 등을 사용하여 교류를 할 수 없고 단지 부모 프로세스가 자식 프로세스에게 argument만 전달하고 반대로 종료(exit) 결과만 부모 프로세스에 알려줄 수 있다. 이제는 과거지만 그러한 점 때문에 심플하게 프로세스를 구동하고자 할 때에는 vfork를 선호하였었다. 그러나 추후 리눅스가 fork에 COW(Copy On Write) 기능을 사용하면서 fork 역시 light한 동작을 갖게되었다. 이로인해 vfork를 사용해야 할 커다란 이유가 없어졌고, vfork의 사용은 추천되지 않게되었다. 최종적으로 vfork가 fork(clone)와 매우 유사하게 되었는데, 정확히 vfork와 fork(clone)의 차이 점을 구별해 낼 수 없다면 그냥 fork(clone)를 사용하여야 한다.

fork vs vfork

fork와 달리 vfork는부모 프로세스와 파이프를 통한 교류를 할 수 없다.
vfork는 자식 태스크를 생성시 부모 프로세스를 잠시 블럭한다.
vfork는 스레드와 같은 동작을 하도록 CLONE_VM을 사용하여 메모리의 사용량을 대폭 줄였다.
- 부모 프로세스가 사용하는 mm 디스크립터 생성하지 않고 공유한다.
- 부모 프로세스가 사용하는 페이지 테이블(pgd)을 공유한다.
  - fork는 새 페이지 테이블(pgd)을 사용하여 COW(Copy On Write) 동작으로 필요 시 마다 메모리를 할당하여 사용한다.
- 부모 프로세스가 사용하는 vm 영역을 복사하지 않고 공유한다.
- 예) 부모 프로세스가 20M의 메모리를 사용할 때 자식 프로세스가 사용하는 메모리 비교
  - fork: 약 12 ~ 21M
    - 처음 fork를 하였을 때에는 COW 동작에 메모리를 복사하지 않지만 write 동작이 가해지면 메모리를 할당하여 사용한다.
    - 부모 태스크가 사용했었던 vm을 그대로 상속하기 위해 vma 및 페이지 테이블을 새롭게 복사하여 사용한다.
  - vfork: 약 1.5M
    - application에 대해서는 동일한 가상 주소 메모리를 공유하여 사용하므로 자식 프로세스 관리에 필요한 메모리만 조금 추가된다.
참고
- fork() gets slower as parent process uses more memory | famzah‘s blog
- A much faster popen() and system() implementation for Linux | famzah‘s blog

스레드 생성

프로세스보다 가벼운 스레드가 필요한 경우에는 커널 v2.6부터 구현되어 현재까지도 사용되는 NPTL 라이브러리(-lpthread)를 사용한 POSIX thread를 사용한다. pthread_create() 함수를 사용 시 CLONE_VM을 추가한 clone syscall을 사용한다. 이를 통해 light한 메모리 사용과 빠른 스레드 생성을 할 수있다.

참고:
- NPTL: The New Implementation of Threads for Linux | Dr.Dobb’s

리눅스 태스크 생성

다음 그림은 태스크가 fork, vfork 및 pthread_create 될때의 모습을 보여준다. 추가로 exec에 의해 다른 태스크로 전환되는 과정도 보여준다.

VM 공유 관련
- fork의 경우 부모 프로세스의 heap 및 user stack을 포함하는 메모리를 읽을 수 있고, 변경 시 CoW(Copy on Write)에 의해 별도로 복제되어 사용된다.
- 나머지의 경우 부모 프로세스(스레드)와 VM(가상 공간)을 공유하므로 읽기 뿐만 아니라 수정도 가능하다.
vfork 동작 시 부모 process는 잠시 멈추고, 자식 process가 종료되어야 잠시 멈춘 지점부터 동작한다.
exec 동작 시 A 프로세스가 같은 PID를 사용하면서 B 프로세스로 전환되며 A 프로세스로 복귀는 불가능하다. 이 때 VM은 공유하지 않는다.

do_fork()

다음 그림은 유저 모드에서 syscall을 통해 프로세스나 스레드를 생성할 때 대응하는 커널 함수와 CLONE 플래그들을 보여준다.

다음 그림은 태스크를 생성 시 CLONE 플래그에 따라 각 루틴이 하는 일들을 보여준다.

CLONE 플래그에 해당하는 항목이 설정된 경우 함수에서의 동작이 노란 색 항목과 같이 수행된다.

COW(Copy On Write)

fork 및 clone 프로세스의 경우 COW(Copy On Write)을 지원하기 위해 dup_mmap() 함수에서 다음과 같은 함수들을 호출하여 VMA들과 페이지 테이블들을 복사한다.

vm_area_dup()
vma_dup_policy()
anon_vma_fork()
copy_page_range()

kernel/fork.c -1/2-

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long do_fork(unsigned long clone_flags,
              unsigned long stack_start,
              unsigned long stack_size,
              int __user *parent_tidptr,
              int __user *child_tidptr)
{
        struct task_struct *p;
        int trace = 0;
        long nr;

        /*
         * Determine whether and which event to report to ptracer.  When
         * called from kernel_thread or CLONE_UNTRACED is explicitly
         * requested, no event is reported; otherwise, report if the event
         * for the type of forking is enabled.
         */
        if (!(clone_flags & CLONE_UNTRACED)) {
                if (clone_flags & CLONE_VFORK)
                        trace = PTRACE_EVENT_VFORK; 
                else if ((clone_flags & CSIGNAL) != SIGCHLD)
                        trace = PTRACE_EVENT_CLONE;
                else
                        trace = PTRACE_EVENT_FORK;

                if (likely(!ptrace_event_enabled(current, trace)))
                        trace = 0;
        }

        p = copy_process(clone_flags, stack_start, stack_size,
                         child_tidptr, NULL, trace);

부모 태스크를 복사(fork)한 새 태스크를 스케줄러에서 깨운다. CLONE 플래그 요청에 따라 각각의 리소스에 대해 부모 리소스를 같이 공유하여 사용하거나 별도로 새로 생성된 태스크에 독립적으로 리소스를 사용하도록 배정한다.

코드 라인 23~33에서 CLONE_UNTRACED 커널 옵션을 지정하여 요청한 경우 ptreacer로 이벤트를 리포팅하지 않도록 제한시킨다. 또한
- kernel_thread() 함수를 통해 만든 커널 스레드는 항상 CLONE_UNTRACED 커널 옵션을 사용한다.
코드 라인 35~36 copy_process() 함수를 호출하여 각 플래그 요청을 기반으로 자원의 공유 여부를 결정하고 부모 태스크를 기반으로 새 태스크를 상속받아 만든다. (clone)

kernel/fork.c -2/2-

        /*
         * Do this prior waking up the new thread - the thread pointer
         * might get invalid after that point, if the thread exits quickly.
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
                struct pid *pid;

                trace_sched_process_fork(current, p);

                pid = get_task_pid(p, PIDTYPE_PID);
                nr = pid_vnr(pid);

                if (clone_flags & CLONE_PARENT_SETTID)
                        put_user(nr, parent_tidptr);

                if (clone_flags & CLONE_VFORK) {
                        p->vfork_done = &vfork;
                        init_completion(&vfork);
                        get_task_struct(p);
                }

                wake_up_new_task(p);

                /* forking complete and child started to run, tell ptracer */
                if (unlikely(trace))
                        ptrace_event_pid(trace, pid);

                if (clone_flags & CLONE_VFORK) {
                        if (!wait_for_vfork_done(p, &vfork))
                                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
                }

                put_pid(pid);
        } else {
                nr = PTR_ERR(p);
        }
        return nr;
}

코드 라인 11~12에서 글로벌 pid 값을 가져온다. 참조 카운터를 1 증가시킨다. 태스크가 소속된 namespace에서 upid 번호를 알아온다.
코드 라인 14~15에서 유저 모드에서 clone syscall을 통해 즉, sys_clone() 함수를 통해 호출된 경우 부모 태스크의 tid에 child 태스크의 upid 번호를 설정한다.
코드 라인 17~21에서 CLONE_VFORK 플래그를 사용하여 요청한 경우 fork 완료 대기를 위해 준비한다.
코드 라인 23에서 생성된 새 태스크를 깨워 동작시킨다.
코드 라인 29~32에서 CLONE_VFORK 플래그를 사용하여 요청한 경우 태스크가 종료될 때 까지 호출한 부모 태스크는 대기한다.
- 리눅스에서 fork 대신 vfork를 사용하면 부모와 자식간의 race가 발생하지 않도록 부모 태스크는 여기서 잠시 block되어 있다가 자식 task가 fork된 후에야 wakeup한다.
코드 라인 34에서 pid의 참조 카운터를 1 감소시킨다.

kernel/fork.c -1/8-

/*
 * This creates a new process as a copy of the old one,
 * but does not actually start it yet.
 *
 * It copies the registers, and all the appropriate
 * parts of the process environment (as per the clone
 * flags). The actual kick-off is left to the caller.
 */
static struct task_struct *copy_process(unsigned long clone_flags,
                                        unsigned long stack_start,
                                        unsigned long stack_size,
                                        int __user *child_tidptr,
                                        struct pid *pid,
                                        int trace)
{
        int retval;
        struct task_struct *p;

        if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
                return ERR_PTR(-EINVAL);

        if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
                return ERR_PTR(-EINVAL);

        /*
         * Thread groups must share signals as well, and detached threads
         * can only be started up within the thread group.
         */
        if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
                return ERR_PTR(-EINVAL);

        /*
         * Shared signal handlers imply shared VM. By way of the above,
         * thread groups also imply shared VM. Blocking this case allows
         * for various simplifications in other code.
         */
        if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
                return ERR_PTR(-EINVAL);

        /*
         * Siblings of global init remain as zombies on exit since they are
         * not reaped by their parent (swapper). To solve this and to avoid
         * multi-rooted process trees, prevent global and container-inits
         * from creating siblings.
         */
        if ((clone_flags & CLONE_PARENT) &&
                                current->signal->flags & SIGNAL_UNKILLABLE)
                return ERR_PTR(-EINVAL);

        /*
         * If the new process will be in a different pid or user namespace
         * do not allow it to share a thread group or signal handlers or
         * parent with the forking task.
         */
        if (clone_flags & CLONE_SIGHAND) {
                if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
                    (task_active_pid_ns(current) !=
                                current->nsproxy->pid_ns_for_children))
                        return ERR_PTR(-EINVAL);
        }

기존 프로세스의 정보들을 사용하여 새로운 프로세스를 생성한다.

코드 라인 19~20에서 새로운 mount namespace로 태스크 생성을 요청했지만 파일 시스템 정보를 공유할 수 없으므로 실패로 함수를 빠져나간다.
코드 라인 22~23에서 새로운 user namespace로 태스크 생성을 요청했지만 파일 시스템 정보를 공유할 수 없으므로 실패로 함수를 빠져나간다.
코드 라인 29~30에서 thread 생성을 요청했지만 시그널 핸들러의 공유가 필요하다. 그렇지 않은 경우 실패로 함수를 빠져나간다.
- 스레드 그룹끼리는 시그널을 공유해야 한다.
코드 라인 37~38에서 시그널 핸들러를 공유할 때 같은 가상 주소(VM)를 사용해야한다. 그렇지 않은 경우 실패로 함수를 빠져나간다.
코드 라인 46~48에서 부모 태스크에 SIGNAL_UNKILLABLE 시그널 플래그가 있는 경우 부모 태스크 클론을 요청한 경우 실패로 함수를 빠져나간다.
코드 라인 55~60에서 새로운 user namespace 또는 pid namespace로 태스크 생성을 요청한 경우 시그널 핸들러를 공유할 수 없다. 이러한 경우 실패로 함수를 빠져나간다.

kernel/fork.c -2/8-

        retval = security_task_create(clone_flags);
        if (retval)
                goto fork_out;

        retval = -ENOMEM;
        p = dup_task_struct(current);
        if (!p)
                goto fork_out;

        ftrace_graph_init_task(p);

        rt_mutex_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
        DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
        DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
        retval = -EAGAIN;
        if (atomic_read(&p->real_cred->user->processes) >=
                        task_rlimit(p, RLIMIT_NPROC)) {
                if (p->real_cred->user != INIT_USER &&
                    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
                        goto bad_fork_free;
        }
        current->flags &= ~PF_NPROC_EXCEEDED;

        retval = copy_creds(p, clone_flags);
        if (retval < 0)
                goto bad_fork_free;

코드 라인 1~3에서 태스크 생성 이전에 태스크 생성 관련 시큐리티 후크 함수를 호출한다. 만일 실패(0이 아닐 때)시 fork_out 레이블로 이동한다.
- CONFIG_SECURITY 커널 옵션을 사용하지 않는 경우 항상 성공(0)이다.
- 참고: LSM(Linux Security Module) -1-
코드 라인 5~8에서 태스크를 생성한다.
- current task 디스크립터(task_struct 및 thread_info 포함)를 복제한다. 이 때 커널 스택도 생성된다.
코드 라인 10에서 ftrace 추적을 위한 초기화를 수행한다.
코드 라인 12에서 priority inheritency 처리 관련 초기화를 수행한다.
코드 라인 18~24에서 최대 프로세스 생성 제한치를 초과하고, 태스크 생성을 루트 유저가 하지 않았으면 bad_fork_free 레이블로 이동한다.
- 단 시스템 리소스와 시스템 관리자에 대한 capability가 설정된 경우에는 제한을 무시한다.
코드 라인 25~29에서 부모 프로세스의 플래그에서 최대 프로세스 생성 제한 플래그를 클리어하고 인증(credentials)을 복사한다.

kernel/fork.c -3/8-

        /*
         * If multiple threads are within copy_process(), then this check
         * triggers too late. This doesn't hurt, the check is only there
         * to stop root fork bombs.
         */
        retval = -EAGAIN;
        if (nr_threads >= max_threads)
                goto bad_fork_cleanup_count;

        if (!try_module_get(task_thread_info(p)->exec_domain->module))
                goto bad_fork_cleanup_count;

        delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
        p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
        p->flags |= PF_FORKNOEXEC;
        INIT_LIST_HEAD(&p->children);
        INIT_LIST_HEAD(&p->sibling);
        rcu_copy_process(p);
        p->vfork_done = NULL;
        spin_lock_init(&p->alloc_lock);

        init_sigpending(&p->pending);

        p->utime = p->stime = p->gtime = 0;
        p->utimescaled = p->stimescaled = 0;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
        p->prev_cputime.utime = p->prev_cputime.stime = 0;
#endif
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
        seqlock_init(&p->vtime_seqlock);
        p->vtime_snap = 0;
        p->vtime_snap_whence = VTIME_SLEEPING;
#endif

#if defined(SPLIT_RSS_COUNTING)
        memset(&p->rss_stat, 0, sizeof(p->rss_stat));
#endif

        p->default_timer_slack_ns = current->timer_slack_ns;

        task_io_accounting_init(&p->ioac);
        acct_clear_integrals(p);

        posix_cpu_timers_init(p);

        p->start_time = ktime_get_ns();
        p->real_start_time = ktime_get_boot_ns();
        p->io_context = NULL;
        p->audit_context = NULL;
        if (clone_flags & CLONE_THREAD)
                threadgroup_change_begin(current);
        cgroup_fork(p);

코드 라인 6~8에서 스레드 수가 fork_init()함수에서 초기화한 최대 스레드 수(max_threads) 이상인 경우 실패로 bad_fork_cleanup_count 레이블로 이동한다.
- max_threads 디폴트: mempages / (8 * 커널 스택 사이즈 / 페이지 사이즈)
코드 라인 10~11에서 현재 태스크의 실행 도메인이 디폴트 실행 도메인이 아니면서 모듈 정보가 없는 경우 bad_fork_cleanup_count 레이블로 이동한다.
코드 라인 13에서 delay accounting이 동작 시 초기화를 수행한다.
코드 라인 14~15에서 생성된 태스크의 수퍼 유저 권한 및 워커 스레드 플래그를 제거하고 태스크 생성 후 실행되지 않도록 PF_FORKNOEXEC 플래그를 제거한다.
코드 라인 16~17에서 생성된 태스크의 자식 및 형제 관련한 리스트를 초기화한다.
코드 라인 18에서 생성된 태스크의 rcu 관련 멤버의 초기화를 수행한다.
코드 라인 22에서 시그널 펜딩 리스트를 초기화한다.
코드 라인 24~25 생성된 프로세스의 각종 실행 타임을 0으로 초기화한다.
- utime: 유저 코드 실행 시간(jiffies)
- stime: 시스템 코드 실행 시간(jiffies)
코드 라인 26~33 virtual cpu에 대한 time accounting을 초기화한다.
코드 라인 35~37에서 rss 통계 카운터를 초기화한다.
- MM_FILEPAGES 카운터, MM_ANONPAGES 카운터, MM_SWAPENTS 카운터
코드 라인 39에서 절전을 위해 사용하는 타이머용 slack 나노초를 부모 값을 상속하여 사용한다.
코드 라인 41에서 태스크에 대한 io 통계 카운터를 초기화한다.
- CONFIG_TASK_XACCT 커널 옵션 사용시
  - 읽은 바이트 수
  - 기록한 바이트 수
  - 읽은 syscall 수
  - 기록한 syscall 수
- CONFIG_TASK_IO_ACCOUNTING 커널 옵션 사용 시
  - 디스크로 부터 읽은 바이트 수
  - 디스크에 기록한 비이트 수
  - 디스크에 기록 취소한 바이트 수
코드 라인 42에서 태스크에 대한 mm 관련 accounting 카운터 정보를 초기화한다.
코드 라인 44에서 posix cpu 타이머들을 초기화한다.
코드 라인 46~47에서 monotonic 시간(nsec)과 boot based 시간(nsec)을 구해와 대입한다.
코드 라인 48~49에서 io 및 audit 컨택스트를 초기화한다.
코드 라인 50~51에서 스레드 생성 요청 시 스레드 그룹 변경에 대한 락을 건다.
코드 라인 52에서 cgroup 관련 초기화를 수행한다.

kernel/fork.c -4/8-

#ifdef CONFIG_NUMA
        p->mempolicy = mpol_dup(p->mempolicy);
        if (IS_ERR(p->mempolicy)) {
                retval = PTR_ERR(p->mempolicy);
                p->mempolicy = NULL;
                goto bad_fork_cleanup_threadgroup_lock;
        }
#endif
#ifdef CONFIG_CPUSETS
        p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
        p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
        seqcount_init(&p->mems_allowed_seq);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
        p->irq_events = 0;
        p->hardirqs_enabled = 0;
        p->hardirq_enable_ip = 0;
        p->hardirq_enable_event = 0;
        p->hardirq_disable_ip = _THIS_IP_;
        p->hardirq_disable_event = 0;
        p->softirqs_enabled = 1;
        p->softirq_enable_ip = _THIS_IP_;
        p->softirq_enable_event = 0;
        p->softirq_disable_ip = 0;
        p->softirq_disable_event = 0;
        p->hardirq_context = 0;
        p->softirq_context = 0;
#endif
#ifdef CONFIG_LOCKDEP
        p->lockdep_depth = 0; /* no locks held yet */
        p->curr_chain_key = 0;
        p->lockdep_recursion = 0;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
        p->blocked_on = NULL; /* not blocked yet */
#endif
#ifdef CONFIG_BCACHE
        p->sequential_io        = 0;
        p->sequential_io_avg    = 0;
#endif

코드 라인 1~8에서 누마 시스템에서 메모리 정책을 복제한다.
코드 라인 9~13에서 cpuset 관련 멤버들을 초기화한다.
코드 라인 14~28에서 irq 시작과 끝에 대한 trace 정보 및 조작 관련 멤버를 초기화한다.
코드 라인 29~33에서 lockdep 디버깅 관련 멤버를 초기화한다.
코드 라인 35~37에서 뮤텍스 블럭 디버깅 멤버를 초기화한다.
코드 라인 38~41에서 블럭 디바이스를 다른 디바이스의 캐시 대용으로 사용할 때 관련된 멤버들을 초기화한다.

kernel/fork.c -5/8-

        /* Perform scheduler related setup. Assign this task to a CPU. */
        retval = sched_fork(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_policy;

        retval = perf_event_init_task(p);
        if (retval)
                goto bad_fork_cleanup_policy;
        retval = audit_alloc(p);
        if (retval)
                goto bad_fork_cleanup_perf;
        /* copy all the process information */
        shm_init_task(p);
        retval = copy_semundo(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_audit;
        retval = copy_files(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_semundo;
        retval = copy_fs(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_files;
        retval = copy_sighand(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_fs;
        retval = copy_signal(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_sighand;
        retval = copy_mm(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_signal;
        retval = copy_namespaces(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_mm;
        retval = copy_io(clone_flags, p);
        if (retval)
                goto bad_fork_cleanup_namespaces;
        retval = copy_thread(clone_flags, stack_start, stack_size, p);
        if (retval)
                goto bad_fork_cleanup_io;

        if (pid != &init_struct_pid) {
                retval = -ENOMEM;
                pid = alloc_pid(p->nsproxy->pid_ns_for_children);
                if (!pid)
                        goto bad_fork_cleanup_io;
        }

        p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
        /*
         * Clear TID on mm_release()?
         */
        p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
#ifdef CONFIG_BLOCK
        p->plug = NULL;
#endif
#ifdef CONFIG_FUTEX
        p->robust_list = NULL;
#ifdef CONFIG_COMPAT
        p->compat_robust_list = NULL;
#endif
        INIT_LIST_HEAD(&p->pi_state_list);
        p->pi_state_cache = NULL;
#endif

코드 라인 1~4에서 태스크의 스케줄러 관련 멤버들을 초기화한다.
- p->state: TAKS_RUNNING 상태로 바꾼다.
- p->prio: normal_prio로 설정한다. (pi boost에 의해 priority가 변경되어 있을 수도 있다)
- p->sched_class: prio를 보고 rt 또는 cfs 스케줄러로 설정한다. (deadline인 경우 에러)
- thread_info->cpu: cpu 설정
- p->on_cpu: 0으로 초기화
- p->pushable_tasks: rt 오버로드용 리스트 초기화
- p->pusjable_dl_tasks: dl 오버로드용 RB 트리 초기화
- thread_info->preempt_count: PREEMPT_ENABLE로 초기화
코드 라인 6~8에서 perf 이벤트 관련한 context 들을 모두 초기화한다.
코드 라인 9~11에서 audit 기능이 동작하는 경우 audit context 블럭을 할당한다.
- 참고: 리눅스 Audit 시스템 소개 | Techint
코드 라인 13에서 시스템V IPC용 공유 메모리(shm)에 관련 리스트의 초기화를 수행한다.
코드 라인 14~16에서 부모 태스크에서 사용중인 시스템V IPC용 세마포어들에 대해 공유(CLONE_SYSVSEM)하거나 새로 초기화하여 사용한다.
- 공유하여 사용하면 참조 카운터는 1을 증가시키고, 그렇지 않은 경우 초기화하여 사용한다.
코드 라인 17~19에서 부모 태스크에서 열고 사용하고 있는 파일 정보를 공유(CLONE_FILES)하거나 복사하여 사용한다.
- 공유하여 사용하면 참조 카운터는 1을 증가시키고, 그렇지 않은 경우 해당 파일들의 참조 카운터를 1부터 시작한다.
코드 라인 20~22에서 부모 태스크에서 사용하는 루트 파일 시스템을 공유(CLONE_FS)하거나 복사하여 사용한다.
- 공유하여 사용하면 참조 카운터는 1을 증가시키고, 그렇지 않은 경우 참조 카운터를 1부터 시작한다.
코드 라인 23~25에서 부모 태스크에서 사용하는 시그널 핸들러 정보를 공유(CLONE_SIGHAND)하거나 복사하여 사용한다.
- 공유하여 사용하면 참조 카운터는 1을 증가시키고, 그렇지 않은 경우 참조 카운터를 1부터 시작한다.
코드 라인 26~28에서 부모 태스크에서 사용하는 시그널 rlimit 정보를 알아와서 시그널 디스크립터를 생성하여 초기화한다. 단 생성되는 태스크가 스레드인 경우에는 아무일도 하지 않고 성공(0)으로 함수를 빠져나간다.
코드 라인 29~31에서 커널 스레드인경우 거의 아무일도 수행하지 않고, pthread, vfork를 통해 진입한 경우 별도의 mm을 만들지 않고 부모의 mm을 공유한다. 유저 프로세스(fork & clone)를 생성한 경우 자신의 vm을 만들되 부모의 mm 정보를 상속받아 사용한다. 부모가 사용하는 vma 정보와 페이지 테이블 정보를 복사하여 사용한다. (COW)
- COW 방식을 사용하여 실제 사용메모리는 할당하지 않고, 부모 태스크가 사용하던 vma 및 페이지 테이블만 복사하여 자신의 vm을 구성하고, 유저 태스크가 추후 실제 페이지에 수정을 위해 접근하려 할 때 fault 되어 기존 페이지를 새로 할당받은 곳에 복사하여 사용하는 방식을 사용한다.
- 공유하여 사용하면 참조 카운터는 1을 증가시키고, 그렇지 않은 경우 참조 카운터를 1부터 시작한다.
코드 라인 32~34에서 부모 태스크에서 사용하는 namespace를 사용하거나 공유(CLONE_NEWNS, CLONE_NEWUTS | CLONE_NEWIPC, CLONE_NEWPID, CLONE_NEWNET) 요청이 있는 경우 부모 태스크의 namespace 정보를 사용하여 새로운 namespace를 생성한다.
- 유저가 CAP_SYS_ADMIN 권한이 없는 경우 -EPERM 에러로 함수를 빠져나간다.
코드 라인 35~37에서 부모 태스크가 사용하는 io context 정보를 공유(CLONE_IO)하거나 새로 생성하여 사용한다.
코드 라인 38~40에서 부모 태스크가 사용하는 스레드 레지스터 정보를 사용하여 초기화한다. 부모 태스크가 사용하는 TLS 정보를 공유(CLONE_SETTLS)할 수도 있다.
코드 라인 42~47에서 인자로 받은 pid 디스크립터가 &init_struct_pid가 아닌 경우 pid 디스크립터를 새로 할당받는다.
- fork_idle() 함수를 통해 copy_process() 함수를 호출할 때 인자로 &init_struct_pid 디스크립터가 주어진다.

kernel/fork.c -6/8-

        /*
         * sigaltstack should be cleared when sharing the same VM
         */
        if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
                p->sas_ss_sp = p->sas_ss_size = 0;

        /*
         * Syscall tracing and stepping should be turned off in the
         * child regardless of CLONE_PTRACE.
         */
        user_disable_single_step(p);
        clear_tsk_thread_flag(p, TIF_SYSCALL_TRACE);
#ifdef TIF_SYSCALL_EMU
        clear_tsk_thread_flag(p, TIF_SYSCALL_EMU);
#endif
        clear_all_latency_tracing(p);

        /* ok, now we should be set up.. */
        p->pid = pid_nr(pid);
        if (clone_flags & CLONE_THREAD) {
                p->exit_signal = -1;
                p->group_leader = current->group_leader;
                p->tgid = current->tgid;
        } else {
                if (clone_flags & CLONE_PARENT)
                        p->exit_signal = current->group_leader->exit_signal;
                else
                        p->exit_signal = (clone_flags & CSIGNAL);
                p->group_leader = p;
                p->tgid = p->pid;
        }

        p->nr_dirtied = 0;
        p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
        p->dirty_paused_when = 0;

        p->pdeath_signal = 0;
        INIT_LIST_HEAD(&p->thread_group);
        p->task_works = NULL;

        /*
         * Make it visible to the rest of the system, but dont wake it up yet.
         * Need tasklist lock for parent etc handling!
         */
        write_lock_irq(&tasklist_lock);

        /* CLONE_PARENT re-uses the old parent */
        if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
                p->real_parent = current->real_parent;
                p->parent_exec_id = current->parent_exec_id;
        } else {
                p->real_parent = current;
                p->parent_exec_id = current->self_exec_id;
        }

코드 라인 4~5에서 vfork가 아닌 fork 또는 clone을 통해 VM을 공유하는 경우 시그널 보조 스택을 초기화한다.
코드 라인 11에서 gdb 등이 사용하는 ptrace 디버그용 유저 모드 싱글 스텝을 disable한다.
- arm 및 arm64 아키텍처는 하드웨어 디버거를 지원받아 사용한다.
- 참고: ptrace
코드 라인 12에서 trace용 syscall 플래그를 제거한다.
코드 라인 13~15에서 emulator용 syscall 플래그를 제거한다.
코드 라인 16에서 latency tracing을 위한 정보를 초기화한다.
코드 라인 20~23에서 스레드를 생성한 경우 이 스레드의 그룹 리더와 스레드 그룹 리더(tgid)는 부모 태스크가 가리키는 그룹리더와 스레드 그룹 리더를 그대로 사용한다.
코드 라인 24~31에서 스레드가 아닌 프로세스를 생성한 경우 그룹 리더와 스레드 그룹 리더는 자신이된다.

kernel/fork.c -7/8-

        spin_lock(&current->sighand->siglock);

        /*
         * Copy seccomp details explicitly here, in case they were changed
         * before holding sighand lock.
         */
        copy_seccomp(p);

        /*
         * Process group and session signals need to be delivered to just the
         * parent before the fork or both the parent and the child after the
         * fork. Restart if a signal comes in before we add the new process to
         * it's process group.
         * A fatal signal pending means that current will exit, so the new
         * thread can't slip out of an OOM kill (or normal SIGKILL).
        */
        recalc_sigpending();
        if (signal_pending(current)) {
                spin_unlock(&current->sighand->siglock);
                write_unlock_irq(&tasklist_lock);
                retval = -ERESTARTNOINTR;
                goto bad_fork_free_pid;
        }

        if (likely(p->pid)) {
                ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);

                init_task_pid(p, PIDTYPE_PID, pid);
                if (thread_group_leader(p)) {
                        init_task_pid(p, PIDTYPE_PGID, task_pgrp(current));
                        init_task_pid(p, PIDTYPE_SID, task_session(current));

                        if (is_child_reaper(pid)) {
                                ns_of_pid(pid)->child_reaper = p;
                                p->signal->flags |= SIGNAL_UNKILLABLE;
                        }

                        p->signal->leader_pid = pid;
                        p->signal->tty = tty_kref_get(current->signal->tty);
                        list_add_tail(&p->sibling, &p->real_parent->children);
                        list_add_tail_rcu(&p->tasks, &init_task.tasks);
                        attach_pid(p, PIDTYPE_PGID);
                        attach_pid(p, PIDTYPE_SID);
                        __this_cpu_inc(process_counts);
                } else {
                        current->signal->nr_threads++;
                        atomic_inc(&current->signal->live);
                        atomic_inc(&current->signal->sigcnt);
                        list_add_tail_rcu(&p->thread_group,
                                          &p->group_leader->thread_group);
                        list_add_tail_rcu(&p->thread_node,
                                          &p->signal->thread_head);
                }
                attach_pid(p, PIDTYPE_PID);
                nr_threads++;
        }

코드 라인 7에서 부모 태스크에서 사용하는 secure computing mode를 가져와서 새 태스크에 복사하여 사용한다.
- 프로세스에 이미 열린 파일 디스크립터에 대해 exit(), sigreturn(), read() 그리고 write()를 제외한 모든 시스템 호출을 할 수 없게 보호한다. 만일 시도시 SIGKILL이 발생한다.
- 참고: seccomp(2) – Linux manual page – man7.org
코드 라인 17~23에서 부모 태스크의 시그널 상태를 재평가하여 펜딩된 경우 에러로 함수를 빠져나간다. 그렇지 않은 경우 플래그에서 시그널 펜딩 플래그를 제거한다.
코드 라인 25~28에서 높은 확률로 생성된 태스크에 pid 디스크립터가 지정된 경우 이 태스크의 PIDTYPE_PID에 pid 디스크립터를 지정한다.
코드 라인 29~31에서 생성된 태스크가 스레드 그룹의 리더인 경우 이 태스크의 PIDTYPE_PGID에 부모 태스크의 프로세스 그룹 리더의 pid 디스크립터를 지정한다. 또한 PIDTYPE_SID에 부모 태스크의 세션 id의 pid 디스크립터를 지정한다.
코드 라인 33~36에서 이 태스크가 child reaper인 경우 이 pid namespace의 child_reaper에 생성 태스크를 지정한다. 또한 시그널 플래그에 SIGNAL_UNKILLABLE 플래그를 추가한다.
코드 라인 38~39에서 시그널 리더 pid에 이 pid를 지정하고, 시그널에 지정된 tty도 부모 tty를 지정한다.
코드 라인 40~41에서 init_tasks.tasks 리스트에 생성된 스레드 그룹 리더 태스크를 추가하고, 부모 태스크의 자식으로 이 태스크를 추가한다.
코드 라인 42~43에서 생성된 태스크에 지정된 PIDTYPE_PGID 타입과 PIDTYPE_SID에 이 pid를 연결한다.
코드 라인 44에서 전역 per-cpu 변수인 process_counts 카운터를 1 증가시킨다.
코드 라인 45~48에서 생성된 태스크가 스레드인 경우 부모 태스크의 시그널 카운터들을 1씩 증가시킨다.
코드 라인 49~53에서 프로세스 그룹리더의 스레드 그룹 리스트와 시그널에 생성된 스레드를 추가한다.
코드 라인 54~55에서 생성된 태스크에 지정된 PIDTYPE_PID 타입에 이 pid를 연결한다. 그런 후 전역 변수인 스레드 수(nr_threads)를 증가시킨다.

kernel/fork.c -8/8-

        total_forks++;
        spin_unlock(&current->sighand->siglock);
        syscall_tracepoint_update(p);
        write_unlock_irq(&tasklist_lock);

        proc_fork_connector(p);
        cgroup_post_fork(p);
        if (clone_flags & CLONE_THREAD)
                threadgroup_change_end(current);
        perf_event_fork(p);

        trace_task_newtask(p, clone_flags);
        uprobe_copy_process(p, clone_flags);

        return p;

bad_fork_free_pid:
        if (pid != &init_struct_pid)
                free_pid(pid);
bad_fork_cleanup_io:
        if (p->io_context)
                exit_io_context(p);
bad_fork_cleanup_namespaces:
        exit_task_namespaces(p);
bad_fork_cleanup_mm:
        if (p->mm)
                mmput(p->mm);
bad_fork_cleanup_signal:
        if (!(clone_flags & CLONE_THREAD))
                free_signal_struct(p->signal);
bad_fork_cleanup_sighand:
        __cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
        exit_fs(p); /* blocking */
bad_fork_cleanup_files:
        exit_files(p); /* blocking */
bad_fork_cleanup_semundo:
        exit_sem(p);
bad_fork_cleanup_audit:
        audit_free(p);
bad_fork_cleanup_perf:
        perf_event_free_task(p);
bad_fork_cleanup_policy:
#ifdef CONFIG_NUMA
        mpol_put(p->mempolicy);
bad_fork_cleanup_threadgroup_lock:
#endif
        if (clone_flags & CLONE_THREAD)
                threadgroup_change_end(current);
        delayacct_tsk_free(p);
        module_put(task_thread_info(p)->exec_domain->module);
bad_fork_cleanup_count:
        atomic_dec(&p->cred->user->processes);
        exit_creds(p);
bad_fork_free:
        free_task(p);
fork_out:
        return ERR_PTR(retval);
}

코드 라인 1에서 fork된 태스크의 수를 증가시킨다. (전역 변수 total_forks)
코드 라인 3에서 syscall_tracepoint 관련하여 부모 태스크에 설정된 TIF_SYSCALL_TRACEPOINT 플래그 유무를 현재 태스크에 복사하여 설정한다.
코드 라인 6에서 fork가 발생하였으므로 1개 이상의 대기중인 프로세스 이벤트 리스너들에게 넷링크를 통해 전송한다.
- 참고
  - Process Events Connector | LWN.net
  - The Proc Connector and Socket Filters | SCOTT JAMES REMNANT
코드 라인 7에서 새 태스크 생성 후 cgroup이 처리할 일을 수행한다.
- cgroup 태스크 리스트에 추가하고 에서 처리할 일을 수행한다.
- cgroup 서브시스템에 등록된 fork 후크 함수를 호출한다.
코드 라인 8~9에서 스레드 생성 요청(CLONE_THREAD)인 경우 스레드 그룹 변경에 대한 락을 닫는다.
코드 라인 10에서 fork에 대한 perf 이벤트 처리를 수행한다.
코드 라인 13에서 부모 태스크에서 사용하던 Uprobe 기반 이벤트 트레이스에서 사용하는 context들을 vfork로 만들어진 새 유저 태스크에만 복사한다.
- 참고
  - Uprobe-tracer: Uprobe-based Event Tracing | Kernel.org
  - Linux uprobe: User-Level Dynamic Tracing | Brendan Gregg’s Blog

새 유저 태스크를 위한 mm 공유 또는 복제

copy_mm()

kernel/fork.c

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
        struct mm_struct *mm, *oldmm;
        int retval;

        tsk->min_flt = tsk->maj_flt = 0;
        tsk->nvcsw = tsk->nivcsw = 0;
#ifdef CONFIG_DETECT_HUNG_TASK
        tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
#endif

        tsk->mm = NULL;
        tsk->active_mm = NULL;

        /*
         * Are we cloning a kernel thread?
         *
         * We need to steal a active VM for that..
         */
        oldmm = current->mm;
        if (!oldmm)
                return 0;

        /* initialize the new vmacache entries */
        vmacache_flush(tsk);

        if (clone_flags & CLONE_VM) {
                atomic_inc(&oldmm->mm_users);
                mm = oldmm;
                goto good_mm;
        }

        retval = -ENOMEM;
        mm = dup_mm(tsk);
        if (!mm)
                goto fail_nomem;

good_mm:
        tsk->mm = mm;
        tsk->active_mm = mm;
        return 0;

fail_nomem:
        return retval;
}

새 태스크를 위해 부모 mm 디스크립터 및 페이지 테이블을 공유 또는 복사해온다. 다음 조건에 따라 동작이 구분된다.

kernel_thread() 함수를 통해 커널 스레드의 생성이 요청된 경우
- 커널 스레드는 vm 정보 구성을 하지 않으므로 아무런 처리를 하지 않는다.
pthread_create() 또는 vfork() 함수를 통해 유저 스레드 또는 유저 태스크의 생성이 요청된 경우
- 부모가 사용하는 mm 디스크립터 정보를 그대로 공유한다.
fork() 또는 clone() 함수를 통해 유저 태스크의 생성이 요청된 경우
- 생성된 태스크용으로 mm 디스크립터와 페이지 테이블을 새롭게 할당한 후 부모가 사용하는 mm 디스크립터와 페이지 테이블을 복사하여 구성한다. (COW)

참고:

tsk->mm
- 커널 스레드인 경우 null
- 유저 태스크(유저 프로세스 및 유저 스레드)인 경우 mm 디스크립터를 가리킨다.
tsk->active_mm
- 커널 스레드인 경우 마지막에 사용했었던 mm을 가리킨다.
  - 최초 init_mm을 제외하곤 항상 마지막 유저 프로세스의 mm 디스크립터를 가리킨다.
- 유저 프로세스가 사용하는 mm 디스크립터를 가리킨다.
  - 유저 스레드들은 자신이 소속된 유저 프로세스의 mm 디스크립터를 가리킨다.

dup_mm()

kernel/fork.c

/*
 * Allocate a new mm structure and copy contents from the
 * mm structure of the passed in task structure.
 */
static struct mm_struct *dup_mm(struct task_struct *tsk)
{
        struct mm_struct *mm, *oldmm = current->mm;
        int err;

        mm = allocate_mm();
        if (!mm)
                goto fail_nomem;

        memcpy(mm, oldmm, sizeof(*mm));

        if (!mm_init(mm, tsk))
                goto fail_nomem;

        dup_mm_exe_file(oldmm, mm);

        err = dup_mmap(mm, oldmm);
        if (err)
                goto free_pt;

        mm->hiwater_rss = get_mm_rss(mm);
        mm->hiwater_vm = mm->total_vm;

        if (mm->binfmt && !try_module_get(mm->binfmt->module))
                goto free_pt;

        return mm;

free_pt:
        /* don't put binfmt in mmput, we haven't got module yet */
        mm->binfmt = NULL;
        mmput(mm);

fail_nomem:
        return NULL;
}

새 유저 태스크를 위해 mm 디스크립터와 페이지 테이블을 할당한 후, 부모 태스크가 사용하던 mm 정보와 페이지 테이블을 복사한다.

코드 라인 10~12에서 mm 디스크립터용 kmem 캐시를 통해 새 mm 디스크립터를 할당받아온다.
코드 라인 14에서 부모 mm(oldmm) 디스크립터를 할당받은 mm 디스크립터에 모두 복사한다.
- 부모 태스크가 사용했었던 모든 vm 정보들이 복사된다. (물론 여기에서는 페이지 테이블은 제외)
코드 라인 16~17에서 부모 태스크의 mm 정보를 사용할 필요가 없는 새 태스크용 mm 디스크립터의 멤버를 일부 초기화시킨다. 이 때 함수내에서 유저 태스크용 페이지 테이블을 할당받아 mm->pgd에 대입한다.
코드 라인 19에서 실행 파일 정보를 복제한다.
코드 라인 21~23에서 부모 태스크가 사용하던 vma 정보들과 페이지 테이블에서 매핑된 유저 엔트리들을 복사한다.
- COW(Copy On Write) 처리를 위해 실제 부모 태스크가 사용하던 페이지(코드, 데이터 등)을 복사하지 않고 vma 정보와 페이지 테이블만 복사한다. 추후 유저 태스크가 활성화되어 부모 태스크가 사용하던 해당 코드 또는 데이터를 공유하여 접근하는데 만일 쓰기 작업이 수행되는 일이 발생할 때에만 해당 페이지를 복사하는 방식으로 실제 메모리를 지연할당한다. -> 태스크 생성이 빨라지고, 실제 물리 메모리 소모가 줄어든다.

mm_init()

kernel/fork.c

static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
        mm->mmap = NULL;
        mm->mm_rb = RB_ROOT;
        mm->vmacache_seqnum = 0;
        atomic_set(&mm->mm_users, 1);
        atomic_set(&mm->mm_count, 1);
        init_rwsem(&mm->mmap_sem);
        INIT_LIST_HEAD(&mm->mmlist);
        mm->core_state = NULL;
        atomic_long_set(&mm->nr_ptes, 0);
        mm_nr_pmds_init(mm);
        mm->map_count = 0;
        mm->locked_vm = 0;
        mm->pinned_vm = 0;
        memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
        spin_lock_init(&mm->page_table_lock);
        mm_init_cpumask(mm);
        mm_init_aio(mm);
        mm_init_owner(mm, p);
        mmu_notifier_mm_init(mm);
        clear_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
        mm->pmd_huge_pte = NULL;
#endif

        if (current->mm) {
                mm->flags = current->mm->flags & MMF_INIT_MASK;
                mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
        } else {
                mm->flags = default_dump_filter;
                mm->def_flags = 0;
        }

        if (mm_alloc_pgd(mm))
                goto fail_nopgd;

        if (init_new_context(p, mm))
                goto fail_nocontext;

        return mm;

fail_nocontext:
        mm_free_pgd(mm);
fail_nopgd:
        free_mm(mm);
        return NULL;
}

mm 디스크립터를 초기화하고 유저용 pgd 테이블도 할당한다.

코드 라인 1~33에서 mm 디스크립터를 초기화한다.
코드 라인 35~36에서 유저용 pgd 테이블도 할당하고 mm->pgd에 대입한다.
코드 라인 38~39에서 context id를 0으로 초기화한다.

유저용 pgd 테이블 할당

arm 커널에서는 하나의 pgd 테이블을 커널 영역과 유저 영역을 같이 공유하여 사용되고, 각 영역을 분리(split)하는 사이즈는 커널 컴파일 옵션(rpi2: CONFIG_VMSPLIT_2G=y)에 따라 다르다. 유저용 pgdb 페이지 테이블을 만들때 arm 아키텍처는 다음과 같은 일을 수행한다.

init_mm->pgd 테이블에서 커널 영역에 해당하는 pgd 엔트리들을 내 pgd 테이블에 복사한다.
- 커널 영역: CONFIG_VMSPLIT_2G + 16M(모듈 영역)
유저 영역에 해당하는 pgd 엔트리들은 null(0)으로 초기화한다.
만일 low exception 벡터를 사용하는 경우에는 low 벡터 주소가 유저 영역에 위치하므로 이를 access 할 수 있도록 low exception 벡터에 대한 pte 엔트리들도 할당하고 준비해야 한다.

32bit arm with LPAE 에서는 3개의 페이지 테이블을 구성하는 것이 달라지고, arm64에서는 커널용 페이지 테이블과 유저용 페이지 테이블이 아예 별도로 구성되어 있으므로 간단히 유저용 pgd 페이지 테이블만 할당 한다.

mm_alloc_pgd()

kernel/fork.c

static inline int mm_alloc_pgd(struct mm_struct *mm)
{
        mm->pgd = pgd_alloc(mm);
        if (unlikely(!mm->pgd))
                return -ENOMEM;
        return 0;
}

페이지 테이블을 할당한 후 mm->pgd에 지정한다.

사용하는 아키텍처 및 커널 옵션 구성에 따라 페이지 테이블 할당 구현이 약간씩 다르다.

pgd_alloc() – for arm

arch/arm/mm/pgd.c

/*
 * need to get a 16k page for level 1
 */
pgd_t *pgd_alloc(struct mm_struct *mm)
{
        pgd_t *new_pgd, *init_pgd;
        pud_t *new_pud, *init_pud;
        pmd_t *new_pmd, *init_pmd;
        pte_t *new_pte, *init_pte;

        new_pgd = __pgd_alloc();
        if (!new_pgd)
                goto no_pgd;

        memset(new_pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));

        /*
         * Copy over the kernel and IO PGD entries
         */
        init_pgd = pgd_offset_k(0);
        memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
                       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));

        clean_dcache_area(new_pgd, PTRS_PER_PGD * sizeof(pgd_t));

#ifdef CONFIG_ARM_LPAE
        /*
         * Allocate PMD table for modules and pkmap mappings.
         */ 
        new_pud = pud_alloc(mm, new_pgd + pgd_index(MODULES_VADDR),
                            MODULES_VADDR);
        if (!new_pud)
                goto no_pud;

        new_pmd = pmd_alloc(mm, new_pud, 0);
        if (!new_pmd)
                goto no_pmd;
#endif

arm 아키텍처용 pgd 페이지 테이블을 할당한 후 반환한다.

코드 라인 11~13에서 pgd 페이지 테이블을 할당해온다.
- 3레벨을 사용하는 LPAE 시스템인 경우에는 4개 엔트리 * 8 바이트 = 32 바이트를 pgd 전용 kmem cache를 통해 할당한다.
- 2레벨을 사용하는 경우에는 버디 시스템을 통해 4개 페이지 * 4K = 16K 페이지를 할당한다.
코드 라인 15에서 pgd 테이블에서 유저 영역에 해당하는 부분만 0으로 초기화한다.
- rpi2 예) 2048개의 리눅스 pgd 엔트리에서 하위 유저 영역에 해당하는 부분은 1024(하위 2G 해당)-8(16M 모듈)=1018개의 8바이트 엔트리를 0으로 초기화한다.
코드 라인 20~22에서 커널용 pgd 테이블에서 커널 영역에 해당하는 부분만 새로 할당한 페이지의 커널 영역에 복사한다.
- 주의: 커널 영역 사이즈는 커널 쪽 vm_split 영역 사이즈와 모듈 영역을 더해야 한다.
코드 라인 24에서 할당받은 pgd 테이블 주소와 사이즈만큼에 해당하는 영역에 대해 data 캐시를 클린한다.
코드 라인 26~38에서 3레벨 페이지 테이블을 구성해야 하는 LPAE 시스템에서는 pud는 pgd를 그대로 사용하여 패스시키고 pmd 테이블도 추가로 할당하여 구성한다.

        if (!vectors_high()) {
                /*
                 * On ARM, first page must always be allocated since it
                 * contains the machine vectors. The vectors are always high
                 * with LPAE.
                 */
                new_pud = pud_alloc(mm, new_pgd, 0);
                if (!new_pud)
                        goto no_pud;

                new_pmd = pmd_alloc(mm, new_pud, 0);
                if (!new_pmd)
                        goto no_pmd;

                new_pte = pte_alloc_map(mm, NULL, new_pmd, 0);
                if (!new_pte)
                        goto no_pte;

                init_pud = pud_offset(init_pgd, 0);
                init_pmd = pmd_offset(init_pud, 0);
                init_pte = pte_offset_map(init_pmd, 0);
                set_pte_ext(new_pte + 0, init_pte[0], 0);
                set_pte_ext(new_pte + 1, init_pte[1], 0);
                pte_unmap(init_pte);
                pte_unmap(new_pte);
        }

        return new_pgd;

no_pte:
        pmd_free(mm, new_pmd);
        mm_dec_nr_pmds(mm);
no_pmd:
        pud_free(mm, new_pud);
no_pud:
        __pgd_free(new_pgd);
no_pgd:
        return NULL;
}

코드 라인 1~26에서 low exception 벡터를 사용하는 시스템에서는 벡터가 유저 영역에 위치하므로 이를 위한 별도의 매핑을 추가로 만들어줘야 한다.
코드 라인 28에서 정상적으로 할당받은 pgd 테이블의 시작 주소를 반환한다.

pgd_alloc() – for arm64

arch/arm64/mm/pgd.c

pgd_t *pgd_alloc(struct mm_struct *mm)
{
        if (PGD_SIZE == PAGE_SIZE)
                return (pgd_t *)__get_free_page(PGALLOC_GFP);
        else
                return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
}

arm64에서는 커널용 페이지 테이블과 유저용 페이지 테이블이 아예 별도로 구성되어 있으므로 간단히 유저용 pgd 페이지 테이블만 할당 한다.

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c
do_fork() | 문c – 현재 글
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

fork()와 exec() | Laonbud

PID 관리

2018-01-302021-01-11 문영일 Leave a comment

PID 관리

PID(Process ID)는 Posix에서 정의한 것으로는 정확히 프로세스를 의미한다. 그러나 리눅스 커널 v2.6에 도입된 Posix 스레드(NTPL을 통한 pthread) 생성이 지원되면서부터 커널 소스내에서 사용하는 pid 값이 process id이지만 개념은 task id이다. 따라서 pid가 process id를 의미하는지 task id를 의미하는지 소스 주변 상황을 보면서 확인해야한다.(조심!!!)

ps 툴 등에서 언급하는 pid와 관련된 항목들을 알아본다.

LWP
- thread id를 의미한다.
TID
- LWP와 동일하다.
PID
- posix.1에서 정의한 process id이다.
TGID
- PID와 동일하다.
PGID
- process 그룹 id이다. (시그널 공유)
- command line 파이프라인을 통해 shell에서 새로운 child 프로세스를 fork할 때 동일 process group id를 사용한다.
- process group id는 1개의 세션에만 허락된다.
SID
- 세션 id이다.
- 1 개 이상의 process group을 가질 수 있다.

다음 그림과 같이 세션에서 쉘 sh가 구동되었고 그 아래에서 ppd가 동작된 상황이다.

sh(shell)에서 ppd를 구동하는 경우 임시 쉘이 sh와 ppd 사이 중간에 잠깐 만들어지지만 생략하였다.
- 중간에 만들어진 shell이 ppd에 대한 대표 process group id이다. (태스크와 pid가 발급되지만 생략)
- 백그라운드에서 동작할 ppd를 호출한 후 중간에 생성된 shell이 사라지면서 대표 process group id가 ppd가 된다.

예) 쉘에서 ps 명령어로 각종 pid 값을 확인해보았다.

# ps -T -p 2189 -o lwp -o tid -o pid -o tgid -o pgid -o sid -o tty -o cmd
  LWP   TID   PID  TGID  PGID   SID TTY   CMD
 2189  2189  2189  2189  2189  2188 pts/2 ppd -d
 2190  2190  2189  2189  2189  2188 pts/2 ppd -d
 2191  2191  2189  2189  2189  2188 pts/2 ppd -d
 2192  2192  2189  2189  2189  2188 pts/2 ppd -d
 2193  2193  2189  2189  2189  2188 pts/2 ppd -d

리눅스에서 실제 pid 구현과 관련된 항목들을 알아본다. (t=task 디스크립터, pid=pid 디스크립터)

t->pid
- 글로벌하게 unique한 task(thread) id를 의미하고 리눅스 내부에서 태스크 생성 시 마다 발급한다.
t->tgid
- 글로벌하게 unique한 프로세스 id를 의미하고 프로세스 생성시 마다 발급한다.
- posix.1에서 정의한 PID와 동일하다.
t->pids[PIDTYPE_PID]
- .node
  - 아래가 가리키는 pid 디스크립터의 tasks[PIDTYPE_PID] 리스트에 연결될 때 사용한다.
- .pid
  - task(thread)에 해당하는 pid 디스크립터를 가리킨다.
t->pids[PIDTYPE_PGID]
- .node
  - 아래가 가리키는 pid 디스크립터의 tasks[PIDTYPE_PGID] 리스트에 연결될 때 사용한다.
- .pid
  - 프로세스(스레드 제외)가 소속된 process group id의 pid 디스크립터를 가리킨다.
t->pids[PIDTYPE_SID]
- .node
  - 아래가 가리키는 pid 디스크립터의 tasks[PIDTYPE_SID] 리스트에 연결될 때 사용한다.
- .pid
  - 프로세스(스레드 제외)가 소속된 session id의 pid 디스크립터를 가리킨다.

pid->tasks[PIDTYPE_PID]
- task(thread) 하나가 담기는 리스트이다.
pids->tasks[PIDTYPE_PGID]
- 동일한 프로세스 그룹 id를 사용하는 프로세스(스레드 제외)들이 담기는 리스트이다.
pids->tasks[PIDTYPE_SID]
- 동일한 세션 id를 사용하는 프로세스(스레드 제외)들이 담기는 리스트이다.

초기에는 PIDTYPE_TGID가 있었지만 지금은 제거되어 사용되지 않는다. (커널 v2.6.x) 현재 스레드를 가진 대표 태스크는 t->thread_group 리스트에 하위 스레드들을 담고있다.

각 태스크는 글로벌 값인 pid, tgid 및 namespace 별로 virtual하게 관리되는 pids[]를 관리한다.

include/linux/sched.h

struct task_struct {
         ...
        pid_t pid;
        pid_t tgid;

        /* PID/PID hash table linkage. */
        struct pid_link pids[PIDTYPE_MAX];
       ...

namespace에 따라 pid가 각각 관리되는 모습을 보여준다.

다음 그림은 task 디스크립터와 pid 디스크립터가 연결된 모습을 보여준다.

task 디스크립터는 1:1로 pid 방향으로 연결된다.
pid 디스크립터에는 1:n으로 1개 이상의 task 디스크립터를 가진다.

다음 그림은 새로운 namespace에서 태스크가 fork되었을 때 upid의 연결관계를 보여준다.

APIs

pid 참조 관련

get_pid()

include/linux/pid.h

static inline struct pid *get_pid(struct pid *pid)
{
        if (pid)
                atomic_inc(&pid->count);
        return pid;
}

사용할 pid 디스크립터의 참조 카운터를 1 증가시킨다.

put_pid()

kernel/pid.c

void put_pid(struct pid *pid)
{
        struct pid_namespace *ns;

        if (!pid)
                return;

        ns = pid->numbers[pid->level].ns;
        if ((atomic_read(&pid->count) == 1) ||
             atomic_dec_and_test(&pid->count)) {
                kmem_cache_free(ns->pid_cachep, pid);
                put_pid_ns(ns);
        }
}
EXPORT_SYMBOL_GPL(put_pid);

사용한 pid 디스크립터의 참조 카운터를 1 감소시키고 0이되면 할당 해제한다. pid namespace 카운터도 1 감소시키고 0이되면 할당 해제한다. 단 초기 namespace의 할당 해제는 하지 않는다.

task로 pid 조회

get_task_pid()

kernel/pid.c

struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
{
        struct pid *pid;
        rcu_read_lock();
        if (type != PIDTYPE_PID)
                task = task->group_leader;
        pid = get_pid(task->pids[type].pid);
        rcu_read_unlock();
        return pid;
}
EXPORT_SYMBOL_GPL(get_task_pid);

태스크가 소속된 프로세스의 pid 디스크립터를 반환하되 pid 참조 카운터를 1 증가시킨다.

해당 태스크의 pid를 가져오는 것이 아니라 프로세스의 pid를 가져오는 것임에 주의해야 한다.

task_pid()

linux/sched.h

static inline struct pid *task_pid(struct task_struct *task)
{
        return task->pids[PIDTYPE_PID].pid;
}

태스크에 해당하는 pid 디스크립터를 반환한다.

task_tgid()

linux/sched.h

static inline struct pid *task_tgid(struct task_struct *task)
{
        return task->group_leader->pids[PIDTYPE_PID].pid;
}

태스크가 소속된 프로세스, 즉 스레드 그룹리더에 해당하는 pid 디스크립터를 반환한다.

task_pgrp()

linux/sched.h

/*
 * Without tasklist or rcu lock it is not safe to dereference
 * the result of task_pgrp/task_session even if task == current,
 * we can race with another thread doing sys_setsid/sys_setpgid.
 */
static inline struct pid *task_pgrp(struct task_struct *task)
{
        return task->group_leader->pids[PIDTYPE_PGID].pid;
}

태스크가 소속된 프로세스 그룹 리더의 pid 디스크립터를 반환한다.

task_session()

linux/sched.h

static inline struct pid *task_session(struct task_struct *task)
{
        return task->group_leader->pids[PIDTYPE_SID].pid;
}

태스크가 소속된 세션 pid 디스크립터를 반환한다.

pid로 task 조회

get_pid_task()

kernel/pid.c

struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
{
        struct task_struct *result;
        rcu_read_lock();
        result = pid_task(pid, type);
        if (result)
                get_task_struct(result);
        rcu_read_unlock();
        return result;
}
EXPORT_SYMBOL_GPL(get_pid_task);

pid 디스크립터가 관리하는 요청한 pid 타입의 첫 번째 태스크를 반환하되 태스크의 참조 카운터를 1 증가시킨다.

pid_task()

kernel/pid.c

struct task_struct *pid_task(struct pid *pid, enum pid_type type)
{       
        struct task_struct *result = NULL;
        if (pid) {
                struct hlist_node *first;
                first = rcu_dereference_check(hlist_first_rcu(&pid->tasks[type]),
                                              lockdep_tasklist_lock_is_held());
                if (first)
                        result = hlist_entry(first, struct task_struct, pids[(type)].node);
        }
        return result;
}
EXPORT_SYMBOL(pid_task);

pid 디스크립터가 관리하는 요청한 pid 타입의 첫 번째 태스크를 반환한다.

pid 디스크립터로 pid 번호 조회

pid_nr()

include/linux/pid.h

/*
 * the helpers to get the pid's id seen from different namespaces
 *
 * pid_nr()    : global id, i.e. the id seen from the init namespace;
 * pid_vnr()   : virtual id, i.e. the id seen from the pid namespace of
 *               current.
 * pid_nr_ns() : id seen from the ns specified.
 *
 * see also task_xid_nr() etc in include/linux/sched.h
 */

static inline pid_t pid_nr(struct pid *pid)
{
        pid_t nr = 0;
        if (pid)
                nr = pid->numbers[0].nr;
        return nr;
}

pid 디스크립터에서 global pid 번호를 알아온다.

pid_vnr()

kernel/pid.c

pid_t pid_vnr(struct pid *pid)
{
        return pid_nr_ns(pid, task_active_pid_ns(current));
}
EXPORT_SYMBOL_GPL(pid_vnr);

pid 디스크립터에서 virtual pid 번호를 알아온다.

pid_nr_ns()

kernel/pid.c

pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
{
        struct upid *upid;
        pid_t nr = 0;

        if (pid && ns->level <= pid->level) {
                upid = &pid->numbers[ns->level];
                if (upid->ns == ns)
                        nr = upid->nr;
        }
        return nr;
}
EXPORT_SYMBOL_GPL(pid_nr_ns);

pid 디스크립터 및 namespace로 virtual pid 번호를 알아온다.

pid 번호로 pid 디스크립터 조회

find_get_pid()

kernel/pid.c

struct pid *find_get_pid(pid_t nr)
{
        struct pid *pid;

        rcu_read_lock();
        pid = get_pid(find_vpid(nr));
        rcu_read_unlock();

        return pid;
}
EXPORT_SYMBOL_GPL(find_get_pid);

virtual pid 번호에 해당하는 pid 디스크립터를 반환하되 pid 디스크립터의 참조 카운터를 1 증가시킨다.

find_vpid()

kernel/pid.c

struct pid *find_vpid(int nr)
{
        return find_pid_ns(nr, task_active_pid_ns(current));
}
EXPORT_SYMBOL_GPL(find_vpid);

virtual pid 번호에 해당하는 pid 디스크립터를 반환한다.

find_pid_ns()

kernel/pid.c

struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
{
        struct upid *pnr;

        hlist_for_each_entry_rcu(pnr,
                        &pid_hash[pid_hashfn(nr, ns)], pid_chain)
                if (pnr->nr == nr && pnr->ns == ns)
                        return container_of(pnr, struct pid,
                                        numbers[ns->level]);

        return NULL;
}
EXPORT_SYMBOL_GPL(find_pid_ns);

pid 해시 리스트에서 요청한 virtual pid 번호 및 namespace에 매치되는 pid 디스크립터를 찾아 반환한다.

namespace 조회

task_active_pid_ns()

kernel/pid.c

struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
{
        return ns_of_pid(task_pid(tsk));
}
EXPORT_SYMBOL_GPL(task_active_pid_ns);

task 디스크립터로 pid 디스크립터를 얻은 후 namespace를 찾아 반환한다.

ns_of_pid()

include/linux/pid.h

/*
 * ns_of_pid() returns the pid namespace in which the specified pid was
 * allocated.
 *
 * NOTE:
 *      ns_of_pid() is expected to be called for a process (task) that has
 *      an attached 'struct pid' (see attach_pid(), detach_pid()) i.e @pid
 *      is expected to be non-NULL. If @pid is NULL, caller should handle
 *      the resulting NULL pid-ns.
 */
static inline struct pid_namespace *ns_of_pid(struct pid *pid) 
{
        struct pid_namespace *ns = NULL;
        if (pid) 
                ns = pid->numbers[pid->level].ns;
        return ns;
}

pid 디스크립터로 namespace를 반환한다.

pid 할당과 해제

alloc_pid()

kernel/pid.c

struct pid *alloc_pid(struct pid_namespace *ns)
{
        struct pid *pid;
        enum pid_type type;
        int i, nr;
        struct pid_namespace *tmp;
        struct upid *upid;

        pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
        if (!pid)
                goto out;

        tmp = ns;
        pid->level = ns->level;
        for (i = ns->level; i >= 0; i--) {
                nr = alloc_pidmap(tmp);
                if (nr < 0)
                        goto out_free;

                pid->numbers[i].nr = nr;
                pid->numbers[i].ns = tmp;
                tmp = tmp->parent;
        }

        if (unlikely(is_child_reaper(pid))) {
                if (pid_ns_prepare_proc(ns))
                        goto out_free;
        }

        get_pid_ns(ns);
        atomic_set(&pid->count, 1);
        for (type = 0; type < PIDTYPE_MAX; ++type)
                INIT_HLIST_HEAD(&pid->tasks[type]);

        upid = pid->numbers + ns->level;
        spin_lock_irq(&pidmap_lock);
        if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
                goto out_unlock;
        for ( ; upid >= pid->numbers; --upid) {
                hlist_add_head_rcu(&upid->pid_chain,
                                &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
                upid->ns->nr_hashed++;
        }
        spin_unlock_irq(&pidmap_lock);

out:
        return pid;

out_unlock:
        spin_unlock_irq(&pidmap_lock);
        put_pid_ns(ns);

out_free:
        while (++i <= ns->level)
                free_pidmap(pid->numbers + i);

        kmem_cache_free(ns->pid_cachep, pid);
        pid = NULL;
        goto out;
}

free_pid()

kernel/pid.c

void free_pid(struct pid *pid)
{
        /* We can be called with write_lock_irq(&tasklist_lock) held */
        int i;
        unsigned long flags;

        spin_lock_irqsave(&pidmap_lock, flags);
        for (i = 0; i <= pid->level; i++) {
                struct upid *upid = pid->numbers + i;
                struct pid_namespace *ns = upid->ns;
                hlist_del_rcu(&upid->pid_chain);
                switch(--ns->nr_hashed) {
                case 2:
                case 1:
                        /* When all that is left in the pid namespace
                         * is the reaper wake up the reaper.  The reaper
                         * may be sleeping in zap_pid_ns_processes().
                         */
                        wake_up_process(ns->child_reaper);
                        break;
                case PIDNS_HASH_ADDING:
                        /* Handle a fork failure of the first process */
                        WARN_ON(ns->child_reaper);
                        ns->nr_hashed = 0;
                        /* fall through */
                case 0:
                        schedule_work(&ns->proc_work);
                        break;
                }
        }
        spin_unlock_irqrestore(&pidmap_lock, flags);

        for (i = 0; i <= pid->level; i++)
                free_pidmap(pid->numbers + i);

        call_rcu(&pid->rcu, delayed_put_pid);
}

구조체

pid 구조체

include/linux/pid.h

/*
 * What is struct pid?
 *
 * A struct pid is the kernel's internal notion of a process identifier.
 * It refers to individual tasks, process groups, and sessions.  While
 * there are processes attached to it the struct pid lives in a hash
 * table, so it and then the processes that it refers to can be found
 * quickly from the numeric pid value.  The attached processes may be
 * quickly accessed by following pointers from struct pid.
 *
 * Storing pid_t values in the kernel and referring to them later has a
 * problem.  The process originally with that pid may have exited and the
 * pid allocator wrapped, and another process could have come along
 * and been assigned that pid.
 *
 * Referring to user space processes by holding a reference to struct
 * task_struct has a problem.  When the user space process exits
 * the now useless task_struct is still kept.  A task_struct plus a
 * stack consumes around 10K of low kernel memory.  More precisely
 * this is THREAD_SIZE + sizeof(struct task_struct).  By comparison
 * a struct pid is about 64 bytes.
 *
 * Holding a reference to struct pid solves both of these problems.
 * It is small so holding a reference does not consume a lot of
 * resources, and since a new struct pid is allocated when the numeric pid
 * value is reused (when pids wrap around) we don't mistakenly refer to new
 * processes.
 */
struct pid
{
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
};

count
- 참조카운터
level
- namespace 레벨 (init=0)
tasks[]
- 태스크들이 연결되는 리스트
rcu
- pid 소멸 시 rcu 방법으로 사용하기 위한 rcu 노드
- pid 해시 태에블에서 pid를 검색할 때 rcu_read_lock()을 사용하여 접근한다.
numbers[]
- namespace 별로 관리하기 위해 upid 구조체가 사용된다.
- namespace 레벨에 따라 배열이 조정된다.
- numbers[0].nr에는 항상 init namespace의 global pid 값이 사용된다.

upid 구조체

include/linux/pid.h

/*
 * struct upid is used to get the id of the struct pid, as it is
 * seen in particular namespace. Later the struct pid is found with
 * find_pid_ns() using the int nr and struct pid_namespace *ns.
 */
struct upid {
        /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
        int nr;
        struct pid_namespace *ns;
        struct hlist_node pid_chain;
};

nr
- virtual pid 번호로 virtual 태스크 id와 동일하다.
*ns
- namespcae를 가리키는 pid_namespace 구조체 포인터
pid_chain

pid_link 구조체

struct pid_link
{
        struct hlist_node node;
        struct pid *pid;
};

node
- 태스크가 pid 해시 리스트에 연결할 때 사용
*pid
- pid 구조체 포인터

참고

Scheduler -1- (Basic) | 문c
Scheduler -2- (Global Cpu Load) | 문c
Scheduler -3- (PELT) | 문c
Scheduler -4- (Group Scheduling) | 문c
Scheduler -5- (Scheduler Core) | 문c
Scheduler -6- (CFS Scheduler) | 문c
Scheduler -7- (Preemption & Context Switch) | 문c
Scheduler -8- (CFS Bandwidth) | 문c
Scheduler -9- (RT Scheduler) | 문c
Scheduler -10- (Deadline Scheduler) | 문c
Scheduler -11- (Stop Scheduler) | 문c
Scheduler -12- (Idle Scheduler) | 문c
Scheduler -13- (Scheduling Domain 1) | 문c
Scheduler -14- (Scheduling Domain 2) | 문c
Scheduler -15- (Load Balance 1) | 문c
Scheduler -16- (Load Balance 2) | 문c
Scheduler -17- (Load Balance 3 NUMA) | 문c
Scheduler -18- (Load Balance 4 EAS) | 문c
Scheduler -19- (초기화) | 문c
PID 관리 | 문c – 현재 글
do_fork() | 문c
cpu_startup_entry() | 문c
런큐 로드 평균(cpu_load[]) – v4.0 | 문c
PELT(Per-Entity Load Tracking) – v4.0 | 문c

Linux kernel 3.x와 2.6.11의 PID Hash Table 자료구조 비교(슬라이드) | 양희철
[Linux] PID 관리 | F/OSS
PID namespaces in the 2.6.24 kernel | LWN.net

RCU(Read Copy Update) -4- (NOCB process)

2018-01-222021-04-01 문영일 Leave a comment

RCU NO-CB (Offload RCU callback)

rcu의 콜백을 처리하는 유형은 큰 흐름으로 다음과 같이 두 가지로 나뉜다.

cb 용
- 1) softirq(디폴트)에서 처리하거나, 2) cb용 콜백 처리 커널 스레드에서 처리된다.
- softirq로 처리될 때 interrupt context에서 곧장 호출되어 처리되고 만일 softirq 처리 건 수가 많아져 지연되는 경우 softirqd 커널 스레드에서 호출되어 처리된다.
- softirq로 처리되는 경우 latency가 짧아 빠른 호출이 보장된다.
no-cb 용
- 전용 no-cb용 콜백 처리 커널 스레드에 위탁(offloaded) 시켜 처리하는 것으로 latency는 약간 저하되지만 절전과 성능을 만족시키는 옵션이다.
- nohz full cpu들을 지정하거나 no-cb cpu들을 지정하는 경우 no-cb용으로 위탁(offloaded)하여 사용한다.
  - 예) “nohz_full=3-7”, 예) “rcu_nocbs=3-7”
- no-cb cpu 지정을 위해 CONFIG_RCU_NOCB_CPU 커널 옵션을 사용하고 다음 3가지 중 하나를 선택할 수 있다.
  - CONFIG_RCU_NOCB_CPU_NONE
    - 디폴트로 nocb 지정된 cpu는 없지만 “nocbs=” 커널 파라메터로 특정 cpu들을 nocb로 지정할 수 있다.
  - CONFIG_RCU_NOCB_CPU_ZERO
    - 디폴트로 cpu#0을 nocb 지정한다. 추가로 “nocbs=” 커널 파라메터를 사용하여 다른 cpu들을 nocb로 지정할 수 있다.
  - CONFIG_RCU_NOCB_CPU_ALL
    - 디폴트로 모든 cpu를 nocb로 지정한다.

bypass 콜백 리스트

커널 v5.4-rc1에서 각 cpu에 있는 seg 콜백 리스트에 3 군데 사용처(cb 호출, nocb gp 커널 스레드, nocb 커널 스레드)의 과도한 lock contention을 줄이기 위해 nocb_nobypass_lim_per_jiffy(디폴트: ms당 16개) 개 이상 유입되는 콜백을 등록하는 경우 일시적으로 bypass 콜백 리스트에 추가한 후 nocb_bypass 콜백 수가 qhimark를 초과하거나 새 틱으로 전환될 때 seg 콜백 리스트에 플러시하여 lock contention에 대한 부담을 줄여 성능을 높일 수 있도록 설계되었다.

참고: rcu/nocb: Add bypass callback queueing (2019, v5.4-rc1)

no-cb용 rcu 초기화

RCU NO-CB 설정

rcu_is_nocb_cpu()

kernel/rcu/tree_plugin.h

/* Is the specified CPU a no-CBs CPU? */

bool rcu_is_nocb_cpu(int cpu)
{
        if (have_rcu_nocb_mask)
                return cpumask_test_cpu(cpu, rcu_nocb_mask);
        return false;
}

요청한 cpu가 no-cb(rcu 스레드 사용)으로 설정되었는지 여부를 반환한다.

have_rcu_nocb_mask 변수는 rcu_nocb_setup() 함수가 호출되어 rcu_nocb_mask라는 cpu 마스크가 할당된 경우 true로 설정된다.
CONFIG_RCU_NOCB_CPU_ALL 커널 옵션을 사용하는 경우 항상 rcu 스레드에서 동작시키기 위해 true를 반환한다.
CONFIG_RCU_NOCB_CPU_ALL 및 CONFIG_RCU_NOCB_CPU 커널 옵션 둘 다 사용하지 않는 경우 항상 callback 처리하기 위해 false를 반환한다.

rcu_nocb_setup()

kernel/rcu/tree_plugin.h

/* Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters. */

static int __init rcu_nocb_setup(char *str)
{
        alloc_bootmem_cpumask_var(&rcu_nocb_mask);
        have_rcu_nocb_mask = true;
        cpulist_parse(str, rcu_nocb_mask);
        return 1;
}
__setup("rcu_nocbs=", rcu_nocb_setup);

커널 파라메터 “rcu_nocbs=”에 cpu 리스트를 설정한다. 이렇게 설정된 cpu들은 rcu callback 처리를 rcu 스레드에서 처리할 수 있다.

예) rcu_nocbs=3-6,8-10

nocb용 gp 및 cb 커널 스레드의 구성

rcu_organize_nocb_kthreads()

kernel/rcu/tree_plugin.h

/*
 * Initialize GP-CB relationships for all no-CBs CPU.
 */

static void __init rcu_organize_nocb_kthreads(void)
{
        int cpu;
        bool firsttime = true;
        int ls = rcu_nocb_gp_stride;
        int nl = 0;  /* Next GP kthread. */
        struct rcu_data *rdp;
        struct rcu_data *rdp_gp = NULL;  /* Suppress misguided gcc warn. */
        struct rcu_data *rdp_prev = NULL;

        if (!cpumask_available(rcu_nocb_mask))
                return;
        if (ls == -1) {
                ls = nr_cpu_ids / int_sqrt(nr_cpu_ids);
                rcu_nocb_gp_stride = ls;
        }

        /*
         * Each pass through this loop sets up one rcu_data structure.
         * Should the corresponding CPU come online in the future, then
         * we will spawn the needed set of rcu_nocb_kthread() kthreads.
         */
        for_each_cpu(cpu, rcu_nocb_mask) {
                rdp = per_cpu_ptr(&rcu_data, cpu);
                if (rdp->cpu >= nl) {
                        /* New GP kthread, set up for CBs & next GP. */
                        nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
                        rdp->nocb_gp_rdp = rdp;
                        rdp_gp = rdp;
                        if (!firsttime && dump_tree)
                                pr_cont("\n");
                        firsttime = false;
                        pr_alert("%s: No-CB GP kthread CPU %d:", __func__, cpu);
                } else {
                        /* Another CB kthread, link to previous GP kthread. */
                        rdp->nocb_gp_rdp = rdp_gp;
                        rdp_prev->nocb_next_cb_rdp = rdp;
                        pr_alert(" %d", cpu);
                }
                rdp_prev = rdp;
        }
}

no-cb용 gp 및 cb 커널 스레드를 구성한다.

코드 라인 11~12에서 커널 파라미터로 no-cb 지정된 cpu가 없는 경우 함수를 빠져나간다.
코드 라인 13~16에서 모듈 파라미터 rcu_nocb_gp_stride (디폴트=-1)가 아직 설정되지 않은 경우 다음과 같이 산출한 후 ls와 rcu_nocb_gp_stride에 대입한다.
- = cpu 수 / srqt(cpu 수)
코드 라인 23~41에서 no-cb용 cpu들을 순회하며 ls로 지정된 단위마다 각 첫 번째 cpu는 gp용 커널 스레드로 지정된다.

다음 그림은 cpu 수에 따라 구성되는 no-cb용 gp 커널 스레드와 cb 커널 스레드들을 보여준다.

NO-CB용 콜백 처리 커널 스레드

기존 커널에서 no-cb용 콜백 리스트를 별도로 분리 구성하여 사용했었는데 이를 없애고 cb용 세그먼티드 콜백리스트 방식을 사용하는 것으로 바꿔 지연도 없애며 OOM 발생 확률을 줄였다.

참고: rcu/nocb: Use rcu_segcblist for no-CBs CPUs (2019, v5.4-rc1)

rcu_nocb_cb_kthread()

kernel/rcu/tree_plugin.h

/*
 * Per-rcu_data kthread, but only for no-CBs CPUs.  Repeatedly invoke
 * nocb_cb_wait() to do the dirty work.
 */

static int rcu_nocb_cb_kthread(void *arg)
{
        struct rcu_data *rdp = arg;

        // Each pass through this loop does one callback batch, and,
        // if there are no more ready callbacks, waits for them.
        for (;;) {
                nocb_cb_wait(rdp);
                cond_resched_tasks_rcu_qs();
        }
        return 0;
}

cpu 마다 구성되는 no-cb용 콜백 처리 커널 스레드이다. 무한 루프를 돌며 대기 중인 콜백들을 처리한다.

nocb_cb_wait()

kernel/rcu/tree_plugin.h

/*
 * Invoke any ready callbacks from the corresponding no-CBs CPU,
 * then, if there are no more, wait for more to appear.
 */

static void nocb_cb_wait(struct rcu_data *rdp)
{
        unsigned long cur_gp_seq;
        unsigned long flags;
        bool needwake_gp = false;
        struct rcu_node *rnp = rdp->mynode;

        local_irq_save(flags);
        rcu_momentary_dyntick_idle();
        local_irq_restore(flags);
        local_bh_disable();
        rcu_do_batch(rdp);
        local_bh_enable();
        lockdep_assert_irqs_enabled();
        rcu_nocb_lock_irqsave(rdp, flags);
        if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
            rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
            raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
                needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
                raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
        }
        if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
                rcu_nocb_unlock_irqrestore(rdp, flags);
                if (needwake_gp)
                        rcu_gp_kthread_wake();
                return;
        }

        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
        WRITE_ONCE(rdp->nocb_cb_sleep, true);
        rcu_nocb_unlock_irqrestore(rdp, flags);
        if (needwake_gp)
                rcu_gp_kthread_wake();
        swait_event_interruptible_exclusive(rdp->nocb_cb_wq,
                                 !READ_ONCE(rdp->nocb_cb_sleep));
        if (!smp_load_acquire(&rdp->nocb_cb_sleep)) { /* VVV */
                /* ^^^ Ensure CB invocation follows _sleep test. */
                return;
        }
        WARN_ON(signal_pending(current));
        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty"));
}

준비 완료된 콜백들을 호출하여 처리한다. 없는 경우 대기한다.

코드 라인 8~10에서 irq를 disable한 채로 rcu core가 긴급하게 qs 상태를 알아야 할 때 수행된다.
코드 라인 11~13에서 bh를 disable한 채로 준비 완료된 콜백들을 호출한다.
코드 라인 16~21에서 콜백리스트에 다음 gp를 기다리는 콜백들이 있고 gp 시퀀스도 완료된 경우해당 노드의 lock을 획득한 채로 콜백들을 advance(cascade) 처리한다.
코드 라인 22~27에서 준비 완료된 콜백들이 존재하는 경우 함수를 빠져나가며, 필요 시gp 커널 스레드를 깨운다.
코드 라인 30~35에서 rdp->nocb_cb_sleep에 true를 대입한 후 rdp->nocb_cb_sleep을 외부에서 false로 변경시킬 때까지 슬립한다.
코드 라인 36~39에서 rdp->nocb_cb_sleep에 대한 메모리 베리어 처리를 수행한다.

다음 그림은 커널 v5.4부터 no-cb용 콜백 처리 커널 스레드가 cb용 segmented 콜백 리스트를 그대로 사용하여 운영되는 모습을 보여준다.

추가로 nocb_bypass 콜백 리스트의 운용도 보여주고 있다.

다음 그림은 기존 커널 v5.3까지에서 no-cb용 콜백 처리 커널 스레드가 leader/follower로 구성되어 동작하는 모습이다.

no-cb용 gp 커널 스레드

기존 커널에서 leader 커널 스레드는 콜백과 gp를 관리하고, follower 커널 스레드들은 콜백만을 처리하는 구조였다. 그러나 수 많은 수의 콜백을 leader가 처리해야 하는 상황에서 많은 follower cpu들이 gp 완료까지 대기를 하느라 OOM이 발생하는 현상이 벌어져 새 커널에서는 gp 커널 스레드를 별도의 커널 스레드로 제공한다.

참고: rcu/nocb: Provide separate no-CBs grace-period kthreads (2019, v5.4-rc1)

rcu_nocb_gp_kthread()

kernel/rcu/tree_plugin.h

/*
 * No-CBs grace-period-wait kthread.  There is one of these per group
 * of CPUs, but only once at least one CPU in that group has come online
 * at least once since boot.  This kthread checks for newly posted
 * callbacks from any of the CPUs it is responsible for, waits for a
 * grace period, then awakens all of the rcu_nocb_cb_kthread() instances
 * that then have callback-invocation work to do.
 */

static int rcu_nocb_gp_kthread(void *arg)
{
        struct rcu_data *rdp = arg;

        for (;;) {
                WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1);
                nocb_gp_wait(rdp);
                cond_resched_tasks_rcu_qs();
        }
        return 0;
}

cpu들을 대상으로 cpu수의 제곱근마다 그룹을 분리하고 각 그룹의 첫 번째 cpu에 구성하는 no-cb용 gp 커널 스레드이다. 무한 루프를 돌며 no-cb용 gp를 기다린다.

nocb_gp_wait()

kernel/rcu/tree_plugin.h -1/2-

/*
 * No-CBs GP kthreads come here to wait for additional callbacks to show up
 * or for grace periods to end.
 */

static void nocb_gp_wait(struct rcu_data *my_rdp)
{
        bool bypass = false;
        long bypass_ncbs;
        int __maybe_unused cpu = my_rdp->cpu;
        unsigned long cur_gp_seq;
        unsigned long flags;
        bool gotcbs;
        unsigned long j = jiffies;
        bool needwait_gp = false; // This prevents actual uninitialized use.
        bool needwake;
        bool needwake_gp;
        struct rcu_data *rdp;
        struct rcu_node *rnp;
        unsigned long wait_gp_seq = 0; // Suppress "use uninitialized" warning.

        /*
         * Each pass through the following loop checks for CBs and for the
         * nearest grace period (if any) to wait for next.  The CB kthreads
         * and the global grace-period kthread are awakened if needed.
         */
        for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) {
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
                rcu_nocb_lock_irqsave(rdp, flags);
                bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
                if (bypass_ncbs &&
                    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
                     bypass_ncbs > 2 * qhimark)) {
                        // Bypass full or old, so flush it.
                        (void)rcu_nocb_try_flush_bypass(rdp, j);
                        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
                } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
                        rcu_nocb_unlock_irqrestore(rdp, flags);
                        continue; /* No callbacks here, try next. */
                }
                if (bypass_ncbs) {
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("Bypass"));
                        bypass = true;
                }
                rnp = rdp->mynode;
                if (bypass) {  // Avoid race with first bypass CB.
                        WRITE_ONCE(my_rdp->nocb_defer_wakeup,
                                   RCU_NOCB_WAKE_NOT);
                        del_timer(&my_rdp->nocb_timer);
                }
                // Advance callbacks if helpful and low contention.
                needwake_gp = false;
                if (!rcu_segcblist_restempty(&rdp->cblist,
                                             RCU_NEXT_READY_TAIL) ||
                    (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
                     rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) {
                        raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
                        needwake_gp = rcu_advance_cbs(rnp, rdp);
                        raw_spin_unlock_rcu_node(rnp); /* irqs disabled. */
                }
                // Need to wait on some grace period?
                WARN_ON_ONCE(!rcu_segcblist_restempty(&rdp->cblist,
                                                      RCU_NEXT_READY_TAIL));
                if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) {
                        if (!needwait_gp ||
                            ULONG_CMP_LT(cur_gp_seq, wait_gp_seq))
                                wait_gp_seq = cur_gp_seq;
                        needwait_gp = true;
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("NeedWaitGP"));
                }
                if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
                        needwake = rdp->nocb_cb_sleep;
                        WRITE_ONCE(rdp->nocb_cb_sleep, false);
                        smp_mb(); /* CB invocation -after- GP end. */
                } else {
                        needwake = false;
                }
                rcu_nocb_unlock_irqrestore(rdp, flags);
                if (needwake) {
                        swake_up_one(&rdp->nocb_cb_wq);
                        gotcbs = true;
                }
                if (needwake_gp)
                        rcu_gp_kthread_wake();
        }

no-cb용 gp 커널 스레에서 호출되며 gp를 대기한다.

코드 라인 22~35에서 gp 커널 스레드가 관리하는 cpu들을 대상으로 순회하며 ->nocb_bypass 리스트에 콜백들이 있고, 1틱 이상 시간이 지났거나 bypass 대기 중인 콜백들이 qhimark(디폴트=10000)의 2배를 넘는 경우 이들을 flush 한다. 그렇지 않고 bypass 콜백도 없고 세그먼트 콜백 리스트도 비어 있는 경우 skip 한다.
코드 라인 36~46에서 여전히 bypass 콜백이 존재하는 경우 bypass를 true로 변경하고, no-cb 타이머를 제거한다.
코드 라인 48~56에서 NEXT 구간에 콜백들이 존재하거나 wait 구간에서 대기 중인 콜백의 gp 시퀀스가 이미 만료된 상태인 경우 콜백들을 advance(cascade) 처리한다.
코드 라인 60~67에서 대기 중인 콜백이 있는 경우 needwait_gp에 true를 대입한다. wait_gp_seq를 갱신하는데 wait 구간의 gp 시퀀스보다 작을 때에만 갱신한다.
코드 라인 68~81에서 준비 완료된 콜백들이 있는 경우 no-cb용 콜백 처리 커널 스레드를 깨우고, nocb_cb_sleep에 false를 대입하여 no-cb용 gp 커널 스레드를 깨운다. 그런 후 계속 다음 cpu를 처리하기 위해 루프를 돈다.

kernel/rcu/tree_plugin.h -2/2-

        my_rdp->nocb_gp_bypass = bypass;
        my_rdp->nocb_gp_gp = needwait_gp;
        my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
        if (bypass && !rcu_nocb_poll) {
                // At least one child with non-empty ->nocb_bypass, so set
                // timer in order to avoid stranding its callbacks.
                raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
                mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
                raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
        }
        if (rcu_nocb_poll) {
                /* Polling, so trace if first poll in the series. */
                if (gotcbs)
                        trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Poll"));
                schedule_timeout_interruptible(1);
        } else if (!needwait_gp) {
                /* Wait for callbacks to appear. */
                trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep"));
                swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq,
                                !READ_ONCE(my_rdp->nocb_gp_sleep));
                trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("EndSleep"));
        } else {
                rnp = my_rdp->mynode;
                trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("StartWait"));
                swait_event_interruptible_exclusive(
                        rnp->nocb_gp_wq[rcu_seq_ctr(wait_gp_seq) & 0x1],
                        rcu_seq_done(&rnp->gp_seq, wait_gp_seq) ||
                        !READ_ONCE(my_rdp->nocb_gp_sleep));
                trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("EndWait"));
        }
        if (!rcu_nocb_poll) {
                raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
                if (bypass)
                        del_timer(&my_rdp->nocb_bypass_timer);
                WRITE_ONCE(my_rdp->nocb_gp_sleep, true);
                raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
        }
        my_rdp->nocb_gp_seq = -1;
        WARN_ON(signal_pending(current));
}

코드 라인 1에서 no-cb용 bypass 콜백이 지난 gp에서 스캔되었느지 여부를 대입한다.
코드 라인 2에서 no-cb용 다음 gp를 대기해야 하는지 요청 여부를 지정한다.
코드 라인 3에서 no-cb용 gp 시퀀스 번호를 대입한다. 만일 다음 gp 요청이 없는 경우 0을 대입한다.
코드 라인 4~10에서 bypass 콜백이 있고 no-cb용 cb 커널 스레드가 폴링을 하지 않는 경우 no-cb용 bypass 타이머를 2틱 뒤의 시각으로 설정한다.
코드 라인 11~15에서 no-cb용 cb 커널 스레드가 폴링을 하는 경우 1틱 슬립한다.
코드 라인 16~21에서 gp 요청이 없는 경우 rdp->nocb_gp_sleep이 외부에서 false를 대입할 때까지 슬립하며 대기한다.
코드 라인 22~30에서 그 외의 경우 대기 중인 gp 시퀀스가 종료되거나 rdp->nocb_gp_sleep이 외부에서 false를 대입할 때까지 슬립하며 대기한다.
코드 라인 31~37에서 no-cb용 cb 커널 스레드가 폴링을 하지 않는 경우 ->nocb_gp_sleep에 true를 대입한다. 만일 bypass 콜백이 있는 경우 bypass 타이머를 제거한다.
코드 라인 38에서 no-cb용 gp 시퀀스에 -1을 대입한다.

no-cb용 gp wakeup 타이머

do_nocb_deferred_wakeup()

kernel/rcu/tree_plugin.h

/*
 * Do a deferred wakeup of rcu_nocb_kthread() from fastpath.
 * This means we do an inexact common-case check.  Note that if
 * we miss, ->nocb_timer will eventually clean things up.
 */

static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
{
        if (rcu_nocb_need_deferred_wakeup(rdp))
                do_nocb_deferred_wakeup_common(rdp);
}

no-cb용 gp 커널스레드의 deferred wakeup 요청을 처리한다.

rcu_nocb_need_deferred_wakeup()

kernel/rcu/tree_plugin.h

/* Is a deferred wakeup of rcu_nocb_kthread() required? */

static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
{
        return READ_ONCE(rdp->nocb_defer_wakeup);
}

no-cb용 커널스레드의 deferred wakeup이 요청되었는지 여부를 반환한다.

deferred wakeup 요청 값

kernel/rcu/tree.h”

/* Values for nocb_defer_wakeup field in struct rcu_data. */

#define RCU_NOCB_WAKE_NOT       0
#define RCU_NOCB_WAKE           1
#define RCU_NOCB_WAKE_FORCE     2

RCU_NOCB_WAKE_NOT
- wake 요청하지 않은 상태
RCU_NOCB_WAKE
- wake 요청한 상태
RCU_NOCB_WAKE_FORCE
- wake 요청을 강제한 상태

deferred wakeup 타이머 핸들러

do_nocb_deferred_wakeup_timer()

kernel/rcu/tree_plugin.h

/* Do a deferred wakeup of rcu_nocb_kthread() from a timer handler. */

static void do_nocb_deferred_wakeup_timer(struct timer_list *t)
{
        struct rcu_data *rdp = from_timer(rdp, t, nocb_timer);

        do_nocb_deferred_wakeup_common(rdp);
}

no-cb용 커널스레드의 deferred wakeup이 요청된 경우 no-cb용 gp 커널 스레드를 깨운다.

do_nocb_deferred_wakeup_common()

kernel/rcu/tree_plugin.h

/* Do a deferred wakeup of rcu_nocb_kthread(). */

static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
{
        unsigned long flags;
        int ndw;

        rcu_nocb_lock_irqsave(rdp, flags);
        if (!rcu_nocb_need_deferred_wakeup(rdp)) {
                rcu_nocb_unlock_irqrestore(rdp, flags);
                return;
        }
        ndw = READ_ONCE(rdp->nocb_defer_wakeup);
        WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
        wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
}

no-cb용 커널스레드의 deferred wakeup이 요청된 경우 no-cb용 gp 커널 스레드를 깨운다.

코드 라인 7~10에서 no-cb용 커널스레드의 wakeup 유예가 요청된 경우가 아니면 함수를 빠져나간다.
코드 라인 11~12용 nocb_defer_wakeup 요청을 RCU_NOCB_WAKE_NOT(0)으로 클리어하고, 기존 값은 ndw로 알아온다.
코드 라인 13에서 gp 커널 스레드를 깨운다.
- 기존 값이 RCU_NOCB_WAKE_FORCE(2)인 경우 wakeup을 강제한다.

no-cb용 gp 커널 스레드 깨우기

wake_nocb_gp()

kernel/rcu/tree_plugin.h

/*
 * Kick the GP kthread for this NOCB group.  Caller holds ->nocb_lock
 * and this function releases it.
 */

static void wake_nocb_gp(struct rcu_data *rdp, bool force,
                           unsigned long flags)
        __releases(rdp->nocb_lock)
{
        bool needwake = false;
        struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;

        lockdep_assert_held(&rdp->nocb_lock);
        if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                    TPS("AlreadyAwake"));
                rcu_nocb_unlock_irqrestore(rdp, flags);
                return;
        }
        del_timer(&rdp->nocb_timer);
        rcu_nocb_unlock_irqrestore(rdp, flags);
        raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
        if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
                WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
                needwake = true;
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake"));
        }
        raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
        if (needwake)
                wake_up_process(rdp_gp->nocb_gp_kthread);
}

no-cb용 gp 커널 스레드를 깨운다.

코드 라인 6에서 현재 cpu에 대응하여 gp를 관리하는 cpu의 rdp를 알아온다.
코드 라인 9~14에서 gp용 cpu의 no-cb용 gp 커널 스레드가 아직 준비되지 않은 경우 함수를 빠져나간다.
코드 라인 15에서 no-cb용 타이머를 삭제한다.
코드 라인 18~25에서 입력 인자 @force가 요청되었거나 rdp_gp->nocb_gp_sleep이 true로 no-cb용 gp 커널 스레드가 슬립하며 대기 중인 경우 스레드를 깨운다.

no-cb용 gp 커널 스레드 지연시켜 깨우기

wake_nocb_gp_defer()

kernel/rcu/tree_plugin.h

/*
 * Arrange to wake the GP kthread for this NOCB group at some future
 * time when it is safe to do so.
 */

static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
                               const char *reason)
{
        if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
                mod_timer(&rdp->nocb_timer, jiffies + 1);
        if (rdp->nocb_defer_wakeup < waketype)
                WRITE_ONCE(rdp->nocb_defer_wakeup, waketype);
        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
}

no-cb용 gp 커널 스레드를 @waketype에 맞춰 deferred wakeup 요청한다.

코드 라인 4~5에서 기존 설정 값(rdp->nocb_defer_wakeup)이 RCU_NOCB_WAKE_NOT(0) 상태에서 요청된 경우 1틱을 지연시켜 no-cb용 gp 커널 스레드를 깨우기 위해 타이머를 변경한다.
코드 라인 6~7에서 rdp->nocb_defer_wakeup 보다 @wakeup이 큰 경우에 한해 갱신한다.

no-cb용 bypass 타이머

bypass 타이머 핸들러

do_nocb_bypass_wakeup_timer()

kernel/rcu/tree_plugin.h

/* Wake up the no-CBs GP kthread to flush ->nocb_bypass. */

static void do_nocb_bypass_wakeup_timer(struct timer_list *t)
{
        unsigned long flags;
        struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);

        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
        rcu_nocb_lock_irqsave(rdp, flags);
        smp_mb__after_spinlock(); /* Timer expire before wakeup. */
        __call_rcu_nocb_wake(rdp, true, flags);
}

nocb_bypass 리스트의 콜백들을 모두 처리하기 위해 nocb용 gp kthread를 깨운다.

__call_rcu_nocb_wake()

kernel/rcu/tree_plugin.h

/*
 * Awaken the no-CBs grace-period kthead if needed, either due to it
 * legitimately being asleep or due to overload conditions.
 *
 * If warranted, also wake up the kthread servicing this CPUs queues.
 */

static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
                                 unsigned long flags)
                                 __releases(rdp->nocb_lock)
{
        unsigned long cur_gp_seq;
        unsigned long j;
        long len;
        struct task_struct *t;

        // If we are being polled or there is no kthread, just leave.
        t = READ_ONCE(rdp->nocb_gp_kthread);
        if (rcu_nocb_poll || !t) {
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                    TPS("WakeNotPoll"));
                rcu_nocb_unlock_irqrestore(rdp, flags);
                return;
        }
        // Need to actually to a wakeup.
        len = rcu_segcblist_n_cbs(&rdp->cblist);
        if (was_alldone) {
                rdp->qlen_last_fqs_check = len;
                if (!irqs_disabled_flags(flags)) {
                        /* ... if queue was empty ... */
                        wake_nocb_gp(rdp, false, flags);
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("WakeEmpty"));
                } else {
                        wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE,
                                           TPS("WakeEmptyIsDeferred"));
                        rcu_nocb_unlock_irqrestore(rdp, flags);
                }
        } else if (len > rdp->qlen_last_fqs_check + qhimark) {
                /* ... or if many callbacks queued. */
                rdp->qlen_last_fqs_check = len;
                j = jiffies;
                if (j != rdp->nocb_gp_adv_time &&
                    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
                    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
                        rcu_advance_cbs_nowake(rdp->mynode, rdp);
                        rdp->nocb_gp_adv_time = j;
                }
                smp_mb(); /* Enqueue before timer_pending(). */
                if ((rdp->nocb_cb_sleep ||
                     !rcu_segcblist_ready_cbs(&rdp->cblist)) &&
                    !timer_pending(&rdp->nocb_bypass_timer))
                        wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
                                           TPS("WakeOvfIsDeferred"));
                rcu_nocb_unlock_irqrestore(rdp, flags);
        } else {
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
                rcu_nocb_unlock_irqrestore(rdp, flags);
        }
        return;
}

nocb_bypass 리스트의 콜백들을 모두 처리하기 위해 nocb용 gp kthread를 깨운다.

코드 라인 11~17에서 아직 해당 cpu의 no-cb용 gp 커널 스레드가 지정되지 않은 경우 그냥 함수를 빠져나간다.
코드 라인 19에서 세그먼트 콜백 리스트의 전체 콜백 수를 len에 알아온다.
코드 라인 20~31에서 두 번째 인자 @was_alldone이 설정된 경우 nocb용 gp 커널 스레드를 깨운다. irq disable 상태에서는 1 틱 만큼 지연시켜 깨운다.
코드 라인 32~48에서 너무 많은 콜백이 처리를 기다리고 있는 중인 경우이다. nocb_gp_adv_time 틱이 흘러 변경되었거나 다음(wait) 구간의 gp 시퀀스가 완료된 경우 nocb_gp_adv_time 을 현재 시각으로 갱신하고 콜백을 advance(cascade) 처리한다. 만일 no-cb용 gp 커널 스레드가 슬립하여 대기중이거나 세그먼트 콜백 리스트에 완료된 콜백이 없는 경우이면서 bypass 타이머가 설정되어 있지 않은 경우 no-cb용 gp 커널 스레드를 wakeup 강제(force)한다.
코드 라인 49~52에서 그 외의 경우 nocb 언락한다.

bypass 리스트에 콜백 추가 시도

rcu_nocb_try_bypass()

kernel/rcu/tree_plugin.h -1/2-

/*
 * See whether it is appropriate to use the ->nocb_bypass list in order
 * to control contention on ->nocb_lock.  A limited number of direct
 * enqueues are permitted into ->cblist per jiffy.  If ->nocb_bypass
 * is non-empty, further callbacks must be placed into ->nocb_bypass,
 * otherwise rcu_barrier() breaks.  Use rcu_nocb_flush_bypass() to switch
 * back to direct use of ->cblist.  However, ->nocb_bypass should not be
 * used if ->cblist is empty, because otherwise callbacks can be stranded
 * on ->nocb_bypass because we cannot count on the current CPU ever again
 * invoking call_rcu().  The general rule is that if ->nocb_bypass is
 * non-empty, the corresponding no-CBs grace-period kthread must not be
 * in an indefinite sleep state.
 *
 * Finally, it is not permitted to use the bypass during early boot,
 * as doing so would confuse the auto-initialization code.  Besides
 * which, there is no point in worrying about lock contention while
 * there is only one CPU in operation.
 */

static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
                                bool *was_alldone, unsigned long flags)
{
        unsigned long c;
        unsigned long cur_gp_seq;
        unsigned long j = jiffies;
        long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);

        if (!rcu_segcblist_is_offloaded(&rdp->cblist)) {
                *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
                return false; /* Not offloaded, no bypassing. */
        }
        lockdep_assert_irqs_disabled();

        // Don't use ->nocb_bypass during early boot.
        if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
                rcu_nocb_lock(rdp);
                WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
                *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
                return false;
        }

        // If we have advanced to a new jiffy, reset counts to allow
        // moving back from ->nocb_bypass to ->cblist.
        if (j == rdp->nocb_nobypass_last) {
                c = rdp->nocb_nobypass_count + 1;
        } else {
                WRITE_ONCE(rdp->nocb_nobypass_last, j);
                c = rdp->nocb_nobypass_count - nocb_nobypass_lim_per_jiffy;
                if (ULONG_CMP_LT(rdp->nocb_nobypass_count,
                                 nocb_nobypass_lim_per_jiffy))
                        c = 0;
                else if (c > nocb_nobypass_lim_per_jiffy)
                        c = nocb_nobypass_lim_per_jiffy;
        }
        WRITE_ONCE(rdp->nocb_nobypass_count, c);

        // If there hasn't yet been all that many ->cblist enqueues
        // this jiffy, tell the caller to enqueue onto ->cblist.  But flush
        // ->nocb_bypass first.
        if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
                rcu_nocb_lock(rdp);
                *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
                if (*was_alldone)
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("FirstQ"));
                WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
                WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
                return false; // Caller must enqueue the callback.
        }

no-cb용 bypass 리스트에 콜백을 추가한다. 만일 성공하는 경우 true를 반환한다.

코드 라인 7에서 nocb_bypass 리스트에 존재하는 콜백들의 수를 ncbs에 알아온다.
코드 라인 9~12에서 오프로드되지 않은 경우 false를 반환한다. 또한 펜딩 콜백이 없는지 여부를 출력 인자 @was_alldone에 대입한다.
코드 라인 16~21에서 rcu 스케줄러가 아직 준비되지 않은 경우 false를 반환한다. 또한 펜딩 콜백이 없는지 여부를 출력 인자 @was_alldone에 대입한다.
코드 라인 25~36에서 새 틱에 진입할 때마다 틱당 nobypass 카운터 제한 수(nocb_nobypass_lim_per_jiffy) 이하로 제한한다.
- 틱 변화가 없는 경우엔 +1 증가시킨다.
- 새 틱에 진입한 경우 nocb_nobypass_count 를 0 ~ nocb_nobypass_lim_per_jiffy 값으로 제한하여 갱신한다.
코드 라인 41~50에서 nocb_nobypass_count 수가 틱당 제한 수 미만으로 적게 유입되는 경우 nocb_bypass 리스트에서 처리하지 않고 원래의 세그먼트 콜백 리스트에 추가하기 위해 false를 반환한다. 또한 펜딩 콜백이 없는지 여부를 출력 인자 @was_alldone에 대입한다.
- nocb_nobypass_lim_per_jiffy
  - 틱 당 유입되는 콜백 수로 이 값을 초과하는 경우에만 nocb_bypass 리스트에 추가한다.

kernel/rcu/tree_plugin.h -2/2-

        // If ->nocb_bypass has been used too long or is too full,
        // flush ->nocb_bypass to ->cblist.
        if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
            ncbs >= qhimark) {
                rcu_nocb_lock(rdp);
                if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
                        *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
                        if (*was_alldone)
                                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                                    TPS("FirstQ"));
                        WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
                        return false; // Caller must enqueue the callback.
                }
                if (j != rdp->nocb_gp_adv_time &&
                    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
                    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
                        rcu_advance_cbs_nowake(rdp->mynode, rdp);
                        rdp->nocb_gp_adv_time = j;
                }
                rcu_nocb_unlock_irqrestore(rdp, flags);
                return true; // Callback already enqueued.
        }

        // We need to use the bypass.
        rcu_nocb_wait_contended(rdp);
        rcu_nocb_bypass_lock(rdp);
        ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
        rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
        if (!ncbs) {
                WRITE_ONCE(rdp->nocb_bypass_first, j);
                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
        }
        rcu_nocb_bypass_unlock(rdp);
        smp_mb(); /* Order enqueue before wake. */
        if (ncbs) {
                local_irq_restore(flags);
        } else {
                // No-CBs GP kthread might be indefinitely asleep, if so, wake.
                rcu_nocb_lock(rdp); // Rare during call_rcu() flood.
                if (!rcu_segcblist_pend_cbs(&rdp->cblist)) {
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("FirstBQwake"));
                        __call_rcu_nocb_wake(rdp, true, flags);
                } else {
                        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                            TPS("FirstBQnoWake"));
                        rcu_nocb_unlock_irqrestore(rdp, flags);
                }
        }
        return true; // Callback already enqueued.
}

코드 라인 3~22에서 첫 콜백이 추가된 후 1 틱이 지났거나 nocb_bypas 콜백들이 qhimark(디폴트=10000) 보다 많은 경우 이들을 한꺼번에 세그먼트 콜백리스트에 플러시하고 함수를 true로 빠져나가는 것으로 매 콜백 건마다 수행되는 lock 컨텐션을 줄인다.
코드 라인 25~26에서 no-cb 락 contention이 없어질때까지 스핀하며 대기한 후 락을 획득한다.
코드 라인 27~34에서 nocb_bypass 리스트에 콜백을 하나 추가하고 언락한다. 만일 nocb_bypass 리스트에 콜백이 처음 추가된 경우 nocb_bypass_first에 현재 시각을 갱신한다.
코드 라인 36~50에서 nocb_bypass 리스트가 비어있는 상태에서 첫 콜백으로 추가되었고 세그먼트 콜백 리스트에서 대기중인 콜백이 없는 경우 no-cb용 gp 커널 스레드를 깨운다.
코드 라인 51에서 true를 반환한다.

다음 그림은 콜백을 offload 하여 운영하는 경우 과도하게 유입되는 콜백을 nocb_bypass 리스트에 추가하고, 1틱이 지났거나 10000개 이상 누적되는 경우 flush 하는 것으로 lock contention을 줄여 성능을 올리기 위해 사용하는 모습을 보여준다.

rcu_nocb_flush_bypass()

kernel/rcu/tree_plugin.h

/*
 * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
 * However, if there is a callback to be enqueued and if ->nocb_bypass
 * proves to be initially empty, just return false because the no-CB GP
 * kthread may need to be awakened in this case.
 *
 * Note that this function always returns true if rhp is NULL.
 */

static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
                                  unsigned long j)
{
        if (!rcu_segcblist_is_offloaded(&rdp->cblist))
                return true;
        rcu_lockdep_assert_cblist_protected(rdp);
        rcu_nocb_bypass_lock(rdp);
        return rcu_nocb_do_flush_bypass(rdp, rhp, j);
}

nocb_bypass 리스트의 콜백들을 모두 세그먼트 콜백리스트에 옮긴다.

코드 라인 4~5에서 콜백 오프로드되지 않은 경우 true를 반환한다.
코드 라인 8에서 nocb_bypass 리스트의 콜백들을 모두 세그먼트 콜백리스트에 옮긴다. 성공한 경우 true를 반환한다.

rcu_nocb_do_flush_bypass()

kernel/rcu/tree_plugin.h

/*
 * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
 * However, if there is a callback to be enqueued and if ->nocb_bypass
 * proves to be initially empty, just return false because the no-CB GP
 * kthread may need to be awakened in this case.
 *
 * Note that this function always returns true if rhp is NULL.
 */

static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
                                     unsigned long j)
{
        struct rcu_cblist rcl;

        WARN_ON_ONCE(!rcu_segcblist_is_offloaded(&rdp->cblist));
        rcu_lockdep_assert_cblist_protected(rdp);
        lockdep_assert_held(&rdp->nocb_bypass_lock);
        if (rhp && !rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
                raw_spin_unlock(&rdp->nocb_bypass_lock);
                return false;
        }
        /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
        if (rhp)
                rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
        rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
        WRITE_ONCE(rdp->nocb_bypass_first, j);
        rcu_nocb_bypass_unlock(rdp);
        return true;
}

nocb_bypass 리스트의 콜백들을 모두 세그먼트 콜백리스트에 옮긴다. 성공한 경우 true를 반환한다.

코드 라인 9~12에서 nocb_bypass 리스트에 콜백들이 하나도 없는 경우 false를 반환한다.
코드 라인 14~15에서 인자로 주어진 콜백 @rhp가 있는 경우 콜백 수를 1 증가시킨다.
코드 라인 16에서 nocb_bypass 리스트의 콜백을 내부 임시 리스트에 옮긴다. 이 때 콜백도 하나 추가한다.
코드 라인 17에서 내부 임시 리스트에 옮겨진 콜백들을 세그먼트 콜백리스트의 next 구간에 추가한다.
코드 라인 18에서 nocb_bypass_first에 현재 시각을 갱신한다.
코드 라인 20에서 정상적으로 옮겼으므로 true를 반환한다.

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c – 현재글
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c
wait_for_completion() | 문c

Namespace

2018-01-112018-01-11 문영일 3 Comments

Namespace 관리 리소스

리눅스에서 namespace는 lightweight 가상화 솔루션이다. XEN이나 KVM 같은 가상화 솔루션들은 커널 인스턴스들을 생성하여 동작시키는 것에 반하여 리눅스의 namespace는 커널 인스턴스를 만들지 않고 기존의 리소스들을 필요한 만큼의 namespace로 분리하여 묶어 관리하는 방법으로 사용한다. 리눅스의 cgroup과 namespace를 사용하여 container를 생성하여 사용하는 LXC, 그리고 LXC를 사용하는 docker 솔루션 등이 구현되었다. 커널이 부팅된 후 관리 자원은 각각의 초기 디폴트 namespace에서 관리한다. 그런 후 사용자의 필요에 따라 namespace를 추가하여 자원들을 별도로 분리하여 관리할 수 있다. 관리 가능한 namespace 리소스들은 다음과 같다.

pid namespace
- 프로세스에 대응하는 pid
- CONFIG_PID_NS 커널 옵션 사용
- 참고
  - Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces | toptal
uts namespace
- “uname” syscall 에서 보여주는 호스트명 및 커널 버전 정보
- CONFIG_UTS_NS 커널 옵션 사용
ipc namespace
- SYSVIPC와 posix mqueue
- CONFIG_IPC_NS 커널 옵션 사용
net namespace
- 네트워크 디바이스, IP, 라우팅 정보, IP 테이블 정보
- CONFIG_NET_NS 커널 옵션 사용
- 참고
  - Linux Switching – Interconnecting Namespaces | Open Cloud Blog
  - VRF & Linux Network Name Space | Ritesh Madapurath
  - VRF for Linux — a contribution to the Linux Kernel | Cumulus
user namespace
- 사용자 정보 (커널 v3.8에 소개되어 아직도 많은 리눅스 파일 시스템이 user namespace를 지원하지 않는다.)
- CONFIG_USER_NS 커널 옵션 사용
mount namespace
- 마운트 파일 시스템

Namespace 초기화

리눅스에서 user namespace를 제외한 다른 리소스들의 구현은 nsproxy를 통해 연결된다.

struct nsproxy init_nsproxy = {
        .count                  = ATOMIC_INIT(1),
        .uts_ns                 = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
        .ipc_ns                 = &init_ipc_ns,
#endif
        .mnt_ns                 = NULL,
        .pid_ns_for_children    = &init_pid_ns,
#ifdef CONFIG_NET
        .net_ns                 = &init_net,
#endif
};

nsproxy_cache_init()

kernel/nsproxy.c

int __init nsproxy_cache_init(void)
{
        nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
        return 0;
}

nsproxy 구조체를 할당해줄 수 있는 kmem 슬랩 캐시를 준비한다.

새로운 Namespace 생성

kernel/nsproxy.c

SYSCALL_DEFINE2(setns, int, fd, int, nstype)
{
        struct task_struct *tsk = current;
        struct nsproxy *new_nsproxy;
        struct file *file;
        struct ns_common *ns;
        int err;

        file = proc_ns_fget(fd);
        if (IS_ERR(file))
                return PTR_ERR(file);

        err = -EINVAL;
        ns = get_proc_ns(file_inode(file));
        if (nstype && (ns->ops->type != nstype))
                goto out;

        new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
        if (IS_ERR(new_nsproxy)) {
                err = PTR_ERR(new_nsproxy);
                goto out;
        }

        err = ns->ops->install(new_nsproxy, ns);
        if (err) {
                free_nsproxy(new_nsproxy);
                goto out;
        }
        switch_task_namespaces(tsk, new_nsproxy);
out:
        fput(file);
        return err;
}

setns 라는 이름의 syscall을 호출하여 현재 태스크(스레드)를 새로 지정한 namespace에 연결시킨다.

코드 라인 8~11에서 proc 인터페이스에 해당하는 파일 디스크립터로 file 구조체 포인터를 알아온다.
- 예) “/proc/1/ns/mnt”
코드 라인 13~16에서 file 디스크립터에 연결된 ns_common 구조체 포인터인 ns를 얻어온다. 만일 nstype이 지정된 경우 file 디스크립터의 namespace type과 다른 경우 에러를 반환한다.
코드 라인 18~22에서 새로운 namespace에 현재 태스크를 지정한다.
코드 라인 24~28에서 해당 리소스의 namespace의 install에 등록된 콜백 함수를 호출하여 설치한다.
- pid: pidns_install()
- ipc: ipcns_install()
- net: netns_install()
- user: userns_install()
- mnt: mntns_install()
- uts: utsns_install()
코드 라인 29에서 태스크의 멤버 nsproxy 가 새로운 nsproxy로 연결하게 한다. 기존에 연결해두었던 nsproxy의 참조 카운터가 0이되면 해제(free)한다.

create_new_namespaces()

kernel/nsproxy.c

/*
 * Create new nsproxy and all of its the associated namespaces.
 * Return the newly created nsproxy.  Do not attach this to the task,
 * leave it to the caller to do proper locking and attach it to task.
 */
static struct nsproxy *create_new_namespaces(unsigned long flags,
        struct task_struct *tsk, struct user_namespace *user_ns,
        struct fs_struct *new_fs)
{
        struct nsproxy *new_nsp;
        int err;

        new_nsp = create_nsproxy();
        if (!new_nsp)
                return ERR_PTR(-ENOMEM);

        new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
        if (IS_ERR(new_nsp->mnt_ns)) {
                err = PTR_ERR(new_nsp->mnt_ns);
                goto out_ns;
        }

        new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
        if (IS_ERR(new_nsp->uts_ns)) {
                err = PTR_ERR(new_nsp->uts_ns);
                goto out_uts;
        }

        new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
        if (IS_ERR(new_nsp->ipc_ns)) {
                err = PTR_ERR(new_nsp->ipc_ns);
                goto out_ipc;
        }

        new_nsp->pid_ns_for_children =
                copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
        if (IS_ERR(new_nsp->pid_ns_for_children)) {
                err = PTR_ERR(new_nsp->pid_ns_for_children);
                goto out_pid;
        }

        new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
        if (IS_ERR(new_nsp->net_ns)) {
                err = PTR_ERR(new_nsp->net_ns);
                goto out_net;
        }

        return new_nsp;

out_net:
        if (new_nsp->pid_ns_for_children)
                put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
        if (new_nsp->ipc_ns)
                put_ipc_ns(new_nsp->ipc_ns);
out_ipc:
        if (new_nsp->uts_ns)
                put_uts_ns(new_nsp->uts_ns);
out_uts:
        if (new_nsp->mnt_ns)
                put_mnt_ns(new_nsp->mnt_ns);
out_ns:
        kmem_cache_free(nsproxy_cachep, new_nsp);
        return ERR_PTR(err);
}

nsproxy를 새로 할당한 후 namespace들을 연결하고 nsproxy를 반환한다. 플래그에는 새롭게 생성할 namespace를 지정한 플래그 값을 사용할 수 있다.

코드 라인 13~15에서 nsproxy 구조체를 새로 할당해온다.
코드 라인 17~21에서 CLONE_NEWNS 플래그가 지정된 경우 새롭게 mnt_namespace 구조체를 할당하고, 플래그가 지정되지 않은 경우 기존 mnt_namespace를 알아온다. 그리고 요청한 태스크를 mnt_namespcae에 지정한다.
- namespace 생성 플래그:
  - CLONE_NEWNS
  - CLONE_NEWUTS
  - CLONE_NEWIPC
  - CLONE_NEWUSER
  - CLONE_NEWPID
  - CLONE_NEWNET
코드 라인 23~46에서 mnt namespace와 비슷한 방식으로 uts, ipc, pid, net namespace도 처리한다.
코드 라인 48에서 새롭게 만들어진 rxproxy 구조체 포인터를 반환한다.

create_nsproxy()

kernel/nsproxy.c

static inline struct nsproxy *create_nsproxy(void)
{
        struct nsproxy *nsproxy;

        nsproxy = kmem_cache_alloc(nsproxy_cachep, GFP_KERNEL);
        if (nsproxy)
                atomic_set(&nsproxy->count, 1);
        return nsproxy;
}

nsproxy 구조체를 할당해온다.

copy_pid_ns()

kernel/pid_namespace.c

struct pid_namespace *copy_pid_ns(unsigned long flags,
        struct user_namespace *user_ns, struct pid_namespace *old_ns)
{      
        if (!(flags & CLONE_NEWPID))
                return get_pid_ns(old_ns);
        if (task_active_pid_ns(current) != old_ns)
                return ERR_PTR(-EINVAL);
        return create_pid_namespace(user_ns, old_ns);
}

create_pid_namespace()

kernel/pid_namespace.c

static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
        struct pid_namespace *parent_pid_ns)
{
        struct pid_namespace *ns;
        unsigned int level = parent_pid_ns->level + 1;
        int i;
        int err;

        if (level > MAX_PID_NS_LEVEL) {
                err = -EINVAL;
                goto out;
        }

        err = -ENOMEM;
        ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
        if (ns == NULL)
                goto out;

        ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
        if (!ns->pidmap[0].page)
                goto out_free;

        ns->pid_cachep = create_pid_cachep(level + 1);
        if (ns->pid_cachep == NULL)
                goto out_free_map;

        err = ns_alloc_inum(&ns->ns);
        if (err)
                goto out_free_map;
        ns->ns.ops = &pidns_operations;

        kref_init(&ns->kref);
        ns->level = level;
        ns->parent = get_pid_ns(parent_pid_ns);
        ns->user_ns = get_user_ns(user_ns);
        ns->nr_hashed = PIDNS_HASH_ADDING;
        INIT_WORK(&ns->proc_work, proc_cleanup_work);

        set_bit(0, ns->pidmap[0].page);
        atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);

        for (i = 1; i < PIDMAP_ENTRIES; i++)
                atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);

        return ns;

out_free_map:
        kfree(ns->pidmap[0].page);
out_free:
        kmem_cache_free(pid_ns_cachep, ns);
out:
        return ERR_PTR(err);
}

새로 만들어지는 child pid namespace는 최대 32단계까지 생성가능하다. pid_namespace를 생성하고 초기화하고 내부 pidmap의 구성은 다음을 참고한다.

참고: pidmap_init() | 문c

Namespace 예)

아래 <pid> 값은 해당 태스크의 pid 숫자로 치환되어야 한다.

root@jake-VirtualBox:/proc/<pid>/ns# ls /proc/1/ns/ -la
합계 0
dr-x--x--x 2 root root 0  1월 11 11:16 .
dr-xr-xr-x 9 root root 0  1월  8 14:06 ..
lrwxrwxrwx 1 root root 0  1월 11 11:16 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0  1월 11 11:16 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0  1월 11 11:16 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0  1월 11 11:16 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0  1월 11 11:16 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0  1월 11 11:16 uts -> uts:[4026531838]

# unshare --help

Usage:
 unshare [options] <program> [<argument>...]

Run a program with some namespaces unshared from the parent.

Options:
 -m, --mount               unshare mounts namespace
 -u, --uts                 unshare UTS namespace (hostname etc)
 -i, --ipc                 unshare System V IPC namespace
 -n, --net                 unshare network namespace
 -p, --pid                 unshare pid namespace
 -U, --user                unshare user namespace
 -f, --fork                fork before launching <program>
     --mount-proc[=<dir>]  mount proc filesystem first (implies --mount)
 -r, --map-root-user       map current user to root (implies --user)
 -s, --setgroups allow|deny  control the setgroups syscall in user namespaces

 -h, --help     display this help and exit
 -V, --version  output version information and exit

참고

Docker vs LXC vs Hypervisor | 문c
Namespaces in operation, part 1: namespaces overview | LWN.net
Linux Namespaces | diyC
setns(2) | man7.org
Unshare(1) | man7.org
Linux unshare -m for per-process private filesystem mount points | End Point