문c 블로그

trace_init()

2016-09-222016-09-22 문영일 Leave a comment

trace_init()

kernel/trace/trace.c

void __init trace_init(void)
{
        if (tracepoint_printk) {
                tracepoint_print_iter =
                        kmalloc(sizeof(*tracepoint_print_iter), GFP_KERNEL);
                if (WARN_ON(!tracepoint_print_iter))
                        tracepoint_printk = 0;
        }
        tracer_alloc_buffers();
        trace_event_init();
}

CONFIG_TRACING 커널 옵션이 설정된 경우에 빌드되어 동작되며 트레이스용 버퍼 및 이벤트를 초기화한다.

성능상의 이유로 production 커널을 빌드시에는 이 옵션을 사용하면 안된다.

if (tracepoint_printk) {
- “tp_printk=on” 등의 커널 파라메터가 동작된 경우
tracepoint_print_iter = kmalloc(sizeof(*tracepoint_print_iter), GFP_KERNEL);
- trace_iterator 구조체 공간을 할당받는다.
tracer_alloc_buffers();
- trace용 per-cpu 버퍼를 할당받고 초기화한다.
  - (생략)
trace_event_init();
- CONFIG_EVENT_TRACING 커널 옵션을 사용하는 경우 이벤트 트레이스를 위해 초기화한다.
  - (생략)

“tp_printk=” 커널 파라메터

static int __init set_tracepoint_printk(char *str)
{
        if ((strcmp(str, "=0") != 0 && strcmp(str, "=off") != 0))
                tracepoint_printk = 1;
        return 1;
}
__setup("tp_printk", set_tracepoint_printk);

“tp_printk=0” 및 “tp_printk=off”가 아닌 경우 전역 tracepoint_printk를 1로 설정하여 trace 출력이 가능하게한다.

rcu_init()

2016-09-222021-04-01 문영일 Leave a comment

RCU 초기화

rcu의 초기화는 rcu_init() 함수 및 rcu_nohz_init() 함수에서 이루어진다.

rcu_init()

kernel/rcu/tree.c

void __init rcu_init(void)
{
        int cpu;

        rcu_early_boot_tests();

        rcu_bootup_announce();
        rcu_init_geometry();
        rcu_init_one();
        if (dump_tree)
                rcu_dump_rcu_node_tree();
        if (use_softirq)
                open_softirq(RCU_SOFTIRQ, rcu_core_si);

        /*
         * We don't need protection against CPU-hotplug here because
         * this is called early in boot, before either interrupts
         * or the scheduler are operational.
         */
        pm_notifier(rcu_pm_notify, 0);
        for_each_online_cpu(cpu) {
                rcutree_prepare_cpu(cpu);
                rcu_cpu_starting(cpu);
                rcutree_online_cpu(cpu);
        }

        /* Create workqueue for expedited GPs and for Tree SRCU. */
        rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
        WARN_ON(!rcu_gp_wq);
        rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
        WARN_ON(!rcu_par_gp_wq);
        srcu_init();
}

rcu를 사용하기 위해 rcu 관련 구조체들을 초기화하고, cpu pm 변화에 따라 호출되는 콜백함수를 등록한다.

코드 라인 5에서 “Running RCU self tests” 메시지를 출력하고 모듈 파라미터 “rcutree.rcu_self_test”가 설정된 경우 테스트 콜백 하나를 call_rcu()와 call_srcu() 두 API를 사용하여 등록시켜 “RCU test callback executed %d” 메시지가 출력되게 한다.
코드 라인 7에서 RCU 설정들이 동작하는지 관련 메시지 로그를 출력한다. “Preemptible hierarchical RCU implementation.” 메시지를 시작으로 여러 정보들이 출력된다.
코드 라인 8에서 rcu_state 구조체 내부의 rcu_node 들에 대한 트리 기하를 구성하기 위한 설정 값들을 산출한다.
코드 라인 9에서 rcu_state이하 모든 rcu 노드 및 rcu_data들을 초기화하고 구성한다.
코드 라인 10~11에서 “dump_tree” 커널 파라미터가 설정된 경우 “rcu_node tree layout dump” 메시지 출력 이후에 노드 구성을 출력한다.
코드 라인 12~13에서 rcu softirq 벡터에 rcu_core_si() 함수가 호출되도록 대입한다.
코드 라인 20에서 pm(power management)의 suspend/resume 동작시 호출되도록 rcu_pm_notify() 함수를 등록한다.
코드 라인 21~25에서 각 online cpu 만큼 순회하며 rcu용 cpu 정보들을 초기화하고 시작시킨다.
코드 라인 28~31에서 급행 gp와 srcu tree용으로 워크큐를 할당한다.
코드 라인 32에서 srcu를 초기화한다.

early 부트에서의 테스트

rcu_early_boot_tests()

kernel/rcu/update.c

void rcu_early_boot_tests(void)
{
        pr_info("Running RCU self tests\n");

        if (rcu_self_test)
                early_boot_test_call_rcu();
        rcu_test_sync_prims();
}

“Running RCU self tests” 메시지를 출력하고 모듈 파라미터 “rcutree.rcu_self_test”가 설정된 경우 테스트 콜백 하나를 call_rcu()와 call_srcu() 두 API를 사용하여 등록시켜 “RCU test callback executed %d” 메시지가 출력되게 한다.

early_boot_test_call_rcu()

kernel/rcu/update.c

static void early_boot_test_call_rcu(void)
{
        static struct rcu_head head;
        static struct rcu_head shead;

        call_rcu(&head, test_callback);
        if (IS_ENABLED(CONFIG_SRCU))
                call_srcu(&early_srcu, &shead, test_callback);
}

test_callback()

kernel/rcu/update.c

static void test_callback(struct rcu_head *r)
{
        rcu_self_test_counter++;
        pr_info("RCU test callback executed %d\n", rcu_self_test_counter);
}

“RCU test callback executed %d” 메시지가 출력되는 테스트용 콜백이다.

rcu_test_sync_prims()

kernel/rcu/update.c

/*
 * Test each non-SRCU synchronous grace-period wait API.  This is
 * useful just after a change in mode for these primitives, and
 * during early boot.
 */

void rcu_test_sync_prims(void)
{
        if (!IS_ENABLED(CONFIG_PROVE_RCU))
                return;
        synchronize_rcu();
        synchronize_rcu_expedited();
}

싱크용 rcu 명령이 eayly boot에서 어떻게 동작하는지 호출해본다.

early 부트에서 gp 대기 없이 아무런 처리를 하지 않아야 한다.

부트업 어나운스

rcu_bootup_announce()

kernel/rcu/tree_plugin.h

/*
 * Tell them what RCU they are running.
 */

static void __init rcu_bootup_announce(void)
{
        pr_info("Preemptible hierarchical RCU implementation.\n");
        rcu_bootup_announce_oddness();
}

RCU 설정들이 동작하는지 관련 메시지 로그를 출력한다. “Preemptible hierarchical RCU implementation.” 메시지를 시작으로 여러 정보들이 출력된다.

rcu_bootup_announce_oddness()

kernel/rcu/tree_plugin.h

/*
 * Check the RCU kernel configuration parameters and print informative
 * messages about anything out of the ordinary.
 */

static void __init rcu_bootup_announce_oddness(void)
{
        if (IS_ENABLED(CONFIG_RCU_TRACE))
                pr_info("\tRCU event tracing is enabled.\n");
        if ((IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 64) ||
            (!IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 32))
                pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d.\n",
                        RCU_FANOUT);
        if (rcu_fanout_exact)
                pr_info("\tHierarchical RCU autobalancing is disabled.\n");
        if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ))
                pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
        if (IS_ENABLED(CONFIG_PROVE_RCU))
                pr_info("\tRCU lockdep checking is enabled.\n");
        if (RCU_NUM_LVLS >= 4)
                pr_info("\tFour(or more)-level hierarchy is enabled.\n");
        if (RCU_FANOUT_LEAF != 16)
                pr_info("\tBuild-time adjustment of leaf fanout to %d.\n",
                        RCU_FANOUT_LEAF);
        if (rcu_fanout_leaf != RCU_FANOUT_LEAF)
                pr_info("\tBoot-time adjustment of leaf fanout to %d.\n",
                        rcu_fanout_leaf);
        if (nr_cpu_ids != NR_CPUS)
                pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%u.\n", NR_CPUS, nr_cpu_ids);
#ifdef CONFIG_RCU_BOOST
        pr_info("\tRCU priority boosting: priority %d delay %d ms.\n",
                kthread_prio, CONFIG_RCU_BOOST_DELAY);
#endif
        if (blimit != DEFAULT_RCU_BLIMIT)
                pr_info("\tBoot-time adjustment of callback invocation limit to %ld.\n", blimit);
        if (qhimark != DEFAULT_RCU_QHIMARK)
                pr_info("\tBoot-time adjustment of callback high-water mark to %ld.\n", qhimark);
        if (qlowmark != DEFAULT_RCU_QLOMARK)
                pr_info("\tBoot-time adjustment of callback low-water mark to %ld.\n", qlowmark);
        if (jiffies_till_first_fqs != ULONG_MAX)
                pr_info("\tBoot-time adjustment of first FQS scan delay to %ld jiffies.\n", jiffies_till_first_fqs);
        if (jiffies_till_next_fqs != ULONG_MAX)
                pr_info("\tBoot-time adjustment of subsequent FQS scan delay to %ld jiffies.\n", jiffies_till_next_fqs);
        if (jiffies_till_sched_qs != ULONG_MAX)
                pr_info("\tBoot-time adjustment of scheduler-enlistment delay to %ld jiffies.\n", jiffies_till_sched_qs))
;
        if (rcu_kick_kthreads)
                pr_info("\tKick kthreads if too-long grace period.\n");
        if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
                pr_info("\tRCU callback double-/use-after-free debug enabled.\n");
        if (gp_preinit_delay)
                pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
        if (gp_init_delay)
                pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
        if (gp_cleanup_delay)
                pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_cleanup_delay);
        if (!use_softirq)
                pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
        if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
                pr_info("\tRCU debug extended QS entry/exit.\n");
        rcupdate_announce_bootup_oddness();
}

커널 config RCU 설정들에 대해 다음과 같은 로그를 출력한다.

CONFIG_RCU_TRACE를 설정한 경우 “RCU event tracing is enabled.”
CONFIG_RCU_FANOUT 값이 default(32bit=32, 64bit=64) 값과 다른 경우 “CONFIG_RCU_FANOUT set to non-default value of %d”
CONFIG_RCU_FANOUT_EXACT가 설정된 경우 “Hierarchical RCU autobalancing is disabled.”
CONFIG_RCU_FAST_NO_HZ가 설정된 경우 “RCU dyntick-idle grace-period acceleration is enabled.”
CONFIG_PROVE_RCU가 설정된 경우 “RCU lockdep checking is enabled.”
rcu 노드 레벨이 4 이상(0~4까지 가능) 설정된 경우 “Four-level hierarchy is enabled.”
CONFIG_RCU_FANOUT_LEAF 값이 default(16) 값과 다른 경우 “Boot-time adjustment of leaf fanout to %d.”
online cpu와 커널 컴파일 시 설정된 NR_CPUS와 다른 경우 “RCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.”
CONFIG_RCU_BOOST가 설정된 경우 “RCU priority boosting: priority %d delay %d ms.”
blimit가 DEFAULT_RCU_BLIMIT(10)와 다른 경우 “Boot-time adjustment of callback invocation limit to %ld.”
qhimark가 DEFAULT_RCU_QHIMARK(10000)와 다른 경우 “Boot-time adjustment of callback high-water mark to %ld.”
qlowmark가 DEFAULT_RCU_QLOMARK(10)와 다른 경우 “Boot-time adjustment of callback low-water mark to %ld.”
jiffies_till_first_fqs가 설정된 경우 “Boot-time adjustment of first FQS scan delay to %ld jiffies.”
jiffies_till_next_fqs가 설정된 경우 “Boot-time adjustment of subsequent FQS scan delay to %ld jiffies.”
jiffies_till_sched_qs가 설정된 경우 “Boot-time adjustment of scheduler-enlistment delay to %ld jiffies.”
rcu_kick_kthreads가 설정된 경우 “Kick kthreads if too-long grace period.”
CONFIG_DEBUG_OBJECTS_RCU_HEAD가 설정된 경우 “tRCU callback double-/use-after-free debug enabled.”
gp_preinit_delay가 설정된 경우 “RCU debug GP pre-init slowdown %d jiffies.”
gp_init_delay가 설정된 경우 “RCU debug GP init slowdown %d jiffies.”
gp_cleanup_delay가 설정된 경우 “RCU debug GP init slowdown %d jiffies.”
use_softirq(디폴트=1)가 off된 경우 “RCU_SOFTIRQ processing moved to rcuc kthreads.”
CONFIG_RCU_EQS_DEBUG가 설정된 경우 “RCU debug extended QS entry/exit.”

rcupdate_announce_bootup_oddness()

kernel/rcu/update.c

/*
 * Print any significant non-default boot-time settings.
 */

void __init rcupdate_announce_bootup_oddness(void)
{
        if (rcu_normal)
                pr_info("\tNo expedited grace period (rcu_normal).\n");
        else if (rcu_normal_after_boot)
                pr_info("\tNo expedited grace period (rcu_normal_after_boot).\n");
        else if (rcu_expedited)
                pr_info("\tAll grace periods are expedited (rcu_expedited).\n");
        if (rcu_cpu_stall_suppress)
                pr_info("\tRCU CPU stall warnings suppressed (rcu_cpu_stall_suppress).\n");
        if (rcu_cpu_stall_timeout != CONFIG_RCU_CPU_STALL_TIMEOUT)
                pr_info("\tRCU CPU stall warnings timeout set to %d (rcu_cpu_stall_timeout).\n", rcu_cpu_stall_timeout);
        rcu_tasks_bootup_oddness();
}

boot 타임 RCU 설정들에 대해 다음과 같은 로그를 출력한다.

rcu_normal이 설정된 경우 “No expedited grace period (rcu_normal).”
rcu_normal_after_boot가 설정된 경우 “No expedited grace period (rcu_normal_after_boot).”
rcu_expedited가 설정된 경우 “All grace periods are expedited (rcu_expedited).”
rcu_cpu_stall_suppress가 설정된 경우 “RCU CPU stall warnings suppressed (rcu_cpu_stall_suppress).”
rcu_cpu_stall_timeout가 CONFIG_RCU_CPU_STALL_TIMEOUT(21초)와 다르게 설정된 경우 “RCU CPU stall warnings timeout set to %d (rcu_cpu_stall_timeout).”

rcu_tasks_bootup_oddness()

kernel/rcu/update.c

/*
 * Print any non-default Tasks RCU settings.
 */

static void __init rcu_tasks_bootup_oddness(void)
{
#ifdef CONFIG_TASKS_RCU
        if (rcu_task_stall_timeout != RCU_TASK_STALL_TIMEOUT)
                pr_info("\tTasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).\n", rcu_task_stall_tt
imeout);
        else
                pr_info("\tTasks RCU enabled.\n");
#endif /* #ifdef CONFIG_TASKS_RCU */
}

Tasks RCU 설정들에 대해 다음과 같은 로그를 출력한다.

rcu_task_stall_timeout 가 RCU_TASK_STALL_TIMEOUT(600초)와 다르게 설정된 경우 “Tasks-RCU CPU stall warnings timeout set to %d (rcu_task_stall_timeout).”

cpu hot-plug 동작

rcu_pm_notify()

kernel/rcu/tree.c

/*
 * On non-huge systems, use expedited RCU grace periods to make suspend
 * and hibernation run faster.
 */

static int rcu_pm_notify(struct notifier_block *self,
                         unsigned long action, void *hcpu)
{
        switch (action) {
        case PM_HIBERNATION_PREPARE:
        case PM_SUSPEND_PREPARE:
                rcu_expedite_gp();
                break;
        case PM_POST_HIBERNATION:
        case PM_POST_SUSPEND:
                rcu_unexpedite_gp();
                break;
        default:
                break;
        }
        return NOTIFY_OK;
}

cpu 상태 변화에 따른 통지를 받았을 때 action에 따라 처리할 함수를 호출한다.

트리 기하 초기화

rcu_init_geometry()

kernel/rcu/tree.c

/*
 * Compute the rcu_node tree geometry from kernel parameters.  This cannot
 * replace the definitions in tree.h because those are needed to size
 * the ->node array in the rcu_state structure.
 */

static void __init rcu_init_geometry(void)
{
        ulong d;
        int i;
        int rcu_capacity[RCU_NUM_LVLS];

        /*
         * Initialize any unspecified boot parameters.
         * The default values of jiffies_till_first_fqs and
         * jiffies_till_next_fqs are set to the RCU_JIFFIES_TILL_FORCE_QS
         * value, which is a function of HZ, then adding one for each
         * RCU_JIFFIES_FQS_DIV CPUs that might be on the system.
         */
        d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
        if (jiffies_till_first_fqs == ULONG_MAX)
                jiffies_till_first_fqs = d;
        if (jiffies_till_next_fqs == ULONG_MAX)
                jiffies_till_next_fqs = d;
        adjust_jiffies_till_sched_qs();

        /* If the compile-time values are accurate, just leave. */
        if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
            nr_cpu_ids == NR_CPUS)
                return;
        pr_info("Adjusting geometry for rcu_fanout_leaf=%d, nr_cpu_ids=%u\n",
                rcu_fanout_leaf, nr_cpu_ids);

        /*
         * The boot-time rcu_fanout_leaf parameter must be at least two
         * and cannot exceed the number of bits in the rcu_node masks.
         * Complain and fall back to the compile-time values if this
         * limit is exceeded.
         */
        if (rcu_fanout_leaf < 2 ||
            rcu_fanout_leaf > sizeof(unsigned long) * 8) {
                rcu_fanout_leaf = RCU_FANOUT_LEAF;
                WARN_ON(1);
                return;
        }

        /*
         * Compute number of nodes that can be handled an rcu_node tree
         * with the given number of levels.
         */
        rcu_capacity[0] = rcu_fanout_leaf;
        for (i = 1; i < RCU_NUM_LVLS; i++)
                rcu_capacity[i] = rcu_capacity[i - 1] * RCU_FANOUT;

        /*
         * The tree must be able to accommodate the configured number of CPUs.
         * If this limit is exceeded, fall back to the compile-time values.
         */
        if (nr_cpu_ids > rcu_capacity[RCU_NUM_LVLS - 1]) {
                rcu_fanout_leaf = RCU_FANOUT_LEAF;
                WARN_ON(1);
                return;
        }

        /* Calculate the number of levels in the tree. */
        for (i = 0; nr_cpu_ids > rcu_capacity[i]; i++) {
        }
        rcu_num_lvls = i + 1;

        /* Calculate the number of rcu_nodes at each level of the tree. */
        for (i = 0; i < rcu_num_lvls; i++) {
                int cap = rcu_capacity[(rcu_num_lvls - 1) - i];
                num_rcu_lvl[i] = DIV_ROUND_UP(nr_cpu_ids, cap);
        }

        /* Calculate the total number of rcu_node structures. */
        rcu_num_nodes = 0;
        for (i = 0; i < rcu_num_lvls; i++)
                rcu_num_nodes += num_rcu_lvl[i];
}

노드 구성을 위한 트리 기하를 산출한다.

전역 jiffies_till_first_fqs 및 jiffies_till_next_fqs 산출
전역 rcu_num_lvls 산출
전역 num_rcu_lvl[] 배열에 각 rcu 노드 레벨별로 rcu_node 갯수 산출
전역 rcu_num_nodes 산출
- num_rcu_lvl[]을 모두 더한다

코드 라인 14에서 d 값으로 RCU_JIFFIES_TILL_FORCE_QS를 배정하지만 online cpu 수가 256개 단위를 초과할 때 마다 delay 값을 추가 한다.
- rpi2: RCU_JIFFES_TILL_FORCE_QS(1) + nr_cpu_ids(4) / RCU_JIFFIES_FQS_DIV(256) = 1
- RCU_JIFFIES_TILL_FORCE_QS 딜레이 값을 디폴트로 대입하되 시스템의 HZ가 250을 넘어가는 케이스 및 online cpu 수가 256개를 초과하는 케이스에 대해 추가로 delay값을 증가하여 설정하다.
코드 라인 15~16에서 모듈 파라미터 jiffies_till_first_fqs가 설정되어 있지 않은 경우 d 값으로 설정된다.
코드 라인 17~18에서 모듈 파라미터 jiffies_till_next_fqs 가 설정되어 있지 않은 경우 d 값으로 설정된다.
코드 라인 19에서 위에서 산출된 값으로 jiffies_to_sched_qs 값을 결정한다.
코드 라인 22~24에서 모듈 파라미터 rcu_fanout_leaf 값과 nr_cpu_ids가 커널 설정 시와 다르지 않은 경우 변동이 없어 함수를 빠져나간다.
코드 라인 25~26에서 변동이 생긴 경우이다. “RCU: Adjusting geometry for rcu_fanout_leaf=%d, nr_cpu_ids=%d” 메시지를 출력한다.
코드 라인 34~39에서 rcu_fanout_leaf 값이 2보다 작거나 시스템 한계(32bits=32, 64bits=64)를 초과하는 경우 CONFIG_RCU_FANOUT_LEAF(디폴트=16)으로 변경하고 함수를 빠져나간다.
코드 라인 45~47에서 rcu_capacity[] 배열에 각 레벨별 최대 노드 수가 산출된다.
코드 라인 53~57에서 산출된 하위 레벨의 rcu_capacity[]가 cpu 보다 작은 경우 rcu_fanout_leaf 값을 CONFIG_RCU_FANOUT_LEAF(디폴트=16)으로 변경하고 함수를 빠져나간다.
코드 라인 60~62에서 트리 레벨을 결정한다.
코드 라인 65~68에서 online cpu 수에 맞게 각 레벨에서 필요로 하는 rcu_node의 수를 num_rcu_lvl[]에 대입한다.
코드 라인 71~73에서 num_rcu_lvl[]을 모두 더해 rcu_num_nodes를 산출한다.

kernel/rcu/tree.h

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
                                        /* For jiffies_till_first_fqs and */
                                        /*  and jiffies_till_next_fqs. */

#define RCU_JIFFIES_FQS_DIV     256     /* Very large systems need more */
                                        /*  delay between bouts of */
                                        /*  quiescent-state forcing. */

RCU_JIFFIES_TILL_FORCE_QS
- fqs 대기 시간(틱)
RCU_JIFFIES_FQS_DIV
- cpu가 많은 시스템에서 이 값 만큼의 cpu 마다 1틱씩 추가 delay를 준다.
- 예) cpus=512 -> fqs 대기 시간에 2틱 추가 delay

다음 그림은 jiffies_till_first_fqs 및 jiffies_till_next_fqs를 산출하는 과정을 보여준다.

다음 그림은 rcu_capacity[] 배열이 산출된 후의 모습을 보여준다. (32bits)

jiffies_till_sched_qs 조정

adjust_jiffies_till_sched_qs()

kernel/rcu/tree.c

/*
 * Make sure that we give the grace-period kthread time to detect any
 * idle CPUs before taking active measures to force quiescent states.
 * However, don't go below 100 milliseconds, adjusted upwards for really
 * large systems.
 */

static void adjust_jiffies_till_sched_qs(void)
{
        unsigned long j;

        /* If jiffies_till_sched_qs was specified, respect the request. */
        if (jiffies_till_sched_qs != ULONG_MAX) {
                WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
                return;
        }
        /* Otherwise, set to third fqs scan, but bound below on large system. */
        j = READ_ONCE(jiffies_till_first_fqs) +
                      2 * READ_ONCE(jiffies_till_next_fqs);
        if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
                j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
        pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
        WRITE_ONCE(jiffies_to_sched_qs, j);
}

jiffies_to_sched_qs 값을 산출한다.

코드 라인 6~9에서 모듈 파라미터 jiffies_till_sched_qs가 지정된 경우 이 값으로 jiffies_to_sched_qs를 갱신하고 함수를 빠져나간다.
코드 라인 11~16에서 jiffies_till_first_fqs + jiffies_till_next_fqs * 2 값을 jiffies_to_sched_qs에 기록하되 최소 값을 다음과 같이 제한한다.
- 0.1초에 해당하는 틱 수 + cpu 수 / 256
- rpi4 = 25

다음은 rpi4에서 알아본 초기 rcu에 대한 모듈 파라메터 값이다.

$ cat /sys/module/rcutree/parameters/jiffies_till_first_fqs
1
$ cat /sys/module/rcutree/parameters/jiffies_till_next_fqs
1
$ cat /sys/module/rcutree/parameters/jiffies_till_sched_qs
18446744073709551615
$ cat /sys/module/rcutree/parameters/jiffies_to_sched_qs
25
$ cat /sys/module/rcutree/parameters/rcu_fanout_leaf
16
$ cat /sys/module/rcutree/parameters/qlowmark
100
$ cat /sys/module/rcutree/parameters/qhimark
10000
$ cat /sys/module/rcutree/parameters/blimit
10
$ cat /sys/module/rcutree/parameters/kthread_prio
0
$ cat /sys/module/rcutree/parameters/gp_cleanup_delay
0
$ cat /sys/module/rcutree/parameters/gp_init_delay
0
$ cat /sys/module/rcutree/parameters/gp_preinit_delay
0
$ cat /sys/module/rcutree/parameters/rcu_divisior
7
$ cat /sys/module/rcutree/parameters/rcu_fanout_exact
N
$ cat /sys/module/rcutree/parameters/rcu_resched_ns
3000000
$ cat /sys/module/rcutree/parameters/sysrq_rcu
N
$ cat /sys/module/rcutree/parameters/use_softirq
Y

빌드 타임 RCU 트리 구성

빌드 타임에 NR_CPUS에 해당하는 cpu 수로 rcu 노드 크기를 결정하여 사용한다.

1개로 구성된 rcu_state 구조체 내부에 rcu_node 구조체 배열이 static하게 구성된다. (기존에는 3개의 rcu_state가 사용되었었다)
rcu_node 구조체 배열은 NR_CPUS 크기에 따라 1레벨부터 최대 4레벨 까지의 tree 구조를 지원한다.
RCU 노드 트리는 최소 1레벨부터 최대 4레벨까지 구성된다.
- 32bit 시스템(CONFIG_RCU_FANOUT_LEAF=16(디폴트)기준)
  - 1레벨에서 최대 16개 cpu 지원 가능
  - 2레벨에서 최대 16×32개 cpu 지원 가능
  - 3레벨에서 최대 16x32x32개 cpu 지원 가능
  - 4레벨에서 최대 16x32x32x32개 cpu 지원 가능
- 64bit 시스템(CONFIG_RCU_FANOUT_LEAF=16(디폴트)기준)
  - 1레벨에서 최대 16개 cpu 지원 가능
  - 2레벨에서 최대 16×64개 cpu 지원 가능
  - 3레벨에서 최대 16x64x64개 cpu 지원 가능
  - 4레벨에서 최대 16x64x64x64개 cpu 지원 가능
hotplug cpu를 지원하여 상태가 변화함에 따라 노드 구성이 바뀌게 설계되어 있다.
rcu_node의 sub rcu 노드들은 최대 CONFIG_RCU_FANOUT까지 구성된다.
- 32bit 시스템에서 2~32개까지, 64bit 시스템에서 2~64개까지 설정 가능하다.
- default: 32bit에서 32, 64bit에서 64
최하단 leaf 노드의 경우 rcu_data(cpu)와의 구성에서 rcu_fanout_leaf까지 연결될 수 있다.
- CONFIG_RCU_FANOUT_LEAF
  - 디폴트로 16개의 cpu(rcu_data)를 관리할 수 있고, 2~RCU_FANOUT 범위까지 설정 가능하다.
  - 각 cpu에 대한 노드 락을 contention을 회피하기 위해 16개를 디폴트로 사용하고 있다.

다음 그림은 최소 1 레벨과 최대 4 레벨의 구성 차이를 보여준다.

다음 그림은 최대 4 레벨에서 관리 가능한 cpu 수를 보여준다.

다음 그림은 컴파일 타임에 NR_CPUS 크기에 따라 사용할 레벨이 결정되고 각 레벨별로 rcu 노드 수가 결정되는 것을 보여준다.

다음 그림은 20개의 CPU를 지원하는 설정으로 컴파일 시 구성되는 rcu 노드들의 수를 산출하는 것을 보여준다.

CONFIG_RCU_FANOUT=64, CONFIG_RCU_FANOUT_LEAF=16 사용

다음 그림은 rcu_state 구조체 내부에서 4레벨로 구성된 rcu_node가 구성된 순서를 보여준다.

rcu_data들은 최 하위 leaf 노드들과 직접 연결된다.

RCU 구조체 초기화

rcu_init_one()

kernel/rcu/tree.c

/*
 * Helper function for rcu_init() that initializes the rcu_state structure.
 */

static void __init rcu_init_one(void)
{
        static const char * const buf[] = RCU_NODE_NAME_INIT;
        static const char * const fqs[] = RCU_FQS_NAME_INIT;
        static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
        static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];

        int levelspread[RCU_NUM_LVLS];          /* kids/node in each level. */
        int cpustride = 1;
        int i;
        int j;
        struct rcu_node *rnp;

        BUILD_BUG_ON(RCU_NUM_LVLS > ARRAY_SIZE(buf));  /* Fix buf[] init! */

        /* Silence gcc 4.8 false positive about array index out of range. */
        if (rcu_num_lvls <= 0 || rcu_num_lvls > RCU_NUM_LVLS)
                panic("rcu_init_one: rcu_num_lvls out of range");

        /* Initialize the level-tracking arrays. */

        for (i = 1; i < rcu_num_lvls; i++)
                rcu_state.level[i] =
                        rcu_state.level[i - 1] + num_rcu_lvl[i - 1];
        rcu_init_levelspread(levelspread, num_rcu_lvl);

        /* Initialize the elements themselves, starting from the leaves. */

        for (i = rcu_num_lvls - 1; i >= 0; i--) {
                cpustride *= levelspread[i];
                rnp = rcu_state.level[i];
                for (j = 0; j < num_rcu_lvl[i]; j++, rnp++) {
                        raw_spin_lock_init(&ACCESS_PRIVATE(rnp, lock));
                        lockdep_set_class_and_name(&ACCESS_PRIVATE(rnp, lock),
                                                   &rcu_node_class[i], buf[i]);
                        raw_spin_lock_init(&rnp->fqslock);
                        lockdep_set_class_and_name(&rnp->fqslock,
                                                   &rcu_fqs_class[i], fqs[i]);
                        rnp->gp_seq = rcu_state.gp_seq;
                        rnp->gp_seq_needed = rcu_state.gp_seq;
                        rnp->completedqs = rcu_state.gp_seq;
                        rnp->qsmask = 0;
                        rnp->qsmaskinit = 0;
                        rnp->grplo = j * cpustride;
                        rnp->grphi = (j + 1) * cpustride - 1;
                        if (rnp->grphi >= nr_cpu_ids)
                                rnp->grphi = nr_cpu_ids - 1;
                        if (i == 0) {
                                rnp->grpnum = 0;
                                rnp->grpmask = 0;
                                rnp->parent = NULL;
                        } else {
                                rnp->grpnum = j % levelspread[i - 1];
                                rnp->grpmask = BIT(rnp->grpnum);
                                rnp->parent = rcu_state.level[i - 1] +
                                              j / levelspread[i - 1];
                        }
                        rnp->level = i;
                        INIT_LIST_HEAD(&rnp->blkd_tasks);
                        rcu_init_one_nocb(rnp);
                        init_waitqueue_head(&rnp->exp_wq[0]);
                        init_waitqueue_head(&rnp->exp_wq[1]);
                        init_waitqueue_head(&rnp->exp_wq[2]);
                        init_waitqueue_head(&rnp->exp_wq[3]);
                        spin_lock_init(&rnp->exp_lock);
                }
        }

        init_swait_queue_head(&rcu_state.gp_wq);
        init_swait_queue_head(&rcu_state.expedited_wq);
        rnp = rcu_first_leaf_node();
        for_each_possible_cpu(i) {
                while (i > rnp->grphi)
                        rnp++;
                per_cpu_ptr(&rcu_data, i)->mynode = rnp;
                rcu_boot_init_percpu_data(i);
        }
}

1개의 rcu_state에 포함된 rcu_node와 rcu_data를 초기화한다. (예전 커널에서 rcu_state가 3가지가 존재하여 이 함수가 3번 호출되었었다)

코드 라인 17~18에서 rcu 노드들의 하이 라키는 1~4 레벨로 제한되어 있다.
코드 라인 22~24에서 각 레벨별로 첫 rcu_node를 가리키게 한다.
코드 라인 25에서 각 rcu 레벨이 관리하는 sub 노드의 수를 산출한다. 모듈 파라미터 rcu_fanout_exact(디폴트=0) 값이 0일 때 nr_cpu_ids 수에 맞춰 spread 한다.
코드 라인 29~67에서 leaf 노드부터 최상위 노드까지 초기화한다.
- grplo와 grphi에는 각 노드가 관리하는 cpu 번호 범위가 지정된다.
코드 라인 69~70에서 두 개의 swait_queue를 초기화한다.
코드 라이 71~77에서 각 cpu에 해당하는 rcu 데이터를 초기화한다. 또한 ->mynode가 담당 leaf 노드를 가리키게 한다.

다음 그림은 rcu_node의 grplo 및 grphi를 산출하는 과정을 보여준다.

다음 그림은 rcu_node의 grpnum, grpmask 및 level을 산출하는 과정을 보여준다.

다음 그림은 3단계로 구성된 rcu_node 구성을 트리 구조로 보여준 사례이다.

rcu_init_levelspread()

kernel/rcu/rcu.h

/*
 * Compute the per-level fanout, either using the exact fanout specified
 * or balancing the tree, depending on the rcu_fanout_exact boot parameter.
 */

static inline void rcu_init_levelspread(int *levelspread, const int *levelcnt)
{
        int i;

        if (rcu_fanout_exact) {
                levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf;
                for (i = rcu_num_lvls - 2; i >= 0; i--)
                        levelspread[i] = RCU_FANOUT;
        } else {
                int ccur;
                int cprv;

                cprv = nr_cpu_ids;
                for (i = rcu_num_lvls - 1; i >= 0; i--) {
                        ccur = levelcnt[i];
                        levelspread[i] = (cprv + ccur - 1) / ccur;
                        cprv = ccur;
                }
        }
}

각 rcu 레벨이 관리하는 sub 노드의 수를 산출한다. rcu_fanout_exact=0(디폴트)을 사용하는 경우 노드 락을 최소화하기 위해 online된 cpu수에 맞춰 노드 배치를 spread하여 구성한다.

코드 라인 5~8에서 모듈 파라미터 rcu_fanout_exact가 설정된 경우 leaf 노드에서는 rcu_fanout_leaf(디폴트=16)로 설정하고, 나머지 노드는 RCU_FANOUT(디폴트=32/64 bits) 값으로 설정한다.
코드 라인 9~19에서 그 외의 경우 online cpu 수에 맞게 각 레벨의 노드가 관리하는 sub 노드 수를 spread 배치하여 구성한다.

다음 그림은 모듈 파라미터 rcu_fanout_exact 설정된 경우 노드 배치가 spread 되는 모습을 보여준다.

rcu_init_one_nocb()

kernel/rcu/tree_plugin.h

static void rcu_init_one_nocb(struct rcu_node *rnp)
{
        init_waitqueue_head(&rnp->nocb_gp_wq[0]);
        init_waitqueue_head(&rnp->nocb_gp_wq[1]);
}

CONFIG_RCU_NOCB_CPU 커널 옵션이 사용되는 경우 nocb_gp_wq[]에 있는 대기큐 두 개를 초기화한다.

rcu_boot_init_percpu_data()

kernel/rcu/tree.c

/*
 * Do boot-time initialization of a CPU's per-CPU RCU data.
 */
static void __init
rcu_boot_init_percpu_data(int cpu)
{
        struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);

        /* Set up local state, ensuring consistent view of global state. */
        rdp->grpmask = leaf_node_cpu_bit(rdp->mynode, cpu);
        WARN_ON_ONCE(rdp->dynticks_nesting != 1);
        WARN_ON_ONCE(rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp)));
        rdp->rcu_ofl_gp_seq = rcu_state.gp_seq;
        rdp->rcu_ofl_gp_flags = RCU_GP_CLEANED;
        rdp->rcu_onl_gp_seq = rcu_state.gp_seq;
        rdp->rcu_onl_gp_flags = RCU_GP_CLEANED;
        rdp->cpu = cpu;
        rcu_boot_init_nocb_percpu_data(rdp);
}

cpu별로 구성되는 rcu_data 구조체의 멤버를 부트타임에 모두 초기화한다.

nohz 및 no-cb용 콜백 처리 커널 스레드 구성

rcu_init_nohz()

kernel/rcu/tree_plugin.h

void __init rcu_init_nohz(void)
{
        int cpu;
        bool need_rcu_nocb_mask = false;
        struct rcu_data *rdp;

#if defined(CONFIG_NO_HZ_FULL)
        if (tick_nohz_full_running && cpumask_weight(tick_nohz_full_mask))
                need_rcu_nocb_mask = true;
#endif /* #if defined(CONFIG_NO_HZ_FULL) */

        if (!cpumask_available(rcu_nocb_mask) && need_rcu_nocb_mask) {
                if (!zalloc_cpumask_var(&rcu_nocb_mask, GFP_KERNEL)) {
                        pr_info("rcu_nocb_mask allocation failed, callback offloading disabled.\n");
                        return;
                }
        }
        if (!cpumask_available(rcu_nocb_mask))
                return;

#if defined(CONFIG_NO_HZ_FULL)
        if (tick_nohz_full_running)
                cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
#endif /* #if defined(CONFIG_NO_HZ_FULL) */

        if (!cpumask_subset(rcu_nocb_mask, cpu_possible_mask)) {
                pr_info("\tNote: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs.\nn");
                cpumask_and(rcu_nocb_mask, cpu_possible_mask,
                            rcu_nocb_mask);
        }
        if (cpumask_empty(rcu_nocb_mask))
                pr_info("\tOffload RCU callbacks from CPUs: (none).\n");
        else
                pr_info("\tOffload RCU callbacks from CPUs: %*pbl.\n",
                        cpumask_pr_args(rcu_nocb_mask));
        if (rcu_nocb_poll)
                pr_info("\tPoll for callbacks from no-CBs CPUs.\n");

        for_each_cpu(cpu, rcu_nocb_mask) {
                rdp = per_cpu_ptr(&rcu_data, cpu);
                if (rcu_segcblist_empty(&rdp->cblist))
                        rcu_segcblist_init(&rdp->cblist);
                rcu_segcblist_offload(&rdp->cblist);
        }
        rcu_organize_nocb_kthreads();
}

rcu nohz 처리를 위한 초기화를 수행한다.

코드 라인 7~10에서 “nohz_full=” 커널 파라미터로 지정된 cpu들이 있는 경우 임시 변수 need_rcu_nocb_mask에 true를 대입해둔다.
코드 라인 12~19에서 “rcu_nocbs=” 커널 파라미터로 지정되는 rcu_nocb_mask 비트마스크가 할당되지 않은 경우 할당한다.
코드 라인 21~24에서 nohz full이 지원되는 시스템인 경우 rcu_nocb_mask에 nohz full cpu들을 추가한다.
코드 라인 26~30에서 nocb용 cpu들이 possible cpu에 포함되지 않은 경우 경고 메시지를 출력하고, rcu_nocb_mask 비트마스크에서 possible cpu들을 모두 뺀다.
코드 라인 31~35에서 offload된(no-cb) cpu들을 출력한다.
코드 라인 36~37에서 “rcu_nocb_poll=” 커널 파라미터가 설정된 경우 no-cb 스레드가 polling을 지원한다고 해당 정보를 출력한다.
코드 라인 39~44에서 offload cpu들에 대해 콜백리스트의 offloaded=1을 설정한다.
코드 라인 45에서 no-cb용 cpu들 각각에 대해 no-cb용 gp 커널 스레드가 동작하는 cpu를 지정한다.
- no-cb로 동작할 때 각 cpu들은 그룹으로 나뉘어 관리되며, 각 그룹당 대표 cpu는 no-cb용 gp 커널 스레드도 생성한다.

참고

RCU(Read Copy Update) -1- (Basic) | 문c
RCU(Read Copy Update) -2- (Callback process) | 문c
RCU(Read Copy Update) -3- (RCU threads) | 문c
RCU(Read Copy Update) -4- (NOCB process) | 문c
RCU(Read Copy Update) -5- (Callback list) | 문c
RCU(Read Copy Update) -6- (Expedited GP) | 문c
RCU(Read Copy Update) -7- (Preemptible RCU) | 문c
rcu_init() | 문c – 현재글
wait_for_completion() | 문c

wait_for_completion()

2016-09-082017-12-04 문영일 Leave a comment

작업 완료 시그널을 받는 wait_for_complition() 함수와 작업 완료 시그널을 보내는 complete() 함수의 처리 흐름도이다.

선언 및 초기화

DECLARE_COMPLETION()

include/linux/completion.h

/**
 * DECLARE_COMPLETION - declare and initialize a completion structure
 * @work:  identifier for the completion structure
 *
 * This macro declares and initializes a completion structure. Generally used
 * for static declarations. You should use the _ONSTACK variant for automatic
 * variables.
 */
#define DECLARE_COMPLETION(work) \
        struct completion work = COMPLETION_INITIALIZER(work)

주어진 이름의 completion 구조체에 대해 초기화를 한다.

예) DECLARE_COMPLETION(abc)
- abc 라는 이름의 completion 구조체 초기화

COMPLETION_INITIALIZER()

include/linux/completion.h

#define COMPLETION_INITIALIZER(work) \
        { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }

주어진 이름의 completion 구조체의 FIFO 대기 큐를 초기화하고 done 이라는 멤버의 초기값을 0으로 클리어한다.

APIs

wait_for_completion()

kernel/sched/completion.c

/**     
 * wait_for_completion: - waits for completion of a task
 * @x:  holds the state of this particular completion
 *
 * This waits to be signaled for completion of a specific task. It is NOT
 * interruptible and there is no timeout.
 *
 * See also similar routines (i.e. wait_for_completion_timeout()) with timeout
 * and interrupt capability. Also see complete().
 */
void __sched wait_for_completion(struct completion *x)
{
        wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(wait_for_completion);

작업의 완료를 기다린다. 정상 완료 시 0이 반환되고 인터럽트된 경우 -ERESTARTSYS를 반환한다.

wait_for_common()

kernel/sched/completion.c

static long __sched
wait_for_common(struct completion *x, long timeout, int state)
{
        return __wait_for_common(x, schedule_timeout, timeout, state);
}

작업의 완료를 주어진 시간 만큼 기다린다. 정상 완료 시 0, 인터럽트된 경우 -ERESTARTSYS 그리고 타임아웃 시 양의 정수로 남은 timeout jiffies 값을 반환한다.

__wait_for_common()

kernel/sched/completion.c

static inline long __sched
__wait_for_common(struct completion *x,
                  long (*action)(long), long timeout, int state)
{
        might_sleep();

        spin_lock_irq(&x->wait.lock);
        timeout = do_wait_for_common(x, action, timeout, state);
        spin_unlock_irq(&x->wait.lock);
        return timeout;
}

do_wait_for_common()

kernel/sched/completion.c

static inline long __sched
do_wait_for_common(struct completion *x,
                   long (*action)(long), long timeout, int state)
{
        if (!x->done) {
                DECLARE_WAITQUEUE(wait, current);

                __add_wait_queue_tail_exclusive(&x->wait, &wait);
                do {
                        if (signal_pending_state(state, current)) {
                                timeout = -ERESTARTSYS;
                                break;
                        }
                        __set_current_state(state);
                        spin_unlock_irq(&x->wait.lock);
                        timeout = action(timeout);
                        spin_lock_irq(&x->wait.lock);
                } while (!x->done && timeout);
                __remove_wait_queue(&x->wait, &wait);
                if (!x->done)
                        return timeout;
        }
        x->done--;
        return timeout ?: 1;
}

현재 태스크를 completion 구조체에 추가한 후 완료 시그널을 기다린다.

if (!x->done) {
- 이미 대기 중이 아닌 경우
DECLARE_WAITQUEUE(wait, current);
- 현재 태스크로 wait 노드를 생성한다.
__add_wait_queue_tail_exclusive(&x->wait, &wait);
- x 인수로 받은 compltion 구조체의 wait 큐에 조금 전에 생성한 wait 노드를 마지막에 exclusive 플래그로 추가한다.
do { if (signal_pending_state(state, current)) { timeout = -ERESTARTSYS; break; }
- 루프를 돌며 지연된 시그널이 있는 경우 루프를 빠져나간다.
__set_current_state(state); spin_unlock_irq(&x->wait.lock); timeout = action(timeout); spin_lock_irq(&x->wait.lock);
- 현재 상태를 저장하고 action 인수로 받은 함수를 호출한다.
  - 기본 함수로 schedule_timeout()을 사용한다.
} while (!x->done && timeout);
- 완료 시그널을 받았거나 타임 아웃된 경우가 아니면 루프를 계속 돈다.
__remove_wait_queue(&x->wait, &wait);
- 대기 큐에서 추가하였던 wait 노드를 제거한다.
if (!x->done) return timeout;
- 완료 시그널을 받지 않은 경우 timeout 값을 반환한다.
x->done–; return timeout ?: 1;
- 완료 카운터 done을 1 감소시키고 timeout 값을 반환하거나 0인 경우 1을 반환한다.

complete()

kernel/sched/completion.c

/**
 * complete: - signals a single thread waiting on this completion
 * @x:  holds the state of this particular completion
 *
 * This will wake up a single thread waiting on this completion. Threads will be
 * awakened in the same order in which they were queued.
 *
 * See also complete_all(), wait_for_completion() and related routines.
 *
 * It may be assumed that this function implies a write memory barrier before
 * changing the task state if and only if any tasks are woken up.
 */
void complete(struct completion *x)
{
        unsigned long flags;

        spin_lock_irqsave(&x->wait.lock, flags);
        x->done++;
        __wake_up_locked(&x->wait, TASK_NORMAL, 1);
        spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete);

작업 완료를 기다리는 하나의 태스크를 깨우고 완료 신호를 보내 대기중인 함수(wait_for_compltion())에서 탈출하게 한다.

대기큐에 하나 이상의 스레드들이 등록되어 있어 모두 깨어나게 할 필요가 있는 경우 complete_all() 함수를 사용한다.
TASK_NORMAL:
- TASK_INTERRUPTIBLE | TASK_UNITERRUPTIBLE

__wake_up_locked()

kernel/sched/wait.c

/*
 * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
 */
void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
{
        __wake_up_common(q, mode, nr, 0, NULL); 
}
EXPORT_SYMBOL_GPL(__wake_up_locked);

대기큐에 등록된 하나의 태스크가 슬립된 경우 깨어나게 한다.

__wake_up_common()

kernel/sched/wait.c

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
                        int nr_exclusive, int wake_flags, void *key)
{
        wait_queue_t *curr, *next;

        list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
                unsigned flags = curr->flags;

                if (curr->func(curr, mode, wake_flags, key) &&
                                (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
                        break;
        }
}

작업 완료(complete)를 기다리는 스레드들을 순회하며 깨우는 함수(func)를 호출하되 exclusive 설정된 태스크들만 nr_exclusive 수 만큼 깨운다.

func()
- schedule_timeout()
- io_schedule_timeout()

구조체

completion 구조체

include/linux/completion.h

/*
 * struct completion - structure used to maintain state for a "completion"
 *
 * This is the opaque structure used to maintain the state for a "completion".
 * Completions currently use a FIFO to queue threads that have to wait for
 * the "completion" event.
 *
 * See also:  complete(), wait_for_completion() (and friends _timeout,
 * _interruptible, _interruptible_timeout, and _killable), init_completion(),
 * reinit_completion(), and macros DECLARE_COMPLETION(),
 * DECLARE_COMPLETION_ONSTACK().
 */
struct completion {
        unsigned int done;
        wait_queue_head_t wait;
};

done
- 초기 값은 0, 완료 시 1
wait
- FIFO 대기 큐

idr_init_cache()

2016-08-262016-08-26 문영일 Leave a comment

idr_init_cache()

lib/idr.c

void __init idr_init_cache(void)
{
        idr_layer_cache = kmem_cache_create("idr_layer_cache",
                                sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
}

idr_layer 구조체를 할당하기 위한 전용 slub 캐시를 생성한다.

이후 idr_layer 구조체를 할당할 때마다 idr_layer_cache를 사용하여 할당받는다.

참고

IDR(integer ID 관리) | 문c

IDR(integer ID 관리)

2016-08-262016-08-26 문영일 Leave a comment

radix tree를 사용하여 정수 ID를 관리하고 이에 연결된 포인터 값을 반환한다. 다음은 리눅스 IDR의 특징이다.

ID 관리
- Radix tree를 사용하여 레이어 단계 마다 256(0x100)배 단위로 ID를 관리할 수 있다.
  - 32 bit 시스템에서 사용하는 레이어의 수에 따라
    - 1 레이어: 0 ~ 0xff ID 관리
    - 2 레이어: 0 ~ 0xffff ID 관리
    - 3 레이어: 0 ~ 0xffffff ID 관리
    - 4 레이어: 0 ~ 0x7fffffff ID 관리
  - 64 bit 시스템에서사용하는 레이어의 수에 따라
    - 1 레이어: 0 ~ 0xff ID 관리
    - 2 레이어: 0 ~ 0xffff ID 관리
    - 3 레이어: 0 ~ 0xffffff ID 관리
    - 4 레이어: 0 ~ 0xffffffff ID 관리
    - 5 레이어: 0 ~ 0xff_ffffffff ID 관리
    - 6 레이어: 0 ~ 0xffff_ffffffff ID 관리
    - 7 레이어: 0 ~ 0xffffff_ffffffff ID 관리
    - 8 레이어: 0 ~ 0x7fffffff_ffffffff ID 관리
- 큰 번호의 ID를 요구하는 경우 레이어 관리 단계가 커져 cost가 더 많이 소모된다.
IDR preload 버퍼
- idr_layer 구조체 할당은 slub 캐시를 통해 전달되는데 이를 미리 몇 개를 할당받아 IDR 레이어가 횡 또는 종으로 확장될 때 빠르게 공급할 수 있도록 설계되었다.

다음 그림은 IDR이 레이어별로 관리되는 모습을 보여준다.

다음 그림은 256~511까지의 ID와 1025의 ID가 할당되어 관리되는 모습을 보여준다.

static IDR 선언 및 초기화

DEFINE_IDR()

include/linux/idr.h

#define DEFINE_IDR(name)        struct idr name = IDR_INIT(name)

주어진 이름으로 idr 구조체를 선언하고 초기화한다.

IDR_INIT()

include/linux/idr.h

#define IDR_INIT(name)                                                  \
{                                                                       \
        .lock                   = __SPIN_LOCK_UNLOCKED(name.lock),      \
}

주어진 이름의 idr 구조체를 초기화한다.

dynamic IDR 초기화

idr_init()

/**     
 * idr_init - initialize idr handle
 * @idp:        idr handle
 *
 * This function is use to set up the handle (@idp) that you will pass
 * to the rest of the functions.
 */
void idr_init(struct idr *idp)
{
        memset(idp, 0, sizeof(struct idr));
        spin_lock_init(&idp->lock);
}
EXPORT_SYMBOL(idr_init);

idr 구조체 멤버 변수를 모두 0으로 초기화하고 lock 멤버만 spinlock 초기화한다.

IDR 할당

기존(old) 방법으로 ID를 할당하는 경우 다음 2개의 API를 연달아 사용했었다.
- int idr_pre_get(struct idr *idp, gfp_t gfp_mask);
- int idr_get_new(struct idr *idp, void *ptr, int *id);

새로운(new) 방법으로 ID를 할당하는 경우 다음 3개의 API를 연달아 사용한다.
- void idr_preload(gfp_t gfp_mask);
- int idr_alloc(struct idr *idp, void *ptr, int start, int end, gfp_t gfp_mask);
- void idr_preload_end(void);
- 2013년 커널 v3.9-rc1부터 적용
  - 참고: idr: implement idr_preload[_end]() and idr_alloc()

idr_preload()

lib/idr.c

/**
 * idr_preload - preload for idr_alloc()
 * @gfp_mask: allocation mask to use for preloading
 *
 * Preload per-cpu layer buffer for idr_alloc().  Can only be used from
 * process context and each idr_preload() invocation should be matched with
 * idr_preload_end().  Note that preemption is disabled while preloaded.
 *
 * The first idr_alloc() in the preloaded section can be treated as if it
 * were invoked with @gfp_mask used for preloading.  This allows using more
 * permissive allocation masks for idrs protected by spinlocks.
 *
 * For example, if idr_alloc() below fails, the failure can be treated as
 * if idr_alloc() were called with GFP_KERNEL rather than GFP_NOWAIT.
 *
 *      idr_preload(GFP_KERNEL);
 *      spin_lock(lock);
 *
 *      id = idr_alloc(idr, ptr, start, end, GFP_NOWAIT);
 *
 *      spin_unlock(lock);
 *      idr_preload_end();
 *      if (id < 0)
 *              error;
 */
void idr_preload(gfp_t gfp_mask)
{
        /*
         * Consuming preload buffer from non-process context breaks preload
         * allocation guarantee.  Disallow usage from those contexts.
         */
        WARN_ON_ONCE(in_interrupt());
        might_sleep_if(gfp_mask & __GFP_WAIT);

        preempt_disable();

        /*
         * idr_alloc() is likely to succeed w/o full idr_layer buffer and
         * return value from idr_alloc() needs to be checked for failure
         * anyway.  Silently give up if allocation fails.  The caller can
         * treat failures from idr_alloc() as if idr_alloc() were called
         * with @gfp_mask which should be enough.
         */
        while (__this_cpu_read(idr_preload_cnt) < MAX_IDR_FREE) {
                struct idr_layer *new;

                preempt_enable();
                new = kmem_cache_zalloc(idr_layer_cache, gfp_mask);
                preempt_disable();
                if (!new)
                        break;

                /* link the new one to per-cpu preload list */
                new->ary[0] = __this_cpu_read(idr_preload_head);
                __this_cpu_write(idr_preload_head, new);
                __this_cpu_inc(idr_preload_cnt);
        }
}
EXPORT_SYMBOL(idr_preload);

idr preload 버퍼에 8(32bit 시스템 기준)개의 idr_layer 엔트리를 미리 할당해둔다. 이 idr 프리로드 버퍼는 idr_alloc() 함수가 필요로하는 idr_layer를 미리 준비하여 필요할 때마다 최대한 짧은 시간에 제공하여 preemption disable 구간에서 동작하는 idr_alloc() 함수를 위해 preemption disable 기간을 최대한 줄이기 위해 사용된다.

might_sleep_if(gfp_mask & __GFP_WAIT);
- __GFP_WAIT 플래그가 주어진 경우 현재 태스크보다 더 높은 우선순위의 처리할 태스크가 있는 경우 선점될 수 있다. 즉 sleep 가능하다.
preempt_disable();
- 여기서 부터는 선점되지 않도록 한다.
while (__this_cpu_read(idr_preload_cnt) < MAX_IDR_FREE) {
- 현재 cpu에서 idr_preload_cnt가 MAX_IDR_FREE(최대 레벨의 2배)보다 적은 경우 루프를 계속 수행한다.
  - MAX_IDR_FREE(최대 레벨의 2배) 수 만큼 idr 캐시를 미리 할당해두려는 목적이다.
preempt_enable(); new = kmem_cache_zalloc(idr_layer_cache, gfp_mask); preempt_disable(); if (!new) break;
- preemption을 enable한 상태로 idr_layer_cache를 통해 idr_layer 구조체 영역을 할당받고 실패한 경우 루프를 탈출한다.
new->ary[0] = __this_cpu_read(idr_preload_head); __this_cpu_write(idr_preload_head, new);
- 할당 받은 new idr_layer 구조체를 idr preload 버퍼 리스트에 추가한다.
  - idr_preload_head 리스트에서 idr_layer들은 ary[0]을 이용하여 다음 엔트리가 연결되어 있는 구조이다.
__this_cpu_inc(idr_preload_cnt);
- 추가하였으므로 idr_preload_cnt를 증가시킨다.

다음 그림은 idr_preload() 함수를 idr preload 버퍼에 미리 8(32bit 시스템)개의 idr_layer 구조체들을 할당해놓은 것을 보여준다.

idr_preload_end()

include/linux/idr.h

/**
 * idr_preload_end - end preload section started with idr_preload()
 *
 * Each idr_preload() should be matched with an invocation of this
 * function.  See idr_preload() for details.
 */
static inline void idr_preload_end(void)
{
        preempt_enable();
}

idr_alloc()이 끝났으므로 preemption을 enable하여 이제 선점 가능한 상태도 바꾼다.

idr_alloc()

lib/idr.c

/**
 * idr_alloc - allocate new idr entry
 * @idr: the (initialized) idr
 * @ptr: pointer to be associated with the new id
 * @start: the minimum id (inclusive)
 * @end: the maximum id (exclusive, <= 0 for max)
 * @gfp_mask: memory allocation flags
 *
 * Allocate an id in [start, end) and associate it with @ptr.  If no ID is
 * available in the specified range, returns -ENOSPC.  On memory allocation
 * failure, returns -ENOMEM. 
 *
 * Note that @end is treated as max when <= 0.  This is to always allow
 * using @start + N as @end as long as N is inside integer range.
 *
 * The user is responsible for exclusively synchronizing all operations
 * which may modify @idr.  However, read-only accesses such as idr_find()
 * or iteration can be performed under RCU read lock provided the user
 * destroys @ptr in RCU-safe way after removal from idr.
 */
int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
{
        int max = end > 0 ? end - 1 : INT_MAX;  /* inclusive upper limit */
        struct idr_layer *pa[MAX_IDR_LEVEL + 1];
        int id;

        might_sleep_if(gfpflags_allow_blocking(gfp_mask));

        /* sanity checks */
        if (WARN_ON_ONCE(start < 0))
                return -EINVAL;
        if (unlikely(max < start))
                return -ENOSPC;

        /* allocate id */
        id = idr_get_empty_slot(idr, start, pa, gfp_mask, NULL);
        if (unlikely(id < 0))
                return id;
        if (unlikely(id > max))
                return -ENOSPC;

        idr_fill_slot(idr, ptr, id, pa);
        return id;
}
EXPORT_SYMBOL_GPL(idr_alloc);

start ~ (end-1) 정수 범위내에서 빈 id를 찾아 ptr을 저장하고 id를 반환한다. end가 0인 경우 시스템 최대 정수값인 INT_MAX로 지정된다.

might_sleep_if(gfp_mask & __GFP_WAIT);
- 선점 가능한 상태에서 __GFP_WAIT 플래그가 요청된 경우 높은 순위이 태스크가 선점 요청한 경우 sleep 한다.
id = idr_get_empty_slot(idr, start, pa, gfp_mask, NULL);
- start ~ end-1 까지 비어있는 ID를 찾아 반환한다. 이 과정에서 만일 레이어 확장이 필요한 경우 생성하게한다.
if (unlikely(id < 0)) return id;
- 음수를 반환하는 경우 ID를 할당받지 못하여 에러로 리턴한다.
if (unlikely(id > max)) return -ENOSPC;
- 요청 범위내에서 할당이 불가능한 경우 할당할 공간이 없다고 -ENOSPC 에러를 반환한다.
idr_fill_slot(idr, ptr, id, pa);
- id에 해당하는 각 레이어들을 업데이트한다.

다음 그림은 0~65535 ID까지 full되어 2단계 레이어로 관리되고 있는 상태에서 65536번의 IDR이 추가 할당되어 3단계 레이어로 확장되는 모습을 보여준다.

idr_get_empty_slot()

lib/idr.c

static int idr_get_empty_slot(struct idr *idp, int starting_id,
                              struct idr_layer **pa, gfp_t gfp_mask,
                              struct idr *layer_idr)
{
        struct idr_layer *p, *new;
        int layers, v, id;
        unsigned long flags;

        id = starting_id;
build_up:
        p = idp->top;
        layers = idp->layers;
        if (unlikely(!p)) {
                if (!(p = idr_layer_alloc(gfp_mask, layer_idr)))
                        return -ENOMEM;
                p->layer = 0;
                layers = 1;
        }
        /*
         * Add a new layer to the top of the tree if the requested
         * id is larger than the currently allocated space.
         */
        while (id > idr_max(layers)) {
                layers++;
                if (!p->count) {
                        /* special case: if the tree is currently empty,
                         * then we grow the tree by moving the top node
                         * upwards.
                         */
                        p->layer++;
                        WARN_ON_ONCE(p->prefix);
                        continue;
                }
                if (!(new = idr_layer_alloc(gfp_mask, layer_idr))) {
                        /*
                         * The allocation failed.  If we built part of
                         * the structure tear it down.
                         */
                        spin_lock_irqsave(&idp->lock, flags);
                        for (new = p; p && p != idp->top; new = p) {
                                p = p->ary[0];
                                new->ary[0] = NULL;
                                new->count = 0;
                                bitmap_clear(new->bitmap, 0, IDR_SIZE);
                                __move_to_free_list(idp, new);
                        }
                        spin_unlock_irqrestore(&idp->lock, flags);
                        return -ENOMEM;
                }
                new->ary[0] = p;
                new->count = 1;
                new->layer = layers-1;
                new->prefix = id & idr_layer_prefix_mask(new->layer);
                if (bitmap_full(p->bitmap, IDR_SIZE))
                        __set_bit(0, new->bitmap);
                p = new;
        }
        rcu_assign_pointer(idp->top, p);
        idp->layers = layers;
        v = sub_alloc(idp, &id, pa, gfp_mask, layer_idr);
        if (v == -EAGAIN)
                goto build_up;
        return(v);
}

start ~ end-1 까지 비어있는 ID를 찾아 반환한다. 이 과정에서 만일 레이어 확장(tree depth)이 필요한 경우 생성하게한다.

build_up: p = idp->top; layers = idp->layers;
- idr 구조체의 top이 가리키는 노드를 지정하고, 사용하는 레이어 계층 수를 알아온다.
if (unlikely(!p)) { if (!(p = idr_layer_alloc(gfp_mask, layer_idr))) return -ENOMEM;
- 적은 확률로 노드가 지정되지 않은 경우이면서 idr_layer 구조체를 할당 받지 못한 경우 -ENOMEM 에러를 반환한다.
p->layer = 0; layers = 1;
- leaf 노드이므로 layer 멤버 변수에 0을 대입하고, 레이어 수는 1로 대입한다.
while (id > idr_max(layers)) {
- 요청한 id 값이 현재 idr 레이어가 처리할 수 있는 최대 수를 초과하는 경우 루프를 돈다.
if (!(new = idr_layer_alloc(gfp_mask, layer_idr))) {
- 상위 레이어를 확장(tree depth)하다 할당이 실패하는 경우
for (new = p; p && p != idp->top; new = p) { p = p->ary[0]; new->ary[0] = NULL; new->count = 0; bitmap_clear(new->bitmap, 0, IDR_SIZE); __move_to_free_list(idp, new); }
- 이미 레이어 확장을 위해 만들어 놓은 idr_layer들을 모두 id_free 리스트로 옮긴다.
new->ary[0] = p; new->count = 1; new->layer = layers-1; new->prefix = id & idr_layer_prefix_mask(new->layer); if (bitmap_full(p->bitmap, IDR_SIZE)) __set_bit(0, new->bitmap); p = new;
- 새로 만들어진 레이어의 ary[0]이 기존 레이어를 향하도록 대입하고, 1개의 count를 갖게 대입한다.
- prefix 값도 지정하고 기존 레이어의 비트맵이 full된 경우 새로 만들어진 bitmap의 처음 비트를 1로 full 설정한다.
rcu_assign_pointer(idp->top, p);
- idr 구조체의 top이 p 노드를 가리킬 수 있도록 대입한다.
idp->layers = layers;
- idr 구조체의 layers를 갱신한다.
v = sub_alloc(idp, &id, pa, gfp_mask, layer_idr);
- 상위 레이어 tree depth 확장은 위 while 문에서 완료되었고 여기에서는 ID를 할당하되 레이어의 depth를 변경하지 않고 그 범위 내의 하위 레이어 중 할당이 필요한 레이어들을 할당한다.
if (v == -EAGAIN) goto build_up;
- ID 할당이 실패한 경우 build_up부터 다시 시작한다.

idr_layer_alloc()

lib/idr.c

/**
 * idr_layer_alloc - allocate a new idr_layer
 * @gfp_mask: allocation mask
 * @layer_idr: optional idr to allocate from
 *
 * If @layer_idr is %NULL, directly allocate one using @gfp_mask or fetch
 * one from the per-cpu preload buffer.  If @layer_idr is not %NULL, fetch
 * an idr_layer from @idr->id_free.
 *
 * @layer_idr is to maintain backward compatibility with the old alloc
 * interface - idr_pre_get() and idr_get_new*() - and will be removed
 * together with per-pool preload buffer.
 */
static struct idr_layer *idr_layer_alloc(gfp_t gfp_mask, struct idr *layer_idr)
{
        struct idr_layer *new;

        /* this is the old path, bypass to get_from_free_list() */
        if (layer_idr)
                return get_from_free_list(layer_idr);

        /*
         * Try to allocate directly from kmem_cache.  We want to try this
         * before preload buffer; otherwise, non-preloading idr_alloc()
         * users will end up taking advantage of preloading ones.  As the
         * following is allowed to fail for preloaded cases, suppress
         * warning this time.
         */
        new = kmem_cache_zalloc(idr_layer_cache, gfp_mask | __GFP_NOWARN);
        if (new)
                return new;

        /*
         * Try to fetch one from the per-cpu preload buffer if in process
         * context.  See idr_preload() for details.
         */
        if (!in_interrupt()) {
                preempt_disable();
                new = __this_cpu_read(idr_preload_head);
                if (new) {
                        __this_cpu_write(idr_preload_head, new->ary[0]);
                        __this_cpu_dec(idr_preload_cnt);
                        new->ary[0] = NULL;
                }
                preempt_enable();
                if (new)
                        return new;
        }

        /*
         * Both failed.  Try kmem_cache again w/o adding __GFP_NOWARN so
         * that memory allocation failure warning is printed as intended.
         */
        return kmem_cache_zalloc(idr_layer_cache, gfp_mask);
}

idr preload 버퍼 또는 idr_layer_cache에서 idr_layer 구조체를 할당받아온다.

if (layer_idr) return get_from_free_list(layer_idr);
- 이 함수의 2번째 인수인 layer_idr이 null 값이 아닌 경우 기존 할당 방식을 호환하기 위해 idr 구조체의 id_free 멤버에서 idr_layer를 가져온다.
- idr preload 버퍼를 사용하는 새로운 방법은 layer_idr 값에 null이 인입되어 이 루틴을 skip 한다.
new = kmem_cache_zalloc(idr_layer_cache, gfp_mask | __GFP_NOWARN); if (new) return new;
- idr_preload() 함수 없이 idr_alloc() 함수를 사용하는 경우를 배려하기 위해 idr preload 버퍼가 아닌 idr_layer_cache로 부터 직접 idr_layer 구조체를 할당받아온다.
  - idr_preload() 함수를 사용한 경우 이 함수를 진행시켜 에러가 발생하면 __GFP_NOWARN 옵션에 의해 경고 메시지가 출력되지 않도록 하였다.
if (!in_interrupt()) { preempt_disable();
- 인터럽트 핸들러에서 호출된 경우가 아니면 idr preload 버퍼를 사용하기 위해 선점을 막아둔다.
new = __this_cpu_read(idr_preload_head); if (new) { __this_cpu_write(idr_preload_head, new->ary[0]); __this_cpu_dec(idr_preload_cnt); new->ary[0] = NULL; }
- idr_preload_head 리스트에서 idr_layer 구조체를 가져오고 리스트에서 제거한다.
preempt_enable(); if (new) return new;
- 다시 선점 가능 상태로 돌리고 할당이 성공한 경우 반환한다.
return kmem_cache_zalloc(idr_layer_cache, gfp_mask);
- 모두 실패한 경우 마지막으로 다시 한 번 idr_layer_cache로부터 직접 시도한다. 이 때에는 실패 시 경고 메시지가 출력된다.

get_from_free_list()

lib/idr.c

static struct idr_layer *get_from_free_list(struct idr *idp)
{
        struct idr_layer *p;
        unsigned long flags;

        spin_lock_irqsave(&idp->lock, flags);
        if ((p = idp->id_free)) {
                idp->id_free = p->ary[0];
                idp->id_free_cnt--;
                p->ary[0] = NULL;
        }
        spin_unlock_irqrestore(&idp->lock, flags);
        return(p);
}

id_free 리스트에 엔트리가 있는 경우 리스트에서 엔트리를 제거하고 그 엔트리를 반환한다.

idr_max()

lib/idr.c

/* the maximum ID which can be allocated given idr->layers */
static int idr_max(int layers)
{
        int bits = min_t(int, layers * IDR_BITS, MAX_IDR_SHIFT);

        return (1 << bits) - 1;
}

주어진 레이어에서 할당받을 수 있는 max ID(positive integer)를 알아온다.

예) 32bit 시스템
- 1단계 레이어: 0xff (8bit)
- 2단계 레이어: 0xffff (16bit)
- 3단계 레이어: 0xff_ffff (24bit)
- 4단계 레이어: 0x7fff_ffff (positive integer, 31bit로 제한)

__move_to_free_list()

lib/idr.c

/* only called when idp->lock is held */
static void __move_to_free_list(struct idr *idp, struct idr_layer *p)
{
        p->ary[0] = idp->id_free;
        idp->id_free = p;
        idp->id_free_cnt++;
}

엔트리를 id_free 리스트에 추가한다.

idr_layer_prefix_mask()

lib/idr.c

/*
 * Prefix mask for an idr_layer at @layer.  For layer 0, the prefix mask is
 * all bits except for the lower IDR_BITS.  For layer 1, 2 * IDR_BITS, and
 * so on.
 */
static int idr_layer_prefix_mask(int layer)
{
        return ~idr_max(layer + 1);
}

요청 layer에 대한 prefix 값을 반환한다.

예) layer 0을 요청하는 경우 0xffffff00을 반환한다.

sub_alloc()

lib/idr.c

/**
 * sub_alloc - try to allocate an id without growing the tree depth
 * @idp: idr handle
 * @starting_id: id to start search at
 * @pa: idr_layer[MAX_IDR_LEVEL] used as backtrack buffer
 * @gfp_mask: allocation mask for idr_layer_alloc()
 * @layer_idr: optional idr passed to idr_layer_alloc()
 *
 * Allocate an id in range [@starting_id, INT_MAX] from @idp without
 * growing its depth.  Returns
 *
 *  the allocated id >= 0 if successful,
 *  -EAGAIN if the tree needs to grow for allocation to succeed,
 *  -ENOSPC if the id space is exhausted,
 *  -ENOMEM if more idr_layers need to be allocated.
 */
static int sub_alloc(struct idr *idp, int *starting_id, struct idr_layer **pa,
                     gfp_t gfp_mask, struct idr *layer_idr)
{
        int n, m, sh;
        struct idr_layer *p, *new;
        int l, id, oid;

        id = *starting_id;
 restart:
        p = idp->top;
        l = idp->layers;
        pa[l--] = NULL;
        while (1) {
                /*
                 * We run around this while until we reach the leaf node...
                 */
                n = (id >> (IDR_BITS*l)) & IDR_MASK;
                m = find_next_zero_bit(p->bitmap, IDR_SIZE, n);
                if (m == IDR_SIZE) {
                        /* no space available go back to previous layer. */
                        l++;
                        oid = id;
                        id = (id | ((1 << (IDR_BITS * l)) - 1)) + 1;

                        /* if already at the top layer, we need to grow */
                        if (id > idr_max(idp->layers)) {
                                *starting_id = id;
                                return -EAGAIN;
                        }
                        p = pa[l];
                        BUG_ON(!p);

                        /* If we need to go up one layer, continue the
                         * loop; otherwise, restart from the top.
                         */
                        sh = IDR_BITS * (l + 1);
                        if (oid >> sh == id >> sh)
                                continue;
                        else
                                goto restart;
                }

ID를 할당하되 레이어의 depth를 변경하지 않고 그 범위 내의 하위 레이어 중 할당이 필요한 레이어들을 할당한다. 출력 인수 pa 포인터 배열에는 최상위 레이어부터 id에 해당하는 레이어까지 idr_layer의 포인터 주소를 담는다.

pa[0]가 가장 하단 레이어를 가리키고 그 다음 배열은 id와 관련된 상위 레이어로 증가하면서 최상위 레이어까지 간다음 마지막에는 null로 종결한다.

id = *starting_id;
- starting_id 부터 준비한다.
restart: p = idp->top;
- 처음부터 다시 수행해야 할 때 여기 restart: 레이블로 이동해와서 idr 구조체의 top에 연결된 최상위 노드를 준비한다.
l = idp->layers; pa[l–] = NULL;
- l은 사용하는 레이어 수를 알아오고 처리할 마지막 pa[] 배열의 끝에 null을 대입한다.
while (1) { n = (id >> (IDR_BITS*l)) & IDR_MASK; m = find_next_zero_bit(p->bitmap, IDR_SIZE, n);
- 루프를 돌며 n은 id 값을 현재 레이어 값 x 8로 나눈 몫에서 IDR_MASK한 값으로 bitmap 인덱스 n에 대입하고, bitmap에서 n값 뒤로 0으로 설정된 비트 위치를 m에 알아온다. 못 찾은 경우 IDR_SIZE 값을 반환한다.
- 예) 전체 레이어가 3단계이고, 현재 레벨에서 마지막 남은 ID를 할당받고자 할 때
  - id=0, bitmap=0x7fffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff, l=2
    - n=0, m=0xff
if (m == IDR_SIZE) {
- 지정된 번호 뒤로 빈 곳이 없는 경우는 현재 노드에 처리할 ID 공간이 없다는 것을 의미한다.
l++; old = id; id = (id | ((1 << (IDR_BITS * l)) – 1)) + 1;
- 다시 상위 레이어로 돌아가기 위해 l을 증가시키고, 현재 id를 백업하며 다음 빈자리를 찾기 위해 현재 레이어에서 우측 형제 레이어의 첫 id를 지정한다.
if (id > idr_max(idp->layers)) { *starting_id = id; return -EAGAIN; }
- 현재 레이어 레벨 구조로 더 이상 ID를 할당할 공간이 없는 경우 starting_id에 id값을 대입하고, -EAGAIN을 반환하여 레이어의 레벨을 확장하도록 요청한다.
p = pa[l];
- 상위 레이어를 지정한다.
sh = IDR_BITS * (l + 1); if (oid >> sh == id >> sh) continue; else goto restart;
- 새로 배정한 id가 상위 노드에서 처리 가능한 경우 계속 루프를 돌고 그렇지 않은 경우 restart 레이블로 이동하여 다시 처음부터 처리한다.
if (m != n) { sh = IDR_BITS*l; id = ((id >> sh) ^ n ^ m) << sh; }
- 예) id=0, n=0, m=0xff, l=2
  - id=0xff0000
if ((id >= MAX_IDR_BIT) || (id < 0)) return -ENOSPC;
- id 값이 positive 정수 범위를 벗어나는 경우 시스템이 처리할 수 없어서 -ENOSPC 에러를 반환한다.
if (l == 0) break;
- 마지막 leaf 노드 레이어까지 처리한 경우 루프를 빠져나간다.
if (!p->ary[m]) { new = idr_layer_alloc(gfp_mask, layer_idr); if (!new) return -ENOMEM;
- 하위 노드가 없는(missing) 경우 만든다.
new->layer = l-1; new->prefix = id & idr_layer_prefix_mask(new->layer); rcu_assign_pointer(p->ary[m], new); p->count++;
- 하위 노드의 layer 및 prefix를 지정하고 현재 노드의 ary[]에 연결한다음 count를 증가시킨다.

                if (m != n) {
                        sh = IDR_BITS*l;
                        id = ((id >> sh) ^ n ^ m) << sh;
                }
                if ((id >= MAX_IDR_BIT) || (id < 0))
                        return -ENOSPC;
                if (l == 0)
                        break;
                /*
                 * Create the layer below if it is missing.
                 */
                if (!p->ary[m]) {
                        new = idr_layer_alloc(gfp_mask, layer_idr);
                        if (!new)
                                return -ENOMEM;
                        new->layer = l-1;
                        new->prefix = id & idr_layer_prefix_mask(new->layer);
                        rcu_assign_pointer(p->ary[m], new);
                        p->count++;
                }
                pa[l--] = p;
                p = p->ary[m];
        }

        pa[l] = p;
        return id;
}

if (m != n) { sh = IDR_BITS*l; id = ((id >> sh) ^ n ^ m) << sh; }
- m과 n이 다른 경우 빈 자리의 id를 찾는다.
if ((id >= MAX_IDR_BIT) || (id < 0)) return -ENOSPC;
- id 값이 범위 밖이면 할당할 공간이 없다고 에러를 반환한다.
if (l == 0) break;
- 최하위 레이어까지 내려온 경우 루프를 빠져나간다.
if (!p->ary[m]) { new = idr_layer_alloc(gfp_mask, layer_idr); if (!new) return -ENOMEM;
- 만일 할당할 id 번호를 관리하는 하위 레이어 노드가 없는 경우 레이어를 할당받아온다.
new->layer = l-1; new->prefix = id & idr_layer_prefix_mask(new->layer); rcu_assign_pointer(p->ary[m], new); p->count++;
- 할당 받아온 레이어의 번호와 prefix, count 등을 업데이트 하고 ary[m]에 할당한 레이어를 가리키게 한다.
pa[l–] = p; p = p->ary[m]; }
- 다음 아래 레이어를 처리하기 위해 감소시켜 지정하고 계속 루프를 돈다.
pa[l] = p; return id;
- 마지막 pa[0]를 갱신하고 id를 리턴한다.

idr_fill_slot()

lib/idr.c

/*
 * @id and @pa are from a successful allocation from idr_get_empty_slot().
 * Install the user pointer @ptr and mark the slot full.
 */
static void idr_fill_slot(struct idr *idr, void *ptr, int id,
                          struct idr_layer **pa)
{
        /* update hint used for lookup, cleared from free_layer() */
        rcu_assign_pointer(idr->hint, pa[0]);

        rcu_assign_pointer(pa[0]->ary[id & IDR_MASK], (struct idr_layer *)ptr);
        pa[0]->count++;
        idr_mark_full(pa, id);
}

마지막에 ID를 할당한 leaf 노드의 주소를 idr 구조체의 hint 멤버에 대입하고, ary[]배열에 ptr을 저장하고, count를 증가시킨 후 full이된 레이어들의 bitmap을 1로 설정한다.

rcu_assign_pointer(idr->hint, pa[0]);
- 마지막에 ID를 할당한 leaf 노드의 주소를 idr 구조체의 hint 멤버에 저장한다.
- idr_find() 함수에서 id로 검색시 hint가 가리키는 레이어가 요청하는 id를 커버하는 경우 빠르게 처리하기 위해 사용한다.
rcu_assign_pointer(pa[0]->ary[id & IDR_MASK], (struct idr_layer *)ptr);
- 마지막에 ID를 할당한 leaf 노드의 id에 해당하는 ary[] 배열에 ptr 값을 저장한다.
pa[0]->count++;
- 마지막에 ID를 할당한 leaf 노드의 카운터를 증가시킨다.
idr_mark_full(pa, id);
- 마지막에 ID를 할당한 leaf 노드의 bitmap에 id에 해당하는 비트를 1로 설정하여 ID가 할당되었음을 표시한 후, 그 노드부터 최상위 노드중 full된 노드의 상위 노드 bitmap에 해당 비트를 1로 설정한다.

idr_mark_full()

lib/idr.c

static void idr_mark_full(struct idr_layer **pa, int id)
{
        struct idr_layer *p = pa[0];
        int l = 0;

        __set_bit(id & IDR_MASK, p->bitmap);
        /*
         * If this layer is full mark the bit in the layer above to
         * show that this part of the radix tree is full.  This may
         * complete the layer above and require walking up the radix
         * tree.
         */
        while (bitmap_full(p->bitmap, IDR_SIZE)) {
                if (!(p = pa[++l]))
                        break;
                id = id >> IDR_BITS;
                __set_bit((id & IDR_MASK), p->bitmap);
        }
}

마지막에 ID를 할당한 leaf 노드의 bitmap에 id에 해당하는 비트를 1로 설정하여 ID가 할당되었음을 표시한 후, 그 노드부터 최상위 노드중 full된 노드의 상위 노드 bitmap에 해당 비트를 1로 설정한다.

struct idr_layer *p = pa[0];
- 최하위 leaf 노드
__set_bit(id & IDR_MASK, p->bitmap);
- 해당 idr_layer 노드의 bitmap에서 id에 해당하는 포지션을 1로 설정하여 ID가 할당되었음을 표시한다.
while (bitmap_full(p->bitmap, IDR_SIZE)) {
- 노드가 full인 경우 계속 루프를 돈다.
if (!(p = pa[++l])) break;
- 상위 노드가 지정되지 않은 경우 루프를 탈출한다.
id = id >> IDR_BITS;
- id 값을 256으로 나눈다.
__set_bit((id & IDR_MASK), p->bitmap);
- 현재 노드의 bitmap에서 id에 해당하는 포지션을 1로 설정하여 하위 노드가 full이 되었음을 표시한다.

다음 그림은 0xffffff ID를 할당받은 후 idr_mark_full() 함수에 의해 각 bitmap에 full 처리되는 모습을 보여준다.

IDR 해제

idr_remove()

lib/idr.c

/**
 * idr_remove - remove the given id and free its slot
 * @idp: idr handle
 * @id: unique key
 */
void idr_remove(struct idr *idp, int id)
{
        struct idr_layer *p;
        struct idr_layer *to_free;

        if (id < 0)
                return;

        if (id > idr_max(idp->layers)) {
                idr_remove_warning(id);
                return;
        }

        sub_remove(idp, (idp->layers - 1) * IDR_BITS, id);
        if (idp->top && idp->top->count == 1 && (idp->layers > 1) &&
            idp->top->ary[0]) {
                /*
                 * Single child at leftmost slot: we can shrink the tree.
                 * This level is not needed anymore since when layers are
                 * inserted, they are inserted at the top of the existing
                 * tree.
                 */
                to_free = idp->top;
                p = idp->top->ary[0];
                rcu_assign_pointer(idp->top, p);
                --idp->layers;
                to_free->count = 0;
                bitmap_clear(to_free->bitmap, 0, IDR_SIZE);
                free_layer(idp, to_free);
        }
}
EXPORT_SYMBOL(idr_remove);

할당한 id를 제거하고, 제거하는 중에 empty된 레이어들은 제거된다. 필요에 따라 레이어 depth 까지도 줄어든다.

if (id < 0) return; if (id > idr_max(idp->layers)) { idr_remove_warning(id); return; }
- IDR에서 처리할 수 있는 id 범위를 벗어난 경우 그냥 빠져나간다.
sub_remove(idp, (idp->layers – 1) * IDR_BITS, id);
- tree depth를 줄이지 않은 상태에서 삭제할 id에 관여되는 레이어들 중 empty되는 레이어들을 연결에서 제거하여 id_free로 대입한다.
if (idp->top && idp->top->count == 1 && (idp->layers > 1) && idp->top->ary[0]) {
- 최상위 레이어의 count가 1이면서 하위 레이어를 가리키는 경우
to_free = idp->top; p = idp->top->ary[0]; rcu_assign_pointer(idp->top, p);
- 삭제 준비를 위해 최상위 레이어를 to_free에 대입하고, 최상위 레이어로 그 하위 레이어를 지정하게 한다.
–idp->layers; to_free->count = 0; bitmap_clear(to_free->bitmap, 0, IDR_SIZE); free_layer(idp, to_free);
- 레이어 수(tree depth)를 줄이고, 삭제할 레이어의 count, bitmap을 clear한 후 해제한다.

다음 그림은 3단계의 레이어에서 65536번 id를 삭제하면서 레이어들이 삭제되고 tree depth가 줄어드는 과정을 보여준다.

sub_remove()

lib/idr.c

static void sub_remove(struct idr *idp, int shift, int id)
{
        struct idr_layer *p = idp->top;
        struct idr_layer **pa[MAX_IDR_LEVEL + 1];
        struct idr_layer ***paa = &pa[0];
        struct idr_layer *to_free;
        int n;

        *paa = NULL;
        *++paa = &idp->top;

        while ((shift > 0) && p) {
                n = (id >> shift) & IDR_MASK;
                __clear_bit(n, p->bitmap);
                *++paa = &p->ary[n];
                p = p->ary[n];
                shift -= IDR_BITS;
        }
        n = id & IDR_MASK;
        if (likely(p != NULL && test_bit(n, p->bitmap))) {
                __clear_bit(n, p->bitmap);
                RCU_INIT_POINTER(p->ary[n], NULL);
                to_free = NULL;
                while(*paa && ! --((**paa)->count)){
                        if (to_free)
                                free_layer(idp, to_free);
                        to_free = **paa;
                        **paa-- = NULL;
                }
                if (!*paa)
                        idp->layers = 0;
                if (to_free)
                        free_layer(idp, to_free);
        } else
                idr_remove_warning(id);
}

tree depth를 줄이지 않은 상태에서 삭제할 id에 관여되는 레이어들 중 empty되는 레이어들을 연결에서 제거하고 할당을 해제한다.

struct idr_layer ***paa = &pa[0]; *paa = NULL; *++paa = &idp->top;
- pa[0]에 null을 대입하고 pa[1]에 top 레이어를 담는다.
while ((shift > 0) && p) { n = (id >> shift) & IDR_MASK; __clear_bit(n, p->bitmap); *++paa = &p->ary[n]; p = p->ary[n]; shift -= IDR_BITS; }
- 하위 leaf 레이어 전까지 내려가면서 pa[]에 각 레이어를 저장하고 bitmap의 연관 비트들을 clear한다.
n = id & IDR_MASK;
- 하위 leaf 레이어에서 bit 위치
if (likely(p != NULL && test_bit(n, p->bitmap))) { __clear_bit(n, p->bitmap); RCU_INIT_POINTER(p->ary[n], NULL); to_free = NULL;
- 많은 확률로 leaf 레이어의 bitmap이 설정되어 있는 경우 비트를 clear 하고 ary[n]도 rcu를 사용하여 null로 대입한다.
while(*paa && ! –((**paa)->count)){ if (to_free) free_layer(idp, to_free); to_free = **paa; **paa– = NULL; }
- pa[] 배열에 저장된 레이어를 다시 거꾸로 루프를 돌면서 해당 레이어의 count를 감소시켜 0인 경우 to_free에 지정된 레이어가 있는 경우 해제한다. 그리고 현재 레이어를 to_free에 담아두고 pa[]에 null을 저장하고 다음 감소시킨 pa[]를 지정한다.
if (!*paa) idp->layers = 0;
- 마지막인 경우 idp->layers에 0을 대입하여 어떠한 하위 레이어도 없음을 나타내게 한다.
if (to_free) free_layer(idp, to_free);
- to_free 레이어를 해제한다.

static inline void free_layer(struct idr *idr, struct idr_layer *p)
{
        if (idr->hint == p)
                RCU_INIT_POINTER(idr->hint, NULL);
        call_rcu(&p->rcu_head, idr_layer_rcu_free);
}

idr->hint가 삭제할 p를 가리키는 경우 hint에 null을 대입한다 그런 후 rcu 기법으로 idr_layer_rcu_free 함수를 호출하여 해당 레이어를 해제하게 한다.

idr_layer_rcu_free()

lib/idr.c

static void idr_layer_rcu_free(struct rcu_head *head)
{
        struct idr_layer *layer;

        layer = container_of(head, struct idr_layer, rcu_head);
        kmem_cache_free(idr_layer_cache, layer);
}

요청된 idr_layer를 해제한다.

IDR 소거

idr_destroy()

lib/idr.c

/**
 * idr_destroy - release all cached layers within an idr tree
 * @idp: idr handle
 *
 * Free all id mappings and all idp_layers.  After this function, @idp is
 * completely unused and can be freed / recycled.  The caller is
 * responsible for ensuring that no one else accesses @idp during or after
 * idr_destroy().
 *
 * A typical clean-up sequence for objects stored in an idr tree will use
 * idr_for_each() to free all objects, if necessary, then idr_destroy() to
 * free up the id mappings and cached idr_layers.
 */
void idr_destroy(struct idr *idp)
{
        __idr_remove_all(idp);

        while (idp->id_free_cnt) {
                struct idr_layer *p = get_from_free_list(idp);
                kmem_cache_free(idr_layer_cache, p);
        }
}
EXPORT_SYMBOL(idr_destroy);

모든 idr 레이어를 삭제시키고 id_free 리스트에 담겨있는 할당 대기중인 레이어들을 해제한다.

__idr_remove_all()

lib/idr.c

static void __idr_remove_all(struct idr *idp)
{
        int n, id, max;
        int bt_mask;
        struct idr_layer *p;
        struct idr_layer *pa[MAX_IDR_LEVEL + 1];
        struct idr_layer **paa = &pa[0];

        n = idp->layers * IDR_BITS;
        *paa = idp->top;
        RCU_INIT_POINTER(idp->top, NULL);
        max = idr_max(idp->layers);

        id = 0;
        while (id >= 0 && id <= max) {
                p = *paa;
                while (n > IDR_BITS && p) {
                        n -= IDR_BITS;
                        p = p->ary[(id >> n) & IDR_MASK];
                        *++paa = p;
                }

                bt_mask = id;
                id += 1 << n;
                /* Get the highest bit that the above add changed from 0->1. */
                while (n < fls(id ^ bt_mask)) {
                        if (*paa)
                                free_layer(idp, *paa);
                        n += IDR_BITS;
                        --paa;
                }
        }
        idp->layers = 0;
}

idr 레이어를 모두 해제한다.

n = idp->layers * IDR_BITS;
- 처리할 최대 비트 수
- 예) layer=3이면 n=24
*paa = idp->top;
- pa[0]에 최상위 레이어를 대입한다.
RCU_INIT_POINTER(idp->top, NULL);
- idp->top에 null을 대입하여 idr_layer가 하나도 등록되지 않았음을 나타내게 한다.
max = idr_max(idp->layers);
- 최대 처리 가능한 id 값
id = 0; while (id >= 0 && id <= max) { p = *paa; while (n > IDR_BITS && p) { n -= IDR_BITS; p = p->ary[(id >> n) & IDR_MASK]; *++paa = p; }
- 마지막 leaf 레이어가 아닌 경우 pa[] 배열에 레이어를 추가해 나간다.
bt_mask = id; id += 1 << n;
- id를 보관하고 id를 횡방향의 다음 레이어가 관리하는 id의 시작 번호로 대입한다.
while (n < fls(id ^ bt_mask)) { if (*paa) free_layer(idp, *paa); n += IDR_BITS; –paa; }
- 가장 마지막에 존재하는 레이어를 해제한다.

구조체 및 주요 상수

idr_layer

include/linux/idr.h

struct idr_layer {
        int                     prefix; /* the ID prefix of this idr_layer */
        int                     layer;  /* distance from leaf */
        struct idr_layer __rcu  *ary[1<<IDR_BITS];
        int                     count;  /* When zero, we can release it */
        union {
                /* A zero bit means "space here" */
                DECLARE_BITMAP(bitmap, IDR_SIZE);
                struct rcu_head         rcu_head;
        };
};

prefix
- 각 레이어가 관리하는 id를 제외한 비트들만 사용되는 마스크
- 예) 32bit 시스템
  - 0 layer (leaf 노드) -> 0xffffff00
  - 1 layer -> 0xffff0000
  - 2 layer -> 0xff000000
  - 3 layer -> 0x80000000
layer
- 레이어 번호(based 0)
  - leaf 노드는 0
ary[256]
- leaf 노드가 아닌 경우 하위 레이어를 가리키고, leaf 노드인 경우 유저 포인터 값을 담아두는데 사용한다.
count
- leaf 노드에서는 ID가 할당되어 사용중인 수가 담기고 leaf 노드가 아닌 경우 연결된 하위 노드의 수를 담아둔다.
- 이 값이 0이면 레이어는 해제될 수 있다.
bitmap
- leaf 노드에서는 ID가 할당된 경우 1로 설정되고, leaf 노드가 아닌 경우 하위 노드가 full인 경우 1로 설정된다.
rcu_head
- bitmap과 union으로 사용되는데 노드를 rcu 기법으로 삭제할 때 사용한다.

idr

include/linux/idr.h

struct idr {
        struct idr_layer __rcu  *hint;  /* the last layer allocated from */
        struct idr_layer __rcu  *top;
        int                     layers; /* only valid w/o concurrent changes */
        int                     cur;    /* current pos for cyclic allocation */
        spinlock_t              lock;
        int                     id_free_cnt;
        struct idr_layer        *id_free;
};

hint
- 마지막 ID가 할당된 노드
top
- 가장 상위 노드
layers
- 운용되는 레이어 단계(tree depth)로 최대 ID 값에 따라 증감되며 운용된다.
- 0인 경우 어떠한 레이어도 없고 노드도 사용되지 않는다.
- 32bit 시스템에서 0~4까지 운용되고, 64bit 시스템에서는 0~8까지 운용될 수 있다.
- 예) 최대 id가 255인 경우layers=1
cur
- 1
lock
- 레이어를 관리하기 위한 lock
id_free_cnt
- 기존 id 할당 방식을 호환하기 위해 캐시역할로 미리 할당된 idr_layer 엔트리 갯수가 담긴다.
id_free
- 기존 id 할당 방식을 호환하기 위해 캐시역할로 미리 할당된 idr_layer 엔트리가 이 리스트에 등록된다.

IDR_BITS

#define IDR_BITS 8
- 이 값은 기존 5(32bit 시스템) 또는 6(64bit 시스템)에서 2013년 커널 v3.9-rc1에서 8로 증가되었다.
- 참고: idr: make idr_layer larger

IDR_SIZE

#define IDR_SIZE (1 << IDR_BITS)
- 256

IDR_MASK

IDR_MASK ((1 << IDR_BITS)-1)
- 255

MAX_IDR_SHIFT

#define MAX_IDR_SHIFT (sizeof(int) * 8 – 1)
- 부호를 제외한 정수에 사용되는 비트 수
- 31 (32bit 시스템)
- 63 (64bit 시스템)

MAX_IDR_BIT

#define MAX_IDR_BIT (1U << MAX_IDR_SHIFT)
- 부호를 제외한 정수 최대수
- 2^31 (32bit 시스템)
- 2^63 (64bit 시스템)

MAX_IDR_LEVEL

#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS – 1) / IDR_BITS)
- MAX_IDR_SHIFT 값을 IDR_BITS 단위로 round up 한 수로 최대 확장될 수 있는 레벨
- 4 레벨 (32bit 시스템)
- 8 레벨 (64bit 시스템)

MAX_IDR_FREE

#define MAX_IDR_FREE (MAX_IDR_LEVEL * 2)
8 (32bit 시스템)
16 (64bit 시스템)

참고

idr – integer ID management | LWN.net
A simplified IDR API | LWN.net
idr: implement idr_alloc() and convert existing users | LWN.net
[Linux] IDR & IDA – integer ID 관리 | F/OSS
Trees I: Radix trees | LWN.net