문c 블로그

Timer -8- (Timecounter)

2017-03-102020-02-03 문영일 Leave a comment

Timecounter/Cyclecounter

h/w 독립적인 타임카운터 API를 제공하며 이 카운터를 사용하는 드라이버가 많지 않아 원래 코드가 있었던 clocksource에서 코드를 제거하여 별도의 파일로 분리하였다.

arm 아키텍트 타이머에서 56비트 cyclecounter를 사용하여 Timecounter를 초기화하여 사용한다.

주로 고속 이더넷 드라이버의 PTP(Precision Time Protocol) 기능을 위해 h/w 타이머를 연동하였고 인텔 HD 오디오 드라이버에서도 사용되었음을 확인할 수 있다.

사용 드라이버
- drivers/net/ethernet/amd/xgbe/xgbe-drv.c
  - AMD 10Gb Ethernet driver
- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
  - Broadcom Everest network driver
- drivers/net/ethernet/freescale/fec_ptp.c
  - Fast Ethernet Controller (ENET) PTP driver for MX6x
- drivers/net/ethernet/ti/cpts.c
  - TI Common Platform Time Sync
- drivers/net/ethernet/mellanox/mlx4/en_clock.c
  - mlx4 ptp clock
- drivers/net/ethernet/intel/igb/igb_ptp.c
  - PTP Hardware Clock (PHC) driver for the Intel 82576 and 82580
- drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
  - Intel 10 Gigabit PCI Express Linux driver
- drivers/net/ethernet/intel/e1000e/netdev.c
  - Intel PRO/1000 Linux driver
- sound/pci/hda/hda_controller.c
  - Implementation of primary alsa driver code base for Intel HD Audio.
참고: time: move the timecounter/cyclecounter code into its own file.

사용 API

timecounter_init()
- 하드웨어 카운터 레지스터에 연동한다.
timecounter_read()
- timcounter_init() 한 후 지난 시간을 ns 값으로 변환하여 반환한다.
cyclecounter_cyc2ns()
- cycle 카운터 값을 ns 값으로 변환하여 반환한다.
timecounter_adjtime()
- tv->nsec 값에 delta 값을 더해 교정한다.
timecounter_cyc2time()
- 요청 cycle – 지난 cycle 카운터 값으로 delta cycle을 구한 후 ns 값으로 변환하여 반환한다.

타임카운터 초기화

timecounter_init()

kernel/time/timecounter.c

/**
 * timecounter_init - initialize a time counter
 * @tc:                 Pointer to time counter which is to be initialized/reset
 * @cc:                 A cycle counter, ready to be used.
 * @start_tstamp:       Arbitrary initial time stamp.
 *
 * After this call the current cycle register (roughly) corresponds to
 * the initial time stamp. Every call to timecounter_read() increments
 * the time stamp counter by the number of elapsed nanoseconds.
 */

void timecounter_init(struct timecounter *tc,
                      const struct cyclecounter *cc,
                      u64 start_tstamp)
{       
        tc->cc = cc;
        tc->cycle_last = cc->read(cc);
        tc->nsec = start_tstamp;
        tc->mask = (1ULL << cc->shift) - 1;
        tc->frac = 0;
}
EXPORT_SYMBOL_GPL(timecounter_init);

요청한 timecount 및 cyclecounter 구조체에 시작 값(ns)으로 초기화하고 cycle_last에는 h/w 타이머로부터 cycle 값을 읽어 저장한다.

cycle_last 값에 현재 지정된 64bit 타이머 카운터 값을 읽어 cycle_last에 대입한다.
nsec에는 처음 시작 값을 대입한다.
rpi2: 아키텍처 generic 타이머 사용
- mask 값에는 24bit 값으로 0xff_ffff를 사용한다.
- 이 값으로 frac(fractional nanoseconds) 필드의 비트마스크 값으로 사용된다.

아래 그림은 rpi2의 armv7 아키텍처 generic 타이머를 사용하여 56비트 타임카운터를 초기화하는 모습을 보여준다.

timecounter_read()

kernel/time/timecounter.c

/**
 * timecounter_read - return nanoseconds elapsed since timecounter_init()
 *                    plus the initial time stamp
 * @tc:          Pointer to time counter.
 *
 * In other words, keeps track of time since the same epoch as
 * the function which generated the initial time stamp.
 */

u64 timecounter_read(struct timecounter *tc)
{
        u64 nsec;

        /* increment time by nanoseconds since last call */
        nsec = timecounter_read_delta(tc);
        nsec += tc->nsec;
        tc->nsec = nsec;

        return nsec;
}
EXPORT_SYMBOL_GPL(timecounter_read);

마지막 호출로부터 경과한 delta(ns) 값을 추가한 값(ns)을 tc->nsec에 갱신하고 반환한다.

다음 그림은 timecouter_init()으로 초기화한 후 100 사이클(5208ns 소요)이 지난 후 처음 timecounter_read() 함수를 호출한 경우 처리되는 모습을 보여준다.

timecounter_read_delta()

kernel/time/timecounter.c

/**
 * timecounter_read_delta - get nanoseconds since last call of this function
 * @tc:         Pointer to time counter
 *
 * When the underlying cycle counter runs over, this will be handled
 * correctly as long as it does not run over more than once between
 * calls.
 *
 * The first call to this function for a new time counter initializes
 * the time tracking and returns an undefined result.
 */

static u64 timecounter_read_delta(struct timecounter *tc)
{
        cycle_t cycle_now, cycle_delta;
        u64 ns_offset;

        /* read cycle counter: */
        cycle_now = tc->cc->read(tc->cc);

        /* calculate the delta since the last timecounter_read_delta(): */
        cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask;

        /* convert to nanoseconds: */
        ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta,
                                        tc->mask, &tc->frac);

        /* update time stamp of timecounter_read_delta() call: */
        tc->cycle_last = cycle_now;

        return ns_offset;
}

cycle 카운트 값을 읽어 tc->cycle_last에 저장하고 마지막 호출로부터 경과한 delta(ns) 값을 반환한다.

코드 라인 7에서 cyclecounter에 연결된 h/w 타이머 cycle 카운트 값을 읽어온다.
코드 라인 10에서 읽은 cycle 값 – 지난 cycle 값에 mask로 필터한 값을 cycle_delta에 대입한다.
코드 라인 13~14에서 cycle_delta 값으로 소요 시간(ns)을 알아온다.
- (cycle_delta * cc->mult) >> cc->shift
코드 라인 17에서 읽었었던 cycle 카운트는 tc->cycle_last에 저장한다.

cyclecounter_cyc2ns()

include/linux/timecounter.h

/**
 * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds
 * @cc:         Pointer to cycle counter.
 * @cycles:     Cycles
 * @mask:       bit mask for maintaining the 'frac' field
 * @frac:       pointer to storage for the fractional nanoseconds.
 */

static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc,
                                      cycle_t cycles, u64 mask, u64 *frac)
{
        u64 ns = (u64) cycles;

        ns = (ns * cc->mult) + *frac;
        *frac = ns & mask;
        return ns >> cc->shift;
}

cycle 카운터 값을 nano 초로 변환한다.

frac
- 참고: timecounter: keep track of accumulated fractional nanoseconds

timecounter_adjtime()

include/linux/timecounter.h

/**
 * timecounter_adjtime - Shifts the time of the clock.
 * @delta:      Desired change in nanoseconds.
 */

static inline void timecounter_adjtime(struct timecounter *tc, s64 delta)
{
        tc->nsec += delta;
}

타임카운터의 시간 ns만 delta 만큼 더해 조정한다. (cycle 값은 바꾸지 않는다.)

구조체

timecounter 구조체

include/linux/timecounter.h

/**
 * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds
 *      Contains the state needed by timecounter_read() to detect
 *      cycle counter wrap around. Initialize with
 *      timecounter_init(). Also used to convert cycle counts into the
 *      corresponding nanosecond counts with timecounter_cyc2time(). Users
 *      of this code are responsible for initializing the underlying
 *      cycle counter hardware, locking issues and reading the time
 *      more often than the cycle counter wraps around. The nanosecond
 *      counter will only wrap around after ~585 years.
 *
 * @cc:                 the cycle counter used by this instance
 * @cycle_last:         most recent cycle counter value seen by
 *                      timecounter_read()
 * @nsec:               continuously increasing count
 * @mask:               bit mask for maintaining the 'frac' field
 * @frac:               accumulated fractional nanoseconds
 */

struct timecounter {
        const struct cyclecounter *cc;
        cycle_t cycle_last;
        u64 nsec;
        u64 mask;
        u64 frac;
};

*cc
- h/w 타이머 카운트(cycle) 값에 대응하는 cyclecounter 구조체와 연결해야 한다.
cycle_last
- cyclecounter를 통해 읽은 최종 cycle 값을 저장해둔다.
- 이 cycle 값을 사용하여 delta cycle을 구하기 위해 사용한다.
nsec
- cycle_last에 대응하는 실제 시간(ns)가 담긴다.
mask
- cycle 마스크로 이 마스크 값을 초과하는 cycle 값은 overflow된 cycle 값이다.
frac
- fractional nano 초

cyclecounter 구조체

include/linux/timecounter.h

/**
 * struct cyclecounter - hardware abstraction for a free running counter
 *      Provides completely state-free accessors to the underlying hardware.
 *      Depending on which hardware it reads, the cycle counter may wrap
 *      around quickly. Locking rules (if necessary) have to be defined
 *      by the implementor and user of specific instances of this API.
 *
 * @read:               returns the current cycle value
 * @mask:               bitmask for two's complement
 *                      subtraction of non 64 bit counters,
 *                      see CYCLECOUNTER_MASK() helper macro
 * @mult:               cycle to nanosecond multiplier
 * @shift:              cycle to nanosecond divisor (power of two)
 */

struct cyclecounter {
        u64 (*read)(const struct cyclecounter *cc);
        u64 mask;
        u32 mult;
        u32 shift;
};

(*read)
- h/w 타이머의 카운터 값을 읽어오는 함수와 연결되는 후크이다.
mask
- cycle 카운터 마스크로 h/w 카운터에서 읽은 값에서 유효한 비트만을 마스크한다.
- 예) 56비트 카운터 = 0x00ff_ffff_ffff_ffff
mult & shift
- 1 cycle 당 ns 값을 산출하기 위해 mult 값으로 곱한 후 우측으로 shift한다.
- 예) mult=0x3415_5555, shift=24
  - 1 cycle = 52ns

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c – 현재 글
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

Timer -4- (Clock Sources Watchdog)

2017-03-102020-02-03 문영일 Leave a comment

Clock Sources Watchdog

불안정한 클럭 소스 처리를 위한 워치독으로 현재는 x86 아키텍처에만 적용되어 있다.

주의: 커널의 다른 워치독 시스템과 구분이 필요한다.

클럭 소스를 워치독 리스트에 등록

clocksource_enqueue_watchdog()

kernel/time/clocksource.c

static void clocksource_enqueue_watchdog(struct clocksource *cs)
{
        unsigned long flags;

        spin_lock_irqsave(&watchdog_lock, flags);
        if (cs->flags & CLOCK_SOURCE_MUST_VERIFY) {
                /* cs is a clocksource to be watched. */
                list_add(&cs->wd_list, &watchdog_list);
                cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
        } else {
                /* cs is a watchdog. */
                if (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS)
                        cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
                /* Pick the best watchdog. */
                if (!watchdog || cs->rating > watchdog->rating) {
                        watchdog = cs;
                        /* Reset watchdog cycles */
                        clocksource_reset_watchdog();
                }
        }
        /* Check if the watchdog timer needs to be started. */
        clocksource_start_watchdog();
        spin_unlock_irqrestore(&watchdog_lock, flags);
}

요청 클럭 소스에 must_verify 플래그 요청이 있는 경우 워치독 리스트에 등록하고 0.5초 타이머 후에 워치독 스레드를 동작시켜 클럭의 안정 여부를 판단하게 한다. 플래그 요청이 없는 경우 rating이 가장 좋은 클럭 소스를 전역 watchdog이 가리키게한다.

코드 라인 6~9에서 must_verify 플래그가 있는 경우 클럭 소스를 워치독 리스트에 추가하고 플래그 중 watchdog 플래그를 클리어한다.
코드 라인 10~13에서 continuous 플래그가 있는 경우 valid_for_hres 플래그를 추가한다.
코드 라인 15~19에서 아직 워치독이 지정되지 않았거나 워치독 클럭 소스의 rating 값보다 요청한 클럭 소스의 rating 값이 더 높은 경우 요청 클럭소스를 워치독 클럭 소스로 지정하고 워치독 리스트에 있는 모든 워치독 플래그를 클리어한다.
코드 라인 22에서 워치독 타이머를 가동한다.

다음 그림은 must_verify 플래그가 있는 클럭 소스를 워치독 리스트에 추가하고 클럭 소스의 안정성 여부를 확인하도록 0.5초 만료시간으로 타이머를 가동시킨 후 워치독 스레드를 동작시키는 과정을 보여준다.

clocksource_reset_watchdog()

kernel/time/clocksource.c

static inline void clocksource_reset_watchdog(void)
{
        struct clocksource *cs;

        list_for_each_entry(cs, &watchdog_list, wd_list)
                cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
}

워치독 리스트에 등록된 모든 클럭 소스의 플래그 중 watchdog 비트만 클리어한다.

clocksource_start_watchdog()

kernel/time/clocksource.c

static inline void clocksource_start_watchdog(void)
{
        if (watchdog_running || !watchdog || list_empty(&watchdog_list))
                return;
        init_timer(&watchdog_timer);
        watchdog_timer.function = clocksource_watchdog;
        watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
        watchdog_running = 1;
}

클럭 소스 워치독으로 만료 시간 0.5초 lowres 타이머를 요청한다.

타이머가 동작 중이거나 워치독 클럭 소스가 없거나 워치독 리스트가 비어 있는 경우 처리를 하지 않고 빠져나간다.

클럭 소스 워치독 핸들러

clocksource_watchdog()

kernel/time/clocksource.c

static void clocksource_watchdog(unsigned long data)
{
        struct clocksource *cs;
        cycle_t csnow, wdnow, delta;
        int64_t wd_nsec, cs_nsec;
        int next_cpu, reset_pending;

        spin_lock(&watchdog_lock);
        if (!watchdog_running)
                goto out;

        reset_pending = atomic_read(&watchdog_reset_pending);

        list_for_each_entry(cs, &watchdog_list, wd_list) {

                /* Clocksource already marked unstable? */
                if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
                        if (finished_booting)
                                schedule_work(&watchdog_work);
                        continue;
                }

                local_irq_disable();
                csnow = cs->read(cs);
                wdnow = watchdog->read(watchdog);
                local_irq_enable();

                /* Clocksource initialized ? */
                if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
                    atomic_read(&watchdog_reset_pending)) {
                        cs->flags |= CLOCK_SOURCE_WATCHDOG;
                        cs->wd_last = wdnow;
                        cs->cs_last = csnow;
                        continue;
                }

                delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
                wd_nsec = clocksource_cyc2ns(delta, watchdog->mult,
                                             watchdog->shift);

                delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
                cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
                cs->cs_last = csnow;
                cs->wd_last = wdnow;

                if (atomic_read(&watchdog_reset_pending))
                        continue;

                /* Check the deviation from the watchdog clocksource. */
                if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
                        clocksource_unstable(cs, cs_nsec - wd_nsec);
                        continue;
                }

워치독 이벤트 핸들러로 클럭 소스 리스트에 있는 모든 클럭에 대해 워치독 클럭 소스와 비교하여 스레졸드(0.625초) 시간을 초과한 경우 unstable 처리 후 워치독 스레드에 맡긴다.

코드 라인 9~10에서 워치독이 가동되지 않은 경우 처리를 중단하고 빠져나간다.
- 워치독 타이머를 종료 시킨 후에 이벤트가 들어온 경우를 위해 함수를 빠져나가게 한다.
코드 라인 12에서 현재 시점의 워치독 리셋 펜딩 값을 읽어 보관해둔다.
코드 라인 14~21에서 이미 unstable 마크된 클럭 소스인 경우 다음 클럭 소스로 skip 한다. 만일 부팅이 완료된 상태인 경우 워치독을 가동시킨다.
코드 라인 23~26에서 현재 클럭 소스 카운터 값과 워치독 클럭 소스 카운터 값을 읽어온다.
코드 라인 29~35에서 워치독 플래그 설정이 없는 클럭 소스이거나 워치독 리셋 펜딩 상태인 경우 현재 클럭 소스에 워치독 플래그를 설정하고 읽은 워치독 클럭 소스 카운터 값과 현재 클럭 소스 카운터 값을 wd_last 및 cs_last에 보관하고 다음 클럭 소스로 skip 한다.
코드 라인 37~39에서 워치독 클럭 소스를 대상으로 기존에 저장해 둔 카운터 값과 좀 전에 읽은 값의 카운터(cycle) 차이를 delta에 담고 소요 시간을 wd_nsec에 담는다.
코드 라인 41~42에서 윗 줄과 같은 방법으로 현재 클럭 소스를 대상으로 동일하게 산출한다.
코드 라인 43~44에서 읽었던 값을 현재 클럭 소스의 마지막에 읽은 카운터 값(cs_last 및 wd_last)에 대입한다.
코드 라인 46~47에서 워치독 리셋 펜딩이 된 경우 다음 클럭 소스로 skip 한다.
코드 라인 50~53에서 현재 클럭 소스의 소요 시간과 워치독 클럭 소스의 소요시간의 차이가 워치독 스레졸드 시간(0.0625초)을 초과한 경우 현재 클럭 소스를 unstable 처리하고 다음 클럭 소스로 skip 한다.

                if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
                    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
                    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
                        /* Mark it valid for high-res. */
                        cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;

                        /*
                         * clocksource_done_booting() will sort it if
                         * finished_booting is not set yet.
                         */
                        if (!finished_booting)
                                continue;

                        /*
                         * If this is not the current clocksource let
                         * the watchdog thread reselect it. Due to the
                         * change to high res this clocksource might
                         * be preferred now. If it is the current
                         * clocksource let the tick code know about
                         * that change.
                         */
                        if (cs != curr_clocksource) {
                                cs->flags |= CLOCK_SOURCE_RESELECT;
                                schedule_work(&watchdog_work);
                        } else {
                                tick_clock_notify();
                        }
                }
        }

        /*
         * We only clear the watchdog_reset_pending, when we did a
         * full cycle through all clocksources.
         */
        if (reset_pending)
                atomic_dec(&watchdog_reset_pending);

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);
out:
        spin_unlock(&watchdog_lock);
}

코드 라인 1~5에서 현재 클럭 소스와 워치독 클럭 소스가 모두 continuous 플래그 설정되어 있고 현재 클럭 소스에 valid_for_hres 플래그가 없는 경우 그 플래그를 설정한다.
코드 라인 11~12에서 부팅이 완료되지 않은 상태이면 다음 클럭 소스로 skip 한다.
코드 라인 22~27에서 현재 클럭 소스가 curr_clocksource가 아닌 경우 reselect 플래그를 추가하고 워치독을 가동하고 같은 경우 클럭 소스가 변경되었음을 async로 통지한다.
코드 라인 35~36에서 루틴 처음에 이미 리셋 펜딩 상태였던 경우 워치독 리셋 펜딩 값을 감소시킨다.
코드 라인 42~46에서 다음 cpu에 대해 워치독 인터벌(0.5초)로 워치독 타이머를 가동시킨다.

불안정한 클럭 소스 처리

clocksource_unstable()

kernel/time/clocksource.c

static void clocksource_unstable(struct clocksource *cs, int64_t delta)
{
        printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n",
               cs->name, delta);
        __clocksource_unstable(cs);
}

불안정한 클럭 소스에 대해 경고 메시지를 출력하고 이에 대한 처리를 하도록 워치독 처리 함수를 스케쥴하여 호출한다.

__clocksource_unstable()

kernel/time/clocksource.c

static void __clocksource_unstable(struct clocksource *cs)
{
        cs->flags &= ~(CLOCK_SOURCE_VALID_FOR_HRES | CLOCK_SOURCE_WATCHDOG);
        cs->flags |= CLOCK_SOURCE_UNSTABLE;
        if (finished_booting)
                schedule_work(&watchdog_work);
}

불안정한 클럭 소스의 처리를 위해 아래 워크큐에 등록된 함수를 스케쥴하여 호출한다.

kernel/time/clocksource.c

static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);

워치독 스레드를 생성하고 동작시키는 워크큐이다.

clocksource_watchdog_work()

kernel/time/clocksource.c

static void clocksource_watchdog_work(struct work_struct *work)
{
        /*
         * If kthread_run fails the next watchdog scan over the
         * watchdog_list will find the unstable clock again.
         */
        kthread_run(clocksource_watchdog_kthread, NULL, "kwatchdog");
}

워치독 스레드를 생성하고 동작시킨다.

워치독 스레드

clocksource_watchdog_kthread()

kernel/time/clocksource.c

static int clocksource_watchdog_kthread(void *data)
{
        mutex_lock(&clocksource_mutex);
        if (__clocksource_watchdog_kthread())
                clocksource_select();
        mutex_unlock(&clocksource_mutex);
        return 0;
}

워치독 리스트에 있는 불안정한 클럭들은 rating을 0으로 바꾼 후 다시 클럭 소스 리스트로 옮기고 클럭 소스를 다시 선택하는 과정을 거치게 한다.

__clocksource_watchdog_kthread()

kernel/time/clocksource.c

static int __clocksource_watchdog_kthread(void)
{
        struct clocksource *cs, *tmp;
        unsigned long flags;
        LIST_HEAD(unstable);
        int select = 0;

        spin_lock_irqsave(&watchdog_lock, flags);
        list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
                if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
                        list_del_init(&cs->wd_list);
                        list_add(&cs->wd_list, &unstable);
                        select = 1;
                }
                if (cs->flags & CLOCK_SOURCE_RESELECT) {
                        cs->flags &= ~CLOCK_SOURCE_RESELECT;
                        select = 1;
                }
        }
        /* Check if the watchdog timer needs to be stopped. */
        clocksource_stop_watchdog();
        spin_unlock_irqrestore(&watchdog_lock, flags);

        /* Needs to be done outside of watchdog lock */
        list_for_each_entry_safe(cs, tmp, &unstable, wd_list) {
                list_del_init(&cs->wd_list); 
                __clocksource_change_rating(cs, 0);
        }
        return select;
}

워치독 리스트에 있는 클럭 소스 중 불안정한 클럭 소스들의 rating을 0으로 바꿔서 다시 클럭 소스 리스트로 옮긴다.

코드 라인 8~14에서 워치독 리스트에서 불안정한 클럭 소스를 임시 리스트인 unstable 리스트로 옮긴다.
코드 라인 15~18에서 reselect 플래그가 있는 클럭들은 플래그만 다시 클리어한다.
코드 라인 21에서 워치독 타이머를 스탑한다.
코드 라인 25~28에서 untable 리스트에 있는 불안정한 클럭 소스의 rating을 0으로 바꾼 후 다시 클럭 소스 리스트로 옮긴다.

__clocksource_change_rating()

kernel/time/clocksource.c

static void __clocksource_change_rating(struct clocksource *cs, int rating)
{
        list_del(&cs->list);
        cs->rating = rating;
        clocksource_enqueue(cs);
}

지정한 클럭 소스의 rating을 변경하고 다시 클럭 소스 리스트에 추가한다.

부팅 완료 시 클럭 소스 선택

clocksource_done_booting()

kernel/time/clocksource.c

/*
 * clocksource_done_booting - Called near the end of core bootup
 *
 * Hack to avoid lots of clocksource churn at boot time.
 * We use fs_initcall because we want this to start before
 * device_initcall but after subsys_initcall.
 */
static int __init clocksource_done_booting(void)
{
        mutex_lock(&clocksource_mutex);
        curr_clocksource = clocksource_default_clock();
        finished_booting = 1;
        /*
         * Run the watchdog first to eliminate unstable clock sources
         */
        __clocksource_watchdog_kthread();
        clocksource_select();
        mutex_unlock(&clocksource_mutex);
        return 0;
}
fs_initcall(clocksource_done_booting);

unstable한 클럭 소스를 제거하고 가장 best한 클럭 소스를 선택한다.

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c – 현재 글
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

Timer -7- (Sched Clock & Delay Timers)

2017-03-102020-02-12 문영일 Leave a comment

Sched Clock

sched_clock은 시간 계산에 사용하는 ns 단위의 카운터를 제공하며 클럭 소스 서브시스템에서 제공하는 고정밀도 카운터를 사용하여 sched_clock으로 등록한다.

32비트 일반 타이머로 동작하던 sched_clock을 64비트 hrtimer 구조로 확장하였다. (kernel v3.13-rc1)
- 참고: sched_clock: Add support for >32 bit sched_clock
- 참고: sched_clock: Use an hrtimer instead of timer
아키텍트 타이머를 사용하는 arm 및 arm64 시스템
- 56비트 아키텍트 타이머를 사용하는 sched_clock을 등록하기 전까지는 일반 타이머로 갱신되는 jiffies 값을 이용하는 함수를 사용한다.
- CONFIG_GENERIC_SCHED_CLOCK 커널 옵션을 사용한다.
sched_clock() API를 통해 등록된 스케줄 클럭(ns) 값을 읽을 수 있다.

다음 그림은 jiffies 클럭 카운터에서 56비트 아키텍트 카운터 기반의 스케줄 클럭으로 등록되어 전환되는 과정을 보여준다.

스케줄 클럭 초기화

sched_clock_init()

arm 및 arm64에서는 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 커널 옵션이 사용되지 않는다. 따라서 이 옵션이 사용되지 않는 함수를 분석한다.

kernel/sched/clock.c

void __init sched_clock_init(void)
{
        static_branch_inc(&sched_clock_running);
        local_irq_disable();
        generic_sched_clock_init();
        local_irq_enable();
}

irq를 블럭한 상태에서 generic 스케줄 클럭 초기화를 수행한다.

generic_sched_clock_init()

kernel/time/sched_clock.c

void __init generic_sched_clock_init(void)
{
        /*
         * If no sched_clock() function has been provided at that point,
         * make it the final one one.
         */
        if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
                sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);

        update_sched_clock();

        /*
         * Start the timer to keep sched_clock() properly updated and
         * sets the initial epoch.
         */
        hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        sched_clock_timer.function = sched_clock_poll;
        hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
}

sched_clock을 초기화한다.

코드 라인 7~8에서 시스템에 고정밀도 hw 기반의 스케줄 클럭이 등록되지 않고 여전히 스케줄 클럭의 읽기용 함수가 jiffy 방식인 경우 스케줄 클럭으로 jiffy를 사용한다.
코드 라인 10에서 스케줄 클럭을 갱신한다.
코드 라인 16~18에서 hrtimer를 사용하여 약 1시간 주기로 스케줄 클럭을 프로그램하여 sched_clock_poll() 함수를 호출한다. 이 함수에서는 sched_clock을 갱신한다.

스케줄 클럭 초기값

kernel/time/sched_clock.c

static struct clock_data cd ____cacheline_aligned = {
        .read_data[0] = { .mult = NSEC_PER_SEC / HZ,
                          .read_sched_clock = jiffy_sched_clock_read, },
        .actual_read_sched_clock = jiffy_sched_clock_read,
};

스케줄 클럭은 지정되지 않는 경우 위의 jiffies 후크 함수가 사용된다.

커널 부트업 시 초반에는 jiffy_sched_clock_read()를 사용하지만 arm 및 arm64에서는 generic 아키텍트 타이머가 준비되면 56비트 카운터 기반의 다음 함수를 사용한다.
- 예) arch_counter_get_cntvct()

jiffy_sched_clock_read()

kernel/time/sched_clock.c

static u64 notrace jiffy_sched_clock_read(void)
{
        /*
         * We don't need to use get_jiffies_64 on 32-bit arches here
         * because we register with BITS_PER_LONG
         */
        return (u64)(jiffies - INITIAL_JIFFIES);
}

커널 부트업 시 초반에는 jiffy_sched_clock_read()를 사용한다.

sched_clock_poll()

kernel/time/sched_clock.c

static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt)
{
        update_sched_clock();
        hrtimer_forward_now(hrt, cd.wrap_kt);

        return HRTIMER_RESTART;
}

스케줄 클럭을 갱신하고, 다시 hrtimer의 forward 기능을 사용하여 프로그램한다. (약 1시간 주기)

Sched Clock 등록

sched_clock_register()

kernel/time/sched_clock.c

void __init
sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
{
        u64 res, wrap, new_mask, new_epoch, cyc, ns;
        u32 new_mult, new_shift;
        unsigned long r;
        char r_unit;
        struct clock_read_data rd;

        if (cd.rate > rate)
                return;

        WARN_ON(!irqs_disabled());

        /* Calculate the mult/shift to convert counter ticks to ns. */
        clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600);

        new_mask = CLOCKSOURCE_MASK(bits);
        cd.rate = rate;

        /* Calculate how many nanosecs until we risk wrapping */
        wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
        cd.wrap_kt = ns_to_ktime(wrap);

        rd = cd.read_data[0];

        /* Update epoch for new counter and update 'epoch_ns' from old counter*/
        new_epoch = read();
        cyc = cd.actual_read_sched_clock();
        ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask, rd.mult, rd.shift);
        cd.actual_read_sched_clock = read;

        rd.read_sched_clock     = read;
        rd.sched_clock_mask     = new_mask;
        rd.mult                 = new_mult;
        rd.shift                = new_shift;
        rd.epoch_cyc            = new_epoch;
        rd.epoch_ns             = ns;

        update_clock_read_data(&rd);

        if (sched_clock_timer.function != NULL) {
                /* update timeout for clock wrap */
                hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
        }

        r = rate;
        if (r >= 4000000) {
                r /= 1000000;
                r_unit = 'M';
        } else {
                if (r >= 1000) {
                        r /= 1000;
                        r_unit = 'k';
                } else {
                        r_unit = ' ';
                }
        }

        /* Calculate the ns resolution of this counter */
        res = cyc_to_ns(1ULL, new_mult, new_shift);

        pr_info("sched_clock: %u bits at %lu%cHz, resolution %lluns, wraps every %lluns\n",
                bits, r, r_unit, res, wrap);

        /* Enable IRQ time accounting if we have a fast enough sched_clock() */
        if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
                enable_sched_clock_irqtime();

        pr_debug("Registered %pS as sched_clock source\n", read);
}

클럭 소스의 카운터 읽기 함수를 sched_clock으로 등록하여 사용한다.

코드 라인 10~11에서 이미 등록한 sched_clock의 rate가 요청한 @rate 보다 높은 경우 처리하지 않고 함수를 빠져나간다.
- 요청한 스케줄 클럭이 여러 개인 경우 가장 높은 rate를 사용하는 스케줄 클럭을 사용한다.
코드 라인 16에서 요청한 클럭 주파수를 3600초의 ns 단위로 바꾸는데 필요한 mult/shift를 산출한다.
- rpi4 예) rate=54M -> mult=0x250_97b4, shift=21
- rpi2 예) rate=19.2M -> mult=0x682_aaab, shift=21
코드 라인 18에서 요청한 bit로 마스크 값을 구한다.
- rpi2 & rpi4 예) bits=56 -> new_mask = 0xff_ffff_ffff_ffff
코드 라인 22~23에서 wrap 타임을 구해 ktime으로 변환한 후 cd.wrap_kt에 저장한다.
- clocks_calc_max_nsecs() 함수에서는 카운터로 사용 가능한 wrap 타임의 50%를 적용하였다.
- rpi4 예) rate=54Mhz -> wrap=4398,046,511,102(약 72분) wrap_kt=3,131,746,996,224 (약 52분)
코드 라인 28~29에서 요청한 새 클럭 카운터를 읽어 new_epoch에 대입하고 기존 클럭 카운터를 읽어 cyc에 대입한다.
코드 라인 30에서 기존 클럭 카운터를 이용한 epoch_ns에 새로 읽은 카운터에 대한 delta ns를 구해 더한 값을 ns에 대입한다.
- 처음 sched_clock을 등록 시 읽어온 jiffies cyc 값은 0이므로 ns 값은 항상 0이다.
- sched_clock으로 사용될 클럭 소스가 더 높은 rate의 클럭 소스가 지정되는 경우 그 동안 소요된 ns 값이 반영된다.
코드 라인 31에서 스케줄 클럭에서 읽어들일 새 카운터 읽기 함수를 지정한다.
- rpi2 & rpi4 예) arch_counter_get_cntvct()
코드 라인 33~40에서 clock_read_data 구조체에 새 값들을 구성한 후 스케줄 클럭에 갱신한다.
코드 라인 42~45에서 wrap_kt 주기(약 1시간)로 동작하는 sched_clock_timer를 동작시킨다.
- rpi4 예) 약 72분 단위
코드 라인 47~58에서 출력을 위해 rate 값으로 r과 r_unit을 산출한다. (rate가 4M 이상일 때 M 단위를 사용하고, 그 이하인 경우 k 단위를 사용한다)
- rpi4 예) rate=54000000 -> r=54, r_unit=’M’
- rpi2 예) rate=19200000 -> r=19, r_unit=’M’
코드 라인 61에서 1 cycle에 해당하는 ns를 산출하여 res에 대입한다.
코드 라인 63~64에서 sched_clock에 대한 정보를 출력한다.
- rpi4 예) “sched_clock: 56 bits at 54MHz, resolution 18ns, wraps every 4398046511102ns”
- rpi2 예) “sched_clock: 56 bits at 19MHz, resolution 52ns, wraps every 3579139424256ns”
코드 라인 67~68에서 irqtime 값이 0을 초과하거나 처음 설정한 sched_clock의 rate가 1M 이상일 때 irq 타임 성능 측정을 할 수 있도록 전역 변수 sched_clock_irqtime에 1을 대입한다.
- irqtime의 디폴트 값은 -1이다.
- irq 타임 성능 측정은 NO_HZ_FULL 커널 옵션을 사용하지 않고 IRQ_TIME_ACCOUNTING 커널 옵션이 적용된 커널에서만 동작한다.
코드 라인 70에서 스케줄 클럭으로 등록되어 사용되어 사용될 클럭 카운터 함수명을 출력한다.
- rpi4 예) “Registered arch_counter_get_cntvct+0x0/0x10 as sched_clock source”

다음 그림은 rpi4 시스템이 사용하는 56비트 아키텍트 카운터를 스케줄 클럭으로 등록시킨 모습을 보여준다.

스케줄 클럭 갱신 및 읽기

스케줄 클럭은 nmi 인터럽트 핸들러에서 dead-lock을 없애고 빠르게 읽어낼 수 있도록 시퀀스 카운터를 사용한 lock-less 구현을 사용하였고, 다음과 같이 두 개의 clock_read_data 구조체를 사용하여 관리한다.

struct clock_read_data read_data[2];
참고: timers, sched/clock: Avoid deadlock during read from NMI (2015, v4.1-rc1)

다음 그림은 두 개의 클럭 데이터로 운영되는 모습을 보여준다.

스케줄 클럭 갱신

update_sched_clock()

kernel/time/sched_clock.c

/*
 * Atomically update the sched_clock() epoch.
 */

static void update_sched_clock(void)
{
        u64 cyc;
        u64 ns;
        struct clock_read_data rd;

        rd = cd.read_data[0];

        cyc = cd.actual_read_sched_clock();
        ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask, rd.mult, rd.shift);

        rd.epoch_ns = ns;
        rd.epoch_cyc = cyc;

        update_clock_read_data(&rd);
}

스케줄 클럭을 읽어 갱신한다.

update_clock_read_data()

kernel/time/sched_clock.c

/*
 * Updating the data required to read the clock.
 *
 * sched_clock() will never observe mis-matched data even if called from
 * an NMI. We do this by maintaining an odd/even copy of the data and
 * steering sched_clock() to one or the other using a sequence counter.
 * In order to preserve the data cache profile of sched_clock() as much
 * as possible the system reverts back to the even copy when the update
 * completes; the odd copy is used *only* during an update.
 */

static void update_clock_read_data(struct clock_read_data *rd)
{
        /* update the backup (odd) copy with the new data */
        cd.read_data[1] = *rd;

        /* steer readers towards the odd copy */
        raw_write_seqcount_latch(&cd.seq);

        /* now its safe for us to update the normal (even) copy */
        cd.read_data[0] = *rd;

        /* switch readers back to the even copy */
        raw_write_seqcount_latch(&cd.seq);
}

@rd 값을 사용하여 스케줄 클럭을 홀/짝 두 개의 클럭 데이터에 갱신한다.

read_data[1]을 갱신하고 시퀀스를 증가시켜 홀수가 될 때 read_data[0]을 갱신한다.

스케줄 클럭 읽기

sched_clock()

kernel/time/sched_clock.c

unsigned long long notrace sched_clock(void)
{
        u64 cyc, res;
        unsigned int seq;
        struct clock_read_data *rd;

        do {
                seq = raw_read_seqcount(&cd.seq);
                rd = cd.read_data + (seq & 1);

                cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
                      rd->sched_clock_mask;
                res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
        } while (read_seqcount_retry(&cd.seq, seq));

        return res;
}

스케줄 클럭을 읽어 반환한다.

시퀀스가 짝수이면 read_data[1]을 갱신할 가능성이 있으므로 read_data[0]의 클럭 데이터를 사용한다.
시퀀스가 홀수이면 read_data[0]을 갱신하고 있으므로 read_data[1]의 클럭 데이터를 사용한다.

스케줄 클럭 suspend/resume 핸들러 초기화

다음 그림은 suspend/resume에 대해 스케줄 클럭이 전환되도록 핸들러를 초기화하는 과정을 보여준다.

sched_clock_syscore_init()

kernel/time/sched_clock.c

static int __init sched_clock_syscore_init(void)
{
        register_syscore_ops(&sched_clock_ops);

        return 0;
}
device_initcall(sched_clock_syscore_init);

suspend/resume을 위해 sched_clock_ops를 등록한다.

sched_clock_ops

kernel/time/sched_clock.c

static struct syscore_ops sched_clock_ops = {
        .suspend        = sched_clock_suspend,
        .resume         = sched_clock_resume,
};

sched_clock_suspend()

kernel/time/sched_clock.c

int sched_clock_suspend(void)
{
        struct clock_read_data *rd = &cd.read_data[0];

        update_sched_clock();
        hrtimer_cancel(&sched_clock_timer);
        rd->read_sched_clock = suspended_sched_clock_read;

        return 0;
}

suspend 시 호출되어 스케줄 클럭의 동작 방식을 변경한다.

코드 라인 5에서 sched_clock을 갱신한다.
코드 라인 6에서 약 1시간 주기로 동작하는 sched_clock_timer를 취소시킨다.
코드 라인 7에서 sched_clock() 함수가 갱신된 sched_clock의 내부 epoch_cyc 값을 읽도록 후크 함수를 변경한다.

sched_clock_resume()

kernel/time/sched_clock.c

void sched_clock_resume(void)
{
        struct clock_read_data *rd = &cd.read_data[0];

        rd->epoch_cyc = cd.actual_read_sched_clock();
        hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
        rd->read_sched_clock = cd.actual_read_sched_clock;
}

resume 시 호출되어 스케줄 클럭의 동작 방식을 변경한다.

코드 라인 5에서 sched_clock 을 실제 hw 카운터를 읽어 갱신한다.
코드 라인 6에서 약 1시간 주기로 동작하는 sched_clock_timer를 다시 동작시킨다.
코드 라인 7에서 sched_clock() 함수가 실제 hw 카운터를 읽도록 후크 함수를 변경한다.

suspended_sched_clock_read()

kernel/time/sched_clock.c

/*
 * Clock read function for use when the clock is suspended.
 *
 * This function makes it appear to sched_clock() as if the clock
 * stopped counting at its last update.
 *
 * This function must only be called from the critical
 * section in sched_clock(). It relies on the read_seqcount_retry()
 * at the end of the critical section to be sure we observe the
 * correct copy of 'epoch_cyc'.
 */

static u64 notrace suspended_sched_clock_read(void)
{
        unsigned int seq = raw_read_seqcount(&cd.seq);

        return cd.read_data[seq & 1].epoch_cyc;
}

suspend 시 읽어들일 스케줄 클럭 값을 반환한다.

delay 관련 함수 – ARM64

arm64 시스템에서 cpu는 cfe를 사용한 busy-wait 루프를 사용하여 대기한다. atomic context에서 ndelay() 또는 udelay() API들이 사용된다. 그러나 mdelay() API는 너무 오랫동안 busy-wait을 하므로 권장되지 않으며 가능하면 non-atomic context에서 사용되는 msleep() API를 사용하는 것이 좋다.

다음 그림은 arm64용 delay 관련 함수의 호출 관계를 보여준다.

밀리 세컨드 단위 delay

mdelay()

include/linux/delay.h

#define mdelay(n) (\
        (__builtin_constant_p(n) && (n)<=MAX_UDELAY_MS) ? udelay((n)*1000) : \
        ({unsigned long __ms=(n); while (__ms--) udelay(1000);}))
#endif

@n 밀리 세컨드 만큼 delay 한다.

상수 @n 값이 MAX_UDELAY_MS(5) 밀리 세컨드 이하에서는 udelay()를 호출 시 1000을 곱해 호출한다.
- 5ms 이하에서는 us단위로 변환하여 udelay() 함수를 한 번만 호출한다.
  - 1000, 2000, 3000, 4000 또는 5000
그 외의 경우 udelay(1000)을 @n 만큼 호출한다.

마이크로 세컨드 단위 delay

udelay()

include/asm-generic/delay.h

/*
 * The weird n/20000 thing suppresses a "comparison is always false due to
 * limited range of data type" warning with non-const 8-bit arguments.
 */

/* 0x10c7 is 2**32 / 1000000 (rounded up) */

#define udelay(n)                                                       \
        ({                                                              \
                if (__builtin_constant_p(n)) {                          \
                        if ((n) / 20000 >= 1)                           \
                                 __bad_udelay();                        \
                        else                                            \
                                __const_udelay((n) * 0x10c7ul);         \
                } else {                                                \
                        __udelay(n);                                    \
                }                                                       \
        })

@n 마이크로 세컨드 만큼 delay 한다.

상수 @n 값이 20000 이상인 경우 즉, 20ms 이상인 경우 컴파일 타임에 에러를 출력한다.
상수 @n 값이 20000 미만인 경우 즉, 20ms 미만인 경우 @n 값에 0x10c7을 곱한 값으로 __const_udelay()를 호출한다.
그 외의 경우 __udelay()를 그대로 호출한다.

__udelay()

arch/arm64/lib/delay.c

void __udelay(unsigned long usecs)
{
        __const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
}
EXPORT_SYMBOL(__udelay);

@usec 마이크로 세컨드 만큼 delay 한다.

@usec 값에 0x10c7을 곱한 값으로 __const_udelay()를 호출한다.

루프 단위 delay

__const_udelay()

arch/arm64/lib/delay.c

inline void __const_udelay(unsigned long xloops)
{
        __delay(xloops_to_cycles(xloops));
}
EXPORT_SYMBOL(__const_udelay);

@xloops 루프 만큼 delay 한다.

루프 단위 @xloops 값을 사이클 단위로 변환한 값으로 __delay() 함수를 호출한다.

나노 세컨드 단위 delay

ndelay()

include/asm-generic/delay.h

/* 0x5 is 2**32 / 1000000000 (rounded up) */

#define ndelay(n)                                                       \
        ({                                                              \
                if (__builtin_constant_p(n)) {                          \
                        if ((n) / 20000 >= 1)                           \
                                __bad_ndelay();                         \
                        else                                            \
                                __const_udelay((n) * 5ul);              \
                } else {                                                \
                        __ndelay(n);                                    \
                }                                                       \
        })

#endif /* __ASM_GENERIC_DELAY_H */

@n 나노 세컨드 만큼 delay 한다.

상수 @n 값이 20000 이상인 경우 즉, 20us 이상인 경우 컴파일 타임에 에러를 출력한다.
상수 @n 값이 20000 미만인 경우 즉, 20us 미만인 경우 @n 값에 5를 곱한 값으로 __const_udelay()를 호출한다.
- 1us 당 5 루프
그 외의 경우 __ndelay()를 그대로 호출한다.

__ndelay()

arch/arm64/lib/delay.c

void __ndelay(unsigned long nsecs)
{
        __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
}
EXPORT_SYMBOL(__ndelay);

@nsec 나노 세컨드 만큼 delay 한다.

@nsec 값에 5를 곱한 값으로 __const_udelay()를 호출한다.

사이클 단위 delay

xloops_to_cycles()

arch/arm64/lib/delay.c

static inline unsigned long xloops_to_cycles(unsigned long xloops)
{
        return (xloops * loops_per_jiffy * HZ) >> 32;
}

@xloops 루프 단위를 사이클 단위로 변환하여 반환한다.

__delay()

arch/arm64/lib/delay.c

void __delay(unsigned long cycles)
{
        cycles_t start = get_cycles();

        if (arch_timer_evtstrm_available()) {
                const cycles_t timer_evt_period =
                        USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);

                while ((get_cycles() - start + timer_evt_period) < cycles)
                        wfe();
        }

        while ((get_cycles() - start) < cycles)
                cpu_relax();
}
EXPORT_SYMBOL(__delay);

@cycles 사이클 단위의 수 만큼 delay 한다.

코드 라인 5~11에서 아키텍트 타이머에 이벤트 스트림이 동작하는 경우 요청한 사이클 수 만큼 100us 단위로 wfe를 수행하여 대기하여 cpu 로드를 줄이고 절전할 수 있다.
코드 라인 13~14에서 사이클 수 만큼 delay하고, 사이클 수를 초과한 경우 루프를 탈출한다.

delay 관련 함수 – ARM32

arm32 시스템에서는 busy-wait 기반의 delay 타이머를 사용한다.

다음 그림은 arm32용 delay 관련 함수의 호출 관계를 보여준다.

Delay 타이머 등록 (generic 타이머) – ARM32

arch_timer_delay_timer_register()

arch/arm/kernel/arch_timer.c

static void __init arch_timer_delay_timer_register(void)
{
        /* Use the architected timer for the delay loop. */
        arch_delay_timer.read_current_timer = arch_timer_read_counter_long;
        arch_delay_timer.freq = arch_timer_get_rate();
        register_current_timer_delay(&arch_delay_timer);
}

armv7 아키텍처에 내장된 generic 타이머를 delay 타이머로 사용할 수 있도록 등록한다.

다음 그림은 100hz로 구성된 generic 타이머를 딜레이 타이머로 등록하는 과정을 보여준다.

register_current_timer_delay() – ARM32

arch/arm/lib/delay.c

void __init register_current_timer_delay(const struct delay_timer *timer)
{
        u32 new_mult, new_shift;
        u64 res;

        clocks_calc_mult_shift(&new_mult, &new_shift, timer->freq,
                               NSEC_PER_SEC, 3600);
        res = cyc_to_ns(1ULL, new_mult, new_shift);

        if (!delay_calibrated && (!delay_res || (res < delay_res))) {
                pr_info("Switching to timer-based delay loop, resolution %lluns\n", res);
                delay_timer                     = timer;
                lpj_fine                        = timer->freq / HZ;
                delay_res                       = res;

                /* cpufreq may scale loops_per_jiffy, so keep a private copy */
                arm_delay_ops.ticks_per_jiffy   = lpj_fine;
                arm_delay_ops.delay             = __timer_delay;
                arm_delay_ops.const_udelay      = __timer_const_udelay;
                arm_delay_ops.udelay            = __timer_udelay;
        } else {
                pr_info("Ignoring duplicate/late registration of read_current_timer delay\n");
        }
}

딜레이 타이머를 등록하고 calibration 한다. 처음 설정 시에는 반드시 calibration을 한다.

코드 라인 6~7에서 1시간에 해당하는 정확도로 1 cycle에 소요되는 nano초를 산출할 수 있도록 new_mult/new_shift 값을 산출한다.
코드 라인 8에서 해상도 res 값을 구한다. (1 cycle에 해당하는 nano 초)
- rpi2: 100hz, 19.2Mhz clock -> res=52
코드 라인 10~20에서 calibration이 완료되지 않았고 처음이거나 요청한 타이머가 더 고해상도 타이머인 경우 딜레이 타이머에 대한 설정을 한다.
- res 값이 작으면 작을 수록 고해상도 타이머이다.
- 클럭 소스가 여러 개가 등록되는 경우 딜레이 타이머에 가장 좋은 고해상도 타이머를 선택하게 한다.
- calivrate_delay() 함수에서 calibration을 완료하고 나면 더 이상 클럭 소스로 부터 더 이상 딜레이 카운터의 등록을 할 수 없게 한다.
- rpi2 예) “Switching to timer-based delay loop, resolution 52ns”

sleep 관련 함수

non-atomic context에서 사용할 수 있는 함수들은 다음과 같다. 10us ~ 20ms까지는 usleep() 보다 atomic context 사용 가능한 udelay()를 사용하길 권장한다.

hrtimer로 동작
- usleep_range()
jiffies 및 legacy timer로 동작
- msleep()
- msleep_interruptible()

다음 그림은 sleep 관련 함수의 호출 관계를 보여준다.

세컨드 단위 sleep

ssleep()

include/linux/delay.h

static inline void ssleep(unsigned int seconds)
{
        msleep(seconds * 1000);
}

@seconds 세컨드만큼 슬립한다.

밀리 세컨드 단위 sleep

msleep()

kernel/time/timer.c

/**
 * msleep - sleep safely even with waitqueue interruptions
 * @msecs: Time in milliseconds to sleep for
 */

void msleep(unsigned int msecs)
{
        unsigned long timeout = msecs_to_jiffies(msecs) + 1;

        while (timeout)
                timeout = schedule_timeout_uninterruptible(timeout);
}
EXPORT_SYMBOL(msleep);

@msec 밀리 세컨드만큼 jiffies 스케줄 틱 기반으로 슬립한다.

마이크로 세컨드 단위 sleep

usleep_range()

kernel/time/timer.c

/**
 * usleep_range - Sleep for an approximate time
 * @min: Minimum time in usecs to sleep
 * @max: Maximum time in usecs to sleep
 *
 * In non-atomic context where the exact wakeup time is flexible, use
 * usleep_range() instead of udelay().  The sleep improves responsiveness
 * by avoiding the CPU-hogging busy-wait of udelay(), and the range reduces
 * power usage by allowing hrtimers to take advantage of an already-
 * scheduled interrupt instead of scheduling a new one just for this sleep.
 */

void __sched usleep_range(unsigned long min, unsigned long max)
{
        ktime_t exp = ktime_add_us(ktime_get(), min);
        u64 delta = (u64)(max - min) * NSEC_PER_USEC;

        for (;;) {
                __set_current_state(TASK_UNINTERRUPTIBLE);
                /* Do not return before the requested sleep time has elapsed */
                if (!schedule_hrtimeout_range(&exp, delta, HRTIMER_MODE_ABS))
                        break;
        }
}
EXPORT_SYMBOL(usleep_range);

@max – @min 마이크로 세컨드만큼 jiffies 스케줄 틱 기반으로 슬립한다.

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c – 현재 글
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

delays – Information on the various kernel delay / sleep mechanisms (Documentation/timers/timers-howto) | Kernel.org
[Linux:Kernel] 지연시간 – 다양한 커널 딜레이(delay) / 슬립(sleep) 메카니즘의 정보 | 다솜돌이

time_init()

2017-03-082019-12-25 문영일 Leave a comment

클럭 및 타이머 초기화

time_init() – ARM64

arch/arm64/kernel/time.c

void __init time_init(void)
{
        u32 arch_timer_rate;

        of_clk_init(NULL);
        timer_probe();

        tick_setup_hrtimer_broadcast();

        arch_timer_rate = arch_timer_get_rate();
        if (!arch_timer_rate)
                panic("Unable to initialise architected timer.\n");

        /* Calibrate the delay loop directly */
        lpj_fine = arch_timer_rate / HZ;
}

클럭 및 타이머를 초기화한다.

코드 라인 5에서 디바이스 트리 기반의 클럭 디바이스를 초기화한다.
- 참고: Common Clock Framework -1- (초기화)
코드 라인 6에서 타이머용 클럭 소스를 초기화한다.
- 참고: Timer -3- (Clock Sources Subsystem) | 문c
코드 라인 8에서 틱 브로드캐스트용 hrtimer를 초기화한다.
코드 라인 10~12에서 타이머 rate를 알아와서 HZ로 나눈 값을 lpj_file에 대입한다.
- 예) arch_timer_rate = 19,200,000 (19.2Mhz), HZ=1000
  - lpj_fine=19,200

다음 그림은 time_init() 함수의 클럭과 타이머를 초기화하기 위한 함수 호출 관계이다.

time_init() – ARM32

arch/arm/kernel/time.c

void __init time_init(void)
{
        if (machine_desc->init_time) {
                machine_desc->init_time();
        } else {
#ifdef CONFIG_COMMON_CLK
                of_clk_init(NULL);
#endif
                timer_probe();
        }
}

클럭 및 타이머를 초기화한다.

코드 라인 3~4에서 시스템이 머신 specific한 코드로 초기화를 지원하는 경우 해당 함수를 호출한다.
- rpi2: bcm2709_timer_init() 함수 호출
코드 라인 5~9에서 시스템이 Device Tree를 사용하여 클럭 디바이스 및 타이머용 클럭 소스를 초기화한다.

다음 그림은 time_init() 함수의 클럭과 타이머를 초기화하기 위한 함수 호출 관계이다.

머신 디스크립터를 이용한 time 초기화 – RPI2(BCM2709) – 커널 v4.0

bcm2709_timer_init()

arch/arm/mach-bcm2709/bcm2709.c

static void __init bcm2709_timer_init(void)
{
        extern void dc4_arch_timer_init(void);
        // timer control
        writel(0, __io_address(ARM_LOCAL_CONTROL));
        // timer pre_scaler
        writel(0x80000000, __io_address(ARM_LOCAL_PRESCALER)); // 19.2MHz
        //writel(0x06AAAAAB, __io_address(ARM_LOCAL_PRESCALER)); // 1MHz

        if (use_dt)
        {
                of_clk_init(NULL);
                clocksource_of_init();
        }
        else
                dc4_arch_timer_init();
}

부트 cpu의 Local 타이머를 0으로 초기화하고 19.2Mhz pre-scaler로 설정한 후 클럭 소스들을 초기화한다.

코드 라인 5에서 Local 타이머를 0으로 초기화한다.
- ARM_LOCAL_CONTROL
  - HW_REGISTER_RW(ARM_LOCAL_BASE+0x000)
  - ARM_LOCAL_BASE = 0x4000_0000
코드 라인 7에서 Local 타이머의 pre-scaler를 19.2Mhz로 설정한다.
- ARM_LOCAL_PRESCALER
  - HW_REGISTER_RW(ARM_LOCAL_BASE+0x008)
코드 라인 10~14에서 디바이스 트리를 사용하는 방법으로 클럭 및 클럭 소스들을 초기화한다.
코드 라인 15~16에서 rpi2 머신 전용 코드로 클럭 및 클럭 소스들을 초기화한다.

dc4_arch_timer_init()

drivers/clocksource/arm_arch_timer.c

int __init dc4_arch_timer_init(void)
{       
        if (arch_timers_present & ARCH_CP15_TIMER) {
                pr_warn("arch_timer: multiple nodes in dt, skipping\n");
                return -1;
        }
        
        arch_timers_present |= ARCH_CP15_TIMER;
                
        /* Try to determine the frequency from the device tree or CNTFRQ */
        arch_timer_rate = 19200000;
                
        arch_timer_ppi[PHYS_SECURE_PPI]    = IRQ_ARM_LOCAL_CNTPSIRQ;
        arch_timer_ppi[PHYS_NONSECURE_PPI] = IRQ_ARM_LOCAL_CNTPNSIRQ;
        arch_timer_ppi[VIRT_PPI]           = IRQ_ARM_LOCAL_CNTVIRQ;
        arch_timer_ppi[HYP_PPI]            = IRQ_ARM_LOCAL_CNTHPIRQ;

        /*
         * If HYP mode is available, we know that the physical timer
         * has been configured to be accessible from PL1. Use it, so
         * that a guest can use the virtual timer instead.
         *     
         * If no interrupt provided for virtual timer, we'll have to
         * stick to the physical timer. It'd better be accessible...
         */    
        if (is_hyp_mode_available() || !arch_timer_ppi[VIRT_PPI]) {
                arch_timer_use_virtual = false;
        }

        arch_timer_c3stop = 0;
        
        arch_timer_register();
        arch_timer_common_init();
        return 0;
}

Generic 타이머를 아키텍처 클럭 소스로 등록하여 인터럽트를 연결하고 event source로 등록한다. 그리고 스케쥴러 클럭 및 딜레이 타이머로도 등록하도록 준비한다.

코드 라인 3~8에서 이미 보조프로세서 cp15를 사용하는 방식의 generic 아키텍처 타이머가 초기화된 경우 함수를 빠져나간다.
코드 라인 11에서 전역 arch_timer_rate에 19.2Mhz를 대입한다.
코드 라인 13~16에서 4개의 타이머 각각에서 사용할 인터럽트 번호를 대입한다.
- 코드 순서대로 96, 97, 99, 98번 IRQ를 사용한다.
코드 라인 26~28에서 하이퍼 모드로 부트한 경우 arch_timer_use_virtual을 false로 바꾼다.
코드 라인 30에서 전역 arch_timer_c3stop에 0을 대입한다.
- 카운터가 정지하지 않고 항상 동작한다는 의미이다.
코드 라인 32에서 현재 커널이 사용하도록 지정된 Generic 타이머를 per-cpu 인터럽트에 등록하고 boot cpu용은 즉시 enable하고 클럭 이벤트에 등록한다.
코드 라인 33에서 현재 커널이 사용하도록 지정된 Generic 타이머를 클럭 소스 및 스케쥴러 클럭으로 등록하고 딜레이 타이머로도 등록한다.

참고

Common Clock Framework -1- (초기화)
Common Clock Framework -2- (APIs)
Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Sched Clock & Delay Timers) | 문c
Timer -7- (Timecounter) | 문c
Timer -8- (Tick Device) | 문c
Timer -9- (Timekeeping) | 문c
Timer -10- (Posix Clock & Timers) | 문c
time_init() | 문c – 현재 글
sched_clock_postinit() | 문c
sched_clock_init() | 문c
init_timers() | 문c
hrtimers_init() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

Timer -2- (HRTimer)

2017-03-082020-02-03 문영일 12 Comments

hrtimer

hrtimer(High Resolution kernel Timer)는 커널 v2.6.21에서 mainline에 채용되었고 1ns 단위의 고해상도로 관리한다.
- 기존 오리지날 커널 타이머는 jiffies 기반의 lowres 타이머를 사용하여 구현되었고 HZ기반 tick에 의해 해상도가 수 ms ~ 수십 ms의 낮은 해상도만을 관리할 수 있었다.
- lowres timer를 사용하여 수 ms ~ 수십ms로 동작하는 스케줄 tick 단위 보다 더 높은 해상도의 타이머가 필요한 경우 사용된다.
사용 가능한 타이머 h/w
- hrtimer는 high resolution h/w 타이머를 사용하는 것을 기본으로 하지만 low resolution h/w 타이머도 사용할 수 있다.
hrtimer의 요청 타임들은 RB 트리 기반으로 관리된다.

주의: 용어 혼동이 있을 수 있으므로 가급적 다음과 같이 해석을 요함.

hrtimer
- 고해상도를 지원하는 hw 여부와 상관없이 나노초(ns) 단위를 사용하는 커널 API 및 서브 시스템
timer (lowres timer)
- 틱(100ms, 25ms, 50ms, 10ms, …) 단위를 사용하는 커널 API 및 서브 시스템
high resolution timer
- 고해상도로 동작하는 hw 타이머 (보통 수ns ~ 수십ns를 지원)
- 최근 ARM 시스템들은 (armv7, armv8, …)들은 대부분 고해상도 타이머를 지원한다.
low resolution timer
- 저해상도로 동작하는 hw 타이머 (수백 ns 이상 지원)

hrtimer와 generic time subsystem

hrtimer 통해 다음과 같은 기능들을 수행한다.

리눅스 시간 관리 (가능하면 hrtimer를 사용한다)
- monotonic(0부터 시작한 nano 단위 타임)
- realtime (실 세계 시간)
- boottime(0부터 시작한 nano 단위 타임이며 suspend 시에도 동작하는 시간)
- taiclock(윤초를 포함하는 천문에 사용하는 우주 시계)
고 정밀도 타이머
- nano 단위의 정확도로 callback 함수를 수행할 수 있다.
스케쥴 tick
- 위의 고정밀도 타이머 기능을 사용하여 주기적 또는 oneshot 기반 클럭 이벤트를 사용하여 스케쥴 tick을 제공한다.
lowres timer의 기반 클럭
- jiffies로 동작하는 lowres timer(기존 kernel timer로 불림)에 제공되는 클럭
process accounting, profileing, …

주변 시스템과의 연동 관계

실제 legacy 코드들은 무척 방만(?)하게 구현되어 셀 수 없이 많은 방법으로 여러 subsystem과 연결되어 있다.

수 백개의 구현 코드들이 재활용 없이 copy & paste로 이쪽 저쪽에서 짜집기되어 있다.
32bit arm embedded 시스템들에 구현된 많은 다양성으로 인해 리눅스의 누구 누구는 거의 포기했다는 말이 있다.
그럼에도 불구하고 common subsystem 등이 계속 정리되어 가고 있고 근래에는 device tree를 통해서 더 표준화 되어 가고 있다.

다음 그림은 최근 커널의 Time subsystem 간의 연동 관계를 보여준다.

좌측은 4개의 리눅스 시간을 관리하는 timekeeping을 clock source로 부터 지속적으로 공급받는 것을 보여준다.
우측은 hrtimer의 만료 시간에 인터럽트가 깨어나 clock event 를 통해 tick 디바이스에 공급되고 각 서브시스템으로 전달되는 과정을 보여준다.

주요 API

nano 단위로 이용가능한 hrtimer API는 다음과 같다.

hrtimer_init()
hrtimer_start()
hrtimer_start_range_ns()
hrtimer_start_expires()
hrtimer_restart()
hrtimer_cancel()
hrtimer_try_to_cancel()
hrtimer_forward()

ktime 관련 API

hrtimer 값은 다음과 같이 하나의 signed 64비트 값으로 나노초(ns)를 담고 있다.

include/linux/ktime.h

/* Nanosecond scalar representation for kernel time values */
typedef s64     ktime_t;

ktime 로드/설정 관련한 api는 다음과 같다.

ktime_get()
- 현재 monotonic 시각을 ktime_t 타입으로 알아온다.
ktime_get_ns()
- 현재 monotonic 시각을 나노초(ns)로 값으로 알아온다.
ktime_get_with_offset(offs)
- 다음 클럭 타입 @offs에 해당하는 시각을 ktime_t 타입으로 알아온다.
  - TK_OFFS_REAL
  - TK_OFFS_BOOT
  - TK_OFFS_TAI
- 예) ktime_get_with_offset(TK_OFFS_REAL)
ktime_mono_to_any(tmono)
- 요청한 monotonic ktime_t 타입 시각 @tmono를 클럭 타입에 해당하는 시각으로 변환하여 ktime_t 타입으로 알아온다.
- 예) ktime_mono_to_any(tmono, TK_OFFS_REAL)
ktime_get_raw()
- 현재 raw monotonic 시각을 ktime_t 타입으로 알아온다.
ktime_get_raw_ns()
- 현재 raw monotonic 시각을 나노초(ns) 값으로 알아온다.
ktime_get_real()
- 현재 realtime 시각을 ktime_t 타입으로 알아온다.
ktime_get_real_ns()
- 현재 realtime 시각을 나노초(ns) 값으로 알아온다.
ktime_get_boottime()
- 현재 boottime 시각을 ktime_t 타입으로 알아온다.
ktime_get_clocktai()
- 현재 tai 시각을 ktime_t 타입으로 알아온다.
ktime_set(secs, nsecs)
- 초 @secs와 나노초 @nsecs를 사용하여 ktime_t 타입 시각으로 설정하여 반환한다.
ktime_mono_to_real(mono)
- 요청한 monotonic ktime_t 타입 @mono를 realtime 시각으로 변환하여 ktime_t 타입으로 알아온다.

ktime 설정, 연산, 비교와 관련한 api는 다음과 같다

ktime_add(kt1, kt2)
- 두 개의 ktime_t 타입 시각 @kt1과 @kt2를 더해 ktime_t 타입으로 반환한다. (overflow 무시)
ktime_add_ns(kt, ns)
- ktime_t 타입 시각 @kt와 나노초 @ns를 더해 ktime_t 타입으로 반환한다. (overflow 무시)
ktime_add_us(kt, ms)
- ktime_t 타입 시각 @kt와 밀리초 @ms를 나노초로 변환한 후 더해 ktime_t 타입으로 반환한다. (overflow 무시)
ktime_sub(kt1, kt2)
- 두 개의 ktime_t 타입 시각 @kt1에서 @kt2를 뺀 값을 반환한다. (underflow 무시)
ktime_sub_ns(kt, ns)
- ktime_t 타입 시각 @kt에서 나노초 @ns를 뺀 후 ktime_t 타입으로 반환한다. (underflow 무시)
ktime_sub_us(kt, us)
- ktime_t 타입 시각 @kt에서 밀리초 @ms를 나노초로 변환한 값을 뺀 후 ktime_t 타입으로 반환한다. (underflow 무시)
ktime_compare(cm1, cmp2)
- 두 개의 ktime_t 타입 시각 @cmp1과 @cmp2를 비교한 결과를 반환한다.
  - @cmp1 < @cmp2 : return < 0 (음수)
  - @cmp1 == @cmp2 : return 0
  - @cmp1 > @cmp2 : return > 0 (양수)
ktime_after(cmp1, cmp2)
- 두 개의 ktime_t 타입 시각 @cmp1이 @cmp2 뒤에 있는지 여부를 반환한다.
ktime_before(cmp1, cmp2)
- 두 개의 ktime_t 타입 시각 @cmp1이 @cmp2 앞에 있는지 여부를 반환한다.
ktime_divns(kt, div)
- ktime_t 타입 시각 @kt에서 나노초 @ns로 나눈 s64 값을 반환한다.

ktime 변환 관련한 api는 다음과 같다.

ktime_to_timespec(kt)
- ktime_t 타입 시각 @kt를 timespec 타입으로 반환한다.
ktime_to_timespec_cond(@kt, @ts)
- ktime_t 타입 시각 @kt를 timespec 타입 @ts 출력 인자에 저장한다. 변환 값이 성공이면 1을 반환한다.
ns_to_timespec(nsec)
- 나노초 @nsec를 timespec 타입으로 반환한다.
ktime_to_timespec64(kt)
- ktime_t 타입 시각 @kt를 timespec64 타입으로 반환한다.
ktime_to_timespec64_cond(kt)
- ktime_t 타입 시각 @kt를 timespec64 타입 @ts 출력 인자에 저장한다. 변환 값이 성공이면 1을 반환한다.
ns_to_timespec64(nsec)
- 나노초 @nsec를 timespec64 타입으로 반환한다.
timespec_to_ktime()
- timespec 타입 @ts를 ktime_t 타입으로 반환한다.
timespec64_to_ktime()
- timespec64 타입 @ts를 ktime_t 타입으로 반환한다.
ktime_to_timeval(kt)
- ktime_t 타입 시각 @kt를 timeval 타입으로 반환한다.
ns_to_timeval(nsec)
- 나노초 @nsec를 timeval 타입으로 반환한다.
timeval_to_ktime(tv)
- timeval 타입 @tv를 ktime_t 타입으로 반환한다.
ktime_to_ns(kt)
- ktime_t 타입 시각 @kt를 나노초 단위로 반환한다.
ktime_to_us(kt)
- ktime_t 타입 시각 @kt를 마이크로초 단위로 반환한다.
ktime_to_ms()
- ktime_t 타입 시각 @kt를 밀리초 단위로 반환한다.
ns_to_ktime()
- 나노초 @ns를 ktime_t 타입 시각으로 반환한다.
ms_to_ktime()
- 밀리초 @ms를 ktime_t 타입 시각으로 반환한다.
ktime_us_delta(later, earlier)
- 두 개의 ktime_t 타입 시각 @later와 @earlier의 시간차를 마이크로초로 반환한다.
ktime_ms_delta()
- 두 개의 ktime_t 타입 시각 @later와 @earlier의 시간차를 밀리초로 반환한다.

시스템 realtime 시각 설정 관련한 api는 다음과 같다.

do_settimeofday64()
- timespec64 타입 realtime 시각으로 시스템 realtime 시각을 설정한다.
do_sys_settimeofday()
- timespec64 타입 realtime 시각 및 타임존을 사용하여 시스템 realtime 시각을 설정한다.
~~do_gettimeofday() – [removed]~~

per cpu 베이스 및 클럭 베이스 관리

다음 그림과 같이 hrtimer를 각각의 cpu별로 관리하는 cpu 베이스가 있고, 내부에서 다시 각각의 클럭 타입별로 관리하는 클럭 베이스가 있다.

클럭 타입은 4가지이며 hardirq에서 동작할 클럭과 softirq에서 동작할 클럭을 나누어 관리하므로 총 8개의 타입이 사용된다.

8가지 클럭 베이스 타입

hardirq에서 관리하는 4가지 타입

HRTIMER_BASE_MONOTONIC
- 부팅 후 0에서 시작하여 단조롭게 계속 전진하는 것을 보장하며 jiffies tick 수와 유사하다. 단 suspend 된 시간은 포함되지 않는다.
HRTIMER_BASE_REALTIME
- 실제 클럭을 관리한다. (real world clock)
HRTIMER_BASE_BOOTTIME
- HRTIMER_BASE_MONOTONIC과 유사하게 커널이 부팅된 후의 클럭을 관리한다. 다른 점으로 suspend 된 시간도 포함한다.
- 참고: [RFC] Introduce CLOCK_BOOTTIME | LWN.net
HRTIMER_BASE_TAI
- 천문학에서 사용하는 우주 표준시
- UTC(Coordinated Universal Time) 기반의 클럭을 유사하지만 윤초가 추가되어 2016 년 12 월 31 일부터 TAI 클럭은 UTC보다 37초 앞당겨진다. 그 전에는 27초가 앞당겨져 있었다.
- International Atomic Time | Wikipedia

softirq에서 관리하는 4가지 타입으로 사용 방법은 hardirq와 동일하다.

HRTIMER_BASE_MONOTONIC_SOFT
HRTIMER_BASE_REALTIME_SOFT
HRTIMER_BASE_BOOTTIME_SOFT
HRTIMER_BASE_TAI_SOFT

hrtimer Latency

hrtimer들은 1 ns 단위의 고해상도로 동작하지만 리눅스의 hrtimer 인터럽트 처리 루틴이 bottom-half로 구현된 softirq에서 처리되므로 수ns ~ 수백 us(평균: 수십 us)의 latency가 발생함을 주의해야 한다.

참고: KTAS: Analysis of Timer Latency for Embedded Linux Kernel – 다운로드 pdf

다음 그림은 위의 참고 자료에 나온 hrtimer에 대한 softirq의 latency를 10,000번 테스트한 결과를 보여준다.

RT(RealTime) 리눅스 커널을 사용하는 경우 일반 리눅스 커널보다 더 빠른 latency를 보장받을 수 있다.

참고: Evaluation of Real-time property in Embedded Linux | Hitachi – 다운로드 pdf

다음 그림은 위의 참고 자료에 나온 RT 커널과 일반 커널간 인터럽트 response time에 대한 대략적인 latency 비교를 보여준다.

CONFIG_PREEMPT_RT 커널 옵션을 사용하면 hrtimer도 softirq로 동작하는 타이머 스레드의 preemption을 허용한다.

참고: hrtimer: Prepare support for PREEMPT_RT (2019, v5.4-rc1)

hrtimers 서브시스템 초기화

hrtimers_init()

kernel/time/hrtimer.c

void __init hrtimers_init(void)
{
        hrtimers_prepare_cpu(smp_processor_id());
        open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq);
}

로컬 cpu에 대한 hrtimer 서브 시스템을 초기화한다. 그리고 hrtimer용 softirq로 hrtimer_run_softirq() 함수를 등록한다.

hrtimers_prepare_cpu()

kernel/time/hrtimer.c

/*
 * Functions related to boot-time initialization:
 */

int hrtimers_prepare_cpu(unsigned int cpu)
{
        struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
        int i;

        for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
                cpu_base->clock_base[i].cpu_base = cpu_base;
                timerqueue_init_head(&cpu_base->clock_base[i].active);
        }

        cpu_base->cpu = cpu;
        cpu_base->active_bases = 0;
        cpu_base->hres_active = 0;
        cpu_base->hang_detected = 0;
        cpu_base->next_timer = NULL;
        cpu_base->softirq_next_timer = NULL;
        cpu_base->expires_next = KTIME_MAX;
        cpu_base->softirq_expires_next = KTIME_MAX;
        return 0;
}

요청 cpu에 대한 hritimer의 cpu_base와 clock_base를 초기화한다.

코드 라인 3에서 요청 cpu의 hrtimer cpu 베이스를 알아온다.
코드 라인 6~9에서 HRTIMER_MAX_CLOCK_BASES(8)개 까지 순회하며 각 클럭 베이스의 active 타이머큐에 사용되는 RB 트리 자료구조를 초기화한다.
코드 라인 11~19에서 요청 cpu에 대해 cpu 베이스의 각 멤버를 초기화한다.

다음 그림은 0번 cpu에 대한 hrtimer cpu 베이스를 초기화한 모습을 보여준다.

timerqueue_init_head()

include/linux/timerqueue.h

static inline void timerqueue_init_head(struct timerqueue_head *head)
{
        head->rb_root = RB_ROOT_CACHED;
}

hrtimer 큐를 초기화한다.

hrtimer hardirq & softirq 핸들러

hardirq 핸들러

__hrtimer_peek_ahead_timers()

kernel/time/hrtimer.c

/* called with interrupts disabled */
static inline void __hrtimer_peek_ahead_timers(void)
{
        struct tick_device *td;

        if (!hrtimer_hres_active())
                return;

        td = this_cpu_ptr(&tick_cpu_device);
        if (td && td->evtdev)
                hrtimer_interrupt(td->evtdev);
}

hrtimer 요청이 발생한 경우 처리할 이벤트 디바이스를 통해 hrtimer 인터럽트를 처리한다.

코드 라인 6~7에서 고해상도 hw 타이머가 지원되지 않으면 처리 없이 함수를 빠져나간다.
코드 라인 9~11에서 tick 디바이스의 이벤트 디바이스가 등록된 경우 hrtimer 인터럽트를 처리한다.

hrtimer_hres_active()

kernel/time/hrtimer.c

static inline int hrtimer_hres_active(void)
{
        return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
}

아래 함수 호출

__hrtimer_hres_active()

kernel/time/hrtimer.c

/*
 * Is the high resolution mode active ?
 */
static inline int hrtimer_hres_active(void) 
{
        return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
                cpu_base->hres_active : 0;
}

현재 cpu에 고해상도 hw 타이머가 활성화되었는지 여부를 알아온다. 0=비활성화 상태, 1=활성화 상태

hrtimer_interrupt()

kernel/time/hrtimer.c -1/2-

/*
 * High resolution timer interrupt
 * Called with interrupts disabled
 */

void hrtimer_interrupt(struct clock_event_device *dev)
{
        struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
        ktime_t expires_next, now, entry_time, delta;
        unsigned long flags;
        int retries = 0;

        BUG_ON(!cpu_base->hres_active);
        cpu_base->nr_events++;
        dev->next_event = KTIME_MAX;

        raw_spin_lock_irqsave(&cpu_base->lock, flags);
        entry_time = now = hrtimer_update_base(cpu_base);
retry:
        cpu_base->in_hrtirq = 1;
        /*
         * We set expires_next to KTIME_MAX here with cpu_base->lock
         * held to prevent that a timer is enqueued in our queue via
         * the migration code. This does not affect enqueueing of
         * timers which run their callback and need to be requeued on
         * this CPU.
         */
        cpu_base->expires_next = KTIME_MAX;

        if (!ktime_before(now, cpu_base->softirq_expires_next)) {
                cpu_base->softirq_expires_next = KTIME_MAX;
                cpu_base->softirq_activated = 1;
                raise_softirq_irqoff(HRTIMER_SOFTIRQ);
        }

        __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);

        /* Reevaluate the clock bases for the next expiry */
        expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
        /*
         * Store the new expiry value so the migration code can verify
         * against it.
         */
        cpu_base->expires_next = expires_next;
        cpu_base->in_hrtirq = 0;
        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

        /* Reprogramming necessary ? */
        if (!tick_program_event(expires_next, 0)) {
                cpu_base->hang_detected = 0;
                return;
        }

hardirq용 hrtimer 인터럽트 핸들러 루틴으로 이 함수는 clock_event_device의 tick_device를 통해 이 함수를 직접 호출한다.

코드 라인 9에서 로컬 cpu에서 hrtimer 핸들러가 호출될 때마다 nr_events 카운터를 1 증가시킨다.
코드 라인 10에서 다음 이벤트에 KTIME_MAX 값을 대입한다.
코드 라인 12~13에서 락을 얻은 후 timekeeper를 위한 clocksource로부터 읽은 값으로 real, boot, tail 시간의 offset를 갱신하고 monotonic 시간을 알아와서 entry_time과 now에 대입한다.
코드 라인 14~15에서 retry: 레이블이다. hrtimer irq 처리가 진행중임을 표시한다.
코드 라인 23~29에서 expires_next에 KTIME_MAX를 대입하고, softirq_expires_next의 시각이 현재 시각을 넘어선 경우 softirq_expires_next 역시 KTIME_MAX를 대입하고, softirq용 hrtimer 인터럽트 핸들러 루틴을 호출한다.
코드 라인 31에서 만료된 hardirq용 hrtimer를 호출한다.
코드 라인 34~41에서 hardirq 및 softirq 모두에서 다음 타이머 설정을 위해 다음 타이머 만료 시각을 알아와서 expires_next에 대입하고 락을 해제한다.
코드 라인 44~47에서 다음 hrtimer를 프로그래밍한다. 만일 처리할 hrtimer 요청이 없거나 요청이 실패한 경우 hang_detected에 0을 대입하고 함수를 빠져나간다.
- 틱을 프로그램하는 과정에서 이미 지나간 시간에 대해 요청을 하려는 경우 틱 프로그래밍이 불가능하다.

kernel/time/hrtimer.c -2/2-

        /*
         * The next timer was already expired due to:
         * - tracing
         * - long lasting callbacks
         * - being scheduled away when running in a VM
         *
         * We need to prevent that we loop forever in the hrtimer
         * interrupt routine. We give it 3 attempts to avoid
         * overreacting on some spurious event.
         *
         * Acquire base lock for updating the offsets and retrieving
         * the current time.
         */
        raw_spin_lock_irqsave(&cpu_base->lock, flags);
        now = hrtimer_update_base(cpu_base);
        cpu_base->nr_retries++;
        if (++retries < 3)
                goto retry;
        /*
         * Give the system a chance to do something else than looping
         * here. We stored the entry time, so we know exactly how long
         * we spent here. We schedule the next event this amount of
         * time away.
         */
        cpu_base->nr_hangs++;
        cpu_base->hang_detected = 1;
        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

        delta = ktime_sub(now, entry_time);
        if ((unsigned int)delta > cpu_base->max_hang_time)
                cpu_base->max_hang_time = (unsigned int) delta;
        /*
         * Limit it to a sensible value as we enforce a longer
         * delay. Give the CPU at least 100ms to catch up.
         */
        if (delta > 100 * NSEC_PER_MSEC)
                expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
        else
                expires_next = ktime_add(now, delta);
        tick_program_event(expires_next, 1);
        pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
}

코드 라인 14~18에서 락을 획득하고 timekepping을 갱신하고 monotonic 시간을 가져와 now에 대입한다. 재시도 카운터를 증가시키고, 2번 재시도 할 수 있도록 retry 레이블로 이동한다.
- tracing을 하거나 시간 소요가 긴 callback 등으로 인해 이미 타이머가 만료되었을 수 있다. 이러한 경우 곧바로 해당 타이머를 처리한다. 이를 위해 최대 3회까지 시도한다.
코드 라인 25~27 에서 hang이 걸릴때의 처리 방법이다. nr_hangs 카운터를 증가시키고 hang_detected에 1을 대입한 후 lock을 해제한다.
코드 라인 29~31에서 인터럽트 처리를 위해 소모된 시간을 delta(ns 단위)에 담는다. 만일 max_hang_time보다 큰 경우 갱신한다.
코드 라인 36~40에서 현재 monotonic 시간에 delta 시간을 더해 tick을 리프로그램 요청한다. 단 delta가 100ms을 초과하는 경우 delta 대신 100ms을 추가한다.
코드 라인 41에서 hrtimer가 처리한 소요시간을 경고 메시지로 딱 한 번 출력한다.

softirq 핸들러

hrtimer_run_softirq()

kernel/time/hrtimer.c

static __latent_entropy void hrtimer_run_softirq(struct softirq_action *h)
{
        struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
        unsigned long flags;
        ktime_t now;

        hrtimer_cpu_base_lock_expiry(cpu_base);
        raw_spin_lock_irqsave(&cpu_base->lock, flags);

        now = hrtimer_update_base(cpu_base);
        __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);

        cpu_base->softirq_activated = 0;
        hrtimer_update_softirq_timer(cpu_base, true);

        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
        hrtimer_cpu_base_unlock_expiry(cpu_base);
}

hrtimer용 softirq의 진입함수이다. hrtimer로 요청된 후 만료되어 인터럽트 처리로 넘어온 경우 요청한 hrtimer에 연동된 핸들러 함수를 처리한다.

커널 v4.2-rc1에서 softirq로 구현된 hrtimer가 hard interrupt context에서 수행되게 옮겼다가. 커널 v4.16-rc1에서 hardirq와 softirq context 양쪽을 지원할 수 있는 구조로 변경하였다.
참고: hrtimer: Implement support for softirq based hrtimers (2017, v4.16-rc1)

만료 타이머 호출

__hrtimer_run_queues()

kernel/time/hrtimer.c

static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
                                 unsigned long flags, unsigned int active_mask)
{
        struct hrtimer_clock_base *base;
        unsigned int active = cpu_base->active_bases & active_mask;

        for_each_active_base(base, cpu_base, active) {
                struct timerqueue_node *node;
                ktime_t basenow;

                basenow = ktime_add(now, base->offset);

                while ((node = timerqueue_getnext(&base->active))) {
                        struct hrtimer *timer;

                        timer = container_of(node, struct hrtimer, node);

                        /*
                         * The immediate goal for using the softexpires is
                         * minimizing wakeups, not running timers at the
                         * earliest interrupt after their soft expiration.
                         * This allows us to avoid using a Priority Search
                         * Tree, which can answer a stabbing querry for
                         * overlapping intervals and instead use the simple
                         * BST we already have.
                         * We don't add extra wakeups by delaying timers that
                         * are right-of a not yet expired timer, because that
                         * timer will have to trigger a wakeup anyway.
                         */
                        if (basenow < hrtimer_get_softexpires_tv64(timer))
                                break;

                        __run_hrtimer(cpu_base, base, timer, &basenow, flags);
                        if (active_mask == HRTIMER_ACTIVE_SOFT)
                                hrtimer_sync_wait_running(cpu_base, flags);
                }
        }
}

@active_mask에 설정된 hrtimer 베이스들에 한하여 만료된 hrtimer를 호출한다.

코드 라인 7~11에서 @active_mask에 설정된 hrtimer 클럭 베이스들을 순회하며 해당 클럭 기준으로 변환하여 basenow에 대입한다.
- 예) realtime 클럭으로 요청된 hrtimer를 처리하는 경우
  - realtime 시간(basenow) = monotonic 시간(now) + realtime offset(base->offset)
코드 라인 13~36에서 처리할 base 클럭에서 다음 처리할 hrtimer 요청을 읽어와서 현재 시간이 타이머의 soft 만료 시간 보다 이전인 경우 처리를 하지 않고 루프를 빠져나간다. 그렇지 않은 경우 만료된 hrtimer를 처리한다. 결국 하나의 인터럽트로 인근에 있는 soft 만료 시간이 지난 타이머들을 같이 처리한다.

__run_hrtimer()

kernel/time/hrtimer.c

/*
 * The write_seqcount_barrier()s in __run_hrtimer() split the thing into 3
 * distinct sections:
 *
 *  - queued:   the timer is queued
 *  - callback: the timer is being ran
 *  - post:     the timer is inactive or (re)queued
 *
 * On the read side we ensure we observe timer->state and cpu_base->running
 * from the same section, if anything changed while we looked at it, we retry.
 * This includes timer->base changing because sequence numbers alone are
 * insufficient for that.
 *
 * The sequence numbers are required because otherwise we could still observe
 * a false negative if the read side got smeared over multiple consequtive
 * __run_hrtimer() invocations.
 */

static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
                          struct hrtimer_clock_base *base,
                          struct hrtimer *timer, ktime_t *now,
                          unsigned long flags)
{
        enum hrtimer_restart (*fn)(struct hrtimer *);
        int restart;

        lockdep_assert_held(&cpu_base->lock);

        debug_deactivate(timer);
        base->running = timer;

        /*
         * Separate the ->running assignment from the ->state assignment.
         *
         * As with a regular write barrier, this ensures the read side in
         * hrtimer_active() cannot observe base->running == NULL &&
         * timer->state == INACTIVE.
         */
        raw_write_seqcount_barrier(&base->seq);

        __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
        fn = timer->function;

        /*
         * Clear the 'is relative' flag for the TIME_LOW_RES case. If the
         * timer is restarted with a period then it becomes an absolute
         * timer. If its not restarted it does not matter.
         */
        if (IS_ENABLED(CONFIG_TIME_LOW_RES))
                timer->is_rel = false;

        /*
         * The timer is marked as running in the CPU base, so it is
         * protected against migration to a different CPU even if the lock
         * is dropped.
         */
        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
        trace_hrtimer_expire_entry(timer, now);
        restart = fn(timer);
        trace_hrtimer_expire_exit(timer);
        raw_spin_lock_irq(&cpu_base->lock);

        /*
         * Note: We clear the running state after enqueue_hrtimer and
         * we do not reprogram the event hardware. Happens either in
         * hrtimer_start_range_ns() or in hrtimer_interrupt()
         *
         * Note: Because we dropped the cpu_base->lock above,
         * hrtimer_start_range_ns() can have popped in and enqueued the timer
         * for us already.
         */
        if (restart != HRTIMER_NORESTART &&
            !(timer->state & HRTIMER_STATE_ENQUEUED))
                enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);

        /*
         * Separate the ->running assignment from the ->state assignment.
         *
         * As with a regular write barrier, this ensures the read side in
         * hrtimer_active() cannot observe base->running.timer == NULL &&
         * timer->state == INACTIVE.
         */
        raw_write_seqcount_barrier(&base->seq);

        WARN_ON_ONCE(base->running != timer);
        base->running = NULL;
}

하나의 hrtimer를 처리한다. 반복을 원하는 hrtimer인 경우 다시 엔큐된다.

코드 라인 12에서 요청한 클럭 베이스에서 hrtimer가 처리 중인 것을 알리기 위해 base->running에 타이머를 대입한다.
코드 라인 21에서 hrtimer_active() 함수에서 사용하는 base->running 및 timer->state를 동기화하기 위해 시퀀스를 증가시키고 메모리 베리어를 사용하였다.
코드 라인 23에서 타이머를 클럭 큐에서 제거하고 enqueue 상태를 클리어한다.
코드 라인 31~32에서 저해상도 hw 타이머를 사용하는 경우 상대 처리 플래그를 클리어한다.
코드 라인 41에서 hrtimer에 해당하는 콜백 함수를 실행한다. 결과 값으로 재반복 여부를 담아온다.
코드 라인 54~56에서 콜백 함수의 결과가 재반복 결과이고, enqueue 상태가 아니면 클럭 베이스에 hrtimer를 다시 큐잉한다.
코드 라인 68에서 요청한 클럭 베이스에서 hrtimer가 처리 중이 아닌 것을 알리기 위해 base->running에 null을 대입한다.

다음 그림은 hrtimer를 사용 시 slack range를 사용하여 인터럽트 발생 횟수를 줄이는 모습을 보여준다.

시각 변동 시 클럭 베이스내의 각 클럭 타입들 시각 갱신

hrtimer_update_base()

kernel/time/hrtimer.c

static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
        ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
        ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
        ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;

        ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
                                            offs_real, offs_boot, offs_tai);

        base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
        base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
        base->clock_base[HRTIMER_BASE_TAI_SOFT].offset = *offs_tai;

        return now;
}

현재 monotonic 시스템 시각(타임 키핑)을 알아와서 지정한 hrtimer cpu base의 각 클럭들을 모두 갱신한다.

코드 라인 3~8에서 시스템의 시각을 관리하는 timekeeper를 통해 monotonic 클럭을 제외한 나머지 클럭들의 offset 값을 갱신한다.
코드 라인 10~12에서 softirq용 클럭 베이스들도 갱신한다.

hrtimer cpu 베이스

hrtimer_bases

kernel/time/hrtimer.c

/*
 * The timer bases:
 *
 * There are more clockids than hrtimer bases. Thus, we index
 * into the timer bases by the hrtimer_base_type enum. When trying
 * to reach a base using a clockid, hrtimer_clockid_to_base()
 * is used to convert from clockid to the proper hrtimer_base_type.
 */

DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
{
        .lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
        .clock_base =
        {
                {
                        .index = HRTIMER_BASE_MONOTONIC,
                        .clockid = CLOCK_MONOTONIC,
                        .get_time = &ktime_get,
                },
                {
                        .index = HRTIMER_BASE_REALTIME,
                        .clockid = CLOCK_REALTIME,
                        .get_time = &ktime_get_real,
                },
                {
                        .index = HRTIMER_BASE_BOOTTIME,
                        .clockid = CLOCK_BOOTTIME,
                        .get_time = &ktime_get_boottime,
                },
                {
                        .index = HRTIMER_BASE_TAI,
                        .clockid = CLOCK_TAI,
                        .get_time = &ktime_get_clocktai,
                },
                {
                        .index = HRTIMER_BASE_MONOTONIC_SOFT,
                        .clockid = CLOCK_MONOTONIC,
                        .get_time = &ktime_get,
                },
                {
                        .index = HRTIMER_BASE_REALTIME_SOFT,
                        .clockid = CLOCK_REALTIME,
                        .get_time = &ktime_get_real,
                },
                {
                        .index = HRTIMER_BASE_BOOTTIME_SOFT,
                        .clockid = CLOCK_BOOTTIME,
                        .get_time = &ktime_get_boottime,
                },
                {
                        .index = HRTIMER_BASE_TAI_SOFT,
                        .clockid = CLOCK_TAI,
                        .get_time = &ktime_get_clocktai,
                },
        }
};

hrtimer가 사용하는 8개의(hardirq용 4개 + softirq용 4개) clock base가 per-cpu 마다 관리된다.

다음 만료 시각 구하기

__hrtimer_get_next_event()

kernel/time/hrtimer.c

/*
 * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
 * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
 *
 * When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
 * those timers will get run whenever the softirq gets handled, at the end of
 * hrtimer_run_softirq(), hrtimer_update_softirq_timer() will re-add these bases.
 *
 * Therefore softirq values are those from the HRTIMER_ACTIVE_SOFT clock bases.
 * The !softirq values are the minima across HRTIMER_ACTIVE_ALL, unless an actual
 * softirq is pending, in which case they're the minima of HRTIMER_ACTIVE_HARD.
 *
 * @active_mask must be one of:
 *  - HRTIMER_ACTIVE_ALL,
 *  - HRTIMER_ACTIVE_SOFT, or
 *  - HRTIMER_ACTIVE_HARD.
 */

static ktime_t
__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
{
        unsigned int active;
        struct hrtimer *next_timer = NULL;
        ktime_t expires_next = KTIME_MAX;

        if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
                active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
                cpu_base->softirq_next_timer = NULL;
                expires_next = __hrtimer_next_event_base(cpu_base, NULL,
                                                         active, KTIME_MAX);

                next_timer = cpu_base->softirq_next_timer;
        }

        if (active_mask & HRTIMER_ACTIVE_HARD) {
                active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
                cpu_base->next_timer = next_timer;
                expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
                                                         expires_next);
        }

        return expires_next;
}

요청한 cpu_base에서 active_mask로 지정된 클럭 베이스의 hrtimer 들 중 가장 빠른 hrtimer의 monotonic 만료 시각(ktime)을 반환한다.

다음 그림은 요청 cpu 베이스의 active_mask 비트로 요청한 클럭 베이스들 중 hw 타이머에 프로그램될 가장 빨리 만료될 hrtimer를 찾아 시각(ktime)을 구하는 모습을 보여준다.

__hrtimer_next_event_base()

kernel/time/hrtimer.c

static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
                                         const struct hrtimer *exclude,
                                         unsigned int active,
                                         ktime_t expires_next)
{
        struct hrtimer_clock_base *base;
        ktime_t expires;

        for_each_active_base(base, cpu_base, active) {
                struct timerqueue_node *next;
                struct hrtimer *timer;

                next = timerqueue_getnext(&base->active);
                timer = container_of(next, struct hrtimer, node);
                if (timer == exclude) {
                        /* Get to the next timer in the queue. */
                        next = timerqueue_iterate_next(next);
                        if (!next)
                                continue;

                        timer = container_of(next, struct hrtimer, node);
                }
                expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
                if (expires < expires_next) {
                        expires_next = expires;

                        /* Skip cpu_base update if a timer is being excluded. */
                        if (exclude)
                                continue;

                        if (timer->is_soft)
                                cpu_base->softirq_next_timer = timer;
                        else
                                cpu_base->next_timer = timer;
                }
        }
        /*
         * clock_was_set() might have changed base->offset of any of
         * the clock bases so the result might be negative. Fix it up
         * to prevent a false positive in clockevents_program_event().
         */
        if (expires_next < 0)
                expires_next = 0;
        return expires_next;
}

cpu 베이스의 비트마스크로 표현된 @active 클럭들에 대해 @exclude 타이머를 제외하고 가장 빠른 만료 시각을 반환한다. 없는 경우 @expires_next를 그대로 반환한다.

코드 라인 9~22에서 @active 클럭들을 대상으로 순회하며 만료 시각이 가장 빠른 타이머를 알아온다. 만일 알아온 타이머가 @exclude인 경우 그 다음 타이머를 알아온다.
코드 라인 23~35에서 만료 시각이 @expires_next보다 더 빠른 경우 이를 갱신한다. 클럭 베이스에 hrtimer를 기록해둔다.
코드 라인 42~44에서 @expires_next를 반환하되 0보다 작은 경우 0을 반환한다.

HRTimer APIs

hrtimer 초기화

hrtimer_init()

kernel/time/hrtimer.c

/**
 * hrtimer_init - initialize a timer to the given clock
 * @timer:      the timer to be initialized
 * @clock_id:   the clock to be used
 * @mode:       The modes which are relevant for intitialization:
 *              HRTIMER_MODE_ABS, HRTIMER_MODE_REL, HRTIMER_MODE_ABS_SOFT,
 *              HRTIMER_MODE_REL_SOFT
 *
 *              The PINNED variants of the above can be handed in,
 *              but the PINNED bit is ignored as pinning happens
 *              when the hrtimer is started
 */

void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
                  enum hrtimer_mode mode)
{
        debug_init(timer, clock_id, mode);
        __hrtimer_init(timer, clock_id, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_init);

요청한 @clock_id 타입 및 @mode를 사용한 hrtimer를 초기화한다.

hrtimer_mode 타입

include/linux/hrtimer.h

/*
 * Mode arguments of xxx_hrtimer functions:
 *
 * HRTIMER_MODE_ABS             - Time value is absolute
 * HRTIMER_MODE_REL             - Time value is relative to now
 * HRTIMER_MODE_PINNED          - Timer is bound to CPU (is only considered
 *                                when starting the timer)
 * HRTIMER_MODE_SOFT            - Timer callback function will be executed in
 *                                soft irq context
 * HRTIMER_MODE_HARD            - Timer callback function will be executed in
 *                                hard irq context even on PREEMPT_RT.
 */

enum hrtimer_mode {
        HRTIMER_MODE_ABS        = 0x00,
        HRTIMER_MODE_REL        = 0x01,
        HRTIMER_MODE_PINNED     = 0x02,
        HRTIMER_MODE_SOFT       = 0x04,
        HRTIMER_MODE_HARD       = 0x08,

        HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
        HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,

        HRTIMER_MODE_ABS_SOFT   = HRTIMER_MODE_ABS | HRTIMER_MODE_SOFT,
        HRTIMER_MODE_REL_SOFT   = HRTIMER_MODE_REL | HRTIMER_MODE_SOFT,

        HRTIMER_MODE_ABS_PINNED_SOFT = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_SOFT,
        HRTIMER_MODE_REL_PINNED_SOFT = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_SOFT,

        HRTIMER_MODE_ABS_HARD   = HRTIMER_MODE_ABS | HRTIMER_MODE_HARD,
        HRTIMER_MODE_REL_HARD   = HRTIMER_MODE_REL | HRTIMER_MODE_HARD,

        HRTIMER_MODE_ABS_PINNED_HARD = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_HARD,
        HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
};

다음과 같은 싱글 플래그가 지정될 수 있다.
- HRTIMER_MODE_ABS (0x00)
  - 절대 시각 사용
  - 예: realtime 시각으로부터 xx 분 xx초에 만료
- HRTIMER_MODE_REL (0x01)
  - 상대 시간 사용
  - 예: 현재 realtime 시각으로 부터 xx 분 후에 만료
- HRTIMER_MODE_PINNED (0x02)
  - 타이머가 수행될 cpu 고정
- HRTIMER_MODE_SOFT (0x04)
  - softirq context에서 동작
- HRTIMER_MODE_HARD (0x08)
  - hardirq context에서 동작
추가적으로 사용 가능한 복합 플래그들은 다음과 같다.
- HRTIMER_MODE_ABS_PINNED
  - 절대 시각 사용 + 타이머가 수행될 cpu 고정
- HRTIMER_MODE_REL_PINNED
  - 상대 시간 사용 + 타이머가 수행될 cpu 고정
- HRTIMER_MODE_ABS_SOFT
  - 절대 시각 사용 + softirq context에서 동작
- HRTIMER_MODE_REL_SOFT
  - 상대 시간 사용 + softirq context에서 동작
- HRTIMER_MODE_ABS_PINNED_SOFT
  - 절대 시각 사용 + 타이머가 수행될 cpu 고정 + softirq context에서 동작
- HRTIMER_MODE_REL_PINNED_SOFT
  - 상대 시간 사용 + 타이머가 수행될 cpu 고정 + softirq context에서 동작
- HRTIMER_MODE_ABS_HARD
  - 절대 시각 사용 + hardirq context에서 동작
- HRTIMER_MODE_REL_HARD
  - 상대 시간 사용 + hardirq context에서 동작
- HRTIMER_MODE_ABS_PINNED_HARD
  - 절대 시각 사용 + 타이머가 수행될 cpu 고정 + hardirq context에서 동작
- HRTIMER_MODE_REL_PINNED_HARD
  - 상대 시간 사용 + 타이머가 수행될 cpu 고정 + hardirq context에서 동작

__hrtimer_init()

kernel/time/hrtimer.c

static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
                           enum hrtimer_mode mode)
{
        bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
        struct hrtimer_cpu_base *cpu_base;
        int base;

        /*
         * On PREEMPT_RT enabled kernels hrtimers which are not explicitely
         * marked for hard interrupt expiry mode are moved into soft
         * interrupt context for latency reasons and because the callbacks
         * can invoke functions which might sleep on RT, e.g. spin_lock().
         */
        if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(mode & HRTIMER_MODE_HARD))
                softtimer = true;

        memset(timer, 0, sizeof(struct hrtimer));

        cpu_base = raw_cpu_ptr(&hrtimer_bases);

        /*
         * POSIX magic: Relative CLOCK_REALTIME timers are not affected by
         * clock modifications, so they needs to become CLOCK_MONOTONIC to
         * ensure POSIX compliance.
         */
        if (clock_id == CLOCK_REALTIME && mode & HRTIMER_MODE_REL)
                clock_id = CLOCK_MONOTONIC;

        base = softtimer ? HRTIMER_MAX_CLOCK_BASES / 2 : 0;
        base += hrtimer_clockid_to_base(clock_id);
        timer->is_soft = softtimer;
        timer->is_hard = !softtimer;
        timer->base = &cpu_base->clock_base[base];
        timerqueue_init(&timer->node);
}

요청한 @clock_id 타입 및 @mode를 사용한 hrtimer를 초기화한다.

hard 또는 soft 모드를 지정하지 않은 디폴트의 경우 RT 커널 여부에 따라 다르다.
- RT 커널에서는 디폴트로 softirq를 사용한다.
- RT 커널이 아닌 경우 디폴트로 hardirq를 사용한다.

hrtimer_clockid_to_base()

kernel/time/hrtimer.c

static inline int hrtimer_clockid_to_base(clockid_t clock_id)
{
        if (likely(clock_id < MAX_CLOCKS)) {
                int base = hrtimer_clock_to_base_table[clock_id];

                if (likely(base != HRTIMER_MAX_CLOCK_BASES))
                        return base;
        }
        WARN(1, "Invalid clockid %d. Using MONOTONIC\n", clock_id);
        return HRTIMER_BASE_MONOTONIC;
}

@clock_id에 해당하는 hrtimer 클럭 베이스 인덱스를 반환한다.

범위를 벗어나거나 사용할 수 없는 클럭 id를 요청한 경우 경고 메시지를 출력하고 monotomic 클럭 베이스를 반환한다.

hrtimer_clock_to_base_table

kernel/time/hrtimer.c

static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
        /* Make sure we catch unsupported clockids */
        [0 ... MAX_CLOCKS - 1]  = HRTIMER_MAX_CLOCK_BASES,

        [CLOCK_REALTIME]        = HRTIMER_BASE_REALTIME,
        [CLOCK_MONOTONIC]       = HRTIMER_BASE_MONOTONIC,
        [CLOCK_BOOTTIME]        = HRTIMER_BASE_BOOTTIME,
        [CLOCK_TAI]             = HRTIMER_BASE_TAI,
};

clock id에 해당하는 hrtimer 클럭 베이스를 구한다.

timerqueue_init()

include/linux/timerqueue.h

static inline void timerqueue_init(struct timerqueue_node *node)
{
        RB_CLEAR_NODE(&node->node);
}

RB 트리로 관리하는 타이머큐를 초기화한다.

hrtimer 시작

hrtimer_start()

kernel/time/hrtimer.c

/**
 * hrtimer_start - (re)start an hrtimer
 * @timer:      the timer to be added
 * @tim:        expiry time
 * @mode:       timer mode: absolute (HRTIMER_MODE_ABS) or
 *              relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
 *              softirq based mode is considered for debug purpose only!
 */

static inline void hrtimer_start(struct hrtimer *timer, ktime_t tim, 
                                                      const enum hrtimer_mode mode)
{
        hrtimer_start_range_ns(timer, tim, 0, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_start);

hrtimer를 요청한 ktime으로 상대 시간 또는 절대 시각에 동작하도록 요청한다.

예) 다음은 monotonic 시계를 사용하는 hrtimer를 100ms 후에 my_hrtimer_callback() 함수가 호출되게 사용하였다.

다음 그림은 monotonic 시계를 사용하는 hrtimer 두 개를 추가하였을 때의 모습을 보여준다.

hrtimer_start_range_ns()

kernel/time/hrtimer.c

/**
 * hrtimer_start_range_ns - (re)start an hrtimer
 * @timer:      the timer to be added
 * @tim:        expiry time
 * @delta_ns:   "slack" range for the timer
 * @mode:       timer mode: absolute (HRTIMER_MODE_ABS) or
 *              relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
 *              softirq based mode is considered for debug purpose only!
 */

void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
                            u64 delta_ns, const enum hrtimer_mode mode)
{
        struct hrtimer_clock_base *base;
        unsigned long flags;

        /*
         * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
         * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
         * expiry mode because unmarked timers are moved to softirq expiry.
         */
        if (!IS_ENABLED(CONFIG_PREEMPT_RT))
                WARN_ON_ONCE(!(mode & HRTIMER_MODE_SOFT) ^ !timer->is_soft);
        else
                WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard);
        base = lock_hrtimer_base(timer, &flags);

        if (__hrtimer_start_range_ns(timer, tim, delta_ns, mode, base))
                hrtimer_reprogram(timer, true);

        unlock_hrtimer_base(timer, &flags);
}
EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);

hrtimer를 요청한 ktime 및 @delta(slack 범위)를 주어 상대 시간 또는 절대 시각에 동작하도록 요청한다.

__hrtimer_start_range_ns()

kernel/time/hrtimer.c

static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
                                    u64 delta_ns, const enum hrtimer_mode mode,
                                    struct hrtimer_clock_base *base)
{
        struct hrtimer_clock_base *new_base;

        /* Remove an active timer from the queue: */
        remove_hrtimer(timer, base, true);

        if (mode & HRTIMER_MODE_REL)
                tim = ktime_add_safe(tim, base->get_time());

        tim = hrtimer_update_lowres(timer, tim, mode);

        hrtimer_set_expires_range_ns(timer, tim, delta_ns);

        /* Switch the timer base, if necessary: */
        new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);

        return enqueue_hrtimer(timer, new_base, mode);
}

hrtimer를 요청한 ktime + delta로 상대 시간 또는 절대 시간 후에 동작하도록 요청한다. slack range가 적용되는 방식인데 실제 타이머의 만료시각은 delta가 적용된 hard 만료 타임을 사용한다. 하지만 다른 타이머 처리 시 soft 만료 시간 범위가 포함되면 다른 타이머에 의해 같이 처리할 수 있도록 slack range를 부여하는 기법이다

코드 라인 8에서 요청한 hrtimer가 큐에 동작중인 경우 삭제한다.
코드 라인 10~11에서 상대 시간을 요청한 경우 현재 monotonic 시각에 @tim을 더한 절대 시각을 산출한다.
코드 라인 13에서 저해상도 hw 타이머를 사용하고 상대 시간을 요청한 경우 1 틱을 추가한다.
코드 라인 15에서 hrtimer에 만료 시각을 설정한다.
- timer->_softexpires에는 time만 저장하고, timer->node.expires에는 time + delta를 저장한다.
코드 라인 18에서 가능(현재 cpu가 hrtimer를 사용할 수 있는 경우)하면 현재 cpu의 클럭을 사용하고 그렇게 하지 못할 경우 다른 cpu의 클럭으로 변경한다.
코드 라인 20에서 generic 타이머 큐(RB 트리로 구현)에 hrtimer를 추가한다.

다음 그림은 20us 만료시간(soft)에서 slack 20us를 추가한 40us 만료시간(hard)까지의 범위를 갖는 타이머의 모습을 보여준다.

hrtimer_set_expires_range_ns()

include/linux/hrtimer.h

static inline void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time, u64 delta)
{
        timer->_softexpires = time;
        timer->node.expires = ktime_add_safe(time, ns_to_ktime(delta));
}

hrtimer의 만료 시간을 기록한다.

timer->_softexpires에는 time만 저장하고, timer->node.expires에는 time + delta를 저장한다.

타이머 베이스 스위치

switch_hrtimer_base()

kernel/time/hrtimer.c

/*
 * We switch the timer base to a power-optimized selected CPU target,
 * if:
 *      - NO_HZ_COMMON is enabled
 *      - timer migration is enabled
 *      - the timer callback is not running
 *      - the timer is not the first expiring timer on the new target
 *
 * If one of the above requirements is not fulfilled we move the timer
 * to the current CPU or leave it on the previously assigned CPU if
 * the timer callback is currently running.
 */

static inline struct hrtimer_clock_base *
switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
                    int pinned)
{
        struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
        struct hrtimer_clock_base *new_base;
        int basenum = base->index;

        this_cpu_base = this_cpu_ptr(&hrtimer_bases);
        new_cpu_base = get_target_base(this_cpu_base, pinned);
again:
        new_base = &new_cpu_base->clock_base[basenum];

        if (base != new_base) {
                /*
                 * We are trying to move timer to new_base.
                 * However we can't change timer's base while it is running,
                 * so we keep it on the same CPU. No hassle vs. reprogramming
                 * the event source in the high resolution case. The softirq
                 * code will take care of this when the timer function has
                 * completed. There is no conflict as we hold the lock until
                 * the timer is enqueued.
                 */
                if (unlikely(hrtimer_callback_running(timer)))
                        return base;

                /* See the comment in lock_hrtimer_base() */
                WRITE_ONCE(timer->base, &migration_base);
                raw_spin_unlock(&base->cpu_base->lock);
                raw_spin_lock(&new_base->cpu_base->lock);

                if (new_cpu_base != this_cpu_base &&
                    hrtimer_check_target(timer, new_base)) {
                        raw_spin_unlock(&new_base->cpu_base->lock);
                        raw_spin_lock(&base->cpu_base->lock);
                        new_cpu_base = this_cpu_base;
                        timer->base = base;
                        goto again;
                }
                WRITE_ONCE(timer->base, new_base);
        } else {
                if (new_cpu_base != this_cpu_base &&
                    hrtimer_check_target(timer, new_base)) {
                        new_cpu_base = this_cpu_base;
                        goto again;
                }
        }
        return new_base;
}

hrtimer를 위해 현재 cpu의 clock base를 사용하지 못하는 경우에 다른 cpu의 clock base로 변경한다.

코드 라인 10에서 타이머를 이주시키기 위해 가장 가까운 cpu 도메인에서 바쁜 cpu의 hrtimer cpu 베이스를 얻어온다.
코드 라인 14~40에서 새 cpu의 클럭 base로 변경하되 다음의 예외 케이스가 있다.
- 낮은 확률로 요청한 hrtimer의 callback이 이미 실행중인 경우 원래 clock base를 반환한다.
- 현재 타이머의 만료 시간이 새 cpu의 clock base의 다음 타이머의 만료 시간 이전인 경우 그냥 현재 cpu로 재시도한다.
코드 라인 41~47에서 cpu가 변경된 경우이면서 현재 타이머의 만료 시간이 새 cpu의 clock base의 다음 타이머의 만료 시간 이전인 경우 그냥 현재 cpu로 재시도한다
코드 라인 48에서 산출한 새 cpu의 hrtimer 클럭 베이스를 반환한다.

nohz를 위한 hrtimer cpu 베이스 찾기

get_target_base()

kernel/time/hrtimer.c

static inline
struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
                                         int pinned)
{
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
        if (static_branch_likely(&timers_migration_enabled) && !pinned)
                return &per_cpu(hrtimer_bases, get_nohz_timer_target());
#endif
        return base;
}

타이머를 동작시킬 타겟 hrtimer cpu 베이스를 반환한다. @pinned가 설정된 경우 요청한 @base를 그대로 반환한다.

이주시키기 위해 가장 가까운 cpu 도메인에서 바쁜 cpu를 얻어온다. (nohz idle 상태인 cpu를 제외한 busy cpu를 찾는다)

hrtimer_check_target()

kernel/time/hrtimer.c

/*
 * We do not migrate the timer when it is expiring before the next
 * event on the target cpu. When high resolution is enabled, we cannot
 * reprogram the target cpu hardware and we would cause it to fire
 * late. To keep it simple, we handle the high resolution enabled and
 * disabled case similar.
 *
 * Called with cpu_base->lock of target cpu held.
 */

static int
hrtimer_check_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base)
{
        ktime_t expires;

        expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset);
        return expires < new_base->cpu_base->expires_next;
}

요청 타이머의 만료 시간이 새 clock base의 다음 타이머 만료 시간보다 앞서는 경우 true를 반환한다. (target cpu 변경을 금지하기 위함)

코드 라인 6에서 요청한 타이머의 만료 시각을 가져와서 새 클럭 베이스로 변환한 새 만료 시각을 산출한다.
코드 라인 7에서 새 만료 시각이 기존 타이머의 만료 시각보다 앞서는 경우 true를 반환한다.

다음 그림은 요청 hrtimer를 new_base(target clock base)로 이주 가능한지 체크하여 true(이주 불가능)가 반환되는 경우를 보여준다.

이주할 clock base의 만료될 타이머보다 앞서는 경우 끼워 넣지 못하여 true를 반환한다.

nohz를 위한 타겟 cpu 찾기

get_nohz_timer_target()

kernel/sched/core.c

/*
 * In the semi idle case, use the nearest busy CPU for migrating timers
 * from an idle CPU.  This is good for power-savings.
 *
 * We don't do similar optimization for completely idle system, as
 * selecting an idle CPU will add more delays to the timers than intended
 * (as that CPU's timer base may not be uptodate wrt jiffies etc).
 */

int get_nohz_timer_target(void)
{
        int i, cpu = smp_processor_id();
        struct sched_domain *sd;

        if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
                return cpu;

        rcu_read_lock();
        for_each_domain(cpu, sd) {
                for_each_cpu(i, sched_domain_span(sd)) {
                        if (cpu == i)
                                continue;

                        if (!idle_cpu(i) && housekeeping_cpu(i, HK_FLAG_TIMER)) {
                                cpu = i;
                                goto unlock;
                        }
                }
        }

        if (!housekeeping_cpu(cpu, HK_FLAG_TIMER))
                cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
unlock:
        rcu_read_unlock();
        return cpu;
}

타이머를 이주시키기 위해 가장 가까운 cpu 도메인에서 바쁜 cpu를 얻어온다. (nohz로 동작 중인 현재 cpu를 제외하고 다른 busy cpu를 찾는다)

코드 라인 6~7에서 idle(task 없이 쉬는) cpu가 아니고 타이머 처리가 가능한 cpu이면 로컬 cpu를 반환한다.
- nohz idle 상태인 경우 절전을 위해 타이머 처리를 하지 않으려 하고, nohz full 상태인 경우 성능을 위해 다른 cpu에서 타이머를 처리하려 한다
코드 라인 10~20에서 cpu 도메인 수 만큼 루프를 돌며 그 안에 있는 각 cpu에 대해 idle cpu가 아니고 타이머 처리가 가능한 cpu id를 반환한다.
코드 라인 22~23에서 타이머 처리가 가능한 cpu가 없으면 아무 cpu나 반환한다.
- /proc/sys/kernel/timer_migration 파일의 디폴트 값은 1이다.

Housekeeping cpu

housekeeping_cpu()

include/linux/sched/isolation.h

static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
{
#ifdef CONFIG_CPU_ISOLATION
        if (static_branch_unlikely(&housekeeping_overridden))
                return housekeeping_test_cpu(cpu, flags);
#endif
        return true;
}

@flags에 대해 처리 가능한 cpu인지 여부를 반환한다.

CONFIG_CPU_ISOLATION 커널 옵션을 사용하지 않은 경우 항상 true를 반환한다.

housekeeping_test_cpu()

kernel/sched/isolation.c

bool housekeeping_test_cpu(int cpu, enum hk_flags flags)
{
        if (static_branch_unlikely(&housekeeping_overridden))
                if (housekeeping_flags & flags)
                        return cpumask_test_cpu(cpu, housekeeping_mask);
        return true;
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);

@flags에 대해 처리 가능한 cpu인지 여부를 반환한다.

cpu가 여러 이유로(성능 또는 절전, cpu 분리(isolation), …) 다음 플래그에 따른 처리를 수행할 수 있는지 여부를 나타낸다.

HK_FLAG_TIMER
- 타이머
HK_FLAG_RCU
- RCU
HK_FLAG_SCHED
- 스케줄러
HK_FLAG_TICK
- 스케줄 틱
HK_FLAG_DOMAIN
- 도메인
HK_FLAG_WQ
- 워크큐

hrtimer_callback_running()

include/linux/hrtimer.h

/*
 * Helper function to check, whether the timer is running the callback
 * function             
 */
static inline int hrtimer_callback_running(struct hrtimer *timer)
{               
        return timer->base->running == timer;
}

hrtimer의 콜백 함수가 실행되고 있는 상태인 경우 true를 반환한다.

hrtimer 큐에 추가

enqueue_hrtimer()

kernel/time/hrtimer.c

/*
 * enqueue_hrtimer - internal function to (re)start a timer
 *
 * The timer is inserted in expiry order. Insertion into the
 * red black tree is O(log(n)). Must hold the base lock.
 *
 * Returns 1 when the new timer is the leftmost timer in the tree.
 */

static int enqueue_hrtimer(struct hrtimer *timer,
                           struct hrtimer_clock_base *base,
                           enum hrtimer_mode mode)
{
        debug_activate(timer);

        base->cpu_base->active_bases |= 1 << base->index;

        timer->state |= HRTIMER_STATE_ENQUEUED;

        return timerqueue_add(&base->active, &timer->node);
}

지정한 clock base의 active 타이머 큐(RB 트리)에 hrtimer를 노드로 추가한다. 추가한 타이머가 타이머큐(RB 트리)에서 가장 먼저 만료 시간이되는 경우 1을 반환한다.

코드 라인 7에서 cpu_base->active_bases 마스크에 요청 타이머가 사용한 clock base에 해당하는 비트를 설정하여 사용됨을 표시한다.
- 해당 cpu base에서 활성화된 hrtimer가 어느 clock base에 있는지 빠른 확인을 위해 사용한다.
코드 라인 9에서 hrtimer를 enque 상태로 설정한다.
코드 라인 11에서 지정한 clock base의 active 타이머 큐(RB 트리)에 hrtimer를 노드로 추가한다.

timerqueue_add()

lib/timerqueue.c

/**
 * timerqueue_add - Adds timer to timerqueue.
 *
 * @head: head of timerqueue
 * @node: timer node to be added
 *
 * Adds the timer node to the timerqueue, sorted by the node's expires
 * value. Returns true if the newly added timer is the first expiring timer in
 * the queue.
 */

bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node)
{
        struct rb_node **p = &head->rb_root.rb_root.rb_node;
        struct rb_node *parent = NULL;
        struct timerqueue_node *ptr;
        bool leftmost = true;

        /* Make sure we don't add nodes that are already added */
        WARN_ON_ONCE(!RB_EMPTY_NODE(&node->node));

        while (*p) {
                parent = *p;
                ptr = rb_entry(parent, struct timerqueue_node, node);
                if (node->expires < ptr->expires) {
                        p = &(*p)->rb_left;
                } else {
                        p = &(*p)->rb_right;
                        leftmost = false;
                }
        }
        rb_link_node(&node->node, parent, p);
        rb_insert_color_cached(&node->node, &head->rb_root, leftmost);

        return leftmost;
}
EXPORT_SYMBOL_GPL(timerqueue_add);

타이머 큐(RB 트리)에 노드(hrtimer)를 추가한다.

nohz full 관련

wake_up_nohz_cpu()

kernel/sched/core.c

void wake_up_nohz_cpu(int cpu)
{
        if (!wake_up_full_nohz_cpu(cpu))
                wake_up_idle_cpu(cpu);
}

nohz full cpu를 깨우거나 nohz idle 상태의 cpu를 깨운다.

wake_up_full_nohz_cpu()

kernel/sched/core.c

static bool wake_up_full_nohz_cpu(int cpu) 
{
        /*
         * We just need the target to call irq_exit() and re-evaluate
         * the next tick. The nohz full kick at least implies that.
         * If needed we can still optimize that later with an
         * empty IRQ.
         */
        if (cpu_is_offline(cpu))
                return true;  /* Don't try to wake offline CPUs. */
        if (tick_nohz_full_cpu(cpu)) {
                if (cpu != smp_processor_id() ||
                    tick_nohz_tick_stopped())
                        tick_nohz_full_kick_cpu(cpu);
                return true;
        }

        return false;
}

요청한 cpu가 nohz full로 동작하는 경우 현재 cpu가 아니거나 nohz tick이 멈춘 경우 해당 cpu를 nohz full 모드에서 제거하도록 요청 하고 true를 반환한다.

tick_nohz_tick_stopped()

include/linux/tick.h

static inline int tick_nohz_tick_stopped(void)
{
        return __this_cpu_read(tick_cpu_sched.tick_stopped);
}

스케쥴 tick이 멈춘 상태인지 여부를 알아온다.

tick_nohz_full_kick_cpu()

kernel/time/tick-sched.c

/*
 * Kick the CPU if it's full dynticks in order to force it to
 * re-evaluate its dependency on the tick and restart it if necessary.
 */

void tick_nohz_full_kick_cpu(int cpu)
{
        if (!tick_nohz_full_cpu(cpu))
                return;

        irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
}

요청한 cpu가 nohz full로 동작할 때 work queue를 사용하여 해당 cpu를 nohz full 모드에서 제거하게 한다.

tick_nohz_full_cpu()

include/linux/tick.h

static inline bool tick_nohz_full_cpu(int cpu)
{
        if (!tick_nohz_full_enabled())
                return false;

        return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}

요청한 cpu가 nohz full로 동작하는지 여부를 반환한다.

hrtimer 프로그램

hrtimer_reprogram()

kernel/time/hrtimer.c

/*
 * When a timer is enqueued and expires earlier than the already enqueued
 * timers, we have to check, whether it expires earlier than the timer for
 * which the clock event device was armed.
 *
 * Called with interrupts disabled and base->cpu_base.lock held
 */

static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
{
        struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
        struct hrtimer_clock_base *base = timer->base;
        ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);

        WARN_ON_ONCE(hrtimer_get_expires_tv64(timer) < 0);

        /*
         * CLOCK_REALTIME timer might be requested with an absolute
         * expiry time which is less than base->offset. Set it to 0.
         */
        if (expires < 0)
                expires = 0;

        if (timer->is_soft) {
                /*
                 * soft hrtimer could be started on a remote CPU. In this
                 * case softirq_expires_next needs to be updated on the
                 * remote CPU. The soft hrtimer will not expire before the
                 * first hard hrtimer on the remote CPU -
                 * hrtimer_check_target() prevents this case.
                 */
                struct hrtimer_cpu_base *timer_cpu_base = base->cpu_base;

                if (timer_cpu_base->softirq_activated)
                        return;

                if (!ktime_before(expires, timer_cpu_base->softirq_expires_next))
                        return;

                timer_cpu_base->softirq_next_timer = timer;
                timer_cpu_base->softirq_expires_next = expires;

                if (!ktime_before(expires, timer_cpu_base->expires_next) ||
                    !reprogram)
                        return;
        }

        /*
         * If the timer is not on the current cpu, we cannot reprogram
         * the other cpus clock event device.
         */
        if (base->cpu_base != cpu_base)
                return;

        /*
         * If the hrtimer interrupt is running, then it will
         * reevaluate the clock bases and reprogram the clock event
         * device. The callbacks are always executed in hard interrupt
         * context so we don't need an extra check for a running
         * callback.
         */
        if (cpu_base->in_hrtirq)
                return;

        if (expires >= cpu_base->expires_next)
                return;

        /* Update the pointer to the next expiring timer */
        cpu_base->next_timer = timer;
        cpu_base->expires_next = expires;

        /*
         * If hres is not active, hardware does not have to be
         * programmed yet.
         *
         * If a hang was detected in the last timer interrupt then we
         * do not schedule a timer which is earlier than the expiry
         * which we enforced in the hang detection. We want the system
         * to make progress.
         */
        if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
                return;

        /*
         * Program the timer hardware. We enforce the expiry for
         * events which are already in the past.
         */
        tick_program_event(expires, 1);
}

hrtimer가 요청한 clock base 큐에 있거나 현재 동작 중인 경우를 제외하고 다시 프로그램한다.

코드 라인 13~14에서 만일 해당 클럭의 만료 시각으로 변환된 expires 값이 0보다 작은 경우 0으로 리셋한다.
코드 라인 16~38에서 softirq context에서 동작해야 할 hrtimer인 경우이고, 다음과 같이 리프로그램이 필요하지 않은 조건인 경우 함수를 빠져나간다.
- softirq가 activate된 상태가 아닌 경우
- 다음 softirq용 만료 시각보다 늦은 경우
- 다음 hardirq용 만료 시각보다 늦은 경우
코드 라인 44~45에서 cpu 베이스가 바뀐 경우 함수를 빠져나간다.
코드 라인 54~55에서 hardirq context에서 hrtimer가 실행중인 경우 함수를 빠져나간다.
코드 라인 57~58에서 만료 시각이 cpu 베이스의 다른 hrtimer에 비해 후순위이면 함수를 빠져나간다.
코드 라인 61~62에서 다음 만료 타이머로 요청한 hrtimer를 지정하고, 만료 시각도 갱신한다.
코드 라인 73~74에서 현재 cpu base에서 hang이 검출된 경우 0을 반환한다.
코드 라인 80에서 hrtimer를 틱 디바이스(hw)를 통해 one-shot 프로그램한다.

만료 시간을 forward

hrtimer_forward()

kernel/time/hrtimer.c

/**
 * hrtimer_forward - forward the timer expiry
 * @timer:      hrtimer to forward
 * @now:        forward past this time
 * @interval:   the interval to forward
 *
 * Forward the timer expiry so it will expire in the future.
 * Returns the number of overruns.
 *
 * Can be safely called from the callback function of @timer. If
 * called from other contexts @timer must neither be enqueued nor
 * running the callback and the caller needs to take care of
 * serialization.
 *
 * Note: This only updates the timer expiry value and does not requeue
 * the timer.
 */

u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
{
        u64 orun = 1;
        ktime_t delta;

        delta = ktime_sub(now, hrtimer_get_expires(timer));

        if (delta < 0)
                return 0;

        if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
                return 0;

        if (interval < hrtimer_resolution)
                interval = hrtimer_resolution;

        if (unlikely(delta >= interval)) {
                s64 incr = ktime_to_ns(interval);

                orun = ktime_divns(delta, incr);
                hrtimer_add_expires_ns(timer, incr * orun);
                if (hrtimer_get_expires_tv64(timer) > now)
                        return orun;
                /*
                 * This (and the ktime_add() below) is the
                 * correction for exact:
                 */
                orun++;
        }
        hrtimer_add_expires(timer, interval);

        return orun;
}
EXPORT_SYMBOL_GPL(hrtimer_forward);

만료된 타이머에 한해 만료 시각으로부터 @interval 기간의 배수 간격으로 @now를 지난 시각을 만료 시각으로 재설정한다. 결과 값으로 forward된 interval 이 몇 회 사용되었는지를 반환한다. 재설정되지 않은 경우 0을 반환한다.

코드 라인 6~9에서 @now로부터 기존 타이머 만료 시각을 빼서 delta에 대입한다. 만일 기존 타이머가 만료되지 않은 경우 0을 결과 값으로 함수를 빠져나간다.
코드 라인 11~12에서 hrtimer가 엔큐 상태이면 경고 메시지를 출력하고 0을 결과 값으로 함수를 빠져나간다.
코드 라인 14~15에서 타이머의 해상도보다 인터벌이 작은 경우 인터벌 값을 타이머의 해상도로 바꾼다.
코드 라인 17~29에서 낮은 확률로 기존 만료된 타이머가 인터벌보다 길게 오래된 경우 orun에 인터벌이 들어갈 횟수를 대입하고 hrtimer의 만료시각을 orun x 인터벌 기간만큼 추가한다. 단 추가한 시각이 현재 시각을 넘어간 경우 orun을 반환한다.
코드 라인 30~32에서 hrtimer를 재설정한 후 orun 값을 반환한다.

다음 그림은 hrtimer_forward() 함수로 만료 시각이 변경되지 않는 사례와 변경되는 사례를 보여준다.

구조체

cpu 베이스

hrtimer_cpu_base 구조체

include/linux/hrtimer.h

/**
 * struct hrtimer_cpu_base - the per cpu clock bases
 * @lock:               lock protecting the base and associated clock bases
 *                      and timers
 * @cpu:                cpu number
 * @active_bases:       Bitfield to mark bases with active timers
 * @clock_was_set_seq:  Sequence counter of clock was set events
 * @hres_active:        State of high resolution mode
 * @in_hrtirq:          hrtimer_interrupt() is currently executing
 * @hang_detected:      The last hrtimer interrupt detected a hang
 * @softirq_activated:  displays, if the softirq is raised - update of softirq
 *                      related settings is not required then.
 * @nr_events:          Total number of hrtimer interrupt events
 * @nr_retries:         Total number of hrtimer interrupt retries
 * @nr_hangs:           Total number of hrtimer interrupt hangs
 * @max_hang_time:      Maximum time spent in hrtimer_interrupt
 * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
 *                       expired
 * @timer_waiters:      A hrtimer_cancel() invocation waits for the timer
 *                      callback to finish.
 * @expires_next:       absolute time of the next event, is required for remote
 *                      hrtimer enqueue; it is the total first expiry time (hard
 *                      and soft hrtimer are taken into account)
 * @next_timer:         Pointer to the first expiring timer
 * @softirq_expires_next: Time to check, if soft queues needs also to be expired
 * @softirq_next_timer: Pointer to the first expiring softirq based timer
 * @clock_base:         array of clock bases for this cpu
 *
 * Note: next_timer is just an optimization for __remove_hrtimer().
 *       Do not dereference the pointer because it is not reliable on
 *       cross cpu removals.
 */

struct hrtimer_cpu_base {
        raw_spinlock_t                  lock;
        unsigned int                    cpu;
        unsigned int                    active_bases;
        unsigned int                    clock_was_set_seq;
        unsigned int                    hres_active             : 1,
                                        in_hrtirq               : 1,
                                        hang_detected           : 1,
                                        softirq_activated       : 1;
#ifdef CONFIG_HIGH_RES_TIMERS
        unsigned int                    nr_events;
        unsigned short                  nr_retries;
        unsigned short                  nr_hangs;
        unsigned int                    max_hang_time;
#endif
#ifdef CONFIG_PREEMPT_RT
        spinlock_t                      softirq_expiry_lock;
        atomic_t                        timer_waiters;
#endif
        ktime_t                         expires_next;
        struct hrtimer                  *next_timer;
        ktime_t                         softirq_expires_next;
        struct hrtimer                  *softirq_next_timer;
        struct hrtimer_clock_base       clock_base[HRTIMER_MAX_CLOCK_BASES];
} ____cacheline_aligned;

lock
- base와 연결된 clock base와 타이머들을 보호하기 위해 lock을 사용한다.
cpu
- cpu id가 지정된다.
active_bases
- 활성화된 타이머가 있는 clock base에 해당하는 비트 필드를 운영한다.
- 예) 0x13 -> softirq context에서 동작할 HRTIMER_BASE_MONOTONIC_SOFT 클럭과 hardirq context에서 동작할 HRTIMER_BASE_MONOTONIC 및 HRTIMER_BASE_REALTIME 클럭 들에 현재 활성화된 hrtimer가 있다.
clock_was_set_seq
- 클럭이 변경(설정)되었는지 여부를 확인하기 위한 시퀀스 값
hres_active:1
- 고해상도 hw 타이머 사용 여부
in_hrtirq:1
- hardirq 인터럽트 context가 현재 수행중인지 여부 (hrtimer_interrupt()가 실행중일 때 1, 완료 시 0)
hang_detected:1
- 지난 인터럽트에서 hang이 발견된 경우 설정된다.
softirq_activated:1
- softirq 인터럽트 context가 수행중인지 여부
nr_events
- hrtimer 인터럽트 이벤트 수
nr_hangs
- hang된 수
max_hang_time
- hrtimer 인터럽트 context가 수행된 시간 중 최대 시간
softirq_expiry_lock
- softirq 타이머가 expire되어 처리되는 동안 사용되는 lock
timer_waiters
- hrtimer_cancel() 함수를 호출하여 타이머 콜백 함수가 종료되도록 대기하는 대기자 수
expires_next
- 4개의 hardirq 관련 클럭들 중 monotonic 기준 클럭으로 가장 먼저 만료되는 타이머의 만료 시각
*next_timer
- hardirq로 처리될 다음 hrtimer
softirq_expires_next
- 4개의 softirq 관련 클럭들 중 monotonic 기준 클럭으로 가장 먼저 만료되는 타이머의 만료 시각
*softirq_next_timer
- softirq로 처리될 다음 hrtimer
clock_base[]
- 8개의 clock base

클럭 베이스

hrtimer_clock_base 구조체

include/linux/hrtimer.h

/**
 * struct hrtimer_clock_base - the timer base for a specific clock
 * @cpu_base:           per cpu clock base
 * @index:              clock type index for per_cpu support when moving a
 *                      timer to a base on another cpu.
 * @clockid:            clock id for per_cpu support
 * @seq:                seqcount around __run_hrtimer
 * @running:            pointer to the currently running hrtimer
 * @active:             red black tree root node for the active timers
 * @get_time:           function to retrieve the current time of the clock
 * @offset:             offset of this clock to the monotonic base
 */

struct hrtimer_clock_base {
        struct hrtimer_cpu_base *cpu_base;
        unsigned int            index;
        clockid_t               clockid;
        seqcount_t              seq;
        struct hrtimer          *running;
        struct timerqueue_head  active;
        ktime_t                 (*get_time)(void);
        ktime_t                 offset;
} __hrtimer_clock_base_align;

*cpu_base
- cpu_base를 가리키는 포인터
index
- clock base index (0~3)
clockid
- clock id
seq
- __run_hrtimer() 수행 시퀀스 번호
running
- 동작 중인 hrtimer
active
- 활성화된 타이머들을 위한 RB 트리 루트
(*get_time)
- 현재 clock 베이스의 시각 조회 함수가 등록되어 있다.
offset
- 현재 clock과 monotonic 클럭과의 offset 시간(ns)
- 현재 clock이 monotonic인 경우 이 값은 0이다.

hrtimer 구조체

include/linux/hrtimer.h

/**
 * struct hrtimer - the basic hrtimer structure
 * @node:       timerqueue node, which also manages node.expires,
 *              the absolute expiry time in the hrtimers internal
 *              representation. The time is related to the clock on
 *              which the timer is based. Is setup by adding
 *              slack to the _softexpires value. For non range timers
 *              identical to _softexpires.
 * @_softexpires: the absolute earliest expiry time of the hrtimer.
 *              The time which was given as expiry time when the timer
 *              was armed.
 * @function:   timer expiry callback function
 * @base:       pointer to the timer base (per cpu and per clock)
 * @state:      state information (See bit values above)
 * @is_rel:     Set if the timer was armed relative
 * @is_soft:    Set if hrtimer will be expired in soft interrupt context.
 * @is_hard:    Set if hrtimer will be expired in hard interrupt context
 *              even on RT.
 *
 * The hrtimer structure must be initialized by hrtimer_init()
 */

struct hrtimer {
        struct timerqueue_node          node;
        ktime_t                         _softexpires;
        enum hrtimer_restart            (*function)(struct hrtimer *);
        struct hrtimer_clock_base       *base;
        u8                              state;
        u8                              is_rel;
        u8                              is_soft;
};

node
- RB 트리에 연결될 노드로 slack이 적용된 실제 만료시간을 가지고 있다.
_softexpires
- slack이 적용되지 않은 만료 시각(soft 만료 시각)
(*function)
- 타이머 만료 시 호출될 함수
*base
- clock base
state
- 타이머 상태
  - HRTIMER_STATE_INACTIVE (0x00)
  - HRTIMER_STATE_ENQUEUED (0x01)
is_rel
- 상대 시각 사용 여부
is_soft
- softirq context 사용 여부
- is_hard와 항상 반대로 설정된다.
is_hard
- hardirq context 사용 여부
- is_soft와 항상 반대로 설정된다.

타이머 정보

다음과 같이 모든 cpu에서 동작하는 hrtimer의 동작상태를 확인할 수 있다.

아래 monotonic 클럭인 clock 0을 보면 offset이 0임을 확인할 수 있다.
아래 boottime 클럭인 clock 2 역시 supend된 적이 없어 clock 0과 동일한 offset을 사용하는 것을 확인할 수 있다.

$ cat /proc/timer_list
Timer List Version: v0.8
HRTIMER_MAX_CLOCK_BASES: 8
now at 1396737667102735 nsecs

cpu: 0
 clock 0:
  .base:       ffffffc07ff6a7c0
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <ffffffc07ff6ac78>, tick_sched_timer, S:01
 # expires at 1396737704000000-1396737704000000 nsecs [in 36897265 to 36897265 nsecs]
 #1: <ffffff800b18b920>, hrtimer_wakeup, S:01
 # expires at 1396738000463329-1396738010452327 nsecs [in 333360594 to 343349592 nsecs]
 #2: <ffffffc07abd0400>, timerfd_tmrproc, S:01
 # expires at 1396761589001000-1396761589001000 nsecs [in 23921898265 to 23921898265 nsecs]
 #3: <ffffff800b37bc00>, hrtimer_wakeup, S:01
 # expires at 1396780922286140-1396780922336140 nsecs [in 43255183405 to 43255233405 nsecs]
 #4: <ffffff800b3a3c00>, hrtimer_wakeup, S:01
 # expires at 1396916185370677-1396916185420677 nsecs [in 178518267942 to 178518317942 nsecs]
 #5: sched_clock_timer, sched_clock_poll, S:01
 # expires at 1398578790530436-1398578790530436 nsecs [in 1841123427701 to 1841123427701 nsecs]
 clock 1:
  .base:       ffffffc07ff6a800
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1575075653300352953 nsecs
active timers:
 #0: <ffffffc070a11300>, timerfd_tmrproc, S:01
 # expires at 9223372036854775807-9223372036854775807 nsecs [in 7646899645887320119 to 7646899645887320119 nsecs]
 clock 2:
  .base:       ffffffc07ff6a840
  .index:      2
  .resolution: 1 nsecs
  .get_time:   ktime_get_boottime
  .offset:     0 nsecs
active timers:
 #0: , timerfd_tmrproc, S:01
 # expires at 1399988673897000-1399988673897000 nsecs [in 3251006794265 to 3251006794265 nsecs]
 clock 3:
  .base:       ffffffc07ff6a880
  .index:      3
  .resolution: 1 nsecs
  .get_time:   ktime_get_clocktai
  .offset:     1575075653300352953 nsecs
...

$ cat /proc/timer_list
Timer List Version: v0.7
HRTIMER_MAX_CLOCK_BASES: 4
now at 6680679880466542 nsecs

cpu: 0
 clock 0:
  .base:       b9b483d0
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <b9b48628>, tick_sched_timer, S:01, hrtimer_start, swapper/0/0
 # expires at 6680679900000000-6680679900000000 nsecs [in 19533458 to 19533458 nsecs]
 #1: <b6e19a48>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, ifplugd/1632
 # expires at 6680680247387930-6680680248387924 nsecs [in 366921388 to 367921382 nsecs]
(...생략...)

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c – 현재 글
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

The high-resolution timer API | LWN.net
High- (but not too high-) resolution timeouts | LWN.net
Greedy hrtimer expiration | LWN.net
Short sleeps suffering from slack | LWN.net
Leaping seconds and looping servers | LWN.net
High resolution timers and dynamic ticks design notes | Kernel.org
hrtimers – subsystem for high-resolution kernel timers | Kernel.org
[Linux] high resolution timer & dynamic tick | F/OSS
Linux / kenel api / timer 관련 API 설명 | Xeno’s study
Hrtimers and Beyond: Transforming the Linux Time Subsystems | Thomas Gleixner & Douglas Niehaus – 다운로드 pdf
Timers | Erich Nahum – 다운로드 ppt