문c 블로그

sched_clock_init()

2017-03-152020-01-03 문영일 Leave a comment

다음에 통합

Timer -6- (Sched Clock & Delay Timers) | 문c

Timer -8- (Timecounter)

2017-03-102020-02-03 문영일 Leave a comment

Timecounter/Cyclecounter

h/w 독립적인 타임카운터 API를 제공하며 이 카운터를 사용하는 드라이버가 많지 않아 원래 코드가 있었던 clocksource에서 코드를 제거하여 별도의 파일로 분리하였다.

arm 아키텍트 타이머에서 56비트 cyclecounter를 사용하여 Timecounter를 초기화하여 사용한다.

주로 고속 이더넷 드라이버의 PTP(Precision Time Protocol) 기능을 위해 h/w 타이머를 연동하였고 인텔 HD 오디오 드라이버에서도 사용되었음을 확인할 수 있다.

사용 드라이버
- drivers/net/ethernet/amd/xgbe/xgbe-drv.c
  - AMD 10Gb Ethernet driver
- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
  - Broadcom Everest network driver
- drivers/net/ethernet/freescale/fec_ptp.c
  - Fast Ethernet Controller (ENET) PTP driver for MX6x
- drivers/net/ethernet/ti/cpts.c
  - TI Common Platform Time Sync
- drivers/net/ethernet/mellanox/mlx4/en_clock.c
  - mlx4 ptp clock
- drivers/net/ethernet/intel/igb/igb_ptp.c
  - PTP Hardware Clock (PHC) driver for the Intel 82576 and 82580
- drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
  - Intel 10 Gigabit PCI Express Linux driver
- drivers/net/ethernet/intel/e1000e/netdev.c
  - Intel PRO/1000 Linux driver
- sound/pci/hda/hda_controller.c
  - Implementation of primary alsa driver code base for Intel HD Audio.
참고: time: move the timecounter/cyclecounter code into its own file.

사용 API

timecounter_init()
- 하드웨어 카운터 레지스터에 연동한다.
timecounter_read()
- timcounter_init() 한 후 지난 시간을 ns 값으로 변환하여 반환한다.
cyclecounter_cyc2ns()
- cycle 카운터 값을 ns 값으로 변환하여 반환한다.
timecounter_adjtime()
- tv->nsec 값에 delta 값을 더해 교정한다.
timecounter_cyc2time()
- 요청 cycle – 지난 cycle 카운터 값으로 delta cycle을 구한 후 ns 값으로 변환하여 반환한다.

타임카운터 초기화

timecounter_init()

kernel/time/timecounter.c

/**
 * timecounter_init - initialize a time counter
 * @tc:                 Pointer to time counter which is to be initialized/reset
 * @cc:                 A cycle counter, ready to be used.
 * @start_tstamp:       Arbitrary initial time stamp.
 *
 * After this call the current cycle register (roughly) corresponds to
 * the initial time stamp. Every call to timecounter_read() increments
 * the time stamp counter by the number of elapsed nanoseconds.
 */

void timecounter_init(struct timecounter *tc,
                      const struct cyclecounter *cc,
                      u64 start_tstamp)
{       
        tc->cc = cc;
        tc->cycle_last = cc->read(cc);
        tc->nsec = start_tstamp;
        tc->mask = (1ULL << cc->shift) - 1;
        tc->frac = 0;
}
EXPORT_SYMBOL_GPL(timecounter_init);

요청한 timecount 및 cyclecounter 구조체에 시작 값(ns)으로 초기화하고 cycle_last에는 h/w 타이머로부터 cycle 값을 읽어 저장한다.

cycle_last 값에 현재 지정된 64bit 타이머 카운터 값을 읽어 cycle_last에 대입한다.
nsec에는 처음 시작 값을 대입한다.
rpi2: 아키텍처 generic 타이머 사용
- mask 값에는 24bit 값으로 0xff_ffff를 사용한다.
- 이 값으로 frac(fractional nanoseconds) 필드의 비트마스크 값으로 사용된다.

아래 그림은 rpi2의 armv7 아키텍처 generic 타이머를 사용하여 56비트 타임카운터를 초기화하는 모습을 보여준다.

timecounter_read()

kernel/time/timecounter.c

/**
 * timecounter_read - return nanoseconds elapsed since timecounter_init()
 *                    plus the initial time stamp
 * @tc:          Pointer to time counter.
 *
 * In other words, keeps track of time since the same epoch as
 * the function which generated the initial time stamp.
 */

u64 timecounter_read(struct timecounter *tc)
{
        u64 nsec;

        /* increment time by nanoseconds since last call */
        nsec = timecounter_read_delta(tc);
        nsec += tc->nsec;
        tc->nsec = nsec;

        return nsec;
}
EXPORT_SYMBOL_GPL(timecounter_read);

마지막 호출로부터 경과한 delta(ns) 값을 추가한 값(ns)을 tc->nsec에 갱신하고 반환한다.

다음 그림은 timecouter_init()으로 초기화한 후 100 사이클(5208ns 소요)이 지난 후 처음 timecounter_read() 함수를 호출한 경우 처리되는 모습을 보여준다.

timecounter_read_delta()

kernel/time/timecounter.c

/**
 * timecounter_read_delta - get nanoseconds since last call of this function
 * @tc:         Pointer to time counter
 *
 * When the underlying cycle counter runs over, this will be handled
 * correctly as long as it does not run over more than once between
 * calls.
 *
 * The first call to this function for a new time counter initializes
 * the time tracking and returns an undefined result.
 */

static u64 timecounter_read_delta(struct timecounter *tc)
{
        cycle_t cycle_now, cycle_delta;
        u64 ns_offset;

        /* read cycle counter: */
        cycle_now = tc->cc->read(tc->cc);

        /* calculate the delta since the last timecounter_read_delta(): */
        cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask;

        /* convert to nanoseconds: */
        ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta,
                                        tc->mask, &tc->frac);

        /* update time stamp of timecounter_read_delta() call: */
        tc->cycle_last = cycle_now;

        return ns_offset;
}

cycle 카운트 값을 읽어 tc->cycle_last에 저장하고 마지막 호출로부터 경과한 delta(ns) 값을 반환한다.

코드 라인 7에서 cyclecounter에 연결된 h/w 타이머 cycle 카운트 값을 읽어온다.
코드 라인 10에서 읽은 cycle 값 – 지난 cycle 값에 mask로 필터한 값을 cycle_delta에 대입한다.
코드 라인 13~14에서 cycle_delta 값으로 소요 시간(ns)을 알아온다.
- (cycle_delta * cc->mult) >> cc->shift
코드 라인 17에서 읽었었던 cycle 카운트는 tc->cycle_last에 저장한다.

cyclecounter_cyc2ns()

include/linux/timecounter.h

/**
 * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds
 * @cc:         Pointer to cycle counter.
 * @cycles:     Cycles
 * @mask:       bit mask for maintaining the 'frac' field
 * @frac:       pointer to storage for the fractional nanoseconds.
 */

static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc,
                                      cycle_t cycles, u64 mask, u64 *frac)
{
        u64 ns = (u64) cycles;

        ns = (ns * cc->mult) + *frac;
        *frac = ns & mask;
        return ns >> cc->shift;
}

cycle 카운터 값을 nano 초로 변환한다.

frac
- 참고: timecounter: keep track of accumulated fractional nanoseconds

timecounter_adjtime()

include/linux/timecounter.h

/**
 * timecounter_adjtime - Shifts the time of the clock.
 * @delta:      Desired change in nanoseconds.
 */

static inline void timecounter_adjtime(struct timecounter *tc, s64 delta)
{
        tc->nsec += delta;
}

타임카운터의 시간 ns만 delta 만큼 더해 조정한다. (cycle 값은 바꾸지 않는다.)

구조체

timecounter 구조체

include/linux/timecounter.h

/**
 * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds
 *      Contains the state needed by timecounter_read() to detect
 *      cycle counter wrap around. Initialize with
 *      timecounter_init(). Also used to convert cycle counts into the
 *      corresponding nanosecond counts with timecounter_cyc2time(). Users
 *      of this code are responsible for initializing the underlying
 *      cycle counter hardware, locking issues and reading the time
 *      more often than the cycle counter wraps around. The nanosecond
 *      counter will only wrap around after ~585 years.
 *
 * @cc:                 the cycle counter used by this instance
 * @cycle_last:         most recent cycle counter value seen by
 *                      timecounter_read()
 * @nsec:               continuously increasing count
 * @mask:               bit mask for maintaining the 'frac' field
 * @frac:               accumulated fractional nanoseconds
 */

struct timecounter {
        const struct cyclecounter *cc;
        cycle_t cycle_last;
        u64 nsec;
        u64 mask;
        u64 frac;
};

*cc
- h/w 타이머 카운트(cycle) 값에 대응하는 cyclecounter 구조체와 연결해야 한다.
cycle_last
- cyclecounter를 통해 읽은 최종 cycle 값을 저장해둔다.
- 이 cycle 값을 사용하여 delta cycle을 구하기 위해 사용한다.
nsec
- cycle_last에 대응하는 실제 시간(ns)가 담긴다.
mask
- cycle 마스크로 이 마스크 값을 초과하는 cycle 값은 overflow된 cycle 값이다.
frac
- fractional nano 초

cyclecounter 구조체

include/linux/timecounter.h

/**
 * struct cyclecounter - hardware abstraction for a free running counter
 *      Provides completely state-free accessors to the underlying hardware.
 *      Depending on which hardware it reads, the cycle counter may wrap
 *      around quickly. Locking rules (if necessary) have to be defined
 *      by the implementor and user of specific instances of this API.
 *
 * @read:               returns the current cycle value
 * @mask:               bitmask for two's complement
 *                      subtraction of non 64 bit counters,
 *                      see CYCLECOUNTER_MASK() helper macro
 * @mult:               cycle to nanosecond multiplier
 * @shift:              cycle to nanosecond divisor (power of two)
 */

struct cyclecounter {
        u64 (*read)(const struct cyclecounter *cc);
        u64 mask;
        u32 mult;
        u32 shift;
};

(*read)
- h/w 타이머의 카운터 값을 읽어오는 함수와 연결되는 후크이다.
mask
- cycle 카운터 마스크로 h/w 카운터에서 읽은 값에서 유효한 비트만을 마스크한다.
- 예) 56비트 카운터 = 0x00ff_ffff_ffff_ffff
mult & shift
- 1 cycle 당 ns 값을 산출하기 위해 mult 값으로 곱한 후 우측으로 shift한다.
- 예) mult=0x3415_5555, shift=24
  - 1 cycle = 52ns

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c – 현재 글
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

메일함 용량이 가득차 비웠습니다.

2017-03-102017-03-10 문영일 Leave a comment

안녕하세요?

제 메일의 용량이 넘쳐서 저에게 메일이 수신되지 않았었습니다.

혹시 연락되지 않으신 분들은 다시 한 번 보내주시기 바랍니다 ^^;

이제서야 파악이되서 메일함을 비워두었습니다.

by 문c (jake9999 @ dreamwiz.com)

Timer -4- (Clock Sources Watchdog)

2017-03-102020-02-03 문영일 Leave a comment

Clock Sources Watchdog

불안정한 클럭 소스 처리를 위한 워치독으로 현재는 x86 아키텍처에만 적용되어 있다.

주의: 커널의 다른 워치독 시스템과 구분이 필요한다.

클럭 소스를 워치독 리스트에 등록

clocksource_enqueue_watchdog()

kernel/time/clocksource.c

static void clocksource_enqueue_watchdog(struct clocksource *cs)
{
        unsigned long flags;

        spin_lock_irqsave(&watchdog_lock, flags);
        if (cs->flags & CLOCK_SOURCE_MUST_VERIFY) {
                /* cs is a clocksource to be watched. */
                list_add(&cs->wd_list, &watchdog_list);
                cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
        } else {
                /* cs is a watchdog. */
                if (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS)
                        cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
                /* Pick the best watchdog. */
                if (!watchdog || cs->rating > watchdog->rating) {
                        watchdog = cs;
                        /* Reset watchdog cycles */
                        clocksource_reset_watchdog();
                }
        }
        /* Check if the watchdog timer needs to be started. */
        clocksource_start_watchdog();
        spin_unlock_irqrestore(&watchdog_lock, flags);
}

요청 클럭 소스에 must_verify 플래그 요청이 있는 경우 워치독 리스트에 등록하고 0.5초 타이머 후에 워치독 스레드를 동작시켜 클럭의 안정 여부를 판단하게 한다. 플래그 요청이 없는 경우 rating이 가장 좋은 클럭 소스를 전역 watchdog이 가리키게한다.

코드 라인 6~9에서 must_verify 플래그가 있는 경우 클럭 소스를 워치독 리스트에 추가하고 플래그 중 watchdog 플래그를 클리어한다.
코드 라인 10~13에서 continuous 플래그가 있는 경우 valid_for_hres 플래그를 추가한다.
코드 라인 15~19에서 아직 워치독이 지정되지 않았거나 워치독 클럭 소스의 rating 값보다 요청한 클럭 소스의 rating 값이 더 높은 경우 요청 클럭소스를 워치독 클럭 소스로 지정하고 워치독 리스트에 있는 모든 워치독 플래그를 클리어한다.
코드 라인 22에서 워치독 타이머를 가동한다.

다음 그림은 must_verify 플래그가 있는 클럭 소스를 워치독 리스트에 추가하고 클럭 소스의 안정성 여부를 확인하도록 0.5초 만료시간으로 타이머를 가동시킨 후 워치독 스레드를 동작시키는 과정을 보여준다.

clocksource_reset_watchdog()

kernel/time/clocksource.c

static inline void clocksource_reset_watchdog(void)
{
        struct clocksource *cs;

        list_for_each_entry(cs, &watchdog_list, wd_list)
                cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
}

워치독 리스트에 등록된 모든 클럭 소스의 플래그 중 watchdog 비트만 클리어한다.

clocksource_start_watchdog()

kernel/time/clocksource.c

static inline void clocksource_start_watchdog(void)
{
        if (watchdog_running || !watchdog || list_empty(&watchdog_list))
                return;
        init_timer(&watchdog_timer);
        watchdog_timer.function = clocksource_watchdog;
        watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
        watchdog_running = 1;
}

클럭 소스 워치독으로 만료 시간 0.5초 lowres 타이머를 요청한다.

타이머가 동작 중이거나 워치독 클럭 소스가 없거나 워치독 리스트가 비어 있는 경우 처리를 하지 않고 빠져나간다.

클럭 소스 워치독 핸들러

clocksource_watchdog()

kernel/time/clocksource.c

static void clocksource_watchdog(unsigned long data)
{
        struct clocksource *cs;
        cycle_t csnow, wdnow, delta;
        int64_t wd_nsec, cs_nsec;
        int next_cpu, reset_pending;

        spin_lock(&watchdog_lock);
        if (!watchdog_running)
                goto out;

        reset_pending = atomic_read(&watchdog_reset_pending);

        list_for_each_entry(cs, &watchdog_list, wd_list) {

                /* Clocksource already marked unstable? */
                if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
                        if (finished_booting)
                                schedule_work(&watchdog_work);
                        continue;
                }

                local_irq_disable();
                csnow = cs->read(cs);
                wdnow = watchdog->read(watchdog);
                local_irq_enable();

                /* Clocksource initialized ? */
                if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
                    atomic_read(&watchdog_reset_pending)) {
                        cs->flags |= CLOCK_SOURCE_WATCHDOG;
                        cs->wd_last = wdnow;
                        cs->cs_last = csnow;
                        continue;
                }

                delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
                wd_nsec = clocksource_cyc2ns(delta, watchdog->mult,
                                             watchdog->shift);

                delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
                cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
                cs->cs_last = csnow;
                cs->wd_last = wdnow;

                if (atomic_read(&watchdog_reset_pending))
                        continue;

                /* Check the deviation from the watchdog clocksource. */
                if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
                        clocksource_unstable(cs, cs_nsec - wd_nsec);
                        continue;
                }

워치독 이벤트 핸들러로 클럭 소스 리스트에 있는 모든 클럭에 대해 워치독 클럭 소스와 비교하여 스레졸드(0.625초) 시간을 초과한 경우 unstable 처리 후 워치독 스레드에 맡긴다.

코드 라인 9~10에서 워치독이 가동되지 않은 경우 처리를 중단하고 빠져나간다.
- 워치독 타이머를 종료 시킨 후에 이벤트가 들어온 경우를 위해 함수를 빠져나가게 한다.
코드 라인 12에서 현재 시점의 워치독 리셋 펜딩 값을 읽어 보관해둔다.
코드 라인 14~21에서 이미 unstable 마크된 클럭 소스인 경우 다음 클럭 소스로 skip 한다. 만일 부팅이 완료된 상태인 경우 워치독을 가동시킨다.
코드 라인 23~26에서 현재 클럭 소스 카운터 값과 워치독 클럭 소스 카운터 값을 읽어온다.
코드 라인 29~35에서 워치독 플래그 설정이 없는 클럭 소스이거나 워치독 리셋 펜딩 상태인 경우 현재 클럭 소스에 워치독 플래그를 설정하고 읽은 워치독 클럭 소스 카운터 값과 현재 클럭 소스 카운터 값을 wd_last 및 cs_last에 보관하고 다음 클럭 소스로 skip 한다.
코드 라인 37~39에서 워치독 클럭 소스를 대상으로 기존에 저장해 둔 카운터 값과 좀 전에 읽은 값의 카운터(cycle) 차이를 delta에 담고 소요 시간을 wd_nsec에 담는다.
코드 라인 41~42에서 윗 줄과 같은 방법으로 현재 클럭 소스를 대상으로 동일하게 산출한다.
코드 라인 43~44에서 읽었던 값을 현재 클럭 소스의 마지막에 읽은 카운터 값(cs_last 및 wd_last)에 대입한다.
코드 라인 46~47에서 워치독 리셋 펜딩이 된 경우 다음 클럭 소스로 skip 한다.
코드 라인 50~53에서 현재 클럭 소스의 소요 시간과 워치독 클럭 소스의 소요시간의 차이가 워치독 스레졸드 시간(0.0625초)을 초과한 경우 현재 클럭 소스를 unstable 처리하고 다음 클럭 소스로 skip 한다.

                if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
                    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
                    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
                        /* Mark it valid for high-res. */
                        cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;

                        /*
                         * clocksource_done_booting() will sort it if
                         * finished_booting is not set yet.
                         */
                        if (!finished_booting)
                                continue;

                        /*
                         * If this is not the current clocksource let
                         * the watchdog thread reselect it. Due to the
                         * change to high res this clocksource might
                         * be preferred now. If it is the current
                         * clocksource let the tick code know about
                         * that change.
                         */
                        if (cs != curr_clocksource) {
                                cs->flags |= CLOCK_SOURCE_RESELECT;
                                schedule_work(&watchdog_work);
                        } else {
                                tick_clock_notify();
                        }
                }
        }

        /*
         * We only clear the watchdog_reset_pending, when we did a
         * full cycle through all clocksources.
         */
        if (reset_pending)
                atomic_dec(&watchdog_reset_pending);

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);
out:
        spin_unlock(&watchdog_lock);
}

코드 라인 1~5에서 현재 클럭 소스와 워치독 클럭 소스가 모두 continuous 플래그 설정되어 있고 현재 클럭 소스에 valid_for_hres 플래그가 없는 경우 그 플래그를 설정한다.
코드 라인 11~12에서 부팅이 완료되지 않은 상태이면 다음 클럭 소스로 skip 한다.
코드 라인 22~27에서 현재 클럭 소스가 curr_clocksource가 아닌 경우 reselect 플래그를 추가하고 워치독을 가동하고 같은 경우 클럭 소스가 변경되었음을 async로 통지한다.
코드 라인 35~36에서 루틴 처음에 이미 리셋 펜딩 상태였던 경우 워치독 리셋 펜딩 값을 감소시킨다.
코드 라인 42~46에서 다음 cpu에 대해 워치독 인터벌(0.5초)로 워치독 타이머를 가동시킨다.

불안정한 클럭 소스 처리

clocksource_unstable()

kernel/time/clocksource.c

static void clocksource_unstable(struct clocksource *cs, int64_t delta)
{
        printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n",
               cs->name, delta);
        __clocksource_unstable(cs);
}

불안정한 클럭 소스에 대해 경고 메시지를 출력하고 이에 대한 처리를 하도록 워치독 처리 함수를 스케쥴하여 호출한다.

__clocksource_unstable()

kernel/time/clocksource.c

static void __clocksource_unstable(struct clocksource *cs)
{
        cs->flags &= ~(CLOCK_SOURCE_VALID_FOR_HRES | CLOCK_SOURCE_WATCHDOG);
        cs->flags |= CLOCK_SOURCE_UNSTABLE;
        if (finished_booting)
                schedule_work(&watchdog_work);
}

불안정한 클럭 소스의 처리를 위해 아래 워크큐에 등록된 함수를 스케쥴하여 호출한다.

kernel/time/clocksource.c

static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);

워치독 스레드를 생성하고 동작시키는 워크큐이다.

clocksource_watchdog_work()

kernel/time/clocksource.c

static void clocksource_watchdog_work(struct work_struct *work)
{
        /*
         * If kthread_run fails the next watchdog scan over the
         * watchdog_list will find the unstable clock again.
         */
        kthread_run(clocksource_watchdog_kthread, NULL, "kwatchdog");
}

워치독 스레드를 생성하고 동작시킨다.

워치독 스레드

clocksource_watchdog_kthread()

kernel/time/clocksource.c

static int clocksource_watchdog_kthread(void *data)
{
        mutex_lock(&clocksource_mutex);
        if (__clocksource_watchdog_kthread())
                clocksource_select();
        mutex_unlock(&clocksource_mutex);
        return 0;
}

워치독 리스트에 있는 불안정한 클럭들은 rating을 0으로 바꾼 후 다시 클럭 소스 리스트로 옮기고 클럭 소스를 다시 선택하는 과정을 거치게 한다.

__clocksource_watchdog_kthread()

kernel/time/clocksource.c

static int __clocksource_watchdog_kthread(void)
{
        struct clocksource *cs, *tmp;
        unsigned long flags;
        LIST_HEAD(unstable);
        int select = 0;

        spin_lock_irqsave(&watchdog_lock, flags);
        list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
                if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
                        list_del_init(&cs->wd_list);
                        list_add(&cs->wd_list, &unstable);
                        select = 1;
                }
                if (cs->flags & CLOCK_SOURCE_RESELECT) {
                        cs->flags &= ~CLOCK_SOURCE_RESELECT;
                        select = 1;
                }
        }
        /* Check if the watchdog timer needs to be stopped. */
        clocksource_stop_watchdog();
        spin_unlock_irqrestore(&watchdog_lock, flags);

        /* Needs to be done outside of watchdog lock */
        list_for_each_entry_safe(cs, tmp, &unstable, wd_list) {
                list_del_init(&cs->wd_list); 
                __clocksource_change_rating(cs, 0);
        }
        return select;
}

워치독 리스트에 있는 클럭 소스 중 불안정한 클럭 소스들의 rating을 0으로 바꿔서 다시 클럭 소스 리스트로 옮긴다.

코드 라인 8~14에서 워치독 리스트에서 불안정한 클럭 소스를 임시 리스트인 unstable 리스트로 옮긴다.
코드 라인 15~18에서 reselect 플래그가 있는 클럭들은 플래그만 다시 클리어한다.
코드 라인 21에서 워치독 타이머를 스탑한다.
코드 라인 25~28에서 untable 리스트에 있는 불안정한 클럭 소스의 rating을 0으로 바꾼 후 다시 클럭 소스 리스트로 옮긴다.

__clocksource_change_rating()

kernel/time/clocksource.c

static void __clocksource_change_rating(struct clocksource *cs, int rating)
{
        list_del(&cs->list);
        cs->rating = rating;
        clocksource_enqueue(cs);
}

지정한 클럭 소스의 rating을 변경하고 다시 클럭 소스 리스트에 추가한다.

부팅 완료 시 클럭 소스 선택

clocksource_done_booting()

kernel/time/clocksource.c

/*
 * clocksource_done_booting - Called near the end of core bootup
 *
 * Hack to avoid lots of clocksource churn at boot time.
 * We use fs_initcall because we want this to start before
 * device_initcall but after subsys_initcall.
 */
static int __init clocksource_done_booting(void)
{
        mutex_lock(&clocksource_mutex);
        curr_clocksource = clocksource_default_clock();
        finished_booting = 1;
        /*
         * Run the watchdog first to eliminate unstable clock sources
         */
        __clocksource_watchdog_kthread();
        clocksource_select();
        mutex_unlock(&clocksource_mutex);
        return 0;
}
fs_initcall(clocksource_done_booting);

unstable한 클럭 소스를 제거하고 가장 best한 클럭 소스를 선택한다.

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c – 현재 글
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

Timer -7- (Sched Clock & Delay Timers)

2017-03-102020-02-12 문영일 Leave a comment

Sched Clock

sched_clock은 시간 계산에 사용하는 ns 단위의 카운터를 제공하며 클럭 소스 서브시스템에서 제공하는 고정밀도 카운터를 사용하여 sched_clock으로 등록한다.

32비트 일반 타이머로 동작하던 sched_clock을 64비트 hrtimer 구조로 확장하였다. (kernel v3.13-rc1)
- 참고: sched_clock: Add support for >32 bit sched_clock
- 참고: sched_clock: Use an hrtimer instead of timer
아키텍트 타이머를 사용하는 arm 및 arm64 시스템
- 56비트 아키텍트 타이머를 사용하는 sched_clock을 등록하기 전까지는 일반 타이머로 갱신되는 jiffies 값을 이용하는 함수를 사용한다.
- CONFIG_GENERIC_SCHED_CLOCK 커널 옵션을 사용한다.
sched_clock() API를 통해 등록된 스케줄 클럭(ns) 값을 읽을 수 있다.

다음 그림은 jiffies 클럭 카운터에서 56비트 아키텍트 카운터 기반의 스케줄 클럭으로 등록되어 전환되는 과정을 보여준다.

스케줄 클럭 초기화

sched_clock_init()

arm 및 arm64에서는 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 커널 옵션이 사용되지 않는다. 따라서 이 옵션이 사용되지 않는 함수를 분석한다.

kernel/sched/clock.c

void __init sched_clock_init(void)
{
        static_branch_inc(&sched_clock_running);
        local_irq_disable();
        generic_sched_clock_init();
        local_irq_enable();
}

irq를 블럭한 상태에서 generic 스케줄 클럭 초기화를 수행한다.

generic_sched_clock_init()

kernel/time/sched_clock.c

void __init generic_sched_clock_init(void)
{
        /*
         * If no sched_clock() function has been provided at that point,
         * make it the final one one.
         */
        if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
                sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);

        update_sched_clock();

        /*
         * Start the timer to keep sched_clock() properly updated and
         * sets the initial epoch.
         */
        hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        sched_clock_timer.function = sched_clock_poll;
        hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
}

sched_clock을 초기화한다.

코드 라인 7~8에서 시스템에 고정밀도 hw 기반의 스케줄 클럭이 등록되지 않고 여전히 스케줄 클럭의 읽기용 함수가 jiffy 방식인 경우 스케줄 클럭으로 jiffy를 사용한다.
코드 라인 10에서 스케줄 클럭을 갱신한다.
코드 라인 16~18에서 hrtimer를 사용하여 약 1시간 주기로 스케줄 클럭을 프로그램하여 sched_clock_poll() 함수를 호출한다. 이 함수에서는 sched_clock을 갱신한다.

스케줄 클럭 초기값

kernel/time/sched_clock.c

static struct clock_data cd ____cacheline_aligned = {
        .read_data[0] = { .mult = NSEC_PER_SEC / HZ,
                          .read_sched_clock = jiffy_sched_clock_read, },
        .actual_read_sched_clock = jiffy_sched_clock_read,
};

스케줄 클럭은 지정되지 않는 경우 위의 jiffies 후크 함수가 사용된다.

커널 부트업 시 초반에는 jiffy_sched_clock_read()를 사용하지만 arm 및 arm64에서는 generic 아키텍트 타이머가 준비되면 56비트 카운터 기반의 다음 함수를 사용한다.
- 예) arch_counter_get_cntvct()

jiffy_sched_clock_read()

kernel/time/sched_clock.c

static u64 notrace jiffy_sched_clock_read(void)
{
        /*
         * We don't need to use get_jiffies_64 on 32-bit arches here
         * because we register with BITS_PER_LONG
         */
        return (u64)(jiffies - INITIAL_JIFFIES);
}

커널 부트업 시 초반에는 jiffy_sched_clock_read()를 사용한다.

sched_clock_poll()

kernel/time/sched_clock.c

static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt)
{
        update_sched_clock();
        hrtimer_forward_now(hrt, cd.wrap_kt);

        return HRTIMER_RESTART;
}

스케줄 클럭을 갱신하고, 다시 hrtimer의 forward 기능을 사용하여 프로그램한다. (약 1시간 주기)

Sched Clock 등록

sched_clock_register()

kernel/time/sched_clock.c

void __init
sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
{
        u64 res, wrap, new_mask, new_epoch, cyc, ns;
        u32 new_mult, new_shift;
        unsigned long r;
        char r_unit;
        struct clock_read_data rd;

        if (cd.rate > rate)
                return;

        WARN_ON(!irqs_disabled());

        /* Calculate the mult/shift to convert counter ticks to ns. */
        clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600);

        new_mask = CLOCKSOURCE_MASK(bits);
        cd.rate = rate;

        /* Calculate how many nanosecs until we risk wrapping */
        wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
        cd.wrap_kt = ns_to_ktime(wrap);

        rd = cd.read_data[0];

        /* Update epoch for new counter and update 'epoch_ns' from old counter*/
        new_epoch = read();
        cyc = cd.actual_read_sched_clock();
        ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask, rd.mult, rd.shift);
        cd.actual_read_sched_clock = read;

        rd.read_sched_clock     = read;
        rd.sched_clock_mask     = new_mask;
        rd.mult                 = new_mult;
        rd.shift                = new_shift;
        rd.epoch_cyc            = new_epoch;
        rd.epoch_ns             = ns;

        update_clock_read_data(&rd);

        if (sched_clock_timer.function != NULL) {
                /* update timeout for clock wrap */
                hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
        }

        r = rate;
        if (r >= 4000000) {
                r /= 1000000;
                r_unit = 'M';
        } else {
                if (r >= 1000) {
                        r /= 1000;
                        r_unit = 'k';
                } else {
                        r_unit = ' ';
                }
        }

        /* Calculate the ns resolution of this counter */
        res = cyc_to_ns(1ULL, new_mult, new_shift);

        pr_info("sched_clock: %u bits at %lu%cHz, resolution %lluns, wraps every %lluns\n",
                bits, r, r_unit, res, wrap);

        /* Enable IRQ time accounting if we have a fast enough sched_clock() */
        if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
                enable_sched_clock_irqtime();

        pr_debug("Registered %pS as sched_clock source\n", read);
}

클럭 소스의 카운터 읽기 함수를 sched_clock으로 등록하여 사용한다.

코드 라인 10~11에서 이미 등록한 sched_clock의 rate가 요청한 @rate 보다 높은 경우 처리하지 않고 함수를 빠져나간다.
- 요청한 스케줄 클럭이 여러 개인 경우 가장 높은 rate를 사용하는 스케줄 클럭을 사용한다.
코드 라인 16에서 요청한 클럭 주파수를 3600초의 ns 단위로 바꾸는데 필요한 mult/shift를 산출한다.
- rpi4 예) rate=54M -> mult=0x250_97b4, shift=21
- rpi2 예) rate=19.2M -> mult=0x682_aaab, shift=21
코드 라인 18에서 요청한 bit로 마스크 값을 구한다.
- rpi2 & rpi4 예) bits=56 -> new_mask = 0xff_ffff_ffff_ffff
코드 라인 22~23에서 wrap 타임을 구해 ktime으로 변환한 후 cd.wrap_kt에 저장한다.
- clocks_calc_max_nsecs() 함수에서는 카운터로 사용 가능한 wrap 타임의 50%를 적용하였다.
- rpi4 예) rate=54Mhz -> wrap=4398,046,511,102(약 72분) wrap_kt=3,131,746,996,224 (약 52분)
코드 라인 28~29에서 요청한 새 클럭 카운터를 읽어 new_epoch에 대입하고 기존 클럭 카운터를 읽어 cyc에 대입한다.
코드 라인 30에서 기존 클럭 카운터를 이용한 epoch_ns에 새로 읽은 카운터에 대한 delta ns를 구해 더한 값을 ns에 대입한다.
- 처음 sched_clock을 등록 시 읽어온 jiffies cyc 값은 0이므로 ns 값은 항상 0이다.
- sched_clock으로 사용될 클럭 소스가 더 높은 rate의 클럭 소스가 지정되는 경우 그 동안 소요된 ns 값이 반영된다.
코드 라인 31에서 스케줄 클럭에서 읽어들일 새 카운터 읽기 함수를 지정한다.
- rpi2 & rpi4 예) arch_counter_get_cntvct()
코드 라인 33~40에서 clock_read_data 구조체에 새 값들을 구성한 후 스케줄 클럭에 갱신한다.
코드 라인 42~45에서 wrap_kt 주기(약 1시간)로 동작하는 sched_clock_timer를 동작시킨다.
- rpi4 예) 약 72분 단위
코드 라인 47~58에서 출력을 위해 rate 값으로 r과 r_unit을 산출한다. (rate가 4M 이상일 때 M 단위를 사용하고, 그 이하인 경우 k 단위를 사용한다)
- rpi4 예) rate=54000000 -> r=54, r_unit=’M’
- rpi2 예) rate=19200000 -> r=19, r_unit=’M’
코드 라인 61에서 1 cycle에 해당하는 ns를 산출하여 res에 대입한다.
코드 라인 63~64에서 sched_clock에 대한 정보를 출력한다.
- rpi4 예) “sched_clock: 56 bits at 54MHz, resolution 18ns, wraps every 4398046511102ns”
- rpi2 예) “sched_clock: 56 bits at 19MHz, resolution 52ns, wraps every 3579139424256ns”
코드 라인 67~68에서 irqtime 값이 0을 초과하거나 처음 설정한 sched_clock의 rate가 1M 이상일 때 irq 타임 성능 측정을 할 수 있도록 전역 변수 sched_clock_irqtime에 1을 대입한다.
- irqtime의 디폴트 값은 -1이다.
- irq 타임 성능 측정은 NO_HZ_FULL 커널 옵션을 사용하지 않고 IRQ_TIME_ACCOUNTING 커널 옵션이 적용된 커널에서만 동작한다.
코드 라인 70에서 스케줄 클럭으로 등록되어 사용되어 사용될 클럭 카운터 함수명을 출력한다.
- rpi4 예) “Registered arch_counter_get_cntvct+0x0/0x10 as sched_clock source”

다음 그림은 rpi4 시스템이 사용하는 56비트 아키텍트 카운터를 스케줄 클럭으로 등록시킨 모습을 보여준다.

스케줄 클럭 갱신 및 읽기

스케줄 클럭은 nmi 인터럽트 핸들러에서 dead-lock을 없애고 빠르게 읽어낼 수 있도록 시퀀스 카운터를 사용한 lock-less 구현을 사용하였고, 다음과 같이 두 개의 clock_read_data 구조체를 사용하여 관리한다.

struct clock_read_data read_data[2];
참고: timers, sched/clock: Avoid deadlock during read from NMI (2015, v4.1-rc1)

다음 그림은 두 개의 클럭 데이터로 운영되는 모습을 보여준다.

스케줄 클럭 갱신

update_sched_clock()

kernel/time/sched_clock.c

/*
 * Atomically update the sched_clock() epoch.
 */

static void update_sched_clock(void)
{
        u64 cyc;
        u64 ns;
        struct clock_read_data rd;

        rd = cd.read_data[0];

        cyc = cd.actual_read_sched_clock();
        ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask, rd.mult, rd.shift);

        rd.epoch_ns = ns;
        rd.epoch_cyc = cyc;

        update_clock_read_data(&rd);
}

스케줄 클럭을 읽어 갱신한다.

update_clock_read_data()

kernel/time/sched_clock.c

/*
 * Updating the data required to read the clock.
 *
 * sched_clock() will never observe mis-matched data even if called from
 * an NMI. We do this by maintaining an odd/even copy of the data and
 * steering sched_clock() to one or the other using a sequence counter.
 * In order to preserve the data cache profile of sched_clock() as much
 * as possible the system reverts back to the even copy when the update
 * completes; the odd copy is used *only* during an update.
 */

static void update_clock_read_data(struct clock_read_data *rd)
{
        /* update the backup (odd) copy with the new data */
        cd.read_data[1] = *rd;

        /* steer readers towards the odd copy */
        raw_write_seqcount_latch(&cd.seq);

        /* now its safe for us to update the normal (even) copy */
        cd.read_data[0] = *rd;

        /* switch readers back to the even copy */
        raw_write_seqcount_latch(&cd.seq);
}

@rd 값을 사용하여 스케줄 클럭을 홀/짝 두 개의 클럭 데이터에 갱신한다.

read_data[1]을 갱신하고 시퀀스를 증가시켜 홀수가 될 때 read_data[0]을 갱신한다.

스케줄 클럭 읽기

sched_clock()

kernel/time/sched_clock.c

unsigned long long notrace sched_clock(void)
{
        u64 cyc, res;
        unsigned int seq;
        struct clock_read_data *rd;

        do {
                seq = raw_read_seqcount(&cd.seq);
                rd = cd.read_data + (seq & 1);

                cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
                      rd->sched_clock_mask;
                res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
        } while (read_seqcount_retry(&cd.seq, seq));

        return res;
}

스케줄 클럭을 읽어 반환한다.

시퀀스가 짝수이면 read_data[1]을 갱신할 가능성이 있으므로 read_data[0]의 클럭 데이터를 사용한다.
시퀀스가 홀수이면 read_data[0]을 갱신하고 있으므로 read_data[1]의 클럭 데이터를 사용한다.

스케줄 클럭 suspend/resume 핸들러 초기화

다음 그림은 suspend/resume에 대해 스케줄 클럭이 전환되도록 핸들러를 초기화하는 과정을 보여준다.

sched_clock_syscore_init()

kernel/time/sched_clock.c

static int __init sched_clock_syscore_init(void)
{
        register_syscore_ops(&sched_clock_ops);

        return 0;
}
device_initcall(sched_clock_syscore_init);

suspend/resume을 위해 sched_clock_ops를 등록한다.

sched_clock_ops

kernel/time/sched_clock.c

static struct syscore_ops sched_clock_ops = {
        .suspend        = sched_clock_suspend,
        .resume         = sched_clock_resume,
};

sched_clock_suspend()

kernel/time/sched_clock.c

int sched_clock_suspend(void)
{
        struct clock_read_data *rd = &cd.read_data[0];

        update_sched_clock();
        hrtimer_cancel(&sched_clock_timer);
        rd->read_sched_clock = suspended_sched_clock_read;

        return 0;
}

suspend 시 호출되어 스케줄 클럭의 동작 방식을 변경한다.

코드 라인 5에서 sched_clock을 갱신한다.
코드 라인 6에서 약 1시간 주기로 동작하는 sched_clock_timer를 취소시킨다.
코드 라인 7에서 sched_clock() 함수가 갱신된 sched_clock의 내부 epoch_cyc 값을 읽도록 후크 함수를 변경한다.

sched_clock_resume()

kernel/time/sched_clock.c

void sched_clock_resume(void)
{
        struct clock_read_data *rd = &cd.read_data[0];

        rd->epoch_cyc = cd.actual_read_sched_clock();
        hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
        rd->read_sched_clock = cd.actual_read_sched_clock;
}

resume 시 호출되어 스케줄 클럭의 동작 방식을 변경한다.

코드 라인 5에서 sched_clock 을 실제 hw 카운터를 읽어 갱신한다.
코드 라인 6에서 약 1시간 주기로 동작하는 sched_clock_timer를 다시 동작시킨다.
코드 라인 7에서 sched_clock() 함수가 실제 hw 카운터를 읽도록 후크 함수를 변경한다.

suspended_sched_clock_read()

kernel/time/sched_clock.c

/*
 * Clock read function for use when the clock is suspended.
 *
 * This function makes it appear to sched_clock() as if the clock
 * stopped counting at its last update.
 *
 * This function must only be called from the critical
 * section in sched_clock(). It relies on the read_seqcount_retry()
 * at the end of the critical section to be sure we observe the
 * correct copy of 'epoch_cyc'.
 */

static u64 notrace suspended_sched_clock_read(void)
{
        unsigned int seq = raw_read_seqcount(&cd.seq);

        return cd.read_data[seq & 1].epoch_cyc;
}

suspend 시 읽어들일 스케줄 클럭 값을 반환한다.

delay 관련 함수 – ARM64

arm64 시스템에서 cpu는 cfe를 사용한 busy-wait 루프를 사용하여 대기한다. atomic context에서 ndelay() 또는 udelay() API들이 사용된다. 그러나 mdelay() API는 너무 오랫동안 busy-wait을 하므로 권장되지 않으며 가능하면 non-atomic context에서 사용되는 msleep() API를 사용하는 것이 좋다.

다음 그림은 arm64용 delay 관련 함수의 호출 관계를 보여준다.

밀리 세컨드 단위 delay

mdelay()

include/linux/delay.h

#define mdelay(n) (\
        (__builtin_constant_p(n) && (n)<=MAX_UDELAY_MS) ? udelay((n)*1000) : \
        ({unsigned long __ms=(n); while (__ms--) udelay(1000);}))
#endif

@n 밀리 세컨드 만큼 delay 한다.

상수 @n 값이 MAX_UDELAY_MS(5) 밀리 세컨드 이하에서는 udelay()를 호출 시 1000을 곱해 호출한다.
- 5ms 이하에서는 us단위로 변환하여 udelay() 함수를 한 번만 호출한다.
  - 1000, 2000, 3000, 4000 또는 5000
그 외의 경우 udelay(1000)을 @n 만큼 호출한다.

마이크로 세컨드 단위 delay

udelay()

include/asm-generic/delay.h

/*
 * The weird n/20000 thing suppresses a "comparison is always false due to
 * limited range of data type" warning with non-const 8-bit arguments.
 */

/* 0x10c7 is 2**32 / 1000000 (rounded up) */

#define udelay(n)                                                       \
        ({                                                              \
                if (__builtin_constant_p(n)) {                          \
                        if ((n) / 20000 >= 1)                           \
                                 __bad_udelay();                        \
                        else                                            \
                                __const_udelay((n) * 0x10c7ul);         \
                } else {                                                \
                        __udelay(n);                                    \
                }                                                       \
        })

@n 마이크로 세컨드 만큼 delay 한다.

상수 @n 값이 20000 이상인 경우 즉, 20ms 이상인 경우 컴파일 타임에 에러를 출력한다.
상수 @n 값이 20000 미만인 경우 즉, 20ms 미만인 경우 @n 값에 0x10c7을 곱한 값으로 __const_udelay()를 호출한다.
그 외의 경우 __udelay()를 그대로 호출한다.

__udelay()

arch/arm64/lib/delay.c

void __udelay(unsigned long usecs)
{
        __const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
}
EXPORT_SYMBOL(__udelay);

@usec 마이크로 세컨드 만큼 delay 한다.

@usec 값에 0x10c7을 곱한 값으로 __const_udelay()를 호출한다.

루프 단위 delay

__const_udelay()

arch/arm64/lib/delay.c

inline void __const_udelay(unsigned long xloops)
{
        __delay(xloops_to_cycles(xloops));
}
EXPORT_SYMBOL(__const_udelay);

@xloops 루프 만큼 delay 한다.

루프 단위 @xloops 값을 사이클 단위로 변환한 값으로 __delay() 함수를 호출한다.

나노 세컨드 단위 delay

ndelay()

include/asm-generic/delay.h

/* 0x5 is 2**32 / 1000000000 (rounded up) */

#define ndelay(n)                                                       \
        ({                                                              \
                if (__builtin_constant_p(n)) {                          \
                        if ((n) / 20000 >= 1)                           \
                                __bad_ndelay();                         \
                        else                                            \
                                __const_udelay((n) * 5ul);              \
                } else {                                                \
                        __ndelay(n);                                    \
                }                                                       \
        })

#endif /* __ASM_GENERIC_DELAY_H */

@n 나노 세컨드 만큼 delay 한다.

상수 @n 값이 20000 이상인 경우 즉, 20us 이상인 경우 컴파일 타임에 에러를 출력한다.
상수 @n 값이 20000 미만인 경우 즉, 20us 미만인 경우 @n 값에 5를 곱한 값으로 __const_udelay()를 호출한다.
- 1us 당 5 루프
그 외의 경우 __ndelay()를 그대로 호출한다.

__ndelay()

arch/arm64/lib/delay.c

void __ndelay(unsigned long nsecs)
{
        __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
}
EXPORT_SYMBOL(__ndelay);

@nsec 나노 세컨드 만큼 delay 한다.

@nsec 값에 5를 곱한 값으로 __const_udelay()를 호출한다.

사이클 단위 delay

xloops_to_cycles()

arch/arm64/lib/delay.c

static inline unsigned long xloops_to_cycles(unsigned long xloops)
{
        return (xloops * loops_per_jiffy * HZ) >> 32;
}

@xloops 루프 단위를 사이클 단위로 변환하여 반환한다.

__delay()

arch/arm64/lib/delay.c

void __delay(unsigned long cycles)
{
        cycles_t start = get_cycles();

        if (arch_timer_evtstrm_available()) {
                const cycles_t timer_evt_period =
                        USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);

                while ((get_cycles() - start + timer_evt_period) < cycles)
                        wfe();
        }

        while ((get_cycles() - start) < cycles)
                cpu_relax();
}
EXPORT_SYMBOL(__delay);

@cycles 사이클 단위의 수 만큼 delay 한다.

코드 라인 5~11에서 아키텍트 타이머에 이벤트 스트림이 동작하는 경우 요청한 사이클 수 만큼 100us 단위로 wfe를 수행하여 대기하여 cpu 로드를 줄이고 절전할 수 있다.
코드 라인 13~14에서 사이클 수 만큼 delay하고, 사이클 수를 초과한 경우 루프를 탈출한다.

delay 관련 함수 – ARM32

arm32 시스템에서는 busy-wait 기반의 delay 타이머를 사용한다.

다음 그림은 arm32용 delay 관련 함수의 호출 관계를 보여준다.

Delay 타이머 등록 (generic 타이머) – ARM32

arch_timer_delay_timer_register()

arch/arm/kernel/arch_timer.c

static void __init arch_timer_delay_timer_register(void)
{
        /* Use the architected timer for the delay loop. */
        arch_delay_timer.read_current_timer = arch_timer_read_counter_long;
        arch_delay_timer.freq = arch_timer_get_rate();
        register_current_timer_delay(&arch_delay_timer);
}

armv7 아키텍처에 내장된 generic 타이머를 delay 타이머로 사용할 수 있도록 등록한다.

다음 그림은 100hz로 구성된 generic 타이머를 딜레이 타이머로 등록하는 과정을 보여준다.

register_current_timer_delay() – ARM32

arch/arm/lib/delay.c

void __init register_current_timer_delay(const struct delay_timer *timer)
{
        u32 new_mult, new_shift;
        u64 res;

        clocks_calc_mult_shift(&new_mult, &new_shift, timer->freq,
                               NSEC_PER_SEC, 3600);
        res = cyc_to_ns(1ULL, new_mult, new_shift);

        if (!delay_calibrated && (!delay_res || (res < delay_res))) {
                pr_info("Switching to timer-based delay loop, resolution %lluns\n", res);
                delay_timer                     = timer;
                lpj_fine                        = timer->freq / HZ;
                delay_res                       = res;

                /* cpufreq may scale loops_per_jiffy, so keep a private copy */
                arm_delay_ops.ticks_per_jiffy   = lpj_fine;
                arm_delay_ops.delay             = __timer_delay;
                arm_delay_ops.const_udelay      = __timer_const_udelay;
                arm_delay_ops.udelay            = __timer_udelay;
        } else {
                pr_info("Ignoring duplicate/late registration of read_current_timer delay\n");
        }
}

딜레이 타이머를 등록하고 calibration 한다. 처음 설정 시에는 반드시 calibration을 한다.

코드 라인 6~7에서 1시간에 해당하는 정확도로 1 cycle에 소요되는 nano초를 산출할 수 있도록 new_mult/new_shift 값을 산출한다.
코드 라인 8에서 해상도 res 값을 구한다. (1 cycle에 해당하는 nano 초)
- rpi2: 100hz, 19.2Mhz clock -> res=52
코드 라인 10~20에서 calibration이 완료되지 않았고 처음이거나 요청한 타이머가 더 고해상도 타이머인 경우 딜레이 타이머에 대한 설정을 한다.
- res 값이 작으면 작을 수록 고해상도 타이머이다.
- 클럭 소스가 여러 개가 등록되는 경우 딜레이 타이머에 가장 좋은 고해상도 타이머를 선택하게 한다.
- calivrate_delay() 함수에서 calibration을 완료하고 나면 더 이상 클럭 소스로 부터 더 이상 딜레이 카운터의 등록을 할 수 없게 한다.
- rpi2 예) “Switching to timer-based delay loop, resolution 52ns”

sleep 관련 함수

non-atomic context에서 사용할 수 있는 함수들은 다음과 같다. 10us ~ 20ms까지는 usleep() 보다 atomic context 사용 가능한 udelay()를 사용하길 권장한다.

hrtimer로 동작
- usleep_range()
jiffies 및 legacy timer로 동작
- msleep()
- msleep_interruptible()

다음 그림은 sleep 관련 함수의 호출 관계를 보여준다.

세컨드 단위 sleep

ssleep()

include/linux/delay.h

static inline void ssleep(unsigned int seconds)
{
        msleep(seconds * 1000);
}

@seconds 세컨드만큼 슬립한다.

밀리 세컨드 단위 sleep

msleep()

kernel/time/timer.c

/**
 * msleep - sleep safely even with waitqueue interruptions
 * @msecs: Time in milliseconds to sleep for
 */

void msleep(unsigned int msecs)
{
        unsigned long timeout = msecs_to_jiffies(msecs) + 1;

        while (timeout)
                timeout = schedule_timeout_uninterruptible(timeout);
}
EXPORT_SYMBOL(msleep);

@msec 밀리 세컨드만큼 jiffies 스케줄 틱 기반으로 슬립한다.

마이크로 세컨드 단위 sleep

usleep_range()

kernel/time/timer.c

/**
 * usleep_range - Sleep for an approximate time
 * @min: Minimum time in usecs to sleep
 * @max: Maximum time in usecs to sleep
 *
 * In non-atomic context where the exact wakeup time is flexible, use
 * usleep_range() instead of udelay().  The sleep improves responsiveness
 * by avoiding the CPU-hogging busy-wait of udelay(), and the range reduces
 * power usage by allowing hrtimers to take advantage of an already-
 * scheduled interrupt instead of scheduling a new one just for this sleep.
 */

void __sched usleep_range(unsigned long min, unsigned long max)
{
        ktime_t exp = ktime_add_us(ktime_get(), min);
        u64 delta = (u64)(max - min) * NSEC_PER_USEC;

        for (;;) {
                __set_current_state(TASK_UNINTERRUPTIBLE);
                /* Do not return before the requested sleep time has elapsed */
                if (!schedule_hrtimeout_range(&exp, delta, HRTIMER_MODE_ABS))
                        break;
        }
}
EXPORT_SYMBOL(usleep_range);

@max – @min 마이크로 세컨드만큼 jiffies 스케줄 틱 기반으로 슬립한다.

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c – 현재 글
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
timekeeping_init() | 문c
calibrate_delay() | 문c

delays – Information on the various kernel delay / sleep mechanisms (Documentation/timers/timers-howto) | Kernel.org
[Linux:Kernel] 지연시간 – 다양한 커널 딜레이(delay) / 슬립(sleep) 메카니즘의 정보 | 다솜돌이