Timer -10- (Timekeeping)

Timekeeping

리눅스 커널에서 시계를 담당하는 timekeeping subsystem에서 관리하는 시계는 4개이다. (Posix 정의 클럭은 최대 16개이다)

monotonic
boottime
realtime
tai

다음 그림은 Timekeeping subsystem이 어떻게 유지 관리되는지 모습을 보여준다.

xtime

리눅스 커널이 관리하는 wall(real-time) 시각으로 시각을 갱신하는데 1 틱에 한 번씩 클럭 소스로 부터 사이클을 읽어들이고, 이를 나노초 단위로 관리하고 있는 timekeeping에 반영한다. 그런데 1 틱에 한번씩 읽어들이는 클럭 사이클에 1 틱당 소요되는 나노초를 곱하여 xtime 값에 추가한다. 이 나노초가 소숫점을 버리고 정수만 사용하여 그대로 곱하면 많은 오차가 발생하므로 정확도를 더 올려 처리하기 위해 여러 자리의 소숫점을 가진 실수를 정수화시켜 곱해서 관리하는 시각이다. 커널은 이진화 정수를 사용한다.

오차가 발생하는 양을 살펴보기로 한다.

54Mhz 클럭을 사용 시 1 사이클당 약 18(ns)의 시간이 사용된다.
- 1,000,000,000(ns) / 54,000,000(hz) = 18.518518518(ns)
- 18(ns) 정수부분만을 사용하기로 한다.
250hz 시스템에서 1틱(4ms)이 지난 후엔 216,000 사이클이 소요된다.
- 54Mhz / 250hz = 216,000 사이클
1틱 다음에 갱신된 시간은 216,000 사이클 * 18(ns) = 3,888,000(ns) = 3.888(ms)

이 오차를 줄이기 위해서는 소숫점 단위로 정확도가 올라간 실수를 사용해야 하는데, 커널에서는 실수를 사용하지 않으므로 실수를 이진화 정수화시켜 사용한다. 이 때 이진화 정수화된 값은 mult이고, 자릿수는 shift가 된다.

18.518518518518 값에 대해 2진수 10자릿수 정확도를 갖게 이진화 정수화를 하면 18.518518518 * 2^10 = 18962(0x4a12) 이다.
- mult (0x4a12) >> shift (10) = 18.51757813
18.518518518518 값에 대해 2진수 24자릿수 정확도를 갖게 이진화 정수화를 하면 18.518518518 * 2^24 = 310,689,185(0x1284_bda1) 이다.
- mult (0x1284_bda1) >> shift (24) = 18.51851851

24비트 정확도를 사용하는 클럭 소스를 사용하여 xtime을 사용하면 다음과 같다.

xtime += (delta 사이클) * mult

다시 xtime을 나노초로 변환하려면 다음과 같이 한다.

소요 시간(ns) = xtime >> shift

24비트 정확도를 가지는 mult, shift를 사용할 때, 1틱이 지난 후에 변화되는 xtime 값과 이 값을 다시 나노초로 변환하면 백만분의 1초의 정확도(이진수로는 24비트 정확도)가 반영되어 오차가 줄어든 것을 확인할 수 있다.

xtime = 216,000 사이클 * mult(0x1284_bda1) = 0x3d08_ffff_63c0
소요 시간(ns) = xtime(0x3d08_ffff_63c0) >> shift(24) = 0x3d08_ff = 3,999,999(ns)

다음 그림에서 십진수 및 이진수로 만든 십진화 정수와 이진화 정수의 차이를 비교한다.

소숫점 3자리마다, 이진수에서는 10비트씩 사용된다.

시간 구조체 타입 비교

기존 커널이 사용하였던 timespec 구조체로된 xtime 변수는 이제 커널의 timekeeping 컴포넌트 내부 구조에서만 사용된다.

아래 그림은 각 시간 구조체 타입을 비교하였다.

아래 시각은 UTC 기준 시각으로 산출되었다. (한국 시각은 KST 타임존을 사용한다)
- UTC 기준 예) “1 Jan 1970 10:20:30” -> 10시간 * 3600초 + 20분 * 60초 + 30초 = 37,230초

timespec64 구조체

include/linux/time64.h

struct timespec64 {
        time64_t        tv_sec;                 /* seconds */
        long            tv_nsec;                /* nanoseconds */
};

ktime_t 타입

include/linux/ktime.h

/* Nanosecond scalar representation for kernel time values */
typedef s64     ktime_t;

timespec 구조체

include/uapi/linux/time.h

struct timespec {
        __kernel_time_t tv_sec;                 /* seconds */
        long            tv_nsec;                /* nanoseconds */
};

timeval 구조체

include/uapi/linux/time.h

struct timeval {
        __kernel_time_t         tv_sec;         /* seconds */
        __kernel_suseconds_t    tv_usec;        /* microseconds */
};

Timekeeping APIs

Get & Set timeofday

do_sys_settimeofday64()
~~do_gettimeofday()~~
do_settimeofday64()
~~do_sys_settimeofday()~~

Kernel time accessors

get_seconds()
~~current_kernel_time()~~

timespec 기반 인터페이스

~~getrawmonotonic64()~~
getnstimeofday()
ktime_get_ts()
ktime_get_real_ts()
getrawmonotonic()
getboottime()

timespec64 기반 인터페이스

ktime_get_ts64()
ktime_get_raw_ts64()
ktime_get_real_ts64()
ktime_get_coarse_ts64()
ktime_get_coarse_real_ts64()
~~__getnstimeofday64()~~
~~getnstimeofday64~~()
getboottime64()
ktime_get_boottime_ts64()
ktime_get_coarse_boottime_ts64()
ktime_get_clocktai_ts64()
ktime_get_coarse_clocktai_ts64()
ktime_get_boottime_seconds()
ktime_get_clocktai_seconds()

time64_t 기반 인터페이스

get_seconds()
ktime_get_seconds()
__ktime_get_real_seconds()
ktime_get_real_seconds()

ktime_t 기반 인터페이스

ktime_get()
ktime_get_with_offset()
ktime_mono_to_any()
ktime_get_raw()
ktime_get_real()
ktime_get_boottime()
ktime_get_clocktai()
ktime_mono_to_real()
ktime_get_coarse_real()
ktime_get_coarse_boottime()
ktime_get_coarse_clocktai()
ktime_get_coarse()

ktime -> ns 변환

ktime_get_ns()
ktime_get_real_ns()
ktime_get_boot_ns()
ktime_get_raw_ns()
ktime_get_coarse_ns()
ktime_get_coarse_boottime_ns()
ktime_get_coarse_clocktai_ns()
ktime_get_clocktai_ns()
ktime_get_resolution_ns()
ktime_get_mono_fast_ns()
ktime_get_raw_fast_ns()
ktime_get_boot_fast_ns()
ktime_get_real_fast_ns()

ktime 기반의 하나를 timespec 인터페이스로 반환

~~get_monotonic_boottime()~~
~~get_monotonic_boottime64~~()
~~timekeeping_clocktai()~~

RTC

timekeeping_inject_sleeptime64()
timekeeping_rtc_skipsuspend()
timekeeping_rtc_skipresume()

PPS

~~getnstime_raw_and_real()~~

Persisten 클럭 관련 인터페이스

~~has_persistent_clock()~~
~~read_persistent_clock()~~
read_persistent_clock64()
~~read_boot_clock~~()
~~update_persistent_clock~~()
update_persistent_clock64()
read_persistent_wall_and_boot_offset()

timekeeping 초기화

timekeeping_init()

kernel/time/timekeeping.c

/*
 * timekeeping_init - Initializes the clocksource and common timekeeping values
 */

void __init timekeeping_init(void)
{
        struct timespec64 wall_time, boot_offset, wall_to_mono;
        struct timekeeper *tk = &tk_core.timekeeper;
        struct clocksource *clock;
        unsigned long flags;

        read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
        if (timespec64_valid_settod(&wall_time) &&
            timespec64_to_ns(&wall_time) > 0) {
                persistent_clock_exists = true;
        } else if (timespec64_to_ns(&wall_time) != 0) {
                pr_warn("Persistent clock returned invalid value");
                wall_time = (struct timespec64){0};
        }

        if (timespec64_compare(&wall_time, &boot_offset) < 0)
                boot_offset = (struct timespec64){0};

        /*
         * We want set wall_to_mono, so the following is true:
         * wall time + wall_to_mono = boot time
         */
        wall_to_mono = timespec64_sub(boot_offset, wall_time);

        raw_spin_lock_irqsave(&timekeeper_lock, flags);
        write_seqcount_begin(&tk_core.seq);
        ntp_init();

        clock = clocksource_default_clock();
        if (clock->enable)
                clock->enable(clock);
        tk_setup_internals(tk, clock);

        tk_set_xtime(tk, &wall_time);
        tk->raw_sec = 0;

        tk_set_wall_to_mono(tk, wall_to_mono);

        timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

        write_seqcount_end(&tk_core.seq);
        raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
}

timekeeping susbsystem을 초기화한다. persistent 클럭이 있는 경우 그 시간으로 리눅스 시계를 설정하고 없는 경우 0 (UTC 기준 1970년 1월 1일 0시 0분 0초)값으로 시계를 설정한다.

코드 라인 8에서 persistent 및 boot offset(ns) 값을 읽어온다.
- persistent 클럭 값을 읽어들이는데 persistent 클럭이 없는 시스템은 0 부터 시작한다.
코드 라인 9~15에서 persistent 클럭 값이 정확히 읽히면 persistent_clock_exist = true로 설정하고, 유효하지 않은 persistent 클럭 값을 읽어온 경우 경고 메시지와 함께 0으로 초기화하여 사용한다.
- 읽어온 persistent 값이 음수(1970년 이전)이거나 나노 초에 해당하는 값을 초과하거나 초 값이 최대치를 초과하는 경우가 유효하지 않은 값이다.
코드 라인 17~18에서 읽어온 시각이 boot_offset보다 작은 경우 유효하지 않은 시각으로 판단하여 0으로 초기화한다.
코드 라인 24에서 realtime을 mono 타임으로 변경할 때 사용할 wall_to_mono을 준비한다.
- wall_to_mono = boot_offset – wall_time
- 정확한 persistent 시각을 알아온 경우 wall_to_mono은 음수 값을 가진다.
- persistent 시각이 없는 시스템에서는 두 값 모두 0으로 시작하여 wall_to_mono도 0으로 시작한다.
코드 라인 26~27에서 timekeeping 값을 바꾸기 위해 lock을 걸고 write 시퀀스 락도 시작한다.
코드 라인 28에서 ntp 초기화를 수행한다.
코드 라인 30~32에서 디폴트 클럭 소스를 enable 시킨다.
- rpi2의 경우 디폴트 클럭 소스로 jiffies를 사용하므로 항시 enable 되어 있다.
코드 라인 33에서 클럭 소스를 사용하기 위해 timekeeper 내부 설정을 진행한다.
코드 라인 35에서 wall_time을 xtime에 초기 값으로 설정한다.
코드 라인 36에서 raw_sec을 0으로 초기화한다.
코드 라인 38에서 realtime을 monotonic 클럭으로 빠르게 변환할 수 있도록 boot 타임을 사용하여 wtm(wall_to_mono)을 설정한다.
- 참고: Timer -9- (Timekeeping) | 문c
코드 라인 40에서 timekeeping 값들을 업데이트 하고 통지등을 수행한다.
코드 라인 42~43에서 lock들을 해제한다.

tk_setup_internals()

kernel/time/timekeeping.c

/**
 * tk_setup_internals - Set up internals to use clocksource clock.
 *
 * @tk:         The target timekeeper to setup.
 * @clock:              Pointer to clocksource.
 *
 * Calculates a fixed cycle/nsec interval for a given clocksource/adjustment
 * pair and interval request.
 *
 * Unless you're the timekeeping code, you should not be using this!
 */

static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
{
        u64 interval;
        u64 tmp, ntpinterval;
        struct clocksource *old_clock;

        ++tk->cs_was_changed_seq;
        old_clock = tk->tkr_mono.clock;
        tk->tkr_mono.clock = clock;
        tk->tkr_mono.mask = clock->mask;
        tk->tkr_mono.cycle_last = tk_clock_read(&tk->tkr_mono);

        tk->tkr_raw.clock = clock;
        tk->tkr_raw.mask = clock->mask;
        tk->tkr_raw.cycle_last = tk->tkr_mono.cycle_last;

        /* Do the ns -> cycle conversion first, using original mult */
        tmp = NTP_INTERVAL_LENGTH;
        tmp <<= clock->shift;
        ntpinterval = tmp;
        tmp += clock->mult/2;
        do_div(tmp, clock->mult);
        if (tmp == 0)
                tmp = 1;

        interval = (u64) tmp;
        tk->cycle_interval = interval;

        /* Go back from cycles -> shifted ns */
        tk->xtime_interval = interval * clock->mult;
        tk->xtime_remainder = ntpinterval - tk->xtime_interval;
        tk->raw_interval = interval * clock->mult;

         /* if changing clocks, convert xtime_nsec shift units */
        if (old_clock) {
                int shift_change = clock->shift - old_clock->shift;
                if (shift_change < 0) {
                        tk->tkr_mono.xtime_nsec >>= -shift_change;
                        tk->tkr_raw.xtime_nsec >>= -shift_change;
                } else {
                        tk->tkr_mono.xtime_nsec <<= shift_change;
                        tk->tkr_raw.xtime_nsec <<= shift_change;
                }
        }

        tk->tkr_mono.shift = clock->shift;
        tk->tkr_raw.shift = clock->shift;

        tk->ntp_error = 0;
        tk->ntp_error_shift = NTP_SCALE_SHIFT - clock->shift;
        tk->ntp_tick = ntpinterval << tk->ntp_error_shift;

        /*
         * The timekeeper keeps its own mult values for the currently
         * active clocksource. These value will be adjusted via NTP
         * to counteract clock drifting.
         */
        tk->tkr_mono.mult = clock->mult;
        tk->tkr_raw.mult = clock->mult;
        tk->ntp_err_mult = 0;
        tk->skip_second_overflow = 0;
}

클럭 소스를 사용하기 위해 timekeeper 내부 설정을 진행한다.

코드 라인 8~9에서 기존 사용하던 클럭 소스를 백업하고, 새 클럭 소스를 지정한다.
코드 라인 10~11에서 새클럭 소스의 마스크 값과 사이클 값을 읽어 갱신한다.
코드 라인 13~15에서 tkr_raw 역시 동일하게 처리한다.
코드 라인 18~24에서 ntp 인터벌 길이 << shift 하여 ntp_interval 값(ns)을 구한다.
- 예) ntp 인터벌 길이 = 1E9 / 100hz = 1E7
  - -> ntp_interval = 1E7 << 24 = 0xf42_4000_0000
코드 라인 26~27에서 interval = 1 tick에 해당하는 cycle 수 값을 구해 interval 및 cycle_interval에 대입한다.
- 예) 0xf42_4000_0000 / mult = 192,000
코드 라인 30에서 1 tick 에 해당하는 cycle 수가 담긴 interval에 mult 값을 적용하여 xtime_interval에 대입한다.
코드 라인 31에서 ntpinterval 에서 xtime_interval을 빼서 xtime_remander에 대입한다.
코드 라인 32에서 interval을 ns로 산출 시 소요되는 시간을 구한다. (1 스케쥴 tick (ns)에 가깝다)
- interval 에 mult를 곱하고 shift 하여 raw_interval 값을 구한다.
코드 라인 35~44에서 기존 clock이 있었던 경우 정확도가 더 좋아지거나 더 나빠지는 경우에 따라서 xtime_nsec 값을 변경된 정확도만큼 조절한다.
코드 라인 46~47에서 클럭소스의 clock->shift를 timekeeper에 저장한다.
코드 라인 49~51에서 ntp 관련 값을 초기화한다.
코드 라인 58~61에서 클럭소스의 clock->mult 값을 timekeeper에 저장하고, 나머지 ntp 관련 변수를 0으로 초기화한다.

아래 그림은 timekeeper의 클럭 소스로 jiffies 클럭 소스가 처음 사용되는 과정을 보여준다.

아래 그림은 timekeeper의 클럭 소스로 jiffies 클럭 소스를 사용하다 arch_sys_conter 클럭 소스로 변경되었을 때의 처리 과정을 보여준다.

ToD 클럭 설정

do_settimeofday64()

kernel/time/timekeeping.c

/**
 * do_settimeofday64 - Sets the time of day.
 * @ts:     pointer to the timespec64 variable containing the new time
 *
 * Sets the time of day to the new time and update NTP and notify hrtimers
 */

int do_settimeofday64(const struct timespec64 *ts)
{
        struct timekeeper *tk = &tk_core.timekeeper;
        struct timespec64 ts_delta, xt;
        unsigned long flags;
        int ret = 0;

        if (!timespec64_valid_settod(ts))
                return -EINVAL;

        raw_spin_lock_irqsave(&timekeeper_lock, flags);
        write_seqcount_begin(&tk_core.seq);

        timekeeping_forward_now(tk);

        xt = tk_xtime(tk);
        ts_delta.tv_sec = ts->tv_sec - xt.tv_sec;
        ts_delta.tv_nsec = ts->tv_nsec - xt.tv_nsec;

        if (timespec64_compare(&tk->wall_to_monotonic, &ts_delta) > 0) {
                ret = -EINVAL;
                goto out;
        }

        tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, ts_delta));

        tk_set_xtime(tk, ts);
out:
        timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

        write_seqcount_end(&tk_core.seq);
        raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

        /* signal hrtimers about time change */
        clock_was_set();

        if (!ret)
                audit_tk_injoffset(ts_delta);

        return ret;
}
EXPORT_SYMBOL(do_settimeofday64);

요청한 시각을 Timekeeping에 설정한다.

코드 라인 8~9에서 요청한 시각이 유효하지 않은 값이면 처리를 중단하고 -EINVAL 값을 반환한다.
코드 라인 14에서 Timekeeper 클럭 소스를 읽어 갱신한다. (last_cycle, xtime, raw_time)
코드 라인 16~18에서 요청한 시각과 현재 시각의 차 delta를 구한다.
코드 라인 20~23에서 음수 시각으로 설정되지 않도록 out 레이블로 이동하여 방지한다.
- 참고: time: Always make sure wall_to_monotonic isn’t positive (2015, v4.3-rc1)
코드 라인 25에서 realtime 값을 monotonic으로 변환을 할 수 있도록 wall_to_monotonic에 delta 값을 증가시킨다.
코드 라인 27에서 요청한 시각을 xtime에 설정한다.
코드 라인 29에서 timekeeping을 갱신한다.
코드 라인 35에서 클럭이 변경된 경우 모든 cpu에 next이벤트가 발생하도록 프로그램하고 cancel_list에 있는 타이머를 취소하도록 호출된다.

다음 그림은 date 명령을 통해 시각이 설정되는 과정을 보여준다.

Timekeeper 카운터 갱신(xtime & raw_time)

timekeeping_forward_now()

kernel/time/timekeeping.c

/**
 * timekeeping_forward_now - update clock to the current time
 *
 * Forward the current clock to update its state since the last call to
 * update_wall_time(). This is useful before significant clock changes,
 * as it avoids having to deal with this time offset explicitly.
 */

static void timekeeping_forward_now(struct timekeeper *tk)
{
        u64 cycle_now, delta;

        cycle_now = tk_clock_read(&tk->tkr_mono);
        delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
        tk->tkr_mono.cycle_last = cycle_now;
        tk->tkr_raw.cycle_last  = cycle_now;

        tk->tkr_mono.xtime_nsec += delta * tk->tkr_mono.mult;

        /* If arch requires, add in get_arch_timeoffset() */
        tk->tkr_mono.xtime_nsec += (u64)arch_gettimeoffset() << tk->tkr_mono.shift;

        tk->tkr_raw.xtime_nsec += delta * tk->tkr_raw.mult;

        /* If arch requires, add in get_arch_timeoffset() */
        tk->tkr_raw.xtime_nsec += (u64)arch_gettimeoffset() << tk->tkr_raw.shift;

        tk_normalize_xtime(tk);
}

Timekeeper 클럭 소스를 읽은 값에서 마지막에 저장해둔 cycle_last 값을 뺀 delta 사이클을 구하여 현재 시각인 xtime(xtime_nsec, xtime_sec)에 더해 갱신한다.

코드 라인 5~8에서 Timekeeper mono 클럭 소스의 cycle 값을 읽어 기존 cycle_last 값을 뺀 delta 사이클을 구한다. 방금 읽은 값은 두 곳의 cycle_last에 저장해둔다.
코드 라인 10에서 delta 값에 mult를 곱한 값을 mono 클럭 소스의 xtime_nsec에 합산한다.
코드 라인 13에서 특정 아키텍처에만 적용된 arch_gettimeoffset() 값을 좌로 shift하여 tkr_mono의 xtime_nsec에 합산한다.
- nohz 및 high-res 타이머를 사용하지 못하는 일부 arm 시스템에서 CONFIG_ARCH_USES_GETTIMEOFFSET 커널 옵션을 사용할 때 적용된다.
코드 라인 15에서 delta 값에 mult를 곱한 값을 mono-raw 클럭 소스의 xtime_nsec에 합산한다.
코드 라인 18에서 특정 아키텍처에만 적용된 arch_gettimeoffset() 값을 좌로 shift하여 tkr_raw의 xtime_nsec에 합산한다.
코드 라인 20에서 xime_nsec 값이 1초를 넘어가면 1초 단위로 루프를 돌며 xime_sec++ 한다. delta 값이 매우 큰 경우 1초 단위로 반복하며 시간을 조정하여 시스템 에러를 최소화한다.

다음 그림은 timekeeper 사이클 카운터를 읽어 delta 사이클만큼 mult를 적용(곱하여)한 값을 xtime_nsec 및 xtime_sec에 합산하여 현재 시각을 갱신하는 과정을 보여준다.

delta 사이클 = clock 소스로부터 읽은 사이클 값 – cycle_last
xtime_nsec += (delta 사이클 * mult)
xtime_nsec에서 1초 초과분을 xtime_sec에 합산 (normalization 수행: 1초 증가분씩 xtime_sec에 반영)

Time Normalization

tk_normalize_xtime()

kernel/time/timekeeping.c

static inline void tk_normalize_xtime(struct timekeeper *tk)
{
        while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) {
                tk->tkr_mono.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_mono.shift;
                tk->xtime_sec++;
        }
        while (tk->tkr_raw.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_raw.shift)) {
                tk->tkr_raw.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_raw.shift;
                tk->raw_sec++;
        }
}

xtime에서 나노초가 1초를 초과한 경우 1초 단위로 루프를 돌며 조정하여 한 번에 급격한 시간 변동이 되지 않도록 한다.

나노초가 항상 0 ~ 1G에 해당하도록 초 단위를 1씩 증가시키며 맞춘다.
참고: time: Condense timekeeper.xtime into xtime_sec

wall_to_monotonic 변환용 값 설정

tk_set_wall_to_mono()

kernel/time/timekeeping.c

static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm)
{
        struct timespec64 tmp;

        /*
         * Verify consistency of: offset_real = -wall_to_monotonic
         * before modifying anything
         */
        set_normalized_timespec64(&tmp, -tk->wall_to_monotonic.tv_sec,
                                        -tk->wall_to_monotonic.tv_nsec);
        WARN_ON_ONCE(tk->offs_real.tv64 != timespec64_to_ktime(tmp).tv64);
        tk->wall_to_monotonic = wtm;
        set_normalized_timespec64(&tmp, -wtm.tv_sec, -wtm.tv_nsec);
        tk->offs_real = timespec64_to_ktime(tmp);
        tk->offs_tai = ktime_add(tk->offs_real, ktime_set(tk->tai_offset, 0));
}

realtime 값을 monotonic으로 변환을 할 수 있도록 wall_to_monotonic 값을 설정한다.

monotonic = realtime + wall_to_monotonic 형태이므로 wall_to_monotonic 값은 보통 음수가 대입된다.

코드 라인 9~11에서 -wall_to_monotonic 값을 읽어 offs_real 값과 다른 경우 경고 메시지를 출력한다.
코드 라인 12에서 인수로 받은 시간을 wall_to_monotonic에 기록하고 나노초의 범위가 0~1G 이내에 있도록 조정한다.
- realtime + wall_to_monotonic = monotonic 이므로 wall_to_monotonic에는 보통 음수 값이 담긴다.
코드 라인 13에서 인수로 받은 시간을 음수로 바꾼 timespce64 타입의 시간을 ktime 형태로 바꾸어 offs_real에 대입한다.
- realtime – offs_real = monotonic 이므로 off_real 값은 양수 값이 담긴다.
코드 라인 14~15에서 offs_real 값과 tai_offset을 더해 offs_tai에 저장한다.

다음 그림은 “2017년 1월 1일 10시 20분 30초”로 시간 설정을 하는 경우 wall_to_monotonic 값의 변화를 보여준다.

set_normalized_timespec64()

kernel/time/time.c

/**
 * set_normalized_timespec - set timespec sec and nsec parts and normalize
 *
 * @ts:         pointer to timespec variable to be set
 * @sec:        seconds to set
 * @nsec:       nanoseconds to set
 *
 * Set seconds and nanoseconds field of a timespec variable and
 * normalize to the timespec storage format
 *
 * Note: The tv_nsec part is always in the range of
 *      0 <= tv_nsec < NSEC_PER_SEC
 * For negative values only the tv_sec field is negative !
 */

void set_normalized_timespec64(struct timespec64 *ts, time64_t sec, s64 nsec)
{
        while (nsec >= NSEC_PER_SEC) {
                /*
                 * The following asm() prevents the compiler from
                 * optimising this loop into a modulo operation. See
                 * also __iter_div_u64_rem() in include/linux/time.h
                 */
                asm("" : "+rm"(nsec));   
                nsec -= NSEC_PER_SEC;
                ++sec;
        }
        while (nsec < 0) {                      
                asm("" : "+rm"(nsec));
                nsec += NSEC_PER_SEC;
                --sec;
        }                       
        ts->tv_sec = sec;
        ts->tv_nsec = nsec;
}
EXPORT_SYMBOL(set_normalized_timespec64);

범위를 벗어난 나노초 단위를 1초 단위로 조정하여 한 번에 급격한 시간 변동이 되지 않도록 한다. 그런 후 출력 인수 ts에 timespec64 단위로 변환한다.

나노초가 항상 0 ~ 1G에 해당하도록 초 단위를 1씩 증감시키며 맞춘다.

Wall 타임 갱신

update_wall_time()

kernel/time/timekeeping.c

/**
 * update_wall_time - Uses the current clocksource to increment the wall time
 *
 */

void update_wall_time(void)
{
        timekeeping_advance(TK_ADV_TICK);
}

timekeeping_advance()

kernel/time/timekeeping.c

/*
 * timekeeping_advance - Updates the timekeeper to the current time and
 * current NTP tick length
 */

static void timekeeping_advance(enum timekeeping_adv_mode mode)
{
        struct timekeeper *real_tk = &tk_core.timekeeper;
        struct timekeeper *tk = &shadow_timekeeper;
        u64 offset;
        int shift = 0, maxshift;
        unsigned int clock_set = 0;
        unsigned long flags;

        raw_spin_lock_irqsave(&timekeeper_lock, flags);

        /* Make sure we're fully resumed: */
        if (unlikely(timekeeping_suspended))
                goto out;

#ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
        offset = real_tk->cycle_interval;

        if (mode != TK_ADV_TICK)
                goto out;
#else
        offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
                                   tk->tkr_mono.cycle_last, tk->tkr_mono.mask);

        /* Check if there's really nothing to do */
        if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
                goto out;
#endif

        /* Do some additional sanity checking */
        timekeeping_check_update(tk, offset);

        /*
         * With NO_HZ we may have to accumulate many cycle_intervals
         * (think "ticks") worth of time at once. To do this efficiently,
         * we calculate the largest doubling multiple of cycle_intervals
         * that is smaller than the offset.  We then accumulate that
         * chunk in one go, and then try to consume the next smaller
         * doubled multiple.
         */
        shift = ilog2(offset) - ilog2(tk->cycle_interval);
        shift = max(0, shift);
        /* Bound shift to one less than what overflows tick_length */
        maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
        shift = min(shift, maxshift);
        while (offset >= tk->cycle_interval) {
                offset = logarithmic_accumulation(tk, offset, shift,
                                                        &clock_set);
                if (offset < tk->cycle_interval<<shift)
                        shift--;
        }

        /* Adjust the multiplier to correct NTP error */
        timekeeping_adjust(tk, offset); 

        /*
         * Finally, make sure that after the rounding
         * xtime_nsec isn't larger than NSEC_PER_SEC
         */
        clock_set |= accumulate_nsecs_to_secs(tk);

        write_seqcount_begin(&tk_core.seq);
        /*
         * Update the real timekeeper.
         *
         * We could avoid this memcpy by switching pointers, but that
         * requires changes to all other timekeeper usage sites as
         * well, i.e. move the timekeeper pointer getter into the
         * spinlocked/seqcount protected sections. And we trade this
         * memcpy under the tk_core.seq against one before we start
         * updating.
         */
        timekeeping_update(tk, clock_set);
        memcpy(real_tk, tk, sizeof(*tk));
        /* The memcpy must come last. Do not put anything here! */
        write_seqcount_end(&tk_core.seq);
out:
        raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
        if (clock_set)
                /* Have to call _delayed version, since in irq context*/
                clock_was_set_delayed();
}

스케쥴 틱 간격으로 호출되어 timekeeping을 갱신한다.

코드 라인 13~14에서 낮은 확률로 PM suspend된 경우 처리를 진행하지 않고 빠져나간다.
코드 라인 16~28에서 마지막에 읽은 cycle과 현재 읽은 cycle과의 차이를 산출하여 offset에 대입한다. 단 offset이 cycle_interval보다 작은 경우 처리할 필요가 없으므로 함수를 빠져나간다.
코드 라인 31에서
코드 라인 41~51에서 offset cycle과 cycle_interval의 차를 사용하여 적절한 shift 값을 구한다.
코드 라인 54에서 NTP 에러가 너무 큰 경우 교정한다.
코드 라인 60에서 xtime 나노초가 1초를 초과한 경우 이 값을 xtime 초에 반영한다. 또한 leap second 처리를 위한 NTP 코드를 호출한다. leap second가 발생한 경우 clock_set에 TK_CLOCK_WAS_SET(1 << 2) 비트가 추가된다.
코드 라인 61에서 shadow timekeeper를 timekeeper에 복사한 후 timekeeping을 갱신한다.
코드 라인 66~67에서 clock_set가 설정된 경우 워크큐를 통해 스케쥴하여 clock_was_set() 코드를 수행하여 모든 cpu에 다음 이벤트가 발생하도록 프로그램하고 cancel_list에 있는 타이머를 취소하도록 호출된다.

clocksource_delta()

kernel/time/timekeeping_internal.h

static inline cycle_t clocksource_delta(cycle_t now, cycle_t last, cycle_t mask)
{
        return (now - last) & mask;
}

현재 cycle과 기존 cycle과의 차이를 mask하여 반환한다.

logarithmic_accumulation()

참고: time: Implement logarithmic time accumulation (2009, 2.6.33-rc1)

kernel/time/timekeeping.c

/**
 * logarithmic_accumulation - shifted accumulation of cycles
 *
 * This functions accumulates a shifted interval of cycles into
 * into a shifted interval nanoseconds. Allows for O(log) accumulation
 * loop.
 *
 * Returns the unconsumed cycles.
 */

static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
                                                u32 shift,
                                                unsigned int *clock_set)
{
        cycle_t interval = tk->cycle_interval << shift;
        u64 raw_nsecs;

        /* If the offset is smaller then a shifted interval, do nothing */
        if (offset < interval)
                return offset;

        /* Accumulate one shifted interval */
        offset -= interval;
        tk->tkr.cycle_last += interval;

        tk->tkr.xtime_nsec += tk->xtime_interval << shift;
        *clock_set |= accumulate_nsecs_to_secs(tk);

        /* Accumulate raw time */
        raw_nsecs = (u64)tk->raw_interval << shift;
        raw_nsecs += tk->raw_time.tv_nsec;
        if (raw_nsecs >= NSEC_PER_SEC) {
                u64 raw_secs = raw_nsecs;
                raw_nsecs = do_div(raw_secs, NSEC_PER_SEC);
                tk->raw_time.tv_sec += raw_secs;
        }
        tk->raw_time.tv_nsec = raw_nsecs;

        /* Accumulate error between NTP and clock interval */
        tk->ntp_error += tk->ntp_tick << shift;
        tk->ntp_error -= (tk->xtime_interval + tk->xtime_remainder) <<
                                                (tk->ntp_error_shift + shift);

        return offset;
}

시프트된 cycle의 인터벌을 시프트된 인터벌 나노초로 산술한다. offset이 인터벌보다 큰 경우에만 동작한다. 또한 적용되지 않고 남은 offset 값은 반환된다.

코드 라인 5에서 1 NTP 인터벌에 대한 ns << shift 하여 interval에 대입한다.
코드 라인 13~14에서 offset 값에서 시프트된 interval을 빼고, cycle_last에는 추가한다.
코드 라인 16에서 xtime_nsec에 xtime_interrval << shift를 더한다.
코드 라인 17에서 xtime 나노초가 1초를 초과한 경우 이 값을 xtime 초에 반영한다. 또한 leap second 처리를 위한 NTP 코드를 호출한다.
코드 라인 20~27에서 raw_time에 시프트된 raw_interval을 추가한다.
코드 라인 30~32에서 ntp_error에 시프트된 (xtime_interval + xtime_remainder)를 빼고 시프트된 ntp_tick 을 추가한다.

accumulate_nsecs_to_secs()

kernel/time/timekeeping.c

/**
 * accumulate_nsecs_to_secs - Accumulates nsecs into secs
 *
 * Helper function that accumulates a the nsecs greater then a second
 * from the xtime_nsec field to the xtime_secs field.
 * It also calls into the NTP code to handle leapsecond processing.
 *
 */

static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
{
        u64 nsecps = (u64)NSEC_PER_SEC << tk->tkr.shift;
        unsigned int clock_set = 0;

        while (tk->tkr.xtime_nsec >= nsecps) {
                int leap;

                tk->tkr.xtime_nsec -= nsecps;
                tk->xtime_sec++;

                /* Figure out if its a leap sec and apply if needed */
                leap = second_overflow(tk->xtime_sec);
                if (unlikely(leap)) {
                        struct timespec64 ts;

                        tk->xtime_sec += leap;

                        ts.tv_sec = leap;
                        ts.tv_nsec = 0;
                        tk_set_wall_to_mono(tk,
                                timespec64_sub(tk->wall_to_monotonic, ts));

                        __timekeeping_set_tai_offset(tk, tk->tai_offset - leap);

                        clock_set = TK_CLOCK_WAS_SET;
                }
        }
        return clock_set;
}

xtime 나노초가 1초를 초과한 경우 이 값을 xtime 초에 반영한다. 또한 leap second 처리를 위한 NTP 코드를 호출한다.

코드 라인 3에서 1초에 해당하는 나노초에 좌측 shift를 적용한 값을 nsecps에 대입한다.
- 예) 1E9 << shift
코드 라인 6~10에서 xtime_nsec가 1초를 초과한 경우 xtime_nsec와 xtime_sec에 대해 1초를 조정한다.
코드 라인 13에서 xtime_sec에 변화가 있는 경우 NTP 코드를 호출하고 leap second를 알아온다.
코드 라인 14~27에서 leaf 값이 있는 경우 wtm 값을 조절하고 tai_offset 값도 조절한다.
코드 라인 29에서 leaf 조정이 없었던 경우 0을 반환하고 조정이 있었던 경우 TK_CLOCK_WAS_SET 값을 반환한다.

NTP 조정

second_overflow()

kernel/time/ntp.c

/*
 * this routine handles the overflow of the microsecond field
 *
 * The tricky bits of code to handle the accurate clock support
 * were provided by Dave Mills (Mills@UDEL.EDU) of NTP fame.
 * They were originally developed for SUN and DEC kernels.
 * All the kudos should go to Dave for this stuff.
 *
 * Also handles leap second processing, and returns leap offset
 */

int second_overflow(unsigned long secs)
{
        s64 delta;
        int leap = 0;

        /*
         * Leap second processing. If in leap-insert state at the end of the
         * day, the system clock is set back one second; if in leap-delete
         * state, the system clock is set ahead one second.
         */
        switch (time_state) {
        case TIME_OK:
                if (time_status & STA_INS)
                        time_state = TIME_INS;
                else if (time_status & STA_DEL)
                        time_state = TIME_DEL;
                break;
        case TIME_INS:
                if (!(time_status & STA_INS))
                        time_state = TIME_OK;
                else if (secs % 86400 == 0) {
                        leap = -1;
                        time_state = TIME_OOP;
                        printk(KERN_NOTICE
                                "Clock: inserting leap second 23:59:60 UTC\n");
                }
                break;
        case TIME_DEL:
                if (!(time_status & STA_DEL))
                        time_state = TIME_OK;
                else if ((secs + 1) % 86400 == 0) {
                        leap = 1;
                        time_state = TIME_WAIT;
                        printk(KERN_NOTICE
                                "Clock: deleting leap second 23:59:59 UTC\n");
                }
                break;
        case TIME_OOP:
                time_state = TIME_WAIT;
                break;

        case TIME_WAIT:
                if (!(time_status & (STA_INS | STA_DEL)))
                        time_state = TIME_OK;
                break;
        }

        /* Bump the maxerror field */
        time_maxerror += MAXFREQ / NSEC_PER_USEC;
        if (time_maxerror > NTP_PHASE_LIMIT) {
                time_maxerror = NTP_PHASE_LIMIT;
                time_status |= STA_UNSYNC;
        }

        /* Compute the phase adjustment for the next second */
        tick_length      = tick_length_base;

        delta            = ntp_offset_chunk(time_offset);
        time_offset     -= delta;
        tick_length     += delta;

        /* Check PPS signal */
        pps_dec_valid();

        if (!time_adjust)
                goto out;

        if (time_adjust > MAX_TICKADJ) {
                time_adjust -= MAX_TICKADJ;
                tick_length += MAX_TICKADJ_SCALED;
                goto out;
        }

        if (time_adjust < -MAX_TICKADJ) {
                time_adjust += MAX_TICKADJ;
                tick_length -= MAX_TICKADJ_SCALED;
                goto out;
        }

        tick_length += (s64)(time_adjust * NSEC_PER_USEC / NTP_INTERVAL_FREQ)
                                                         << NTP_SCALE_SHIFT;
        time_adjust = 0;

out:
        return leap;
}

Timekeeping 조정

참고: timekeeping: Rework frequency adjustments to work better w/ nohz (2013, v3.17-rc1)

timekeeping_adjust()

kernel/time/timekeeping.c

/*
 * Adjust the timekeeper's multiplier to the correct frequency
 * and also to reduce the accumulated error value.
 */

static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
{
        u32 mult;

        /*
         * Determine the multiplier from the current NTP tick length.
         * Avoid expensive division when the tick length doesn't change.
         */
        if (likely(tk->ntp_tick == ntp_tick_length())) {
                mult = tk->tkr_mono.mult - tk->ntp_err_mult;
        } else {
                tk->ntp_tick = ntp_tick_length();
                mult = div64_u64((tk->ntp_tick >> tk->ntp_error_shift) -
                                 tk->xtime_remainder, tk->cycle_interval);
        }

        /*
         * If the clock is behind the NTP time, increase the multiplier by 1
         * to catch up with it. If it's ahead and there was a remainder in the
         * tick division, the clock will slow down. Otherwise it will stay
         * ahead until the tick length changes to a non-divisible value.
         */
        tk->ntp_err_mult = tk->ntp_error > 0 ? 1 : 0;
        mult += tk->ntp_err_mult;

        timekeeping_apply_adjustment(tk, offset, mult - tk->tkr_mono.mult);

        if (unlikely(tk->tkr_mono.clock->maxadj &&
                (abs(tk->tkr_mono.mult - tk->tkr_mono.clock->mult)
                        > tk->tkr_mono.clock->maxadj))) {
                printk_once(KERN_WARNING
                        "Adjusting %s more than 11%% (%ld vs %ld)\n",
                        tk->tkr_mono.clock->name, (long)tk->tkr_mono.mult,
                        (long)tk->tkr_mono.clock->mult + tk->tkr_mono.clock->maxadj);
        }

        /*
         * It may be possible that when we entered this function, xtime_nsec
         * was very small.  Further, if we're slightly speeding the clocksource
         * in the code above, its possible the required corrective factor to
         * xtime_nsec could cause it to underflow.
         *
         * Now, since we have already accumulated the second and the NTP
         * subsystem has been notified via second_overflow(), we need to skip
         * the next update.
         */
        if (unlikely((s64)tk->tkr_mono.xtime_nsec < 0)) {
                tk->tkr_mono.xtime_nsec += (u64)NSEC_PER_SEC <<
                                                        tk->tkr_mono.shift;
                tk->xtime_sec--;
                tk->skip_second_overflow = 1;
        }
}

Timekeeper가 올바른 주파수를 갖도록 mult를 변경하고 에러 값 역시 줄이게 한다.

코드 라인 9~10에서 tk->ntp_tick과 ntp 틱 길이가 같은 경우의 mult 값을 산출한다.
- NTP에 지정된 주파수와 일치시키는 데 필요한 mult 조정을 산출한다.
코드 라인 11~15에서 다른 경우 ntp 틱 길이를 구해 tk->ntp_tick에 대입하고, mult 값을 산출한다.
코드 라인 23~26에서 ntp 에러인 경우 mult 값을 1 증가시키고, 그렇지 않은 경우 증가시키지 않은 mult 값을 사용하여 @offset 값을 timekeeping에 교정하여 적용한다.
코드 라인 28~35에서 mult 값이 클럭의 maxadj 배율을 벗어나면 경고 메시지를 출력한다.
코드 라인 47~52에서 xime_nsec가 0보다 작은 경우 xtime_sec에서 1초를 감소시키고 이를 xtime_nsec에 shift를 적용한 나노초를 더한다.

timekeeping_freqadjust()

kernel/time/timekeeping.c

/*
 * Calculate the multiplier adjustment needed to match the frequency
 * specified by NTP
 */
static __always_inline void timekeeping_freqadjust(struct timekeeper *tk,
                                                        s64 offset)
{
        s64 interval = tk->cycle_interval;
        s64 xinterval = tk->xtime_interval;
        s64 tick_error;
        bool negative;
        u32 adj;

        /* Remove any current error adj from freq calculation */
        if (tk->ntp_err_mult)
                xinterval -= tk->cycle_interval;

        tk->ntp_tick = ntp_tick_length();

        /* Calculate current error per tick */
        tick_error = ntp_tick_length() >> tk->ntp_error_shift;
        tick_error -= (xinterval + tk->xtime_remainder);

        /* Don't worry about correcting it if its small */
        if (likely((tick_error >= 0) && (tick_error <= interval)))
                return;

        /* preserve the direction of correction */
        negative = (tick_error < 0);

        /* Sort out the magnitude of the correction */
        tick_error = abs(tick_error);
        for (adj = 0; tick_error > interval; adj++)
                tick_error >>= 1;

        /* scale the corrections */
        timekeeping_apply_adjustment(tk, offset, negative, adj);
}

NTP에 지정된 주파수와 일치시키는 데 필요한 mult 조정을 산출한다.

코드 라인 15~16에서 mult 교정에 에러가 있는 경우 xtime 인터벌에서 cycle 인터벌을 뺀다.
코드 라인 18에서 ntp_tick에 ntp 틱 길이를 알아온다.
코드 라인 21~26에서 tick 에러가 없는 경우 함수를 빠져나간다.
- shift된 ntp_tick_length 값에서 shift를 제거하고 xinterval 값과 xtime_remainder 값을 제외시킨 값이 0 ~ cycle_interval 범위를 벗어난 경우 에러
코드 라인 32~34에서 tick_error가 cycle_interval보다 큰 동안 adj를 증가시키고 tick_error 값을 50%씩 줄인다.
코드 라인 37에서 배율(adj) 적용된 offset으로 교정한다.

timekeeping_apply_adjustment()

kernel/time/timekeeping.c

/*
 * Apply a multiplier adjustment to the timekeeper
 */

static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
                                                         s64 offset,
                                                         s32 mult_adj)
{
        s64 interval = tk->cycle_interval;

        if (mult_adj == 0) {
                return;
        } else if (mult_adj == -1) {
                interval = -interval;
                offset = -offset;
        } else if (mult_adj != 1) {
                interval *= mult_adj;
                offset *= mult_adj;
        }

adj_scale 배율로 offset 만큼 교정한다. (tkr.mult, tkr.xtime_nsec, xtime_interval, ntp_error 값 갱신)

코드 라인 7~8에서 @mult_adj가 0(mult가 변경되지 않은 경우이다)인 경우 교정없이 함수를 빠져나간다.
코드 라인 9~11에서 음수(감소) 교정인 경우 interval, offset을 음수로 바꾼다.
코드 라인 17~19에서 양수(증가) 교정인 경우 interval, offset 값들을 mult_adj 만큼 곱해 배율을 적용한다.

        /*
         * So the following can be confusing.
         *
         * To keep things simple, lets assume mult_adj == 1 for now.
         *
         * When mult_adj != 1, remember that the interval and offset values
         * have been appropriately scaled so the math is the same.
         *
         * The basic idea here is that we're increasing the multiplier
         * by one, this causes the xtime_interval to be incremented by
         * one cycle_interval. This is because:
         *      xtime_interval = cycle_interval * mult
         * So if mult is being incremented by one:
         *      xtime_interval = cycle_interval * (mult + 1)
         * Its the same as:
         *      xtime_interval = (cycle_interval * mult) + cycle_interval
         * Which can be shortened to:
         *      xtime_interval += cycle_interval
         *
         * So offset stores the non-accumulated cycles. Thus the current
         * time (in shifted nanoseconds) is:
         *      now = (offset * adj) + xtime_nsec
         * Now, even though we're adjusting the clock frequency, we have
         * to keep time consistent. In other words, we can't jump back
         * in time, and we also want to avoid jumping forward in time.
         *
         * So given the same offset value, we need the time to be the same
         * both before and after the freq adjustment.
         *      now = (offset * adj_1) + xtime_nsec_1
         *      now = (offset * adj_2) + xtime_nsec_2
         * So:
         *      (offset * adj_1) + xtime_nsec_1 =
         *              (offset * adj_2) + xtime_nsec_2
         * And we know:
         *      adj_2 = adj_1 + 1
         * So:
         *      (offset * adj_1) + xtime_nsec_1 =
         *              (offset * (adj_1+1)) + xtime_nsec_2
         *      (offset * adj_1) + xtime_nsec_1 =
         *              (offset * adj_1) + offset + xtime_nsec_2
         * Canceling the sides:
         *      xtime_nsec_1 = offset + xtime_nsec_2
         * Which gives us:
         *      xtime_nsec_2 = xtime_nsec_1 - offset
         * Which simplfies to:
         *      xtime_nsec -= offset
         */

        if ((mult_adj > 0) && (tk->tkr_mono.mult + mult_adj < mult_adj)) {
                /* NTP adjustment caused clocksource mult overflow */
                WARN_ON_ONCE(1);
                return;
        }

        tk->tkr_mono.mult += mult_adj;
        tk->xtime_interval += interval;
        tk->tkr_mono.xtime_nsec -= offset;
}

코드 라인 1~5에서 양수 교정이고 mult가 시스템이 표현하는 값을 초과하여 overflow된 경우 경고 메시지를 출력하고 함수를 빠져나간다.
코드 라인 7에서 증감된 @mult_adj를 tkr_mono.mult에 더해 적용한다.
코드 라인 8에서 xtime_interval에는 증감된 interval을 더해 적용한다.
코드 라인 9에서 xtime_nsec 값은 offset을 감소시킨다.

Timekeeping 갱신

timekeeping_update()

kernel/time/timekeeping.c

/* must hold timekeeper_lock */
static void timekeeping_update(struct timekeeper *tk, unsigned int action)
{
        if (action & TK_CLEAR_NTP) {
                tk->ntp_error = 0;
                ntp_clear();
        }

        tk_update_leap_state(tk);
        tk_update_ktime_data(tk);

        update_vsyscall(tk);
        update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);

        tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real;
        update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono);
        update_fast_timekeeper(&tk->tkr_raw,  &tk_fast_raw);

        if (action & TK_CLOCK_WAS_SET)
                tk->clock_was_set_seq++;
        /*
         * The mirroring of the data to the shadow-timekeeper needs
         * to happen last here to ensure we don't over-write the
         * timekeeper structure on the next update with stale data
         */
        if (action & TK_MIRROR)
                memcpy(&shadow_timekeeper, &tk_core.timekeeper,
                       sizeof(tk_core.timekeeper));
}

timekeeping 값들을 업데이트 하고 통지등을 수행한다.

코드 라인 4~7에서 TK_CLEAR_NTP 요청이 있는 경우 ntp를 클리어한다.
코드 라인 9~10에서 time keeper의 next_leap_ktime과 ktime_t 기반의 스칼라 나노초들을 업데이트한다.
코드 라인 12에서 vsyscall을 지원하는 아키텍처를 위해 timekeeper를 업데이트한다.
- 아직 arm에는 적용되지 않았지만 arm64, x86 등 많은 커널이 유저 스페이스에서 커널 코드나 데이터의 직접 접근을 허용하도록 매핑해준다. 이렇게 사용하면 context swtich로 인한 오버헤드를 줄여 매우 빠르게 유저스페이스에서 커널의 자원을 사용할 수 있는 장점이 있다. 하지만 시큐리티 이슈로 매우 제한적인 사용을 보여준다. arm에서는 유사하게 User Helpers라는 인터페이스가 있으며 VDSO와 매우 유사하다.
- 참고
  - Kernel-provided User Helpers | 문c
  - VDSO manual page
  - vDSO | wikipedia
코드 라인 13에서 pvclock_gtod_chain 리스트에 등록된 notify block에 등록된 함수를 호출한다.
- 현재는 x86 아키텍처에서 가상화 지원을 위해 사용된다.
코드 라인 15에서 tkr_mono.base_real을 tkr_mono.base + offs_real 값으로 초기화한다.
코드 라인 16~17에서 timekeeper를 fast timekeeper로도 복사한다.
코드 라인 19~20에서 입력 인자 @action에 TK_CLOCK_WAS_SET을 요청한 경우 clock_was_set_seq를 증가시킨다.
코드 라인 26~28에서 TK_MIRROR가 요청된 경우 timekeeper를 shadow timekeeper로 복사한다.
- 참고: timekeeping: Implement a shadow timekeeper

다음 그림은 timekeeper 코어로부터 각각의 시각 관리 구조체로 시각을 갱신하는 모습을 보여준다.

tk_update_ktime_data()

kernel/time/timekeeping.c

/*
 * Update the ktime_t based scalar nsec members of the timekeeper
 */

static inline void tk_update_ktime_data(struct timekeeper *tk)
{
        u64 seconds;
        u32 nsec;

        /*
         * The xtime based monotonic readout is:
         *      nsec = (xtime_sec + wtm_sec) * 1e9 + wtm_nsec + now();
         * The ktime based monotonic readout is:
         *      nsec = base_mono + now();
         * ==> base_mono = (xtime_sec + wtm_sec) * 1e9 + wtm_nsec
         */
        seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
        nsec = (u32) tk->wall_to_monotonic.tv_nsec;
        tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);

        /*
         * The sum of the nanoseconds portions of xtime and
         * wall_to_monotonic can be greater/equal one second. Take
         * this into account before updating tk->ktime_sec.
         */
        nsec += (u32)(tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift);
        if (nsec >= NSEC_PER_SEC)
                seconds++;
        tk->ktime_sec = seconds;

        /* Update the monotonic raw base */
        tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC);
}

timekepper의 ktime_t 기반 나노초를 갱신한다.

코드 라인 13~15에서 xtime 과 wtm을 더해서 base_mono에 대입한다.
코드 라인 22~25에서 xtime과 wtm의 초를 더해 ktime_sec에 대입하되 wtime_nsec + (xtime_nsec >> shift) 가 1초를 초과하는 경우 추가로 1을 증가시킨다.
코드 라인 28에서 raw_time 값을 base_raw 타입으로 변환하여 대입한다.

아래 그림과 같이 timekeeper의 tkr_mono.base, tir_raw.base 및 ktime_sec 값이 갱신되는 것을 보여준다.

PM Suspend & Resume 통지 등록

timekeeping_init_ops()

kernel/time/timekeeping.c

static int __init timekeeping_init_ops(void)
{
        register_syscore_ops(&timekeeping_syscore_ops);
        return 0;
}
device_initcall(timekeeping_init_ops);

timekeeping 관련한 syscore_ops를 syscore_ops_list에 등록한다.

kernel/time/timekeeping.c

/* sysfs resume/suspend bits for timekeeping */
static struct syscore_ops timekeeping_syscore_ops = {
        .resume         = timekeeping_resume,
        .suspend        = timekeeping_suspend,
};

register_syscore_ops()

drivers/base/syscore.c

/**
 * register_syscore_ops - Register a set of system core operations.
 * @ops: System core operations to register.
 */
void register_syscore_ops(struct syscore_ops *ops)
{
        mutex_lock(&syscore_ops_lock);
        list_add_tail(&ops->node, &syscore_ops_list);
        mutex_unlock(&syscore_ops_lock);
}
EXPORT_SYMBOL_GPL(register_syscore_ops);

syscore_ops를 syscore_ops_list에 등록한다.

Fast Timekeeper

참고: timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC

update_fast_timekeeper()

kernel/time/timekeeping.c

/**
 * update_fast_timekeeper - Update the fast and NMI safe monotonic timekeeper.
 * @tkr: Timekeeping readout base from which we take the update
 *
 * We want to use this from any context including NMI and tracing /
 * instrumenting the timekeeping code itself.
 *
 * So we handle this differently than the other timekeeping accessor
 * functions which retry when the sequence count has changed. The
 * update side does:
 *
 * smp_wmb();   <- Ensure that the last base[1] update is visible
 * tkf->seq++;
 * smp_wmb();   <- Ensure that the seqcount update is visible
 * update(tkf->base[0], tkr);
 * smp_wmb();   <- Ensure that the base[0] update is visible
 * tkf->seq++;
 * smp_wmb();   <- Ensure that the seqcount update is visible
 * update(tkf->base[1], tkr);
 *
 * The reader side does:
 *
 * do {
 *      seq = tkf->seq;
 *      smp_rmb();
 *      idx = seq & 0x01;
 *      now = now(tkf->base[idx]);
 *      smp_rmb();
 * } while (seq != tkf->seq)
 *
 * As long as we update base[0] readers are forced off to
 * base[1]. Once base[0] is updated readers are redirected to base[0]
 * and the base[1] update takes place.
 *
 * So if a NMI hits the update of base[0] then it will use base[1]
 * which is still consistent. In the worst case this can result is a
 * slightly wrong timestamp (a few nanoseconds). See
 * @ktime_get_mono_fast_ns.
 */
static void update_fast_timekeeper(struct tk_read_base *tkr)
{
        struct tk_read_base *base = tk_fast_mono.base;

        /* Force readers off to base[1] */
        raw_write_seqcount_latch(&tk_fast_mono.seq);

        /* Update base[0] */
        memcpy(base, tkr, sizeof(*base));

        /* Force readers back to base[0] */
        raw_write_seqcount_latch(&tk_fast_mono.seq);

        /* Update base[1] */
        memcpy(base + 1, base, sizeof(*base));
}

tracing 지원을 위해 timekeeper를 fast timekeeper로 복사한다. 복사 할 때 두 개의 공간에 복사하는데 매우 빠르게 lockless로 사용할 수 있도록 구현되었다.

코드 라인 45에서 시퀀스를 증가시킨다. 이 때 시퀀스 값이 홀수가된다.
코드 라인 48에서 fast timekeeper 첫 번째 공간에 tkr을 복사한다.
코드 라인 51에서 시퀀스를 증가시킨다. 이 때 시퀀스 값이 짝수가된다.
코드 라인 52에서 fast timekeeper 두 번째 공간에 첫 번째 공간을 복사한다.

다음 그림은 timekeeper를 fast timekeeper로 업데이트하는 것을 보여준다.

tkr_raw -> tk_fast_raw로 복사하는 과정은 동일하여 생략하였다.

ktime_get_mono_fast_ns()

kernel/time/timekeeping.c

/**
 * ktime_get_mono_fast_ns - Fast NMI safe access to clock monotonic
 *
 * This timestamp is not guaranteed to be monotonic across an update.
 * The timestamp is calculated by:
 *
 *      now = base_mono + clock_delta * slope
 *
 * So if the update lowers the slope, readers who are forced to the
 * not yet updated second array are still using the old steeper slope.
 *
 * tmono
 * ^
 * |    o  n
 * |   o n
 * |  u
 * | o
 * |o
 * |12345678---> reader order
 *
 * o = old slope
 * u = update
 * n = new slope
 *
 * So reader 6 will observe time going backwards versus reader 5.
 *
 * While other CPUs are likely to be able observe that, the only way
 * for a CPU local observation is when an NMI hits in the middle of
 * the update. Timestamps taken from that NMI context might be ahead
 * of the following timestamps. Callers need to be aware of that and
 * deal with it.
 */
u64 notrace ktime_get_mono_fast_ns(void)
{
        struct tk_read_base *tkr;
        unsigned int seq;
        u64 now;

        do {
                seq = raw_read_seqcount(&tk_fast_mono.seq);
                tkr = tk_fast_mono.base + (seq & 0x01);
                now = ktime_to_ns(tkr->base_mono) + timekeeping_get_ns(tkr);

        } while (read_seqcount_retry(&tk_fast_mono.seq, seq));
        return now;
}
EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

fast timekeeper로부터 monotonic 클럭 값을 얻어온다. fast timekeeeper는 tracing 목적으로 사용되며 lockless 구현되어 매우 빠르게 monotonic 값을 얻고자 할 때 사용하다.

코드 라인 40에서 시퀀스 값을 읽어온다.
코드 라인 41~42에서 시퀀스가 짝수이면 fast timekeeper의 base[0]에서 홀수이면 base[1]을 사용하여 monotonic 값을 산출한다.
코드 라인 44에서 시퀀스 값에 변화가 있는 경우 다시 반복한다.

구조체

timekeeper 구조체

include/linux/timekeeper_internal.h

/**
 * struct timekeeper - Structure holding internal timekeeping values.
 * @tkr_mono:           The readout base structure for CLOCK_MONOTONIC
 * @tkr_raw:            The readout base structure for CLOCK_MONOTONIC_RAW
 * @xtime_sec:          Current CLOCK_REALTIME time in seconds
 * @ktime_sec:          Current CLOCK_MONOTONIC time in seconds
 * @wall_to_monotonic:  CLOCK_REALTIME to CLOCK_MONOTONIC offset
 * @offs_real:          Offset clock monotonic -> clock realtime
 * @offs_boot:          Offset clock monotonic -> clock boottime
 * @offs_tai:           Offset clock monotonic -> clock tai
 * @tai_offset:         The current UTC to TAI offset in seconds
 * @clock_was_set_seq:  The sequence number of clock was set events
 * @cs_was_changed_seq: The sequence number of clocksource change events
 * @next_leap_ktime:    CLOCK_MONOTONIC time value of a pending leap-second
 * @raw_sec:            CLOCK_MONOTONIC_RAW  time in seconds
 * @monotonic_to_boot:  CLOCK_MONOTONIC to CLOCK_BOOTTIME offset
 * @cycle_interval:     Number of clock cycles in one NTP interval
 * @xtime_interval:     Number of clock shifted nano seconds in one NTP
 *                      interval.
 * @xtime_remainder:    Shifted nano seconds left over when rounding
 *                      @cycle_interval
 * @raw_interval:       Shifted raw nano seconds accumulated per NTP interval.
 * @ntp_error:          Difference between accumulated time and NTP time in ntp
 *                      shifted nano seconds.
 * @ntp_error_shift:    Shift conversion between clock shifted nano seconds and
 *                      ntp shifted nano seconds.
 * @last_warning:       Warning ratelimiter (DEBUG_TIMEKEEPING)
 * @underflow_seen:     Underflow warning flag (DEBUG_TIMEKEEPING)
 * @overflow_seen:      Overflow warning flag (DEBUG_TIMEKEEPING)
 *
 * Note: For timespec(64) based interfaces wall_to_monotonic is what
 * we need to add to xtime (or xtime corrected for sub jiffie times)
 * to get to monotonic time.  Monotonic is pegged at zero at system
 * boot time, so wall_to_monotonic will be negative, however, we will
 * ALWAYS keep the tv_nsec part positive so we can use the usual
 * normalization.
 *
 * wall_to_monotonic is moved after resume from suspend for the
 * monotonic time not to jump. We need to add total_sleep_time to
 * wall_to_monotonic to get the real boot based time offset.
 *
 * wall_to_monotonic is no longer the boot time, getboottime must be
 * used instead.
 *
 * @monotonic_to_boottime is a timespec64 representation of @offs_boot to
 * accelerate the VDSO update for CLOCK_BOOTTIME.
 */

struct timekeeper {
        struct tk_read_base     tkr_mono;
        struct tk_read_base     tkr_raw;
        u64                     xtime_sec;
        unsigned long           ktime_sec;
        struct timespec64       wall_to_monotonic;
        ktime_t                 offs_real;
        ktime_t                 offs_boot;
        ktime_t                 offs_tai;
        s32                     tai_offset;
        unsigned int            clock_was_set_seq;
        u8                      cs_was_changed_seq;
        ktime_t                 next_leap_ktime;
        u64                     raw_sec;
        struct timespec64       monotonic_to_boot;

        /* The following members are for timekeeping internal use */
        u64                     cycle_interval;
        u64                     xtime_interval;
        s64                     xtime_remainder;
        u64                     raw_interval;
        /* The ntp_tick_length() value currently being used.
         * This cached copy ensures we consistently apply the tick
         * length for an entire tick, as ntp_tick_length may change
         * mid-tick, and we don't want to apply that new value to
         * the tick in progress.
         */
        u64                     ntp_tick;
        /* Difference between accumulated time and NTP time in ntp
         * shifted nano seconds. */
        s64                     ntp_error;
        u32                     ntp_error_shift;
        u32                     ntp_err_mult;
        /* Flag used to avoid updating NTP twice with same second */
        u32                     skip_second_overflow;
#ifdef CONFIG_DEBUG_TIMEKEEPING
        long                    last_warning;
        /*
         * These simple flag variables are managed
         * without locks, which is racy, but they are
         * ok since we don't really care about being
         * super precise about how many events were
         * seen, just that a problem was observed.
         */
        int                     underflow_seen;
        int                     overflow_seen;
#endif
};

tkr_mono
- CLOCK_MONOTONIC을 읽기 위한 tk_read_base 구조체
- tkr -> tkr_mono로 이름을 변경하였다.
  - 참고: time: Rename timekeeper::tkr to timekeeper::tkr_mono (2015, v4.1-rc1)
tkr_raw
- CLOCK_MONOTONIC_RAW를 읽기 위한 tk_read_base 구조체
- 참고: time: Add timerkeeper::tkr_raw (2015, v4.1-rc1)
xtime_sec
- xtime 초
ktime_sec
- ktime 초
wall_to_monotonic
- real 시각을 monotonic으로 변환하기 위해 더할 숫자로 보통 음수가 담긴다.
- realtime + wall_to_monotonic = monotonic
offs_real
- monotonic을 사용하여 real 시각을 산출하기 위해 더할 숫자로 보통 양수가 담긴다.
- monotonic + offs_real = realtime
offs_boot
- monotonic을 사용하여 boot-up 시각을 산출하기 위해 더할 숫자로 보통 양수가 담긴다.
- bootup 시간은 monotonic과 다르게 suspend 시에도 증가된다.
- monotonic + offs_boot = boottime
offs_tai
- monotonic을 사용하여 tai(우주천문시계) 시각을 산출하기 위해 더할 숫자로 보통 양수가 담긴다.
- monotonic + offs_tai = tai
tai_offset
- 현재 UTC -> TAI로 바꾸기 위한 offset 초
clock_was_set_seq
- 클럭 설정될 때의 시퀀스 번호
next_leap_ktime
- 보류중인 윤초의 CLOCK_MONOTOMIC 시각
raw_sec
- CLOCK_MONOTOMIC에 대한 초
cycle_interval
- 1 NTP 인터벌에 해당하는 cycle 수
- update_wall_time() 함수가 호출될 때 클럭 소스로 부터 읽은 카운터 cycle이 이 값을 초과할 경우에만 timekeeping에 반영한다.
- 예) 100hz 시스템인 경우 1E7 ns에 해당하는 cycle 수가 담긴다.
xtime_interval
- 1 NTP 인터벌에 해당하는 나노초를 좌측 shift 한 값
xtime_remainder
- 1초 – xtime_interval을 하여 남는 나노초보다 작은 수
raw_interval
- 1 NTP 인터벌당 산출된 raw 나노초

tk_read_base 구조체

include/linux/timekeeper_internal.h

/**
 * struct tk_read_base - base structure for timekeeping readout
 * @clock:      Current clocksource used for timekeeping.
 * @mask:       Bitmask for two's complement subtraction of non 64bit clocks
 * @cycle_last: @clock cycle value at last update
 * @mult:       (NTP adjusted) multiplier for scaled math conversion
 * @shift:      Shift value for scaled math conversion
 * @xtime_nsec: Shifted (fractional) nano seconds offset for readout
 * @base:       ktime_t (nanoseconds) base time for readout
 * @base_real:  Nanoseconds base value for clock REALTIME readout
 *
 * This struct has size 56 byte on 64 bit. Together with a seqcount it
 * occupies a single 64byte cache line.
 *
 * The struct is separate from struct timekeeper as it is also used
 * for a fast NMI safe accessors.
 *
 * @base_real is for the fast NMI safe accessor to allow reading clock
 * realtime from any context.
 */

struct tk_read_base {
        struct clocksource      *clock;
        u64                     mask;
        u64                     cycle_last;
        u32                     mult;
        u32                     shift;
        u64                     xtime_nsec;
        ktime_t                 base;
        u64                     base_real;
};

*clock
- Timekeeper 클럭 소스
(*read)
- 클럭 소스의 cycle 읽기 함수
mask
- 읽은 값 중 얻을 값만 가려내기 위한 mask
cycle_last
- 마지막에 갱신된 사이클 값
mult/shift
- 1 cycle 을 mult로 곱하고 우측 shift 하여 ns를 산출
xtime_nsec
- 나노초 << shift 하여 보관된다.
base
- ktime_t 기반 monotonic
base_real
- realtime 클럭을 위한 나노초 기반 값

참고

Timer -1- (Lowres Timer) | 문c
Timer -2- (HRTimer) | 문c
Timer -3- (Clock Sources Subsystem) | 문c
Timer -4- (Clock Sources Watchdog) | 문c
Timer -5- (Clock Events Subsystem) | 문c
Timer -6- (Clock Source & Timer Driver) | 문c
Timer -7- (Sched Clock & Delay Timers) | 문c
Timer -8- (Timecounter) | 문c
Timer -9- (Tick Device) | 문c
Timer -10- (Timekeeping) | 문c – 현재 글
Timer -11- (Posix Clock & Timers) | 문c
time_init() | 문c
sched_clock_postinit() | 문c
tick_init() | 문c
calibrate_delay() | 문c

The Raspberry Pi as a Stratum-1 NTP Server

Timekeeping

xtime

시간 구조체 타입 비교

timespec64 구조체

ktime_t 타입

timespec 구조체

timeval 구조체

Timekeeping APIs

Get & Set timeofday

Kernel time accessors

timespec 기반 인터페이스

timespec64 기반 인터페이스

time64_t 기반 인터페이스

ktime_t 기반 인터페이스

ktime -> ns 변환

ktime 기반의 하나를 timespec 인터페이스로 반환

RTC

PPS

Persisten 클럭 관련 인터페이스

timekeeping 초기화

timekeeping_init()

tk_setup_internals()

ToD 클럭 설정

do_settimeofday64()

Timekeeper 카운터 갱신(xtime & raw_time)

timekeeping_forward_now()

Time Normalization

tk_normalize_xtime()

wall_to_monotonic 변환용 값 설정

tk_set_wall_to_mono()

set_normalized_timespec64()

Wall 타임 갱신

update_wall_time()

timekeeping_advance()

clocksource_delta()

logarithmic_accumulation()

accumulate_nsecs_to_secs()

NTP 조정

second_overflow()

Timekeeping 조정

timekeeping_adjust()

timekeeping_freqadjust()

timekeeping_apply_adjustment()

Timekeeping 갱신

timekeeping_update()

tk_update_ktime_data()

PM Suspend & Resume 통지 등록

timekeeping_init_ops()

register_syscore_ops()

Fast Timekeeper

update_fast_timekeeper()

ktime_get_mono_fast_ns()

구조체

timekeeper 구조체

tk_read_base 구조체

참고

댓글 남기기 댓글 취소