kprobe

리눅스 커널을 위한 디버깅 메커니즘이며 실운영 시스템 안에서 이벤트 모니터링을 위해 사용되어지기도 한다.

kprobes1

kprobes2

include/linux/kprobes.h

struct kprobe {
        struct hlist_node hlist;
        struct list_head list;
        unsigned long nmissed;
        kprobe_opcode_t *addr;
        const char *symbol_name;
        unsigned int offset;
        kprobe_pre_handler_t pre_handler;
        kprobe_post_handler_t post_handler;
        kprobe_fault_handler_t fault_handler;
        kprobe_break_handler_t break_handler;
        kprobe_opcode_t opcode;
        struct arch_specific_insn ainsn;
        u32 flags;
};

 

typedef int (*kprobe_pre_handler_t)(struct kprobe*, struct pt_regs*);
typedef void (*kprobe_post_handler_t)(struct kprobe*, struct pt_regs*, 
              unsigned long flags);
typedef int (*kprobe_fault_handler_t)(struct kprobe*, struct pt_regs*, 
             int trapnr);

int register_kprobe(struct kprobe *p);
int unregister_kprobe(struct kprobe *p);

 

참고

kprobes 동작 방식(1)
kprobes 동작 방식(2)
An introduction to KProbes
arm에서 kprobes 사용하기(1)
arm에서 kprobes 사용하기(2)

 

Ftrace

리눅스 커널의 내부 동작을 trace

주요 기능

  • 1) dynamic kernel function tracing
    .stack usage, call graph, counting, filters
  • 2) latency tracing
    .interrupts, wake-up
  • 3) static tracepoints를 사용하여 event tracing
    .scheduling, interrupts, file systems, virtual guest connections

특징

  • debugfs 파일시스템을 /sys/kernel/debug에 마운트하여 사용
  • gcc 컴파일러의 프로파일링 옵션(-pg)로 빌드된 커널 사용
  • linux kernel 2.6.27에서 mainline에 합류
  • 적은(10%) overhead
  • buffered(trace) vs live tracing(trace_pipe)
  • trace source로 tracepoints, kprobes를 사용

Tracer Plugins

  • function: 임의의 기간동안 커널 내부 함수의 호출관계를 분석
  • function_graph: 그래프 형식의 함수 관계 분석
  • wakeup, wakeup_dl, wakeup_rt: wake up latency 분석
  • mmiotrace: 메모리 맵의 I/O에 대한 분석
  • irqsoff: interrupt latency 분석
  • nop: 디버깅 기능 off

preemptoff, preemptirqsoff 기능은 커널 3.0.x 에서 제거됨.
sched_switch의 경우 deprecated되어 wakeup 기능으로 동작

1) Function Tracing

function filter: 모든 함수 → 특정 함수

# cd /sys/kernel/debug/tracing
# cat set_ftrace_filter
#### all functions enabled ####

# echo schedule_tail >> set_ftrace_filter
# cat set_ftrace_filter
schedule_tail

# echo 'sched*' > set_ftrace_filter
# cat set_ftrace_filter
schedule_on_each_cpu
sched_feat_open
sched_feat_show
(…)

 

제외시킬 function: 없음 → 특정 함수

# cat set_ftrace_notrace
#### no functions disabled ####

# cat set_ftrace_notrace
xen_pte_unlock
update_persistent_clock
read_persistent_clock
(…)

 

사용가능한 tracer 목록

# cat current_tracer
nop

# cat available_tracers
blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

 

사용가능한 function 목록

# cat available_filter_functions
run_init_process
try_to_run_init_process
do_one_initcall
match_dev_by_uuid
rootfs_mount
name_to_dev_t
init_linuxrc
init_linuxrc
calibration_delay_done
calibrate_delay
(...)

 

function trace

# cat trace | head -30
# tracer: function
# entries-in-buffer/entries-written: 62862/62862   #P:4
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            bash-11785 [000] d... 172870.350417: sched_stat_sleep: comm=rcuos/0 pid=9 ...
            bash-11785 [000] d... 172870.350418: sched_wake_idle_without_ipi: cpu=1
            bash-11785 [000] d... 172870.350419: sched_wakeup: comm=rcuos/0 pid=9 ...
          <idle>-0     [001] .N.. 172870.350436: schedule <-schedule_preempt_disabled
          <idle>-0     [001] dN.. 172870.350438: sched_stat_wait: comm=rcuos/0 pid=9 ...

 

function_graph trace

# cat trace | head -30
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3) + 11.008 us   |  }
 1)               |  schedule() {
 1)               |  /* sched_stat_wait: comm=rcuos/0 pid=9 delay=0 [ns] */
 1)               |  /* sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 ...
 ------------------------------------------
 1)    <idle>-0    =>   rcuos/0-9 
 ------------------------------------------

 

2) Latency Tracing

wakeup

# cat trace
# tracer: wakeup
#
# wakeup latency trace v1.1.5 on 4.0.5
# --------------------------------------------------------------------
# latency: 14 us, #75/75, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: watchdog/0-12 (uid:0 nice:0 policy:1 rt_prio:99)
#    -----------------
#
#                  _------=> CPU#
#                 / _-----=> irqs-off
#                | / _----=> need-resched
#                || / _---=> hardirq/softirq
#                ||| / _--=> preempt-depth
#                |||| /     delay
#  cmd     pid   ||||| time  |   caller
#     \   /      |||||  \    |   /
  <idle>-0       0dNh.    0us :      0:120:R   + [000]    12:  0:R watchdog/0
  <idle>-0       0dNh.    0us : ttwu_do_activate.constprop.92 <-try_to_wake_up
  <idle>-0       0dNh.    1us : task_woken_rt <-ttwu_do_wakeup
  <idle>-0       0dNh.    1us : _raw_spin_unlock <-try_to_wake_up
  <idle>-0       0dNh.    1us : ttwu_stat <-try_to_wake_up
  <idle>-0       0dNh.    1us : _raw_spin_unlock_irqrestore <-try_to_wake_up
  <idle>-0       0dNh.    1us : ktime_get <-watchdog_timer_fn
  <idle>-0       0dNh.    1us : hrtimer_forward <-watchdog_timer_fn
  <idle>-0       0dNh.    2us : _raw_spin_lock <-__run_hrtimer

 

wake_dl

# cat trace
# tracer: wakeup_dl
#
# wakeup_dl latency trace v1.1.5 on 4.0.5
# --------------------------------------------------------------------
# latency: 665155 us, #0/0, CPU#2 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: Xorg-840 (uid:0 nice:0 policy:0 rt_prio:0)
#    -----------------
#
#                  _------=> CPU#
#                 / _-----=> irqs-off
#                | / _----=> need-resched
#                || / _---=> hardirq/softirq
#                ||| / _--=> preempt-depth
#                |||| /     delay
#  cmd     pid   ||||| time  |   caller
#     \   /      |||||  \    |   /

 

wakeup_rt

# cat trace
# tracer: wakeup_rt
#
# wakeup_rt latency trace v1.1.5 on 4.0.5
# --------------------------------------------------------------------
# latency: 18 us, #75/75, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: irq/51-iwlwifi-423 (uid:0 nice:0 policy:1 rt_prio:50)
#    -----------------
#
#                  _------=> CPU#
#                 / _-----=> irqs-off
#                | / _----=> need-resched
#                || / _---=> hardirq/softirq
#                ||| / _--=> preempt-depth
#                |||| /     delay
#  cmd     pid   ||||| time  |   caller
#     \   /      |||||  \    |   /
  <idle>-0       1dNh.    0us :      0:120:R   + [001]   423: 49:R irq/51-iwlwifi
  <idle>-0       1dNh.    0us : ttwu_do_activate.constprop.92 <-try_to_wake_up
  <idle>-0       1dNh.    1us : task_woken_rt <-ttwu_do_wakeup
  <idle>-0       1dNh.    1us : _raw_spin_unlock <-try_to_wake_up
  <idle>-0       1dNh.    1us : ttwu_stat <-try_to_wake_up
  <idle>-0       1dNh.    1us : _raw_spin_unlock_irqrestore <-try_to_wake_up
  <idle>-0       1dNh.    1us : add_interrupt_randomness <-handle_irq_event_percpu

 

3) Event Tracing

event 디렉토리

 

# ls events/
asoc              filemap       irq_vectors    mac80211_msg  random        sock      vsyscall
block             ftrace        iwlwifi        mce           ras           spi       workqueue
cfg80211          gpio          iwlwifi_data   migrate       raw_syscalls  swiotlb   writeback
compaction        hda           iwlwifi_io     module        rcu           syscalls  xen
context_tracking  hda_intel     iwlwifi_msg    napi          regmap        task      xhci-hcd
drm               header_event  iwlwifi_ucode  net           regulator     thermal
enable            header_page   jbd2           nmi           rpm           timer
exceptions        i2c           kmem           oom           sched         tlb
ext4              i915          kvm            pagemap       scsi          udp
fence             iommu         kvmmmu         power         signal        v4l2
filelock          irq           mac80211       printk        skb           vmscan

# ls events/sched/
enable                  sched_process_exec  sched_stat_iowait   sched_wait_task
filter                  sched_process_exit  sched_stat_runtime  sched_wake_idle_without_ipi
sched_kthread_stop      sched_process_fork  sched_stat_sleep    sched_wakeup
sched_kthread_stop_ret  sched_process_free  sched_stat_wait     sched_wakeup_new
sched_migrate_task      sched_process_hang  sched_stick_numa
sched_move_numa         sched_process_wait  sched_swap_numa
sched_pi_setprio        sched_stat_blocked  sched_switch

# ls events/sched/sched_wakeup
enable  filter  format  id  trigger

 

event 목록

# cat available_events | head -20
mac80211:drv_return_void
mac80211:drv_return_int
mac80211:drv_return_bool
mac80211:drv_return_u32
mac80211:drv_return_u64
mac80211:drv_start
mac80211:drv_get_et_strings
mac80211:drv_get_et_sset_count
mac80211:drv_get_et_stats
mac80211:drv_suspend
mac80211:drv_resume
mac80211:drv_set_wakeup
mac80211:drv_stop
mac80211:drv_add_interface
mac80211:drv_change_interface
mac80211:drv_remove_interface
mac80211:drv_config
mac80211:drv_bss_info_changed
mac80211:drv_prepare_multicast
mac80211:drv_configure_filter

 

4) Stack Tracing

# echo 1 > /proc/sys/kernel/stack_tracer_enabled
# cat stack_trace
        Depth    Size   Location    (16 entries)
        -----    ----   --------
  0)     2824      16   update_group_capacity+0x2c/0x270
  1)     2808       8   msecs_to_jiffies+0x5/0x30
  2)     2800      64   update_group_capacity+0x2c/0x270
  3)     2736     384   find_busiest_group+0xfb/0x940
  4)     2352     288   load_balance+0x1ac/0x8f0
  5)     2064     128   pick_next_task_fair+0x488/0x8a0
  6)     1936      80   __schedule+0x116/0xa10
  7)     1856       8   schedule+0x37/0x90
  8)     1848     184   schedule_hrtimeout_range_clock.part.26+0xfc/0x120
  9)     1664      16   schedule_hrtimeout_range_clock+0x19/0x50
 10)     1648      16   schedule_hrtimeout_range+0x13/0x20
 11)     1632      24   poll_schedule_timeout+0x49/0x70
 12)     1608     952   do_select+0x57c/0x7d0
 13)      656     416   core_sys_select+0x1c7/0x2f0
 14)      240     112   SyS_select+0xab/0x100
 15)      128     128   system_call_fastpath+0x16/0x1b

 

5) Function profiling

# echo 1 > function_profile_enabled
# cat trace_stat/function0 | head -15
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  __schedule                           10960    647848540 us     59110.26 us     2099595 us
  schedule                              6499    647692443 us     99660.32 us     4087623 us
  schedule_hrtimeout_range              1021    461369396 us     451879.9 us     5813184 us
  schedule_hrtimeout_range_clock        1021    461369211 us     451879.7 us     5763201 us
  poll_schedule_timeout                 1016    454877997 us     447714.5 us     8181934 us
  SyS_poll                               340    332641192 us     978356.4 us     105339792 us
  do_sys_poll                            340    332640884 us     978355.5 us     105204044 us
  schedule_hrtimeout_range_clock         138    188387417 us     1365126 us     582984815 us
  SyS_select                            1720    122266989 us     71085.45 us     782863.5 us
  core_sys_select                       1720    122266182 us     71084.98 us     653943.4 us
  do_select                             1720    122265404 us     71084.53 us     556184.2 us
  schedule_timeout                        72    51148556 us     710396.6 us     1317174442 us
  SyS_read                              1249    31128240 us     24922.53 us     2673959 us

* 결과 시간에서 sleep time을 제거하고 출력하려면 “echo 0 > options/sleep-time”을 사용한다.

 

참고

arm에서 kprobes 사용하기(1)
arm에서 kprobes 사용하기(2)
Function Tracer
FTrace

TRACE_EVENT

TRACE_EVENT 특징

– Static Kernel Tracing: 커널 코드안에서 static probe point가 있다.
– TRACE_EVENT 매크로는 low overhead 성능을 위해 만들어졌다.
– ftrace는 물론 perf, LTTng와 SystemTap에서도 사용된다.

TRACE_EVENT 매크로 사용

  • 커널 코드내부에 tracepoint를 만들어야 한다.
  • tracepoint를 hook하는 callback 펑션을 만들어야 한다.
  • callback 펑션은 tracer ring buffer를 통해 가능한 빨리 데이터 기록.
  • 기록된 데이터는 사람이 읽을 수 있는 데이터 형태를 갖어야 한다.
TRACE_EVENT(name, proto, args, struct, assign, print)

TRACE_EVENT 매크로 인수

.name

  • tracepoint 이름
  • trace_ 접두어로 함수가 만들어진다.
    TRACE_EVENT(sched_switch,

     

.prototype

  • tracepoint callbacks을 위한 prototype
    TP_PROTO(struct rq *rq, struct task_struct *prev, struct task_struct *next)
    trace_sched_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next);

     

.args

  • prototype에 매치되는 실제 데이터 인수
    TP_ARGS(rq, prev, next),
      #define TRACE_POINT(name, proto, args) \
           void trace_##name(proto)            \
           {                                   \
                   if (trace_##name##_active)  \
                           callback(args);     \
           }

     

.struct (optional)

  • tracepoint 로 전달될 데이터가 저장될 구조체
    TP_STRUCT__entry(
    		__array(	char,	prev_comm,	TASK_COMM_LEN	)
    		__field(	pid_t,	prev_pid			)
    		__field(	int,	prev_prio			)
    		__field(	long,	prev_state			)
    		__array(	char,	next_comm,	TASK_COMM_LEN	)
    		__field(	pid_t,	next_pid			)
    		__field(	int,	next_prio			)
        ),
      struct {
    	      char   prev_comm[TASK_COMM_LEN];
    	      pid_t  prev_pid;
    	      int    prev_prio;
    	      long   prev_state;
    	      char   next_comm[TASK_COMM_LEN];
    	      pid_t  next_pid;
    	      int    next_prio;
        };

     

.assign

  • ring buffer에 연결된 구조체에 데이터를 옮기기 위해 사용
     TP_fast_assign(
    		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
    		__entry->prev_pid	= prev->pid;
    		__entry->prev_prio	= prev->prio;
    		__entry->prev_state	= prev->state;
    		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
    		__entry->next_pid	= next->pid;
    		__entry->next_prio	= next->prio;
        ),

     

.print

  • 사람이 읽을 수 있는 ASCII 형태 출력
    TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==> " \
     		  "next_comm=%s next_pid=%d next_prio=%d",
    		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
    		__entry->prev_state ?
    		  __print_flags(__entry->prev_state, "|",
    				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
    				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
    				{ 128, "W" }) : "R",
    		__entry->next_comm, __entry->next_pid, __entry->next_prio)

 format 화일

#cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 264
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char prev_comm[16];       offset:8;       size:16;        signed:1;
        field:pid_t prev_pid;   offset:24;      size:4; signed:1;
        field:int prev_prio;    offset:28;      size:4; signed:1;
        field:long prev_state;  offset:32;      size:8; signed:1;
        field:char next_comm[16];       offset:40;      size:16;        signed:1;
        field:pid_t next_pid;   offset:56;      size:4; signed:1;
        field:int next_prio;    offset:60;      size:4; signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (1024-1) ? __print_flags(REC->prev_state & (1024-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", REC->prev_state & 1024 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

 

trace 함수 호출 예)

  static inline void
   context_switch(struct rq *rq, struct task_struct *prev,
	          struct task_struct *next)
   {
	   struct mm_struct *mm, *oldmm;

	   prepare_task_switch(rq, prev, next);
	   trace_sched_switch(rq, prev, next);
	   mm = next->mm;
	   oldmm = prev->active_mm;
[04:37:30.629091416] (+0.000050732) sched_switch: { cpu_id = 2 }, { prev_comm = "swapper/2", prev_tid = 0, prev_prio = 20,
prev_state = 0, next_comm = "lttng", next_tid = 8347, next_prio = 20 }

 

TRACE_EVENT 정의된 헤더 화일

TRACE_EVENT(sched_switch,

        TP_PROTO(struct task_struct *prev,
                 struct task_struct *next),

        TP_ARGS(prev, next),

        TP_STRUCT__entry(
                __array(        char,   prev_comm,      TASK_COMM_LEN   )
                __field(        pid_t,  prev_pid                        )
                __field(        int,    prev_prio                       )
                __field(        long,   prev_state                      )
                __array(        char,   next_comm,      TASK_COMM_LEN   )
                __field(        pid_t,  next_pid                        )
                __field(        int,    next_prio                       )
        ),

        TP_fast_assign(
                memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
                __entry->prev_pid       = prev->pid;
                __entry->prev_prio      = prev->prio;
                __entry->prev_state     = __trace_sched_switch_state(prev);
                memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
                __entry->next_pid       = next->pid;
                __entry->next_prio      = next->prio;
        ),

        TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
                __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
                __entry->prev_state & (TASK_STATE_MAX-1) ?
                  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
                                { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
                                { 16, "Z" }, { 32, "X" }, { 64, "x" },
                                { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
                __entry->prev_state & TASK_STATE_MAX ? "+" : "",
                __entry->next_comm, __entry->next_pid, __entry->next_prio)
);

 

get_random_bytes() 에서의 사용 예)

drivers/char/random.c – get_random_bytes()

trace_get_random_bytes(nbytes, _RET_IP_);

 

include/trace/events/random.h

DEFINE_EVENT(random__get_random_bytes, get_random_bytes,
	TP_PROTO(int nbytes, unsigned long IP),
           TP_ARGS(nbytes, IP)
);

DEFINE_EVENT를 통해 아래의 함수들이 생성됨

extern struct tracepoint __tracepoint_get_random_bytes;
static inline void trace_get_random_bytes();
static inline void trace_get_random_bytes_rcuidle();
static inline int register_trace_get_random_bytes();
static inline int unregister_trace_get_random_bytes();
static inline void check_trace_callback_type_get_random_bytes();
static inline bool trace_get_random_bytes_enabled();

 

include/linux/tracepoint.h

struct tracepoint {
        const char *name;               /* Tracepoint name */
        struct static_key key;
        void (*regfunc)(void);
        void (*unregfunc)(void);
        struct tracepoint_func __rcu *funcs;
};
#define DEFINE_EVENT(template, name, proto, args)		\
        DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
#define TRACE_EVENT(name, proto, args, struct, assign, print)   	\
        DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
#define DECLARE_TRACE(name, proto, args)			\
                	__DECLARE_TRACE(name, PARAMS(proto), 		\
		PARAMS(args), 1,   					\
              	PARAMS(void *__data, proto),            		\
                     PARAMS(__data, args))

 

include/linux/tracepoint.h

#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
        extern struct tracepoint __tracepoint_##name;          		\
        static inline void trace_##name(proto)                          	\
        {                                                               			\
                if (static_key_false(&__tracepoint_##name.key))         	\
                        __DO_TRACE(&__tracepoint_##name,                	\
                                TP_PROTO(data_proto),                   		\
                                TP_ARGS(data_args),                     		\
                                TP_CONDITION(cond),,);                  	\
                if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {         	\
                        rcu_read_lock_sched_notrace();                  	\
                        rcu_dereference_sched(__tracepoint_##name.funcs);\
                        rcu_read_unlock_sched_notrace();                	\
                }                                                       			\
        }                                                               			\
        __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
                PARAMS(cond), PARAMS(data_proto),PARAMS(data_args)) \
        static inline int                                               		\
        register_trace_##name(void (*probe)(data_proto), void *data)	\
        {                                                               			\
                return tracepoint_probe_register(&__tracepoint_##name, \
                                                (void *)probe, data);   		\
        }                                                               			\
        static inline int                                               		\
        unregister_trace_##name(void (*probe)(data_proto), void *data)  \
        {                                                               			\
                return tracepoint_probe_unregister(&__tracepoint_##name,\
                                                (void *)probe, data);   		\
        }                                                               			\
        static inline void                                              		\
        check_trace_callback_type_##name(void (*cb)(data_proto)) 	\
        {                                                               			\
        }                                                               			\
        static inline bool                                              		\
        trace_##name##_enabled(void)                                    	\
        {                                                               			\
                return static_key_false(&__tracepoint_##name.key);  	\
        }

CONFIG_TRACEPOINTS 옵션을 사용하여 커널을 빌드 시 위와 같은 함수로 동작하고 그렇지 않은 경우 빈함수가 만들어지고 에러를 리턴한다.

#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu)      	\
        do {                                                            		\
                struct tracepoint_func *it_func_ptr;                    		\
                void *it_func;                                          		\
                void *__data;                                           		\
                                                                        			\
                if (!(cond))                                            		\
                        return;                                         		\
                prercu;                                                 		\
                rcu_read_lock_sched_notrace();                          	\
                it_func_ptr = rcu_dereference_sched((tp)->funcs);       	\
                if (it_func_ptr) {                                      		\
                        do {                                            		\
                                it_func = (it_func_ptr)->func;          		\
                                __data = (it_func_ptr)->data;           		\
                                ((void(*)(proto))(it_func))(args);      		\
                        } while ((++it_func_ptr)->func);                		\
                }                                                       			\
                rcu_read_unlock_sched_notrace();                        	\
                postrcu;                                                		\
        } while (0)

이벤트 분석

# cat events/sched/enable
0

# echo 1 > events/sched/enable
# cat set_event
sched:sched_wake_idle_without_ipi
sched:sched_swap_numa
sched:sched_stick_numa
sched:sched_move_numa
sched:sched_process_hang
sched:sched_pi_setprio
sched:sched_stat_runtime
sched:sched_stat_blocked
sched:sched_stat_iowait
sched:sched_stat_sleep
sched:sched_stat_wait
sched:sched_process_exec
sched:sched_process_fork
sched:sched_process_wait
sched:sched_wait_task
sched:sched_process_exit
sched:sched_process_free
sched:sched_migrate_task
sched:sched_switch
sched:sched_wakeup_new
sched:sched_wakeup
sched:sched_kthread_stop_ret
sched:sched_kthread_stop

# ls events/sched/
enable                  sched_process_exec  sched_stat_iowait   sched_wait_task
filter                  sched_process_exit  sched_stat_runtime  sched_wake_idle_without_ipi
sched_kthread_stop      sched_process_fork  sched_stat_sleep    sched_wakeup
sched_kthread_stop_ret  sched_process_free  sched_stat_wait     sched_wakeup_new
sched_migrate_task      sched_process_hang  sched_stick_numa
sched_move_numa         sched_process_wait  sched_swap_numa
sched_pi_setprio        sched_stat_blocked  sched_switch

사용 가능한 이벤트 조회

# cat /sys/kernel/debug/tracing/available_events
mac80211:drv_return_void
mac80211:drv_return_int
(...)
irq_vectors:thermal_apic_exit
irq_vectors:thermal_apic_entry
irq_vectors:threshold_apic_exit
irq_vectors:threshold_apic_entry
irq_vectors:call_function_single_exit
irq_vectors:call_function_single_entry
irq_vectors:call_function_exit
irq_vectors:call_function_entry
irq_vectors:irq_work_exit
irq_vectors:irq_work_entry
irq_vectors:x86_platform_ipi_exit
irq_vectors:x86_platform_ipi_entry
irq_vectors:error_apic_exit
irq_vectors:error_apic_entry
irq_vectors:spurious_apic_exit
irq_vectors:spurious_apic_entry
irq_vectors:reschedule_exit
irq_vectors:reschedule_entry
irq_vectors:local_timer_exit
irq_vectors:local_timer_entry
nmi:nmi_handler
syscalls:sys_exit_mmap
syscalls:sys_enter_mmap
vsyscall:emulate_vsyscall
raw_syscalls:sys_exit
raw_syscalls:sys_enter
mce:mce_record
tlb:tlb_flush
exceptions:page_fault_kernel
exceptions:page_fault_user
syscalls:sys_exit_unshare
syscalls:sys_enter_unshare
syscalls:sys_exit_set_tid_address
syscalls:sys_enter_set_tid_address
task:task_rename
task:task_newtask
(...)

 

참고

 

 

Kernel Tracing

Tools

Trace Tools

  • SystemTap
  • LTTng
  • GDB
  • ktap
  • dtrace4linux
  • Oracle Linux Dtrace
  • sysdig

Trace frameworks

  • Ftrace
  • perf_events (perf)
  • eBPF

Trace source

  • tracepoints – kernel static tracing
  • kprobes – kernel dynamic tracing
  • uprobes – user-level dynamic tracing

Trace history

  • 초기: Static tracers, prototype dynamic tracers
  • 2004: Linux kprobes (2.6.9)
  • 2005: Solaris DTrace (s10)
  • 2008: Linux ftrace (2.6.27)
  • 2009: Linux perf (2.6.31)
  • 2009 tracepoints (2.6.32)
  • 2010~2014: ftrace & perf_events enhancements
  • 2014: eBPF patches
    (참고: Brendan Gregg – Linux Performance Analysis: New Tools and Old Secrets)

 

Kernel Option

Kernel hacking  --→
	[*] Tracers  --→
		-*-   Kernel Function Tracer                     
		[*]     Kernel Function Graph Tracer             
		[*]   Interrupts-off Latency Tracer              
		[ ]   Preemption-off Latency Tracer              
		[*]   Scheduling Latency Tracer                  
		[ ]   Trace syscalls                             
		-*-   Create a snapshot trace buffer             
		-*-     Allow snapshot to swap per CPU            
   	            Branch Profiling (No branch profiling)  --->
		[*]   Trace max stack                            
		[*]   Support for tracing block IO actions       
		[ ]   Enable kprobes-based dynamic events        
		[ ]   Enable uprobes-based dynamic events        
		[*]   enable/disable function tracing dynamically
		[*]   Kernel function profiler                  
		[ ]   Perform a startup test on ftrace          
		[ ]   Add tracepoint that benchmarks tracepoints
		< >   Ring buffer benchmark stress tester
		[ ]   Ring buffer startup self test

Dependency

  • 프로파일링: 커널 빌드 시 gcc의 -pg 옵션을 사용하여 추가적인 프로파일링 코드를 생성하고 이를 통해 함수들의 분석을 돕는다.
  • mount -t debugfs none /sys/kernel/debug
  • CONFIG_DEBUG_FS
  • CONFIG_FUNCTION_PROFILER
  • CONFIG_FTRACE
  • CONFIG_KPROBES (dynamic kernel tracing)

Visual Tools

KernelShark

visual_tools1

Trace Compass

visual_tools2

perf CPU Flame Graph

visual_tools3

perf Block I/O Latency Heat Map

visual_tools4

LTTviewer

visual_tools5

참고

Kernel Debugging

kernel menuconfig

Kernel hacking
    printk and dmesg options  --->
    Compile-time checks and compiler options  --->
-*- Magic SysRq key
(0x1) Enable magic SysRq key functions by default
-*- Kernel debugging
    Memory Debugging  --->
[ ] Debug shared IRQ handlers
    Debug Lockups and Hangs  --->
[ ] Panic on Oops
(0) panic timeout
[*] Collect scheduler debugging info
[ ] Collect scheduler statistics
[ ] Detect stack corruption on calls to schedule()
[*] Collect kernel timers statistics
[ ] Debug preemptible kernel
    Lock Debugging (spinlocks, mutexes, etc...)  --->
-*- Stack backtrace support
[ ] kobject debugging
[*] Verbose BUG() reporting (adds 70K)
[ ] Debug linked list manipulation
[ ] Debug priority linked list manipulation
[ ] Debug SG table operations
[ ] Debug notifier call chains
[ ] Debug credential management
    RCU Debugging  --->
[ ] Force extended block device numbers and spread them
< > Notifier error injection
[ ] Fault-injection framework
[*] Tracers  --->
    Runtime Testing  --->
[ ] Enable debugging of DMA-API usage
< > Test module loading with 'hello world' module
< > Test user/kernel boundary protections
< > Test BPF filter functionality
< > Test firmware loading via userspace interface
< > udelay test driver
[ ] Sample kernel code  ----
[*] KGDB: kernel debugger  --->
[ ] Export kernel pagetable layout to userspace via debugfs
[ ] Filter access to /dev/mem
[*] Enable stack unwinding support (EXPERIMENTAL)
[ ] Verbose user fault messages
[ ] Kernel low-level debugging functions (read help!)
< > Kprobes test module
[ ] Write the current PID to the CONTEXTIDR register
[ ] Set loadable kernel module data as NX and text as RO
[ ] CoreSight Tracing Support  ----

Debug File System (CONFIG_DEBUG_FS)

  • 커널 메뉴 설정에서 아래가 체크되어야 한다.
    Kernel hacking ––>
    → Compile-time checks and compiler options ––>
    → [*] Debug filesystem
  • Debugfs 는 마운트되어야 사용할 수 있다.
    (mount -t debugfs none /sys/kernel/debug)
  • /sys/kernel/debug/debug_objects/stats을 통해 통계 정보를 알 수 있다.

참고