0001 =========================================================
0002 Notes on Analysing Behaviour Using Events and Tracepoints
0003 =========================================================
0004 :Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar)
0005
0006 1. Introduction
0007 ===============
0008
0009 Tracepoints (see Documentation/trace/tracepoints.rst) can be used without
0010 creating custom kernel modules to register probe functions using the event
0011 tracing infrastructure.
0012
0013 Simplistically, tracepoints represent important events that can be
0014 taken in conjunction with other tracepoints to build a "Big Picture" of
0015 what is going on within the system. There are a large number of methods for
0016 gathering and interpreting these events. Lacking any current Best Practises,
0017 this document describes some of the methods that can be used.
0018
0019 This document assumes that debugfs is mounted on /sys/kernel/debug and that
0020 the appropriate tracing options have been configured into the kernel. It is
0021 assumed that the PCL tool tools/perf has been installed and is in your path.
0022
0023 2. Listing Available Events
0024 ===========================
0025
0026 2.1 Standard Utilities
0027 ----------------------
0028
0029 All possible events are visible from /sys/kernel/debug/tracing/events. Simply
0030 calling::
0031
0032 $ find /sys/kernel/debug/tracing/events -type d
0033
0034 will give a fair indication of the number of events available.
0035
0036 2.2 PCL (Performance Counters for Linux)
0037 ----------------------------------------
0038
0039 Discovery and enumeration of all counters and events, including tracepoints,
0040 are available with the perf tool. Getting a list of available events is a
0041 simple case of::
0042
0043 $ perf list 2>&1 | grep Tracepoint
0044 ext4:ext4_free_inode [Tracepoint event]
0045 ext4:ext4_request_inode [Tracepoint event]
0046 ext4:ext4_allocate_inode [Tracepoint event]
0047 ext4:ext4_write_begin [Tracepoint event]
0048 ext4:ext4_ordered_write_end [Tracepoint event]
0049 [ .... remaining output snipped .... ]
0050
0051
0052 3. Enabling Events
0053 ==================
0054
0055 3.1 System-Wide Event Enabling
0056 ------------------------------
0057
0058 See Documentation/trace/events.rst for a proper description on how events
0059 can be enabled system-wide. A short example of enabling all events related
0060 to page allocation would look something like::
0061
0062 $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
0063
0064 3.2 System-Wide Event Enabling with SystemTap
0065 ---------------------------------------------
0066
0067 In SystemTap, tracepoints are accessible using the kernel.trace() function
0068 call. The following is an example that reports every 5 seconds what processes
0069 were allocating the pages.
0070 ::
0071
0072 global page_allocs
0073
0074 probe kernel.trace("mm_page_alloc") {
0075 page_allocs[execname()]++
0076 }
0077
0078 function print_count() {
0079 printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
0080 foreach (proc in page_allocs-)
0081 printf("%-25d %s\n", page_allocs[proc], proc)
0082 printf ("\n")
0083 delete page_allocs
0084 }
0085
0086 probe timer.s(5) {
0087 print_count()
0088 }
0089
0090 3.3 System-Wide Event Enabling with PCL
0091 ---------------------------------------
0092
0093 By specifying the -a switch and analysing sleep, the system-wide events
0094 for a duration of time can be examined.
0095 ::
0096
0097 $ perf stat -a \
0098 -e kmem:mm_page_alloc -e kmem:mm_page_free \
0099 -e kmem:mm_page_free_batched \
0100 sleep 10
0101 Performance counter stats for 'sleep 10':
0102
0103 9630 kmem:mm_page_alloc
0104 2143 kmem:mm_page_free
0105 7424 kmem:mm_page_free_batched
0106
0107 10.002577764 seconds time elapsed
0108
0109 Similarly, one could execute a shell and exit it as desired to get a report
0110 at that point.
0111
0112 3.4 Local Event Enabling
0113 ------------------------
0114
0115 Documentation/trace/ftrace.rst describes how to enable events on a per-thread
0116 basis using set_ftrace_pid.
0117
0118 3.5 Local Event Enablement with PCL
0119 -----------------------------------
0120
0121 Events can be activated and tracked for the duration of a process on a local
0122 basis using PCL such as follows.
0123 ::
0124
0125 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0126 -e kmem:mm_page_free_batched ./hackbench 10
0127 Time: 0.909
0128
0129 Performance counter stats for './hackbench 10':
0130
0131 17803 kmem:mm_page_alloc
0132 12398 kmem:mm_page_free
0133 4827 kmem:mm_page_free_batched
0134
0135 0.973913387 seconds time elapsed
0136
0137 4. Event Filtering
0138 ==================
0139
0140 Documentation/trace/ftrace.rst covers in-depth how to filter events in
0141 ftrace. Obviously using grep and awk of trace_pipe is an option as well
0142 as any script reading trace_pipe.
0143
0144 5. Analysing Event Variances with PCL
0145 =====================================
0146
0147 Any workload can exhibit variances between runs and it can be important
0148 to know what the standard deviation is. By and large, this is left to the
0149 performance analyst to do it by hand. In the event that the discrete event
0150 occurrences are useful to the performance analyst, then perf can be used.
0151 ::
0152
0153 $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
0154 -e kmem:mm_page_free_batched ./hackbench 10
0155 Time: 0.890
0156 Time: 0.895
0157 Time: 0.915
0158 Time: 1.001
0159 Time: 0.899
0160
0161 Performance counter stats for './hackbench 10' (5 runs):
0162
0163 16630 kmem:mm_page_alloc ( +- 3.542% )
0164 11486 kmem:mm_page_free ( +- 4.771% )
0165 4730 kmem:mm_page_free_batched ( +- 2.325% )
0166
0167 0.982653002 seconds time elapsed ( +- 1.448% )
0168
0169 In the event that some higher-level event is required that depends on some
0170 aggregation of discrete events, then a script would need to be developed.
0171
0172 Using --repeat, it is also possible to view how events are fluctuating over
0173 time on a system-wide basis using -a and sleep.
0174 ::
0175
0176 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0177 -e kmem:mm_page_free_batched \
0178 -a --repeat 10 \
0179 sleep 1
0180 Performance counter stats for 'sleep 1' (10 runs):
0181
0182 1066 kmem:mm_page_alloc ( +- 26.148% )
0183 182 kmem:mm_page_free ( +- 5.464% )
0184 890 kmem:mm_page_free_batched ( +- 30.079% )
0185
0186 1.002251757 seconds time elapsed ( +- 0.005% )
0187
0188 6. Higher-Level Analysis with Helper Scripts
0189 ============================================
0190
0191 When events are enabled the events that are triggering can be read from
0192 /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
0193 options exist as well. By post-processing the output, further information can
0194 be gathered on-line as appropriate. Examples of post-processing might include
0195
0196 - Reading information from /proc for the PID that triggered the event
0197 - Deriving a higher-level event from a series of lower-level events.
0198 - Calculating latencies between two events
0199
0200 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
0201 script that can read trace_pipe from STDIN or a copy of a trace. When used
0202 on-line, it can be interrupted once to generate a report without exiting
0203 and twice to exit.
0204
0205 Simplistically, the script just reads STDIN and counts up events but it
0206 also can do more such as
0207
0208 - Derive high-level events from many low-level events. If a number of pages
0209 are freed to the main allocator from the per-CPU lists, it recognises
0210 that as one per-CPU drain even though there is no specific tracepoint
0211 for that event
0212 - It can aggregate based on PID or individual process number
0213 - In the event memory is getting externally fragmented, it reports
0214 on whether the fragmentation event was severe or moderate.
0215 - When receiving an event about a PID, it can record who the parent was so
0216 that if large numbers of events are coming from very short-lived
0217 processes, the parent process responsible for creating all the helpers
0218 can be identified
0219
0220 7. Lower-Level Analysis with PCL
0221 ================================
0222
0223 There may also be a requirement to identify what functions within a program
0224 were generating events within the kernel. To begin this sort of analysis, the
0225 data must be recorded. At the time of writing, this required root:
0226 ::
0227
0228 $ perf record -c 1 \
0229 -e kmem:mm_page_alloc -e kmem:mm_page_free \
0230 -e kmem:mm_page_free_batched \
0231 ./hackbench 10
0232 Time: 0.894
0233 [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
0234
0235 Note the use of '-c 1' to set the event period to sample. The default sample
0236 period is quite high to minimise overhead but the information collected can be
0237 very coarse as a result.
0238
0239 This record outputted a file called perf.data which can be analysed using
0240 perf report.
0241 ::
0242
0243 $ perf report
0244 # Samples: 30922
0245 #
0246 # Overhead Command Shared Object
0247 # ........ ......... ................................
0248 #
0249 87.27% hackbench [vdso]
0250 6.85% hackbench /lib/i686/cmov/libc-2.9.so
0251 2.62% hackbench /lib/ld-2.9.so
0252 1.52% perf [vdso]
0253 1.22% hackbench ./hackbench
0254 0.48% hackbench [kernel]
0255 0.02% perf /lib/i686/cmov/libc-2.9.so
0256 0.01% perf /usr/bin/perf
0257 0.01% perf /lib/ld-2.9.so
0258 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
0259 #
0260 # (For more details, try: perf report --sort comm,dso,symbol)
0261 #
0262
0263 According to this, the vast majority of events triggered on events
0264 within the VDSO. With simple binaries, this will often be the case so let's
0265 take a slightly different example. In the course of writing this, it was
0266 noticed that X was generating an insane amount of page allocations so let's look
0267 at it:
0268 ::
0269
0270 $ perf record -c 1 -f \
0271 -e kmem:mm_page_alloc -e kmem:mm_page_free \
0272 -e kmem:mm_page_free_batched \
0273 -p `pidof X`
0274
0275 This was interrupted after a few seconds and
0276 ::
0277
0278 $ perf report
0279 # Samples: 27666
0280 #
0281 # Overhead Command Shared Object
0282 # ........ ....... .......................................
0283 #
0284 51.95% Xorg [vdso]
0285 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
0286 0.09% Xorg /lib/i686/cmov/libc-2.9.so
0287 0.01% Xorg [kernel]
0288 #
0289 # (For more details, try: perf report --sort comm,dso,symbol)
0290 #
0291
0292 So, almost half of the events are occurring in a library. To get an idea which
0293 symbol:
0294 ::
0295
0296 $ perf report --sort comm,dso,symbol
0297 # Samples: 27666
0298 #
0299 # Overhead Command Shared Object Symbol
0300 # ........ ....... ....................................... ......
0301 #
0302 51.95% Xorg [vdso] [.] 0x000000ffffe424
0303 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
0304 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
0305 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
0306 0.01% Xorg [kernel] [k] read_hpet
0307 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
0308 0.00% Xorg [kernel] [k] ftrace_trace_userstack
0309
0310 To see where within the function pixmanFillsse2 things are going wrong:
0311 ::
0312
0313 $ perf annotate pixmanFillsse2
0314 [ ... ]
0315 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
0316 : }
0317 :
0318 : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
0319 : _mm_store_si128 (__m128i *__P, __m128i __B) : {
0320 : *__P = __B;
0321 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
0322 0.00 : 34ef5: ff
0323 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
0324 0.00 : 34efd: ff
0325 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
0326 0.00 : 34f05: ff
0327 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
0328 0.00 : 34f0d: ff
0329 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
0330 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
0331 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
0332 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
0333
0334 At a glance, it looks like the time is being spent copying pixmaps to
0335 the card. Further investigation would be needed to determine why pixmaps
0336 are being copied around so much but a starting point would be to take an
0337 ancient build of libpixmap out of the library path where it was totally
0338 forgotten about from months ago!