Documentation/trace/tracepoint-analysis.rst

0001 =========================================================
0002 Notes on Analysing Behaviour Using Events and Tracepoints
0003 =========================================================
0004 :Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar)
0005
0006 1. Introduction
0007 ===============
0008
0009 Tracepoints (see Documentation/trace/tracepoints.rst) can be used without
0010 creating custom kernel modules to register probe functions using the event
0011 tracing infrastructure.
0012
0013 Simplistically, tracepoints represent important events that can be
0014 taken in conjunction with other tracepoints to build a "Big Picture" of
0015 what is going on within the system. There are a large number of methods for
0016 gathering and interpreting these events. Lacking any current Best Practises,
0017 this document describes some of the methods that can be used.
0018
0019 This document assumes that debugfs is mounted on /sys/kernel/debug and that
0020 the appropriate tracing options have been configured into the kernel. It is
0021 assumed that the PCL tool tools/perf has been installed and is in your path.
0022
0023 2. Listing Available Events
0024 ===========================
0025
0026 2.1 Standard Utilities
0027 ----------------------
0028
0029 All possible events are visible from /sys/kernel/debug/tracing/events. Simply
0030 calling::
0031
0032   $ find /sys/kernel/debug/tracing/events -type d
0033
0034 will give a fair indication of the number of events available.
0035
0036 2.2 PCL (Performance Counters for Linux)
0037 ----------------------------------------
0038
0039 Discovery and enumeration of all counters and events, including tracepoints,
0040 are available with the perf tool. Getting a list of available events is a
0041 simple case of::
0042
0043   $ perf list 2>&1 | grep Tracepoint
0044   ext4:ext4_free_inode                     [Tracepoint event]
0045   ext4:ext4_request_inode                  [Tracepoint event]
0046   ext4:ext4_allocate_inode                 [Tracepoint event]
0047   ext4:ext4_write_begin                    [Tracepoint event]
0048   ext4:ext4_ordered_write_end              [Tracepoint event]
0049   [ .... remaining output snipped .... ]
0050
0051
0052 3. Enabling Events
0053 ==================
0054
0055 3.1 System-Wide Event Enabling
0056 ------------------------------
0057
0058 See Documentation/trace/events.rst for a proper description on how events
0059 can be enabled system-wide. A short example of enabling all events related
0060 to page allocation would look something like::
0061
0062   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
0063
0064 3.2 System-Wide Event Enabling with SystemTap
0065 ---------------------------------------------
0066
0067 In SystemTap, tracepoints are accessible using the kernel.trace() function
0068 call. The following is an example that reports every 5 seconds what processes
0069 were allocating the pages.
0070 ::
0071
0072   global page_allocs
0073
0074   probe kernel.trace("mm_page_alloc") {
0075         page_allocs[execname()]++
0076   }
0077
0078   function print_count() {
0079         printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
0080         foreach (proc in page_allocs-)
0081                 printf("%-25d %s\n", page_allocs[proc], proc)
0082         printf ("\n")
0083         delete page_allocs
0084   }
0085
0086   probe timer.s(5) {
0087           print_count()
0088   }
0089
0090 3.3 System-Wide Event Enabling with PCL
0091 ---------------------------------------
0092
0093 By specifying the -a switch and analysing sleep, the system-wide events
0094 for a duration of time can be examined.
0095 ::
0096
0097  $ perf stat -a \
0098         -e kmem:mm_page_alloc -e kmem:mm_page_free \
0099         -e kmem:mm_page_free_batched \
0100         sleep 10
0101  Performance counter stats for 'sleep 10':
0102
0103            9630  kmem:mm_page_alloc
0104            2143  kmem:mm_page_free
0105            7424  kmem:mm_page_free_batched
0106
0107    10.002577764  seconds time elapsed
0108
0109 Similarly, one could execute a shell and exit it as desired to get a report
0110 at that point.
0111
0112 3.4 Local Event Enabling
0113 ------------------------
0114
0115 Documentation/trace/ftrace.rst describes how to enable events on a per-thread
0116 basis using set_ftrace_pid.
0117
0118 3.5 Local Event Enablement with PCL
0119 -----------------------------------
0120
0121 Events can be activated and tracked for the duration of a process on a local
0122 basis using PCL such as follows.
0123 ::
0124
0125   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0126                  -e kmem:mm_page_free_batched ./hackbench 10
0127   Time: 0.909
0128
0129     Performance counter stats for './hackbench 10':
0130
0131           17803  kmem:mm_page_alloc
0132           12398  kmem:mm_page_free
0133            4827  kmem:mm_page_free_batched
0134
0135     0.973913387  seconds time elapsed
0136
0137 4. Event Filtering
0138 ==================
0139
0140 Documentation/trace/ftrace.rst covers in-depth how to filter events in
0141 ftrace.  Obviously using grep and awk of trace_pipe is an option as well
0142 as any script reading trace_pipe.
0143
0144 5. Analysing Event Variances with PCL
0145 =====================================
0146
0147 Any workload can exhibit variances between runs and it can be important
0148 to know what the standard deviation is. By and large, this is left to the
0149 performance analyst to do it by hand. In the event that the discrete event
0150 occurrences are useful to the performance analyst, then perf can be used.
0151 ::
0152
0153   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
0154                         -e kmem:mm_page_free_batched ./hackbench 10
0155   Time: 0.890
0156   Time: 0.895
0157   Time: 0.915
0158   Time: 1.001
0159   Time: 0.899
0160
0161    Performance counter stats for './hackbench 10' (5 runs):
0162
0163           16630  kmem:mm_page_alloc         ( +-   3.542% )
0164           11486  kmem:mm_page_free          ( +-   4.771% )
0165            4730  kmem:mm_page_free_batched  ( +-   2.325% )
0166
0167     0.982653002  seconds time elapsed   ( +-   1.448% )
0168
0169 In the event that some higher-level event is required that depends on some
0170 aggregation of discrete events, then a script would need to be developed.
0171
0172 Using --repeat, it is also possible to view how events are fluctuating over
0173 time on a system-wide basis using -a and sleep.
0174 ::
0175
0176   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0177                 -e kmem:mm_page_free_batched \
0178                 -a --repeat 10 \
0179                 sleep 1
0180   Performance counter stats for 'sleep 1' (10 runs):
0181
0182            1066  kmem:mm_page_alloc         ( +-  26.148% )
0183             182  kmem:mm_page_free          ( +-   5.464% )
0184             890  kmem:mm_page_free_batched  ( +-  30.079% )
0185
0186     1.002251757  seconds time elapsed   ( +-   0.005% )
0187
0188 6. Higher-Level Analysis with Helper Scripts
0189 ============================================
0190
0191 When events are enabled the events that are triggering can be read from
0192 /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
0193 options exist as well. By post-processing the output, further information can
0194 be gathered on-line as appropriate. Examples of post-processing might include
0195
0196   - Reading information from /proc for the PID that triggered the event
0197   - Deriving a higher-level event from a series of lower-level events.
0198   - Calculating latencies between two events
0199
0200 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
0201 script that can read trace_pipe from STDIN or a copy of a trace. When used
0202 on-line, it can be interrupted once to generate a report without exiting
0203 and twice to exit.
0204
0205 Simplistically, the script just reads STDIN and counts up events but it
0206 also can do more such as
0207
0208   - Derive high-level events from many low-level events. If a number of pages
0209     are freed to the main allocator from the per-CPU lists, it recognises
0210     that as one per-CPU drain even though there is no specific tracepoint
0211     for that event
0212   - It can aggregate based on PID or individual process number
0213   - In the event memory is getting externally fragmented, it reports
0214     on whether the fragmentation event was severe or moderate.
0215   - When receiving an event about a PID, it can record who the parent was so
0216     that if large numbers of events are coming from very short-lived
0217     processes, the parent process responsible for creating all the helpers
0218     can be identified
0219
0220 7. Lower-Level Analysis with PCL
0221 ================================
0222
0223 There may also be a requirement to identify what functions within a program
0224 were generating events within the kernel. To begin this sort of analysis, the
0225 data must be recorded. At the time of writing, this required root:
0226 ::
0227
0228   $ perf record -c 1 \
0229         -e kmem:mm_page_alloc -e kmem:mm_page_free \
0230         -e kmem:mm_page_free_batched \
0231         ./hackbench 10
0232   Time: 0.894
0233   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
0234
0235 Note the use of '-c 1' to set the event period to sample. The default sample
0236 period is quite high to minimise overhead but the information collected can be
0237 very coarse as a result.
0238
0239 This record outputted a file called perf.data which can be analysed using
0240 perf report.
0241 ::
0242
0243   $ perf report
0244   # Samples: 30922
0245   #
0246   # Overhead    Command                     Shared Object
0247   # ........  .........  ................................
0248   #
0249       87.27%  hackbench  [vdso]
0250        6.85%  hackbench  /lib/i686/cmov/libc-2.9.so
0251        2.62%  hackbench  /lib/ld-2.9.so
0252        1.52%       perf  [vdso]
0253        1.22%  hackbench  ./hackbench
0254        0.48%  hackbench  [kernel]
0255        0.02%       perf  /lib/i686/cmov/libc-2.9.so
0256        0.01%       perf  /usr/bin/perf
0257        0.01%       perf  /lib/ld-2.9.so
0258        0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
0259   #
0260   # (For more details, try: perf report --sort comm,dso,symbol)
0261   #
0262
0263 According to this, the vast majority of events triggered on events
0264 within the VDSO. With simple binaries, this will often be the case so let's
0265 take a slightly different example. In the course of writing this, it was
0266 noticed that X was generating an insane amount of page allocations so let's look
0267 at it:
0268 ::
0269
0270   $ perf record -c 1 -f \
0271                 -e kmem:mm_page_alloc -e kmem:mm_page_free \
0272                 -e kmem:mm_page_free_batched \
0273                 -p `pidof X`
0274
0275 This was interrupted after a few seconds and
0276 ::
0277
0278   $ perf report
0279   # Samples: 27666
0280   #
0281   # Overhead  Command                            Shared Object
0282   # ........  .......  .......................................
0283   #
0284       51.95%     Xorg  [vdso]
0285       47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
0286        0.09%     Xorg  /lib/i686/cmov/libc-2.9.so
0287        0.01%     Xorg  [kernel]
0288   #
0289   # (For more details, try: perf report --sort comm,dso,symbol)
0290   #
0291
0292 So, almost half of the events are occurring in a library. To get an idea which
0293 symbol:
0294 ::
0295
0296   $ perf report --sort comm,dso,symbol
0297   # Samples: 27666
0298   #
0299   # Overhead  Command                            Shared Object  Symbol
0300   # ........  .......  .......................................  ......
0301   #
0302       51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
0303       47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
0304        0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
0305        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
0306        0.01%     Xorg  [kernel]                                 [k] read_hpet
0307        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
0308        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
0309
0310 To see where within the function pixmanFillsse2 things are going wrong:
0311 ::
0312
0313   $ perf annotate pixmanFillsse2
0314   [ ... ]
0315     0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
0316          :      }
0317          :
0318          :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
0319          :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
0320          :        *__P = __B;
0321    12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
0322     0.00 :         34ef5:       ff
0323    12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
0324     0.00 :         34efd:       ff
0325    12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
0326     0.00 :         34f05:       ff
0327    12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
0328     0.00 :         34f0d:       ff
0329    12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
0330    12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
0331    12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
0332    12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
0333
0334 At a glance, it looks like the time is being spent copying pixmaps to
0335 the card.  Further investigation would be needed to determine why pixmaps
0336 are being copied around so much but a starting point would be to take an
0337 ancient build of libpixmap out of the library path where it was totally
0338 forgotten about from months ago!