Back to home page

OSCL-LXR

 
 

    


0001 =========================================================
0002 Notes on Analysing Behaviour Using Events and Tracepoints
0003 =========================================================
0004 :Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar)
0005 
0006 1. Introduction
0007 ===============
0008 
0009 Tracepoints (see Documentation/trace/tracepoints.rst) can be used without
0010 creating custom kernel modules to register probe functions using the event
0011 tracing infrastructure.
0012 
0013 Simplistically, tracepoints represent important events that can be
0014 taken in conjunction with other tracepoints to build a "Big Picture" of
0015 what is going on within the system. There are a large number of methods for
0016 gathering and interpreting these events. Lacking any current Best Practises,
0017 this document describes some of the methods that can be used.
0018 
0019 This document assumes that debugfs is mounted on /sys/kernel/debug and that
0020 the appropriate tracing options have been configured into the kernel. It is
0021 assumed that the PCL tool tools/perf has been installed and is in your path.
0022 
0023 2. Listing Available Events
0024 ===========================
0025 
0026 2.1 Standard Utilities
0027 ----------------------
0028 
0029 All possible events are visible from /sys/kernel/debug/tracing/events. Simply
0030 calling::
0031 
0032   $ find /sys/kernel/debug/tracing/events -type d
0033 
0034 will give a fair indication of the number of events available.
0035 
0036 2.2 PCL (Performance Counters for Linux)
0037 ----------------------------------------
0038 
0039 Discovery and enumeration of all counters and events, including tracepoints,
0040 are available with the perf tool. Getting a list of available events is a
0041 simple case of::
0042 
0043   $ perf list 2>&1 | grep Tracepoint
0044   ext4:ext4_free_inode                     [Tracepoint event]
0045   ext4:ext4_request_inode                  [Tracepoint event]
0046   ext4:ext4_allocate_inode                 [Tracepoint event]
0047   ext4:ext4_write_begin                    [Tracepoint event]
0048   ext4:ext4_ordered_write_end              [Tracepoint event]
0049   [ .... remaining output snipped .... ]
0050 
0051 
0052 3. Enabling Events
0053 ==================
0054 
0055 3.1 System-Wide Event Enabling
0056 ------------------------------
0057 
0058 See Documentation/trace/events.rst for a proper description on how events
0059 can be enabled system-wide. A short example of enabling all events related
0060 to page allocation would look something like::
0061 
0062   $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
0063 
0064 3.2 System-Wide Event Enabling with SystemTap
0065 ---------------------------------------------
0066 
0067 In SystemTap, tracepoints are accessible using the kernel.trace() function
0068 call. The following is an example that reports every 5 seconds what processes
0069 were allocating the pages.
0070 ::
0071 
0072   global page_allocs
0073 
0074   probe kernel.trace("mm_page_alloc") {
0075         page_allocs[execname()]++
0076   }
0077 
0078   function print_count() {
0079         printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
0080         foreach (proc in page_allocs-)
0081                 printf("%-25d %s\n", page_allocs[proc], proc)
0082         printf ("\n")
0083         delete page_allocs
0084   }
0085 
0086   probe timer.s(5) {
0087           print_count()
0088   }
0089 
0090 3.3 System-Wide Event Enabling with PCL
0091 ---------------------------------------
0092 
0093 By specifying the -a switch and analysing sleep, the system-wide events
0094 for a duration of time can be examined.
0095 ::
0096 
0097  $ perf stat -a \
0098         -e kmem:mm_page_alloc -e kmem:mm_page_free \
0099         -e kmem:mm_page_free_batched \
0100         sleep 10
0101  Performance counter stats for 'sleep 10':
0102 
0103            9630  kmem:mm_page_alloc
0104            2143  kmem:mm_page_free
0105            7424  kmem:mm_page_free_batched
0106 
0107    10.002577764  seconds time elapsed
0108 
0109 Similarly, one could execute a shell and exit it as desired to get a report
0110 at that point.
0111 
0112 3.4 Local Event Enabling
0113 ------------------------
0114 
0115 Documentation/trace/ftrace.rst describes how to enable events on a per-thread
0116 basis using set_ftrace_pid.
0117 
0118 3.5 Local Event Enablement with PCL
0119 -----------------------------------
0120 
0121 Events can be activated and tracked for the duration of a process on a local
0122 basis using PCL such as follows.
0123 ::
0124 
0125   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0126                  -e kmem:mm_page_free_batched ./hackbench 10
0127   Time: 0.909
0128 
0129     Performance counter stats for './hackbench 10':
0130 
0131           17803  kmem:mm_page_alloc
0132           12398  kmem:mm_page_free
0133            4827  kmem:mm_page_free_batched
0134 
0135     0.973913387  seconds time elapsed
0136 
0137 4. Event Filtering
0138 ==================
0139 
0140 Documentation/trace/ftrace.rst covers in-depth how to filter events in
0141 ftrace.  Obviously using grep and awk of trace_pipe is an option as well
0142 as any script reading trace_pipe.
0143 
0144 5. Analysing Event Variances with PCL
0145 =====================================
0146 
0147 Any workload can exhibit variances between runs and it can be important
0148 to know what the standard deviation is. By and large, this is left to the
0149 performance analyst to do it by hand. In the event that the discrete event
0150 occurrences are useful to the performance analyst, then perf can be used.
0151 ::
0152 
0153   $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
0154                         -e kmem:mm_page_free_batched ./hackbench 10
0155   Time: 0.890
0156   Time: 0.895
0157   Time: 0.915
0158   Time: 1.001
0159   Time: 0.899
0160 
0161    Performance counter stats for './hackbench 10' (5 runs):
0162 
0163           16630  kmem:mm_page_alloc         ( +-   3.542% )
0164           11486  kmem:mm_page_free          ( +-   4.771% )
0165            4730  kmem:mm_page_free_batched  ( +-   2.325% )
0166 
0167     0.982653002  seconds time elapsed   ( +-   1.448% )
0168 
0169 In the event that some higher-level event is required that depends on some
0170 aggregation of discrete events, then a script would need to be developed.
0171 
0172 Using --repeat, it is also possible to view how events are fluctuating over
0173 time on a system-wide basis using -a and sleep.
0174 ::
0175 
0176   $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
0177                 -e kmem:mm_page_free_batched \
0178                 -a --repeat 10 \
0179                 sleep 1
0180   Performance counter stats for 'sleep 1' (10 runs):
0181 
0182            1066  kmem:mm_page_alloc         ( +-  26.148% )
0183             182  kmem:mm_page_free          ( +-   5.464% )
0184             890  kmem:mm_page_free_batched  ( +-  30.079% )
0185 
0186     1.002251757  seconds time elapsed   ( +-   0.005% )
0187 
0188 6. Higher-Level Analysis with Helper Scripts
0189 ============================================
0190 
0191 When events are enabled the events that are triggering can be read from
0192 /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
0193 options exist as well. By post-processing the output, further information can
0194 be gathered on-line as appropriate. Examples of post-processing might include
0195 
0196   - Reading information from /proc for the PID that triggered the event
0197   - Deriving a higher-level event from a series of lower-level events.
0198   - Calculating latencies between two events
0199 
0200 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
0201 script that can read trace_pipe from STDIN or a copy of a trace. When used
0202 on-line, it can be interrupted once to generate a report without exiting
0203 and twice to exit.
0204 
0205 Simplistically, the script just reads STDIN and counts up events but it
0206 also can do more such as
0207 
0208   - Derive high-level events from many low-level events. If a number of pages
0209     are freed to the main allocator from the per-CPU lists, it recognises
0210     that as one per-CPU drain even though there is no specific tracepoint
0211     for that event
0212   - It can aggregate based on PID or individual process number
0213   - In the event memory is getting externally fragmented, it reports
0214     on whether the fragmentation event was severe or moderate.
0215   - When receiving an event about a PID, it can record who the parent was so
0216     that if large numbers of events are coming from very short-lived
0217     processes, the parent process responsible for creating all the helpers
0218     can be identified
0219 
0220 7. Lower-Level Analysis with PCL
0221 ================================
0222 
0223 There may also be a requirement to identify what functions within a program
0224 were generating events within the kernel. To begin this sort of analysis, the
0225 data must be recorded. At the time of writing, this required root:
0226 ::
0227 
0228   $ perf record -c 1 \
0229         -e kmem:mm_page_alloc -e kmem:mm_page_free \
0230         -e kmem:mm_page_free_batched \
0231         ./hackbench 10
0232   Time: 0.894
0233   [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
0234 
0235 Note the use of '-c 1' to set the event period to sample. The default sample
0236 period is quite high to minimise overhead but the information collected can be
0237 very coarse as a result.
0238 
0239 This record outputted a file called perf.data which can be analysed using
0240 perf report.
0241 ::
0242 
0243   $ perf report
0244   # Samples: 30922
0245   #
0246   # Overhead    Command                     Shared Object
0247   # ........  .........  ................................
0248   #
0249       87.27%  hackbench  [vdso]
0250        6.85%  hackbench  /lib/i686/cmov/libc-2.9.so
0251        2.62%  hackbench  /lib/ld-2.9.so
0252        1.52%       perf  [vdso]
0253        1.22%  hackbench  ./hackbench
0254        0.48%  hackbench  [kernel]
0255        0.02%       perf  /lib/i686/cmov/libc-2.9.so
0256        0.01%       perf  /usr/bin/perf
0257        0.01%       perf  /lib/ld-2.9.so
0258        0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
0259   #
0260   # (For more details, try: perf report --sort comm,dso,symbol)
0261   #
0262 
0263 According to this, the vast majority of events triggered on events
0264 within the VDSO. With simple binaries, this will often be the case so let's
0265 take a slightly different example. In the course of writing this, it was
0266 noticed that X was generating an insane amount of page allocations so let's look
0267 at it:
0268 ::
0269 
0270   $ perf record -c 1 -f \
0271                 -e kmem:mm_page_alloc -e kmem:mm_page_free \
0272                 -e kmem:mm_page_free_batched \
0273                 -p `pidof X`
0274 
0275 This was interrupted after a few seconds and
0276 ::
0277 
0278   $ perf report
0279   # Samples: 27666
0280   #
0281   # Overhead  Command                            Shared Object
0282   # ........  .......  .......................................
0283   #
0284       51.95%     Xorg  [vdso]
0285       47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
0286        0.09%     Xorg  /lib/i686/cmov/libc-2.9.so
0287        0.01%     Xorg  [kernel]
0288   #
0289   # (For more details, try: perf report --sort comm,dso,symbol)
0290   #
0291 
0292 So, almost half of the events are occurring in a library. To get an idea which
0293 symbol:
0294 ::
0295 
0296   $ perf report --sort comm,dso,symbol
0297   # Samples: 27666
0298   #
0299   # Overhead  Command                            Shared Object  Symbol
0300   # ........  .......  .......................................  ......
0301   #
0302       51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
0303       47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
0304        0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
0305        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
0306        0.01%     Xorg  [kernel]                                 [k] read_hpet
0307        0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
0308        0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
0309 
0310 To see where within the function pixmanFillsse2 things are going wrong:
0311 ::
0312 
0313   $ perf annotate pixmanFillsse2
0314   [ ... ]
0315     0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
0316          :      }
0317          :
0318          :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
0319          :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
0320          :        *__P = __B;
0321    12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
0322     0.00 :         34ef5:       ff
0323    12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
0324     0.00 :         34efd:       ff
0325    12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
0326     0.00 :         34f05:       ff
0327    12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
0328     0.00 :         34f0d:       ff
0329    12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
0330    12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
0331    12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
0332    12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
0333 
0334 At a glance, it looks like the time is being spent copying pixmaps to
0335 the card.  Further investigation would be needed to determine why pixmaps
0336 are being copied around so much but a starting point would be to take an
0337 ancient build of libpixmap out of the library path where it was totally
0338 forgotten about from months ago!