perf/Documentation/perf-arm-spe.txt

0001 perf-arm-spe(1)
0002 ================
0003
0004 NAME
0005 ----
0006 perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
0007
0008 SYNOPSIS
0009 --------
0010 [verse]
0011 'perf record' -e arm_spe//
0012
0013 DESCRIPTION
0014 -----------
0015
0016 The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
0017  events down to individual instructions. Rather than being interrupt-driven, it picks an
0018 instruction to sample and then captures data for it during execution. Data includes execution time
0019 in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
0020
0021 The sampling has 5 stages:
0022
0023   1. Choose an operation
0024   2. Collect data about the operation
0025   3. Optionally discard the record based on a filter
0026   4. Write the record to memory
0027   5. Interrupt when the buffer is full
0028
0029 Choose an operation
0030 ~~~~~~~~~~~~~~~~~~~
0031
0032 This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
0033 architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
0034 architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
0035 sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
0036 perturbation is also added to the sampling interval by default.
0037
0038 Collect data about the operation
0039 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0040
0041 Program counter, PMU events, timings and data addresses related to the operation are recorded.
0042 Sampling ensures there is only one sampled operation is in flight.
0043
0044 Optionally discard the record based on a filter
0045 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0046
0047 Based on programmable criteria, choose whether to keep the record or discard it. If the record is
0048 discarded then the flow stops here for this sample.
0049
0050 Write the record to memory
0051 ~~~~~~~~~~~~~~~~~~~~~~~~~~
0052
0053 The record is appended to a memory buffer
0054
0055 Interrupt when the buffer is full
0056 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0057
0058 When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
0059 Perf saves the raw data in the perf.data file.
0060
0061 Opening the file
0062 ----------------
0063
0064 Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
0065 recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
0066 the data, Perf generates "synthetic samples" as if these were generated at the time of the
0067 recording. These samples are the same as if normal sampling was done by Perf without using SPE,
0068 although they may have more attributes associated with them. For example a normal sample may have
0069 just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
0070
0071 Why Sampling?
0072 -------------
0073
0074  - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
0075  hardware. Only one sampled operation is in flight at a time.
0076
0077  - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
0078  addresses.
0079
0080  - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
0081  indicates which particular cache was hit, but the meaning is implementation defined because
0082  different implementations can have different cache configurations.)
0083
0084 However, SPE does not provide any call-graph information, and relies on statistical methods.
0085
0086 Collisions
0087 ----------
0088
0089 When an operation is sampled while a previous sampled operation has not finished, a collision
0090 occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
0091 should be set to avoid collisions.
0092
0093 The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
0094 count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
0095 number for samples dropped that would have made it through the filter, but can be a rough
0096 guide.
0097
0098 The effect of microarchitectural sampling
0099 -----------------------------------------
0100
0101 If an implementation samples micro-operations instead of instructions, the results of sampling must
0102 be weighted accordingly.
0103
0104 For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
0105 becomes twice as likely to appear in the sample population.
0106
0107 The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
0108 estimated from the 'sample_pop' and 'inst_retired' PMU events.
0109
0110 Kernel Requirements
0111 -------------------
0112
0113 The ARM_SPE_PMU config must be set to build as either a module or statically.
0114
0115 Depending on CPU model, the kernel may need to be booted with page table isolation disabled
0116 (kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
0117 inaccessible. Try passing 'kpti=off' on the kernel command line".
0118
0119 Capturing SPE with perf command-line tools
0120 ------------------------------------------
0121
0122 You can record a session with SPE samples:
0123
0124   perf record -e arm_spe// -- ./mybench
0125
0126 The sample period is set from the -c option, and because the minimum interval is used by default
0127 it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
0128
0129 Config parameters
0130 ~~~~~~~~~~~~~~~~~
0131
0132 These are placed between the // in the event and comma separated. For example '-e
0133 arm_spe/load_filter=1,min_latency=10/'
0134
0135   branch_filter=1     - collect branches only (PMSFCR.B)
0136   event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
0137   jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
0138   load_filter=1       - collect loads only (PMSFCR.LD)
0139   min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
0140   pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
0141   pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
0142   store_filter=1      - collect stores only (PMSFCR.ST)
0143   ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
0144
0145 +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
0146 than only the execution latency.
0147
0148 Only some events can be filtered on; these include:
0149
0150   bit 1     - instruction retired (i.e. omit speculative instructions)
0151   bit 3     - L1D refill
0152   bit 5     - TLB refill
0153   bit 7     - mispredict
0154   bit 11    - misaligned access
0155
0156 So to sample just retired instructions:
0157
0158   perf record -e arm_spe/event_filter=2/ -- ./mybench
0159
0160 or just mispredicted branches:
0161
0162   perf record -e arm_spe/event_filter=0x80/ -- ./mybench
0163
0164 Viewing the data
0165 ~~~~~~~~~~~~~~~~~
0166
0167 By default perf report and perf script will assign samples to separate groups depending on the
0168 attributes/events of the SPE record. Because instructions can have multiple events associated with
0169 them, the samples in these groups are not necessarily unique. For example perf report shows these
0170 groups:
0171
0172   Available samples
0173   0 arm_spe//
0174   0 dummy:u
0175   21 l1d-miss
0176   897 l1d-access
0177   5 llc-miss
0178   7 llc-access
0179   2 tlb-miss
0180   1K tlb-access
0181   36 branch-miss
0182   0 remote-access
0183   900 memory
0184
0185 The arm_spe// and dummy:u events are implementation details and are expected to be empty.
0186
0187 To get a full list of unique samples that are not sorted into groups, set the itrace option to
0188 generate 'instruction' samples. The period option is also taken into account, so set it to 1
0189 instruction unless you want to further downsample the already sampled SPE data:
0190
0191   perf report --itrace=i1i
0192
0193 Memory access details are also stored on the samples and this can be viewed with:
0194
0195   perf report --mem-mode
0196
0197 Common errors
0198 ~~~~~~~~~~~~~
0199
0200  - "Cannot find PMU `arm_spe'. Missing kernel support?"
0201
0202    Module not built or loaded, KPTI not disabled (see above), or running on a VM
0203
0204  - "Arm SPE CONTEXT packets not found in the traces."
0205
0206    Root privilege is required to collect context packets. But these only increase the accuracy of
0207    assigning PIDs to kernel samples. For userspace sampling this can be ignored.
0208
0209  - Excessively large perf.data file size
0210
0211    Increase sampling interval (see above)
0212
0213
0214 SEE ALSO
0215 --------
0216
0217 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
0218 linkperf:perf-inject[1]