0001 perf-c2c(1)
0002 ===========
0003
0004 NAME
0005 ----
0006 perf-c2c - Shared Data C2C/HITM Analyzer.
0007
0008 SYNOPSIS
0009 --------
0010 [verse]
0011 'perf c2c record' [<options>] <command>
0012 'perf c2c record' [<options>] \-- [<record command options>] <command>
0013 'perf c2c report' [<options>]
0014
0015 DESCRIPTION
0016 -----------
0017 C2C stands for Cache To Cache.
0018
0019 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
0020 you to track down the cacheline contentions.
0021
0022 On x86, the tool is based on load latency and precise store facility events
0023 provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
0024 with thresholding feature.
0025
0026 These events provide:
0027 - memory address of the access
0028 - type of the access (load and store details)
0029 - latency (in cycles) of the load access
0030
0031 The c2c tool provide means to record this data and report back access details
0032 for cachelines with highest contention - highest number of HITM accesses.
0033
0034 The basic workflow with this tool follows the standard record/report phase.
0035 User uses the record command to record events data and report command to
0036 display it.
0037
0038
0039 RECORD OPTIONS
0040 --------------
0041 -e::
0042 --event=::
0043 Select the PMU event. Use 'perf c2c record -e list'
0044 to list available events.
0045
0046 -v::
0047 --verbose::
0048 Be more verbose (show counter open errors, etc).
0049
0050 -l::
0051 --ldlat::
0052 Configure mem-loads latency. (x86 only)
0053
0054 -k::
0055 --all-kernel::
0056 Configure all used events to run in kernel space.
0057
0058 -u::
0059 --all-user::
0060 Configure all used events to run in user space.
0061
0062 REPORT OPTIONS
0063 --------------
0064 -k::
0065 --vmlinux=<file>::
0066 vmlinux pathname
0067
0068 -v::
0069 --verbose::
0070 Be more verbose (show counter open errors, etc).
0071
0072 -i::
0073 --input::
0074 Specify the input file to process.
0075
0076 -N::
0077 --node-info::
0078 Show extra node info in report (see NODE INFO section)
0079
0080 -c::
0081 --coalesce::
0082 Specify sorting fields for single cacheline display.
0083 Following fields are available: tid,pid,iaddr,dso
0084 (see COALESCE)
0085
0086 -g::
0087 --call-graph::
0088 Setup callchains parameters.
0089 Please refer to perf-report man page for details.
0090
0091 --stdio::
0092 Force the stdio output (see STDIO OUTPUT)
0093
0094 --stats::
0095 Display only statistic tables and force stdio mode.
0096
0097 --full-symbols::
0098 Display full length of symbols.
0099
0100 --no-source::
0101 Do not display Source:Line column.
0102
0103 --show-all::
0104 Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
0105
0106 -f::
0107 --force::
0108 Don't do ownership validation.
0109
0110 -d::
0111 --display::
0112 Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
0113 and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
0114 as default.
0115
0116 --stitch-lbr::
0117 Show callgraph with stitched LBRs, which may have more complete
0118 callgraph. The perf.data file must have been obtained using
0119 perf c2c record --call-graph lbr.
0120 Disabled by default. In common cases with call stack overflows,
0121 it can recreate better call stacks than the default lbr call stack
0122 output. But this approach is not full proof. There can be cases
0123 where it creates incorrect call stacks from incorrect matches.
0124 The known limitations include exception handing such as
0125 setjmp/longjmp will have calls/returns not match.
0126
0127 C2C RECORD
0128 ----------
0129 The perf c2c record command setup options related to HITM cacheline analysis
0130 and calls standard perf record command.
0131
0132 Following perf record options are configured by default:
0133 (check perf record man page for details)
0134
0135 -W,-d,--phys-data,--sample-cpu
0136
0137 Unless specified otherwise with '-e' option, following events are monitored by
0138 default on x86:
0139
0140 cpu/mem-loads,ldlat=30/P
0141 cpu/mem-stores/P
0142
0143 and following on PowerPC:
0144
0145 cpu/mem-loads/
0146 cpu/mem-stores/
0147
0148 User can pass any 'perf record' option behind '--' mark, like (to enable
0149 callchains and system wide monitoring):
0150
0151 $ perf c2c record -- -g -a
0152
0153 Please check RECORD OPTIONS section for specific c2c record options.
0154
0155 C2C REPORT
0156 ----------
0157 The perf c2c report command displays shared data analysis. It comes in two
0158 display modes: stdio and tui (default).
0159
0160 The report command workflow is following:
0161 - sort all the data based on the cacheline address
0162 - store access details for each cacheline
0163 - sort all cachelines based on user settings
0164 - display data
0165
0166 In general perf report output consist of 2 basic views:
0167 1) most expensive cachelines list
0168 2) offsets details for each cacheline
0169
0170 For each cacheline in the 1) list we display following data:
0171 (Both stdio and TUI modes follow the same fields output)
0172
0173 Index
0174 - zero based index to identify the cacheline
0175
0176 Cacheline
0177 - cacheline address (hex number)
0178
0179 Rmt/Lcl Hitm (Display with HITM types)
0180 - cacheline percentage of all Remote/Local HITM accesses
0181
0182 Peer Snoop (Display with peer type)
0183 - cacheline percentage of all peer accesses
0184
0185 LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
0186 - count of Total/Local/Remote load HITMs
0187
0188 Load Peer - Total, Local, Remote (For display with peer type)
0189 - count of Total/Local/Remote load from peer cache or DRAM
0190
0191 Total records
0192 - sum of all cachelines accesses
0193
0194 Total loads
0195 - sum of all load accesses
0196
0197 Total stores
0198 - sum of all store accesses
0199
0200 Store Reference - L1Hit, L1Miss, N/A
0201 L1Hit - store accesses that hit L1
0202 L1Miss - store accesses that missed L1
0203 N/A - store accesses with memory level is not available
0204
0205 Core Load Hit - FB, L1, L2
0206 - count of load hits in FB (Fill Buffer), L1 and L2 cache
0207
0208 LLC Load Hit - LlcHit, LclHitm
0209 - count of LLC load accesses, includes LLC hits and LLC HITMs
0210
0211 RMT Load Hit - RmtHit, RmtHitm
0212 - count of remote load accesses, includes remote hits and remote HITMs;
0213 on Arm neoverse cores, RmtHit is used to account remote accesses,
0214 includes remote DRAM or any upward cache level in remote node
0215
0216 Load Dram - Lcl, Rmt
0217 - count of local and remote DRAM accesses
0218
0219 For each offset in the 2) list we display following data:
0220
0221 HITM - Rmt, Lcl (Display with HITM types)
0222 - % of Remote/Local HITM accesses for given offset within cacheline
0223
0224 Peer Snoop - Rmt, Lcl (Display with peer type)
0225 - % of Remote/Local peer accesses for given offset within cacheline
0226
0227 Store Refs - L1 Hit, L1 Miss, N/A
0228 - % of store accesses that hit L1, missed L1 and N/A (no available) memory
0229 level for given offset within cacheline
0230
0231 Data address - Offset
0232 - offset address
0233
0234 Pid
0235 - pid of the process responsible for the accesses
0236
0237 Tid
0238 - tid of the process responsible for the accesses
0239
0240 Code address
0241 - code address responsible for the accesses
0242
0243 cycles - rmt hitm, lcl hitm, load (Display with HITM types)
0244 - sum of cycles for given accesses - Remote/Local HITM and generic load
0245
0246 cycles - rmt peer, lcl peer, load (Display with peer type)
0247 - sum of cycles for given accesses - Remote/Local peer load and generic load
0248
0249 cpu cnt
0250 - number of cpus that participated on the access
0251
0252 Symbol
0253 - code symbol related to the 'Code address' value
0254
0255 Shared Object
0256 - shared object name related to the 'Code address' value
0257
0258 Source:Line
0259 - source information related to the 'Code address' value
0260
0261 Node
0262 - nodes participating on the access (see NODE INFO section)
0263
0264 NODE INFO
0265 ---------
0266 The 'Node' field displays nodes that accesses given cacheline
0267 offset. Its output comes in 3 flavors:
0268 - node IDs separated by ','
0269 - node IDs with stats for each ID, in following format:
0270 Node{cpus %hitms %stores} (Display with HITM types)
0271 Node{cpus %peers %stores} (Display with peer type)
0272 - node IDs with list of affected CPUs in following format:
0273 Node{cpu list}
0274
0275 User can switch between above flavors with -N option or
0276 use 'n' key to interactively switch in TUI mode.
0277
0278 COALESCE
0279 --------
0280 User can specify how to sort offsets for cacheline.
0281
0282 Following fields are available and governs the final
0283 output fields set for cacheline offsets output:
0284
0285 tid - coalesced by process TIDs
0286 pid - coalesced by process PIDs
0287 iaddr - coalesced by code address, following fields are displayed:
0288 Code address, Code symbol, Shared Object, Source line
0289 dso - coalesced by shared object
0290
0291 By default the coalescing is setup with 'pid,iaddr'.
0292
0293 STDIO OUTPUT
0294 ------------
0295 The stdio output displays data on standard output.
0296
0297 Following tables are displayed:
0298 Trace Event Information
0299 - overall statistics of memory accesses
0300
0301 Global Shared Cache Line Event Information
0302 - overall statistics on shared cachelines
0303
0304 Shared Data Cache Line Table
0305 - list of most expensive cachelines
0306
0307 Shared Cache Line Distribution Pareto
0308 - list of all accessed offsets for each cacheline
0309
0310 TUI OUTPUT
0311 ----------
0312 The TUI output provides interactive interface to navigate
0313 through cachelines list and to display offset details.
0314
0315 For details please refer to the help window by pressing '?' key.
0316
0317 CREDITS
0318 -------
0319 Although Don Zickus, Dick Fowles and Joe Mario worked together
0320 to get this implemented, we got lots of early help from Arnaldo
0321 Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
0322
0323 C2C BLOG
0324 --------
0325 Check Joe's blog on c2c tool for detailed use case explanation:
0326 https://joemario.github.io/blog/2016/09/01/c2c-blog/
0327
0328 SEE ALSO
0329 --------
0330 linkperf:perf-record[1], linkperf:perf-mem[1]