Back to home page

OSCL-LXR

 
 

    


0001 perf-c2c(1)
0002 ===========
0003 
0004 NAME
0005 ----
0006 perf-c2c - Shared Data C2C/HITM Analyzer.
0007 
0008 SYNOPSIS
0009 --------
0010 [verse]
0011 'perf c2c record' [<options>] <command>
0012 'perf c2c record' [<options>] \-- [<record command options>] <command>
0013 'perf c2c report' [<options>]
0014 
0015 DESCRIPTION
0016 -----------
0017 C2C stands for Cache To Cache.
0018 
0019 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
0020 you to track down the cacheline contentions.
0021 
0022 On x86, the tool is based on load latency and precise store facility events
0023 provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
0024 with thresholding feature.
0025 
0026 These events provide:
0027   - memory address of the access
0028   - type of the access (load and store details)
0029   - latency (in cycles) of the load access
0030 
0031 The c2c tool provide means to record this data and report back access details
0032 for cachelines with highest contention - highest number of HITM accesses.
0033 
0034 The basic workflow with this tool follows the standard record/report phase.
0035 User uses the record command to record events data and report command to
0036 display it.
0037 
0038 
0039 RECORD OPTIONS
0040 --------------
0041 -e::
0042 --event=::
0043         Select the PMU event. Use 'perf c2c record -e list'
0044         to list available events.
0045 
0046 -v::
0047 --verbose::
0048         Be more verbose (show counter open errors, etc).
0049 
0050 -l::
0051 --ldlat::
0052         Configure mem-loads latency. (x86 only)
0053 
0054 -k::
0055 --all-kernel::
0056         Configure all used events to run in kernel space.
0057 
0058 -u::
0059 --all-user::
0060         Configure all used events to run in user space.
0061 
0062 REPORT OPTIONS
0063 --------------
0064 -k::
0065 --vmlinux=<file>::
0066         vmlinux pathname
0067 
0068 -v::
0069 --verbose::
0070         Be more verbose (show counter open errors, etc).
0071 
0072 -i::
0073 --input::
0074         Specify the input file to process.
0075 
0076 -N::
0077 --node-info::
0078         Show extra node info in report (see NODE INFO section)
0079 
0080 -c::
0081 --coalesce::
0082         Specify sorting fields for single cacheline display.
0083         Following fields are available: tid,pid,iaddr,dso
0084         (see COALESCE)
0085 
0086 -g::
0087 --call-graph::
0088         Setup callchains parameters.
0089         Please refer to perf-report man page for details.
0090 
0091 --stdio::
0092         Force the stdio output (see STDIO OUTPUT)
0093 
0094 --stats::
0095         Display only statistic tables and force stdio mode.
0096 
0097 --full-symbols::
0098         Display full length of symbols.
0099 
0100 --no-source::
0101         Do not display Source:Line column.
0102 
0103 --show-all::
0104         Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
0105 
0106 -f::
0107 --force::
0108         Don't do ownership validation.
0109 
0110 -d::
0111 --display::
0112         Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
0113         and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
0114         as default.
0115 
0116 --stitch-lbr::
0117         Show callgraph with stitched LBRs, which may have more complete
0118         callgraph. The perf.data file must have been obtained using
0119         perf c2c record --call-graph lbr.
0120         Disabled by default. In common cases with call stack overflows,
0121         it can recreate better call stacks than the default lbr call stack
0122         output. But this approach is not full proof. There can be cases
0123         where it creates incorrect call stacks from incorrect matches.
0124         The known limitations include exception handing such as
0125         setjmp/longjmp will have calls/returns not match.
0126 
0127 C2C RECORD
0128 ----------
0129 The perf c2c record command setup options related to HITM cacheline analysis
0130 and calls standard perf record command.
0131 
0132 Following perf record options are configured by default:
0133 (check perf record man page for details)
0134 
0135   -W,-d,--phys-data,--sample-cpu
0136 
0137 Unless specified otherwise with '-e' option, following events are monitored by
0138 default on x86:
0139 
0140   cpu/mem-loads,ldlat=30/P
0141   cpu/mem-stores/P
0142 
0143 and following on PowerPC:
0144 
0145   cpu/mem-loads/
0146   cpu/mem-stores/
0147 
0148 User can pass any 'perf record' option behind '--' mark, like (to enable
0149 callchains and system wide monitoring):
0150 
0151   $ perf c2c record -- -g -a
0152 
0153 Please check RECORD OPTIONS section for specific c2c record options.
0154 
0155 C2C REPORT
0156 ----------
0157 The perf c2c report command displays shared data analysis.  It comes in two
0158 display modes: stdio and tui (default).
0159 
0160 The report command workflow is following:
0161   - sort all the data based on the cacheline address
0162   - store access details for each cacheline
0163   - sort all cachelines based on user settings
0164   - display data
0165 
0166 In general perf report output consist of 2 basic views:
0167   1) most expensive cachelines list
0168   2) offsets details for each cacheline
0169 
0170 For each cacheline in the 1) list we display following data:
0171 (Both stdio and TUI modes follow the same fields output)
0172 
0173   Index
0174   - zero based index to identify the cacheline
0175 
0176   Cacheline
0177   - cacheline address (hex number)
0178 
0179   Rmt/Lcl Hitm (Display with HITM types)
0180   - cacheline percentage of all Remote/Local HITM accesses
0181 
0182   Peer Snoop (Display with peer type)
0183   - cacheline percentage of all peer accesses
0184 
0185   LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
0186   - count of Total/Local/Remote load HITMs
0187 
0188   Load Peer - Total, Local, Remote (For display with peer type)
0189   - count of Total/Local/Remote load from peer cache or DRAM
0190 
0191   Total records
0192   - sum of all cachelines accesses
0193 
0194   Total loads
0195   - sum of all load accesses
0196 
0197   Total stores
0198   - sum of all store accesses
0199 
0200   Store Reference - L1Hit, L1Miss, N/A
0201     L1Hit - store accesses that hit L1
0202     L1Miss - store accesses that missed L1
0203     N/A - store accesses with memory level is not available
0204 
0205   Core Load Hit - FB, L1, L2
0206   - count of load hits in FB (Fill Buffer), L1 and L2 cache
0207 
0208   LLC Load Hit - LlcHit, LclHitm
0209   - count of LLC load accesses, includes LLC hits and LLC HITMs
0210 
0211   RMT Load Hit - RmtHit, RmtHitm
0212   - count of remote load accesses, includes remote hits and remote HITMs;
0213     on Arm neoverse cores, RmtHit is used to account remote accesses,
0214     includes remote DRAM or any upward cache level in remote node
0215 
0216   Load Dram - Lcl, Rmt
0217   - count of local and remote DRAM accesses
0218 
0219 For each offset in the 2) list we display following data:
0220 
0221   HITM - Rmt, Lcl (Display with HITM types)
0222   - % of Remote/Local HITM accesses for given offset within cacheline
0223 
0224   Peer Snoop - Rmt, Lcl (Display with peer type)
0225   - % of Remote/Local peer accesses for given offset within cacheline
0226 
0227   Store Refs - L1 Hit, L1 Miss, N/A
0228   - % of store accesses that hit L1, missed L1 and N/A (no available) memory
0229     level for given offset within cacheline
0230 
0231   Data address - Offset
0232   - offset address
0233 
0234   Pid
0235   - pid of the process responsible for the accesses
0236 
0237   Tid
0238   - tid of the process responsible for the accesses
0239 
0240   Code address
0241   - code address responsible for the accesses
0242 
0243   cycles - rmt hitm, lcl hitm, load (Display with HITM types)
0244     - sum of cycles for given accesses - Remote/Local HITM and generic load
0245 
0246   cycles - rmt peer, lcl peer, load (Display with peer type)
0247     - sum of cycles for given accesses - Remote/Local peer load and generic load
0248 
0249   cpu cnt
0250     - number of cpus that participated on the access
0251 
0252   Symbol
0253     - code symbol related to the 'Code address' value
0254 
0255   Shared Object
0256     - shared object name related to the 'Code address' value
0257 
0258   Source:Line
0259     - source information related to the 'Code address' value
0260 
0261   Node
0262     - nodes participating on the access (see NODE INFO section)
0263 
0264 NODE INFO
0265 ---------
0266 The 'Node' field displays nodes that accesses given cacheline
0267 offset. Its output comes in 3 flavors:
0268   - node IDs separated by ','
0269   - node IDs with stats for each ID, in following format:
0270       Node{cpus %hitms %stores} (Display with HITM types)
0271       Node{cpus %peers %stores} (Display with peer type)
0272   - node IDs with list of affected CPUs in following format:
0273       Node{cpu list}
0274 
0275 User can switch between above flavors with -N option or
0276 use 'n' key to interactively switch in TUI mode.
0277 
0278 COALESCE
0279 --------
0280 User can specify how to sort offsets for cacheline.
0281 
0282 Following fields are available and governs the final
0283 output fields set for cacheline offsets output:
0284 
0285   tid   - coalesced by process TIDs
0286   pid   - coalesced by process PIDs
0287   iaddr - coalesced by code address, following fields are displayed:
0288              Code address, Code symbol, Shared Object, Source line
0289   dso   - coalesced by shared object
0290 
0291 By default the coalescing is setup with 'pid,iaddr'.
0292 
0293 STDIO OUTPUT
0294 ------------
0295 The stdio output displays data on standard output.
0296 
0297 Following tables are displayed:
0298   Trace Event Information
0299   - overall statistics of memory accesses
0300 
0301   Global Shared Cache Line Event Information
0302   - overall statistics on shared cachelines
0303 
0304   Shared Data Cache Line Table
0305   - list of most expensive cachelines
0306 
0307   Shared Cache Line Distribution Pareto
0308   - list of all accessed offsets for each cacheline
0309 
0310 TUI OUTPUT
0311 ----------
0312 The TUI output provides interactive interface to navigate
0313 through cachelines list and to display offset details.
0314 
0315 For details please refer to the help window by pressing '?' key.
0316 
0317 CREDITS
0318 -------
0319 Although Don Zickus, Dick Fowles and Joe Mario worked together
0320 to get this implemented, we got lots of early help from Arnaldo
0321 Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
0322 
0323 C2C BLOG
0324 --------
0325 Check Joe's blog on c2c tool for detailed use case explanation:
0326   https://joemario.github.io/blog/2016/09/01/c2c-blog/
0327 
0328 SEE ALSO
0329 --------
0330 linkperf:perf-record[1], linkperf:perf-mem[1]