0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. _imc:
0003
0004 ===================================
0005 IMC (In-Memory Collection Counters)
0006 ===================================
0007
0008 Anju T Sudhakar, 10 May 2019
0009
0010 .. contents::
0011 :depth: 3
0012
0013
0014 Basic overview
0015 ==============
0016
0017 IMC (In-Memory collection counters) is a hardware monitoring facility that
0018 collects large numbers of hardware performance events at Nest level (these are
0019 on-chip but off-core), Core level and Thread level.
0020
0021 The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC
0022 (On-Chip Controller) complex. The microcode collects the counter data and moves
0023 the nest IMC counter data to memory.
0024
0025 The Core and Thread IMC PMU counters are handled in the core. Core level PMU
0026 counters give us the IMC counters' data per core and thread level PMU counters
0027 give us the IMC counters' data per CPU thread.
0028
0029 OPAL obtains the IMC PMU and supported events information from the IMC Catalog
0030 and passes on to the kernel via the device tree. The event's information
0031 contains:
0032
0033 - Event name
0034 - Event Offset
0035 - Event description
0036
0037 and possibly also:
0038
0039 - Event scale
0040 - Event unit
0041
0042 Some PMUs may have a common scale and unit values for all their supported
0043 events. For those cases, the scale and unit properties for those events must be
0044 inherited from the PMU.
0045
0046 The event offset in the memory is where the counter data gets accumulated.
0047
0048 IMC catalog is available at:
0049 https://github.com/open-power/ima-catalog
0050
0051 The kernel discovers the IMC counters information in the device tree at the
0052 `imc-counters` device node which has a compatible field
0053 `ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs
0054 and their event's information and register the PMU and its attributes in the
0055 kernel.
0056
0057 IMC example usage
0058 =================
0059
0060 .. code-block:: sh
0061
0062 # perf list
0063 [...]
0064 nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event]
0065 nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event]
0066 [...]
0067 core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
0068 core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
0069 [...]
0070 thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
0071 thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
0072
0073 To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/:
0074
0075 .. code-block:: sh
0076
0077 # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket
0078
0079 To see non-idle instructions for core 0:
0080
0081 .. code-block:: sh
0082
0083 # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000
0084
0085 To see non-idle instructions for a "make":
0086
0087 .. code-block:: sh
0088
0089 # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make
0090
0091
0092 IMC Trace-mode
0093 ===============
0094
0095 POWER9 supports two modes for IMC which are the Accumulation mode and Trace
0096 mode. In Accumulation mode, event counts are accumulated in system Memory.
0097 Hypervisor then reads the posted counts periodically or when requested. In IMC
0098 Trace mode, the 64 bit trace SCOM value is initialized with the event
0099 information. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event
0100 to be monitored and the sampling duration. On each overflow in the CPMCxSEL,
0101 hardware snapshots the program counter along with event counts and writes into
0102 memory pointed by LDBAR.
0103
0104 LDBAR is a 64 bit special purpose per thread register, it has bits to indicate
0105 whether hardware is configured for accumulation or trace mode.
0106
0107 LDBAR Register Layout
0108 ---------------------
0109
0110 +-------+----------------------+
0111 | 0 | Enable/Disable |
0112 +-------+----------------------+
0113 | 1 | 0: Accumulation Mode |
0114 | +----------------------+
0115 | | 1: Trace Mode |
0116 +-------+----------------------+
0117 | 2:3 | Reserved |
0118 +-------+----------------------+
0119 | 4-6 | PB scope |
0120 +-------+----------------------+
0121 | 7 | Reserved |
0122 +-------+----------------------+
0123 | 8:50 | Counter Address |
0124 +-------+----------------------+
0125 | 51:63 | Reserved |
0126 +-------+----------------------+
0127
0128 TRACE_IMC_SCOM bit representation
0129 ---------------------------------
0130
0131 +-------+------------+
0132 | 0:1 | SAMPSEL |
0133 +-------+------------+
0134 | 2:33 | CPMC_LOAD |
0135 +-------+------------+
0136 | 34:40 | CPMC1SEL |
0137 +-------+------------+
0138 | 41:47 | CPMC2SEL |
0139 +-------+------------+
0140 | 48:50 | BUFFERSIZE |
0141 +-------+------------+
0142 | 51:63 | RESERVED |
0143 +-------+------------+
0144
0145 CPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the
0146 event to count. BUFFERSIZE indicates the memory range. On each overflow,
0147 hardware snapshots the program counter along with event counts and updates the
0148 memory and reloads the CMPC_LOAD value for the next sampling duration. IMC
0149 hardware does not support exceptions, so it quietly wraps around if memory
0150 buffer reaches the end.
0151
0152 *Currently the event monitored for trace-mode is fixed as cycle.*
0153
0154 Trace IMC example usage
0155 =======================
0156
0157 .. code-block:: sh
0158
0159 # perf list
0160 [....]
0161 trace_imc/trace_cycles/ [Kernel PMU event]
0162
0163 To record an application/process with trace-imc event:
0164
0165 .. code-block:: sh
0166
0167 # perf record -e trace_imc/trace_cycles/ yes > /dev/null
0168 [ perf record: Woken up 1 times to write data ]
0169 [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ]
0170
0171 The `perf.data` generated, can be read using perf report.
0172
0173 Benefits of using IMC trace-mode
0174 ================================
0175
0176 PMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC
0177 trace mode snapshots the program counter and updates to the memory. And this
0178 also provide a way for the operating system to do instruction sampling in real
0179 time without PMI processing overhead.
0180
0181 Performance data using `perf top` with and without trace-imc event.
0182
0183 PMI interrupts count when `perf top` command is executed without trace-imc event.
0184
0185 .. code-block:: sh
0186
0187 # grep PMI /proc/interrupts
0188 PMI: 0 0 0 0 Performance monitoring interrupts
0189 # ./perf top
0190 ...
0191 # grep PMI /proc/interrupts
0192 PMI: 39735 8710 17338 17801 Performance monitoring interrupts
0193 # ./perf top -e trace_imc/trace_cycles/
0194 ...
0195 # grep PMI /proc/interrupts
0196 PMI: 39735 8710 17338 17801 Performance monitoring interrupts
0197
0198
0199 That is, the PMI interrupt counts do not increment when using the `trace_imc` event.