Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. _imc:
0003 
0004 ===================================
0005 IMC (In-Memory Collection Counters)
0006 ===================================
0007 
0008 Anju T Sudhakar, 10 May 2019
0009 
0010 .. contents::
0011     :depth: 3
0012 
0013 
0014 Basic overview
0015 ==============
0016 
0017 IMC (In-Memory collection counters) is a hardware monitoring facility that
0018 collects large numbers of hardware performance events at Nest level (these are
0019 on-chip but off-core), Core level and Thread level.
0020 
0021 The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC
0022 (On-Chip Controller) complex. The microcode collects the counter data and moves
0023 the nest IMC counter data to memory.
0024 
0025 The Core and Thread IMC PMU counters are handled in the core. Core level PMU
0026 counters give us the IMC counters' data per core and thread level PMU counters
0027 give us the IMC counters' data per CPU thread.
0028 
0029 OPAL obtains the IMC PMU and supported events information from the IMC Catalog
0030 and passes on to the kernel via the device tree. The event's information
0031 contains:
0032 
0033 - Event name
0034 - Event Offset
0035 - Event description
0036 
0037 and possibly also:
0038 
0039 - Event scale
0040 - Event unit
0041 
0042 Some PMUs may have a common scale and unit values for all their supported
0043 events. For those cases, the scale and unit properties for those events must be
0044 inherited from the PMU.
0045 
0046 The event offset in the memory is where the counter data gets accumulated.
0047 
0048 IMC catalog is available at:
0049         https://github.com/open-power/ima-catalog
0050 
0051 The kernel discovers the IMC counters information in the device tree at the
0052 `imc-counters` device node which has a compatible field
0053 `ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs
0054 and their event's information and register the PMU and its attributes in the
0055 kernel.
0056 
0057 IMC example usage
0058 =================
0059 
0060 .. code-block:: sh
0061 
0062   # perf list
0063   [...]
0064   nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/            [Kernel PMU event]
0065   nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/            [Kernel PMU event]
0066   [...]
0067   core_imc/CPM_0THRD_NON_IDLE_PCYC/                  [Kernel PMU event]
0068   core_imc/CPM_1THRD_NON_IDLE_INST/                  [Kernel PMU event]
0069   [...]
0070   thread_imc/CPM_0THRD_NON_IDLE_PCYC/                [Kernel PMU event]
0071   thread_imc/CPM_1THRD_NON_IDLE_INST/                [Kernel PMU event]
0072 
0073 To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/:
0074 
0075 .. code-block:: sh
0076 
0077   # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket
0078 
0079 To see non-idle instructions for core 0:
0080 
0081 .. code-block:: sh
0082 
0083   # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000
0084 
0085 To see non-idle instructions for a "make":
0086 
0087 .. code-block:: sh
0088 
0089   # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make
0090 
0091 
0092 IMC Trace-mode
0093 ===============
0094 
0095 POWER9 supports two modes for IMC which are the Accumulation mode and Trace
0096 mode. In Accumulation mode, event counts are accumulated in system Memory.
0097 Hypervisor then reads the posted counts periodically or when requested. In IMC
0098 Trace mode, the 64 bit trace SCOM value is initialized with the event
0099 information. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event
0100 to be monitored and the sampling duration. On each overflow in the CPMCxSEL,
0101 hardware snapshots the program counter along with event counts and writes into
0102 memory pointed by LDBAR.
0103 
0104 LDBAR is a 64 bit special purpose per thread register, it has bits to indicate
0105 whether hardware is configured for accumulation or trace mode.
0106 
0107 LDBAR Register Layout
0108 ---------------------
0109 
0110   +-------+----------------------+
0111   | 0     | Enable/Disable       |
0112   +-------+----------------------+
0113   | 1     | 0: Accumulation Mode |
0114   |       +----------------------+
0115   |       | 1: Trace Mode        |
0116   +-------+----------------------+
0117   | 2:3   | Reserved             |
0118   +-------+----------------------+
0119   | 4-6   | PB scope             |
0120   +-------+----------------------+
0121   | 7     | Reserved             |
0122   +-------+----------------------+
0123   | 8:50  | Counter Address      |
0124   +-------+----------------------+
0125   | 51:63 | Reserved             |
0126   +-------+----------------------+
0127 
0128 TRACE_IMC_SCOM bit representation
0129 ---------------------------------
0130 
0131   +-------+------------+
0132   | 0:1   | SAMPSEL    |
0133   +-------+------------+
0134   | 2:33  | CPMC_LOAD  |
0135   +-------+------------+
0136   | 34:40 | CPMC1SEL   |
0137   +-------+------------+
0138   | 41:47 | CPMC2SEL   |
0139   +-------+------------+
0140   | 48:50 | BUFFERSIZE |
0141   +-------+------------+
0142   | 51:63 | RESERVED   |
0143   +-------+------------+
0144 
0145 CPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the
0146 event to count. BUFFERSIZE indicates the memory range. On each overflow,
0147 hardware snapshots the program counter along with event counts and updates the
0148 memory and reloads the CMPC_LOAD value for the next sampling duration. IMC
0149 hardware does not support exceptions, so it quietly wraps around if memory
0150 buffer reaches the end.
0151 
0152 *Currently the event monitored for trace-mode is fixed as cycle.*
0153 
0154 Trace IMC example usage
0155 =======================
0156 
0157 .. code-block:: sh
0158 
0159   # perf list
0160   [....]
0161   trace_imc/trace_cycles/                            [Kernel PMU event]
0162 
0163 To record an application/process with trace-imc event:
0164 
0165 .. code-block:: sh
0166 
0167   # perf record -e trace_imc/trace_cycles/ yes > /dev/null
0168   [ perf record: Woken up 1 times to write data ]
0169   [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ]
0170 
0171 The `perf.data` generated, can be read using perf report.
0172 
0173 Benefits of using IMC trace-mode
0174 ================================
0175 
0176 PMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC
0177 trace mode snapshots the program counter and updates to the memory. And this
0178 also provide a way for the operating system to do instruction sampling in real
0179 time without PMI processing overhead.
0180 
0181 Performance data using `perf top` with and without trace-imc event.
0182 
0183 PMI interrupts count when `perf top` command is executed without trace-imc event.
0184 
0185 .. code-block:: sh
0186 
0187   # grep PMI /proc/interrupts
0188   PMI:          0          0          0          0   Performance monitoring interrupts
0189   # ./perf top
0190   ...
0191   # grep PMI /proc/interrupts
0192   PMI:      39735       8710      17338      17801   Performance monitoring interrupts
0193   # ./perf top -e trace_imc/trace_cycles/
0194   ...
0195   # grep PMI /proc/interrupts
0196   PMI:      39735       8710      17338      17801   Performance monitoring interrupts
0197 
0198 
0199 That is, the PMI interrupt counts do not increment when using the `trace_imc` event.