0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. include:: <isonum.txt>
0003
0004 ===========================================
0005 User Interface for Resource Control feature
0006 ===========================================
0007
0008 :Copyright: |copy| 2016 Intel Corporation
0009 :Authors: - Fenghua Yu <fenghua.yu@intel.com>
0010 - Tony Luck <tony.luck@intel.com>
0011 - Vikas Shivappa <vikas.shivappa@intel.com>
0012
0013
0014 Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
0015 AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
0016
0017 This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
0018 flag bits:
0019
0020 ============================================= ================================
0021 RDT (Resource Director Technology) Allocation "rdt_a"
0022 CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
0023 CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
0024 CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
0025 MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
0026 MBA (Memory Bandwidth Allocation) "mba"
0027 ============================================= ================================
0028
0029 To use the feature mount the file system::
0030
0031 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
0032
0033 mount options are:
0034
0035 "cdp":
0036 Enable code/data prioritization in L3 cache allocations.
0037 "cdpl2":
0038 Enable code/data prioritization in L2 cache allocations.
0039 "mba_MBps":
0040 Enable the MBA Software Controller(mba_sc) to specify MBA
0041 bandwidth in MBps
0042
0043 L2 and L3 CDP are controlled separately.
0044
0045 RDT features are orthogonal. A particular system may support only
0046 monitoring, only control, or both monitoring and control. Cache
0047 pseudo-locking is a unique way of using cache control to "pin" or
0048 "lock" data in the cache. Details can be found in
0049 "Cache Pseudo-Locking".
0050
0051
0052 The mount succeeds if either of allocation or monitoring is present, but
0053 only those files and directories supported by the system will be created.
0054 For more details on the behavior of the interface during monitoring
0055 and allocation, see the "Resource alloc and monitor groups" section.
0056
0057 Info directory
0058 ==============
0059
0060 The 'info' directory contains information about the enabled
0061 resources. Each resource has its own subdirectory. The subdirectory
0062 names reflect the resource names.
0063
0064 Each subdirectory contains the following files with respect to
0065 allocation:
0066
0067 Cache resource(L3/L2) subdirectory contains the following files
0068 related to allocation:
0069
0070 "num_closids":
0071 The number of CLOSIDs which are valid for this
0072 resource. The kernel uses the smallest number of
0073 CLOSIDs of all enabled resources as limit.
0074 "cbm_mask":
0075 The bitmask which is valid for this resource.
0076 This mask is equivalent to 100%.
0077 "min_cbm_bits":
0078 The minimum number of consecutive bits which
0079 must be set when writing a mask.
0080
0081 "shareable_bits":
0082 Bitmask of shareable resource with other executing
0083 entities (e.g. I/O). User can use this when
0084 setting up exclusive cache partitions. Note that
0085 some platforms support devices that have their
0086 own settings for cache use which can over-ride
0087 these bits.
0088 "bit_usage":
0089 Annotated capacity bitmasks showing how all
0090 instances of the resource are used. The legend is:
0091
0092 "0":
0093 Corresponding region is unused. When the system's
0094 resources have been allocated and a "0" is found
0095 in "bit_usage" it is a sign that resources are
0096 wasted.
0097
0098 "H":
0099 Corresponding region is used by hardware only
0100 but available for software use. If a resource
0101 has bits set in "shareable_bits" but not all
0102 of these bits appear in the resource groups'
0103 schematas then the bits appearing in
0104 "shareable_bits" but no resource group will
0105 be marked as "H".
0106 "X":
0107 Corresponding region is available for sharing and
0108 used by hardware and software. These are the
0109 bits that appear in "shareable_bits" as
0110 well as a resource group's allocation.
0111 "S":
0112 Corresponding region is used by software
0113 and available for sharing.
0114 "E":
0115 Corresponding region is used exclusively by
0116 one resource group. No sharing allowed.
0117 "P":
0118 Corresponding region is pseudo-locked. No
0119 sharing allowed.
0120
0121 Memory bandwidth(MB) subdirectory contains the following files
0122 with respect to allocation:
0123
0124 "min_bandwidth":
0125 The minimum memory bandwidth percentage which
0126 user can request.
0127
0128 "bandwidth_gran":
0129 The granularity in which the memory bandwidth
0130 percentage is allocated. The allocated
0131 b/w percentage is rounded off to the next
0132 control step available on the hardware. The
0133 available bandwidth control steps are:
0134 min_bandwidth + N * bandwidth_gran.
0135
0136 "delay_linear":
0137 Indicates if the delay scale is linear or
0138 non-linear. This field is purely informational
0139 only.
0140
0141 "thread_throttle_mode":
0142 Indicator on Intel systems of how tasks running on threads
0143 of a physical core are throttled in cases where they
0144 request different memory bandwidth percentages:
0145
0146 "max":
0147 the smallest percentage is applied
0148 to all threads
0149 "per-thread":
0150 bandwidth percentages are directly applied to
0151 the threads running on the core
0152
0153 If RDT monitoring is available there will be an "L3_MON" directory
0154 with the following files:
0155
0156 "num_rmids":
0157 The number of RMIDs available. This is the
0158 upper bound for how many "CTRL_MON" + "MON"
0159 groups can be created.
0160
0161 "mon_features":
0162 Lists the monitoring events if
0163 monitoring is enabled for the resource.
0164
0165 "max_threshold_occupancy":
0166 Read/write file provides the largest value (in
0167 bytes) at which a previously used LLC_occupancy
0168 counter can be considered for re-use.
0169
0170 Finally, in the top level of the "info" directory there is a file
0171 named "last_cmd_status". This is reset with every "command" issued
0172 via the file system (making new directories or writing to any of the
0173 control files). If the command was successful, it will read as "ok".
0174 If the command failed, it will provide more information that can be
0175 conveyed in the error returns from file operations. E.g.
0176 ::
0177
0178 # echo L3:0=f7 > schemata
0179 bash: echo: write error: Invalid argument
0180 # cat info/last_cmd_status
0181 mask f7 has non-consecutive 1-bits
0182
0183 Resource alloc and monitor groups
0184 =================================
0185
0186 Resource groups are represented as directories in the resctrl file
0187 system. The default group is the root directory which, immediately
0188 after mounting, owns all the tasks and cpus in the system and can make
0189 full use of all resources.
0190
0191 On a system with RDT control features additional directories can be
0192 created in the root directory that specify different amounts of each
0193 resource (see "schemata" below). The root and these additional top level
0194 directories are referred to as "CTRL_MON" groups below.
0195
0196 On a system with RDT monitoring the root directory and other top level
0197 directories contain a directory named "mon_groups" in which additional
0198 directories can be created to monitor subsets of tasks in the CTRL_MON
0199 group that is their ancestor. These are called "MON" groups in the rest
0200 of this document.
0201
0202 Removing a directory will move all tasks and cpus owned by the group it
0203 represents to the parent. Removing one of the created CTRL_MON groups
0204 will automatically remove all MON groups below it.
0205
0206 All groups contain the following files:
0207
0208 "tasks":
0209 Reading this file shows the list of all tasks that belong to
0210 this group. Writing a task id to the file will add a task to the
0211 group. If the group is a CTRL_MON group the task is removed from
0212 whichever previous CTRL_MON group owned the task and also from
0213 any MON group that owned the task. If the group is a MON group,
0214 then the task must already belong to the CTRL_MON parent of this
0215 group. The task is removed from any previous MON group.
0216
0217
0218 "cpus":
0219 Reading this file shows a bitmask of the logical CPUs owned by
0220 this group. Writing a mask to this file will add and remove
0221 CPUs to/from this group. As with the tasks file a hierarchy is
0222 maintained where MON groups may only include CPUs owned by the
0223 parent CTRL_MON group.
0224 When the resource group is in pseudo-locked mode this file will
0225 only be readable, reflecting the CPUs associated with the
0226 pseudo-locked region.
0227
0228
0229 "cpus_list":
0230 Just like "cpus", only using ranges of CPUs instead of bitmasks.
0231
0232
0233 When control is enabled all CTRL_MON groups will also contain:
0234
0235 "schemata":
0236 A list of all the resources available to this group.
0237 Each resource has its own line and format - see below for details.
0238
0239 "size":
0240 Mirrors the display of the "schemata" file to display the size in
0241 bytes of each allocation instead of the bits representing the
0242 allocation.
0243
0244 "mode":
0245 The "mode" of the resource group dictates the sharing of its
0246 allocations. A "shareable" resource group allows sharing of its
0247 allocations while an "exclusive" resource group does not. A
0248 cache pseudo-locked region is created by first writing
0249 "pseudo-locksetup" to the "mode" file before writing the cache
0250 pseudo-locked region's schemata to the resource group's "schemata"
0251 file. On successful pseudo-locked region creation the mode will
0252 automatically change to "pseudo-locked".
0253
0254 When monitoring is enabled all MON groups will also contain:
0255
0256 "mon_data":
0257 This contains a set of files organized by L3 domain and by
0258 RDT event. E.g. on a system with two L3 domains there will
0259 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
0260 directories have one file per event (e.g. "llc_occupancy",
0261 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
0262 files provide a read out of the current value of the event for
0263 all tasks in the group. In CTRL_MON groups these files provide
0264 the sum for all tasks in the CTRL_MON group and all tasks in
0265 MON groups. Please see example section for more details on usage.
0266
0267 Resource allocation rules
0268 -------------------------
0269
0270 When a task is running the following rules define which resources are
0271 available to it:
0272
0273 1) If the task is a member of a non-default group, then the schemata
0274 for that group is used.
0275
0276 2) Else if the task belongs to the default group, but is running on a
0277 CPU that is assigned to some specific group, then the schemata for the
0278 CPU's group is used.
0279
0280 3) Otherwise the schemata for the default group is used.
0281
0282 Resource monitoring rules
0283 -------------------------
0284 1) If a task is a member of a MON group, or non-default CTRL_MON group
0285 then RDT events for the task will be reported in that group.
0286
0287 2) If a task is a member of the default CTRL_MON group, but is running
0288 on a CPU that is assigned to some specific group, then the RDT events
0289 for the task will be reported in that group.
0290
0291 3) Otherwise RDT events for the task will be reported in the root level
0292 "mon_data" group.
0293
0294
0295 Notes on cache occupancy monitoring and control
0296 ===============================================
0297 When moving a task from one group to another you should remember that
0298 this only affects *new* cache allocations by the task. E.g. you may have
0299 a task in a monitor group showing 3 MB of cache occupancy. If you move
0300 to a new group and immediately check the occupancy of the old and new
0301 groups you will likely see that the old group is still showing 3 MB and
0302 the new group zero. When the task accesses locations still in cache from
0303 before the move, the h/w does not update any counters. On a busy system
0304 you will likely see the occupancy in the old group go down as cache lines
0305 are evicted and re-used while the occupancy in the new group rises as
0306 the task accesses memory and loads into the cache are counted based on
0307 membership in the new group.
0308
0309 The same applies to cache allocation control. Moving a task to a group
0310 with a smaller cache partition will not evict any cache lines. The
0311 process may continue to use them from the old partition.
0312
0313 Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
0314 to identify a control group and a monitoring group respectively. Each of
0315 the resource groups are mapped to these IDs based on the kind of group. The
0316 number of CLOSid and RMID are limited by the hardware and hence the creation of
0317 a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
0318 and creation of "MON" group may fail if we run out of RMIDs.
0319
0320 max_threshold_occupancy - generic concepts
0321 ------------------------------------------
0322
0323 Note that an RMID once freed may not be immediately available for use as
0324 the RMID is still tagged the cache lines of the previous user of RMID.
0325 Hence such RMIDs are placed on limbo list and checked back if the cache
0326 occupancy has gone down. If there is a time when system has a lot of
0327 limbo RMIDs but which are not ready to be used, user may see an -EBUSY
0328 during mkdir.
0329
0330 max_threshold_occupancy is a user configurable value to determine the
0331 occupancy at which an RMID can be freed.
0332
0333 Schemata files - general concepts
0334 ---------------------------------
0335 Each line in the file describes one resource. The line starts with
0336 the name of the resource, followed by specific values to be applied
0337 in each of the instances of that resource on the system.
0338
0339 Cache IDs
0340 ---------
0341 On current generation systems there is one L3 cache per socket and L2
0342 caches are generally just shared by the hyperthreads on a core, but this
0343 isn't an architectural requirement. We could have multiple separate L3
0344 caches on a socket, multiple cores could share an L2 cache. So instead
0345 of using "socket" or "core" to define the set of logical cpus sharing
0346 a resource we use a "Cache ID". At a given cache level this will be a
0347 unique number across the whole system (but it isn't guaranteed to be a
0348 contiguous sequence, there may be gaps). To find the ID for each logical
0349 CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
0350
0351 Cache Bit Masks (CBM)
0352 ---------------------
0353 For cache resources we describe the portion of the cache that is available
0354 for allocation using a bitmask. The maximum value of the mask is defined
0355 by each cpu model (and may be different for different cache levels). It
0356 is found using CPUID, but is also provided in the "info" directory of
0357 the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
0358 requires that these masks have all the '1' bits in a contiguous block. So
0359 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
0360 and 0xA are not. On a system with a 20-bit mask each bit represents 5%
0361 of the capacity of the cache. You could partition the cache into four
0362 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
0363
0364 Memory bandwidth Allocation and monitoring
0365 ==========================================
0366
0367 For Memory bandwidth resource, by default the user controls the resource
0368 by indicating the percentage of total memory bandwidth.
0369
0370 The minimum bandwidth percentage value for each cpu model is predefined
0371 and can be looked up through "info/MB/min_bandwidth". The bandwidth
0372 granularity that is allocated is also dependent on the cpu model and can
0373 be looked up at "info/MB/bandwidth_gran". The available bandwidth
0374 control steps are: min_bw + N * bw_gran. Intermediate values are rounded
0375 to the next control step available on the hardware.
0376
0377 The bandwidth throttling is a core specific mechanism on some of Intel
0378 SKUs. Using a high bandwidth and a low bandwidth setting on two threads
0379 sharing a core may result in both threads being throttled to use the
0380 low bandwidth (see "thread_throttle_mode").
0381
0382 The fact that Memory bandwidth allocation(MBA) may be a core
0383 specific mechanism where as memory bandwidth monitoring(MBM) is done at
0384 the package level may lead to confusion when users try to apply control
0385 via the MBA and then monitor the bandwidth to see if the controls are
0386 effective. Below are such scenarios:
0387
0388 1. User may *not* see increase in actual bandwidth when percentage
0389 values are increased:
0390
0391 This can occur when aggregate L2 external bandwidth is more than L3
0392 external bandwidth. Consider an SKL SKU with 24 cores on a package and
0393 where L2 external is 10GBps (hence aggregate L2 external bandwidth is
0394 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
0395 threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
0396 bandwidth of 100GBps although the percentage value specified is only 50%
0397 << 100%. Hence increasing the bandwidth percentage will not yield any
0398 more bandwidth. This is because although the L2 external bandwidth still
0399 has capacity, the L3 external bandwidth is fully used. Also note that
0400 this would be dependent on number of cores the benchmark is run on.
0401
0402 2. Same bandwidth percentage may mean different actual bandwidth
0403 depending on # of threads:
0404
0405 For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
0406 thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
0407 they have same percentage bandwidth of 10%. This is simply because as
0408 threads start using more cores in an rdtgroup, the actual bandwidth may
0409 increase or vary although user specified bandwidth percentage is same.
0410
0411 In order to mitigate this and make the interface more user friendly,
0412 resctrl added support for specifying the bandwidth in MBps as well. The
0413 kernel underneath would use a software feedback mechanism or a "Software
0414 Controller(mba_sc)" which reads the actual bandwidth using MBM counters
0415 and adjust the memory bandwidth percentages to ensure::
0416
0417 "actual bandwidth < user specified bandwidth".
0418
0419 By default, the schemata would take the bandwidth percentage values
0420 where as user can switch to the "MBA software controller" mode using
0421 a mount option 'mba_MBps'. The schemata format is specified in the below
0422 sections.
0423
0424 L3 schemata file details (code and data prioritization disabled)
0425 ----------------------------------------------------------------
0426 With CDP disabled the L3 schemata format is::
0427
0428 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0429
0430 L3 schemata file details (CDP enabled via mount option to resctrl)
0431 ------------------------------------------------------------------
0432 When CDP is enabled L3 control is split into two separate resources
0433 so you can specify independent masks for code and data like this::
0434
0435 L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0436 L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0437
0438 L2 schemata file details
0439 ------------------------
0440 CDP is supported at L2 using the 'cdpl2' mount option. The schemata
0441 format is either::
0442
0443 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0444
0445 or
0446
0447 L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0448 L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
0449
0450
0451 Memory bandwidth Allocation (default mode)
0452 ------------------------------------------
0453
0454 Memory b/w domain is L3 cache.
0455 ::
0456
0457 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
0458
0459 Memory bandwidth Allocation specified in MBps
0460 ---------------------------------------------
0461
0462 Memory bandwidth domain is L3 cache.
0463 ::
0464
0465 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
0466
0467 Reading/writing the schemata file
0468 ---------------------------------
0469 Reading the schemata file will show the state of all resources
0470 on all domains. When writing you only need to specify those values
0471 which you wish to change. E.g.
0472 ::
0473
0474 # cat schemata
0475 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
0476 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
0477 # echo "L3DATA:2=3c0;" > schemata
0478 # cat schemata
0479 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
0480 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
0481
0482 Cache Pseudo-Locking
0483 ====================
0484 CAT enables a user to specify the amount of cache space that an
0485 application can fill. Cache pseudo-locking builds on the fact that a
0486 CPU can still read and write data pre-allocated outside its current
0487 allocated area on a cache hit. With cache pseudo-locking, data can be
0488 preloaded into a reserved portion of cache that no application can
0489 fill, and from that point on will only serve cache hits. The cache
0490 pseudo-locked memory is made accessible to user space where an
0491 application can map it into its virtual address space and thus have
0492 a region of memory with reduced average read latency.
0493
0494 The creation of a cache pseudo-locked region is triggered by a request
0495 from the user to do so that is accompanied by a schemata of the region
0496 to be pseudo-locked. The cache pseudo-locked region is created as follows:
0497
0498 - Create a CAT allocation CLOSNEW with a CBM matching the schemata
0499 from the user of the cache region that will contain the pseudo-locked
0500 memory. This region must not overlap with any current CAT allocation/CLOS
0501 on the system and no future overlap with this cache region is allowed
0502 while the pseudo-locked region exists.
0503 - Create a contiguous region of memory of the same size as the cache
0504 region.
0505 - Flush the cache, disable hardware prefetchers, disable preemption.
0506 - Make CLOSNEW the active CLOS and touch the allocated memory to load
0507 it into the cache.
0508 - Set the previous CLOS as active.
0509 - At this point the closid CLOSNEW can be released - the cache
0510 pseudo-locked region is protected as long as its CBM does not appear in
0511 any CAT allocation. Even though the cache pseudo-locked region will from
0512 this point on not appear in any CBM of any CLOS an application running with
0513 any CLOS will be able to access the memory in the pseudo-locked region since
0514 the region continues to serve cache hits.
0515 - The contiguous region of memory loaded into the cache is exposed to
0516 user-space as a character device.
0517
0518 Cache pseudo-locking increases the probability that data will remain
0519 in the cache via carefully configuring the CAT feature and controlling
0520 application behavior. There is no guarantee that data is placed in
0521 cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
0522 “locked” data from cache. Power management C-states may shrink or
0523 power off cache. Deeper C-states will automatically be restricted on
0524 pseudo-locked region creation.
0525
0526 It is required that an application using a pseudo-locked region runs
0527 with affinity to the cores (or a subset of the cores) associated
0528 with the cache on which the pseudo-locked region resides. A sanity check
0529 within the code will not allow an application to map pseudo-locked memory
0530 unless it runs with affinity to cores associated with the cache on which the
0531 pseudo-locked region resides. The sanity check is only done during the
0532 initial mmap() handling, there is no enforcement afterwards and the
0533 application self needs to ensure it remains affine to the correct cores.
0534
0535 Pseudo-locking is accomplished in two stages:
0536
0537 1) During the first stage the system administrator allocates a portion
0538 of cache that should be dedicated to pseudo-locking. At this time an
0539 equivalent portion of memory is allocated, loaded into allocated
0540 cache portion, and exposed as a character device.
0541 2) During the second stage a user-space application maps (mmap()) the
0542 pseudo-locked memory into its address space.
0543
0544 Cache Pseudo-Locking Interface
0545 ------------------------------
0546 A pseudo-locked region is created using the resctrl interface as follows:
0547
0548 1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
0549 2) Change the new resource group's mode to "pseudo-locksetup" by writing
0550 "pseudo-locksetup" to the "mode" file.
0551 3) Write the schemata of the pseudo-locked region to the "schemata" file. All
0552 bits within the schemata should be "unused" according to the "bit_usage"
0553 file.
0554
0555 On successful pseudo-locked region creation the "mode" file will contain
0556 "pseudo-locked" and a new character device with the same name as the resource
0557 group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
0558 by user space in order to obtain access to the pseudo-locked memory region.
0559
0560 An example of cache pseudo-locked region creation and usage can be found below.
0561
0562 Cache Pseudo-Locking Debugging Interface
0563 ----------------------------------------
0564 The pseudo-locking debugging interface is enabled by default (if
0565 CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
0566
0567 There is no explicit way for the kernel to test if a provided memory
0568 location is present in the cache. The pseudo-locking debugging interface uses
0569 the tracing infrastructure to provide two ways to measure cache residency of
0570 the pseudo-locked region:
0571
0572 1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
0573 from these measurements are best visualized using a hist trigger (see
0574 example below). In this test the pseudo-locked region is traversed at
0575 a stride of 32 bytes while hardware prefetchers and preemption
0576 are disabled. This also provides a substitute visualization of cache
0577 hits and misses.
0578 2) Cache hit and miss measurements using model specific precision counters if
0579 available. Depending on the levels of cache on the system the pseudo_lock_l2
0580 and pseudo_lock_l3 tracepoints are available.
0581
0582 When a pseudo-locked region is created a new debugfs directory is created for
0583 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
0584 write-only file, pseudo_lock_measure, is present in this directory. The
0585 measurement of the pseudo-locked region depends on the number written to this
0586 debugfs file:
0587
0588 1:
0589 writing "1" to the pseudo_lock_measure file will trigger the latency
0590 measurement captured in the pseudo_lock_mem_latency tracepoint. See
0591 example below.
0592 2:
0593 writing "2" to the pseudo_lock_measure file will trigger the L2 cache
0594 residency (cache hits and misses) measurement captured in the
0595 pseudo_lock_l2 tracepoint. See example below.
0596 3:
0597 writing "3" to the pseudo_lock_measure file will trigger the L3 cache
0598 residency (cache hits and misses) measurement captured in the
0599 pseudo_lock_l3 tracepoint.
0600
0601 All measurements are recorded with the tracing infrastructure. This requires
0602 the relevant tracepoints to be enabled before the measurement is triggered.
0603
0604 Example of latency debugging interface
0605 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0606 In this example a pseudo-locked region named "newlock" was created. Here is
0607 how we can measure the latency in cycles of reading from this region and
0608 visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
0609 is set::
0610
0611 # :> /sys/kernel/debug/tracing/trace
0612 # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
0613 # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
0614 # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
0615 # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
0616 # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist
0617
0618 # event histogram
0619 #
0620 # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
0621 #
0622
0623 { latency: 456 } hitcount: 1
0624 { latency: 50 } hitcount: 83
0625 { latency: 36 } hitcount: 96
0626 { latency: 44 } hitcount: 174
0627 { latency: 48 } hitcount: 195
0628 { latency: 46 } hitcount: 262
0629 { latency: 42 } hitcount: 693
0630 { latency: 40 } hitcount: 3204
0631 { latency: 38 } hitcount: 3484
0632
0633 Totals:
0634 Hits: 8192
0635 Entries: 9
0636 Dropped: 0
0637
0638 Example of cache hits/misses debugging
0639 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0640 In this example a pseudo-locked region named "newlock" was created on the L2
0641 cache of a platform. Here is how we can obtain details of the cache hits
0642 and misses using the platform's precision counters.
0643 ::
0644
0645 # :> /sys/kernel/debug/tracing/trace
0646 # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
0647 # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
0648 # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
0649 # cat /sys/kernel/debug/tracing/trace
0650
0651 # tracer: nop
0652 #
0653 # _-----=> irqs-off
0654 # / _----=> need-resched
0655 # | / _---=> hardirq/softirq
0656 # || / _--=> preempt-depth
0657 # ||| / delay
0658 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
0659 # | | | |||| | |
0660 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
0661
0662
0663 Examples for RDT allocation usage
0664 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0665
0666 1) Example 1
0667
0668 On a two socket machine (one L3 cache per socket) with just four bits
0669 for cache bit masks, minimum b/w of 10% with a memory bandwidth
0670 granularity of 10%.
0671 ::
0672
0673 # mount -t resctrl resctrl /sys/fs/resctrl
0674 # cd /sys/fs/resctrl
0675 # mkdir p0 p1
0676 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
0677 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
0678
0679 The default resource group is unmodified, so we have access to all parts
0680 of all caches (its schemata file reads "L3:0=f;1=f").
0681
0682 Tasks that are under the control of group "p0" may only allocate from the
0683 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
0684 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
0685
0686 Similarly, tasks that are under the control of group "p0" may use a
0687 maximum memory b/w of 50% on socket0 and 50% on socket 1.
0688 Tasks in group "p1" may also use 50% memory b/w on both sockets.
0689 Note that unlike cache masks, memory b/w cannot specify whether these
0690 allocations can overlap or not. The allocations specifies the maximum
0691 b/w that the group may be able to use and the system admin can configure
0692 the b/w accordingly.
0693
0694 If resctrl is using the software controller (mba_sc) then user can enter the
0695 max b/w in MB rather than the percentage values.
0696 ::
0697
0698 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
0699 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
0700
0701 In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
0702 of 1024MB where as on socket 1 they would use 500MB.
0703
0704 2) Example 2
0705
0706 Again two sockets, but this time with a more realistic 20-bit mask.
0707
0708 Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
0709 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
0710 neighbors, each of the two real-time tasks exclusively occupies one quarter
0711 of L3 cache on socket 0.
0712 ::
0713
0714 # mount -t resctrl resctrl /sys/fs/resctrl
0715 # cd /sys/fs/resctrl
0716
0717 First we reset the schemata for the default group so that the "upper"
0718 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
0719 ordinary tasks::
0720
0721 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
0722
0723 Next we make a resource group for our first real time task and give
0724 it access to the "top" 25% of the cache on socket 0.
0725 ::
0726
0727 # mkdir p0
0728 # echo "L3:0=f8000;1=fffff" > p0/schemata
0729
0730 Finally we move our first real time task into this resource group. We
0731 also use taskset(1) to ensure the task always runs on a dedicated CPU
0732 on socket 0. Most uses of resource groups will also constrain which
0733 processors tasks run on.
0734 ::
0735
0736 # echo 1234 > p0/tasks
0737 # taskset -cp 1 1234
0738
0739 Ditto for the second real time task (with the remaining 25% of cache)::
0740
0741 # mkdir p1
0742 # echo "L3:0=7c00;1=fffff" > p1/schemata
0743 # echo 5678 > p1/tasks
0744 # taskset -cp 2 5678
0745
0746 For the same 2 socket system with memory b/w resource and CAT L3 the
0747 schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
0748 10):
0749
0750 For our first real time task this would request 20% memory b/w on socket 0.
0751 ::
0752
0753 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
0754
0755 For our second real time task this would request an other 20% memory b/w
0756 on socket 0.
0757 ::
0758
0759 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
0760
0761 3) Example 3
0762
0763 A single socket system which has real-time tasks running on core 4-7 and
0764 non real-time workload assigned to core 0-3. The real-time tasks share text
0765 and data, so a per task association is not required and due to interaction
0766 with the kernel it's desired that the kernel on these cores shares L3 with
0767 the tasks.
0768 ::
0769
0770 # mount -t resctrl resctrl /sys/fs/resctrl
0771 # cd /sys/fs/resctrl
0772
0773 First we reset the schemata for the default group so that the "upper"
0774 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
0775 cannot be used by ordinary tasks::
0776
0777 # echo "L3:0=3ff\nMB:0=50" > schemata
0778
0779 Next we make a resource group for our real time cores and give it access
0780 to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
0781 socket 0.
0782 ::
0783
0784 # mkdir p0
0785 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
0786
0787 Finally we move core 4-7 over to the new group and make sure that the
0788 kernel and the tasks running there get 50% of the cache. They should
0789 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
0790 siblings and only the real time threads are scheduled on the cores 4-7.
0791 ::
0792
0793 # echo F0 > p0/cpus
0794
0795 4) Example 4
0796
0797 The resource groups in previous examples were all in the default "shareable"
0798 mode allowing sharing of their cache allocations. If one resource group
0799 configures a cache allocation then nothing prevents another resource group
0800 to overlap with that allocation.
0801
0802 In this example a new exclusive resource group will be created on a L2 CAT
0803 system with two L2 cache instances that can be configured with an 8-bit
0804 capacity bitmask. The new exclusive resource group will be configured to use
0805 25% of each cache instance.
0806 ::
0807
0808 # mount -t resctrl resctrl /sys/fs/resctrl/
0809 # cd /sys/fs/resctrl
0810
0811 First, we observe that the default group is configured to allocate to all L2
0812 cache::
0813
0814 # cat schemata
0815 L2:0=ff;1=ff
0816
0817 We could attempt to create the new resource group at this point, but it will
0818 fail because of the overlap with the schemata of the default group::
0819
0820 # mkdir p0
0821 # echo 'L2:0=0x3;1=0x3' > p0/schemata
0822 # cat p0/mode
0823 shareable
0824 # echo exclusive > p0/mode
0825 -sh: echo: write error: Invalid argument
0826 # cat info/last_cmd_status
0827 schemata overlaps
0828
0829 To ensure that there is no overlap with another resource group the default
0830 resource group's schemata has to change, making it possible for the new
0831 resource group to become exclusive.
0832 ::
0833
0834 # echo 'L2:0=0xfc;1=0xfc' > schemata
0835 # echo exclusive > p0/mode
0836 # grep . p0/*
0837 p0/cpus:0
0838 p0/mode:exclusive
0839 p0/schemata:L2:0=03;1=03
0840 p0/size:L2:0=262144;1=262144
0841
0842 A new resource group will on creation not overlap with an exclusive resource
0843 group::
0844
0845 # mkdir p1
0846 # grep . p1/*
0847 p1/cpus:0
0848 p1/mode:shareable
0849 p1/schemata:L2:0=fc;1=fc
0850 p1/size:L2:0=786432;1=786432
0851
0852 The bit_usage will reflect how the cache is used::
0853
0854 # cat info/L2/bit_usage
0855 0=SSSSSSEE;1=SSSSSSEE
0856
0857 A resource group cannot be forced to overlap with an exclusive resource group::
0858
0859 # echo 'L2:0=0x1;1=0x1' > p1/schemata
0860 -sh: echo: write error: Invalid argument
0861 # cat info/last_cmd_status
0862 overlaps with exclusive group
0863
0864 Example of Cache Pseudo-Locking
0865 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0866 Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
0867 region is exposed at /dev/pseudo_lock/newlock that can be provided to
0868 application for argument to mmap().
0869 ::
0870
0871 # mount -t resctrl resctrl /sys/fs/resctrl/
0872 # cd /sys/fs/resctrl
0873
0874 Ensure that there are bits available that can be pseudo-locked, since only
0875 unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
0876 removed from the default resource group's schemata::
0877
0878 # cat info/L2/bit_usage
0879 0=SSSSSSSS;1=SSSSSSSS
0880 # echo 'L2:1=0xfc' > schemata
0881 # cat info/L2/bit_usage
0882 0=SSSSSSSS;1=SSSSSS00
0883
0884 Create a new resource group that will be associated with the pseudo-locked
0885 region, indicate that it will be used for a pseudo-locked region, and
0886 configure the requested pseudo-locked region capacity bitmask::
0887
0888 # mkdir newlock
0889 # echo pseudo-locksetup > newlock/mode
0890 # echo 'L2:1=0x3' > newlock/schemata
0891
0892 On success the resource group's mode will change to pseudo-locked, the
0893 bit_usage will reflect the pseudo-locked region, and the character device
0894 exposing the pseudo-locked region will exist::
0895
0896 # cat newlock/mode
0897 pseudo-locked
0898 # cat info/L2/bit_usage
0899 0=SSSSSSSS;1=SSSSSSPP
0900 # ls -l /dev/pseudo_lock/newlock
0901 crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
0902
0903 ::
0904
0905 /*
0906 * Example code to access one page of pseudo-locked cache region
0907 * from user space.
0908 */
0909 #define _GNU_SOURCE
0910 #include <fcntl.h>
0911 #include <sched.h>
0912 #include <stdio.h>
0913 #include <stdlib.h>
0914 #include <unistd.h>
0915 #include <sys/mman.h>
0916
0917 /*
0918 * It is required that the application runs with affinity to only
0919 * cores associated with the pseudo-locked region. Here the cpu
0920 * is hardcoded for convenience of example.
0921 */
0922 static int cpuid = 2;
0923
0924 int main(int argc, char *argv[])
0925 {
0926 cpu_set_t cpuset;
0927 long page_size;
0928 void *mapping;
0929 int dev_fd;
0930 int ret;
0931
0932 page_size = sysconf(_SC_PAGESIZE);
0933
0934 CPU_ZERO(&cpuset);
0935 CPU_SET(cpuid, &cpuset);
0936 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
0937 if (ret < 0) {
0938 perror("sched_setaffinity");
0939 exit(EXIT_FAILURE);
0940 }
0941
0942 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
0943 if (dev_fd < 0) {
0944 perror("open");
0945 exit(EXIT_FAILURE);
0946 }
0947
0948 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
0949 dev_fd, 0);
0950 if (mapping == MAP_FAILED) {
0951 perror("mmap");
0952 close(dev_fd);
0953 exit(EXIT_FAILURE);
0954 }
0955
0956 /* Application interacts with pseudo-locked memory @mapping */
0957
0958 ret = munmap(mapping, page_size);
0959 if (ret < 0) {
0960 perror("munmap");
0961 close(dev_fd);
0962 exit(EXIT_FAILURE);
0963 }
0964
0965 close(dev_fd);
0966 exit(EXIT_SUCCESS);
0967 }
0968
0969 Locking between applications
0970 ----------------------------
0971
0972 Certain operations on the resctrl filesystem, composed of read/writes
0973 to/from multiple files, must be atomic.
0974
0975 As an example, the allocation of an exclusive reservation of L3 cache
0976 involves:
0977
0978 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
0979 2. Find a contiguous set of bits in the global CBM bitmask that is clear
0980 in any of the directory cbmmasks
0981 3. Create a new directory
0982 4. Set the bits found in step 2 to the new directory "schemata" file
0983
0984 If two applications attempt to allocate space concurrently then they can
0985 end up allocating the same bits so the reservations are shared instead of
0986 exclusive.
0987
0988 To coordinate atomic operations on the resctrlfs and to avoid the problem
0989 above, the following locking procedure is recommended:
0990
0991 Locking is based on flock, which is available in libc and also as a shell
0992 script command
0993
0994 Write lock:
0995
0996 A) Take flock(LOCK_EX) on /sys/fs/resctrl
0997 B) Read/write the directory structure.
0998 C) funlock
0999
1000 Read lock:
1001
1002 A) Take flock(LOCK_SH) on /sys/fs/resctrl
1003 B) If success read the directory structure.
1004 C) funlock
1005
1006 Example with bash::
1007
1008 # Atomically read directory structure
1009 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1010
1011 # Read directory contents and create new subdirectory
1012
1013 $ cat create-dir.sh
1014 find /sys/fs/resctrl/ > output.txt
1015 mask = function-of(output.txt)
1016 mkdir /sys/fs/resctrl/newres/
1017 echo mask > /sys/fs/resctrl/newres/schemata
1018
1019 $ flock /sys/fs/resctrl/ ./create-dir.sh
1020
1021 Example with C::
1022
1023 /*
1024 * Example code do take advisory locks
1025 * before accessing resctrl filesystem
1026 */
1027 #include <sys/file.h>
1028 #include <stdlib.h>
1029
1030 void resctrl_take_shared_lock(int fd)
1031 {
1032 int ret;
1033
1034 /* take shared lock on resctrl filesystem */
1035 ret = flock(fd, LOCK_SH);
1036 if (ret) {
1037 perror("flock");
1038 exit(-1);
1039 }
1040 }
1041
1042 void resctrl_take_exclusive_lock(int fd)
1043 {
1044 int ret;
1045
1046 /* release lock on resctrl filesystem */
1047 ret = flock(fd, LOCK_EX);
1048 if (ret) {
1049 perror("flock");
1050 exit(-1);
1051 }
1052 }
1053
1054 void resctrl_release_lock(int fd)
1055 {
1056 int ret;
1057
1058 /* take shared lock on resctrl filesystem */
1059 ret = flock(fd, LOCK_UN);
1060 if (ret) {
1061 perror("flock");
1062 exit(-1);
1063 }
1064 }
1065
1066 void main(void)
1067 {
1068 int fd, ret;
1069
1070 fd = open("/sys/fs/resctrl", O_DIRECTORY);
1071 if (fd == -1) {
1072 perror("open");
1073 exit(-1);
1074 }
1075 resctrl_take_shared_lock(fd);
1076 /* code to read directory contents */
1077 resctrl_release_lock(fd);
1078
1079 resctrl_take_exclusive_lock(fd);
1080 /* code to read and write directory contents */
1081 resctrl_release_lock(fd);
1082 }
1083
1084 Examples for RDT Monitoring along with allocation usage
1085 =======================================================
1086 Reading monitored data
1087 ----------------------
1088 Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1089 show the current snapshot of LLC occupancy of the corresponding MON
1090 group or CTRL_MON group.
1091
1092
1093 Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1094 ------------------------------------------------------------------------
1095 On a two socket machine (one L3 cache per socket) with just four bits
1096 for cache bit masks::
1097
1098 # mount -t resctrl resctrl /sys/fs/resctrl
1099 # cd /sys/fs/resctrl
1100 # mkdir p0 p1
1101 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1102 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1103 # echo 5678 > p1/tasks
1104 # echo 5679 > p1/tasks
1105
1106 The default resource group is unmodified, so we have access to all parts
1107 of all caches (its schemata file reads "L3:0=f;1=f").
1108
1109 Tasks that are under the control of group "p0" may only allocate from the
1110 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1111 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1112
1113 Create monitor groups and assign a subset of tasks to each monitor group.
1114 ::
1115
1116 # cd /sys/fs/resctrl/p1/mon_groups
1117 # mkdir m11 m12
1118 # echo 5678 > m11/tasks
1119 # echo 5679 > m12/tasks
1120
1121 fetch data (data shown in bytes)
1122 ::
1123
1124 # cat m11/mon_data/mon_L3_00/llc_occupancy
1125 16234000
1126 # cat m11/mon_data/mon_L3_01/llc_occupancy
1127 14789000
1128 # cat m12/mon_data/mon_L3_00/llc_occupancy
1129 16789000
1130
1131 The parent ctrl_mon group shows the aggregated data.
1132 ::
1133
1134 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1135 31234000
1136
1137 Example 2 (Monitor a task from its creation)
1138 --------------------------------------------
1139 On a two socket machine (one L3 cache per socket)::
1140
1141 # mount -t resctrl resctrl /sys/fs/resctrl
1142 # cd /sys/fs/resctrl
1143 # mkdir p0 p1
1144
1145 An RMID is allocated to the group once its created and hence the <cmd>
1146 below is monitored from its creation.
1147 ::
1148
1149 # echo $$ > /sys/fs/resctrl/p1/tasks
1150 # <cmd>
1151
1152 Fetch the data::
1153
1154 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1155 31789000
1156
1157 Example 3 (Monitor without CAT support or before creating CAT groups)
1158 ---------------------------------------------------------------------
1159
1160 Assume a system like HSW has only CQM and no CAT support. In this case
1161 the resctrl will still mount but cannot create CTRL_MON directories.
1162 But user can create different MON groups within the root group thereby
1163 able to monitor all tasks including kernel threads.
1164
1165 This can also be used to profile jobs cache size footprint before being
1166 able to allocate them to different allocation groups.
1167 ::
1168
1169 # mount -t resctrl resctrl /sys/fs/resctrl
1170 # cd /sys/fs/resctrl
1171 # mkdir mon_groups/m01
1172 # mkdir mon_groups/m02
1173
1174 # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1175 # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1176
1177 Monitor the groups separately and also get per domain data. From the
1178 below its apparent that the tasks are mostly doing work on
1179 domain(socket) 0.
1180 ::
1181
1182 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1183 31234000
1184 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1185 34555
1186 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1187 31234000
1188 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1189 32789
1190
1191
1192 Example 4 (Monitor real time tasks)
1193 -----------------------------------
1194
1195 A single socket system which has real time tasks running on cores 4-7
1196 and non real time tasks on other cpus. We want to monitor the cache
1197 occupancy of the real time threads on these cores.
1198 ::
1199
1200 # mount -t resctrl resctrl /sys/fs/resctrl
1201 # cd /sys/fs/resctrl
1202 # mkdir p1
1203
1204 Move the cpus 4-7 over to p1::
1205
1206 # echo f0 > p1/cpus
1207
1208 View the llc occupancy snapshot::
1209
1210 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1211 11234000
1212
1213 Intel RDT Errata
1214 ================
1215
1216 Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1217 -----------------------------------------------------------------
1218
1219 Errata SKX99 for Skylake server and BDF102 for Broadwell server.
1220
1221 Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1222 according to the assigned Resource Monitor ID (RMID) for that logical
1223 core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1224 metrics, may report incorrect system bandwidth for certain RMID values.
1225
1226 Implication: Due to the errata, system memory bandwidth may not match
1227 what is reported.
1228
1229 Workaround: MBM total and local readings are corrected according to the
1230 following correction factor table:
1231
1232 +---------------+---------------+---------------+-----------------+
1233 |core count |rmid count |rmid threshold |correction factor|
1234 +---------------+---------------+---------------+-----------------+
1235 |1 |8 |0 |1.000000 |
1236 +---------------+---------------+---------------+-----------------+
1237 |2 |16 |0 |1.000000 |
1238 +---------------+---------------+---------------+-----------------+
1239 |3 |24 |15 |0.969650 |
1240 +---------------+---------------+---------------+-----------------+
1241 |4 |32 |0 |1.000000 |
1242 +---------------+---------------+---------------+-----------------+
1243 |6 |48 |31 |0.969650 |
1244 +---------------+---------------+---------------+-----------------+
1245 |7 |56 |47 |1.142857 |
1246 +---------------+---------------+---------------+-----------------+
1247 |8 |64 |0 |1.000000 |
1248 +---------------+---------------+---------------+-----------------+
1249 |9 |72 |63 |1.185115 |
1250 +---------------+---------------+---------------+-----------------+
1251 |10 |80 |63 |1.066553 |
1252 +---------------+---------------+---------------+-----------------+
1253 |11 |88 |79 |1.454545 |
1254 +---------------+---------------+---------------+-----------------+
1255 |12 |96 |0 |1.000000 |
1256 +---------------+---------------+---------------+-----------------+
1257 |13 |104 |95 |1.230769 |
1258 +---------------+---------------+---------------+-----------------+
1259 |14 |112 |95 |1.142857 |
1260 +---------------+---------------+---------------+-----------------+
1261 |15 |120 |95 |1.066667 |
1262 +---------------+---------------+---------------+-----------------+
1263 |16 |128 |0 |1.000000 |
1264 +---------------+---------------+---------------+-----------------+
1265 |17 |136 |127 |1.254863 |
1266 +---------------+---------------+---------------+-----------------+
1267 |18 |144 |127 |1.185255 |
1268 +---------------+---------------+---------------+-----------------+
1269 |19 |152 |0 |1.000000 |
1270 +---------------+---------------+---------------+-----------------+
1271 |20 |160 |127 |1.066667 |
1272 +---------------+---------------+---------------+-----------------+
1273 |21 |168 |0 |1.000000 |
1274 +---------------+---------------+---------------+-----------------+
1275 |22 |176 |159 |1.454334 |
1276 +---------------+---------------+---------------+-----------------+
1277 |23 |184 |0 |1.000000 |
1278 +---------------+---------------+---------------+-----------------+
1279 |24 |192 |127 |0.969744 |
1280 +---------------+---------------+---------------+-----------------+
1281 |25 |200 |191 |1.280246 |
1282 +---------------+---------------+---------------+-----------------+
1283 |26 |208 |191 |1.230921 |
1284 +---------------+---------------+---------------+-----------------+
1285 |27 |216 |0 |1.000000 |
1286 +---------------+---------------+---------------+-----------------+
1287 |28 |224 |191 |1.143118 |
1288 +---------------+---------------+---------------+-----------------+
1289
1290 If rmid > rmid threshold, MBM total and local values should be multiplied
1291 by the correction factor.
1292
1293 See:
1294
1295 1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1296 http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1297
1298 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1299 http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1300
1301 3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1302 https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1303
1304 for further information.