admin-guide/mm/numaperf.rst

0001 .. _numaperf:
0002
0003 =============
0004 NUMA Locality
0005 =============
0006
0007 Some platforms may have multiple types of memory attached to a compute
0008 node. These disparate memory ranges may share some characteristics, such
0009 as CPU cache coherence, but may have different performance. For example,
0010 different media types and buses affect bandwidth and latency.
0011
0012 A system supports such heterogeneous memory by grouping each memory type
0013 under different domains, or "nodes", based on locality and performance
0014 characteristics.  Some memory may share the same node as a CPU, and others
0015 are provided as memory only nodes. While memory only nodes do not provide
0016 CPUs, they may still be local to one or more compute nodes relative to
0017 other nodes. The following diagram shows one such example of two compute
0018 nodes with local memory and a memory only node for each of compute node::
0019
0020  +------------------+     +------------------+
0021  | Compute Node 0   +-----+ Compute Node 1   |
0022  | Local Node0 Mem  |     | Local Node1 Mem  |
0023  +--------+---------+     +--------+---------+
0024           |                        |
0025  +--------+---------+     +--------+---------+
0026  | Slower Node2 Mem |     | Slower Node3 Mem |
0027  +------------------+     +--------+---------+
0028
0029 A "memory initiator" is a node containing one or more devices such as
0030 CPUs or separate memory I/O devices that can initiate memory requests.
0031 A "memory target" is a node containing one or more physical address
0032 ranges accessible from one or more memory initiators.
0033
0034 When multiple memory initiators exist, they may not all have the same
0035 performance when accessing a given memory target. Each initiator-target
0036 pair may be organized into different ranked access classes to represent
0037 this relationship. The highest performing initiator to a given target
0038 is considered to be one of that target's local initiators, and given
0039 the highest access class, 0. Any given target may have one or more
0040 local initiators, and any given initiator may have multiple local
0041 memory targets.
0042
0043 To aid applications matching memory targets with their initiators, the
0044 kernel provides symlinks to each other. The following example lists the
0045 relationship for the access class "0" memory initiators and targets::
0046
0047         # symlinks -v /sys/devices/system/node/nodeX/access0/targets/
0048         relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
0049
0050         # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
0051         relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
0052
0053 A memory initiator may have multiple memory targets in the same access
0054 class. The target memory's initiators in a given class indicate the
0055 nodes' access characteristics share the same performance relative to other
0056 linked initiator nodes. Each target within an initiator's access class,
0057 though, do not necessarily perform the same as each other.
0058
0059 The access class "1" is used to allow differentiation between initiators
0060 that are CPUs and hence suitable for generic task scheduling, and
0061 IO initiators such as GPUs and NICs.  Unlike access class 0, only
0062 nodes containing CPUs are considered.
0063
0064 ================
0065 NUMA Performance
0066 ================
0067
0068 Applications may wish to consider which node they want their memory to
0069 be allocated from based on the node's performance characteristics. If
0070 the system provides these attributes, the kernel exports them under the
0071 node sysfs hierarchy by appending the attributes directory under the
0072 memory node's access class 0 initiators as follows::
0073
0074         /sys/devices/system/node/nodeY/access0/initiators/
0075
0076 These attributes apply only when accessed from nodes that have the
0077 are linked under the this access's initiators.
0078
0079 The performance characteristics the kernel provides for the local initiators
0080 are exported are as follows::
0081
0082         # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
0083         /sys/devices/system/node/nodeY/access0/initiators/
0084         |-- read_bandwidth
0085         |-- read_latency
0086         |-- write_bandwidth
0087         `-- write_latency
0088
0089 The bandwidth attributes are provided in MiB/second.
0090
0091 The latency attributes are provided in nanoseconds.
0092
0093 The values reported here correspond to the rated latency and bandwidth
0094 for the platform.
0095
0096 Access class 1 takes the same form but only includes values for CPU to
0097 memory activity.
0098
0099 ==========
0100 NUMA Cache
0101 ==========
0102
0103 System memory may be constructed in a hierarchy of elements with various
0104 performance characteristics in order to provide large address space of
0105 slower performing memory cached by a smaller higher performing memory. The
0106 system physical addresses memory  initiators are aware of are provided
0107 by the last memory level in the hierarchy. The system meanwhile uses
0108 higher performing memory to transparently cache access to progressively
0109 slower levels.
0110
0111 The term "far memory" is used to denote the last level memory in the
0112 hierarchy. Each increasing cache level provides higher performing
0113 initiator access, and the term "near memory" represents the fastest
0114 cache provided by the system.
0115
0116 This numbering is different than CPU caches where the cache level (ex:
0117 L1, L2, L3) uses the CPU-side view where each increased level is lower
0118 performing. In contrast, the memory cache level is centric to the last
0119 level memory, so the higher numbered cache level corresponds to  memory
0120 nearer to the CPU, and further from far memory.
0121
0122 The memory-side caches are not directly addressable by software. When
0123 software accesses a system address, the system will return it from the
0124 near memory cache if it is present. If it is not present, the system
0125 accesses the next level of memory until there is either a hit in that
0126 cache level, or it reaches far memory.
0127
0128 An application does not need to know about caching attributes in order
0129 to use the system. Software may optionally query the memory cache
0130 attributes in order to maximize the performance out of such a setup.
0131 If the system provides a way for the kernel to discover this information,
0132 for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
0133 the kernel will append these attributes to the NUMA node memory target.
0134
0135 When the kernel first registers a memory cache with a node, the kernel
0136 will create the following directory::
0137
0138         /sys/devices/system/node/nodeX/memory_side_cache/
0139
0140 If that directory is not present, the system either does not provide
0141 a memory-side cache, or that information is not accessible to the kernel.
0142
0143 The attributes for each level of cache is provided under its cache
0144 level index::
0145
0146         /sys/devices/system/node/nodeX/memory_side_cache/indexA/
0147         /sys/devices/system/node/nodeX/memory_side_cache/indexB/
0148         /sys/devices/system/node/nodeX/memory_side_cache/indexC/
0149
0150 Each cache level's directory provides its attributes. For example, the
0151 following shows a single cache level and the attributes available for
0152 software to query::
0153
0154         # tree /sys/devices/system/node/node0/memory_side_cache/
0155         /sys/devices/system/node/node0/memory_side_cache/
0156         |-- index1
0157         |   |-- indexing
0158         |   |-- line_size
0159         |   |-- size
0160         |   `-- write_policy
0161
0162 The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
0163 for any other indexed based, multi-way associativity.
0164
0165 The "line_size" is the number of bytes accessed from the next cache
0166 level on a miss.
0167
0168 The "size" is the number of bytes provided by this cache level.
0169
0170 The "write_policy" will be 0 for write-back, and non-zero for
0171 write-through caching.
0172
0173 ========
0174 See Also
0175 ========
0176
0177 [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
0178 - Section 5.2.27