admin-guide/mm/concepts.rst

0001 .. _mm_concepts:
0002
0003 =================
0004 Concepts overview
0005 =================
0006
0007 The memory management in Linux is a complex system that evolved over the
0008 years and included more and more functionality to support a variety of
0009 systems from MMU-less microcontrollers to supercomputers. The memory
0010 management for systems without an MMU is called ``nommu`` and it
0011 definitely deserves a dedicated document, which hopefully will be
0012 eventually written. Yet, although some of the concepts are the same,
0013 here we assume that an MMU is available and a CPU can translate a virtual
0014 address to a physical address.
0015
0016 .. contents:: :local:
0017
0018 Virtual Memory Primer
0019 =====================
0020
0021 The physical memory in a computer system is a limited resource and
0022 even for systems that support memory hotplug there is a hard limit on
0023 the amount of memory that can be installed. The physical memory is not
0024 necessarily contiguous; it might be accessible as a set of distinct
0025 address ranges. Besides, different CPU architectures, and even
0026 different implementations of the same architecture have different views
0027 of how these address ranges are defined.
0028
0029 All this makes dealing directly with physical memory quite complex and
0030 to avoid this complexity a concept of virtual memory was developed.
0031
0032 The virtual memory abstracts the details of physical memory from the
0033 application software, allows to keep only needed information in the
0034 physical memory (demand paging) and provides a mechanism for the
0035 protection and controlled sharing of data between processes.
0036
0037 With virtual memory, each and every memory access uses a virtual
0038 address. When the CPU decodes an instruction that reads (or
0039 writes) from (or to) the system memory, it translates the `virtual`
0040 address encoded in that instruction to a `physical` address that the
0041 memory controller can understand.
0042
0043 The physical system memory is divided into page frames, or pages. The
0044 size of each page is architecture specific. Some architectures allow
0045 selection of the page size from several supported values; this
0046 selection is performed at the kernel build time by setting an
0047 appropriate kernel configuration option.
0048
0049 Each physical memory page can be mapped as one or more virtual
0050 pages. These mappings are described by page tables that allow
0051 translation from a virtual address used by programs to the physical
0052 memory address. The page tables are organized hierarchically.
0053
0054 The tables at the lowest level of the hierarchy contain physical
0055 addresses of actual pages used by the software. The tables at higher
0056 levels contain physical addresses of the pages belonging to the lower
0057 levels. The pointer to the top level page table resides in a
0058 register. When the CPU performs the address translation, it uses this
0059 register to access the top level page table. The high bits of the
0060 virtual address are used to index an entry in the top level page
0061 table. That entry is then used to access the next level in the
0062 hierarchy with the next bits of the virtual address as the index to
0063 that level page table. The lowest bits in the virtual address define
0064 the offset inside the actual page.
0065
0066 Huge Pages
0067 ==========
0068
0069 The address translation requires several memory accesses and memory
0070 accesses are slow relatively to CPU speed. To avoid spending precious
0071 processor cycles on the address translation, CPUs maintain a cache of
0072 such translations called Translation Lookaside Buffer (or
0073 TLB). Usually TLB is pretty scarce resource and applications with
0074 large memory working set will experience performance hit because of
0075 TLB misses.
0076
0077 Many modern CPU architectures allow mapping of the memory pages
0078 directly by the higher levels in the page table. For instance, on x86,
0079 it is possible to map 2M and even 1G pages using entries in the second
0080 and the third level page tables. In Linux such pages are called
0081 `huge`. Usage of huge pages significantly reduces pressure on TLB,
0082 improves TLB hit-rate and thus improves overall system performance.
0083
0084 There are two mechanisms in Linux that enable mapping of the physical
0085 memory with the huge pages. The first one is `HugeTLB filesystem`, or
0086 hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
0087 store. For the files created in this filesystem the data resides in
0088 the memory and mapped using huge pages. The hugetlbfs is described at
0089 :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
0090
0091 Another, more recent, mechanism that enables use of the huge pages is
0092 called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
0093 requires users and/or system administrators to configure what parts of
0094 the system memory should and can be mapped by the huge pages, THP
0095 manages such mappings transparently to the user and hence the
0096 name. See
0097 :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
0098 for more details about THP.
0099
0100 Zones
0101 =====
0102
0103 Often hardware poses restrictions on how different physical memory
0104 ranges can be accessed. In some cases, devices cannot perform DMA to
0105 all the addressable memory. In other cases, the size of the physical
0106 memory exceeds the maximal addressable size of virtual memory and
0107 special actions are required to access portions of the memory. Linux
0108 groups memory pages into `zones` according to their possible
0109 usage. For example, ZONE_DMA will contain memory that can be used by
0110 devices for DMA, ZONE_HIGHMEM will contain memory that is not
0111 permanently mapped into kernel's address space and ZONE_NORMAL will
0112 contain normally addressed pages.
0113
0114 The actual layout of the memory zones is hardware dependent as not all
0115 architectures define all zones, and requirements for DMA are different
0116 for different platforms.
0117
0118 Nodes
0119 =====
0120
0121 Many multi-processor machines are NUMA - Non-Uniform Memory Access -
0122 systems. In such systems the memory is arranged into banks that have
0123 different access latency depending on the "distance" from the
0124 processor. Each bank is referred to as a `node` and for each node Linux
0125 constructs an independent memory management subsystem. A node has its
0126 own set of zones, lists of free and used pages and various statistics
0127 counters. You can find more details about NUMA in
0128 :ref:`Documentation/mm/numa.rst <numa>` and in
0129 :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
0130
0131 Page cache
0132 ==========
0133
0134 The physical memory is volatile and the common case for getting data
0135 into the memory is to read it from files. Whenever a file is read, the
0136 data is put into the `page cache` to avoid expensive disk access on
0137 the subsequent reads. Similarly, when one writes to a file, the data
0138 is placed in the page cache and eventually gets into the backing
0139 storage device. The written pages are marked as `dirty` and when Linux
0140 decides to reuse them for other purposes, it makes sure to synchronize
0141 the file contents on the device with the updated data.
0142
0143 Anonymous Memory
0144 ================
0145
0146 The `anonymous memory` or `anonymous mappings` represent memory that
0147 is not backed by a filesystem. Such mappings are implicitly created
0148 for program's stack and heap or by explicit calls to mmap(2) system
0149 call. Usually, the anonymous mappings only define virtual memory areas
0150 that the program is allowed to access. The read accesses will result
0151 in creation of a page table entry that references a special physical
0152 page filled with zeroes. When the program performs a write, a regular
0153 physical page will be allocated to hold the written data. The page
0154 will be marked dirty and if the kernel decides to repurpose it,
0155 the dirty page will be swapped out.
0156
0157 Reclaim
0158 =======
0159
0160 Throughout the system lifetime, a physical page can be used for storing
0161 different types of data. It can be kernel internal data structures,
0162 DMA'able buffers for device drivers use, data read from a filesystem,
0163 memory allocated by user space processes etc.
0164
0165 Depending on the page usage it is treated differently by the Linux
0166 memory management. The pages that can be freed at any time, either
0167 because they cache the data available elsewhere, for instance, on a
0168 hard disk, or because they can be swapped out, again, to the hard
0169 disk, are called `reclaimable`. The most notable categories of the
0170 reclaimable pages are page cache and anonymous memory.
0171
0172 In most cases, the pages holding internal kernel data and used as DMA
0173 buffers cannot be repurposed, and they remain pinned until freed by
0174 their user. Such pages are called `unreclaimable`. However, in certain
0175 circumstances, even pages occupied with kernel data structures can be
0176 reclaimed. For instance, in-memory caches of filesystem metadata can
0177 be re-read from the storage device and therefore it is possible to
0178 discard them from the main memory when system is under memory
0179 pressure.
0180
0181 The process of freeing the reclaimable physical memory pages and
0182 repurposing them is called (surprise!) `reclaim`. Linux can reclaim
0183 pages either asynchronously or synchronously, depending on the state
0184 of the system. When the system is not loaded, most of the memory is free
0185 and allocation requests will be satisfied immediately from the free
0186 pages supply. As the load increases, the amount of the free pages goes
0187 down and when it reaches a certain threshold (low watermark), an
0188 allocation request will awaken the ``kswapd`` daemon. It will
0189 asynchronously scan memory pages and either just free them if the data
0190 they contain is available elsewhere, or evict to the backing storage
0191 device (remember those dirty pages?). As memory usage increases even
0192 more and reaches another threshold - min watermark - an allocation
0193 will trigger `direct reclaim`. In this case allocation is stalled
0194 until enough memory pages are reclaimed to satisfy the request.
0195
0196 Compaction
0197 ==========
0198
0199 As the system runs, tasks allocate and free the memory and it becomes
0200 fragmented. Although with virtual memory it is possible to present
0201 scattered physical pages as virtually contiguous range, sometimes it is
0202 necessary to allocate large physically contiguous memory areas. Such
0203 need may arise, for instance, when a device driver requires a large
0204 buffer for DMA, or when THP allocates a huge page. Memory `compaction`
0205 addresses the fragmentation issue. This mechanism moves occupied pages
0206 from the lower part of a memory zone to free pages in the upper part
0207 of the zone. When a compaction scan is finished free pages are grouped
0208 together at the beginning of the zone and allocations of large
0209 physically contiguous areas become possible.
0210
0211 Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
0212 daemon or synchronously as a result of a memory allocation request.
0213
0214 OOM killer
0215 ==========
0216
0217 It is possible that on a loaded machine memory will be exhausted and the
0218 kernel will be unable to reclaim enough memory to continue to operate. In
0219 order to save the rest of the system, it invokes the `OOM killer`.
0220
0221 The `OOM killer` selects a task to sacrifice for the sake of the overall
0222 system health. The selected task is killed in a hope that after it exits
0223 enough memory will be freed to continue normal operation.