Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 =========================================
0004 A vmemmap diet for HugeTLB and Device DAX
0005 =========================================
0006 
0007 HugeTLB
0008 =======
0009 
0010 This section is to explain how HugeTLB Vmemmap Optimization (HVO) works.
0011 
0012 The ``struct page`` structures are used to describe a physical page frame. By
0013 default, there is a one-to-one mapping from a page frame to it's corresponding
0014 ``struct page``.
0015 
0016 HugeTLB pages consist of multiple base page size pages and is supported by many
0017 architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
0018 details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
0019 currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
0020 consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages.
0021 For each base page, there is a corresponding ``struct page``.
0022 
0023 Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
0024 contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
0025 this upper limit. The only 'useful' information in the remaining ``struct page``
0026 is the compound_head field, and this field is the same for all tail pages.
0027 
0028 By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
0029 to the buddy allocator for other uses.
0030 
0031 Different architectures support different HugeTLB pages. For example, the
0032 following table is the HugeTLB page size supported by x86 and arm64
0033 architectures. Because arm64 supports 4k, 16k, and 64k base pages and
0034 supports contiguous entries, so it supports many kinds of sizes of HugeTLB
0035 page.
0036 
0037 +--------------+-----------+-----------------------------------------------+
0038 | Architecture | Page Size |                HugeTLB Page Size              |
0039 +--------------+-----------+-----------+-----------+-----------+-----------+
0040 |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
0041 +--------------+-----------+-----------+-----------+-----------+-----------+
0042 |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
0043 |              +-----------+-----------+-----------+-----------+-----------+
0044 |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
0045 |              +-----------+-----------+-----------+-----------+-----------+
0046 |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
0047 +--------------+-----------+-----------+-----------+-----------+-----------+
0048 
0049 When the system boot up, every HugeTLB page has more than one ``struct page``
0050 structs which size is (unit: pages)::
0051 
0052    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
0053 
0054 Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
0055 of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
0056 relationship::
0057 
0058    HugeTLB_Size = n * PAGE_SIZE
0059 
0060 Then::
0061 
0062    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
0063                = n * sizeof(struct page) / PAGE_SIZE
0064 
0065 We can use huge mapping at the pud/pmd level for the HugeTLB page.
0066 
0067 For the HugeTLB page of the pmd level mapping, then::
0068 
0069    struct_size = n * sizeof(struct page) / PAGE_SIZE
0070                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
0071                = sizeof(struct page) / sizeof(pte_t)
0072                = 64 / 8
0073                = 8 (pages)
0074 
0075 Where n is how many pte entries which one page can contains. So the value of
0076 n is (PAGE_SIZE / sizeof(pte_t)).
0077 
0078 This optimization only supports 64-bit system, so the value of sizeof(pte_t)
0079 is 8. And this optimization also applicable only when the size of ``struct page``
0080 is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
0081 x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
0082 size of ``struct page`` structs of it is 8 page frames which size depends on the
0083 size of the base page.
0084 
0085 For the HugeTLB page of the pud level mapping, then::
0086 
0087    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
0088                = PAGE_SIZE / 8 * 8 (pages)
0089                = PAGE_SIZE (pages)
0090 
0091 Where the struct_size(pmd) is the size of the ``struct page`` structs of a
0092 HugeTLB page of the pmd level mapping.
0093 
0094 E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
0095 HugeTLB page consists in 4096.
0096 
0097 Next, we take the pmd level mapping of the HugeTLB page as an example to
0098 show the internal implementation of this optimization. There are 8 pages
0099 ``struct page`` structs associated with a HugeTLB page which is pmd mapped.
0100 
0101 Here is how things look before optimization::
0102 
0103     HugeTLB                  struct pages(8 pages)         page frame(8 pages)
0104  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
0105  |           |                     |     0     | -------------> |     0     |
0106  |           |                     +-----------+                +-----------+
0107  |           |                     |     1     | -------------> |     1     |
0108  |           |                     +-----------+                +-----------+
0109  |           |                     |     2     | -------------> |     2     |
0110  |           |                     +-----------+                +-----------+
0111  |           |                     |     3     | -------------> |     3     |
0112  |           |                     +-----------+                +-----------+
0113  |           |                     |     4     | -------------> |     4     |
0114  |    PMD    |                     +-----------+                +-----------+
0115  |   level   |                     |     5     | -------------> |     5     |
0116  |  mapping  |                     +-----------+                +-----------+
0117  |           |                     |     6     | -------------> |     6     |
0118  |           |                     +-----------+                +-----------+
0119  |           |                     |     7     | -------------> |     7     |
0120  |           |                     +-----------+                +-----------+
0121  |           |
0122  |           |
0123  |           |
0124  +-----------+
0125 
0126 The value of page->compound_head is the same for all tail pages. The first
0127 page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
0128 ``struct page`` necessary to describe the HugeTLB. The only use of the remaining
0129 pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head.
0130 Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
0131 will be used for each HugeTLB page. This will allow us to free the remaining
0132 7 pages to the buddy allocator.
0133 
0134 Here is how things look after remapping::
0135 
0136     HugeTLB                  struct pages(8 pages)         page frame(8 pages)
0137  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
0138  |           |                     |     0     | -------------> |     0     |
0139  |           |                     +-----------+                +-----------+
0140  |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
0141  |           |                     +-----------+                  | | | | | |
0142  |           |                     |     2     | -----------------+ | | | | |
0143  |           |                     +-----------+                    | | | | |
0144  |           |                     |     3     | -------------------+ | | | |
0145  |           |                     +-----------+                      | | | |
0146  |           |                     |     4     | ---------------------+ | | |
0147  |    PMD    |                     +-----------+                        | | |
0148  |   level   |                     |     5     | -----------------------+ | |
0149  |  mapping  |                     +-----------+                          | |
0150  |           |                     |     6     | -------------------------+ |
0151  |           |                     +-----------+                            |
0152  |           |                     |     7     | ---------------------------+
0153  |           |                     +-----------+
0154  |           |
0155  |           |
0156  |           |
0157  +-----------+
0158 
0159 When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
0160 vmemmap pages and restore the previous mapping relationship.
0161 
0162 For the HugeTLB page of the pud level mapping. It is similar to the former.
0163 We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
0164 
0165 Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
0166 (e.g. aarch64) provides a contiguous bit in the translation table entries
0167 that hints to the MMU to indicate that it is one of a contiguous set of
0168 entries that can be cached in a single TLB entry.
0169 
0170 The contiguous bit is used to increase the mapping size at the pmd and pte
0171 (last) level. So this type of HugeTLB page can be optimized only when its
0172 size of the ``struct page`` structs is greater than **1** page.
0173 
0174 Notice: The head vmemmap page is not freed to the buddy allocator and all
0175 tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
0176 more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB
0177 page) associated with each HugeTLB page. The ``compound_head()`` can handle
0178 this correctly. There is only **one** head ``struct page``, the tail
0179 ``struct page`` with ``PG_head`` are fake head ``struct page``.  We need an
0180 approach to distinguish between those two different types of ``struct page`` so
0181 that ``compound_head()`` can return the real head ``struct page`` when the
0182 parameter is the tail ``struct page`` but with ``PG_head``. The following code
0183 snippet describes how to distinguish between real and fake head ``struct page``.
0184 
0185 .. code-block:: c
0186 
0187         if (test_bit(PG_head, &page->flags)) {
0188                 unsigned long head = READ_ONCE(page[1].compound_head);
0189 
0190                 if (head & 1) {
0191                         if (head == (unsigned long)page + 1)
0192                                 /* head struct page */
0193                         else
0194                                 /* tail struct page */
0195                 } else {
0196                         /* head struct page */
0197                 }
0198         }
0199 
0200 We can safely access the field of the **page[1]** with ``PG_head`` because the
0201 page is a compound page composed with at least two contiguous pages.
0202 The implementation refers to ``page_fixed_fake_head()``.
0203 
0204 Device DAX
0205 ==========
0206 
0207 The device-dax interface uses the same tail deduplication technique explained
0208 in the previous chapter, except when used with the vmemmap in
0209 the device (altmap).
0210 
0211 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
0212 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
0213 
0214 The differences with HugeTLB are relatively minor.
0215 
0216 It only use 3 ``struct page`` for storing all information as opposed
0217 to 4 on HugeTLB pages.
0218 
0219 There's no remapping of vmemmap given that device-dax memory is not part of
0220 System RAM ranges initialized at boot. Thus the tail page deduplication
0221 happens at a later stage when we populate the sections. HugeTLB reuses the
0222 the head vmemmap page representing, whereas device-dax reuses the tail
0223 vmemmap page. This results in only half of the savings compared to HugeTLB.
0224 
0225 Deduplicated tail pages are not mapped read-only.
0226 
0227 Here's how things look like on device-dax after the sections are populated::
0228 
0229  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
0230  |           |                     |     0     | -------------> |     0     |
0231  |           |                     +-----------+                +-----------+
0232  |           |                     |     1     | -------------> |     1     |
0233  |           |                     +-----------+                +-----------+
0234  |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
0235  |           |                     +-----------+                   | | | | |
0236  |           |                     |     3     | ------------------+ | | | |
0237  |           |                     +-----------+                     | | | |
0238  |           |                     |     4     | --------------------+ | | |
0239  |    PMD    |                     +-----------+                       | | |
0240  |   level   |                     |     5     | ----------------------+ | |
0241  |  mapping  |                     +-----------+                         | |
0242  |           |                     |     6     | ------------------------+ |
0243  |           |                     +-----------+                           |
0244  |           |                     |     7     | --------------------------+
0245  |           |                     +-----------+
0246  |           |
0247  |           |
0248  |           |
0249  +-----------+