0001 .. _hugetlbfs_reserve:
0002
0003 =====================
0004 Hugetlbfs Reservation
0005 =====================
0006
0007 Overview
0008 ========
0009
0010 Huge pages as described at :ref:`hugetlbpage` are typically
0011 preallocated for application use. These huge pages are instantiated in a
0012 task's address space at page fault time if the VMA indicates huge pages are
0013 to be used. If no huge page exists at page fault time, the task is sent
0014 a SIGBUS and often dies an unhappy death. Shortly after huge page support
0015 was added, it was determined that it would be better to detect a shortage
0016 of huge pages at mmap() time. The idea is that if there were not enough
0017 huge pages to cover the mapping, the mmap() would fail. This was first
0018 done with a simple check in the code at mmap() time to determine if there
0019 were enough free huge pages to cover the mapping. Like most things in the
0020 kernel, the code has evolved over time. However, the basic idea was to
0021 'reserve' huge pages at mmap() time to ensure that huge pages would be
0022 available for page faults in that mapping. The description below attempts to
0023 describe how huge page reserve processing is done in the v4.10 kernel.
0024
0025
0026 Audience
0027 ========
0028 This description is primarily targeted at kernel developers who are modifying
0029 hugetlbfs code.
0030
0031
0032 The Data Structures
0033 ===================
0034
0035 resv_huge_pages
0036 This is a global (per-hstate) count of reserved huge pages. Reserved
0037 huge pages are only available to the task which reserved them.
0038 Therefore, the number of huge pages generally available is computed
0039 as (``free_huge_pages - resv_huge_pages``).
0040 Reserve Map
0041 A reserve map is described by the structure::
0042
0043 struct resv_map {
0044 struct kref refs;
0045 spinlock_t lock;
0046 struct list_head regions;
0047 long adds_in_progress;
0048 struct list_head region_cache;
0049 long region_cache_count;
0050 };
0051
0052 There is one reserve map for each huge page mapping in the system.
0053 The regions list within the resv_map describes the regions within
0054 the mapping. A region is described as::
0055
0056 struct file_region {
0057 struct list_head link;
0058 long from;
0059 long to;
0060 };
0061
0062 The 'from' and 'to' fields of the file region structure are huge page
0063 indices into the mapping. Depending on the type of mapping, a
0064 region in the reserv_map may indicate reservations exist for the
0065 range, or reservations do not exist.
0066 Flags for MAP_PRIVATE Reservations
0067 These are stored in the bottom bits of the reservation map pointer.
0068
0069 ``#define HPAGE_RESV_OWNER (1UL << 0)``
0070 Indicates this task is the owner of the reservations
0071 associated with the mapping.
0072 ``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
0073 Indicates task originally mapping this range (and creating
0074 reserves) has unmapped a page from this task (the child)
0075 due to a failed COW.
0076 Page Flags
0077 The PagePrivate page flag is used to indicate that a huge page
0078 reservation must be restored when the huge page is freed. More
0079 details will be discussed in the "Freeing huge pages" section.
0080
0081
0082 Reservation Map Location (Private or Shared)
0083 ============================================
0084
0085 A huge page mapping or segment is either private or shared. If private,
0086 it is typically only available to a single address space (task). If shared,
0087 it can be mapped into multiple address spaces (tasks). The location and
0088 semantics of the reservation map is significantly different for the two types
0089 of mappings. Location differences are:
0090
0091 - For private mappings, the reservation map hangs off the VMA structure.
0092 Specifically, vma->vm_private_data. This reserve map is created at the
0093 time the mapping (mmap(MAP_PRIVATE)) is created.
0094 - For shared mappings, the reservation map hangs off the inode. Specifically,
0095 inode->i_mapping->private_data. Since shared mappings are always backed
0096 by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
0097 contains a reservation map. As a result, the reservation map is allocated
0098 when the inode is created.
0099
0100
0101 Creating Reservations
0102 =====================
0103 Reservations are created when a huge page backed shared memory segment is
0104 created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
0105 These operations result in a call to the routine hugetlb_reserve_pages()::
0106
0107 int hugetlb_reserve_pages(struct inode *inode,
0108 long from, long to,
0109 struct vm_area_struct *vma,
0110 vm_flags_t vm_flags)
0111
0112 The first thing hugetlb_reserve_pages() does is check if the NORESERVE
0113 flag was specified in either the shmget() or mmap() call. If NORESERVE
0114 was specified, then this routine returns immediately as no reservations
0115 are desired.
0116
0117 The arguments 'from' and 'to' are huge page indices into the mapping or
0118 underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
0119 the length of the segment/mapping. For mmap(), the offset argument could
0120 be used to specify the offset into the underlying file. In such a case,
0121 the 'from' and 'to' arguments have been adjusted by this offset.
0122
0123 One of the big differences between PRIVATE and SHARED mappings is the way
0124 in which reservations are represented in the reservation map.
0125
0126 - For shared mappings, an entry in the reservation map indicates a reservation
0127 exists or did exist for the corresponding page. As reservations are
0128 consumed, the reservation map is not modified.
0129 - For private mappings, the lack of an entry in the reservation map indicates
0130 a reservation exists for the corresponding page. As reservations are
0131 consumed, entries are added to the reservation map. Therefore, the
0132 reservation map can also be used to determine which reservations have
0133 been consumed.
0134
0135 For private mappings, hugetlb_reserve_pages() creates the reservation map and
0136 hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set
0137 to indicate this VMA owns the reservations.
0138
0139 The reservation map is consulted to determine how many huge page reservations
0140 are needed for the current mapping/segment. For private mappings, this is
0141 always the value (to - from). However, for shared mappings it is possible that
0142 some reservations may already exist within the range (to - from). See the
0143 section :ref:`Reservation Map Modifications <resv_map_modifications>`
0144 for details on how this is accomplished.
0145
0146 The mapping may be associated with a subpool. If so, the subpool is consulted
0147 to ensure there is sufficient space for the mapping. It is possible that the
0148 subpool has set aside reservations that can be used for the mapping. See the
0149 section :ref:`Subpool Reservations <sub_pool_resv>` for more details.
0150
0151 After consulting the reservation map and subpool, the number of needed new
0152 reservations is known. The routine hugetlb_acct_memory() is called to check
0153 for and take the requested number of reservations. hugetlb_acct_memory()
0154 calls into routines that potentially allocate and adjust surplus page counts.
0155 However, within those routines the code is simply checking to ensure there
0156 are enough free huge pages to accommodate the reservation. If there are,
0157 the global reservation count resv_huge_pages is adjusted something like the
0158 following::
0159
0160 if (resv_needed <= (resv_huge_pages - free_huge_pages))
0161 resv_huge_pages += resv_needed;
0162
0163 Note that the global lock hugetlb_lock is held when checking and adjusting
0164 these counters.
0165
0166 If there were enough free huge pages and the global count resv_huge_pages
0167 was adjusted, then the reservation map associated with the mapping is
0168 modified to reflect the reservations. In the case of a shared mapping, a
0169 file_region will exist that includes the range 'from' - 'to'. For private
0170 mappings, no modifications are made to the reservation map as lack of an
0171 entry indicates a reservation exists.
0172
0173 If hugetlb_reserve_pages() was successful, the global reservation count and
0174 reservation map associated with the mapping will be modified as required to
0175 ensure reservations exist for the range 'from' - 'to'.
0176
0177 .. _consume_resv:
0178
0179 Consuming Reservations/Allocating a Huge Page
0180 =============================================
0181
0182 Reservations are consumed when huge pages associated with the reservations
0183 are allocated and instantiated in the corresponding mapping. The allocation
0184 is performed within the routine alloc_huge_page()::
0185
0186 struct page *alloc_huge_page(struct vm_area_struct *vma,
0187 unsigned long addr, int avoid_reserve)
0188
0189 alloc_huge_page is passed a VMA pointer and a virtual address, so it can
0190 consult the reservation map to determine if a reservation exists. In addition,
0191 alloc_huge_page takes the argument avoid_reserve which indicates reserves
0192 should not be used even if it appears they have been set aside for the
0193 specified address. The avoid_reserve argument is most often used in the case
0194 of Copy on Write and Page Migration where additional copies of an existing
0195 page are being allocated.
0196
0197 The helper routine vma_needs_reservation() is called to determine if a
0198 reservation exists for the address within the mapping(vma). See the section
0199 :ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed
0200 information on what this routine does.
0201 The value returned from vma_needs_reservation() is generally
0202 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
0203 If a reservation does not exist, and there is a subpool associated with the
0204 mapping the subpool is consulted to determine if it contains reservations.
0205 If the subpool contains reservations, one can be used for this allocation.
0206 However, in every case the avoid_reserve argument overrides the use of
0207 a reservation for the allocation. After determining whether a reservation
0208 exists and can be used for the allocation, the routine dequeue_huge_page_vma()
0209 is called. This routine takes two arguments related to reservations:
0210
0211 - avoid_reserve, this is the same value/argument passed to alloc_huge_page()
0212 - chg, even though this argument is of type long only the values 0 or 1 are
0213 passed to dequeue_huge_page_vma. If the value is 0, it indicates a
0214 reservation exists (see the section "Memory Policy and Reservations" for
0215 possible issues). If the value is 1, it indicates a reservation does not
0216 exist and the page must be taken from the global free pool if possible.
0217
0218 The free lists associated with the memory policy of the VMA are searched for
0219 a free page. If a page is found, the value free_huge_pages is decremented
0220 when the page is removed from the free list. If there was a reservation
0221 associated with the page, the following adjustments are made::
0222
0223 SetPagePrivate(page); /* Indicates allocating this page consumed
0224 * a reservation, and if an error is
0225 * encountered such that the page must be
0226 * freed, the reservation will be restored. */
0227 resv_huge_pages--; /* Decrement the global reservation count */
0228
0229 Note, if no huge page can be found that satisfies the VMA's memory policy
0230 an attempt will be made to allocate one using the buddy allocator. This
0231 brings up the issue of surplus huge pages and overcommit which is beyond
0232 the scope reservations. Even if a surplus page is allocated, the same
0233 reservation based adjustments as above will be made: SetPagePrivate(page) and
0234 resv_huge_pages--.
0235
0236 After obtaining a new huge page, (page)->private is set to the value of
0237 the subpool associated with the page if it exists. This will be used for
0238 subpool accounting when the page is freed.
0239
0240 The routine vma_commit_reservation() is then called to adjust the reserve
0241 map based on the consumption of the reservation. In general, this involves
0242 ensuring the page is represented within a file_region structure of the region
0243 map. For shared mappings where the reservation was present, an entry
0244 in the reserve map already existed so no change is made. However, if there
0245 was no reservation in a shared mapping or this was a private mapping a new
0246 entry must be created.
0247
0248 It is possible that the reserve map could have been changed between the call
0249 to vma_needs_reservation() at the beginning of alloc_huge_page() and the
0250 call to vma_commit_reservation() after the page was allocated. This would
0251 be possible if hugetlb_reserve_pages was called for the same page in a shared
0252 mapping. In such cases, the reservation count and subpool free page count
0253 will be off by one. This rare condition can be identified by comparing the
0254 return value from vma_needs_reservation and vma_commit_reservation. If such
0255 a race is detected, the subpool and global reserve counts are adjusted to
0256 compensate. See the section
0257 :ref:`Reservation Map Helper Routines <resv_map_helpers>` for more
0258 information on these routines.
0259
0260
0261 Instantiate Huge Pages
0262 ======================
0263
0264 After huge page allocation, the page is typically added to the page tables
0265 of the allocating task. Before this, pages in a shared mapping are added
0266 to the page cache and pages in private mappings are added to an anonymous
0267 reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore,
0268 when a huge page that has been instantiated is freed no adjustment is made
0269 to the global reservation count (resv_huge_pages).
0270
0271
0272 Freeing Huge Pages
0273 ==================
0274
0275 Huge page freeing is performed by the routine free_huge_page(). This routine
0276 is the destructor for hugetlbfs compound pages. As a result, it is only
0277 passed a pointer to the page struct. When a huge page is freed, reservation
0278 accounting may need to be performed. This would be the case if the page was
0279 associated with a subpool that contained reserves, or the page is being freed
0280 on an error path where a global reserve count must be restored.
0281
0282 The page->private field points to any subpool associated with the page.
0283 If the PagePrivate flag is set, it indicates the global reserve count should
0284 be adjusted (see the section
0285 :ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>`
0286 for information on how these are set).
0287
0288 The routine first calls hugepage_subpool_put_pages() for the page. If this
0289 routine returns a value of 0 (which does not equal the value passed 1) it
0290 indicates reserves are associated with the subpool, and this newly free page
0291 must be used to keep the number of subpool reserves above the minimum size.
0292 Therefore, the global resv_huge_pages counter is incremented in this case.
0293
0294 If the PagePrivate flag was set in the page, the global resv_huge_pages counter
0295 will always be incremented.
0296
0297 .. _sub_pool_resv:
0298
0299 Subpool Reservations
0300 ====================
0301
0302 There is a struct hstate associated with each huge page size. The hstate
0303 tracks all huge pages of the specified size. A subpool represents a subset
0304 of pages within a hstate that is associated with a mounted hugetlbfs
0305 filesystem.
0306
0307 When a hugetlbfs filesystem is mounted a min_size option can be specified
0308 which indicates the minimum number of huge pages required by the filesystem.
0309 If this option is specified, the number of huge pages corresponding to
0310 min_size are reserved for use by the filesystem. This number is tracked in
0311 the min_hpages field of a struct hugepage_subpool. At mount time,
0312 hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
0313 huge pages. If they can not be reserved, the mount fails.
0314
0315 The routines hugepage_subpool_get/put_pages() are called when pages are
0316 obtained from or released back to a subpool. They perform all subpool
0317 accounting, and track any reservations associated with the subpool.
0318 hugepage_subpool_get/put_pages are passed the number of huge pages by which
0319 to adjust the subpool 'used page' count (down for get, up for put). Normally,
0320 they return the same value that was passed or an error if not enough pages
0321 exist in the subpool.
0322
0323 However, if reserves are associated with the subpool a return value less
0324 than the passed value may be returned. This return value indicates the
0325 number of additional global pool adjustments which must be made. For example,
0326 suppose a subpool contains 3 reserved huge pages and someone asks for 5.
0327 The 3 reserved pages associated with the subpool can be used to satisfy part
0328 of the request. But, 2 pages must be obtained from the global pools. To
0329 relay this information to the caller, the value 2 is returned. The caller
0330 is then responsible for attempting to obtain the additional two pages from
0331 the global pools.
0332
0333
0334 COW and Reservations
0335 ====================
0336
0337 Since shared mappings all point to and use the same underlying pages, the
0338 biggest reservation concern for COW is private mappings. In this case,
0339 two tasks can be pointing at the same previously allocated page. One task
0340 attempts to write to the page, so a new page must be allocated so that each
0341 task points to its own page.
0342
0343 When the page was originally allocated, the reservation for that page was
0344 consumed. When an attempt to allocate a new page is made as a result of
0345 COW, it is possible that no free huge pages are free and the allocation
0346 will fail.
0347
0348 When the private mapping was originally created, the owner of the mapping
0349 was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
0350 map of the owner. Since the owner created the mapping, the owner owns all
0351 the reservations associated with the mapping. Therefore, when a write fault
0352 occurs and there is no page available, different action is taken for the owner
0353 and non-owner of the reservation.
0354
0355 In the case where the faulting task is not the owner, the fault will fail and
0356 the task will typically receive a SIGBUS.
0357
0358 If the owner is the faulting task, we want it to succeed since it owned the
0359 original reservation. To accomplish this, the page is unmapped from the
0360 non-owning task. In this way, the only reference is from the owning task.
0361 In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
0362 of the non-owning task. The non-owning task may receive a SIGBUS if it later
0363 faults on a non-present page. But, the original owner of the
0364 mapping/reservation will behave as expected.
0365
0366
0367 .. _resv_map_modifications:
0368
0369 Reservation Map Modifications
0370 =============================
0371
0372 The following low level routines are used to make modifications to a
0373 reservation map. Typically, these routines are not called directly. Rather,
0374 a reservation map helper routine is called which calls one of these low level
0375 routines. These low level routines are fairly well documented in the source
0376 code (mm/hugetlb.c). These routines are::
0377
0378 long region_chg(struct resv_map *resv, long f, long t);
0379 long region_add(struct resv_map *resv, long f, long t);
0380 void region_abort(struct resv_map *resv, long f, long t);
0381 long region_count(struct resv_map *resv, long f, long t);
0382
0383 Operations on the reservation map typically involve two operations:
0384
0385 1) region_chg() is called to examine the reserve map and determine how
0386 many pages in the specified range [f, t) are NOT currently represented.
0387
0388 The calling code performs global checks and allocations to determine if
0389 there are enough huge pages for the operation to succeed.
0390
0391 2)
0392 a) If the operation can succeed, region_add() is called to actually modify
0393 the reservation map for the same range [f, t) previously passed to
0394 region_chg().
0395 b) If the operation can not succeed, region_abort is called for the same
0396 range [f, t) to abort the operation.
0397
0398 Note that this is a two step process where region_add() and region_abort()
0399 are guaranteed to succeed after a prior call to region_chg() for the same
0400 range. region_chg() is responsible for pre-allocating any data structures
0401 necessary to ensure the subsequent operations (specifically region_add()))
0402 will succeed.
0403
0404 As mentioned above, region_chg() determines the number of pages in the range
0405 which are NOT currently represented in the map. This number is returned to
0406 the caller. region_add() returns the number of pages in the range added to
0407 the map. In most cases, the return value of region_add() is the same as the
0408 return value of region_chg(). However, in the case of shared mappings it is
0409 possible for changes to the reservation map to be made between the calls to
0410 region_chg() and region_add(). In this case, the return value of region_add()
0411 will not match the return value of region_chg(). It is likely that in such
0412 cases global counts and subpool accounting will be incorrect and in need of
0413 adjustment. It is the responsibility of the caller to check for this condition
0414 and make the appropriate adjustments.
0415
0416 The routine region_del() is called to remove regions from a reservation map.
0417 It is typically called in the following situations:
0418
0419 - When a file in the hugetlbfs filesystem is being removed, the inode will
0420 be released and the reservation map freed. Before freeing the reservation
0421 map, all the individual file_region structures must be freed. In this case
0422 region_del is passed the range [0, LONG_MAX).
0423 - When a hugetlbfs file is being truncated. In this case, all allocated pages
0424 after the new file size must be freed. In addition, any file_region entries
0425 in the reservation map past the new end of file must be deleted. In this
0426 case, region_del is passed the range [new_end_of_file, LONG_MAX).
0427 - When a hole is being punched in a hugetlbfs file. In this case, huge pages
0428 are removed from the middle of the file one at a time. As the pages are
0429 removed, region_del() is called to remove the corresponding entry from the
0430 reservation map. In this case, region_del is passed the range
0431 [page_idx, page_idx + 1).
0432
0433 In every case, region_del() will return the number of pages removed from the
0434 reservation map. In VERY rare cases, region_del() can fail. This can only
0435 happen in the hole punch case where it has to split an existing file_region
0436 entry and can not allocate a new structure. In this error case, region_del()
0437 will return -ENOMEM. The problem here is that the reservation map will
0438 indicate that there is a reservation for the page. However, the subpool and
0439 global reservation counts will not reflect the reservation. To handle this
0440 situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
0441 counters so that they correspond with the reservation map entry that could
0442 not be deleted.
0443
0444 region_count() is called when unmapping a private huge page mapping. In
0445 private mappings, the lack of a entry in the reservation map indicates that
0446 a reservation exists. Therefore, by counting the number of entries in the
0447 reservation map we know how many reservations were consumed and how many are
0448 outstanding (outstanding = (end - start) - region_count(resv, start, end)).
0449 Since the mapping is going away, the subpool and global reservation counts
0450 are decremented by the number of outstanding reservations.
0451
0452 .. _resv_map_helpers:
0453
0454 Reservation Map Helper Routines
0455 ===============================
0456
0457 Several helper routines exist to query and modify the reservation maps.
0458 These routines are only interested with reservations for a specific huge
0459 page, so they just pass in an address instead of a range. In addition,
0460 they pass in the associated VMA. From the VMA, the type of mapping (private
0461 or shared) and the location of the reservation map (inode or VMA) can be
0462 determined. These routines simply call the underlying routines described
0463 in the section "Reservation Map Modifications". However, they do take into
0464 account the 'opposite' meaning of reservation map entries for private and
0465 shared mappings and hide this detail from the caller::
0466
0467 long vma_needs_reservation(struct hstate *h,
0468 struct vm_area_struct *vma,
0469 unsigned long addr)
0470
0471 This routine calls region_chg() for the specified page. If no reservation
0472 exists, 1 is returned. If a reservation exists, 0 is returned::
0473
0474 long vma_commit_reservation(struct hstate *h,
0475 struct vm_area_struct *vma,
0476 unsigned long addr)
0477
0478 This calls region_add() for the specified page. As in the case of region_chg
0479 and region_add, this routine is to be called after a previous call to
0480 vma_needs_reservation. It will add a reservation entry for the page. It
0481 returns 1 if the reservation was added and 0 if not. The return value should
0482 be compared with the return value of the previous call to
0483 vma_needs_reservation. An unexpected difference indicates the reservation
0484 map was modified between calls::
0485
0486 void vma_end_reservation(struct hstate *h,
0487 struct vm_area_struct *vma,
0488 unsigned long addr)
0489
0490 This calls region_abort() for the specified page. As in the case of region_chg
0491 and region_abort, this routine is to be called after a previous call to
0492 vma_needs_reservation. It will abort/end the in progress reservation add
0493 operation::
0494
0495 long vma_add_reservation(struct hstate *h,
0496 struct vm_area_struct *vma,
0497 unsigned long addr)
0498
0499 This is a special wrapper routine to help facilitate reservation cleanup
0500 on error paths. It is only called from the routine restore_reserve_on_error().
0501 This routine is used in conjunction with vma_needs_reservation in an attempt
0502 to add a reservation to the reservation map. It takes into account the
0503 different reservation map semantics for private and shared mappings. Hence,
0504 region_add is called for shared mappings (as an entry present in the map
0505 indicates a reservation), and region_del is called for private mappings (as
0506 the absence of an entry in the map indicates a reservation). See the section
0507 "Reservation cleanup in error paths" for more information on what needs to
0508 be done on error paths.
0509
0510
0511 Reservation Cleanup in Error Paths
0512 ==================================
0513
0514 As mentioned in the section
0515 :ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation
0516 map modifications are performed in two steps. First vma_needs_reservation
0517 is called before a page is allocated. If the allocation is successful,
0518 then vma_commit_reservation is called. If not, vma_end_reservation is called.
0519 Global and subpool reservation counts are adjusted based on success or failure
0520 of the operation and all is well.
0521
0522 Additionally, after a huge page is instantiated the PagePrivate flag is
0523 cleared so that accounting when the page is ultimately freed is correct.
0524
0525 However, there are several instances where errors are encountered after a huge
0526 page is allocated but before it is instantiated. In this case, the page
0527 allocation has consumed the reservation and made the appropriate subpool,
0528 reservation map and global count adjustments. If the page is freed at this
0529 time (before instantiation and clearing of PagePrivate), then free_huge_page
0530 will increment the global reservation count. However, the reservation map
0531 indicates the reservation was consumed. This resulting inconsistent state
0532 will cause the 'leak' of a reserved huge page. The global reserve count will
0533 be higher than it should and prevent allocation of a pre-allocated page.
0534
0535 The routine restore_reserve_on_error() attempts to handle this situation. It
0536 is fairly well documented. The intention of this routine is to restore
0537 the reservation map to the way it was before the page allocation. In this
0538 way, the state of the reservation map will correspond to the global reservation
0539 count after the page is freed.
0540
0541 The routine restore_reserve_on_error itself may encounter errors while
0542 attempting to restore the reservation map entry. In this case, it will
0543 simply clear the PagePrivate flag of the page. In this way, the global
0544 reserve count will not be incremented when the page is freed. However, the
0545 reservation map will continue to look as though the reservation was consumed.
0546 A page can still be allocated for the address, but it will not use a reserved
0547 page as originally intended.
0548
0549 There is some code (most notably userfaultfd) which can not call
0550 restore_reserve_on_error. In this case, it simply modifies the PagePrivate
0551 so that a reservation will not be leaked when the huge page is freed.
0552
0553
0554 Reservations and Memory Policy
0555 ==============================
0556 Per-node huge page lists existed in struct hstate when git was first used
0557 to manage Linux code. The concept of reservations was added some time later.
0558 When reservations were added, no attempt was made to take memory policy
0559 into account. While cpusets are not exactly the same as memory policy, this
0560 comment in hugetlb_acct_memory sums up the interaction between reservations
0561 and cpusets/memory policy::
0562
0563 /*
0564 * When cpuset is configured, it breaks the strict hugetlb page
0565 * reservation as the accounting is done on a global variable. Such
0566 * reservation is completely rubbish in the presence of cpuset because
0567 * the reservation is not checked against page availability for the
0568 * current cpuset. Application can still potentially OOM'ed by kernel
0569 * with lack of free htlb page in cpuset that the task is in.
0570 * Attempt to enforce strict accounting with cpuset is almost
0571 * impossible (or too ugly) because cpuset is too fluid that
0572 * task or memory node can be dynamically moved between cpusets.
0573 *
0574 * The change of semantics for shared hugetlb mapping with cpuset is
0575 * undesirable. However, in order to preserve some of the semantics,
0576 * we fall back to check against current free page availability as
0577 * a best attempt and hopefully to minimize the impact of changing
0578 * semantics that cpuset has.
0579 */
0580
0581 Huge page reservations were added to prevent unexpected page allocation
0582 failures (OOM) at page fault time. However, if an application makes use
0583 of cpusets or memory policy there is no guarantee that huge pages will be
0584 available on the required nodes. This is true even if there are a sufficient
0585 number of global reservations.
0586
0587 Hugetlbfs regression testing
0588 ============================
0589
0590 The most complete set of hugetlb tests are in the libhugetlbfs repository.
0591 If you modify any hugetlb related code, use the libhugetlbfs test suite
0592 to check for regressions. In addition, if you add any new hugetlb
0593 functionality, please add appropriate tests to libhugetlbfs.
0594
0595 --
0596 Mike Kravetz, 7 April 2017