0001 ==================
0002 HugeTLB Controller
0003 ==================
0004
0005 HugeTLB controller can be created by first mounting the cgroup filesystem.
0006
0007 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
0008
0009 With the above step, the initial or the parent HugeTLB group becomes
0010 visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
0011 the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
0012
0013 New groups can be created under the parent group /sys/fs/cgroup::
0014
0015 # cd /sys/fs/cgroup
0016 # mkdir g1
0017 # echo $$ > g1/tasks
0018
0019 The above steps create a new group g1 and move the current shell
0020 process (bash) into it.
0021
0022 Brief summary of control files::
0023
0024 hugetlb.<hugepagesize>.rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations
0025 hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults
0026 hugetlb.<hugepagesize>.rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb
0027 hugetlb.<hugepagesize>.rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit
0028 hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults
0029 hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
0030 hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
0031 hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
0032 hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
0033
0034 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
0035 files include::
0036
0037 hugetlb.1GB.limit_in_bytes
0038 hugetlb.1GB.max_usage_in_bytes
0039 hugetlb.1GB.numa_stat
0040 hugetlb.1GB.usage_in_bytes
0041 hugetlb.1GB.failcnt
0042 hugetlb.1GB.rsvd.limit_in_bytes
0043 hugetlb.1GB.rsvd.max_usage_in_bytes
0044 hugetlb.1GB.rsvd.usage_in_bytes
0045 hugetlb.1GB.rsvd.failcnt
0046 hugetlb.64KB.limit_in_bytes
0047 hugetlb.64KB.max_usage_in_bytes
0048 hugetlb.64KB.numa_stat
0049 hugetlb.64KB.usage_in_bytes
0050 hugetlb.64KB.failcnt
0051 hugetlb.64KB.rsvd.limit_in_bytes
0052 hugetlb.64KB.rsvd.max_usage_in_bytes
0053 hugetlb.64KB.rsvd.usage_in_bytes
0054 hugetlb.64KB.rsvd.failcnt
0055 hugetlb.32MB.limit_in_bytes
0056 hugetlb.32MB.max_usage_in_bytes
0057 hugetlb.32MB.numa_stat
0058 hugetlb.32MB.usage_in_bytes
0059 hugetlb.32MB.failcnt
0060 hugetlb.32MB.rsvd.limit_in_bytes
0061 hugetlb.32MB.rsvd.max_usage_in_bytes
0062 hugetlb.32MB.rsvd.usage_in_bytes
0063 hugetlb.32MB.rsvd.failcnt
0064
0065
0066 1. Page fault accounting
0067
0068 hugetlb.<hugepagesize>.limit_in_bytes
0069 hugetlb.<hugepagesize>.max_usage_in_bytes
0070 hugetlb.<hugepagesize>.usage_in_bytes
0071 hugetlb.<hugepagesize>.failcnt
0072
0073 The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
0074 control group and enforces the limit during page fault. Since HugeTLB
0075 doesn't support page reclaim, enforcing the limit at page fault time implies
0076 that, the application will get SIGBUS signal if it tries to fault in HugeTLB
0077 pages beyond its limit. Therefore the application needs to know exactly how many
0078 HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
0079 there are enough available on the machine for all the users to avoid processes
0080 getting SIGBUS.
0081
0082
0083 2. Reservation accounting
0084
0085 hugetlb.<hugepagesize>.rsvd.limit_in_bytes
0086 hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
0087 hugetlb.<hugepagesize>.rsvd.usage_in_bytes
0088 hugetlb.<hugepagesize>.rsvd.failcnt
0089
0090 The HugeTLB controller allows to limit the HugeTLB reservations per control
0091 group and enforces the controller limit at reservation time and at the fault of
0092 HugeTLB memory for which no reservation exists. Since reservation limits are
0093 enforced at reservation time (on mmap or shget), reservation limits never causes
0094 the application to get SIGBUS signal if the memory was reserved before hand. For
0095 MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
0096 limit, enforcing memory usage at fault time and causing the application to
0097 receive a SIGBUS if it's crossing its limit.
0098
0099 Reservation limits are superior to page fault limits described above, since
0100 reservation limits are enforced at reservation time (on mmap or shget), and
0101 never causes the application to get SIGBUS signal if the memory was reserved
0102 before hand. This allows for easier fallback to alternatives such as
0103 non-HugeTLB memory for example. In the case of page fault accounting, it's very
0104 hard to avoid processes getting SIGBUS since the sysadmin needs precisely know
0105 the HugeTLB usage of all the tasks in the system and make sure there is enough
0106 pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited
0107 systems is practically impossible with page fault accounting.
0108
0109
0110 3. Caveats with shared memory
0111
0112 For shared HugeTLB memory, both HugeTLB reservation and page faults are charged
0113 to the first task that causes the memory to be reserved or faulted, and all
0114 subsequent uses of this reserved or faulted memory is done without charging.
0115
0116 Shared HugeTLB memory is only uncharged when it is unreserved or deallocated.
0117 This is usually when the HugeTLB file is deleted, and not when the task that
0118 caused the reservation or fault has exited.
0119
0120
0121 4. Caveats with HugeTLB cgroup offline.
0122
0123 When a HugeTLB cgroup goes offline with some reservations or faults still
0124 charged to it, the behavior is as follows:
0125
0126 - The fault charges are charged to the parent HugeTLB cgroup (reparented),
0127 - the reservation charges remain on the offline HugeTLB cgroup.
0128
0129 This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB
0130 reservations charged to it, that cgroup persists as a zombie until all HugeTLB
0131 reservations are uncharged. HugeTLB reservations behave in this manner to match
0132 the memory controller whose cgroups also persist as zombie until all charged
0133 memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more
0134 complex compared to the tracking of HugeTLB faults, so it is significantly
0135 harder to reparent reservations at offline time.