Back to home page

OSCL-LXR

 
 

    


0001 ==========================
0002 Memory Resource Controller
0003 ==========================
0004 
0005 NOTE:
0006       This document is hopelessly outdated and it asks for a complete
0007       rewrite. It still contains a useful information so we are keeping it
0008       here but make sure to check the current code if you need a deeper
0009       understanding.
0010 
0011 NOTE:
0012       The Memory Resource Controller has generically been referred to as the
0013       memory controller in this document. Do not confuse memory controller
0014       used here with the memory controller that is used in hardware.
0015 
0016 (For editors) In this document:
0017       When we mention a cgroup (cgroupfs's directory) with memory controller,
0018       we call it "memory cgroup". When you see git-log and source code, you'll
0019       see patch's title and function names tend to use "memcg".
0020       In this document, we avoid using it.
0021 
0022 Benefits and Purpose of the memory controller
0023 =============================================
0024 
0025 The memory controller isolates the memory behaviour of a group of tasks
0026 from the rest of the system. The article on LWN [12] mentions some probable
0027 uses of the memory controller. The memory controller can be used to
0028 
0029 a. Isolate an application or a group of applications
0030    Memory-hungry applications can be isolated and limited to a smaller
0031    amount of memory.
0032 b. Create a cgroup with a limited amount of memory; this can be used
0033    as a good alternative to booting with mem=XXXX.
0034 c. Virtualization solutions can control the amount of memory they want
0035    to assign to a virtual machine instance.
0036 d. A CD/DVD burner could control the amount of memory used by the
0037    rest of the system to ensure that burning does not fail due to lack
0038    of available memory.
0039 e. There are several other use cases; find one or use the controller just
0040    for fun (to learn and hack on the VM subsystem).
0041 
0042 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
0043 
0044 Features:
0045 
0046  - accounting anonymous pages, file caches, swap caches usage and limiting them.
0047  - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
0048  - optionally, memory+swap usage can be accounted and limited.
0049  - hierarchical accounting
0050  - soft limit
0051  - moving (recharging) account at moving a task is selectable.
0052  - usage threshold notifier
0053  - memory pressure notifier
0054  - oom-killer disable knob and oom-notifier
0055  - Root cgroup has no limit controls.
0056 
0057  Kernel memory support is a work in progress, and the current version provides
0058  basically functionality. (See Section 2.7)
0059 
0060 Brief summary of control files.
0061 
0062 ==================================== ==========================================
0063  tasks                               attach a task(thread) and show list of
0064                                      threads
0065  cgroup.procs                        show list of processes
0066  cgroup.event_control                an interface for event_fd()
0067                                      This knob is not available on CONFIG_PREEMPT_RT systems.
0068  memory.usage_in_bytes               show current usage for memory
0069                                      (See 5.5 for details)
0070  memory.memsw.usage_in_bytes         show current usage for memory+Swap
0071                                      (See 5.5 for details)
0072  memory.limit_in_bytes               set/show limit of memory usage
0073  memory.memsw.limit_in_bytes         set/show limit of memory+Swap usage
0074  memory.failcnt                      show the number of memory usage hits limits
0075  memory.memsw.failcnt                show the number of memory+Swap hits limits
0076  memory.max_usage_in_bytes           show max memory usage recorded
0077  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
0078  memory.soft_limit_in_bytes          set/show soft limit of memory usage
0079                                      This knob is not available on CONFIG_PREEMPT_RT systems.
0080  memory.stat                         show various statistics
0081  memory.use_hierarchy                set/show hierarchical account enabled
0082                                      This knob is deprecated and shouldn't be
0083                                      used.
0084  memory.force_empty                  trigger forced page reclaim
0085  memory.pressure_level               set memory pressure notifications
0086  memory.swappiness                   set/show swappiness parameter of vmscan
0087                                      (See sysctl's vm.swappiness)
0088  memory.move_charge_at_immigrate     set/show controls of moving charges
0089  memory.oom_control                  set/show oom controls.
0090  memory.numa_stat                    show the number of memory usage per numa
0091                                      node
0092  memory.kmem.limit_in_bytes          This knob is deprecated and writing to
0093                                      it will return -ENOTSUPP.
0094  memory.kmem.usage_in_bytes          show current kernel memory allocation
0095  memory.kmem.failcnt                 show the number of kernel memory usage
0096                                      hits limits
0097  memory.kmem.max_usage_in_bytes      show max kernel memory usage recorded
0098 
0099  memory.kmem.tcp.limit_in_bytes      set/show hard limit for tcp buf memory
0100  memory.kmem.tcp.usage_in_bytes      show current tcp buf memory allocation
0101  memory.kmem.tcp.failcnt             show the number of tcp buf memory usage
0102                                      hits limits
0103  memory.kmem.tcp.max_usage_in_bytes  show max tcp buf memory usage recorded
0104 ==================================== ==========================================
0105 
0106 1. History
0107 ==========
0108 
0109 The memory controller has a long history. A request for comments for the memory
0110 controller was posted by Balbir Singh [1]. At the time the RFC was posted
0111 there were several implementations for memory control. The goal of the
0112 RFC was to build consensus and agreement for the minimal features required
0113 for memory control. The first RSS controller was posted by Balbir Singh[2]
0114 in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
0115 RSS controller. At OLS, at the resource management BoF, everyone suggested
0116 that we handle both page cache and RSS together. Another request was raised
0117 to allow user space handling of OOM. The current memory controller is
0118 at version 6; it combines both mapped (RSS) and unmapped Page
0119 Cache Control [11].
0120 
0121 2. Memory Control
0122 =================
0123 
0124 Memory is a unique resource in the sense that it is present in a limited
0125 amount. If a task requires a lot of CPU processing, the task can spread
0126 its processing over a period of hours, days, months or years, but with
0127 memory, the same physical memory needs to be reused to accomplish the task.
0128 
0129 The memory controller implementation has been divided into phases. These
0130 are:
0131 
0132 1. Memory controller
0133 2. mlock(2) controller
0134 3. Kernel user memory accounting and slab control
0135 4. user mappings length controller
0136 
0137 The memory controller is the first controller developed.
0138 
0139 2.1. Design
0140 -----------
0141 
0142 The core of the design is a counter called the page_counter. The
0143 page_counter tracks the current memory usage and limit of the group of
0144 processes associated with the controller. Each cgroup has a memory controller
0145 specific data structure (mem_cgroup) associated with it.
0146 
0147 2.2. Accounting
0148 ---------------
0149 
0150 ::
0151 
0152                 +--------------------+
0153                 |  mem_cgroup        |
0154                 |  (page_counter)    |
0155                 +--------------------+
0156                  /            ^      \
0157                 /             |       \
0158            +---------------+  |        +---------------+
0159            | mm_struct     |  |....    | mm_struct     |
0160            |               |  |        |               |
0161            +---------------+  |        +---------------+
0162                               |
0163                               + --------------+
0164                                               |
0165            +---------------+           +------+--------+
0166            | page          +---------->  page_cgroup|
0167            |               |           |               |
0168            +---------------+           +---------------+
0169 
0170              (Figure 1: Hierarchy of Accounting)
0171 
0172 
0173 Figure 1 shows the important aspects of the controller
0174 
0175 1. Accounting happens per cgroup
0176 2. Each mm_struct knows about which cgroup it belongs to
0177 3. Each page has a pointer to the page_cgroup, which in turn knows the
0178    cgroup it belongs to
0179 
0180 The accounting is done as follows: mem_cgroup_charge_common() is invoked to
0181 set up the necessary data structures and check if the cgroup that is being
0182 charged is over its limit. If it is, then reclaim is invoked on the cgroup.
0183 More details can be found in the reclaim section of this document.
0184 If everything goes well, a page meta-data-structure called page_cgroup is
0185 updated. page_cgroup has its own LRU on cgroup.
0186 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
0187 
0188 2.2.1 Accounting details
0189 ------------------------
0190 
0191 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
0192 Some pages which are never reclaimable and will not be on the LRU
0193 are not accounted. We just account pages under usual VM management.
0194 
0195 RSS pages are accounted at page_fault unless they've already been accounted
0196 for earlier. A file page will be accounted for as Page Cache when it's
0197 inserted into inode (radix-tree). While it's mapped into the page tables of
0198 processes, duplicate accounting is carefully avoided.
0199 
0200 An RSS page is unaccounted when it's fully unmapped. A PageCache page is
0201 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
0202 unmapped (by kswapd), they may exist as SwapCache in the system until they
0203 are really freed. Such SwapCaches are also accounted.
0204 A swapped-in page is accounted after adding into swapcache.
0205 
0206 Note: The kernel does swapin-readahead and reads multiple swaps at once.
0207 Since page's memcg recorded into swap whatever memsw enabled, the page will
0208 be accounted after swapin.
0209 
0210 At page migration, accounting information is kept.
0211 
0212 Note: we just account pages-on-LRU because our purpose is to control amount
0213 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
0214 
0215 2.3 Shared Page Accounting
0216 --------------------------
0217 
0218 Shared pages are accounted on the basis of the first touch approach. The
0219 cgroup that first touches a page is accounted for the page. The principle
0220 behind this approach is that a cgroup that aggressively uses a shared
0221 page will eventually get charged for it (once it is uncharged from
0222 the cgroup that brought it in -- this will happen on memory pressure).
0223 
0224 But see section 8.2: when moving a task to another cgroup, its pages may
0225 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
0226 
0227 2.4 Swap Extension
0228 --------------------------------------
0229 
0230 Swap usage is always recorded for each of cgroup. Swap Extension allows you to
0231 read and limit it.
0232 
0233 When CONFIG_SWAP is enabled, following files are added.
0234 
0235  - memory.memsw.usage_in_bytes.
0236  - memory.memsw.limit_in_bytes.
0237 
0238 memsw means memory+swap. Usage of memory+swap is limited by
0239 memsw.limit_in_bytes.
0240 
0241 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
0242 (by mistake) under 2G memory limitation will use all swap.
0243 In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
0244 By using the memsw limit, you can avoid system OOM which can be caused by swap
0245 shortage.
0246 
0247 **why 'memory+swap' rather than swap**
0248 
0249 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
0250 to move account from memory to swap...there is no change in usage of
0251 memory+swap. In other words, when we want to limit the usage of swap without
0252 affecting global LRU, memory+swap limit is better than just limiting swap from
0253 an OS point of view.
0254 
0255 **What happens when a cgroup hits memory.memsw.limit_in_bytes**
0256 
0257 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
0258 in this cgroup. Then, swap-out will not be done by cgroup routine and file
0259 caches are dropped. But as mentioned above, global LRU can do swapout memory
0260 from it for sanity of the system's memory management state. You can't forbid
0261 it by cgroup.
0262 
0263 2.5 Reclaim
0264 -----------
0265 
0266 Each cgroup maintains a per cgroup LRU which has the same structure as
0267 global VM. When a cgroup goes over its limit, we first try
0268 to reclaim memory from the cgroup so as to make space for the new
0269 pages that the cgroup has touched. If the reclaim is unsuccessful,
0270 an OOM routine is invoked to select and kill the bulkiest task in the
0271 cgroup. (See 10. OOM Control below.)
0272 
0273 The reclaim algorithm has not been modified for cgroups, except that
0274 pages that are selected for reclaiming come from the per-cgroup LRU
0275 list.
0276 
0277 NOTE:
0278   Reclaim does not work for the root cgroup, since we cannot set any
0279   limits on the root cgroup.
0280 
0281 Note2:
0282   When panic_on_oom is set to "2", the whole system will panic.
0283 
0284 When oom event notifier is registered, event will be delivered.
0285 (See oom_control section)
0286 
0287 2.6 Locking
0288 -----------
0289 
0290 Lock order is as follows:
0291 
0292   Page lock (PG_locked bit of page->flags)
0293     mm->page_table_lock or split pte_lock
0294       lock_page_memcg (memcg->move_lock)
0295         mapping->i_pages lock
0296           lruvec->lru_lock.
0297 
0298 Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
0299 lruvec->lru_lock; PG_lru bit of page->flags is cleared before
0300 isolating a page from its LRU under lruvec->lru_lock.
0301 
0302 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
0303 -----------------------------------------------
0304 
0305 With the Kernel memory extension, the Memory Controller is able to limit
0306 the amount of kernel memory used by the system. Kernel memory is fundamentally
0307 different than user memory, since it can't be swapped out, which makes it
0308 possible to DoS the system by consuming too much of this precious resource.
0309 
0310 Kernel memory accounting is enabled for all memory cgroups by default. But
0311 it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
0312 at boot time. In this case, kernel memory will not be accounted at all.
0313 
0314 Kernel memory limits are not imposed for the root cgroup. Usage for the root
0315 cgroup may or may not be accounted. The memory used is accumulated into
0316 memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
0317 (currently only for tcp).
0318 
0319 The main "kmem" counter is fed into the main counter, so kmem charges will
0320 also be visible from the user counter.
0321 
0322 Currently no soft limit is implemented for kernel memory. It is future work
0323 to trigger slab reclaim when those limits are reached.
0324 
0325 2.7.1 Current Kernel Memory resources accounted
0326 -----------------------------------------------
0327 
0328 stack pages:
0329   every process consumes some stack pages. By accounting into
0330   kernel memory, we prevent new processes from being created when the kernel
0331   memory usage is too high.
0332 
0333 slab pages:
0334   pages allocated by the SLAB or SLUB allocator are tracked. A copy
0335   of each kmem_cache is created every time the cache is touched by the first time
0336   from inside the memcg. The creation is done lazily, so some objects can still be
0337   skipped while the cache is being created. All objects in a slab page should
0338   belong to the same memcg. This only fails to hold when a task is migrated to a
0339   different memcg during the page allocation by the cache.
0340 
0341 sockets memory pressure:
0342   some sockets protocols have memory pressure
0343   thresholds. The Memory Controller allows them to be controlled individually
0344   per cgroup, instead of globally.
0345 
0346 tcp memory pressure:
0347   sockets memory pressure for the tcp protocol.
0348 
0349 2.7.2 Common use cases
0350 ----------------------
0351 
0352 Because the "kmem" counter is fed to the main user counter, kernel memory can
0353 never be limited completely independently of user memory. Say "U" is the user
0354 limit, and "K" the kernel limit. There are three possible ways limits can be
0355 set:
0356 
0357 U != 0, K = unlimited:
0358     This is the standard memcg limitation mechanism already present before kmem
0359     accounting. Kernel memory is completely ignored.
0360 
0361 U != 0, K < U:
0362     Kernel memory is a subset of the user memory. This setup is useful in
0363     deployments where the total amount of memory per-cgroup is overcommitted.
0364     Overcommitting kernel memory limits is definitely not recommended, since the
0365     box can still run out of non-reclaimable memory.
0366     In this case, the admin could set up K so that the sum of all groups is
0367     never greater than the total memory, and freely set U at the cost of his
0368     QoS.
0369 
0370 WARNING:
0371     In the current implementation, memory reclaim will NOT be
0372     triggered for a cgroup when it hits K while staying below U, which makes
0373     this setup impractical.
0374 
0375 U != 0, K >= U:
0376     Since kmem charges will also be fed to the user counter and reclaim will be
0377     triggered for the cgroup for both kinds of memory. This setup gives the
0378     admin a unified view of memory, and it is also useful for people who just
0379     want to track kernel memory usage.
0380 
0381 3. User Interface
0382 =================
0383 
0384 3.0. Configuration
0385 ------------------
0386 
0387 a. Enable CONFIG_CGROUPS
0388 b. Enable CONFIG_MEMCG
0389 c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
0390 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
0391 
0392 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
0393 -------------------------------------------------------------------
0394 
0395 ::
0396 
0397         # mount -t tmpfs none /sys/fs/cgroup
0398         # mkdir /sys/fs/cgroup/memory
0399         # mount -t cgroup none /sys/fs/cgroup/memory -o memory
0400 
0401 3.2. Make the new group and move bash into it::
0402 
0403         # mkdir /sys/fs/cgroup/memory/0
0404         # echo $$ > /sys/fs/cgroup/memory/0/tasks
0405 
0406 Since now we're in the 0 cgroup, we can alter the memory limit::
0407 
0408         # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
0409 
0410 NOTE:
0411   We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
0412   mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
0413   Gibibytes.)
0414 
0415 NOTE:
0416   We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
0417 
0418 NOTE:
0419   We cannot set limits on the root cgroup any more.
0420 
0421 ::
0422 
0423   # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
0424   4194304
0425 
0426 We can check the usage::
0427 
0428   # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
0429   1216512
0430 
0431 A successful write to this file does not guarantee a successful setting of
0432 this limit to the value written into the file. This can be due to a
0433 number of factors, such as rounding up to page boundaries or the total
0434 availability of memory on the system. The user is required to re-read
0435 this file after a write to guarantee the value committed by the kernel::
0436 
0437   # echo 1 > memory.limit_in_bytes
0438   # cat memory.limit_in_bytes
0439   4096
0440 
0441 The memory.failcnt field gives the number of times that the cgroup limit was
0442 exceeded.
0443 
0444 The memory.stat file gives accounting information. Now, the number of
0445 caches, RSS and Active pages/Inactive pages are shown.
0446 
0447 4. Testing
0448 ==========
0449 
0450 For testing features and implementation, see memcg_test.txt.
0451 
0452 Performance test is also important. To see pure memory controller's overhead,
0453 testing on tmpfs will give you good numbers of small overheads.
0454 Example: do kernel make on tmpfs.
0455 
0456 Page-fault scalability is also important. At measuring parallel
0457 page fault test, multi-process test may be better than multi-thread
0458 test because it has noise of shared objects/status.
0459 
0460 But the above two are testing extreme situations.
0461 Trying usual test under memory controller is always helpful.
0462 
0463 4.1 Troubleshooting
0464 -------------------
0465 
0466 Sometimes a user might find that the application under a cgroup is
0467 terminated by the OOM killer. There are several causes for this:
0468 
0469 1. The cgroup limit is too low (just too low to do anything useful)
0470 2. The user is using anonymous memory and swap is turned off or too low
0471 
0472 A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
0473 some of the pages cached in the cgroup (page cache pages).
0474 
0475 To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
0476 seeing what happens will be helpful.
0477 
0478 4.2 Task migration
0479 ------------------
0480 
0481 When a task migrates from one cgroup to another, its charge is not
0482 carried forward by default. The pages allocated from the original cgroup still
0483 remain charged to it, the charge is dropped when the page is freed or
0484 reclaimed.
0485 
0486 You can move charges of a task along with task migration.
0487 See 8. "Move charges at task migration"
0488 
0489 4.3 Removing a cgroup
0490 ---------------------
0491 
0492 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
0493 cgroup might have some charge associated with it, even though all
0494 tasks have migrated away from it. (because we charge against pages, not
0495 against tasks.)
0496 
0497 We move the stats to parent, and no change on the charge except uncharging
0498 from the child.
0499 
0500 Charges recorded in swap information is not updated at removal of cgroup.
0501 Recorded information is discarded and a cgroup which uses swap (swapcache)
0502 will be charged as a new owner of it.
0503 
0504 5. Misc. interfaces
0505 ===================
0506 
0507 5.1 force_empty
0508 ---------------
0509   memory.force_empty interface is provided to make cgroup's memory usage empty.
0510   When writing anything to this::
0511 
0512     # echo 0 > memory.force_empty
0513 
0514   the cgroup will be reclaimed and as many pages reclaimed as possible.
0515 
0516   The typical use case for this interface is before calling rmdir().
0517   Though rmdir() offlines memcg, but the memcg may still stay there due to
0518   charged file caches. Some out-of-use page caches may keep charged until
0519   memory pressure happens. If you want to avoid that, force_empty will be useful.
0520 
0521 5.2 stat file
0522 -------------
0523 
0524 memory.stat file includes following statistics
0525 
0526 per-memory cgroup local status
0527 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0528 
0529 =============== ===============================================================
0530 cache           # of bytes of page cache memory.
0531 rss             # of bytes of anonymous and swap cache memory (includes
0532                 transparent hugepages).
0533 rss_huge        # of bytes of anonymous transparent hugepages.
0534 mapped_file     # of bytes of mapped file (includes tmpfs/shmem)
0535 pgpgin          # of charging events to the memory cgroup. The charging
0536                 event happens each time a page is accounted as either mapped
0537                 anon page(RSS) or cache page(Page Cache) to the cgroup.
0538 pgpgout         # of uncharging events to the memory cgroup. The uncharging
0539                 event happens each time a page is unaccounted from the cgroup.
0540 swap            # of bytes of swap usage
0541 dirty           # of bytes that are waiting to get written back to the disk.
0542 writeback       # of bytes of file/anon cache that are queued for syncing to
0543                 disk.
0544 inactive_anon   # of bytes of anonymous and swap cache memory on inactive
0545                 LRU list.
0546 active_anon     # of bytes of anonymous and swap cache memory on active
0547                 LRU list.
0548 inactive_file   # of bytes of file-backed memory on inactive LRU list.
0549 active_file     # of bytes of file-backed memory on active LRU list.
0550 unevictable     # of bytes of memory that cannot be reclaimed (mlocked etc).
0551 =============== ===============================================================
0552 
0553 status considering hierarchy (see memory.use_hierarchy settings)
0554 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0555 
0556 ========================= ===================================================
0557 hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
0558                           under which the memory cgroup is
0559 hierarchical_memsw_limit  # of bytes of memory+swap limit with regard to
0560                           hierarchy under which memory cgroup is.
0561 
0562 total_<counter>           # hierarchical version of <counter>, which in
0563                           addition to the cgroup's own value includes the
0564                           sum of all hierarchical children's values of
0565                           <counter>, i.e. total_cache
0566 ========================= ===================================================
0567 
0568 The following additional stats are dependent on CONFIG_DEBUG_VM
0569 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0570 
0571 ========================= ========================================
0572 recent_rotated_anon       VM internal parameter. (see mm/vmscan.c)
0573 recent_rotated_file       VM internal parameter. (see mm/vmscan.c)
0574 recent_scanned_anon       VM internal parameter. (see mm/vmscan.c)
0575 recent_scanned_file       VM internal parameter. (see mm/vmscan.c)
0576 ========================= ========================================
0577 
0578 Memo:
0579         recent_rotated means recent frequency of LRU rotation.
0580         recent_scanned means recent # of scans to LRU.
0581         showing for better debug please see the code for meanings.
0582 
0583 Note:
0584         Only anonymous and swap cache memory is listed as part of 'rss' stat.
0585         This should not be confused with the true 'resident set size' or the
0586         amount of physical memory used by the cgroup.
0587 
0588         'rss + mapped_file" will give you resident set size of cgroup.
0589 
0590         (Note: file and shmem may be shared among other cgroups. In that case,
0591         mapped_file is accounted only when the memory cgroup is owner of page
0592         cache.)
0593 
0594 5.3 swappiness
0595 --------------
0596 
0597 Overrides /proc/sys/vm/swappiness for the particular group. The tunable
0598 in the root cgroup corresponds to the global swappiness setting.
0599 
0600 Please note that unlike during the global reclaim, limit reclaim
0601 enforces that 0 swappiness really prevents from any swapping even if
0602 there is a swap storage available. This might lead to memcg OOM killer
0603 if there are no file pages to reclaim.
0604 
0605 5.4 failcnt
0606 -----------
0607 
0608 A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
0609 This failcnt(== failure count) shows the number of times that a usage counter
0610 hit its limit. When a memory cgroup hits a limit, failcnt increases and
0611 memory under it will be reclaimed.
0612 
0613 You can reset failcnt by writing 0 to failcnt file::
0614 
0615         # echo 0 > .../memory.failcnt
0616 
0617 5.5 usage_in_bytes
0618 ------------------
0619 
0620 For efficiency, as other kernel components, memory cgroup uses some optimization
0621 to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
0622 method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
0623 value for efficient access. (Of course, when necessary, it's synchronized.)
0624 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
0625 value in memory.stat(see 5.2).
0626 
0627 5.6 numa_stat
0628 -------------
0629 
0630 This is similar to numa_maps but operates on a per-memcg basis.  This is
0631 useful for providing visibility into the numa locality information within
0632 an memcg since the pages are allowed to be allocated from any physical
0633 node.  One of the use cases is evaluating application performance by
0634 combining this information with the application's CPU allocation.
0635 
0636 Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
0637 per-node page counts including "hierarchical_<counter>" which sums up all
0638 hierarchical children's values in addition to the memcg's own value.
0639 
0640 The output format of memory.numa_stat is::
0641 
0642   total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
0643   file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
0644   anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
0645   unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
0646   hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
0647 
0648 The "total" count is sum of file + anon + unevictable.
0649 
0650 6. Hierarchy support
0651 ====================
0652 
0653 The memory controller supports a deep hierarchy and hierarchical accounting.
0654 The hierarchy is created by creating the appropriate cgroups in the
0655 cgroup filesystem. Consider for example, the following cgroup filesystem
0656 hierarchy::
0657 
0658                root
0659              /  |   \
0660             /   |    \
0661            a    b     c
0662                       | \
0663                       |  \
0664                       d   e
0665 
0666 In the diagram above, with hierarchical accounting enabled, all memory
0667 usage of e, is accounted to its ancestors up until the root (i.e, c and root).
0668 If one of the ancestors goes over its limit, the reclaim algorithm reclaims
0669 from the tasks in the ancestor and the children of the ancestor.
0670 
0671 6.1 Hierarchical accounting and reclaim
0672 ---------------------------------------
0673 
0674 Hierarchical accounting is enabled by default. Disabling the hierarchical
0675 accounting is deprecated. An attempt to do it will result in a failure
0676 and a warning printed to dmesg.
0677 
0678 For compatibility reasons writing 1 to memory.use_hierarchy will always pass::
0679 
0680         # echo 1 > memory.use_hierarchy
0681 
0682 7. Soft limits
0683 ==============
0684 
0685 Soft limits allow for greater sharing of memory. The idea behind soft limits
0686 is to allow control groups to use as much of the memory as needed, provided
0687 
0688 a. There is no memory contention
0689 b. They do not exceed their hard limit
0690 
0691 When the system detects memory contention or low memory, control groups
0692 are pushed back to their soft limits. If the soft limit of each control
0693 group is very high, they are pushed back as much as possible to make
0694 sure that one control group does not starve the others of memory.
0695 
0696 Please note that soft limits is a best-effort feature; it comes with
0697 no guarantees, but it does its best to make sure that when memory is
0698 heavily contended for, memory is allocated based on the soft limit
0699 hints/setup. Currently soft limit based reclaim is set up such that
0700 it gets invoked from balance_pgdat (kswapd).
0701 
0702 7.1 Interface
0703 -------------
0704 
0705 Soft limits can be setup by using the following commands (in this example we
0706 assume a soft limit of 256 MiB)::
0707 
0708         # echo 256M > memory.soft_limit_in_bytes
0709 
0710 If we want to change this to 1G, we can at any time use::
0711 
0712         # echo 1G > memory.soft_limit_in_bytes
0713 
0714 NOTE1:
0715        Soft limits take effect over a long period of time, since they involve
0716        reclaiming memory for balancing between memory cgroups
0717 NOTE2:
0718        It is recommended to set the soft limit always below the hard limit,
0719        otherwise the hard limit will take precedence.
0720 
0721 8. Move charges at task migration
0722 =================================
0723 
0724 Users can move charges associated with a task along with task migration, that
0725 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
0726 This feature is not supported in !CONFIG_MMU environments because of lack of
0727 page tables.
0728 
0729 8.1 Interface
0730 -------------
0731 
0732 This feature is disabled by default. It can be enabled (and disabled again) by
0733 writing to memory.move_charge_at_immigrate of the destination cgroup.
0734 
0735 If you want to enable it::
0736 
0737         # echo (some positive value) > memory.move_charge_at_immigrate
0738 
0739 Note:
0740       Each bits of move_charge_at_immigrate has its own meaning about what type
0741       of charges should be moved. See 8.2 for details.
0742 Note:
0743       Charges are moved only when you move mm->owner, in other words,
0744       a leader of a thread group.
0745 Note:
0746       If we cannot find enough space for the task in the destination cgroup, we
0747       try to make space by reclaiming memory. Task migration may fail if we
0748       cannot make enough space.
0749 Note:
0750       It can take several seconds if you move charges much.
0751 
0752 And if you want disable it again::
0753 
0754         # echo 0 > memory.move_charge_at_immigrate
0755 
0756 8.2 Type of charges which can be moved
0757 --------------------------------------
0758 
0759 Each bit in move_charge_at_immigrate has its own meaning about what type of
0760 charges should be moved. But in any case, it must be noted that an account of
0761 a page or a swap can be moved only when it is charged to the task's current
0762 (old) memory cgroup.
0763 
0764 +---+--------------------------------------------------------------------------+
0765 |bit| what type of charges would be moved ?                                    |
0766 +===+==========================================================================+
0767 | 0 | A charge of an anonymous page (or swap of it) used by the target task.   |
0768 |   | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
0769 +---+--------------------------------------------------------------------------+
0770 | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
0771 |   | and swaps of tmpfs file) mmapped by the target task. Unlike the case of  |
0772 |   | anonymous pages, file pages (and swaps) in the range mmapped by the task |
0773 |   | will be moved even if the task hasn't done page fault, i.e. they might   |
0774 |   | not be the task's "RSS", but other task's "RSS" that maps the same file. |
0775 |   | And mapcount of the page is ignored (the page can be moved even if       |
0776 |   | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to    |
0777 |   | enable move of swap charges.                                             |
0778 +---+--------------------------------------------------------------------------+
0779 
0780 8.3 TODO
0781 --------
0782 
0783 - All of moving charge operations are done under cgroup_mutex. It's not good
0784   behavior to hold the mutex too long, so we may need some trick.
0785 
0786 9. Memory thresholds
0787 ====================
0788 
0789 Memory cgroup implements memory thresholds using the cgroups notification
0790 API (see cgroups.txt). It allows to register multiple memory and memsw
0791 thresholds and gets notifications when it crosses.
0792 
0793 To register a threshold, an application must:
0794 
0795 - create an eventfd using eventfd(2);
0796 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
0797 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
0798   cgroup.event_control.
0799 
0800 Application will be notified through eventfd when memory usage crosses
0801 threshold in any direction.
0802 
0803 It's applicable for root and non-root cgroup.
0804 
0805 10. OOM Control
0806 ===============
0807 
0808 memory.oom_control file is for OOM notification and other controls.
0809 
0810 Memory cgroup implements OOM notifier using the cgroup notification
0811 API (See cgroups.txt). It allows to register multiple OOM notification
0812 delivery and gets notification when OOM happens.
0813 
0814 To register a notifier, an application must:
0815 
0816  - create an eventfd using eventfd(2)
0817  - open memory.oom_control file
0818  - write string like "<event_fd> <fd of memory.oom_control>" to
0819    cgroup.event_control
0820 
0821 The application will be notified through eventfd when OOM happens.
0822 OOM notification doesn't work for the root cgroup.
0823 
0824 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
0825 
0826         #echo 1 > memory.oom_control
0827 
0828 If OOM-killer is disabled, tasks under cgroup will hang/sleep
0829 in memory cgroup's OOM-waitqueue when they request accountable memory.
0830 
0831 For running them, you have to relax the memory cgroup's OOM status by
0832 
0833         * enlarge limit or reduce usage.
0834 
0835 To reduce usage,
0836 
0837         * kill some tasks.
0838         * move some tasks to other group with account migration.
0839         * remove some files (on tmpfs?)
0840 
0841 Then, stopped tasks will work again.
0842 
0843 At reading, current status of OOM is shown.
0844 
0845         - oom_kill_disable 0 or 1
0846           (if 1, oom-killer is disabled)
0847         - under_oom        0 or 1
0848           (if 1, the memory cgroup is under OOM, tasks may be stopped.)
0849         - oom_kill         integer counter
0850           The number of processes belonging to this cgroup killed by any
0851           kind of OOM killer.
0852 
0853 11. Memory Pressure
0854 ===================
0855 
0856 The pressure level notifications can be used to monitor the memory
0857 allocation cost; based on the pressure, applications can implement
0858 different strategies of managing their memory resources. The pressure
0859 levels are defined as following:
0860 
0861 The "low" level means that the system is reclaiming memory for new
0862 allocations. Monitoring this reclaiming activity might be useful for
0863 maintaining cache level. Upon notification, the program (typically
0864 "Activity Manager") might analyze vmstat and act in advance (i.e.
0865 prematurely shutdown unimportant services).
0866 
0867 The "medium" level means that the system is experiencing medium memory
0868 pressure, the system might be making swap, paging out active file caches,
0869 etc. Upon this event applications may decide to further analyze
0870 vmstat/zoneinfo/memcg or internal memory usage statistics and free any
0871 resources that can be easily reconstructed or re-read from a disk.
0872 
0873 The "critical" level means that the system is actively thrashing, it is
0874 about to out of memory (OOM) or even the in-kernel OOM killer is on its
0875 way to trigger. Applications should do whatever they can to help the
0876 system. It might be too late to consult with vmstat or any other
0877 statistics, so it's advisable to take an immediate action.
0878 
0879 By default, events are propagated upward until the event is handled, i.e. the
0880 events are not pass-through. For example, you have three cgroups: A->B->C. Now
0881 you set up an event listener on cgroups A, B and C, and suppose group C
0882 experiences some pressure. In this situation, only group C will receive the
0883 notification, i.e. groups A and B will not receive it. This is done to avoid
0884 excessive "broadcasting" of messages, which disturbs the system and which is
0885 especially bad if we are low on memory or thrashing. Group B, will receive
0886 notification only if there are no event listers for group C.
0887 
0888 There are three optional modes that specify different propagation behavior:
0889 
0890  - "default": this is the default behavior specified above. This mode is the
0891    same as omitting the optional mode parameter, preserved by backwards
0892    compatibility.
0893 
0894  - "hierarchy": events always propagate up to the root, similar to the default
0895    behavior, except that propagation continues regardless of whether there are
0896    event listeners at each level, with the "hierarchy" mode. In the above
0897    example, groups A, B, and C will receive notification of memory pressure.
0898 
0899  - "local": events are pass-through, i.e. they only receive notifications when
0900    memory pressure is experienced in the memcg for which the notification is
0901    registered. In the above example, group C will receive notification if
0902    registered for "local" notification and the group experiences memory
0903    pressure. However, group B will never receive notification, regardless if
0904    there is an event listener for group C or not, if group B is registered for
0905    local notification.
0906 
0907 The level and event notification mode ("hierarchy" or "local", if necessary) are
0908 specified by a comma-delimited string, i.e. "low,hierarchy" specifies
0909 hierarchical, pass-through, notification for all ancestor memcgs. Notification
0910 that is the default, non pass-through behavior, does not specify a mode.
0911 "medium,local" specifies pass-through notification for the medium level.
0912 
0913 The file memory.pressure_level is only used to setup an eventfd. To
0914 register a notification, an application must:
0915 
0916 - create an eventfd using eventfd(2);
0917 - open memory.pressure_level;
0918 - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
0919   to cgroup.event_control.
0920 
0921 Application will be notified through eventfd when memory pressure is at
0922 the specific level (or higher). Read/write operations to
0923 memory.pressure_level are no implemented.
0924 
0925 Test:
0926 
0927    Here is a small script example that makes a new cgroup, sets up a
0928    memory limit, sets up a notification in the cgroup and then makes child
0929    cgroup experience a critical pressure::
0930 
0931         # cd /sys/fs/cgroup/memory/
0932         # mkdir foo
0933         # cd foo
0934         # cgroup_event_listener memory.pressure_level low,hierarchy &
0935         # echo 8000000 > memory.limit_in_bytes
0936         # echo 8000000 > memory.memsw.limit_in_bytes
0937         # echo $$ > tasks
0938         # dd if=/dev/zero | read x
0939 
0940    (Expect a bunch of notifications, and eventually, the oom-killer will
0941    trigger.)
0942 
0943 12. TODO
0944 ========
0945 
0946 1. Make per-cgroup scanner reclaim not-shared pages first
0947 2. Teach controller to account for shared-pages
0948 3. Start reclamation in the background when the limit is
0949    not yet hit but the usage is getting closer
0950 
0951 Summary
0952 =======
0953 
0954 Overall, the memory controller has been a stable controller and has been
0955 commented and discussed quite extensively in the community.
0956 
0957 References
0958 ==========
0959 
0960 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
0961 2. Singh, Balbir. Memory Controller (RSS Control),
0962    http://lwn.net/Articles/222762/
0963 3. Emelianov, Pavel. Resource controllers based on process cgroups
0964    https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru
0965 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
0966    https://lore.kernel.org/r/461A3010.90403@sw.ru
0967 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
0968    https://lore.kernel.org/r/465D9739.8070209@openvz.org
0969 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
0970 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
0971    subsystem (v3), http://lwn.net/Articles/235534/
0972 8. Singh, Balbir. RSS controller v2 test results (lmbench),
0973    https://lore.kernel.org/r/464C95D4.7070806@linux.vnet.ibm.com
0974 9. Singh, Balbir. RSS controller v2 AIM9 results
0975    https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com
0976 10. Singh, Balbir. Memory controller v6 test results,
0977     https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
0978 11. Singh, Balbir. Memory controller introduction (v6),
0979     https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop
0980 12. Corbet, Jonathan, Controlling memory use in cgroups,
0981     http://lwn.net/Articles/243795/