0001 =====================================================
0002 Memory Resource Controller(Memcg) Implementation Memo
0003 =====================================================
0004
0005 Last Updated: 2010/2
0006
0007 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
0008
0009 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
0010 is complex. This is a document for memcg's internal behavior.
0011 Please note that implementation details can be changed.
0012
0013 (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
0014
0015 0. How to record usage ?
0016 ========================
0017
0018 2 objects are used.
0019
0020 page_cgroup ....an object per page.
0021
0022 Allocated at boot or memory hotplug. Freed at memory hot removal.
0023
0024 swap_cgroup ... an entry per swp_entry.
0025
0026 Allocated at swapon(). Freed at swapoff().
0027
0028 The page_cgroup has USED bit and double count against a page_cgroup never
0029 occurs. swap_cgroup is used only when a charged page is swapped-out.
0030
0031 1. Charge
0032 =========
0033
0034 a page/swp_entry may be charged (usage += PAGE_SIZE) at
0035
0036 mem_cgroup_try_charge()
0037
0038 2. Uncharge
0039 ===========
0040
0041 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
0042
0043 mem_cgroup_uncharge()
0044 Called when a page's refcount goes down to 0.
0045
0046 mem_cgroup_uncharge_swap()
0047 Called when swp_entry's refcnt goes down to 0. A charge against swap
0048 disappears.
0049
0050 3. charge-commit-cancel
0051 =======================
0052
0053 Memcg pages are charged in two steps:
0054
0055 - mem_cgroup_try_charge()
0056 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
0057
0058 At try_charge(), there are no flags to say "this page is charged".
0059 at this point, usage += PAGE_SIZE.
0060
0061 At commit(), the page is associated with the memcg.
0062
0063 At cancel(), simply usage -= PAGE_SIZE.
0064
0065 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
0066
0067 4. Anonymous
0068 ============
0069
0070 Anonymous page is newly allocated at
0071 - page fault into MAP_ANONYMOUS mapping.
0072 - Copy-On-Write.
0073
0074 4.1 Swap-in.
0075 At swap-in, the page is taken from swap-cache. There are 2 cases.
0076
0077 (a) If the SwapCache is newly allocated and read, it has no charges.
0078 (b) If the SwapCache has been mapped by processes, it has been
0079 charged already.
0080
0081 4.2 Swap-out.
0082 At swap-out, typical state transition is below.
0083
0084 (a) add to swap cache. (marked as SwapCache)
0085 swp_entry's refcnt += 1.
0086 (b) fully unmapped.
0087 swp_entry's refcnt += # of ptes.
0088 (c) write back to swap.
0089 (d) delete from swap cache. (remove from SwapCache)
0090 swp_entry's refcnt -= 1.
0091
0092
0093 Finally, at task exit,
0094 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
0095
0096 5. Page Cache
0097 =============
0098
0099 Page Cache is charged at
0100 - filemap_add_folio().
0101
0102 The logic is very clear. (About migration, see below)
0103
0104 Note:
0105 __remove_from_page_cache() is called by remove_from_page_cache()
0106 and __remove_mapping().
0107
0108 6. Shmem(tmpfs) Page Cache
0109 ===========================
0110
0111 The best way to understand shmem's page state transition is to read
0112 mm/shmem.c.
0113
0114 But brief explanation of the behavior of memcg around shmem will be
0115 helpful to understand the logic.
0116
0117 Shmem's page (just leaf page, not direct/indirect block) can be on
0118
0119 - radix-tree of shmem's inode.
0120 - SwapCache.
0121 - Both on radix-tree and SwapCache. This happens at swap-in
0122 and swap-out,
0123
0124 It's charged when...
0125
0126 - A new page is added to shmem's radix-tree.
0127 - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
0128
0129 7. Page Migration
0130 =================
0131
0132 mem_cgroup_migrate()
0133
0134 8. LRU
0135 ======
0136 Each memcg has its own vector of LRUs (inactive anon, active anon,
0137 inactive file, active file, unevictable) of pages from each node,
0138 each LRU handled under a single lru_lock for that memcg and node.
0139
0140 9. Typical Tests.
0141 =================
0142
0143 Tests for racy cases.
0144
0145 9.1 Small limit to memcg.
0146 -------------------------
0147
0148 When you do test to do racy case, it's good test to set memcg's limit
0149 to be very small rather than GB. Many races found in the test under
0150 xKB or xxMB limits.
0151
0152 (Memory behavior under GB and Memory behavior under MB shows very
0153 different situation.)
0154
0155 9.2 Shmem
0156 ---------
0157
0158 Historically, memcg's shmem handling was poor and we saw some amount
0159 of troubles here. This is because shmem is page-cache but can be
0160 SwapCache. Test with shmem/tmpfs is always good test.
0161
0162 9.3 Migration
0163 -------------
0164
0165 For NUMA, migration is an another special case. To do easy test, cpuset
0166 is useful. Following is a sample script to do migration::
0167
0168 mount -t cgroup -o cpuset none /opt/cpuset
0169
0170 mkdir /opt/cpuset/01
0171 echo 1 > /opt/cpuset/01/cpuset.cpus
0172 echo 0 > /opt/cpuset/01/cpuset.mems
0173 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
0174 mkdir /opt/cpuset/02
0175 echo 1 > /opt/cpuset/02/cpuset.cpus
0176 echo 1 > /opt/cpuset/02/cpuset.mems
0177 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
0178
0179 In above set, when you moves a task from 01 to 02, page migration to
0180 node 0 to node 1 will occur. Following is a script to migrate all
0181 under cpuset.::
0182
0183 --
0184 move_task()
0185 {
0186 for pid in $1
0187 do
0188 /bin/echo $pid >$2/tasks 2>/dev/null
0189 echo -n $pid
0190 echo -n " "
0191 done
0192 echo END
0193 }
0194
0195 G1_TASK=`cat ${G1}/tasks`
0196 G2_TASK=`cat ${G2}/tasks`
0197 move_task "${G1_TASK}" ${G2} &
0198 --
0199
0200 9.4 Memory hotplug
0201 ------------------
0202
0203 memory hotplug test is one of good test.
0204
0205 to offline memory, do following::
0206
0207 # echo offline > /sys/devices/system/memory/memoryXXX/state
0208
0209 (XXX is the place of memory)
0210
0211 This is an easy way to test page migration, too.
0212
0213 9.5 nested cgroups
0214 ------------------
0215
0216 Use tests like the following for testing nested cgroups::
0217
0218 mkdir /opt/cgroup/01/child_a
0219 mkdir /opt/cgroup/01/child_b
0220
0221 set limit to 01.
0222 add limit to 01/child_b
0223 run jobs under child_a and child_b
0224
0225 create/delete following groups at random while jobs are running::
0226
0227 /opt/cgroup/01/child_a/child_aa
0228 /opt/cgroup/01/child_b/child_bb
0229 /opt/cgroup/01/child_c
0230
0231 running new jobs in new group is also good.
0232
0233 9.6 Mount with other subsystems
0234 -------------------------------
0235
0236 Mounting with other subsystems is a good test because there is a
0237 race and lock dependency with other cgroup subsystems.
0238
0239 example::
0240
0241 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
0242
0243 and do task move, mkdir, rmdir etc...under this.
0244
0245 9.7 swapoff
0246 -----------
0247
0248 Besides management of swap is one of complicated parts of memcg,
0249 call path of swap-in at swapoff is not same as usual swap-in path..
0250 It's worth to be tested explicitly.
0251
0252 For example, test like following is good:
0253
0254 (Shell-A)::
0255
0256 # mount -t cgroup none /cgroup -o memory
0257 # mkdir /cgroup/test
0258 # echo 40M > /cgroup/test/memory.limit_in_bytes
0259 # echo 0 > /cgroup/test/tasks
0260
0261 Run malloc(100M) program under this. You'll see 60M of swaps.
0262
0263 (Shell-B)::
0264
0265 # move all tasks in /cgroup/test to /cgroup
0266 # /sbin/swapoff -a
0267 # rmdir /cgroup/test
0268 # kill malloc task.
0269
0270 Of course, tmpfs v.s. swapoff test should be tested, too.
0271
0272 9.8 OOM-Killer
0273 --------------
0274
0275 Out-of-memory caused by memcg's limit will kill tasks under
0276 the memcg. When hierarchy is used, a task under hierarchy
0277 will be killed by the kernel.
0278
0279 In this case, panic_on_oom shouldn't be invoked and tasks
0280 in other groups shouldn't be killed.
0281
0282 It's not difficult to cause OOM under memcg as following.
0283
0284 Case A) when you can swapoff::
0285
0286 #swapoff -a
0287 #echo 50M > /memory.limit_in_bytes
0288
0289 run 51M of malloc
0290
0291 Case B) when you use mem+swap limitation::
0292
0293 #echo 50M > memory.limit_in_bytes
0294 #echo 50M > memory.memsw.limit_in_bytes
0295
0296 run 51M of malloc
0297
0298 9.9 Move charges at task migration
0299 ----------------------------------
0300
0301 Charges associated with a task can be moved along with task migration.
0302
0303 (Shell-A)::
0304
0305 #mkdir /cgroup/A
0306 #echo $$ >/cgroup/A/tasks
0307
0308 run some programs which uses some amount of memory in /cgroup/A.
0309
0310 (Shell-B)::
0311
0312 #mkdir /cgroup/B
0313 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
0314 #echo "pid of the program running in group A" >/cgroup/B/tasks
0315
0316 You can see charges have been moved by reading ``*.usage_in_bytes`` or
0317 memory.stat of both A and B.
0318
0319 See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
0320 be written to move_charge_at_immigrate.
0321
0322 9.10 Memory thresholds
0323 ----------------------
0324
0325 Memory controller implements memory thresholds using cgroups notification
0326 API. You can use tools/cgroup/cgroup_event_listener.c to test it.
0327
0328 (Shell-A) Create cgroup and run event listener::
0329
0330 # mkdir /cgroup/A
0331 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
0332
0333 (Shell-B) Add task to cgroup and try to allocate and free memory::
0334
0335 # echo $$ >/cgroup/A/tasks
0336 # a="$(dd if=/dev/zero bs=1M count=10)"
0337 # a=
0338
0339 You will see message from cgroup_event_listener every time you cross
0340 the thresholds.
0341
0342 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
0343
0344 It's good idea to test root cgroup as well.