admin-guide/cgroup-v1/memcg_test.rst

0001 =====================================================
0002 Memory Resource Controller(Memcg) Implementation Memo
0003 =====================================================
0004
0005 Last Updated: 2010/2
0006
0007 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
0008
0009 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
0010 is complex. This is a document for memcg's internal behavior.
0011 Please note that implementation details can be changed.
0012
0013 (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
0014
0015 0. How to record usage ?
0016 ========================
0017
0018    2 objects are used.
0019
0020    page_cgroup ....an object per page.
0021
0022         Allocated at boot or memory hotplug. Freed at memory hot removal.
0023
0024    swap_cgroup ... an entry per swp_entry.
0025
0026         Allocated at swapon(). Freed at swapoff().
0027
0028    The page_cgroup has USED bit and double count against a page_cgroup never
0029    occurs. swap_cgroup is used only when a charged page is swapped-out.
0030
0031 1. Charge
0032 =========
0033
0034    a page/swp_entry may be charged (usage += PAGE_SIZE) at
0035
0036         mem_cgroup_try_charge()
0037
0038 2. Uncharge
0039 ===========
0040
0041   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
0042
0043         mem_cgroup_uncharge()
0044           Called when a page's refcount goes down to 0.
0045
0046         mem_cgroup_uncharge_swap()
0047           Called when swp_entry's refcnt goes down to 0. A charge against swap
0048           disappears.
0049
0050 3. charge-commit-cancel
0051 =======================
0052
0053         Memcg pages are charged in two steps:
0054
0055                 - mem_cgroup_try_charge()
0056                 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
0057
0058         At try_charge(), there are no flags to say "this page is charged".
0059         at this point, usage += PAGE_SIZE.
0060
0061         At commit(), the page is associated with the memcg.
0062
0063         At cancel(), simply usage -= PAGE_SIZE.
0064
0065 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
0066
0067 4. Anonymous
0068 ============
0069
0070         Anonymous page is newly allocated at
0071                   - page fault into MAP_ANONYMOUS mapping.
0072                   - Copy-On-Write.
0073
0074         4.1 Swap-in.
0075         At swap-in, the page is taken from swap-cache. There are 2 cases.
0076
0077         (a) If the SwapCache is newly allocated and read, it has no charges.
0078         (b) If the SwapCache has been mapped by processes, it has been
0079             charged already.
0080
0081         4.2 Swap-out.
0082         At swap-out, typical state transition is below.
0083
0084         (a) add to swap cache. (marked as SwapCache)
0085             swp_entry's refcnt += 1.
0086         (b) fully unmapped.
0087             swp_entry's refcnt += # of ptes.
0088         (c) write back to swap.
0089         (d) delete from swap cache. (remove from SwapCache)
0090             swp_entry's refcnt -= 1.
0091
0092
0093         Finally, at task exit,
0094         (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
0095
0096 5. Page Cache
0097 =============
0098
0099         Page Cache is charged at
0100         - filemap_add_folio().
0101
0102         The logic is very clear. (About migration, see below)
0103
0104         Note:
0105           __remove_from_page_cache() is called by remove_from_page_cache()
0106           and __remove_mapping().
0107
0108 6. Shmem(tmpfs) Page Cache
0109 ===========================
0110
0111         The best way to understand shmem's page state transition is to read
0112         mm/shmem.c.
0113
0114         But brief explanation of the behavior of memcg around shmem will be
0115         helpful to understand the logic.
0116
0117         Shmem's page (just leaf page, not direct/indirect block) can be on
0118
0119                 - radix-tree of shmem's inode.
0120                 - SwapCache.
0121                 - Both on radix-tree and SwapCache. This happens at swap-in
0122                   and swap-out,
0123
0124         It's charged when...
0125
0126         - A new page is added to shmem's radix-tree.
0127         - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
0128
0129 7. Page Migration
0130 =================
0131
0132         mem_cgroup_migrate()
0133
0134 8. LRU
0135 ======
0136         Each memcg has its own vector of LRUs (inactive anon, active anon,
0137         inactive file, active file, unevictable) of pages from each node,
0138         each LRU handled under a single lru_lock for that memcg and node.
0139
0140 9. Typical Tests.
0141 =================
0142
0143  Tests for racy cases.
0144
0145 9.1 Small limit to memcg.
0146 -------------------------
0147
0148         When you do test to do racy case, it's good test to set memcg's limit
0149         to be very small rather than GB. Many races found in the test under
0150         xKB or xxMB limits.
0151
0152         (Memory behavior under GB and Memory behavior under MB shows very
0153         different situation.)
0154
0155 9.2 Shmem
0156 ---------
0157
0158         Historically, memcg's shmem handling was poor and we saw some amount
0159         of troubles here. This is because shmem is page-cache but can be
0160         SwapCache. Test with shmem/tmpfs is always good test.
0161
0162 9.3 Migration
0163 -------------
0164
0165         For NUMA, migration is an another special case. To do easy test, cpuset
0166         is useful. Following is a sample script to do migration::
0167
0168                 mount -t cgroup -o cpuset none /opt/cpuset
0169
0170                 mkdir /opt/cpuset/01
0171                 echo 1 > /opt/cpuset/01/cpuset.cpus
0172                 echo 0 > /opt/cpuset/01/cpuset.mems
0173                 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
0174                 mkdir /opt/cpuset/02
0175                 echo 1 > /opt/cpuset/02/cpuset.cpus
0176                 echo 1 > /opt/cpuset/02/cpuset.mems
0177                 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
0178
0179         In above set, when you moves a task from 01 to 02, page migration to
0180         node 0 to node 1 will occur. Following is a script to migrate all
0181         under cpuset.::
0182
0183                 --
0184                 move_task()
0185                 {
0186                 for pid in $1
0187                 do
0188                         /bin/echo $pid >$2/tasks 2>/dev/null
0189                         echo -n $pid
0190                         echo -n " "
0191                 done
0192                 echo END
0193                 }
0194
0195                 G1_TASK=`cat ${G1}/tasks`
0196                 G2_TASK=`cat ${G2}/tasks`
0197                 move_task "${G1_TASK}" ${G2} &
0198                 --
0199
0200 9.4 Memory hotplug
0201 ------------------
0202
0203         memory hotplug test is one of good test.
0204
0205         to offline memory, do following::
0206
0207                 # echo offline > /sys/devices/system/memory/memoryXXX/state
0208
0209         (XXX is the place of memory)
0210
0211         This is an easy way to test page migration, too.
0212
0213 9.5 nested cgroups
0214 ------------------
0215
0216         Use tests like the following for testing nested cgroups::
0217
0218                 mkdir /opt/cgroup/01/child_a
0219                 mkdir /opt/cgroup/01/child_b
0220
0221                 set limit to 01.
0222                 add limit to 01/child_b
0223                 run jobs under child_a and child_b
0224
0225         create/delete following groups at random while jobs are running::
0226
0227                 /opt/cgroup/01/child_a/child_aa
0228                 /opt/cgroup/01/child_b/child_bb
0229                 /opt/cgroup/01/child_c
0230
0231         running new jobs in new group is also good.
0232
0233 9.6 Mount with other subsystems
0234 -------------------------------
0235
0236         Mounting with other subsystems is a good test because there is a
0237         race and lock dependency with other cgroup subsystems.
0238
0239         example::
0240
0241                 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
0242
0243         and do task move, mkdir, rmdir etc...under this.
0244
0245 9.7 swapoff
0246 -----------
0247
0248         Besides management of swap is one of complicated parts of memcg,
0249         call path of swap-in at swapoff is not same as usual swap-in path..
0250         It's worth to be tested explicitly.
0251
0252         For example, test like following is good:
0253
0254         (Shell-A)::
0255
0256                 # mount -t cgroup none /cgroup -o memory
0257                 # mkdir /cgroup/test
0258                 # echo 40M > /cgroup/test/memory.limit_in_bytes
0259                 # echo 0 > /cgroup/test/tasks
0260
0261         Run malloc(100M) program under this. You'll see 60M of swaps.
0262
0263         (Shell-B)::
0264
0265                 # move all tasks in /cgroup/test to /cgroup
0266                 # /sbin/swapoff -a
0267                 # rmdir /cgroup/test
0268                 # kill malloc task.
0269
0270         Of course, tmpfs v.s. swapoff test should be tested, too.
0271
0272 9.8 OOM-Killer
0273 --------------
0274
0275         Out-of-memory caused by memcg's limit will kill tasks under
0276         the memcg. When hierarchy is used, a task under hierarchy
0277         will be killed by the kernel.
0278
0279         In this case, panic_on_oom shouldn't be invoked and tasks
0280         in other groups shouldn't be killed.
0281
0282         It's not difficult to cause OOM under memcg as following.
0283
0284         Case A) when you can swapoff::
0285
0286                 #swapoff -a
0287                 #echo 50M > /memory.limit_in_bytes
0288
0289         run 51M of malloc
0290
0291         Case B) when you use mem+swap limitation::
0292
0293                 #echo 50M > memory.limit_in_bytes
0294                 #echo 50M > memory.memsw.limit_in_bytes
0295
0296         run 51M of malloc
0297
0298 9.9 Move charges at task migration
0299 ----------------------------------
0300
0301         Charges associated with a task can be moved along with task migration.
0302
0303         (Shell-A)::
0304
0305                 #mkdir /cgroup/A
0306                 #echo $$ >/cgroup/A/tasks
0307
0308         run some programs which uses some amount of memory in /cgroup/A.
0309
0310         (Shell-B)::
0311
0312                 #mkdir /cgroup/B
0313                 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
0314                 #echo "pid of the program running in group A" >/cgroup/B/tasks
0315
0316         You can see charges have been moved by reading ``*.usage_in_bytes`` or
0317         memory.stat of both A and B.
0318
0319         See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
0320         be written to move_charge_at_immigrate.
0321
0322 9.10 Memory thresholds
0323 ----------------------
0324
0325         Memory controller implements memory thresholds using cgroups notification
0326         API. You can use tools/cgroup/cgroup_event_listener.c to test it.
0327
0328         (Shell-A) Create cgroup and run event listener::
0329
0330                 # mkdir /cgroup/A
0331                 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
0332
0333         (Shell-B) Add task to cgroup and try to allocate and free memory::
0334
0335                 # echo $$ >/cgroup/A/tasks
0336                 # a="$(dd if=/dev/zero bs=1M count=10)"
0337                 # a=
0338
0339         You will see message from cgroup_event_listener every time you cross
0340         the thresholds.
0341
0342         Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
0343
0344         It's good idea to test root cgroup as well.