Back to home page

OSCL-LXR

 
 

    


0001 =====================================================
0002 Memory Resource Controller(Memcg) Implementation Memo
0003 =====================================================
0004 
0005 Last Updated: 2010/2
0006 
0007 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
0008 
0009 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
0010 is complex. This is a document for memcg's internal behavior.
0011 Please note that implementation details can be changed.
0012 
0013 (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
0014 
0015 0. How to record usage ?
0016 ========================
0017 
0018    2 objects are used.
0019 
0020    page_cgroup ....an object per page.
0021 
0022         Allocated at boot or memory hotplug. Freed at memory hot removal.
0023 
0024    swap_cgroup ... an entry per swp_entry.
0025 
0026         Allocated at swapon(). Freed at swapoff().
0027 
0028    The page_cgroup has USED bit and double count against a page_cgroup never
0029    occurs. swap_cgroup is used only when a charged page is swapped-out.
0030 
0031 1. Charge
0032 =========
0033 
0034    a page/swp_entry may be charged (usage += PAGE_SIZE) at
0035 
0036         mem_cgroup_try_charge()
0037 
0038 2. Uncharge
0039 ===========
0040 
0041   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
0042 
0043         mem_cgroup_uncharge()
0044           Called when a page's refcount goes down to 0.
0045 
0046         mem_cgroup_uncharge_swap()
0047           Called when swp_entry's refcnt goes down to 0. A charge against swap
0048           disappears.
0049 
0050 3. charge-commit-cancel
0051 =======================
0052 
0053         Memcg pages are charged in two steps:
0054 
0055                 - mem_cgroup_try_charge()
0056                 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
0057 
0058         At try_charge(), there are no flags to say "this page is charged".
0059         at this point, usage += PAGE_SIZE.
0060 
0061         At commit(), the page is associated with the memcg.
0062 
0063         At cancel(), simply usage -= PAGE_SIZE.
0064 
0065 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
0066 
0067 4. Anonymous
0068 ============
0069 
0070         Anonymous page is newly allocated at
0071                   - page fault into MAP_ANONYMOUS mapping.
0072                   - Copy-On-Write.
0073 
0074         4.1 Swap-in.
0075         At swap-in, the page is taken from swap-cache. There are 2 cases.
0076 
0077         (a) If the SwapCache is newly allocated and read, it has no charges.
0078         (b) If the SwapCache has been mapped by processes, it has been
0079             charged already.
0080 
0081         4.2 Swap-out.
0082         At swap-out, typical state transition is below.
0083 
0084         (a) add to swap cache. (marked as SwapCache)
0085             swp_entry's refcnt += 1.
0086         (b) fully unmapped.
0087             swp_entry's refcnt += # of ptes.
0088         (c) write back to swap.
0089         (d) delete from swap cache. (remove from SwapCache)
0090             swp_entry's refcnt -= 1.
0091 
0092 
0093         Finally, at task exit,
0094         (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
0095 
0096 5. Page Cache
0097 =============
0098 
0099         Page Cache is charged at
0100         - filemap_add_folio().
0101 
0102         The logic is very clear. (About migration, see below)
0103 
0104         Note:
0105           __remove_from_page_cache() is called by remove_from_page_cache()
0106           and __remove_mapping().
0107 
0108 6. Shmem(tmpfs) Page Cache
0109 ===========================
0110 
0111         The best way to understand shmem's page state transition is to read
0112         mm/shmem.c.
0113 
0114         But brief explanation of the behavior of memcg around shmem will be
0115         helpful to understand the logic.
0116 
0117         Shmem's page (just leaf page, not direct/indirect block) can be on
0118 
0119                 - radix-tree of shmem's inode.
0120                 - SwapCache.
0121                 - Both on radix-tree and SwapCache. This happens at swap-in
0122                   and swap-out,
0123 
0124         It's charged when...
0125 
0126         - A new page is added to shmem's radix-tree.
0127         - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
0128 
0129 7. Page Migration
0130 =================
0131 
0132         mem_cgroup_migrate()
0133 
0134 8. LRU
0135 ======
0136         Each memcg has its own vector of LRUs (inactive anon, active anon,
0137         inactive file, active file, unevictable) of pages from each node,
0138         each LRU handled under a single lru_lock for that memcg and node.
0139 
0140 9. Typical Tests.
0141 =================
0142 
0143  Tests for racy cases.
0144 
0145 9.1 Small limit to memcg.
0146 -------------------------
0147 
0148         When you do test to do racy case, it's good test to set memcg's limit
0149         to be very small rather than GB. Many races found in the test under
0150         xKB or xxMB limits.
0151 
0152         (Memory behavior under GB and Memory behavior under MB shows very
0153         different situation.)
0154 
0155 9.2 Shmem
0156 ---------
0157 
0158         Historically, memcg's shmem handling was poor and we saw some amount
0159         of troubles here. This is because shmem is page-cache but can be
0160         SwapCache. Test with shmem/tmpfs is always good test.
0161 
0162 9.3 Migration
0163 -------------
0164 
0165         For NUMA, migration is an another special case. To do easy test, cpuset
0166         is useful. Following is a sample script to do migration::
0167 
0168                 mount -t cgroup -o cpuset none /opt/cpuset
0169 
0170                 mkdir /opt/cpuset/01
0171                 echo 1 > /opt/cpuset/01/cpuset.cpus
0172                 echo 0 > /opt/cpuset/01/cpuset.mems
0173                 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
0174                 mkdir /opt/cpuset/02
0175                 echo 1 > /opt/cpuset/02/cpuset.cpus
0176                 echo 1 > /opt/cpuset/02/cpuset.mems
0177                 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
0178 
0179         In above set, when you moves a task from 01 to 02, page migration to
0180         node 0 to node 1 will occur. Following is a script to migrate all
0181         under cpuset.::
0182 
0183                 --
0184                 move_task()
0185                 {
0186                 for pid in $1
0187                 do
0188                         /bin/echo $pid >$2/tasks 2>/dev/null
0189                         echo -n $pid
0190                         echo -n " "
0191                 done
0192                 echo END
0193                 }
0194 
0195                 G1_TASK=`cat ${G1}/tasks`
0196                 G2_TASK=`cat ${G2}/tasks`
0197                 move_task "${G1_TASK}" ${G2} &
0198                 --
0199 
0200 9.4 Memory hotplug
0201 ------------------
0202 
0203         memory hotplug test is one of good test.
0204 
0205         to offline memory, do following::
0206 
0207                 # echo offline > /sys/devices/system/memory/memoryXXX/state
0208 
0209         (XXX is the place of memory)
0210 
0211         This is an easy way to test page migration, too.
0212 
0213 9.5 nested cgroups
0214 ------------------
0215 
0216         Use tests like the following for testing nested cgroups::
0217 
0218                 mkdir /opt/cgroup/01/child_a
0219                 mkdir /opt/cgroup/01/child_b
0220 
0221                 set limit to 01.
0222                 add limit to 01/child_b
0223                 run jobs under child_a and child_b
0224 
0225         create/delete following groups at random while jobs are running::
0226 
0227                 /opt/cgroup/01/child_a/child_aa
0228                 /opt/cgroup/01/child_b/child_bb
0229                 /opt/cgroup/01/child_c
0230 
0231         running new jobs in new group is also good.
0232 
0233 9.6 Mount with other subsystems
0234 -------------------------------
0235 
0236         Mounting with other subsystems is a good test because there is a
0237         race and lock dependency with other cgroup subsystems.
0238 
0239         example::
0240 
0241                 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
0242 
0243         and do task move, mkdir, rmdir etc...under this.
0244 
0245 9.7 swapoff
0246 -----------
0247 
0248         Besides management of swap is one of complicated parts of memcg,
0249         call path of swap-in at swapoff is not same as usual swap-in path..
0250         It's worth to be tested explicitly.
0251 
0252         For example, test like following is good:
0253 
0254         (Shell-A)::
0255 
0256                 # mount -t cgroup none /cgroup -o memory
0257                 # mkdir /cgroup/test
0258                 # echo 40M > /cgroup/test/memory.limit_in_bytes
0259                 # echo 0 > /cgroup/test/tasks
0260 
0261         Run malloc(100M) program under this. You'll see 60M of swaps.
0262 
0263         (Shell-B)::
0264 
0265                 # move all tasks in /cgroup/test to /cgroup
0266                 # /sbin/swapoff -a
0267                 # rmdir /cgroup/test
0268                 # kill malloc task.
0269 
0270         Of course, tmpfs v.s. swapoff test should be tested, too.
0271 
0272 9.8 OOM-Killer
0273 --------------
0274 
0275         Out-of-memory caused by memcg's limit will kill tasks under
0276         the memcg. When hierarchy is used, a task under hierarchy
0277         will be killed by the kernel.
0278 
0279         In this case, panic_on_oom shouldn't be invoked and tasks
0280         in other groups shouldn't be killed.
0281 
0282         It's not difficult to cause OOM under memcg as following.
0283 
0284         Case A) when you can swapoff::
0285 
0286                 #swapoff -a
0287                 #echo 50M > /memory.limit_in_bytes
0288 
0289         run 51M of malloc
0290 
0291         Case B) when you use mem+swap limitation::
0292 
0293                 #echo 50M > memory.limit_in_bytes
0294                 #echo 50M > memory.memsw.limit_in_bytes
0295 
0296         run 51M of malloc
0297 
0298 9.9 Move charges at task migration
0299 ----------------------------------
0300 
0301         Charges associated with a task can be moved along with task migration.
0302 
0303         (Shell-A)::
0304 
0305                 #mkdir /cgroup/A
0306                 #echo $$ >/cgroup/A/tasks
0307 
0308         run some programs which uses some amount of memory in /cgroup/A.
0309 
0310         (Shell-B)::
0311 
0312                 #mkdir /cgroup/B
0313                 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
0314                 #echo "pid of the program running in group A" >/cgroup/B/tasks
0315 
0316         You can see charges have been moved by reading ``*.usage_in_bytes`` or
0317         memory.stat of both A and B.
0318 
0319         See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
0320         be written to move_charge_at_immigrate.
0321 
0322 9.10 Memory thresholds
0323 ----------------------
0324 
0325         Memory controller implements memory thresholds using cgroups notification
0326         API. You can use tools/cgroup/cgroup_event_listener.c to test it.
0327 
0328         (Shell-A) Create cgroup and run event listener::
0329 
0330                 # mkdir /cgroup/A
0331                 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
0332 
0333         (Shell-B) Add task to cgroup and try to allocate and free memory::
0334 
0335                 # echo $$ >/cgroup/A/tasks
0336                 # a="$(dd if=/dev/zero bs=1M count=10)"
0337                 # a=
0338 
0339         You will see message from cgroup_event_listener every time you cross
0340         the thresholds.
0341 
0342         Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
0343 
0344         It's good idea to test root cgroup as well.