0001 .. _swap_numa:
0002
0003 ===========================================
0004 Automatically bind swap device to numa node
0005 ===========================================
0006
0007 If the system has more than one swap device and swap device has the node
0008 information, we can make use of this information to decide which swap
0009 device to use in get_swap_pages() to get better performance.
0010
0011
0012 How to use this feature
0013 =======================
0014
0015 Swap device has priority and that decides the order of it to be used. To make
0016 use of automatically binding, there is no need to manipulate priority settings
0017 for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
0018 swapB, with swapA attached to node 0 and swapB attached to node 1, are going
0019 to be swapped on. Simply swapping them on by doing::
0020
0021 # swapon /dev/swapA
0022 # swapon /dev/swapB
0023
0024 Then node 0 will use the two swap devices in the order of swapA then swapB and
0025 node 1 will use the two swap devices in the order of swapB then swapA. Note
0026 that the order of them being swapped on doesn't matter.
0027
0028 A more complex example on a 4 node machine. Assume 6 swap devices are going to
0029 be swapped on: swapA and swapB are attached to node 0, swapC is attached to
0030 node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
0031 The way to swap them on is the same as above::
0032
0033 # swapon /dev/swapA
0034 # swapon /dev/swapB
0035 # swapon /dev/swapC
0036 # swapon /dev/swapD
0037 # swapon /dev/swapE
0038 # swapon /dev/swapF
0039
0040 Then node 0 will use them in the order of::
0041
0042 swapA/swapB -> swapC -> swapD -> swapE -> swapF
0043
0044 swapA and swapB will be used in a round robin mode before any other swap device.
0045
0046 node 1 will use them in the order of::
0047
0048 swapC -> swapA -> swapB -> swapD -> swapE -> swapF
0049
0050 node 2 will use them in the order of::
0051
0052 swapD/swapE -> swapA -> swapB -> swapC -> swapF
0053
0054 Similaly, swapD and swapE will be used in a round robin mode before any
0055 other swap devices.
0056
0057 node 3 will use them in the order of::
0058
0059 swapF -> swapA -> swapB -> swapC -> swapD -> swapE
0060
0061
0062 Implementation details
0063 ======================
0064
0065 The current code uses a priority based list, swap_avail_list, to decide
0066 which swap device to use and if multiple swap devices share the same
0067 priority, they are used round robin. This change here replaces the single
0068 global swap_avail_list with a per-numa-node list, i.e. for each numa node,
0069 it sees its own priority based list of available swap devices. Swap
0070 device's priority can be promoted on its matching node's swap_avail_list.
0071
0072 The current swap device's priority is set as: user can set a >=0 value,
0073 or the system will pick one starting from -1 then downwards. The priority
0074 value in the swap_avail_list is the negated value of the swap device's
0075 due to plist being sorted from low to high. The new policy doesn't change
0076 the semantics for priority >=0 cases, the previous starting from -1 then
0077 downwards now becomes starting from -2 then downwards and -1 is reserved
0078 as the promoted value. So if multiple swap devices are attached to the same
0079 node, they will all be promoted to priority -1 on that node's plist and will
0080 be used round robin before any other swap devices.