Back to home page

LXR

 
 

    


0001 ==============
0002 Memory Hotplug
0003 ==============
0004 
0005 Created:                                        Jul 28 2007
0006 Add description of notifier of memory hotplug   Oct 11 2007
0007 
0008 This document is about memory hotplug including how-to-use and current status.
0009 Because Memory Hotplug is still under development, contents of this text will
0010 be changed often.
0011 
0012 1. Introduction
0013   1.1 purpose of memory hotplug
0014   1.2. Phases of memory hotplug
0015   1.3. Unit of Memory online/offline operation
0016 2. Kernel Configuration
0017 3. sysfs files for memory hotplug
0018 4. Physical memory hot-add phase
0019   4.1 Hardware(Firmware) Support
0020   4.2 Notify memory hot-add event by hand
0021 5. Logical Memory hot-add phase
0022   5.1. State of memory
0023   5.2. How to online memory
0024 6. Logical memory remove
0025   6.1 Memory offline and ZONE_MOVABLE
0026   6.2. How to offline memory
0027 7. Physical memory remove
0028 8. Memory hotplug event notifier
0029 9. Future Work List
0030 
0031 Note(1): x86_64's has special implementation for memory hotplug.
0032          This text does not describe it.
0033 Note(2): This text assumes that sysfs is mounted at /sys.
0034 
0035 
0036 ---------------
0037 1. Introduction
0038 ---------------
0039 
0040 1.1 purpose of memory hotplug
0041 ------------
0042 Memory Hotplug allows users to increase/decrease the amount of memory.
0043 Generally, there are two purposes.
0044 
0045 (A) For changing the amount of memory.
0046     This is to allow a feature like capacity on demand.
0047 (B) For installing/removing DIMMs or NUMA-nodes physically.
0048     This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
0049 
0050 (A) is required by highly virtualized environments and (B) is required by
0051 hardware which supports memory power management.
0052 
0053 Linux memory hotplug is designed for both purpose.
0054 
0055 
0056 1.2. Phases of memory hotplug
0057 ---------------
0058 There are 2 phases in Memory Hotplug.
0059   1) Physical Memory Hotplug phase
0060   2) Logical Memory Hotplug phase.
0061 
0062 The First phase is to communicate hardware/firmware and make/erase
0063 environment for hotplugged memory. Basically, this phase is necessary
0064 for the purpose (B), but this is good phase for communication between
0065 highly virtualized environments too.
0066 
0067 When memory is hotplugged, the kernel recognizes new memory, makes new memory
0068 management tables, and makes sysfs files for new memory's operation.
0069 
0070 If firmware supports notification of connection of new memory to OS,
0071 this phase is triggered automatically. ACPI can notify this event. If not,
0072 "probe" operation by system administration is used instead.
0073 (see Section 4.).
0074 
0075 Logical Memory Hotplug phase is to change memory state into
0076 available/unavailable for users. Amount of memory from user's view is
0077 changed by this phase. The kernel makes all memory in it as free pages
0078 when a memory range is available.
0079 
0080 In this document, this phase is described as online/offline.
0081 
0082 Logical Memory Hotplug phase is triggered by write of sysfs file by system
0083 administrator. For the hot-add case, it must be executed after Physical Hotplug
0084 phase by hand.
0085 (However, if you writes udev's hotplug scripts for memory hotplug, these
0086  phases can be execute in seamless way.)
0087 
0088 
0089 1.3. Unit of Memory online/offline operation
0090 ------------
0091 Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
0092 into chunks of the same size. These chunks are called "sections". The size of
0093 a memory section is architecture dependent. For example, power uses 16MiB, ia64
0094 uses 1GiB.
0095 
0096 Memory sections are combined into chunks referred to as "memory blocks". The
0097 size of a memory block is architecture dependent and represents the logical
0098 unit upon which memory online/offline operations are to be performed. The
0099 default size of a memory block is the same as memory section size unless an
0100 architecture specifies otherwise. (see Section 3.)
0101 
0102 To determine the size (in bytes) of a memory block please read this file:
0103 
0104 /sys/devices/system/memory/block_size_bytes
0105 
0106 
0107 -----------------------
0108 2. Kernel Configuration
0109 -----------------------
0110 To use memory hotplug feature, kernel must be compiled with following
0111 config options.
0112 
0113 - For all memory hotplug
0114     Memory model -> Sparse Memory  (CONFIG_SPARSEMEM)
0115     Allow for memory hot-add       (CONFIG_MEMORY_HOTPLUG)
0116 
0117 - To enable memory removal, the followings are also necessary
0118     Allow for memory hot remove    (CONFIG_MEMORY_HOTREMOVE)
0119     Page Migration                 (CONFIG_MIGRATION)
0120 
0121 - For ACPI memory hotplug, the followings are also necessary
0122     Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY)
0123     This option can be kernel module.
0124 
0125 - As a related configuration, if your box has a feature of NUMA-node hotplug
0126   via ACPI, then this option is necessary too.
0127     ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
0128     (CONFIG_ACPI_CONTAINER).
0129     This option can be kernel module too.
0130 
0131 
0132 --------------------------------
0133 3 sysfs files for memory hotplug
0134 --------------------------------
0135 All memory blocks have their device information in sysfs.  Each memory block
0136 is described under /sys/devices/system/memory as
0137 
0138 /sys/devices/system/memory/memoryXXX
0139 (XXX is the memory block id.)
0140 
0141 For the memory block covered by the sysfs directory.  It is expected that all
0142 memory sections in this range are present and no memory holes exist in the
0143 range. Currently there is no way to determine if there is a memory hole, but
0144 the existence of one should not affect the hotplug capabilities of the memory
0145 block.
0146 
0147 For example, assume 1GiB memory block size. A device for a memory starting at
0148 0x100000000 is /sys/device/system/memory/memory4
0149 (0x100000000 / 1Gib = 4)
0150 This device covers address range [0x100000000 ... 0x140000000)
0151 
0152 Under each memory block, you can see 5 files:
0153 
0154 /sys/devices/system/memory/memoryXXX/phys_index
0155 /sys/devices/system/memory/memoryXXX/phys_device
0156 /sys/devices/system/memory/memoryXXX/state
0157 /sys/devices/system/memory/memoryXXX/removable
0158 /sys/devices/system/memory/memoryXXX/valid_zones
0159 
0160 'phys_index'      : read-only and contains memory block id, same as XXX.
0161 'state'           : read-write
0162                     at read:  contains online/offline state of memory.
0163                     at write: user can specify "online_kernel",
0164                     "online_movable", "online", "offline" command
0165                     which will be performed on all sections in the block.
0166 'phys_device'     : read-only: designed to show the name of physical memory
0167                     device.  This is not well implemented now.
0168 'removable'       : read-only: contains an integer value indicating
0169                     whether the memory block is removable or not
0170                     removable.  A value of 1 indicates that the memory
0171                     block is removable and a value of 0 indicates that
0172                     it is not removable. A memory block is removable only if
0173                     every section in the block is removable.
0174 'valid_zones'     : read-only: designed to show which zones this memory block
0175                     can be onlined to.
0176                     The first column shows it's default zone.
0177                     "memory6/valid_zones: Normal Movable" shows this memoryblock
0178                     can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
0179                     by online_movable.
0180                     "memory7/valid_zones: Movable Normal" shows this memoryblock
0181                     can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
0182                     by online_kernel.
0183 
0184 NOTE:
0185   These directories/files appear after physical memory hotplug phase.
0186 
0187 If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
0188 via symbolic links located in the /sys/devices/system/node/node* directories.
0189 
0190 For example:
0191 /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
0192 
0193 A backlink will also be created:
0194 /sys/devices/system/memory/memory9/node0 -> ../../node/node0
0195 
0196 
0197 --------------------------------
0198 4. Physical memory hot-add phase
0199 --------------------------------
0200 
0201 4.1 Hardware(Firmware) Support
0202 ------------
0203 On x86_64/ia64 platform, memory hotplug by ACPI is supported.
0204 
0205 In general, the firmware (ACPI) which supports memory hotplug defines
0206 memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
0207 Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
0208 script. This will be done automatically.
0209 
0210 But scripts for memory hotplug are not contained in generic udev package(now).
0211 You may have to write it by yourself or online/offline memory by hand.
0212 Please see "How to online memory", "How to offline memory" in this text.
0213 
0214 If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
0215 "PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
0216 calls hotplug code for all of objects which are defined in it.
0217 If memory device is found, memory hotplug code will be called.
0218 
0219 
0220 4.2 Notify memory hot-add event by hand
0221 ------------
0222 On some architectures, the firmware may not notify the kernel of a memory
0223 hotplug event.  Therefore, the memory "probe" interface is supported to
0224 explicitly notify the kernel.  This interface depends on
0225 CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
0226 if hotplug is supported, although for x86 this should be handled by ACPI
0227 notification.
0228 
0229 Probe interface is located at
0230 /sys/devices/system/memory/probe
0231 
0232 You can tell the physical address of new memory to the kernel by
0233 
0234 % echo start_address_of_new_memory > /sys/devices/system/memory/probe
0235 
0236 Then, [start_address_of_new_memory, start_address_of_new_memory +
0237 memory_block_size] memory range is hot-added. In this case, hotplug script is
0238 not called (in current implementation). You'll have to online memory by
0239 yourself.  Please see "How to online memory" in this text.
0240 
0241 
0242 ------------------------------
0243 5. Logical Memory hot-add phase
0244 ------------------------------
0245 
0246 5.1. State of memory
0247 ------------
0248 To see (online/offline) state of a memory block, read 'state' file.
0249 
0250 % cat /sys/device/system/memory/memoryXXX/state
0251 
0252 
0253 If the memory block is online, you'll read "online".
0254 If the memory block is offline, you'll read "offline".
0255 
0256 
0257 5.2. How to online memory
0258 ------------
0259 When the memory is hot-added, the kernel decides whether or not to "online"
0260 it according to the policy which can be read from "auto_online_blocks" file:
0261 
0262 % cat /sys/devices/system/memory/auto_online_blocks
0263 
0264 The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
0265 option. If it is disabled the default is "offline" which means the newly added
0266 memory is not in a ready-to-use state and you have to "online" the newly added
0267 memory blocks manually. Automatic onlining can be requested by writing "online"
0268 to "auto_online_blocks" file:
0269 
0270 % echo online > /sys/devices/system/memory/auto_online_blocks
0271 
0272 This sets a global policy and impacts all memory blocks that will subsequently
0273 be hotplugged. Currently offline blocks keep their state. It is possible, under
0274 certain circumstances, that some memory blocks will be added but will fail to
0275 online. User space tools can check their "state" files
0276 (/sys/devices/system/memory/memoryXXX/state) and try to online them manually.
0277 
0278 If the automatic onlining wasn't requested, failed, or some memory block was
0279 offlined it is possible to change the individual block's state by writing to the
0280 "state" file:
0281 
0282 % echo online > /sys/devices/system/memory/memoryXXX/state
0283 
0284 This onlining will not change the ZONE type of the target memory block,
0285 If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
0286 
0287 % echo online_movable > /sys/devices/system/memory/memoryXXX/state
0288 (NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
0289 
0290 And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
0291 
0292 % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
0293 (NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL)
0294 
0295 After this, memory block XXX's state will be 'online' and the amount of
0296 available memory will be increased.
0297 
0298 Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
0299 This may be changed in future.
0300 
0301 
0302 
0303 ------------------------
0304 6. Logical memory remove
0305 ------------------------
0306 
0307 6.1 Memory offline and ZONE_MOVABLE
0308 ------------
0309 Memory offlining is more complicated than memory online. Because memory offline
0310 has to make the whole memory block be unused, memory offline can fail if
0311 the memory block includes memory which cannot be freed.
0312 
0313 In general, memory offline can use 2 techniques.
0314 
0315 (1) reclaim and free all memory in the memory block.
0316 (2) migrate all pages in the memory block.
0317 
0318 In the current implementation, Linux's memory offline uses method (2), freeing
0319 all  pages in the memory block by page migration. But not all pages are
0320 migratable. Under current Linux, migratable pages are anonymous pages and
0321 page caches. For offlining a memory block by migration, the kernel has to
0322 guarantee that the memory block contains only migratable pages.
0323 
0324 Now, a boot option for making a memory block which consists of migratable pages
0325 is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
0326 create ZONE_MOVABLE...a zone which is just used for movable pages.
0327 (See also Documentation/admin-guide/kernel-parameters.rst)
0328 
0329 Assume the system has "TOTAL" amount of memory at boot time, this boot option
0330 creates ZONE_MOVABLE as following.
0331 
0332 1) When kernelcore=YYYY boot option is used,
0333   Size of memory not for movable pages (not for offline) is YYYY.
0334   Size of memory for movable pages (for offline) is TOTAL-YYYY.
0335 
0336 2) When movablecore=ZZZZ boot option is used,
0337   Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
0338   Size of memory for movable pages (for offline) is ZZZZ.
0339 
0340 
0341 Note: Unfortunately, there is no information to show which memory block belongs
0342 to ZONE_MOVABLE. This is TBD.
0343 
0344 
0345 6.2. How to offline memory
0346 ------------
0347 You can offline a memory block by using the same sysfs interface that was used
0348 in memory onlining.
0349 
0350 % echo offline > /sys/devices/system/memory/memoryXXX/state
0351 
0352 If offline succeeds, the state of the memory block is changed to be "offline".
0353 If it fails, some error core (like -EBUSY) will be returned by the kernel.
0354 Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
0355 it.  If it doesn't contain 'unmovable' memory, you'll get success.
0356 
0357 A memory block under ZONE_MOVABLE is considered to be able to be offlined
0358 easily.  But under some busy state, it may return -EBUSY. Even if a memory
0359 block cannot be offlined due to -EBUSY, you can retry offlining it and may be
0360 able to offline it (or not). (For example, a page is referred to by some kernel
0361 internal call and released soon.)
0362 
0363 Consideration:
0364 Memory hotplug's design direction is to make the possibility of memory offlining
0365 higher and to guarantee unplugging memory under any situation. But it needs
0366 more work. Returning -EBUSY under some situation may be good because the user
0367 can decide to retry more or not by himself. Currently, memory offlining code
0368 does some amount of retry with 120 seconds timeout.
0369 
0370 -------------------------
0371 7. Physical memory remove
0372 -------------------------
0373 Need more implementation yet....
0374  - Notification completion of remove works by OS to firmware.
0375  - Guard from remove if not yet.
0376 
0377 --------------------------------
0378 8. Memory hotplug event notifier
0379 --------------------------------
0380 Hotplugging events are sent to a notification queue.
0381 
0382 There are six types of notification defined in include/linux/memory.h:
0383 
0384 MEM_GOING_ONLINE
0385   Generated before new memory becomes available in order to be able to
0386   prepare subsystems to handle memory. The page allocator is still unable
0387   to allocate from the new memory.
0388 
0389 MEM_CANCEL_ONLINE
0390   Generated if MEMORY_GOING_ONLINE fails.
0391 
0392 MEM_ONLINE
0393   Generated when memory has successfully brought online. The callback may
0394   allocate pages from the new memory.
0395 
0396 MEM_GOING_OFFLINE
0397   Generated to begin the process of offlining memory. Allocations are no
0398   longer possible from the memory but some of the memory to be offlined
0399   is still in use. The callback can be used to free memory known to a
0400   subsystem from the indicated memory block.
0401 
0402 MEM_CANCEL_OFFLINE
0403   Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
0404   the memory block that we attempted to offline.
0405 
0406 MEM_OFFLINE
0407   Generated after offlining memory is complete.
0408 
0409 A callback routine can be registered by calling
0410 
0411   hotplug_memory_notifier(callback_func, priority)
0412 
0413 Callback functions with higher values of priority are called before callback
0414 functions with lower values.
0415 
0416 A callback function must have the following prototype:
0417 
0418   int callback_func(
0419     struct notifier_block *self, unsigned long action, void *arg);
0420 
0421 The first argument of the callback function (self) is a pointer to the block
0422 of the notifier chain that points to the callback function itself.
0423 The second argument (action) is one of the event types described above.
0424 The third argument (arg) passes a pointer of struct memory_notify.
0425 
0426 struct memory_notify {
0427        unsigned long start_pfn;
0428        unsigned long nr_pages;
0429        int status_change_nid_normal;
0430        int status_change_nid_high;
0431        int status_change_nid;
0432 }
0433 
0434 start_pfn is start_pfn of online/offline memory.
0435 nr_pages is # of pages of online/offline memory.
0436 status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
0437 is (will be) set/clear, if this is -1, then nodemask status is not changed.
0438 status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
0439 is (will be) set/clear, if this is -1, then nodemask status is not changed.
0440 status_change_nid is set node id when N_MEMORY of nodemask is (will be)
0441 set/clear. It means a new(memoryless) node gets new memory by online and a
0442 node loses all memory. If this is -1, then nodemask status is not changed.
0443 If status_changed_nid* >= 0, callback should create/discard structures for the
0444 node if necessary.
0445 
0446 The callback routine shall return one of the values
0447 NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP
0448 defined in include/linux/notifier.h
0449 
0450 NOTIFY_DONE and NOTIFY_OK have no effect on the further processing.
0451 
0452 NOTIFY_BAD is used as response to the MEM_GOING_ONLINE, MEM_GOING_OFFLINE,
0453 MEM_ONLINE, or MEM_OFFLINE action to cancel hotplugging. It stops
0454 further processing of the notification queue.
0455 
0456 NOTIFY_STOP stops further processing of the notification queue.
0457 
0458 --------------
0459 9. Future Work
0460 --------------
0461   - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
0462     sysctl or new control file.
0463   - showing memory block and physical device relationship.
0464   - test and make it better memory offlining.
0465   - support HugeTLB page migration and offlining.
0466   - memmap removing at memory offline.
0467   - physical remove memory.
0468