Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. include:: <isonum.txt>
0003 
0004 .. _driverapi_pm_devices:
0005 
0006 ==============================
0007 Device Power Management Basics
0008 ==============================
0009 
0010 :Copyright: |copy| 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
0011 :Copyright: |copy| 2010 Alan Stern <stern@rowland.harvard.edu>
0012 :Copyright: |copy| 2016 Intel Corporation
0013 
0014 :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
0015 
0016 
0017 Most of the code in Linux is device drivers, so most of the Linux power
0018 management (PM) code is also driver-specific.  Most drivers will do very
0019 little; others, especially for platforms with small batteries (like cell
0020 phones), will do a lot.
0021 
0022 This writeup gives an overview of how drivers interact with system-wide
0023 power management goals, emphasizing the models and interfaces that are
0024 shared by everything that hooks up to the driver model core.  Read it as
0025 background for the domain-specific work you'd do with any specific driver.
0026 
0027 
0028 Two Models for Device Power Management
0029 ======================================
0030 
0031 Drivers will use one or both of these models to put devices into low-power
0032 states:
0033 
0034     System Sleep model:
0035 
0036         Drivers can enter low-power states as part of entering system-wide
0037         low-power states like "suspend" (also known as "suspend-to-RAM"), or
0038         (mostly for systems with disks) "hibernation" (also known as
0039         "suspend-to-disk").
0040 
0041         This is something that device, bus, and class drivers collaborate on
0042         by implementing various role-specific suspend and resume methods to
0043         cleanly power down hardware and software subsystems, then reactivate
0044         them without loss of data.
0045 
0046         Some drivers can manage hardware wakeup events, which make the system
0047         leave the low-power state.  This feature may be enabled or disabled
0048         using the relevant :file:`/sys/devices/.../power/wakeup` file (for
0049         Ethernet drivers the ioctl interface used by ethtool may also be used
0050         for this purpose); enabling it may cost some power usage, but let the
0051         whole system enter low-power states more often.
0052 
0053     Runtime Power Management model:
0054 
0055         Devices may also be put into low-power states while the system is
0056         running, independently of other power management activity in principle.
0057         However, devices are not generally independent of each other (for
0058         example, a parent device cannot be suspended unless all of its child
0059         devices have been suspended).  Moreover, depending on the bus type the
0060         device is on, it may be necessary to carry out some bus-specific
0061         operations on the device for this purpose.  Devices put into low power
0062         states at run time may require special handling during system-wide power
0063         transitions (suspend or hibernation).
0064 
0065         For these reasons not only the device driver itself, but also the
0066         appropriate subsystem (bus type, device type or device class) driver and
0067         the PM core are involved in runtime power management.  As in the system
0068         sleep power management case, they need to collaborate by implementing
0069         various role-specific suspend and resume methods, so that the hardware
0070         is cleanly powered down and reactivated without data or service loss.
0071 
0072 There's not a lot to be said about those low-power states except that they are
0073 very system-specific, and often device-specific.  Also, that if enough devices
0074 have been put into low-power states (at runtime), the effect may be very similar
0075 to entering some system-wide low-power state (system sleep) ... and that
0076 synergies exist, so that several drivers using runtime PM might put the system
0077 into a state where even deeper power saving options are available.
0078 
0079 Most suspended devices will have quiesced all I/O: no more DMA or IRQs (except
0080 for wakeup events), no more data read or written, and requests from upstream
0081 drivers are no longer accepted.  A given bus or platform may have different
0082 requirements though.
0083 
0084 Examples of hardware wakeup events include an alarm from a real time clock,
0085 network wake-on-LAN packets, keyboard or mouse activity, and media insertion
0086 or removal (for PCMCIA, MMC/SD, USB, and so on).
0087 
0088 Interfaces for Entering System Sleep States
0089 ===========================================
0090 
0091 There are programming interfaces provided for subsystems (bus type, device type,
0092 device class) and device drivers to allow them to participate in the power
0093 management of devices they are concerned with.  These interfaces cover both
0094 system sleep and runtime power management.
0095 
0096 
0097 Device Power Management Operations
0098 ----------------------------------
0099 
0100 Device power management operations, at the subsystem level as well as at the
0101 device driver level, are implemented by defining and populating objects of type
0102 struct dev_pm_ops defined in :file:`include/linux/pm.h`.  The roles of the
0103 methods included in it will be explained in what follows.  For now, it should be
0104 sufficient to remember that the last three methods are specific to runtime power
0105 management while the remaining ones are used during system-wide power
0106 transitions.
0107 
0108 There also is a deprecated "old" or "legacy" interface for power management
0109 operations available at least for some subsystems.  This approach does not use
0110 struct dev_pm_ops objects and it is suitable only for implementing system
0111 sleep power management methods in a limited way.  Therefore it is not described
0112 in this document, so please refer directly to the source code for more
0113 information about it.
0114 
0115 
0116 Subsystem-Level Methods
0117 -----------------------
0118 
0119 The core methods to suspend and resume devices reside in
0120 struct dev_pm_ops pointed to by the :c:member:`ops` member of
0121 struct dev_pm_domain, or by the :c:member:`pm` member of struct bus_type,
0122 struct device_type and struct class.  They are mostly of interest to the
0123 people writing infrastructure for platforms and buses, like PCI or USB, or
0124 device type and device class drivers.  They also are relevant to the writers of
0125 device drivers whose subsystems (PM domains, device types, device classes and
0126 bus types) don't provide all power management methods.
0127 
0128 Bus drivers implement these methods as appropriate for the hardware and the
0129 drivers using it; PCI works differently from USB, and so on.  Not many people
0130 write subsystem-level drivers; most driver code is a "device driver" that builds
0131 on top of bus-specific framework code.
0132 
0133 For more information on these driver calls, see the description later;
0134 they are called in phases for every device, respecting the parent-child
0135 sequencing in the driver model tree.
0136 
0137 
0138 :file:`/sys/devices/.../power/wakeup` files
0139 -------------------------------------------
0140 
0141 All device objects in the driver model contain fields that control the handling
0142 of system wakeup events (hardware signals that can force the system out of a
0143 sleep state).  These fields are initialized by bus or device driver code using
0144 :c:func:`device_set_wakeup_capable()` and :c:func:`device_set_wakeup_enable()`,
0145 defined in :file:`include/linux/pm_wakeup.h`.
0146 
0147 The :c:member:`power.can_wakeup` flag just records whether the device (and its
0148 driver) can physically support wakeup events.  The
0149 :c:func:`device_set_wakeup_capable()` routine affects this flag.  The
0150 :c:member:`power.wakeup` field is a pointer to an object of type
0151 struct wakeup_source used for controlling whether or not the device should use
0152 its system wakeup mechanism and for notifying the PM core of system wakeup
0153 events signaled by the device.  This object is only present for wakeup-capable
0154 devices (i.e. devices whose :c:member:`can_wakeup` flags are set) and is created
0155 (or removed) by :c:func:`device_set_wakeup_capable()`.
0156 
0157 Whether or not a device is capable of issuing wakeup events is a hardware
0158 matter, and the kernel is responsible for keeping track of it.  By contrast,
0159 whether or not a wakeup-capable device should issue wakeup events is a policy
0160 decision, and it is managed by user space through a sysfs attribute: the
0161 :file:`power/wakeup` file.  User space can write the "enabled" or "disabled"
0162 strings to it to indicate whether or not, respectively, the device is supposed
0163 to signal system wakeup.  This file is only present if the
0164 :c:member:`power.wakeup` object exists for the given device and is created (or
0165 removed) along with that object, by :c:func:`device_set_wakeup_capable()`.
0166 Reads from the file will return the corresponding string.
0167 
0168 The initial value in the :file:`power/wakeup` file is "disabled" for the
0169 majority of devices; the major exceptions are power buttons, keyboards, and
0170 Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with ethtool.
0171 It should also default to "enabled" for devices that don't generate wakeup
0172 requests on their own but merely forward wakeup requests from one bus to another
0173 (like PCI Express ports).
0174 
0175 The :c:func:`device_may_wakeup()` routine returns true only if the
0176 :c:member:`power.wakeup` object exists and the corresponding :file:`power/wakeup`
0177 file contains the "enabled" string.  This information is used by subsystems,
0178 like the PCI bus type code, to see whether or not to enable the devices' wakeup
0179 mechanisms.  If device wakeup mechanisms are enabled or disabled directly by
0180 drivers, they also should use :c:func:`device_may_wakeup()` to decide what to do
0181 during a system sleep transition.  Device drivers, however, are not expected to
0182 call :c:func:`device_set_wakeup_enable()` directly in any case.
0183 
0184 It ought to be noted that system wakeup is conceptually different from "remote
0185 wakeup" used by runtime power management, although it may be supported by the
0186 same physical mechanism.  Remote wakeup is a feature allowing devices in
0187 low-power states to trigger specific interrupts to signal conditions in which
0188 they should be put into the full-power state.  Those interrupts may or may not
0189 be used to signal system wakeup events, depending on the hardware design.  On
0190 some systems it is impossible to trigger them from system sleep states.  In any
0191 case, remote wakeup should always be enabled for runtime power management for
0192 all devices and drivers that support it.
0193 
0194 
0195 :file:`/sys/devices/.../power/control` files
0196 --------------------------------------------
0197 
0198 Each device in the driver model has a flag to control whether it is subject to
0199 runtime power management.  This flag, :c:member:`runtime_auto`, is initialized
0200 by the bus type (or generally subsystem) code using :c:func:`pm_runtime_allow()`
0201 or :c:func:`pm_runtime_forbid()`; the default is to allow runtime power
0202 management.
0203 
0204 The setting can be adjusted by user space by writing either "on" or "auto" to
0205 the device's :file:`power/control` sysfs file.  Writing "auto" calls
0206 :c:func:`pm_runtime_allow()`, setting the flag and allowing the device to be
0207 runtime power-managed by its driver.  Writing "on" calls
0208 :c:func:`pm_runtime_forbid()`, clearing the flag, returning the device to full
0209 power if it was in a low-power state, and preventing the
0210 device from being runtime power-managed.  User space can check the current value
0211 of the :c:member:`runtime_auto` flag by reading that file.
0212 
0213 The device's :c:member:`runtime_auto` flag has no effect on the handling of
0214 system-wide power transitions.  In particular, the device can (and in the
0215 majority of cases should and will) be put into a low-power state during a
0216 system-wide transition to a sleep state even though its :c:member:`runtime_auto`
0217 flag is clear.
0218 
0219 For more information about the runtime power management framework, refer to
0220 Documentation/power/runtime_pm.rst.
0221 
0222 
0223 Calling Drivers to Enter and Leave System Sleep States
0224 ======================================================
0225 
0226 When the system goes into a sleep state, each device's driver is asked to
0227 suspend the device by putting it into a state compatible with the target
0228 system state.  That's usually some version of "off", but the details are
0229 system-specific.  Also, wakeup-enabled devices will usually stay partly
0230 functional in order to wake the system.
0231 
0232 When the system leaves that low-power state, the device's driver is asked to
0233 resume it by returning it to full power.  The suspend and resume operations
0234 always go together, and both are multi-phase operations.
0235 
0236 For simple drivers, suspend might quiesce the device using class code
0237 and then turn its hardware as "off" as possible during suspend_noirq.  The
0238 matching resume calls would then completely reinitialize the hardware
0239 before reactivating its class I/O queues.
0240 
0241 More power-aware drivers might prepare the devices for triggering system wakeup
0242 events.
0243 
0244 
0245 Call Sequence Guarantees
0246 ------------------------
0247 
0248 To ensure that bridges and similar links needing to talk to a device are
0249 available when the device is suspended or resumed, the device hierarchy is
0250 walked in a bottom-up order to suspend devices.  A top-down order is
0251 used to resume those devices.
0252 
0253 The ordering of the device hierarchy is defined by the order in which devices
0254 get registered:  a child can never be registered, probed or resumed before
0255 its parent; and can't be removed or suspended after that parent.
0256 
0257 The policy is that the device hierarchy should match hardware bus topology.
0258 [Or at least the control bus, for devices which use multiple busses.]
0259 In particular, this means that a device registration may fail if the parent of
0260 the device is suspending (i.e. has been chosen by the PM core as the next
0261 device to suspend) or has already suspended, as well as after all of the other
0262 devices have been suspended.  Device drivers must be prepared to cope with such
0263 situations.
0264 
0265 
0266 System Power Management Phases
0267 ------------------------------
0268 
0269 Suspending or resuming the system is done in several phases.  Different phases
0270 are used for suspend-to-idle, shallow (standby), and deep ("suspend-to-RAM")
0271 sleep states and the hibernation state ("suspend-to-disk").  Each phase involves
0272 executing callbacks for every device before the next phase begins.  Not all
0273 buses or classes support all these callbacks and not all drivers use all the
0274 callbacks.  The various phases always run after tasks have been frozen and
0275 before they are unfrozen.  Furthermore, the ``*_noirq`` phases run at a time
0276 when IRQ handlers have been disabled (except for those marked with the
0277 IRQF_NO_SUSPEND flag).
0278 
0279 All phases use PM domain, bus, type, class or driver callbacks (that is, methods
0280 defined in ``dev->pm_domain->ops``, ``dev->bus->pm``, ``dev->type->pm``,
0281 ``dev->class->pm`` or ``dev->driver->pm``).  These callbacks are regarded by the
0282 PM core as mutually exclusive.  Moreover, PM domain callbacks always take
0283 precedence over all of the other callbacks and, for example, type callbacks take
0284 precedence over bus, class and driver callbacks.  To be precise, the following
0285 rules are used to determine which callback to execute in the given phase:
0286 
0287     1.  If ``dev->pm_domain`` is present, the PM core will choose the callback
0288         provided by ``dev->pm_domain->ops`` for execution.
0289 
0290     2.  Otherwise, if both ``dev->type`` and ``dev->type->pm`` are present, the
0291         callback provided by ``dev->type->pm`` will be chosen for execution.
0292 
0293     3.  Otherwise, if both ``dev->class`` and ``dev->class->pm`` are present,
0294         the callback provided by ``dev->class->pm`` will be chosen for
0295         execution.
0296 
0297     4.  Otherwise, if both ``dev->bus`` and ``dev->bus->pm`` are present, the
0298         callback provided by ``dev->bus->pm`` will be chosen for execution.
0299 
0300 This allows PM domains and device types to override callbacks provided by bus
0301 types or device classes if necessary.
0302 
0303 The PM domain, type, class and bus callbacks may in turn invoke device- or
0304 driver-specific methods stored in ``dev->driver->pm``, but they don't have to do
0305 that.
0306 
0307 If the subsystem callback chosen for execution is not present, the PM core will
0308 execute the corresponding method from the ``dev->driver->pm`` set instead if
0309 there is one.
0310 
0311 
0312 Entering System Suspend
0313 -----------------------
0314 
0315 When the system goes into the freeze, standby or memory sleep state,
0316 the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``.
0317 
0318     1.  The ``prepare`` phase is meant to prevent races by preventing new
0319         devices from being registered; the PM core would never know that all the
0320         children of a device had been suspended if new children could be
0321         registered at will.  [By contrast, from the PM core's perspective,
0322         devices may be unregistered at any time.]  Unlike the other
0323         suspend-related phases, during the ``prepare`` phase the device
0324         hierarchy is traversed top-down.
0325 
0326         After the ``->prepare`` callback method returns, no new children may be
0327         registered below the device.  The method may also prepare the device or
0328         driver in some way for the upcoming system power transition, but it
0329         should not put the device into a low-power state.  Moreover, if the
0330         device supports runtime power management, the ``->prepare`` callback
0331         method must not update its state in case it is necessary to resume it
0332         from runtime suspend later on.
0333 
0334         For devices supporting runtime power management, the return value of the
0335         prepare callback can be used to indicate to the PM core that it may
0336         safely leave the device in runtime suspend (if runtime-suspended
0337         already), provided that all of the device's descendants are also left in
0338         runtime suspend.  Namely, if the prepare callback returns a positive
0339         number and that happens for all of the descendants of the device too,
0340         and all of them (including the device itself) are runtime-suspended, the
0341         PM core will skip the ``suspend``, ``suspend_late`` and
0342         ``suspend_noirq`` phases as well as all of the corresponding phases of
0343         the subsequent device resume for all of these devices.  In that case,
0344         the ``->complete`` callback will be the next one invoked after the
0345         ``->prepare`` callback and is entirely responsible for putting the
0346         device into a consistent state as appropriate.
0347 
0348         Note that this direct-complete procedure applies even if the device is
0349         disabled for runtime PM; only the runtime-PM status matters.  It follows
0350         that if a device has system-sleep callbacks but does not support runtime
0351         PM, then its prepare callback must never return a positive value.  This
0352         is because all such devices are initially set to runtime-suspended with
0353         runtime PM disabled.
0354 
0355         This feature also can be controlled by device drivers by using the
0356         ``DPM_FLAG_NO_DIRECT_COMPLETE`` and ``DPM_FLAG_SMART_PREPARE`` driver
0357         power management flags.  [Typically, they are set at the time the driver
0358         is probed against the device in question by passing them to the
0359         :c:func:`dev_pm_set_driver_flags` helper function.]  If the first of
0360         these flags is set, the PM core will not apply the direct-complete
0361         procedure described above to the given device and, consequenty, to any
0362         of its ancestors.  The second flag, when set, informs the middle layer
0363         code (bus types, device types, PM domains, classes) that it should take
0364         the return value of the ``->prepare`` callback provided by the driver
0365         into account and it may only return a positive value from its own
0366         ``->prepare`` callback if the driver's one also has returned a positive
0367         value.
0368 
0369     2.  The ``->suspend`` methods should quiesce the device to stop it from
0370         performing I/O.  They also may save the device registers and put it into
0371         the appropriate low-power state, depending on the bus type the device is
0372         on, and they may enable wakeup events.
0373 
0374         However, for devices supporting runtime power management, the
0375         ``->suspend`` methods provided by subsystems (bus types and PM domains
0376         in particular) must follow an additional rule regarding what can be done
0377         to the devices before their drivers' ``->suspend`` methods are called.
0378         Namely, they may resume the devices from runtime suspend by
0379         calling :c:func:`pm_runtime_resume` for them, if that is necessary, but
0380         they must not update the state of the devices in any other way at that
0381         time (in case the drivers need to resume the devices from runtime
0382         suspend in their ``->suspend`` methods).  In fact, the PM core prevents
0383         subsystems or drivers from putting devices into runtime suspend at
0384         these times by calling :c:func:`pm_runtime_get_noresume` before issuing
0385         the ``->prepare`` callback (and calling :c:func:`pm_runtime_put` after
0386         issuing the ``->complete`` callback).
0387 
0388     3.  For a number of devices it is convenient to split suspend into the
0389         "quiesce device" and "save device state" phases, in which cases
0390         ``suspend_late`` is meant to do the latter.  It is always executed after
0391         runtime power management has been disabled for the device in question.
0392 
0393     4.  The ``suspend_noirq`` phase occurs after IRQ handlers have been disabled,
0394         which means that the driver's interrupt handler will not be called while
0395         the callback method is running.  The ``->suspend_noirq`` methods should
0396         save the values of the device's registers that weren't saved previously
0397         and finally put the device into the appropriate low-power state.
0398 
0399         The majority of subsystems and device drivers need not implement this
0400         callback.  However, bus types allowing devices to share interrupt
0401         vectors, like PCI, generally need it; otherwise a driver might encounter
0402         an error during the suspend phase by fielding a shared interrupt
0403         generated by some other device after its own device had been set to low
0404         power.
0405 
0406 At the end of these phases, drivers should have stopped all I/O transactions
0407 (DMA, IRQs), saved enough state that they can re-initialize or restore previous
0408 state (as needed by the hardware), and placed the device into a low-power state.
0409 On many platforms they will gate off one or more clock sources; sometimes they
0410 will also switch off power supplies or reduce voltages.  [Drivers supporting
0411 runtime PM may already have performed some or all of these steps.]
0412 
0413 If :c:func:`device_may_wakeup()` returns ``true``, the device should be
0414 prepared for generating hardware wakeup signals to trigger a system wakeup event
0415 when the system is in the sleep state.  For example, :c:func:`enable_irq_wake()`
0416 might identify GPIO signals hooked up to a switch or other external hardware,
0417 and :c:func:`pci_enable_wake()` does something similar for the PCI PME signal.
0418 
0419 If any of these callbacks returns an error, the system won't enter the desired
0420 low-power state.  Instead, the PM core will unwind its actions by resuming all
0421 the devices that were suspended.
0422 
0423 
0424 Leaving System Suspend
0425 ----------------------
0426 
0427 When resuming from freeze, standby or memory sleep, the phases are:
0428 ``resume_noirq``, ``resume_early``, ``resume``, ``complete``.
0429 
0430     1.  The ``->resume_noirq`` callback methods should perform any actions
0431         needed before the driver's interrupt handlers are invoked.  This
0432         generally means undoing the actions of the ``suspend_noirq`` phase.  If
0433         the bus type permits devices to share interrupt vectors, like PCI, the
0434         method should bring the device and its driver into a state in which the
0435         driver can recognize if the device is the source of incoming interrupts,
0436         if any, and handle them correctly.
0437 
0438         For example, the PCI bus type's ``->pm.resume_noirq()`` puts the device
0439         into the full-power state (D0 in the PCI terminology) and restores the
0440         standard configuration registers of the device.  Then it calls the
0441         device driver's ``->pm.resume_noirq()`` method to perform device-specific
0442         actions.
0443 
0444     2.  The ``->resume_early`` methods should prepare devices for the execution
0445         of the resume methods.  This generally involves undoing the actions of
0446         the preceding ``suspend_late`` phase.
0447 
0448     3.  The ``->resume`` methods should bring the device back to its operating
0449         state, so that it can perform normal I/O.  This generally involves
0450         undoing the actions of the ``suspend`` phase.
0451 
0452     4.  The ``complete`` phase should undo the actions of the ``prepare`` phase.
0453         For this reason, unlike the other resume-related phases, during the
0454         ``complete`` phase the device hierarchy is traversed bottom-up.
0455 
0456         Note, however, that new children may be registered below the device as
0457         soon as the ``->resume`` callbacks occur; it's not necessary to wait
0458         until the ``complete`` phase runs.
0459 
0460         Moreover, if the preceding ``->prepare`` callback returned a positive
0461         number, the device may have been left in runtime suspend throughout the
0462         whole system suspend and resume (its ``->suspend``, ``->suspend_late``,
0463         ``->suspend_noirq``, ``->resume_noirq``,
0464         ``->resume_early``, and ``->resume`` callbacks may have been
0465         skipped).  In that case, the ``->complete`` callback is entirely
0466         responsible for putting the device into a consistent state after system
0467         suspend if necessary.  [For example, it may need to queue up a runtime
0468         resume request for the device for this purpose.]  To check if that is
0469         the case, the ``->complete`` callback can consult the device's
0470         ``power.direct_complete`` flag.  If that flag is set when the
0471         ``->complete`` callback is being run then the direct-complete mechanism
0472         was used, and special actions may be required to make the device work
0473         correctly afterward.
0474 
0475 At the end of these phases, drivers should be as functional as they were before
0476 suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
0477 gated on.
0478 
0479 However, the details here may again be platform-specific.  For example,
0480 some systems support multiple "run" states, and the mode in effect at
0481 the end of resume might not be the one which preceded suspension.
0482 That means availability of certain clocks or power supplies changed,
0483 which could easily affect how a driver works.
0484 
0485 Drivers need to be able to handle hardware which has been reset since all of the
0486 suspend methods were called, for example by complete reinitialization.
0487 This may be the hardest part, and the one most protected by NDA'd documents
0488 and chip errata.  It's simplest if the hardware state hasn't changed since
0489 the suspend was carried out, but that can only be guaranteed if the target
0490 system sleep entered was suspend-to-idle.  For the other system sleep states
0491 that may not be the case (and usually isn't for ACPI-defined system sleep
0492 states, like S3).
0493 
0494 Drivers must also be prepared to notice that the device has been removed
0495 while the system was powered down, whenever that's physically possible.
0496 PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
0497 where common Linux platforms will see such removal.  Details of how drivers
0498 will notice and handle such removals are currently bus-specific, and often
0499 involve a separate thread.
0500 
0501 These callbacks may return an error value, but the PM core will ignore such
0502 errors since there's nothing it can do about them other than printing them in
0503 the system log.
0504 
0505 
0506 Entering Hibernation
0507 --------------------
0508 
0509 Hibernating the system is more complicated than putting it into sleep states,
0510 because it involves creating and saving a system image.  Therefore there are
0511 more phases for hibernation, with a different set of callbacks.  These phases
0512 always run after tasks have been frozen and enough memory has been freed.
0513 
0514 The general procedure for hibernation is to quiesce all devices ("freeze"),
0515 create an image of the system memory while everything is stable, reactivate all
0516 devices ("thaw"), write the image to permanent storage, and finally shut down
0517 the system ("power off").  The phases used to accomplish this are: ``prepare``,
0518 ``freeze``, ``freeze_late``, ``freeze_noirq``, ``thaw_noirq``, ``thaw_early``,
0519 ``thaw``, ``complete``, ``prepare``, ``poweroff``, ``poweroff_late``,
0520 ``poweroff_noirq``.
0521 
0522     1.  The ``prepare`` phase is discussed in the "Entering System Suspend"
0523         section above.
0524 
0525     2.  The ``->freeze`` methods should quiesce the device so that it doesn't
0526         generate IRQs or DMA, and they may need to save the values of device
0527         registers.  However the device does not have to be put in a low-power
0528         state, and to save time it's best not to do so.  Also, the device should
0529         not be prepared to generate wakeup events.
0530 
0531     3.  The ``freeze_late`` phase is analogous to the ``suspend_late`` phase
0532         described earlier, except that the device should not be put into a
0533         low-power state and should not be allowed to generate wakeup events.
0534 
0535     4.  The ``freeze_noirq`` phase is analogous to the ``suspend_noirq`` phase
0536         discussed earlier, except again that the device should not be put into
0537         a low-power state and should not be allowed to generate wakeup events.
0538 
0539 At this point the system image is created.  All devices should be inactive and
0540 the contents of memory should remain undisturbed while this happens, so that the
0541 image forms an atomic snapshot of the system state.
0542 
0543     5.  The ``thaw_noirq`` phase is analogous to the ``resume_noirq`` phase
0544         discussed earlier.  The main difference is that its methods can assume
0545         the device is in the same state as at the end of the ``freeze_noirq``
0546         phase.
0547 
0548     6.  The ``thaw_early`` phase is analogous to the ``resume_early`` phase
0549         described above.  Its methods should undo the actions of the preceding
0550         ``freeze_late``, if necessary.
0551 
0552     7.  The ``thaw`` phase is analogous to the ``resume`` phase discussed
0553         earlier.  Its methods should bring the device back to an operating
0554         state, so that it can be used for saving the image if necessary.
0555 
0556     8.  The ``complete`` phase is discussed in the "Leaving System Suspend"
0557         section above.
0558 
0559 At this point the system image is saved, and the devices then need to be
0560 prepared for the upcoming system shutdown.  This is much like suspending them
0561 before putting the system into the suspend-to-idle, shallow or deep sleep state,
0562 and the phases are similar.
0563 
0564     9.  The ``prepare`` phase is discussed above.
0565 
0566     10. The ``poweroff`` phase is analogous to the ``suspend`` phase.
0567 
0568     11. The ``poweroff_late`` phase is analogous to the ``suspend_late`` phase.
0569 
0570     12. The ``poweroff_noirq`` phase is analogous to the ``suspend_noirq`` phase.
0571 
0572 The ``->poweroff``, ``->poweroff_late`` and ``->poweroff_noirq`` callbacks
0573 should do essentially the same things as the ``->suspend``, ``->suspend_late``
0574 and ``->suspend_noirq`` callbacks, respectively.  A notable difference is
0575 that they need not store the device register values, because the registers
0576 should already have been stored during the ``freeze``, ``freeze_late`` or
0577 ``freeze_noirq`` phases.  Also, on many machines the firmware will power-down
0578 the entire system, so it is not necessary for the callback to put the device in
0579 a low-power state.
0580 
0581 
0582 Leaving Hibernation
0583 -------------------
0584 
0585 Resuming from hibernation is, again, more complicated than resuming from a sleep
0586 state in which the contents of main memory are preserved, because it requires
0587 a system image to be loaded into memory and the pre-hibernation memory contents
0588 to be restored before control can be passed back to the image kernel.
0589 
0590 Although in principle the image might be loaded into memory and the
0591 pre-hibernation memory contents restored by the boot loader, in practice this
0592 can't be done because boot loaders aren't smart enough and there is no
0593 established protocol for passing the necessary information.  So instead, the
0594 boot loader loads a fresh instance of the kernel, called "the restore kernel",
0595 into memory and passes control to it in the usual way.  Then the restore kernel
0596 reads the system image, restores the pre-hibernation memory contents, and passes
0597 control to the image kernel.  Thus two different kernel instances are involved
0598 in resuming from hibernation.  In fact, the restore kernel may be completely
0599 different from the image kernel: a different configuration and even a different
0600 version.  This has important consequences for device drivers and their
0601 subsystems.
0602 
0603 To be able to load the system image into memory, the restore kernel needs to
0604 include at least a subset of device drivers allowing it to access the storage
0605 medium containing the image, although it doesn't need to include all of the
0606 drivers present in the image kernel.  After the image has been loaded, the
0607 devices managed by the boot kernel need to be prepared for passing control back
0608 to the image kernel.  This is very similar to the initial steps involved in
0609 creating a system image, and it is accomplished in the same way, using
0610 ``prepare``, ``freeze``, and ``freeze_noirq`` phases.  However, the devices
0611 affected by these phases are only those having drivers in the restore kernel;
0612 other devices will still be in whatever state the boot loader left them.
0613 
0614 Should the restoration of the pre-hibernation memory contents fail, the restore
0615 kernel would go through the "thawing" procedure described above, using the
0616 ``thaw_noirq``, ``thaw_early``, ``thaw``, and ``complete`` phases, and then
0617 continue running normally.  This happens only rarely.  Most often the
0618 pre-hibernation memory contents are restored successfully and control is passed
0619 to the image kernel, which then becomes responsible for bringing the system back
0620 to the working state.
0621 
0622 To achieve this, the image kernel must restore the devices' pre-hibernation
0623 functionality.  The operation is much like waking up from a sleep state (with
0624 the memory contents preserved), although it involves different phases:
0625 ``restore_noirq``, ``restore_early``, ``restore``, ``complete``.
0626 
0627     1.  The ``restore_noirq`` phase is analogous to the ``resume_noirq`` phase.
0628 
0629     2.  The ``restore_early`` phase is analogous to the ``resume_early`` phase.
0630 
0631     3.  The ``restore`` phase is analogous to the ``resume`` phase.
0632 
0633     4.  The ``complete`` phase is discussed above.
0634 
0635 The main difference from ``resume[_early|_noirq]`` is that
0636 ``restore[_early|_noirq]`` must assume the device has been accessed and
0637 reconfigured by the boot loader or the restore kernel.  Consequently, the state
0638 of the device may be different from the state remembered from the ``freeze``,
0639 ``freeze_late`` and ``freeze_noirq`` phases.  The device may even need to be
0640 reset and completely re-initialized.  In many cases this difference doesn't
0641 matter, so the ``->resume[_early|_noirq]`` and ``->restore[_early|_norq]``
0642 method pointers can be set to the same routines.  Nevertheless, different
0643 callback pointers are used in case there is a situation where it actually does
0644 matter.
0645 
0646 
0647 Power Management Notifiers
0648 ==========================
0649 
0650 There are some operations that cannot be carried out by the power management
0651 callbacks discussed above, because the callbacks occur too late or too early.
0652 To handle these cases, subsystems and device drivers may register power
0653 management notifiers that are called before tasks are frozen and after they have
0654 been thawed.  Generally speaking, the PM notifiers are suitable for performing
0655 actions that either require user space to be available, or at least won't
0656 interfere with user space.
0657 
0658 For details refer to Documentation/driver-api/pm/notifiers.rst.
0659 
0660 
0661 Device Low-Power (suspend) States
0662 =================================
0663 
0664 Device low-power states aren't standard.  One device might only handle
0665 "on" and "off", while another might support a dozen different versions of
0666 "on" (how many engines are active?), plus a state that gets back to "on"
0667 faster than from a full "off".
0668 
0669 Some buses define rules about what different suspend states mean.  PCI
0670 gives one example: after the suspend sequence completes, a non-legacy
0671 PCI device may not perform DMA or issue IRQs, and any wakeup events it
0672 issues would be issued through the PME# bus signal.  Plus, there are
0673 several PCI-standard device states, some of which are optional.
0674 
0675 In contrast, integrated system-on-chip processors often use IRQs as the
0676 wakeup event sources (so drivers would call :c:func:`enable_irq_wake`) and
0677 might be able to treat DMA completion as a wakeup event (sometimes DMA can stay
0678 active too, it'd only be the CPU and some peripherals that sleep).
0679 
0680 Some details here may be platform-specific.  Systems may have devices that
0681 can be fully active in certain sleep states, such as an LCD display that's
0682 refreshed using DMA while most of the system is sleeping lightly ... and
0683 its frame buffer might even be updated by a DSP or other non-Linux CPU while
0684 the Linux control processor stays idle.
0685 
0686 Moreover, the specific actions taken may depend on the target system state.
0687 One target system state might allow a given device to be very operational;
0688 another might require a hard shut down with re-initialization on resume.
0689 And two different target systems might use the same device in different
0690 ways; the aforementioned LCD might be active in one product's "standby",
0691 but a different product using the same SOC might work differently.
0692 
0693 
0694 Device Power Management Domains
0695 ===============================
0696 
0697 Sometimes devices share reference clocks or other power resources.  In those
0698 cases it generally is not possible to put devices into low-power states
0699 individually.  Instead, a set of devices sharing a power resource can be put
0700 into a low-power state together at the same time by turning off the shared
0701 power resource.  Of course, they also need to be put into the full-power state
0702 together, by turning the shared power resource on.  A set of devices with this
0703 property is often referred to as a power domain. A power domain may also be
0704 nested inside another power domain. The nested domain is referred to as the
0705 sub-domain of the parent domain.
0706 
0707 Support for power domains is provided through the :c:member:`pm_domain` field of
0708 struct device.  This field is a pointer to an object of type
0709 struct dev_pm_domain, defined in :file:`include/linux/pm.h`, providing a set
0710 of power management callbacks analogous to the subsystem-level and device driver
0711 callbacks that are executed for the given device during all power transitions,
0712 instead of the respective subsystem-level callbacks.  Specifically, if a
0713 device's :c:member:`pm_domain` pointer is not NULL, the ``->suspend()`` callback
0714 from the object pointed to by it will be executed instead of its subsystem's
0715 (e.g. bus type's) ``->suspend()`` callback and analogously for all of the
0716 remaining callbacks.  In other words, power management domain callbacks, if
0717 defined for the given device, always take precedence over the callbacks provided
0718 by the device's subsystem (e.g. bus type).
0719 
0720 The support for device power management domains is only relevant to platforms
0721 needing to use the same device driver power management callbacks in many
0722 different power domain configurations and wanting to avoid incorporating the
0723 support for power domains into subsystem-level callbacks, for example by
0724 modifying the platform bus type.  Other platforms need not implement it or take
0725 it into account in any way.
0726 
0727 Devices may be defined as IRQ-safe which indicates to the PM core that their
0728 runtime PM callbacks may be invoked with disabled interrupts (see
0729 Documentation/power/runtime_pm.rst for more information).  If an
0730 IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be
0731 disallowed, unless the domain itself is defined as IRQ-safe. However, it
0732 makes sense to define a PM domain as IRQ-safe only if all the devices in it
0733 are IRQ-safe. Moreover, if an IRQ-safe domain has a parent domain, the runtime
0734 PM of the parent is only allowed if the parent itself is IRQ-safe too with the
0735 additional restriction that all child domains of an IRQ-safe parent must also
0736 be IRQ-safe.
0737 
0738 
0739 Runtime Power Management
0740 ========================
0741 
0742 Many devices are able to dynamically power down while the system is still
0743 running. This feature is useful for devices that are not being used, and
0744 can offer significant power savings on a running system.  These devices
0745 often support a range of runtime power states, which might use names such
0746 as "off", "sleep", "idle", "active", and so on.  Those states will in some
0747 cases (like PCI) be partially constrained by the bus the device uses, and will
0748 usually include hardware states that are also used in system sleep states.
0749 
0750 A system-wide power transition can be started while some devices are in low
0751 power states due to runtime power management.  The system sleep PM callbacks
0752 should recognize such situations and react to them appropriately, but the
0753 necessary actions are subsystem-specific.
0754 
0755 In some cases the decision may be made at the subsystem level while in other
0756 cases the device driver may be left to decide.  In some cases it may be
0757 desirable to leave a suspended device in that state during a system-wide power
0758 transition, but in other cases the device must be put back into the full-power
0759 state temporarily, for example so that its system wakeup capability can be
0760 disabled.  This all depends on the hardware and the design of the subsystem and
0761 device driver in question.
0762 
0763 If it is necessary to resume a device from runtime suspend during a system-wide
0764 transition into a sleep state, that can be done by calling
0765 :c:func:`pm_runtime_resume` from the ``->suspend`` callback (or the ``->freeze``
0766 or ``->poweroff`` callback for transitions related to hibernation) of either the
0767 device's driver or its subsystem (for example, a bus type or a PM domain).
0768 However, subsystems must not otherwise change the runtime status of devices
0769 from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before*
0770 invoking device drivers' ``->suspend`` callbacks (or equivalent).
0771 
0772 .. _smart_suspend_flag:
0773 
0774 The ``DPM_FLAG_SMART_SUSPEND`` Driver Flag
0775 ------------------------------------------
0776 
0777 Some bus types and PM domains have a policy to resume all devices from runtime
0778 suspend upfront in their ``->suspend`` callbacks, but that may not be really
0779 necessary if the device's driver can cope with runtime-suspended devices.
0780 The driver can indicate this by setting ``DPM_FLAG_SMART_SUSPEND`` in
0781 :c:member:`power.driver_flags` at probe time, with the assistance of the
0782 :c:func:`dev_pm_set_driver_flags` helper routine.
0783 
0784 Setting that flag causes the PM core and middle-layer code
0785 (bus types, PM domains etc.) to skip the ``->suspend_late`` and
0786 ``->suspend_noirq`` callbacks provided by the driver if the device remains in
0787 runtime suspend throughout those phases of the system-wide suspend (and
0788 similarly for the "freeze" and "poweroff" parts of system hibernation).
0789 [Otherwise the same driver
0790 callback might be executed twice in a row for the same device, which would not
0791 be valid in general.]  If the middle-layer system-wide PM callbacks are present
0792 for the device then they are responsible for skipping these driver callbacks;
0793 if not then the PM core skips them.  The subsystem callback routines can
0794 determine whether they need to skip the driver callbacks by testing the return
0795 value from the :c:func:`dev_pm_skip_suspend` helper function.
0796 
0797 In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_noirq``
0798 and ``->thaw_early`` callbacks are skipped in hibernation if the device remained
0799 in runtime suspend throughout the preceding "freeze" transition.  Again, if the
0800 middle-layer callbacks are present for the device, they are responsible for
0801 doing this, otherwise the PM core takes care of it.
0802 
0803 
0804 The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag
0805 --------------------------------------------
0806 
0807 During system-wide resume from a sleep state it's easiest to put devices into
0808 the full-power state, as explained in Documentation/power/runtime_pm.rst.
0809 [Refer to that document for more information regarding this particular issue as
0810 well as for information on the device runtime power management framework in
0811 general.]  However, it often is desirable to leave devices in suspend after
0812 system transitions to the working state, especially if those devices had been in
0813 runtime suspend before the preceding system-wide suspend (or analogous)
0814 transition.
0815 
0816 To that end, device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to
0817 indicate to the PM core and middle-layer code that they allow their "noirq" and
0818 "early" resume callbacks to be skipped if the device can be left in suspend
0819 after system-wide PM transitions to the working state.  Whether or not that is
0820 the case generally depends on the state of the device before the given system
0821 suspend-resume cycle and on the type of the system transition under way.
0822 In particular, the "thaw" and "restore" transitions related to hibernation are
0823 not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all.  [All callbacks are
0824 issued during the "restore" transition regardless of the flag settings,
0825 and whether or not any driver callbacks
0826 are skipped during the "thaw" transition depends whether or not the
0827 ``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above <smart_suspend_flag_>`_).
0828 In addition, a device is not allowed to remain in runtime suspend if any of its
0829 children will be returned to full power.]
0830 
0831 The ``DPM_FLAG_MAY_SKIP_RESUME`` flag is taken into account in combination with
0832 the :c:member:`power.may_skip_resume` status bit set by the PM core during the
0833 "suspend" phase of suspend-type transitions.  If the driver or the middle layer
0834 has a reason to prevent the driver's "noirq" and "early" resume callbacks from
0835 being skipped during the subsequent system resume transition, it should
0836 clear :c:member:`power.may_skip_resume` in its ``->suspend``, ``->suspend_late``
0837 or ``->suspend_noirq`` callback.  [Note that the drivers setting
0838 ``DPM_FLAG_SMART_SUSPEND`` need to clear :c:member:`power.may_skip_resume` in
0839 their ``->suspend`` callback in case the other two are skipped.]
0840 
0841 Setting the :c:member:`power.may_skip_resume` status bit along with the
0842 ``DPM_FLAG_MAY_SKIP_RESUME`` flag is necessary, but generally not sufficient,
0843 for the driver's "noirq" and "early" resume callbacks to be skipped.  Whether or
0844 not they should be skipped can be determined by evaluating the
0845 :c:func:`dev_pm_skip_resume` helper function.
0846 
0847 If that function returns ``true``, the driver's "noirq" and "early" resume
0848 callbacks should be skipped and the device's runtime PM status will be set to
0849 "suspended" by the PM core.  Otherwise, if the device was runtime-suspended
0850 during the preceding system-wide suspend transition and its
0851 ``DPM_FLAG_SMART_SUSPEND`` is set, its runtime PM status will be set to
0852 "active" by the PM core.  [Hence, the drivers that do not set
0853 ``DPM_FLAG_SMART_SUSPEND`` should not expect the runtime PM status of their
0854 devices to be changed from "suspended" to "active" by the PM core during
0855 system-wide resume-type transitions.]
0856 
0857 If the ``DPM_FLAG_MAY_SKIP_RESUME`` flag is not set for a device, but
0858 ``DPM_FLAG_SMART_SUSPEND`` is set and the driver's "late" and "noirq" suspend
0859 callbacks are skipped, its system-wide "noirq" and "early" resume callbacks, if
0860 present, are invoked as usual and the device's runtime PM status is set to
0861 "active" by the PM core before enabling runtime PM for it.  In that case, the
0862 driver must be prepared to cope with the invocation of its system-wide resume
0863 callbacks back-to-back with its ``->runtime_suspend`` one (without the
0864 intervening ``->runtime_resume`` and system-wide suspend callbacks) and the
0865 final state of the device must reflect the "active" runtime PM status in that
0866 case.  [Note that this is not a problem at all if the driver's
0867 ``->suspend_late`` callback pointer points to the same function as its
0868 ``->runtime_suspend`` one and its ``->resume_early`` callback pointer points to
0869 the same function as the ``->runtime_resume`` one, while none of the other
0870 system-wide suspend-resume callbacks of the driver are present, for example.]
0871 
0872 Likewise, if ``DPM_FLAG_MAY_SKIP_RESUME`` is set for a device, its driver's
0873 system-wide "noirq" and "early" resume callbacks may be skipped while its "late"
0874 and "noirq" suspend callbacks may have been executed (in principle, regardless
0875 of whether or not ``DPM_FLAG_SMART_SUSPEND`` is set).  In that case, the driver
0876 needs to be able to cope with the invocation of its ``->runtime_resume``
0877 callback back-to-back with its "late" and "noirq" suspend ones.  [For instance,
0878 that is not a concern if the driver sets both ``DPM_FLAG_SMART_SUSPEND`` and
0879 ``DPM_FLAG_MAY_SKIP_RESUME`` and uses the same pair of suspend/resume callback
0880 functions for runtime PM and system-wide suspend/resume.]