0001 ====================
0002 PCI Power Management
0003 ====================
0004
0005 Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
0006
0007 An overview of concepts and the Linux kernel's interfaces related to PCI power
0008 management. Based on previous work by Patrick Mochel <mochel@transmeta.com>
0009 (and others).
0010
0011 This document only covers the aspects of power management specific to PCI
0012 devices. For general description of the kernel's interfaces related to device
0013 power management refer to Documentation/driver-api/pm/devices.rst and
0014 Documentation/power/runtime_pm.rst.
0015
0016 .. contents:
0017
0018 1. Hardware and Platform Support for PCI Power Management
0019 2. PCI Subsystem and Device Power Management
0020 3. PCI Device Drivers and Power Management
0021 4. Resources
0022
0023
0024 1. Hardware and Platform Support for PCI Power Management
0025 =========================================================
0026
0027 1.1. Native and Platform-Based Power Management
0028 -----------------------------------------------
0029
0030 In general, power management is a feature allowing one to save energy by putting
0031 devices into states in which they draw less power (low-power states) at the
0032 price of reduced functionality or performance.
0033
0034 Usually, a device is put into a low-power state when it is underutilized or
0035 completely inactive. However, when it is necessary to use the device once
0036 again, it has to be put back into the "fully functional" state (full-power
0037 state). This may happen when there are some data for the device to handle or
0038 as a result of an external event requiring the device to be active, which may
0039 be signaled by the device itself.
0040
0041 PCI devices may be put into low-power states in two ways, by using the device
0042 capabilities introduced by the PCI Bus Power Management Interface Specification,
0043 or with the help of platform firmware, such as an ACPI BIOS. In the first
0044 approach, that is referred to as the native PCI power management (native PCI PM)
0045 in what follows, the device power state is changed as a result of writing a
0046 specific value into one of its standard configuration registers. The second
0047 approach requires the platform firmware to provide special methods that may be
0048 used by the kernel to change the device's power state.
0049
0050 Devices supporting the native PCI PM usually can generate wakeup signals called
0051 Power Management Events (PMEs) to let the kernel know about external events
0052 requiring the device to be active. After receiving a PME the kernel is supposed
0053 to put the device that sent it into the full-power state. However, the PCI Bus
0054 Power Management Interface Specification doesn't define any standard method of
0055 delivering the PME from the device to the CPU and the operating system kernel.
0056 It is assumed that the platform firmware will perform this task and therefore,
0057 even though a PCI device is set up to generate PMEs, it also may be necessary to
0058 prepare the platform firmware for notifying the CPU of the PMEs coming from the
0059 device (e.g. by generating interrupts).
0060
0061 In turn, if the methods provided by the platform firmware are used for changing
0062 the power state of a device, usually the platform also provides a method for
0063 preparing the device to generate wakeup signals. In that case, however, it
0064 often also is necessary to prepare the device for generating PMEs using the
0065 native PCI PM mechanism, because the method provided by the platform depends on
0066 that.
0067
0068 Thus in many situations both the native and the platform-based power management
0069 mechanisms have to be used simultaneously to obtain the desired result.
0070
0071 1.2. Native PCI Power Management
0072 --------------------------------
0073
0074 The PCI Bus Power Management Interface Specification (PCI PM Spec) was
0075 introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a
0076 standard interface for performing various operations related to power
0077 management.
0078
0079 The implementation of the PCI PM Spec is optional for conventional PCI devices,
0080 but it is mandatory for PCI Express devices. If a device supports the PCI PM
0081 Spec, it has an 8 byte power management capability field in its PCI
0082 configuration space. This field is used to describe and control the standard
0083 features related to the native PCI power management.
0084
0085 The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
0086 (B0-B3). The higher the number, the less power is drawn by the device or bus
0087 in that state. However, the higher the number, the longer the latency for
0088 the device or bus to return to the full-power state (D0 or B0, respectively).
0089
0090 There are two variants of the D3 state defined by the specification. The first
0091 one is D3hot, referred to as the software accessible D3, because devices can be
0092 programmed to go into it. The second one, D3cold, is the state that PCI devices
0093 are in when the supply voltage (Vcc) is removed from them. It is not possible
0094 to program a PCI device to go into D3cold, although there may be a programmable
0095 interface for putting the bus the device is on into a state in which Vcc is
0096 removed from all devices on the bus.
0097
0098 PCI bus power management, however, is not supported by the Linux kernel at the
0099 time of this writing and therefore it is not covered by this document.
0100
0101 Note that every PCI device can be in the full-power state (D0) or in D3cold,
0102 regardless of whether or not it implements the PCI PM Spec. In addition to
0103 that, if the PCI PM Spec is implemented by the device, it must support D3hot
0104 as well as D0. The support for the D1 and D2 power states is optional.
0105
0106 PCI devices supporting the PCI PM Spec can be programmed to go to any of the
0107 supported low-power states (except for D3cold). While in D1-D3hot the
0108 standard configuration registers of the device must be accessible to software
0109 (i.e. the device is required to respond to PCI configuration accesses), although
0110 its I/O and memory spaces are then disabled. This allows the device to be
0111 programmatically put into D0. Thus the kernel can switch the device back and
0112 forth between D0 and the supported low-power states (except for D3cold) and the
0113 possible power state transitions the device can undergo are the following:
0114
0115 +----------------------------+
0116 | Current State | New State |
0117 +----------------------------+
0118 | D0 | D1, D2, D3 |
0119 +----------------------------+
0120 | D1 | D2, D3 |
0121 +----------------------------+
0122 | D2 | D3 |
0123 +----------------------------+
0124 | D1, D2, D3 | D0 |
0125 +----------------------------+
0126
0127 The transition from D3cold to D0 occurs when the supply voltage is provided to
0128 the device (i.e. power is restored). In that case the device returns to D0 with
0129 a full power-on reset sequence and the power-on defaults are restored to the
0130 device by hardware just as at initial power up.
0131
0132 PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
0133 while in any power state (D0-D3), but they are not required to be capable
0134 of generating PMEs from all supported power states. In particular, the
0135 capability of generating PMEs from D3cold is optional and depends on the
0136 presence of additional voltage (3.3Vaux) allowing the device to remain
0137 sufficiently active to generate a wakeup signal.
0138
0139 1.3. ACPI Device Power Management
0140 ---------------------------------
0141
0142 The platform firmware support for the power management of PCI devices is
0143 system-specific. However, if the system in question is compliant with the
0144 Advanced Configuration and Power Interface (ACPI) Specification, like the
0145 majority of x86-based systems, it is supposed to implement device power
0146 management interfaces defined by the ACPI standard.
0147
0148 For this purpose the ACPI BIOS provides special functions called "control
0149 methods" that may be executed by the kernel to perform specific tasks, such as
0150 putting a device into a low-power state. These control methods are encoded
0151 using special byte-code language called the ACPI Machine Language (AML) and
0152 stored in the machine's BIOS. The kernel loads them from the BIOS and executes
0153 them as needed using an AML interpreter that translates the AML byte code into
0154 computations and memory or I/O space accesses. This way, in theory, a BIOS
0155 writer can provide the kernel with a means to perform actions depending
0156 on the system design in a system-specific fashion.
0157
0158 ACPI control methods may be divided into global control methods, that are not
0159 associated with any particular devices, and device control methods, that have
0160 to be defined separately for each device supposed to be handled with the help of
0161 the platform. This means, in particular, that ACPI device control methods can
0162 only be used to handle devices that the BIOS writer knew about in advance. The
0163 ACPI methods used for device power management fall into that category.
0164
0165 The ACPI specification assumes that devices can be in one of four power states
0166 labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
0167 D0-D3 states (although the difference between D3hot and D3cold is not taken
0168 into account by ACPI). Moreover, for each power state of a device there is a
0169 set of power resources that have to be enabled for the device to be put into
0170 that state. These power resources are controlled (i.e. enabled or disabled)
0171 with the help of their own control methods, _ON and _OFF, that have to be
0172 defined individually for each of them.
0173
0174 To put a device into the ACPI power state Dx (where x is a number between 0 and
0175 3 inclusive) the kernel is supposed to (1) enable the power resources required
0176 by the device in this state using their _ON control methods and (2) execute the
0177 _PSx control method defined for the device. In addition to that, if the device
0178 is going to be put into a low-power state (D1-D3) and is supposed to generate
0179 wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
0180 3.0) control method defined for it has to be executed before _PSx. Power
0181 resources that are not required by the device in the target power state and are
0182 not required any more by any other device should be disabled (by executing their
0183 _OFF control methods). If the current power state of the device is D3, it can
0184 only be put into D0 this way.
0185
0186 However, quite often the power states of devices are changed during a
0187 system-wide transition into a sleep state or back into the working state. ACPI
0188 defines four system sleep states, S1, S2, S3, and S4, and denotes the system
0189 working state as S0. In general, the target system sleep (or working) state
0190 determines the highest power (lowest number) state the device can be put
0191 into and the kernel is supposed to obtain this information by executing the
0192 device's _SxD control method (where x is a number between 0 and 4 inclusive).
0193 If the device is required to wake up the system from the target sleep state, the
0194 lowest power (highest number) state it can be put into is also determined by the
0195 target state of the system. The kernel is then supposed to use the device's
0196 _SxW control method to obtain the number of that state. It also is supposed to
0197 use the device's _PRW control method to learn which power resources need to be
0198 enabled for the device to be able to generate wakeup signals.
0199
0200 1.4. Wakeup Signaling
0201 ---------------------
0202
0203 Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
0204 a result of the execution of the _DSW (or _PSW) ACPI control method before
0205 putting the device into a low-power state, have to be caught and handled as
0206 appropriate. If they are sent while the system is in the working state
0207 (ACPI S0), they should be translated into interrupts so that the kernel can
0208 put the devices generating them into the full-power state and take care of the
0209 events that triggered them. In turn, if they are sent while the system is
0210 sleeping, they should cause the system's core logic to trigger wakeup.
0211
0212 On ACPI-based systems wakeup signals sent by conventional PCI devices are
0213 converted into ACPI General-Purpose Events (GPEs) which are hardware signals
0214 from the system core logic generated in response to various events that need to
0215 be acted upon. Every GPE is associated with one or more sources of potentially
0216 interesting events. In particular, a GPE may be associated with a PCI device
0217 capable of signaling wakeup. The information on the connections between GPEs
0218 and event sources is recorded in the system's ACPI BIOS from where it can be
0219 read by the kernel.
0220
0221 If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
0222 associated with it (if there is one) is triggered. The GPEs associated with PCI
0223 bridges may also be triggered in response to a wakeup signal from one of the
0224 devices below the bridge (this also is the case for root bridges) and, for
0225 example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
0226 handled this way.
0227
0228 A GPE may be triggered when the system is sleeping (i.e. when it is in one of
0229 the ACPI S1-S4 states), in which case system wakeup is started by its core logic
0230 (the device that was the source of the signal causing the system wakeup to occur
0231 may be identified later). The GPEs used in such situations are referred to as
0232 wakeup GPEs.
0233
0234 Usually, however, GPEs are also triggered when the system is in the working
0235 state (ACPI S0) and in that case the system's core logic generates a System
0236 Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI
0237 handler identifies the GPE that caused the interrupt to be generated which,
0238 in turn, allows the kernel to identify the source of the event (that may be
0239 a PCI device signaling wakeup). The GPEs used for notifying the kernel of
0240 events occurring while the system is in the working state are referred to as
0241 runtime GPEs.
0242
0243 Unfortunately, there is no standard way of handling wakeup signals sent by
0244 conventional PCI devices on systems that are not ACPI-based, but there is one
0245 for PCI Express devices. Namely, the PCI Express Base Specification introduced
0246 a native mechanism for converting native PCI PMEs into interrupts generated by
0247 root ports. For conventional PCI devices native PMEs are out-of-band, so they
0248 are routed separately and they need not pass through bridges (in principle they
0249 may be routed directly to the system's core logic), but for PCI Express devices
0250 they are in-band messages that have to pass through the PCI Express hierarchy,
0251 including the root port on the path from the device to the Root Complex. Thus
0252 it was possible to introduce a mechanism by which a root port generates an
0253 interrupt whenever it receives a PME message from one of the devices below it.
0254 The PCI Express Requester ID of the device that sent the PME message is then
0255 recorded in one of the root port's configuration registers from where it may be
0256 read by the interrupt handler allowing the device to be identified. [PME
0257 messages sent by PCI Express endpoints integrated with the Root Complex don't
0258 pass through root ports, but instead they cause a Root Complex Event Collector
0259 (if there is one) to generate interrupts.]
0260
0261 In principle the native PCI Express PME signaling may also be used on ACPI-based
0262 systems along with the GPEs, but to use it the kernel has to ask the system's
0263 ACPI BIOS to release control of root port configuration registers. The ACPI
0264 BIOS, however, is not required to allow the kernel to control these registers
0265 and if it doesn't do that, the kernel must not modify their contents. Of course
0266 the native PCI Express PME signaling cannot be used by the kernel in that case.
0267
0268
0269 2. PCI Subsystem and Device Power Management
0270 ============================================
0271
0272 2.1. Device Power Management Callbacks
0273 --------------------------------------
0274
0275 The PCI Subsystem participates in the power management of PCI devices in a
0276 number of ways. First of all, it provides an intermediate code layer between
0277 the device power management core (PM core) and PCI device drivers.
0278 Specifically, the pm field of the PCI subsystem's struct bus_type object,
0279 pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
0280 pointers to several device power management callbacks::
0281
0282 const struct dev_pm_ops pci_dev_pm_ops = {
0283 .prepare = pci_pm_prepare,
0284 .complete = pci_pm_complete,
0285 .suspend = pci_pm_suspend,
0286 .resume = pci_pm_resume,
0287 .freeze = pci_pm_freeze,
0288 .thaw = pci_pm_thaw,
0289 .poweroff = pci_pm_poweroff,
0290 .restore = pci_pm_restore,
0291 .suspend_noirq = pci_pm_suspend_noirq,
0292 .resume_noirq = pci_pm_resume_noirq,
0293 .freeze_noirq = pci_pm_freeze_noirq,
0294 .thaw_noirq = pci_pm_thaw_noirq,
0295 .poweroff_noirq = pci_pm_poweroff_noirq,
0296 .restore_noirq = pci_pm_restore_noirq,
0297 .runtime_suspend = pci_pm_runtime_suspend,
0298 .runtime_resume = pci_pm_runtime_resume,
0299 .runtime_idle = pci_pm_runtime_idle,
0300 };
0301
0302 These callbacks are executed by the PM core in various situations related to
0303 device power management and they, in turn, execute power management callbacks
0304 provided by PCI device drivers. They also perform power management operations
0305 involving some standard configuration registers of PCI devices that device
0306 drivers need not know or care about.
0307
0308 The structure representing a PCI device, struct pci_dev, contains several fields
0309 that these callbacks operate on::
0310
0311 struct pci_dev {
0312 ...
0313 pci_power_t current_state; /* Current operating state. */
0314 int pm_cap; /* PM capability offset in the
0315 configuration space */
0316 unsigned int pme_support:5; /* Bitmask of states from which PME#
0317 can be generated */
0318 unsigned int pme_poll:1; /* Poll device's PME status bit */
0319 unsigned int d1_support:1; /* Low power state D1 is supported */
0320 unsigned int d2_support:1; /* Low power state D2 is supported */
0321 unsigned int no_d1d2:1; /* D1 and D2 are forbidden */
0322 unsigned int wakeup_prepared:1; /* Device prepared for wake up */
0323 unsigned int d3hot_delay; /* D3hot->D0 transition time in ms */
0324 ...
0325 };
0326
0327 They also indirectly use some fields of the struct device that is embedded in
0328 struct pci_dev.
0329
0330 2.2. Device Initialization
0331 --------------------------
0332
0333 The PCI subsystem's first task related to device power management is to
0334 prepare the device for power management and initialize the fields of struct
0335 pci_dev used for this purpose. This happens in two functions defined in
0336 drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
0337
0338 The first of these functions checks if the device supports native PCI PM
0339 and if that's the case the offset of its power management capability structure
0340 in the configuration space is stored in the pm_cap field of the device's struct
0341 pci_dev object. Next, the function checks which PCI low-power states are
0342 supported by the device and from which low-power states the device can generate
0343 native PCI PMEs. The power management fields of the device's struct pci_dev and
0344 the struct device embedded in it are updated accordingly and the generation of
0345 PMEs by the device is disabled.
0346
0347 The second function checks if the device can be prepared to signal wakeup with
0348 the help of the platform firmware, such as the ACPI BIOS. If that is the case,
0349 the function updates the wakeup fields in struct device embedded in the
0350 device's struct pci_dev and uses the firmware-provided method to prevent the
0351 device from signaling wakeup.
0352
0353 At this point the device is ready for power management. For driverless devices,
0354 however, this functionality is limited to a few basic operations carried out
0355 during system-wide transitions to a sleep state and back to the working state.
0356
0357 2.3. Runtime Device Power Management
0358 ------------------------------------
0359
0360 The PCI subsystem plays a vital role in the runtime power management of PCI
0361 devices. For this purpose it uses the general runtime power management
0362 (runtime PM) framework described in Documentation/power/runtime_pm.rst.
0363 Namely, it provides subsystem-level callbacks::
0364
0365 pci_pm_runtime_suspend()
0366 pci_pm_runtime_resume()
0367 pci_pm_runtime_idle()
0368
0369 that are executed by the core runtime PM routines. It also implements the
0370 entire mechanics necessary for handling runtime wakeup signals from PCI devices
0371 in low-power states, which at the time of this writing works for both the native
0372 PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
0373 Section 1.
0374
0375 First, a PCI device is put into a low-power state, or suspended, with the help
0376 of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
0377 pci_pm_runtime_suspend() to do the actual job. For this to work, the device's
0378 driver has to provide a pm->runtime_suspend() callback (see below), which is
0379 run by pci_pm_runtime_suspend() as the first action. If the driver's callback
0380 returns successfully, the device's standard configuration registers are saved,
0381 the device is prepared to generate wakeup signals and, finally, it is put into
0382 the target low-power state.
0383
0384 The low-power state to put the device into is the lowest-power (highest number)
0385 state from which it can signal wakeup. The exact method of signaling wakeup is
0386 system-dependent and is determined by the PCI subsystem on the basis of the
0387 reported capabilities of the device and the platform firmware. To prepare the
0388 device for signaling wakeup and put it into the selected low-power state, the
0389 PCI subsystem can use the platform firmware as well as the device's native PCI
0390 PM capabilities, if supported.
0391
0392 It is expected that the device driver's pm->runtime_suspend() callback will
0393 not attempt to prepare the device for signaling wakeup or to put it into a
0394 low-power state. The driver ought to leave these tasks to the PCI subsystem
0395 that has all of the information necessary to perform them.
0396
0397 A suspended device is brought back into the "active" state, or resumed,
0398 with the help of pm_request_resume() or pm_runtime_resume() which both call
0399 pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's
0400 driver provides a pm->runtime_resume() callback (see below). However, before
0401 the driver's callback is executed, pci_pm_runtime_resume() brings the device
0402 back into the full-power state, prevents it from signaling wakeup while in that
0403 state and restores its standard configuration registers. Thus the driver's
0404 callback need not worry about the PCI-specific aspects of the device resume.
0405
0406 Note that generally pci_pm_runtime_resume() may be called in two different
0407 situations. First, it may be called at the request of the device's driver, for
0408 example if there are some data for it to process. Second, it may be called
0409 as a result of a wakeup signal from the device itself (this sometimes is
0410 referred to as "remote wakeup"). Of course, for this purpose the wakeup signal
0411 is handled in one of the ways described in Section 1 and finally converted into
0412 a notification for the PCI subsystem after the source device has been
0413 identified.
0414
0415 The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
0416 and pm_request_idle(), executes the device driver's pm->runtime_idle()
0417 callback, if defined, and if that callback doesn't return error code (or is not
0418 present at all), suspends the device with the help of pm_runtime_suspend().
0419 Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
0420 example, it is called right after the device has just been resumed), in which
0421 cases it is expected to suspend the device if that makes sense. Usually,
0422 however, the PCI subsystem doesn't really know if the device really can be
0423 suspended, so it lets the device's driver decide by running its
0424 pm->runtime_idle() callback.
0425
0426 2.4. System-Wide Power Transitions
0427 ----------------------------------
0428 There are a few different types of system-wide power transitions, described in
0429 Documentation/driver-api/pm/devices.rst. Each of them requires devices to be
0430 handled in a specific way and the PM core executes subsystem-level power
0431 management callbacks for this purpose. They are executed in phases such that
0432 each phase involves executing the same subsystem-level callback for every device
0433 belonging to the given subsystem before the next phase begins. These phases
0434 always run after tasks have been frozen.
0435
0436 2.4.1. System Suspend
0437 ^^^^^^^^^^^^^^^^^^^^^
0438
0439 When the system is going into a sleep state in which the contents of memory will
0440 be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
0441
0442 prepare, suspend, suspend_noirq.
0443
0444 The following PCI bus type's callbacks, respectively, are used in these phases::
0445
0446 pci_pm_prepare()
0447 pci_pm_suspend()
0448 pci_pm_suspend_noirq()
0449
0450 The pci_pm_prepare() routine first puts the device into the "fully functional"
0451 state with the help of pm_runtime_resume(). Then, it executes the device
0452 driver's pm->prepare() callback if defined (i.e. if the driver's struct
0453 dev_pm_ops object is present and the prepare pointer in that object is valid).
0454
0455 The pci_pm_suspend() routine first checks if the device's driver implements
0456 legacy PCI suspend routines (see Section 3), in which case the driver's legacy
0457 suspend callback is executed, if present, and its result is returned. Next, if
0458 the device's driver doesn't provide a struct dev_pm_ops object (containing
0459 pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
0460 simply turns off the device's bus master capability and runs
0461 pcibios_disable_device() to disable it, unless the device is a bridge (PCI
0462 bridges are ignored by this routine). Next, the device driver's pm->suspend()
0463 callback is executed, if defined, and its result is returned if it fails.
0464 Finally, pci_fixup_device() is called to apply hardware suspend quirks related
0465 to the device if necessary.
0466
0467 Note that the suspend phase is carried out asynchronously for PCI devices, so
0468 the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
0469 devices that don't depend on each other in a known way (i.e. none of the paths
0470 in the device tree from the root bridge to a leaf device contains both of them).
0471
0472 The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
0473 been called, which means that the device driver's interrupt handler won't be
0474 invoked while this routine is running. It first checks if the device's driver
0475 implements legacy PCI suspends routines (Section 3), in which case the legacy
0476 late suspend routine is called and its result is returned (the standard
0477 configuration registers of the device are saved if the driver's callback hasn't
0478 done that). Second, if the device driver's struct dev_pm_ops object is not
0479 present, the device's standard configuration registers are saved and the routine
0480 returns success. Otherwise the device driver's pm->suspend_noirq() callback is
0481 executed, if present, and its result is returned if it fails. Next, if the
0482 device's standard configuration registers haven't been saved yet (one of the
0483 device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
0484 saves them, prepares the device to signal wakeup (if necessary) and puts it into
0485 a low-power state.
0486
0487 The low-power state to put the device into is the lowest-power (highest number)
0488 state from which it can signal wakeup while the system is in the target sleep
0489 state. Just like in the runtime PM case described above, the mechanism of
0490 signaling wakeup is system-dependent and determined by the PCI subsystem, which
0491 is also responsible for preparing the device to signal wakeup from the system's
0492 target sleep state as appropriate.
0493
0494 PCI device drivers (that don't implement legacy power management callbacks) are
0495 generally not expected to prepare devices for signaling wakeup or to put them
0496 into low-power states. However, if one of the driver's suspend callbacks
0497 (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
0498 registers, pci_pm_suspend_noirq() will assume that the device has been prepared
0499 to signal wakeup and put into a low-power state by the driver (the driver is
0500 then assumed to have used the helper functions provided by the PCI subsystem for
0501 this purpose). PCI device drivers are not encouraged to do that, but in some
0502 rare cases doing that in the driver may be the optimum approach.
0503
0504 2.4.2. System Resume
0505 ^^^^^^^^^^^^^^^^^^^^
0506
0507 When the system is undergoing a transition from a sleep state in which the
0508 contents of memory have been preserved, such as one of the ACPI sleep states
0509 S1-S3, into the working state (ACPI S0), the phases are:
0510
0511 resume_noirq, resume, complete.
0512
0513 The following PCI bus type's callbacks, respectively, are executed in these
0514 phases::
0515
0516 pci_pm_resume_noirq()
0517 pci_pm_resume()
0518 pci_pm_complete()
0519
0520 The pci_pm_resume_noirq() routine first puts the device into the full-power
0521 state, restores its standard configuration registers and applies early resume
0522 hardware quirks related to the device, if necessary. This is done
0523 unconditionally, regardless of whether or not the device's driver implements
0524 legacy PCI power management callbacks (this way all PCI devices are in the
0525 full-power state and their standard configuration registers have been restored
0526 when their interrupt handlers are invoked for the first time during resume,
0527 which allows the kernel to avoid problems with the handling of shared interrupts
0528 by drivers whose devices are still suspended). If legacy PCI power management
0529 callbacks (see Section 3) are implemented by the device's driver, the legacy
0530 early resume callback is executed and its result is returned. Otherwise, the
0531 device driver's pm->resume_noirq() callback is executed, if defined, and its
0532 result is returned.
0533
0534 The pci_pm_resume() routine first checks if the device's standard configuration
0535 registers have been restored and restores them if that's not the case (this
0536 only is necessary in the error path during a failing suspend). Next, resume
0537 hardware quirks related to the device are applied, if necessary, and if the
0538 device's driver implements legacy PCI power management callbacks (see
0539 Section 3), the driver's legacy resume callback is executed and its result is
0540 returned. Otherwise, the device's wakeup signaling mechanisms are blocked and
0541 its driver's pm->resume() callback is executed, if defined (the callback's
0542 result is then returned).
0543
0544 The resume phase is carried out asynchronously for PCI devices, like the
0545 suspend phase described above, which means that if two PCI devices don't depend
0546 on each other in a known way, the pci_pm_resume() routine may be executed for
0547 the both of them in parallel.
0548
0549 The pci_pm_complete() routine only executes the device driver's pm->complete()
0550 callback, if defined.
0551
0552 2.4.3. System Hibernation
0553 ^^^^^^^^^^^^^^^^^^^^^^^^^
0554
0555 System hibernation is more complicated than system suspend, because it requires
0556 a system image to be created and written into a persistent storage medium. The
0557 image is created atomically and all devices are quiesced, or frozen, before that
0558 happens.
0559
0560 The freezing of devices is carried out after enough memory has been freed (at
0561 the time of this writing the image creation requires at least 50% of system RAM
0562 to be free) in the following three phases:
0563
0564 prepare, freeze, freeze_noirq
0565
0566 that correspond to the PCI bus type's callbacks::
0567
0568 pci_pm_prepare()
0569 pci_pm_freeze()
0570 pci_pm_freeze_noirq()
0571
0572 This means that the prepare phase is exactly the same as for system suspend.
0573 The other two phases, however, are different.
0574
0575 The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
0576 the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
0577 and it doesn't apply the suspend-related hardware quirks. It is executed
0578 asynchronously for different PCI devices that don't depend on each other in a
0579 known way.
0580
0581 The pci_pm_freeze_noirq() routine, in turn, is similar to
0582 pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
0583 routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the
0584 device for signaling wakeup and put it into a low-power state. Still, it saves
0585 the device's standard configuration registers if they haven't been saved by one
0586 of the driver's callbacks.
0587
0588 Once the image has been created, it has to be saved. However, at this point all
0589 devices are frozen and they cannot handle I/O, while their ability to handle
0590 I/O is obviously necessary for the image saving. Thus they have to be brought
0591 back to the fully functional state and this is done in the following phases:
0592
0593 thaw_noirq, thaw, complete
0594
0595 using the following PCI bus type's callbacks::
0596
0597 pci_pm_thaw_noirq()
0598 pci_pm_thaw()
0599 pci_pm_complete()
0600
0601 respectively.
0602
0603 The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq().
0604 It puts the device into the full power state and restores its standard
0605 configuration registers. It also executes the device driver's pm->thaw_noirq()
0606 callback, if defined, instead of pm->resume_noirq().
0607
0608 The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
0609 driver's pm->thaw() callback instead of pm->resume(). It is executed
0610 asynchronously for different PCI devices that don't depend on each other in a
0611 known way.
0612
0613 The complete phase is the same as for system resume.
0614
0615 After saving the image, devices need to be powered down before the system can
0616 enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in
0617 three phases:
0618
0619 prepare, poweroff, poweroff_noirq
0620
0621 where the prepare phase is exactly the same as for system suspend. The other
0622 two phases are analogous to the suspend and suspend_noirq phases, respectively.
0623 The PCI subsystem-level callbacks they correspond to::
0624
0625 pci_pm_poweroff()
0626 pci_pm_poweroff_noirq()
0627
0628 work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
0629 although they don't attempt to save the device's standard configuration
0630 registers.
0631
0632 2.4.4. System Restore
0633 ^^^^^^^^^^^^^^^^^^^^^
0634
0635 System restore requires a hibernation image to be loaded into memory and the
0636 pre-hibernation memory contents to be restored before the pre-hibernation system
0637 activity can be resumed.
0638
0639 As described in Documentation/driver-api/pm/devices.rst, the hibernation image
0640 is loaded into memory by a fresh instance of the kernel, called the boot kernel,
0641 which in turn is loaded and run by a boot loader in the usual way. After the
0642 boot kernel has loaded the image, it needs to replace its own code and data with
0643 the code and data of the "hibernated" kernel stored within the image, called the
0644 image kernel. For this purpose all devices are frozen just like before creating
0645 the image during hibernation, in the
0646
0647 prepare, freeze, freeze_noirq
0648
0649 phases described above. However, the devices affected by these phases are only
0650 those having drivers in the boot kernel; other devices will still be in whatever
0651 state the boot loader left them.
0652
0653 Should the restoration of the pre-hibernation memory contents fail, the boot
0654 kernel would go through the "thawing" procedure described above, using the
0655 thaw_noirq, thaw, and complete phases (that will only affect the devices having
0656 drivers in the boot kernel), and then continue running normally.
0657
0658 If the pre-hibernation memory contents are restored successfully, which is the
0659 usual situation, control is passed to the image kernel, which then becomes
0660 responsible for bringing the system back to the working state. To achieve this,
0661 it must restore the devices' pre-hibernation functionality, which is done much
0662 like waking up from the memory sleep state, although it involves different
0663 phases:
0664
0665 restore_noirq, restore, complete
0666
0667 The first two of these are analogous to the resume_noirq and resume phases
0668 described above, respectively, and correspond to the following PCI subsystem
0669 callbacks::
0670
0671 pci_pm_restore_noirq()
0672 pci_pm_restore()
0673
0674 These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
0675 respectively, but they execute the device driver's pm->restore_noirq() and
0676 pm->restore() callbacks, if available.
0677
0678 The complete phase is carried out in exactly the same way as during system
0679 resume.
0680
0681
0682 3. PCI Device Drivers and Power Management
0683 ==========================================
0684
0685 3.1. Power Management Callbacks
0686 -------------------------------
0687
0688 PCI device drivers participate in power management by providing callbacks to be
0689 executed by the PCI subsystem's power management routines described above and by
0690 controlling the runtime power management of their devices.
0691
0692 At the time of this writing there are two ways to define power management
0693 callbacks for a PCI device driver, the recommended one, based on using a
0694 dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and
0695 the "legacy" one, in which the .suspend() and .resume() callbacks from struct
0696 pci_driver are used. The legacy approach, however, doesn't allow one to define
0697 runtime power management callbacks and is not really suitable for any new
0698 drivers. Therefore it is not covered by this document (refer to the source code
0699 to learn more about it).
0700
0701 It is recommended that all PCI device drivers define a struct dev_pm_ops object
0702 containing pointers to power management (PM) callbacks that will be executed by
0703 the PCI subsystem's PM routines in various circumstances. A pointer to the
0704 driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
0705 its struct pci_driver object. Once that has happened, the "legacy" PM callbacks
0706 in struct pci_driver are ignored (even if they are not NULL).
0707
0708 The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
0709 defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
0710 subsystem will handle the device in a simplified default manner. If they are
0711 defined, though, they are expected to behave as described in the following
0712 subsections.
0713
0714 3.1.1. prepare()
0715 ^^^^^^^^^^^^^^^^
0716
0717 The prepare() callback is executed during system suspend, during hibernation
0718 (when a hibernation image is about to be created), during power-off after
0719 saving a hibernation image and during system restore, when a hibernation image
0720 has just been loaded into memory.
0721
0722 This callback is only necessary if the driver's device has children that in
0723 general may be registered at any time. In that case the role of the prepare()
0724 callback is to prevent new children of the device from being registered until
0725 one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
0726
0727 In addition to that the prepare() callback may carry out some operations
0728 preparing the device to be suspended, although it should not allocate memory
0729 (if additional memory is required to suspend the device, it has to be
0730 preallocated earlier, for example in a suspend/hibernate notifier as described
0731 in Documentation/driver-api/pm/notifiers.rst).
0732
0733 3.1.2. suspend()
0734 ^^^^^^^^^^^^^^^^
0735
0736 The suspend() callback is only executed during system suspend, after prepare()
0737 callbacks have been executed for all devices in the system.
0738
0739 This callback is expected to quiesce the device and prepare it to be put into a
0740 low-power state by the PCI subsystem. It is not required (in fact it even is
0741 not recommended) that a PCI driver's suspend() callback save the standard
0742 configuration registers of the device, prepare it for waking up the system, or
0743 put it into a low-power state. All of these operations can very well be taken
0744 care of by the PCI subsystem, without the driver's participation.
0745
0746 However, in some rare case it is convenient to carry out these operations in
0747 a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and
0748 pci_set_power_state() should be used to save the device's standard configuration
0749 registers, to prepare it for system wakeup (if necessary), and to put it into a
0750 low-power state, respectively. Moreover, if the driver calls pci_save_state(),
0751 the PCI subsystem will not execute either pci_prepare_to_sleep(), or
0752 pci_set_power_state() for its device, so the driver is then responsible for
0753 handling the device as appropriate.
0754
0755 While the suspend() callback is being executed, the driver's interrupt handler
0756 can be invoked to handle an interrupt from the device, so all suspend-related
0757 operations relying on the driver's ability to handle interrupts should be
0758 carried out in this callback.
0759
0760 3.1.3. suspend_noirq()
0761 ^^^^^^^^^^^^^^^^^^^^^^
0762
0763 The suspend_noirq() callback is only executed during system suspend, after
0764 suspend() callbacks have been executed for all devices in the system and
0765 after device interrupts have been disabled by the PM core.
0766
0767 The difference between suspend_noirq() and suspend() is that the driver's
0768 interrupt handler will not be invoked while suspend_noirq() is running. Thus
0769 suspend_noirq() can carry out operations that would cause race conditions to
0770 arise if they were performed in suspend().
0771
0772 3.1.4. freeze()
0773 ^^^^^^^^^^^^^^^
0774
0775 The freeze() callback is hibernation-specific and is executed in two situations,
0776 during hibernation, after prepare() callbacks have been executed for all devices
0777 in preparation for the creation of a system image, and during restore,
0778 after a system image has been loaded into memory from persistent storage and the
0779 prepare() callbacks have been executed for all devices.
0780
0781 The role of this callback is analogous to the role of the suspend() callback
0782 described above. In fact, they only need to be different in the rare cases when
0783 the driver takes the responsibility for putting the device into a low-power
0784 state.
0785
0786 In that cases the freeze() callback should not prepare the device system wakeup
0787 or put it into a low-power state. Still, either it or freeze_noirq() should
0788 save the device's standard configuration registers using pci_save_state().
0789
0790 3.1.5. freeze_noirq()
0791 ^^^^^^^^^^^^^^^^^^^^^
0792
0793 The freeze_noirq() callback is hibernation-specific. It is executed during
0794 hibernation, after prepare() and freeze() callbacks have been executed for all
0795 devices in preparation for the creation of a system image, and during restore,
0796 after a system image has been loaded into memory and after prepare() and
0797 freeze() callbacks have been executed for all devices. It is always executed
0798 after device interrupts have been disabled by the PM core.
0799
0800 The role of this callback is analogous to the role of the suspend_noirq()
0801 callback described above and it very rarely is necessary to define
0802 freeze_noirq().
0803
0804 The difference between freeze_noirq() and freeze() is analogous to the
0805 difference between suspend_noirq() and suspend().
0806
0807 3.1.6. poweroff()
0808 ^^^^^^^^^^^^^^^^^
0809
0810 The poweroff() callback is hibernation-specific. It is executed when the system
0811 is about to be powered off after saving a hibernation image to a persistent
0812 storage. prepare() callbacks are executed for all devices before poweroff() is
0813 called.
0814
0815 The role of this callback is analogous to the role of the suspend() and freeze()
0816 callbacks described above, although it does not need to save the contents of
0817 the device's registers. In particular, if the driver wants to put the device
0818 into a low-power state itself instead of allowing the PCI subsystem to do that,
0819 the poweroff() callback should use pci_prepare_to_sleep() and
0820 pci_set_power_state() to prepare the device for system wakeup and to put it
0821 into a low-power state, respectively, but it need not save the device's standard
0822 configuration registers.
0823
0824 3.1.7. poweroff_noirq()
0825 ^^^^^^^^^^^^^^^^^^^^^^^
0826
0827 The poweroff_noirq() callback is hibernation-specific. It is executed after
0828 poweroff() callbacks have been executed for all devices in the system.
0829
0830 The role of this callback is analogous to the role of the suspend_noirq() and
0831 freeze_noirq() callbacks described above, but it does not need to save the
0832 contents of the device's registers.
0833
0834 The difference between poweroff_noirq() and poweroff() is analogous to the
0835 difference between suspend_noirq() and suspend().
0836
0837 3.1.8. resume_noirq()
0838 ^^^^^^^^^^^^^^^^^^^^^
0839
0840 The resume_noirq() callback is only executed during system resume, after the
0841 PM core has enabled the non-boot CPUs. The driver's interrupt handler will not
0842 be invoked while resume_noirq() is running, so this callback can carry out
0843 operations that might race with the interrupt handler.
0844
0845 Since the PCI subsystem unconditionally puts all devices into the full power
0846 state in the resume_noirq phase of system resume and restores their standard
0847 configuration registers, resume_noirq() is usually not necessary. In general
0848 it should only be used for performing operations that would lead to race
0849 conditions if carried out by resume().
0850
0851 3.1.9. resume()
0852 ^^^^^^^^^^^^^^^
0853
0854 The resume() callback is only executed during system resume, after
0855 resume_noirq() callbacks have been executed for all devices in the system and
0856 device interrupts have been enabled by the PM core.
0857
0858 This callback is responsible for restoring the pre-suspend configuration of the
0859 device and bringing it back to the fully functional state. The device should be
0860 able to process I/O in a usual way after resume() has returned.
0861
0862 3.1.10. thaw_noirq()
0863 ^^^^^^^^^^^^^^^^^^^^
0864
0865 The thaw_noirq() callback is hibernation-specific. It is executed after a
0866 system image has been created and the non-boot CPUs have been enabled by the PM
0867 core, in the thaw_noirq phase of hibernation. It also may be executed if the
0868 loading of a hibernation image fails during system restore (it is then executed
0869 after enabling the non-boot CPUs). The driver's interrupt handler will not be
0870 invoked while thaw_noirq() is running.
0871
0872 The role of this callback is analogous to the role of resume_noirq(). The
0873 difference between these two callbacks is that thaw_noirq() is executed after
0874 freeze() and freeze_noirq(), so in general it does not need to modify the
0875 contents of the device's registers.
0876
0877 3.1.11. thaw()
0878 ^^^^^^^^^^^^^^
0879
0880 The thaw() callback is hibernation-specific. It is executed after thaw_noirq()
0881 callbacks have been executed for all devices in the system and after device
0882 interrupts have been enabled by the PM core.
0883
0884 This callback is responsible for restoring the pre-freeze configuration of
0885 the device, so that it will work in a usual way after thaw() has returned.
0886
0887 3.1.12. restore_noirq()
0888 ^^^^^^^^^^^^^^^^^^^^^^^
0889
0890 The restore_noirq() callback is hibernation-specific. It is executed in the
0891 restore_noirq phase of hibernation, when the boot kernel has passed control to
0892 the image kernel and the non-boot CPUs have been enabled by the image kernel's
0893 PM core.
0894
0895 This callback is analogous to resume_noirq() with the exception that it cannot
0896 make any assumption on the previous state of the device, even if the BIOS (or
0897 generally the platform firmware) is known to preserve that state over a
0898 suspend-resume cycle.
0899
0900 For the vast majority of PCI device drivers there is no difference between
0901 resume_noirq() and restore_noirq().
0902
0903 3.1.13. restore()
0904 ^^^^^^^^^^^^^^^^^
0905
0906 The restore() callback is hibernation-specific. It is executed after
0907 restore_noirq() callbacks have been executed for all devices in the system and
0908 after the PM core has enabled device drivers' interrupt handlers to be invoked.
0909
0910 This callback is analogous to resume(), just like restore_noirq() is analogous
0911 to resume_noirq(). Consequently, the difference between restore_noirq() and
0912 restore() is analogous to the difference between resume_noirq() and resume().
0913
0914 For the vast majority of PCI device drivers there is no difference between
0915 resume() and restore().
0916
0917 3.1.14. complete()
0918 ^^^^^^^^^^^^^^^^^^
0919
0920 The complete() callback is executed in the following situations:
0921
0922 - during system resume, after resume() callbacks have been executed for all
0923 devices,
0924 - during hibernation, before saving the system image, after thaw() callbacks
0925 have been executed for all devices,
0926 - during system restore, when the system is going back to its pre-hibernation
0927 state, after restore() callbacks have been executed for all devices.
0928
0929 It also may be executed if the loading of a hibernation image into memory fails
0930 (in that case it is run after thaw() callbacks have been executed for all
0931 devices that have drivers in the boot kernel).
0932
0933 This callback is entirely optional, although it may be necessary if the
0934 prepare() callback performs operations that need to be reversed.
0935
0936 3.1.15. runtime_suspend()
0937 ^^^^^^^^^^^^^^^^^^^^^^^^^
0938
0939 The runtime_suspend() callback is specific to device runtime power management
0940 (runtime PM). It is executed by the PM core's runtime PM framework when the
0941 device is about to be suspended (i.e. quiesced and put into a low-power state)
0942 at run time.
0943
0944 This callback is responsible for freezing the device and preparing it to be
0945 put into a low-power state, but it must allow the PCI subsystem to perform all
0946 of the PCI-specific actions necessary for suspending the device.
0947
0948 3.1.16. runtime_resume()
0949 ^^^^^^^^^^^^^^^^^^^^^^^^
0950
0951 The runtime_resume() callback is specific to device runtime PM. It is executed
0952 by the PM core's runtime PM framework when the device is about to be resumed
0953 (i.e. put into the full-power state and programmed to process I/O normally) at
0954 run time.
0955
0956 This callback is responsible for restoring the normal functionality of the
0957 device after it has been put into the full-power state by the PCI subsystem.
0958 The device is expected to be able to process I/O in the usual way after
0959 runtime_resume() has returned.
0960
0961 3.1.17. runtime_idle()
0962 ^^^^^^^^^^^^^^^^^^^^^^
0963
0964 The runtime_idle() callback is specific to device runtime PM. It is executed
0965 by the PM core's runtime PM framework whenever it may be desirable to suspend
0966 the device according to the PM core's information. In particular, it is
0967 automatically executed right after runtime_resume() has returned in case the
0968 resume of the device has happened as a result of a spurious event.
0969
0970 This callback is optional, but if it is not implemented or if it returns 0, the
0971 PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
0972 cause the driver's runtime_suspend() callback to be executed.
0973
0974 3.1.18. Pointing Multiple Callback Pointers to One Routine
0975 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0976
0977 Although in principle each of the callbacks described in the previous
0978 subsections can be defined as a separate function, it often is convenient to
0979 point two or more members of struct dev_pm_ops to the same routine. There are
0980 a few convenience macros that can be used for this purpose.
0981
0982 The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
0983 suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
0984 members and one resume routine pointed to by the .resume(), .thaw(), and
0985 .restore() members. The other function pointers in this struct dev_pm_ops are
0986 unset.
0987
0988 The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
0989 additionally sets the .runtime_resume() pointer to the same value as
0990 .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
0991 the same value as .suspend() (and .freeze() and .poweroff()).
0992
0993 The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
0994 dev_pm_ops to indicate that one suspend routine is to be pointed to by the
0995 .suspend(), .freeze(), and .poweroff() members and one resume routine is to
0996 be pointed to by the .resume(), .thaw(), and .restore() members.
0997
0998 3.1.19. Driver Flags for Power Management
0999 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1000
1001 The PM core allows device drivers to set flags that influence the handling of
1002 power management for the devices by the core itself and by middle layer code
1003 including the PCI bus type. The flags should be set once at the driver probe
1004 time with the help of the dev_pm_set_driver_flags() function and they should not
1005 be updated directly afterwards.
1006
1007 The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the
1008 direct-complete mechanism allowing device suspend/resume callbacks to be skipped
1009 if the device is in runtime suspend when the system suspend starts. That also
1010 affects all of the ancestors of the device, so this flag should only be used if
1011 absolutely necessary.
1012
1013 The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive
1014 value from pci_pm_prepare() only if the ->prepare callback provided by the
1015 driver of the device returns a positive value. That allows the driver to opt
1016 out from using the direct-complete mechanism dynamically (whereas setting
1017 DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out).
1018
1019 The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's
1020 perspective the device can be safely left in runtime suspend during system
1021 suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff()
1022 to avoid resuming the device from runtime suspend unless there are PCI-specific
1023 reasons for doing that. Also, it causes pci_pm_suspend_late/noirq() and
1024 pci_pm_poweroff_late/noirq() to return early if the device remains in runtime
1025 suspend during the "late" phase of the system-wide transition under way.
1026 Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or
1027 pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it
1028 is going to be put into D0 going forward).
1029
1030 Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its
1031 "noirq" and "early" resume callbacks to be skipped if the device can be left
1032 in suspend after a system-wide transition into the working state. This flag is
1033 taken into consideration by the PM core along with the power.may_skip_resume
1034 status bit of the device which is set by pci_pm_suspend_noirq() in certain
1035 situations. If the PM core determines that the driver's "noirq" and "early"
1036 resume callbacks should be skipped, the dev_pm_skip_resume() helper function
1037 will return "true" and that will cause pci_pm_resume_noirq() and
1038 pci_pm_resume_early() to return upfront without touching the device and
1039 executing the driver callbacks.
1040
1041 3.2. Device Runtime Power Management
1042 ------------------------------------
1043
1044 In addition to providing device power management callbacks PCI device drivers
1045 are responsible for controlling the runtime power management (runtime PM) of
1046 their devices.
1047
1048 The PCI device runtime PM is optional, but it is recommended that PCI device
1049 drivers implement it at least in the cases where there is a reliable way of
1050 verifying that the device is not used (like when the network cable is detached
1051 from an Ethernet adapter or there are no devices attached to a USB controller).
1052
1053 To support the PCI runtime PM the driver first needs to implement the
1054 runtime_suspend() and runtime_resume() callbacks. It also may need to implement
1055 the runtime_idle() callback to prevent the device from being suspended again
1056 every time right after the runtime_resume() callback has returned
1057 (alternatively, the runtime_suspend() callback will have to check if the
1058 device should really be suspended and return -EAGAIN if that is not the case).
1059
1060 The runtime PM of PCI devices is enabled by default by the PCI core. PCI
1061 device drivers do not need to enable it and should not attempt to do so.
1062 However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid()
1063 helper function. In addition to that, the runtime PM usage counter of
1064 each PCI device is incremented by local_pci_probe() before executing the
1065 probe callback provided by the device's driver.
1066
1067 If a PCI driver implements the runtime PM callbacks and intends to use the
1068 runtime PM framework provided by the PM core and the PCI subsystem, it needs
1069 to decrement the device's runtime PM usage counter in its probe callback
1070 function. If it doesn't do that, the counter will always be different from
1071 zero for the device and it will never be runtime-suspended. The simplest
1072 way to do that is by calling pm_runtime_put_noidle(), but if the driver
1073 wants to schedule an autosuspend right away, for example, it may call
1074 pm_runtime_put_autosuspend() instead for this purpose. Generally, it
1075 just needs to call a function that decrements the devices usage counter
1076 from its probe routine to make runtime PM work for the device.
1077
1078 It is important to remember that the driver's runtime_suspend() callback
1079 may be executed right after the usage counter has been decremented, because
1080 user space may already have caused the pm_runtime_allow() helper function
1081 unblocking the runtime PM of the device to run via sysfs, so the driver must
1082 be prepared to cope with that.
1083
1084 The driver itself should not call pm_runtime_allow(), though. Instead, it
1085 should let user space or some platform-specific code do that (user space can
1086 do it via sysfs as stated above), but it must be prepared to handle the
1087 runtime PM of the device correctly as soon as pm_runtime_allow() is called
1088 (which may happen at any time, even before the driver is loaded).
1089
1090 When the driver's remove callback runs, it has to balance the decrementation
1091 of the device's runtime PM usage counter at the probe time. For this reason,
1092 if it has decremented the counter in its probe callback, it must run
1093 pm_runtime_get_noresume() in its remove callback. [Since the core carries
1094 out a runtime resume of the device and bumps up the device's usage counter
1095 before running the driver's remove callback, the runtime PM of the device
1096 is effectively disabled for the duration of the remove execution and all
1097 runtime PM helper functions incrementing the device's usage counter are
1098 then effectively equivalent to pm_runtime_get_noresume().]
1099
1100 The runtime PM framework works by processing requests to suspend or resume
1101 devices, or to check if they are idle (in which cases it is reasonable to
1102 subsequently request that they be suspended). These requests are represented
1103 by work items put into the power management workqueue, pm_wq. Although there
1104 are a few situations in which power management requests are automatically
1105 queued by the PM core (for example, after processing a request to resume a
1106 device the PM core automatically queues a request to check if the device is
1107 idle), device drivers are generally responsible for queuing power management
1108 requests for their devices. For this purpose they should use the runtime PM
1109 helper functions provided by the PM core, discussed in
1110 Documentation/power/runtime_pm.rst.
1111
1112 Devices can also be suspended and resumed synchronously, without placing a
1113 request into pm_wq. In the majority of cases this also is done by their
1114 drivers that use helper functions provided by the PM core for this purpose.
1115
1116 For more information on the runtime PM of devices refer to
1117 Documentation/power/runtime_pm.rst.
1118
1119
1120 4. Resources
1121 ============
1122
1123 PCI Local Bus Specification, Rev. 3.0
1124
1125 PCI Bus Power Management Interface Specification, Rev. 1.2
1126
1127 Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
1128
1129 PCI Express Base Specification, Rev. 2.0
1130
1131 Documentation/driver-api/pm/devices.rst
1132
1133 Documentation/power/runtime_pm.rst