0001 =================================
0002 Debugging hibernation and suspend
0003 =================================
0004
0005 (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
0006
0007 1. Testing hibernation (aka suspend to disk or STD)
0008 ===================================================
0009
0010 To check if hibernation works, you can try to hibernate in the "reboot" mode::
0011
0012 # echo reboot > /sys/power/disk
0013 # echo disk > /sys/power/state
0014
0015 and the system should create a hibernation image, reboot, resume and get back to
0016 the command prompt where you have started the transition. If that happens,
0017 hibernation is most likely to work correctly. Still, you need to repeat the
0018 test at least a couple of times in a row for confidence. [This is necessary,
0019 because some problems only show up on a second attempt at suspending and
0020 resuming the system.] Moreover, hibernating in the "reboot" and "shutdown"
0021 modes causes the PM core to skip some platform-related callbacks which on ACPI
0022 systems might be necessary to make hibernation work. Thus, if your machine
0023 fails to hibernate or resume in the "reboot" mode, you should try the
0024 "platform" mode::
0025
0026 # echo platform > /sys/power/disk
0027 # echo disk > /sys/power/state
0028
0029 which is the default and recommended mode of hibernation.
0030
0031 Unfortunately, the "platform" mode of hibernation does not work on some systems
0032 with broken BIOSes. In such cases the "shutdown" mode of hibernation might
0033 work::
0034
0035 # echo shutdown > /sys/power/disk
0036 # echo disk > /sys/power/state
0037
0038 (it is similar to the "reboot" mode, but it requires you to press the power
0039 button to make the system resume).
0040
0041 If neither "platform" nor "shutdown" hibernation mode works, you will need to
0042 identify what goes wrong.
0043
0044 a) Test modes of hibernation
0045 ----------------------------
0046
0047 To find out why hibernation fails on your system, you can use a special testing
0048 facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then,
0049 there is the file /sys/power/pm_test that can be used to make the hibernation
0050 core run in a test mode. There are 5 test modes available:
0051
0052 freezer
0053 - test the freezing of processes
0054
0055 devices
0056 - test the freezing of processes and suspending of devices
0057
0058 platform
0059 - test the freezing of processes, suspending of devices and platform
0060 global control methods [1]_
0061
0062 processors
0063 - test the freezing of processes, suspending of devices, platform
0064 global control methods [1]_ and the disabling of nonboot CPUs
0065
0066 core
0067 - test the freezing of processes, suspending of devices, platform global
0068 control methods\ [1]_, the disabling of nonboot CPUs and suspending
0069 of platform/system devices
0070
0071 .. [1]
0072
0073 the platform global control methods are only available on ACPI systems
0074 and are only tested if the hibernation mode is set to "platform"
0075
0076 To use one of them it is necessary to write the corresponding string to
0077 /sys/power/pm_test (eg. "devices" to test the freezing of processes and
0078 suspending devices) and issue the standard hibernation commands. For example,
0079 to use the "devices" test mode along with the "platform" mode of hibernation,
0080 you should do the following::
0081
0082 # echo devices > /sys/power/pm_test
0083 # echo platform > /sys/power/disk
0084 # echo disk > /sys/power/state
0085
0086 Then, the kernel will try to freeze processes, suspend devices, wait a few
0087 seconds (5 by default, but configurable by the suspend.pm_test_delay module
0088 parameter), resume devices and thaw processes. If "platform" is written to
0089 /sys/power/pm_test , then after suspending devices the kernel will additionally
0090 invoke the global control methods (eg. ACPI global control methods) used to
0091 prepare the platform firmware for hibernation. Next, it will wait a
0092 configurable number of seconds and invoke the platform (eg. ACPI) global
0093 methods used to cancel hibernation etc.
0094
0095 Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal
0096 hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test
0097 contains a space-separated list of all available tests (including "none" that
0098 represents the normal functionality) in which the current test level is
0099 indicated by square brackets.
0100
0101 Generally, as you can see, each test level is more "invasive" than the previous
0102 one and the "core" level tests the hardware and drivers as deeply as possible
0103 without creating a hibernation image. Obviously, if the "devices" test fails,
0104 the "platform" test will fail as well and so on. Thus, as a rule of thumb, you
0105 should try the test modes starting from "freezer", through "devices", "platform"
0106 and "processors" up to "core" (repeat the test on each level a couple of times
0107 to make sure that any random factors are avoided).
0108
0109 If the "freezer" test fails, there is a task that cannot be frozen (in that case
0110 it usually is possible to identify the offending task by analysing the output of
0111 dmesg obtained after the failing test). Failure at this level usually means
0112 that there is a problem with the tasks freezer subsystem that should be
0113 reported.
0114
0115 If the "devices" test fails, most likely there is a driver that cannot suspend
0116 or resume its device (in the latter case the system may hang or become unstable
0117 after the test, so please take that into consideration). To find this driver,
0118 you can carry out a binary search according to the rules:
0119
0120 - if the test fails, unload a half of the drivers currently loaded and repeat
0121 (that would probably involve rebooting the system, so always note what drivers
0122 have been loaded before the test),
0123 - if the test succeeds, load a half of the drivers you have unloaded most
0124 recently and repeat.
0125
0126 Once you have found the failing driver (there can be more than just one of
0127 them), you have to unload it every time before hibernation. In that case please
0128 make sure to report the problem with the driver.
0129
0130 It is also possible that the "devices" test will still fail after you have
0131 unloaded all modules. In that case, you may want to look in your kernel
0132 configuration for the drivers that can be compiled as modules (and test again
0133 with these drivers compiled as modules). You may also try to use some special
0134 kernel command line options such as "noapic", "noacpi" or even "acpi=off".
0135
0136 If the "platform" test fails, there is a problem with the handling of the
0137 platform (eg. ACPI) firmware on your system. In that case the "platform" mode
0138 of hibernation is not likely to work. You can try the "shutdown" mode, but that
0139 is rather a poor man's workaround.
0140
0141 If the "processors" test fails, the disabling/enabling of nonboot CPUs does not
0142 work (of course, this only may be an issue on SMP systems) and the problem
0143 should be reported. In that case you can also try to switch the nonboot CPUs
0144 off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and
0145 see if that works.
0146
0147 If the "core" test fails, which means that suspending of the system/platform
0148 devices has failed (these devices are suspended on one CPU with interrupts off),
0149 the problem is most probably hardware-related and serious, so it should be
0150 reported.
0151
0152 A failure of any of the "platform", "processors" or "core" tests may cause your
0153 system to hang or become unstable, so please beware. Such a failure usually
0154 indicates a serious problem that very well may be related to the hardware, but
0155 please report it anyway.
0156
0157 b) Testing minimal configuration
0158 --------------------------------
0159
0160 If all of the hibernation test modes work, you can boot the system with the
0161 "init=/bin/bash" command line parameter and attempt to hibernate in the
0162 "reboot", "shutdown" and "platform" modes. If that does not work, there
0163 probably is a problem with a driver statically compiled into the kernel and you
0164 can try to compile more drivers as modules, so that they can be tested
0165 individually. Otherwise, there is a problem with a modular driver and you can
0166 find it by loading a half of the modules you normally use and binary searching
0167 in accordance with the algorithm:
0168 - if there are n modules loaded and the attempt to suspend and resume fails,
0169 unload n/2 of the modules and try again (that would probably involve rebooting
0170 the system),
0171 - if there are n modules loaded and the attempt to suspend and resume succeeds,
0172 load n/2 modules more and try again.
0173
0174 Again, if you find the offending module(s), it(they) must be unloaded every time
0175 before hibernation, and please report the problem with it(them).
0176
0177 c) Using the "test_resume" hibernation option
0178 ---------------------------------------------
0179
0180 /sys/power/disk generally tells the kernel what to do after creating a
0181 hibernation image. One of the available options is "test_resume" which
0182 causes the just created image to be used for immediate restoration. Namely,
0183 after doing::
0184
0185 # echo test_resume > /sys/power/disk
0186 # echo disk > /sys/power/state
0187
0188 a hibernation image will be created and a resume from it will be triggered
0189 immediately without involving the platform firmware in any way.
0190
0191 That test can be used to check if failures to resume from hibernation are
0192 related to bad interactions with the platform firmware. That is, if the above
0193 works every time, but resume from actual hibernation does not work or is
0194 unreliable, the platform firmware may be responsible for the failures.
0195
0196 On architectures and platforms that support using different kernels to restore
0197 hibernation images (that is, the kernel used to read the image from storage and
0198 load it into memory is different from the one included in the image) or support
0199 kernel address space randomization, it also can be used to check if failures
0200 to resume may be related to the differences between the restore and image
0201 kernels.
0202
0203 d) Advanced debugging
0204 ---------------------
0205
0206 In case that hibernation does not work on your system even in the minimal
0207 configuration and compiling more drivers as modules is not practical or some
0208 modules cannot be unloaded, you can use one of the more advanced debugging
0209 techniques to find the problem. First, if there is a serial port in your box,
0210 you can boot the kernel with the 'no_console_suspend' parameter and try to log
0211 kernel messages using the serial console. This may provide you with some
0212 information about the reasons of the suspend (resume) failure. Alternatively,
0213 it may be possible to use a FireWire port for debugging with firescope
0214 (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to
0215 use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst .
0216
0217 2. Testing suspend to RAM (STR)
0218 ===============================
0219
0220 To verify that the STR works, it is generally more convenient to use the s2ram
0221 tool available from http://suspend.sf.net and documented at
0222 http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK).
0223
0224 Namely, after writing "freezer", "devices", "platform", "processors", or "core"
0225 into /sys/power/pm_test (available if the kernel is compiled with
0226 CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding
0227 to given string. The STR test modes are defined in the same way as for
0228 hibernation, so please refer to Section 1 for more information about them. In
0229 particular, the "core" test allows you to test everything except for the actual
0230 invocation of the platform firmware in order to put the system into the sleep
0231 state.
0232
0233 Among other things, the testing with the help of /sys/power/pm_test may allow
0234 you to identify drivers that fail to suspend or resume their devices. They
0235 should be unloaded every time before an STR transition.
0236
0237 Next, you can follow the instructions at S2RAM_LINK to test the system, but if
0238 it does not work "out of the box", you may need to boot it with
0239 "init=/bin/bash" and test s2ram in the minimal configuration. In that case,
0240 you may be able to search for failing drivers by following the procedure
0241 analogous to the one described in section 1. If you find some failing drivers,
0242 you will have to unload them every time before an STR transition (ie. before
0243 you run s2ram), and please report the problems with them.
0244
0245 There is a debugfs entry which shows the suspend to RAM statistics. Here is an
0246 example of its output::
0247
0248 # mount -t debugfs none /sys/kernel/debug
0249 # cat /sys/kernel/debug/suspend_stats
0250 success: 20
0251 fail: 5
0252 failed_freeze: 0
0253 failed_prepare: 0
0254 failed_suspend: 5
0255 failed_suspend_noirq: 0
0256 failed_resume: 0
0257 failed_resume_noirq: 0
0258 failures:
0259 last_failed_dev: alarm
0260 adc
0261 last_failed_errno: -16
0262 -16
0263 last_failed_step: suspend
0264 suspend
0265
0266 Field success means the success number of suspend to RAM, and field fail means
0267 the failure number. Others are the failure number of different steps of suspend
0268 to RAM. suspend_stats just lists the last 2 failed devices, error number and
0269 failed step of suspend.