Documentation/powerpc/eeh-pci-error-recovery.rst

0001 ==========================
0002 PCI Bus EEH Error Recovery
0003 ==========================
0004
0005 Linas Vepstas <linas@austin.ibm.com>
0006
0007 12 January 2005
0008
0009
0010 Overview:
0011 ---------
0012 The IBM POWER-based pSeries and iSeries computers include PCI bus
0013 controller chips that have extended capabilities for detecting and
0014 reporting a large variety of PCI bus error conditions.  These features
0015 go under the name of "EEH", for "Enhanced Error Handling".  The EEH
0016 hardware features allow PCI bus errors to be cleared and a PCI
0017 card to be "rebooted", without also having to reboot the operating
0018 system.
0019
0020 This is in contrast to traditional PCI error handling, where the
0021 PCI chip is wired directly to the CPU, and an error would cause
0022 a CPU machine-check/check-stop condition, halting the CPU entirely.
0023 Another "traditional" technique is to ignore such errors, which
0024 can lead to data corruption, both of user data or of kernel data,
0025 hung/unresponsive adapters, or system crashes/lockups.  Thus,
0026 the idea behind EEH is that the operating system can become more
0027 reliable and robust by protecting it from PCI errors, and giving
0028 the OS the ability to "reboot"/recover individual PCI devices.
0029
0030 Future systems from other vendors, based on the PCI-E specification,
0031 may contain similar features.
0032
0033
0034 Causes of EEH Errors
0035 --------------------
0036 EEH was originally designed to guard against hardware failure, such
0037 as PCI cards dying from heat, humidity, dust, vibration and bad
0038 electrical connections. The vast majority of EEH errors seen in
0039 "real life" are due to either poorly seated PCI cards, or,
0040 unfortunately quite commonly, due to device driver bugs, device firmware
0041 bugs, and sometimes PCI card hardware bugs.
0042
0043 The most common software bug, is one that causes the device to
0044 attempt to DMA to a location in system memory that has not been
0045 reserved for DMA access for that card.  This is a powerful feature,
0046 as it prevents what; otherwise, would have been silent memory
0047 corruption caused by the bad DMA.  A number of device driver
0048 bugs have been found and fixed in this way over the past few
0049 years.  Other possible causes of EEH errors include data or
0050 address line parity errors (for example, due to poor electrical
0051 connectivity due to a poorly seated card), and PCI-X split-completion
0052 errors (due to software, device firmware, or device PCI hardware bugs).
0053 The vast majority of "true hardware failures" can be cured by
0054 physically removing and re-seating the PCI card.
0055
0056
0057 Detection and Recovery
0058 ----------------------
0059 In the following discussion, a generic overview of how to detect
0060 and recover from EEH errors will be presented. This is followed
0061 by an overview of how the current implementation in the Linux
0062 kernel does it.  The actual implementation is subject to change,
0063 and some of the finer points are still being debated.  These
0064 may in turn be swayed if or when other architectures implement
0065 similar functionality.
0066
0067 When a PCI Host Bridge (PHB, the bus controller connecting the
0068 PCI bus to the system CPU electronics complex) detects a PCI error
0069 condition, it will "isolate" the affected PCI card.  Isolation
0070 will block all writes (either to the card from the system, or
0071 from the card to the system), and it will cause all reads to
0072 return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
0073 This value was chosen because it is the same value you would
0074 get if the device was physically unplugged from the slot.
0075 This includes access to PCI memory, I/O space, and PCI config
0076 space.  Interrupts; however, will continue to be delivered.
0077
0078 Detection and recovery are performed with the aid of ppc64
0079 firmware.  The programming interfaces in the Linux kernel
0080 into the firmware are referred to as RTAS (Run-Time Abstraction
0081 Services).  The Linux kernel does not (should not) access
0082 the EEH function in the PCI chipsets directly, primarily because
0083 there are a number of different chipsets out there, each with
0084 different interfaces and quirks. The firmware provides a
0085 uniform abstraction layer that will work with all pSeries
0086 and iSeries hardware (and be forwards-compatible).
0087
0088 If the OS or device driver suspects that a PCI slot has been
0089 EEH-isolated, there is a firmware call it can make to determine if
0090 this is the case. If so, then the device driver should put itself
0091 into a consistent state (given that it won't be able to complete any
0092 pending work) and start recovery of the card.  Recovery normally
0093 would consist of resetting the PCI device (holding the PCI #RST
0094 line high for two seconds), followed by setting up the device
0095 config space (the base address registers (BAR's), latency timer,
0096 cache line size, interrupt line, and so on).  This is followed by a
0097 reinitialization of the device driver.  In a worst-case scenario,
0098 the power to the card can be toggled, at least on hot-plug-capable
0099 slots.  In principle, layers far above the device driver probably
0100 do not need to know that the PCI card has been "rebooted" in this
0101 way; ideally, there should be at most a pause in Ethernet/disk/USB
0102 I/O while the card is being reset.
0103
0104 If the card cannot be recovered after three or four resets, the
0105 kernel/device driver should assume the worst-case scenario, that the
0106 card has died completely, and report this error to the sysadmin.
0107 In addition, error messages are reported through RTAS and also through
0108 syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
0109 The correct way to deal with failed adapters is to use the standard
0110 PCI hotplug tools to remove and replace the dead card.
0111
0112
0113 Current PPC64 Linux EEH Implementation
0114 --------------------------------------
0115 At this time, a generic EEH recovery mechanism has been implemented,
0116 so that individual device drivers do not need to be modified to support
0117 EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
0118 infrastructure,  and percolates events up through the userspace/udev
0119 infrastructure.  Following is a detailed description of how this is
0120 accomplished.
0121
0122 EEH must be enabled in the PHB's very early during the boot process,
0123 and if a PCI slot is hot-plugged. The former is performed by
0124 eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
0125 drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
0126 EEH must be enabled before a PCI scan of the device can proceed.
0127 Current Power5 hardware will not work unless EEH is enabled;
0128 although older Power4 can run with it disabled.  Effectively,
0129 EEH can no longer be turned off.  PCI devices *must* be
0130 registered with the EEH code; the EEH code needs to know about
0131 the I/O address ranges of the PCI device in order to detect an
0132 error.  Given an arbitrary address, the routine
0133 pci_get_device_by_addr() will find the pci device associated
0134 with that address (if any).
0135
0136 The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
0137 etc. include a check to see if the i/o read returned all-0xff's.
0138 If so, these make a call to eeh_dn_check_failure(), which in turn
0139 asks the firmware if the all-ff's value is the sign of a true EEH
0140 error.  If it is not, processing continues as normal.  The grand
0141 total number of these false alarms or "false positives" can be
0142 seen in /proc/ppc64/eeh (subject to change).  Normally, almost
0143 all of these occur during boot, when the PCI bus is scanned, where
0144 a large number of 0xff reads are part of the bus scan procedure.
0145
0146 If a frozen slot is detected, code in
0147 arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
0148 syslog (/var/log/messages).  This stack trace has proven to be very
0149 useful to device-driver authors for finding out at what point the EEH
0150 error was detected, as the error itself usually occurs slightly
0151 beforehand.
0152
0153 Next, it uses the Linux kernel notifier chain/work queue mechanism to
0154 allow any interested parties to find out about the failure.  Device
0155 drivers, or other parts of the kernel, can use
0156 `eeh_register_notifier(struct notifier_block *)` to find out about EEH
0157 events.  The event will include a pointer to the pci device, the
0158 device node and some state info.  Receivers of the event can "do as
0159 they wish"; the default handler will be described further in this
0160 section.
0161
0162 To assist in the recovery of the device, eeh.c exports the
0163 following functions:
0164
0165 rtas_set_slot_reset()
0166    assert the  PCI #RST line for 1/8th of a second
0167 rtas_configure_bridge()
0168    ask firmware to configure any PCI bridges
0169    located topologically under the pci slot.
0170 eeh_save_bars() and eeh_restore_bars():
0171    save and restore the PCI
0172    config-space info for a device and any devices under it.
0173
0174
0175 A handler for the EEH notifier_block events is implemented in
0176 drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
0177 It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
0178 This last call causes the device driver for the card to be stopped,
0179 which causes uevents to go out to user space. This triggers
0180 user-space scripts that might issue commands such as "ifdown eth0"
0181 for ethernet cards, and so on.  This handler then sleeps for 5 seconds,
0182 hoping to give the user-space scripts enough time to complete.
0183 It then resets the PCI card, reconfigures the device BAR's, and
0184 any bridges underneath. It then calls rpaphp_enable_pci_slot(),
0185 which restarts the device driver and triggers more user-space
0186 events (for example, calling "ifup eth0" for ethernet cards).
0187
0188
0189 Device Shutdown and User-Space Events
0190 -------------------------------------
0191 This section documents what happens when a pci slot is unconfigured,
0192 focusing on how the device driver gets shut down, and on how the
0193 events get delivered to user-space scripts.
0194
0195 Following is an example sequence of events that cause a device driver
0196 close function to be called during the first phase of an EEH reset.
0197 The following sequence is an example of the pcnet32 device driver::
0198
0199     rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
0200     {
0201       calls
0202       pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
0203       {
0204         calls
0205         pci_destroy_dev (struct pci_dev *)
0206         {
0207           calls
0208           device_unregister (&dev->dev) // in /drivers/base/core.c
0209           {
0210             calls
0211             device_del (struct device *)
0212             {
0213               calls
0214               bus_remove_device() // in /drivers/base/bus.c
0215               {
0216                 calls
0217                 device_release_driver()
0218                 {
0219                   calls
0220                   struct device_driver->remove() which is just
0221                   pci_device_remove()  // in /drivers/pci/pci_driver.c
0222                   {
0223                     calls
0224                     struct pci_driver->remove() which is just
0225                     pcnet32_remove_one() // in /drivers/net/pcnet32.c
0226                     {
0227                       calls
0228                       unregister_netdev() // in /net/core/dev.c
0229                       {
0230                         calls
0231                         dev_close()  // in /net/core/dev.c
0232                         {
0233                            calls dev->stop();
0234                            which is just pcnet32_close() // in pcnet32.c
0235                            {
0236                              which does what you wanted
0237                              to stop the device
0238                            }
0239                         }
0240                      }
0241                    which
0242                    frees pcnet32 device driver memory
0243                 }
0244      }}}}}}
0245
0246
0247 in drivers/pci/pci_driver.c,
0248 struct device_driver->remove() is just pci_device_remove()
0249 which calls struct pci_driver->remove() which is pcnet32_remove_one()
0250 which calls unregister_netdev()  (in net/core/dev.c)
0251 which calls dev_close()  (in net/core/dev.c)
0252 which calls dev->stop() which is pcnet32_close()
0253 which then does the appropriate shutdown.
0254
0255 ---
0256
0257 Following is the analogous stack trace for events sent to user-space
0258 when the pci device is unconfigured::
0259
0260   rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
0261     calls
0262     pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
0263       calls
0264       pci_destroy_dev (struct pci_dev *) {
0265         calls
0266         device_unregister (&dev->dev) {        // in /drivers/base/core.c
0267           calls
0268           device_del(struct device * dev) {    // in /drivers/base/core.c
0269             calls
0270             kobject_del() {                    //in /libs/kobject.c
0271               calls
0272               kobject_uevent() {               // in /libs/kobject.c
0273                 calls
0274                 kset_uevent() {                // in /lib/kobject.c
0275                   calls
0276                   kset->uevent_ops->uevent()   // which is really just
0277                   a call to
0278                   dev_uevent() {               // in /drivers/base/core.c
0279                     calls
0280                     dev->bus->uevent() which is really just a call to
0281                     pci_uevent () {            // in drivers/pci/hotplug.c
0282                       which prints device name, etc....
0283                    }
0284                  }
0285                  then kobject_uevent() sends a netlink uevent to userspace
0286                  --> userspace uevent
0287                  (during early boot, nobody listens to netlink events and
0288                  kobject_uevent() executes uevent_helper[], which runs the
0289                  event process /sbin/hotplug)
0290              }
0291            }
0292            kobject_del() then calls sysfs_remove_dir(), which would
0293            trigger any user-space daemon that was watching /sysfs,
0294            and notice the delete event.
0295
0296
0297 Pro's and Con's of the Current Design
0298 -------------------------------------
0299 There are several issues with the current EEH software recovery design,
0300 which may be addressed in future revisions.  But first, note that the
0301 big plus of the current design is that no changes need to be made to
0302 individual device drivers, so that the current design throws a wide net.
0303 The biggest negative of the design is that it potentially disturbs
0304 network daemons and file systems that didn't need to be disturbed.
0305
0306 -  A minor complaint is that resetting the network card causes
0307    user-space back-to-back ifdown/ifup burps that potentially disturb
0308    network daemons, that didn't need to even know that the pci
0309    card was being rebooted.
0310
0311 -  A more serious concern is that the same reset, for SCSI devices,
0312    causes havoc to mounted file systems.  Scripts cannot post-facto
0313    unmount a file system without flushing pending buffers, but this
0314    is impossible, because I/O has already been stopped.  Thus,
0315    ideally, the reset should happen at or below the block layer,
0316    so that the file systems are not disturbed.
0317
0318    Reiserfs does not tolerate errors returned from the block device.
0319    Ext3fs seems to be tolerant, retrying reads/writes until it does
0320    succeed. Both have been only lightly tested in this scenario.
0321
0322    The SCSI-generic subsystem already has built-in code for performing
0323    SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
0324    (HBA) resets.  These are cascaded into a chain of attempted
0325    resets if a SCSI command fails. These are completely hidden
0326    from the block layer.  It would be very natural to add an EEH
0327    reset into this chain of events.
0328
0329 -  If a SCSI error occurs for the root device, all is lost unless
0330    the sysadmin had the foresight to run /bin, /sbin, /etc, /var
0331    and so on, out of ramdisk/tmpfs.
0332
0333
0334 Conclusions
0335 -----------
0336 There's forward progress ...