Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ==============
0004 Devlink Health
0005 ==============
0006 
0007 Background
0008 ==========
0009 
0010 The ``devlink`` health mechanism is targeted for Real Time Alerting, in
0011 order to know when something bad happened to a PCI device.
0012 
0013   * Provide alert debug information.
0014   * Self healing.
0015   * If problem needs vendor support, provide a way to gather all needed
0016     debugging information.
0017 
0018 Overview
0019 ========
0020 
0021 The main idea is to unify and centralize driver health reports in the
0022 generic ``devlink`` instance and allow the user to set different
0023 attributes of the health reporting and recovery procedures.
0024 
0025 The ``devlink`` health reporter:
0026 Device driver creates a "health reporter" per each error/health type.
0027 Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
0028 or unknown (driver specific).
0029 For each registered health reporter a driver can issue error/health reports
0030 asynchronously. All health reports handling is done by ``devlink``.
0031 Device driver can provide specific callbacks for each "health reporter", e.g.:
0032 
0033   * Recovery procedures
0034   * Diagnostics procedures
0035   * Object dump procedures
0036   * OOB initial parameters
0037 
0038 Different parts of the driver can register different types of health reporters
0039 with different handlers.
0040 
0041 Actions
0042 =======
0043 
0044 Once an error is reported, devlink health will perform the following actions:
0045 
0046   * A log is being send to the kernel trace events buffer
0047   * Health status and statistics are being updated for the reporter instance
0048   * Object dump is being taken and saved at the reporter instance (as long as
0049     there is no other dump which is already stored)
0050   * Auto recovery attempt is being done. Depends on:
0051 
0052     - Auto-recovery configuration
0053     - Grace period vs. time passed since last recover
0054 
0055 User Interface
0056 ==============
0057 
0058 User can access/change each reporter's parameters and driver specific callbacks
0059 via ``devlink``, e.g per error type (per health reporter):
0060 
0061   * Configure reporter's generic parameters (like: disable/enable auto recovery)
0062   * Invoke recovery procedure
0063   * Run diagnostics
0064   * Object dump
0065 
0066 .. list-table:: List of devlink health interfaces
0067    :widths: 10 90
0068 
0069    * - Name
0070      - Description
0071    * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
0072      - Retrieves status and configuration info per DEV and reporter.
0073    * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
0074      - Allows reporter-related configuration setting.
0075    * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
0076      - Triggers reporter's recovery procedure.
0077    * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
0078      - Triggers a fake health event on the reporter. The effects of the test
0079        event in terms of recovery flow should follow closely that of a real
0080        event.
0081    * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
0082      - Retrieves current device state related to the reporter.
0083    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
0084      - Retrieves the last stored dump. Devlink health
0085        saves a single dump. If an dump is not already stored by devlink
0086        for this reporter, devlink generates a new dump.
0087        Dump output is defined by the reporter.
0088    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
0089      - Clears the last saved dump file for the specified reporter.
0090 
0091 The following diagram provides a general overview of ``devlink-health``::
0092 
0093                                                    netlink
0094                                           +--------------------------+
0095                                           |                          |
0096                                           |            +             |
0097                                           |            |             |
0098                                           +--------------------------+
0099                                                        |request for ops
0100                                                        |(diagnose,
0101       driver                               devlink     |recover,
0102                                                        |dump)
0103     +--------+                            +--------------------------+
0104     |        |                            |    reporter|             |
0105     |        |                            |  +---------v----------+  |
0106     |        |   ops execution            |  |                    |  |
0107     |     <----------------------------------+                    |  |
0108     |        |                            |  |                    |  |
0109     |        |                            |  + ^------------------+  |
0110     |        |                            |    | request for ops     |
0111     |        |                            |    | (recover, dump)     |
0112     |        |                            |    |                     |
0113     |        |                            |  +-+------------------+  |
0114     |        |     health report          |  | health handler     |  |
0115     |        +------------------------------->                    |  |
0116     |        |                            |  +--------------------+  |
0117     |        |     health reporter create |                          |
0118     |        +---------------------------->                          |
0119     +--------+                            +--------------------------+