0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ==============
0004 Devlink Health
0005 ==============
0006
0007 Background
0008 ==========
0009
0010 The ``devlink`` health mechanism is targeted for Real Time Alerting, in
0011 order to know when something bad happened to a PCI device.
0012
0013 * Provide alert debug information.
0014 * Self healing.
0015 * If problem needs vendor support, provide a way to gather all needed
0016 debugging information.
0017
0018 Overview
0019 ========
0020
0021 The main idea is to unify and centralize driver health reports in the
0022 generic ``devlink`` instance and allow the user to set different
0023 attributes of the health reporting and recovery procedures.
0024
0025 The ``devlink`` health reporter:
0026 Device driver creates a "health reporter" per each error/health type.
0027 Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
0028 or unknown (driver specific).
0029 For each registered health reporter a driver can issue error/health reports
0030 asynchronously. All health reports handling is done by ``devlink``.
0031 Device driver can provide specific callbacks for each "health reporter", e.g.:
0032
0033 * Recovery procedures
0034 * Diagnostics procedures
0035 * Object dump procedures
0036 * OOB initial parameters
0037
0038 Different parts of the driver can register different types of health reporters
0039 with different handlers.
0040
0041 Actions
0042 =======
0043
0044 Once an error is reported, devlink health will perform the following actions:
0045
0046 * A log is being send to the kernel trace events buffer
0047 * Health status and statistics are being updated for the reporter instance
0048 * Object dump is being taken and saved at the reporter instance (as long as
0049 there is no other dump which is already stored)
0050 * Auto recovery attempt is being done. Depends on:
0051
0052 - Auto-recovery configuration
0053 - Grace period vs. time passed since last recover
0054
0055 User Interface
0056 ==============
0057
0058 User can access/change each reporter's parameters and driver specific callbacks
0059 via ``devlink``, e.g per error type (per health reporter):
0060
0061 * Configure reporter's generic parameters (like: disable/enable auto recovery)
0062 * Invoke recovery procedure
0063 * Run diagnostics
0064 * Object dump
0065
0066 .. list-table:: List of devlink health interfaces
0067 :widths: 10 90
0068
0069 * - Name
0070 - Description
0071 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
0072 - Retrieves status and configuration info per DEV and reporter.
0073 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
0074 - Allows reporter-related configuration setting.
0075 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
0076 - Triggers reporter's recovery procedure.
0077 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
0078 - Triggers a fake health event on the reporter. The effects of the test
0079 event in terms of recovery flow should follow closely that of a real
0080 event.
0081 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
0082 - Retrieves current device state related to the reporter.
0083 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
0084 - Retrieves the last stored dump. Devlink health
0085 saves a single dump. If an dump is not already stored by devlink
0086 for this reporter, devlink generates a new dump.
0087 Dump output is defined by the reporter.
0088 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
0089 - Clears the last saved dump file for the specified reporter.
0090
0091 The following diagram provides a general overview of ``devlink-health``::
0092
0093 netlink
0094 +--------------------------+
0095 | |
0096 | + |
0097 | | |
0098 +--------------------------+
0099 |request for ops
0100 |(diagnose,
0101 driver devlink |recover,
0102 |dump)
0103 +--------+ +--------------------------+
0104 | | | reporter| |
0105 | | | +---------v----------+ |
0106 | | ops execution | | | |
0107 | <----------------------------------+ | |
0108 | | | | | |
0109 | | | + ^------------------+ |
0110 | | | | request for ops |
0111 | | | | (recover, dump) |
0112 | | | | |
0113 | | | +-+------------------+ |
0114 | | health report | | health handler | |
0115 | +-------------------------------> | |
0116 | | | +--------------------+ |
0117 | | health reporter create | |
0118 | +----------------------------> |
0119 +--------+ +--------------------------+