0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =============
0004 Devlink DPIPE
0005 =============
0006
0007 Background
0008 ==========
0009
0010 While performing the hardware offloading process, much of the hardware
0011 specifics cannot be presented. These details are useful for debugging, and
0012 ``devlink-dpipe`` provides a standardized way to provide visibility into the
0013 offloading process.
0014
0015 For example, the routing longest prefix match (LPM) algorithm used by the
0016 Linux kernel may differ from the hardware implementation. The pipeline debug
0017 API (DPIPE) is aimed at providing the user visibility into the ASIC's
0018 pipeline in a generic way.
0019
0020 The hardware offload process is expected to be done in a way that the user
0021 should not be able to distinguish between the hardware vs. software
0022 implementation. In this process, hardware specifics are neglected. In
0023 reality those details can have lots of meaning and should be exposed in some
0024 standard way.
0025
0026 This problem is made even more complex when one wishes to offload the
0027 control path of the whole networking stack to a switch ASIC. Due to
0028 differences in the hardware and software models some processes cannot be
0029 represented correctly.
0030
0031 One example is the kernel's LPM algorithm which in many cases differs
0032 greatly to the hardware implementation. The configuration API is the same,
0033 but one cannot rely on the Forward Information Base (FIB) to look like the
0034 Level Path Compression trie (LPC-trie) in hardware.
0035
0036 In many situations trying to analyze systems failure solely based on the
0037 kernel's dump may not be enough. By combining this data with complementary
0038 information about the underlying hardware, this debugging can be made
0039 easier; additionally, the information can be useful when debugging
0040 performance issues.
0041
0042 Overview
0043 ========
0044
0045 The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
0046 modeled as a graph of match/action tables. Each table represents a specific
0047 hardware block. This model is not new, first being used by the P4 language.
0048
0049 Traditionally it has been used as an alternative model for hardware
0050 configuration, but the ``devlink-dpipe`` interface uses it for visibility
0051 purposes as a standard complementary tool. The system's view from
0052 ``devlink-dpipe`` should change according to the changes done by the
0053 standard configuration tools.
0054
0055 For example, it’s quite common to implement Access Control Lists (ACL)
0056 using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
0057 divided into TCAM regions. Complex TC filters can have multiple rules with
0058 different priorities and different lookup keys. On the other hand hardware
0059 TCAM regions have a predefined lookup key. Offloading the TC filter rules
0060 using TCAM engine can result in multiple TCAM regions being interconnected
0061 in a chain (which may affect the data path latency). In response to a new TC
0062 filter new tables should be created describing those regions.
0063
0064 Model
0065 =====
0066
0067 The ``DPIPE`` model introduces several objects:
0068
0069 * headers
0070 * tables
0071 * entries
0072
0073 A ``header`` describes packet formats and provides names for fields within
0074 the packet. A ``table`` describes hardware blocks. An ``entry`` describes
0075 the actual content of a specific table.
0076
0077 The hardware pipeline is not port specific, but rather describes the whole
0078 ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
0079
0080 Drivers can register and unregister tables at run time, in order to support
0081 dynamic behavior. This dynamic behavior is mandatory for describing hardware
0082 blocks like TCAM regions which can be allocated and freed dynamically.
0083
0084 ``devlink-dpipe`` generally is not intended for configuration. The exception
0085 is hardware counting for a specific table.
0086
0087 The following commands are used to obtain the ``dpipe`` objects from
0088 userspace:
0089
0090 * ``table_get``: Receive a table's description.
0091 * ``headers_get``: Receive a device's supported headers.
0092 * ``entries_get``: Receive a table's current entries.
0093 * ``counters_set``: Enable or disable counters on a table.
0094
0095 Table
0096 -----
0097
0098 The driver should implement the following operations for each table:
0099
0100 * ``matches_dump``: Dump the supported matches.
0101 * ``actions_dump``: Dump the supported actions.
0102 * ``entries_dump``: Dump the actual content of the table.
0103 * ``counters_set_update``: Synchronize hardware with counters enabled or
0104 disabled.
0105
0106 Header/Field
0107 ------------
0108
0109 In a similar way to P4 headers and fields are used to describe a table's
0110 behavior. There is a slight difference between the standard protocol headers
0111 and specific ASIC metadata. The protocol headers should be declared in the
0112 ``devlink`` core API. On the other hand ASIC meta data is driver specific
0113 and should be defined in the driver. Additionally, each driver-specific
0114 devlink documentation file should document the driver-specific ``dpipe``
0115 headers it implements. The headers and fields are identified by enumeration.
0116
0117 In order to provide further visibility some ASIC metadata fields could be
0118 mapped to kernel objects. For example, internal router interface indexes can
0119 be directly mapped to the net device ifindex. FIB table indexes used by
0120 different Virtual Routing and Forwarding (VRF) tables can be mapped to
0121 internal routing table indexes.
0122
0123 Match
0124 -----
0125
0126 Matches are kept primitive and close to hardware operation. Match types like
0127 LPM are not supported due to the fact that this is exactly a process we wish
0128 to describe in full detail. Example of matches:
0129
0130 * ``field_exact``: Exact match on a specific field.
0131 * ``field_exact_mask``: Exact match on a specific field after masking.
0132 * ``field_range``: Match on a specific range.
0133
0134 The id's of the header and the field should be specified in order to
0135 identify the specific field. Furthermore, the header index should be
0136 specified in order to distinguish multiple headers of the same type in a
0137 packet (tunneling).
0138
0139 Action
0140 ------
0141
0142 Similar to match, the actions are kept primitive and close to hardware
0143 operation. For example:
0144
0145 * ``field_modify``: Modify the field value.
0146 * ``field_inc``: Increment the field value.
0147 * ``push_header``: Add a header.
0148 * ``pop_header``: Remove a header.
0149
0150 Entry
0151 -----
0152
0153 Entries of a specific table can be dumped on demand. Each eentry is
0154 identified with an index and its properties are described by a list of
0155 match/action values and specific counter. By dumping the tables content the
0156 interactions between tables can be resolved.
0157
0158 Abstraction Example
0159 ===================
0160
0161 The following is an example of the abstraction model of the L3 part of
0162 Mellanox Spectrum ASIC. The blocks are described in the order they appear in
0163 the pipeline. The table sizes in the following examples are not real
0164 hardware sizes and are provided for demonstration purposes.
0165
0166 LPM
0167 ---
0168
0169 The LPM algorithm can be implemented as a list of hash tables. Each hash
0170 table contains routes with the same prefix length. The root of the list is
0171 /32, and in case of a miss the hardware will continue to the next hash
0172 table. The depth of the search will affect the data path latency.
0173
0174 In case of a hit the entry contains information about the next stage of the
0175 pipeline which resolves the MAC address. The next stage can be either local
0176 host table for directly connected routes, or adjacency table for next-hops.
0177 The ``meta.lpm_prefix`` field is used to connect two LPM tables.
0178
0179 .. code::
0180
0181 table lpm_prefix_16 {
0182 size: 4096,
0183 counters_enabled: true,
0184 match: { meta.vr_id: exact,
0185 ipv4.dst_addr: exact_mask,
0186 ipv6.dst_addr: exact_mask,
0187 meta.lpm_prefix: exact },
0188 action: { meta.adj_index: set,
0189 meta.adj_group_size: set,
0190 meta.rif_port: set,
0191 meta.lpm_prefix: set },
0192 }
0193
0194 Local Host
0195 ----------
0196
0197 In the case of local routes the LPM lookup already resolves the egress
0198 router interface (RIF), yet the exact MAC address is not known. The local
0199 host table is a hash table combining the output interface id with
0200 destination IP address as a key. The result is the MAC address.
0201
0202 .. code::
0203
0204 table local_host {
0205 size: 4096,
0206 counters_enabled: true,
0207 match: { meta.rif_port: exact,
0208 ipv4.dst_addr: exact},
0209 action: { ethernet.daddr: set }
0210 }
0211
0212 Adjacency
0213 ---------
0214
0215 In case of remote routes this table does the ECMP. The LPM lookup results in
0216 ECMP group size and index that serves as a global offset into this table.
0217 Concurrently a hash of the packet is generated. Based on the ECMP group size
0218 and the packet's hash a local offset is generated. Multiple LPM entries can
0219 point to the same adjacency group.
0220
0221 .. code::
0222
0223 table adjacency {
0224 size: 4096,
0225 counters_enabled: true,
0226 match: { meta.adj_index: exact,
0227 meta.adj_group_size: exact,
0228 meta.packet_hash_index: exact },
0229 action: { ethernet.daddr: set,
0230 meta.erif: set }
0231 }
0232
0233 ERIF
0234 ----
0235
0236 In case the egress RIF and destination MAC have been resolved by previous
0237 tables this table does multiple operations like TTL decrease and MTU check.
0238 Then the decision of forward/drop is taken and the port L3 statistics are
0239 updated based on the packet's type (broadcast, unicast, multicast).
0240
0241 .. code::
0242
0243 table erif {
0244 size: 800,
0245 counters_enabled: true,
0246 match: { meta.rif_port: exact,
0247 meta.is_l3_unicast: exact,
0248 meta.is_l3_broadcast: exact,
0249 meta.is_l3_multicast, exact },
0250 action: { meta.l3_drop: set,
0251 meta.l3_forward: set }
0252 }