Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 =============
0004 Devlink DPIPE
0005 =============
0006 
0007 Background
0008 ==========
0009 
0010 While performing the hardware offloading process, much of the hardware
0011 specifics cannot be presented. These details are useful for debugging, and
0012 ``devlink-dpipe`` provides a standardized way to provide visibility into the
0013 offloading process.
0014 
0015 For example, the routing longest prefix match (LPM) algorithm used by the
0016 Linux kernel may differ from the hardware implementation. The pipeline debug
0017 API (DPIPE) is aimed at providing the user visibility into the ASIC's
0018 pipeline in a generic way.
0019 
0020 The hardware offload process is expected to be done in a way that the user
0021 should not be able to distinguish between the hardware vs. software
0022 implementation. In this process, hardware specifics are neglected. In
0023 reality those details can have lots of meaning and should be exposed in some
0024 standard way.
0025 
0026 This problem is made even more complex when one wishes to offload the
0027 control path of the whole networking stack to a switch ASIC. Due to
0028 differences in the hardware and software models some processes cannot be
0029 represented correctly.
0030 
0031 One example is the kernel's LPM algorithm which in many cases differs
0032 greatly to the hardware implementation. The configuration API is the same,
0033 but one cannot rely on the Forward Information Base (FIB) to look like the
0034 Level Path Compression trie (LPC-trie) in hardware.
0035 
0036 In many situations trying to analyze systems failure solely based on the
0037 kernel's dump may not be enough. By combining this data with complementary
0038 information about the underlying hardware, this debugging can be made
0039 easier; additionally, the information can be useful when debugging
0040 performance issues.
0041 
0042 Overview
0043 ========
0044 
0045 The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
0046 modeled as a graph of match/action tables. Each table represents a specific
0047 hardware block. This model is not new, first being used by the P4 language.
0048 
0049 Traditionally it has been used as an alternative model for hardware
0050 configuration, but the ``devlink-dpipe`` interface uses it for visibility
0051 purposes as a standard complementary tool. The system's view from
0052 ``devlink-dpipe`` should change according to the changes done by the
0053 standard configuration tools.
0054 
0055 For example, it’s quite common to  implement Access Control Lists (ACL)
0056 using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
0057 divided into TCAM regions. Complex TC filters can have multiple rules with
0058 different priorities and different lookup keys. On the other hand hardware
0059 TCAM regions have a predefined lookup key. Offloading the TC filter rules
0060 using TCAM engine can result in multiple TCAM regions being interconnected
0061 in a chain (which may affect the data path latency). In response to a new TC
0062 filter new tables should be created describing those regions.
0063 
0064 Model
0065 =====
0066 
0067 The ``DPIPE`` model introduces several objects:
0068 
0069   * headers
0070   * tables
0071   * entries
0072 
0073 A ``header`` describes packet formats and provides names for fields within
0074 the packet. A ``table`` describes hardware blocks. An ``entry`` describes
0075 the actual content of a specific table.
0076 
0077 The hardware pipeline is not port specific, but rather describes the whole
0078 ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
0079 
0080 Drivers can register and unregister tables at run time, in order to support
0081 dynamic behavior. This dynamic behavior is mandatory for describing hardware
0082 blocks like TCAM regions which can be allocated and freed dynamically.
0083 
0084 ``devlink-dpipe`` generally is not intended for configuration. The exception
0085 is hardware counting for a specific table.
0086 
0087 The following commands are used to obtain the ``dpipe`` objects from
0088 userspace:
0089 
0090   * ``table_get``: Receive a table's description.
0091   * ``headers_get``: Receive a device's supported headers.
0092   * ``entries_get``: Receive a table's current entries.
0093   * ``counters_set``: Enable or disable counters on a table.
0094 
0095 Table
0096 -----
0097 
0098 The driver should implement the following operations for each table:
0099 
0100   * ``matches_dump``: Dump the supported matches.
0101   * ``actions_dump``: Dump the supported actions.
0102   * ``entries_dump``: Dump the actual content of the table.
0103   * ``counters_set_update``: Synchronize hardware with counters enabled or
0104     disabled.
0105 
0106 Header/Field
0107 ------------
0108 
0109 In a similar way to P4 headers and fields are used to describe a table's
0110 behavior. There is a slight difference between the standard protocol headers
0111 and specific ASIC metadata. The protocol headers should be declared in the
0112 ``devlink`` core API. On the other hand ASIC meta data is driver specific
0113 and should be defined in the driver. Additionally, each driver-specific
0114 devlink documentation file should document the driver-specific ``dpipe``
0115 headers it implements. The headers and fields are identified by enumeration.
0116 
0117 In order to provide further visibility some ASIC metadata fields could be
0118 mapped to kernel objects. For example, internal router interface indexes can
0119 be directly mapped to the net device ifindex. FIB table indexes used by
0120 different Virtual Routing and Forwarding (VRF) tables can be mapped to
0121 internal routing table indexes.
0122 
0123 Match
0124 -----
0125 
0126 Matches are kept primitive and close to hardware operation. Match types like
0127 LPM are not supported due to the fact that this is exactly a process we wish
0128 to describe in full detail. Example of matches:
0129 
0130   * ``field_exact``: Exact match on a specific field.
0131   * ``field_exact_mask``: Exact match on a specific field after masking.
0132   * ``field_range``: Match on a specific range.
0133 
0134 The id's of the header and the field should be specified in order to
0135 identify the specific field. Furthermore, the header index should be
0136 specified in order to distinguish multiple headers of the same type in a
0137 packet (tunneling).
0138 
0139 Action
0140 ------
0141 
0142 Similar to match, the actions are kept primitive and close to hardware
0143 operation. For example:
0144 
0145   * ``field_modify``: Modify the field value.
0146   * ``field_inc``: Increment the field value.
0147   * ``push_header``: Add a header.
0148   * ``pop_header``: Remove a header.
0149 
0150 Entry
0151 -----
0152 
0153 Entries of a specific table can be dumped on demand. Each eentry is
0154 identified with an index and its properties are described by a list of
0155 match/action values and specific counter. By dumping the tables content the
0156 interactions between tables can be resolved.
0157 
0158 Abstraction Example
0159 ===================
0160 
0161 The following is an example of the abstraction model of the L3 part of
0162 Mellanox Spectrum ASIC. The blocks are described in the order they appear in
0163 the pipeline. The table sizes in the following examples are not real
0164 hardware sizes and are provided for demonstration purposes.
0165 
0166 LPM
0167 ---
0168 
0169 The LPM algorithm can be implemented as a list of hash tables. Each hash
0170 table contains routes with the same prefix length. The root of the list is
0171 /32, and in case of a miss the hardware will continue to the next hash
0172 table. The depth of the search will affect the data path latency.
0173 
0174 In case of a hit the entry contains information about the next stage of the
0175 pipeline which resolves the MAC address. The next stage can be either local
0176 host table for directly connected routes, or adjacency table for next-hops.
0177 The ``meta.lpm_prefix`` field is used to connect two LPM tables.
0178 
0179 .. code::
0180 
0181     table lpm_prefix_16 {
0182       size: 4096,
0183       counters_enabled: true,
0184       match: { meta.vr_id: exact,
0185                ipv4.dst_addr: exact_mask,
0186                ipv6.dst_addr: exact_mask,
0187                meta.lpm_prefix: exact },
0188       action: { meta.adj_index: set,
0189                 meta.adj_group_size: set,
0190                 meta.rif_port: set,
0191                 meta.lpm_prefix: set },
0192     }
0193 
0194 Local Host
0195 ----------
0196 
0197 In the case of local routes the LPM lookup already resolves the egress
0198 router interface (RIF), yet the exact MAC address is not known. The local
0199 host table is a hash table combining the output interface id with
0200 destination IP address as a key. The result is the MAC address.
0201 
0202 .. code::
0203 
0204     table local_host {
0205       size: 4096,
0206       counters_enabled: true,
0207       match: { meta.rif_port: exact,
0208                ipv4.dst_addr: exact},
0209       action: { ethernet.daddr: set }
0210     }
0211 
0212 Adjacency
0213 ---------
0214 
0215 In case of remote routes this table does the ECMP. The LPM lookup results in
0216 ECMP group size and index that serves as a global offset into this table.
0217 Concurrently a hash of the packet is generated. Based on the ECMP group size
0218 and the packet's hash a local offset is generated. Multiple LPM entries can
0219 point to the same adjacency group.
0220 
0221 .. code::
0222 
0223     table adjacency {
0224       size: 4096,
0225       counters_enabled: true,
0226       match: { meta.adj_index: exact,
0227                meta.adj_group_size: exact,
0228                meta.packet_hash_index: exact },
0229       action: { ethernet.daddr: set,
0230                 meta.erif: set }
0231     }
0232 
0233 ERIF
0234 ----
0235 
0236 In case the egress RIF and destination MAC have been resolved by previous
0237 tables this table does multiple operations like TTL decrease and MTU check.
0238 Then the decision of forward/drop is taken and the port L3 statistics are
0239 updated based on the packet's type (broadcast, unicast, multicast).
0240 
0241 .. code::
0242 
0243     table erif {
0244       size: 800,
0245       counters_enabled: true,
0246       match: { meta.rif_port: exact,
0247                meta.is_l3_unicast: exact,
0248                meta.is_l3_broadcast: exact,
0249                meta.is_l3_multicast, exact },
0250       action: { meta.l3_drop: set,
0251                 meta.l3_forward: set }
0252     }