Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ====================================
0004 Netfilter's flowtable infrastructure
0005 ====================================
0006 
0007 This documentation describes the Netfilter flowtable infrastructure which allows
0008 you to define a fastpath through the flowtable datapath. This infrastructure
0009 also provides hardware offload support. The flowtable supports for the layer 3
0010 IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
0011 
0012 Overview
0013 --------
0014 
0015 Once the first packet of the flow successfully goes through the IP forwarding
0016 path, from the second packet on, you might decide to offload the flow to the
0017 flowtable through your ruleset. The flowtable infrastructure provides a rule
0018 action that allows you to specify when to add a flow to the flowtable.
0019 
0020 A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
0021 transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
0022 classic IP forwarding path (the visible effect is that you do not see these
0023 packets from any of the Netfilter hooks coming after ingress). In case that
0024 there is no matching entry in the flowtable (ie. flowtable miss), the packet
0025 follows the classic IP forwarding path.
0026 
0027 The flowtable uses a resizable hashtable. Lookups are based on the following
0028 n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
0029 source and destination, layer 4 source and destination ports and the input
0030 interface (useful in case there are several conntrack zones in place).
0031 
0032 The 'flow add' action allows you to populate the flowtable, the user selectively
0033 specifies what flows are placed into the flowtable. Hence, packets follow the
0034 classic IP forwarding path unless the user explicitly instruct flows to use this
0035 new alternative forwarding path via policy.
0036 
0037 The flowtable datapath is represented in Fig.1, which describes the classic IP
0038 forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
0039 
0040 ::
0041 
0042                                          userspace process
0043                                           ^              |
0044                                           |              |
0045                                      _____|____     ____\/___
0046                                     /          \   /         \
0047                                     |   input   |  |  output  |
0048                                     \__________/   \_________/
0049                                          ^               |
0050                                          |               |
0051       _________      __________      ---------     _____\/_____
0052      /         \    /          \     |Routing |   /            \
0053   -->  ingress  ---> prerouting ---> |decision|   | postrouting |--> neigh_xmit
0054      \_________/    \__________/     ----------   \____________/          ^
0055        |      ^                          |               ^                |
0056    flowtable  |                     ____\/___            |                |
0057        |      |                    /         \           |                |
0058     __\/___   |                    | forward |------------                |
0059     |-----|   |                    \_________/                            |
0060     |-----|   |                 'flow offload' rule                       |
0061     |-----|   |                   adds entry to                           |
0062     |_____|   |                     flowtable                             |
0063        |      |                                                           |
0064       / \     |                                                           |
0065      /hit\_no_|                                                           |
0066      \ ? /                                                                |
0067       \ /                                                                 |
0068        |__yes_________________fastpath bypass ____________________________|
0069 
0070                Fig.1 Netfilter hooks and flowtable interactions
0071 
0072 The flowtable entry also stores the NAT configuration, so all packets are
0073 mangled according to the NAT policy that is specified from the classic IP
0074 forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
0075 traffic is passed up to follow the classic IP forwarding path given that the
0076 transport header is missing, in this case, flowtable lookups are not possible.
0077 TCP RST and FIN packets are also passed up to the classic IP forwarding path to
0078 release the flow gracefully. Packets that exceed the MTU are also passed up to
0079 the classic forwarding path to report packet-too-big ICMP errors to the sender.
0080 
0081 Example configuration
0082 ---------------------
0083 
0084 Enabling the flowtable bypass is relatively easy, you only need to create a
0085 flowtable and add one rule to your forward chain::
0086 
0087         table inet x {
0088                 flowtable f {
0089                         hook ingress priority 0; devices = { eth0, eth1 };
0090                 }
0091                 chain y {
0092                         type filter hook forward priority 0; policy accept;
0093                         ip protocol tcp flow add @f
0094                         counter packets 0 bytes 0
0095                 }
0096         }
0097 
0098 This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
0099 netdevices. You can create as many flowtables as you want in case you need to
0100 perform resource partitioning. The flowtable priority defines the order in which
0101 hooks are run in the pipeline, this is convenient in case you already have a
0102 nftables ingress chain (make sure the flowtable priority is smaller than the
0103 nftables ingress chain hence the flowtable runs before in the pipeline).
0104 
0105 The 'flow offload' action from the forward chain 'y' adds an entry to the
0106 flowtable for the TCP syn-ack packet coming in the reply direction. Once the
0107 flow is offloaded, you will observe that the counter rule in the example above
0108 does not get updated for the packets that are being forwarded through the
0109 forwarding bypass.
0110 
0111 You can identify offloaded flows through the [OFFLOAD] tag when listing your
0112 connection tracking table.
0113 
0114 ::
0115 
0116         # conntrack -L
0117         tcp      6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
0118 
0119 
0120 Layer 2 encapsulation
0121 ---------------------
0122 
0123 Since Linux kernel 5.13, the flowtable infrastructure discovers the real
0124 netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
0125 parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
0126 VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
0127 flowtable datapath also deals with layer 2 decapsulation.
0128 
0129 You do not need to add the PPPoE and the VLAN devices to your flowtable,
0130 instead the real device is sufficient for the flowtable to track your flows.
0131 
0132 Bridge and IP forwarding
0133 ------------------------
0134 
0135 Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
0136 flowtable infrastructure discovers the topology behind the bridge device. This
0137 allows the flowtable to define a fastpath bypass between the bridge ports
0138 (represented as eth1 and eth2 in the example figure below) and the gateway
0139 device (represented as eth0) in your switch/router.
0140 
0141 ::
0142 
0143                       fastpath bypass
0144                .-------------------------.
0145               /                           \
0146               |           IP forwarding   |
0147               |          /             \ \/
0148               |       br0               eth0 ..... eth0
0149               .       / \                          *host B*
0150                -> eth1  eth2
0151                    .           *switch/router*
0152                    .
0153                    .
0154                  eth0
0155                *host A*
0156 
0157 The flowtable infrastructure also supports for bridge VLAN filtering actions
0158 such as PVID and untagged. You can also stack a classic VLAN device on top of
0159 your bridge port.
0160 
0161 If you would like that your flowtable defines a fastpath between your bridge
0162 ports and your IP forwarding path, you have to add your bridge ports (as
0163 represented by the real netdevice) to your flowtable definition.
0164 
0165 Counters
0166 --------
0167 
0168 The flowtable can synchronize packet and byte counters with the existing
0169 connection tracking entry by specifying the counter statement in your flowtable
0170 definition, e.g.
0171 
0172 ::
0173 
0174         table inet x {
0175                 flowtable f {
0176                         hook ingress priority 0; devices = { eth0, eth1 };
0177                         counter
0178                 }
0179         }
0180 
0181 Counter support is available since Linux kernel 5.7.
0182 
0183 Hardware offload
0184 ----------------
0185 
0186 If your network device provides hardware offload support, you can turn it on by
0187 means of the 'offload' flag in your flowtable definition, e.g.
0188 
0189 ::
0190 
0191         table inet x {
0192                 flowtable f {
0193                         hook ingress priority 0; devices = { eth0, eth1 };
0194                         flags offload;
0195                 }
0196         }
0197 
0198 There is a workqueue that adds the flows to the hardware. Note that a few
0199 packets might still run over the flowtable software path until the workqueue has
0200 a chance to offload the flow to the network device.
0201 
0202 You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
0203 listing your connection tracking table. Please, note that the [OFFLOAD] tag
0204 refers to the software offload mode, so there is a distinction between [OFFLOAD]
0205 which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
0206 to the hardware offload datapath being used by the flow.
0207 
0208 The flowtable hardware offload infrastructure also supports for the DSA
0209 (Distributed Switch Architecture).
0210 
0211 Limitations
0212 -----------
0213 
0214 The flowtable behaves like a cache. The flowtable entries might get stale if
0215 either the destination MAC address or the egress netdevice that is used for
0216 transmission changes.
0217 
0218 This might be a problem if:
0219 
0220 - You run the flowtable in software mode and you combine bridge and IP
0221   forwarding in your setup.
0222 - Hardware offload is enabled.
0223 
0224 More reading
0225 ------------
0226 
0227 This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
0228 also made a very complete and comprehensive summary called "A state of network
0229 acceleration" that describes how things were before this infrastructure was
0230 mainlined [3]_ and it also makes a rough summary of this work [4]_.
0231 
0232 .. [1] https://lwn.net/Articles/738214/
0233 .. [2] https://lwn.net/Articles/742164/
0234 .. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
0235 .. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html