Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 =============================================
0004 Open vSwitch datapath developer documentation
0005 =============================================
0006 
0007 The Open vSwitch kernel module allows flexible userspace control over
0008 flow-level packet processing on selected network devices.  It can be
0009 used to implement a plain Ethernet switch, network device bonding,
0010 VLAN processing, network access control, flow-based network control,
0011 and so on.
0012 
0013 The kernel module implements multiple "datapaths" (analogous to
0014 bridges), each of which can have multiple "vports" (analogous to ports
0015 within a bridge).  Each datapath also has associated with it a "flow
0016 table" that userspace populates with "flows" that map from keys based
0017 on packet headers and metadata to sets of actions.  The most common
0018 action forwards the packet to another vport; other actions are also
0019 implemented.
0020 
0021 When a packet arrives on a vport, the kernel module processes it by
0022 extracting its flow key and looking it up in the flow table.  If there
0023 is a matching flow, it executes the associated actions.  If there is
0024 no match, it queues the packet to userspace for processing (as part of
0025 its processing, userspace will likely set up a flow to handle further
0026 packets of the same type entirely in-kernel).
0027 
0028 
0029 Flow key compatibility
0030 ----------------------
0031 
0032 Network protocols evolve over time.  New protocols become important
0033 and existing protocols lose their prominence.  For the Open vSwitch
0034 kernel module to remain relevant, it must be possible for newer
0035 versions to parse additional protocols as part of the flow key.  It
0036 might even be desirable, someday, to drop support for parsing
0037 protocols that have become obsolete.  Therefore, the Netlink interface
0038 to Open vSwitch is designed to allow carefully written userspace
0039 applications to work with any version of the flow key, past or future.
0040 
0041 To support this forward and backward compatibility, whenever the
0042 kernel module passes a packet to userspace, it also passes along the
0043 flow key that it parsed from the packet.  Userspace then extracts its
0044 own notion of a flow key from the packet and compares it against the
0045 kernel-provided version:
0046 
0047     - If userspace's notion of the flow key for the packet matches the
0048       kernel's, then nothing special is necessary.
0049 
0050     - If the kernel's flow key includes more fields than the userspace
0051       version of the flow key, for example if the kernel decoded IPv6
0052       headers but userspace stopped at the Ethernet type (because it
0053       does not understand IPv6), then again nothing special is
0054       necessary.  Userspace can still set up a flow in the usual way,
0055       as long as it uses the kernel-provided flow key to do it.
0056 
0057     - If the userspace flow key includes more fields than the
0058       kernel's, for example if userspace decoded an IPv6 header but
0059       the kernel stopped at the Ethernet type, then userspace can
0060       forward the packet manually, without setting up a flow in the
0061       kernel.  This case is bad for performance because every packet
0062       that the kernel considers part of the flow must go to userspace,
0063       but the forwarding behavior is correct.  (If userspace can
0064       determine that the values of the extra fields would not affect
0065       forwarding behavior, then it could set up a flow anyway.)
0066 
0067 How flow keys evolve over time is important to making this work, so
0068 the following sections go into detail.
0069 
0070 
0071 Flow key format
0072 ---------------
0073 
0074 A flow key is passed over a Netlink socket as a sequence of Netlink
0075 attributes.  Some attributes represent packet metadata, defined as any
0076 information about a packet that cannot be extracted from the packet
0077 itself, e.g. the vport on which the packet was received.  Most
0078 attributes, however, are extracted from headers within the packet,
0079 e.g. source and destination addresses from Ethernet, IP, or TCP
0080 headers.
0081 
0082 The <linux/openvswitch.h> header file defines the exact format of the
0083 flow key attributes.  For informal explanatory purposes here, we write
0084 them as comma-separated strings, with parentheses indicating arguments
0085 and nesting.  For example, the following could represent a flow key
0086 corresponding to a TCP packet that arrived on vport 1::
0087 
0088     in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
0089     eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
0090     frag=no), tcp(src=49163, dst=80)
0091 
0092 Often we ellipsize arguments not important to the discussion, e.g.::
0093 
0094     in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
0095 
0096 
0097 Wildcarded flow key format
0098 --------------------------
0099 
0100 A wildcarded flow is described with two sequences of Netlink attributes
0101 passed over the Netlink socket. A flow key, exactly as described above, and an
0102 optional corresponding flow mask.
0103 
0104 A wildcarded flow can represent a group of exact match flows. Each '1' bit
0105 in the mask specifies a exact match with the corresponding bit in the flow key.
0106 A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
0107 of a incoming packet. Using wildcarded flow can improve the flow set up rate
0108 by reduce the number of new flows need to be processed by the user space program.
0109 
0110 Support for the mask Netlink attribute is optional for both the kernel and user
0111 space program. The kernel can ignore the mask attribute, installing an exact
0112 match flow, or reduce the number of don't care bits in the kernel to less than
0113 what was specified by the user space program. In this case, variations in bits
0114 that the kernel does not implement will simply result in additional flow setups.
0115 The kernel module will also work with user space programs that neither support
0116 nor supply flow mask attributes.
0117 
0118 Since the kernel may ignore or modify wildcard bits, it can be difficult for
0119 the userspace program to know exactly what matches are installed. There are
0120 two possible approaches: reactively install flows as they miss the kernel
0121 flow table (and therefore not attempt to determine wildcard changes at all)
0122 or use the kernel's response messages to determine the installed wildcards.
0123 
0124 When interacting with userspace, the kernel should maintain the match portion
0125 of the key exactly as originally installed. This will provides a handle to
0126 identify the flow for all future operations. However, when reporting the
0127 mask of an installed flow, the mask should include any restrictions imposed
0128 by the kernel.
0129 
0130 The behavior when using overlapping wildcarded flows is undefined. It is the
0131 responsibility of the user space program to ensure that any incoming packet
0132 can match at most one flow, wildcarded or not. The current implementation
0133 performs best-effort detection of overlapping wildcarded flows and may reject
0134 some but not all of them. However, this behavior may change in future versions.
0135 
0136 
0137 Unique flow identifiers
0138 -----------------------
0139 
0140 An alternative to using the original match portion of a key as the handle for
0141 flow identification is a unique flow identifier, or "UFID". UFIDs are optional
0142 for both the kernel and user space program.
0143 
0144 User space programs that support UFID are expected to provide it during flow
0145 setup in addition to the flow, then refer to the flow using the UFID for all
0146 future operations. The kernel is not required to index flows by the original
0147 flow key if a UFID is specified.
0148 
0149 
0150 Basic rule for evolving flow keys
0151 ---------------------------------
0152 
0153 Some care is needed to really maintain forward and backward
0154 compatibility for applications that follow the rules listed under
0155 "Flow key compatibility" above.
0156 
0157 The basic rule is obvious::
0158 
0159     ==================================================================
0160     New network protocol support must only supplement existing flow
0161     key attributes.  It must not change the meaning of already defined
0162     flow key attributes.
0163     ==================================================================
0164 
0165 This rule does have less-obvious consequences so it is worth working
0166 through a few examples.  Suppose, for example, that the kernel module
0167 did not already implement VLAN parsing.  Instead, it just interpreted
0168 the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
0169 packet.  The flow key for any packet with an 802.1Q header would look
0170 essentially like this, ignoring metadata::
0171 
0172     eth(...), eth_type(0x8100)
0173 
0174 Naively, to add VLAN support, it makes sense to add a new "vlan" flow
0175 key attribute to contain the VLAN tag, then continue to decode the
0176 encapsulated headers beyond the VLAN tag using the existing field
0177 definitions.  With this change, a TCP packet in VLAN 10 would have a
0178 flow key much like this::
0179 
0180     eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
0181 
0182 But this change would negatively affect a userspace application that
0183 has not been updated to understand the new "vlan" flow key attribute.
0184 The application could, following the flow compatibility rules above,
0185 ignore the "vlan" attribute that it does not understand and therefore
0186 assume that the flow contained IP packets.  This is a bad assumption
0187 (the flow only contains IP packets if one parses and skips over the
0188 802.1Q header) and it could cause the application's behavior to change
0189 across kernel versions even though it follows the compatibility rules.
0190 
0191 The solution is to use a set of nested attributes.  This is, for
0192 example, why 802.1Q support uses nested attributes.  A TCP packet in
0193 VLAN 10 is actually expressed as::
0194 
0195     eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
0196     ip(proto=6, ...), tcp(...)))
0197 
0198 Notice how the "eth_type", "ip", and "tcp" flow key attributes are
0199 nested inside the "encap" attribute.  Thus, an application that does
0200 not understand the "vlan" key will not see either of those attributes
0201 and therefore will not misinterpret them.  (Also, the outer eth_type
0202 is still 0x8100, not changed to 0x0800.)
0203 
0204 Handling malformed packets
0205 --------------------------
0206 
0207 Don't drop packets in the kernel for malformed protocol headers, bad
0208 checksums, etc.  This would prevent userspace from implementing a
0209 simple Ethernet switch that forwards every packet.
0210 
0211 Instead, in such a case, include an attribute with "empty" content.
0212 It doesn't matter if the empty content could be valid protocol values,
0213 as long as those values are rarely seen in practice, because userspace
0214 can always forward all packets with those values to userspace and
0215 handle them individually.
0216 
0217 For example, consider a packet that contains an IP header that
0218 indicates protocol 6 for TCP, but which is truncated just after the IP
0219 header, so that the TCP header is missing.  The flow key for this
0220 packet would include a tcp attribute with all-zero src and dst, like
0221 this::
0222 
0223     eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
0224 
0225 As another example, consider a packet with an Ethernet type of 0x8100,
0226 indicating that a VLAN TCI should follow, but which is truncated just
0227 after the Ethernet type.  The flow key for this packet would include
0228 an all-zero-bits vlan and an empty encap attribute, like this::
0229 
0230     eth(...), eth_type(0x8100), vlan(0), encap()
0231 
0232 Unlike a TCP packet with source and destination ports 0, an
0233 all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
0234 VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
0235 attribute expressly to allow this situation to be distinguished.
0236 Thus, the flow key in this second example unambiguously indicates a
0237 missing or malformed VLAN TCI.
0238 
0239 Other rules
0240 -----------
0241 
0242 The other rules for flow keys are much less subtle:
0243 
0244     - Duplicate attributes are not allowed at a given nesting level.
0245 
0246     - Ordering of attributes is not significant.
0247 
0248     - When the kernel sends a given flow key to userspace, it always
0249       composes it the same way.  This allows userspace to hash and
0250       compare entire flow keys that it may not be able to fully
0251       interpret.