0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =========================
0004 Resilient Next-hop Groups
0005 =========================
0006
0007 Resilient groups are a type of next-hop group that is aimed at minimizing
0008 disruption in flow routing across changes to the group composition and
0009 weights of constituent next hops.
0010
0011 The idea behind resilient hashing groups is best explained in contrast to
0012 the legacy multipath next-hop group, which uses the hash-threshold
0013 algorithm, described in RFC 2992.
0014
0015 To select a next hop, hash-threshold algorithm first assigns a range of
0016 hashes to each next hop in the group, and then selects the next hop by
0017 comparing the SKB hash with the individual ranges. When a next hop is
0018 removed from the group, the ranges are recomputed, which leads to
0019 reassignment of parts of hash space from one next hop to another. RFC 2992
0020 illustrates it thus::
0021
0022 +-------+-------+-------+-------+-------+
0023 | 1 | 2 | 3 | 4 | 5 |
0024 +-------+-+-----+---+---+-----+-+-------+
0025 | 1 | 2 | 4 | 5 |
0026 +---------+---------+---------+---------+
0027
0028 Before and after deletion of next hop 3
0029 under the hash-threshold algorithm.
0030
0031 Note how next hop 2 gave up part of the hash space in favor of next hop 1,
0032 and 4 in favor of 5. While there will usually be some overlap between the
0033 previous and the new distribution, some traffic flows change the next hop
0034 that they resolve to.
0035
0036 If a multipath group is used for load-balancing between multiple servers,
0037 this hash space reassignment causes an issue that packets from a single
0038 flow suddenly end up arriving at a server that does not expect them. This
0039 can result in TCP connections being reset.
0040
0041 If a multipath group is used for load-balancing among available paths to
0042 the same server, the issue is that different latencies and reordering along
0043 the way causes the packets to arrive in the wrong order, resulting in
0044 degraded application performance.
0045
0046 To mitigate the above-mentioned flow redirection, resilient next-hop groups
0047 insert another layer of indirection between the hash space and its
0048 constituent next hops: a hash table. The selection algorithm uses SKB hash
0049 to choose a hash table bucket, then reads the next hop that this bucket
0050 contains, and forwards traffic there.
0051
0052 This indirection brings an important feature. In the hash-threshold
0053 algorithm, the range of hashes associated with a next hop must be
0054 continuous. With a hash table, mapping between the hash table buckets and
0055 the individual next hops is arbitrary. Therefore when a next hop is deleted
0056 the buckets that held it are simply reassigned to other next hops::
0057
0058 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0059 |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
0060 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0061 v v v v
0062 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0063 |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
0064 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0065
0066 Before and after deletion of next hop 3
0067 under the resilient hashing algorithm.
0068
0069 When weights of next hops in a group are altered, it may be possible to
0070 choose a subset of buckets that are currently not used for forwarding
0071 traffic, and use those to satisfy the new next-hop distribution demands,
0072 keeping the "busy" buckets intact. This way, established flows are ideally
0073 kept being forwarded to the same endpoints through the same paths as before
0074 the next-hop group change.
0075
0076 Algorithm
0077 ---------
0078
0079 In a nutshell, the algorithm works as follows. Each next hop deserves a
0080 certain number of buckets, according to its weight and the number of
0081 buckets in the hash table. In accordance with the source code, we will call
0082 this number a "wants count" of a next hop. In case of an event that might
0083 cause bucket allocation change, the wants counts for individual next hops
0084 are updated.
0085
0086 Next hops that have fewer buckets than their wants count, are called
0087 "underweight". Those that have more are "overweight". If there are no
0088 overweight (and therefore no underweight) next hops in the group, it is
0089 said to be "balanced".
0090
0091 Each bucket maintains a last-used timer. Every time a packet is forwarded
0092 through a bucket, this timer is updated to current jiffies value. One
0093 attribute of a resilient group is then the "idle timer", which is the
0094 amount of time that a bucket must not be hit by traffic in order for it to
0095 be considered "idle". Buckets that are not idle are busy.
0096
0097 After assigning wants counts to next hops, an "upkeep" algorithm runs. For
0098 buckets:
0099
0100 1) that have no assigned next hop, or
0101 2) whose next hop has been removed, or
0102 3) that are idle and their next hop is overweight,
0103
0104 upkeep changes the next hop that the bucket references to one of the
0105 underweight next hops. If, after considering all buckets in this manner,
0106 there are still underweight next hops, another upkeep run is scheduled to a
0107 future time.
0108
0109 There may not be enough "idle" buckets to satisfy the updated wants counts
0110 of all next hops. Another attribute of a resilient group is the "unbalanced
0111 timer". This timer can be set to 0, in which case the table will stay out
0112 of balance until idle buckets do appear, possibly never. If set to a
0113 non-zero value, the value represents the period of time that the table is
0114 permitted to stay out of balance.
0115
0116 With this in mind, we update the above list of conditions with one more
0117 item. Thus buckets:
0118
0119 4) whose next hop is overweight, and the amount of time that the table has
0120 been out of balance exceeds the unbalanced timer, if that is non-zero,
0121
0122 \... are migrated as well.
0123
0124 Offloading & Driver Feedback
0125 ----------------------------
0126
0127 When offloading resilient groups, the algorithm that distributes buckets
0128 among next hops is still the one in SW. Drivers are notified of updates to
0129 next hop groups in the following three ways:
0130
0131 - Full group notification with the type
0132 ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
0133 created and buckets populated for the first time.
0134
0135 - Single-bucket notifications of the type
0136 ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
0137 individual migrations within an already-established group.
0138
0139 - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
0140 is sent before the group is replaced, and is a way for the driver to veto
0141 the group before committing anything to the HW.
0142
0143 Some single-bucket notifications are forced, as indicated by the "force"
0144 flag in the notification. Those are used for the cases where e.g. the next
0145 hop associated with the bucket was removed, and the bucket really must be
0146 migrated.
0147
0148 Non-forced notifications can be overridden by the driver by returning an
0149 error code. The use case for this is that the driver notifies the HW that a
0150 bucket should be migrated, but the HW discovers that the bucket has in fact
0151 been hit by traffic.
0152
0153 A second way for the HW to report that a bucket is busy is through the
0154 ``nexthop_res_grp_activity_update()`` API. The buckets identified this way
0155 as busy are treated as if traffic hit them.
0156
0157 Offloaded buckets should be flagged as either "offload" or "trap". This is
0158 done through the ``nexthop_bucket_set_hw_flags()`` API.
0159
0160 Netlink UAPI
0161 ------------
0162
0163 Resilient Group Replacement
0164 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0165
0166 Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
0167 same manner as other multipath groups. The following changes apply to the
0168 attributes passed in the netlink message:
0169
0170 =================== =========================================================
0171 ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
0172 ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
0173 groups.
0174 =================== =========================================================
0175
0176 ``NHA_RES_GROUP`` payload:
0177
0178 =================================== =========================================
0179 ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
0180 ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
0181 ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
0182 =================================== =========================================
0183
0184 Next Hop Get
0185 ^^^^^^^^^^^^
0186
0187 Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
0188 message in exactly the same way as other next hop get requests. The
0189 response attributes match the replacement attributes cited above, except
0190 ``NHA_RES_GROUP`` payload will include the following attribute:
0191
0192 =================================== =========================================
0193 ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
0194 of balance, in units of clock_t.
0195 =================================== =========================================
0196
0197 Bucket Get
0198 ^^^^^^^^^^
0199
0200 The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
0201 used to request a single bucket. The attributes recognized at get requests
0202 are:
0203
0204 =================== =========================================================
0205 ``NHA_ID`` ID of the next-hop group that the bucket belongs to.
0206 ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
0207 =================== =========================================================
0208
0209 ``NHA_RES_BUCKET`` payload:
0210
0211 ======================== ====================================================
0212 ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
0213 ======================== ====================================================
0214
0215 Bucket Dumps
0216 ^^^^^^^^^^^^
0217
0218 The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
0219 to request a dump of matching buckets. The attributes recognized at dump
0220 requests are:
0221
0222 =================== =========================================================
0223 ``NHA_ID`` If specified, limits the dump to just the next-hop group
0224 with this ID.
0225 ``NHA_OIF`` If specified, limits the dump to buckets that contain
0226 next hops that use the device with this ifindex.
0227 ``NHA_MASTER`` If specified, limits the dump to buckets that contain
0228 next hops that use a device in the VRF with this ifindex.
0229 ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
0230 =================== =========================================================
0231
0232 ``NHA_RES_BUCKET`` payload:
0233
0234 ======================== ====================================================
0235 ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
0236 that contain the next hop with this ID.
0237 ======================== ====================================================
0238
0239 Usage
0240 -----
0241
0242 To illustrate the usage, consider the following commands::
0243
0244 # ip nexthop add id 1 via 192.0.2.2 dev eth0
0245 # ip nexthop add id 2 via 192.0.2.3 dev eth0
0246 # ip nexthop add id 10 group 1/2 type resilient \
0247 buckets 8 idle_timer 60 unbalanced_timer 300
0248
0249 The last command creates a resilient next-hop group. It will have 8 buckets
0250 (which is unusually low number, and used here for demonstration purposes
0251 only), each bucket will be considered idle when no traffic hits it for at
0252 least 60 seconds, and if the table remains out of balance for 300 seconds,
0253 it will be forcefully brought into balance.
0254
0255 Changing next-hop weights leads to change in bucket allocation::
0256
0257 # ip nexthop replace id 10 group 1,3/2 type resilient
0258
0259 This can be confirmed by looking at individual buckets::
0260
0261 # ip nexthop bucket show id 10
0262 id 10 index 0 idle_time 5.59 nhid 1
0263 id 10 index 1 idle_time 5.59 nhid 1
0264 id 10 index 2 idle_time 8.74 nhid 2
0265 id 10 index 3 idle_time 8.74 nhid 2
0266 id 10 index 4 idle_time 8.74 nhid 1
0267 id 10 index 5 idle_time 8.74 nhid 1
0268 id 10 index 6 idle_time 8.74 nhid 1
0269 id 10 index 7 idle_time 8.74 nhid 1
0270
0271 Note the two buckets that have a shorter idle time. Those are the ones that
0272 were migrated after the next-hop replace command to satisfy the new demand
0273 that next hop 1 be given 6 buckets instead of 4.
0274
0275 Netdevsim
0276 ---------
0277
0278 The netdevsim driver implements a mock offload of resilient groups, and
0279 exposes debugfs interface that allows marking individual buckets as busy.
0280 For example, the following will mark bucket 23 in next-hop group 10 as
0281 active::
0282
0283 # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
0284
0285 In addition, another debugfs interface can be used to configure that the
0286 next attempt to migrate a bucket should fail::
0287
0288 # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
0289
0290 Besides serving as an example, the interfaces that netdevsim exposes are
0291 useful in automated testing, and
0292 ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
0293 them to test the algorithm.