admin-guide/sysctl/net.rst

0001 ================================
0002 Documentation for /proc/sys/net/
0003 ================================
0004
0005 Copyright
0006
0007 Copyright (c) 1999
0008
0009         - Terrehon Bowden <terrehon@pacbell.net>
0010         - Bodo Bauer <bb@ricochet.net>
0011
0012 Copyright (c) 2000
0013
0014         - Jorge Nerin <comandante@zaralinux.com>
0015
0016 Copyright (c) 2009
0017
0018         - Shen Feng <shen@cn.fujitsu.com>
0019
0020 For general info and legal blurb, please look in index.rst.
0021
0022 ------------------------------------------------------------------------------
0023
0024 This file contains the documentation for the sysctl files in
0025 /proc/sys/net
0026
0027 The interface  to  the  networking  parts  of  the  kernel  is  located  in
0028 /proc/sys/net. The following table shows all possible subdirectories.  You may
0029 see only some of them, depending on your kernel's configuration.
0030
0031
0032 Table : Subdirectories in /proc/sys/net
0033
0034  ========= =================== = ========== ==================
0035  Directory Content               Directory  Content
0036  ========= =================== = ========== ==================
0037  core      General parameter     appletalk  Appletalk protocol
0038  unix      Unix domain sockets   netrom     NET/ROM
0039  802       E802 protocol         ax25       AX25
0040  ethernet  Ethernet protocol     rose       X.25 PLP layer
0041  ipv4      IP version 4          x25        X.25 protocol
0042  bridge    Bridging              decnet     DEC net
0043  ipv6      IP version 6          tipc       TIPC
0044  ========= =================== = ========== ==================
0045
0046 1. /proc/sys/net/core - Network core options
0047 ============================================
0048
0049 bpf_jit_enable
0050 --------------
0051
0052 This enables the BPF Just in Time (JIT) compiler. BPF is a flexible
0053 and efficient infrastructure allowing to execute bytecode at various
0054 hook points. It is used in a number of Linux kernel subsystems such
0055 as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints)
0056 and security (e.g. seccomp). LLVM has a BPF back end that can compile
0057 restricted C into a sequence of BPF instructions. After program load
0058 through bpf(2) and passing a verifier in the kernel, a JIT will then
0059 translate these BPF proglets into native CPU instructions. There are
0060 two flavors of JITs, the newer eBPF JIT currently supported on:
0061
0062   - x86_64
0063   - x86_32
0064   - arm64
0065   - arm32
0066   - ppc64
0067   - ppc32
0068   - sparc64
0069   - mips64
0070   - s390x
0071   - riscv64
0072   - riscv32
0073
0074 And the older cBPF JIT supported on the following archs:
0075
0076   - mips
0077   - sparc
0078
0079 eBPF JITs are a superset of cBPF JITs, meaning the kernel will
0080 migrate cBPF instructions into eBPF instructions and then JIT
0081 compile them transparently. Older cBPF JITs can only translate
0082 tcpdump filters, seccomp rules, etc, but not mentioned eBPF
0083 programs loaded through bpf(2).
0084
0085 Values:
0086
0087         - 0 - disable the JIT (default value)
0088         - 1 - enable the JIT
0089         - 2 - enable the JIT and ask the compiler to emit traces on kernel log.
0090
0091 bpf_jit_harden
0092 --------------
0093
0094 This enables hardening for the BPF JIT compiler. Supported are eBPF
0095 JIT backends. Enabling hardening trades off performance, but can
0096 mitigate JIT spraying.
0097
0098 Values:
0099
0100         - 0 - disable JIT hardening (default value)
0101         - 1 - enable JIT hardening for unprivileged users only
0102         - 2 - enable JIT hardening for all users
0103
0104 bpf_jit_kallsyms
0105 ----------------
0106
0107 When BPF JIT compiler is enabled, then compiled images are unknown
0108 addresses to the kernel, meaning they neither show up in traces nor
0109 in /proc/kallsyms. This enables export of these addresses, which can
0110 be used for debugging/tracing. If bpf_jit_harden is enabled, this
0111 feature is disabled.
0112
0113 Values :
0114
0115         - 0 - disable JIT kallsyms export (default value)
0116         - 1 - enable JIT kallsyms export for privileged users only
0117
0118 bpf_jit_limit
0119 -------------
0120
0121 This enforces a global limit for memory allocations to the BPF JIT
0122 compiler in order to reject unprivileged JIT requests once it has
0123 been surpassed. bpf_jit_limit contains the value of the global limit
0124 in bytes.
0125
0126 dev_weight
0127 ----------
0128
0129 The maximum number of packets that kernel can handle on a NAPI interrupt,
0130 it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware
0131 aggregated packet is counted as one packet in this context.
0132
0133 Default: 64
0134
0135 dev_weight_rx_bias
0136 ------------------
0137
0138 RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function
0139 of the driver for the per softirq cycle netdev_budget. This parameter influences
0140 the proportion of the configured netdev_budget that is spent on RPS based packet
0141 processing during RX softirq cycles. It is further meant for making current
0142 dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack.
0143 (see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based
0144 on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias).
0145
0146 Default: 1
0147
0148 dev_weight_tx_bias
0149 ------------------
0150
0151 Scales the maximum number of packets that can be processed during a TX softirq cycle.
0152 Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric
0153 net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog.
0154
0155 Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias).
0156
0157 Default: 1
0158
0159 default_qdisc
0160 -------------
0161
0162 The default queuing discipline to use for network devices. This allows
0163 overriding the default of pfifo_fast with an alternative. Since the default
0164 queuing discipline is created without additional parameters so is best suited
0165 to queuing disciplines that work well without configuration like stochastic
0166 fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use
0167 queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin
0168 which require setting up classes and bandwidths. Note that physical multiqueue
0169 interfaces still use mq as root qdisc, which in turn uses this default for its
0170 leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead
0171 default to noqueue.
0172
0173 Default: pfifo_fast
0174
0175 busy_read
0176 ---------
0177
0178 Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL)
0179 Approximate time in us to busy loop waiting for packets on the device queue.
0180 This sets the default value of the SO_BUSY_POLL socket option.
0181 Can be set or overridden per socket by setting socket option SO_BUSY_POLL,
0182 which is the preferred method of enabling. If you need to enable the feature
0183 globally via sysctl, a value of 50 is recommended.
0184
0185 Will increase power usage.
0186
0187 Default: 0 (off)
0188
0189 busy_poll
0190 ----------------
0191 Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL)
0192 Approximate time in us to busy loop waiting for events.
0193 Recommended value depends on the number of sockets you poll on.
0194 For several sockets 50, for several hundreds 100.
0195 For more than that you probably want to use epoll.
0196 Note that only sockets with SO_BUSY_POLL set will be busy polled,
0197 so you want to either selectively set SO_BUSY_POLL on those sockets or set
0198 sysctl.net.busy_read globally.
0199
0200 Will increase power usage.
0201
0202 Default: 0 (off)
0203
0204 rmem_default
0205 ------------
0206
0207 The default setting of the socket receive buffer in bytes.
0208
0209 rmem_max
0210 --------
0211
0212 The maximum receive socket buffer size in bytes.
0213
0214 tstamp_allow_data
0215 -----------------
0216 Allow processes to receive tx timestamps looped together with the original
0217 packet contents. If disabled, transmit timestamp requests from unprivileged
0218 processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set.
0219
0220 Default: 1 (on)
0221
0222
0223 wmem_default
0224 ------------
0225
0226 The default setting (in bytes) of the socket send buffer.
0227
0228 wmem_max
0229 --------
0230
0231 The maximum send socket buffer size in bytes.
0232
0233 message_burst and message_cost
0234 ------------------------------
0235
0236 These parameters  are used to limit the warning messages written to the kernel
0237 log from  the  networking  code.  They  enforce  a  rate  limit  to  make  a
0238 denial-of-service attack  impossible. A higher message_cost factor, results in
0239 fewer messages that will be written. Message_burst controls when messages will
0240 be dropped.  The  default  settings  limit  warning messages to one every five
0241 seconds.
0242
0243 warnings
0244 --------
0245
0246 This sysctl is now unused.
0247
0248 This was used to control console messages from the networking stack that
0249 occur because of problems on the network like duplicate address or bad
0250 checksums.
0251
0252 These messages are now emitted at KERN_DEBUG and can generally be enabled
0253 and controlled by the dynamic_debug facility.
0254
0255 netdev_budget
0256 -------------
0257
0258 Maximum number of packets taken from all interfaces in one polling cycle (NAPI
0259 poll). In one polling cycle interfaces which are registered to polling are
0260 probed in a round-robin manner. Also, a polling cycle may not exceed
0261 netdev_budget_usecs microseconds, even if netdev_budget has not been
0262 exhausted.
0263
0264 netdev_budget_usecs
0265 ---------------------
0266
0267 Maximum number of microseconds in one NAPI polling cycle. Polling
0268 will exit when either netdev_budget_usecs have elapsed during the
0269 poll cycle or the number of packets processed reaches netdev_budget.
0270
0271 netdev_max_backlog
0272 ------------------
0273
0274 Maximum number of packets, queued on the INPUT side, when the interface
0275 receives packets faster than kernel can process them.
0276
0277 netdev_rss_key
0278 --------------
0279
0280 RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is
0281 randomly generated.
0282 Some user space might need to gather its content even if drivers do not
0283 provide ethtool -x support yet.
0284
0285 ::
0286
0287   myhost:~# cat /proc/sys/net/core/netdev_rss_key
0288   84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total)
0289
0290 File contains nul bytes if no driver ever called netdev_rss_key_fill() function.
0291
0292 Note:
0293   /proc/sys/net/core/netdev_rss_key contains 52 bytes of key,
0294   but most drivers only use 40 bytes of it.
0295
0296 ::
0297
0298   myhost:~# ethtool -x eth0
0299   RX flow hash indirection table for eth0 with 8 RX ring(s):
0300       0:    0     1     2     3     4     5     6     7
0301   RSS hash key:
0302   84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89
0303
0304 netdev_tstamp_prequeue
0305 ----------------------
0306
0307 If set to 0, RX packet timestamps can be sampled after RPS processing, when
0308 the target CPU processes packets. It might give some delay on timestamps, but
0309 permit to distribute the load on several cpus.
0310
0311 If set to 1 (default), timestamps are sampled as soon as possible, before
0312 queueing.
0313
0314 netdev_unregister_timeout_secs
0315 ------------------------------
0316
0317 Unregister network device timeout in seconds.
0318 This option controls the timeout (in seconds) used to issue a warning while
0319 waiting for a network device refcount to drop to 0 during device
0320 unregistration. A lower value may be useful during bisection to detect
0321 a leaked reference faster. A larger value may be useful to prevent false
0322 warnings on slow/loaded systems.
0323 Default value is 10, minimum 1, maximum 3600.
0324
0325 skb_defer_max
0326 -------------
0327
0328 Max size (in skbs) of the per-cpu list of skbs being freed
0329 by the cpu which allocated them. Used by TCP stack so far.
0330
0331 Default: 64
0332
0333 optmem_max
0334 ----------
0335
0336 Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
0337 of struct cmsghdr structures with appended data.
0338
0339 fb_tunnels_only_for_init_net
0340 ----------------------------
0341
0342 Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
0343 sit0, ip6tnl0, ip6gre0) are automatically created. There are 3 possibilities
0344 (a) value = 0; respective fallback tunnels are created when module is
0345 loaded in every net namespaces (backward compatible behavior).
0346 (b) value = 1; [kcmd value: initns] respective fallback tunnels are
0347 created only in init net namespace and every other net namespace will
0348 not have them.
0349 (c) value = 2; [kcmd value: none] fallback tunnels are not created
0350 when a module is loaded in any of the net namespace. Setting value to
0351 "2" is pointless after boot if these modules are built-in, so there is
0352 a kernel command-line option that can change this default. Please refer to
0353 Documentation/admin-guide/kernel-parameters.txt for additional details.
0354
0355 Not creating fallback tunnels gives control to userspace to create
0356 whatever is needed only and avoid creating devices which are redundant.
0357
0358 Default : 0  (for compatibility reasons)
0359
0360 devconf_inherit_init_net
0361 ------------------------
0362
0363 Controls if a new network namespace should inherit all current
0364 settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By
0365 default, we keep the current behavior: for IPv4 we inherit all current
0366 settings from init_net and for IPv6 we reset all settings to default.
0367
0368 If set to 1, both IPv4 and IPv6 settings are forced to inherit from
0369 current ones in init_net. If set to 2, both IPv4 and IPv6 settings are
0370 forced to reset to their default values. If set to 3, both IPv4 and IPv6
0371 settings are forced to inherit from current ones in the netns where this
0372 new netns has been created.
0373
0374 Default : 0  (for compatibility reasons)
0375
0376 txrehash
0377 --------
0378
0379 Controls default hash rethink behaviour on listening socket when SO_TXREHASH
0380 option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
0381
0382 If set to 1 (default), hash rethink is performed on listening socket.
0383 If set to 0, hash rethink is not performed.
0384
0385 gro_normal_batch
0386 ----------------
0387
0388 Maximum number of the segments to batch up on output of GRO. When a packet
0389 exits GRO, either as a coalesced superframe or as an original packet which
0390 GRO has decided not to coalesce, it is placed on a per-NAPI list. This
0391 list is then passed to the stack when the number of segments reaches the
0392 gro_normal_batch limit.
0393
0394 high_order_alloc_disable
0395 ------------------------
0396
0397 By default the allocator for page frags tries to use high order pages (order-3
0398 on x86). While the default behavior gives good results in most cases, some users
0399 might have hit a contention in page allocations/freeing. This was especially
0400 true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
0401 lists. This allows to opt-in for order-0 allocation instead but is now mostly of
0402 historical importance.
0403
0404 Default: 0
0405
0406 2. /proc/sys/net/unix - Parameters for Unix domain sockets
0407 ----------------------------------------------------------
0408
0409 There is only one file in this directory.
0410 unix_dgram_qlen limits the max number of datagrams queued in Unix domain
0411 socket's buffer. It will not take effect unless PF_UNIX flag is specified.
0412
0413
0414 3. /proc/sys/net/ipv4 - IPV4 settings
0415 -------------------------------------
0416 Please see: Documentation/networking/ip-sysctl.rst and
0417 Documentation/admin-guide/sysctl/net.rst for descriptions of these entries.
0418
0419
0420 4. Appletalk
0421 ------------
0422
0423 The /proc/sys/net/appletalk  directory  holds the Appletalk configuration data
0424 when Appletalk is loaded. The configurable parameters are:
0425
0426 aarp-expiry-time
0427 ----------------
0428
0429 The amount  of  time  we keep an ARP entry before expiring it. Used to age out
0430 old hosts.
0431
0432 aarp-resolve-time
0433 -----------------
0434
0435 The amount of time we will spend trying to resolve an Appletalk address.
0436
0437 aarp-retransmit-limit
0438 ---------------------
0439
0440 The number of times we will retransmit a query before giving up.
0441
0442 aarp-tick-time
0443 --------------
0444
0445 Controls the rate at which expires are checked.
0446
0447 The directory  /proc/net/appletalk  holds the list of active Appletalk sockets
0448 on a machine.
0449
0450 The fields  indicate  the DDP type, the local address (in network:node format)
0451 the remote  address,  the  size of the transmit pending queue, the size of the
0452 received queue  (bytes waiting for applications to read) the state and the uid
0453 owning the socket.
0454
0455 /proc/net/atalk_iface lists  all  the  interfaces  configured for appletalk.It
0456 shows the  name  of the interface, its Appletalk address, the network range on
0457 that address  (or  network number for phase 1 networks), and the status of the
0458 interface.
0459
0460 /proc/net/atalk_route lists  each  known  network  route.  It lists the target
0461 (network) that the route leads to, the router (may be directly connected), the
0462 route flags, and the device the route is using.
0463
0464 5. TIPC
0465 -------
0466
0467 tipc_rmem
0468 ---------
0469
0470 The TIPC protocol now has a tunable for the receive memory, similar to the
0471 tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
0472
0473 ::
0474
0475     # cat /proc/sys/net/tipc/tipc_rmem
0476     4252725 34021800        68043600
0477     #
0478
0479 The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
0480 are scaled (shifted) versions of that same value.  Note that the min value
0481 is not at this point in time used in any meaningful way, but the triplet is
0482 preserved in order to be consistent with things like tcp_rmem.
0483
0484 named_timeout
0485 -------------
0486
0487 TIPC name table updates are distributed asynchronously in a cluster, without
0488 any form of transaction handling. This means that different race scenarios are
0489 possible. One such is that a name withdrawal sent out by one node and received
0490 by another node may arrive after a second, overlapping name publication already
0491 has been accepted from a third node, although the conflicting updates
0492 originally may have been issued in the correct sequential order.
0493 If named_timeout is nonzero, failed topology updates will be placed on a defer
0494 queue until another event arrives that clears the error, or until the timeout
0495 expires. Value is in milliseconds.