Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 =====================================
0004 Network Devices, the Kernel, and You!
0005 =====================================
0006 
0007 
0008 Introduction
0009 ============
0010 The following is a random collection of documentation regarding
0011 network devices.
0012 
0013 struct net_device lifetime rules
0014 ================================
0015 Network device structures need to persist even after module is unloaded and
0016 must be allocated with alloc_netdev_mqs() and friends.
0017 If device has registered successfully, it will be freed on last use
0018 by free_netdev(). This is required to handle the pathological case cleanly
0019 (example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
0020 
0021 alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
0022 private data which gets freed when the network device is freed. If
0023 separately allocated data is attached to the network device
0024 (netdev_priv()) then it is up to the module exit handler to free that.
0025 
0026 There are two groups of APIs for registering struct net_device.
0027 First group can be used in normal contexts where ``rtnl_lock`` is not already
0028 held: register_netdev(), unregister_netdev().
0029 Second group can be used when ``rtnl_lock`` is already held:
0030 register_netdevice(), unregister_netdevice(), free_netdevice().
0031 
0032 Simple drivers
0033 --------------
0034 
0035 Most drivers (especially device drivers) handle lifetime of struct net_device
0036 in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
0037 
0038 In that case the struct net_device registration is done using
0039 the register_netdev(), and unregister_netdev() functions:
0040 
0041 .. code-block:: c
0042 
0043   int probe()
0044   {
0045     struct my_device_priv *priv;
0046     int err;
0047 
0048     dev = alloc_netdev_mqs(...);
0049     if (!dev)
0050       return -ENOMEM;
0051     priv = netdev_priv(dev);
0052 
0053     /* ... do all device setup before calling register_netdev() ...
0054      */
0055 
0056     err = register_netdev(dev);
0057     if (err)
0058       goto err_undo;
0059 
0060     /* net_device is visible to the user! */
0061 
0062   err_undo:
0063     /* ... undo the device setup ... */
0064     free_netdev(dev);
0065     return err;
0066   }
0067 
0068   void remove()
0069   {
0070     unregister_netdev(dev);
0071     free_netdev(dev);
0072   }
0073 
0074 Note that after calling register_netdev() the device is visible in the system.
0075 Users can open it and start sending / receiving traffic immediately,
0076 or run any other callback, so all initialization must be done prior to
0077 registration.
0078 
0079 unregister_netdev() closes the device and waits for all users to be done
0080 with it. The memory of struct net_device itself may still be referenced
0081 by sysfs but all operations on that device will fail.
0082 
0083 free_netdev() can be called after unregister_netdev() returns on when
0084 register_netdev() failed.
0085 
0086 Device management under RTNL
0087 ----------------------------
0088 
0089 Registering struct net_device while in context which already holds
0090 the ``rtnl_lock`` requires extra care. In those scenarios most drivers
0091 will want to make use of struct net_device's ``needs_free_netdev``
0092 and ``priv_destructor`` members for freeing of state.
0093 
0094 Example flow of netdev handling under ``rtnl_lock``:
0095 
0096 .. code-block:: c
0097 
0098   static void my_setup(struct net_device *dev)
0099   {
0100     dev->needs_free_netdev = true;
0101   }
0102 
0103   static void my_destructor(struct net_device *dev)
0104   {
0105     some_obj_destroy(priv->obj);
0106     some_uninit(priv);
0107   }
0108 
0109   int create_link()
0110   {
0111     struct my_device_priv *priv;
0112     int err;
0113 
0114     ASSERT_RTNL();
0115 
0116     dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
0117     if (!dev)
0118       return -ENOMEM;
0119     priv = netdev_priv(dev);
0120 
0121     /* Implicit constructor */
0122     err = some_init(priv);
0123     if (err)
0124       goto err_free_dev;
0125 
0126     priv->obj = some_obj_create();
0127     if (!priv->obj) {
0128       err = -ENOMEM;
0129       goto err_some_uninit;
0130     }
0131     /* End of constructor, set the destructor: */
0132     dev->priv_destructor = my_destructor;
0133 
0134     err = register_netdevice(dev);
0135     if (err)
0136       /* register_netdevice() calls destructor on failure */
0137       goto err_free_dev;
0138 
0139     /* If anything fails now unregister_netdevice() (or unregister_netdev())
0140      * will take care of calling my_destructor and free_netdev().
0141      */
0142 
0143     return 0;
0144 
0145   err_some_uninit:
0146     some_uninit(priv);
0147   err_free_dev:
0148     free_netdev(dev);
0149     return err;
0150   }
0151 
0152 If struct net_device.priv_destructor is set it will be called by the core
0153 some time after unregister_netdevice(), it will also be called if
0154 register_netdevice() fails. The callback may be invoked with or without
0155 ``rtnl_lock`` held.
0156 
0157 There is no explicit constructor callback, driver "constructs" the private
0158 netdev state after allocating it and before registration.
0159 
0160 Setting struct net_device.needs_free_netdev makes core call free_netdevice()
0161 automatically after unregister_netdevice() when all references to the device
0162 are gone. It only takes effect after a successful call to register_netdevice()
0163 so if register_netdevice() fails driver is responsible for calling
0164 free_netdev().
0165 
0166 free_netdev() is safe to call on error paths right after unregister_netdevice()
0167 or when register_netdevice() fails. Parts of netdev (de)registration process
0168 happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
0169 will defer some of the processing until ``rtnl_lock`` is released.
0170 
0171 Devices spawned from struct rtnl_link_ops should never free the
0172 struct net_device directly.
0173 
0174 .ndo_init and .ndo_uninit
0175 ~~~~~~~~~~~~~~~~~~~~~~~~~
0176 
0177 ``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
0178 registration and de-registration, under ``rtnl_lock``. Drivers can use
0179 those e.g. when parts of their init process need to run under ``rtnl_lock``.
0180 
0181 ``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
0182 runs during de-registering after device is closed but other subsystems
0183 may still have outstanding references to the netdevice.
0184 
0185 MTU
0186 ===
0187 Each network device has a Maximum Transfer Unit. The MTU does not
0188 include any link layer protocol overhead. Upper layer protocols must
0189 not pass a socket buffer (skb) to a device to transmit with more data
0190 than the mtu. The MTU does not include link layer header overhead, so
0191 for example on Ethernet if the standard MTU is 1500 bytes used, the
0192 actual skb will contain up to 1514 bytes because of the Ethernet
0193 header. Devices should allow for the 4 byte VLAN header as well.
0194 
0195 Segmentation Offload (GSO, TSO) is an exception to this rule.  The
0196 upper layer protocol may pass a large socket buffer to the device
0197 transmit routine, and the device will break that up into separate
0198 packets based on the current MTU.
0199 
0200 MTU is symmetrical and applies both to receive and transmit. A device
0201 must be able to receive at least the maximum size packet allowed by
0202 the MTU. A network device may use the MTU as mechanism to size receive
0203 buffers, but the device should allow packets with VLAN header. With
0204 standard Ethernet mtu of 1500 bytes, the device should allow up to
0205 1518 byte packets (1500 + 14 header + 4 tag).  The device may either:
0206 drop, truncate, or pass up oversize packets, but dropping oversize
0207 packets is preferred.
0208 
0209 
0210 struct net_device synchronization rules
0211 =======================================
0212 ndo_open:
0213         Synchronization: rtnl_lock() semaphore.
0214         Context: process
0215 
0216 ndo_stop:
0217         Synchronization: rtnl_lock() semaphore.
0218         Context: process
0219         Note: netif_running() is guaranteed false
0220 
0221 ndo_do_ioctl:
0222         Synchronization: rtnl_lock() semaphore.
0223         Context: process
0224 
0225         This is only called by network subsystems internally,
0226         not by user space calling ioctl as it was in before
0227         linux-5.14.
0228 
0229 ndo_siocbond:
0230         Synchronization: rtnl_lock() semaphore.
0231         Context: process
0232 
0233         Used by the bonding driver for the SIOCBOND family of
0234         ioctl commands.
0235 
0236 ndo_siocwandev:
0237         Synchronization: rtnl_lock() semaphore.
0238         Context: process
0239 
0240         Used by the drivers/net/wan framework to handle
0241         the SIOCWANDEV ioctl with the if_settings structure.
0242 
0243 ndo_siocdevprivate:
0244         Synchronization: rtnl_lock() semaphore.
0245         Context: process
0246 
0247         This is used to implement SIOCDEVPRIVATE ioctl helpers.
0248         These should not be added to new drivers, so don't use.
0249 
0250 ndo_eth_ioctl:
0251         Synchronization: rtnl_lock() semaphore.
0252         Context: process
0253 
0254 ndo_get_stats:
0255         Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
0256         Context: atomic (can't sleep under rwlock or RCU)
0257 
0258 ndo_start_xmit:
0259         Synchronization: __netif_tx_lock spinlock.
0260 
0261         When the driver sets NETIF_F_LLTX in dev->features this will be
0262         called without holding netif_tx_lock. In this case the driver
0263         has to lock by itself when needed.
0264         The locking there should also properly protect against
0265         set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
0266         Don't use it for new drivers.
0267 
0268         Context: Process with BHs disabled or BH (timer),
0269                  will be called with interrupts disabled by netconsole.
0270 
0271         Return codes:
0272 
0273         * NETDEV_TX_OK everything ok.
0274         * NETDEV_TX_BUSY Cannot transmit packet, try later
0275           Usually a bug, means queue start/stop flow control is broken in
0276           the driver. Note: the driver must NOT put the skb in its DMA ring.
0277 
0278 ndo_tx_timeout:
0279         Synchronization: netif_tx_lock spinlock; all TX queues frozen.
0280         Context: BHs disabled
0281         Notes: netif_queue_stopped() is guaranteed true
0282 
0283 ndo_set_rx_mode:
0284         Synchronization: netif_addr_lock spinlock.
0285         Context: BHs disabled
0286 
0287 struct napi_struct synchronization rules
0288 ========================================
0289 napi->poll:
0290         Synchronization:
0291                 NAPI_STATE_SCHED bit in napi->state.  Device
0292                 driver's ndo_stop method will invoke napi_disable() on
0293                 all NAPI instances which will do a sleeping poll on the
0294                 NAPI_STATE_SCHED napi->state bit, waiting for all pending
0295                 NAPI activity to cease.
0296 
0297         Context:
0298                  softirq
0299                  will be called with interrupts disabled by netconsole.