Back to home page

OSCL-LXR

 
 

    


0001 ==================
0002 IP over InfiniBand
0003 ==================
0004 
0005   The ib_ipoib driver is an implementation of the IP over InfiniBand
0006   protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
0007   working group.  It is a "native" implementation in the sense of
0008   setting the interface type to ARPHRD_INFINIBAND and the hardware
0009   address length to 20 (earlier proprietary implementations
0010   masqueraded to the kernel as ethernet interfaces).
0011 
0012 Partitions and P_Keys
0013 =====================
0014 
0015   When the IPoIB driver is loaded, it creates one interface for each
0016   port using the P_Key at index 0.  To create an interface with a
0017   different P_Key, write the desired P_Key into the main interface's
0018   /sys/class/net/<intf name>/create_child file.  For example::
0019 
0020     echo 0x8001 > /sys/class/net/ib0/create_child
0021 
0022   This will create an interface named ib0.8001 with P_Key 0x8001.  To
0023   remove a subinterface, use the "delete_child" file::
0024 
0025     echo 0x8001 > /sys/class/net/ib0/delete_child
0026 
0027   The P_Key for any interface is given by the "pkey" file, and the
0028   main interface for a subinterface is in "parent."
0029 
0030   Child interface create/delete can also be done using IPoIB's
0031   rtnl_link_ops, where children created using either way behave the same.
0032 
0033 Datagram vs Connected modes
0034 ===========================
0035 
0036   The IPoIB driver supports two modes of operation: datagram and
0037   connected.  The mode is set and read through an interface's
0038   /sys/class/net/<intf name>/mode file.
0039 
0040   In datagram mode, the IB UD (Unreliable Datagram) transport is used
0041   and so the interface MTU has is equal to the IB L2 MTU minus the
0042   IPoIB encapsulation header (4 bytes).  For example, in a typical IB
0043   fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
0044 
0045   In connected mode, the IB RC (Reliable Connected) transport is used.
0046   Connected mode takes advantage of the connected nature of the IB
0047   transport and allows an MTU up to the maximal IP packet size of 64K,
0048   which reduces the number of IP packets needed for handling large UDP
0049   datagrams, TCP segments, etc and increases the performance for large
0050   messages.
0051 
0052   In connected mode, the interface's UD QP is still used for multicast
0053   and communication with peers that don't support connected mode. In
0054   this case, RX emulation of ICMP PMTU packets is used to cause the
0055   networking stack to use the smaller UD MTU for these neighbours.
0056 
0057 Stateless offloads
0058 ==================
0059 
0060   If the IB HW supports IPoIB stateless offloads, IPoIB advertises
0061   TCP/IP checksum and/or Large Send (LSO) offloading capability to the
0062   network stack.
0063 
0064   Large Receive (LRO) offloading is also implemented and may be turned
0065   on/off using ethtool calls.  Currently LRO is supported only for
0066   checksum offload capable devices.
0067 
0068   Stateless offloads are supported only in datagram mode.
0069 
0070 Interrupt moderation
0071 ====================
0072 
0073   If the underlying IB device supports CQ event moderation, one can
0074   use ethtool to set interrupt mitigation parameters and thus reduce
0075   the overhead incurred by handling interrupts.  The main code path of
0076   IPoIB doesn't use events for TX completion signaling so only RX
0077   moderation is supported.
0078 
0079 Debugging Information
0080 =====================
0081 
0082   By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
0083   to 'y', tracing messages are compiled into the driver.  They are
0084   turned on by setting the module parameters debug_level and
0085   mcast_debug_level to 1.  These parameters can be controlled at
0086   runtime through files in /sys/module/ib_ipoib/.
0087 
0088   CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
0089   virtual filesystem.  By mounting this filesystem, for example with::
0090 
0091     mount -t debugfs none /sys/kernel/debug
0092 
0093   it is possible to get statistics about multicast groups from the
0094   files /sys/kernel/debug/ipoib/ib0_mcg and so on.
0095 
0096   The performance impact of this option is negligible, so it
0097   is safe to enable this option with debug_level set to 0 for normal
0098   operation.
0099 
0100   CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
0101   the data path when data_debug_level is set to 1.  However, even with
0102   the output disabled, enabling this configuration option will affect
0103   performance, because it adds tests to the fast path.
0104 
0105 References
0106 ==========
0107 
0108   Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
0109     http://ietf.org/rfc/rfc4391.txt
0110 
0111   IP over InfiniBand (IPoIB) Architecture (RFC 4392)
0112     http://ietf.org/rfc/rfc4392.txt
0113 
0114   IP over InfiniBand: Connected Mode (RFC 4755)
0115     http://ietf.org/rfc/rfc4755.txt