Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 =============================
0004 Kernel Connection Multiplexor
0005 =============================
0006 
0007 Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
0008 interface over TCP for generic application protocols. With KCM an application
0009 can efficiently send and receive application protocol messages over TCP using
0010 datagram sockets.
0011 
0012 KCM implements an NxM multiplexor in the kernel as diagrammed below::
0013 
0014     +------------+   +------------+   +------------+   +------------+
0015     | KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
0016     +------------+   +------------+   +------------+   +------------+
0017         |                 |               |                |
0018         +-----------+     |               |     +----------+
0019                     |     |               |     |
0020                 +----------------------------------+
0021                 |           Multiplexor            |
0022                 +----------------------------------+
0023                     |   |           |           |  |
0024         +---------+   |           |           |  ------------+
0025         |             |           |           |              |
0026     +----------+  +----------+  +----------+  +----------+ +----------+
0027     |  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
0028     +----------+  +----------+  +----------+  +----------+ +----------+
0029         |              |           |            |             |
0030     +----------+  +----------+  +----------+  +----------+ +----------+
0031     | TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
0032     +----------+  +----------+  +----------+  +----------+ +----------+
0033 
0034 KCM sockets
0035 ===========
0036 
0037 The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
0038 bound to a multiplexor are considered to have equivalent function, and I/O
0039 operations in different sockets may be done in parallel without the need for
0040 synchronization between threads in userspace.
0041 
0042 Multiplexor
0043 ===========
0044 
0045 The multiplexor provides the message steering. In the transmit path, messages
0046 written on a KCM socket are sent atomically on an appropriate TCP socket.
0047 Similarly, in the receive path, messages are constructed on each TCP socket
0048 (Psock) and complete messages are steered to a KCM socket.
0049 
0050 TCP sockets & Psocks
0051 ====================
0052 
0053 TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
0054 for each bound TCP socket, this structure holds the state for constructing
0055 messages on receive as well as other connection specific information for KCM.
0056 
0057 Connected mode semantics
0058 ========================
0059 
0060 Each multiplexor assumes that all attached TCP connections are to the same
0061 destination and can use the different connections for load balancing when
0062 transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
0063 can be used to send and receive messages from the KCM socket.
0064 
0065 Socket types
0066 ============
0067 
0068 KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
0069 
0070 Message delineation
0071 -------------------
0072 
0073 Messages are sent over a TCP stream with some application protocol message
0074 format that typically includes a header which frames the messages. The length
0075 of a received message can be deduced from the application protocol header
0076 (often just a simple length field).
0077 
0078 A TCP stream must be parsed to determine message boundaries. Berkeley Packet
0079 Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
0080 BPF program must be specified. The program is called at the start of receiving
0081 a new message and is given an skbuff that contains the bytes received so far.
0082 It parses the message header and returns the length of the message. Given this
0083 information, KCM will construct the message of the stated length and deliver it
0084 to a KCM socket.
0085 
0086 TCP socket management
0087 ---------------------
0088 
0089 When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
0090 write space available (POLLOUT) events are handled by the multiplexor. If there
0091 is a state change (disconnection) or other error on a TCP socket, an error is
0092 posted on the TCP socket so that a POLLERR event happens and KCM discontinues
0093 using the socket. When the application gets the error notification for a
0094 TCP socket, it should unattach the socket from KCM and then handle the error
0095 condition (the typical response is to close the socket and create a new
0096 connection if necessary).
0097 
0098 KCM limits the maximum receive message size to be the size of the receive
0099 socket buffer on the attached TCP socket (the socket buffer size can be set by
0100 SO_RCVBUF). If the length of a new message reported by the BPF program is
0101 greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
0102 socket. The BPF program may also enforce a maximum messages size and report an
0103 error when it is exceeded.
0104 
0105 A timeout may be set for assembling messages on a receive socket. The timeout
0106 value is taken from the receive timeout of the attached TCP socket (this is set
0107 by SO_RCVTIMEO). If the timer expires before assembly is complete an error
0108 (ETIMEDOUT) is posted on the socket.
0109 
0110 User interface
0111 ==============
0112 
0113 Creating a multiplexor
0114 ----------------------
0115 
0116 A new multiplexor and initial KCM socket is created by a socket call::
0117 
0118   socket(AF_KCM, type, protocol)
0119 
0120 - type is either SOCK_DGRAM or SOCK_SEQPACKET
0121 - protocol is KCMPROTO_CONNECTED
0122 
0123 Cloning KCM sockets
0124 -------------------
0125 
0126 After the first KCM socket is created using the socket call as described
0127 above, additional sockets for the multiplexor can be created by cloning
0128 a KCM socket. This is accomplished by an ioctl on a KCM socket::
0129 
0130   /* From linux/kcm.h */
0131   struct kcm_clone {
0132         int fd;
0133   };
0134 
0135   struct kcm_clone info;
0136 
0137   memset(&info, 0, sizeof(info));
0138 
0139   err = ioctl(kcmfd, SIOCKCMCLONE, &info);
0140 
0141   if (!err)
0142     newkcmfd = info.fd;
0143 
0144 Attach transport sockets
0145 ------------------------
0146 
0147 Attaching of transport sockets to a multiplexor is performed by calling an
0148 ioctl on a KCM socket for the multiplexor. e.g.::
0149 
0150   /* From linux/kcm.h */
0151   struct kcm_attach {
0152         int fd;
0153         int bpf_fd;
0154   };
0155 
0156   struct kcm_attach info;
0157 
0158   memset(&info, 0, sizeof(info));
0159 
0160   info.fd = tcpfd;
0161   info.bpf_fd = bpf_prog_fd;
0162 
0163   ioctl(kcmfd, SIOCKCMATTACH, &info);
0164 
0165 The kcm_attach structure contains:
0166 
0167   - fd: file descriptor for TCP socket being attached
0168   - bpf_prog_fd: file descriptor for compiled BPF program downloaded
0169 
0170 Unattach transport sockets
0171 --------------------------
0172 
0173 Unattaching a transport socket from a multiplexor is straightforward. An
0174 "unattach" ioctl is done with the kcm_unattach structure as the argument::
0175 
0176   /* From linux/kcm.h */
0177   struct kcm_unattach {
0178         int fd;
0179   };
0180 
0181   struct kcm_unattach info;
0182 
0183   memset(&info, 0, sizeof(info));
0184 
0185   info.fd = cfd;
0186 
0187   ioctl(fd, SIOCKCMUNATTACH, &info);
0188 
0189 Disabling receive on KCM socket
0190 -------------------------------
0191 
0192 A setsockopt is used to disable or enable receiving on a KCM socket.
0193 When receive is disabled, any pending messages in the socket's
0194 receive buffer are moved to other sockets. This feature is useful
0195 if an application thread knows that it will be doing a lot of
0196 work on a request and won't be able to service new messages for a
0197 while. Example use::
0198 
0199   int val = 1;
0200 
0201   setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
0202 
0203 BFP programs for message delineation
0204 ------------------------------------
0205 
0206 BPF programs can be compiled using the BPF LLVM backend. For example,
0207 the BPF program for parsing Thrift is::
0208 
0209   #include "bpf.h" /* for __sk_buff */
0210   #include "bpf_helpers.h" /* for load_word intrinsic */
0211 
0212   SEC("socket_kcm")
0213   int bpf_prog1(struct __sk_buff *skb)
0214   {
0215        return load_word(skb, 0) + 4;
0216   }
0217 
0218   char _license[] SEC("license") = "GPL";
0219 
0220 Use in applications
0221 ===================
0222 
0223 KCM accelerates application layer protocols. Specifically, it allows
0224 applications to use a message based interface for sending and receiving
0225 messages. The kernel provides necessary assurances that messages are sent
0226 and received atomically. This relieves much of the burden applications have
0227 in mapping a message based protocol onto the TCP stream. KCM also make
0228 application layer messages a unit of work in the kernel for the purposes of
0229 steering and scheduling, which in turn allows a simpler networking model in
0230 multithreaded applications.
0231 
0232 Configurations
0233 --------------
0234 
0235 In an Nx1 configuration, KCM logically provides multiple socket handles
0236 to the same TCP connection. This allows parallelism between in I/O
0237 operations on the TCP socket (for instance copyin and copyout of data is
0238 parallelized). In an application, a KCM socket can be opened for each
0239 processing thread and inserted into the epoll (similar to how SO_REUSEPORT
0240 is used to allow multiple listener sockets on the same port).
0241 
0242 In a MxN configuration, multiple connections are established to the
0243 same destination. These are used for simple load balancing.
0244 
0245 Message batching
0246 ----------------
0247 
0248 The primary purpose of KCM is load balancing between KCM sockets and hence
0249 threads in a nominal use case. Perfect load balancing, that is steering
0250 each received message to a different KCM socket or steering each sent
0251 message to a different TCP socket, can negatively impact performance
0252 since this doesn't allow for affinities to be established. Balancing
0253 based on groups, or batches of messages, can be beneficial for performance.
0254 
0255 On transmit, there are three ways an application can batch (pipeline)
0256 messages on a KCM socket.
0257 
0258   1) Send multiple messages in a single sendmmsg.
0259   2) Send a group of messages each with a sendmsg call, where all messages
0260      except the last have MSG_BATCH in the flags of sendmsg call.
0261   3) Create "super message" composed of multiple messages and send this
0262      with a single sendmsg.
0263 
0264 On receive, the KCM module attempts to queue messages received on the
0265 same KCM socket during each TCP ready callback. The targeted KCM socket
0266 changes at each receive ready callback on the KCM socket. The application
0267 does not need to configure this.
0268 
0269 Error handling
0270 --------------
0271 
0272 An application should include a thread to monitor errors raised on
0273 the TCP connection. Normally, this will be done by placing each
0274 TCP socket attached to a KCM multiplexor in epoll set for POLLERR
0275 event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
0276 on the socket thus waking up the application thread. When the application
0277 sees the error (which may just be a disconnect) it should unattach the
0278 socket from KCM and then close it. It is assumed that once an error is
0279 posted on the TCP socket the data stream is unrecoverable (i.e. an error
0280 may have occurred in the middle of receiving a message).
0281 
0282 TCP connection monitoring
0283 -------------------------
0284 
0285 In KCM there is no means to correlate a message to the TCP socket that
0286 was used to send or receive the message (except in the case there is
0287 only one attached TCP socket). However, the application does retain
0288 an open file descriptor to the socket so it will be able to get statistics
0289 from the socket which can be used in detecting issues (such as high
0290 retransmissions on the socket).