0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =============================
0004 Kernel Connection Multiplexor
0005 =============================
0006
0007 Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
0008 interface over TCP for generic application protocols. With KCM an application
0009 can efficiently send and receive application protocol messages over TCP using
0010 datagram sockets.
0011
0012 KCM implements an NxM multiplexor in the kernel as diagrammed below::
0013
0014 +------------+ +------------+ +------------+ +------------+
0015 | KCM socket | | KCM socket | | KCM socket | | KCM socket |
0016 +------------+ +------------+ +------------+ +------------+
0017 | | | |
0018 +-----------+ | | +----------+
0019 | | | |
0020 +----------------------------------+
0021 | Multiplexor |
0022 +----------------------------------+
0023 | | | | |
0024 +---------+ | | | ------------+
0025 | | | | |
0026 +----------+ +----------+ +----------+ +----------+ +----------+
0027 | Psock | | Psock | | Psock | | Psock | | Psock |
0028 +----------+ +----------+ +----------+ +----------+ +----------+
0029 | | | | |
0030 +----------+ +----------+ +----------+ +----------+ +----------+
0031 | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
0032 +----------+ +----------+ +----------+ +----------+ +----------+
0033
0034 KCM sockets
0035 ===========
0036
0037 The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
0038 bound to a multiplexor are considered to have equivalent function, and I/O
0039 operations in different sockets may be done in parallel without the need for
0040 synchronization between threads in userspace.
0041
0042 Multiplexor
0043 ===========
0044
0045 The multiplexor provides the message steering. In the transmit path, messages
0046 written on a KCM socket are sent atomically on an appropriate TCP socket.
0047 Similarly, in the receive path, messages are constructed on each TCP socket
0048 (Psock) and complete messages are steered to a KCM socket.
0049
0050 TCP sockets & Psocks
0051 ====================
0052
0053 TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
0054 for each bound TCP socket, this structure holds the state for constructing
0055 messages on receive as well as other connection specific information for KCM.
0056
0057 Connected mode semantics
0058 ========================
0059
0060 Each multiplexor assumes that all attached TCP connections are to the same
0061 destination and can use the different connections for load balancing when
0062 transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
0063 can be used to send and receive messages from the KCM socket.
0064
0065 Socket types
0066 ============
0067
0068 KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
0069
0070 Message delineation
0071 -------------------
0072
0073 Messages are sent over a TCP stream with some application protocol message
0074 format that typically includes a header which frames the messages. The length
0075 of a received message can be deduced from the application protocol header
0076 (often just a simple length field).
0077
0078 A TCP stream must be parsed to determine message boundaries. Berkeley Packet
0079 Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
0080 BPF program must be specified. The program is called at the start of receiving
0081 a new message and is given an skbuff that contains the bytes received so far.
0082 It parses the message header and returns the length of the message. Given this
0083 information, KCM will construct the message of the stated length and deliver it
0084 to a KCM socket.
0085
0086 TCP socket management
0087 ---------------------
0088
0089 When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
0090 write space available (POLLOUT) events are handled by the multiplexor. If there
0091 is a state change (disconnection) or other error on a TCP socket, an error is
0092 posted on the TCP socket so that a POLLERR event happens and KCM discontinues
0093 using the socket. When the application gets the error notification for a
0094 TCP socket, it should unattach the socket from KCM and then handle the error
0095 condition (the typical response is to close the socket and create a new
0096 connection if necessary).
0097
0098 KCM limits the maximum receive message size to be the size of the receive
0099 socket buffer on the attached TCP socket (the socket buffer size can be set by
0100 SO_RCVBUF). If the length of a new message reported by the BPF program is
0101 greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
0102 socket. The BPF program may also enforce a maximum messages size and report an
0103 error when it is exceeded.
0104
0105 A timeout may be set for assembling messages on a receive socket. The timeout
0106 value is taken from the receive timeout of the attached TCP socket (this is set
0107 by SO_RCVTIMEO). If the timer expires before assembly is complete an error
0108 (ETIMEDOUT) is posted on the socket.
0109
0110 User interface
0111 ==============
0112
0113 Creating a multiplexor
0114 ----------------------
0115
0116 A new multiplexor and initial KCM socket is created by a socket call::
0117
0118 socket(AF_KCM, type, protocol)
0119
0120 - type is either SOCK_DGRAM or SOCK_SEQPACKET
0121 - protocol is KCMPROTO_CONNECTED
0122
0123 Cloning KCM sockets
0124 -------------------
0125
0126 After the first KCM socket is created using the socket call as described
0127 above, additional sockets for the multiplexor can be created by cloning
0128 a KCM socket. This is accomplished by an ioctl on a KCM socket::
0129
0130 /* From linux/kcm.h */
0131 struct kcm_clone {
0132 int fd;
0133 };
0134
0135 struct kcm_clone info;
0136
0137 memset(&info, 0, sizeof(info));
0138
0139 err = ioctl(kcmfd, SIOCKCMCLONE, &info);
0140
0141 if (!err)
0142 newkcmfd = info.fd;
0143
0144 Attach transport sockets
0145 ------------------------
0146
0147 Attaching of transport sockets to a multiplexor is performed by calling an
0148 ioctl on a KCM socket for the multiplexor. e.g.::
0149
0150 /* From linux/kcm.h */
0151 struct kcm_attach {
0152 int fd;
0153 int bpf_fd;
0154 };
0155
0156 struct kcm_attach info;
0157
0158 memset(&info, 0, sizeof(info));
0159
0160 info.fd = tcpfd;
0161 info.bpf_fd = bpf_prog_fd;
0162
0163 ioctl(kcmfd, SIOCKCMATTACH, &info);
0164
0165 The kcm_attach structure contains:
0166
0167 - fd: file descriptor for TCP socket being attached
0168 - bpf_prog_fd: file descriptor for compiled BPF program downloaded
0169
0170 Unattach transport sockets
0171 --------------------------
0172
0173 Unattaching a transport socket from a multiplexor is straightforward. An
0174 "unattach" ioctl is done with the kcm_unattach structure as the argument::
0175
0176 /* From linux/kcm.h */
0177 struct kcm_unattach {
0178 int fd;
0179 };
0180
0181 struct kcm_unattach info;
0182
0183 memset(&info, 0, sizeof(info));
0184
0185 info.fd = cfd;
0186
0187 ioctl(fd, SIOCKCMUNATTACH, &info);
0188
0189 Disabling receive on KCM socket
0190 -------------------------------
0191
0192 A setsockopt is used to disable or enable receiving on a KCM socket.
0193 When receive is disabled, any pending messages in the socket's
0194 receive buffer are moved to other sockets. This feature is useful
0195 if an application thread knows that it will be doing a lot of
0196 work on a request and won't be able to service new messages for a
0197 while. Example use::
0198
0199 int val = 1;
0200
0201 setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
0202
0203 BFP programs for message delineation
0204 ------------------------------------
0205
0206 BPF programs can be compiled using the BPF LLVM backend. For example,
0207 the BPF program for parsing Thrift is::
0208
0209 #include "bpf.h" /* for __sk_buff */
0210 #include "bpf_helpers.h" /* for load_word intrinsic */
0211
0212 SEC("socket_kcm")
0213 int bpf_prog1(struct __sk_buff *skb)
0214 {
0215 return load_word(skb, 0) + 4;
0216 }
0217
0218 char _license[] SEC("license") = "GPL";
0219
0220 Use in applications
0221 ===================
0222
0223 KCM accelerates application layer protocols. Specifically, it allows
0224 applications to use a message based interface for sending and receiving
0225 messages. The kernel provides necessary assurances that messages are sent
0226 and received atomically. This relieves much of the burden applications have
0227 in mapping a message based protocol onto the TCP stream. KCM also make
0228 application layer messages a unit of work in the kernel for the purposes of
0229 steering and scheduling, which in turn allows a simpler networking model in
0230 multithreaded applications.
0231
0232 Configurations
0233 --------------
0234
0235 In an Nx1 configuration, KCM logically provides multiple socket handles
0236 to the same TCP connection. This allows parallelism between in I/O
0237 operations on the TCP socket (for instance copyin and copyout of data is
0238 parallelized). In an application, a KCM socket can be opened for each
0239 processing thread and inserted into the epoll (similar to how SO_REUSEPORT
0240 is used to allow multiple listener sockets on the same port).
0241
0242 In a MxN configuration, multiple connections are established to the
0243 same destination. These are used for simple load balancing.
0244
0245 Message batching
0246 ----------------
0247
0248 The primary purpose of KCM is load balancing between KCM sockets and hence
0249 threads in a nominal use case. Perfect load balancing, that is steering
0250 each received message to a different KCM socket or steering each sent
0251 message to a different TCP socket, can negatively impact performance
0252 since this doesn't allow for affinities to be established. Balancing
0253 based on groups, or batches of messages, can be beneficial for performance.
0254
0255 On transmit, there are three ways an application can batch (pipeline)
0256 messages on a KCM socket.
0257
0258 1) Send multiple messages in a single sendmmsg.
0259 2) Send a group of messages each with a sendmsg call, where all messages
0260 except the last have MSG_BATCH in the flags of sendmsg call.
0261 3) Create "super message" composed of multiple messages and send this
0262 with a single sendmsg.
0263
0264 On receive, the KCM module attempts to queue messages received on the
0265 same KCM socket during each TCP ready callback. The targeted KCM socket
0266 changes at each receive ready callback on the KCM socket. The application
0267 does not need to configure this.
0268
0269 Error handling
0270 --------------
0271
0272 An application should include a thread to monitor errors raised on
0273 the TCP connection. Normally, this will be done by placing each
0274 TCP socket attached to a KCM multiplexor in epoll set for POLLERR
0275 event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
0276 on the socket thus waking up the application thread. When the application
0277 sees the error (which may just be a disconnect) it should unattach the
0278 socket from KCM and then close it. It is assumed that once an error is
0279 posted on the TCP socket the data stream is unrecoverable (i.e. an error
0280 may have occurred in the middle of receiving a message).
0281
0282 TCP connection monitoring
0283 -------------------------
0284
0285 In KCM there is no means to correlate a message to the TCP socket that
0286 was used to send or receive the message (except in the case there is
0287 only one attached TCP socket). However, the application does retain
0288 an open file descriptor to the socket so it will be able to get statistics
0289 from the socket which can be used in detecting issues (such as high
0290 retransmissions on the socket).