Back to home page

OSCL-LXR

 
 

    


0001 
0002 ============
0003 MSG_ZEROCOPY
0004 ============
0005 
0006 Intro
0007 =====
0008 
0009 The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
0010 The feature is currently implemented for TCP and UDP sockets.
0011 
0012 
0013 Opportunity and Caveats
0014 -----------------------
0015 
0016 Copying large buffers between user process and kernel can be
0017 expensive. Linux supports various interfaces that eschew copying,
0018 such as sendpage and splice. The MSG_ZEROCOPY flag extends the
0019 underlying copy avoidance mechanism to common socket send calls.
0020 
0021 Copy avoidance is not a free lunch. As implemented, with page pinning,
0022 it replaces per byte copy cost with page accounting and completion
0023 notification overhead. As a result, MSG_ZEROCOPY is generally only
0024 effective at writes over around 10 KB.
0025 
0026 Page pinning also changes system call semantics. It temporarily shares
0027 the buffer between process and network stack. Unlike with copying, the
0028 process cannot immediately overwrite the buffer after system call
0029 return without possibly modifying the data in flight. Kernel integrity
0030 is not affected, but a buggy program can possibly corrupt its own data
0031 stream.
0032 
0033 The kernel returns a notification when it is safe to modify data.
0034 Converting an existing application to MSG_ZEROCOPY is not always as
0035 trivial as just passing the flag, then.
0036 
0037 
0038 More Info
0039 ---------
0040 
0041 Much of this document was derived from a longer paper presented at
0042 netdev 2.1. For more in-depth information see that paper and talk,
0043 the excellent reporting over at LWN.net or read the original code.
0044 
0045   paper, slides, video
0046     https://netdevconf.org/2.1/session.html?debruijn
0047 
0048   LWN article
0049     https://lwn.net/Articles/726917/
0050 
0051   patchset
0052     [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
0053     https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
0054 
0055 
0056 Interface
0057 =========
0058 
0059 Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
0060 avoidance, but not the only one.
0061 
0062 Socket Setup
0063 ------------
0064 
0065 The kernel is permissive when applications pass undefined flags to the
0066 send system call. By default it simply ignores these. To avoid enabling
0067 copy avoidance mode for legacy processes that accidentally already pass
0068 this flag, a process must first signal intent by setting a socket option:
0069 
0070 ::
0071 
0072         if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
0073                 error(1, errno, "setsockopt zerocopy");
0074 
0075 Transmission
0076 ------------
0077 
0078 The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
0079 Pass the new flag.
0080 
0081 ::
0082 
0083         ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
0084 
0085 A zerocopy failure will return -1 with errno ENOBUFS. This happens if
0086 the socket option was not set, the socket exceeds its optmem limit or
0087 the user exceeds its ulimit on locked pages.
0088 
0089 
0090 Mixing copy avoidance and copying
0091 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0092 
0093 Many workloads have a mixture of large and small buffers. Because copy
0094 avoidance is more expensive than copying for small packets, the
0095 feature is implemented as a flag. It is safe to mix calls with the flag
0096 with those without.
0097 
0098 
0099 Notifications
0100 -------------
0101 
0102 The kernel has to notify the process when it is safe to reuse a
0103 previously passed buffer. It queues completion notifications on the
0104 socket error queue, akin to the transmit timestamping interface.
0105 
0106 The notification itself is a simple scalar value. Each socket
0107 maintains an internal unsigned 32-bit counter. Each send call with
0108 MSG_ZEROCOPY that successfully sends data increments the counter. The
0109 counter is not incremented on failure or if called with length zero.
0110 The counter counts system call invocations, not bytes. It wraps after
0111 UINT_MAX calls.
0112 
0113 
0114 Notification Reception
0115 ~~~~~~~~~~~~~~~~~~~~~~
0116 
0117 The below snippet demonstrates the API. In the simplest case, each
0118 send syscall is followed by a poll and recvmsg on the error queue.
0119 
0120 Reading from the error queue is always a non-blocking operation. The
0121 poll call is there to block until an error is outstanding. It will set
0122 POLLERR in its output flags. That flag does not have to be set in the
0123 events field. Errors are signaled unconditionally.
0124 
0125 ::
0126 
0127         pfd.fd = fd;
0128         pfd.events = 0;
0129         if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
0130                 error(1, errno, "poll");
0131 
0132         ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
0133         if (ret == -1)
0134                 error(1, errno, "recvmsg");
0135 
0136         read_notification(msg);
0137 
0138 The example is for demonstration purpose only. In practice, it is more
0139 efficient to not wait for notifications, but read without blocking
0140 every couple of send calls.
0141 
0142 Notifications can be processed out of order with other operations on
0143 the socket. A socket that has an error queued would normally block
0144 other operations until the error is read. Zerocopy notifications have
0145 a zero error code, however, to not block send and recv calls.
0146 
0147 
0148 Notification Batching
0149 ~~~~~~~~~~~~~~~~~~~~~
0150 
0151 Multiple outstanding packets can be read at once using the recvmmsg
0152 call. This is often not needed. In each message the kernel returns not
0153 a single value, but a range. It coalesces consecutive notifications
0154 while one is outstanding for reception on the error queue.
0155 
0156 When a new notification is about to be queued, it checks whether the
0157 new value extends the range of the notification at the tail of the
0158 queue. If so, it drops the new notification packet and instead increases
0159 the range upper value of the outstanding notification.
0160 
0161 For protocols that acknowledge data in-order, like TCP, each
0162 notification can be squashed into the previous one, so that no more
0163 than one notification is outstanding at any one point.
0164 
0165 Ordered delivery is the common case, but not guaranteed. Notifications
0166 may arrive out of order on retransmission and socket teardown.
0167 
0168 
0169 Notification Parsing
0170 ~~~~~~~~~~~~~~~~~~~~
0171 
0172 The below snippet demonstrates how to parse the control message: the
0173 read_notification() call in the previous snippet. A notification
0174 is encoded in the standard error format, sock_extended_err.
0175 
0176 The level and type fields in the control data are protocol family
0177 specific, IP_RECVERR or IPV6_RECVERR.
0178 
0179 Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
0180 as explained before, to avoid blocking read and write system calls on
0181 the socket.
0182 
0183 The 32-bit notification range is encoded as [ee_info, ee_data]. This
0184 range is inclusive. Other fields in the struct must be treated as
0185 undefined, bar for ee_code, as discussed below.
0186 
0187 ::
0188 
0189         struct sock_extended_err *serr;
0190         struct cmsghdr *cm;
0191 
0192         cm = CMSG_FIRSTHDR(msg);
0193         if (cm->cmsg_level != SOL_IP &&
0194             cm->cmsg_type != IP_RECVERR)
0195                 error(1, 0, "cmsg");
0196 
0197         serr = (void *) CMSG_DATA(cm);
0198         if (serr->ee_errno != 0 ||
0199             serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
0200                 error(1, 0, "serr");
0201 
0202         printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
0203 
0204 
0205 Deferred copies
0206 ~~~~~~~~~~~~~~~
0207 
0208 Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
0209 avoidance, and a contract that the kernel will queue a completion
0210 notification. It is not a guarantee that the copy is elided.
0211 
0212 Copy avoidance is not always feasible. Devices that do not support
0213 scatter-gather I/O cannot send packets made up of kernel generated
0214 protocol headers plus zerocopy user data. A packet may need to be
0215 converted to a private copy of data deep in the stack, say to compute
0216 a checksum.
0217 
0218 In all these cases, the kernel returns a completion notification when
0219 it releases its hold on the shared pages. That notification may arrive
0220 before the (copied) data is fully transmitted. A zerocopy completion
0221 notification is not a transmit completion notification, therefore.
0222 
0223 Deferred copies can be more expensive than a copy immediately in the
0224 system call, if the data is no longer warm in the cache. The process
0225 also incurs notification processing cost for no benefit. For this
0226 reason, the kernel signals if data was completed with a copy, by
0227 setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
0228 A process may use this signal to stop passing flag MSG_ZEROCOPY on
0229 subsequent requests on the same socket.
0230 
0231 
0232 Implementation
0233 ==============
0234 
0235 Loopback
0236 --------
0237 
0238 Data sent to local sockets can be queued indefinitely if the receive
0239 process does not read its socket. Unbound notification latency is not
0240 acceptable. For this reason all packets generated with MSG_ZEROCOPY
0241 that are looped to a local socket will incur a deferred copy. This
0242 includes looping onto packet sockets (e.g., tcpdump) and tun devices.
0243 
0244 
0245 Testing
0246 =======
0247 
0248 More realistic example code can be found in the kernel source under
0249 tools/testing/selftests/net/msg_zerocopy.c.
0250 
0251 Be cognizant of the loopback constraint. The test can be run between
0252 a pair of hosts. But if run between a local pair of processes, for
0253 instance when run with msg_zerocopy.sh between a veth pair across
0254 namespaces, the test will not show any improvement. For testing, the
0255 loopback restriction can be temporarily relaxed by making
0256 skb_orphan_frags_rx identical to skb_orphan_frags.