Back to home page

OSCL-LXR

 
 

    


0001 ****************************
0002 RDMA Transport (RTRS)
0003 ****************************
0004 
0005 RTRS (RDMA Transport) is a reliable high speed transport library
0006 which provides support to establish optimal number of connections
0007 between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
0008 transport. It is optimized to transfer (read/write) IO blocks.
0009 
0010 In its core interface it follows the BIO semantics of providing the
0011 possibility to either write data from an sg list to the remote side
0012 or to request ("read") data transfer from the remote side into a given
0013 sg list.
0014 
0015 RTRS provides I/O fail-over and load-balancing capabilities by using
0016 multipath I/O (see "add_path" and "mp_policy" configuration entries in
0017 Documentation/ABI/testing/sysfs-class-rtrs-client).
0018 
0019 RTRS is used by the RNBD (RDMA Network Block Device) modules.
0020 
0021 ==================
0022 Transport protocol
0023 ==================
0024 
0025 Overview
0026 --------
0027 An established connection between a client and a server is called rtrs
0028 session. A session is associated with a set of memory chunks reserved on the
0029 server side for a given client for rdma transfer. A session
0030 consists of multiple paths, each representing a separate physical link
0031 between client and server. Those are used for load balancing and failover.
0032 Each path consists of as many connections (QPs) as there are cpus on
0033 the client.
0034 
0035 When processing an incoming write or read request, rtrs client uses memory
0036 chunks reserved for him on the server side. Their number, size and addresses
0037 need to be exchanged between client and server during the connection
0038 establishment phase. Apart from the memory related information client needs to
0039 inform the server about the session name and identify each path and connection
0040 individually.
0041 
0042 On an established session client sends to server write or read messages.
0043 Server uses immediate field to tell the client which request is being
0044 acknowledged and for errno. Client uses immediate field to tell the server
0045 which of the memory chunks has been accessed and at which offset the message
0046 can be found.
0047 
0048 Module parameter always_invalidate is introduced for the security problem
0049 discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
0050 invalidate each rdma buffer before we hand it over to RNBD server and
0051 then pass it to the block layer. A new rkey is generated and registered for the
0052 buffer after it returns back from the block layer and RNBD server.
0053 The new rkey is sent back to the client along with the IO result.
0054 The procedure is the default behaviour of the driver. This invalidation and
0055 registration on each IO causes performance drop of up to 20%. A user of the
0056 driver may choose to load the modules with this mechanism switched off
0057 (always_invalidate=N), if he understands and can take the risk of a malicious
0058 client being able to corrupt memory of a server it is connected to. This might
0059 be a reasonable option in a scenario where all the clients and all the servers
0060 are located within a secure datacenter.
0061 
0062 
0063 Connection establishment
0064 ------------------------
0065 
0066 1. Client starts establishing connections belonging to a path of a session one
0067 by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
0068 Those include uuid of the session and uuid of the path to be
0069 established. They are used by the server to find a persisting session/path or
0070 to create a new one when necessary. The message also contains the protocol
0071 version and magic for compatibility, total number of connections per session
0072 (as many as cpus on the client), the id of the current connection and
0073 the reconnect counter, which is used to resolve the situations where
0074 client is trying to reconnect a path, while server is still destroying the old
0075 one.
0076 
0077 2. Server accepts the connection requests one by one and attaches
0078 RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
0079 protocol version, the messages include error code, queue depth supported by
0080 the server (number of memory chunks which are going to be allocated for that
0081 session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
0082 when always_invalidate=Y.
0083 
0084 3. After all connections of a path are established client sends to server the
0085 RTRS_MSG_INFO_REQ message, containing the name of the session. This message
0086 requests the address information from the server.
0087 
0088 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
0089 which contains the addresses and keys of the RDMA buffers allocated for that
0090 session.
0091 
0092 5. Session becomes connected after all paths to be established are connected
0093 (i.e. steps 1-4 finished for all paths requested for a session)
0094 
0095 6. Server and client exchange periodically heartbeat messages (empty rdma
0096 messages with an immediate field) which are used to detect a crash on remote
0097 side or network outage in an absence of IO.
0098 
0099 7. On any RDMA related error or in the case of a heartbeat timeout, the
0100 corresponding path is disconnected, all the inflight IO are failed over to a
0101 healthy path, if any, and the reconnect mechanism is triggered.
0102 
0103 CLT                                     SRV
0104 *for each connection belonging to a path and for each path:
0105 RTRS_MSG_CON_REQ  ------------------->
0106                    <------------------- RTRS_MSG_CON_RSP
0107 ...
0108 *after all connections are established:
0109 RTRS_MSG_INFO_REQ ------------------->
0110                    <------------------- RTRS_MSG_INFO_RSP
0111 *heartbeat is started from both sides:
0112                    -------------------> [RTRS_HB_MSG_IMM]
0113 [RTRS_HB_MSG_ACK] <-------------------
0114 [RTRS_HB_MSG_IMM] <-------------------
0115                    -------------------> [RTRS_HB_MSG_ACK]
0116 
0117 IO path
0118 -------
0119 
0120 * Write (always_invalidate=N) *
0121 
0122 1. When processing a write request client selects one of the memory chunks
0123 on the server side and rdma writes there the user data, user header and the
0124 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
0125 contains size of the user header. The client tells the server which chunk has
0126 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
0127 using the IMM field.
0128 
0129 2. When confirming a write request server sends an "empty" rdma message with
0130 an immediate field. The 32 bit field is used to specify the outstanding
0131 inflight IO and for the error code.
0132 
0133 CLT                                                          SRV
0134 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
0135 [RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
0136 
0137 * Write (always_invalidate=Y) *
0138 
0139 1. When processing a write request client selects one of the memory chunks
0140 on the server side and rdma writes there the user data, user header and the
0141 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
0142 contains size of the user header. The client tells the server which chunk has
0143 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
0144 using the IMM field, Server invalidate rkey associated to the memory chunks
0145 first, when it finishes, pass the IO to RNBD server module.
0146 
0147 2. When confirming a write request server sends an "empty" rdma message with
0148 an immediate field. The 32 bit field is used to specify the outstanding
0149 inflight IO and for the error code. The new rkey is sent back using
0150 SEND_WITH_IMM WR, client When it recived new rkey message, it validates
0151 the message and finished IO after update rkey for the rbuffer, then post
0152 back the recv buffer for later use.
0153 
0154 CLT                                                          SRV
0155 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
0156 [RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
0157 [RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
0158 
0159 
0160 * Read (always_invalidate=N)*
0161 
0162 1. When processing a read request client selects one of the memory chunks
0163 on the server side and rdma writes there the user header and the
0164 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
0165 the user header, flags (specifying if memory invalidation is necessary) and the
0166 list of addresses along with keys for the data to be read into.
0167 
0168 2. When confirming a read request server transfers the requested data first,
0169 attaches an invalidation message if requested and finally an "empty" rdma
0170 message with an immediate field. The 32 bit field is used to specify the
0171 outstanding inflight IO and the error code.
0172 
0173 CLT                                           SRV
0174 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
0175 [RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
0176 or in case client requested invalidation:
0177 [RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
0178 
0179 * Read (always_invalidate=Y)*
0180 
0181 1. When processing a read request client selects one of the memory chunks
0182 on the server side and rdma writes there the user header and the
0183 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
0184 the user header, flags (specifying if memory invalidation is necessary) and the
0185 list of addresses along with keys for the data to be read into.
0186 Server invalidate rkey associated to the memory chunks first, when it finishes,
0187 passes the IO to RNBD server module.
0188 
0189 2. When confirming a read request server transfers the requested data first,
0190 attaches an invalidation message if requested and finally an "empty" rdma
0191 message with an immediate field. The 32 bit field is used to specify the
0192 outstanding inflight IO and the error code. The new rkey is sent back using
0193 SEND_WITH_IMM WR, client When it recived new rkey message, it validates
0194 the message and finished IO after update rkey for the rbuffer, then post
0195 back the recv buffer for later use.
0196 
0197 CLT                                           SRV
0198 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
0199 [RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
0200 [RTRS_MSG_RKEY_RSP]          <----------------- (RTRS_MSG_RKEY_RSP)
0201 or in case client requested invalidation:
0202 [RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
0203 =========================================
0204 Contributors List(in alphabetical order)
0205 =========================================
0206 Danil Kipnis <danil.kipnis@profitbricks.com>
0207 Fabian Holler <mail@fholler.de>
0208 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
0209 Jack Wang <jinpu.wang@profitbricks.com>
0210 Kleber Souza <kleber.souza@profitbricks.com>
0211 Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
0212 Milind Dumbare <Milind.dumbare@gmail.com>
0213 Roman Penyaev <roman.penyaev@profitbricks.com>