Back to home page

OSCL-LXR

 
 

    


0001 ====================
0002 TCM Userspace Design
0003 ====================
0004 
0005 
0006 .. Contents:
0007 
0008    1) Design
0009      a) Background
0010      b) Benefits
0011      c) Design constraints
0012      d) Implementation overview
0013         i. Mailbox
0014         ii. Command ring
0015         iii. Data Area
0016      e) Device discovery
0017      f) Device events
0018      g) Other contingencies
0019    2) Writing a user pass-through handler
0020      a) Discovering and configuring TCMU uio devices
0021      b) Waiting for events on the device(s)
0022      c) Managing the command ring
0023    3) A final note
0024 
0025 
0026 Design
0027 ======
0028 
0029 TCM is another name for LIO, an in-kernel iSCSI target (server).
0030 Existing TCM targets run in the kernel.  TCMU (TCM in Userspace)
0031 allows userspace programs to be written which act as iSCSI targets.
0032 This document describes the design.
0033 
0034 The existing kernel provides modules for different SCSI transport
0035 protocols.  TCM also modularizes the data storage.  There are existing
0036 modules for file, block device, RAM or using another SCSI device as
0037 storage.  These are called "backstores" or "storage engines".  These
0038 built-in modules are implemented entirely as kernel code.
0039 
0040 Background
0041 ----------
0042 
0043 In addition to modularizing the transport protocol used for carrying
0044 SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
0045 the actual data storage as well. These are referred to as "backstores"
0046 or "storage engines". The target comes with backstores that allow a
0047 file, a block device, RAM, or another SCSI device to be used for the
0048 local storage needed for the exported SCSI LUN. Like the rest of LIO,
0049 these are implemented entirely as kernel code.
0050 
0051 These backstores cover the most common use cases, but not all. One new
0052 use case that other non-kernel target solutions, such as tgt, are able
0053 to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
0054 target then serves as a translator, allowing initiators to store data
0055 in these non-traditional networked storage systems, while still only
0056 using standard protocols themselves.
0057 
0058 If the target is a userspace process, supporting these is easy. tgt,
0059 for example, needs only a small adapter module for each, because the
0060 modules just use the available userspace libraries for RBD and GLFS.
0061 
0062 Adding support for these backstores in LIO is considerably more
0063 difficult, because LIO is entirely kernel code. Instead of undertaking
0064 the significant work to port the GLFS or RBD APIs and protocols to the
0065 kernel, another approach is to create a userspace pass-through
0066 backstore for LIO, "TCMU".
0067 
0068 
0069 Benefits
0070 --------
0071 
0072 In addition to allowing relatively easy support for RBD and GLFS, TCMU
0073 will also allow easier development of new backstores. TCMU combines
0074 with the LIO loopback fabric to become something similar to FUSE
0075 (Filesystem in Userspace), but at the SCSI layer instead of the
0076 filesystem layer. A SUSE, if you will.
0077 
0078 The disadvantage is there are more distinct components to configure, and
0079 potentially to malfunction. This is unavoidable, but hopefully not
0080 fatal if we're careful to keep things as simple as possible.
0081 
0082 Design constraints
0083 ------------------
0084 
0085 - Good performance: high throughput, low latency
0086 - Cleanly handle if userspace:
0087 
0088    1) never attaches
0089    2) hangs
0090    3) dies
0091    4) misbehaves
0092 
0093 - Allow future flexibility in user & kernel implementations
0094 - Be reasonably memory-efficient
0095 - Simple to configure & run
0096 - Simple to write a userspace backend
0097 
0098 
0099 Implementation overview
0100 -----------------------
0101 
0102 The core of the TCMU interface is a memory region that is shared
0103 between kernel and userspace. Within this region is: a control area
0104 (mailbox); a lockless producer/consumer circular buffer for commands
0105 to be passed up, and status returned; and an in/out data buffer area.
0106 
0107 TCMU uses the pre-existing UIO subsystem. UIO allows device driver
0108 development in userspace, and this is conceptually very close to the
0109 TCMU use case, except instead of a physical device, TCMU implements a
0110 memory-mapped layout designed for SCSI commands. Using UIO also
0111 benefits TCMU by handling device introspection (e.g. a way for
0112 userspace to determine how large the shared region is) and signaling
0113 mechanisms in both directions.
0114 
0115 There are no embedded pointers in the memory region. Everything is
0116 expressed as an offset from the region's starting address. This allows
0117 the ring to still work if the user process dies and is restarted with
0118 the region mapped at a different virtual address.
0119 
0120 See target_core_user.h for the struct definitions.
0121 
0122 The Mailbox
0123 -----------
0124 
0125 The mailbox is always at the start of the shared memory region, and
0126 contains a version, details about the starting offset and size of the
0127 command ring, and head and tail pointers to be used by the kernel and
0128 userspace (respectively) to put commands on the ring, and indicate
0129 when the commands are completed.
0130 
0131 version - 1 (userspace should abort if otherwise)
0132 
0133 flags:
0134     - TCMU_MAILBOX_FLAG_CAP_OOOC:
0135         indicates out-of-order completion is supported.
0136         See "The Command Ring" for details.
0137 
0138 cmdr_off
0139         The offset of the start of the command ring from the start
0140         of the memory region, to account for the mailbox size.
0141 cmdr_size
0142         The size of the command ring. This does *not* need to be a
0143         power of two.
0144 cmd_head
0145         Modified by the kernel to indicate when a command has been
0146         placed on the ring.
0147 cmd_tail
0148         Modified by userspace to indicate when it has completed
0149         processing of a command.
0150 
0151 The Command Ring
0152 ----------------
0153 
0154 Commands are placed on the ring by the kernel incrementing
0155 mailbox.cmd_head by the size of the command, modulo cmdr_size, and
0156 then signaling userspace via uio_event_notify(). Once the command is
0157 completed, userspace updates mailbox.cmd_tail in the same way and
0158 signals the kernel via a 4-byte write(). When cmd_head equals
0159 cmd_tail, the ring is empty -- no commands are currently waiting to be
0160 processed by userspace.
0161 
0162 TCMU commands are 8-byte aligned. They start with a common header
0163 containing "len_op", a 32-bit value that stores the length, as well as
0164 the opcode in the lowest unused bits. It also contains cmd_id and
0165 flags fields for setting by the kernel (kflags) and userspace
0166 (uflags).
0167 
0168 Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
0169 
0170 When the opcode is CMD, the entry in the command ring is a struct
0171 tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
0172 tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
0173 overall shared memory region, not the entry. The data in/out buffers
0174 are accessible via tht req.iov[] array. iov_cnt contains the number of
0175 entries in iov[] needed to describe either the Data-In or Data-Out
0176 buffers. For bidirectional commands, iov_cnt specifies how many iovec
0177 entries cover the Data-Out area, and iov_bidi_cnt specifies how many
0178 iovec entries immediately after that in iov[] cover the Data-In
0179 area. Just like other fields, iov.iov_base is an offset from the start
0180 of the region.
0181 
0182 When completing a command, userspace sets rsp.scsi_status, and
0183 rsp.sense_buffer if necessary. Userspace then increments
0184 mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
0185 kernel via the UIO method, a 4-byte write to the file descriptor.
0186 
0187 If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is
0188 capable of handling out-of-order completions. In this case, userspace can
0189 handle command in different order other than original. Since kernel would
0190 still process the commands in the same order it appeared in the command
0191 ring, userspace need to update the cmd->id when completing the
0192 command(a.k.a steal the original command's entry).
0193 
0194 When the opcode is PAD, userspace only updates cmd_tail as above --
0195 it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
0196 is contiguous within the command ring.)
0197 
0198 More opcodes may be added in the future. If userspace encounters an
0199 opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
0200 hdr.uflags, update cmd_tail, and proceed with processing additional
0201 commands, if any.
0202 
0203 The Data Area
0204 -------------
0205 
0206 This is shared-memory space after the command ring. The organization
0207 of this area is not defined in the TCMU interface, and userspace
0208 should access only the parts referenced by pending iovs.
0209 
0210 
0211 Device Discovery
0212 ----------------
0213 
0214 Other devices may be using UIO besides TCMU. Unrelated user processes
0215 may also be handling different sets of TCMU devices. TCMU userspace
0216 processes must find their devices by scanning sysfs
0217 class/uio/uio*/name. For TCMU devices, these names will be of the
0218 format::
0219 
0220         tcm-user/<hba_num>/<device_name>/<subtype>/<path>
0221 
0222 where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
0223 and <device_name> allow userspace to find the device's path in the
0224 kernel target's configfs tree. Assuming the usual mount point, it is
0225 found at::
0226 
0227         /sys/kernel/config/target/core/user_<hba_num>/<device_name>
0228 
0229 This location contains attributes such as "hw_block_size", that
0230 userspace needs to know for correct operation.
0231 
0232 <subtype> will be a userspace-process-unique string to identify the
0233 TCMU device as expecting to be backed by a certain handler, and <path>
0234 will be an additional handler-specific string for the user process to
0235 configure the device, if needed. The name cannot contain ':', due to
0236 LIO limitations.
0237 
0238 For all devices so discovered, the user handler opens /dev/uioX and
0239 calls mmap()::
0240 
0241         mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
0242 
0243 where size must be equal to the value read from
0244 /sys/class/uio/uioX/maps/map0/size.
0245 
0246 
0247 Device Events
0248 -------------
0249 
0250 If a new device is added or removed, a notification will be broadcast
0251 over netlink, using a generic netlink family name of "TCM-USER" and a
0252 multicast group named "config". This will include the UIO name as
0253 described in the previous section, as well as the UIO minor
0254 number. This should allow userspace to identify both the UIO device and
0255 the LIO device, so that after determining the device is supported
0256 (based on subtype) it can take the appropriate action.
0257 
0258 
0259 Other contingencies
0260 -------------------
0261 
0262 Userspace handler process never attaches:
0263 
0264 - TCMU will post commands, and then abort them after a timeout period
0265   (30 seconds.)
0266 
0267 Userspace handler process is killed:
0268 
0269 - It is still possible to restart and re-connect to TCMU
0270   devices. Command ring is preserved. However, after the timeout period,
0271   the kernel will abort pending tasks.
0272 
0273 Userspace handler process hangs:
0274 
0275 - The kernel will abort pending tasks after a timeout period.
0276 
0277 Userspace handler process is malicious:
0278 
0279 - The process can trivially break the handling of devices it controls,
0280   but should not be able to access kernel memory outside its shared
0281   memory areas.
0282 
0283 
0284 Writing a user pass-through handler (with example code)
0285 =======================================================
0286 
0287 A user process handing a TCMU device must support the following:
0288 
0289 a) Discovering and configuring TCMU uio devices
0290 b) Waiting for events on the device(s)
0291 c) Managing the command ring: Parsing operations and commands,
0292    performing work as needed, setting response fields (scsi_status and
0293    possibly sense_buffer), updating cmd_tail, and notifying the kernel
0294    that work has been finished
0295 
0296 First, consider instead writing a plugin for tcmu-runner. tcmu-runner
0297 implements all of this, and provides a higher-level API for plugin
0298 authors.
0299 
0300 TCMU is designed so that multiple unrelated processes can manage TCMU
0301 devices separately. All handlers should make sure to only open their
0302 devices, based opon a known subtype string.
0303 
0304 a) Discovering and configuring TCMU UIO devices::
0305 
0306       /* error checking omitted for brevity */
0307 
0308       int fd, dev_fd;
0309       char buf[256];
0310       unsigned long long map_len;
0311       void *map;
0312 
0313       fd = open("/sys/class/uio/uio0/name", O_RDONLY);
0314       ret = read(fd, buf, sizeof(buf));
0315       close(fd);
0316       buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
0317 
0318       /* we only want uio devices whose name is a format we expect */
0319       if (strncmp(buf, "tcm-user", 8))
0320         exit(-1);
0321 
0322       /* Further checking for subtype also needed here */
0323 
0324       fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
0325       ret = read(fd, buf, sizeof(buf));
0326       close(fd);
0327       str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
0328 
0329       map_len = strtoull(buf, NULL, 0);
0330 
0331       dev_fd = open("/dev/uio0", O_RDWR);
0332       map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
0333 
0334 
0335       b) Waiting for events on the device(s)
0336 
0337       while (1) {
0338         char buf[4];
0339 
0340         int ret = read(dev_fd, buf, 4); /* will block */
0341 
0342         handle_device_events(dev_fd, map);
0343       }
0344 
0345 
0346 c) Managing the command ring::
0347 
0348       #include <linux/target_core_user.h>
0349 
0350       int handle_device_events(int fd, void *map)
0351       {
0352         struct tcmu_mailbox *mb = map;
0353         struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
0354         int did_some_work = 0;
0355 
0356         /* Process events from cmd ring until we catch up with cmd_head */
0357         while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
0358 
0359           if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
0360             uint8_t *cdb = (void *)mb + ent->req.cdb_off;
0361             bool success = true;
0362 
0363             /* Handle command here. */
0364             printf("SCSI opcode: 0x%x\n", cdb[0]);
0365 
0366             /* Set response fields */
0367             if (success)
0368               ent->rsp.scsi_status = SCSI_NO_SENSE;
0369             else {
0370               /* Also fill in rsp->sense_buffer here */
0371               ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
0372             }
0373           }
0374           else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
0375             /* Tell the kernel we didn't handle unknown opcodes */
0376             ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
0377           }
0378           else {
0379             /* Do nothing for PAD entries except update cmd_tail */
0380           }
0381 
0382           /* update cmd_tail */
0383           mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
0384           ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
0385           did_some_work = 1;
0386         }
0387 
0388         /* Notify the kernel that work has been finished */
0389         if (did_some_work) {
0390           uint32_t buf = 0;
0391 
0392           write(fd, &buf, 4);
0393         }
0394 
0395         return 0;
0396       }
0397 
0398 
0399 A final note
0400 ============
0401 
0402 Please be careful to return codes as defined by the SCSI
0403 specifications. These are different than some values defined in the
0404 scsi/scsi.h include file. For example, CHECK CONDITION's status code
0405 is 2, not 1.