0001 =========
0002 dm-switch
0003 =========
0004
0005 The device-mapper switch target creates a device that supports an
0006 arbitrary mapping of fixed-size regions of I/O across a fixed set of
0007 paths. The path used for any specific region can be switched
0008 dynamically by sending the target a message.
0009
0010 It maps I/O to underlying block devices efficiently when there is a large
0011 number of fixed-sized address regions but there is no simple pattern
0012 that would allow for a compact representation of the mapping such as
0013 dm-stripe.
0014
0015 Background
0016 ----------
0017
0018 Dell EqualLogic and some other iSCSI storage arrays use a distributed
0019 frameless architecture. In this architecture, the storage group
0020 consists of a number of distinct storage arrays ("members") each having
0021 independent controllers, disk storage and network adapters. When a LUN
0022 is created it is spread across multiple members. The details of the
0023 spreading are hidden from initiators connected to this storage system.
0024 The storage group exposes a single target discovery portal, no matter
0025 how many members are being used. When iSCSI sessions are created, each
0026 session is connected to an eth port on a single member. Data to a LUN
0027 can be sent on any iSCSI session, and if the blocks being accessed are
0028 stored on another member the I/O will be forwarded as required. This
0029 forwarding is invisible to the initiator. The storage layout is also
0030 dynamic, and the blocks stored on disk may be moved from member to
0031 member as needed to balance the load.
0032
0033 This architecture simplifies the management and configuration of both
0034 the storage group and initiators. In a multipathing configuration, it
0035 is possible to set up multiple iSCSI sessions to use multiple network
0036 interfaces on both the host and target to take advantage of the
0037 increased network bandwidth. An initiator could use a simple round
0038 robin algorithm to send I/O across all paths and let the storage array
0039 members forward it as necessary, but there is a performance advantage to
0040 sending data directly to the correct member.
0041
0042 A device-mapper table already lets you map different regions of a
0043 device onto different targets. However in this architecture the LUN is
0044 spread with an address region size on the order of 10s of MBs, which
0045 means the resulting table could have more than a million entries and
0046 consume far too much memory.
0047
0048 Using this device-mapper switch target we can now build a two-layer
0049 device hierarchy:
0050
0051 Upper Tier - Determine which array member the I/O should be sent to.
0052 Lower Tier - Load balance amongst paths to a particular member.
0053
0054 The lower tier consists of a single dm multipath device for each member.
0055 Each of these multipath devices contains the set of paths directly to
0056 the array member in one priority group, and leverages existing path
0057 selectors to load balance amongst these paths. We also build a
0058 non-preferred priority group containing paths to other array members for
0059 failover reasons.
0060
0061 The upper tier consists of a single dm-switch device. This device uses
0062 a bitmap to look up the location of the I/O and choose the appropriate
0063 lower tier device to route the I/O. By using a bitmap we are able to
0064 use 4 bits for each address range in a 16 member group (which is very
0065 large for us). This is a much denser representation than the dm table
0066 b-tree can achieve.
0067
0068 Construction Parameters
0069 =======================
0070
0071 <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
0072 <num_paths>
0073 The number of paths across which to distribute the I/O.
0074
0075 <region_size>
0076 The number of 512-byte sectors in a region. Each region can be redirected
0077 to any of the available paths.
0078
0079 <num_optional_args>
0080 The number of optional arguments. Currently, no optional arguments
0081 are supported and so this must be zero.
0082
0083 <dev_path>
0084 The block device that represents a specific path to the device.
0085
0086 <offset>
0087 The offset of the start of data on the specific <dev_path> (in units
0088 of 512-byte sectors). This number is added to the sector number when
0089 forwarding the request to the specific path. Typically it is zero.
0090
0091 Messages
0092 ========
0093
0094 set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
0095
0096 Modify the region table by specifying which regions are redirected to
0097 which paths.
0098
0099 <index>
0100 The region number (region size was specified in constructor parameters).
0101 If index is omitted, the next region (previous index + 1) is used.
0102 Expressed in hexadecimal (WITHOUT any prefix like 0x).
0103
0104 <path_nr>
0105 The path number in the range 0 ... (<num_paths> - 1).
0106 Expressed in hexadecimal (WITHOUT any prefix like 0x).
0107
0108 R<n>,<m>
0109 This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
0110 are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
0111 slots.
0112
0113 Status
0114 ======
0115
0116 No status line is reported.
0117
0118 Example
0119 =======
0120
0121 Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
0122 the same size.
0123
0124 Create a switch device with 64kB region size::
0125
0126 dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
0127 switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
0128
0129 Set mappings for the first 7 entries to point to devices switch0, switch1,
0130 switch2, switch0, switch1, switch2, switch1::
0131
0132 dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
0133
0134 Set repetitive mapping. This command::
0135
0136 dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
0137
0138 is equivalent to::
0139
0140 dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
0141 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2