Back to home page

OSCL-LXR

 
 

    


0001 =========
0002 dm-switch
0003 =========
0004 
0005 The device-mapper switch target creates a device that supports an
0006 arbitrary mapping of fixed-size regions of I/O across a fixed set of
0007 paths.  The path used for any specific region can be switched
0008 dynamically by sending the target a message.
0009 
0010 It maps I/O to underlying block devices efficiently when there is a large
0011 number of fixed-sized address regions but there is no simple pattern
0012 that would allow for a compact representation of the mapping such as
0013 dm-stripe.
0014 
0015 Background
0016 ----------
0017 
0018 Dell EqualLogic and some other iSCSI storage arrays use a distributed
0019 frameless architecture.  In this architecture, the storage group
0020 consists of a number of distinct storage arrays ("members") each having
0021 independent controllers, disk storage and network adapters.  When a LUN
0022 is created it is spread across multiple members.  The details of the
0023 spreading are hidden from initiators connected to this storage system.
0024 The storage group exposes a single target discovery portal, no matter
0025 how many members are being used.  When iSCSI sessions are created, each
0026 session is connected to an eth port on a single member.  Data to a LUN
0027 can be sent on any iSCSI session, and if the blocks being accessed are
0028 stored on another member the I/O will be forwarded as required.  This
0029 forwarding is invisible to the initiator.  The storage layout is also
0030 dynamic, and the blocks stored on disk may be moved from member to
0031 member as needed to balance the load.
0032 
0033 This architecture simplifies the management and configuration of both
0034 the storage group and initiators.  In a multipathing configuration, it
0035 is possible to set up multiple iSCSI sessions to use multiple network
0036 interfaces on both the host and target to take advantage of the
0037 increased network bandwidth.  An initiator could use a simple round
0038 robin algorithm to send I/O across all paths and let the storage array
0039 members forward it as necessary, but there is a performance advantage to
0040 sending data directly to the correct member.
0041 
0042 A device-mapper table already lets you map different regions of a
0043 device onto different targets.  However in this architecture the LUN is
0044 spread with an address region size on the order of 10s of MBs, which
0045 means the resulting table could have more than a million entries and
0046 consume far too much memory.
0047 
0048 Using this device-mapper switch target we can now build a two-layer
0049 device hierarchy:
0050 
0051     Upper Tier - Determine which array member the I/O should be sent to.
0052     Lower Tier - Load balance amongst paths to a particular member.
0053 
0054 The lower tier consists of a single dm multipath device for each member.
0055 Each of these multipath devices contains the set of paths directly to
0056 the array member in one priority group, and leverages existing path
0057 selectors to load balance amongst these paths.  We also build a
0058 non-preferred priority group containing paths to other array members for
0059 failover reasons.
0060 
0061 The upper tier consists of a single dm-switch device.  This device uses
0062 a bitmap to look up the location of the I/O and choose the appropriate
0063 lower tier device to route the I/O.  By using a bitmap we are able to
0064 use 4 bits for each address range in a 16 member group (which is very
0065 large for us).  This is a much denser representation than the dm table
0066 b-tree can achieve.
0067 
0068 Construction Parameters
0069 =======================
0070 
0071     <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
0072         <num_paths>
0073             The number of paths across which to distribute the I/O.
0074 
0075         <region_size>
0076             The number of 512-byte sectors in a region. Each region can be redirected
0077             to any of the available paths.
0078 
0079         <num_optional_args>
0080             The number of optional arguments. Currently, no optional arguments
0081             are supported and so this must be zero.
0082 
0083         <dev_path>
0084             The block device that represents a specific path to the device.
0085 
0086         <offset>
0087             The offset of the start of data on the specific <dev_path> (in units
0088             of 512-byte sectors). This number is added to the sector number when
0089             forwarding the request to the specific path. Typically it is zero.
0090 
0091 Messages
0092 ========
0093 
0094 set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
0095 
0096 Modify the region table by specifying which regions are redirected to
0097 which paths.
0098 
0099 <index>
0100     The region number (region size was specified in constructor parameters).
0101     If index is omitted, the next region (previous index + 1) is used.
0102     Expressed in hexadecimal (WITHOUT any prefix like 0x).
0103 
0104 <path_nr>
0105     The path number in the range 0 ... (<num_paths> - 1).
0106     Expressed in hexadecimal (WITHOUT any prefix like 0x).
0107 
0108 R<n>,<m>
0109     This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
0110     are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
0111     slots.
0112 
0113 Status
0114 ======
0115 
0116 No status line is reported.
0117 
0118 Example
0119 =======
0120 
0121 Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
0122 the same size.
0123 
0124 Create a switch device with 64kB region size::
0125 
0126     dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
0127         switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
0128 
0129 Set mappings for the first 7 entries to point to devices switch0, switch1,
0130 switch2, switch0, switch1, switch2, switch1::
0131 
0132     dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
0133 
0134 Set repetitive mapping. This command::
0135 
0136     dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
0137 
0138 is equivalent to::
0139 
0140     dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
0141         :1 :2 :1 :2 :1 :2 :1 :2 :1 :2