Back to home page

OSCL-LXR

 
 

    


0001 ========
0002 dm-zoned
0003 ========
0004 
0005 The dm-zoned device mapper target exposes a zoned block device (ZBC and
0006 ZAC compliant devices) as a regular block device without any write
0007 pattern constraints. In effect, it implements a drive-managed zoned
0008 block device which hides from the user (a file system or an application
0009 doing raw block device accesses) the sequential write constraints of
0010 host-managed zoned block devices and can mitigate the potential
0011 device-side performance degradation due to excessive random writes on
0012 host-aware zoned block devices.
0013 
0014 For a more detailed description of the zoned block device models and
0015 their constraints see (for SCSI devices):
0016 
0017 https://www.t10.org/drafts.htm#ZBC_Family
0018 
0019 and (for ATA devices):
0020 
0021 http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
0022 
0023 The dm-zoned implementation is simple and minimizes system overhead (CPU
0024 and memory usage as well as storage capacity loss). For a 10TB
0025 host-managed disk with 256 MB zones, dm-zoned memory usage per disk
0026 instance is at most 4.5 MB and as little as 5 zones will be used
0027 internally for storing metadata and performing reclaim operations.
0028 
0029 dm-zoned target devices are formatted and checked using the dmzadm
0030 utility available at:
0031 
0032 https://github.com/hgst/dm-zoned-tools
0033 
0034 Algorithm
0035 =========
0036 
0037 dm-zoned implements an on-disk buffering scheme to handle non-sequential
0038 write accesses to the sequential zones of a zoned block device.
0039 Conventional zones are used for caching as well as for storing internal
0040 metadata. It can also use a regular block device together with the zoned
0041 block device; in that case the regular block device will be split logically
0042 in zones with the same size as the zoned block device. These zones will be
0043 placed in front of the zones from the zoned block device and will be handled
0044 just like conventional zones.
0045 
0046 The zones of the device(s) are separated into 2 types:
0047 
0048 1) Metadata zones: these are conventional zones used to store metadata.
0049 Metadata zones are not reported as useable capacity to the user.
0050 
0051 2) Data zones: all remaining zones, the vast majority of which will be
0052 sequential zones used exclusively to store user data. The conventional
0053 zones of the device may be used also for buffering user random writes.
0054 Data in these zones may be directly mapped to the conventional zone, but
0055 later moved to a sequential zone so that the conventional zone can be
0056 reused for buffering incoming random writes.
0057 
0058 dm-zoned exposes a logical device with a sector size of 4096 bytes,
0059 irrespective of the physical sector size of the backend zoned block
0060 device being used. This allows reducing the amount of metadata needed to
0061 manage valid blocks (blocks written).
0062 
0063 The on-disk metadata format is as follows:
0064 
0065 1) The first block of the first conventional zone found contains the
0066 super block which describes the on disk amount and position of metadata
0067 blocks.
0068 
0069 2) Following the super block, a set of blocks is used to describe the
0070 mapping of the logical device blocks. The mapping is done per chunk of
0071 blocks, with the chunk size equal to the zoned block device size. The
0072 mapping table is indexed by chunk number and each mapping entry
0073 indicates the zone number of the device storing the chunk of data. Each
0074 mapping entry may also indicate if the zone number of a conventional
0075 zone used to buffer random modification to the data zone.
0076 
0077 3) A set of blocks used to store bitmaps indicating the validity of
0078 blocks in the data zones follows the mapping table. A valid block is
0079 defined as a block that was written and not discarded. For a buffered
0080 data chunk, a block is always valid only in the data zone mapping the
0081 chunk or in the buffer zone of the chunk.
0082 
0083 For a logical chunk mapped to a conventional zone, all write operations
0084 are processed by directly writing to the zone. If the mapping zone is a
0085 sequential zone, the write operation is processed directly only if the
0086 write offset within the logical chunk is equal to the write pointer
0087 offset within of the sequential data zone (i.e. the write operation is
0088 aligned on the zone write pointer). Otherwise, write operations are
0089 processed indirectly using a buffer zone. In that case, an unused
0090 conventional zone is allocated and assigned to the chunk being
0091 accessed. Writing a block to the buffer zone of a chunk will
0092 automatically invalidate the same block in the sequential zone mapping
0093 the chunk. If all blocks of the sequential zone become invalid, the zone
0094 is freed and the chunk buffer zone becomes the primary zone mapping the
0095 chunk, resulting in native random write performance similar to a regular
0096 block device.
0097 
0098 Read operations are processed according to the block validity
0099 information provided by the bitmaps. Valid blocks are read either from
0100 the sequential zone mapping a chunk, or if the chunk is buffered, from
0101 the buffer zone assigned. If the accessed chunk has no mapping, or the
0102 accessed blocks are invalid, the read buffer is zeroed and the read
0103 operation terminated.
0104 
0105 After some time, the limited number of conventional zones available may
0106 be exhausted (all used to map chunks or buffer sequential zones) and
0107 unaligned writes to unbuffered chunks become impossible. To avoid this
0108 situation, a reclaim process regularly scans used conventional zones and
0109 tries to reclaim the least recently used zones by copying the valid
0110 blocks of the buffer zone to a free sequential zone. Once the copy
0111 completes, the chunk mapping is updated to point to the sequential zone
0112 and the buffer zone freed for reuse.
0113 
0114 Metadata Protection
0115 ===================
0116 
0117 To protect metadata against corruption in case of sudden power loss or
0118 system crash, 2 sets of metadata zones are used. One set, the primary
0119 set, is used as the main metadata region, while the secondary set is
0120 used as a staging area. Modified metadata is first written to the
0121 secondary set and validated by updating the super block in the secondary
0122 set, a generation counter is used to indicate that this set contains the
0123 newest metadata. Once this operation completes, in place of metadata
0124 block updates can be done in the primary metadata set. This ensures that
0125 one of the set is always consistent (all modifications committed or none
0126 at all). Flush operations are used as a commit point. Upon reception of
0127 a flush request, metadata modification activity is temporarily blocked
0128 (for both incoming BIO processing and reclaim process) and all dirty
0129 metadata blocks are staged and updated. Normal operation is then
0130 resumed. Flushing metadata thus only temporarily delays write and
0131 discard requests. Read requests can be processed concurrently while
0132 metadata flush is being executed.
0133 
0134 If a regular device is used in conjunction with the zoned block device,
0135 a third set of metadata (without the zone bitmaps) is written to the
0136 start of the zoned block device. This metadata has a generation counter of
0137 '0' and will never be updated during normal operation; it just serves for
0138 identification purposes. The first and second copy of the metadata
0139 are located at the start of the regular block device.
0140 
0141 Usage
0142 =====
0143 
0144 A zoned block device must first be formatted using the dmzadm tool. This
0145 will analyze the device zone configuration, determine where to place the
0146 metadata sets on the device and initialize the metadata sets.
0147 
0148 Ex::
0149 
0150         dmzadm --format /dev/sdxx
0151 
0152 
0153 If two drives are to be used, both devices must be specified, with the
0154 regular block device as the first device.
0155 
0156 Ex::
0157 
0158         dmzadm --format /dev/sdxx /dev/sdyy
0159 
0160 
0161 Formatted device(s) can be started with the dmzadm utility, too.:
0162 
0163 Ex::
0164 
0165         dmzadm --start /dev/sdxx /dev/sdyy
0166 
0167 
0168 Information about the internal layout and current usage of the zones can
0169 be obtained with the 'status' callback from dmsetup:
0170 
0171 Ex::
0172 
0173         dmsetup status /dev/dm-X
0174 
0175 will return a line
0176 
0177         0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential
0178 
0179 where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number
0180 of unmapped (ie free) random zones, <nr_rnd> the total number of zones,
0181 <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the
0182 total number of sequential zones.
0183 
0184 Normally the reclaim process will be started once there are less than 50
0185 percent free random zones. In order to start the reclaim process manually
0186 even before reaching this threshold the 'dmsetup message' function can be
0187 used:
0188 
0189 Ex::
0190 
0191         dmsetup message /dev/dm-X 0 reclaim
0192 
0193 will start the reclaim process and random zones will be moved to sequential
0194 zones.