0001 =======
0002 dm-raid
0003 =======
0004
0005 The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
0006 It allows the MD RAID drivers to be accessed using a device-mapper
0007 interface.
0008
0009
0010 Mapping Table Interface
0011 -----------------------
0012 The target is named "raid" and it accepts the following parameters::
0013
0014 <raid_type> <#raid_params> <raid_params> \
0015 <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
0016
0017 <raid_type>:
0018
0019 ============= ===============================================================
0020 raid0 RAID0 striping (no resilience)
0021 raid1 RAID1 mirroring
0022 raid4 RAID4 with dedicated last parity disk
0023 raid5_n RAID5 with dedicated last parity disk supporting takeover
0024 Same as raid4
0025
0026 - Transitory layout
0027 raid5_la RAID5 left asymmetric
0028
0029 - rotating parity 0 with data continuation
0030 raid5_ra RAID5 right asymmetric
0031
0032 - rotating parity N with data continuation
0033 raid5_ls RAID5 left symmetric
0034
0035 - rotating parity 0 with data restart
0036 raid5_rs RAID5 right symmetric
0037
0038 - rotating parity N with data restart
0039 raid6_zr RAID6 zero restart
0040
0041 - rotating parity zero (left-to-right) with data restart
0042 raid6_nr RAID6 N restart
0043
0044 - rotating parity N (right-to-left) with data restart
0045 raid6_nc RAID6 N continue
0046
0047 - rotating parity N (right-to-left) with data continuation
0048 raid6_n_6 RAID6 with dedicate parity disks
0049
0050 - parity and Q-syndrome on the last 2 disks;
0051 layout for takeover from/to raid4/raid5_n
0052 raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
0053
0054 - layout for takeover from raid5_la from/to raid6
0055 raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
0056
0057 - layout for takeover from raid5_ra from/to raid6
0058 raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
0059
0060 - layout for takeover from raid5_ls from/to raid6
0061 raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
0062
0063 - layout for takeover from raid5_rs from/to raid6
0064 raid10 Various RAID10 inspired algorithms chosen by additional params
0065 (see raid10_format and raid10_copies below)
0066
0067 - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
0068 - RAID1E: Integrated Adjacent Stripe Mirroring
0069 - RAID1E: Integrated Offset Stripe Mirroring
0070 - and other similar RAID10 variants
0071 ============= ===============================================================
0072
0073 Reference: Chapter 4 of
0074 https://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
0075
0076 <#raid_params>: The number of parameters that follow.
0077
0078 <raid_params> consists of
0079
0080 Mandatory parameters:
0081 <chunk_size>:
0082 Chunk size in sectors. This parameter is often known as
0083 "stripe size". It is the only mandatory parameter and
0084 is placed first.
0085
0086 followed by optional parameters (in any order):
0087 [sync|nosync]
0088 Force or prevent RAID initialization.
0089
0090 [rebuild <idx>]
0091 Rebuild drive number 'idx' (first drive is 0).
0092
0093 [daemon_sleep <ms>]
0094 Interval between runs of the bitmap daemon that
0095 clear bits. A longer interval means less bitmap I/O but
0096 resyncing after a failure is likely to take longer.
0097
0098 [min_recovery_rate <kB/sec/disk>]
0099 Throttle RAID initialization
0100 [max_recovery_rate <kB/sec/disk>]
0101 Throttle RAID initialization
0102 [write_mostly <idx>]
0103 Mark drive index 'idx' write-mostly.
0104 [max_write_behind <sectors>]
0105 See '--write-behind=' (man mdadm)
0106 [stripe_cache <sectors>]
0107 Stripe cache size (RAID 4/5/6 only)
0108 [region_size <sectors>]
0109 The region_size multiplied by the number of regions is the
0110 logical size of the array. The bitmap records the device
0111 synchronisation state for each region.
0112
0113 [raid10_copies <# copies>], [raid10_format <near|far|offset>]
0114 These two options are used to alter the default layout of
0115 a RAID10 configuration. The number of copies is can be
0116 specified, but the default is 2. There are also three
0117 variations to how the copies are laid down - the default
0118 is "near". Near copies are what most people think of with
0119 respect to mirroring. If these options are left unspecified,
0120 or 'raid10_copies 2' and/or 'raid10_format near' are given,
0121 then the layouts for 2, 3 and 4 devices are:
0122
0123 ======== ========== ==============
0124 2 drives 3 drives 4 drives
0125 ======== ========== ==============
0126 A1 A1 A1 A1 A2 A1 A1 A2 A2
0127 A2 A2 A2 A3 A3 A3 A3 A4 A4
0128 A3 A3 A4 A4 A5 A5 A5 A6 A6
0129 A4 A4 A5 A6 A6 A7 A7 A8 A8
0130 .. .. .. .. .. .. .. .. ..
0131 ======== ========== ==============
0132
0133 The 2-device layout is equivalent 2-way RAID1. The 4-device
0134 layout is what a traditional RAID10 would look like. The
0135 3-device layout is what might be called a 'RAID1E - Integrated
0136 Adjacent Stripe Mirroring'.
0137
0138 If 'raid10_copies 2' and 'raid10_format far', then the layouts
0139 for 2, 3 and 4 devices are:
0140
0141 ======== ============ ===================
0142 2 drives 3 drives 4 drives
0143 ======== ============ ===================
0144 A1 A2 A1 A2 A3 A1 A2 A3 A4
0145 A3 A4 A4 A5 A6 A5 A6 A7 A8
0146 A5 A6 A7 A8 A9 A9 A10 A11 A12
0147 .. .. .. .. .. .. .. .. ..
0148 A2 A1 A3 A1 A2 A2 A1 A4 A3
0149 A4 A3 A6 A4 A5 A6 A5 A8 A7
0150 A6 A5 A9 A7 A8 A10 A9 A12 A11
0151 .. .. .. .. .. .. .. .. ..
0152 ======== ============ ===================
0153
0154 If 'raid10_copies 2' and 'raid10_format offset', then the
0155 layouts for 2, 3 and 4 devices are:
0156
0157 ======== ========== ================
0158 2 drives 3 drives 4 drives
0159 ======== ========== ================
0160 A1 A2 A1 A2 A3 A1 A2 A3 A4
0161 A2 A1 A3 A1 A2 A2 A1 A4 A3
0162 A3 A4 A4 A5 A6 A5 A6 A7 A8
0163 A4 A3 A6 A4 A5 A6 A5 A8 A7
0164 A5 A6 A7 A8 A9 A9 A10 A11 A12
0165 A6 A5 A9 A7 A8 A10 A9 A12 A11
0166 .. .. .. .. .. .. .. .. ..
0167 ======== ========== ================
0168
0169 Here we see layouts closely akin to 'RAID1E - Integrated
0170 Offset Stripe Mirroring'.
0171
0172 [delta_disks <N>]
0173 The delta_disks option value (-251 < N < +251) triggers
0174 device removal (negative value) or device addition (positive
0175 value) to any reshape supporting raid levels 4/5/6 and 10.
0176 RAID levels 4/5/6 allow for addition of devices (metadata
0177 and data device tuple), raid10_near and raid10_offset only
0178 allow for device addition. raid10_far does not support any
0179 reshaping at all.
0180 A minimum of devices have to be kept to enforce resilience,
0181 which is 3 devices for raid4/5 and 4 devices for raid6.
0182
0183 [data_offset <sectors>]
0184 This option value defines the offset into each data device
0185 where the data starts. This is used to provide out-of-place
0186 reshaping space to avoid writing over data while
0187 changing the layout of stripes, hence an interruption/crash
0188 may happen at any time without the risk of losing data.
0189 E.g. when adding devices to an existing raid set during
0190 forward reshaping, the out-of-place space will be allocated
0191 at the beginning of each raid device. The kernel raid4/5/6/10
0192 MD personalities supporting such device addition will read the data from
0193 the existing first stripes (those with smaller number of stripes)
0194 starting at data_offset to fill up a new stripe with the larger
0195 number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
0196 and write that new stripe to offset 0. Same will be applied to all
0197 N-1 other new stripes. This out-of-place scheme is used to change
0198 the RAID type (i.e. the allocation algorithm) as well, e.g.
0199 changing from raid5_ls to raid5_n.
0200
0201 [journal_dev <dev>]
0202 This option adds a journal device to raid4/5/6 raid sets and
0203 uses it to close the 'write hole' caused by the non-atomic updates
0204 to the component devices which can cause data loss during recovery.
0205 The journal device is used as writethrough thus causing writes to
0206 be throttled versus non-journaled raid4/5/6 sets.
0207 Takeover/reshape is not possible with a raid4/5/6 journal device;
0208 it has to be deconfigured before requesting these.
0209
0210 [journal_mode <mode>]
0211 This option sets the caching mode on journaled raid4/5/6 raid sets
0212 (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
0213 If 'writeback' is selected the journal device has to be resilient
0214 and must not suffer from the 'write hole' problem itself (e.g. use
0215 raid1 or raid10) to avoid a single point of failure.
0216
0217 <#raid_devs>: The number of devices composing the array.
0218 Each device consists of two entries. The first is the device
0219 containing the metadata (if any); the second is the one containing the
0220 data. A Maximum of 64 metadata/data device entries are supported
0221 up to target version 1.8.0.
0222 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
0223
0224 If a drive has failed or is missing at creation time, a '-' can be
0225 given for both the metadata and data drives for a given position.
0226
0227
0228 Example Tables
0229 --------------
0230
0231 ::
0232
0233 # RAID4 - 4 data drives, 1 parity (no metadata devices)
0234 # No metadata devices specified to hold superblock/bitmap info
0235 # Chunk size of 1MiB
0236 # (Lines separated for easy reading)
0237
0238 0 1960893648 raid \
0239 raid4 1 2048 \
0240 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
0241
0242 # RAID4 - 4 data drives, 1 parity (with metadata devices)
0243 # Chunk size of 1MiB, force RAID initialization,
0244 # min recovery rate at 20 kiB/sec/disk
0245
0246 0 1960893648 raid \
0247 raid4 4 2048 sync min_recovery_rate 20 \
0248 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
0249
0250
0251 Status Output
0252 -------------
0253 'dmsetup table' displays the table used to construct the mapping.
0254 The optional parameters are always printed in the order listed
0255 above with "sync" or "nosync" always output ahead of the other
0256 arguments, regardless of the order used when originally loading the table.
0257 Arguments that can be repeated are ordered by value.
0258
0259
0260 'dmsetup status' yields information on the state and health of the array.
0261 The output is as follows (normally a single line, but expanded here for
0262 clarity)::
0263
0264 1: <s> <l> raid \
0265 2: <raid_type> <#devices> <health_chars> \
0266 3: <sync_ratio> <sync_action> <mismatch_cnt>
0267
0268 Line 1 is the standard output produced by device-mapper.
0269
0270 Line 2 & 3 are produced by the raid target and are best explained by example::
0271
0272 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
0273
0274 Here we can see the RAID type is raid4, there are 5 devices - all of
0275 which are 'A'live, and the array is 2/490221568 complete with its initial
0276 recovery. Here is a fuller description of the individual fields:
0277
0278 =============== =========================================================
0279 <raid_type> Same as the <raid_type> used to create the array.
0280 <health_chars> One char for each device, indicating:
0281
0282 - 'A' = alive and in-sync
0283 - 'a' = alive but not in-sync
0284 - 'D' = dead/failed.
0285 <sync_ratio> The ratio indicating how much of the array has undergone
0286 the process described by 'sync_action'. If the
0287 'sync_action' is "check" or "repair", then the process
0288 of "resync" or "recover" can be considered complete.
0289 <sync_action> One of the following possible states:
0290
0291 idle
0292 - No synchronization action is being performed.
0293 frozen
0294 - The current action has been halted.
0295 resync
0296 - Array is undergoing its initial synchronization
0297 or is resynchronizing after an unclean shutdown
0298 (possibly aided by a bitmap).
0299 recover
0300 - A device in the array is being rebuilt or
0301 replaced.
0302 check
0303 - A user-initiated full check of the array is
0304 being performed. All blocks are read and
0305 checked for consistency. The number of
0306 discrepancies found are recorded in
0307 <mismatch_cnt>. No changes are made to the
0308 array by this action.
0309 repair
0310 - The same as "check", but discrepancies are
0311 corrected.
0312 reshape
0313 - The array is undergoing a reshape.
0314 <mismatch_cnt> The number of discrepancies found between mirror copies
0315 in RAID1/10 or wrong parity values found in RAID4/5/6.
0316 This value is valid only after a "check" of the array
0317 is performed. A healthy array has a 'mismatch_cnt' of 0.
0318 <data_offset> The current data offset to the start of the user data on
0319 each component device of a raid set (see the respective
0320 raid parameter to support out-of-place reshaping).
0321 <journal_char> - 'A' - active write-through journal device.
0322 - 'a' - active write-back journal device.
0323 - 'D' - dead journal device.
0324 - '-' - no journal device.
0325 =============== =========================================================
0326
0327
0328 Message Interface
0329 -----------------
0330 The dm-raid target will accept certain actions through the 'message' interface.
0331 ('man dmsetup' for more information on the message interface.) These actions
0332 include:
0333
0334 ========= ================================================
0335 "idle" Halt the current sync action.
0336 "frozen" Freeze the current sync action.
0337 "resync" Initiate/continue a resync.
0338 "recover" Initiate/continue a recover process.
0339 "check" Initiate a check (i.e. a "scrub") of the array.
0340 "repair" Initiate a repair of the array.
0341 ========= ================================================
0342
0343
0344 Discard Support
0345 ---------------
0346 The implementation of discard support among hardware vendors varies.
0347 When a block is discarded, some storage devices will return zeroes when
0348 the block is read. These devices set the 'discard_zeroes_data'
0349 attribute. Other devices will return random data. Confusingly, some
0350 devices that advertise 'discard_zeroes_data' will not reliably return
0351 zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks
0352 from a number of devices to calculate parity blocks and (for performance
0353 reasons) relies on 'discard_zeroes_data' being reliable, it is important
0354 that the devices be consistent. Blocks may be discarded in the middle
0355 of a RAID 4/5/6 stripe and if subsequent read results are not
0356 consistent, the parity blocks may be calculated differently at any time;
0357 making the parity blocks useless for redundancy. It is important to
0358 understand how your hardware behaves with discards if you are going to
0359 enable discards with RAID 4/5/6.
0360
0361 Since the behavior of storage devices is unreliable in this respect,
0362 even when reporting 'discard_zeroes_data', by default RAID 4/5/6
0363 discard support is disabled -- this ensures data integrity at the
0364 expense of losing some performance.
0365
0366 Storage devices that properly support 'discard_zeroes_data' are
0367 increasingly whitelisted in the kernel and can thus be trusted.
0368
0369 For trusted devices, the following dm-raid module parameter can be set
0370 to safely enable discard support for RAID 4/5/6:
0371
0372 'devices_handle_discards_safely'
0373
0374
0375 Version History
0376 ---------------
0377
0378 ::
0379
0380 1.0.0 Initial version. Support for RAID 4/5/6
0381 1.1.0 Added support for RAID 1
0382 1.2.0 Handle creation of arrays that contain failed devices.
0383 1.3.0 Added support for RAID 10
0384 1.3.1 Allow device replacement/rebuild for RAID 10
0385 1.3.2 Fix/improve redundancy checking for RAID10
0386 1.4.0 Non-functional change. Removes arg from mapping function.
0387 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
0388 1.4.2 Add RAID10 "far" and "offset" algorithm support.
0389 1.5.0 Add message interface to allow manipulation of the sync_action.
0390 New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
0391 1.5.1 Add ability to restore transiently failed devices on resume.
0392 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
0393 1.6.0 Add discard support (and devices_handle_discard_safely module param).
0394 1.7.0 Add support for MD RAID0 mappings.
0395 1.8.0 Explicitly check for compatible flags in the superblock metadata
0396 and reject to start the raid set if any are set by a newer
0397 target version, thus avoiding data corruption on a raid set
0398 with a reshape in progress.
0399 1.9.0 Add support for RAID level takeover/reshape/region size
0400 and set size reduction.
0401 1.9.1 Fix activation of existing RAID 4/10 mapped devices
0402 1.9.2 Don't emit '- -' on the status table line in case the constructor
0403 fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
0404 'D' on the status line. If '- -' is passed into the constructor, emit
0405 '- -' on the table line and '-' as the status line health character.
0406 1.10.0 Add support for raid4/5/6 journal device
0407 1.10.1 Fix data corruption on reshape request
0408 1.11.0 Fix table line argument order
0409 (wrong raid10_copies/raid10_format sequence)
0410 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
0411 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
0412 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
0413 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
0414 state races.
0415 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
0416 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
0417 deadlock/potential data corruption. Update superblock when
0418 specific devices are requested via rebuild. Fix RAID leg
0419 rebuild errors.
0420 1.15.0 Fix size extensions not being synchronized in case of new MD bitmap
0421 pages allocated; also fix those not occurring after previous reductions
0422 1.15.1 Fix argument count and arguments for rebuild/write_mostly/journal_(dev|mode)
0423 on the status line.