Back to home page

OSCL-LXR

 
 

    


0001 =======
0002 dm-raid
0003 =======
0004 
0005 The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
0006 It allows the MD RAID drivers to be accessed using a device-mapper
0007 interface.
0008 
0009 
0010 Mapping Table Interface
0011 -----------------------
0012 The target is named "raid" and it accepts the following parameters::
0013 
0014   <raid_type> <#raid_params> <raid_params> \
0015     <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
0016 
0017 <raid_type>:
0018 
0019   ============= ===============================================================
0020   raid0         RAID0 striping (no resilience)
0021   raid1         RAID1 mirroring
0022   raid4         RAID4 with dedicated last parity disk
0023   raid5_n       RAID5 with dedicated last parity disk supporting takeover
0024                 Same as raid4
0025 
0026                 - Transitory layout
0027   raid5_la      RAID5 left asymmetric
0028 
0029                 - rotating parity 0 with data continuation
0030   raid5_ra      RAID5 right asymmetric
0031 
0032                 - rotating parity N with data continuation
0033   raid5_ls      RAID5 left symmetric
0034 
0035                 - rotating parity 0 with data restart
0036   raid5_rs      RAID5 right symmetric
0037 
0038                 - rotating parity N with data restart
0039   raid6_zr      RAID6 zero restart
0040 
0041                 - rotating parity zero (left-to-right) with data restart
0042   raid6_nr      RAID6 N restart
0043 
0044                 - rotating parity N (right-to-left) with data restart
0045   raid6_nc      RAID6 N continue
0046 
0047                 - rotating parity N (right-to-left) with data continuation
0048   raid6_n_6     RAID6 with dedicate parity disks
0049 
0050                 - parity and Q-syndrome on the last 2 disks;
0051                   layout for takeover from/to raid4/raid5_n
0052   raid6_la_6    Same as "raid_la" plus dedicated last Q-syndrome disk
0053 
0054                 - layout for takeover from raid5_la from/to raid6
0055   raid6_ra_6    Same as "raid5_ra" dedicated last Q-syndrome disk
0056 
0057                 - layout for takeover from raid5_ra from/to raid6
0058   raid6_ls_6    Same as "raid5_ls" dedicated last Q-syndrome disk
0059 
0060                 - layout for takeover from raid5_ls from/to raid6
0061   raid6_rs_6    Same as "raid5_rs" dedicated last Q-syndrome disk
0062 
0063                 - layout for takeover from raid5_rs from/to raid6
0064   raid10        Various RAID10 inspired algorithms chosen by additional params
0065                 (see raid10_format and raid10_copies below)
0066 
0067                 - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
0068                 - RAID1E: Integrated Adjacent Stripe Mirroring
0069                 - RAID1E: Integrated Offset Stripe Mirroring
0070                 - and other similar RAID10 variants
0071   ============= ===============================================================
0072 
0073   Reference: Chapter 4 of
0074   https://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
0075 
0076 <#raid_params>: The number of parameters that follow.
0077 
0078 <raid_params> consists of
0079 
0080     Mandatory parameters:
0081         <chunk_size>:
0082                       Chunk size in sectors.  This parameter is often known as
0083                       "stripe size".  It is the only mandatory parameter and
0084                       is placed first.
0085 
0086     followed by optional parameters (in any order):
0087         [sync|nosync]
0088                 Force or prevent RAID initialization.
0089 
0090         [rebuild <idx>]
0091                 Rebuild drive number 'idx' (first drive is 0).
0092 
0093         [daemon_sleep <ms>]
0094                 Interval between runs of the bitmap daemon that
0095                 clear bits.  A longer interval means less bitmap I/O but
0096                 resyncing after a failure is likely to take longer.
0097 
0098         [min_recovery_rate <kB/sec/disk>]
0099                 Throttle RAID initialization
0100         [max_recovery_rate <kB/sec/disk>]
0101                 Throttle RAID initialization
0102         [write_mostly <idx>]
0103                 Mark drive index 'idx' write-mostly.
0104         [max_write_behind <sectors>]
0105                 See '--write-behind=' (man mdadm)
0106         [stripe_cache <sectors>]
0107                 Stripe cache size (RAID 4/5/6 only)
0108         [region_size <sectors>]
0109                 The region_size multiplied by the number of regions is the
0110                 logical size of the array.  The bitmap records the device
0111                 synchronisation state for each region.
0112 
0113         [raid10_copies   <# copies>], [raid10_format   <near|far|offset>]
0114                 These two options are used to alter the default layout of
0115                 a RAID10 configuration.  The number of copies is can be
0116                 specified, but the default is 2.  There are also three
0117                 variations to how the copies are laid down - the default
0118                 is "near".  Near copies are what most people think of with
0119                 respect to mirroring.  If these options are left unspecified,
0120                 or 'raid10_copies 2' and/or 'raid10_format near' are given,
0121                 then the layouts for 2, 3 and 4 devices are:
0122 
0123                 ========         ==========        ==============
0124                 2 drives         3 drives          4 drives
0125                 ========         ==========        ==============
0126                 A1  A1           A1  A1  A2        A1  A1  A2  A2
0127                 A2  A2           A2  A3  A3        A3  A3  A4  A4
0128                 A3  A3           A4  A4  A5        A5  A5  A6  A6
0129                 A4  A4           A5  A6  A6        A7  A7  A8  A8
0130                 ..  ..           ..  ..  ..        ..  ..  ..  ..
0131                 ========         ==========        ==============
0132 
0133                 The 2-device layout is equivalent 2-way RAID1.  The 4-device
0134                 layout is what a traditional RAID10 would look like.  The
0135                 3-device layout is what might be called a 'RAID1E - Integrated
0136                 Adjacent Stripe Mirroring'.
0137 
0138                 If 'raid10_copies 2' and 'raid10_format far', then the layouts
0139                 for 2, 3 and 4 devices are:
0140 
0141                 ========             ============         ===================
0142                 2 drives             3 drives             4 drives
0143                 ========             ============         ===================
0144                 A1  A2               A1   A2   A3         A1   A2   A3   A4
0145                 A3  A4               A4   A5   A6         A5   A6   A7   A8
0146                 A5  A6               A7   A8   A9         A9   A10  A11  A12
0147                 ..  ..               ..   ..   ..         ..   ..   ..   ..
0148                 A2  A1               A3   A1   A2         A2   A1   A4   A3
0149                 A4  A3               A6   A4   A5         A6   A5   A8   A7
0150                 A6  A5               A9   A7   A8         A10  A9   A12  A11
0151                 ..  ..               ..   ..   ..         ..   ..   ..   ..
0152                 ========             ============         ===================
0153 
0154                 If 'raid10_copies 2' and 'raid10_format offset', then the
0155                 layouts for 2, 3 and 4 devices are:
0156 
0157                 ========       ==========         ================
0158                 2 drives       3 drives           4 drives
0159                 ========       ==========         ================
0160                 A1  A2         A1  A2  A3         A1  A2  A3  A4
0161                 A2  A1         A3  A1  A2         A2  A1  A4  A3
0162                 A3  A4         A4  A5  A6         A5  A6  A7  A8
0163                 A4  A3         A6  A4  A5         A6  A5  A8  A7
0164                 A5  A6         A7  A8  A9         A9  A10 A11 A12
0165                 A6  A5         A9  A7  A8         A10 A9  A12 A11
0166                 ..  ..         ..  ..  ..         ..  ..  ..  ..
0167                 ========       ==========         ================
0168 
0169                 Here we see layouts closely akin to 'RAID1E - Integrated
0170                 Offset Stripe Mirroring'.
0171 
0172         [delta_disks <N>]
0173                 The delta_disks option value (-251 < N < +251) triggers
0174                 device removal (negative value) or device addition (positive
0175                 value) to any reshape supporting raid levels 4/5/6 and 10.
0176                 RAID levels 4/5/6 allow for addition of devices (metadata
0177                 and data device tuple), raid10_near and raid10_offset only
0178                 allow for device addition. raid10_far does not support any
0179                 reshaping at all.
0180                 A minimum of devices have to be kept to enforce resilience,
0181                 which is 3 devices for raid4/5 and 4 devices for raid6.
0182 
0183         [data_offset <sectors>]
0184                 This option value defines the offset into each data device
0185                 where the data starts. This is used to provide out-of-place
0186                 reshaping space to avoid writing over data while
0187                 changing the layout of stripes, hence an interruption/crash
0188                 may happen at any time without the risk of losing data.
0189                 E.g. when adding devices to an existing raid set during
0190                 forward reshaping, the out-of-place space will be allocated
0191                 at the beginning of each raid device. The kernel raid4/5/6/10
0192                 MD personalities supporting such device addition will read the data from
0193                 the existing first stripes (those with smaller number of stripes)
0194                 starting at data_offset to fill up a new stripe with the larger
0195                 number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
0196                 and write that new stripe to offset 0. Same will be applied to all
0197                 N-1 other new stripes. This out-of-place scheme is used to change
0198                 the RAID type (i.e. the allocation algorithm) as well, e.g.
0199                 changing from raid5_ls to raid5_n.
0200 
0201         [journal_dev <dev>]
0202                 This option adds a journal device to raid4/5/6 raid sets and
0203                 uses it to close the 'write hole' caused by the non-atomic updates
0204                 to the component devices which can cause data loss during recovery.
0205                 The journal device is used as writethrough thus causing writes to
0206                 be throttled versus non-journaled raid4/5/6 sets.
0207                 Takeover/reshape is not possible with a raid4/5/6 journal device;
0208                 it has to be deconfigured before requesting these.
0209 
0210         [journal_mode <mode>]
0211                 This option sets the caching mode on journaled raid4/5/6 raid sets
0212                 (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
0213                 If 'writeback' is selected the journal device has to be resilient
0214                 and must not suffer from the 'write hole' problem itself (e.g. use
0215                 raid1 or raid10) to avoid a single point of failure.
0216 
0217 <#raid_devs>: The number of devices composing the array.
0218         Each device consists of two entries.  The first is the device
0219         containing the metadata (if any); the second is the one containing the
0220         data. A Maximum of 64 metadata/data device entries are supported
0221         up to target version 1.8.0.
0222         1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
0223 
0224         If a drive has failed or is missing at creation time, a '-' can be
0225         given for both the metadata and data drives for a given position.
0226 
0227 
0228 Example Tables
0229 --------------
0230 
0231 ::
0232 
0233   # RAID4 - 4 data drives, 1 parity (no metadata devices)
0234   # No metadata devices specified to hold superblock/bitmap info
0235   # Chunk size of 1MiB
0236   # (Lines separated for easy reading)
0237 
0238   0 1960893648 raid \
0239           raid4 1 2048 \
0240           5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
0241 
0242   # RAID4 - 4 data drives, 1 parity (with metadata devices)
0243   # Chunk size of 1MiB, force RAID initialization,
0244   #       min recovery rate at 20 kiB/sec/disk
0245 
0246   0 1960893648 raid \
0247           raid4 4 2048 sync min_recovery_rate 20 \
0248           5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
0249 
0250 
0251 Status Output
0252 -------------
0253 'dmsetup table' displays the table used to construct the mapping.
0254 The optional parameters are always printed in the order listed
0255 above with "sync" or "nosync" always output ahead of the other
0256 arguments, regardless of the order used when originally loading the table.
0257 Arguments that can be repeated are ordered by value.
0258 
0259 
0260 'dmsetup status' yields information on the state and health of the array.
0261 The output is as follows (normally a single line, but expanded here for
0262 clarity)::
0263 
0264   1: <s> <l> raid \
0265   2:      <raid_type> <#devices> <health_chars> \
0266   3:      <sync_ratio> <sync_action> <mismatch_cnt>
0267 
0268 Line 1 is the standard output produced by device-mapper.
0269 
0270 Line 2 & 3 are produced by the raid target and are best explained by example::
0271 
0272         0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
0273 
0274 Here we can see the RAID type is raid4, there are 5 devices - all of
0275 which are 'A'live, and the array is 2/490221568 complete with its initial
0276 recovery.  Here is a fuller description of the individual fields:
0277 
0278         =============== =========================================================
0279         <raid_type>     Same as the <raid_type> used to create the array.
0280         <health_chars>  One char for each device, indicating:
0281 
0282                         - 'A' = alive and in-sync
0283                         - 'a' = alive but not in-sync
0284                         - 'D' = dead/failed.
0285         <sync_ratio>    The ratio indicating how much of the array has undergone
0286                         the process described by 'sync_action'.  If the
0287                         'sync_action' is "check" or "repair", then the process
0288                         of "resync" or "recover" can be considered complete.
0289         <sync_action>   One of the following possible states:
0290 
0291                         idle
0292                                 - No synchronization action is being performed.
0293                         frozen
0294                                 - The current action has been halted.
0295                         resync
0296                                 - Array is undergoing its initial synchronization
0297                                   or is resynchronizing after an unclean shutdown
0298                                   (possibly aided by a bitmap).
0299                         recover
0300                                 - A device in the array is being rebuilt or
0301                                   replaced.
0302                         check
0303                                 - A user-initiated full check of the array is
0304                                   being performed.  All blocks are read and
0305                                   checked for consistency.  The number of
0306                                   discrepancies found are recorded in
0307                                   <mismatch_cnt>.  No changes are made to the
0308                                   array by this action.
0309                         repair
0310                                 - The same as "check", but discrepancies are
0311                                   corrected.
0312                         reshape
0313                                 - The array is undergoing a reshape.
0314         <mismatch_cnt>  The number of discrepancies found between mirror copies
0315                         in RAID1/10 or wrong parity values found in RAID4/5/6.
0316                         This value is valid only after a "check" of the array
0317                         is performed.  A healthy array has a 'mismatch_cnt' of 0.
0318         <data_offset>   The current data offset to the start of the user data on
0319                         each component device of a raid set (see the respective
0320                         raid parameter to support out-of-place reshaping).
0321         <journal_char>  - 'A' - active write-through journal device.
0322                         - 'a' - active write-back journal device.
0323                         - 'D' - dead journal device.
0324                         - '-' - no journal device.
0325         =============== =========================================================
0326 
0327 
0328 Message Interface
0329 -----------------
0330 The dm-raid target will accept certain actions through the 'message' interface.
0331 ('man dmsetup' for more information on the message interface.)  These actions
0332 include:
0333 
0334         ========= ================================================
0335         "idle"    Halt the current sync action.
0336         "frozen"  Freeze the current sync action.
0337         "resync"  Initiate/continue a resync.
0338         "recover" Initiate/continue a recover process.
0339         "check"   Initiate a check (i.e. a "scrub") of the array.
0340         "repair"  Initiate a repair of the array.
0341         ========= ================================================
0342 
0343 
0344 Discard Support
0345 ---------------
0346 The implementation of discard support among hardware vendors varies.
0347 When a block is discarded, some storage devices will return zeroes when
0348 the block is read.  These devices set the 'discard_zeroes_data'
0349 attribute.  Other devices will return random data.  Confusingly, some
0350 devices that advertise 'discard_zeroes_data' will not reliably return
0351 zeroes when discarded blocks are read!  Since RAID 4/5/6 uses blocks
0352 from a number of devices to calculate parity blocks and (for performance
0353 reasons) relies on 'discard_zeroes_data' being reliable, it is important
0354 that the devices be consistent.  Blocks may be discarded in the middle
0355 of a RAID 4/5/6 stripe and if subsequent read results are not
0356 consistent, the parity blocks may be calculated differently at any time;
0357 making the parity blocks useless for redundancy.  It is important to
0358 understand how your hardware behaves with discards if you are going to
0359 enable discards with RAID 4/5/6.
0360 
0361 Since the behavior of storage devices is unreliable in this respect,
0362 even when reporting 'discard_zeroes_data', by default RAID 4/5/6
0363 discard support is disabled -- this ensures data integrity at the
0364 expense of losing some performance.
0365 
0366 Storage devices that properly support 'discard_zeroes_data' are
0367 increasingly whitelisted in the kernel and can thus be trusted.
0368 
0369 For trusted devices, the following dm-raid module parameter can be set
0370 to safely enable discard support for RAID 4/5/6:
0371 
0372     'devices_handle_discards_safely'
0373 
0374 
0375 Version History
0376 ---------------
0377 
0378 ::
0379 
0380  1.0.0  Initial version.  Support for RAID 4/5/6
0381  1.1.0  Added support for RAID 1
0382  1.2.0  Handle creation of arrays that contain failed devices.
0383  1.3.0  Added support for RAID 10
0384  1.3.1  Allow device replacement/rebuild for RAID 10
0385  1.3.2  Fix/improve redundancy checking for RAID10
0386  1.4.0  Non-functional change.  Removes arg from mapping function.
0387  1.4.1  RAID10 fix redundancy validation checks (commit 55ebbb5).
0388  1.4.2  Add RAID10 "far" and "offset" algorithm support.
0389  1.5.0  Add message interface to allow manipulation of the sync_action.
0390         New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
0391  1.5.1  Add ability to restore transiently failed devices on resume.
0392  1.5.2  'mismatch_cnt' is zero unless [last_]sync_action is "check".
0393  1.6.0  Add discard support (and devices_handle_discard_safely module param).
0394  1.7.0  Add support for MD RAID0 mappings.
0395  1.8.0  Explicitly check for compatible flags in the superblock metadata
0396         and reject to start the raid set if any are set by a newer
0397         target version, thus avoiding data corruption on a raid set
0398         with a reshape in progress.
0399  1.9.0  Add support for RAID level takeover/reshape/region size
0400         and set size reduction.
0401  1.9.1  Fix activation of existing RAID 4/10 mapped devices
0402  1.9.2  Don't emit '- -' on the status table line in case the constructor
0403         fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
0404         'D' on the status line.  If '- -' is passed into the constructor, emit
0405         '- -' on the table line and '-' as the status line health character.
0406  1.10.0 Add support for raid4/5/6 journal device
0407  1.10.1 Fix data corruption on reshape request
0408  1.11.0 Fix table line argument order
0409         (wrong raid10_copies/raid10_format sequence)
0410  1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
0411  1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
0412  1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
0413  1.13.1 Fix deadlock caused by early md_stop_writes().  Also fix size an
0414         state races.
0415  1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
0416  1.14.0 Fix reshape race on small devices.  Fix stripe adding reshape
0417         deadlock/potential data corruption.  Update superblock when
0418         specific devices are requested via rebuild.  Fix RAID leg
0419         rebuild errors.
0420  1.15.0 Fix size extensions not being synchronized in case of new MD bitmap
0421         pages allocated;  also fix those not occurring after previous reductions
0422  1.15.1 Fix argument count and arguments for rebuild/write_mostly/journal_(dev|mode)
0423         on the status line.