driver-api/md/raid5-cache.rst

0001 ================
0002 RAID 4/5/6 cache
0003 ================
0004
0005 Raid 4/5/6 could include an extra disk for data cache besides normal RAID
0006 disks. The role of RAID disks isn't changed with the cache disk. The cache disk
0007 caches data to the RAID disks. The cache can be in write-through (supported
0008 since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
0009 3.4) has a new option '--write-journal' to create array with cache. Please
0010 refer to mdadm manual for details. By default (RAID array starts), the cache is
0011 in write-through mode. A user can switch it to write-back mode by::
0012
0013         echo "write-back" > /sys/block/md0/md/journal_mode
0014
0015 And switch it back to write-through mode by::
0016
0017         echo "write-through" > /sys/block/md0/md/journal_mode
0018
0019 In both modes, all writes to the array will hit cache disk first. This means
0020 the cache disk must be fast and sustainable.
0021
0022 write-through mode
0023 ==================
0024
0025 This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
0026 shutdown can cause data in some stripes to not be in consistent state, eg, data
0027 and parity don't match. The reason is that a stripe write involves several RAID
0028 disks and it's possible the writes don't hit all RAID disks yet before the
0029 unclean shutdown. We call an array degraded if it has inconsistent data. MD
0030 tries to resync the array to bring it back to normal state. But before the
0031 resync completes, any system crash will expose the chance of real data
0032 corruption in the RAID array. This problem is called 'write hole'.
0033
0034 The write-through cache will cache all data on cache disk first. After the data
0035 is safe on the cache disk, the data will be flushed onto RAID disks. The
0036 two-step write will guarantee MD can recover correct data after unclean
0037 shutdown even the array is degraded. Thus the cache can close the 'write hole'.
0038
0039 In write-through mode, MD reports IO completion to upper layer (usually
0040 filesystems) after the data is safe on RAID disks, so cache disk failure
0041 doesn't cause data loss. Of course cache disk failure means the array is
0042 exposed to 'write hole' again.
0043
0044 In write-through mode, the cache disk isn't required to be big. Several
0045 hundreds megabytes are enough.
0046
0047 write-back mode
0048 ===============
0049
0050 write-back mode fixes the 'write hole' issue too, since all write data is
0051 cached on cache disk. But the main goal of 'write-back' cache is to speed up
0052 write. If a write crosses all RAID disks of a stripe, we call it full-stripe
0053 write. For non-full-stripe writes, MD must read old data before the new parity
0054 can be calculated. These synchronous reads hurt write throughput. Some writes
0055 which are sequential but not dispatched in the same time will suffer from this
0056 overhead too. Write-back cache will aggregate the data and flush the data to
0057 RAID disks only after the data becomes a full stripe write. This will
0058 completely avoid the overhead, so it's very helpful for some workloads. A
0059 typical workload which does sequential write followed by fsync is an example.
0060
0061 In write-back mode, MD reports IO completion to upper layer (usually
0062 filesystems) right after the data hits cache disk. The data is flushed to raid
0063 disks later after specific conditions met. So cache disk failure will cause
0064 data loss.
0065
0066 In write-back mode, MD also caches data in memory. The memory cache includes
0067 the same data stored on cache disk, so a power loss doesn't cause data loss.
0068 The memory cache size has performance impact for the array. It's recommended
0069 the size is big. A user can configure the size by::
0070
0071         echo "2048" > /sys/block/md0/md/stripe_cache_size
0072
0073 Too small cache disk will make the write aggregation less efficient in this
0074 mode depending on the workloads. It's recommended to use a cache disk with at
0075 least several gigabytes size in write-back mode.
0076
0077 The implementation
0078 ==================
0079
0080 The write-through and write-back cache use the same disk format. The cache disk
0081 is organized as a simple write log. The log consists of 'meta data' and 'data'
0082 pairs. The meta data describes the data. It also includes checksum and sequence
0083 ID for recovery identification. Data can be IO data and parity data. Data is
0084 checksumed too. The checksum is stored in the meta data ahead of the data. The
0085 checksum is an optimization because MD can write meta and data freely without
0086 worry about the order. MD superblock has a field pointed to the valid meta data
0087 of log head.
0088
0089 The log implementation is pretty straightforward. The difficult part is the
0090 order in which MD writes data to cache disk and RAID disks. Specifically, in
0091 write-through mode, MD calculates parity for IO data, writes both IO data and
0092 parity to the log, writes the data and parity to RAID disks after the data and
0093 parity is settled down in log and finally the IO is finished. Read just reads
0094 from raid disks as usual.
0095
0096 In write-back mode, MD writes IO data to the log and reports IO completion. The
0097 data is also fully cached in memory at that time, which means read must query
0098 memory cache. If some conditions are met, MD will flush the data to RAID disks.
0099 MD will calculate parity for the data and write parity into the log. After this
0100 is finished, MD will write both data and parity into RAID disks, then MD can
0101 release the memory cache. The flush conditions could be stripe becomes a full
0102 stripe write, free cache disk space is low or free in-kernel memory cache space
0103 is low.
0104
0105 After an unclean shutdown, MD does recovery. MD reads all meta data and data
0106 from the log. The sequence ID and checksum will help us detect corrupted meta
0107 data and data. If MD finds a stripe with data and valid parities (1 parity for
0108 raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
0109 parities are incompleted, they are discarded. If part of data is corrupted,
0110 they are discarded too. MD then loads valid data and writes them to RAID disks
0111 in normal way.