0001 =====
0002 Cache
0003 =====
0004
0005 Introduction
0006 ============
0007
0008 dm-cache is a device mapper target written by Joe Thornber, Heinz
0009 Mauelshagen, and Mike Snitzer.
0010
0011 It aims to improve performance of a block device (eg, a spindle) by
0012 dynamically migrating some of its data to a faster, smaller device
0013 (eg, an SSD).
0014
0015 This device-mapper solution allows us to insert this caching at
0016 different levels of the dm stack, for instance above the data device for
0017 a thin-provisioning pool. Caching solutions that are integrated more
0018 closely with the virtual memory system should give better performance.
0019
0020 The target reuses the metadata library used in the thin-provisioning
0021 library.
0022
0023 The decision as to what data to migrate and when is left to a plug-in
0024 policy module. Several of these have been written as we experiment,
0025 and we hope other people will contribute others for specific io
0026 scenarios (eg. a vm image server).
0027
0028 Glossary
0029 ========
0030
0031 Migration
0032 Movement of the primary copy of a logical block from one
0033 device to the other.
0034 Promotion
0035 Migration from slow device to fast device.
0036 Demotion
0037 Migration from fast device to slow device.
0038
0039 The origin device always contains a copy of the logical block, which
0040 may be out of date or kept in sync with the copy on the cache device
0041 (depending on policy).
0042
0043 Design
0044 ======
0045
0046 Sub-devices
0047 -----------
0048
0049 The target is constructed by passing three devices to it (along with
0050 other parameters detailed later):
0051
0052 1. An origin device - the big, slow one.
0053
0054 2. A cache device - the small, fast one.
0055
0056 3. A small metadata device - records which blocks are in the cache,
0057 which are dirty, and extra hints for use by the policy object.
0058 This information could be put on the cache device, but having it
0059 separate allows the volume manager to configure it differently,
0060 e.g. as a mirror for extra robustness. This metadata device may only
0061 be used by a single cache device.
0062
0063 Fixed block size
0064 ----------------
0065
0066 The origin is divided up into blocks of a fixed size. This block size
0067 is configurable when you first create the cache. Typically we've been
0068 using block sizes of 256KB - 1024KB. The block size must be between 64
0069 sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
0070
0071 Having a fixed block size simplifies the target a lot. But it is
0072 something of a compromise. For instance, a small part of a block may be
0073 getting hit a lot, yet the whole block will be promoted to the cache.
0074 So large block sizes are bad because they waste cache space. And small
0075 block sizes are bad because they increase the amount of metadata (both
0076 in core and on disk).
0077
0078 Cache operating modes
0079 ---------------------
0080
0081 The cache has three operating modes: writeback, writethrough and
0082 passthrough.
0083
0084 If writeback, the default, is selected then a write to a block that is
0085 cached will go only to the cache and the block will be marked dirty in
0086 the metadata.
0087
0088 If writethrough is selected then a write to a cached block will not
0089 complete until it has hit both the origin and cache devices. Clean
0090 blocks should remain clean.
0091
0092 If passthrough is selected, useful when the cache contents are not known
0093 to be coherent with the origin device, then all reads are served from
0094 the origin device (all reads miss the cache) and all writes are
0095 forwarded to the origin device; additionally, write hits cause cache
0096 block invalidates. To enable passthrough mode the cache must be clean.
0097 Passthrough mode allows a cache device to be activated without having to
0098 worry about coherency. Coherency that exists is maintained, although
0099 the cache will gradually cool as writes take place. If the coherency of
0100 the cache can later be verified, or established through use of the
0101 "invalidate_cblocks" message, the cache device can be transitioned to
0102 writethrough or writeback mode while still warm. Otherwise, the cache
0103 contents can be discarded prior to transitioning to the desired
0104 operating mode.
0105
0106 A simple cleaner policy is provided, which will clean (write back) all
0107 dirty blocks in a cache. Useful for decommissioning a cache or when
0108 shrinking a cache. Shrinking the cache's fast device requires all cache
0109 blocks, in the area of the cache being removed, to be clean. If the
0110 area being removed from the cache still contains dirty blocks the resize
0111 will fail. Care must be taken to never reduce the volume used for the
0112 cache's fast device until the cache is clean. This is of particular
0113 importance if writeback mode is used. Writethrough and passthrough
0114 modes already maintain a clean cache. Future support to partially clean
0115 the cache, above a specified threshold, will allow for keeping the cache
0116 warm and in writeback mode during resize.
0117
0118 Migration throttling
0119 --------------------
0120
0121 Migrating data between the origin and cache device uses bandwidth.
0122 The user can set a throttle to prevent more than a certain amount of
0123 migration occurring at any one time. Currently we're not taking any
0124 account of normal io traffic going to the devices. More work needs
0125 doing here to avoid migrating during those peak io moments.
0126
0127 For the time being, a message "migration_threshold <#sectors>"
0128 can be used to set the maximum number of sectors being migrated,
0129 the default being 2048 sectors (1MB).
0130
0131 Updating on-disk metadata
0132 -------------------------
0133
0134 On-disk metadata is committed every time a FLUSH or FUA bio is written.
0135 If no such requests are made then commits will occur every second. This
0136 means the cache behaves like a physical disk that has a volatile write
0137 cache. If power is lost you may lose some recent writes. The metadata
0138 should always be consistent in spite of any crash.
0139
0140 The 'dirty' state for a cache block changes far too frequently for us
0141 to keep updating it on the fly. So we treat it as a hint. In normal
0142 operation it will be written when the dm device is suspended. If the
0143 system crashes all cache blocks will be assumed dirty when restarted.
0144
0145 Per-block policy hints
0146 ----------------------
0147
0148 Policy plug-ins can store a chunk of data per cache block. It's up to
0149 the policy how big this chunk is, but it should be kept small. Like the
0150 dirty flags this data is lost if there's a crash so a safe fallback
0151 value should always be possible.
0152
0153 Policy hints affect performance, not correctness.
0154
0155 Policy messaging
0156 ----------------
0157
0158 Policies will have different tunables, specific to each one, so we
0159 need a generic way of getting and setting these. Device-mapper
0160 messages are used. Refer to cache-policies.txt.
0161
0162 Discard bitset resolution
0163 -------------------------
0164
0165 We can avoid copying data during migration if we know the block has
0166 been discarded. A prime example of this is when mkfs discards the
0167 whole block device. We store a bitset tracking the discard state of
0168 blocks. However, we allow this bitset to have a different block size
0169 from the cache blocks. This is because we need to track the discard
0170 state for all of the origin device (compare with the dirty bitset
0171 which is just for the smaller cache device).
0172
0173 Target interface
0174 ================
0175
0176 Constructor
0177 -----------
0178
0179 ::
0180
0181 cache <metadata dev> <cache dev> <origin dev> <block size>
0182 <#feature args> [<feature arg>]*
0183 <policy> <#policy args> [policy args]*
0184
0185 ================ =======================================================
0186 metadata dev fast device holding the persistent metadata
0187 cache dev fast device holding cached data blocks
0188 origin dev slow device holding original data blocks
0189 block size cache unit size in sectors
0190
0191 #feature args number of feature arguments passed
0192 feature args writethrough or passthrough (The default is writeback.)
0193
0194 policy the replacement policy to use
0195 #policy args an even number of arguments corresponding to
0196 key/value pairs passed to the policy
0197 policy args key/value pairs passed to the policy
0198 E.g. 'sequential_threshold 1024'
0199 See cache-policies.txt for details.
0200 ================ =======================================================
0201
0202 Optional feature arguments are:
0203
0204
0205 ==================== ========================================================
0206 writethrough write through caching that prohibits cache block
0207 content from being different from origin block content.
0208 Without this argument, the default behaviour is to write
0209 back cache block contents later for performance reasons,
0210 so they may differ from the corresponding origin blocks.
0211
0212 passthrough a degraded mode useful for various cache coherency
0213 situations (e.g., rolling back snapshots of
0214 underlying storage). Reads and writes always go to
0215 the origin. If a write goes to a cached origin
0216 block, then the cache block is invalidated.
0217 To enable passthrough mode the cache must be clean.
0218
0219 metadata2 use version 2 of the metadata. This stores the dirty
0220 bits in a separate btree, which improves speed of
0221 shutting down the cache.
0222
0223 no_discard_passdown disable passing down discards from the cache
0224 to the origin's data device.
0225 ==================== ========================================================
0226
0227 A policy called 'default' is always registered. This is an alias for
0228 the policy we currently think is giving best all round performance.
0229
0230 As the default policy could vary between kernels, if you are relying on
0231 the characteristics of a specific policy, always request it by name.
0232
0233 Status
0234 ------
0235
0236 ::
0237
0238 <metadata block size> <#used metadata blocks>/<#total metadata blocks>
0239 <cache block size> <#used cache blocks>/<#total cache blocks>
0240 <#read hits> <#read misses> <#write hits> <#write misses>
0241 <#demotions> <#promotions> <#dirty> <#features> <features>*
0242 <#core args> <core args>* <policy name> <#policy args> <policy args>*
0243 <cache metadata mode>
0244
0245
0246 ========================= =====================================================
0247 metadata block size Fixed block size for each metadata block in
0248 sectors
0249 #used metadata blocks Number of metadata blocks used
0250 #total metadata blocks Total number of metadata blocks
0251 cache block size Configurable block size for the cache device
0252 in sectors
0253 #used cache blocks Number of blocks resident in the cache
0254 #total cache blocks Total number of cache blocks
0255 #read hits Number of times a READ bio has been mapped
0256 to the cache
0257 #read misses Number of times a READ bio has been mapped
0258 to the origin
0259 #write hits Number of times a WRITE bio has been mapped
0260 to the cache
0261 #write misses Number of times a WRITE bio has been
0262 mapped to the origin
0263 #demotions Number of times a block has been removed
0264 from the cache
0265 #promotions Number of times a block has been moved to
0266 the cache
0267 #dirty Number of blocks in the cache that differ
0268 from the origin
0269 #feature args Number of feature args to follow
0270 feature args 'writethrough' (optional)
0271 #core args Number of core arguments (must be even)
0272 core args Key/value pairs for tuning the core
0273 e.g. migration_threshold
0274 policy name Name of the policy
0275 #policy args Number of policy arguments to follow (must be even)
0276 policy args Key/value pairs e.g. sequential_threshold
0277 cache metadata mode ro if read-only, rw if read-write
0278
0279 In serious cases where even a read-only mode is
0280 deemed unsafe no further I/O will be permitted and
0281 the status will just contain the string 'Fail'.
0282 The userspace recovery tools should then be used.
0283 needs_check 'needs_check' if set, '-' if not set
0284 A metadata operation has failed, resulting in the
0285 needs_check flag being set in the metadata's
0286 superblock. The metadata device must be
0287 deactivated and checked/repaired before the
0288 cache can be made fully operational again.
0289 '-' indicates needs_check is not set.
0290 ========================= =====================================================
0291
0292 Messages
0293 --------
0294
0295 Policies will have different tunables, specific to each one, so we
0296 need a generic way of getting and setting these. Device-mapper
0297 messages are used. (A sysfs interface would also be possible.)
0298
0299 The message format is::
0300
0301 <key> <value>
0302
0303 E.g.::
0304
0305 dmsetup message my_cache 0 sequential_threshold 1024
0306
0307
0308 Invalidation is removing an entry from the cache without writing it
0309 back. Cache blocks can be invalidated via the invalidate_cblocks
0310 message, which takes an arbitrary number of cblock ranges. Each cblock
0311 range's end value is "one past the end", meaning 5-10 expresses a range
0312 of values from 5 to 9. Each cblock must be expressed as a decimal
0313 value, in the future a variant message that takes cblock ranges
0314 expressed in hexadecimal may be needed to better support efficient
0315 invalidation of larger caches. The cache must be in passthrough mode
0316 when invalidate_cblocks is used::
0317
0318 invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
0319
0320 E.g.::
0321
0322 dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
0323
0324 Examples
0325 ========
0326
0327 The test suite can be found here:
0328
0329 https://github.com/jthornber/device-mapper-test-suite
0330
0331 ::
0332
0333 dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
0334 /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
0335 dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
0336 /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
0337 mq 4 sequential_threshold 1024 random_threshold 8'