Back to home page

OSCL-LXR

 
 

    


0001 =====
0002 Cache
0003 =====
0004 
0005 Introduction
0006 ============
0007 
0008 dm-cache is a device mapper target written by Joe Thornber, Heinz
0009 Mauelshagen, and Mike Snitzer.
0010 
0011 It aims to improve performance of a block device (eg, a spindle) by
0012 dynamically migrating some of its data to a faster, smaller device
0013 (eg, an SSD).
0014 
0015 This device-mapper solution allows us to insert this caching at
0016 different levels of the dm stack, for instance above the data device for
0017 a thin-provisioning pool.  Caching solutions that are integrated more
0018 closely with the virtual memory system should give better performance.
0019 
0020 The target reuses the metadata library used in the thin-provisioning
0021 library.
0022 
0023 The decision as to what data to migrate and when is left to a plug-in
0024 policy module.  Several of these have been written as we experiment,
0025 and we hope other people will contribute others for specific io
0026 scenarios (eg. a vm image server).
0027 
0028 Glossary
0029 ========
0030 
0031   Migration
0032                Movement of the primary copy of a logical block from one
0033                device to the other.
0034   Promotion
0035                Migration from slow device to fast device.
0036   Demotion
0037                Migration from fast device to slow device.
0038 
0039 The origin device always contains a copy of the logical block, which
0040 may be out of date or kept in sync with the copy on the cache device
0041 (depending on policy).
0042 
0043 Design
0044 ======
0045 
0046 Sub-devices
0047 -----------
0048 
0049 The target is constructed by passing three devices to it (along with
0050 other parameters detailed later):
0051 
0052 1. An origin device - the big, slow one.
0053 
0054 2. A cache device - the small, fast one.
0055 
0056 3. A small metadata device - records which blocks are in the cache,
0057    which are dirty, and extra hints for use by the policy object.
0058    This information could be put on the cache device, but having it
0059    separate allows the volume manager to configure it differently,
0060    e.g. as a mirror for extra robustness.  This metadata device may only
0061    be used by a single cache device.
0062 
0063 Fixed block size
0064 ----------------
0065 
0066 The origin is divided up into blocks of a fixed size.  This block size
0067 is configurable when you first create the cache.  Typically we've been
0068 using block sizes of 256KB - 1024KB.  The block size must be between 64
0069 sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
0070 
0071 Having a fixed block size simplifies the target a lot.  But it is
0072 something of a compromise.  For instance, a small part of a block may be
0073 getting hit a lot, yet the whole block will be promoted to the cache.
0074 So large block sizes are bad because they waste cache space.  And small
0075 block sizes are bad because they increase the amount of metadata (both
0076 in core and on disk).
0077 
0078 Cache operating modes
0079 ---------------------
0080 
0081 The cache has three operating modes: writeback, writethrough and
0082 passthrough.
0083 
0084 If writeback, the default, is selected then a write to a block that is
0085 cached will go only to the cache and the block will be marked dirty in
0086 the metadata.
0087 
0088 If writethrough is selected then a write to a cached block will not
0089 complete until it has hit both the origin and cache devices.  Clean
0090 blocks should remain clean.
0091 
0092 If passthrough is selected, useful when the cache contents are not known
0093 to be coherent with the origin device, then all reads are served from
0094 the origin device (all reads miss the cache) and all writes are
0095 forwarded to the origin device; additionally, write hits cause cache
0096 block invalidates.  To enable passthrough mode the cache must be clean.
0097 Passthrough mode allows a cache device to be activated without having to
0098 worry about coherency.  Coherency that exists is maintained, although
0099 the cache will gradually cool as writes take place.  If the coherency of
0100 the cache can later be verified, or established through use of the
0101 "invalidate_cblocks" message, the cache device can be transitioned to
0102 writethrough or writeback mode while still warm.  Otherwise, the cache
0103 contents can be discarded prior to transitioning to the desired
0104 operating mode.
0105 
0106 A simple cleaner policy is provided, which will clean (write back) all
0107 dirty blocks in a cache.  Useful for decommissioning a cache or when
0108 shrinking a cache.  Shrinking the cache's fast device requires all cache
0109 blocks, in the area of the cache being removed, to be clean.  If the
0110 area being removed from the cache still contains dirty blocks the resize
0111 will fail.  Care must be taken to never reduce the volume used for the
0112 cache's fast device until the cache is clean.  This is of particular
0113 importance if writeback mode is used.  Writethrough and passthrough
0114 modes already maintain a clean cache.  Future support to partially clean
0115 the cache, above a specified threshold, will allow for keeping the cache
0116 warm and in writeback mode during resize.
0117 
0118 Migration throttling
0119 --------------------
0120 
0121 Migrating data between the origin and cache device uses bandwidth.
0122 The user can set a throttle to prevent more than a certain amount of
0123 migration occurring at any one time.  Currently we're not taking any
0124 account of normal io traffic going to the devices.  More work needs
0125 doing here to avoid migrating during those peak io moments.
0126 
0127 For the time being, a message "migration_threshold <#sectors>"
0128 can be used to set the maximum number of sectors being migrated,
0129 the default being 2048 sectors (1MB).
0130 
0131 Updating on-disk metadata
0132 -------------------------
0133 
0134 On-disk metadata is committed every time a FLUSH or FUA bio is written.
0135 If no such requests are made then commits will occur every second.  This
0136 means the cache behaves like a physical disk that has a volatile write
0137 cache.  If power is lost you may lose some recent writes.  The metadata
0138 should always be consistent in spite of any crash.
0139 
0140 The 'dirty' state for a cache block changes far too frequently for us
0141 to keep updating it on the fly.  So we treat it as a hint.  In normal
0142 operation it will be written when the dm device is suspended.  If the
0143 system crashes all cache blocks will be assumed dirty when restarted.
0144 
0145 Per-block policy hints
0146 ----------------------
0147 
0148 Policy plug-ins can store a chunk of data per cache block.  It's up to
0149 the policy how big this chunk is, but it should be kept small.  Like the
0150 dirty flags this data is lost if there's a crash so a safe fallback
0151 value should always be possible.
0152 
0153 Policy hints affect performance, not correctness.
0154 
0155 Policy messaging
0156 ----------------
0157 
0158 Policies will have different tunables, specific to each one, so we
0159 need a generic way of getting and setting these.  Device-mapper
0160 messages are used.  Refer to cache-policies.txt.
0161 
0162 Discard bitset resolution
0163 -------------------------
0164 
0165 We can avoid copying data during migration if we know the block has
0166 been discarded.  A prime example of this is when mkfs discards the
0167 whole block device.  We store a bitset tracking the discard state of
0168 blocks.  However, we allow this bitset to have a different block size
0169 from the cache blocks.  This is because we need to track the discard
0170 state for all of the origin device (compare with the dirty bitset
0171 which is just for the smaller cache device).
0172 
0173 Target interface
0174 ================
0175 
0176 Constructor
0177 -----------
0178 
0179   ::
0180 
0181    cache <metadata dev> <cache dev> <origin dev> <block size>
0182          <#feature args> [<feature arg>]*
0183          <policy> <#policy args> [policy args]*
0184 
0185  ================ =======================================================
0186  metadata dev     fast device holding the persistent metadata
0187  cache dev        fast device holding cached data blocks
0188  origin dev       slow device holding original data blocks
0189  block size       cache unit size in sectors
0190 
0191  #feature args    number of feature arguments passed
0192  feature args     writethrough or passthrough (The default is writeback.)
0193 
0194  policy           the replacement policy to use
0195  #policy args     an even number of arguments corresponding to
0196                   key/value pairs passed to the policy
0197  policy args      key/value pairs passed to the policy
0198                   E.g. 'sequential_threshold 1024'
0199                   See cache-policies.txt for details.
0200  ================ =======================================================
0201 
0202 Optional feature arguments are:
0203 
0204 
0205    ==================== ========================================================
0206    writethrough         write through caching that prohibits cache block
0207                         content from being different from origin block content.
0208                         Without this argument, the default behaviour is to write
0209                         back cache block contents later for performance reasons,
0210                         so they may differ from the corresponding origin blocks.
0211 
0212    passthrough          a degraded mode useful for various cache coherency
0213                         situations (e.g., rolling back snapshots of
0214                         underlying storage).     Reads and writes always go to
0215                         the origin.     If a write goes to a cached origin
0216                         block, then the cache block is invalidated.
0217                         To enable passthrough mode the cache must be clean.
0218 
0219    metadata2            use version 2 of the metadata.  This stores the dirty
0220                         bits in a separate btree, which improves speed of
0221                         shutting down the cache.
0222 
0223    no_discard_passdown  disable passing down discards from the cache
0224                         to the origin's data device.
0225    ==================== ========================================================
0226 
0227 A policy called 'default' is always registered.  This is an alias for
0228 the policy we currently think is giving best all round performance.
0229 
0230 As the default policy could vary between kernels, if you are relying on
0231 the characteristics of a specific policy, always request it by name.
0232 
0233 Status
0234 ------
0235 
0236 ::
0237 
0238   <metadata block size> <#used metadata blocks>/<#total metadata blocks>
0239   <cache block size> <#used cache blocks>/<#total cache blocks>
0240   <#read hits> <#read misses> <#write hits> <#write misses>
0241   <#demotions> <#promotions> <#dirty> <#features> <features>*
0242   <#core args> <core args>* <policy name> <#policy args> <policy args>*
0243   <cache metadata mode>
0244 
0245 
0246 ========================= =====================================================
0247 metadata block size       Fixed block size for each metadata block in
0248                           sectors
0249 #used metadata blocks     Number of metadata blocks used
0250 #total metadata blocks    Total number of metadata blocks
0251 cache block size          Configurable block size for the cache device
0252                           in sectors
0253 #used cache blocks        Number of blocks resident in the cache
0254 #total cache blocks       Total number of cache blocks
0255 #read hits                Number of times a READ bio has been mapped
0256                           to the cache
0257 #read misses              Number of times a READ bio has been mapped
0258                           to the origin
0259 #write hits               Number of times a WRITE bio has been mapped
0260                           to the cache
0261 #write misses             Number of times a WRITE bio has been
0262                           mapped to the origin
0263 #demotions                Number of times a block has been removed
0264                           from the cache
0265 #promotions               Number of times a block has been moved to
0266                           the cache
0267 #dirty                    Number of blocks in the cache that differ
0268                           from the origin
0269 #feature args             Number of feature args to follow
0270 feature args              'writethrough' (optional)
0271 #core args                Number of core arguments (must be even)
0272 core args                 Key/value pairs for tuning the core
0273                           e.g. migration_threshold
0274 policy name               Name of the policy
0275 #policy args              Number of policy arguments to follow (must be even)
0276 policy args               Key/value pairs e.g. sequential_threshold
0277 cache metadata mode       ro if read-only, rw if read-write
0278 
0279                           In serious cases where even a read-only mode is
0280                           deemed unsafe no further I/O will be permitted and
0281                           the status will just contain the string 'Fail'.
0282                           The userspace recovery tools should then be used.
0283 needs_check               'needs_check' if set, '-' if not set
0284                           A metadata operation has failed, resulting in the
0285                           needs_check flag being set in the metadata's
0286                           superblock.  The metadata device must be
0287                           deactivated and checked/repaired before the
0288                           cache can be made fully operational again.
0289                           '-' indicates needs_check is not set.
0290 ========================= =====================================================
0291 
0292 Messages
0293 --------
0294 
0295 Policies will have different tunables, specific to each one, so we
0296 need a generic way of getting and setting these.  Device-mapper
0297 messages are used.  (A sysfs interface would also be possible.)
0298 
0299 The message format is::
0300 
0301    <key> <value>
0302 
0303 E.g.::
0304 
0305    dmsetup message my_cache 0 sequential_threshold 1024
0306 
0307 
0308 Invalidation is removing an entry from the cache without writing it
0309 back.  Cache blocks can be invalidated via the invalidate_cblocks
0310 message, which takes an arbitrary number of cblock ranges.  Each cblock
0311 range's end value is "one past the end", meaning 5-10 expresses a range
0312 of values from 5 to 9.  Each cblock must be expressed as a decimal
0313 value, in the future a variant message that takes cblock ranges
0314 expressed in hexadecimal may be needed to better support efficient
0315 invalidation of larger caches.  The cache must be in passthrough mode
0316 when invalidate_cblocks is used::
0317 
0318    invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
0319 
0320 E.g.::
0321 
0322    dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
0323 
0324 Examples
0325 ========
0326 
0327 The test suite can be found here:
0328 
0329 https://github.com/jthornber/device-mapper-test-suite
0330 
0331 ::
0332 
0333   dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
0334           /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
0335   dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
0336           /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
0337           mq 4 sequential_threshold 1024 random_threshold 8'