driver-api/md/md-cluster.rst

0001 ==========
0002 MD Cluster
0003 ==========
0004
0005 The cluster MD is a shared-device RAID for a cluster, it supports
0006 two levels: raid1 and raid10 (limited support).
0007
0008
0009 1. On-disk format
0010 =================
0011
0012 Separate write-intent-bitmaps are used for each cluster node.
0013 The bitmaps record all writes that may have been started on that node,
0014 and may not yet have finished. The on-disk layout is::
0015
0016   0                    4k                     8k                    12k
0017   -------------------------------------------------------------------
0018   | idle                | md super            | bm super [0] + bits |
0019   | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
0020   | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
0021   | bm bits [3, contd]  |                     |                     |
0022
0023 During "normal" functioning we assume the filesystem ensures that only
0024 one node writes to any given block at a time, so a write request will
0025
0026  - set the appropriate bit (if not already set)
0027  - commit the write to all mirrors
0028  - schedule the bit to be cleared after a timeout.
0029
0030 Reads are just handled normally. It is up to the filesystem to ensure
0031 one node doesn't read from a location where another node (or the same
0032 node) is writing.
0033
0034
0035 2. DLM Locks for management
0036 ===========================
0037
0038 There are three groups of locks for managing the device:
0039
0040 2.1 Bitmap lock resource (bm_lockres)
0041 -------------------------------------
0042
0043  The bm_lockres protects individual node bitmaps. They are named in
0044  the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
0045  node joins the cluster, it acquires the lock in PW mode and it stays
0046  so during the lifetime the node is part of the cluster. The lock
0047  resource number is based on the slot number returned by the DLM
0048  subsystem. Since DLM starts node count from one and bitmap slots
0049  start from zero, one is subtracted from the DLM slot number to arrive
0050  at the bitmap slot number.
0051
0052  The LVB of the bitmap lock for a particular node records the range
0053  of sectors that are being re-synced by that node.  No other
0054  node may write to those sectors.  This is used when a new nodes
0055  joins the cluster.
0056
0057 2.2 Message passing locks
0058 -------------------------
0059
0060  Each node has to communicate with other nodes when starting or ending
0061  resync, and for metadata superblock updates.  This communication is
0062  managed through three locks: "token", "message", and "ack", together
0063  with the Lock Value Block (LVB) of one of the "message" lock.
0064
0065 2.3 new-device management
0066 -------------------------
0067
0068  A single lock: "no-new-dev" is used to co-ordinate the addition of
0069  new devices - this must be synchronized across the array.
0070  Normally all nodes hold a concurrent-read lock on this device.
0071
0072 3. Communication
0073 ================
0074
0075  Messages can be broadcast to all nodes, and the sender waits for all
0076  other nodes to acknowledge the message before proceeding.  Only one
0077  message can be processed at a time.
0078
0079 3.1 Message Types
0080 -----------------
0081
0082  There are six types of messages which are passed:
0083
0084 3.1.1 METADATA_UPDATED
0085 ^^^^^^^^^^^^^^^^^^^^^^
0086
0087    informs other nodes that the metadata has
0088    been updated, and the node must re-read the md superblock. This is
0089    performed synchronously. It is primarily used to signal device
0090    failure.
0091
0092 3.1.2 RESYNCING
0093 ^^^^^^^^^^^^^^^
0094    informs other nodes that a resync is initiated or
0095    ended so that each node may suspend or resume the region.  Each
0096    RESYNCING message identifies a range of the devices that the
0097    sending node is about to resync. This overrides any previous
0098    notification from that node: only one ranged can be resynced at a
0099    time per-node.
0100
0101 3.1.3 NEWDISK
0102 ^^^^^^^^^^^^^
0103
0104    informs other nodes that a device is being added to
0105    the array. Message contains an identifier for that device.  See
0106    below for further details.
0107
0108 3.1.4 REMOVE
0109 ^^^^^^^^^^^^
0110
0111    A failed or spare device is being removed from the
0112    array. The slot-number of the device is included in the message.
0113
0114  3.1.5 RE_ADD:
0115
0116    A failed device is being re-activated - the assumption
0117    is that it has been determined to be working again.
0118
0119  3.1.6 BITMAP_NEEDS_SYNC:
0120
0121    If a node is stopped locally but the bitmap
0122    isn't clean, then another node is informed to take the ownership of
0123    resync.
0124
0125 3.2 Communication mechanism
0126 ---------------------------
0127
0128  The DLM LVB is used to communicate within nodes of the cluster. There
0129  are three resources used for the purpose:
0130
0131 3.2.1 token
0132 ^^^^^^^^^^^
0133    The resource which protects the entire communication
0134    system. The node having the token resource is allowed to
0135    communicate.
0136
0137 3.2.2 message
0138 ^^^^^^^^^^^^^
0139    The lock resource which carries the data to communicate.
0140
0141 3.2.3 ack
0142 ^^^^^^^^^
0143
0144    The resource, acquiring which means the message has been
0145    acknowledged by all nodes in the cluster. The BAST of the resource
0146    is used to inform the receiving node that a node wants to
0147    communicate.
0148
0149 The algorithm is:
0150
0151  1. receive status - all nodes have concurrent-reader lock on "ack"::
0152
0153         sender                         receiver                 receiver
0154         "ack":CR                       "ack":CR                 "ack":CR
0155
0156  2. sender get EX on "token",
0157     sender get EX on "message"::
0158
0159         sender                        receiver                 receiver
0160         "token":EX                    "ack":CR                 "ack":CR
0161         "message":EX
0162         "ack":CR
0163
0164     Sender checks that it still needs to send a message. Messages
0165     received or other events that happened while waiting for the
0166     "token" may have made this message inappropriate or redundant.
0167
0168  3. sender writes LVB
0169
0170     sender down-convert "message" from EX to CW
0171
0172     sender try to get EX of "ack"
0173
0174     ::
0175
0176       [ wait until all receivers have *processed* the "message" ]
0177
0178                                        [ triggered by bast of "ack" ]
0179                                        receiver get CR on "message"
0180                                        receiver read LVB
0181                                        receiver processes the message
0182                                        [ wait finish ]
0183                                        receiver releases "ack"
0184                                        receiver tries to get PR on "message"
0185
0186      sender                         receiver                  receiver
0187      "token":EX                     "message":CR              "message":CR
0188      "message":CW
0189      "ack":EX
0190
0191  4. triggered by grant of EX on "ack" (indicating all receivers
0192     have processed message)
0193
0194     sender down-converts "ack" from EX to CR
0195
0196     sender releases "message"
0197
0198     sender releases "token"
0199
0200     ::
0201
0202                                  receiver upconvert to PR on "message"
0203                                  receiver get CR of "ack"
0204                                  receiver release "message"
0205
0206      sender                      receiver                   receiver
0207      "ack":CR                    "ack":CR                   "ack":CR
0208
0209
0210 4. Handling Failures
0211 ====================
0212
0213 4.1 Node Failure
0214 ----------------
0215
0216  When a node fails, the DLM informs the cluster with the slot
0217  number. The node starts a cluster recovery thread. The cluster
0218  recovery thread:
0219
0220         - acquires the bitmap<number> lock of the failed node
0221         - opens the bitmap
0222         - reads the bitmap of the failed node
0223         - copies the set bitmap to local node
0224         - cleans the bitmap of the failed node
0225         - releases bitmap<number> lock of the failed node
0226         - initiates resync of the bitmap on the current node
0227           md_check_recovery is invoked within recover_bitmaps,
0228           then md_check_recovery -> metadata_update_start/finish,
0229           it will lock the communication by lock_comm.
0230           Which means when one node is resyncing it blocks all
0231           other nodes from writing anywhere on the array.
0232
0233  The resync process is the regular md resync. However, in a clustered
0234  environment when a resync is performed, it needs to tell other nodes
0235  of the areas which are suspended. Before a resync starts, the node
0236  send out RESYNCING with the (lo,hi) range of the area which needs to
0237  be suspended. Each node maintains a suspend_list, which contains the
0238  list of ranges which are currently suspended. On receiving RESYNCING,
0239  the node adds the range to the suspend_list. Similarly, when the node
0240  performing resync finishes, it sends RESYNCING with an empty range to
0241  other nodes and other nodes remove the corresponding entry from the
0242  suspend_list.
0243
0244  A helper function, ->area_resyncing() can be used to check if a
0245  particular I/O range should be suspended or not.
0246
0247 4.2 Device Failure
0248 ==================
0249
0250  Device failures are handled and communicated with the metadata update
0251  routine.  When a node detects a device failure it does not allow
0252  any further writes to that device until the failure has been
0253  acknowledged by all other nodes.
0254
0255 5. Adding a new Device
0256 ----------------------
0257
0258  For adding a new device, it is necessary that all nodes "see" the new
0259  device to be added. For this, the following algorithm is used:
0260
0261    1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
0262        ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
0263    2.  Node 1 sends a NEWDISK message with uuid and slot number
0264    3.  Other nodes issue kobject_uevent_env with uuid and slot number
0265        (Steps 4,5 could be a udev rule)
0266    4.  In userspace, the node searches for the disk, perhaps
0267        using blkid -t SUB_UUID=""
0268    5.  Other nodes issue either of the following depending on whether
0269        the disk was found:
0270        ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
0271        disc.number set to slot number)
0272        ioctl(CLUSTERED_DISK_NACK)
0273    6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
0274    7.  Node 1 attempts EX lock on "no-new-dev"
0275    8.  If node 1 gets the lock, it sends METADATA_UPDATED after
0276        unmarking the disk as SpareLocal
0277    9.  If not (get "no-new-dev" lock), it fails the operation and sends
0278        METADATA_UPDATED.
0279    10. Other nodes get the information whether a disk is added or not
0280        by the following METADATA_UPDATED.
0281
0282 6. Module interface
0283 ===================
0284
0285  There are 17 call-backs which the md core can make to the cluster
0286  module.  Understanding these can give a good overview of the whole
0287  process.
0288
0289 6.1 join(nodes) and leave()
0290 ---------------------------
0291
0292  These are called when an array is started with a clustered bitmap,
0293  and when the array is stopped.  join() ensures the cluster is
0294  available and initializes the various resources.
0295  Only the first 'nodes' nodes in the cluster can use the array.
0296
0297 6.2 slot_number()
0298 -----------------
0299
0300  Reports the slot number advised by the cluster infrastructure.
0301  Range is from 0 to nodes-1.
0302
0303 6.3 resync_info_update()
0304 ------------------------
0305
0306  This updates the resync range that is stored in the bitmap lock.
0307  The starting point is updated as the resync progresses.  The
0308  end point is always the end of the array.
0309  It does *not* send a RESYNCING message.
0310
0311 6.4 resync_start(), resync_finish()
0312 -----------------------------------
0313
0314  These are called when resync/recovery/reshape starts or stops.
0315  They update the resyncing range in the bitmap lock and also
0316  send a RESYNCING message.  resync_start reports the whole
0317  array as resyncing, resync_finish reports none of it.
0318
0319  resync_finish() also sends a BITMAP_NEEDS_SYNC message which
0320  allows some other node to take over.
0321
0322 6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
0323 -------------------------------------------------------------------------------
0324
0325  metadata_update_start is used to get exclusive access to
0326  the metadata.  If a change is still needed once that access is
0327  gained, metadata_update_finish() will send a METADATA_UPDATE
0328  message to all other nodes, otherwise metadata_update_cancel()
0329  can be used to release the lock.
0330
0331 6.6 area_resyncing()
0332 --------------------
0333
0334  This combines two elements of functionality.
0335
0336  Firstly, it will check if any node is currently resyncing
0337  anything in a given range of sectors.  If any resync is found,
0338  then the caller will avoid writing or read-balancing in that
0339  range.
0340
0341  Secondly, while node recovery is happening it reports that
0342  all areas are resyncing for READ requests.  This avoids races
0343  between the cluster-filesystem and the cluster-RAID handling
0344  a node failure.
0345
0346 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
0347 ---------------------------------------------------------------
0348
0349  These are used to manage the new-disk protocol described above.
0350  When a new device is added, add_new_disk_start() is called before
0351  it is bound to the array and, if that succeeds, add_new_disk_finish()
0352  is called the device is fully added.
0353
0354  When a device is added in acknowledgement to a previous
0355  request, or when the device is declared "unavailable",
0356  new_disk_ack() is called.
0357
0358 6.8 remove_disk()
0359 -----------------
0360
0361  This is called when a spare or failed device is removed from
0362  the array.  It causes a REMOVE message to be send to other nodes.
0363
0364 6.9 gather_bitmaps()
0365 --------------------
0366
0367  This sends a RE_ADD message to all other nodes and then
0368  gathers bitmap information from all bitmaps.  This combined
0369  bitmap is then used to recovery the re-added device.
0370
0371 6.10 lock_all_bitmaps() and unlock_all_bitmaps()
0372 ------------------------------------------------
0373
0374  These are called when change bitmap to none. If a node plans
0375  to clear the cluster raid's bitmap, it need to make sure no other
0376  nodes are using the raid which is achieved by lock all bitmap
0377  locks within the cluster, and also those locks are unlocked
0378  accordingly.
0379
0380 7. Unsupported features
0381 =======================
0382
0383 There are somethings which are not supported by cluster MD yet.
0384
0385 - change array_sectors.