0001 ==========
0002 MD Cluster
0003 ==========
0004
0005 The cluster MD is a shared-device RAID for a cluster, it supports
0006 two levels: raid1 and raid10 (limited support).
0007
0008
0009 1. On-disk format
0010 =================
0011
0012 Separate write-intent-bitmaps are used for each cluster node.
0013 The bitmaps record all writes that may have been started on that node,
0014 and may not yet have finished. The on-disk layout is::
0015
0016 0 4k 8k 12k
0017 -------------------------------------------------------------------
0018 | idle | md super | bm super [0] + bits |
0019 | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
0020 | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
0021 | bm bits [3, contd] | | |
0022
0023 During "normal" functioning we assume the filesystem ensures that only
0024 one node writes to any given block at a time, so a write request will
0025
0026 - set the appropriate bit (if not already set)
0027 - commit the write to all mirrors
0028 - schedule the bit to be cleared after a timeout.
0029
0030 Reads are just handled normally. It is up to the filesystem to ensure
0031 one node doesn't read from a location where another node (or the same
0032 node) is writing.
0033
0034
0035 2. DLM Locks for management
0036 ===========================
0037
0038 There are three groups of locks for managing the device:
0039
0040 2.1 Bitmap lock resource (bm_lockres)
0041 -------------------------------------
0042
0043 The bm_lockres protects individual node bitmaps. They are named in
0044 the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
0045 node joins the cluster, it acquires the lock in PW mode and it stays
0046 so during the lifetime the node is part of the cluster. The lock
0047 resource number is based on the slot number returned by the DLM
0048 subsystem. Since DLM starts node count from one and bitmap slots
0049 start from zero, one is subtracted from the DLM slot number to arrive
0050 at the bitmap slot number.
0051
0052 The LVB of the bitmap lock for a particular node records the range
0053 of sectors that are being re-synced by that node. No other
0054 node may write to those sectors. This is used when a new nodes
0055 joins the cluster.
0056
0057 2.2 Message passing locks
0058 -------------------------
0059
0060 Each node has to communicate with other nodes when starting or ending
0061 resync, and for metadata superblock updates. This communication is
0062 managed through three locks: "token", "message", and "ack", together
0063 with the Lock Value Block (LVB) of one of the "message" lock.
0064
0065 2.3 new-device management
0066 -------------------------
0067
0068 A single lock: "no-new-dev" is used to co-ordinate the addition of
0069 new devices - this must be synchronized across the array.
0070 Normally all nodes hold a concurrent-read lock on this device.
0071
0072 3. Communication
0073 ================
0074
0075 Messages can be broadcast to all nodes, and the sender waits for all
0076 other nodes to acknowledge the message before proceeding. Only one
0077 message can be processed at a time.
0078
0079 3.1 Message Types
0080 -----------------
0081
0082 There are six types of messages which are passed:
0083
0084 3.1.1 METADATA_UPDATED
0085 ^^^^^^^^^^^^^^^^^^^^^^
0086
0087 informs other nodes that the metadata has
0088 been updated, and the node must re-read the md superblock. This is
0089 performed synchronously. It is primarily used to signal device
0090 failure.
0091
0092 3.1.2 RESYNCING
0093 ^^^^^^^^^^^^^^^
0094 informs other nodes that a resync is initiated or
0095 ended so that each node may suspend or resume the region. Each
0096 RESYNCING message identifies a range of the devices that the
0097 sending node is about to resync. This overrides any previous
0098 notification from that node: only one ranged can be resynced at a
0099 time per-node.
0100
0101 3.1.3 NEWDISK
0102 ^^^^^^^^^^^^^
0103
0104 informs other nodes that a device is being added to
0105 the array. Message contains an identifier for that device. See
0106 below for further details.
0107
0108 3.1.4 REMOVE
0109 ^^^^^^^^^^^^
0110
0111 A failed or spare device is being removed from the
0112 array. The slot-number of the device is included in the message.
0113
0114 3.1.5 RE_ADD:
0115
0116 A failed device is being re-activated - the assumption
0117 is that it has been determined to be working again.
0118
0119 3.1.6 BITMAP_NEEDS_SYNC:
0120
0121 If a node is stopped locally but the bitmap
0122 isn't clean, then another node is informed to take the ownership of
0123 resync.
0124
0125 3.2 Communication mechanism
0126 ---------------------------
0127
0128 The DLM LVB is used to communicate within nodes of the cluster. There
0129 are three resources used for the purpose:
0130
0131 3.2.1 token
0132 ^^^^^^^^^^^
0133 The resource which protects the entire communication
0134 system. The node having the token resource is allowed to
0135 communicate.
0136
0137 3.2.2 message
0138 ^^^^^^^^^^^^^
0139 The lock resource which carries the data to communicate.
0140
0141 3.2.3 ack
0142 ^^^^^^^^^
0143
0144 The resource, acquiring which means the message has been
0145 acknowledged by all nodes in the cluster. The BAST of the resource
0146 is used to inform the receiving node that a node wants to
0147 communicate.
0148
0149 The algorithm is:
0150
0151 1. receive status - all nodes have concurrent-reader lock on "ack"::
0152
0153 sender receiver receiver
0154 "ack":CR "ack":CR "ack":CR
0155
0156 2. sender get EX on "token",
0157 sender get EX on "message"::
0158
0159 sender receiver receiver
0160 "token":EX "ack":CR "ack":CR
0161 "message":EX
0162 "ack":CR
0163
0164 Sender checks that it still needs to send a message. Messages
0165 received or other events that happened while waiting for the
0166 "token" may have made this message inappropriate or redundant.
0167
0168 3. sender writes LVB
0169
0170 sender down-convert "message" from EX to CW
0171
0172 sender try to get EX of "ack"
0173
0174 ::
0175
0176 [ wait until all receivers have *processed* the "message" ]
0177
0178 [ triggered by bast of "ack" ]
0179 receiver get CR on "message"
0180 receiver read LVB
0181 receiver processes the message
0182 [ wait finish ]
0183 receiver releases "ack"
0184 receiver tries to get PR on "message"
0185
0186 sender receiver receiver
0187 "token":EX "message":CR "message":CR
0188 "message":CW
0189 "ack":EX
0190
0191 4. triggered by grant of EX on "ack" (indicating all receivers
0192 have processed message)
0193
0194 sender down-converts "ack" from EX to CR
0195
0196 sender releases "message"
0197
0198 sender releases "token"
0199
0200 ::
0201
0202 receiver upconvert to PR on "message"
0203 receiver get CR of "ack"
0204 receiver release "message"
0205
0206 sender receiver receiver
0207 "ack":CR "ack":CR "ack":CR
0208
0209
0210 4. Handling Failures
0211 ====================
0212
0213 4.1 Node Failure
0214 ----------------
0215
0216 When a node fails, the DLM informs the cluster with the slot
0217 number. The node starts a cluster recovery thread. The cluster
0218 recovery thread:
0219
0220 - acquires the bitmap<number> lock of the failed node
0221 - opens the bitmap
0222 - reads the bitmap of the failed node
0223 - copies the set bitmap to local node
0224 - cleans the bitmap of the failed node
0225 - releases bitmap<number> lock of the failed node
0226 - initiates resync of the bitmap on the current node
0227 md_check_recovery is invoked within recover_bitmaps,
0228 then md_check_recovery -> metadata_update_start/finish,
0229 it will lock the communication by lock_comm.
0230 Which means when one node is resyncing it blocks all
0231 other nodes from writing anywhere on the array.
0232
0233 The resync process is the regular md resync. However, in a clustered
0234 environment when a resync is performed, it needs to tell other nodes
0235 of the areas which are suspended. Before a resync starts, the node
0236 send out RESYNCING with the (lo,hi) range of the area which needs to
0237 be suspended. Each node maintains a suspend_list, which contains the
0238 list of ranges which are currently suspended. On receiving RESYNCING,
0239 the node adds the range to the suspend_list. Similarly, when the node
0240 performing resync finishes, it sends RESYNCING with an empty range to
0241 other nodes and other nodes remove the corresponding entry from the
0242 suspend_list.
0243
0244 A helper function, ->area_resyncing() can be used to check if a
0245 particular I/O range should be suspended or not.
0246
0247 4.2 Device Failure
0248 ==================
0249
0250 Device failures are handled and communicated with the metadata update
0251 routine. When a node detects a device failure it does not allow
0252 any further writes to that device until the failure has been
0253 acknowledged by all other nodes.
0254
0255 5. Adding a new Device
0256 ----------------------
0257
0258 For adding a new device, it is necessary that all nodes "see" the new
0259 device to be added. For this, the following algorithm is used:
0260
0261 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
0262 ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
0263 2. Node 1 sends a NEWDISK message with uuid and slot number
0264 3. Other nodes issue kobject_uevent_env with uuid and slot number
0265 (Steps 4,5 could be a udev rule)
0266 4. In userspace, the node searches for the disk, perhaps
0267 using blkid -t SUB_UUID=""
0268 5. Other nodes issue either of the following depending on whether
0269 the disk was found:
0270 ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
0271 disc.number set to slot number)
0272 ioctl(CLUSTERED_DISK_NACK)
0273 6. Other nodes drop lock on "no-new-devs" (CR) if device is found
0274 7. Node 1 attempts EX lock on "no-new-dev"
0275 8. If node 1 gets the lock, it sends METADATA_UPDATED after
0276 unmarking the disk as SpareLocal
0277 9. If not (get "no-new-dev" lock), it fails the operation and sends
0278 METADATA_UPDATED.
0279 10. Other nodes get the information whether a disk is added or not
0280 by the following METADATA_UPDATED.
0281
0282 6. Module interface
0283 ===================
0284
0285 There are 17 call-backs which the md core can make to the cluster
0286 module. Understanding these can give a good overview of the whole
0287 process.
0288
0289 6.1 join(nodes) and leave()
0290 ---------------------------
0291
0292 These are called when an array is started with a clustered bitmap,
0293 and when the array is stopped. join() ensures the cluster is
0294 available and initializes the various resources.
0295 Only the first 'nodes' nodes in the cluster can use the array.
0296
0297 6.2 slot_number()
0298 -----------------
0299
0300 Reports the slot number advised by the cluster infrastructure.
0301 Range is from 0 to nodes-1.
0302
0303 6.3 resync_info_update()
0304 ------------------------
0305
0306 This updates the resync range that is stored in the bitmap lock.
0307 The starting point is updated as the resync progresses. The
0308 end point is always the end of the array.
0309 It does *not* send a RESYNCING message.
0310
0311 6.4 resync_start(), resync_finish()
0312 -----------------------------------
0313
0314 These are called when resync/recovery/reshape starts or stops.
0315 They update the resyncing range in the bitmap lock and also
0316 send a RESYNCING message. resync_start reports the whole
0317 array as resyncing, resync_finish reports none of it.
0318
0319 resync_finish() also sends a BITMAP_NEEDS_SYNC message which
0320 allows some other node to take over.
0321
0322 6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
0323 -------------------------------------------------------------------------------
0324
0325 metadata_update_start is used to get exclusive access to
0326 the metadata. If a change is still needed once that access is
0327 gained, metadata_update_finish() will send a METADATA_UPDATE
0328 message to all other nodes, otherwise metadata_update_cancel()
0329 can be used to release the lock.
0330
0331 6.6 area_resyncing()
0332 --------------------
0333
0334 This combines two elements of functionality.
0335
0336 Firstly, it will check if any node is currently resyncing
0337 anything in a given range of sectors. If any resync is found,
0338 then the caller will avoid writing or read-balancing in that
0339 range.
0340
0341 Secondly, while node recovery is happening it reports that
0342 all areas are resyncing for READ requests. This avoids races
0343 between the cluster-filesystem and the cluster-RAID handling
0344 a node failure.
0345
0346 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
0347 ---------------------------------------------------------------
0348
0349 These are used to manage the new-disk protocol described above.
0350 When a new device is added, add_new_disk_start() is called before
0351 it is bound to the array and, if that succeeds, add_new_disk_finish()
0352 is called the device is fully added.
0353
0354 When a device is added in acknowledgement to a previous
0355 request, or when the device is declared "unavailable",
0356 new_disk_ack() is called.
0357
0358 6.8 remove_disk()
0359 -----------------
0360
0361 This is called when a spare or failed device is removed from
0362 the array. It causes a REMOVE message to be send to other nodes.
0363
0364 6.9 gather_bitmaps()
0365 --------------------
0366
0367 This sends a RE_ADD message to all other nodes and then
0368 gathers bitmap information from all bitmaps. This combined
0369 bitmap is then used to recovery the re-added device.
0370
0371 6.10 lock_all_bitmaps() and unlock_all_bitmaps()
0372 ------------------------------------------------
0373
0374 These are called when change bitmap to none. If a node plans
0375 to clear the cluster raid's bitmap, it need to make sure no other
0376 nodes are using the raid which is achieved by lock all bitmap
0377 locks within the cluster, and also those locks are unlocked
0378 accordingly.
0379
0380 7. Unsupported features
0381 =======================
0382
0383 There are somethings which are not supported by cluster MD yet.
0384
0385 - change array_sectors.