Documentation/filesystems/xfs-self-describing-metadata.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ============================
0004 XFS Self Describing Metadata
0005 ============================
0006
0007 Introduction
0008 ============
0009
0010 The largest scalability problem facing XFS is not one of algorithmic
0011 scalability, but of verification of the filesystem structure. Scalabilty of the
0012 structures and indexes on disk and the algorithms for iterating them are
0013 adequate for supporting PB scale filesystems with billions of inodes, however it
0014 is this very scalability that causes the verification problem.
0015
0016 Almost all metadata on XFS is dynamically allocated. The only fixed location
0017 metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
0018 other metadata structures need to be discovered by walking the filesystem
0019 structure in different ways. While this is already done by userspace tools for
0020 validating and repairing the structure, there are limits to what they can
0021 verify, and this in turn limits the supportable size of an XFS filesystem.
0022
0023 For example, it is entirely possible to manually use xfs_db and a bit of
0024 scripting to analyse the structure of a 100TB filesystem when trying to
0025 determine the root cause of a corruption problem, but it is still mainly a
0026 manual task of verifying that things like single bit errors or misplaced writes
0027 weren't the ultimate cause of a corruption event. It may take a few hours to a
0028 few days to perform such forensic analysis, so for at this scale root cause
0029 analysis is entirely possible.
0030
0031 However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
0032 to analyse and so that analysis blows out towards weeks/months of forensic work.
0033 Most of the analysis work is slow and tedious, so as the amount of analysis goes
0034 up, the more likely that the cause will be lost in the noise.  Hence the primary
0035 concern for supporting PB scale filesystems is minimising the time and effort
0036 required for basic forensic analysis of the filesystem structure.
0037
0038
0039 Self Describing Metadata
0040 ========================
0041
0042 One of the problems with the current metadata format is that apart from the
0043 magic number in the metadata block, we have no other way of identifying what it
0044 is supposed to be. We can't even identify if it is the right place. Put simply,
0045 you can't look at a single metadata block in isolation and say "yes, it is
0046 supposed to be there and the contents are valid".
0047
0048 Hence most of the time spent on forensic analysis is spent doing basic
0049 verification of metadata values, looking for values that are in range (and hence
0050 not detected by automated verification checks) but are not correct. Finding and
0051 understanding how things like cross linked block lists (e.g. sibling
0052 pointers in a btree end up with loops in them) are the key to understanding what
0053 went wrong, but it is impossible to tell what order the blocks were linked into
0054 each other or written to disk after the fact.
0055
0056 Hence we need to record more information into the metadata to allow us to
0057 quickly determine if the metadata is intact and can be ignored for the purpose
0058 of analysis. We can't protect against every possible type of error, but we can
0059 ensure that common types of errors are easily detectable.  Hence the concept of
0060 self describing metadata.
0061
0062 The first, fundamental requirement of self describing metadata is that the
0063 metadata object contains some form of unique identifier in a well known
0064 location. This allows us to identify the expected contents of the block and
0065 hence parse and verify the metadata object. IF we can't independently identify
0066 the type of metadata in the object, then the metadata doesn't describe itself
0067 very well at all!
0068
0069 Luckily, almost all XFS metadata has magic numbers embedded already - only the
0070 AGFL, remote symlinks and remote attribute blocks do not contain identifying
0071 magic numbers. Hence we can change the on-disk format of all these objects to
0072 add more identifying information and detect this simply by changing the magic
0073 numbers in the metadata objects. That is, if it has the current magic number,
0074 the metadata isn't self identifying. If it contains a new magic number, it is
0075 self identifying and we can do much more expansive automated verification of the
0076 metadata object at runtime, during forensic analysis or repair.
0077
0078 As a primary concern, self describing metadata needs some form of overall
0079 integrity checking. We cannot trust the metadata if we cannot verify that it has
0080 not been changed as a result of external influences. Hence we need some form of
0081 integrity check, and this is done by adding CRC32c validation to the metadata
0082 block. If we can verify the block contains the metadata it was intended to
0083 contain, a large amount of the manual verification work can be skipped.
0084
0085 CRC32c was selected as metadata cannot be more than 64k in length in XFS and
0086 hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
0087 metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
0088 fast. So while CRC32c is not the strongest of possible integrity checks that
0089 could be used, it is more than sufficient for our needs and has relatively
0090 little overhead. Adding support for larger integrity fields and/or algorithms
0091 does really provide any extra value over CRC32c, but it does add a lot of
0092 complexity and so there is no provision for changing the integrity checking
0093 mechanism.
0094
0095 Self describing metadata needs to contain enough information so that the
0096 metadata block can be verified as being in the correct place without needing to
0097 look at any other metadata. This means it needs to contain location information.
0098 Just adding a block number to the metadata is not sufficient to protect against
0099 mis-directed writes - a write might be misdirected to the wrong LUN and so be
0100 written to the "correct block" of the wrong filesystem. Hence location
0101 information must contain a filesystem identifier as well as a block number.
0102
0103 Another key information point in forensic analysis is knowing who the metadata
0104 block belongs to. We already know the type, the location, that it is valid
0105 and/or corrupted, and how long ago that it was last modified. Knowing the owner
0106 of the block is important as it allows us to find other related metadata to
0107 determine the scope of the corruption. For example, if we have a extent btree
0108 object, we don't know what inode it belongs to and hence have to walk the entire
0109 filesystem to find the owner of the block. Worse, the corruption could mean that
0110 no owner can be found (i.e. it's an orphan block), and so without an owner field
0111 in the metadata we have no idea of the scope of the corruption. If we have an
0112 owner field in the metadata object, we can immediately do top down validation to
0113 determine the scope of the problem.
0114
0115 Different types of metadata have different owner identifiers. For example,
0116 directory, attribute and extent tree blocks are all owned by an inode, while
0117 freespace btree blocks are owned by an allocation group. Hence the size and
0118 contents of the owner field are determined by the type of metadata object we are
0119 looking at.  The owner information can also identify misplaced writes (e.g.
0120 freespace btree block written to the wrong AG).
0121
0122 Self describing metadata also needs to contain some indication of when it was
0123 written to the filesystem. One of the key information points when doing forensic
0124 analysis is how recently the block was modified. Correlation of set of corrupted
0125 metadata blocks based on modification times is important as it can indicate
0126 whether the corruptions are related, whether there's been multiple corruption
0127 events that lead to the eventual failure, and even whether there are corruptions
0128 present that the run-time verification is not detecting.
0129
0130 For example, we can determine whether a metadata object is supposed to be free
0131 space or still allocated if it is still referenced by its owner by looking at
0132 when the free space btree block that contains the block was last written
0133 compared to when the metadata object itself was last written.  If the free space
0134 block is more recent than the object and the object's owner, then there is a
0135 very good chance that the block should have been removed from the owner.
0136
0137 To provide this "written timestamp", each metadata block gets the Log Sequence
0138 Number (LSN) of the most recent transaction it was modified on written into it.
0139 This number will always increase over the life of the filesystem, and the only
0140 thing that resets it is running xfs_repair on the filesystem. Further, by use of
0141 the LSN we can tell if the corrupted metadata all belonged to the same log
0142 checkpoint and hence have some idea of how much modification occurred between
0143 the first and last instance of corrupt metadata on disk and, further, how much
0144 modification occurred between the corruption being written and when it was
0145 detected.
0146
0147 Runtime Validation
0148 ==================
0149
0150 Validation of self-describing metadata takes place at runtime in two places:
0151
0152         - immediately after a successful read from disk
0153         - immediately prior to write IO submission
0154
0155 The verification is completely stateless - it is done independently of the
0156 modification process, and seeks only to check that the metadata is what it says
0157 it is and that the metadata fields are within bounds and internally consistent.
0158 As such, we cannot catch all types of corruption that can occur within a block
0159 as there may be certain limitations that operational state enforces of the
0160 metadata, or there may be corruption of interblock relationships (e.g. corrupted
0161 sibling pointer lists). Hence we still need stateful checking in the main code
0162 body, but in general most of the per-field validation is handled by the
0163 verifiers.
0164
0165 For read verification, the caller needs to specify the expected type of metadata
0166 that it should see, and the IO completion process verifies that the metadata
0167 object matches what was expected. If the verification process fails, then it
0168 marks the object being read as EFSCORRUPTED. The caller needs to catch this
0169 error (same as for IO errors), and if it needs to take special action due to a
0170 verification error it can do so by catching the EFSCORRUPTED error value. If we
0171 need more discrimination of error type at higher levels, we can define new
0172 error numbers for different errors as necessary.
0173
0174 The first step in read verification is checking the magic number and determining
0175 whether CRC validating is necessary. If it is, the CRC32c is calculated and
0176 compared against the value stored in the object itself. Once this is validated,
0177 further checks are made against the location information, followed by extensive
0178 object specific metadata validation. If any of these checks fail, then the
0179 buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
0180
0181 Write verification is the opposite of the read verification - first the object
0182 is extensively verified and if it is OK we then update the LSN from the last
0183 modification made to the object, After this, we calculate the CRC and insert it
0184 into the object. Once this is done the write IO is allowed to continue. If any
0185 error occurs during this process, the buffer is again marked with a EFSCORRUPTED
0186 error for the higher layers to catch.
0187
0188 Structures
0189 ==========
0190
0191 A typical on-disk structure needs to contain the following information::
0192
0193     struct xfs_ondisk_hdr {
0194             __be32  magic;              /* magic number */
0195             __be32  crc;                /* CRC, not logged */
0196             uuid_t  uuid;               /* filesystem identifier */
0197             __be64  owner;              /* parent object */
0198             __be64  blkno;              /* location on disk */
0199             __be64  lsn;                /* last modification in log, not logged */
0200     };
0201
0202 Depending on the metadata, this information may be part of a header structure
0203 separate to the metadata contents, or may be distributed through an existing
0204 structure. The latter occurs with metadata that already contains some of this
0205 information, such as the superblock and AG headers.
0206
0207 Other metadata may have different formats for the information, but the same
0208 level of information is generally provided. For example:
0209
0210         - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
0211           number for location. The two of these combined provide the same
0212           information as @owner and @blkno in eh above structure, but using 8
0213           bytes less space on disk.
0214
0215         - directory/attribute node blocks have a 16 bit magic number, and the
0216           header that contains the magic number has other information in it as
0217           well. hence the additional metadata headers change the overall format
0218           of the metadata.
0219
0220 A typical buffer read verifier is structured as follows::
0221
0222     #define XFS_FOO_CRC_OFF             offsetof(struct xfs_ondisk_hdr, crc)
0223
0224     static void
0225     xfs_foo_read_verify(
0226             struct xfs_buf      *bp)
0227     {
0228         struct xfs_mount *mp = bp->b_mount;
0229
0230             if ((xfs_sb_version_hascrc(&mp->m_sb) &&
0231                 !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
0232                                             XFS_FOO_CRC_OFF)) ||
0233                 !xfs_foo_verify(bp)) {
0234                     XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
0235                     xfs_buf_ioerror(bp, EFSCORRUPTED);
0236             }
0237     }
0238
0239 The code ensures that the CRC is only checked if the filesystem has CRCs enabled
0240 by checking the superblock of the feature bit, and then if the CRC verifies OK
0241 (or is not needed) it verifies the actual contents of the block.
0242
0243 The verifier function will take a couple of different forms, depending on
0244 whether the magic number can be used to determine the format of the block. In
0245 the case it can't, the code is structured as follows::
0246
0247     static bool
0248     xfs_foo_verify(
0249             struct xfs_buf              *bp)
0250     {
0251             struct xfs_mount    *mp = bp->b_mount;
0252             struct xfs_ondisk_hdr       *hdr = bp->b_addr;
0253
0254             if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
0255                     return false;
0256
0257             if (!xfs_sb_version_hascrc(&mp->m_sb)) {
0258                     if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
0259                             return false;
0260                     if (bp->b_bn != be64_to_cpu(hdr->blkno))
0261                             return false;
0262                     if (hdr->owner == 0)
0263                             return false;
0264             }
0265
0266             /* object specific verification checks here */
0267
0268             return true;
0269     }
0270
0271 If there are different magic numbers for the different formats, the verifier
0272 will look like::
0273
0274     static bool
0275     xfs_foo_verify(
0276             struct xfs_buf              *bp)
0277     {
0278             struct xfs_mount    *mp = bp->b_mount;
0279             struct xfs_ondisk_hdr       *hdr = bp->b_addr;
0280
0281             if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
0282                     if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
0283                             return false;
0284                     if (bp->b_bn != be64_to_cpu(hdr->blkno))
0285                             return false;
0286                     if (hdr->owner == 0)
0287                             return false;
0288             } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
0289                     return false;
0290
0291             /* object specific verification checks here */
0292
0293             return true;
0294     }
0295
0296 Write verifiers are very similar to the read verifiers, they just do things in
0297 the opposite order to the read verifiers. A typical write verifier::
0298
0299     static void
0300     xfs_foo_write_verify(
0301             struct xfs_buf      *bp)
0302     {
0303             struct xfs_mount    *mp = bp->b_mount;
0304             struct xfs_buf_log_item     *bip = bp->b_fspriv;
0305
0306             if (!xfs_foo_verify(bp)) {
0307                     XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
0308                     xfs_buf_ioerror(bp, EFSCORRUPTED);
0309                     return;
0310             }
0311
0312             if (!xfs_sb_version_hascrc(&mp->m_sb))
0313                     return;
0314
0315
0316             if (bip) {
0317                     struct xfs_ondisk_hdr       *hdr = bp->b_addr;
0318                     hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
0319             }
0320             xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
0321     }
0322
0323 This will verify the internal structure of the metadata before we go any
0324 further, detecting corruptions that have occurred as the metadata has been
0325 modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
0326 update the LSN field (when it was last modified) and calculate the CRC on the
0327 metadata. Once this is done, we can issue the IO.
0328
0329 Inodes and Dquots
0330 =================
0331
0332 Inodes and dquots are special snowflakes. They have per-object CRC and
0333 self-identifiers, but they are packed so that there are multiple objects per
0334 buffer. Hence we do not use per-buffer verifiers to do the work of per-object
0335 verification and CRC calculations. The per-buffer verifiers simply perform basic
0336 identification of the buffer - that they contain inodes or dquots, and that
0337 there are magic numbers in all the expected spots. All further CRC and
0338 verification checks are done when each inode is read from or written back to the
0339 buffer.
0340
0341 The structure of the verifiers and the identifiers checks is very similar to the
0342 buffer code described above. The only difference is where they are called. For
0343 example, inode read verification is done in xfs_inode_from_disk() when the inode
0344 is first read out of the buffer and the struct xfs_inode is instantiated. The
0345 inode is already extensively verified during writeback in xfs_iflush_int, so the
0346 only addition here is to add the LSN and CRC to the inode as it is copied back
0347 into the buffer.
0348
0349 XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
0350 the unlinked list modifications check or update CRCs, neither during unlink nor
0351 log recovery. So, it's gone unnoticed until now. This won't matter immediately -
0352 repair will probably complain about it - but it needs to be fixed.