Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 Journal (jbd2)
0004 --------------
0005 
0006 Introduced in ext3, the ext4 filesystem employs a journal to protect the
0007 filesystem against metadata inconsistencies in the case of a system crash. Up
0008 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
0009 size limits) can be reserved inside the filesystem as a place to land
0010 “important” data writes on-disk as quickly as possible. Once the important
0011 data transaction is fully written to the disk and flushed from the disk write
0012 cache, a record of the data being committed is also written to the journal. At
0013 some later point in time, the journal code writes the transactions to their
0014 final locations on disk (this could involve a lot of seeking or a lot of small
0015 read-write-erases) before erasing the commit record. Should the system
0016 crash during the second slow write, the journal can be replayed all the
0017 way to the latest commit record, guaranteeing the atomicity of whatever
0018 gets written through the journal to the disk. The effect of this is to
0019 guarantee that the filesystem does not become stuck midway through a
0020 metadata update.
0021 
0022 For performance reasons, ext4 by default only writes filesystem metadata
0023 through the journal. This means that file data blocks are /not/
0024 guaranteed to be in any consistent state after a crash. If this default
0025 guarantee level (``data=ordered``) is not satisfactory, there is a mount
0026 option to control journal behavior. If ``data=journal``, all data and
0027 metadata are written to disk through the journal. This is slower but
0028 safest. If ``data=writeback``, dirty data blocks are not flushed to the
0029 disk before the metadata are written to disk through the journal.
0030 
0031 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
0032 help reduce commit latency significantly. The default ``data=ordered``
0033 mode works by logging metadata blocks to the journal. In fast commit
0034 mode, Ext4 only stores the minimal delta needed to recreate the
0035 affected metadata in fast commit space that is shared with JBD2.
0036 Once the fast commit area fills in or if fast commit is not possible
0037 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
0038 A full commit invalidates all the fast commits that happened before
0039 it and thus it makes the fast commit area empty for further fast
0040 commits. This feature needs to be enabled at mkfs time.
0041 
0042 The journal inode is typically inode 8. The first 68 bytes of the
0043 journal inode are replicated in the ext4 superblock. The journal itself
0044 is normal (but hidden) file within the filesystem. The file usually
0045 consumes an entire block group, though mke2fs tries to put it in the
0046 middle of the disk.
0047 
0048 All fields in jbd2 are written to disk in big-endian order. This is the
0049 opposite of ext4.
0050 
0051 NOTE: Both ext4 and ocfs2 use jbd2.
0052 
0053 The maximum size of a journal embedded in an ext4 filesystem is 2^32
0054 blocks. jbd2 itself does not seem to care.
0055 
0056 Layout
0057 ~~~~~~
0058 
0059 Generally speaking, the journal has this format:
0060 
0061 .. list-table::
0062    :widths: 16 48 16
0063    :header-rows: 1
0064 
0065    * - Superblock
0066      - descriptor_block (data_blocks or revocation_block) [more data or
0067        revocations] commmit_block
0068      - [more transactions...]
0069    * - 
0070      - One transaction
0071      -
0072 
0073 Notice that a transaction begins with either a descriptor and some data,
0074 or a block revocation list. A finished transaction always ends with a
0075 commit. If there is no commit record (or the checksums don't match), the
0076 transaction will be discarded during replay.
0077 
0078 External Journal
0079 ~~~~~~~~~~~~~~~~
0080 
0081 Optionally, an ext4 filesystem can be created with an external journal
0082 device (as opposed to an internal journal, which uses a reserved inode).
0083 In this case, on the filesystem device, ``s_journal_inum`` should be
0084 zero and ``s_journal_uuid`` should be set. On the journal device there
0085 will be an ext4 super block in the usual place, with a matching UUID.
0086 The journal superblock will be in the next full block after the
0087 superblock.
0088 
0089 .. list-table::
0090    :widths: 12 12 12 32 12
0091    :header-rows: 1
0092 
0093    * - 1024 bytes of padding
0094      - ext4 Superblock
0095      - Journal Superblock
0096      - descriptor_block (data_blocks or revocation_block) [more data or
0097        revocations] commmit_block
0098      - [more transactions...]
0099    * - 
0100      -
0101      -
0102      - One transaction
0103      -
0104 
0105 Block Header
0106 ~~~~~~~~~~~~
0107 
0108 Every block in the journal starts with a common 12-byte header
0109 ``struct journal_header_s``:
0110 
0111 .. list-table::
0112    :widths: 8 8 24 40
0113    :header-rows: 1
0114 
0115    * - Offset
0116      - Type
0117      - Name
0118      - Description
0119    * - 0x0
0120      - __be32
0121      - h_magic
0122      - jbd2 magic number, 0xC03B3998.
0123    * - 0x4
0124      - __be32
0125      - h_blocktype
0126      - Description of what this block contains. See the jbd2_blocktype_ table
0127        below.
0128    * - 0x8
0129      - __be32
0130      - h_sequence
0131      - The transaction ID that goes with this block.
0132 
0133 .. _jbd2_blocktype:
0134 
0135 The journal block type can be any one of:
0136 
0137 .. list-table::
0138    :widths: 16 64
0139    :header-rows: 1
0140 
0141    * - Value
0142      - Description
0143    * - 1
0144      - Descriptor. This block precedes a series of data blocks that were
0145        written through the journal during a transaction.
0146    * - 2
0147      - Block commit record. This block signifies the completion of a
0148        transaction.
0149    * - 3
0150      - Journal superblock, v1.
0151    * - 4
0152      - Journal superblock, v2.
0153    * - 5
0154      - Block revocation records. This speeds up recovery by enabling the
0155        journal to skip writing blocks that were subsequently rewritten.
0156 
0157 Super Block
0158 ~~~~~~~~~~~
0159 
0160 The super block for the journal is much simpler as compared to ext4's.
0161 The key data kept within are size of the journal, and where to find the
0162 start of the log of transactions.
0163 
0164 The journal superblock is recorded as ``struct journal_superblock_s``,
0165 which is 1024 bytes long:
0166 
0167 .. list-table::
0168    :widths: 8 8 24 40
0169    :header-rows: 1
0170 
0171    * - Offset
0172      - Type
0173      - Name
0174      - Description
0175    * -
0176      -
0177      -
0178      - Static information describing the journal.
0179    * - 0x0
0180      - journal_header_t (12 bytes)
0181      - s_header
0182      - Common header identifying this as a superblock.
0183    * - 0xC
0184      - __be32
0185      - s_blocksize
0186      - Journal device block size.
0187    * - 0x10
0188      - __be32
0189      - s_maxlen
0190      - Total number of blocks in this journal.
0191    * - 0x14
0192      - __be32
0193      - s_first
0194      - First block of log information.
0195    * -
0196      -
0197      -
0198      - Dynamic information describing the current state of the log.
0199    * - 0x18
0200      - __be32
0201      - s_sequence
0202      - First commit ID expected in log.
0203    * - 0x1C
0204      - __be32
0205      - s_start
0206      - Block number of the start of log. Contrary to the comments, this field
0207        being zero does not imply that the journal is clean!
0208    * - 0x20
0209      - __be32
0210      - s_errno
0211      - Error value, as set by jbd2_journal_abort().
0212    * -
0213      -
0214      -
0215      - The remaining fields are only valid in a v2 superblock.
0216    * - 0x24
0217      - __be32
0218      - s_feature_compat;
0219      - Compatible feature set. See the table jbd2_compat_ below.
0220    * - 0x28
0221      - __be32
0222      - s_feature_incompat
0223      - Incompatible feature set. See the table jbd2_incompat_ below.
0224    * - 0x2C
0225      - __be32
0226      - s_feature_ro_compat
0227      - Read-only compatible feature set. There aren't any of these currently.
0228    * - 0x30
0229      - __u8
0230      - s_uuid[16]
0231      - 128-bit uuid for journal. This is compared against the copy in the ext4
0232        super block at mount time.
0233    * - 0x40
0234      - __be32
0235      - s_nr_users
0236      - Number of file systems sharing this journal.
0237    * - 0x44
0238      - __be32
0239      - s_dynsuper
0240      - Location of dynamic super block copy. (Not used?)
0241    * - 0x48
0242      - __be32
0243      - s_max_transaction
0244      - Limit of journal blocks per transaction. (Not used?)
0245    * - 0x4C
0246      - __be32
0247      - s_max_trans_data
0248      - Limit of data blocks per transaction. (Not used?)
0249    * - 0x50
0250      - __u8
0251      - s_checksum_type
0252      - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
0253        more info.
0254    * - 0x51
0255      - __u8[3]
0256      - s_padding2
0257      -
0258    * - 0x54
0259      - __be32
0260      - s_num_fc_blocks
0261      - Number of fast commit blocks in the journal.
0262    * - 0x58
0263      - __u32
0264      - s_padding[42]
0265      -
0266    * - 0xFC
0267      - __be32
0268      - s_checksum
0269      - Checksum of the entire superblock, with this field set to zero.
0270    * - 0x100
0271      - __u8
0272      - s_users[16*48]
0273      - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
0274        shared external journals, but I imagine Lustre (or ocfs2?), which use
0275        the jbd2 code, might.
0276 
0277 .. _jbd2_compat:
0278 
0279 The journal compat features are any combination of the following:
0280 
0281 .. list-table::
0282    :widths: 16 64
0283    :header-rows: 1
0284 
0285    * - Value
0286      - Description
0287    * - 0x1
0288      - Journal maintains checksums on the data blocks.
0289        (JBD2_FEATURE_COMPAT_CHECKSUM)
0290 
0291 .. _jbd2_incompat:
0292 
0293 The journal incompat features are any combination of the following:
0294 
0295 .. list-table::
0296    :widths: 16 64
0297    :header-rows: 1
0298 
0299    * - Value
0300      - Description
0301    * - 0x1
0302      - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
0303    * - 0x2
0304      - Journal can deal with 64-bit block numbers.
0305        (JBD2_FEATURE_INCOMPAT_64BIT)
0306    * - 0x4
0307      - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
0308    * - 0x8
0309      - This journal uses v2 of the checksum on-disk format. Each journal
0310        metadata block gets its own checksum, and the block tags in the
0311        descriptor table contain checksums for each of the data blocks in the
0312        journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
0313    * - 0x10
0314      - This journal uses v3 of the checksum on-disk format. This is the same as
0315        v2, but the journal block tag size is fixed regardless of the size of
0316        block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
0317    * - 0x20
0318      - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
0319 
0320 .. _jbd2_checksum_type:
0321 
0322 Journal checksum type codes are one of the following.  crc32 or crc32c are the
0323 most likely choices.
0324 
0325 .. list-table::
0326    :widths: 16 64
0327    :header-rows: 1
0328 
0329    * - Value
0330      - Description
0331    * - 1
0332      - CRC32
0333    * - 2
0334      - MD5
0335    * - 3
0336      - SHA1
0337    * - 4
0338      - CRC32C
0339 
0340 Descriptor Block
0341 ~~~~~~~~~~~~~~~~
0342 
0343 The descriptor block contains an array of journal block tags that
0344 describe the final locations of the data blocks that follow in the
0345 journal. Descriptor blocks are open-coded instead of being completely
0346 described by a data structure, but here is the block structure anyway.
0347 Descriptor blocks consume at least 36 bytes, but use a full block:
0348 
0349 .. list-table::
0350    :widths: 8 8 24 40
0351    :header-rows: 1
0352 
0353    * - Offset
0354      - Type
0355      - Name
0356      - Descriptor
0357    * - 0x0
0358      - journal_header_t
0359      - (open coded)
0360      - Common block header.
0361    * - 0xC
0362      - struct journal_block_tag_s
0363      - open coded array[]
0364      - Enough tags either to fill up the block or to describe all the data
0365        blocks that follow this descriptor block.
0366 
0367 Journal block tags have any of the following formats, depending on which
0368 journal feature and block tag flags are set.
0369 
0370 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
0371 defined as ``struct journal_block_tag3_s``, which looks like the
0372 following. The size is 16 or 32 bytes.
0373 
0374 .. list-table::
0375    :widths: 8 8 24 40
0376    :header-rows: 1
0377 
0378    * - Offset
0379      - Type
0380      - Name
0381      - Descriptor
0382    * - 0x0
0383      - __be32
0384      - t_blocknr
0385      - Lower 32-bits of the location of where the corresponding data block
0386        should end up on disk.
0387    * - 0x4
0388      - __be32
0389      - t_flags
0390      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
0391        more info.
0392    * - 0x8
0393      - __be32
0394      - t_blocknr_high
0395      - Upper 32-bits of the location of where the corresponding data block
0396        should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
0397        not enabled.
0398    * - 0xC
0399      - __be32
0400      - t_checksum
0401      - Checksum of the journal UUID, the sequence number, and the data block.
0402    * -
0403      -
0404      -
0405      - This field appears to be open coded. It always comes at the end of the
0406        tag, after t_checksum. This field is not present if the "same UUID" flag
0407        is set.
0408    * - 0x8 or 0xC
0409      - char
0410      - uuid[16]
0411      - A UUID to go with this tag. This field appears to be copied from the
0412        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
0413        field.
0414 
0415 .. _jbd2_tag_flags:
0416 
0417 The journal tag flags are any combination of the following:
0418 
0419 .. list-table::
0420    :widths: 16 64
0421    :header-rows: 1
0422 
0423    * - Value
0424      - Description
0425    * - 0x1
0426      - On-disk block is escaped. The first four bytes of the data block just
0427        happened to match the jbd2 magic number.
0428    * - 0x2
0429      - This block has the same UUID as previous, therefore the UUID field is
0430        omitted.
0431    * - 0x4
0432      - The data block was deleted by the transaction. (Not used?)
0433    * - 0x8
0434      - This is the last tag in this descriptor block.
0435 
0436 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
0437 is defined as ``struct journal_block_tag_s``, which looks like the
0438 following. The size is 8, 12, 24, or 28 bytes:
0439 
0440 .. list-table::
0441    :widths: 8 8 24 40
0442    :header-rows: 1
0443 
0444    * - Offset
0445      - Type
0446      - Name
0447      - Descriptor
0448    * - 0x0
0449      - __be32
0450      - t_blocknr
0451      - Lower 32-bits of the location of where the corresponding data block
0452        should end up on disk.
0453    * - 0x4
0454      - __be16
0455      - t_checksum
0456      - Checksum of the journal UUID, the sequence number, and the data block.
0457        Note that only the lower 16 bits are stored.
0458    * - 0x6
0459      - __be16
0460      - t_flags
0461      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
0462        more info.
0463    * -
0464      -
0465      -
0466      - This next field is only present if the super block indicates support for
0467        64-bit block numbers.
0468    * - 0x8
0469      - __be32
0470      - t_blocknr_high
0471      - Upper 32-bits of the location of where the corresponding data block
0472        should end up on disk.
0473    * -
0474      -
0475      -
0476      - This field appears to be open coded. It always comes at the end of the
0477        tag, after t_flags or t_blocknr_high. This field is not present if the
0478        "same UUID" flag is set.
0479    * - 0x8 or 0xC
0480      - char
0481      - uuid[16]
0482      - A UUID to go with this tag. This field appears to be copied from the
0483        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
0484        field.
0485 
0486 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
0487 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
0488 ``struct jbd2_journal_block_tail``, which looks like this:
0489 
0490 .. list-table::
0491    :widths: 8 8 24 40
0492    :header-rows: 1
0493 
0494    * - Offset
0495      - Type
0496      - Name
0497      - Descriptor
0498    * - 0x0
0499      - __be32
0500      - t_checksum
0501      - Checksum of the journal UUID + the descriptor block, with this field set
0502        to zero.
0503 
0504 Data Block
0505 ~~~~~~~~~~
0506 
0507 In general, the data blocks being written to disk through the journal
0508 are written verbatim into the journal file after the descriptor block.
0509 However, if the first four bytes of the block match the jbd2 magic
0510 number then those four bytes are replaced with zeroes and the “escaped”
0511 flag is set in the descriptor block tag.
0512 
0513 Revocation Block
0514 ~~~~~~~~~~~~~~~~
0515 
0516 A revocation block is used to prevent replay of a block in an earlier
0517 transaction. This is used to mark blocks that were journalled at one
0518 time but are no longer journalled. Typically this happens if a metadata
0519 block is freed and re-allocated as a file data block; in this case, a
0520 journal replay after the file block was written to disk will cause
0521 corruption.
0522 
0523 **NOTE**: This mechanism is NOT used to express “this journal block is
0524 superseded by this other journal block”, as the author (djwong)
0525 mistakenly thought. Any block being added to a transaction will cause
0526 the removal of all existing revocation records for that block.
0527 
0528 Revocation blocks are described in
0529 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
0530 length, but use a full block:
0531 
0532 .. list-table::
0533    :widths: 8 8 24 40
0534    :header-rows: 1
0535 
0536    * - Offset
0537      - Type
0538      - Name
0539      - Description
0540    * - 0x0
0541      - journal_header_t
0542      - r_header
0543      - Common block header.
0544    * - 0xC
0545      - __be32
0546      - r_count
0547      - Number of bytes used in this block.
0548    * - 0x10
0549      - __be32 or __be64
0550      - blocks[0]
0551      - Blocks to revoke.
0552 
0553 After r_count is a linear array of block numbers that are effectively
0554 revoked by this transaction. The size of each block number is 8 bytes if
0555 the superblock advertises 64-bit block number support, or 4 bytes
0556 otherwise.
0557 
0558 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
0559 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
0560 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
0561 
0562 .. list-table::
0563    :widths: 8 8 24 40
0564    :header-rows: 1
0565 
0566    * - Offset
0567      - Type
0568      - Name
0569      - Description
0570    * - 0x0
0571      - __be32
0572      - r_checksum
0573      - Checksum of the journal UUID + revocation block
0574 
0575 Commit Block
0576 ~~~~~~~~~~~~
0577 
0578 The commit block is a sentry that indicates that a transaction has been
0579 completely written to the journal. Once this commit block reaches the
0580 journal, the data stored with this transaction can be written to their
0581 final locations on disk.
0582 
0583 The commit block is described by ``struct commit_header``, which is 32
0584 bytes long (but uses a full block):
0585 
0586 .. list-table::
0587    :widths: 8 8 24 40
0588    :header-rows: 1
0589 
0590    * - Offset
0591      - Type
0592      - Name
0593      - Descriptor
0594    * - 0x0
0595      - journal_header_s
0596      - (open coded)
0597      - Common block header.
0598    * - 0xC
0599      - unsigned char
0600      - h_chksum_type
0601      - The type of checksum to use to verify the integrity of the data blocks
0602        in the transaction. See jbd2_checksum_type_ for more info.
0603    * - 0xD
0604      - unsigned char
0605      - h_chksum_size
0606      - The number of bytes used by the checksum. Most likely 4.
0607    * - 0xE
0608      - unsigned char
0609      - h_padding[2]
0610      -
0611    * - 0x10
0612      - __be32
0613      - h_chksum[JBD2_CHECKSUM_BYTES]
0614      - 32 bytes of space to store checksums. If
0615        JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
0616        are set, the first ``__be32`` is the checksum of the journal UUID and
0617        the entire commit block, with this field zeroed. If
0618        JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
0619        crc32 of all the blocks already written to the transaction.
0620    * - 0x30
0621      - __be64
0622      - h_commit_sec
0623      - The time that the transaction was committed, in seconds since the epoch.
0624    * - 0x38
0625      - __be32
0626      - h_commit_nsec
0627      - Nanoseconds component of the above timestamp.
0628 
0629 Fast commits
0630 ~~~~~~~~~~~~
0631 
0632 Fast commit area is organized as a log of tag length values. Each TLV has
0633 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
0634 of the entire field. It is followed by variable length tag specific value.
0635 Here is the list of supported tags and their meanings:
0636 
0637 .. list-table::
0638    :widths: 8 20 20 32
0639    :header-rows: 1
0640 
0641    * - Tag
0642      - Meaning
0643      - Value struct
0644      - Description
0645    * - EXT4_FC_TAG_HEAD
0646      - Fast commit area header
0647      - ``struct ext4_fc_head``
0648      - Stores the TID of the transaction after which these fast commits should
0649        be applied.
0650    * - EXT4_FC_TAG_ADD_RANGE
0651      - Add extent to inode
0652      - ``struct ext4_fc_add_range``
0653      - Stores the inode number and extent to be added in this inode
0654    * - EXT4_FC_TAG_DEL_RANGE
0655      - Remove logical offsets to inode
0656      - ``struct ext4_fc_del_range``
0657      - Stores the inode number and the logical offset range that needs to be
0658        removed
0659    * - EXT4_FC_TAG_CREAT
0660      - Create directory entry for a newly created file
0661      - ``struct ext4_fc_dentry_info``
0662      - Stores the parent inode number, inode number and directory entry of the
0663        newly created file
0664    * - EXT4_FC_TAG_LINK
0665      - Link a directory entry to an inode
0666      - ``struct ext4_fc_dentry_info``
0667      - Stores the parent inode number, inode number and directory entry
0668    * - EXT4_FC_TAG_UNLINK
0669      - Unlink a directory entry of an inode
0670      - ``struct ext4_fc_dentry_info``
0671      - Stores the parent inode number, inode number and directory entry
0672 
0673    * - EXT4_FC_TAG_PAD
0674      - Padding (unused area)
0675      - None
0676      - Unused bytes in the fast commit area.
0677 
0678    * - EXT4_FC_TAG_TAIL
0679      - Mark the end of a fast commit
0680      - ``struct ext4_fc_tail``
0681      - Stores the TID of the commit, CRC of the fast commit of which this tag
0682        represents the end of
0683 
0684 Fast Commit Replay Idempotence
0685 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0686 
0687 Fast commits tags are idempotent in nature provided the recovery code follows
0688 certain rules. The guiding principle that the commit path follows while
0689 committing is that it stores the result of a particular operation instead of
0690 storing the procedure.
0691 
0692 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
0693 was associated with inode 10. During fast commit, instead of storing this
0694 operation as a procedure "rename a to b", we store the resulting file system
0695 state as a "series" of outcomes:
0696 
0697 - Link dirent b to inode 10
0698 - Unlink dirent a
0699 - Inode 10 with valid refcount
0700 
0701 Now when recovery code runs, it needs "enforce" this state on the file
0702 system. This is what guarantees idempotence of fast commit replay.
0703 
0704 Let's take an example of a procedure that is not idempotent and see how fast
0705 commits make it idempotent. Consider following sequence of operations:
0706 
0707 1) rm A
0708 2) mv B A
0709 3) read A
0710 
0711 If we store this sequence of operations as is then the replay is not idempotent.
0712 Let's say while in replay, we crash after (2). During the second replay,
0713 file A (which was actually created as a result of "mv B A" operation) would get
0714 deleted. Thus, file named A would be absent when we try to read A. So, this
0715 sequence of operations is not idempotent. However, as mentioned above, instead
0716 of storing the procedure fast commits store the outcome of each procedure. Thus
0717 the fast commit log for above procedure would be as follows:
0718 
0719 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
0720 inode 11 before the replay)
0721 
0722 1) Unlink A
0723 2) Link A to inode 11
0724 3) Unlink B
0725 4) Inode 11
0726 
0727 If we crash after (3) we will have file A linked to inode 11. During the second
0728 replay, we will remove file A (inode 11). But we will create it back and make
0729 it point to inode 11. We won't find B, so we'll just skip that step. At this
0730 point, the refcount for inode 11 is not reliable, but that gets fixed by the
0731 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
0732 into a series of idempotent outcomes, fast commits ensured idempotence during
0733 the replay.
0734 
0735 Journal Checkpoint
0736 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0737 
0738 Checkpointing the journal ensures all transactions and their associated buffers
0739 are submitted to the disk. In-progress transactions are waited upon and included
0740 in the checkpoint. Checkpointing is used internally during critical updates to
0741 the filesystem including journal recovery, filesystem resizing, and freeing of
0742 the journal_t structure.
0743 
0744 A journal checkpoint can be triggered from userspace via the ioctl
0745 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
0746 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
0747 can be used to verify input to the ioctl. It returns error if there is any
0748 invalid input, otherwise it returns success without performing
0749 any checkpointing. This can be used to check whether the ioctl exists on a
0750 system and to verify there are no issues with arguments or flags. The
0751 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
0752 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
0753 discarded or zero-filled, respectively, after the journal checkpoint is
0754 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
0755 cannot both be set. The ioctl may be useful when snapshotting a system or for
0756 complying with content deletion SLOs.