0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 Journal (jbd2)
0004 --------------
0005
0006 Introduced in ext3, the ext4 filesystem employs a journal to protect the
0007 filesystem against metadata inconsistencies in the case of a system crash. Up
0008 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
0009 size limits) can be reserved inside the filesystem as a place to land
0010 “important” data writes on-disk as quickly as possible. Once the important
0011 data transaction is fully written to the disk and flushed from the disk write
0012 cache, a record of the data being committed is also written to the journal. At
0013 some later point in time, the journal code writes the transactions to their
0014 final locations on disk (this could involve a lot of seeking or a lot of small
0015 read-write-erases) before erasing the commit record. Should the system
0016 crash during the second slow write, the journal can be replayed all the
0017 way to the latest commit record, guaranteeing the atomicity of whatever
0018 gets written through the journal to the disk. The effect of this is to
0019 guarantee that the filesystem does not become stuck midway through a
0020 metadata update.
0021
0022 For performance reasons, ext4 by default only writes filesystem metadata
0023 through the journal. This means that file data blocks are /not/
0024 guaranteed to be in any consistent state after a crash. If this default
0025 guarantee level (``data=ordered``) is not satisfactory, there is a mount
0026 option to control journal behavior. If ``data=journal``, all data and
0027 metadata are written to disk through the journal. This is slower but
0028 safest. If ``data=writeback``, dirty data blocks are not flushed to the
0029 disk before the metadata are written to disk through the journal.
0030
0031 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
0032 help reduce commit latency significantly. The default ``data=ordered``
0033 mode works by logging metadata blocks to the journal. In fast commit
0034 mode, Ext4 only stores the minimal delta needed to recreate the
0035 affected metadata in fast commit space that is shared with JBD2.
0036 Once the fast commit area fills in or if fast commit is not possible
0037 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
0038 A full commit invalidates all the fast commits that happened before
0039 it and thus it makes the fast commit area empty for further fast
0040 commits. This feature needs to be enabled at mkfs time.
0041
0042 The journal inode is typically inode 8. The first 68 bytes of the
0043 journal inode are replicated in the ext4 superblock. The journal itself
0044 is normal (but hidden) file within the filesystem. The file usually
0045 consumes an entire block group, though mke2fs tries to put it in the
0046 middle of the disk.
0047
0048 All fields in jbd2 are written to disk in big-endian order. This is the
0049 opposite of ext4.
0050
0051 NOTE: Both ext4 and ocfs2 use jbd2.
0052
0053 The maximum size of a journal embedded in an ext4 filesystem is 2^32
0054 blocks. jbd2 itself does not seem to care.
0055
0056 Layout
0057 ~~~~~~
0058
0059 Generally speaking, the journal has this format:
0060
0061 .. list-table::
0062 :widths: 16 48 16
0063 :header-rows: 1
0064
0065 * - Superblock
0066 - descriptor_block (data_blocks or revocation_block) [more data or
0067 revocations] commmit_block
0068 - [more transactions...]
0069 * -
0070 - One transaction
0071 -
0072
0073 Notice that a transaction begins with either a descriptor and some data,
0074 or a block revocation list. A finished transaction always ends with a
0075 commit. If there is no commit record (or the checksums don't match), the
0076 transaction will be discarded during replay.
0077
0078 External Journal
0079 ~~~~~~~~~~~~~~~~
0080
0081 Optionally, an ext4 filesystem can be created with an external journal
0082 device (as opposed to an internal journal, which uses a reserved inode).
0083 In this case, on the filesystem device, ``s_journal_inum`` should be
0084 zero and ``s_journal_uuid`` should be set. On the journal device there
0085 will be an ext4 super block in the usual place, with a matching UUID.
0086 The journal superblock will be in the next full block after the
0087 superblock.
0088
0089 .. list-table::
0090 :widths: 12 12 12 32 12
0091 :header-rows: 1
0092
0093 * - 1024 bytes of padding
0094 - ext4 Superblock
0095 - Journal Superblock
0096 - descriptor_block (data_blocks or revocation_block) [more data or
0097 revocations] commmit_block
0098 - [more transactions...]
0099 * -
0100 -
0101 -
0102 - One transaction
0103 -
0104
0105 Block Header
0106 ~~~~~~~~~~~~
0107
0108 Every block in the journal starts with a common 12-byte header
0109 ``struct journal_header_s``:
0110
0111 .. list-table::
0112 :widths: 8 8 24 40
0113 :header-rows: 1
0114
0115 * - Offset
0116 - Type
0117 - Name
0118 - Description
0119 * - 0x0
0120 - __be32
0121 - h_magic
0122 - jbd2 magic number, 0xC03B3998.
0123 * - 0x4
0124 - __be32
0125 - h_blocktype
0126 - Description of what this block contains. See the jbd2_blocktype_ table
0127 below.
0128 * - 0x8
0129 - __be32
0130 - h_sequence
0131 - The transaction ID that goes with this block.
0132
0133 .. _jbd2_blocktype:
0134
0135 The journal block type can be any one of:
0136
0137 .. list-table::
0138 :widths: 16 64
0139 :header-rows: 1
0140
0141 * - Value
0142 - Description
0143 * - 1
0144 - Descriptor. This block precedes a series of data blocks that were
0145 written through the journal during a transaction.
0146 * - 2
0147 - Block commit record. This block signifies the completion of a
0148 transaction.
0149 * - 3
0150 - Journal superblock, v1.
0151 * - 4
0152 - Journal superblock, v2.
0153 * - 5
0154 - Block revocation records. This speeds up recovery by enabling the
0155 journal to skip writing blocks that were subsequently rewritten.
0156
0157 Super Block
0158 ~~~~~~~~~~~
0159
0160 The super block for the journal is much simpler as compared to ext4's.
0161 The key data kept within are size of the journal, and where to find the
0162 start of the log of transactions.
0163
0164 The journal superblock is recorded as ``struct journal_superblock_s``,
0165 which is 1024 bytes long:
0166
0167 .. list-table::
0168 :widths: 8 8 24 40
0169 :header-rows: 1
0170
0171 * - Offset
0172 - Type
0173 - Name
0174 - Description
0175 * -
0176 -
0177 -
0178 - Static information describing the journal.
0179 * - 0x0
0180 - journal_header_t (12 bytes)
0181 - s_header
0182 - Common header identifying this as a superblock.
0183 * - 0xC
0184 - __be32
0185 - s_blocksize
0186 - Journal device block size.
0187 * - 0x10
0188 - __be32
0189 - s_maxlen
0190 - Total number of blocks in this journal.
0191 * - 0x14
0192 - __be32
0193 - s_first
0194 - First block of log information.
0195 * -
0196 -
0197 -
0198 - Dynamic information describing the current state of the log.
0199 * - 0x18
0200 - __be32
0201 - s_sequence
0202 - First commit ID expected in log.
0203 * - 0x1C
0204 - __be32
0205 - s_start
0206 - Block number of the start of log. Contrary to the comments, this field
0207 being zero does not imply that the journal is clean!
0208 * - 0x20
0209 - __be32
0210 - s_errno
0211 - Error value, as set by jbd2_journal_abort().
0212 * -
0213 -
0214 -
0215 - The remaining fields are only valid in a v2 superblock.
0216 * - 0x24
0217 - __be32
0218 - s_feature_compat;
0219 - Compatible feature set. See the table jbd2_compat_ below.
0220 * - 0x28
0221 - __be32
0222 - s_feature_incompat
0223 - Incompatible feature set. See the table jbd2_incompat_ below.
0224 * - 0x2C
0225 - __be32
0226 - s_feature_ro_compat
0227 - Read-only compatible feature set. There aren't any of these currently.
0228 * - 0x30
0229 - __u8
0230 - s_uuid[16]
0231 - 128-bit uuid for journal. This is compared against the copy in the ext4
0232 super block at mount time.
0233 * - 0x40
0234 - __be32
0235 - s_nr_users
0236 - Number of file systems sharing this journal.
0237 * - 0x44
0238 - __be32
0239 - s_dynsuper
0240 - Location of dynamic super block copy. (Not used?)
0241 * - 0x48
0242 - __be32
0243 - s_max_transaction
0244 - Limit of journal blocks per transaction. (Not used?)
0245 * - 0x4C
0246 - __be32
0247 - s_max_trans_data
0248 - Limit of data blocks per transaction. (Not used?)
0249 * - 0x50
0250 - __u8
0251 - s_checksum_type
0252 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for
0253 more info.
0254 * - 0x51
0255 - __u8[3]
0256 - s_padding2
0257 -
0258 * - 0x54
0259 - __be32
0260 - s_num_fc_blocks
0261 - Number of fast commit blocks in the journal.
0262 * - 0x58
0263 - __u32
0264 - s_padding[42]
0265 -
0266 * - 0xFC
0267 - __be32
0268 - s_checksum
0269 - Checksum of the entire superblock, with this field set to zero.
0270 * - 0x100
0271 - __u8
0272 - s_users[16*48]
0273 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
0274 shared external journals, but I imagine Lustre (or ocfs2?), which use
0275 the jbd2 code, might.
0276
0277 .. _jbd2_compat:
0278
0279 The journal compat features are any combination of the following:
0280
0281 .. list-table::
0282 :widths: 16 64
0283 :header-rows: 1
0284
0285 * - Value
0286 - Description
0287 * - 0x1
0288 - Journal maintains checksums on the data blocks.
0289 (JBD2_FEATURE_COMPAT_CHECKSUM)
0290
0291 .. _jbd2_incompat:
0292
0293 The journal incompat features are any combination of the following:
0294
0295 .. list-table::
0296 :widths: 16 64
0297 :header-rows: 1
0298
0299 * - Value
0300 - Description
0301 * - 0x1
0302 - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
0303 * - 0x2
0304 - Journal can deal with 64-bit block numbers.
0305 (JBD2_FEATURE_INCOMPAT_64BIT)
0306 * - 0x4
0307 - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
0308 * - 0x8
0309 - This journal uses v2 of the checksum on-disk format. Each journal
0310 metadata block gets its own checksum, and the block tags in the
0311 descriptor table contain checksums for each of the data blocks in the
0312 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
0313 * - 0x10
0314 - This journal uses v3 of the checksum on-disk format. This is the same as
0315 v2, but the journal block tag size is fixed regardless of the size of
0316 block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
0317 * - 0x20
0318 - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
0319
0320 .. _jbd2_checksum_type:
0321
0322 Journal checksum type codes are one of the following. crc32 or crc32c are the
0323 most likely choices.
0324
0325 .. list-table::
0326 :widths: 16 64
0327 :header-rows: 1
0328
0329 * - Value
0330 - Description
0331 * - 1
0332 - CRC32
0333 * - 2
0334 - MD5
0335 * - 3
0336 - SHA1
0337 * - 4
0338 - CRC32C
0339
0340 Descriptor Block
0341 ~~~~~~~~~~~~~~~~
0342
0343 The descriptor block contains an array of journal block tags that
0344 describe the final locations of the data blocks that follow in the
0345 journal. Descriptor blocks are open-coded instead of being completely
0346 described by a data structure, but here is the block structure anyway.
0347 Descriptor blocks consume at least 36 bytes, but use a full block:
0348
0349 .. list-table::
0350 :widths: 8 8 24 40
0351 :header-rows: 1
0352
0353 * - Offset
0354 - Type
0355 - Name
0356 - Descriptor
0357 * - 0x0
0358 - journal_header_t
0359 - (open coded)
0360 - Common block header.
0361 * - 0xC
0362 - struct journal_block_tag_s
0363 - open coded array[]
0364 - Enough tags either to fill up the block or to describe all the data
0365 blocks that follow this descriptor block.
0366
0367 Journal block tags have any of the following formats, depending on which
0368 journal feature and block tag flags are set.
0369
0370 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
0371 defined as ``struct journal_block_tag3_s``, which looks like the
0372 following. The size is 16 or 32 bytes.
0373
0374 .. list-table::
0375 :widths: 8 8 24 40
0376 :header-rows: 1
0377
0378 * - Offset
0379 - Type
0380 - Name
0381 - Descriptor
0382 * - 0x0
0383 - __be32
0384 - t_blocknr
0385 - Lower 32-bits of the location of where the corresponding data block
0386 should end up on disk.
0387 * - 0x4
0388 - __be32
0389 - t_flags
0390 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
0391 more info.
0392 * - 0x8
0393 - __be32
0394 - t_blocknr_high
0395 - Upper 32-bits of the location of where the corresponding data block
0396 should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
0397 not enabled.
0398 * - 0xC
0399 - __be32
0400 - t_checksum
0401 - Checksum of the journal UUID, the sequence number, and the data block.
0402 * -
0403 -
0404 -
0405 - This field appears to be open coded. It always comes at the end of the
0406 tag, after t_checksum. This field is not present if the "same UUID" flag
0407 is set.
0408 * - 0x8 or 0xC
0409 - char
0410 - uuid[16]
0411 - A UUID to go with this tag. This field appears to be copied from the
0412 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
0413 field.
0414
0415 .. _jbd2_tag_flags:
0416
0417 The journal tag flags are any combination of the following:
0418
0419 .. list-table::
0420 :widths: 16 64
0421 :header-rows: 1
0422
0423 * - Value
0424 - Description
0425 * - 0x1
0426 - On-disk block is escaped. The first four bytes of the data block just
0427 happened to match the jbd2 magic number.
0428 * - 0x2
0429 - This block has the same UUID as previous, therefore the UUID field is
0430 omitted.
0431 * - 0x4
0432 - The data block was deleted by the transaction. (Not used?)
0433 * - 0x8
0434 - This is the last tag in this descriptor block.
0435
0436 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
0437 is defined as ``struct journal_block_tag_s``, which looks like the
0438 following. The size is 8, 12, 24, or 28 bytes:
0439
0440 .. list-table::
0441 :widths: 8 8 24 40
0442 :header-rows: 1
0443
0444 * - Offset
0445 - Type
0446 - Name
0447 - Descriptor
0448 * - 0x0
0449 - __be32
0450 - t_blocknr
0451 - Lower 32-bits of the location of where the corresponding data block
0452 should end up on disk.
0453 * - 0x4
0454 - __be16
0455 - t_checksum
0456 - Checksum of the journal UUID, the sequence number, and the data block.
0457 Note that only the lower 16 bits are stored.
0458 * - 0x6
0459 - __be16
0460 - t_flags
0461 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
0462 more info.
0463 * -
0464 -
0465 -
0466 - This next field is only present if the super block indicates support for
0467 64-bit block numbers.
0468 * - 0x8
0469 - __be32
0470 - t_blocknr_high
0471 - Upper 32-bits of the location of where the corresponding data block
0472 should end up on disk.
0473 * -
0474 -
0475 -
0476 - This field appears to be open coded. It always comes at the end of the
0477 tag, after t_flags or t_blocknr_high. This field is not present if the
0478 "same UUID" flag is set.
0479 * - 0x8 or 0xC
0480 - char
0481 - uuid[16]
0482 - A UUID to go with this tag. This field appears to be copied from the
0483 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
0484 field.
0485
0486 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
0487 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
0488 ``struct jbd2_journal_block_tail``, which looks like this:
0489
0490 .. list-table::
0491 :widths: 8 8 24 40
0492 :header-rows: 1
0493
0494 * - Offset
0495 - Type
0496 - Name
0497 - Descriptor
0498 * - 0x0
0499 - __be32
0500 - t_checksum
0501 - Checksum of the journal UUID + the descriptor block, with this field set
0502 to zero.
0503
0504 Data Block
0505 ~~~~~~~~~~
0506
0507 In general, the data blocks being written to disk through the journal
0508 are written verbatim into the journal file after the descriptor block.
0509 However, if the first four bytes of the block match the jbd2 magic
0510 number then those four bytes are replaced with zeroes and the “escaped”
0511 flag is set in the descriptor block tag.
0512
0513 Revocation Block
0514 ~~~~~~~~~~~~~~~~
0515
0516 A revocation block is used to prevent replay of a block in an earlier
0517 transaction. This is used to mark blocks that were journalled at one
0518 time but are no longer journalled. Typically this happens if a metadata
0519 block is freed and re-allocated as a file data block; in this case, a
0520 journal replay after the file block was written to disk will cause
0521 corruption.
0522
0523 **NOTE**: This mechanism is NOT used to express “this journal block is
0524 superseded by this other journal block”, as the author (djwong)
0525 mistakenly thought. Any block being added to a transaction will cause
0526 the removal of all existing revocation records for that block.
0527
0528 Revocation blocks are described in
0529 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
0530 length, but use a full block:
0531
0532 .. list-table::
0533 :widths: 8 8 24 40
0534 :header-rows: 1
0535
0536 * - Offset
0537 - Type
0538 - Name
0539 - Description
0540 * - 0x0
0541 - journal_header_t
0542 - r_header
0543 - Common block header.
0544 * - 0xC
0545 - __be32
0546 - r_count
0547 - Number of bytes used in this block.
0548 * - 0x10
0549 - __be32 or __be64
0550 - blocks[0]
0551 - Blocks to revoke.
0552
0553 After r_count is a linear array of block numbers that are effectively
0554 revoked by this transaction. The size of each block number is 8 bytes if
0555 the superblock advertises 64-bit block number support, or 4 bytes
0556 otherwise.
0557
0558 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
0559 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
0560 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
0561
0562 .. list-table::
0563 :widths: 8 8 24 40
0564 :header-rows: 1
0565
0566 * - Offset
0567 - Type
0568 - Name
0569 - Description
0570 * - 0x0
0571 - __be32
0572 - r_checksum
0573 - Checksum of the journal UUID + revocation block
0574
0575 Commit Block
0576 ~~~~~~~~~~~~
0577
0578 The commit block is a sentry that indicates that a transaction has been
0579 completely written to the journal. Once this commit block reaches the
0580 journal, the data stored with this transaction can be written to their
0581 final locations on disk.
0582
0583 The commit block is described by ``struct commit_header``, which is 32
0584 bytes long (but uses a full block):
0585
0586 .. list-table::
0587 :widths: 8 8 24 40
0588 :header-rows: 1
0589
0590 * - Offset
0591 - Type
0592 - Name
0593 - Descriptor
0594 * - 0x0
0595 - journal_header_s
0596 - (open coded)
0597 - Common block header.
0598 * - 0xC
0599 - unsigned char
0600 - h_chksum_type
0601 - The type of checksum to use to verify the integrity of the data blocks
0602 in the transaction. See jbd2_checksum_type_ for more info.
0603 * - 0xD
0604 - unsigned char
0605 - h_chksum_size
0606 - The number of bytes used by the checksum. Most likely 4.
0607 * - 0xE
0608 - unsigned char
0609 - h_padding[2]
0610 -
0611 * - 0x10
0612 - __be32
0613 - h_chksum[JBD2_CHECKSUM_BYTES]
0614 - 32 bytes of space to store checksums. If
0615 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
0616 are set, the first ``__be32`` is the checksum of the journal UUID and
0617 the entire commit block, with this field zeroed. If
0618 JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
0619 crc32 of all the blocks already written to the transaction.
0620 * - 0x30
0621 - __be64
0622 - h_commit_sec
0623 - The time that the transaction was committed, in seconds since the epoch.
0624 * - 0x38
0625 - __be32
0626 - h_commit_nsec
0627 - Nanoseconds component of the above timestamp.
0628
0629 Fast commits
0630 ~~~~~~~~~~~~
0631
0632 Fast commit area is organized as a log of tag length values. Each TLV has
0633 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
0634 of the entire field. It is followed by variable length tag specific value.
0635 Here is the list of supported tags and their meanings:
0636
0637 .. list-table::
0638 :widths: 8 20 20 32
0639 :header-rows: 1
0640
0641 * - Tag
0642 - Meaning
0643 - Value struct
0644 - Description
0645 * - EXT4_FC_TAG_HEAD
0646 - Fast commit area header
0647 - ``struct ext4_fc_head``
0648 - Stores the TID of the transaction after which these fast commits should
0649 be applied.
0650 * - EXT4_FC_TAG_ADD_RANGE
0651 - Add extent to inode
0652 - ``struct ext4_fc_add_range``
0653 - Stores the inode number and extent to be added in this inode
0654 * - EXT4_FC_TAG_DEL_RANGE
0655 - Remove logical offsets to inode
0656 - ``struct ext4_fc_del_range``
0657 - Stores the inode number and the logical offset range that needs to be
0658 removed
0659 * - EXT4_FC_TAG_CREAT
0660 - Create directory entry for a newly created file
0661 - ``struct ext4_fc_dentry_info``
0662 - Stores the parent inode number, inode number and directory entry of the
0663 newly created file
0664 * - EXT4_FC_TAG_LINK
0665 - Link a directory entry to an inode
0666 - ``struct ext4_fc_dentry_info``
0667 - Stores the parent inode number, inode number and directory entry
0668 * - EXT4_FC_TAG_UNLINK
0669 - Unlink a directory entry of an inode
0670 - ``struct ext4_fc_dentry_info``
0671 - Stores the parent inode number, inode number and directory entry
0672
0673 * - EXT4_FC_TAG_PAD
0674 - Padding (unused area)
0675 - None
0676 - Unused bytes in the fast commit area.
0677
0678 * - EXT4_FC_TAG_TAIL
0679 - Mark the end of a fast commit
0680 - ``struct ext4_fc_tail``
0681 - Stores the TID of the commit, CRC of the fast commit of which this tag
0682 represents the end of
0683
0684 Fast Commit Replay Idempotence
0685 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0686
0687 Fast commits tags are idempotent in nature provided the recovery code follows
0688 certain rules. The guiding principle that the commit path follows while
0689 committing is that it stores the result of a particular operation instead of
0690 storing the procedure.
0691
0692 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
0693 was associated with inode 10. During fast commit, instead of storing this
0694 operation as a procedure "rename a to b", we store the resulting file system
0695 state as a "series" of outcomes:
0696
0697 - Link dirent b to inode 10
0698 - Unlink dirent a
0699 - Inode 10 with valid refcount
0700
0701 Now when recovery code runs, it needs "enforce" this state on the file
0702 system. This is what guarantees idempotence of fast commit replay.
0703
0704 Let's take an example of a procedure that is not idempotent and see how fast
0705 commits make it idempotent. Consider following sequence of operations:
0706
0707 1) rm A
0708 2) mv B A
0709 3) read A
0710
0711 If we store this sequence of operations as is then the replay is not idempotent.
0712 Let's say while in replay, we crash after (2). During the second replay,
0713 file A (which was actually created as a result of "mv B A" operation) would get
0714 deleted. Thus, file named A would be absent when we try to read A. So, this
0715 sequence of operations is not idempotent. However, as mentioned above, instead
0716 of storing the procedure fast commits store the outcome of each procedure. Thus
0717 the fast commit log for above procedure would be as follows:
0718
0719 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
0720 inode 11 before the replay)
0721
0722 1) Unlink A
0723 2) Link A to inode 11
0724 3) Unlink B
0725 4) Inode 11
0726
0727 If we crash after (3) we will have file A linked to inode 11. During the second
0728 replay, we will remove file A (inode 11). But we will create it back and make
0729 it point to inode 11. We won't find B, so we'll just skip that step. At this
0730 point, the refcount for inode 11 is not reliable, but that gets fixed by the
0731 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
0732 into a series of idempotent outcomes, fast commits ensured idempotence during
0733 the replay.
0734
0735 Journal Checkpoint
0736 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0737
0738 Checkpointing the journal ensures all transactions and their associated buffers
0739 are submitted to the disk. In-progress transactions are waited upon and included
0740 in the checkpoint. Checkpointing is used internally during critical updates to
0741 the filesystem including journal recovery, filesystem resizing, and freeing of
0742 the journal_t structure.
0743
0744 A journal checkpoint can be triggered from userspace via the ioctl
0745 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
0746 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
0747 can be used to verify input to the ioctl. It returns error if there is any
0748 invalid input, otherwise it returns success without performing
0749 any checkpointing. This can be used to check whether the ioctl exists on a
0750 system and to verify there are no issues with arguments or flags. The
0751 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
0752 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
0753 discarded or zero-filled, respectively, after the journal checkpoint is
0754 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
0755 cannot both be set. The ioctl may be useful when snapshotting a system or for
0756 complying with content deletion SLOs.