0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 Layout
0004 ------
0005
0006 The layout of a standard block group is approximately as follows (each
0007 of these fields is discussed in a separate section below):
0008
0009 .. list-table::
0010 :widths: 1 1 1 1 1 1 1 1
0011 :header-rows: 1
0012
0013 * - Group 0 Padding
0014 - ext4 Super Block
0015 - Group Descriptors
0016 - Reserved GDT Blocks
0017 - Data Block Bitmap
0018 - inode Bitmap
0019 - inode Table
0020 - Data Blocks
0021 * - 1024 bytes
0022 - 1 block
0023 - many blocks
0024 - many blocks
0025 - 1 block
0026 - 1 block
0027 - many blocks
0028 - many more blocks
0029
0030 For the special case of block group 0, the first 1024 bytes are unused,
0031 to allow for the installation of x86 boot sectors and other oddities.
0032 The superblock will start at offset 1024 bytes, whichever block that
0033 happens to be (usually 0). However, if for some reason the block size =
0034 1024, then block 0 is marked in use and the superblock goes in block 1.
0035 For all other block groups, there is no padding.
0036
0037 The ext4 driver primarily works with the superblock and the group
0038 descriptors that are found in block group 0. Redundant copies of the
0039 superblock and group descriptors are written to some of the block groups
0040 across the disk in case the beginning of the disk gets trashed, though
0041 not all block groups necessarily host a redundant copy (see following
0042 paragraph for more details). If the group does not have a redundant
0043 copy, the block group begins with the data block bitmap. Note also that
0044 when the filesystem is freshly formatted, mkfs will allocate “reserve
0045 GDT block” space after the block group descriptors and before the start
0046 of the block bitmaps to allow for future expansion of the filesystem. By
0047 default, a filesystem is allowed to increase in size by a factor of
0048 1024x over the original filesystem size.
0049
0050 The location of the inode table is given by ``grp.bg_inode_table_*``. It
0051 is continuous range of blocks large enough to contain
0052 ``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
0053
0054 As for the ordering of items in a block group, it is generally
0055 established that the super block and the group descriptor table, if
0056 present, will be at the beginning of the block group. The bitmaps and
0057 the inode table can be anywhere, and it is quite possible for the
0058 bitmaps to come after the inode table, or for both to be in different
0059 groups (flex_bg). Leftover space is used for file data blocks, indirect
0060 block maps, extent tree blocks, and extended attributes.
0061
0062 Flexible Block Groups
0063 ---------------------
0064
0065 Starting in ext4, there is a new feature called flexible block groups
0066 (flex_bg). In a flex_bg, several block groups are tied together as one
0067 logical block group; the bitmap spaces and the inode table space in the
0068 first block group of the flex_bg are expanded to include the bitmaps
0069 and inode tables of all other block groups in the flex_bg. For example,
0070 if the flex_bg size is 4, then group 0 will contain (in order) the
0071 superblock, group descriptors, data block bitmaps for groups 0-3, inode
0072 bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
0073 space in group 0 is for file data. The effect of this is to group the
0074 block group metadata close together for faster loading, and to enable
0075 large files to be continuous on disk. Backup copies of the superblock
0076 and group descriptors are always at the beginning of block groups, even
0077 if flex_bg is enabled. The number of block groups that make up a
0078 flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
0079
0080 Meta Block Groups
0081 -----------------
0082
0083 Without the option META_BG, for safety concerns, all block group
0084 descriptors copies are kept in the first block group. Given the default
0085 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
0086 can have at most 2^27/64 = 2^21 block groups. This limits the entire
0087 filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
0088
0089 The solution to this problem is to use the metablock group feature
0090 (META_BG), which is already in ext3 for all 2.6 releases. With the
0091 META_BG feature, ext4 filesystems are partitioned into many metablock
0092 groups. Each metablock group is a cluster of block groups whose group
0093 descriptor structures can be stored in a single disk block. For ext4
0094 filesystems with 4 KB block size, a single metablock group partition
0095 includes 64 block groups, or 8 GiB of disk space. The metablock group
0096 feature moves the location of the group descriptors from the congested
0097 first block group of the whole filesystem into the first group of each
0098 metablock group itself. The backups are in the second and last group of
0099 each metablock group. This increases the 2^21 maximum block groups limit
0100 to the hard limit 2^32, allowing support for a 512PiB filesystem.
0101
0102 The change in the filesystem format replaces the current scheme where
0103 the superblock is followed by a variable-length set of block group
0104 descriptors. Instead, the superblock and a single block group descriptor
0105 block is placed at the beginning of the first, second, and last block
0106 groups in a meta-block group. A meta-block group is a collection of
0107 block groups which can be described by a single block group descriptor
0108 block. Since the size of the block group descriptor structure is 32
0109 bytes, a meta-block group contains 32 block groups for filesystems with
0110 a 1KB block size, and 128 block groups for filesystems with a 4KB
0111 blocksize. Filesystems can either be created using this new block group
0112 descriptor layout, or existing filesystems can be resized on-line, and
0113 the field s_first_meta_bg in the superblock will indicate the first
0114 block group using this new layout.
0115
0116 Please see an important note about ``BLOCK_UNINIT`` in the section about
0117 block and inode bitmaps.
0118
0119 Lazy Block Group Initialization
0120 -------------------------------
0121
0122 A new feature for ext4 are three block group descriptor flags that
0123 enable mkfs to skip initializing other parts of the block group
0124 metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
0125 that the inode and block bitmaps for that group can be calculated and
0126 therefore the on-disk bitmap blocks are not initialized. This is
0127 generally the case for an empty block group or a block group containing
0128 only fixed-location block group metadata. The INODE_ZEROED flag means
0129 that the inode table has been initialized; mkfs will unset this flag and
0130 rely on the kernel to initialize the inode tables in the background.
0131
0132 By not writing zeroes to the bitmaps and inode table, mkfs time is
0133 reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
0134 but the dumpe2fs output prints this as “uninit_bg”. They are the same
0135 thing.