filesystems/ext4/allocators.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 Block and Inode Allocation Policy
0004 ---------------------------------
0005
0006 ext4 recognizes (better than ext3, anyway) that data locality is
0007 generally a desirably quality of a filesystem. On a spinning disk,
0008 keeping related blocks near each other reduces the amount of movement
0009 that the head actuator and disk must perform to access a data block,
0010 thus speeding up disk IO. On an SSD there of course are no moving parts,
0011 but locality can increase the size of each transfer request while
0012 reducing the total number of requests. This locality may also have the
0013 effect of concentrating writes on a single erase block, which can speed
0014 up file rewrites significantly. Therefore, it is useful to reduce
0015 fragmentation whenever possible.
0016
0017 The first tool that ext4 uses to combat fragmentation is the multi-block
0018 allocator. When a file is first created, the block allocator
0019 speculatively allocates 8KiB of disk space to the file on the assumption
0020 that the space will get written soon. When the file is closed, the
0021 unused speculative allocations are of course freed, but if the
0022 speculation is correct (typically the case for full writes of small
0023 files) then the file data gets written out in a single multi-block
0024 extent. A second related trick that ext4 uses is delayed allocation.
0025 Under this scheme, when a file needs more blocks to absorb file writes,
0026 the filesystem defers deciding the exact placement on the disk until all
0027 the dirty buffers are being written out to disk. By not committing to a
0028 particular placement until it's absolutely necessary (the commit timeout
0029 is hit, or sync() is called, or the kernel runs out of memory), the hope
0030 is that the filesystem can make better location decisions.
0031
0032 The third trick that ext4 (and ext3) uses is that it tries to keep a
0033 file's data blocks in the same block group as its inode. This cuts down
0034 on the seek penalty when the filesystem first has to read a file's inode
0035 to learn where the file's data blocks live and then seek over to the
0036 file's data blocks to begin I/O operations.
0037
0038 The fourth trick is that all the inodes in a directory are placed in the
0039 same block group as the directory, when feasible. The working assumption
0040 here is that all the files in a directory might be related, therefore it
0041 is useful to try to keep them all together.
0042
0043 The fifth trick is that the disk volume is cut up into 128MB block
0044 groups; these mini-containers are used as outlined above to try to
0045 maintain data locality. However, there is a deliberate quirk -- when a
0046 directory is created in the root directory, the inode allocator scans
0047 the block groups and puts that directory into the least heavily loaded
0048 block group that it can find. This encourages directories to spread out
0049 over a disk; as the top-level directory/file blobs fill up one block
0050 group, the allocators simply move on to the next block group. Allegedly
0051 this scheme evens out the loading on the block groups, though the author
0052 suspects that the directories which are so unlucky as to land towards
0053 the end of a spinning drive get a raw deal performance-wise.
0054
0055 Of course if all of these mechanisms fail, one can always use e4defrag
0056 to defragment files.