Documentation/filesystems/erofs.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ======================================
0004 EROFS - Enhanced Read-Only File System
0005 ======================================
0006
0007 Overview
0008 ========
0009
0010 EROFS filesystem stands for Enhanced Read-Only File System.  It aims to form a
0011 generic read-only filesystem solution for various read-only use cases instead
0012 of just focusing on storage space saving without considering any side effects
0013 of runtime performance.
0014
0015 It is designed to meet the needs of flexibility, feature extendability and user
0016 payload friendly, etc.  Apart from those, it is still kept as a simple
0017 random-access friendly high-performance filesystem to get rid of unneeded I/O
0018 amplification and memory-resident overhead compared to similar approaches.
0019
0020 It is implemented to be a better choice for the following scenarios:
0021
0022  - read-only storage media or
0023
0024  - part of a fully trusted read-only solution, which means it needs to be
0025    immutable and bit-for-bit identical to the official golden image for
0026    their releases due to security or other considerations and
0027
0028  - hope to minimize extra storage space with guaranteed end-to-end performance
0029    by using compact layout, transparent file compression and direct access,
0030    especially for those embedded devices with limited memory and high-density
0031    hosts with numerous containers.
0032
0033 Here is the main features of EROFS:
0034
0035  - Little endian on-disk design;
0036
0037  - 4KiB block size and 32-bit block addresses, therefore 16TiB address space
0038    at most for now;
0039
0040  - Two inode layouts for different requirements:
0041
0042    =====================  ============  ======================================
0043                           compact (v1)  extended (v2)
0044    =====================  ============  ======================================
0045    Inode metadata size    32 bytes      64 bytes
0046    Max file size          4 GiB         16 EiB (also limited by max. vol size)
0047    Max uids/gids          65536         4294967296
0048    Per-inode timestamp    no            yes (64 + 32-bit timestamp)
0049    Max hardlinks          65536         4294967296
0050    Metadata reserved      8 bytes       18 bytes
0051    =====================  ============  ======================================
0052
0053  - Metadata and data could be mixed as an option;
0054
0055  - Support extended attributes (xattrs) as an option;
0056
0057  - Support tailpacking data and xattr inline compared to byte-addressed
0058    unaligned metadata or smaller block size alternatives;
0059
0060  - Support POSIX.1e ACLs by using xattrs;
0061
0062  - Support transparent data compression as an option:
0063    LZ4 and MicroLZMA algorithms can be used on a per-file basis; In addition,
0064    inplace decompression is also supported to avoid bounce compressed buffers
0065    and page cache thrashing.
0066
0067  - Support direct I/O on uncompressed files to avoid double caching for loop
0068    devices;
0069
0070  - Support FSDAX on uncompressed images for secure containers and ramdisks in
0071    order to get rid of unnecessary page cache.
0072
0073  - Support multiple devices for multi blob container images;
0074
0075  - Support file-based on-demand loading with the Fscache infrastructure.
0076
0077 The following git tree provides the file system user-space tools under
0078 development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
0079 compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
0080
0081 - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
0082
0083 Bugs and patches are welcome, please kindly help us and send to the following
0084 linux-erofs mailing list:
0085
0086 - linux-erofs mailing list   <linux-erofs@lists.ozlabs.org>
0087
0088 Mount options
0089 =============
0090
0091 ===================    =========================================================
0092 (no)user_xattr         Setup Extended User Attributes. Note: xattr is enabled
0093                        by default if CONFIG_EROFS_FS_XATTR is selected.
0094 (no)acl                Setup POSIX Access Control List. Note: acl is enabled
0095                        by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
0096 cache_strategy=%s      Select a strategy for cached decompression from now on:
0097
0098                        ==========  =============================================
0099                          disabled  In-place I/O decompression only;
0100                         readahead  Cache the last incomplete compressed physical
0101                                    cluster for further reading. It still does
0102                                    in-place I/O decompression for the rest
0103                                    compressed physical clusters;
0104                        readaround  Cache the both ends of incomplete compressed
0105                                    physical clusters for further reading.
0106                                    It still does in-place I/O decompression
0107                                    for the rest compressed physical clusters.
0108                        ==========  =============================================
0109 dax={always,never}     Use direct access (no page cache).  See
0110                        Documentation/filesystems/dax.rst.
0111 dax                    A legacy option which is an alias for ``dax=always``.
0112 device=%s              Specify a path to an extra device to be used together.
0113 fsid=%s                Specify a filesystem image ID for Fscache back-end.
0114 ===================    =========================================================
0115
0116 Sysfs Entries
0117 =============
0118
0119 Information about mounted erofs file systems can be found in /sys/fs/erofs.
0120 Each mounted filesystem will have a directory in /sys/fs/erofs based on its
0121 device name (i.e., /sys/fs/erofs/sda).
0122 (see also Documentation/ABI/testing/sysfs-fs-erofs)
0123
0124 On-disk details
0125 ===============
0126
0127 Summary
0128 -------
0129 Different from other read-only file systems, an EROFS volume is designed
0130 to be as simple as possible::
0131
0132                                 |-> aligned with the block size
0133    ____________________________________________________________
0134   | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
0135   |_|__|_|_____|__________|_____|______|__________|_____|______|
0136   0 +1K
0137
0138 All data areas should be aligned with the block size, but metadata areas
0139 may not. All metadatas can be now observed in two different spaces (views):
0140
0141  1. Inode metadata space
0142
0143     Each valid inode should be aligned with an inode slot, which is a fixed
0144     value (32 bytes) and designed to be kept in line with compact inode size.
0145
0146     Each inode can be directly found with the following formula:
0147          inode offset = meta_blkaddr * block_size + 32 * nid
0148
0149     ::
0150
0151                                  |-> aligned with 8B
0152                                             |-> followed closely
0153      + meta_blkaddr blocks                                      |-> another slot
0154        _____________________________________________________________________
0155      |  ...   | inode |  xattrs  | extents  | data inline | ... | inode ...
0156      |________|_______|(optional)|(optional)|__(optional)_|_____|__________
0157               |-> aligned with the inode slot size
0158                    .                   .
0159                  .                         .
0160                .                              .
0161              .                                    .
0162            .                                         .
0163          .                                              .
0164        .____________________________________________________|-> aligned with 4B
0165        | xattr_ibody_header | shared xattrs | inline xattrs |
0166        |____________________|_______________|_______________|
0167        |->    12 bytes    <-|->x * 4 bytes<-|               .
0168                            .                .                 .
0169                      .                      .                   .
0170                 .                           .                     .
0171             ._______________________________.______________________.
0172             | id | id | id | id |  ... | id | ent | ... | ent| ... |
0173             |____|____|____|____|______|____|_____|_____|____|_____|
0174                                             |-> aligned with 4B
0175                                                         |-> aligned with 4B
0176
0177     Inode could be 32 or 64 bytes, which can be distinguished from a common
0178     field which all inode versions have -- i_format::
0179
0180         __________________               __________________
0181        |     i_format     |             |     i_format     |
0182        |__________________|             |__________________|
0183        |        ...       |             |        ...       |
0184        |                  |             |                  |
0185        |__________________| 32 bytes    |                  |
0186                                         |                  |
0187                                         |__________________| 64 bytes
0188
0189     Xattrs, extents, data inline are followed by the corresponding inode with
0190     proper alignment, and they could be optional for different data mappings.
0191     _currently_ total 5 data layouts are supported:
0192
0193     ==  ====================================================================
0194      0  flat file data without data inline (no extent);
0195      1  fixed-sized output data compression (with non-compacted indexes);
0196      2  flat file data with tail packing data inline (no extent);
0197      3  fixed-sized output data compression (with compacted indexes, v5.3+);
0198      4  chunk-based file (v5.15+).
0199     ==  ====================================================================
0200
0201     The size of the optional xattrs is indicated by i_xattr_count in inode
0202     header. Large xattrs or xattrs shared by many different files can be
0203     stored in shared xattrs metadata rather than inlined right after inode.
0204
0205  2. Shared xattrs metadata space
0206
0207     Shared xattrs space is similar to the above inode space, started with
0208     a specific block indicated by xattr_blkaddr, organized one by one with
0209     proper align.
0210
0211     Each share xattr can also be directly found by the following formula:
0212          xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
0213
0214 ::
0215
0216                            |-> aligned by  4 bytes
0217     + xattr_blkaddr blocks                     |-> aligned with 4 bytes
0218      _________________________________________________________________________
0219     |  ...   | xattr_entry |  xattr data | ... |  xattr_entry | xattr data  ...
0220     |________|_____________|_____________|_____|______________|_______________
0221
0222 Directories
0223 -----------
0224 All directories are now organized in a compact on-disk format. Note that
0225 each directory block is divided into index and name areas in order to support
0226 random file lookup, and all directory entries are _strictly_ recorded in
0227 alphabetical order in order to support improved prefix binary search
0228 algorithm (could refer to the related source code).
0229
0230 ::
0231
0232                   ___________________________
0233                  /                           |
0234                 /              ______________|________________
0235                /              /              | nameoff1       | nameoffN-1
0236   ____________.______________._______________v________________v__________
0237  | dirent | dirent | ... | dirent | filename | filename | ... | filename |
0238  |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
0239       \                           ^
0240        \                          |                           * could have
0241         \                         |                             trailing '\0'
0242          \________________________| nameoff0
0243                              Directory block
0244
0245 Note that apart from the offset of the first filename, nameoff0 also indicates
0246 the total number of directory entries in this block since it is no need to
0247 introduce another on-disk field at all.
0248
0249 Chunk-based files
0250 -----------------
0251 In order to support chunk-based data deduplication, a new inode data layout has
0252 been supported since Linux v5.15: Files are split in equal-sized data chunks
0253 with ``extents`` area of the inode metadata indicating how to get the chunk
0254 data: these can be simply as a 4-byte block address array or in the 8-byte
0255 chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
0256 details.)
0257
0258 By the way, chunk-based files are all uncompressed for now.
0259
0260 Data compression
0261 ----------------
0262 EROFS implements LZ4 fixed-sized output compression which generates fixed-sized
0263 compressed data blocks from variable-sized input in contrast to other existing
0264 fixed-sized input solutions. Relatively higher compression ratios can be gotten
0265 by using fixed-sized output compression since nowadays popular data compression
0266 algorithms are mostly LZ77-based and such fixed-sized output approach can be
0267 benefited from the historical dictionary (aka. sliding window).
0268
0269 In details, original (uncompressed) data is turned into several variable-sized
0270 extents and in the meanwhile, compressed into physical clusters (pclusters).
0271 In order to record each variable-sized extent, logical clusters (lclusters) are
0272 introduced as the basic unit of compress indexes to indicate whether a new
0273 extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
0274 fixed in block size, as illustrated below::
0275
0276           |<-    variable-sized extent    ->|<-       VLE         ->|
0277         clusterofs                        clusterofs              clusterofs
0278           |                                 |                       |
0279  _________v_________________________________v_______________________v________
0280  ... |    .         |              |        .     |              |  .   ...
0281  ____|____._________|______________|________.___ _|______________|__.________
0282      |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
0283           (HEAD)        (NONHEAD)       (HEAD)        (NONHEAD)    .
0284            .             CBLKCNT            .                    .
0285             .                               .                  .
0286              .                              .                .
0287        _______._____________________________.______________._________________
0288           ... |              |              |              | ...
0289        _______|______________|______________|______________|_________________
0290               |->      big pcluster       <-|-> pcluster <-|
0291
0292 A physical cluster can be seen as a container of physical compressed blocks
0293 which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
0294 were supported. After big pcluster feature is introduced (available since
0295 Linux v5.13), pcluster can be a multiple of lcluster size.
0296
0297 For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
0298 starts and blkaddr is used to seek the compressed data. For each NONHEAD
0299 lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
0300 distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
0301 also a HEAD lcluster except that its data is uncompressed. See the comments
0302 around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
0303
0304 If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
0305 well. Let the delta0 of the first NONHEAD lcluster store the compressed block
0306 count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
0307 to understand its delta0 is constantly 1, as illustrated below::
0308
0309    __________________________________________________________
0310   | HEAD |  NONHEAD  | NONHEAD | ... | NONHEAD | HEAD | HEAD |
0311   |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
0312      |<----- a big pcluster (with CBLKCNT) ------>|<--  -->|
0313            a lcluster-sized pcluster (without CBLKCNT) ^
0314
0315 If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
0316 but it's easy to know the size of such pcluster is 1 lcluster as well.