Documentation/filesystems/squashfs.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =======================
0004 Squashfs 4.0 Filesystem
0005 =======================
0006
0007 Squashfs is a compressed read-only filesystem for Linux.
0008
0009 It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
0010 directories.  Inodes in the system are very small and all blocks are packed to
0011 minimise data overhead. Block sizes greater than 4K are supported up to a
0012 maximum of 1Mbytes (default block size 128K).
0013
0014 Squashfs is intended for general read-only filesystem use, for archival
0015 use (i.e. in cases where a .tar.gz file may be used), and in constrained
0016 block device/memory systems (e.g. embedded systems) where low overhead is
0017 needed.
0018
0019 Mailing list: squashfs-devel@lists.sourceforge.net
0020 Web site: www.squashfs.org
0021
0022 1. Filesystem Features
0023 ----------------------
0024
0025 Squashfs filesystem features versus Cramfs:
0026
0027 ==============================  =========               ==========
0028                                 Squashfs                Cramfs
0029 ==============================  =========               ==========
0030 Max filesystem size             2^64                    256 MiB
0031 Max file size                   ~ 2 TiB                 16 MiB
0032 Max files                       unlimited               unlimited
0033 Max directories                 unlimited               unlimited
0034 Max entries per directory       unlimited               unlimited
0035 Max block size                  1 MiB                   4 KiB
0036 Metadata compression            yes                     no
0037 Directory indexes               yes                     no
0038 Sparse file support             yes                     no
0039 Tail-end packing (fragments)    yes                     no
0040 Exportable (NFS etc.)           yes                     no
0041 Hard link support               yes                     no
0042 "." and ".." in readdir         yes                     no
0043 Real inode numbers              yes                     no
0044 32-bit uids/gids                yes                     no
0045 File creation time              yes                     no
0046 Xattr support                   yes                     no
0047 ACL support                     no                      no
0048 ==============================  =========               ==========
0049
0050 Squashfs compresses data, inodes and directories.  In addition, inode and
0051 directory data are highly compacted, and packed on byte boundaries.  Each
0052 compressed inode is on average 8 bytes in length (the exact length varies on
0053 file type, i.e. regular file, directory, symbolic link, and block/char device
0054 inodes have different sizes).
0055
0056 2. Using Squashfs
0057 -----------------
0058
0059 As squashfs is a read-only filesystem, the mksquashfs program must be used to
0060 create populated squashfs filesystems.  This and other squashfs utilities
0061 can be obtained from http://www.squashfs.org.  Usage instructions can be
0062 obtained from this site also.
0063
0064 The squashfs-tools development tree is now located on kernel.org
0065         git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
0066
0067 3. Squashfs Filesystem Design
0068 -----------------------------
0069
0070 A squashfs filesystem consists of a maximum of nine parts, packed together on a
0071 byte alignment::
0072
0073          ---------------
0074         |  superblock   |
0075         |---------------|
0076         |  compression  |
0077         |    options    |
0078         |---------------|
0079         |  datablocks   |
0080         |  & fragments  |
0081         |---------------|
0082         |  inode table  |
0083         |---------------|
0084         |   directory   |
0085         |     table     |
0086         |---------------|
0087         |   fragment    |
0088         |    table      |
0089         |---------------|
0090         |    export     |
0091         |    table      |
0092         |---------------|
0093         |    uid/gid    |
0094         |  lookup table |
0095         |---------------|
0096         |     xattr     |
0097         |     table     |
0098          ---------------
0099
0100 Compressed data blocks are written to the filesystem as files are read from
0101 the source directory, and checked for duplicates.  Once all file data has been
0102 written the completed inode, directory, fragment, export, uid/gid lookup and
0103 xattr tables are written.
0104
0105 3.1 Compression options
0106 -----------------------
0107
0108 Compressors can optionally support compression specific options (e.g.
0109 dictionary size).  If non-default compression options have been used, then
0110 these are stored here.
0111
0112 3.2 Inodes
0113 ----------
0114
0115 Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
0116 compressed block is prefixed by a two byte length, the top bit is set if the
0117 block is uncompressed.  A block will be uncompressed if the -noI option is set,
0118 or if the compressed block was larger than the uncompressed block.
0119
0120 Inodes are packed into the metadata blocks, and are not aligned to block
0121 boundaries, therefore inodes overlap compressed blocks.  Inodes are identified
0122 by a 48-bit number which encodes the location of the compressed metadata block
0123 containing the inode, and the byte offset into that block where the inode is
0124 placed (<block, offset>).
0125
0126 To maximise compression there are different inodes for each file type
0127 (regular file, directory, device, etc.), the inode contents and length
0128 varying with the type.
0129
0130 To further maximise compression, two types of regular file inode and
0131 directory inode are defined: inodes optimised for frequently occurring
0132 regular files and directories, and extended types where extra
0133 information has to be stored.
0134
0135 3.3 Directories
0136 ---------------
0137
0138 Like inodes, directories are packed into compressed metadata blocks, stored
0139 in a directory table.  Directories are accessed using the start address of
0140 the metablock containing the directory and the offset into the
0141 decompressed block (<block, offset>).
0142
0143 Directories are organised in a slightly complex way, and are not simply
0144 a list of file names.  The organisation takes advantage of the
0145 fact that (in most cases) the inodes of the files will be in the same
0146 compressed metadata block, and therefore, can share the start block.
0147 Directories are therefore organised in a two level list, a directory
0148 header containing the shared start block value, and a sequence of directory
0149 entries, each of which share the shared start block.  A new directory header
0150 is written once/if the inode start block changes.  The directory
0151 header/directory entry list is repeated as many times as necessary.
0152
0153 Directories are sorted, and can contain a directory index to speed up
0154 file lookup.  Directory indexes store one entry per metablock, each entry
0155 storing the index/filename mapping to the first directory header
0156 in each metadata block.  Directories are sorted in alphabetical order,
0157 and at lookup the index is scanned linearly looking for the first filename
0158 alphabetically larger than the filename being looked up.  At this point the
0159 location of the metadata block the filename is in has been found.
0160 The general idea of the index is to ensure only one metadata block needs to be
0161 decompressed to do a lookup irrespective of the length of the directory.
0162 This scheme has the advantage that it doesn't require extra memory overhead
0163 and doesn't require much extra storage on disk.
0164
0165 3.4 File data
0166 -------------
0167
0168 Regular files consist of a sequence of contiguous compressed blocks, and/or a
0169 compressed fragment block (tail-end packed block).   The compressed size
0170 of each datablock is stored in a block list contained within the
0171 file inode.
0172
0173 To speed up access to datablocks when reading 'large' files (256 Mbytes or
0174 larger), the code implements an index cache that caches the mapping from
0175 block index to datablock location on disk.
0176
0177 The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
0178 retaining a simple and space-efficient block list on disk.  The cache
0179 is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
0180 Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
0181 The index cache is designed to be memory efficient, and by default uses
0182 16 KiB.
0183
0184 3.5 Fragment lookup table
0185 -------------------------
0186
0187 Regular files can contain a fragment index which is mapped to a fragment
0188 location on disk and compressed size using a fragment lookup table.  This
0189 fragment lookup table is itself stored compressed into metadata blocks.
0190 A second index table is used to locate these.  This second index table for
0191 speed of access (and because it is small) is read at mount time and cached
0192 in memory.
0193
0194 3.6 Uid/gid lookup table
0195 ------------------------
0196
0197 For space efficiency regular files store uid and gid indexes, which are
0198 converted to 32-bit uids/gids using an id look up table.  This table is
0199 stored compressed into metadata blocks.  A second index table is used to
0200 locate these.  This second index table for speed of access (and because it
0201 is small) is read at mount time and cached in memory.
0202
0203 3.7 Export table
0204 ----------------
0205
0206 To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
0207 can optionally (disabled with the -no-exports Mksquashfs option) contain
0208 an inode number to inode disk location lookup table.  This is required to
0209 enable Squashfs to map inode numbers passed in filehandles to the inode
0210 location on disk, which is necessary when the export code reinstantiates
0211 expired/flushed inodes.
0212
0213 This table is stored compressed into metadata blocks.  A second index table is
0214 used to locate these.  This second index table for speed of access (and because
0215 it is small) is read at mount time and cached in memory.
0216
0217 3.8 Xattr table
0218 ---------------
0219
0220 The xattr table contains extended attributes for each inode.  The xattrs
0221 for each inode are stored in a list, each list entry containing a type,
0222 name and value field.  The type field encodes the xattr prefix
0223 ("user.", "trusted." etc) and it also encodes how the name/value fields
0224 should be interpreted.  Currently the type indicates whether the value
0225 is stored inline (in which case the value field contains the xattr value),
0226 or if it is stored out of line (in which case the value field stores a
0227 reference to where the actual value is stored).  This allows large values
0228 to be stored out of line improving scanning and lookup performance and it
0229 also allows values to be de-duplicated, the value being stored once, and
0230 all other occurrences holding an out of line reference to that value.
0231
0232 The xattr lists are packed into compressed 8K metadata blocks.
0233 To reduce overhead in inodes, rather than storing the on-disk
0234 location of the xattr list inside each inode, a 32-bit xattr id
0235 is stored.  This xattr id is mapped into the location of the xattr
0236 list using a second xattr id lookup table.
0237
0238 4. TODOs and Outstanding Issues
0239 -------------------------------
0240
0241 4.1 TODO list
0242 -------------
0243
0244 Implement ACL support.
0245
0246 4.2 Squashfs Internal Cache
0247 ---------------------------
0248
0249 Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
0250 recently accessed data Squashfs uses two small metadata and fragment caches.
0251
0252 The cache is not used for file datablocks, these are decompressed and cached in
0253 the page-cache in the normal way.  The cache is used to temporarily cache
0254 fragment and metadata blocks which have been read as a result of a metadata
0255 (i.e. inode or directory) or fragment access.  Because metadata and fragments
0256 are packed together into blocks (to gain greater compression) the read of a
0257 particular piece of metadata or fragment will retrieve other metadata/fragments
0258 which have been packed with it, these because of locality-of-reference may be
0259 read in the near future. Temporarily caching them ensures they are available
0260 for near future access without requiring an additional read and decompress.
0261
0262 In the future this internal cache may be replaced with an implementation which
0263 uses the kernel page cache.  Because the page cache operates on page sized
0264 units this may introduce additional complexity in terms of locking and
0265 associated race conditions.