0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 Index Nodes
0004 -----------
0005
0006 In a regular UNIX filesystem, the inode stores all the metadata
0007 pertaining to the file (time stamps, block maps, extended attributes,
0008 etc), not the directory entry. To find the information associated with a
0009 file, one must traverse the directory files to find the directory entry
0010 associated with a file, then load the inode to find the metadata for
0011 that file. ext4 appears to cheat (for performance reasons) a little bit
0012 by storing a copy of the file type (normally stored in the inode) in the
0013 directory entry. (Compare all this to FAT, which stores all the file
0014 information directly in the directory entry, but does not support hard
0015 links and is in general more seek-happy than ext4 due to its simpler
0016 block allocator and extensive use of linked lists.)
0017
0018 The inode table is a linear array of ``struct ext4_inode``. The table is
0019 sized to have enough blocks to store at least
0020 ``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
0021 block group containing an inode can be calculated as
0022 ``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
0023 group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
0024 is no inode 0.
0025
0026 The inode checksum is calculated against the FS UUID, the inode number,
0027 and the inode structure itself.
0028
0029 The inode table entry is laid out in ``struct ext4_inode``.
0030
0031 .. list-table::
0032 :widths: 8 8 24 40
0033 :header-rows: 1
0034 :class: longtable
0035
0036 * - Offset
0037 - Size
0038 - Name
0039 - Description
0040 * - 0x0
0041 - __le16
0042 - i_mode
0043 - File mode. See the table i_mode_ below.
0044 * - 0x2
0045 - __le16
0046 - i_uid
0047 - Lower 16-bits of Owner UID.
0048 * - 0x4
0049 - __le32
0050 - i_size_lo
0051 - Lower 32-bits of size in bytes.
0052 * - 0x8
0053 - __le32
0054 - i_atime
0055 - Last access time, in seconds since the epoch. However, if the EA_INODE
0056 inode flag is set, this inode stores an extended attribute value and
0057 this field contains the checksum of the value.
0058 * - 0xC
0059 - __le32
0060 - i_ctime
0061 - Last inode change time, in seconds since the epoch. However, if the
0062 EA_INODE inode flag is set, this inode stores an extended attribute
0063 value and this field contains the lower 32 bits of the attribute value's
0064 reference count.
0065 * - 0x10
0066 - __le32
0067 - i_mtime
0068 - Last data modification time, in seconds since the epoch. However, if the
0069 EA_INODE inode flag is set, this inode stores an extended attribute
0070 value and this field contains the number of the inode that owns the
0071 extended attribute.
0072 * - 0x14
0073 - __le32
0074 - i_dtime
0075 - Deletion Time, in seconds since the epoch.
0076 * - 0x18
0077 - __le16
0078 - i_gid
0079 - Lower 16-bits of GID.
0080 * - 0x1A
0081 - __le16
0082 - i_links_count
0083 - Hard link count. Normally, ext4 does not permit an inode to have more
0084 than 65,000 hard links. This applies to files as well as directories,
0085 which means that there cannot be more than 64,998 subdirectories in a
0086 directory (each subdirectory's '..' entry counts as a hard link, as does
0087 the '.' entry in the directory itself). With the DIR_NLINK feature
0088 enabled, ext4 supports more than 64,998 subdirectories by setting this
0089 field to 1 to indicate that the number of hard links is not known.
0090 * - 0x1C
0091 - __le32
0092 - i_blocks_lo
0093 - Lower 32-bits of “block” count. If the huge_file feature flag is not
0094 set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
0095 on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
0096 ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
0097 << 32)`` 512-byte blocks on disk. If huge_file is set and
0098 EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
0099 consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
0100 disk.
0101 * - 0x20
0102 - __le32
0103 - i_flags
0104 - Inode flags. See the table i_flags_ below.
0105 * - 0x24
0106 - 4 bytes
0107 - i_osd1
0108 - See the table i_osd1_ for more details.
0109 * - 0x28
0110 - 60 bytes
0111 - i_block[EXT4_N_BLOCKS=15]
0112 - Block map or extent tree. See the section “The Contents of inode.i_block”.
0113 * - 0x64
0114 - __le32
0115 - i_generation
0116 - File version (for NFS).
0117 * - 0x68
0118 - __le32
0119 - i_file_acl_lo
0120 - Lower 32-bits of extended attribute block. ACLs are of course one of
0121 many possible extended attributes; I think the name of this field is a
0122 result of the first use of extended attributes being for ACLs.
0123 * - 0x6C
0124 - __le32
0125 - i_size_high / i_dir_acl
0126 - Upper 32-bits of file/directory size. In ext2/3 this field was named
0127 i_dir_acl, though it was usually set to zero and never used.
0128 * - 0x70
0129 - __le32
0130 - i_obso_faddr
0131 - (Obsolete) fragment address.
0132 * - 0x74
0133 - 12 bytes
0134 - i_osd2
0135 - See the table i_osd2_ for more details.
0136 * - 0x80
0137 - __le16
0138 - i_extra_isize
0139 - Size of this inode - 128. Alternately, the size of the extended inode
0140 fields beyond the original ext2 inode, including this field.
0141 * - 0x82
0142 - __le16
0143 - i_checksum_hi
0144 - Upper 16-bits of the inode checksum.
0145 * - 0x84
0146 - __le32
0147 - i_ctime_extra
0148 - Extra change time bits. This provides sub-second precision. See Inode
0149 Timestamps section.
0150 * - 0x88
0151 - __le32
0152 - i_mtime_extra
0153 - Extra modification time bits. This provides sub-second precision.
0154 * - 0x8C
0155 - __le32
0156 - i_atime_extra
0157 - Extra access time bits. This provides sub-second precision.
0158 * - 0x90
0159 - __le32
0160 - i_crtime
0161 - File creation time, in seconds since the epoch.
0162 * - 0x94
0163 - __le32
0164 - i_crtime_extra
0165 - Extra file creation time bits. This provides sub-second precision.
0166 * - 0x98
0167 - __le32
0168 - i_version_hi
0169 - Upper 32-bits for version number.
0170 * - 0x9C
0171 - __le32
0172 - i_projid
0173 - Project ID.
0174
0175 .. _i_mode:
0176
0177 The ``i_mode`` value is a combination of the following flags:
0178
0179 .. list-table::
0180 :widths: 16 64
0181 :header-rows: 1
0182
0183 * - Value
0184 - Description
0185 * - 0x1
0186 - S_IXOTH (Others may execute)
0187 * - 0x2
0188 - S_IWOTH (Others may write)
0189 * - 0x4
0190 - S_IROTH (Others may read)
0191 * - 0x8
0192 - S_IXGRP (Group members may execute)
0193 * - 0x10
0194 - S_IWGRP (Group members may write)
0195 * - 0x20
0196 - S_IRGRP (Group members may read)
0197 * - 0x40
0198 - S_IXUSR (Owner may execute)
0199 * - 0x80
0200 - S_IWUSR (Owner may write)
0201 * - 0x100
0202 - S_IRUSR (Owner may read)
0203 * - 0x200
0204 - S_ISVTX (Sticky bit)
0205 * - 0x400
0206 - S_ISGID (Set GID)
0207 * - 0x800
0208 - S_ISUID (Set UID)
0209 * -
0210 - These are mutually-exclusive file types:
0211 * - 0x1000
0212 - S_IFIFO (FIFO)
0213 * - 0x2000
0214 - S_IFCHR (Character device)
0215 * - 0x4000
0216 - S_IFDIR (Directory)
0217 * - 0x6000
0218 - S_IFBLK (Block device)
0219 * - 0x8000
0220 - S_IFREG (Regular file)
0221 * - 0xA000
0222 - S_IFLNK (Symbolic link)
0223 * - 0xC000
0224 - S_IFSOCK (Socket)
0225
0226 .. _i_flags:
0227
0228 The ``i_flags`` field is a combination of these values:
0229
0230 .. list-table::
0231 :widths: 16 64
0232 :header-rows: 1
0233
0234 * - Value
0235 - Description
0236 * - 0x1
0237 - This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
0238 * - 0x2
0239 - This file should be preserved, should undeletion be desired
0240 (EXT4_UNRM_FL). (not implemented)
0241 * - 0x4
0242 - File is compressed (EXT4_COMPR_FL). (not really implemented)
0243 * - 0x8
0244 - All writes to the file must be synchronous (EXT4_SYNC_FL).
0245 * - 0x10
0246 - File is immutable (EXT4_IMMUTABLE_FL).
0247 * - 0x20
0248 - File can only be appended (EXT4_APPEND_FL).
0249 * - 0x40
0250 - The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
0251 * - 0x80
0252 - Do not update access time (EXT4_NOATIME_FL).
0253 * - 0x100
0254 - Dirty compressed file (EXT4_DIRTY_FL). (not used)
0255 * - 0x200
0256 - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
0257 * - 0x400
0258 - Do not compress file (EXT4_NOCOMPR_FL). (not used)
0259 * - 0x800
0260 - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
0261 EXT4_ECOMPR_FL (compression error), which was never used.
0262 * - 0x1000
0263 - Directory has hashed indexes (EXT4_INDEX_FL).
0264 * - 0x2000
0265 - AFS magic directory (EXT4_IMAGIC_FL).
0266 * - 0x4000
0267 - File data must always be written through the journal
0268 (EXT4_JOURNAL_DATA_FL).
0269 * - 0x8000
0270 - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
0271 * - 0x10000
0272 - All directory entry data should be written synchronously (see
0273 ``dirsync``) (EXT4_DIRSYNC_FL).
0274 * - 0x20000
0275 - Top of directory hierarchy (EXT4_TOPDIR_FL).
0276 * - 0x40000
0277 - This is a huge file (EXT4_HUGE_FILE_FL).
0278 * - 0x80000
0279 - Inode uses extents (EXT4_EXTENTS_FL).
0280 * - 0x100000
0281 - Verity protected file (EXT4_VERITY_FL).
0282 * - 0x200000
0283 - Inode stores a large extended attribute value in its data blocks
0284 (EXT4_EA_INODE_FL).
0285 * - 0x400000
0286 - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
0287 (deprecated)
0288 * - 0x01000000
0289 - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
0290 * - 0x04000000
0291 - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
0292 mainline)
0293 * - 0x08000000
0294 - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
0295 mainline)
0296 * - 0x10000000
0297 - Inode has inline data (EXT4_INLINE_DATA_FL).
0298 * - 0x20000000
0299 - Create children with the same project ID (EXT4_PROJINHERIT_FL).
0300 * - 0x80000000
0301 - Reserved for ext4 library (EXT4_RESERVED_FL).
0302 * -
0303 - Aggregate flags:
0304 * - 0x705BDFFF
0305 - User-visible flags.
0306 * - 0x604BC0FF
0307 - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
0308 EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
0309 EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
0310 these flags in a special manner and they are masked out of the set of
0311 flags that are saved directly to i_flags.
0312
0313 .. _i_osd1:
0314
0315 The ``osd1`` field has multiple meanings depending on the creator:
0316
0317 Linux:
0318
0319 .. list-table::
0320 :widths: 8 8 24 40
0321 :header-rows: 1
0322
0323 * - Offset
0324 - Size
0325 - Name
0326 - Description
0327 * - 0x0
0328 - __le32
0329 - l_i_version
0330 - Inode version. However, if the EA_INODE inode flag is set, this inode
0331 stores an extended attribute value and this field contains the upper 32
0332 bits of the attribute value's reference count.
0333
0334 Hurd:
0335
0336 .. list-table::
0337 :widths: 8 8 24 40
0338 :header-rows: 1
0339
0340 * - Offset
0341 - Size
0342 - Name
0343 - Description
0344 * - 0x0
0345 - __le32
0346 - h_i_translator
0347 - ??
0348
0349 Masix:
0350
0351 .. list-table::
0352 :widths: 8 8 24 40
0353 :header-rows: 1
0354
0355 * - Offset
0356 - Size
0357 - Name
0358 - Description
0359 * - 0x0
0360 - __le32
0361 - m_i_reserved
0362 - ??
0363
0364 .. _i_osd2:
0365
0366 The ``osd2`` field has multiple meanings depending on the filesystem creator:
0367
0368 Linux:
0369
0370 .. list-table::
0371 :widths: 8 8 24 40
0372 :header-rows: 1
0373
0374 * - Offset
0375 - Size
0376 - Name
0377 - Description
0378 * - 0x0
0379 - __le16
0380 - l_i_blocks_high
0381 - Upper 16-bits of the block count. Please see the note attached to
0382 i_blocks_lo.
0383 * - 0x2
0384 - __le16
0385 - l_i_file_acl_high
0386 - Upper 16-bits of the extended attribute block (historically, the file
0387 ACL location). See the Extended Attributes section below.
0388 * - 0x4
0389 - __le16
0390 - l_i_uid_high
0391 - Upper 16-bits of the Owner UID.
0392 * - 0x6
0393 - __le16
0394 - l_i_gid_high
0395 - Upper 16-bits of the GID.
0396 * - 0x8
0397 - __le16
0398 - l_i_checksum_lo
0399 - Lower 16-bits of the inode checksum.
0400 * - 0xA
0401 - __le16
0402 - l_i_reserved
0403 - Unused.
0404
0405 Hurd:
0406
0407 .. list-table::
0408 :widths: 8 8 24 40
0409 :header-rows: 1
0410
0411 * - Offset
0412 - Size
0413 - Name
0414 - Description
0415 * - 0x0
0416 - __le16
0417 - h_i_reserved1
0418 - ??
0419 * - 0x2
0420 - __u16
0421 - h_i_mode_high
0422 - Upper 16-bits of the file mode.
0423 * - 0x4
0424 - __le16
0425 - h_i_uid_high
0426 - Upper 16-bits of the Owner UID.
0427 * - 0x6
0428 - __le16
0429 - h_i_gid_high
0430 - Upper 16-bits of the GID.
0431 * - 0x8
0432 - __u32
0433 - h_i_author
0434 - Author code?
0435
0436 Masix:
0437
0438 .. list-table::
0439 :widths: 8 8 24 40
0440 :header-rows: 1
0441
0442 * - Offset
0443 - Size
0444 - Name
0445 - Description
0446 * - 0x0
0447 - __le16
0448 - h_i_reserved1
0449 - ??
0450 * - 0x2
0451 - __u16
0452 - m_i_file_acl_high
0453 - Upper 16-bits of the extended attribute block (historically, the file
0454 ACL location).
0455 * - 0x4
0456 - __u32
0457 - m_i_reserved2[2]
0458 - ??
0459
0460 Inode Size
0461 ~~~~~~~~~~
0462
0463 In ext2 and ext3, the inode structure size was fixed at 128 bytes
0464 (``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
0465 128 bytes. Starting with ext4, it is possible to allocate a larger
0466 on-disk inode at format time for all inodes in the filesystem to provide
0467 space beyond the end of the original ext2 inode. The on-disk inode
0468 record size is recorded in the superblock as ``s_inode_size``. The
0469 number of bytes actually used by struct ext4_inode beyond the original
0470 128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
0471 inode, which allows struct ext4_inode to grow for a new kernel without
0472 having to upgrade all of the on-disk inodes. Access to fields beyond
0473 EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
0474 ``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
0475 of August 2019) the inode structure is 160 bytes
0476 (``i_extra_isize = 32``). The extra space between the end of the inode
0477 structure and the end of the inode record can be used to store extended
0478 attributes. Each inode record can be as large as the filesystem block
0479 size, though this is not terribly efficient.
0480
0481 Finding an Inode
0482 ~~~~~~~~~~~~~~~~
0483
0484 Each block group contains ``sb->s_inodes_per_group`` inodes. Because
0485 inode 0 is defined not to exist, this formula can be used to find the
0486 block group that an inode lives in:
0487 ``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
0488 can be found within the block group's inode table at
0489 ``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
0490 address within the inode table, use
0491 ``offset = index * sb->s_inode_size``.
0492
0493 Inode Timestamps
0494 ~~~~~~~~~~~~~~~~
0495
0496 Four timestamps are recorded in the lower 128 bytes of the inode
0497 structure -- inode change time (ctime), access time (atime), data
0498 modification time (mtime), and deletion time (dtime). The four fields
0499 are 32-bit signed integers that represent seconds since the Unix epoch
0500 (1970-01-01 00:00:00 GMT), which means that the fields will overflow in
0501 January 2038. If the filesystem does not have orphan_file feature, inodes
0502 that are not linked from any directory but are still open (orphan inodes) have
0503 the dtime field overloaded for use with the orphan list. The superblock field
0504 ``s_last_orphan`` points to the first inode in the orphan list; dtime is then
0505 the number of the next orphaned inode, or zero if there are no more orphans.
0506
0507 If the inode structure size ``sb->s_inode_size`` is larger than 128
0508 bytes and the ``i_inode_extra`` field is large enough to encompass the
0509 respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
0510 inode fields are widened to 64 bits. Within this “extra” 32-bit field,
0511 the lower two bits are used to extend the 32-bit seconds field to be 34
0512 bit wide; the upper 30 bits are used to provide nanosecond timestamp
0513 accuracy. Therefore, timestamps should not overflow until May 2446.
0514 dtime was not widened. There is also a fifth timestamp to record inode
0515 creation time (crtime); this field is 64-bits wide and decoded in the
0516 same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
0517 through the regular stat() interface, though debugfs will report them.
0518
0519 We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
0520 In other words:
0521
0522 .. list-table::
0523 :widths: 20 20 20 20 20
0524 :header-rows: 1
0525
0526 * - Extra epoch bits
0527 - MSB of 32-bit time
0528 - Adjustment for signed 32-bit to 64-bit tv_sec
0529 - Decoded 64-bit tv_sec
0530 - valid time range
0531 * - 0 0
0532 - 1
0533 - 0
0534 - ``-0x80000000 - -0x00000001``
0535 - 1901-12-13 to 1969-12-31
0536 * - 0 0
0537 - 0
0538 - 0
0539 - ``0x000000000 - 0x07fffffff``
0540 - 1970-01-01 to 2038-01-19
0541 * - 0 1
0542 - 1
0543 - 0x100000000
0544 - ``0x080000000 - 0x0ffffffff``
0545 - 2038-01-19 to 2106-02-07
0546 * - 0 1
0547 - 0
0548 - 0x100000000
0549 - ``0x100000000 - 0x17fffffff``
0550 - 2106-02-07 to 2174-02-25
0551 * - 1 0
0552 - 1
0553 - 0x200000000
0554 - ``0x180000000 - 0x1ffffffff``
0555 - 2174-02-25 to 2242-03-16
0556 * - 1 0
0557 - 0
0558 - 0x200000000
0559 - ``0x200000000 - 0x27fffffff``
0560 - 2242-03-16 to 2310-04-04
0561 * - 1 1
0562 - 1
0563 - 0x300000000
0564 - ``0x280000000 - 0x2ffffffff``
0565 - 2310-04-04 to 2378-04-22
0566 * - 1 1
0567 - 0
0568 - 0x300000000
0569 - ``0x300000000 - 0x37fffffff``
0570 - 2378-04-22 to 2446-05-10
0571
0572 This is a somewhat odd encoding since there are effectively seven times
0573 as many positive values as negative values. There have also been
0574 long-standing bugs decoding and encoding dates beyond 2038, which don't
0575 seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
0576 incorrectly use the extra epoch bits 1,1 for dates between 1901 and
0577 1970. At some point the kernel will be fixed and e2fsck will fix this
0578 situation, assuming that it is run before 2310.