0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ======================
0004 The SGI XFS Filesystem
0005 ======================
0006
0007 XFS is a high performance journaling filesystem which originated
0008 on the SGI IRIX platform. It is completely multi-threaded, can
0009 support large files and large filesystems, extended attributes,
0010 variable block sizes, is extent based, and makes extensive use of
0011 Btrees (directories, extents, free space) to aid both performance
0012 and scalability.
0013
0014 Refer to the documentation at https://xfs.wiki.kernel.org/
0015 for further details. This implementation is on-disk compatible
0016 with the IRIX version of XFS.
0017
0018
0019 Mount Options
0020 =============
0021
0022 When mounting an XFS filesystem, the following options are accepted.
0023
0024 allocsize=size
0025 Sets the buffered I/O end-of-file preallocation size when
0026 doing delayed allocation writeout (default size is 64KiB).
0027 Valid values for this option are page size (typically 4KiB)
0028 through to 1GiB, inclusive, in power-of-2 increments.
0029
0030 The default behaviour is for dynamic end-of-file
0031 preallocation size, which uses a set of heuristics to
0032 optimise the preallocation size based on the current
0033 allocation patterns within the file and the access patterns
0034 to the file. Specifying a fixed ``allocsize`` value turns off
0035 the dynamic behaviour.
0036
0037 attr2 or noattr2
0038 The options enable/disable an "opportunistic" improvement to
0039 be made in the way inline extended attributes are stored
0040 on-disk. When the new form is used for the first time when
0041 ``attr2`` is selected (either when setting or removing extended
0042 attributes) the on-disk superblock feature bit field will be
0043 updated to reflect this format being in use.
0044
0045 The default behaviour is determined by the on-disk feature
0046 bit indicating that ``attr2`` behaviour is active. If either
0047 mount option is set, then that becomes the new default used
0048 by the filesystem.
0049
0050 CRC enabled filesystems always use the ``attr2`` format, and so
0051 will reject the ``noattr2`` mount option if it is set.
0052
0053 discard or nodiscard (default)
0054 Enable/disable the issuing of commands to let the block
0055 device reclaim space freed by the filesystem. This is
0056 useful for SSD devices, thinly provisioned LUNs and virtual
0057 machine images, but may have a performance impact.
0058
0059 Note: It is currently recommended that you use the ``fstrim``
0060 application to ``discard`` unused blocks rather than the ``discard``
0061 mount option because the performance impact of this option
0062 is quite severe.
0063
0064 grpid/bsdgroups or nogrpid/sysvgroups (default)
0065 These options define what group ID a newly created file
0066 gets. When ``grpid`` is set, it takes the group ID of the
0067 directory in which it is created; otherwise it takes the
0068 ``fsgid`` of the current process, unless the directory has the
0069 ``setgid`` bit set, in which case it takes the ``gid`` from the
0070 parent directory, and also gets the ``setgid`` bit set if it is
0071 a directory itself.
0072
0073 filestreams
0074 Make the data allocator use the filestreams allocation mode
0075 across the entire filesystem rather than just on directories
0076 configured to use it.
0077
0078 ikeep or noikeep (default)
0079 When ``ikeep`` is specified, XFS does not delete empty inode
0080 clusters and keeps them around on disk. When ``noikeep`` is
0081 specified, empty inode clusters are returned to the free
0082 space pool.
0083
0084 inode32 or inode64 (default)
0085 When ``inode32`` is specified, it indicates that XFS limits
0086 inode creation to locations which will not result in inode
0087 numbers with more than 32 bits of significance.
0088
0089 When ``inode64`` is specified, it indicates that XFS is allowed
0090 to create inodes at any location in the filesystem,
0091 including those which will result in inode numbers occupying
0092 more than 32 bits of significance.
0093
0094 ``inode32`` is provided for backwards compatibility with older
0095 systems and applications, since 64 bits inode numbers might
0096 cause problems for some applications that cannot handle
0097 large inode numbers. If applications are in use which do
0098 not handle inode numbers bigger than 32 bits, the ``inode32``
0099 option should be specified.
0100
0101 largeio or nolargeio (default)
0102 If ``nolargeio`` is specified, the optimal I/O reported in
0103 ``st_blksize`` by **stat(2)** will be as small as possible to allow
0104 user applications to avoid inefficient read/modify/write
0105 I/O. This is typically the page size of the machine, as
0106 this is the granularity of the page cache.
0107
0108 If ``largeio`` is specified, a filesystem that was created with a
0109 ``swidth`` specified will return the ``swidth`` value (in bytes)
0110 in ``st_blksize``. If the filesystem does not have a ``swidth``
0111 specified but does specify an ``allocsize`` then ``allocsize``
0112 (in bytes) will be returned instead. Otherwise the behaviour
0113 is the same as if ``nolargeio`` was specified.
0114
0115 logbufs=value
0116 Set the number of in-memory log buffers. Valid numbers
0117 range from 2-8 inclusive.
0118
0119 The default value is 8 buffers.
0120
0121 If the memory cost of 8 log buffers is too high on small
0122 systems, then it may be reduced at some cost to performance
0123 on metadata intensive workloads. The ``logbsize`` option below
0124 controls the size of each buffer and so is also relevant to
0125 this case.
0126
0127 logbsize=value
0128 Set the size of each in-memory log buffer. The size may be
0129 specified in bytes, or in kilobytes with a "k" suffix.
0130 Valid sizes for version 1 and version 2 logs are 16384 (16k)
0131 and 32768 (32k). Valid sizes for version 2 logs also
0132 include 65536 (64k), 131072 (128k) and 262144 (256k). The
0133 logbsize must be an integer multiple of the log
0134 stripe unit configured at **mkfs(8)** time.
0135
0136 The default value for version 1 logs is 32768, while the
0137 default value for version 2 logs is MAX(32768, log_sunit).
0138
0139 logdev=device and rtdev=device
0140 Use an external log (metadata journal) and/or real-time device.
0141 An XFS filesystem has up to three parts: a data section, a log
0142 section, and a real-time section. The real-time section is
0143 optional, and the log section can be separate from the data
0144 section or contained within it.
0145
0146 noalign
0147 Data allocations will not be aligned at stripe unit
0148 boundaries. This is only relevant to filesystems created
0149 with non-zero data alignment parameters (``sunit``, ``swidth``) by
0150 **mkfs(8)**.
0151
0152 norecovery
0153 The filesystem will be mounted without running log recovery.
0154 If the filesystem was not cleanly unmounted, it is likely to
0155 be inconsistent when mounted in ``norecovery`` mode.
0156 Some files or directories may not be accessible because of this.
0157 Filesystems mounted ``norecovery`` must be mounted read-only or
0158 the mount will fail.
0159
0160 nouuid
0161 Don't check for double mounted file systems using the file
0162 system ``uuid``. This is useful to mount LVM snapshot volumes,
0163 and often used in combination with ``norecovery`` for mounting
0164 read-only snapshots.
0165
0166 noquota
0167 Forcibly turns off all quota accounting and enforcement
0168 within the filesystem.
0169
0170 uquota/usrquota/uqnoenforce/quota
0171 User disk quota accounting enabled, and limits (optionally)
0172 enforced. Refer to **xfs_quota(8)** for further details.
0173
0174 gquota/grpquota/gqnoenforce
0175 Group disk quota accounting enabled and limits (optionally)
0176 enforced. Refer to **xfs_quota(8)** for further details.
0177
0178 pquota/prjquota/pqnoenforce
0179 Project disk quota accounting enabled and limits (optionally)
0180 enforced. Refer to **xfs_quota(8)** for further details.
0181
0182 sunit=value and swidth=value
0183 Used to specify the stripe unit and width for a RAID device
0184 or a stripe volume. "value" must be specified in 512-byte
0185 block units. These options are only relevant to filesystems
0186 that were created with non-zero data alignment parameters.
0187
0188 The ``sunit`` and ``swidth`` parameters specified must be compatible
0189 with the existing filesystem alignment characteristics. In
0190 general, that means the only valid changes to ``sunit`` are
0191 increasing it by a power-of-2 multiple. Valid ``swidth`` values
0192 are any integer multiple of a valid ``sunit`` value.
0193
0194 Typically the only time these mount options are necessary if
0195 after an underlying RAID device has had it's geometry
0196 modified, such as adding a new disk to a RAID5 lun and
0197 reshaping it.
0198
0199 swalloc
0200 Data allocations will be rounded up to stripe width boundaries
0201 when the current end of file is being extended and the file
0202 size is larger than the stripe width size.
0203
0204 wsync
0205 When specified, all filesystem namespace operations are
0206 executed synchronously. This ensures that when the namespace
0207 operation (create, unlink, etc) completes, the change to the
0208 namespace is on stable storage. This is useful in HA setups
0209 where failover must not result in clients seeing
0210 inconsistent namespace presentation during or after a
0211 failover event.
0212
0213 Deprecation of V4 Format
0214 ========================
0215
0216 The V4 filesystem format lacks certain features that are supported by
0217 the V5 format, such as metadata checksumming, strengthened metadata
0218 verification, and the ability to store timestamps past the year 2038.
0219 Because of this, the V4 format is deprecated. All users should upgrade
0220 by backing up their files, reformatting, and restoring from the backup.
0221
0222 Administrators and users can detect a V4 filesystem by running xfs_info
0223 against a filesystem mountpoint and checking for a string containing
0224 "crc=". If no such string is found, please upgrade xfsprogs to the
0225 latest version and try again.
0226
0227 The deprecation will take place in two parts. Support for mounting V4
0228 filesystems can now be disabled at kernel build time via Kconfig option.
0229 The option will default to yes until September 2025, at which time it
0230 will be changed to default to no. In September 2030, support will be
0231 removed from the codebase entirely.
0232
0233 Note: Distributors may choose to withdraw V4 format support earlier than
0234 the dates listed above.
0235
0236 Deprecated Mount Options
0237 ========================
0238
0239 =========================== ================
0240 Name Removal Schedule
0241 =========================== ================
0242 Mounting with V4 filesystem September 2030
0243 ikeep/noikeep September 2025
0244 attr2/noattr2 September 2025
0245 =========================== ================
0246
0247
0248 Removed Mount Options
0249 =====================
0250
0251 =========================== =======
0252 Name Removed
0253 =========================== =======
0254 delaylog/nodelaylog v4.0
0255 ihashsize v4.0
0256 irixsgid v4.0
0257 osyncisdsync/osyncisosync v4.0
0258 barrier v4.19
0259 nobarrier v4.19
0260 =========================== =======
0261
0262 sysctls
0263 =======
0264
0265 The following sysctls are available for the XFS filesystem:
0266
0267 fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
0268 Setting this to "1" clears accumulated XFS statistics
0269 in /proc/fs/xfs/stat. It then immediately resets to "0".
0270
0271 fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
0272 The interval at which the filesystem flushes metadata
0273 out to disk and runs internal cache cleanup routines.
0274
0275 fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000)
0276 The interval at which the filesystem ages filestreams cache
0277 references and returns timed-out AGs back to the free stream
0278 pool.
0279
0280 fs.xfs.speculative_prealloc_lifetime
0281 (Units: seconds Min: 1 Default: 300 Max: 86400)
0282 The interval at which the background scanning for inodes
0283 with unused speculative preallocation runs. The scan
0284 removes unused preallocation from clean inodes and releases
0285 the unused space back to the free pool.
0286
0287 fs.xfs.speculative_cow_prealloc_lifetime
0288 This is an alias for speculative_prealloc_lifetime.
0289
0290 fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
0291 A volume knob for error reporting when internal errors occur.
0292 This will generate detailed messages & backtraces for filesystem
0293 shutdowns, for example. Current threshold values are:
0294
0295 XFS_ERRLEVEL_OFF: 0
0296 XFS_ERRLEVEL_LOW: 1
0297 XFS_ERRLEVEL_HIGH: 5
0298
0299 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256)
0300 Causes certain error conditions to call BUG(). Value is a bitmask;
0301 OR together the tags which represent errors which should cause panics:
0302
0303 XFS_NO_PTAG 0
0304 XFS_PTAG_IFLUSH 0x00000001
0305 XFS_PTAG_LOGRES 0x00000002
0306 XFS_PTAG_AILDELETE 0x00000004
0307 XFS_PTAG_ERROR_REPORT 0x00000008
0308 XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
0309 XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
0310 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
0311 XFS_PTAG_FSBLOCK_ZERO 0x00000080
0312 XFS_PTAG_VERIFIER_ERROR 0x00000100
0313
0314 This option is intended for debugging only.
0315
0316 fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
0317 Controls whether symlinks are created with mode 0777 (default)
0318 or whether their mode is affected by the umask (irix mode).
0319
0320 fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
0321 Controls files created in SGID directories.
0322 If the group ID of the new file does not match the effective group
0323 ID or one of the supplementary group IDs of the parent dir, the
0324 ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
0325 is set.
0326
0327 fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
0328 Setting this to "1" will cause the "sync" flag set
0329 by the **xfs_io(8)** chattr command on a directory to be
0330 inherited by files in that directory.
0331
0332 fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
0333 Setting this to "1" will cause the "nodump" flag set
0334 by the **xfs_io(8)** chattr command on a directory to be
0335 inherited by files in that directory.
0336
0337 fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
0338 Setting this to "1" will cause the "noatime" flag set
0339 by the **xfs_io(8)** chattr command on a directory to be
0340 inherited by files in that directory.
0341
0342 fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
0343 Setting this to "1" will cause the "nosymlinks" flag set
0344 by the **xfs_io(8)** chattr command on a directory to be
0345 inherited by files in that directory.
0346
0347 fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1)
0348 Setting this to "1" will cause the "nodefrag" flag set
0349 by the **xfs_io(8)** chattr command on a directory to be
0350 inherited by files in that directory.
0351
0352 fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
0353 In "inode32" allocation mode, this option determines how many
0354 files the allocator attempts to allocate in the same allocation
0355 group before moving to the next allocation group. The intent
0356 is to control the rate at which the allocator moves between
0357 allocation groups when allocating extents for new files.
0358
0359 Deprecated Sysctls
0360 ==================
0361
0362 =========================================== ================
0363 Name Removal Schedule
0364 =========================================== ================
0365 fs.xfs.irix_sgid_inherit September 2025
0366 fs.xfs.irix_symlink_mode September 2025
0367 fs.xfs.speculative_cow_prealloc_lifetime September 2025
0368 =========================================== ================
0369
0370
0371 Removed Sysctls
0372 ===============
0373
0374 ============================= =======
0375 Name Removed
0376 ============================= =======
0377 fs.xfs.xfsbufd_centisec v4.0
0378 fs.xfs.age_buffer_centisecs v4.0
0379 ============================= =======
0380
0381 Error handling
0382 ==============
0383
0384 XFS can act differently according to the type of error found during its
0385 operation. The implementation introduces the following concepts to the error
0386 handler:
0387
0388 -failure speed:
0389 Defines how fast XFS should propagate an error upwards when a specific
0390 error is found during the filesystem operation. It can propagate
0391 immediately, after a defined number of retries, after a set time period,
0392 or simply retry forever.
0393
0394 -error classes:
0395 Specifies the subsystem the error configuration will apply to, such as
0396 metadata IO or memory allocation. Different subsystems will have
0397 different error handlers for which behaviour can be configured.
0398
0399 -error handlers:
0400 Defines the behavior for a specific error.
0401
0402 The filesystem behavior during an error can be set via ``sysfs`` files. Each
0403 error handler works independently - the first condition met by an error handler
0404 for a specific class will cause the error to be propagated rather than reset and
0405 retried.
0406
0407 The action taken by the filesystem when the error is propagated is context
0408 dependent - it may cause a shut down in the case of an unrecoverable error,
0409 it may be reported back to userspace, or it may even be ignored because
0410 there's nothing useful we can with the error or anyone we can report it to (e.g.
0411 during unmount).
0412
0413 The configuration files are organized into the following hierarchy for each
0414 mounted filesystem:
0415
0416 /sys/fs/xfs/<dev>/error/<class>/<error>/
0417
0418 Where:
0419 <dev>
0420 The short device name of the mounted filesystem. This is the same device
0421 name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
0422
0423 <class>
0424 The subsystem the error configuration belongs to. As of 4.9, the defined
0425 classes are:
0426
0427 - "metadata": applies metadata buffer write IO
0428
0429 <error>
0430 The individual error handler configurations.
0431
0432
0433 Each filesystem has "global" error configuration options defined in their top
0434 level directory:
0435
0436 /sys/fs/xfs/<dev>/error/
0437
0438 fail_at_unmount (Min: 0 Default: 1 Max: 1)
0439 Defines the filesystem error behavior at unmount time.
0440
0441 If set to a value of 1, XFS will override all other error configurations
0442 during unmount and replace them with "immediate fail" characteristics.
0443 i.e. no retries, no retry timeout. This will always allow unmount to
0444 succeed when there are persistent errors present.
0445
0446 If set to 0, the configured retry behaviour will continue until all
0447 retries and/or timeouts have been exhausted. This will delay unmount
0448 completion when there are persistent errors, and it may prevent the
0449 filesystem from ever unmounting fully in the case of "retry forever"
0450 handler configurations.
0451
0452 Note: there is no guarantee that fail_at_unmount can be set while an
0453 unmount is in progress. It is possible that the ``sysfs`` entries are
0454 removed by the unmounting filesystem before a "retry forever" error
0455 handler configuration causes unmount to hang, and hence the filesystem
0456 must be configured appropriately before unmount begins to prevent
0457 unmount hangs.
0458
0459 Each filesystem has specific error class handlers that define the error
0460 propagation behaviour for specific errors. There is also a "default" error
0461 handler defined, which defines the behaviour for all errors that don't have
0462 specific handlers defined. Where multiple retry constraints are configured for
0463 a single error, the first retry configuration that expires will cause the error
0464 to be propagated. The handler configurations are found in the directory:
0465
0466 /sys/fs/xfs/<dev>/error/<class>/<error>/
0467
0468 max_retries (Min: -1 Default: Varies Max: INTMAX)
0469 Defines the allowed number of retries of a specific error before
0470 the filesystem will propagate the error. The retry count for a given
0471 error context (e.g. a specific metadata buffer) is reset every time
0472 there is a successful completion of the operation.
0473
0474 Setting the value to "-1" will cause XFS to retry forever for this
0475 specific error.
0476
0477 Setting the value to "0" will cause XFS to fail immediately when the
0478 specific error is reported.
0479
0480 Setting the value to "N" (where 0 < N < Max) will make XFS retry the
0481 operation "N" times before propagating the error.
0482
0483 retry_timeout_seconds (Min: -1 Default: Varies Max: 1 day)
0484 Define the amount of time (in seconds) that the filesystem is
0485 allowed to retry its operations when the specific error is
0486 found.
0487
0488 Setting the value to "-1" will allow XFS to retry forever for this
0489 specific error.
0490
0491 Setting the value to "0" will cause XFS to fail immediately when the
0492 specific error is reported.
0493
0494 Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
0495 operation for up to "N" seconds before propagating the error.
0496
0497 **Note:** The default behaviour for a specific error handler is dependent on both
0498 the class and error context. For example, the default values for
0499 "metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
0500 to "fail immediately" behaviour. This is done because ENODEV is a fatal,
0501 unrecoverable error no matter how many times the metadata IO is retried.
0502
0503 Workqueue Concurrency
0504 =====================
0505
0506 XFS uses kernel workqueues to parallelize metadata update processes. This
0507 enables it to take advantage of storage hardware that can service many IO
0508 operations simultaneously. This interface exposes internal implementation
0509 details of XFS, and as such is explicitly not part of any userspace API/ABI
0510 guarantee the kernel may give userspace. These are undocumented features of
0511 the generic workqueue implementation XFS uses for concurrency, and they are
0512 provided here purely for diagnostic and tuning purposes and may change at any
0513 time in the future.
0514
0515 The control knobs for a filesystem's workqueues are organized by task at hand
0516 and the short name of the data device. They all can be found in:
0517
0518 /sys/bus/workqueue/devices/${task}!${device}
0519
0520 ================ ===========
0521 Task Description
0522 ================ ===========
0523 xfs_iwalk-$pid Inode scans of the entire filesystem. Currently limited to
0524 mount time quotacheck.
0525 xfs-gc Background garbage collection of disk space that have been
0526 speculatively allocated beyond EOF or for staging copy on
0527 write operations.
0528 ================ ===========
0529
0530 For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
0531 found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
0532
0533 The interesting knobs for XFS workqueues are as follows:
0534
0535 ============ ===========
0536 Knob Description
0537 ============ ===========
0538 max_active Maximum number of background threads that can be started to
0539 run the work.
0540 cpumask CPUs upon which the threads are allowed to run.
0541 nice Relative priority of scheduling the threads. These are the
0542 same nice levels that can be applied to userspace processes.
0543 ============ ===========