Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ======================
0004 The SGI XFS Filesystem
0005 ======================
0006 
0007 XFS is a high performance journaling filesystem which originated
0008 on the SGI IRIX platform.  It is completely multi-threaded, can
0009 support large files and large filesystems, extended attributes,
0010 variable block sizes, is extent based, and makes extensive use of
0011 Btrees (directories, extents, free space) to aid both performance
0012 and scalability.
0013 
0014 Refer to the documentation at https://xfs.wiki.kernel.org/
0015 for further details.  This implementation is on-disk compatible
0016 with the IRIX version of XFS.
0017 
0018 
0019 Mount Options
0020 =============
0021 
0022 When mounting an XFS filesystem, the following options are accepted.
0023 
0024   allocsize=size
0025         Sets the buffered I/O end-of-file preallocation size when
0026         doing delayed allocation writeout (default size is 64KiB).
0027         Valid values for this option are page size (typically 4KiB)
0028         through to 1GiB, inclusive, in power-of-2 increments.
0029 
0030         The default behaviour is for dynamic end-of-file
0031         preallocation size, which uses a set of heuristics to
0032         optimise the preallocation size based on the current
0033         allocation patterns within the file and the access patterns
0034         to the file. Specifying a fixed ``allocsize`` value turns off
0035         the dynamic behaviour.
0036 
0037   attr2 or noattr2
0038         The options enable/disable an "opportunistic" improvement to
0039         be made in the way inline extended attributes are stored
0040         on-disk.  When the new form is used for the first time when
0041         ``attr2`` is selected (either when setting or removing extended
0042         attributes) the on-disk superblock feature bit field will be
0043         updated to reflect this format being in use.
0044 
0045         The default behaviour is determined by the on-disk feature
0046         bit indicating that ``attr2`` behaviour is active. If either
0047         mount option is set, then that becomes the new default used
0048         by the filesystem.
0049 
0050         CRC enabled filesystems always use the ``attr2`` format, and so
0051         will reject the ``noattr2`` mount option if it is set.
0052 
0053   discard or nodiscard (default)
0054         Enable/disable the issuing of commands to let the block
0055         device reclaim space freed by the filesystem.  This is
0056         useful for SSD devices, thinly provisioned LUNs and virtual
0057         machine images, but may have a performance impact.
0058 
0059         Note: It is currently recommended that you use the ``fstrim``
0060         application to ``discard`` unused blocks rather than the ``discard``
0061         mount option because the performance impact of this option
0062         is quite severe.
0063 
0064   grpid/bsdgroups or nogrpid/sysvgroups (default)
0065         These options define what group ID a newly created file
0066         gets.  When ``grpid`` is set, it takes the group ID of the
0067         directory in which it is created; otherwise it takes the
0068         ``fsgid`` of the current process, unless the directory has the
0069         ``setgid`` bit set, in which case it takes the ``gid`` from the
0070         parent directory, and also gets the ``setgid`` bit set if it is
0071         a directory itself.
0072 
0073   filestreams
0074         Make the data allocator use the filestreams allocation mode
0075         across the entire filesystem rather than just on directories
0076         configured to use it.
0077 
0078   ikeep or noikeep (default)
0079         When ``ikeep`` is specified, XFS does not delete empty inode
0080         clusters and keeps them around on disk.  When ``noikeep`` is
0081         specified, empty inode clusters are returned to the free
0082         space pool.
0083 
0084   inode32 or inode64 (default)
0085         When ``inode32`` is specified, it indicates that XFS limits
0086         inode creation to locations which will not result in inode
0087         numbers with more than 32 bits of significance.
0088 
0089         When ``inode64`` is specified, it indicates that XFS is allowed
0090         to create inodes at any location in the filesystem,
0091         including those which will result in inode numbers occupying
0092         more than 32 bits of significance.
0093 
0094         ``inode32`` is provided for backwards compatibility with older
0095         systems and applications, since 64 bits inode numbers might
0096         cause problems for some applications that cannot handle
0097         large inode numbers.  If applications are in use which do
0098         not handle inode numbers bigger than 32 bits, the ``inode32``
0099         option should be specified.
0100 
0101   largeio or nolargeio (default)
0102         If ``nolargeio`` is specified, the optimal I/O reported in
0103         ``st_blksize`` by **stat(2)** will be as small as possible to allow
0104         user applications to avoid inefficient read/modify/write
0105         I/O.  This is typically the page size of the machine, as
0106         this is the granularity of the page cache.
0107 
0108         If ``largeio`` is specified, a filesystem that was created with a
0109         ``swidth`` specified will return the ``swidth`` value (in bytes)
0110         in ``st_blksize``. If the filesystem does not have a ``swidth``
0111         specified but does specify an ``allocsize`` then ``allocsize``
0112         (in bytes) will be returned instead. Otherwise the behaviour
0113         is the same as if ``nolargeio`` was specified.
0114 
0115   logbufs=value
0116         Set the number of in-memory log buffers.  Valid numbers
0117         range from 2-8 inclusive.
0118 
0119         The default value is 8 buffers.
0120 
0121         If the memory cost of 8 log buffers is too high on small
0122         systems, then it may be reduced at some cost to performance
0123         on metadata intensive workloads. The ``logbsize`` option below
0124         controls the size of each buffer and so is also relevant to
0125         this case.
0126 
0127   logbsize=value
0128         Set the size of each in-memory log buffer.  The size may be
0129         specified in bytes, or in kilobytes with a "k" suffix.
0130         Valid sizes for version 1 and version 2 logs are 16384 (16k)
0131         and 32768 (32k).  Valid sizes for version 2 logs also
0132         include 65536 (64k), 131072 (128k) and 262144 (256k). The
0133         logbsize must be an integer multiple of the log
0134         stripe unit configured at **mkfs(8)** time.
0135 
0136         The default value for version 1 logs is 32768, while the
0137         default value for version 2 logs is MAX(32768, log_sunit).
0138 
0139   logdev=device and rtdev=device
0140         Use an external log (metadata journal) and/or real-time device.
0141         An XFS filesystem has up to three parts: a data section, a log
0142         section, and a real-time section.  The real-time section is
0143         optional, and the log section can be separate from the data
0144         section or contained within it.
0145 
0146   noalign
0147         Data allocations will not be aligned at stripe unit
0148         boundaries. This is only relevant to filesystems created
0149         with non-zero data alignment parameters (``sunit``, ``swidth``) by
0150         **mkfs(8)**.
0151 
0152   norecovery
0153         The filesystem will be mounted without running log recovery.
0154         If the filesystem was not cleanly unmounted, it is likely to
0155         be inconsistent when mounted in ``norecovery`` mode.
0156         Some files or directories may not be accessible because of this.
0157         Filesystems mounted ``norecovery`` must be mounted read-only or
0158         the mount will fail.
0159 
0160   nouuid
0161         Don't check for double mounted file systems using the file
0162         system ``uuid``.  This is useful to mount LVM snapshot volumes,
0163         and often used in combination with ``norecovery`` for mounting
0164         read-only snapshots.
0165 
0166   noquota
0167         Forcibly turns off all quota accounting and enforcement
0168         within the filesystem.
0169 
0170   uquota/usrquota/uqnoenforce/quota
0171         User disk quota accounting enabled, and limits (optionally)
0172         enforced.  Refer to **xfs_quota(8)** for further details.
0173 
0174   gquota/grpquota/gqnoenforce
0175         Group disk quota accounting enabled and limits (optionally)
0176         enforced.  Refer to **xfs_quota(8)** for further details.
0177 
0178   pquota/prjquota/pqnoenforce
0179         Project disk quota accounting enabled and limits (optionally)
0180         enforced.  Refer to **xfs_quota(8)** for further details.
0181 
0182   sunit=value and swidth=value
0183         Used to specify the stripe unit and width for a RAID device
0184         or a stripe volume.  "value" must be specified in 512-byte
0185         block units. These options are only relevant to filesystems
0186         that were created with non-zero data alignment parameters.
0187 
0188         The ``sunit`` and ``swidth`` parameters specified must be compatible
0189         with the existing filesystem alignment characteristics.  In
0190         general, that means the only valid changes to ``sunit`` are
0191         increasing it by a power-of-2 multiple. Valid ``swidth`` values
0192         are any integer multiple of a valid ``sunit`` value.
0193 
0194         Typically the only time these mount options are necessary if
0195         after an underlying RAID device has had it's geometry
0196         modified, such as adding a new disk to a RAID5 lun and
0197         reshaping it.
0198 
0199   swalloc
0200         Data allocations will be rounded up to stripe width boundaries
0201         when the current end of file is being extended and the file
0202         size is larger than the stripe width size.
0203 
0204   wsync
0205         When specified, all filesystem namespace operations are
0206         executed synchronously. This ensures that when the namespace
0207         operation (create, unlink, etc) completes, the change to the
0208         namespace is on stable storage. This is useful in HA setups
0209         where failover must not result in clients seeing
0210         inconsistent namespace presentation during or after a
0211         failover event.
0212 
0213 Deprecation of V4 Format
0214 ========================
0215 
0216 The V4 filesystem format lacks certain features that are supported by
0217 the V5 format, such as metadata checksumming, strengthened metadata
0218 verification, and the ability to store timestamps past the year 2038.
0219 Because of this, the V4 format is deprecated.  All users should upgrade
0220 by backing up their files, reformatting, and restoring from the backup.
0221 
0222 Administrators and users can detect a V4 filesystem by running xfs_info
0223 against a filesystem mountpoint and checking for a string containing
0224 "crc=".  If no such string is found, please upgrade xfsprogs to the
0225 latest version and try again.
0226 
0227 The deprecation will take place in two parts.  Support for mounting V4
0228 filesystems can now be disabled at kernel build time via Kconfig option.
0229 The option will default to yes until September 2025, at which time it
0230 will be changed to default to no.  In September 2030, support will be
0231 removed from the codebase entirely.
0232 
0233 Note: Distributors may choose to withdraw V4 format support earlier than
0234 the dates listed above.
0235 
0236 Deprecated Mount Options
0237 ========================
0238 
0239 ===========================     ================
0240   Name                          Removal Schedule
0241 ===========================     ================
0242 Mounting with V4 filesystem     September 2030
0243 ikeep/noikeep                   September 2025
0244 attr2/noattr2                   September 2025
0245 ===========================     ================
0246 
0247 
0248 Removed Mount Options
0249 =====================
0250 
0251 ===========================     =======
0252   Name                          Removed
0253 ===========================     =======
0254   delaylog/nodelaylog           v4.0
0255   ihashsize                     v4.0
0256   irixsgid                      v4.0
0257   osyncisdsync/osyncisosync     v4.0
0258   barrier                       v4.19
0259   nobarrier                     v4.19
0260 ===========================     =======
0261 
0262 sysctls
0263 =======
0264 
0265 The following sysctls are available for the XFS filesystem:
0266 
0267   fs.xfs.stats_clear            (Min: 0  Default: 0  Max: 1)
0268         Setting this to "1" clears accumulated XFS statistics
0269         in /proc/fs/xfs/stat.  It then immediately resets to "0".
0270 
0271   fs.xfs.xfssyncd_centisecs     (Min: 100  Default: 3000  Max: 720000)
0272         The interval at which the filesystem flushes metadata
0273         out to disk and runs internal cache cleanup routines.
0274 
0275   fs.xfs.filestream_centisecs   (Min: 1  Default: 3000  Max: 360000)
0276         The interval at which the filesystem ages filestreams cache
0277         references and returns timed-out AGs back to the free stream
0278         pool.
0279 
0280   fs.xfs.speculative_prealloc_lifetime
0281         (Units: seconds   Min: 1  Default: 300  Max: 86400)
0282         The interval at which the background scanning for inodes
0283         with unused speculative preallocation runs. The scan
0284         removes unused preallocation from clean inodes and releases
0285         the unused space back to the free pool.
0286 
0287   fs.xfs.speculative_cow_prealloc_lifetime
0288         This is an alias for speculative_prealloc_lifetime.
0289 
0290   fs.xfs.error_level            (Min: 0  Default: 3  Max: 11)
0291         A volume knob for error reporting when internal errors occur.
0292         This will generate detailed messages & backtraces for filesystem
0293         shutdowns, for example.  Current threshold values are:
0294 
0295                 XFS_ERRLEVEL_OFF:       0
0296                 XFS_ERRLEVEL_LOW:       1
0297                 XFS_ERRLEVEL_HIGH:      5
0298 
0299   fs.xfs.panic_mask             (Min: 0  Default: 0  Max: 256)
0300         Causes certain error conditions to call BUG(). Value is a bitmask;
0301         OR together the tags which represent errors which should cause panics:
0302 
0303                 XFS_NO_PTAG                     0
0304                 XFS_PTAG_IFLUSH                 0x00000001
0305                 XFS_PTAG_LOGRES                 0x00000002
0306                 XFS_PTAG_AILDELETE              0x00000004
0307                 XFS_PTAG_ERROR_REPORT           0x00000008
0308                 XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
0309                 XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
0310                 XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
0311                 XFS_PTAG_FSBLOCK_ZERO           0x00000080
0312                 XFS_PTAG_VERIFIER_ERROR         0x00000100
0313 
0314         This option is intended for debugging only.
0315 
0316   fs.xfs.irix_symlink_mode      (Min: 0  Default: 0  Max: 1)
0317         Controls whether symlinks are created with mode 0777 (default)
0318         or whether their mode is affected by the umask (irix mode).
0319 
0320   fs.xfs.irix_sgid_inherit      (Min: 0  Default: 0  Max: 1)
0321         Controls files created in SGID directories.
0322         If the group ID of the new file does not match the effective group
0323         ID or one of the supplementary group IDs of the parent dir, the
0324         ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
0325         is set.
0326 
0327   fs.xfs.inherit_sync           (Min: 0  Default: 1  Max: 1)
0328         Setting this to "1" will cause the "sync" flag set
0329         by the **xfs_io(8)** chattr command on a directory to be
0330         inherited by files in that directory.
0331 
0332   fs.xfs.inherit_nodump         (Min: 0  Default: 1  Max: 1)
0333         Setting this to "1" will cause the "nodump" flag set
0334         by the **xfs_io(8)** chattr command on a directory to be
0335         inherited by files in that directory.
0336 
0337   fs.xfs.inherit_noatime        (Min: 0  Default: 1  Max: 1)
0338         Setting this to "1" will cause the "noatime" flag set
0339         by the **xfs_io(8)** chattr command on a directory to be
0340         inherited by files in that directory.
0341 
0342   fs.xfs.inherit_nosymlinks     (Min: 0  Default: 1  Max: 1)
0343         Setting this to "1" will cause the "nosymlinks" flag set
0344         by the **xfs_io(8)** chattr command on a directory to be
0345         inherited by files in that directory.
0346 
0347   fs.xfs.inherit_nodefrag       (Min: 0  Default: 1  Max: 1)
0348         Setting this to "1" will cause the "nodefrag" flag set
0349         by the **xfs_io(8)** chattr command on a directory to be
0350         inherited by files in that directory.
0351 
0352   fs.xfs.rotorstep              (Min: 1  Default: 1  Max: 256)
0353         In "inode32" allocation mode, this option determines how many
0354         files the allocator attempts to allocate in the same allocation
0355         group before moving to the next allocation group.  The intent
0356         is to control the rate at which the allocator moves between
0357         allocation groups when allocating extents for new files.
0358 
0359 Deprecated Sysctls
0360 ==================
0361 
0362 ===========================================     ================
0363   Name                                          Removal Schedule
0364 ===========================================     ================
0365 fs.xfs.irix_sgid_inherit                        September 2025
0366 fs.xfs.irix_symlink_mode                        September 2025
0367 fs.xfs.speculative_cow_prealloc_lifetime        September 2025
0368 ===========================================     ================
0369 
0370 
0371 Removed Sysctls
0372 ===============
0373 
0374 =============================   =======
0375   Name                          Removed
0376 =============================   =======
0377   fs.xfs.xfsbufd_centisec       v4.0
0378   fs.xfs.age_buffer_centisecs   v4.0
0379 =============================   =======
0380 
0381 Error handling
0382 ==============
0383 
0384 XFS can act differently according to the type of error found during its
0385 operation. The implementation introduces the following concepts to the error
0386 handler:
0387 
0388  -failure speed:
0389         Defines how fast XFS should propagate an error upwards when a specific
0390         error is found during the filesystem operation. It can propagate
0391         immediately, after a defined number of retries, after a set time period,
0392         or simply retry forever.
0393 
0394  -error classes:
0395         Specifies the subsystem the error configuration will apply to, such as
0396         metadata IO or memory allocation. Different subsystems will have
0397         different error handlers for which behaviour can be configured.
0398 
0399  -error handlers:
0400         Defines the behavior for a specific error.
0401 
0402 The filesystem behavior during an error can be set via ``sysfs`` files. Each
0403 error handler works independently - the first condition met by an error handler
0404 for a specific class will cause the error to be propagated rather than reset and
0405 retried.
0406 
0407 The action taken by the filesystem when the error is propagated is context
0408 dependent - it may cause a shut down in the case of an unrecoverable error,
0409 it may be reported back to userspace, or it may even be ignored because
0410 there's nothing useful we can with the error or anyone we can report it to (e.g.
0411 during unmount).
0412 
0413 The configuration files are organized into the following hierarchy for each
0414 mounted filesystem:
0415 
0416   /sys/fs/xfs/<dev>/error/<class>/<error>/
0417 
0418 Where:
0419   <dev>
0420         The short device name of the mounted filesystem. This is the same device
0421         name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
0422 
0423   <class>
0424         The subsystem the error configuration belongs to. As of 4.9, the defined
0425         classes are:
0426 
0427                 - "metadata": applies metadata buffer write IO
0428 
0429   <error>
0430         The individual error handler configurations.
0431 
0432 
0433 Each filesystem has "global" error configuration options defined in their top
0434 level directory:
0435 
0436   /sys/fs/xfs/<dev>/error/
0437 
0438   fail_at_unmount               (Min:  0  Default:  1  Max: 1)
0439         Defines the filesystem error behavior at unmount time.
0440 
0441         If set to a value of 1, XFS will override all other error configurations
0442         during unmount and replace them with "immediate fail" characteristics.
0443         i.e. no retries, no retry timeout. This will always allow unmount to
0444         succeed when there are persistent errors present.
0445 
0446         If set to 0, the configured retry behaviour will continue until all
0447         retries and/or timeouts have been exhausted. This will delay unmount
0448         completion when there are persistent errors, and it may prevent the
0449         filesystem from ever unmounting fully in the case of "retry forever"
0450         handler configurations.
0451 
0452         Note: there is no guarantee that fail_at_unmount can be set while an
0453         unmount is in progress. It is possible that the ``sysfs`` entries are
0454         removed by the unmounting filesystem before a "retry forever" error
0455         handler configuration causes unmount to hang, and hence the filesystem
0456         must be configured appropriately before unmount begins to prevent
0457         unmount hangs.
0458 
0459 Each filesystem has specific error class handlers that define the error
0460 propagation behaviour for specific errors. There is also a "default" error
0461 handler defined, which defines the behaviour for all errors that don't have
0462 specific handlers defined. Where multiple retry constraints are configured for
0463 a single error, the first retry configuration that expires will cause the error
0464 to be propagated. The handler configurations are found in the directory:
0465 
0466   /sys/fs/xfs/<dev>/error/<class>/<error>/
0467 
0468   max_retries                   (Min: -1  Default: Varies  Max: INTMAX)
0469         Defines the allowed number of retries of a specific error before
0470         the filesystem will propagate the error. The retry count for a given
0471         error context (e.g. a specific metadata buffer) is reset every time
0472         there is a successful completion of the operation.
0473 
0474         Setting the value to "-1" will cause XFS to retry forever for this
0475         specific error.
0476 
0477         Setting the value to "0" will cause XFS to fail immediately when the
0478         specific error is reported.
0479 
0480         Setting the value to "N" (where 0 < N < Max) will make XFS retry the
0481         operation "N" times before propagating the error.
0482 
0483   retry_timeout_seconds         (Min:  -1  Default:  Varies  Max: 1 day)
0484         Define the amount of time (in seconds) that the filesystem is
0485         allowed to retry its operations when the specific error is
0486         found.
0487 
0488         Setting the value to "-1" will allow XFS to retry forever for this
0489         specific error.
0490 
0491         Setting the value to "0" will cause XFS to fail immediately when the
0492         specific error is reported.
0493 
0494         Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
0495         operation for up to "N" seconds before propagating the error.
0496 
0497 **Note:** The default behaviour for a specific error handler is dependent on both
0498 the class and error context. For example, the default values for
0499 "metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
0500 to "fail immediately" behaviour. This is done because ENODEV is a fatal,
0501 unrecoverable error no matter how many times the metadata IO is retried.
0502 
0503 Workqueue Concurrency
0504 =====================
0505 
0506 XFS uses kernel workqueues to parallelize metadata update processes.  This
0507 enables it to take advantage of storage hardware that can service many IO
0508 operations simultaneously.  This interface exposes internal implementation
0509 details of XFS, and as such is explicitly not part of any userspace API/ABI
0510 guarantee the kernel may give userspace.  These are undocumented features of
0511 the generic workqueue implementation XFS uses for concurrency, and they are
0512 provided here purely for diagnostic and tuning purposes and may change at any
0513 time in the future.
0514 
0515 The control knobs for a filesystem's workqueues are organized by task at hand
0516 and the short name of the data device.  They all can be found in:
0517 
0518   /sys/bus/workqueue/devices/${task}!${device}
0519 
0520 ================  ===========
0521   Task            Description
0522 ================  ===========
0523   xfs_iwalk-$pid  Inode scans of the entire filesystem. Currently limited to
0524                   mount time quotacheck.
0525   xfs-gc          Background garbage collection of disk space that have been
0526                   speculatively allocated beyond EOF or for staging copy on
0527                   write operations.
0528 ================  ===========
0529 
0530 For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
0531 found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
0532 
0533 The interesting knobs for XFS workqueues are as follows:
0534 
0535 ============     ===========
0536   Knob           Description
0537 ============     ===========
0538   max_active     Maximum number of background threads that can be started to
0539                  run the work.
0540   cpumask        CPUs upon which the threads are allowed to run.
0541   nice           Relative priority of scheduling the threads.  These are the
0542                  same nice levels that can be applied to userspace processes.
0543 ============     ===========