Documentation/driver-api/dma-buf.rst

0001 Buffer Sharing and Synchronization
0002 ==================================
0003
0004 The dma-buf subsystem provides the framework for sharing buffers for
0005 hardware (DMA) access across multiple device drivers and subsystems, and
0006 for synchronizing asynchronous hardware access.
0007
0008 This is used, for example, by drm "prime" multi-GPU support, but is of
0009 course not limited to GPU use cases.
0010
0011 The three main components of this are: (1) dma-buf, representing a
0012 sg_table and exposed to userspace as a file descriptor to allow passing
0013 between devices, (2) fence, which provides a mechanism to signal when
0014 one device has finished access, and (3) reservation, which manages the
0015 shared or exclusive fence(s) associated with the buffer.
0016
0017 Shared DMA Buffers
0018 ------------------
0019
0020 This document serves as a guide to device-driver writers on what is the dma-buf
0021 buffer sharing API, how to use it for exporting and using shared buffers.
0022
0023 Any device driver which wishes to be a part of DMA buffer sharing, can do so as
0024 either the 'exporter' of buffers, or the 'user' or 'importer' of buffers.
0025
0026 Say a driver A wants to use buffers created by driver B, then we call B as the
0027 exporter, and A as buffer-user/importer.
0028
0029 The exporter
0030
0031  - implements and manages operations in :c:type:`struct dma_buf_ops
0032    <dma_buf_ops>` for the buffer,
0033  - allows other users to share the buffer by using dma_buf sharing APIs,
0034  - manages the details of buffer allocation, wrapped in a :c:type:`struct
0035    dma_buf <dma_buf>`,
0036  - decides about the actual backing storage where this allocation happens,
0037  - and takes care of any migration of scatterlist - for all (shared) users of
0038    this buffer.
0039
0040 The buffer-user
0041
0042  - is one of (many) sharing users of the buffer.
0043  - doesn't need to worry about how the buffer is allocated, or where.
0044  - and needs a mechanism to get access to the scatterlist that makes up this
0045    buffer in memory, mapped into its own address space, so it can access the
0046    same area of memory. This interface is provided by :c:type:`struct
0047    dma_buf_attachment <dma_buf_attachment>`.
0048
0049 Any exporters or users of the dma-buf buffer sharing framework must have a
0050 'select DMA_SHARED_BUFFER' in their respective Kconfigs.
0051
0052 Userspace Interface Notes
0053 ~~~~~~~~~~~~~~~~~~~~~~~~~
0054
0055 Mostly a DMA buffer file descriptor is simply an opaque object for userspace,
0056 and hence the generic interface exposed is very minimal. There's a few things to
0057 consider though:
0058
0059 - Since kernel 3.12 the dma-buf FD supports the llseek system call, but only
0060   with offset=0 and whence=SEEK_END|SEEK_SET. SEEK_SET is supported to allow
0061   the usual size discover pattern size = SEEK_END(0); SEEK_SET(0). Every other
0062   llseek operation will report -EINVAL.
0063
0064   If llseek on dma-buf FDs isn't support the kernel will report -ESPIPE for all
0065   cases. Userspace can use this to detect support for discovering the dma-buf
0066   size using llseek.
0067
0068 - In order to avoid fd leaks on exec, the FD_CLOEXEC flag must be set
0069   on the file descriptor.  This is not just a resource leak, but a
0070   potential security hole.  It could give the newly exec'd application
0071   access to buffers, via the leaked fd, to which it should otherwise
0072   not be permitted access.
0073
0074   The problem with doing this via a separate fcntl() call, versus doing it
0075   atomically when the fd is created, is that this is inherently racy in a
0076   multi-threaded app[3].  The issue is made worse when it is library code
0077   opening/creating the file descriptor, as the application may not even be
0078   aware of the fd's.
0079
0080   To avoid this problem, userspace must have a way to request O_CLOEXEC
0081   flag be set when the dma-buf fd is created.  So any API provided by
0082   the exporting driver to create a dmabuf fd must provide a way to let
0083   userspace control setting of O_CLOEXEC flag passed in to dma_buf_fd().
0084
0085 - Memory mapping the contents of the DMA buffer is also supported. See the
0086   discussion below on `CPU Access to DMA Buffer Objects`_ for the full details.
0087
0088 - The DMA buffer FD is also pollable, see `Implicit Fence Poll Support`_ below for
0089   details.
0090
0091 - The DMA buffer FD also supports a few dma-buf-specific ioctls, see
0092   `DMA Buffer ioctls`_ below for details.
0093
0094 Basic Operation and Device DMA Access
0095 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0096
0097 .. kernel-doc:: drivers/dma-buf/dma-buf.c
0098    :doc: dma buf device access
0099
0100 CPU Access to DMA Buffer Objects
0101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0102
0103 .. kernel-doc:: drivers/dma-buf/dma-buf.c
0104    :doc: cpu access
0105
0106 Implicit Fence Poll Support
0107 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
0108
0109 .. kernel-doc:: drivers/dma-buf/dma-buf.c
0110    :doc: implicit fence polling
0111
0112 DMA-BUF statistics
0113 ~~~~~~~~~~~~~~~~~~
0114 .. kernel-doc:: drivers/dma-buf/dma-buf-sysfs-stats.c
0115    :doc: overview
0116
0117 DMA Buffer ioctls
0118 ~~~~~~~~~~~~~~~~~
0119
0120 .. kernel-doc:: include/uapi/linux/dma-buf.h
0121
0122 Kernel Functions and Structures Reference
0123 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0124
0125 .. kernel-doc:: drivers/dma-buf/dma-buf.c
0126    :export:
0127
0128 .. kernel-doc:: include/linux/dma-buf.h
0129    :internal:
0130
0131 Reservation Objects
0132 -------------------
0133
0134 .. kernel-doc:: drivers/dma-buf/dma-resv.c
0135    :doc: Reservation Object Overview
0136
0137 .. kernel-doc:: drivers/dma-buf/dma-resv.c
0138    :export:
0139
0140 .. kernel-doc:: include/linux/dma-resv.h
0141    :internal:
0142
0143 DMA Fences
0144 ----------
0145
0146 .. kernel-doc:: drivers/dma-buf/dma-fence.c
0147    :doc: DMA fences overview
0148
0149 DMA Fence Cross-Driver Contract
0150 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0151
0152 .. kernel-doc:: drivers/dma-buf/dma-fence.c
0153    :doc: fence cross-driver contract
0154
0155 DMA Fence Signalling Annotations
0156 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0157
0158 .. kernel-doc:: drivers/dma-buf/dma-fence.c
0159    :doc: fence signalling annotation
0160
0161 DMA Fences Functions Reference
0162 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0163
0164 .. kernel-doc:: drivers/dma-buf/dma-fence.c
0165    :export:
0166
0167 .. kernel-doc:: include/linux/dma-fence.h
0168    :internal:
0169
0170 DMA Fence Array
0171 ~~~~~~~~~~~~~~~
0172
0173 .. kernel-doc:: drivers/dma-buf/dma-fence-array.c
0174    :export:
0175
0176 .. kernel-doc:: include/linux/dma-fence-array.h
0177    :internal:
0178
0179 DMA Fence Chain
0180 ~~~~~~~~~~~~~~~
0181
0182 .. kernel-doc:: drivers/dma-buf/dma-fence-chain.c
0183    :export:
0184
0185 .. kernel-doc:: include/linux/dma-fence-chain.h
0186    :internal:
0187
0188 DMA Fence unwrap
0189 ~~~~~~~~~~~~~~~~
0190
0191 .. kernel-doc:: include/linux/dma-fence-unwrap.h
0192    :internal:
0193
0194 DMA Fence uABI/Sync File
0195 ~~~~~~~~~~~~~~~~~~~~~~~~
0196
0197 .. kernel-doc:: drivers/dma-buf/sync_file.c
0198    :export:
0199
0200 .. kernel-doc:: include/linux/sync_file.h
0201    :internal:
0202
0203 Indefinite DMA Fences
0204 ~~~~~~~~~~~~~~~~~~~~~
0205
0206 At various times struct dma_fence with an indefinite time until dma_fence_wait()
0207 finishes have been proposed. Examples include:
0208
0209 * Future fences, used in HWC1 to signal when a buffer isn't used by the display
0210   any longer, and created with the screen update that makes the buffer visible.
0211   The time this fence completes is entirely under userspace's control.
0212
0213 * Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
0214   been set. Used to asynchronously delay command submission.
0215
0216 * Userspace fences or gpu futexes, fine-grained locking within a command buffer
0217   that userspace uses for synchronization across engines or with the CPU, which
0218   are then imported as a DMA fence for integration into existing winsys
0219   protocols.
0220
0221 * Long-running compute command buffers, while still using traditional end of
0222   batch DMA fences for memory management instead of context preemption DMA
0223   fences which get reattached when the compute job is rescheduled.
0224
0225 Common to all these schemes is that userspace controls the dependencies of these
0226 fences and controls when they fire. Mixing indefinite fences with normal
0227 in-kernel DMA fences does not work, even when a fallback timeout is included to
0228 protect against malicious userspace:
0229
0230 * Only the kernel knows about all DMA fence dependencies, userspace is not aware
0231   of dependencies injected due to memory management or scheduler decisions.
0232
0233 * Only userspace knows about all dependencies in indefinite fences and when
0234   exactly they will complete, the kernel has no visibility.
0235
0236 Furthermore the kernel has to be able to hold up userspace command submission
0237 for memory management needs, which means we must support indefinite fences being
0238 dependent upon DMA fences. If the kernel also support indefinite fences in the
0239 kernel like a DMA fence, like any of the above proposal would, there is the
0240 potential for deadlocks.
0241
0242 .. kernel-render:: DOT
0243    :alt: Indefinite Fencing Dependency Cycle
0244    :caption: Indefinite Fencing Dependency Cycle
0245
0246    digraph "Fencing Cycle" {
0247       node [shape=box bgcolor=grey style=filled]
0248       kernel [label="Kernel DMA Fences"]
0249       userspace [label="userspace controlled fences"]
0250       kernel -> userspace [label="memory management"]
0251       userspace -> kernel [label="Future fence, fence proxy, ..."]
0252
0253       { rank=same; kernel userspace }
0254    }
0255
0256 This means that the kernel might accidentally create deadlocks
0257 through memory management dependencies which userspace is unaware of, which
0258 randomly hangs workloads until the timeout kicks in. Workloads, which from
0259 userspace's perspective, do not contain a deadlock.  In such a mixed fencing
0260 architecture there is no single entity with knowledge of all dependencies.
0261 Thefore preventing such deadlocks from within the kernel is not possible.
0262
0263 The only solution to avoid dependencies loops is by not allowing indefinite
0264 fences in the kernel. This means:
0265
0266 * No future fences, proxy fences or userspace fences imported as DMA fences,
0267   with or without a timeout.
0268
0269 * No DMA fences that signal end of batchbuffer for command submission where
0270   userspace is allowed to use userspace fencing or long running compute
0271   workloads. This also means no implicit fencing for shared buffers in these
0272   cases.
0273
0274 Recoverable Hardware Page Faults Implications
0275 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0276
0277 Modern hardware supports recoverable page faults, which has a lot of
0278 implications for DMA fences.
0279
0280 First, a pending page fault obviously holds up the work that's running on the
0281 accelerator and a memory allocation is usually required to resolve the fault.
0282 But memory allocations are not allowed to gate completion of DMA fences, which
0283 means any workload using recoverable page faults cannot use DMA fences for
0284 synchronization. Synchronization fences controlled by userspace must be used
0285 instead.
0286
0287 On GPUs this poses a problem, because current desktop compositor protocols on
0288 Linux rely on DMA fences, which means without an entirely new userspace stack
0289 built on top of userspace fences, they cannot benefit from recoverable page
0290 faults. Specifically this means implicit synchronization will not be possible.
0291 The exception is when page faults are only used as migration hints and never to
0292 on-demand fill a memory request. For now this means recoverable page
0293 faults on GPUs are limited to pure compute workloads.
0294
0295 Furthermore GPUs usually have shared resources between the 3D rendering and
0296 compute side, like compute units or command submission engines. If both a 3D
0297 job with a DMA fence and a compute workload using recoverable page faults are
0298 pending they could deadlock:
0299
0300 - The 3D workload might need to wait for the compute job to finish and release
0301   hardware resources first.
0302
0303 - The compute workload might be stuck in a page fault, because the memory
0304   allocation is waiting for the DMA fence of the 3D workload to complete.
0305
0306 There are a few options to prevent this problem, one of which drivers need to
0307 ensure:
0308
0309 - Compute workloads can always be preempted, even when a page fault is pending
0310   and not yet repaired. Not all hardware supports this.
0311
0312 - DMA fence workloads and workloads which need page fault handling have
0313   independent hardware resources to guarantee forward progress. This could be
0314   achieved through e.g. through dedicated engines and minimal compute unit
0315   reservations for DMA fence workloads.
0316
0317 - The reservation approach could be further refined by only reserving the
0318   hardware resources for DMA fence workloads when they are in-flight. This must
0319   cover the time from when the DMA fence is visible to other threads up to
0320   moment when fence is completed through dma_fence_signal().
0321
0322 - As a last resort, if the hardware provides no useful reservation mechanics,
0323   all workloads must be flushed from the GPU when switching between jobs
0324   requiring DMA fences or jobs requiring page fault handling: This means all DMA
0325   fences must complete before a compute job with page fault handling can be
0326   inserted into the scheduler queue. And vice versa, before a DMA fence can be
0327   made visible anywhere in the system, all compute workloads must be preempted
0328   to guarantee all pending GPU page faults are flushed.
0329
0330 - Only a fairly theoretical option would be to untangle these dependencies when
0331   allocating memory to repair hardware page faults, either through separate
0332   memory blocks or runtime tracking of the full dependency graph of all DMA
0333   fences. This results very wide impact on the kernel, since resolving the page
0334   on the CPU side can itself involve a page fault. It is much more feasible and
0335   robust to limit the impact of handling hardware page faults to the specific
0336   driver.
0337
0338 Note that workloads that run on independent hardware like copy engines or other
0339 GPUs do not have any impact. This allows us to keep using DMA fences internally
0340 in the kernel even for resolving hardware page faults, e.g. by using copy
0341 engines to clear or copy memory needed to resolve the page fault.
0342
0343 In some ways this page fault problem is a special case of the `Infinite DMA
0344 Fences` discussions: Infinite fences from compute workloads are allowed to
0345 depend on DMA fences, but not the other way around. And not even the page fault
0346 problem is new, because some other CPU thread in userspace might
0347 hit a page fault which holds up a userspace fence - supporting page faults on
0348 GPUs doesn't anything fundamentally new.