Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ===========================================
0004 Shared Virtual Addressing (SVA) with ENQCMD
0005 ===========================================
0006 
0007 Background
0008 ==========
0009 
0010 Shared Virtual Addressing (SVA) allows the processor and device to use the
0011 same virtual addresses avoiding the need for software to translate virtual
0012 addresses to physical addresses. SVA is what PCIe calls Shared Virtual
0013 Memory (SVM).
0014 
0015 In addition to the convenience of using application virtual addresses
0016 by the device, it also doesn't require pinning pages for DMA.
0017 PCIe Address Translation Services (ATS) along with Page Request Interface
0018 (PRI) allow devices to function much the same way as the CPU handling
0019 application page-faults. For more information please refer to the PCIe
0020 specification Chapter 10: ATS Specification.
0021 
0022 Use of SVA requires IOMMU support in the platform. IOMMU is also
0023 required to support the PCIe features ATS and PRI. ATS allows devices
0024 to cache translations for virtual addresses. The IOMMU driver uses the
0025 mmu_notifier() support to keep the device TLB cache and the CPU cache in
0026 sync. When an ATS lookup fails for a virtual address, the device should
0027 use the PRI in order to request the virtual address to be paged into the
0028 CPU page tables. The device must use ATS again in order the fetch the
0029 translation before use.
0030 
0031 Shared Hardware Workqueues
0032 ==========================
0033 
0034 Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
0035 the use of Shared Work Queues (SWQ) by both applications and Virtual
0036 Machines (VM's). This allows better hardware utilization vs. hard
0037 partitioning resources that could result in under utilization. In order to
0038 allow the hardware to distinguish the context for which work is being
0039 executed in the hardware by SWQ interface, SIOV uses Process Address Space
0040 ID (PASID), which is a 20-bit number defined by the PCIe SIG.
0041 
0042 PASID value is encoded in all transactions from the device. This allows the
0043 IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
0044 Resource Identifier (RID) which is the Bus/Device/Function.
0045 
0046 
0047 ENQCMD
0048 ======
0049 
0050 ENQCMD is a new instruction on Intel platforms that atomically submits a
0051 work descriptor to a device. The descriptor includes the operation to be
0052 performed, virtual addresses of all parameters, virtual address of a completion
0053 record, and the PASID (process address space ID) of the current process.
0054 
0055 ENQCMD works with non-posted semantics and carries a status back if the
0056 command was accepted by hardware. This allows the submitter to know if the
0057 submission needs to be retried or other device specific mechanisms to
0058 implement fairness or ensure forward progress should be provided.
0059 
0060 ENQCMD is the glue that ensures applications can directly submit commands
0061 to the hardware and also permits hardware to be aware of application context
0062 to perform I/O operations via use of PASID.
0063 
0064 Process Address Space Tagging
0065 =============================
0066 
0067 A new thread-scoped MSR (IA32_PASID) provides the connection between
0068 user processes and the rest of the hardware. When an application first
0069 accesses an SVA-capable device, this MSR is initialized with a newly
0070 allocated PASID. The driver for the device calls an IOMMU-specific API
0071 that sets up the routing for DMA and page-requests.
0072 
0073 For example, the Intel Data Streaming Accelerator (DSA) uses
0074 iommu_sva_bind_device(), which will do the following:
0075 
0076 - Allocate the PASID, and program the process page-table (%cr3 register) in the
0077   PASID context entries.
0078 - Register for mmu_notifier() to track any page-table invalidations to keep
0079   the device TLB in sync. For example, when a page-table entry is invalidated,
0080   the IOMMU propagates the invalidation to the device TLB. This will force any
0081   future access by the device to this virtual address to participate in
0082   ATS. If the IOMMU responds with proper response that a page is not
0083   present, the device would request the page to be paged in via the PCIe PRI
0084   protocol before performing I/O.
0085 
0086 This MSR is managed with the XSAVE feature set as "supervisor state" to
0087 ensure the MSR is updated during context switch.
0088 
0089 PASID Management
0090 ================
0091 
0092 The kernel must allocate a PASID on behalf of each process which will use
0093 ENQCMD and program it into the new MSR to communicate the process identity to
0094 platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
0095 from this process.  When a user submits a work descriptor to a device using the
0096 ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
0097 value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
0098 with the same PASID. The platform IOMMU uses the PASID in the transaction to
0099 perform address translation. The IOMMU APIs setup the corresponding PASID
0100 entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
0101 x86).
0102 
0103 The MSR must be configured on each logical CPU before any application
0104 thread can interact with a device. Threads that belong to the same
0105 process share the same page tables, thus the same MSR value.
0106 
0107 PASID Life Cycle Management
0108 ===========================
0109 
0110 PASID is initialized as INVALID_IOASID (-1) when a process is created.
0111 
0112 Only processes that access SVA-capable devices need to have a PASID
0113 allocated. This allocation happens when a process opens/binds an SVA-capable
0114 device but finds no PASID for this process. Subsequent binds of the same, or
0115 other devices will share the same PASID.
0116 
0117 Although the PASID is allocated to the process by opening a device,
0118 it is not active in any of the threads of that process. It's loaded to the
0119 IA32_PASID MSR lazily when a thread tries to submit a work descriptor
0120 to a device using the ENQCMD.
0121 
0122 That first access will trigger a #GP fault because the IA32_PASID MSR
0123 has not been initialized with the PASID value assigned to the process
0124 when the device was opened. The Linux #GP handler notes that a PASID has
0125 been allocated for the process, and so initializes the IA32_PASID MSR
0126 and returns so that the ENQCMD instruction is re-executed.
0127 
0128 On fork(2) or exec(2) the PASID is removed from the process as it no
0129 longer has the same address space that it had when the device was opened.
0130 
0131 On clone(2) the new task shares the same address space, so will be
0132 able to use the PASID allocated to the process. The IA32_PASID is not
0133 preemptively initialized as the PASID value might not be allocated yet or
0134 the kernel does not know whether this thread is going to access the device
0135 and the cleared IA32_PASID MSR reduces context switch overhead by xstate
0136 init optimization. Since #GP faults have to be handled on any threads that
0137 were created before the PASID was assigned to the mm of the process, newly
0138 created threads might as well be treated in a consistent way.
0139 
0140 Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
0141 all threads in unbind, free the PASID lazily only on mm exit.
0142 
0143 If a process does a close(2) of the device file descriptor and munmap(2)
0144 of the device MMIO portal, then the driver will unbind the device. The
0145 PASID is still marked VALID in the PASID_MSR for any threads in the
0146 process that accessed the device. But this is harmless as without the
0147 MMIO portal they cannot submit new work to the device.
0148 
0149 Relationships
0150 =============
0151 
0152  * Each process has many threads, but only one PASID.
0153  * Devices have a limited number (~10's to 1000's) of hardware workqueues.
0154    The device driver manages allocating hardware workqueues.
0155  * A single mmap() maps a single hardware workqueue as a "portal" and
0156    each portal maps down to a single workqueue.
0157  * For each device with which a process interacts, there must be
0158    one or more mmap()'d portals.
0159  * Many threads within a process can share a single portal to access
0160    a single device.
0161  * Multiple processes can separately mmap() the same portal, in
0162    which case they still share one device hardware workqueue.
0163  * The single process-wide PASID is used by all threads to interact
0164    with all devices.  There is not, for instance, a PASID for each
0165    thread or each thread<->device pair.
0166 
0167 FAQ
0168 ===
0169 
0170 * What is SVA/SVM?
0171 
0172 Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
0173 work in the same address space, i.e., to share it. Some call it Shared
0174 Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
0175 POSIX Shared Memory and Secure Virtual Machines which were terms already in
0176 circulation.
0177 
0178 * What is a PASID?
0179 
0180 A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
0181 (TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
0182 PASID is included in all transactions between the platform and the device.
0183 
0184 * How are shared workqueues different?
0185 
0186 Traditionally, in order for userspace applications to interact with hardware,
0187 there is a separate hardware instance required per process. For example,
0188 consider doorbells as a mechanism of informing hardware about work to process.
0189 Each doorbell is required to be spaced 4k (or page-size) apart for process
0190 isolation. This requires hardware to provision that space and reserve it in
0191 MMIO. This doesn't scale as the number of threads becomes quite large. The
0192 hardware also manages the queue depth for Shared Work Queues (SWQ), and
0193 consumers don't need to track queue depth. If there is no space to accept
0194 a command, the device will return an error indicating retry.
0195 
0196 A user should check Deferrable Memory Write (DMWr) capability on the device
0197 and only submits ENQCMD when the device supports it. In the new DMWr PCIe
0198 terminology, devices need to support DMWr completer capability. In addition,
0199 it requires all switch ports to support DMWr routing and must be enabled by
0200 the PCIe subsystem, much like how PCIe atomic operations are managed for
0201 instance.
0202 
0203 SWQ allows hardware to provision just a single address in the device. When
0204 used with ENQCMD to submit work, the device can distinguish the process
0205 submitting the work since it will include the PASID assigned to that
0206 process. This helps the device scale to a large number of processes.
0207 
0208 * Is this the same as a user space device driver?
0209 
0210 Communicating with the device via the shared workqueue is much simpler
0211 than a full blown user space driver. The kernel driver does all the
0212 initialization of the hardware. User space only needs to worry about
0213 submitting work and processing completions.
0214 
0215 * Is this the same as SR-IOV?
0216 
0217 Single Root I/O Virtualization (SR-IOV) focuses on providing independent
0218 hardware interfaces for virtualizing hardware. Hence, it's required to be
0219 almost fully functional interface to software supporting the traditional
0220 BARs, space for interrupts via MSI-X, its own register layout.
0221 Virtual Functions (VFs) are assisted by the Physical Function (PF)
0222 driver.
0223 
0224 Scalable I/O Virtualization builds on the PASID concept to create device
0225 instances for virtualization. SIOV requires host software to assist in
0226 creating virtual devices; each virtual device is represented by a PASID
0227 along with the bus/device/function of the device.  This allows device
0228 hardware to optimize device resource creation and can grow dynamically on
0229 demand. SR-IOV creation and management is very static in nature. Consult
0230 references below for more details.
0231 
0232 * Why not just create a virtual function for each app?
0233 
0234 Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
0235 duplicated hardware for PCI config space and interrupts such as MSI-X.
0236 Resources such as interrupts have to be hard partitioned between VFs at
0237 creation time, and cannot scale dynamically on demand. The VFs are not
0238 completely independent from the Physical Function (PF). Most VFs require
0239 some communication and assistance from the PF driver. SIOV, in contrast,
0240 creates a software-defined device where all the configuration and control
0241 aspects are mediated via the slow path. The work submission and completion
0242 happen without any mediation.
0243 
0244 * Does this support virtualization?
0245 
0246 ENQCMD can be used from within a guest VM. In these cases, the VMM helps
0247 with setting up a translation table to translate from Guest PASID to Host
0248 PASID. Please consult the ENQCMD instruction set reference for more
0249 details.
0250 
0251 * Does memory need to be pinned?
0252 
0253 When devices support SVA along with platform hardware such as IOMMU
0254 supporting such devices, there is no need to pin memory for DMA purposes.
0255 Devices that support SVA also support other PCIe features that remove the
0256 pinning requirement for memory.
0257 
0258 Device TLB support - Device requests the IOMMU to lookup an address before
0259 use via Address Translation Service (ATS) requests.  If the mapping exists
0260 but there is no page allocated by the OS, IOMMU hardware returns that no
0261 mapping exists.
0262 
0263 Device requests the virtual address to be mapped via Page Request
0264 Interface (PRI). Once the OS has successfully completed the mapping, it
0265 returns the response back to the device. The device requests again for
0266 a translation and continues.
0267 
0268 IOMMU works with the OS in managing consistency of page-tables with the
0269 device. When removing pages, it interacts with the device to remove any
0270 device TLB entry that might have been cached before removing the mappings from
0271 the OS.
0272 
0273 References
0274 ==========
0275 
0276 VT-D:
0277 https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
0278 
0279 SIOV:
0280 https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
0281 
0282 ENQCMD in ISE:
0283 https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
0284 
0285 DSA spec:
0286 https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf