0001 ===================================================
0002 PCI Express I/O Virtualization Resource on Powerenv
0003 ===================================================
0004
0005 Wei Yang <weiyang@linux.vnet.ibm.com>
0006
0007 Benjamin Herrenschmidt <benh@au1.ibm.com>
0008
0009 Bjorn Helgaas <bhelgaas@google.com>
0010
0011 26 Aug 2014
0012
0013 This document describes the requirement from hardware for PCI MMIO resource
0014 sizing and assignment on PowerKVM and how generic PCI code handles this
0015 requirement. The first two sections describe the concepts of Partitionable
0016 Endpoints and the implementation on P8 (IODA2). The next two sections talks
0017 about considerations on enabling SRIOV on IODA2.
0018
0019 1. Introduction to Partitionable Endpoints
0020 ==========================================
0021
0022 A Partitionable Endpoint (PE) is a way to group the various resources
0023 associated with a device or a set of devices to provide isolation between
0024 partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
0025 to freeze a device that is causing errors in order to limit the possibility
0026 of propagation of bad data.
0027
0028 There is thus, in HW, a table of PE states that contains a pair of "frozen"
0029 state bits (one for MMIO and one for DMA, they get set together but can be
0030 cleared independently) for each PE.
0031
0032 When a PE is frozen, all stores in any direction are dropped and all loads
0033 return all 1's value. MSIs are also blocked. There's a bit more state that
0034 captures things like the details of the error that caused the freeze etc., but
0035 that's not critical.
0036
0037 The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
0038 are matched to their corresponding PEs.
0039
0040 The following section provides a rough description of what we have on P8
0041 (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
0042 is a completely separate HW entity that replicates the entire logic, so has
0043 its own set of PEs, etc.
0044
0045 2. Implementation of Partitionable Endpoints on P8 (IODA2)
0046 ==========================================================
0047
0048 P8 supports up to 256 Partitionable Endpoints per PHB.
0049
0050 * Inbound
0051
0052 For DMA, MSIs and inbound PCIe error messages, we have a table (in
0053 memory but accessed in HW by the chip) that provides a direct
0054 correspondence between a PCIe RID (bus/dev/fn) with a PE number.
0055 We call this the RTT.
0056
0057 - For DMA we then provide an entire address space for each PE that can
0058 contain two "windows", depending on the value of PCI address bit 59.
0059 Each window can be configured to be remapped via a "TCE table" (IOMMU
0060 translation table), which has various configurable characteristics
0061 not described here.
0062
0063 - For MSIs, we have two windows in the address space (one at the top of
0064 the 32-bit space and one much higher) which, via a combination of the
0065 address and MSI value, will result in one of the 2048 interrupts per
0066 bridge being triggered. There's a PE# in the interrupt controller
0067 descriptor table as well which is compared with the PE# obtained from
0068 the RTT to "authorize" the device to emit that specific interrupt.
0069
0070 - Error messages just use the RTT.
0071
0072 * Outbound. That's where the tricky part is.
0073
0074 Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
0075 from the CPU address space to the PCI address space. There is one M32
0076 window and sixteen M64 windows. They have different characteristics.
0077 First what they have in common: they forward a configurable portion of
0078 the CPU address space to the PCIe bus and must be naturally aligned
0079 power of two in size. The rest is different:
0080
0081 - The M32 window:
0082
0083 * Is limited to 4GB in size.
0084
0085 * Drops the top bits of the address (above the size) and replaces
0086 them with a configurable value. This is typically used to generate
0087 32-bit PCIe accesses. We configure that window at boot from FW and
0088 don't touch it from Linux; it's usually set to forward a 2GB
0089 portion of address space from the CPU to PCIe
0090 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
0091 reserved for MSIs but this is not a problem at this point; we just
0092 need to ensure Linux doesn't assign anything there, the M32 logic
0093 ignores that however and will forward in that space if we try).
0094
0095 * It is divided into 256 segments of equal size. A table in the chip
0096 maps each segment to a PE#. That allows portions of the MMIO space
0097 to be assigned to PEs on a segment granularity. For a 2GB window,
0098 the segment granularity is 2GB/256 = 8MB.
0099
0100 Now, this is the "main" window we use in Linux today (excluding
0101 SR-IOV). We basically use the trick of forcing the bridge MMIO windows
0102 onto a segment alignment/granularity so that the space behind a bridge
0103 can be assigned to a PE.
0104
0105 Ideally we would like to be able to have individual functions in PEs
0106 but that would mean using a completely different address allocation
0107 scheme where individual function BARs can be "grouped" to fit in one or
0108 more segments.
0109
0110 - The M64 windows:
0111
0112 * Must be at least 256MB in size.
0113
0114 * Do not translate addresses (the address on PCIe is the same as the
0115 address on the PowerBus). There is a way to also set the top 14
0116 bits which are not conveyed by PowerBus but we don't use this.
0117
0118 * Can be configured to be segmented. When not segmented, we can
0119 specify the PE# for the entire window. When segmented, a window
0120 has 256 segments; however, there is no table for mapping a segment
0121 to a PE#. The segment number *is* the PE#.
0122
0123 * Support overlaps. If an address is covered by multiple windows,
0124 there's a defined ordering for which window applies.
0125
0126 We have code (fairly new compared to the M32 stuff) that exploits that
0127 for large BARs in 64-bit space:
0128
0129 We configure an M64 window to cover the entire region of address space
0130 that has been assigned by FW for the PHB (about 64GB, ignore the space
0131 for the M32, it comes out of a different "reserve"). We configure it
0132 as segmented.
0133
0134 Then we do the same thing as with M32, using the bridge alignment
0135 trick, to match to those giant segments.
0136
0137 Since we cannot remap, we have two additional constraints:
0138
0139 - We do the PE# allocation *after* the 64-bit space has been assigned
0140 because the addresses we use directly determine the PE#. We then
0141 update the M32 PE# for the devices that use both 32-bit and 64-bit
0142 spaces or assign the remaining PE# to 32-bit only devices.
0143
0144 - We cannot "group" segments in HW, so if a device ends up using more
0145 than one segment, we end up with more than one PE#. There is a HW
0146 mechanism to make the freeze state cascade to "companion" PEs but
0147 that only works for PCIe error messages (typically used so that if
0148 you freeze a switch, it freezes all its children). So we do it in
0149 SW. We lose a bit of effectiveness of EEH in that case, but that's
0150 the best we found. So when any of the PEs freezes, we freeze the
0151 other ones for that "domain". We thus introduce the concept of
0152 "master PE" which is the one used for DMA, MSIs, etc., and "secondary
0153 PEs" that are used for the remaining M64 segments.
0154
0155 We would like to investigate using additional M64 windows in "single
0156 PE" mode to overlay over specific BARs to work around some of that, for
0157 example for devices with very large BARs, e.g., GPUs. It would make
0158 sense, but we haven't done it yet.
0159
0160 3. Considerations for SR-IOV on PowerKVM
0161 ========================================
0162
0163 * SR-IOV Background
0164
0165 The PCIe SR-IOV feature allows a single Physical Function (PF) to
0166 support several Virtual Functions (VFs). Registers in the PF's SR-IOV
0167 Capability control the number of VFs and whether they are enabled.
0168
0169 When VFs are enabled, they appear in Configuration Space like normal
0170 PCI devices, but the BARs in VF config space headers are unusual. For
0171 a non-VF device, software uses BARs in the config space header to
0172 discover the BAR sizes and assign addresses for them. For VF devices,
0173 software uses VF BAR registers in the *PF* SR-IOV Capability to
0174 discover sizes and assign addresses. The BARs in the VF's config space
0175 header are read-only zeros.
0176
0177 When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
0178 base address for all the corresponding VF(n) BARs. For example, if the
0179 PF SR-IOV Capability is programmed to enable eight VFs, and it has a
0180 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
0181 This region is divided into eight contiguous 1MB regions, each of which
0182 is a BAR0 for one of the VFs. Note that even though the VF BAR
0183 describes an 8MB region, the alignment requirement is for a single VF,
0184 i.e., 1MB in this example.
0185
0186 There are several strategies for isolating VFs in PEs:
0187
0188 - M32 window: There's one M32 window, and it is split into 256
0189 equally-sized segments. The finest granularity possible is a 256MB
0190 window with 1MB segments. VF BARs that are 1MB or larger could be
0191 mapped to separate PEs in this window. Each segment can be
0192 individually mapped to a PE via the lookup table, so this is quite
0193 flexible, but it works best when all the VF BARs are the same size. If
0194 they are different sizes, the entire window has to be small enough that
0195 the segment size matches the smallest VF BAR, which means larger VF
0196 BARs span several segments.
0197
0198 - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
0199 to a single PE, so it could only isolate one VF.
0200
0201 - Single segmented M64 windows: A segmented M64 window could be used just
0202 like the M32 window, but the segments can't be individually mapped to
0203 PEs (the segment number is the PE#), so there isn't as much
0204 flexibility. A VF with multiple BARs would have to be in a "domain" of
0205 multiple PEs, which is not as well isolated as a single PE.
0206
0207 - Multiple segmented M64 windows: As usual, each window is split into 256
0208 equally-sized segments, and the segment number is the PE#. But if we
0209 use several M64 windows, they can be set to different base addresses
0210 and different segment sizes. If we have VFs that each have a 1MB BAR
0211 and a 32MB BAR, we could use one M64 window to assign 1MB segments and
0212 another M64 window to assign 32MB segments.
0213
0214 Finally, the plan to use M64 windows for SR-IOV, which will be described
0215 more in the next two sections. For a given VF BAR, we need to
0216 effectively reserve the entire 256 segments (256 * VF BAR size) and
0217 position the VF BAR to start at the beginning of a free range of
0218 segments/PEs inside that M64 window.
0219
0220 The goal is of course to be able to give a separate PE for each VF.
0221
0222 The IODA2 platform has 16 M64 windows, which are used to map MMIO
0223 range to PE#. Each M64 window defines one MMIO range and this range is
0224 divided into 256 segments, with each segment corresponding to one PE.
0225
0226 We decide to leverage this M64 window to map VFs to individual PEs, since
0227 SR-IOV VF BARs are all the same size.
0228
0229 But doing so introduces another problem: total_VFs is usually smaller
0230 than the number of M64 window segments, so if we map one VF BAR directly
0231 to one M64 window, some part of the M64 window will map to another
0232 device's MMIO range.
0233
0234 IODA supports 256 PEs, so segmented windows contain 256 segments, so if
0235 total_VFs is less than 256, we have the situation in Figure 1.0, where
0236 segments [total_VFs, 255] of the M64 window may map to some MMIO range on
0237 other devices::
0238
0239 0 1 total_VFs - 1
0240 +------+------+- -+------+------+
0241 | | | ... | | |
0242 +------+------+- -+------+------+
0243
0244 VF(n) BAR space
0245
0246 0 1 total_VFs - 1 255
0247 +------+------+- -+------+------+- -+------+------+
0248 | | | ... | | | ... | | |
0249 +------+------+- -+------+------+- -+------+------+
0250
0251 M64 window
0252
0253 Figure 1.0 Direct map VF(n) BAR space
0254
0255 Our current solution is to allocate 256 segments even if the VF(n) BAR
0256 space doesn't need that much, as shown in Figure 1.1::
0257
0258 0 1 total_VFs - 1 255
0259 +------+------+- -+------+------+- -+------+------+
0260 | | | ... | | | ... | | |
0261 +------+------+- -+------+------+- -+------+------+
0262
0263 VF(n) BAR space + extra
0264
0265 0 1 total_VFs - 1 255
0266 +------+------+- -+------+------+- -+------+------+
0267 | | | ... | | | ... | | |
0268 +------+------+- -+------+------+- -+------+------+
0269
0270 M64 window
0271
0272 Figure 1.1 Map VF(n) BAR space + extra
0273
0274 Allocating the extra space ensures that the entire M64 window will be
0275 assigned to this one SR-IOV device and none of the space will be
0276 available for other devices. Note that this only expands the space
0277 reserved in software; there are still only total_VFs VFs, and they only
0278 respond to segments [0, total_VFs - 1]. There's nothing in hardware that
0279 responds to segments [total_VFs, 255].
0280
0281 4. Implications for the Generic PCI Code
0282 ========================================
0283
0284 The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
0285 aligned to the size of an individual VF BAR.
0286
0287 In IODA2, the MMIO address determines the PE#. If the address is in an M32
0288 window, we can set the PE# by updating the table that translates segments
0289 to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
0290 set the PE# for the window. But if it's in a segmented M64 window, the
0291 segment number is the PE#.
0292
0293 Therefore, the only way to control the PE# for a VF is to change the base
0294 of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
0295 amount of space required for the VF(n) BAR space, the VF BAR value is fixed
0296 and cannot be changed.
0297
0298 On the other hand, if the PCI core allocates additional space, the VF BAR
0299 value can be changed as long as the entire VF(n) BAR space remains inside
0300 the space allocated by the core.
0301
0302 Ideally the segment size will be the same as an individual VF BAR size.
0303 Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
0304 are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
0305 allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
0306
0307 If the segment size is smaller than the VF BAR size, it will take several
0308 segments to cover a VF BAR, and a VF will be in several PEs. This is
0309 possible, but the isolation isn't as good, and it reduces the number of PE#
0310 choices because instead of consuming only numVFs segments, the VF(n) BAR
0311 space will consume (numVFs * n) segments. That means there aren't as many
0312 available segments for adjusting base of the VF(n) BAR space.