0001 ====================================
0002 Coherent Accelerator Interface (CXL)
0003 ====================================
0004
0005 Introduction
0006 ============
0007
0008 The coherent accelerator interface is designed to allow the
0009 coherent connection of accelerators (FPGAs and other devices) to a
0010 POWER system. These devices need to adhere to the Coherent
0011 Accelerator Interface Architecture (CAIA).
0012
0013 IBM refers to this as the Coherent Accelerator Processor Interface
0014 or CAPI. In the kernel it's referred to by the name CXL to avoid
0015 confusion with the ISDN CAPI subsystem.
0016
0017 Coherent in this context means that the accelerator and CPUs can
0018 both access system memory directly and with the same effective
0019 addresses.
0020
0021
0022 Hardware overview
0023 =================
0024
0025 ::
0026
0027 POWER8/9 FPGA
0028 +----------+ +---------+
0029 | | | |
0030 | CPU | | AFU |
0031 | | | |
0032 | | | |
0033 | | | |
0034 +----------+ +---------+
0035 | PHB | | |
0036 | +------+ | PSL |
0037 | | CAPP |<------>| |
0038 +---+------+ PCIE +---------+
0039
0040 The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
0041 unit which is part of the PCIe Host Bridge (PHB). This is managed
0042 by Linux by calls into OPAL. Linux doesn't directly program the
0043 CAPP.
0044
0045 The FPGA (or coherently attached device) consists of two parts.
0046 The POWER Service Layer (PSL) and the Accelerator Function Unit
0047 (AFU). The AFU is used to implement specific functionality behind
0048 the PSL. The PSL, among other things, provides memory address
0049 translation services to allow each AFU direct access to userspace
0050 memory.
0051
0052 The AFU is the core part of the accelerator (eg. the compression,
0053 crypto etc function). The kernel has no knowledge of the function
0054 of the AFU. Only userspace interacts directly with the AFU.
0055
0056 The PSL provides the translation and interrupt services that the
0057 AFU needs. This is what the kernel interacts with. For example, if
0058 the AFU needs to read a particular effective address, it sends
0059 that address to the PSL, the PSL then translates it, fetches the
0060 data from memory and returns it to the AFU. If the PSL has a
0061 translation miss, it interrupts the kernel and the kernel services
0062 the fault. The context to which this fault is serviced is based on
0063 who owns that acceleration function.
0064
0065 - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0.
0066 - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0.
0067
0068 This PSL Version 9 provides new features such as:
0069
0070 * Interaction with the nest MMU on the P9 chip.
0071 * Native DMA support.
0072 * Supports sending ASB_Notify messages for host thread wakeup.
0073 * Supports Atomic operations.
0074 * etc.
0075
0076 Cards with a PSL9 won't work on a POWER8 system and cards with a
0077 PSL8 won't work on a POWER9 system.
0078
0079 AFU Modes
0080 =========
0081
0082 There are two programming modes supported by the AFU. Dedicated
0083 and AFU directed. AFU may support one or both modes.
0084
0085 When using dedicated mode only one MMU context is supported. In
0086 this mode, only one userspace process can use the accelerator at
0087 time.
0088
0089 When using AFU directed mode, up to 16K simultaneous contexts can
0090 be supported. This means up to 16K simultaneous userspace
0091 applications may use the accelerator (although specific AFUs may
0092 support fewer). In this mode, the AFU sends a 16 bit context ID
0093 with each of its requests. This tells the PSL which context is
0094 associated with each operation. If the PSL can't translate an
0095 operation, the ID can also be accessed by the kernel so it can
0096 determine the userspace context associated with an operation.
0097
0098
0099 MMIO space
0100 ==========
0101
0102 A portion of the accelerator MMIO space can be directly mapped
0103 from the AFU to userspace. Either the whole space can be mapped or
0104 just a per context portion. The hardware is self describing, hence
0105 the kernel can determine the offset and size of the per context
0106 portion.
0107
0108
0109 Interrupts
0110 ==========
0111
0112 AFUs may generate interrupts that are destined for userspace. These
0113 are received by the kernel as hardware interrupts and passed onto
0114 userspace by a read syscall documented below.
0115
0116 Data storage faults and error interrupts are handled by the kernel
0117 driver.
0118
0119
0120 Work Element Descriptor (WED)
0121 =============================
0122
0123 The WED is a 64-bit parameter passed to the AFU when a context is
0124 started. Its format is up to the AFU hence the kernel has no
0125 knowledge of what it represents. Typically it will be the
0126 effective address of a work queue or status block where the AFU
0127 and userspace can share control and status information.
0128
0129
0130
0131
0132 User API
0133 ========
0134
0135 1. AFU character devices
0136 ^^^^^^^^^^^^^^^^^^^^^^^^
0137
0138 For AFUs operating in AFU directed mode, two character device
0139 files will be created. /dev/cxl/afu0.0m will correspond to a
0140 master context and /dev/cxl/afu0.0s will correspond to a slave
0141 context. Master contexts have access to the full MMIO space an
0142 AFU provides. Slave contexts have access to only the per process
0143 MMIO space an AFU provides.
0144
0145 For AFUs operating in dedicated process mode, the driver will
0146 only create a single character device per AFU called
0147 /dev/cxl/afu0.0d. This will have access to the entire MMIO space
0148 that the AFU provides (like master contexts in AFU directed).
0149
0150 The types described below are defined in include/uapi/misc/cxl.h
0151
0152 The following file operations are supported on both slave and
0153 master devices.
0154
0155 A userspace library libcxl is available here:
0156
0157 https://github.com/ibm-capi/libcxl
0158
0159 This provides a C interface to this kernel API.
0160
0161 open
0162 ----
0163
0164 Opens the device and allocates a file descriptor to be used with
0165 the rest of the API.
0166
0167 A dedicated mode AFU only has one context and only allows the
0168 device to be opened once.
0169
0170 An AFU directed mode AFU can have many contexts, the device can be
0171 opened once for each context that is available.
0172
0173 When all available contexts are allocated the open call will fail
0174 and return -ENOSPC.
0175
0176 Note:
0177 IRQs need to be allocated for each context, which may limit
0178 the number of contexts that can be created, and therefore
0179 how many times the device can be opened. The POWER8 CAPP
0180 supports 2040 IRQs and 3 are used by the kernel, so 2037 are
0181 left. If 1 IRQ is needed per context, then only 2037
0182 contexts can be allocated. If 4 IRQs are needed per context,
0183 then only 2037/4 = 509 contexts can be allocated.
0184
0185
0186 ioctl
0187 -----
0188
0189 CXL_IOCTL_START_WORK:
0190 Starts the AFU context and associates it with the current
0191 process. Once this ioctl is successfully executed, all memory
0192 mapped into this process is accessible to this AFU context
0193 using the same effective addresses. No additional calls are
0194 required to map/unmap memory. The AFU memory context will be
0195 updated as userspace allocates and frees memory. This ioctl
0196 returns once the AFU context is started.
0197
0198 Takes a pointer to a struct cxl_ioctl_start_work
0199
0200 ::
0201
0202 struct cxl_ioctl_start_work {
0203 __u64 flags;
0204 __u64 work_element_descriptor;
0205 __u64 amr;
0206 __s16 num_interrupts;
0207 __s16 reserved1;
0208 __s32 reserved2;
0209 __u64 reserved3;
0210 __u64 reserved4;
0211 __u64 reserved5;
0212 __u64 reserved6;
0213 };
0214
0215 flags:
0216 Indicates which optional fields in the structure are
0217 valid.
0218
0219 work_element_descriptor:
0220 The Work Element Descriptor (WED) is a 64-bit argument
0221 defined by the AFU. Typically this is an effective
0222 address pointing to an AFU specific structure
0223 describing what work to perform.
0224
0225 amr:
0226 Authority Mask Register (AMR), same as the powerpc
0227 AMR. This field is only used by the kernel when the
0228 corresponding CXL_START_WORK_AMR value is specified in
0229 flags. If not specified the kernel will use a default
0230 value of 0.
0231
0232 num_interrupts:
0233 Number of userspace interrupts to request. This field
0234 is only used by the kernel when the corresponding
0235 CXL_START_WORK_NUM_IRQS value is specified in flags.
0236 If not specified the minimum number required by the
0237 AFU will be allocated. The min and max number can be
0238 obtained from sysfs.
0239
0240 reserved fields:
0241 For ABI padding and future extensions
0242
0243 CXL_IOCTL_GET_PROCESS_ELEMENT:
0244 Get the current context id, also known as the process element.
0245 The value is returned from the kernel as a __u32.
0246
0247
0248 mmap
0249 ----
0250
0251 An AFU may have an MMIO space to facilitate communication with the
0252 AFU. If it does, the MMIO space can be accessed via mmap. The size
0253 and contents of this area are specific to the particular AFU. The
0254 size can be discovered via sysfs.
0255
0256 In AFU directed mode, master contexts are allowed to map all of
0257 the MMIO space and slave contexts are allowed to only map the per
0258 process MMIO space associated with the context. In dedicated
0259 process mode the entire MMIO space can always be mapped.
0260
0261 This mmap call must be done after the START_WORK ioctl.
0262
0263 Care should be taken when accessing MMIO space. Only 32 and 64-bit
0264 accesses are supported by POWER8. Also, the AFU will be designed
0265 with a specific endianness, so all MMIO accesses should consider
0266 endianness (recommend endian(3) variants like: le64toh(),
0267 be64toh() etc). These endian issues equally apply to shared memory
0268 queues the WED may describe.
0269
0270
0271 read
0272 ----
0273
0274 Reads events from the AFU. Blocks if no events are pending
0275 (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
0276 unrecoverable error or if the card is removed.
0277
0278 read() will always return an integral number of events.
0279
0280 The buffer passed to read() must be at least 4K bytes.
0281
0282 The result of the read will be a buffer of one or more events,
0283 each event is of type struct cxl_event, of varying size::
0284
0285 struct cxl_event {
0286 struct cxl_event_header header;
0287 union {
0288 struct cxl_event_afu_interrupt irq;
0289 struct cxl_event_data_storage fault;
0290 struct cxl_event_afu_error afu_error;
0291 };
0292 };
0293
0294 The struct cxl_event_header is defined as
0295
0296 ::
0297
0298 struct cxl_event_header {
0299 __u16 type;
0300 __u16 size;
0301 __u16 process_element;
0302 __u16 reserved1;
0303 };
0304
0305 type:
0306 This defines the type of event. The type determines how
0307 the rest of the event is structured. These types are
0308 described below and defined by enum cxl_event_type.
0309
0310 size:
0311 This is the size of the event in bytes including the
0312 struct cxl_event_header. The start of the next event can
0313 be found at this offset from the start of the current
0314 event.
0315
0316 process_element:
0317 Context ID of the event.
0318
0319 reserved field:
0320 For future extensions and padding.
0321
0322 If the event type is CXL_EVENT_AFU_INTERRUPT then the event
0323 structure is defined as
0324
0325 ::
0326
0327 struct cxl_event_afu_interrupt {
0328 __u16 flags;
0329 __u16 irq; /* Raised AFU interrupt number */
0330 __u32 reserved1;
0331 };
0332
0333 flags:
0334 These flags indicate which optional fields are present
0335 in this struct. Currently all fields are mandatory.
0336
0337 irq:
0338 The IRQ number sent by the AFU.
0339
0340 reserved field:
0341 For future extensions and padding.
0342
0343 If the event type is CXL_EVENT_DATA_STORAGE then the event
0344 structure is defined as
0345
0346 ::
0347
0348 struct cxl_event_data_storage {
0349 __u16 flags;
0350 __u16 reserved1;
0351 __u32 reserved2;
0352 __u64 addr;
0353 __u64 dsisr;
0354 __u64 reserved3;
0355 };
0356
0357 flags:
0358 These flags indicate which optional fields are present in
0359 this struct. Currently all fields are mandatory.
0360
0361 address:
0362 The address that the AFU unsuccessfully attempted to
0363 access. Valid accesses will be handled transparently by the
0364 kernel but invalid accesses will generate this event.
0365
0366 dsisr:
0367 This field gives information on the type of fault. It is a
0368 copy of the DSISR from the PSL hardware when the address
0369 fault occurred. The form of the DSISR is as defined in the
0370 CAIA.
0371
0372 reserved fields:
0373 For future extensions
0374
0375 If the event type is CXL_EVENT_AFU_ERROR then the event structure
0376 is defined as
0377
0378 ::
0379
0380 struct cxl_event_afu_error {
0381 __u16 flags;
0382 __u16 reserved1;
0383 __u32 reserved2;
0384 __u64 error;
0385 };
0386
0387 flags:
0388 These flags indicate which optional fields are present in
0389 this struct. Currently all fields are Mandatory.
0390
0391 error:
0392 Error status from the AFU. Defined by the AFU.
0393
0394 reserved fields:
0395 For future extensions and padding
0396
0397
0398 2. Card character device (powerVM guest only)
0399 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0400
0401 In a powerVM guest, an extra character device is created for the
0402 card. The device is only used to write (flash) a new image on the
0403 FPGA accelerator. Once the image is written and verified, the
0404 device tree is updated and the card is reset to reload the updated
0405 image.
0406
0407 open
0408 ----
0409
0410 Opens the device and allocates a file descriptor to be used with
0411 the rest of the API. The device can only be opened once.
0412
0413 ioctl
0414 -----
0415
0416 CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE:
0417 Starts and controls flashing a new FPGA image. Partial
0418 reconfiguration is not supported (yet), so the image must contain
0419 a copy of the PSL and AFU(s). Since an image can be quite large,
0420 the caller may have to iterate, splitting the image in smaller
0421 chunks.
0422
0423 Takes a pointer to a struct cxl_adapter_image::
0424
0425 struct cxl_adapter_image {
0426 __u64 flags;
0427 __u64 data;
0428 __u64 len_data;
0429 __u64 len_image;
0430 __u64 reserved1;
0431 __u64 reserved2;
0432 __u64 reserved3;
0433 __u64 reserved4;
0434 };
0435
0436 flags:
0437 These flags indicate which optional fields are present in
0438 this struct. Currently all fields are mandatory.
0439
0440 data:
0441 Pointer to a buffer with part of the image to write to the
0442 card.
0443
0444 len_data:
0445 Size of the buffer pointed to by data.
0446
0447 len_image:
0448 Full size of the image.
0449
0450
0451 Sysfs Class
0452 ===========
0453
0454 A cxl sysfs class is added under /sys/class/cxl to facilitate
0455 enumeration and tuning of the accelerators. Its layout is
0456 described in Documentation/ABI/testing/sysfs-class-cxl
0457
0458
0459 Udev rules
0460 ==========
0461
0462 The following udev rules could be used to create a symlink to the
0463 most logical chardev to use in any programming mode (afuX.Yd for
0464 dedicated, afuX.Ys for afu directed), since the API is virtually
0465 identical for each::
0466
0467 SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
0468 SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
0469 KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"