Back to home page

OSCL-LXR

 
 

    


0001 ==================================
0002 vfio-ccw: the basic infrastructure
0003 ==================================
0004 
0005 Introduction
0006 ------------
0007 
0008 Here we describe the vfio support for I/O subchannel devices for
0009 Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
0010 virtual machine, while vfio is the means.
0011 
0012 Different than other hardware architectures, s390 has defined a unified
0013 I/O access method, which is so called Channel I/O. It has its own access
0014 patterns:
0015 
0016 - Channel programs run asynchronously on a separate (co)processor.
0017 - The channel subsystem will access any memory designated by the caller
0018   in the channel program directly, i.e. there is no iommu involved.
0019 
0020 Thus when we introduce vfio support for these devices, we realize it
0021 with a mediated device (mdev) implementation. The vfio mdev will be
0022 added to an iommu group, so as to make itself able to be managed by the
0023 vfio framework. And we add read/write callbacks for special vfio I/O
0024 regions to pass the channel programs from the mdev to its parent device
0025 (the real I/O subchannel device) to do further address translation and
0026 to perform I/O instructions.
0027 
0028 This document does not intend to explain the s390 I/O architecture in
0029 every detail. More information/reference could be found here:
0030 
0031 - A good start to know Channel I/O in general:
0032   https://en.wikipedia.org/wiki/Channel_I/O
0033 - s390 architecture:
0034   s390 Principles of Operation manual (IBM Form. No. SA22-7832)
0035 - The existing QEMU code which implements a simple emulated channel
0036   subsystem could also be a good reference. It makes it easier to follow
0037   the flow.
0038   qemu/hw/s390x/css.c
0039 
0040 For vfio mediated device framework:
0041 - Documentation/driver-api/vfio-mediated-device.rst
0042 
0043 Motivation of vfio-ccw
0044 ----------------------
0045 
0046 Typically, a guest virtualized via QEMU/KVM on s390 only sees
0047 paravirtualized virtio devices via the "Virtio Over Channel I/O
0048 (virtio-ccw)" transport. This makes virtio devices discoverable via
0049 standard operating system algorithms for handling channel devices.
0050 
0051 However this is not enough. On s390 for the majority of devices, which
0052 use the standard Channel I/O based mechanism, we also need to provide
0053 the functionality of passing through them to a QEMU virtual machine.
0054 This includes devices that don't have a virtio counterpart (e.g. tape
0055 drives) or that have specific characteristics which guests want to
0056 exploit.
0057 
0058 For passing a device to a guest, we want to use the same interface as
0059 everybody else, namely vfio. We implement this vfio support for channel
0060 devices via the vfio mediated device framework and the subchannel device
0061 driver "vfio_ccw".
0062 
0063 Access patterns of CCW devices
0064 ------------------------------
0065 
0066 s390 architecture has implemented a so called channel subsystem, that
0067 provides a unified view of the devices physically attached to the
0068 systems. Though the s390 hardware platform knows about a huge variety of
0069 different peripheral attachments like disk devices (aka. DASDs), tapes,
0070 communication controllers, etc. They can all be accessed by a well
0071 defined access method and they are presenting I/O completion a unified
0072 way: I/O interruptions.
0073 
0074 All I/O requires the use of channel command words (CCWs). A CCW is an
0075 instruction to a specialized I/O channel processor. A channel program is
0076 a sequence of CCWs which are executed by the I/O channel subsystem.  To
0077 issue a channel program to the channel subsystem, it is required to
0078 build an operation request block (ORB), which can be used to point out
0079 the format of the CCW and other control information to the system. The
0080 operating system signals the I/O channel subsystem to begin executing
0081 the channel program with a SSCH (start sub-channel) instruction. The
0082 central processor is then free to proceed with non-I/O instructions
0083 until interrupted. The I/O completion result is received by the
0084 interrupt handler in the form of interrupt response block (IRB).
0085 
0086 Back to vfio-ccw, in short:
0087 
0088 - ORBs and channel programs are built in guest kernel (with guest
0089   physical addresses).
0090 - ORBs and channel programs are passed to the host kernel.
0091 - Host kernel translates the guest physical addresses to real addresses
0092   and starts the I/O with issuing a privileged Channel I/O instruction
0093   (e.g SSCH).
0094 - channel programs run asynchronously on a separate processor.
0095 - I/O completion will be signaled to the host with I/O interruptions.
0096   And it will be copied as IRB to user space to pass it back to the
0097   guest.
0098 
0099 Physical vfio ccw device and its child mdev
0100 -------------------------------------------
0101 
0102 As mentioned above, we realize vfio-ccw with a mdev implementation.
0103 
0104 Channel I/O does not have IOMMU hardware support, so the physical
0105 vfio-ccw device does not have an IOMMU level translation or isolation.
0106 
0107 Subchannel I/O instructions are all privileged instructions. When
0108 handling the I/O instruction interception, vfio-ccw has the software
0109 policing and translation how the channel program is programmed before
0110 it gets sent to hardware.
0111 
0112 Within this implementation, we have two drivers for two types of
0113 devices:
0114 
0115 - The vfio_ccw driver for the physical subchannel device.
0116   This is an I/O subchannel driver for the real subchannel device.  It
0117   realizes a group of callbacks and registers to the mdev framework as a
0118   parent (physical) device. As a consequence, mdev provides vfio_ccw a
0119   generic interface (sysfs) to create mdev devices. A vfio mdev could be
0120   created by vfio_ccw then and added to the mediated bus. It is the vfio
0121   device that added to an IOMMU group and a vfio group.
0122   vfio_ccw also provides an I/O region to accept channel program
0123   request from user space and store I/O interrupt result for user
0124   space to retrieve. To notify user space an I/O completion, it offers
0125   an interface to setup an eventfd fd for asynchronous signaling.
0126 
0127 - The vfio_mdev driver for the mediated vfio ccw device.
0128   This is provided by the mdev framework. It is a vfio device driver for
0129   the mdev that created by vfio_ccw.
0130   It realizes a group of vfio device driver callbacks, adds itself to a
0131   vfio group, and registers itself to the mdev framework as a mdev
0132   driver.
0133   It uses a vfio iommu backend that uses the existing map and unmap
0134   ioctls, but rather than programming them into an IOMMU for a device,
0135   it simply stores the translations for use by later requests. This
0136   means that a device programmed in a VM with guest physical addresses
0137   can have the vfio kernel convert that address to process virtual
0138   address, pin the page and program the hardware with the host physical
0139   address in one step.
0140   For a mdev, the vfio iommu backend will not pin the pages during the
0141   VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
0142   of the iova<->vaddr mappings in this operation. And they export a
0143   vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
0144   backend for the physical devices to pin and unpin pages by demand.
0145 
0146 Below is a high Level block diagram::
0147 
0148  +-------------+
0149  |             |
0150  | +---------+ | mdev_register_driver() +--------------+
0151  | |  Mdev   | +<-----------------------+              |
0152  | |  bus    | |                        | vfio_mdev.ko |
0153  | | driver  | +----------------------->+              |<-> VFIO user
0154  | +---------+ |    probe()/remove()    +--------------+    APIs
0155  |             |
0156  |  MDEV CORE  |
0157  |   MODULE    |
0158  |   mdev.ko   |
0159  | +---------+ | mdev_register_device() +--------------+
0160  | |Physical | +<-----------------------+              |
0161  | | device  | |                        |  vfio_ccw.ko |<-> subchannel
0162  | |interface| +----------------------->+              |     device
0163  | +---------+ |       callback         +--------------+
0164  +-------------+
0165 
0166 The process of how these work together.
0167 
0168 1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
0169    physical device (with callbacks) to mdev framework.
0170    When vfio_ccw probing the subchannel device, it registers device
0171    pointer and callbacks to the mdev framework. Mdev related file nodes
0172    under the device node in sysfs would be created for the subchannel
0173    device, namely 'mdev_create', 'mdev_destroy' and
0174    'mdev_supported_types'.
0175 2. Create a mediated vfio ccw device.
0176    Use the 'mdev_create' sysfs file, we need to manually create one (and
0177    only one for our case) mediated device.
0178 3. vfio_mdev.ko drives the mediated ccw device.
0179    vfio_mdev is also the vfio device drvier. It will probe the mdev and
0180    add it to an iommu_group and a vfio_group. Then we could pass through
0181    the mdev to a guest.
0182 
0183 
0184 VFIO-CCW Regions
0185 ----------------
0186 
0187 The vfio-ccw driver exposes MMIO regions to accept requests from and return
0188 results to userspace.
0189 
0190 vfio-ccw I/O region
0191 -------------------
0192 
0193 An I/O region is used to accept channel program request from user
0194 space and store I/O interrupt result for user space to retrieve. The
0195 definition of the region is::
0196 
0197   struct ccw_io_region {
0198   #define ORB_AREA_SIZE 12
0199           __u8    orb_area[ORB_AREA_SIZE];
0200   #define SCSW_AREA_SIZE 12
0201           __u8    scsw_area[SCSW_AREA_SIZE];
0202   #define IRB_AREA_SIZE 96
0203           __u8    irb_area[IRB_AREA_SIZE];
0204           __u32   ret_code;
0205   } __packed;
0206 
0207 This region is always available.
0208 
0209 While starting an I/O request, orb_area should be filled with the
0210 guest ORB, and scsw_area should be filled with the SCSW of the Virtual
0211 Subchannel.
0212 
0213 irb_area stores the I/O result.
0214 
0215 ret_code stores a return code for each access of the region. The following
0216 values may occur:
0217 
0218 ``0``
0219   The operation was successful.
0220 
0221 ``-EOPNOTSUPP``
0222   The orb specified transport mode or an unidentified IDAW format, or the
0223   scsw specified a function other than the start function.
0224 
0225 ``-EIO``
0226   A request was issued while the device was not in a state ready to accept
0227   requests, or an internal error occurred.
0228 
0229 ``-EBUSY``
0230   The subchannel was status pending or busy, or a request is already active.
0231 
0232 ``-EAGAIN``
0233   A request was being processed, and the caller should retry.
0234 
0235 ``-EACCES``
0236   The channel path(s) used for the I/O were found to be not operational.
0237 
0238 ``-ENODEV``
0239   The device was found to be not operational.
0240 
0241 ``-EINVAL``
0242   The orb specified a chain longer than 255 ccws, or an internal error
0243   occurred.
0244 
0245 
0246 vfio-ccw cmd region
0247 -------------------
0248 
0249 The vfio-ccw cmd region is used to accept asynchronous instructions
0250 from userspace::
0251 
0252   #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0)
0253   #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1)
0254   struct ccw_cmd_region {
0255          __u32 command;
0256          __u32 ret_code;
0257   } __packed;
0258 
0259 This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD.
0260 
0261 Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region.
0262 
0263 command specifies the command to be issued; ret_code stores a return code
0264 for each access of the region. The following values may occur:
0265 
0266 ``0``
0267   The operation was successful.
0268 
0269 ``-ENODEV``
0270   The device was found to be not operational.
0271 
0272 ``-EINVAL``
0273   A command other than halt or clear was specified.
0274 
0275 ``-EIO``
0276   A request was issued while the device was not in a state ready to accept
0277   requests.
0278 
0279 ``-EAGAIN``
0280   A request was being processed, and the caller should retry.
0281 
0282 ``-EBUSY``
0283   The subchannel was status pending or busy while processing a halt request.
0284 
0285 vfio-ccw schib region
0286 ---------------------
0287 
0288 The vfio-ccw schib region is used to return Subchannel-Information
0289 Block (SCHIB) data to userspace::
0290 
0291   struct ccw_schib_region {
0292   #define SCHIB_AREA_SIZE 52
0293          __u8 schib_area[SCHIB_AREA_SIZE];
0294   } __packed;
0295 
0296 This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB.
0297 
0298 Reading this region triggers a STORE SUBCHANNEL to be issued to the
0299 associated hardware.
0300 
0301 vfio-ccw crw region
0302 ---------------------
0303 
0304 The vfio-ccw crw region is used to return Channel Report Word (CRW)
0305 data to userspace::
0306 
0307   struct ccw_crw_region {
0308          __u32 crw;
0309          __u32 pad;
0310   } __packed;
0311 
0312 This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW.
0313 
0314 Reading this region returns a CRW if one that is relevant for this
0315 subchannel (e.g. one reporting changes in channel path state) is
0316 pending, or all zeroes if not. If multiple CRWs are pending (including
0317 possibly chained CRWs), reading this region again will return the next
0318 one, until no more CRWs are pending and zeroes are returned. This is
0319 similar to how STORE CHANNEL REPORT WORD works.
0320 
0321 vfio-ccw operation details
0322 --------------------------
0323 
0324 vfio-ccw follows what vfio-pci did on the s390 platform and uses
0325 vfio-iommu-type1 as the vfio iommu backend.
0326 
0327 * CCW translation APIs
0328   A group of APIs (start with `cp_`) to do CCW translation. The CCWs
0329   passed in by a user space program are organized with their guest
0330   physical memory addresses. These APIs will copy the CCWs into kernel
0331   space, and assemble a runnable kernel channel program by updating the
0332   guest physical addresses with their corresponding host physical addresses.
0333   Note that we have to use IDALs even for direct-access CCWs, as the
0334   referenced memory can be located anywhere, including above 2G.
0335 
0336 * vfio_ccw device driver
0337   This driver utilizes the CCW translation APIs and introduces
0338   vfio_ccw, which is the driver for the I/O subchannel devices you want
0339   to pass through.
0340   vfio_ccw implements the following vfio ioctls::
0341 
0342     VFIO_DEVICE_GET_INFO
0343     VFIO_DEVICE_GET_IRQ_INFO
0344     VFIO_DEVICE_GET_REGION_INFO
0345     VFIO_DEVICE_RESET
0346     VFIO_DEVICE_SET_IRQS
0347 
0348   This provides an I/O region, so that the user space program can pass a
0349   channel program to the kernel, to do further CCW translation before
0350   issuing them to a real device.
0351   This also provides the SET_IRQ ioctl to setup an event notifier to
0352   notify the user space program the I/O completion in an asynchronous
0353   way.
0354 
0355 The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
0356 good example to get understand how these patches work. Here is a little
0357 bit more detail how an I/O request triggered by the QEMU guest will be
0358 handled (without error handling).
0359 
0360 Explanation:
0361 
0362 - Q1-Q7: QEMU side process.
0363 - K1-K5: Kernel side process.
0364 
0365 Q1.
0366     Get I/O region info during initialization.
0367 
0368 Q2.
0369     Setup event notifier and handler to handle I/O completion.
0370 
0371 ... ...
0372 
0373 Q3.
0374     Intercept a ssch instruction.
0375 Q4.
0376     Write the guest channel program and ORB to the I/O region.
0377 
0378     K1.
0379         Copy from guest to kernel.
0380     K2.
0381         Translate the guest channel program to a host kernel space
0382         channel program, which becomes runnable for a real device.
0383     K3.
0384         With the necessary information contained in the orb passed in
0385         by QEMU, issue the ccwchain to the device.
0386     K4.
0387         Return the ssch CC code.
0388 Q5.
0389     Return the CC code to the guest.
0390 
0391 ... ...
0392 
0393     K5.
0394         Interrupt handler gets the I/O result and write the result to
0395         the I/O region.
0396     K6.
0397         Signal QEMU to retrieve the result.
0398 
0399 Q6.
0400     Get the signal and event handler reads out the result from the I/O
0401     region.
0402 Q7.
0403     Update the irb for the guest.
0404 
0405 Limitations
0406 -----------
0407 
0408 The current vfio-ccw implementation focuses on supporting basic commands
0409 needed to implement block device functionality (read/write) of DASD/ECKD
0410 device only. Some commands may need special handling in the future, for
0411 example, anything related to path grouping.
0412 
0413 DASD is a kind of storage device. While ECKD is a data recording format.
0414 More information for DASD and ECKD could be found here:
0415 https://en.wikipedia.org/wiki/Direct-access_storage_device
0416 https://en.wikipedia.org/wiki/Count_key_data
0417 
0418 Together with the corresponding work in QEMU, we can bring the passed
0419 through DASD/ECKD device online in a guest now and use it as a block
0420 device.
0421 
0422 The current code allows the guest to start channel programs via
0423 START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL,
0424 and STORE SUBCHANNEL.
0425 
0426 Currently all channel programs are prefetched, regardless of the
0427 p-bit setting in the ORB.  As a result, self modifying channel
0428 programs are not supported.  For this reason, IPL has to be handled as
0429 a special case by a userspace/guest program; this has been implemented
0430 in QEMU's s390-ccw bios as of QEMU 4.1.
0431 
0432 vfio-ccw supports classic (command mode) channel I/O only. Transport
0433 mode (HPF) is not supported.
0434 
0435 QDIO subchannels are currently not supported. Classic devices other than
0436 DASD/ECKD might work, but have not been tested.
0437 
0438 Reference
0439 ---------
0440 1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
0441 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
0442 3. https://en.wikipedia.org/wiki/Channel_I/O
0443 4. Documentation/s390/cds.rst
0444 5. Documentation/driver-api/vfio.rst
0445 6. Documentation/driver-api/vfio-mediated-device.rst