0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 VMbus
0004 =====
0005 VMbus is a software construct provided by Hyper-V to guest VMs. It
0006 consists of a control path and common facilities used by synthetic
0007 devices that Hyper-V presents to guest VMs. The control path is
0008 used to offer synthetic devices to the guest VM and, in some cases,
0009 to rescind those devices. The common facilities include software
0010 channels for communicating between the device driver in the guest VM
0011 and the synthetic device implementation that is part of Hyper-V, and
0012 signaling primitives to allow Hyper-V and the guest to interrupt
0013 each other.
0014
0015 VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
0016 entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c)
0017 establishes the VMbus control path with the Hyper-V host, then
0018 registers itself as a Linux bus driver. It implements the standard
0019 bus functions for adding and removing devices to/from the bus.
0020
0021 Most synthetic devices offered by Hyper-V have a corresponding Linux
0022 device driver. These devices include:
0023
0024 * SCSI controller
0025 * NIC
0026 * Graphics frame buffer
0027 * Keyboard
0028 * Mouse
0029 * PCI device pass-thru
0030 * Heartbeat
0031 * Time Sync
0032 * Shutdown
0033 * Memory balloon
0034 * Key/Value Pair (KVP) exchange with Hyper-V
0035 * Hyper-V online backup (a.k.a. VSS)
0036
0037 Guest VMs may have multiple instances of the synthetic SCSI
0038 controller, synthetic NIC, and PCI pass-thru devices. Other
0039 synthetic devices are limited to a single instance per VM. Not
0040 listed above are a small number of synthetic devices offered by
0041 Hyper-V that are used only by Windows guests and for which Linux
0042 does not have a driver.
0043
0044 Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
0045 devices. "VSP" refers to the Hyper-V code that implements a
0046 particular synthetic device, while "VSC" refers to the driver for
0047 the device in the guest VM. For example, the Linux driver for the
0048 synthetic NIC is referred to as "netvsc" and the Linux driver for
0049 the synthetic SCSI controller is "storvsc". These drivers contain
0050 functions with names like "storvsc_connect_to_vsp".
0051
0052 VMbus channels
0053 --------------
0054 An instance of a synthetic device uses VMbus channels to communicate
0055 between the VSP and the VSC. Channels are bi-directional and used
0056 for passing messages. Most synthetic devices use a single channel,
0057 but the synthetic SCSI controller and synthetic NIC may use multiple
0058 channels to achieve higher performance and greater parallelism.
0059
0060 Each channel consists of two ring buffers. These are classic ring
0061 buffers from a university data structures textbook. If the read
0062 and writes pointers are equal, the ring buffer is considered to be
0063 empty, so a full ring buffer always has at least one byte unused.
0064 The "in" ring buffer is for messages from the Hyper-V host to the
0065 guest, and the "out" ring buffer is for messages from the guest to
0066 the Hyper-V host. In Linux, the "in" and "out" designations are as
0067 viewed by the guest side. The ring buffers are memory that is
0068 shared between the guest and the host, and they follow the standard
0069 paradigm where the memory is allocated by the guest, with the list
0070 of GPAs that make up the ring buffer communicated to the host. Each
0071 ring buffer consists of a header page (4 Kbytes) with the read and
0072 write indices and some control flags, followed by the memory for the
0073 actual ring. The size of the ring is determined by the VSC in the
0074 guest and is specific to each synthetic device. The list of GPAs
0075 making up the ring is communicated to the Hyper-V host over the
0076 VMbus control path as a GPA Descriptor List (GPADL). See function
0077 vmbus_establish_gpadl().
0078
0079 Each ring buffer is mapped into contiguous Linux kernel virtual
0080 space in three parts: 1) the 4 Kbyte header page, 2) the memory
0081 that makes up the ring itself, and 3) a second mapping of the memory
0082 that makes up the ring itself. Because (2) and (3) are contiguous
0083 in kernel virtual space, the code that copies data to and from the
0084 ring buffer need not be concerned with ring buffer wrap-around.
0085 Once a copy operation has completed, the read or write index may
0086 need to be reset to point back into the first mapping, but the
0087 actual data copy does not need to be broken into two parts. This
0088 approach also allows complex data structures to be easily accessed
0089 directly in the ring without handling wrap-around.
0090
0091 On arm64 with page sizes > 4 Kbytes, the header page must still be
0092 passed to Hyper-V as a 4 Kbyte area. But the memory for the actual
0093 ring must be aligned to PAGE_SIZE and have a size that is a multiple
0094 of PAGE_SIZE so that the duplicate mapping trick can be done. Hence
0095 a portion of the header page is unused and not communicated to
0096 Hyper-V. This case is handled by vmbus_establish_gpadl().
0097
0098 Hyper-V enforces a limit on the aggregate amount of guest memory
0099 that can be shared with the host via GPADLs. This limit ensures
0100 that a rogue guest can't force the consumption of excessive host
0101 resources. For Windows Server 2019 and later, this limit is
0102 approximately 1280 Mbytes. For versions prior to Windows Server
0103 2019, the limit is approximately 384 Mbytes.
0104
0105 VMbus messages
0106 --------------
0107 All VMbus messages have a standard header that includes the message
0108 length, the offset of the message payload, some flags, and a
0109 transactionID. The portion of the message after the header is
0110 unique to each VSP/VSC pair.
0111
0112 Messages follow one of two patterns:
0113
0114 * Unidirectional: Either side sends a message and does not
0115 expect a response message
0116 * Request/response: One side (usually the guest) sends a message
0117 and expects a response
0118
0119 The transactionID (a.k.a. "requestID") is for matching requests &
0120 responses. Some synthetic devices allow multiple requests to be in-
0121 flight simultaneously, so the guest specifies a transactionID when
0122 sending a request. Hyper-V sends back the same transactionID in the
0123 matching response.
0124
0125 Messages passed between the VSP and VSC are control messages. For
0126 example, a message sent from the storvsc driver might be "execute
0127 this SCSI command". If a message also implies some data transfer
0128 between the guest and the Hyper-V host, the actual data to be
0129 transferred may be embedded with the control message, or it may be
0130 specified as a separate data buffer that the Hyper-V host will
0131 access as a DMA operation. The former case is used when the size of
0132 the data is small and the cost of copying the data to and from the
0133 ring buffer is minimal. For example, time sync messages from the
0134 Hyper-V host to the guest contain the actual time value. When the
0135 data is larger, a separate data buffer is used. In this case, the
0136 control message contains a list of GPAs that describe the data
0137 buffer. For example, the storvsc driver uses this approach to
0138 specify the data buffers to/from which disk I/O is done.
0139
0140 Three functions exist to send VMbus messages:
0141
0142 1. vmbus_sendpacket(): Control-only messages and messages with
0143 embedded data -- no GPAs
0144 2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
0145 identifying data to transfer. An offset and length is
0146 associated with each GPA so that multiple discontinuous areas
0147 of guest memory can be targeted.
0148 3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
0149 identifying data to transfer. A single offset and length is
0150 associated with a list of GPAs. The GPAs must describe a
0151 single logical area of guest memory to be targeted.
0152
0153 Historically, Linux guests have trusted Hyper-V to send well-formed
0154 and valid messages, and Linux drivers for synthetic devices did not
0155 fully validate messages. With the introduction of processor
0156 technologies that fully encrypt guest memory and that allow the
0157 guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting
0158 the Hyper-V host is no longer a valid assumption. The drivers for
0159 VMbus synthetic devices are being updated to fully validate any
0160 values read from memory that is shared with Hyper-V, which includes
0161 messages from VMbus devices. To facilitate such validation,
0162 messages read by the guest from the "in" ring buffer are copied to a
0163 temporary buffer that is not shared with Hyper-V. Validation is
0164 performed in this temporary buffer without the risk of Hyper-V
0165 maliciously modifying the message after it is validated but before
0166 it is used.
0167
0168 VMbus interrupts
0169 ----------------
0170 VMbus provides a mechanism for the guest to interrupt the host when
0171 the guest has queued new messages in a ring buffer. The host
0172 expects that the guest will send an interrupt only when an "out"
0173 ring buffer transitions from empty to non-empty. If the guest sends
0174 interrupts at other times, the host deems such interrupts to be
0175 unnecessary. If a guest sends an excessive number of unnecessary
0176 interrupts, the host may throttle that guest by suspending its
0177 execution for a few seconds to prevent a denial-of-service attack.
0178
0179 Similarly, the host will interrupt the guest when it sends a new
0180 message on the VMbus control path, or when a VMbus channel "in" ring
0181 buffer transitions from empty to non-empty. Each CPU in the guest
0182 may receive VMbus interrupts, so they are best modeled as per-CPU
0183 interrupts in Linux. This model works well on arm64 where a single
0184 per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for
0185 per-CPU IRQs, an x86 interrupt vector is statically allocated (see
0186 HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to
0187 call the VMbus interrupt service routine. These interrupts are
0188 visible in /proc/interrupts on the "HYP" line.
0189
0190 The guest CPU that a VMbus channel will interrupt is selected by the
0191 guest when the channel is created, and the host is informed of that
0192 selection. VMbus devices are broadly grouped into two categories:
0193
0194 1. "Slow" devices that need only one VMbus channel. The devices
0195 (such as keyboard, mouse, heartbeat, and timesync) generate
0196 relatively few interrupts. Their VMbus channels are all
0197 assigned to interrupt the VMBUS_CONNECT_CPU, which is always
0198 CPU 0.
0199
0200 2. "High speed" devices that may use multiple VMbus channels for
0201 higher parallelism and performance. These devices include the
0202 synthetic SCSI controller and synthetic NIC. Their VMbus
0203 channels interrupts are assigned to CPUs that are spread out
0204 among the available CPUs in the VM so that interrupts on
0205 multiple channels can be processed in parallel.
0206
0207 The assignment of VMbus channel interrupts to CPUs is done in the
0208 function init_vp_index(). This assignment is done outside of the
0209 normal Linux interrupt affinity mechanism, so the interrupts are
0210 neither "unmanaged" nor "managed" interrupts.
0211
0212 The CPU that a VMbus channel will interrupt can be seen in
0213 /sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
0214 When running on later versions of Hyper-V, the CPU can be changed
0215 by writing a new value to this sysfs entry. Because the interrupt
0216 assignment is done outside of the normal Linux affinity mechanism,
0217 there are no entries in /proc/irq corresponding to individual
0218 VMbus channel interrupts.
0219
0220 An online CPU in a Linux guest may not be taken offline if it has
0221 VMbus channel interrupts assigned to it. Any such channel
0222 interrupts must first be manually reassigned to another CPU as
0223 described above. When no channel interrupts are assigned to the
0224 CPU, it can be taken offline.
0225
0226 When a guest CPU receives a VMbus interrupt from the host, the
0227 function vmbus_isr() handles the interrupt. It first checks for
0228 channel interrupts by calling vmbus_chan_sched(), which looks at a
0229 bitmap setup by the host to determine which channels have pending
0230 interrupts on this CPU. If multiple channels have pending
0231 interrupts for this CPU, they are processed sequentially. When all
0232 channel interrupts have been processed, vmbus_isr() checks for and
0233 processes any message received on the VMbus control path.
0234
0235 The VMbus channel interrupt handling code is designed to work
0236 correctly even if an interrupt is received on a CPU other than the
0237 CPU assigned to the channel. Specifically, the code does not use
0238 CPU-based exclusion for correctness. In normal operation, Hyper-V
0239 will interrupt the assigned CPU. But when the CPU assigned to a
0240 channel is being changed via sysfs, the guest doesn't know exactly
0241 when Hyper-V will make the transition. The code must work correctly
0242 even if there is a time lag before Hyper-V starts interrupting the
0243 new CPU. See comments in target_cpu_store().
0244
0245 VMbus device creation/deletion
0246 ------------------------------
0247 Hyper-V and the Linux guest have a separate message-passing path
0248 that is used for synthetic device creation and deletion. This
0249 path does not use a VMbus channel. See vmbus_post_msg() and
0250 vmbus_on_msg_dpc().
0251
0252 The first step is for the guest to connect to the generic
0253 Hyper-V VMbus mechanism. As part of establishing this connection,
0254 the guest and Hyper-V agree on a VMbus protocol version they will
0255 use. This negotiation allows newer Linux kernels to run on older
0256 Hyper-V versions, and vice versa.
0257
0258 The guest then tells Hyper-V to "send offers". Hyper-V sends an
0259 offer message to the guest for each synthetic device that the VM
0260 is configured to have. Each VMbus device type has a fixed GUID
0261 known as the "class ID", and each VMbus device instance is also
0262 identified by a GUID. The offer message from Hyper-V contains
0263 both GUIDs to uniquely (within the VM) identify the device.
0264 There is one offer message for each device instance, so a VM with
0265 two synthetic NICs will get two offers messages with the NIC
0266 class ID. The ordering of offer messages can vary from boot-to-boot
0267 and must not be assumed to be consistent in Linux code. Offer
0268 messages may also arrive long after Linux has initially booted
0269 because Hyper-V supports adding devices, such as synthetic NICs,
0270 to running VMs. A new offer message is processed by
0271 vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
0272
0273 Upon receipt of an offer message, the guest identifies the device
0274 type based on the class ID, and invokes the correct driver to set up
0275 the device. Driver/device matching is performed using the standard
0276 Linux mechanism.
0277
0278 The device driver probe function opens the primary VMbus channel to
0279 the corresponding VSP. It allocates guest memory for the channel
0280 ring buffers and shares the ring buffer with the Hyper-V host by
0281 giving the host a list of GPAs for the ring buffer memory. See
0282 vmbus_establish_gpadl().
0283
0284 Once the ring buffer is set up, the device driver and VSP exchange
0285 setup messages via the primary channel. These messages may include
0286 negotiating the device protocol version to be used between the Linux
0287 VSC and the VSP on the Hyper-V host. The setup messages may also
0288 include creating additional VMbus channels, which are somewhat
0289 mis-named as "sub-channels" since they are functionally
0290 equivalent to the primary channel once they are created.
0291
0292 Finally, the device driver may create entries in /dev as with
0293 any device driver.
0294
0295 The Hyper-V host can send a "rescind" message to the guest to
0296 remove a device that was previously offered. Linux drivers must
0297 handle such a rescind message at any time. Rescinding a device
0298 invokes the device driver "remove" function to cleanly shut
0299 down the device and remove it. Once a synthetic device is
0300 rescinded, neither Hyper-V nor Linux retains any state about
0301 its previous existence. Such a device might be re-added later,
0302 in which case it is treated as an entirely new device. See
0303 vmbus_onoffer_rescind().