0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =================
0004 KVM VCPU Requests
0005 =================
0006
0007 Overview
0008 ========
0009
0010 KVM supports an internal API enabling threads to request a VCPU thread to
0011 perform some activity. For example, a thread may request a VCPU to flush
0012 its TLB with a VCPU request. The API consists of the following functions::
0013
0014 /* Check if any requests are pending for VCPU @vcpu. */
0015 bool kvm_request_pending(struct kvm_vcpu *vcpu);
0016
0017 /* Check if VCPU @vcpu has request @req pending. */
0018 bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
0019
0020 /* Clear request @req for VCPU @vcpu. */
0021 void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
0022
0023 /*
0024 * Check if VCPU @vcpu has request @req pending. When the request is
0025 * pending it will be cleared and a memory barrier, which pairs with
0026 * another in kvm_make_request(), will be issued.
0027 */
0028 bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
0029
0030 /*
0031 * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
0032 * with another in kvm_check_request(), prior to setting the request.
0033 */
0034 void kvm_make_request(int req, struct kvm_vcpu *vcpu);
0035
0036 /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
0037 bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
0038
0039 Typically a requester wants the VCPU to perform the activity as soon
0040 as possible after making the request. This means most requests
0041 (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
0042 and kvm_make_all_cpus_request() has the kicking of all VCPUs built
0043 into it.
0044
0045 VCPU Kicks
0046 ----------
0047
0048 The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
0049 order to perform some KVM maintenance. To do so, an IPI is sent, forcing
0050 a guest mode exit. However, a VCPU thread may not be in guest mode at the
0051 time of the kick. Therefore, depending on the mode and state of the VCPU
0052 thread, there are two other actions a kick may take. All three actions
0053 are listed below:
0054
0055 1) Send an IPI. This forces a guest mode exit.
0056 2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest
0057 mode that wait on waitqueues. Waking them removes the threads from
0058 the waitqueues, allowing the threads to run again. This behavior
0059 may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
0060 3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not
0061 sleeping, then there is nothing to do.
0062
0063 VCPU Mode
0064 ---------
0065
0066 VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
0067 guest is running in guest mode or not, as well as some specific
0068 outside guest mode states. The architecture may use ``vcpu->mode`` to
0069 ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
0070 as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
0071 even to ensure IPI acknowledgements are waited upon (see "Waiting for
0072 Acknowledgements"). The following modes are defined:
0073
0074 OUTSIDE_GUEST_MODE
0075
0076 The VCPU thread is outside guest mode.
0077
0078 IN_GUEST_MODE
0079
0080 The VCPU thread is in guest mode.
0081
0082 EXITING_GUEST_MODE
0083
0084 The VCPU thread is transitioning from IN_GUEST_MODE to
0085 OUTSIDE_GUEST_MODE.
0086
0087 READING_SHADOW_PAGE_TABLES
0088
0089 The VCPU thread is outside guest mode, but it wants the sender of
0090 certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
0091 thread is done reading the page tables.
0092
0093 VCPU Request Internals
0094 ======================
0095
0096 VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
0097 This means general bitops, like those documented in [atomic-ops]_ could
0098 also be used, e.g. ::
0099
0100 clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
0101
0102 However, VCPU request users should refrain from doing so, as it would
0103 break the abstraction. The first 8 bits are reserved for architecture
0104 independent requests, all additional bits are available for architecture
0105 dependent requests.
0106
0107 Architecture Independent Requests
0108 ---------------------------------
0109
0110 KVM_REQ_TLB_FLUSH
0111
0112 KVM's common MMU notifier may need to flush all of a guest's TLB
0113 entries, calling kvm_flush_remote_tlbs() to do so. Architectures that
0114 choose to use the common kvm_flush_remote_tlbs() implementation will
0115 need to handle this VCPU request.
0116
0117 KVM_REQ_VM_DEAD
0118
0119 This request informs all VCPUs that the VM is dead and unusable, e.g. due to
0120 fatal error or because the VM's state has been intentionally destroyed.
0121
0122 KVM_REQ_UNBLOCK
0123
0124 This request informs the vCPU to exit kvm_vcpu_block. It is used for
0125 example from timer handlers that run on the host on behalf of a vCPU,
0126 or in order to update the interrupt routing and ensure that assigned
0127 devices will wake up the vCPU.
0128
0129 KVM_REQ_UNHALT
0130
0131 This request may be made from the KVM common function kvm_vcpu_block(),
0132 which is used to emulate an instruction that causes a CPU to halt until
0133 one of an architectural specific set of events and/or interrupts is
0134 received (determined by checking kvm_arch_vcpu_runnable()). When that
0135 event or interrupt arrives kvm_vcpu_block() makes the request. This is
0136 in contrast to when kvm_vcpu_block() returns due to any other reason,
0137 such as a pending signal, which does not indicate the VCPU's halt
0138 emulation should stop, and therefore does not make the request.
0139
0140 KVM_REQ_OUTSIDE_GUEST_MODE
0141
0142 This "request" ensures the target vCPU has exited guest mode prior to the
0143 sender of the request continuing on. No action needs be taken by the target,
0144 and so no request is actually logged for the target. This request is similar
0145 to a "kick", but unlike a kick it guarantees the vCPU has actually exited
0146 guest mode. A kick only guarantees the vCPU will exit at some point in the
0147 future, e.g. a previous kick may have started the process, but there's no
0148 guarantee the to-be-kicked vCPU has fully exited guest mode.
0149
0150 KVM_REQUEST_MASK
0151 ----------------
0152
0153 VCPU requests should be masked by KVM_REQUEST_MASK before using them with
0154 bitops. This is because only the lower 8 bits are used to represent the
0155 request's number. The upper bits are used as flags. Currently only two
0156 flags are defined.
0157
0158 VCPU Request Flags
0159 ------------------
0160
0161 KVM_REQUEST_NO_WAKEUP
0162
0163 This flag is applied to requests that only need immediate attention
0164 from VCPUs running in guest mode. That is, sleeping VCPUs do not need
0165 to be awaken for these requests. Sleeping VCPUs will handle the
0166 requests when they are awaken later for some other reason.
0167
0168 KVM_REQUEST_WAIT
0169
0170 When requests with this flag are made with kvm_make_all_cpus_request(),
0171 then the caller will wait for each VCPU to acknowledge its IPI before
0172 proceeding. This flag only applies to VCPUs that would receive IPIs.
0173 If, for example, the VCPU is sleeping, so no IPI is necessary, then
0174 the requesting thread does not wait. This means that this flag may be
0175 safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for
0176 Acknowledgements" for more information about requests with
0177 KVM_REQUEST_WAIT.
0178
0179 VCPU Requests with Associated State
0180 ===================================
0181
0182 Requesters that want the receiving VCPU to handle new state need to ensure
0183 the newly written state is observable to the receiving VCPU thread's CPU
0184 by the time it observes the request. This means a write memory barrier
0185 must be inserted after writing the new state and before setting the VCPU
0186 request bit. Additionally, on the receiving VCPU thread's side, a
0187 corresponding read barrier must be inserted after reading the request bit
0188 and before proceeding to read the new state associated with it. See
0189 scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
0190 [memory-barriers]_.
0191
0192 The pair of functions, kvm_check_request() and kvm_make_request(), provide
0193 the memory barriers, allowing this requirement to be handled internally by
0194 the API.
0195
0196 Ensuring Requests Are Seen
0197 ==========================
0198
0199 When making requests to VCPUs, we want to avoid the receiving VCPU
0200 executing in guest mode for an arbitrary long time without handling the
0201 request. We can be sure this won't happen as long as we ensure the VCPU
0202 thread checks kvm_request_pending() before entering guest mode and that a
0203 kick will send an IPI to force an exit from guest mode when necessary.
0204 Extra care must be taken to cover the period after the VCPU thread's last
0205 kvm_request_pending() check and before it has entered guest mode, as kick
0206 IPIs will only trigger guest mode exits for VCPU threads that are in guest
0207 mode or at least have already disabled interrupts in order to prepare to
0208 enter guest mode. This means that an optimized implementation (see "IPI
0209 Reduction") must be certain when it's safe to not send the IPI. One
0210 solution, which all architectures except s390 apply, is to:
0211
0212 - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
0213 the last kvm_request_pending() check;
0214 - enable interrupts atomically when entering the guest.
0215
0216 This solution also requires memory barriers to be placed carefully in both
0217 the requesting thread and the receiving VCPU. With the memory barriers we
0218 can exclude the possibility of a VCPU thread observing
0219 !kvm_request_pending() on its last check and then not receiving an IPI for
0220 the next request made of it, even if the request is made immediately after
0221 the check. This is done by way of the Dekker memory barrier pattern
0222 (scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables,
0223 this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting
0224 them into the pattern gives::
0225
0226 CPU1 CPU2
0227 ================= =================
0228 local_irq_disable();
0229 WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu);
0230 smp_mb(); smp_mb();
0231 if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) ==
0232 IN_GUEST_MODE) {
0233 ...abort guest entry... ...send IPI...
0234 } }
0235
0236 As stated above, the IPI is only useful for VCPU threads in guest mode or
0237 that have already disabled interrupts. This is why this specific case of
0238 the Dekker pattern has been extended to disable interrupts before setting
0239 ``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to
0240 pedantically implement the memory barrier pattern, guaranteeing the
0241 compiler doesn't interfere with ``vcpu->mode``'s carefully planned
0242 accesses.
0243
0244 IPI Reduction
0245 -------------
0246
0247 As only one IPI is needed to get a VCPU to check for any/all requests,
0248 then they may be coalesced. This is easily done by having the first IPI
0249 sending kick also change the VCPU mode to something !IN_GUEST_MODE. The
0250 transitional state, EXITING_GUEST_MODE, is used for this purpose.
0251
0252 Waiting for Acknowledgements
0253 ----------------------------
0254
0255 Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
0256 be sent, and the acknowledgements to be waited upon, even when the target
0257 VCPU threads are in modes other than IN_GUEST_MODE. For example, one case
0258 is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
0259 is set after disabling interrupts. To support these cases, the
0260 KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
0261 checking that the VCPU is IN_GUEST_MODE to checking that it is not
0262 OUTSIDE_GUEST_MODE.
0263
0264 Request-less VCPU Kicks
0265 -----------------------
0266
0267 As the determination of whether or not to send an IPI depends on the
0268 two-variable Dekker memory barrier pattern, then it's clear that
0269 request-less VCPU kicks are almost never correct. Without the assurance
0270 that a non-IPI generating kick will still result in an action by the
0271 receiving VCPU, as the final kvm_request_pending() check does for
0272 request-accompanying kicks, then the kick may not do anything useful at
0273 all. If, for instance, a request-less kick was made to a VCPU that was
0274 just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
0275 the VCPU thread may continue its entry without actually having done
0276 whatever it was the kick was meant to initiate.
0277
0278 One exception is x86's posted interrupt mechanism. In this case, however,
0279 even the request-less VCPU kick is coupled with the same
0280 local_irq_disable() + smp_mb() pattern described above; the ON bit
0281 (Outstanding Notification) in the posted interrupt descriptor takes the
0282 role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is
0283 set before reading ``vcpu->mode``; dually, in the VCPU thread,
0284 vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
0285 IN_GUEST_MODE.
0286
0287 Additional Considerations
0288 =========================
0289
0290 Sleeping VCPUs
0291 --------------
0292
0293 VCPU threads may need to consider requests before and/or after calling
0294 functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they
0295 do or not, and, if they do, which requests need consideration, is
0296 architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
0297 to check if it should awaken. One reason to do so is to provide
0298 architectures a function where requests may be checked if necessary.
0299
0300 Clearing Requests
0301 -----------------
0302
0303 Generally it only makes sense for the receiving VCPU thread to clear a
0304 request. However, in some circumstances, such as when the requesting
0305 thread and the receiving VCPU thread are executed serially, such as when
0306 they are the same thread, or when they are using some form of concurrency
0307 control to temporarily execute synchronously, then it's possible to know
0308 that the request may be cleared immediately, rather than waiting for the
0309 receiving VCPU thread to handle the request in VCPU RUN. The only current
0310 examples of this are kvm_vcpu_block() calls made by VCPUs to block
0311 themselves. A possible side-effect of that call is to make the
0312 KVM_REQ_UNHALT request, which may then be cleared immediately when the
0313 VCPU returns from the call.
0314
0315 References
0316 ==========
0317
0318 .. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
0319 .. [memory-barriers] Documentation/memory-barriers.txt
0320 .. [lwn-mb] https://lwn.net/Articles/573436/