0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =================
0004 KVM-specific MSRs
0005 =================
0006
0007 :Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
0008
0009 KVM makes use of some custom MSRs to service some requests.
0010
0011 Custom MSRs have a range reserved for them, that goes from
0012 0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
0013 but they are deprecated and their use is discouraged.
0014
0015 Custom MSR list
0016 ---------------
0017
0018 The current supported Custom MSR list is:
0019
0020 MSR_KVM_WALL_CLOCK_NEW:
0021 0x4b564d00
0022
0023 data:
0024 4-byte alignment physical address of a memory area which must be
0025 in guest RAM. This memory is expected to hold a copy of the following
0026 structure::
0027
0028 struct pvclock_wall_clock {
0029 u32 version;
0030 u32 sec;
0031 u32 nsec;
0032 } __attribute__((__packed__));
0033
0034 whose data will be filled in by the hypervisor. The hypervisor is only
0035 guaranteed to update this data at the moment of MSR write.
0036 Users that want to reliably query this information more than once have
0037 to write more than once to this MSR. Fields have the following meanings:
0038
0039 version:
0040 guest has to check version before and after grabbing
0041 time information and check that they are both equal and even.
0042 An odd version indicates an in-progress update.
0043
0044 sec:
0045 number of seconds for wallclock at time of boot.
0046
0047 nsec:
0048 number of nanoseconds for wallclock at time of boot.
0049
0050 In order to get the current wallclock time, the system_time from
0051 MSR_KVM_SYSTEM_TIME_NEW needs to be added.
0052
0053 Note that although MSRs are per-CPU entities, the effect of this
0054 particular MSR is global.
0055
0056 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
0057 leaf prior to usage.
0058
0059 MSR_KVM_SYSTEM_TIME_NEW:
0060 0x4b564d01
0061
0062 data:
0063 4-byte aligned physical address of a memory area which must be in
0064 guest RAM, plus an enable bit in bit 0. This memory is expected to hold
0065 a copy of the following structure::
0066
0067 struct pvclock_vcpu_time_info {
0068 u32 version;
0069 u32 pad0;
0070 u64 tsc_timestamp;
0071 u64 system_time;
0072 u32 tsc_to_system_mul;
0073 s8 tsc_shift;
0074 u8 flags;
0075 u8 pad[2];
0076 } __attribute__((__packed__)); /* 32 bytes */
0077
0078 whose data will be filled in by the hypervisor periodically. Only one
0079 write, or registration, is needed for each VCPU. The interval between
0080 updates of this structure is arbitrary and implementation-dependent.
0081 The hypervisor may update this structure at any time it sees fit until
0082 anything with bit0 == 0 is written to it.
0083
0084 Fields have the following meanings:
0085
0086 version:
0087 guest has to check version before and after grabbing
0088 time information and check that they are both equal and even.
0089 An odd version indicates an in-progress update.
0090
0091 tsc_timestamp:
0092 the tsc value at the current VCPU at the time
0093 of the update of this structure. Guests can subtract this value
0094 from current tsc to derive a notion of elapsed time since the
0095 structure update.
0096
0097 system_time:
0098 a host notion of monotonic time, including sleep
0099 time at the time this structure was last updated. Unit is
0100 nanoseconds.
0101
0102 tsc_to_system_mul:
0103 multiplier to be used when converting
0104 tsc-related quantity to nanoseconds
0105
0106 tsc_shift:
0107 shift to be used when converting tsc-related
0108 quantity to nanoseconds. This shift will ensure that
0109 multiplication with tsc_to_system_mul does not overflow.
0110 A positive value denotes a left shift, a negative value
0111 a right shift.
0112
0113 The conversion from tsc to nanoseconds involves an additional
0114 right shift by 32 bits. With this information, guests can
0115 derive per-CPU time by doing::
0116
0117 time = (current_tsc - tsc_timestamp)
0118 if (tsc_shift >= 0)
0119 time <<= tsc_shift;
0120 else
0121 time >>= -tsc_shift;
0122 time = (time * tsc_to_system_mul) >> 32
0123 time = time + system_time
0124
0125 flags:
0126 bits in this field indicate extended capabilities
0127 coordinated between the guest and the hypervisor. Availability
0128 of specific flags has to be checked in 0x40000001 cpuid leaf.
0129 Current flags are:
0130
0131
0132 +-----------+--------------+----------------------------------+
0133 | flag bit | cpuid bit | meaning |
0134 +-----------+--------------+----------------------------------+
0135 | | | time measures taken across |
0136 | 0 | 24 | multiple cpus are guaranteed to |
0137 | | | be monotonic |
0138 +-----------+--------------+----------------------------------+
0139 | | | guest vcpu has been paused by |
0140 | 1 | N/A | the host |
0141 | | | See 4.70 in api.txt |
0142 +-----------+--------------+----------------------------------+
0143
0144 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
0145 leaf prior to usage.
0146
0147
0148 MSR_KVM_WALL_CLOCK:
0149 0x11
0150
0151 data and functioning:
0152 same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
0153
0154 This MSR falls outside the reserved KVM range and may be removed in the
0155 future. Its usage is deprecated.
0156
0157 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
0158 leaf prior to usage.
0159
0160 MSR_KVM_SYSTEM_TIME:
0161 0x12
0162
0163 data and functioning:
0164 same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
0165
0166 This MSR falls outside the reserved KVM range and may be removed in the
0167 future. Its usage is deprecated.
0168
0169 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
0170 leaf prior to usage.
0171
0172 The suggested algorithm for detecting kvmclock presence is then::
0173
0174 if (!kvm_para_available()) /* refer to cpuid.txt */
0175 return NON_PRESENT;
0176
0177 flags = cpuid_eax(0x40000001);
0178 if (flags & 3) {
0179 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
0180 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
0181 return PRESENT;
0182 } else if (flags & 0) {
0183 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
0184 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
0185 return PRESENT;
0186 } else
0187 return NON_PRESENT;
0188
0189 MSR_KVM_ASYNC_PF_EN:
0190 0x4b564d02
0191
0192 data:
0193 Asynchronous page fault (APF) control MSR.
0194
0195 Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
0196 which must be in guest RAM and must be zeroed. This memory is expected
0197 to hold a copy of the following structure::
0198
0199 struct kvm_vcpu_pv_apf_data {
0200 /* Used for 'page not present' events delivered via #PF */
0201 __u32 flags;
0202
0203 /* Used for 'page ready' events delivered via interrupt notification */
0204 __u32 token;
0205
0206 __u8 pad[56];
0207 __u32 enabled;
0208 };
0209
0210 Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
0211 when asynchronous page faults are enabled on the vcpu, 0 when disabled.
0212 Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
0213 cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
0214 #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
0215 present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
0216 events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
0217 CPUID.
0218
0219 'Page not present' events are currently always delivered as synthetic
0220 #PF exception. During delivery of these events APF CR2 register contains
0221 a token that will be used to notify the guest when missing page becomes
0222 available. Also, to make it possible to distinguish between real #PF and
0223 APF, first 4 bytes of 64 byte memory location ('flags') will be written
0224 to by the hypervisor at the time of injection. Only first bit of 'flags'
0225 is currently supported, when set, it indicates that the guest is dealing
0226 with asynchronous 'page not present' event. If during a page fault APF
0227 'flags' is '0' it means that this is regular page fault. Guest is
0228 supposed to clear 'flags' when it is done handling #PF exception so the
0229 next event can be delivered.
0230
0231 Note, since APF 'page not present' events use the same exception vector
0232 as regular page fault, guest must reset 'flags' to '0' before it does
0233 something that can generate normal page fault.
0234
0235 Bytes 5-7 of 64 byte memory location ('token') will be written to by the
0236 hypervisor at the time of APF 'page ready' event injection. The content
0237 of these bytes is a token which was previously delivered as 'page not
0238 present' event. The event indicates the page in now available. Guest is
0239 supposed to write '0' to 'token' when it is done handling 'page ready'
0240 event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
0241 writing to the MSR forces KVM to re-scan its queue and deliver the next
0242 pending notification.
0243
0244 Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
0245 ready' APF delivery needs to be written to before enabling APF mechanism
0246 in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
0247 available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
0248
0249 Note, previously, 'page ready' events were delivered via the same #PF
0250 exception as 'page not present' events but this is now deprecated. If
0251 bit 3 (interrupt based delivery) is not set APF events are not delivered.
0252
0253 If APF is disabled while there are outstanding APFs, they will
0254 not be delivered.
0255
0256 Currently 'page ready' APF events will be always delivered on the
0257 same vcpu as 'page not present' event was, but guest should not rely on
0258 that.
0259
0260 MSR_KVM_STEAL_TIME:
0261 0x4b564d03
0262
0263 data:
0264 64-byte alignment physical address of a memory area which must be
0265 in guest RAM, plus an enable bit in bit 0. This memory is expected to
0266 hold a copy of the following structure::
0267
0268 struct kvm_steal_time {
0269 __u64 steal;
0270 __u32 version;
0271 __u32 flags;
0272 __u8 preempted;
0273 __u8 u8_pad[3];
0274 __u32 pad[11];
0275 }
0276
0277 whose data will be filled in by the hypervisor periodically. Only one
0278 write, or registration, is needed for each VCPU. The interval between
0279 updates of this structure is arbitrary and implementation-dependent.
0280 The hypervisor may update this structure at any time it sees fit until
0281 anything with bit0 == 0 is written to it. Guest is required to make sure
0282 this structure is initialized to zero.
0283
0284 Fields have the following meanings:
0285
0286 version:
0287 a sequence counter. In other words, guest has to check
0288 this field before and after grabbing time information and make
0289 sure they are both equal and even. An odd version indicates an
0290 in-progress update.
0291
0292 flags:
0293 At this point, always zero. May be used to indicate
0294 changes in this structure in the future.
0295
0296 steal:
0297 the amount of time in which this vCPU did not run, in
0298 nanoseconds. Time during which the vcpu is idle, will not be
0299 reported as steal time.
0300
0301 preempted:
0302 indicate the vCPU who owns this struct is running or
0303 not. Non-zero values mean the vCPU has been preempted. Zero
0304 means the vCPU is not preempted. NOTE, it is always zero if the
0305 the hypervisor doesn't support this field.
0306
0307 MSR_KVM_EOI_EN:
0308 0x4b564d04
0309
0310 data:
0311 Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
0312 when disabled. Bit 1 is reserved and must be zero. When PV end of
0313 interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
0314 physical address of a 4 byte memory area which must be in guest RAM and
0315 must be zeroed.
0316
0317 The first, least significant bit of 4 byte memory location will be
0318 written to by the hypervisor, typically at the time of interrupt
0319 injection. Value of 1 means that guest can skip writing EOI to the apic
0320 (using MSR or MMIO write); instead, it is sufficient to signal
0321 EOI by clearing the bit in guest memory - this location will
0322 later be polled by the hypervisor.
0323 Value of 0 means that the EOI write is required.
0324
0325 It is always safe for the guest to ignore the optimization and perform
0326 the APIC EOI write anyway.
0327
0328 Hypervisor is guaranteed to only modify this least
0329 significant bit while in the current VCPU context, this means that
0330 guest does not need to use either lock prefix or memory ordering
0331 primitives to synchronise with the hypervisor.
0332
0333 However, hypervisor can set and clear this memory bit at any time:
0334 therefore to make sure hypervisor does not interrupt the
0335 guest and clear the least significant bit in the memory area
0336 in the window between guest testing it to detect
0337 whether it can skip EOI apic write and between guest
0338 clearing it to signal EOI to the hypervisor,
0339 guest must both read the least significant bit in the memory area and
0340 clear it using a single CPU instruction, such as test and clear, or
0341 compare and exchange.
0342
0343 MSR_KVM_POLL_CONTROL:
0344 0x4b564d05
0345
0346 Control host-side polling.
0347
0348 data:
0349 Bit 0 enables (1) or disables (0) host-side HLT polling logic.
0350
0351 KVM guests can request the host not to poll on HLT, for example if
0352 they are performing polling themselves.
0353
0354 MSR_KVM_ASYNC_PF_INT:
0355 0x4b564d06
0356
0357 data:
0358 Second asynchronous page fault (APF) control MSR.
0359
0360 Bits 0-7: APIC vector for delivery of 'page ready' APF events.
0361 Bits 8-63: Reserved
0362
0363 Interrupt vector for asynchnonous 'page ready' notifications delivery.
0364 The vector has to be set up before asynchronous page fault mechanism
0365 is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
0366 KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
0367
0368 MSR_KVM_ASYNC_PF_ACK:
0369 0x4b564d07
0370
0371 data:
0372 Asynchronous page fault (APF) acknowledgment.
0373
0374 When the guest is done processing 'page ready' APF event and 'token'
0375 field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
0376 write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
0377 and check if there are more notifications pending. The MSR is available
0378 if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
0379
0380 MSR_KVM_MIGRATION_CONTROL:
0381 0x4b564d08
0382
0383 data:
0384 This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
0385 CPUID. Bit 0 represents whether live migration of the guest is allowed.
0386
0387 When a guest is started, bit 0 will be 0 if the guest has encrypted
0388 memory and 1 if the guest does not have encrypted memory. If the
0389 guest is communicating page encryption status to the host using the
0390 ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
0391 allow live migration of the guest.