Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ==========
0004 Nested VMX
0005 ==========
0006 
0007 Overview
0008 ---------
0009 
0010 On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
0011 to easily and efficiently run guest operating systems. Normally, these guests
0012 *cannot* themselves be hypervisors running their own guests, because in VMX,
0013 guests cannot use VMX instructions.
0014 
0015 The "Nested VMX" feature adds this missing capability - of running guest
0016 hypervisors (which use VMX) with their own nested guests. It does so by
0017 allowing a guest to use VMX instructions, and correctly and efficiently
0018 emulating them using the single level of VMX available in the hardware.
0019 
0020 We describe in much greater detail the theory behind the nested VMX feature,
0021 its implementation and its performance characteristics, in the OSDI 2010 paper
0022 "The Turtles Project: Design and Implementation of Nested Virtualization",
0023 available at:
0024 
0025         https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
0026 
0027 
0028 Terminology
0029 -----------
0030 
0031 Single-level virtualization has two levels - the host (KVM) and the guests.
0032 In nested virtualization, we have three levels: The host (KVM), which we call
0033 L0, the guest hypervisor, which we call L1, and its nested guest, which we
0034 call L2.
0035 
0036 
0037 Running nested VMX
0038 ------------------
0039 
0040 The nested VMX feature is enabled by default since Linux kernel v4.20. For
0041 older Linux kernel, it can be enabled by giving the "nested=1" option to the
0042 kvm-intel module.
0043 
0044 
0045 No modifications are required to user space (qemu). However, qemu's default
0046 emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
0047 explicitly enabled, by giving qemu one of the following options:
0048 
0049      - cpu host              (emulated CPU has all features of the real CPU)
0050 
0051      - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
0052 
0053 
0054 ABIs
0055 ----
0056 
0057 Nested VMX aims to present a standard and (eventually) fully-functional VMX
0058 implementation for the a guest hypervisor to use. As such, the official
0059 specification of the ABI that it provides is Intel's VMX specification,
0060 namely volume 3B of their "Intel 64 and IA-32 Architectures Software
0061 Developer's Manual". Not all of VMX's features are currently fully supported,
0062 but the goal is to eventually support them all, starting with the VMX features
0063 which are used in practice by popular hypervisors (KVM and others).
0064 
0065 As a VMX implementation, nested VMX presents a VMCS structure to L1.
0066 As mandated by the spec, other than the two fields revision_id and abort,
0067 this structure is *opaque* to its user, who is not supposed to know or care
0068 about its internal structure. Rather, the structure is accessed through the
0069 VMREAD and VMWRITE instructions.
0070 Still, for debugging purposes, KVM developers might be interested to know the
0071 internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
0072 
0073 The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
0074 also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
0075 which L0 builds to actually run L2 - how this is done is explained in the
0076 aforementioned paper.
0077 
0078 For convenience, we repeat the content of struct vmcs12 here. If the internals
0079 of this structure changes, this can break live migration across KVM versions.
0080 VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
0081 struct shadow_vmcs is ever changed.
0082 
0083 ::
0084 
0085         typedef u64 natural_width;
0086         struct __packed vmcs12 {
0087                 /* According to the Intel spec, a VMCS region must start with
0088                  * these two user-visible fields */
0089                 u32 revision_id;
0090                 u32 abort;
0091 
0092                 u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
0093                 u32 padding[7]; /* room for future expansion */
0094 
0095                 u64 io_bitmap_a;
0096                 u64 io_bitmap_b;
0097                 u64 msr_bitmap;
0098                 u64 vm_exit_msr_store_addr;
0099                 u64 vm_exit_msr_load_addr;
0100                 u64 vm_entry_msr_load_addr;
0101                 u64 tsc_offset;
0102                 u64 virtual_apic_page_addr;
0103                 u64 apic_access_addr;
0104                 u64 ept_pointer;
0105                 u64 guest_physical_address;
0106                 u64 vmcs_link_pointer;
0107                 u64 guest_ia32_debugctl;
0108                 u64 guest_ia32_pat;
0109                 u64 guest_ia32_efer;
0110                 u64 guest_pdptr0;
0111                 u64 guest_pdptr1;
0112                 u64 guest_pdptr2;
0113                 u64 guest_pdptr3;
0114                 u64 host_ia32_pat;
0115                 u64 host_ia32_efer;
0116                 u64 padding64[8]; /* room for future expansion */
0117                 natural_width cr0_guest_host_mask;
0118                 natural_width cr4_guest_host_mask;
0119                 natural_width cr0_read_shadow;
0120                 natural_width cr4_read_shadow;
0121                 natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
0122                 natural_width exit_qualification;
0123                 natural_width guest_linear_address;
0124                 natural_width guest_cr0;
0125                 natural_width guest_cr3;
0126                 natural_width guest_cr4;
0127                 natural_width guest_es_base;
0128                 natural_width guest_cs_base;
0129                 natural_width guest_ss_base;
0130                 natural_width guest_ds_base;
0131                 natural_width guest_fs_base;
0132                 natural_width guest_gs_base;
0133                 natural_width guest_ldtr_base;
0134                 natural_width guest_tr_base;
0135                 natural_width guest_gdtr_base;
0136                 natural_width guest_idtr_base;
0137                 natural_width guest_dr7;
0138                 natural_width guest_rsp;
0139                 natural_width guest_rip;
0140                 natural_width guest_rflags;
0141                 natural_width guest_pending_dbg_exceptions;
0142                 natural_width guest_sysenter_esp;
0143                 natural_width guest_sysenter_eip;
0144                 natural_width host_cr0;
0145                 natural_width host_cr3;
0146                 natural_width host_cr4;
0147                 natural_width host_fs_base;
0148                 natural_width host_gs_base;
0149                 natural_width host_tr_base;
0150                 natural_width host_gdtr_base;
0151                 natural_width host_idtr_base;
0152                 natural_width host_ia32_sysenter_esp;
0153                 natural_width host_ia32_sysenter_eip;
0154                 natural_width host_rsp;
0155                 natural_width host_rip;
0156                 natural_width paddingl[8]; /* room for future expansion */
0157                 u32 pin_based_vm_exec_control;
0158                 u32 cpu_based_vm_exec_control;
0159                 u32 exception_bitmap;
0160                 u32 page_fault_error_code_mask;
0161                 u32 page_fault_error_code_match;
0162                 u32 cr3_target_count;
0163                 u32 vm_exit_controls;
0164                 u32 vm_exit_msr_store_count;
0165                 u32 vm_exit_msr_load_count;
0166                 u32 vm_entry_controls;
0167                 u32 vm_entry_msr_load_count;
0168                 u32 vm_entry_intr_info_field;
0169                 u32 vm_entry_exception_error_code;
0170                 u32 vm_entry_instruction_len;
0171                 u32 tpr_threshold;
0172                 u32 secondary_vm_exec_control;
0173                 u32 vm_instruction_error;
0174                 u32 vm_exit_reason;
0175                 u32 vm_exit_intr_info;
0176                 u32 vm_exit_intr_error_code;
0177                 u32 idt_vectoring_info_field;
0178                 u32 idt_vectoring_error_code;
0179                 u32 vm_exit_instruction_len;
0180                 u32 vmx_instruction_info;
0181                 u32 guest_es_limit;
0182                 u32 guest_cs_limit;
0183                 u32 guest_ss_limit;
0184                 u32 guest_ds_limit;
0185                 u32 guest_fs_limit;
0186                 u32 guest_gs_limit;
0187                 u32 guest_ldtr_limit;
0188                 u32 guest_tr_limit;
0189                 u32 guest_gdtr_limit;
0190                 u32 guest_idtr_limit;
0191                 u32 guest_es_ar_bytes;
0192                 u32 guest_cs_ar_bytes;
0193                 u32 guest_ss_ar_bytes;
0194                 u32 guest_ds_ar_bytes;
0195                 u32 guest_fs_ar_bytes;
0196                 u32 guest_gs_ar_bytes;
0197                 u32 guest_ldtr_ar_bytes;
0198                 u32 guest_tr_ar_bytes;
0199                 u32 guest_interruptibility_info;
0200                 u32 guest_activity_state;
0201                 u32 guest_sysenter_cs;
0202                 u32 host_ia32_sysenter_cs;
0203                 u32 padding32[8]; /* room for future expansion */
0204                 u16 virtual_processor_id;
0205                 u16 guest_es_selector;
0206                 u16 guest_cs_selector;
0207                 u16 guest_ss_selector;
0208                 u16 guest_ds_selector;
0209                 u16 guest_fs_selector;
0210                 u16 guest_gs_selector;
0211                 u16 guest_ldtr_selector;
0212                 u16 guest_tr_selector;
0213                 u16 host_es_selector;
0214                 u16 host_cs_selector;
0215                 u16 host_ss_selector;
0216                 u16 host_ds_selector;
0217                 u16 host_fs_selector;
0218                 u16 host_gs_selector;
0219                 u16 host_tr_selector;
0220         };
0221 
0222 
0223 Authors
0224 -------
0225 
0226 These patches were written by:
0227     - Abel Gordon, abelg <at> il.ibm.com
0228     - Nadav Har'El, nyh <at> il.ibm.com
0229     - Orit Wasserman, oritw <at> il.ibm.com
0230     - Ben-Ami Yassor, benami <at> il.ibm.com
0231     - Muli Ben-Yehuda, muli <at> il.ibm.com
0232 
0233 With contributions by:
0234     - Anthony Liguori, aliguori <at> us.ibm.com
0235     - Mike Day, mdday <at> us.ibm.com
0236     - Michael Factor, factor <at> il.ibm.com
0237     - Zvi Dubitzky, dubi <at> il.ibm.com
0238 
0239 And valuable reviews by:
0240     - Avi Kivity, avi <at> redhat.com
0241     - Gleb Natapov, gleb <at> redhat.com
0242     - Marcelo Tosatti, mtosatti <at> redhat.com
0243     - Kevin Tian, kevin.tian <at> intel.com
0244     - and others.