Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ==============================
0004 Running nested guests with KVM
0005 ==============================
0006 
0007 A nested guest is the ability to run a guest inside another guest (it
0008 can be KVM-based or a different hypervisor).  The straightforward
0009 example is a KVM guest that in turn runs on a KVM guest (the rest of
0010 this document is built on this example)::
0011 
0012               .----------------.  .----------------.
0013               |                |  |                |
0014               |      L2        |  |      L2        |
0015               | (Nested Guest) |  | (Nested Guest) |
0016               |                |  |                |
0017               |----------------'--'----------------|
0018               |                                    |
0019               |       L1 (Guest Hypervisor)        |
0020               |          KVM (/dev/kvm)            |
0021               |                                    |
0022       .------------------------------------------------------.
0023       |                 L0 (Host Hypervisor)                 |
0024       |                    KVM (/dev/kvm)                    |
0025       |------------------------------------------------------|
0026       |        Hardware (with virtualization extensions)     |
0027       '------------------------------------------------------'
0028 
0029 Terminology:
0030 
0031 - L0 – level-0; the bare metal host, running KVM
0032 
0033 - L1 – level-1 guest; a VM running on L0; also called the "guest
0034   hypervisor", as it itself is capable of running KVM.
0035 
0036 - L2 – level-2 guest; a VM running on L1, this is the "nested guest"
0037 
0038 .. note:: The above diagram is modelled after the x86 architecture;
0039           s390x, ppc64 and other architectures are likely to have
0040           a different design for nesting.
0041 
0042           For example, s390x always has an LPAR (LogicalPARtition)
0043           hypervisor running on bare metal, adding another layer and
0044           resulting in at least four levels in a nested setup — L0 (bare
0045           metal, running the LPAR hypervisor), L1 (host hypervisor), L2
0046           (guest hypervisor), L3 (nested guest).
0047 
0048           This document will stick with the three-level terminology (L0,
0049           L1, and L2) for all architectures; and will largely focus on
0050           x86.
0051 
0052 
0053 Use Cases
0054 ---------
0055 
0056 There are several scenarios where nested KVM can be useful, to name a
0057 few:
0058 
0059 - As a developer, you want to test your software on different operating
0060   systems (OSes).  Instead of renting multiple VMs from a Cloud
0061   Provider, using nested KVM lets you rent a large enough "guest
0062   hypervisor" (level-1 guest).  This in turn allows you to create
0063   multiple nested guests (level-2 guests), running different OSes, on
0064   which you can develop and test your software.
0065 
0066 - Live migration of "guest hypervisors" and their nested guests, for
0067   load balancing, disaster recovery, etc.
0068 
0069 - VM image creation tools (e.g. ``virt-install``,  etc) often run
0070   their own VM, and users expect these to work inside a VM.
0071 
0072 - Some OSes use virtualization internally for security (e.g. to let
0073   applications run safely in isolation).
0074 
0075 
0076 Enabling "nested" (x86)
0077 -----------------------
0078 
0079 From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
0080 by default for Intel and AMD.  (Though your Linux distribution might
0081 override this default.)
0082 
0083 In case you are running a Linux kernel older than v4.19, to enable
0084 nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``.  To
0085 persist this setting across reboots, you can add it in a config file, as
0086 shown below:
0087 
0088 1. On the bare metal host (L0), list the kernel modules and ensure that
0089    the KVM modules::
0090 
0091     $ lsmod | grep -i kvm
0092     kvm_intel             133627  0
0093     kvm                   435079  1 kvm_intel
0094 
0095 2. Show information for ``kvm_intel`` module::
0096 
0097     $ modinfo kvm_intel | grep -i nested
0098     parm:           nested:bool
0099 
0100 3. For the nested KVM configuration to persist across reboots, place the
0101    below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
0102    doesn't exist)::
0103 
0104     $ cat /etc/modprobe.d/kvm_intel.conf
0105     options kvm-intel nested=y
0106 
0107 4. Unload and re-load the KVM Intel module::
0108 
0109     $ sudo rmmod kvm-intel
0110     $ sudo modprobe kvm-intel
0111 
0112 5. Verify if the ``nested`` parameter for KVM is enabled::
0113 
0114     $ cat /sys/module/kvm_intel/parameters/nested
0115     Y
0116 
0117 For AMD hosts, the process is the same as above, except that the module
0118 name is ``kvm-amd``.
0119 
0120 
0121 Additional nested-related kernel parameters (x86)
0122 -------------------------------------------------
0123 
0124 If your hardware is sufficiently advanced (Intel Haswell processor or
0125 higher, which has newer hardware virt extensions), the following
0126 additional features will also be enabled by default: "Shadow VMCS
0127 (Virtual Machine Control Structure)", APIC Virtualization on your bare
0128 metal host (L0).  Parameters for Intel hosts::
0129 
0130     $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
0131     Y
0132 
0133     $ cat /sys/module/kvm_intel/parameters/enable_apicv
0134     Y
0135 
0136     $ cat /sys/module/kvm_intel/parameters/ept
0137     Y
0138 
0139 .. note:: If you suspect your L2 (i.e. nested guest) is running slower,
0140           ensure the above are enabled (particularly
0141           ``enable_shadow_vmcs`` and ``ept``).
0142 
0143 
0144 Starting a nested guest (x86)
0145 -----------------------------
0146 
0147 Once your bare metal host (L0) is configured for nesting, you should be
0148 able to start an L1 guest with::
0149 
0150     $ qemu-kvm -cpu host [...]
0151 
0152 The above will pass through the host CPU's capabilities as-is to the
0153 gues); or for better live migration compatibility, use a named CPU
0154 model supported by QEMU. e.g.::
0155 
0156     $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
0157 
0158 then the guest hypervisor will subsequently be capable of running a
0159 nested guest with accelerated KVM.
0160 
0161 
0162 Enabling "nested" (s390x)
0163 -------------------------
0164 
0165 1. On the host hypervisor (L0), enable the ``nested`` parameter on
0166    s390x::
0167 
0168     $ rmmod kvm
0169     $ modprobe kvm nested=1
0170 
0171 .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
0172           with the ``nested`` paramter — i.e. to be able to enable
0173           ``nested``, the ``hpage`` parameter *must* be disabled.
0174 
0175 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
0176    feature — with QEMU, this can be done by using "host passthrough"
0177    (via the command-line ``-cpu host``).
0178 
0179 3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
0180 
0181     $ modprobe kvm
0182 
0183 
0184 Live migration with nested KVM
0185 ------------------------------
0186 
0187 Migrating an L1 guest, with a  *live* nested guest in it, to another
0188 bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
0189 Intel x86 systems, and even on older versions for s390x.
0190 
0191 On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
0192 should no longer be migrated or saved (refer to QEMU documentation on
0193 "savevm"/"loadvm") until the L2 guest shuts down.  Attempting to migrate
0194 or save-and-load an L1 guest while an L2 guest is running will result in
0195 undefined behavior.  You might see a ``kernel BUG!`` entry in ``dmesg``, a
0196 kernel 'oops', or an outright kernel panic.  Such a migrated or loaded L1
0197 guest can no longer be considered stable or secure, and must be restarted.
0198 Migrating an L1 guest merely configured to support nesting, while not
0199 actually running L2 guests, is expected to function normally even on AMD
0200 systems but may fail once guests are started.
0201 
0202 Migrating an L2 guest is always expected to succeed, so all the following
0203 scenarios should work even on AMD systems:
0204 
0205 - Migrating a nested guest (L2) to another L1 guest on the *same* bare
0206   metal host.
0207 
0208 - Migrating a nested guest (L2) to another L1 guest on a *different*
0209   bare metal host.
0210 
0211 - Migrating a nested guest (L2) to a bare metal host.
0212 
0213 Reporting bugs from nested setups
0214 -----------------------------------
0215 
0216 Debugging "nested" problems can involve sifting through log files across
0217 L0, L1 and L2; this can result in tedious back-n-forth between the bug
0218 reporter and the bug fixer.
0219 
0220 - Mention that you are in a "nested" setup.  If you are running any kind
0221   of "nesting" at all, say so.  Unfortunately, this needs to be called
0222   out because when reporting bugs, people tend to forget to even
0223   *mention* that they're using nested virtualization.
0224 
0225 - Ensure you are actually running KVM on KVM.  Sometimes people do not
0226   have KVM enabled for their guest hypervisor (L1), which results in
0227   them running with pure emulation or what QEMU calls it as "TCG", but
0228   they think they're running nested KVM.  Thus confusing "nested Virt"
0229   (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
0230 
0231 Information to collect (generic)
0232 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0233 
0234 The following is not an exhaustive list, but a very good starting point:
0235 
0236   - Kernel, libvirt, and QEMU version from L0
0237 
0238   - Kernel, libvirt and QEMU version from L1
0239 
0240   - QEMU command-line of L1 -- when using libvirt, you'll find it here:
0241     ``/var/log/libvirt/qemu/instance.log``
0242 
0243   - QEMU command-line of L2 -- as above, when using libvirt, get the
0244     complete libvirt-generated QEMU command-line
0245 
0246   - ``cat /sys/cpuinfo`` from L0
0247 
0248   - ``cat /sys/cpuinfo`` from L1
0249 
0250   - ``lscpu`` from L0
0251 
0252   - ``lscpu`` from L1
0253 
0254   - Full ``dmesg`` output from L0
0255 
0256   - Full ``dmesg`` output from L1
0257 
0258 x86-specific info to collect
0259 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0260 
0261 Both the below commands, ``x86info`` and ``dmidecode``, should be
0262 available on most Linux distributions with the same name:
0263 
0264   - Output of: ``x86info -a`` from L0
0265 
0266   - Output of: ``x86info -a`` from L1
0267 
0268   - Output of: ``dmidecode`` from L0
0269 
0270   - Output of: ``dmidecode`` from L1
0271 
0272 s390x-specific info to collect
0273 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0274 
0275 Along with the earlier mentioned generic details, the below is
0276 also recommended:
0277 
0278   - ``/proc/sysinfo`` from L1; this will also include the info from L0