0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ==============================
0004 Running nested guests with KVM
0005 ==============================
0006
0007 A nested guest is the ability to run a guest inside another guest (it
0008 can be KVM-based or a different hypervisor). The straightforward
0009 example is a KVM guest that in turn runs on a KVM guest (the rest of
0010 this document is built on this example)::
0011
0012 .----------------. .----------------.
0013 | | | |
0014 | L2 | | L2 |
0015 | (Nested Guest) | | (Nested Guest) |
0016 | | | |
0017 |----------------'--'----------------|
0018 | |
0019 | L1 (Guest Hypervisor) |
0020 | KVM (/dev/kvm) |
0021 | |
0022 .------------------------------------------------------.
0023 | L0 (Host Hypervisor) |
0024 | KVM (/dev/kvm) |
0025 |------------------------------------------------------|
0026 | Hardware (with virtualization extensions) |
0027 '------------------------------------------------------'
0028
0029 Terminology:
0030
0031 - L0 – level-0; the bare metal host, running KVM
0032
0033 - L1 – level-1 guest; a VM running on L0; also called the "guest
0034 hypervisor", as it itself is capable of running KVM.
0035
0036 - L2 – level-2 guest; a VM running on L1, this is the "nested guest"
0037
0038 .. note:: The above diagram is modelled after the x86 architecture;
0039 s390x, ppc64 and other architectures are likely to have
0040 a different design for nesting.
0041
0042 For example, s390x always has an LPAR (LogicalPARtition)
0043 hypervisor running on bare metal, adding another layer and
0044 resulting in at least four levels in a nested setup — L0 (bare
0045 metal, running the LPAR hypervisor), L1 (host hypervisor), L2
0046 (guest hypervisor), L3 (nested guest).
0047
0048 This document will stick with the three-level terminology (L0,
0049 L1, and L2) for all architectures; and will largely focus on
0050 x86.
0051
0052
0053 Use Cases
0054 ---------
0055
0056 There are several scenarios where nested KVM can be useful, to name a
0057 few:
0058
0059 - As a developer, you want to test your software on different operating
0060 systems (OSes). Instead of renting multiple VMs from a Cloud
0061 Provider, using nested KVM lets you rent a large enough "guest
0062 hypervisor" (level-1 guest). This in turn allows you to create
0063 multiple nested guests (level-2 guests), running different OSes, on
0064 which you can develop and test your software.
0065
0066 - Live migration of "guest hypervisors" and their nested guests, for
0067 load balancing, disaster recovery, etc.
0068
0069 - VM image creation tools (e.g. ``virt-install``, etc) often run
0070 their own VM, and users expect these to work inside a VM.
0071
0072 - Some OSes use virtualization internally for security (e.g. to let
0073 applications run safely in isolation).
0074
0075
0076 Enabling "nested" (x86)
0077 -----------------------
0078
0079 From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
0080 by default for Intel and AMD. (Though your Linux distribution might
0081 override this default.)
0082
0083 In case you are running a Linux kernel older than v4.19, to enable
0084 nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
0085 persist this setting across reboots, you can add it in a config file, as
0086 shown below:
0087
0088 1. On the bare metal host (L0), list the kernel modules and ensure that
0089 the KVM modules::
0090
0091 $ lsmod | grep -i kvm
0092 kvm_intel 133627 0
0093 kvm 435079 1 kvm_intel
0094
0095 2. Show information for ``kvm_intel`` module::
0096
0097 $ modinfo kvm_intel | grep -i nested
0098 parm: nested:bool
0099
0100 3. For the nested KVM configuration to persist across reboots, place the
0101 below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
0102 doesn't exist)::
0103
0104 $ cat /etc/modprobe.d/kvm_intel.conf
0105 options kvm-intel nested=y
0106
0107 4. Unload and re-load the KVM Intel module::
0108
0109 $ sudo rmmod kvm-intel
0110 $ sudo modprobe kvm-intel
0111
0112 5. Verify if the ``nested`` parameter for KVM is enabled::
0113
0114 $ cat /sys/module/kvm_intel/parameters/nested
0115 Y
0116
0117 For AMD hosts, the process is the same as above, except that the module
0118 name is ``kvm-amd``.
0119
0120
0121 Additional nested-related kernel parameters (x86)
0122 -------------------------------------------------
0123
0124 If your hardware is sufficiently advanced (Intel Haswell processor or
0125 higher, which has newer hardware virt extensions), the following
0126 additional features will also be enabled by default: "Shadow VMCS
0127 (Virtual Machine Control Structure)", APIC Virtualization on your bare
0128 metal host (L0). Parameters for Intel hosts::
0129
0130 $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
0131 Y
0132
0133 $ cat /sys/module/kvm_intel/parameters/enable_apicv
0134 Y
0135
0136 $ cat /sys/module/kvm_intel/parameters/ept
0137 Y
0138
0139 .. note:: If you suspect your L2 (i.e. nested guest) is running slower,
0140 ensure the above are enabled (particularly
0141 ``enable_shadow_vmcs`` and ``ept``).
0142
0143
0144 Starting a nested guest (x86)
0145 -----------------------------
0146
0147 Once your bare metal host (L0) is configured for nesting, you should be
0148 able to start an L1 guest with::
0149
0150 $ qemu-kvm -cpu host [...]
0151
0152 The above will pass through the host CPU's capabilities as-is to the
0153 gues); or for better live migration compatibility, use a named CPU
0154 model supported by QEMU. e.g.::
0155
0156 $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
0157
0158 then the guest hypervisor will subsequently be capable of running a
0159 nested guest with accelerated KVM.
0160
0161
0162 Enabling "nested" (s390x)
0163 -------------------------
0164
0165 1. On the host hypervisor (L0), enable the ``nested`` parameter on
0166 s390x::
0167
0168 $ rmmod kvm
0169 $ modprobe kvm nested=1
0170
0171 .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
0172 with the ``nested`` paramter — i.e. to be able to enable
0173 ``nested``, the ``hpage`` parameter *must* be disabled.
0174
0175 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
0176 feature — with QEMU, this can be done by using "host passthrough"
0177 (via the command-line ``-cpu host``).
0178
0179 3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
0180
0181 $ modprobe kvm
0182
0183
0184 Live migration with nested KVM
0185 ------------------------------
0186
0187 Migrating an L1 guest, with a *live* nested guest in it, to another
0188 bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
0189 Intel x86 systems, and even on older versions for s390x.
0190
0191 On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
0192 should no longer be migrated or saved (refer to QEMU documentation on
0193 "savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
0194 or save-and-load an L1 guest while an L2 guest is running will result in
0195 undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
0196 kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
0197 guest can no longer be considered stable or secure, and must be restarted.
0198 Migrating an L1 guest merely configured to support nesting, while not
0199 actually running L2 guests, is expected to function normally even on AMD
0200 systems but may fail once guests are started.
0201
0202 Migrating an L2 guest is always expected to succeed, so all the following
0203 scenarios should work even on AMD systems:
0204
0205 - Migrating a nested guest (L2) to another L1 guest on the *same* bare
0206 metal host.
0207
0208 - Migrating a nested guest (L2) to another L1 guest on a *different*
0209 bare metal host.
0210
0211 - Migrating a nested guest (L2) to a bare metal host.
0212
0213 Reporting bugs from nested setups
0214 -----------------------------------
0215
0216 Debugging "nested" problems can involve sifting through log files across
0217 L0, L1 and L2; this can result in tedious back-n-forth between the bug
0218 reporter and the bug fixer.
0219
0220 - Mention that you are in a "nested" setup. If you are running any kind
0221 of "nesting" at all, say so. Unfortunately, this needs to be called
0222 out because when reporting bugs, people tend to forget to even
0223 *mention* that they're using nested virtualization.
0224
0225 - Ensure you are actually running KVM on KVM. Sometimes people do not
0226 have KVM enabled for their guest hypervisor (L1), which results in
0227 them running with pure emulation or what QEMU calls it as "TCG", but
0228 they think they're running nested KVM. Thus confusing "nested Virt"
0229 (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
0230
0231 Information to collect (generic)
0232 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0233
0234 The following is not an exhaustive list, but a very good starting point:
0235
0236 - Kernel, libvirt, and QEMU version from L0
0237
0238 - Kernel, libvirt and QEMU version from L1
0239
0240 - QEMU command-line of L1 -- when using libvirt, you'll find it here:
0241 ``/var/log/libvirt/qemu/instance.log``
0242
0243 - QEMU command-line of L2 -- as above, when using libvirt, get the
0244 complete libvirt-generated QEMU command-line
0245
0246 - ``cat /sys/cpuinfo`` from L0
0247
0248 - ``cat /sys/cpuinfo`` from L1
0249
0250 - ``lscpu`` from L0
0251
0252 - ``lscpu`` from L1
0253
0254 - Full ``dmesg`` output from L0
0255
0256 - Full ``dmesg`` output from L1
0257
0258 x86-specific info to collect
0259 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0260
0261 Both the below commands, ``x86info`` and ``dmidecode``, should be
0262 available on most Linux distributions with the same name:
0263
0264 - Output of: ``x86info -a`` from L0
0265
0266 - Output of: ``x86info -a`` from L1
0267
0268 - Output of: ``dmidecode`` from L0
0269
0270 - Output of: ``dmidecode`` from L1
0271
0272 s390x-specific info to collect
0273 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0274
0275 Along with the earlier mentioned generic details, the below is
0276 also recommended:
0277
0278 - ``/proc/sysinfo`` from L1; this will also include the info from L0