Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ===============================
0004 Software Guard eXtensions (SGX)
0005 ===============================
0006 
0007 Overview
0008 ========
0009 
0010 Software Guard eXtensions (SGX) hardware enables for user space applications
0011 to set aside private memory regions of code and data:
0012 
0013 * Privileged (ring-0) ENCLS functions orchestrate the construction of the
0014   regions.
0015 * Unprivileged (ring-3) ENCLU functions allow an application to enter and
0016   execute inside the regions.
0017 
0018 These memory regions are called enclaves. An enclave can be only entered at a
0019 fixed set of entry points. Each entry point can hold a single hardware thread
0020 at a time.  While the enclave is loaded from a regular binary file by using
0021 ENCLS functions, only the threads inside the enclave can access its memory. The
0022 region is denied from outside access by the CPU, and encrypted before it leaves
0023 from LLC.
0024 
0025 The support can be determined by
0026 
0027         ``grep sgx /proc/cpuinfo``
0028 
0029 SGX must both be supported in the processor and enabled by the BIOS.  If SGX
0030 appears to be unsupported on a system which has hardware support, ensure
0031 support is enabled in the BIOS.  If a BIOS presents a choice between "Enabled"
0032 and "Software Enabled" modes for SGX, choose "Enabled".
0033 
0034 Enclave Page Cache
0035 ==================
0036 
0037 SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
0038 with an enclave. It is contained in a BIOS-reserved region of physical memory.
0039 Unlike pages used for regular memory, pages can only be accessed from outside of
0040 the enclave during enclave construction with special, limited SGX instructions.
0041 
0042 Only a CPU executing inside an enclave can directly access enclave memory.
0043 However, a CPU executing inside an enclave may access normal memory outside the
0044 enclave.
0045 
0046 The kernel manages enclave memory similar to how it treats device memory.
0047 
0048 Enclave Page Types
0049 ------------------
0050 
0051 **SGX Enclave Control Structure (SECS)**
0052    Enclave's address range, attributes and other global data are defined
0053    by this structure.
0054 
0055 **Regular (REG)**
0056    Regular EPC pages contain the code and data of an enclave.
0057 
0058 **Thread Control Structure (TCS)**
0059    Thread Control Structure pages define the entry points to an enclave and
0060    track the execution state of an enclave thread.
0061 
0062 **Version Array (VA)**
0063    Version Array pages contain 512 slots, each of which can contain a version
0064    number for a page evicted from the EPC.
0065 
0066 Enclave Page Cache Map
0067 ----------------------
0068 
0069 The processor tracks EPC pages in a hardware metadata structure called the
0070 *Enclave Page Cache Map (EPCM)*.  The EPCM contains an entry for each EPC page
0071 which describes the owning enclave, access rights and page type among the other
0072 things.
0073 
0074 EPCM permissions are separate from the normal page tables.  This prevents the
0075 kernel from, for instance, allowing writes to data which an enclave wishes to
0076 remain read-only.  EPCM permissions may only impose additional restrictions on
0077 top of normal x86 page permissions.
0078 
0079 For all intents and purposes, the SGX architecture allows the processor to
0080 invalidate all EPCM entries at will.  This requires that software be prepared to
0081 handle an EPCM fault at any time.  In practice, this can happen on events like
0082 power transitions when the ephemeral key that encrypts enclave memory is lost.
0083 
0084 Application interface
0085 =====================
0086 
0087 Enclave build functions
0088 -----------------------
0089 
0090 In addition to the traditional compiler and linker build process, SGX has a
0091 separate enclave “build” process.  Enclaves must be built before they can be
0092 executed (entered). The first step in building an enclave is opening the
0093 **/dev/sgx_enclave** device.  Since enclave memory is protected from direct
0094 access, special privileged instructions are then used to copy data into enclave
0095 pages and establish enclave page permissions.
0096 
0097 .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
0098    :functions: sgx_ioc_enclave_create
0099                sgx_ioc_enclave_add_pages
0100                sgx_ioc_enclave_init
0101                sgx_ioc_enclave_provision
0102 
0103 Enclave runtime management
0104 --------------------------
0105 
0106 Systems supporting SGX2 additionally support changes to initialized
0107 enclaves: modifying enclave page permissions and type, and dynamically
0108 adding and removing of enclave pages. When an enclave accesses an address
0109 within its address range that does not have a backing page then a new
0110 regular page will be dynamically added to the enclave. The enclave is
0111 still required to run EACCEPT on the new page before it can be used.
0112 
0113 .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
0114    :functions: sgx_ioc_enclave_restrict_permissions
0115                sgx_ioc_enclave_modify_types
0116                sgx_ioc_enclave_remove_pages
0117 
0118 Enclave vDSO
0119 ------------
0120 
0121 Entering an enclave can only be done through SGX-specific EENTER and ERESUME
0122 functions, and is a non-trivial process.  Because of the complexity of
0123 transitioning to and from an enclave, enclaves typically utilize a library to
0124 handle the actual transitions.  This is roughly analogous to how glibc
0125 implementations are used by most applications to wrap system calls.
0126 
0127 Another crucial characteristic of enclaves is that they can generate exceptions
0128 as part of their normal operation that need to be handled in the enclave or are
0129 unique to SGX.
0130 
0131 Instead of the traditional signal mechanism to handle these exceptions, SGX
0132 can leverage special exception fixup provided by the vDSO.  The kernel-provided
0133 vDSO function wraps low-level transitions to/from the enclave like EENTER and
0134 ERESUME.  The vDSO function intercepts exceptions that would otherwise generate
0135 a signal and return the fault information directly to its caller.  This avoids
0136 the need to juggle signal handlers.
0137 
0138 .. kernel-doc:: arch/x86/include/uapi/asm/sgx.h
0139    :functions: vdso_sgx_enter_enclave_t
0140 
0141 ksgxd
0142 =====
0143 
0144 SGX support includes a kernel thread called *ksgxd*.
0145 
0146 EPC sanitization
0147 ----------------
0148 
0149 ksgxd is started when SGX initializes.  Enclave memory is typically ready
0150 for use when the processor powers on or resets.  However, if SGX has been in
0151 use since the reset, enclave pages may be in an inconsistent state.  This might
0152 occur after a crash and kexec() cycle, for instance.  At boot, ksgxd
0153 reinitializes all enclave pages so that they can be allocated and re-used.
0154 
0155 The sanitization is done by going through EPC address space and applying the
0156 EREMOVE function to each physical page. Some enclave pages like SECS pages have
0157 hardware dependencies on other pages which prevents EREMOVE from functioning.
0158 Executing two EREMOVE passes removes the dependencies.
0159 
0160 Page reclaimer
0161 --------------
0162 
0163 Similar to the core kswapd, ksgxd, is responsible for managing the
0164 overcommitment of enclave memory.  If the system runs out of enclave memory,
0165 *ksgxd* “swaps” enclave memory to normal memory.
0166 
0167 Launch Control
0168 ==============
0169 
0170 SGX provides a launch control mechanism. After all enclave pages have been
0171 copied, kernel executes EINIT function, which initializes the enclave. Only after
0172 this the CPU can execute inside the enclave.
0173 
0174 EINIT function takes an RSA-3072 signature of the enclave measurement.  The function
0175 checks that the measurement is correct and signature is signed with the key
0176 hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the
0177 SHA256 of a public key.
0178 
0179 Those MSRs can be configured by the BIOS to be either readable or writable.
0180 Linux supports only writable configuration in order to give full control to the
0181 kernel on launch control policy. Before calling EINIT function, the driver sets
0182 the MSRs to match the enclave's signing key.
0183 
0184 Encryption engines
0185 ==================
0186 
0187 In order to conceal the enclave data while it is out of the CPU package, the
0188 memory controller has an encryption engine to transparently encrypt and decrypt
0189 enclave memory.
0190 
0191 In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to
0192 encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
0193 SRAM to maintain integrity of the encrypted data. This provides integrity and
0194 anti-replay protection but does not scale to large memory sizes because the time
0195 required to update the Merkle tree grows logarithmically in relation to the
0196 memory size.
0197 
0198 CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
0199 MEE. TME-based SGX implementations do not have an integrity Merkle tree, which
0200 means integrity and replay-attacks are not mitigated.  B, it includes
0201 additional changes to prevent cipher text from being returned and SW memory
0202 aliases from being created.
0203 
0204 DMA to enclave memory is blocked by range registers on both MEE and TME systems
0205 (SDM section 41.10).
0206 
0207 Usage Models
0208 ============
0209 
0210 Shared Library
0211 --------------
0212 
0213 Sensitive data and the code that acts on it is partitioned from the application
0214 into a separate library. The library is then linked as a DSO which can be loaded
0215 into an enclave. The application can then make individual function calls into
0216 the enclave through special SGX instructions. A run-time within the enclave is
0217 configured to marshal function parameters into and out of the enclave and to
0218 call the correct library function.
0219 
0220 Application Container
0221 ---------------------
0222 
0223 An application may be loaded into a container enclave which is specially
0224 configured with a library OS and run-time which permits the application to run.
0225 The enclave run-time and library OS work together to execute the application
0226 when a thread enters the enclave.
0227 
0228 Impact of Potential Kernel SGX Bugs
0229 ===================================
0230 
0231 EPC leaks
0232 ---------
0233 
0234 When EPC page leaks happen, a WARNING like this is shown in dmesg:
0235 
0236 "EREMOVE returned ... and an EPC page was leaked.  SGX may become unusable..."
0237 
0238 This is effectively a kernel use-after-free of an EPC page, and due
0239 to the way SGX works, the bug is detected at freeing. Rather than
0240 adding the page back to the pool of available EPC pages, the kernel
0241 intentionally leaks the page to avoid additional errors in the future.
0242 
0243 When this happens, the kernel will likely soon leak more EPC pages, and
0244 SGX will likely become unusable because the memory available to SGX is
0245 limited. However, while this may be fatal to SGX, the rest of the kernel
0246 is unlikely to be impacted and should continue to work.
0247 
0248 As a result, when this happpens, user should stop running any new
0249 SGX workloads, (or just any new workloads), and migrate all valuable
0250 workloads. Although a machine reboot can recover all EPC memory, the bug
0251 should be reported to Linux developers.
0252 
0253 
0254 Virtual EPC
0255 ===========
0256 
0257 The implementation has also a virtual EPC driver to support SGX enclaves
0258 in guests. Unlike the SGX driver, an EPC page allocated by the virtual
0259 EPC driver doesn't have a specific enclave associated with it. This is
0260 because KVM doesn't track how a guest uses EPC pages.
0261 
0262 As a result, the SGX core page reclaimer doesn't support reclaiming EPC
0263 pages allocated to KVM guests through the virtual EPC driver. If the
0264 user wants to deploy SGX applications both on the host and in guests
0265 on the same machine, the user should reserve enough EPC (by taking out
0266 total virtual EPC size of all SGX VMs from the physical EPC size) for
0267 host SGX applications so they can run with acceptable performance.
0268 
0269 Architectural behavior is to restore all EPC pages to an uninitialized
0270 state also after a guest reboot.  Because this state can be reached only
0271 through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc``
0272 provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction
0273 on all pages in the virtual EPC.
0274 
0275 ``EREMOVE`` can fail for three reasons.  Userspace must pay attention
0276 to expected failures and handle them as follows:
0277 
0278 1. Page removal will always fail when any thread is running in the
0279    enclave to which the page belongs.  In this case the ioctl will
0280    return ``EBUSY`` independent of whether it has successfully removed
0281    some pages; userspace can avoid these failures by preventing execution
0282    of any vcpu which maps the virtual EPC.
0283 
0284 2. Page removal will cause a general protection fault if two calls to
0285    ``EREMOVE`` happen concurrently for pages that refer to the same
0286    "SECS" metadata pages.  This can happen if there are concurrent
0287    invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc``
0288    file descriptor in the guest is closed at the same time as
0289    ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``.
0290    This can be avoided in userspace by serializing calls to the ioctl()
0291    and to close(), but in general it should not be a problem.
0292 
0293 3. Finally, page removal will fail for SECS metadata pages which still
0294    have child pages.  Child pages can be removed by executing
0295    ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors
0296    mapped into the guest.  This means that the ioctl() must be called
0297    twice: an initial set of calls to remove child pages and a subsequent
0298    set of calls to remove SECS pages.  The second set of calls is only
0299    required for those mappings that returned a nonzero value from the
0300    first call.  It indicates a bug in the kernel or the userspace client
0301    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
0302    a return code other than 0.