Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 ==========================
0004 Page Table Isolation (PTI)
0005 ==========================
0006 
0007 Overview
0008 ========
0009 
0010 Page Table Isolation (pti, previously known as KAISER [1]_) is a
0011 countermeasure against attacks on the shared user/kernel address
0012 space such as the "Meltdown" approach [2]_.
0013 
0014 To mitigate this class of attacks, we create an independent set of
0015 page tables for use only when running userspace applications.  When
0016 the kernel is entered via syscalls, interrupts or exceptions, the
0017 page tables are switched to the full "kernel" copy.  When the system
0018 switches back to user mode, the user copy is used again.
0019 
0020 The userspace page tables contain only a minimal amount of kernel
0021 data: only what is needed to enter/exit the kernel such as the
0022 entry/exit functions themselves and the interrupt descriptor table
0023 (IDT).  There are a few strictly unnecessary things that get mapped
0024 such as the first C function when entering an interrupt (see
0025 comments in pti.c).
0026 
0027 This approach helps to ensure that side-channel attacks leveraging
0028 the paging structures do not function when PTI is enabled.  It can be
0029 enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
0030 Once enabled at compile-time, it can be disabled at boot with the
0031 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
0032 
0033 Page Table Management
0034 =====================
0035 
0036 When PTI is enabled, the kernel manages two sets of page tables.
0037 The first set is very similar to the single set which is present in
0038 kernels without PTI.  This includes a complete mapping of userspace
0039 that the kernel can use for things like copy_to_user().
0040 
0041 Although _complete_, the user portion of the kernel page tables is
0042 crippled by setting the NX bit in the top level.  This ensures
0043 that any missed kernel->user CR3 switch will immediately crash
0044 userspace upon executing its first instruction.
0045 
0046 The userspace page tables map only the kernel data needed to enter
0047 and exit the kernel.  This data is entirely contained in the 'struct
0048 cpu_entry_area' structure which is placed in the fixmap which gives
0049 each CPU's copy of the area a compile-time-fixed virtual address.
0050 
0051 For new userspace mappings, the kernel makes the entries in its
0052 page tables like normal.  The only difference is when the kernel
0053 makes entries in the top (PGD) level.  In addition to setting the
0054 entry in the main kernel PGD, a copy of the entry is made in the
0055 userspace page tables' PGD.
0056 
0057 This sharing at the PGD level also inherently shares all the lower
0058 layers of the page tables.  This leaves a single, shared set of
0059 userspace page tables to manage.  One PTE to lock, one set of
0060 accessed bits, dirty bits, etc...
0061 
0062 Overhead
0063 ========
0064 
0065 Protection against side-channel attacks is important.  But,
0066 this protection comes at a cost:
0067 
0068 1. Increased Memory Use
0069 
0070   a. Each process now needs an order-1 PGD instead of order-0.
0071      (Consumes an additional 4k per process).
0072   b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
0073      aligned so that it can be mapped by setting a single PMD
0074      entry.  This consumes nearly 2MB of RAM once the kernel
0075      is decompressed, but no space in the kernel image itself.
0076 
0077 2. Runtime Cost
0078 
0079   a. CR3 manipulation to switch between the page table copies
0080      must be done at interrupt, syscall, and exception entry
0081      and exit (it can be skipped when the kernel is interrupted,
0082      though.)  Moves to CR3 are on the order of a hundred
0083      cycles, and are required at every entry and exit.
0084   b. A "trampoline" must be used for SYSCALL entry.  This
0085      trampoline depends on a smaller set of resources than the
0086      non-PTI SYSCALL entry code, so requires mapping fewer
0087      things into the userspace page tables.  The downside is
0088      that stacks must be switched at entry time.
0089   c. Global pages are disabled for all kernel structures not
0090      mapped into both kernel and userspace page tables.  This
0091      feature of the MMU allows different processes to share TLB
0092      entries mapping the kernel.  Losing the feature means more
0093      TLB misses after a context switch.  The actual loss of
0094      performance is very small, however, never exceeding 1%.
0095   d. Process Context IDentifiers (PCID) is a CPU feature that
0096      allows us to skip flushing the entire TLB when switching page
0097      tables by setting a special bit in CR3 when the page tables
0098      are changed.  This makes switching the page tables (at context
0099      switch, or kernel entry/exit) cheaper.  But, on systems with
0100      PCID support, the context switch code must flush both the user
0101      and kernel entries out of the TLB.  The user PCID TLB flush is
0102      deferred until the exit to userspace, minimizing the cost.
0103      See intel.com/sdm for the gory PCID/INVPCID details.
0104   e. The userspace page tables must be populated for each new
0105      process.  Even without PTI, the shared kernel mappings
0106      are created by copying top-level (PGD) entries into each
0107      new process.  But, with PTI, there are now *two* kernel
0108      mappings: one in the kernel page tables that maps everything
0109      and one for the entry/exit structures.  At fork(), we need to
0110      copy both.
0111   f. In addition to the fork()-time copying, there must also
0112      be an update to the userspace PGD any time a set_pgd() is done
0113      on a PGD used to map userspace.  This ensures that the kernel
0114      and userspace copies always map the same userspace
0115      memory.
0116   g. On systems without PCID support, each CR3 write flushes
0117      the entire TLB.  That means that each syscall, interrupt
0118      or exception flushes the TLB.
0119   h. INVPCID is a TLB-flushing instruction which allows flushing
0120      of TLB entries for non-current PCIDs.  Some systems support
0121      PCIDs, but do not support INVPCID.  On these systems, addresses
0122      can only be flushed from the TLB for the current PCID.  When
0123      flushing a kernel address, we need to flush all PCIDs, so a
0124      single kernel address flush will require a TLB-flushing CR3
0125      write upon the next use of every PCID.
0126 
0127 Possible Future Work
0128 ====================
0129 1. We can be more careful about not actually writing to CR3
0130    unless its value is actually changed.
0131 2. Allow PTI to be enabled/disabled at runtime in addition to the
0132    boot-time switching.
0133 
0134 Testing
0135 ========
0136 
0137 To test stability of PTI, the following test procedure is recommended,
0138 ideally doing all of these in parallel:
0139 
0140 1. Set CONFIG_DEBUG_ENTRY=y
0141 2. Run several copies of all of the tools/testing/selftests/x86/ tests
0142    (excluding MPX and protection_keys) in a loop on multiple CPUs for
0143    several minutes.  These tests frequently uncover corner cases in the
0144    kernel entry code.  In general, old kernels might cause these tests
0145    themselves to crash, but they should never crash the kernel.
0146 3. Run the 'perf' tool in a mode (top or record) that generates many
0147    frequent performance monitoring non-maskable interrupts (see "NMI"
0148    in /proc/interrupts).  This exercises the NMI entry/exit code which
0149    is known to trigger bugs in code paths that did not expect to be
0150    interrupted, including nested NMIs.  Using "-c" boosts the rate of
0151    NMIs, and using two -c with separate counters encourages nested NMIs
0152    and less deterministic behavior.
0153    ::
0154 
0155         while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
0156 
0157 4. Launch a KVM virtual machine.
0158 5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
0159    This has been a lightly-tested code path and needs extra scrutiny.
0160 
0161 Debugging
0162 =========
0163 
0164 Bugs in PTI cause a few different signatures of crashes
0165 that are worth noting here.
0166 
0167  * Failures of the selftests/x86 code.  Usually a bug in one of the
0168    more obscure corners of entry_64.S
0169  * Crashes in early boot, especially around CPU bringup.  Bugs
0170    in the trampoline code or mappings cause these.
0171  * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
0172    like screwing up a page table switch.  Also caused by
0173    incorrectly mapping the IRQ handler entry code.
0174  * Crashes at the first NMI.  The NMI code is separate from main
0175    interrupt handlers and can have bugs that do not affect
0176    normal interrupts.  Also caused by incorrectly mapping NMI
0177    code.  NMIs that interrupt the entry code must be very
0178    careful and can be the cause of crashes that show up when
0179    running perf.
0180  * Kernel crashes at the first exit to userspace.  entry_64.S
0181    bugs, or failing to map some of the exit code.
0182  * Crashes at first interrupt that interrupts userspace. The paths
0183    in entry_64.S that return to userspace are sometimes separate
0184    from the ones that return to the kernel.
0185  * Double faults: overflowing the kernel stack because of page
0186    faults upon page faults.  Caused by touching non-pti-mapped
0187    data in the entry code, or forgetting to switch to kernel
0188    CR3 before calling into C functions which are not pti-mapped.
0189  * Userspace segfaults early in boot, sometimes manifesting
0190    as mount(8) failing to mount the rootfs.  These have
0191    tended to be TLB invalidation issues.  Usually invalidating
0192    the wrong PCID, or otherwise missing an invalidation.
0193 
0194 .. [1] https://gruss.cc/files/kaiser.pdf
0195 .. [2] https://meltdownattack.com/meltdown.pdf