0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =====================================
0004 Virtually Mapped Kernel Stack Support
0005 =====================================
0006
0007 :Author: Shuah Khan <skhan@linuxfoundation.org>
0008
0009 .. contents:: :local:
0010
0011 Overview
0012 --------
0013
0014 This is a compilation of information from the code and original patch
0015 series that introduced the `Virtually Mapped Kernel Stacks feature
0016 <https://lwn.net/Articles/694348/>`
0017
0018 Introduction
0019 ------------
0020
0021 Kernel stack overflows are often hard to debug and make the kernel
0022 susceptible to exploits. Problems could show up at a later time making
0023 it difficult to isolate and root-cause.
0024
0025 Virtually-mapped kernel stacks with guard pages causes kernel stack
0026 overflows to be caught immediately rather than causing difficult to
0027 diagnose corruptions.
0028
0029 HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable
0030 support for virtually mapped stacks with guard pages. This feature
0031 causes reliable faults when the stack overflows. The usability of
0032 the stack trace after overflow and response to the overflow itself
0033 is architecture dependent.
0034
0035 .. note::
0036 As of this writing, arm64, powerpc, riscv, s390, um, and x86 have
0037 support for VMAP_STACK.
0038
0039 HAVE_ARCH_VMAP_STACK
0040 --------------------
0041
0042 Architectures that can support Virtually Mapped Kernel Stacks should
0043 enable this bool configuration option. The requirements are:
0044
0045 - vmalloc space must be large enough to hold many kernel stacks. This
0046 may rule out many 32-bit architectures.
0047 - Stacks in vmalloc space need to work reliably. For example, if
0048 vmap page tables are created on demand, either this mechanism
0049 needs to work while the stack points to a virtual address with
0050 unpopulated page tables or arch code (switch_to() and switch_mm(),
0051 most likely) needs to ensure that the stack's page table entries
0052 are populated before running on a possibly unpopulated stack.
0053 - If the stack overflows into a guard page, something reasonable
0054 should happen. The definition of "reasonable" is flexible, but
0055 instantly rebooting without logging anything would be unfriendly.
0056
0057 VMAP_STACK
0058 ----------
0059
0060 VMAP_STACK bool configuration option when enabled allocates virtually
0061 mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
0062
0063 - Enable this if you want the use virtually-mapped kernel stacks
0064 with guard pages. This causes kernel stack overflows to be caught
0065 immediately rather than causing difficult-to-diagnose corruption.
0066
0067 .. note::
0068
0069 Using this feature with KASAN requires architecture support
0070 for backing virtual mappings with real shadow memory, and
0071 KASAN_VMALLOC must be enabled.
0072
0073 .. note::
0074
0075 VMAP_STACK is enabled, it is not possible to run DMA on stack
0076 allocated data.
0077
0078 Kernel configuration options and dependencies keep changing. Refer to
0079 the latest code base:
0080
0081 `Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
0082
0083 Allocation
0084 -----------
0085
0086 When a new kernel thread is created, thread stack is allocated from
0087 virtually contiguous memory pages from the page level allocator. These
0088 pages are mapped into contiguous kernel virtual space with PAGE_KERNEL
0089 protections.
0090
0091 alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack
0092 with PAGE_KERNEL protections.
0093
0094 - Allocated stacks are cached and later reused by new threads, so memcg
0095 accounting is performed manually on assigning/releasing stacks to tasks.
0096 Hence, __vmalloc_node_range is called without __GFP_ACCOUNT.
0097 - vm_struct is cached to be able to find when thread free is initiated
0098 in interrupt context. free_thread_stack() can be called in interrupt
0099 context.
0100 - On arm64, all VMAP's stacks need to have the same alignment to ensure
0101 that VMAP'd stack overflow detection works correctly. Arch specific
0102 vmap stack allocator takes care of this detail.
0103 - This does not address interrupt stacks - according to the original patch
0104
0105 Thread stack allocation is initiated from clone(), fork(), vfork(),
0106 kernel_thread() via kernel_clone(). Leaving a few hints for searching
0107 the code base to understand when and how thread stack is allocated.
0108
0109 Bulk of the code is in:
0110 `kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
0111
0112 stack_vm_area pointer in task_struct keeps track of the virtually allocated
0113 stack and a non-null stack_vm_area pointer serves as a indication that the
0114 virtually mapped kernel stacks are enabled.
0115
0116 ::
0117
0118 struct vm_struct *stack_vm_area;
0119
0120 Stack overflow handling
0121 -----------------------
0122
0123 Leading and trailing guard pages help detect stack overflows. When stack
0124 overflows into the guard pages, handlers have to be careful not overflow
0125 the stack again. When handlers are called, it is likely that very little
0126 stack space is left.
0127
0128 On x86, this is done by handling the page fault indicating the kernel
0129 stack overflow on the double-fault stack.
0130
0131 Testing VMAP allocation with guard pages
0132 ----------------------------------------
0133
0134 How do we ensure that VMAP_STACK is actually allocating with a leading
0135 and trailing guard page? The following lkdtm tests can help detect any
0136 regressions.
0137
0138 ::
0139
0140 void lkdtm_STACK_GUARD_PAGE_LEADING()
0141 void lkdtm_STACK_GUARD_PAGE_TRAILING()
0142
0143 Conclusions
0144 -----------
0145
0146 - A percpu cache of vmalloced stacks appears to be a bit faster than a
0147 high-order stack allocation, at least when the cache hits.
0148 - THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and
0149 simply embed the thread_info (containing only flags) and 'int cpu' into
0150 task_struct.
0151 - The thread stack can be free'ed as soon as the task is dead (without
0152 waiting for RCU) and then, if vmapped stacks are in use, cache the
0153 entire stack for reuse on the same cpu.