Documentation/security/self-protection.rst

0001 ======================
0002 Kernel Self-Protection
0003 ======================
0004
0005 Kernel self-protection is the design and implementation of systems and
0006 structures within the Linux kernel to protect against security flaws in
0007 the kernel itself. This covers a wide range of issues, including removing
0008 entire classes of bugs, blocking security flaw exploitation methods,
0009 and actively detecting attack attempts. Not all topics are explored in
0010 this document, but it should serve as a reasonable starting point and
0011 answer any frequently asked questions. (Patches welcome, of course!)
0012
0013 In the worst-case scenario, we assume an unprivileged local attacker
0014 has arbitrary read and write access to the kernel's memory. In many
0015 cases, bugs being exploited will not provide this level of access,
0016 but with systems in place that defend against the worst case we'll
0017 cover the more limited cases as well. A higher bar, and one that should
0018 still be kept in mind, is protecting the kernel against a _privileged_
0019 local attacker, since the root user has access to a vastly increased
0020 attack surface. (Especially when they have the ability to load arbitrary
0021 kernel modules.)
0022
0023 The goals for successful self-protection systems would be that they
0024 are effective, on by default, require no opt-in by developers, have no
0025 performance impact, do not impede kernel debugging, and have tests. It
0026 is uncommon that all these goals can be met, but it is worth explicitly
0027 mentioning them, since these aspects need to be explored, dealt with,
0028 and/or accepted.
0029
0030
0031 Attack Surface Reduction
0032 ========================
0033
0034 The most fundamental defense against security exploits is to reduce the
0035 areas of the kernel that can be used to redirect execution. This ranges
0036 from limiting the exposed APIs available to userspace, making in-kernel
0037 APIs hard to use incorrectly, minimizing the areas of writable kernel
0038 memory, etc.
0039
0040 Strict kernel memory permissions
0041 --------------------------------
0042
0043 When all of kernel memory is writable, it becomes trivial for attacks
0044 to redirect execution flow. To reduce the availability of these targets
0045 the kernel needs to protect its memory with a tight set of permissions.
0046
0047 Executable code and read-only data must not be writable
0048 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0049
0050 Any areas of the kernel with executable memory must not be writable.
0051 While this obviously includes the kernel text itself, we must consider
0052 all additional places too: kernel modules, JIT memory, etc. (There are
0053 temporary exceptions to this rule to support things like instruction
0054 alternatives, breakpoints, kprobes, etc. If these must exist in a
0055 kernel, they are implemented in a way where the memory is temporarily
0056 made writable during the update, and then returned to the original
0057 permissions.)
0058
0059 In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
0060 ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
0061 writable, data is not executable, and read-only data is neither writable
0062 nor executable.
0063
0064 Most architectures have these options on by default and not user selectable.
0065 For some architectures like arm that wish to have these be selectable,
0066 the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
0067 a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
0068 the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
0069
0070 Function pointers and sensitive variables must not be writable
0071 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0072
0073 Vast areas of kernel memory contain function pointers that are looked
0074 up by the kernel and used to continue execution (e.g. descriptor/vector
0075 tables, file/network/etc operation structures, etc). The number of these
0076 variables must be reduced to an absolute minimum.
0077
0078 Many such variables can be made read-only by setting them "const"
0079 so that they live in the .rodata section instead of the .data section
0080 of the kernel, gaining the protection of the kernel's strict memory
0081 permissions as described above.
0082
0083 For variables that are initialized once at ``__init`` time, these can
0084 be marked with the ``__ro_after_init`` attribute.
0085
0086 What remains are variables that are updated rarely (e.g. GDT). These
0087 will need another infrastructure (similar to the temporary exceptions
0088 made to kernel code mentioned above) that allow them to spend the rest
0089 of their lifetime read-only. (For example, when being updated, only the
0090 CPU thread performing the update would be given uninterruptible write
0091 access to the memory.)
0092
0093 Segregation of kernel memory from userspace memory
0094 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0095
0096 The kernel must never execute userspace memory. The kernel must also never
0097 access userspace memory without explicit expectation to do so. These
0098 rules can be enforced either by support of hardware-based restrictions
0099 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
0100 By blocking userspace memory in this way, execution and data parsing
0101 cannot be passed to trivially-controlled userspace memory, forcing
0102 attacks to operate entirely in kernel memory.
0103
0104 Reduced access to syscalls
0105 --------------------------
0106
0107 One trivial way to eliminate many syscalls for 64-bit systems is building
0108 without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
0109
0110 The "seccomp" system provides an opt-in feature made available to
0111 userspace, which provides a way to reduce the number of kernel entry
0112 points available to a running process. This limits the breadth of kernel
0113 code that can be reached, possibly reducing the availability of a given
0114 bug to an attack.
0115
0116 An area of improvement would be creating viable ways to keep access to
0117 things like compat, user namespaces, BPF creation, and perf limited only
0118 to trusted processes. This would keep the scope of kernel entry points
0119 restricted to the more regular set of normally available to unprivileged
0120 userspace.
0121
0122 Restricting access to kernel modules
0123 ------------------------------------
0124
0125 The kernel should never allow an unprivileged user the ability to
0126 load specific kernel modules, since that would provide a facility to
0127 unexpectedly extend the available attack surface. (The on-demand loading
0128 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
0129 considered "expected" here, though additional consideration should be
0130 given even to these.) For example, loading a filesystem module via an
0131 unprivileged socket API is nonsense: only the root or physically local
0132 user should trigger filesystem module loading. (And even this can be up
0133 for debate in some scenarios.)
0134
0135 To protect against even privileged users, systems may need to either
0136 disable module loading entirely (e.g. monolithic kernel builds or
0137 modules_disabled sysctl), or provide signed modules (e.g.
0138 ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
0139 root load arbitrary kernel code via the module loader interface.
0140
0141
0142 Memory integrity
0143 ================
0144
0145 There are many memory structures in the kernel that are regularly abused
0146 to gain execution control during an attack, By far the most commonly
0147 understood is that of the stack buffer overflow in which the return
0148 address stored on the stack is overwritten. Many other examples of this
0149 kind of attack exist, and protections exist to defend against them.
0150
0151 Stack buffer overflow
0152 ---------------------
0153
0154 The classic stack buffer overflow involves writing past the expected end
0155 of a variable stored on the stack, ultimately writing a controlled value
0156 to the stack frame's stored return address. The most widely used defense
0157 is the presence of a stack canary between the stack variables and the
0158 return address (``CONFIG_STACKPROTECTOR``), which is verified just before
0159 the function returns. Other defenses include things like shadow stacks.
0160
0161 Stack depth overflow
0162 --------------------
0163
0164 A less well understood attack is using a bug that triggers the
0165 kernel to consume stack memory with deep function calls or large stack
0166 allocations. With this attack it is possible to write beyond the end of
0167 the kernel's preallocated stack space and into sensitive structures. Two
0168 important changes need to be made for better protections: moving the
0169 sensitive thread_info structure elsewhere, and adding a faulting memory
0170 hole at the bottom of the stack to catch these overflows.
0171
0172 Heap memory integrity
0173 ---------------------
0174
0175 The structures used to track heap free lists can be sanity-checked during
0176 allocation and freeing to make sure they aren't being used to manipulate
0177 other memory areas.
0178
0179 Counter integrity
0180 -----------------
0181
0182 Many places in the kernel use atomic counters to track object references
0183 or perform similar lifetime management. When these counters can be made
0184 to wrap (over or under) this traditionally exposes a use-after-free
0185 flaw. By trapping atomic wrapping, this class of bug vanishes.
0186
0187 Size calculation overflow detection
0188 -----------------------------------
0189
0190 Similar to counter overflow, integer overflows (usually size calculations)
0191 need to be detected at runtime to kill this class of bug, which
0192 traditionally leads to being able to write past the end of kernel buffers.
0193
0194
0195 Probabilistic defenses
0196 ======================
0197
0198 While many protections can be considered deterministic (e.g. read-only
0199 memory cannot be written to), some protections provide only statistical
0200 defense, in that an attack must gather enough information about a
0201 running system to overcome the defense. While not perfect, these do
0202 provide meaningful defenses.
0203
0204 Canaries, blinding, and other secrets
0205 -------------------------------------
0206
0207 It should be noted that things like the stack canary discussed earlier
0208 are technically statistical defenses, since they rely on a secret value,
0209 and such values may become discoverable through an information exposure
0210 flaw.
0211
0212 Blinding literal values for things like JITs, where the executable
0213 contents may be partially under the control of userspace, need a similar
0214 secret value.
0215
0216 It is critical that the secret values used must be separate (e.g.
0217 different canary per stack) and high entropy (e.g. is the RNG actually
0218 working?) in order to maximize their success.
0219
0220 Kernel Address Space Layout Randomization (KASLR)
0221 -------------------------------------------------
0222
0223 Since the location of kernel memory is almost always instrumental in
0224 mounting a successful attack, making the location non-deterministic
0225 raises the difficulty of an exploit. (Note that this in turn makes
0226 the value of information exposures higher, since they may be used to
0227 discover desired memory locations.)
0228
0229 Text and module base
0230 ~~~~~~~~~~~~~~~~~~~~
0231
0232 By relocating the physical and virtual base address of the kernel at
0233 boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
0234 frustrated. Additionally, offsetting the module loading base address
0235 means that even systems that load the same set of modules in the same
0236 order every boot will not share a common base address with the rest of
0237 the kernel text.
0238
0239 Stack base
0240 ~~~~~~~~~~
0241
0242 If the base address of the kernel stack is not the same between processes,
0243 or even not the same between syscalls, targets on or beyond the stack
0244 become more difficult to locate.
0245
0246 Dynamic memory base
0247 ~~~~~~~~~~~~~~~~~~~
0248
0249 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
0250 being relatively deterministic in layout due to the order of early-boot
0251 initializations. If the base address of these areas is not the same
0252 between boots, targeting them is frustrated, requiring an information
0253 exposure specific to the region.
0254
0255 Structure layout
0256 ~~~~~~~~~~~~~~~~
0257
0258 By performing a per-build randomization of the layout of sensitive
0259 structures, attacks must either be tuned to known kernel builds or expose
0260 enough kernel memory to determine structure layouts before manipulating
0261 them.
0262
0263
0264 Preventing Information Exposures
0265 ================================
0266
0267 Since the locations of sensitive structures are the primary target for
0268 attacks, it is important to defend against exposure of both kernel memory
0269 addresses and kernel memory contents (since they may contain kernel
0270 addresses or other sensitive things like canary values).
0271
0272 Kernel addresses
0273 ----------------
0274
0275 Printing kernel addresses to userspace leaks sensitive information about
0276 the kernel memory layout. Care should be exercised when using any printk
0277 specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
0278 in certain circumstances [*]).  Any file written to using one of these
0279 specifiers should be readable only by privileged processes.
0280
0281 Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
0282 addresses printed with the specifier %p are hashed before printing.
0283
0284 [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
0285 printed. If KALLSYMS is not enabled the raw address is printed.
0286
0287 Unique identifiers
0288 ------------------
0289
0290 Kernel memory addresses must never be used as identifiers exposed to
0291 userspace. Instead, use an atomic counter, an idr, or similar unique
0292 identifier.
0293
0294 Memory initialization
0295 ---------------------
0296
0297 Memory copied to userspace must always be fully initialized. If not
0298 explicitly memset(), this will require changes to the compiler to make
0299 sure structure holes are cleared.
0300
0301 Memory poisoning
0302 ----------------
0303
0304 When releasing memory, it is best to poison the contents, to avoid reuse
0305 attacks that rely on the old contents of memory. E.g., clear stack on a
0306 syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
0307 free. This frustrates many uninitialized variable attacks, stack content
0308 exposures, heap content exposures, and use-after-free attacks.
0309
0310 Destination tracking
0311 --------------------
0312
0313 To help kill classes of bugs that result in kernel addresses being
0314 written to userspace, the destination of writes needs to be tracked. If
0315 the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
0316 it should automatically censor sensitive values.