0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. Copyright (C) 2020, Google LLC.
0003
0004 Kernel Electric-Fence (KFENCE)
0005 ==============================
0006
0007 Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety
0008 error detector. KFENCE detects heap out-of-bounds access, use-after-free, and
0009 invalid-free errors.
0010
0011 KFENCE is designed to be enabled in production kernels, and has near zero
0012 performance overhead. Compared to KASAN, KFENCE trades performance for
0013 precision. The main motivation behind KFENCE's design, is that with enough
0014 total uptime KFENCE will detect bugs in code paths not typically exercised by
0015 non-production test workloads. One way to quickly achieve a large enough total
0016 uptime is when the tool is deployed across a large fleet of machines.
0017
0018 Usage
0019 -----
0020
0021 To enable KFENCE, configure the kernel with::
0022
0023 CONFIG_KFENCE=y
0024
0025 To build a kernel with KFENCE support, but disabled by default (to enable, set
0026 ``kfence.sample_interval`` to non-zero value), configure the kernel with::
0027
0028 CONFIG_KFENCE=y
0029 CONFIG_KFENCE_SAMPLE_INTERVAL=0
0030
0031 KFENCE provides several other configuration options to customize behaviour (see
0032 the respective help text in ``lib/Kconfig.kfence`` for more info).
0033
0034 Tuning performance
0035 ~~~~~~~~~~~~~~~~~~
0036
0037 The most important parameter is KFENCE's sample interval, which can be set via
0038 the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
0039 sample interval determines the frequency with which heap allocations will be
0040 guarded by KFENCE. The default is configurable via the Kconfig option
0041 ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
0042 disables KFENCE.
0043
0044 The sample interval controls a timer that sets up KFENCE allocations. By
0045 default, to keep the real sample interval predictable, the normal timer also
0046 causes CPU wake-ups when the system is completely idle. This may be undesirable
0047 on power-constrained systems. The boot parameter ``kfence.deferrable=1``
0048 instead switches to a "deferrable" timer which does not force CPU wake-ups on
0049 idle systems, at the risk of unpredictable sample intervals. The default is
0050 configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``.
0051
0052 .. warning::
0053 The KUnit test suite is very likely to fail when using a deferrable timer
0054 since it currently causes very unpredictable sample intervals.
0055
0056 The KFENCE memory pool is of fixed size, and if the pool is exhausted, no
0057 further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default
0058 255), the number of available guarded objects can be controlled. Each object
0059 requires 2 pages, one for the object itself and the other one used as a guard
0060 page; object pages are interleaved with guard pages, and every object page is
0061 therefore surrounded by two guard pages.
0062
0063 The total memory dedicated to the KFENCE memory pool can be computed as::
0064
0065 ( #objects + 1 ) * 2 * PAGE_SIZE
0066
0067 Using the default config, and assuming a page size of 4 KiB, results in
0068 dedicating 2 MiB to the KFENCE memory pool.
0069
0070 Note: On architectures that support huge pages, KFENCE will ensure that the
0071 pool is using pages of size ``PAGE_SIZE``. This will result in additional page
0072 tables being allocated.
0073
0074 Error reports
0075 ~~~~~~~~~~~~~
0076
0077 A typical out-of-bounds access looks like this::
0078
0079 ==================================================================
0080 BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234
0081
0082 Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72):
0083 test_out_of_bounds_read+0xa6/0x234
0084 kunit_try_run_case+0x61/0xa0
0085 kunit_generic_run_threadfn_adapter+0x16/0x30
0086 kthread+0x176/0x1b0
0087 ret_from_fork+0x22/0x30
0088
0089 kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32
0090
0091 allocated by task 484 on cpu 0 at 32.919330s:
0092 test_alloc+0xfe/0x738
0093 test_out_of_bounds_read+0x9b/0x234
0094 kunit_try_run_case+0x61/0xa0
0095 kunit_generic_run_threadfn_adapter+0x16/0x30
0096 kthread+0x176/0x1b0
0097 ret_from_fork+0x22/0x30
0098
0099 CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
0100 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
0101 ==================================================================
0102
0103 The header of the report provides a short summary of the function involved in
0104 the access. It is followed by more detailed information about the access and
0105 its origin. Note that, real kernel addresses are only shown when using the
0106 kernel command line option ``no_hash_pointers``.
0107
0108 Use-after-free accesses are reported as::
0109
0110 ==================================================================
0111 BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143
0112
0113 Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79):
0114 test_use_after_free_read+0xb3/0x143
0115 kunit_try_run_case+0x61/0xa0
0116 kunit_generic_run_threadfn_adapter+0x16/0x30
0117 kthread+0x176/0x1b0
0118 ret_from_fork+0x22/0x30
0119
0120 kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32
0121
0122 allocated by task 488 on cpu 2 at 33.871326s:
0123 test_alloc+0xfe/0x738
0124 test_use_after_free_read+0x76/0x143
0125 kunit_try_run_case+0x61/0xa0
0126 kunit_generic_run_threadfn_adapter+0x16/0x30
0127 kthread+0x176/0x1b0
0128 ret_from_fork+0x22/0x30
0129
0130 freed by task 488 on cpu 2 at 33.871358s:
0131 test_use_after_free_read+0xa8/0x143
0132 kunit_try_run_case+0x61/0xa0
0133 kunit_generic_run_threadfn_adapter+0x16/0x30
0134 kthread+0x176/0x1b0
0135 ret_from_fork+0x22/0x30
0136
0137 CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
0138 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
0139 ==================================================================
0140
0141 KFENCE also reports on invalid frees, such as double-frees::
0142
0143 ==================================================================
0144 BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
0145
0146 Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81):
0147 test_double_free+0xdc/0x171
0148 kunit_try_run_case+0x61/0xa0
0149 kunit_generic_run_threadfn_adapter+0x16/0x30
0150 kthread+0x176/0x1b0
0151 ret_from_fork+0x22/0x30
0152
0153 kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32
0154
0155 allocated by task 490 on cpu 1 at 34.175321s:
0156 test_alloc+0xfe/0x738
0157 test_double_free+0x76/0x171
0158 kunit_try_run_case+0x61/0xa0
0159 kunit_generic_run_threadfn_adapter+0x16/0x30
0160 kthread+0x176/0x1b0
0161 ret_from_fork+0x22/0x30
0162
0163 freed by task 490 on cpu 1 at 34.175348s:
0164 test_double_free+0xa8/0x171
0165 kunit_try_run_case+0x61/0xa0
0166 kunit_generic_run_threadfn_adapter+0x16/0x30
0167 kthread+0x176/0x1b0
0168 ret_from_fork+0x22/0x30
0169
0170 CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
0171 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
0172 ==================================================================
0173
0174 KFENCE also uses pattern-based redzones on the other side of an object's guard
0175 page, to detect out-of-bounds writes on the unprotected side of the object.
0176 These are reported on frees::
0177
0178 ==================================================================
0179 BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
0180
0181 Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156):
0182 test_kmalloc_aligned_oob_write+0xef/0x184
0183 kunit_try_run_case+0x61/0xa0
0184 kunit_generic_run_threadfn_adapter+0x16/0x30
0185 kthread+0x176/0x1b0
0186 ret_from_fork+0x22/0x30
0187
0188 kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96
0189
0190 allocated by task 502 on cpu 7 at 42.159302s:
0191 test_alloc+0xfe/0x738
0192 test_kmalloc_aligned_oob_write+0x57/0x184
0193 kunit_try_run_case+0x61/0xa0
0194 kunit_generic_run_threadfn_adapter+0x16/0x30
0195 kthread+0x176/0x1b0
0196 ret_from_fork+0x22/0x30
0197
0198 CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
0199 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
0200 ==================================================================
0201
0202 For such errors, the address where the corruption occurred as well as the
0203 invalidly written bytes (offset from the address) are shown; in this
0204 representation, '.' denote untouched bytes. In the example above ``0xac`` is
0205 the value written to the invalid address at offset 0, and the remaining '.'
0206 denote that no following bytes have been touched. Note that, real values are
0207 only shown if the kernel was booted with ``no_hash_pointers``; to avoid
0208 information disclosure otherwise, '!' is used instead to denote invalidly
0209 written bytes.
0210
0211 And finally, KFENCE may also report on invalid accesses to any protected page
0212 where it was not possible to determine an associated object, e.g. if adjacent
0213 object pages had not yet been allocated::
0214
0215 ==================================================================
0216 BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0
0217
0218 Invalid read at 0xffffffffb670b00a:
0219 test_invalid_access+0x26/0xe0
0220 kunit_try_run_case+0x51/0x85
0221 kunit_generic_run_threadfn_adapter+0x16/0x30
0222 kthread+0x137/0x160
0223 ret_from_fork+0x22/0x30
0224
0225 CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
0226 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
0227 ==================================================================
0228
0229 DebugFS interface
0230 ~~~~~~~~~~~~~~~~~
0231
0232 Some debugging information is exposed via debugfs:
0233
0234 * The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics.
0235
0236 * The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects
0237 allocated via KFENCE, including those already freed but protected.
0238
0239 Implementation Details
0240 ----------------------
0241
0242 Guarded allocations are set up based on the sample interval. After expiration
0243 of the sample interval, the next allocation through the main allocator (SLAB or
0244 SLUB) returns a guarded allocation from the KFENCE object pool (allocation
0245 sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and
0246 the next allocation is set up after the expiration of the interval.
0247
0248 When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated"
0249 through the main allocator's fast-path by relying on static branches via the
0250 static keys infrastructure. The static branch is toggled to redirect the
0251 allocation to KFENCE. Depending on sample interval, target workloads, and
0252 system architecture, this may perform better than the simple dynamic branch.
0253 Careful benchmarking is recommended.
0254
0255 KFENCE objects each reside on a dedicated page, at either the left or right
0256 page boundaries selected at random. The pages to the left and right of the
0257 object page are "guard pages", whose attributes are changed to a protected
0258 state, and cause page faults on any attempted access. Such page faults are then
0259 intercepted by KFENCE, which handles the fault gracefully by reporting an
0260 out-of-bounds access, and marking the page as accessible so that the faulting
0261 code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead).
0262
0263 To detect out-of-bounds writes to memory within the object's page itself,
0264 KFENCE also uses pattern-based redzones. For each object page, a redzone is set
0265 up for all non-object memory. For typical alignments, the redzone is only
0266 required on the unguarded side of an object. Because KFENCE must honor the
0267 cache's requested alignment, special alignments may result in unprotected gaps
0268 on either side of an object, all of which are redzoned.
0269
0270 The following figure illustrates the page layout::
0271
0272 ---+-----------+-----------+-----------+-----------+-----------+---
0273 | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
0274 | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
0275 | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
0276 | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
0277 | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
0278 | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
0279 ---+-----------+-----------+-----------+-----------+-----------+---
0280
0281 Upon deallocation of a KFENCE object, the object's page is again protected and
0282 the object is marked as freed. Any further access to the object causes a fault
0283 and KFENCE reports a use-after-free access. Freed objects are inserted at the
0284 tail of KFENCE's freelist, so that the least recently freed objects are reused
0285 first, and the chances of detecting use-after-frees of recently freed objects
0286 is increased.
0287
0288 If pool utilization reaches 75% (default) or above, to reduce the risk of the
0289 pool eventually being fully occupied by allocated objects yet ensure diverse
0290 coverage of allocations, KFENCE limits currently covered allocations of the
0291 same source from further filling up the pool. The "source" of an allocation is
0292 based on its partial allocation stack trace. A side-effect is that this also
0293 limits frequent long-lived allocations (e.g. pagecache) of the same source
0294 filling up the pool permanently, which is the most common risk for the pool
0295 becoming full and the sampled allocation rate dropping to zero. The threshold
0296 at which to start limiting currently covered allocations can be configured via
0297 the boot parameter ``kfence.skip_covered_thresh`` (pool usage%).
0298
0299 Interface
0300 ---------
0301
0302 The following describes the functions which are used by allocators as well as
0303 page handling code to set up and deal with KFENCE allocations.
0304
0305 .. kernel-doc:: include/linux/kfence.h
0306 :functions: is_kfence_address
0307 kfence_shutdown_cache
0308 kfence_alloc kfence_free __kfence_free
0309 kfence_ksize kfence_object_start
0310 kfence_handle_page_fault
0311
0312 Related Tools
0313 -------------
0314
0315 In userspace, a similar approach is taken by `GWP-ASan
0316 <http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and
0317 a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is
0318 directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another
0319 similar but non-sampling approach, that also inspired the name "KFENCE", can be
0320 found in the userspace `Electric Fence Malloc Debugger
0321 <https://linux.die.net/man/3/efence>`_.
0322
0323 In the kernel, several tools exist to debug memory access errors, and in
0324 particular KASAN can detect all bug classes that KFENCE can detect. While KASAN
0325 is more precise, relying on compiler instrumentation, this comes at a
0326 performance cost.
0327
0328 It is worth highlighting that KASAN and KFENCE are complementary, with
0329 different target environments. For instance, KASAN is the better debugging-aid,
0330 where test cases or reproducers exists: due to the lower chance to detect the
0331 error, it would require more effort using KFENCE to debug. Deployments at scale
0332 that cannot afford to enable KASAN, however, would benefit from using KFENCE to
0333 discover bugs due to code paths not exercised by test cases or fuzzers.