Documentation/livepatch/reliable-stacktrace.rst

0001 ===================
0002 Reliable Stacktrace
0003 ===================
0004
0005 This document outlines basic information about reliable stacktracing.
0006
0007 .. Table of Contents:
0008
0009 .. contents:: :local:
0010
0011 1. Introduction
0012 ===============
0013
0014 The kernel livepatch consistency model relies on accurately identifying which
0015 functions may have live state and therefore may not be safe to patch. One way
0016 to identify which functions are live is to use a stacktrace.
0017
0018 Existing stacktrace code may not always give an accurate picture of all
0019 functions with live state, and best-effort approaches which can be helpful for
0020 debugging are unsound for livepatching. Livepatching depends on architectures
0021 to provide a *reliable* stacktrace which ensures it never omits any live
0022 functions from a trace.
0023
0024
0025 2. Requirements
0026 ===============
0027
0028 Architectures must implement one of the reliable stacktrace functions.
0029 Architectures using CONFIG_ARCH_STACKWALK must implement
0030 'arch_stack_walk_reliable', and other architectures must implement
0031 'save_stack_trace_tsk_reliable'.
0032
0033 Principally, the reliable stacktrace function must ensure that either:
0034
0035 * The trace includes all functions that the task may be returned to, and the
0036   return code is zero to indicate that the trace is reliable.
0037
0038 * The return code is non-zero to indicate that the trace is not reliable.
0039
0040 .. note::
0041    In some cases it is legitimate to omit specific functions from the trace,
0042    but all other functions must be reported. These cases are described in
0043    futher detail below.
0044
0045 Secondly, the reliable stacktrace function must be robust to cases where
0046 the stack or other unwind state is corrupt or otherwise unreliable. The
0047 function should attempt to detect such cases and return a non-zero error
0048 code, and should not get stuck in an infinite loop or access memory in
0049 an unsafe way.  Specific cases are described in further detail below.
0050
0051
0052 3. Compile-time analysis
0053 ========================
0054
0055 To ensure that kernel code can be correctly unwound in all cases,
0056 architectures may need to verify that code has been compiled in a manner
0057 expected by the unwinder. For example, an unwinder may expect that
0058 functions manipulate the stack pointer in a limited way, or that all
0059 functions use specific prologue and epilogue sequences. Architectures
0060 with such requirements should verify the kernel compilation using
0061 objtool.
0062
0063 In some cases, an unwinder may require metadata to correctly unwind.
0064 Where necessary, this metadata should be generated at build time using
0065 objtool.
0066
0067
0068 4. Considerations
0069 =================
0070
0071 The unwinding process varies across architectures, their respective procedure
0072 call standards, and kernel configurations. This section describes common
0073 details that architectures should consider.
0074
0075 4.1 Identifying successful termination
0076 --------------------------------------
0077
0078 Unwinding may terminate early for a number of reasons, including:
0079
0080 * Stack or frame pointer corruption.
0081
0082 * Missing unwind support for an uncommon scenario, or a bug in the unwinder.
0083
0084 * Dynamically generated code (e.g. eBPF) or foreign code (e.g. EFI runtime
0085   services) not following the conventions expected by the unwinder.
0086
0087 To ensure that this does not result in functions being omitted from the trace,
0088 even if not caught by other checks, it is strongly recommended that
0089 architectures verify that a stacktrace ends at an expected location, e.g.
0090
0091 * Within a specific function that is an entry point to the kernel.
0092
0093 * At a specific location on a stack expected for a kernel entry point.
0094
0095 * On a specific stack expected for a kernel entry point (e.g. if the
0096   architecture has separate task and IRQ stacks).
0097
0098 4.2 Identifying unwindable code
0099 -------------------------------
0100
0101 Unwinding typically relies on code following specific conventions (e.g.
0102 manipulating a frame pointer), but there can be code which may not follow these
0103 conventions and may require special handling in the unwinder, e.g.
0104
0105 * Exception vectors and entry assembly.
0106
0107 * Procedure Linkage Table (PLT) entries and veneer functions.
0108
0109 * Trampoline assembly (e.g. ftrace, kprobes).
0110
0111 * Dynamically generated code (e.g. eBPF, optprobe trampolines).
0112
0113 * Foreign code (e.g. EFI runtime services).
0114
0115 To ensure that such cases do not result in functions being omitted from a
0116 trace, it is strongly recommended that architectures positively identify code
0117 which is known to be reliable to unwind from, and reject unwinding from all
0118 other code.
0119
0120 Kernel code including modules and eBPF can be distinguished from foreign code
0121 using '__kernel_text_address()'. Checking for this also helps to detect stack
0122 corruption.
0123
0124 There are several ways an architecture may identify kernel code which is deemed
0125 unreliable to unwind from, e.g.
0126
0127 * Placing such code into special linker sections, and rejecting unwinding from
0128   any code in these sections.
0129
0130 * Identifying specific portions of code using bounds information.
0131
0132 4.3 Unwinding across interrupts and exceptions
0133 ----------------------------------------------
0134
0135 At function call boundaries the stack and other unwind state is expected to be
0136 in a consistent state suitable for reliable unwinding, but this may not be the
0137 case part-way through a function. For example, during a function prologue or
0138 epilogue a frame pointer may be transiently invalid, or during the function
0139 body the return address may be held in an arbitrary general purpose register.
0140 For some architectures this may change at runtime as a result of dynamic
0141 instrumentation.
0142
0143 If an interrupt or other exception is taken while the stack or other unwind
0144 state is in an inconsistent state, it may not be possible to reliably unwind,
0145 and it may not be possible to identify whether such unwinding will be reliable.
0146 See below for examples.
0147
0148 Architectures which cannot identify when it is reliable to unwind such cases
0149 (or where it is never reliable) must reject unwinding across exception
0150 boundaries. Note that it may be reliable to unwind across certain
0151 exceptions (e.g. IRQ) but unreliable to unwind across other exceptions
0152 (e.g. NMI).
0153
0154 Architectures which can identify when it is reliable to unwind such cases (or
0155 have no such cases) should attempt to unwind across exception boundaries, as
0156 doing so can prevent unnecessarily stalling livepatch consistency checks and
0157 permits livepatch transitions to complete more quickly.
0158
0159 4.4 Rewriting of return addresses
0160 ---------------------------------
0161
0162 Some trampolines temporarily modify the return address of a function in order
0163 to intercept when that function returns with a return trampoline, e.g.
0164
0165 * An ftrace trampoline may modify the return address so that function graph
0166   tracing can intercept returns.
0167
0168 * A kprobes (or optprobes) trampoline may modify the return address so that
0169   kretprobes can intercept returns.
0170
0171 When this happens, the original return address will not be in its usual
0172 location. For trampolines which are not subject to live patching, where an
0173 unwinder can reliably determine the original return address and no unwind state
0174 is altered by the trampoline, the unwinder may report the original return
0175 address in place of the trampoline and report this as reliable. Otherwise, an
0176 unwinder must report these cases as unreliable.
0177
0178 Special care is required when identifying the original return address, as this
0179 information is not in a consistent location for the duration of the entry
0180 trampoline or return trampoline. For example, considering the x86_64
0181 'return_to_handler' return trampoline:
0182
0183 .. code-block:: none
0184
0185    SYM_CODE_START(return_to_handler)
0186            UNWIND_HINT_EMPTY
0187            subq  $24, %rsp
0188
0189            /* Save the return values */
0190            movq %rax, (%rsp)
0191            movq %rdx, 8(%rsp)
0192            movq %rbp, %rdi
0193
0194            call ftrace_return_to_handler
0195
0196            movq %rax, %rdi
0197            movq 8(%rsp), %rdx
0198            movq (%rsp), %rax
0199            addq $24, %rsp
0200            JMP_NOSPEC rdi
0201    SYM_CODE_END(return_to_handler)
0202
0203 While the traced function runs its return address on the stack points to
0204 the start of return_to_handler, and the original return address is stored in
0205 the task's cur_ret_stack. During this time the unwinder can find the return
0206 address using ftrace_graph_ret_addr().
0207
0208 When the traced function returns to return_to_handler, there is no longer a
0209 return address on the stack, though the original return address is still stored
0210 in the task's cur_ret_stack. Within ftrace_return_to_handler(), the original
0211 return address is removed from cur_ret_stack and is transiently moved
0212 arbitrarily by the compiler before being returned in rax. The return_to_handler
0213 trampoline moves this into rdi before jumping to it.
0214
0215 Architectures might not always be able to unwind such sequences, such as when
0216 ftrace_return_to_handler() has removed the address from cur_ret_stack, and the
0217 location of the return address cannot be reliably determined.
0218
0219 It is recommended that architectures unwind cases where return_to_handler has
0220 not yet been returned to, but architectures are not required to unwind from the
0221 middle of return_to_handler and can report this as unreliable. Architectures
0222 are not required to unwind from other trampolines which modify the return
0223 address.
0224
0225 4.5 Obscuring of return addresses
0226 ---------------------------------
0227
0228 Some trampolines do not rewrite the return address in order to intercept
0229 returns, but do transiently clobber the return address or other unwind state.
0230
0231 For example, the x86_64 implementation of optprobes patches the probed function
0232 with a JMP instruction which targets the associated optprobe trampoline. When
0233 the probe is hit, the CPU will branch to the optprobe trampoline, and the
0234 address of the probed function is not held in any register or on the stack.
0235
0236 Similarly, the arm64 implementation of DYNAMIC_FTRACE_WITH_REGS patches traced
0237 functions with the following:
0238
0239 .. code-block:: none
0240
0241    MOV X9, X30
0242    BL <trampoline>
0243
0244 The MOV saves the link register (X30) into X9 to preserve the return address
0245 before the BL clobbers the link register and branches to the trampoline. At the
0246 start of the trampoline, the address of the traced function is in X9 rather
0247 than the link register as would usually be the case.
0248
0249 Architectures must either ensure that unwinders either reliably unwind
0250 such cases, or report the unwinding as unreliable.
0251
0252 4.6 Link register unreliability
0253 -------------------------------
0254
0255 On some other architectures, 'call' instructions place the return address into a
0256 link register, and 'return' instructions consume the return address from the
0257 link register without modifying the register. On these architectures software
0258 must save the return address to the stack prior to making a function call. Over
0259 the duration of a function call, the return address may be held in the link
0260 register alone, on the stack alone, or in both locations.
0261
0262 Unwinders typically assume the link register is always live, but this
0263 assumption can lead to unreliable stack traces. For example, consider the
0264 following arm64 assembly for a simple function:
0265
0266 .. code-block:: none
0267
0268    function:
0269            STP X29, X30, [SP, -16]!
0270            MOV X29, SP
0271            BL <other_function>
0272            LDP X29, X30, [SP], #16
0273            RET
0274
0275 At entry to the function, the link register (x30) points to the caller, and the
0276 frame pointer (X29) points to the caller's frame including the caller's return
0277 address. The first two instructions create a new stackframe and update the
0278 frame pointer, and at this point the link register and the frame pointer both
0279 describe this function's return address. A trace at this point may describe
0280 this function twice, and if the function return is being traced, the unwinder
0281 may consume two entries from the fgraph return stack rather than one entry.
0282
0283 The BL invokes 'other_function' with the link register pointing to this
0284 function's LDR and the frame pointer pointing to this function's stackframe.
0285 When 'other_function' returns, the link register is left pointing at the BL,
0286 and so a trace at this point could result in 'function' appearing twice in the
0287 backtrace.
0288
0289 Similarly, a function may deliberately clobber the LR, e.g.
0290
0291 .. code-block:: none
0292
0293    caller:
0294            STP X29, X30, [SP, -16]!
0295            MOV X29, SP
0296            ADR LR, <callee>
0297            BLR LR
0298            LDP X29, X30, [SP], #16
0299            RET
0300
0301 The ADR places the address of 'callee' into the LR, before the BLR branches to
0302 this address. If a trace is made immediately after the ADR, 'callee' will
0303 appear to be the parent of 'caller', rather than the child.
0304
0305 Due to cases such as the above, it may only be possible to reliably consume a
0306 link register value at a function call boundary. Architectures where this is
0307 the case must reject unwinding across exception boundaries unless they can
0308 reliably identify when the LR or stack value should be used (e.g. using
0309 metadata generated by objtool).