0001 Entry/exit handling for exceptions, interrupts, syscalls and KVM
0002 ================================================================
0003
0004 All transitions between execution domains require state updates which are
0005 subject to strict ordering constraints. State updates are required for the
0006 following:
0007
0008 * Lockdep
0009 * RCU / Context tracking
0010 * Preemption counter
0011 * Tracing
0012 * Time accounting
0013
0014 The update order depends on the transition type and is explained below in
0015 the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
0016 exceptions`_, `NMI and NMI-like exceptions`_.
0017
0018 Non-instrumentable code - noinstr
0019 ---------------------------------
0020
0021 Most instrumentation facilities depend on RCU, so intrumentation is prohibited
0022 for entry code before RCU starts watching and exit code after RCU stops
0023 watching. In addition, many architectures must save and restore register state,
0024 which means that (for example) a breakpoint in the breakpoint entry code would
0025 overwrite the debug registers of the initial breakpoint.
0026
0027 Such code must be marked with the 'noinstr' attribute, placing that code into a
0028 special section inaccessible to instrumentation and debug facilities. Some
0029 functions are partially instrumentable, which is handled by marking them
0030 noinstr and using instrumentation_begin() and instrumentation_end() to flag the
0031 instrumentable ranges of code:
0032
0033 .. code-block:: c
0034
0035 noinstr void entry(void)
0036 {
0037 handle_entry(); // <-- must be 'noinstr' or '__always_inline'
0038 ...
0039
0040 instrumentation_begin();
0041 handle_context(); // <-- instrumentable code
0042 instrumentation_end();
0043
0044 ...
0045 handle_exit(); // <-- must be 'noinstr' or '__always_inline'
0046 }
0047
0048 This allows verification of the 'noinstr' restrictions via objtool on
0049 supported architectures.
0050
0051 Invoking non-instrumentable functions from instrumentable context has no
0052 restrictions and is useful to protect e.g. state switching which would
0053 cause malfunction if instrumented.
0054
0055 All non-instrumentable entry/exit code sections before and after the RCU
0056 state transitions must run with interrupts disabled.
0057
0058 Syscalls
0059 --------
0060
0061 Syscall-entry code starts in assembly code and calls out into low-level C code
0062 after establishing low-level architecture-specific state and stack frames. This
0063 low-level C code must not be instrumented. A typical syscall handling function
0064 invoked from low-level assembly code looks like this:
0065
0066 .. code-block:: c
0067
0068 noinstr void syscall(struct pt_regs *regs, int nr)
0069 {
0070 arch_syscall_enter(regs);
0071 nr = syscall_enter_from_user_mode(regs, nr);
0072
0073 instrumentation_begin();
0074 if (!invoke_syscall(regs, nr) && nr != -1)
0075 result_reg(regs) = __sys_ni_syscall(regs);
0076 instrumentation_end();
0077
0078 syscall_exit_to_user_mode(regs);
0079 }
0080
0081 syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
0082 establishes state in the following order:
0083
0084 * Lockdep
0085 * RCU / Context tracking
0086 * Tracing
0087
0088 and then invokes the various entry work functions like ptrace, seccomp, audit,
0089 syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
0090 function can be invoked. The instrumentable code section then ends, after which
0091 syscall_exit_to_user_mode() is invoked.
0092
0093 syscall_exit_to_user_mode() handles all work which needs to be done before
0094 returning to user space like tracing, audit, signals, task work etc. After
0095 that it invokes exit_to_user_mode() which again handles the state
0096 transition in the reverse order:
0097
0098 * Tracing
0099 * RCU / Context tracking
0100 * Lockdep
0101
0102 syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
0103 available as fine grained subfunctions in cases where the architecture code
0104 has to do extra work between the various steps. In such cases it has to
0105 ensure that enter_from_user_mode() is called first on entry and
0106 exit_to_user_mode() is called last on exit.
0107
0108 Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
0109 to print a warning.
0110
0111 KVM
0112 ---
0113
0114 Entering or exiting guest mode is very similar to syscalls. From the host
0115 kernel point of view the CPU goes off into user space when entering the
0116 guest and returns to the kernel on exit.
0117
0118 kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
0119 and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
0120 The state operations have the same ordering.
0121
0122 Task work handling is done separately for guest at the boundary of the
0123 vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
0124 the work handled on return to user space.
0125
0126 Do not nest KVM entry/exit transitions because doing so is nonsensical.
0127
0128 Interrupts and regular exceptions
0129 ---------------------------------
0130
0131 Interrupts entry and exit handling is slightly more complex than syscalls
0132 and KVM transitions.
0133
0134 If an interrupt is raised while the CPU executes in user space, the entry
0135 and exit handling is exactly the same as for syscalls.
0136
0137 If the interrupt is raised while the CPU executes in kernel space the entry and
0138 exit handling is slightly different. RCU state is only updated when the
0139 interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
0140 already be watching. Lockdep and tracing have to be updated unconditionally.
0141
0142 irqentry_enter() and irqentry_exit() provide the implementation for this.
0143
0144 The architecture-specific part looks similar to syscall handling:
0145
0146 .. code-block:: c
0147
0148 noinstr void interrupt(struct pt_regs *regs, int nr)
0149 {
0150 arch_interrupt_enter(regs);
0151 state = irqentry_enter(regs);
0152
0153 instrumentation_begin();
0154
0155 irq_enter_rcu();
0156 invoke_irq_handler(regs, nr);
0157 irq_exit_rcu();
0158
0159 instrumentation_end();
0160
0161 irqentry_exit(regs, state);
0162 }
0163
0164 Note that the invocation of the actual interrupt handler is within a
0165 irq_enter_rcu() and irq_exit_rcu() pair.
0166
0167 irq_enter_rcu() updates the preemption count which makes in_hardirq()
0168 return true, handles NOHZ tick state and interrupt time accounting. This
0169 means that up to the point where irq_enter_rcu() is invoked in_hardirq()
0170 returns false.
0171
0172 irq_exit_rcu() handles interrupt time accounting, undoes the preemption
0173 count update and eventually handles soft interrupts and NOHZ tick state.
0174
0175 In theory, the preemption count could be updated in irqentry_enter(). In
0176 practice, deferring this update to irq_enter_rcu() allows the preemption-count
0177 code to be traced, while also maintaining symmetry with irq_exit_rcu() and
0178 irqentry_exit(), which are described in the next paragraph. The only downside
0179 is that the early entry code up to irq_enter_rcu() must be aware that the
0180 preemption count has not yet been updated with the HARDIRQ_OFFSET state.
0181
0182 Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
0183 before it handles soft interrupts, whose handlers must run in BH context rather
0184 than irq-disabled context. In addition, irqentry_exit() might schedule, which
0185 also requires that HARDIRQ_OFFSET has been removed from the preemption count.
0186
0187 Even though interrupt handlers are expected to run with local interrupts
0188 disabled, interrupt nesting is common from an entry/exit perspective. For
0189 example, softirq handling happens within an irqentry_{enter,exit}() block with
0190 local interrupts enabled. Also, although uncommon, nothing prevents an
0191 interrupt handler from re-enabling interrupts.
0192
0193 Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
0194 runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
0195 the entry code is shared between the two.
0196
0197 NMI and NMI-like exceptions
0198 ---------------------------
0199
0200 NMIs and NMI-like exceptions (machine checks, double faults, debug
0201 interrupts, etc.) can hit any context and must be extra careful with
0202 the state.
0203
0204 State changes for debug exceptions and machine-check exceptions depend on
0205 whether these exceptions happened in user-space (breakpoints or watchpoints) or
0206 in kernel mode (code patching). From user-space, they are treated like
0207 interrupts, while from kernel mode they are treated like NMIs.
0208
0209 NMIs and other NMI-like exceptions handle state transitions without
0210 distinguishing between user-mode and kernel-mode origin.
0211
0212 The state update on entry is handled in irqentry_nmi_enter() which updates
0213 state in the following order:
0214
0215 * Preemption counter
0216 * Lockdep
0217 * RCU / Context tracking
0218 * Tracing
0219
0220 The exit counterpart irqentry_nmi_exit() does the reverse operation in the
0221 reverse order.
0222
0223 Note that the update of the preemption counter has to be the first
0224 operation on enter and the last operation on exit. The reason is that both
0225 lockdep and RCU rely on in_nmi() returning true in this case. The
0226 preemption count modification in the NMI entry/exit case must not be
0227 traced.
0228
0229 Architecture-specific code looks like this:
0230
0231 .. code-block:: c
0232
0233 noinstr void nmi(struct pt_regs *regs)
0234 {
0235 arch_nmi_enter(regs);
0236 state = irqentry_nmi_enter(regs);
0237
0238 instrumentation_begin();
0239 nmi_handler(regs);
0240 instrumentation_end();
0241
0242 irqentry_nmi_exit(regs);
0243 }
0244
0245 and for e.g. a debug exception it can look like this:
0246
0247 .. code-block:: c
0248
0249 noinstr void debug(struct pt_regs *regs)
0250 {
0251 arch_nmi_enter(regs);
0252
0253 debug_regs = save_debug_regs();
0254
0255 if (user_mode(regs)) {
0256 state = irqentry_enter(regs);
0257
0258 instrumentation_begin();
0259 user_mode_debug_handler(regs, debug_regs);
0260 instrumentation_end();
0261
0262 irqentry_exit(regs, state);
0263 } else {
0264 state = irqentry_nmi_enter(regs);
0265
0266 instrumentation_begin();
0267 kernel_mode_debug_handler(regs, debug_regs);
0268 instrumentation_end();
0269
0270 irqentry_nmi_exit(regs, state);
0271 }
0272 }
0273
0274 There is no combined irqentry_nmi_if_kernel() function available as the
0275 above cannot be handled in an exception-agnostic way.
0276
0277 NMIs can happen in any context. For example, an NMI-like exception triggered
0278 while handling an NMI. So NMI entry code has to be reentrant and state updates
0279 need to handle nesting.