Back to home page

OSCL-LXR

 
 

    


0001 Entry/exit handling for exceptions, interrupts, syscalls and KVM
0002 ================================================================
0003 
0004 All transitions between execution domains require state updates which are
0005 subject to strict ordering constraints. State updates are required for the
0006 following:
0007 
0008   * Lockdep
0009   * RCU / Context tracking
0010   * Preemption counter
0011   * Tracing
0012   * Time accounting
0013 
0014 The update order depends on the transition type and is explained below in
0015 the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
0016 exceptions`_, `NMI and NMI-like exceptions`_.
0017 
0018 Non-instrumentable code - noinstr
0019 ---------------------------------
0020 
0021 Most instrumentation facilities depend on RCU, so intrumentation is prohibited
0022 for entry code before RCU starts watching and exit code after RCU stops
0023 watching. In addition, many architectures must save and restore register state,
0024 which means that (for example) a breakpoint in the breakpoint entry code would
0025 overwrite the debug registers of the initial breakpoint.
0026 
0027 Such code must be marked with the 'noinstr' attribute, placing that code into a
0028 special section inaccessible to instrumentation and debug facilities. Some
0029 functions are partially instrumentable, which is handled by marking them
0030 noinstr and using instrumentation_begin() and instrumentation_end() to flag the
0031 instrumentable ranges of code:
0032 
0033 .. code-block:: c
0034 
0035   noinstr void entry(void)
0036   {
0037         handle_entry();     // <-- must be 'noinstr' or '__always_inline'
0038         ...
0039 
0040         instrumentation_begin();
0041         handle_context();   // <-- instrumentable code
0042         instrumentation_end();
0043 
0044         ...
0045         handle_exit();      // <-- must be 'noinstr' or '__always_inline'
0046   }
0047 
0048 This allows verification of the 'noinstr' restrictions via objtool on
0049 supported architectures.
0050 
0051 Invoking non-instrumentable functions from instrumentable context has no
0052 restrictions and is useful to protect e.g. state switching which would
0053 cause malfunction if instrumented.
0054 
0055 All non-instrumentable entry/exit code sections before and after the RCU
0056 state transitions must run with interrupts disabled.
0057 
0058 Syscalls
0059 --------
0060 
0061 Syscall-entry code starts in assembly code and calls out into low-level C code
0062 after establishing low-level architecture-specific state and stack frames. This
0063 low-level C code must not be instrumented. A typical syscall handling function
0064 invoked from low-level assembly code looks like this:
0065 
0066 .. code-block:: c
0067 
0068   noinstr void syscall(struct pt_regs *regs, int nr)
0069   {
0070         arch_syscall_enter(regs);
0071         nr = syscall_enter_from_user_mode(regs, nr);
0072 
0073         instrumentation_begin();
0074         if (!invoke_syscall(regs, nr) && nr != -1)
0075                 result_reg(regs) = __sys_ni_syscall(regs);
0076         instrumentation_end();
0077 
0078         syscall_exit_to_user_mode(regs);
0079   }
0080 
0081 syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
0082 establishes state in the following order:
0083 
0084   * Lockdep
0085   * RCU / Context tracking
0086   * Tracing
0087 
0088 and then invokes the various entry work functions like ptrace, seccomp, audit,
0089 syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
0090 function can be invoked. The instrumentable code section then ends, after which
0091 syscall_exit_to_user_mode() is invoked.
0092 
0093 syscall_exit_to_user_mode() handles all work which needs to be done before
0094 returning to user space like tracing, audit, signals, task work etc. After
0095 that it invokes exit_to_user_mode() which again handles the state
0096 transition in the reverse order:
0097 
0098   * Tracing
0099   * RCU / Context tracking
0100   * Lockdep
0101 
0102 syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
0103 available as fine grained subfunctions in cases where the architecture code
0104 has to do extra work between the various steps. In such cases it has to
0105 ensure that enter_from_user_mode() is called first on entry and
0106 exit_to_user_mode() is called last on exit.
0107 
0108 Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
0109 to print a warning.
0110 
0111 KVM
0112 ---
0113 
0114 Entering or exiting guest mode is very similar to syscalls. From the host
0115 kernel point of view the CPU goes off into user space when entering the
0116 guest and returns to the kernel on exit.
0117 
0118 kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
0119 and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
0120 The state operations have the same ordering.
0121 
0122 Task work handling is done separately for guest at the boundary of the
0123 vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
0124 the work handled on return to user space.
0125 
0126 Do not nest KVM entry/exit transitions because doing so is nonsensical.
0127 
0128 Interrupts and regular exceptions
0129 ---------------------------------
0130 
0131 Interrupts entry and exit handling is slightly more complex than syscalls
0132 and KVM transitions.
0133 
0134 If an interrupt is raised while the CPU executes in user space, the entry
0135 and exit handling is exactly the same as for syscalls.
0136 
0137 If the interrupt is raised while the CPU executes in kernel space the entry and
0138 exit handling is slightly different. RCU state is only updated when the
0139 interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
0140 already be watching. Lockdep and tracing have to be updated unconditionally.
0141 
0142 irqentry_enter() and irqentry_exit() provide the implementation for this.
0143 
0144 The architecture-specific part looks similar to syscall handling:
0145 
0146 .. code-block:: c
0147 
0148   noinstr void interrupt(struct pt_regs *regs, int nr)
0149   {
0150         arch_interrupt_enter(regs);
0151         state = irqentry_enter(regs);
0152 
0153         instrumentation_begin();
0154 
0155         irq_enter_rcu();
0156         invoke_irq_handler(regs, nr);
0157         irq_exit_rcu();
0158 
0159         instrumentation_end();
0160 
0161         irqentry_exit(regs, state);
0162   }
0163 
0164 Note that the invocation of the actual interrupt handler is within a
0165 irq_enter_rcu() and irq_exit_rcu() pair.
0166 
0167 irq_enter_rcu() updates the preemption count which makes in_hardirq()
0168 return true, handles NOHZ tick state and interrupt time accounting. This
0169 means that up to the point where irq_enter_rcu() is invoked in_hardirq()
0170 returns false.
0171 
0172 irq_exit_rcu() handles interrupt time accounting, undoes the preemption
0173 count update and eventually handles soft interrupts and NOHZ tick state.
0174 
0175 In theory, the preemption count could be updated in irqentry_enter(). In
0176 practice, deferring this update to irq_enter_rcu() allows the preemption-count
0177 code to be traced, while also maintaining symmetry with irq_exit_rcu() and
0178 irqentry_exit(), which are described in the next paragraph. The only downside
0179 is that the early entry code up to irq_enter_rcu() must be aware that the
0180 preemption count has not yet been updated with the HARDIRQ_OFFSET state.
0181 
0182 Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
0183 before it handles soft interrupts, whose handlers must run in BH context rather
0184 than irq-disabled context. In addition, irqentry_exit() might schedule, which
0185 also requires that HARDIRQ_OFFSET has been removed from the preemption count.
0186 
0187 Even though interrupt handlers are expected to run with local interrupts
0188 disabled, interrupt nesting is common from an entry/exit perspective. For
0189 example, softirq handling happens within an irqentry_{enter,exit}() block with
0190 local interrupts enabled. Also, although uncommon, nothing prevents an
0191 interrupt handler from re-enabling interrupts.
0192 
0193 Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
0194 runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
0195 the entry code is shared between the two.
0196 
0197 NMI and NMI-like exceptions
0198 ---------------------------
0199 
0200 NMIs and NMI-like exceptions (machine checks, double faults, debug
0201 interrupts, etc.) can hit any context and must be extra careful with
0202 the state.
0203 
0204 State changes for debug exceptions and machine-check exceptions depend on
0205 whether these exceptions happened in user-space (breakpoints or watchpoints) or
0206 in kernel mode (code patching). From user-space, they are treated like
0207 interrupts, while from kernel mode they are treated like NMIs.
0208 
0209 NMIs and other NMI-like exceptions handle state transitions without
0210 distinguishing between user-mode and kernel-mode origin.
0211 
0212 The state update on entry is handled in irqentry_nmi_enter() which updates
0213 state in the following order:
0214 
0215   * Preemption counter
0216   * Lockdep
0217   * RCU / Context tracking
0218   * Tracing
0219 
0220 The exit counterpart irqentry_nmi_exit() does the reverse operation in the
0221 reverse order.
0222 
0223 Note that the update of the preemption counter has to be the first
0224 operation on enter and the last operation on exit. The reason is that both
0225 lockdep and RCU rely on in_nmi() returning true in this case. The
0226 preemption count modification in the NMI entry/exit case must not be
0227 traced.
0228 
0229 Architecture-specific code looks like this:
0230 
0231 .. code-block:: c
0232 
0233   noinstr void nmi(struct pt_regs *regs)
0234   {
0235         arch_nmi_enter(regs);
0236         state = irqentry_nmi_enter(regs);
0237 
0238         instrumentation_begin();
0239         nmi_handler(regs);
0240         instrumentation_end();
0241 
0242         irqentry_nmi_exit(regs);
0243   }
0244 
0245 and for e.g. a debug exception it can look like this:
0246 
0247 .. code-block:: c
0248 
0249   noinstr void debug(struct pt_regs *regs)
0250   {
0251         arch_nmi_enter(regs);
0252 
0253         debug_regs = save_debug_regs();
0254 
0255         if (user_mode(regs)) {
0256                 state = irqentry_enter(regs);
0257 
0258                 instrumentation_begin();
0259                 user_mode_debug_handler(regs, debug_regs);
0260                 instrumentation_end();
0261 
0262                 irqentry_exit(regs, state);
0263         } else {
0264                 state = irqentry_nmi_enter(regs);
0265 
0266                 instrumentation_begin();
0267                 kernel_mode_debug_handler(regs, debug_regs);
0268                 instrumentation_end();
0269 
0270                 irqentry_nmi_exit(regs, state);
0271         }
0272   }
0273 
0274 There is no combined irqentry_nmi_if_kernel() function available as the
0275 above cannot be handled in an exception-agnostic way.
0276 
0277 NMIs can happen in any context. For example, an NMI-like exception triggered
0278 while handling an NMI. So NMI entry code has to be reentrant and state updates
0279 need to handle nesting.