0001 =============================================================
0002 An ad-hoc collection of notes on IA64 MCA and INIT processing
0003 =============================================================
0004
0005 Feel free to update it with notes about any area that is not clear.
0006
0007 ---
0008
0009 MCA/INIT are completely asynchronous. They can occur at any time, when
0010 the OS is in any state. Including when one of the cpus is already
0011 holding a spinlock. Trying to get any lock from MCA/INIT state is
0012 asking for deadlock. Also the state of structures that are protected
0013 by locks is indeterminate, including linked lists.
0014
0015 ---
0016
0017 The complicated ia64 MCA process. All of this is mandated by Intel's
0018 specification for ia64 SAL, error recovery and unwind, it is not as
0019 if we have a choice here.
0020
0021 * MCA occurs on one cpu, usually due to a double bit memory error.
0022 This is the monarch cpu.
0023
0024 * SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
0025 to all the other cpus, the slaves.
0026
0027 * Slave cpus that receive the MCA interrupt call down into SAL, they
0028 end up spinning disabled while the MCA is being serviced.
0029
0030 * If any slave cpu was already spinning disabled when the MCA occurred
0031 then it cannot service the MCA interrupt. SAL waits ~20 seconds then
0032 sends an unmaskable INIT event to the slave cpus that have not
0033 already rendezvoused.
0034
0035 * Because MCA/INIT can be delivered at any time, including when the cpu
0036 is down in PAL in physical mode, the registers at the time of the
0037 event are _completely_ undefined. In particular the MCA/INIT
0038 handlers cannot rely on the thread pointer, PAL physical mode can
0039 (and does) modify TP. It is allowed to do that as long as it resets
0040 TP on return. However MCA/INIT events expose us to these PAL
0041 internal TP changes. Hence curr_task().
0042
0043 * If an MCA/INIT event occurs while the kernel was running (not user
0044 space) and the kernel has called PAL then the MCA/INIT handler cannot
0045 assume that the kernel stack is in a fit state to be used. Mainly
0046 because PAL may or may not maintain the stack pointer internally.
0047 Because the MCA/INIT handlers cannot trust the kernel stack, they
0048 have to use their own, per-cpu stacks. The MCA/INIT stacks are
0049 preformatted with just enough task state to let the relevant handlers
0050 do their job.
0051
0052 * Unlike most other architectures, the ia64 struct task is embedded in
0053 the kernel stack[1]. So switching to a new kernel stack means that
0054 we switch to a new task as well. Because various bits of the kernel
0055 assume that current points into the struct task, switching to a new
0056 stack also means a new value for current.
0057
0058 * Once all slaves have rendezvoused and are spinning disabled, the
0059 monarch is entered. The monarch now tries to diagnose the problem
0060 and decide if it can recover or not.
0061
0062 * Part of the monarch's job is to look at the state of all the other
0063 tasks. The only way to do that on ia64 is to call the unwinder,
0064 as mandated by Intel.
0065
0066 * The starting point for the unwind depends on whether a task is
0067 running or not. That is, whether it is on a cpu or is blocked. The
0068 monarch has to determine whether or not a task is on a cpu before it
0069 knows how to start unwinding it. The tasks that received an MCA or
0070 INIT event are no longer running, they have been converted to blocked
0071 tasks. But (and its a big but), the cpus that received the MCA
0072 rendezvous interrupt are still running on their normal kernel stacks!
0073
0074 * To distinguish between these two cases, the monarch must know which
0075 tasks are on a cpu and which are not. Hence each slave cpu that
0076 switches to an MCA/INIT stack, registers its new stack using
0077 set_curr_task(), so the monarch can tell that the _original_ task is
0078 no longer running on that cpu. That gives us a decent chance of
0079 getting a valid backtrace of the _original_ task.
0080
0081 * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
0082 nested error, we want diagnostics on the MCA/INIT handler that
0083 failed, not on the task that was originally running. Again this
0084 requires set_curr_task() so the MCA/INIT handlers can register their
0085 own stack as running on that cpu. Then a recursive error gets a
0086 trace of the failing handler's "task".
0087
0088 [1]
0089 My (Keith Owens) original design called for ia64 to separate its
0090 struct task and the kernel stacks. Then the MCA/INIT data would be
0091 chained stacks like i386 interrupt stacks. But that required
0092 radical surgery on the rest of ia64, plus extra hard wired TLB
0093 entries with its associated performance degradation. David
0094 Mosberger vetoed that approach. Which meant that separate kernel
0095 stacks meant separate "tasks" for the MCA/INIT handlers.
0096
0097 ---
0098
0099 INIT is less complicated than MCA. Pressing the nmi button or using
0100 the equivalent command on the management console sends INIT to all
0101 cpus. SAL picks one of the cpus as the monarch and the rest are
0102 slaves. All the OS INIT handlers are entered at approximately the same
0103 time. The OS monarch prints the state of all tasks and returns, after
0104 which the slaves return and the system resumes.
0105
0106 At least that is what is supposed to happen. Alas there are broken
0107 versions of SAL out there. Some drive all the cpus as monarchs. Some
0108 drive them all as slaves. Some drive one cpu as monarch, wait for that
0109 cpu to return from the OS then drive the rest as slaves. Some versions
0110 of SAL cannot even cope with returning from the OS, they spin inside
0111 SAL on resume. The OS INIT code has workarounds for some of these
0112 broken SAL symptoms, but some simply cannot be fixed from the OS side.
0113
0114 ---
0115
0116 The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
0117 violations. Unfortunately MCA/INIT start off as massive layer
0118 violations (can occur at _any_ time) and they build from there.
0119
0120 At least ia64 makes an attempt at recovering from hardware errors, but
0121 it is a difficult problem because of the asynchronous nature of these
0122 errors. When processing an unmaskable interrupt we sometimes need
0123 special code to cope with our inability to take any locks.
0124
0125 ---
0126
0127 How is ia64 MCA/INIT different from x86 NMI?
0128
0129 * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
0130 all cpus.
0131
0132 * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
0133 per cpu.
0134
0135 * x86 has a separate struct task which points to one of multiple kernel
0136 stacks. ia64 has the struct task embedded in the single kernel
0137 stack, so switching stack means switching task.
0138
0139 * x86 does not call the BIOS so the NMI handler does not have to worry
0140 about any registers having changed. MCA/INIT can occur while the cpu
0141 is in PAL in physical mode, with undefined registers and an undefined
0142 kernel stack.
0143
0144 * i386 backtrace is not very sensitive to whether a process is running
0145 or not. ia64 unwind is very, very sensitive to whether a process is
0146 running or not.
0147
0148 ---
0149
0150 What happens when MCA/INIT is delivered what a cpu is running user
0151 space code?
0152
0153 The user mode registers are stored in the RSE area of the MCA/INIT on
0154 entry to the OS and are restored from there on return to SAL, so user
0155 mode registers are preserved across a recoverable MCA/INIT. Since the
0156 OS has no idea what unwind data is available for the user space stack,
0157 MCA/INIT never tries to backtrace user space. Which means that the OS
0158 does not bother making the user space process look like a blocked task,
0159 i.e. the OS does not copy pt_regs and switch_stack to the user space
0160 stack. Also the OS has no idea how big the user space RSE and memory
0161 stacks are, which makes it too risky to copy the saved state to a user
0162 mode stack.
0163
0164 ---
0165
0166 How do we get a backtrace on the tasks that were running when MCA/INIT
0167 was delivered?
0168
0169 mca.c:::ia64_mca_modify_original_stack(). That identifies and
0170 verifies the original kernel stack, copies the dirty registers from
0171 the MCA/INIT stack's RSE to the original stack's RSE, copies the
0172 skeleton struct pt_regs and switch_stack to the original stack, fills
0173 in the skeleton structures from the PAL minstate area and updates the
0174 original stack's thread.ksp. That makes the original stack look
0175 exactly like any other blocked task, i.e. it now appears to be
0176 sleeping. To get a backtrace, just start with thread.ksp for the
0177 original task and unwind like any other sleeping task.
0178
0179 ---
0180
0181 How do we identify the tasks that were running when MCA/INIT was
0182 delivered?
0183
0184 If the previous task has been verified and converted to a blocked
0185 state, then sos->prev_task on the MCA/INIT stack is updated to point to
0186 the previous task. You can look at that field in dumps or debuggers.
0187 To help distinguish between the handler and the original tasks,
0188 handlers have _TIF_MCA_INIT set in thread_info.flags.
0189
0190 The sos data is always in the MCA/INIT handler stack, at offset
0191 MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
0192 as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
0193 ia64_sal_os_state), with 16 byte alignment for all structures.
0194
0195 Also the comm field of the MCA/INIT task is modified to include the pid
0196 of the original task, for humans to use. For example, a comm field of
0197 'MCA 12159' means that pid 12159 was running when the MCA was
0198 delivered.