0001 ===================================
0002 Light-weight System Calls for IA-64
0003 ===================================
0004
0005 Started: 13-Jan-2003
0006
0007 Last update: 27-Sep-2003
0008
0009 David Mosberger-Tang
0010 <davidm@hpl.hp.com>
0011
0012 Using the "epc" instruction effectively introduces a new mode of
0013 execution to the ia64 linux kernel. We call this mode the
0014 "fsys-mode". To recap, the normal states of execution are:
0015
0016 - kernel mode:
0017 Both the register stack and the memory stack have been
0018 switched over to kernel memory. The user-level state is saved
0019 in a pt-regs structure at the top of the kernel memory stack.
0020
0021 - user mode:
0022 Both the register stack and the kernel stack are in
0023 user memory. The user-level state is contained in the
0024 CPU registers.
0025
0026 - bank 0 interruption-handling mode:
0027 This is the non-interruptible state which all
0028 interruption-handlers start execution in. The user-level
0029 state remains in the CPU registers and some kernel state may
0030 be stored in bank 0 of registers r16-r31.
0031
0032 In contrast, fsys-mode has the following special properties:
0033
0034 - execution is at privilege level 0 (most-privileged)
0035
0036 - CPU registers may contain a mixture of user-level and kernel-level
0037 state (it is the responsibility of the kernel to ensure that no
0038 security-sensitive kernel-level state is leaked back to
0039 user-level)
0040
0041 - execution is interruptible and preemptible (an fsys-mode handler
0042 can disable interrupts and avoid all other interruption-sources
0043 to avoid preemption)
0044
0045 - neither the memory-stack nor the register-stack can be trusted while
0046 in fsys-mode (they point to the user-level stacks, which may
0047 be invalid, or completely bogus addresses)
0048
0049 In summary, fsys-mode is much more similar to running in user-mode
0050 than it is to running in kernel-mode. Of course, given that the
0051 privilege level is at level 0, this means that fsys-mode requires some
0052 care (see below).
0053
0054
0055 How to tell fsys-mode
0056 =====================
0057
0058 Linux operates in fsys-mode when (a) the privilege level is 0 (most
0059 privileged) and (b) the stacks have NOT been switched to kernel memory
0060 yet. For convenience, the header file <asm-ia64/ptrace.h> provides
0061 three macros::
0062
0063 user_mode(regs)
0064 user_stack(task,regs)
0065 fsys_mode(task,regs)
0066
0067 The "regs" argument is a pointer to a pt_regs structure. The "task"
0068 argument is a pointer to the task structure to which the "regs"
0069 pointer belongs to. user_mode() returns TRUE if the CPU state pointed
0070 to by "regs" was executing in user mode (privilege level 3).
0071 user_stack() returns TRUE if the state pointed to by "regs" was
0072 executing on the user-level stack(s). Finally, fsys_mode() returns
0073 TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
0074 The fsys_mode() macro is equivalent to the expression::
0075
0076 !user_mode(regs) && user_stack(task,regs)
0077
0078 How to write an fsyscall handler
0079 ================================
0080
0081 The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
0082 (fsyscall_table). This table contains one entry for each system call.
0083 By default, a system call is handled by fsys_fallback_syscall(). This
0084 routine takes care of entering (full) kernel mode and calling the
0085 normal Linux system call handler. For performance-critical system
0086 calls, it is possible to write a hand-tuned fsyscall_handler. For
0087 example, fsys.S contains fsys_getpid(), which is a hand-tuned version
0088 of the getpid() system call.
0089
0090 The entry and exit-state of an fsyscall handler is as follows:
0091
0092 Machine state on entry to fsyscall handler
0093 ------------------------------------------
0094
0095 ========= ===============================================================
0096 r10 0
0097 r11 saved ar.pfs (a user-level value)
0098 r15 system call number
0099 r16 "current" task pointer (in normal kernel-mode, this is in r13)
0100 r32-r39 system call arguments
0101 b6 return address (a user-level value)
0102 ar.pfs previous frame-state (a user-level value)
0103 PSR.be cleared to zero (i.e., little-endian byte order is in effect)
0104 - all other registers may contain values passed in from user-mode
0105 ========= ===============================================================
0106
0107 Required machine state on exit to fsyscall handler
0108 --------------------------------------------------
0109
0110 ========= ===========================================================
0111 r11 saved ar.pfs (as passed into the fsyscall handler)
0112 r15 system call number (as passed into the fsyscall handler)
0113 r32-r39 system call arguments (as passed into the fsyscall handler)
0114 b6 return address (as passed into the fsyscall handler)
0115 ar.pfs previous frame-state (as passed into the fsyscall handler)
0116 ========= ===========================================================
0117
0118 Fsyscall handlers can execute with very little overhead, but with that
0119 speed comes a set of restrictions:
0120
0121 * Fsyscall-handlers MUST check for any pending work in the flags
0122 member of the thread-info structure and if any of the
0123 TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
0124 doing a full system call (by calling fsys_fallback_syscall).
0125
0126 * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
0127 r15, b6, and ar.pfs) because they will be needed in case of a
0128 system call restart. Of course, all "preserved" registers also
0129 must be preserved, in accordance to the normal calling conventions.
0130
0131 * Fsyscall-handlers MUST check argument registers for containing a
0132 NaT value before using them in any way that could trigger a
0133 NaT-consumption fault. If a system call argument is found to
0134 contain a NaT value, an fsyscall-handler may return immediately
0135 with r8=EINVAL, r10=-1.
0136
0137 * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
0138 any other operation that would trigger mandatory RSE
0139 (register-stack engine) traffic.
0140
0141 * Fsyscall-handlers MUST NOT write to any stacked registers because
0142 it is not safe to assume that user-level called a handler with the
0143 proper number of arguments.
0144
0145 * Fsyscall-handlers need to be careful when accessing per-CPU variables:
0146 unless proper safe-guards are taken (e.g., interruptions are avoided),
0147 execution may be pre-empted and resumed on another CPU at any given
0148 time.
0149
0150 * Fsyscall-handlers must be careful not to leak sensitive kernel'
0151 information back to user-level. In particular, before returning to
0152 user-level, care needs to be taken to clear any scratch registers
0153 that could contain sensitive information (note that the current
0154 task pointer is not considered sensitive: it's already exposed
0155 through ar.k6).
0156
0157 * Fsyscall-handlers MUST NOT access user-memory without first
0158 validating access-permission (this can be done typically via
0159 probe.r.fault and/or probe.w.fault) and without guarding against
0160 memory access exceptions (this can be done with the EX() macros
0161 defined by asmmacro.h).
0162
0163 The above restrictions may seem draconian, but remember that it's
0164 possible to trade off some of the restrictions by paying a slightly
0165 higher overhead. For example, if an fsyscall-handler could benefit
0166 from the shadow register bank, it could temporarily disable PSR.i and
0167 PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
0168 needed. In other words, following the above rules yields extremely
0169 fast system call execution (while fully preserving system call
0170 semantics), but there is also a lot of flexibility in handling more
0171 complicated cases.
0172
0173 Signal handling
0174 ===============
0175
0176 The delivery of (asynchronous) signals must be delayed until fsys-mode
0177 is exited. This is accomplished with the help of the lower-privilege
0178 transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
0179 checks whether the interrupted task was in fsys-mode and, if so, sets
0180 PSR.lp and returns immediately. When fsys-mode is exited via the
0181 "br.ret" instruction that lowers the privilege level, a trap will
0182 occur. The trap handler clears PSR.lp again and returns immediately.
0183 The kernel exit path then checks for and delivers any pending signals.
0184
0185 PSR Handling
0186 ============
0187
0188 The "epc" instruction doesn't change the contents of PSR at all. This
0189 is in contrast to a regular interruption, which clears almost all
0190 bits. Because of that, some care needs to be taken to ensure things
0191 work as expected. The following discussion describes how each PSR bit
0192 is handled.
0193
0194 ======= =======================================================================
0195 PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
0196 to ensure the CPU is in little-endian mode before the first
0197 load/store instruction is executed. PSR.be is normally NOT
0198 restored upon return from an fsys-mode handler. In other
0199 words, user-level code must not rely on PSR.be being preserved
0200 across a system call.
0201 PSR.up Unchanged.
0202 PSR.ac Unchanged.
0203 PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
0204 PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
0205 PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
0206 PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
0207 PSR.pk Unchanged.
0208 PSR.dt Unchanged.
0209 PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
0210 PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
0211 PSR.sp Unchanged.
0212 PSR.pp Unchanged.
0213 PSR.di Unchanged.
0214 PSR.si Unchanged.
0215 PSR.db Unchanged. The kernel prevents user-level from setting a hardware
0216 breakpoint that triggers at any privilege level other than
0217 3 (user-mode).
0218 PSR.lp Unchanged.
0219 PSR.tb Lazy redirect. If a taken-branch trap occurs while in
0220 fsys-mode, the trap-handler modifies the saved machine state
0221 such that execution resumes in the gate page at
0222 syscall_via_break(), with privilege level 3. Note: the
0223 taken branch would occur on the branch invoking the
0224 fsyscall-handler, at which point, by definition, a syscall
0225 restart is still safe. If the system call number is invalid,
0226 the fsys-mode handler will return directly to user-level. This
0227 return will trigger a taken-branch trap, but since the trap is
0228 taken _after_ restoring the privilege level, the CPU has already
0229 left fsys-mode, so no special treatment is needed.
0230 PSR.rt Unchanged.
0231 PSR.cpl Cleared to 0.
0232 PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
0233 PSR.mc Unchanged.
0234 PSR.it Unchanged (guaranteed to be 1).
0235 PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
0236 PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
0237 PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
0238 PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
0239 be taken. The trap handler then modifies the saved machine
0240 state such that execution resumes in the gate page at
0241 syscall_via_break(), with privilege level 3.
0242 PSR.ri Unchanged.
0243 PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
0244 handler performed a speculative load that gets NaTted. If so, this
0245 would be the normal & expected behavior, so no special treatment is
0246 needed.
0247 PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
0248 Doing so requires clearing PSR.i and PSR.ic as well.
0249 PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
0250 ======= =======================================================================
0251
0252 Using fast system calls
0253 =======================
0254
0255 To use fast system calls, userspace applications need simply call
0256 __kernel_syscall_via_epc(). For example
0257
0258 -- example fgettimeofday() call --
0259
0260 -- fgettimeofday.S --
0261
0262 ::
0263
0264 #include <asm/asmmacro.h>
0265
0266 GLOBAL_ENTRY(fgettimeofday)
0267 .prologue
0268 .save ar.pfs, r11
0269 mov r11 = ar.pfs
0270 .body
0271
0272 mov r2 = 0xa000000000020660;; // gate address
0273 // found by inspection of System.map for the
0274 // __kernel_syscall_via_epc() function. See
0275 // below for how to do this for real.
0276
0277 mov b7 = r2
0278 mov r15 = 1087 // gettimeofday syscall
0279 ;;
0280 br.call.sptk.many b6 = b7
0281 ;;
0282
0283 .restore sp
0284
0285 mov ar.pfs = r11
0286 br.ret.sptk.many rp;; // return to caller
0287 END(fgettimeofday)
0288
0289 -- end fgettimeofday.S --
0290
0291 In reality, getting the gate address is accomplished by two extra
0292 values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
0293
0294 * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
0295 * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
0296
0297 The ELF DSO is a pre-linked library that is mapped in by the kernel at
0298 the gate page. It is a proper ELF shared object so, with a dynamic
0299 loader that recognises the library, you should be able to make calls to
0300 the exported functions within it as with any other shared library.
0301 AT_SYSINFO points into the kernel DSO at the
0302 __kernel_syscall_via_epc() function for historical reasons (it was
0303 used before the kernel DSO) and as a convenience.