Back to home page

OSCL-LXR

 
 

    


0001 ===================================
0002 Light-weight System Calls for IA-64
0003 ===================================
0004 
0005                         Started: 13-Jan-2003
0006 
0007                     Last update: 27-Sep-2003
0008 
0009                       David Mosberger-Tang
0010                       <davidm@hpl.hp.com>
0011 
0012 Using the "epc" instruction effectively introduces a new mode of
0013 execution to the ia64 linux kernel.  We call this mode the
0014 "fsys-mode".  To recap, the normal states of execution are:
0015 
0016   - kernel mode:
0017         Both the register stack and the memory stack have been
0018         switched over to kernel memory.  The user-level state is saved
0019         in a pt-regs structure at the top of the kernel memory stack.
0020 
0021   - user mode:
0022         Both the register stack and the kernel stack are in
0023         user memory.  The user-level state is contained in the
0024         CPU registers.
0025 
0026   - bank 0 interruption-handling mode:
0027         This is the non-interruptible state which all
0028         interruption-handlers start execution in.  The user-level
0029         state remains in the CPU registers and some kernel state may
0030         be stored in bank 0 of registers r16-r31.
0031 
0032 In contrast, fsys-mode has the following special properties:
0033 
0034   - execution is at privilege level 0 (most-privileged)
0035 
0036   - CPU registers may contain a mixture of user-level and kernel-level
0037     state (it is the responsibility of the kernel to ensure that no
0038     security-sensitive kernel-level state is leaked back to
0039     user-level)
0040 
0041   - execution is interruptible and preemptible (an fsys-mode handler
0042     can disable interrupts and avoid all other interruption-sources
0043     to avoid preemption)
0044 
0045   - neither the memory-stack nor the register-stack can be trusted while
0046     in fsys-mode (they point to the user-level stacks, which may
0047     be invalid, or completely bogus addresses)
0048 
0049 In summary, fsys-mode is much more similar to running in user-mode
0050 than it is to running in kernel-mode.  Of course, given that the
0051 privilege level is at level 0, this means that fsys-mode requires some
0052 care (see below).
0053 
0054 
0055 How to tell fsys-mode
0056 =====================
0057 
0058 Linux operates in fsys-mode when (a) the privilege level is 0 (most
0059 privileged) and (b) the stacks have NOT been switched to kernel memory
0060 yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
0061 three macros::
0062 
0063         user_mode(regs)
0064         user_stack(task,regs)
0065         fsys_mode(task,regs)
0066 
0067 The "regs" argument is a pointer to a pt_regs structure.  The "task"
0068 argument is a pointer to the task structure to which the "regs"
0069 pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
0070 to by "regs" was executing in user mode (privilege level 3).
0071 user_stack() returns TRUE if the state pointed to by "regs" was
0072 executing on the user-level stack(s).  Finally, fsys_mode() returns
0073 TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
0074 The fsys_mode() macro is equivalent to the expression::
0075 
0076         !user_mode(regs) && user_stack(task,regs)
0077 
0078 How to write an fsyscall handler
0079 ================================
0080 
0081 The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
0082 (fsyscall_table).  This table contains one entry for each system call.
0083 By default, a system call is handled by fsys_fallback_syscall().  This
0084 routine takes care of entering (full) kernel mode and calling the
0085 normal Linux system call handler.  For performance-critical system
0086 calls, it is possible to write a hand-tuned fsyscall_handler.  For
0087 example, fsys.S contains fsys_getpid(), which is a hand-tuned version
0088 of the getpid() system call.
0089 
0090 The entry and exit-state of an fsyscall handler is as follows:
0091 
0092 Machine state on entry to fsyscall handler
0093 ------------------------------------------
0094 
0095   ========= ===============================================================
0096   r10       0
0097   r11       saved ar.pfs (a user-level value)
0098   r15       system call number
0099   r16       "current" task pointer (in normal kernel-mode, this is in r13)
0100   r32-r39   system call arguments
0101   b6        return address (a user-level value)
0102   ar.pfs    previous frame-state (a user-level value)
0103   PSR.be    cleared to zero (i.e., little-endian byte order is in effect)
0104   -         all other registers may contain values passed in from user-mode
0105   ========= ===============================================================
0106 
0107 Required machine state on exit to fsyscall handler
0108 --------------------------------------------------
0109 
0110   ========= ===========================================================
0111   r11       saved ar.pfs (as passed into the fsyscall handler)
0112   r15       system call number (as passed into the fsyscall handler)
0113   r32-r39   system call arguments (as passed into the fsyscall handler)
0114   b6        return address (as passed into the fsyscall handler)
0115   ar.pfs    previous frame-state (as passed into the fsyscall handler)
0116   ========= ===========================================================
0117 
0118 Fsyscall handlers can execute with very little overhead, but with that
0119 speed comes a set of restrictions:
0120 
0121  * Fsyscall-handlers MUST check for any pending work in the flags
0122    member of the thread-info structure and if any of the
0123    TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
0124    doing a full system call (by calling fsys_fallback_syscall).
0125 
0126  * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
0127    r15, b6, and ar.pfs) because they will be needed in case of a
0128    system call restart.  Of course, all "preserved" registers also
0129    must be preserved, in accordance to the normal calling conventions.
0130 
0131  * Fsyscall-handlers MUST check argument registers for containing a
0132    NaT value before using them in any way that could trigger a
0133    NaT-consumption fault.  If a system call argument is found to
0134    contain a NaT value, an fsyscall-handler may return immediately
0135    with r8=EINVAL, r10=-1.
0136 
0137  * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
0138    any other operation that would trigger mandatory RSE
0139    (register-stack engine) traffic.
0140 
0141  * Fsyscall-handlers MUST NOT write to any stacked registers because
0142    it is not safe to assume that user-level called a handler with the
0143    proper number of arguments.
0144 
0145  * Fsyscall-handlers need to be careful when accessing per-CPU variables:
0146    unless proper safe-guards are taken (e.g., interruptions are avoided),
0147    execution may be pre-empted and resumed on another CPU at any given
0148    time.
0149 
0150  * Fsyscall-handlers must be careful not to leak sensitive kernel'
0151    information back to user-level.  In particular, before returning to
0152    user-level, care needs to be taken to clear any scratch registers
0153    that could contain sensitive information (note that the current
0154    task pointer is not considered sensitive: it's already exposed
0155    through ar.k6).
0156 
0157  * Fsyscall-handlers MUST NOT access user-memory without first
0158    validating access-permission (this can be done typically via
0159    probe.r.fault and/or probe.w.fault) and without guarding against
0160    memory access exceptions (this can be done with the EX() macros
0161    defined by asmmacro.h).
0162 
0163 The above restrictions may seem draconian, but remember that it's
0164 possible to trade off some of the restrictions by paying a slightly
0165 higher overhead.  For example, if an fsyscall-handler could benefit
0166 from the shadow register bank, it could temporarily disable PSR.i and
0167 PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
0168 needed.  In other words, following the above rules yields extremely
0169 fast system call execution (while fully preserving system call
0170 semantics), but there is also a lot of flexibility in handling more
0171 complicated cases.
0172 
0173 Signal handling
0174 ===============
0175 
0176 The delivery of (asynchronous) signals must be delayed until fsys-mode
0177 is exited.  This is accomplished with the help of the lower-privilege
0178 transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
0179 checks whether the interrupted task was in fsys-mode and, if so, sets
0180 PSR.lp and returns immediately.  When fsys-mode is exited via the
0181 "br.ret" instruction that lowers the privilege level, a trap will
0182 occur.  The trap handler clears PSR.lp again and returns immediately.
0183 The kernel exit path then checks for and delivers any pending signals.
0184 
0185 PSR Handling
0186 ============
0187 
0188 The "epc" instruction doesn't change the contents of PSR at all.  This
0189 is in contrast to a regular interruption, which clears almost all
0190 bits.  Because of that, some care needs to be taken to ensure things
0191 work as expected.  The following discussion describes how each PSR bit
0192 is handled.
0193 
0194 ======= =======================================================================
0195 PSR.be  Cleared when entering fsys-mode.  A srlz.d instruction is used
0196         to ensure the CPU is in little-endian mode before the first
0197         load/store instruction is executed.  PSR.be is normally NOT
0198         restored upon return from an fsys-mode handler.  In other
0199         words, user-level code must not rely on PSR.be being preserved
0200         across a system call.
0201 PSR.up  Unchanged.
0202 PSR.ac  Unchanged.
0203 PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
0204 PSR.mfh Unchanged.  Note: fsys-mode handlers must not write-registers!
0205 PSR.ic  Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
0206 PSR.i   Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
0207 PSR.pk  Unchanged.
0208 PSR.dt  Unchanged.
0209 PSR.dfl Unchanged.  Note: fsys-mode handlers must not write-registers!
0210 PSR.dfh Unchanged.  Note: fsys-mode handlers must not write-registers!
0211 PSR.sp  Unchanged.
0212 PSR.pp  Unchanged.
0213 PSR.di  Unchanged.
0214 PSR.si  Unchanged.
0215 PSR.db  Unchanged.  The kernel prevents user-level from setting a hardware
0216         breakpoint that triggers at any privilege level other than
0217         3 (user-mode).
0218 PSR.lp  Unchanged.
0219 PSR.tb  Lazy redirect.  If a taken-branch trap occurs while in
0220         fsys-mode, the trap-handler modifies the saved machine state
0221         such that execution resumes in the gate page at
0222         syscall_via_break(), with privilege level 3.  Note: the
0223         taken branch would occur on the branch invoking the
0224         fsyscall-handler, at which point, by definition, a syscall
0225         restart is still safe.  If the system call number is invalid,
0226         the fsys-mode handler will return directly to user-level.  This
0227         return will trigger a taken-branch trap, but since the trap is
0228         taken _after_ restoring the privilege level, the CPU has already
0229         left fsys-mode, so no special treatment is needed.
0230 PSR.rt  Unchanged.
0231 PSR.cpl Cleared to 0.
0232 PSR.is  Unchanged (guaranteed to be 0 on entry to the gate page).
0233 PSR.mc  Unchanged.
0234 PSR.it  Unchanged (guaranteed to be 1).
0235 PSR.id  Unchanged.  Note: the ia64 linux kernel never sets this bit.
0236 PSR.da  Unchanged.  Note: the ia64 linux kernel never sets this bit.
0237 PSR.dd  Unchanged.  Note: the ia64 linux kernel never sets this bit.
0238 PSR.ss  Lazy redirect.  If set, "epc" will cause a Single Step Trap to
0239         be taken.  The trap handler then modifies the saved machine
0240         state such that execution resumes in the gate page at
0241         syscall_via_break(), with privilege level 3.
0242 PSR.ri  Unchanged.
0243 PSR.ed  Unchanged.  Note: This bit could only have an effect if an fsys-mode
0244         handler performed a speculative load that gets NaTted.  If so, this
0245         would be the normal & expected behavior, so no special treatment is
0246         needed.
0247 PSR.bn  Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
0248         Doing so requires clearing PSR.i and PSR.ic as well.
0249 PSR.ia  Unchanged.  Note: the ia64 linux kernel never sets this bit.
0250 ======= =======================================================================
0251 
0252 Using fast system calls
0253 =======================
0254 
0255 To use fast system calls, userspace applications need simply call
0256 __kernel_syscall_via_epc().  For example
0257 
0258 -- example fgettimeofday() call --
0259 
0260 -- fgettimeofday.S --
0261 
0262 ::
0263 
0264   #include <asm/asmmacro.h>
0265 
0266   GLOBAL_ENTRY(fgettimeofday)
0267   .prologue
0268   .save ar.pfs, r11
0269   mov r11 = ar.pfs
0270   .body
0271 
0272   mov r2 = 0xa000000000020660;;  // gate address
0273                                // found by inspection of System.map for the
0274                                // __kernel_syscall_via_epc() function.  See
0275                                // below for how to do this for real.
0276 
0277   mov b7 = r2
0278   mov r15 = 1087                       // gettimeofday syscall
0279   ;;
0280   br.call.sptk.many b6 = b7
0281   ;;
0282 
0283   .restore sp
0284 
0285   mov ar.pfs = r11
0286   br.ret.sptk.many rp;;       // return to caller
0287   END(fgettimeofday)
0288 
0289 -- end fgettimeofday.S --
0290 
0291 In reality, getting the gate address is accomplished by two extra
0292 values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
0293 
0294  * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
0295  * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
0296 
0297 The ELF DSO is a pre-linked library that is mapped in by the kernel at
0298 the gate page.  It is a proper ELF shared object so, with a dynamic
0299 loader that recognises the library, you should be able to make calls to
0300 the exported functions within it as with any other shared library.
0301 AT_SYSINFO points into the kernel DSO at the
0302 __kernel_syscall_via_epc() function for historical reasons (it was
0303 used before the kernel DSO) and as a convenience.