Back to home page

OSCL-LXR

 
 

    


0001 ====================
0002 The robust futex ABI
0003 ====================
0004 
0005 :Author: Started by Paul Jackson <pj@sgi.com>
0006 
0007 
0008 Robust_futexes provide a mechanism that is used in addition to normal
0009 futexes, for kernel assist of cleanup of held locks on task exit.
0010 
0011 The interesting data as to what futexes a thread is holding is kept on a
0012 linked list in user space, where it can be updated efficiently as locks
0013 are taken and dropped, without kernel intervention.  The only additional
0014 kernel intervention required for robust_futexes above and beyond what is
0015 required for futexes is:
0016 
0017  1) a one time call, per thread, to tell the kernel where its list of
0018     held robust_futexes begins, and
0019  2) internal kernel code at exit, to handle any listed locks held
0020     by the exiting thread.
0021 
0022 The existing normal futexes already provide a "Fast Userspace Locking"
0023 mechanism, which handles uncontested locking without needing a system
0024 call, and handles contested locking by maintaining a list of waiting
0025 threads in the kernel.  Options on the sys_futex(2) system call support
0026 waiting on a particular futex, and waking up the next waiter on a
0027 particular futex.
0028 
0029 For robust_futexes to work, the user code (typically in a library such
0030 as glibc linked with the application) has to manage and place the
0031 necessary list elements exactly as the kernel expects them.  If it fails
0032 to do so, then improperly listed locks will not be cleaned up on exit,
0033 probably causing deadlock or other such failure of the other threads
0034 waiting on the same locks.
0035 
0036 A thread that anticipates possibly using robust_futexes should first
0037 issue the system call::
0038 
0039     asmlinkage long
0040     sys_set_robust_list(struct robust_list_head __user *head, size_t len);
0041 
0042 The pointer 'head' points to a structure in the threads address space
0043 consisting of three words.  Each word is 32 bits on 32 bit arch's, or 64
0044 bits on 64 bit arch's, and local byte order.  Each thread should have
0045 its own thread private 'head'.
0046 
0047 If a thread is running in 32 bit compatibility mode on a 64 native arch
0048 kernel, then it can actually have two such structures - one using 32 bit
0049 words for 32 bit compatibility mode, and one using 64 bit words for 64
0050 bit native mode.  The kernel, if it is a 64 bit kernel supporting 32 bit
0051 compatibility mode, will attempt to process both lists on each task
0052 exit, if the corresponding sys_set_robust_list() call has been made to
0053 setup that list.
0054 
0055   The first word in the memory structure at 'head' contains a
0056   pointer to a single linked list of 'lock entries', one per lock,
0057   as described below.  If the list is empty, the pointer will point
0058   to itself, 'head'.  The last 'lock entry' points back to the 'head'.
0059 
0060   The second word, called 'offset', specifies the offset from the
0061   address of the associated 'lock entry', plus or minus, of what will
0062   be called the 'lock word', from that 'lock entry'.  The 'lock word'
0063   is always a 32 bit word, unlike the other words above.  The 'lock
0064   word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
0065   of the thread holding the lock in the bottom 30 bits.  See further
0066   below for a description of the flag bits.
0067 
0068   The third word, called 'list_op_pending', contains transient copy of
0069   the address of the 'lock entry', during list insertion and removal,
0070   and is needed to correctly resolve races should a thread exit while
0071   in the middle of a locking or unlocking operation.
0072 
0073 Each 'lock entry' on the single linked list starting at 'head' consists
0074 of just a single word, pointing to the next 'lock entry', or back to
0075 'head' if there are no more entries.  In addition, nearby to each 'lock
0076 entry', at an offset from the 'lock entry' specified by the 'offset'
0077 word, is one 'lock word'.
0078 
0079 The 'lock word' is always 32 bits, and is intended to be the same 32 bit
0080 lock variable used by the futex mechanism, in conjunction with
0081 robust_futexes.  The kernel will only be able to wakeup the next thread
0082 waiting for a lock on a threads exit if that next thread used the futex
0083 mechanism to register the address of that 'lock word' with the kernel.
0084 
0085 For each futex lock currently held by a thread, if it wants this
0086 robust_futex support for exit cleanup of that lock, it should have one
0087 'lock entry' on this list, with its associated 'lock word' at the
0088 specified 'offset'.  Should a thread die while holding any such locks,
0089 the kernel will walk this list, mark any such locks with a bit
0090 indicating their holder died, and wakeup the next thread waiting for
0091 that lock using the futex mechanism.
0092 
0093 When a thread has invoked the above system call to indicate it
0094 anticipates using robust_futexes, the kernel stores the passed in 'head'
0095 pointer for that task.  The task may retrieve that value later on by
0096 using the system call::
0097 
0098     asmlinkage long
0099     sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
0100                         size_t __user *len_ptr);
0101 
0102 It is anticipated that threads will use robust_futexes embedded in
0103 larger, user level locking structures, one per lock.  The kernel
0104 robust_futex mechanism doesn't care what else is in that structure, so
0105 long as the 'offset' to the 'lock word' is the same for all
0106 robust_futexes used by that thread.  The thread should link those locks
0107 it currently holds using the 'lock entry' pointers.  It may also have
0108 other links between the locks, such as the reverse side of a double
0109 linked list, but that doesn't matter to the kernel.
0110 
0111 By keeping its locks linked this way, on a list starting with a 'head'
0112 pointer known to the kernel, the kernel can provide to a thread the
0113 essential service available for robust_futexes, which is to help clean
0114 up locks held at the time of (a perhaps unexpectedly) exit.
0115 
0116 Actual locking and unlocking, during normal operations, is handled
0117 entirely by user level code in the contending threads, and by the
0118 existing futex mechanism to wait for, and wakeup, locks.  The kernels
0119 only essential involvement in robust_futexes is to remember where the
0120 list 'head' is, and to walk the list on thread exit, handling locks
0121 still held by the departing thread, as described below.
0122 
0123 There may exist thousands of futex lock structures in a threads shared
0124 memory, on various data structures, at a given point in time. Only those
0125 lock structures for locks currently held by that thread should be on
0126 that thread's robust_futex linked lock list a given time.
0127 
0128 A given futex lock structure in a user shared memory region may be held
0129 at different times by any of the threads with access to that region. The
0130 thread currently holding such a lock, if any, is marked with the threads
0131 TID in the lower 30 bits of the 'lock word'.
0132 
0133 When adding or removing a lock from its list of held locks, in order for
0134 the kernel to correctly handle lock cleanup regardless of when the task
0135 exits (perhaps it gets an unexpected signal 9 in the middle of
0136 manipulating this list), the user code must observe the following
0137 protocol on 'lock entry' insertion and removal:
0138 
0139 On insertion:
0140 
0141  1) set the 'list_op_pending' word to the address of the 'lock entry'
0142     to be inserted,
0143  2) acquire the futex lock,
0144  3) add the lock entry, with its thread id (TID) in the bottom 30 bits
0145     of the 'lock word', to the linked list starting at 'head', and
0146  4) clear the 'list_op_pending' word.
0147 
0148 On removal:
0149 
0150  1) set the 'list_op_pending' word to the address of the 'lock entry'
0151     to be removed,
0152  2) remove the lock entry for this lock from the 'head' list,
0153  3) release the futex lock, and
0154  4) clear the 'lock_op_pending' word.
0155 
0156 On exit, the kernel will consider the address stored in
0157 'list_op_pending' and the address of each 'lock word' found by walking
0158 the list starting at 'head'.  For each such address, if the bottom 30
0159 bits of the 'lock word' at offset 'offset' from that address equals the
0160 exiting threads TID, then the kernel will do two things:
0161 
0162  1) if bit 31 (0x80000000) is set in that word, then attempt a futex
0163     wakeup on that address, which will waken the next thread that has
0164     used to the futex mechanism to wait on that address, and
0165  2) atomically set  bit 30 (0x40000000) in the 'lock word'.
0166 
0167 In the above, bit 31 was set by futex waiters on that lock to indicate
0168 they were waiting, and bit 30 is set by the kernel to indicate that the
0169 lock owner died holding the lock.
0170 
0171 The kernel exit code will silently stop scanning the list further if at
0172 any point:
0173 
0174  1) the 'head' pointer or an subsequent linked list pointer
0175     is not a valid address of a user space word
0176  2) the calculated location of the 'lock word' (address plus
0177     'offset') is not the valid address of a 32 bit user space
0178     word
0179  3) if the list contains more than 1 million (subject to
0180     future kernel configuration changes) elements.
0181 
0182 When the kernel sees a list entry whose 'lock word' doesn't have the
0183 current threads TID in the lower 30 bits, it does nothing with that
0184 entry, and goes on to the next entry.