Back to home page




0001 Futex Requeue PI
0002 ----------------
0004 Requeueing of tasks from a non-PI futex to a PI futex requires
0005 special handling in order to ensure the underlying rt_mutex is never
0006 left without an owner if it has waiters; doing so would break the PI
0007 boosting logic [see rt-mutex-desgin.txt] For the purposes of
0008 brevity, this action will be referred to as "requeue_pi" throughout
0009 this document.  Priority inheritance is abbreviated throughout as
0010 "PI".
0012 Motivation
0013 ----------
0015 Without requeue_pi, the glibc implementation of
0016 pthread_cond_broadcast() must resort to waking all the tasks waiting
0017 on a pthread_condvar and letting them try to sort out which task
0018 gets to run first in classic thundering-herd formation.  An ideal
0019 implementation would wake the highest-priority waiter, and leave the
0020 rest to the natural wakeup inherent in unlocking the mutex
0021 associated with the condvar.
0023 Consider the simplified glibc calls:
0025 /* caller must lock mutex */
0026 pthread_cond_wait(cond, mutex)
0027 {
0028         lock(cond->__data.__lock);
0029         unlock(mutex);
0030         do {
0031            unlock(cond->__data.__lock);
0032            futex_wait(cond->__data.__futex);
0033            lock(cond->__data.__lock);
0034         } while(...)
0035         unlock(cond->__data.__lock);
0036         lock(mutex);
0037 }
0039 pthread_cond_broadcast(cond)
0040 {
0041         lock(cond->__data.__lock);
0042         unlock(cond->__data.__lock);
0043         futex_requeue(cond->data.__futex, cond->mutex);
0044 }
0046 Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
0047 has waiters. Note that pthread_cond_wait() attempts to lock the
0048 mutex only after it has returned to user space.  This will leave the
0049 underlying rt_mutex with waiters, and no owner, breaking the
0050 previously mentioned PI-boosting algorithms.
0052 In order to support PI-aware pthread_condvar's, the kernel needs to
0053 be able to requeue tasks to PI futexes.  This support implies that
0054 upon a successful futex_wait system call, the caller would return to
0055 user space already holding the PI futex.  The glibc implementation
0056 would be modified as follows:
0059 /* caller must lock mutex */
0060 pthread_cond_wait_pi(cond, mutex)
0061 {
0062         lock(cond->__data.__lock);
0063         unlock(mutex);
0064         do {
0065            unlock(cond->__data.__lock);
0066            futex_wait_requeue_pi(cond->__data.__futex);
0067            lock(cond->__data.__lock);
0068         } while(...)
0069         unlock(cond->__data.__lock);
0070         /* the kernel acquired the mutex for us */
0071 }
0073 pthread_cond_broadcast_pi(cond)
0074 {
0075         lock(cond->__data.__lock);
0076         unlock(cond->__data.__lock);
0077         futex_requeue_pi(cond->data.__futex, cond->mutex);
0078 }
0080 The actual glibc implementation will likely test for PI and make the
0081 necessary changes inside the existing calls rather than creating new
0082 calls for the PI cases.  Similar changes are needed for
0083 pthread_cond_timedwait() and pthread_cond_signal().
0085 Implementation
0086 --------------
0088 In order to ensure the rt_mutex has an owner if it has waiters, it
0089 is necessary for both the requeue code, as well as the waiting code,
0090 to be able to acquire the rt_mutex before returning to user space.
0091 The requeue code cannot simply wake the waiter and leave it to
0092 acquire the rt_mutex as it would open a race window between the
0093 requeue call returning to user space and the waiter waking and
0094 starting to run.  This is especially true in the uncontended case.
0096 The solution involves two new rt_mutex helper routines,
0097 rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
0098 allow the requeue code to acquire an uncontended rt_mutex on behalf
0099 of the waiter and to enqueue the waiter on a contended rt_mutex.
0100 Two new system calls provide the kernel<->user interface to
0103 FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
0104 and pthread_cond_timedwait()) to block on the initial futex and wait
0105 to be requeued to a PI-aware futex.  The implementation is the
0106 result of a high-speed collision between futex_wait() and
0107 futex_lock_pi(), with some extra logic to check for the additional
0108 wake-up scenarios.
0110 FUTEX_CMP_REQUEUE_PI is called by the waker
0111 (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
0112 possibly wake the waiting tasks. Internally, this system call is
0113 still handled by futex_requeue (by passing requeue_pi=1).  Before
0114 requeueing, futex_requeue() attempts to acquire the requeue target
0115 PI futex on behalf of the top waiter.  If it can, this waiter is
0116 woken.  futex_requeue() then proceeds to requeue the remaining
0117 nr_wake+nr_requeue tasks to the PI futex, calling
0118 rt_mutex_start_proxy_lock() prior to each requeue to prepare the
0119 task as a waiter on the underlying rt_mutex.  It is possible that
0120 the lock can be acquired at this stage as well, if so, the next
0121 waiter is woken to finish the acquisition of the lock.
0123 FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
0124 their sum is all that really matters.  futex_requeue() will wake or
0125 requeue up to nr_wake + nr_requeue tasks.  It will wake only as many
0126 tasks as it can acquire the lock for, which in the majority of cases
0127 should be 0 as good programming practice dictates that the caller of
0128 either pthread_cond_broadcast() or pthread_cond_signal() acquire the
0129 mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
0130 nr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
0131 signal.