0001 ================
0002 Futex Requeue PI
0003 ================
0004
0005 Requeueing of tasks from a non-PI futex to a PI futex requires
0006 special handling in order to ensure the underlying rt_mutex is never
0007 left without an owner if it has waiters; doing so would break the PI
0008 boosting logic [see rt-mutex-design.rst] For the purposes of
0009 brevity, this action will be referred to as "requeue_pi" throughout
0010 this document. Priority inheritance is abbreviated throughout as
0011 "PI".
0012
0013 Motivation
0014 ----------
0015
0016 Without requeue_pi, the glibc implementation of
0017 pthread_cond_broadcast() must resort to waking all the tasks waiting
0018 on a pthread_condvar and letting them try to sort out which task
0019 gets to run first in classic thundering-herd formation. An ideal
0020 implementation would wake the highest-priority waiter, and leave the
0021 rest to the natural wakeup inherent in unlocking the mutex
0022 associated with the condvar.
0023
0024 Consider the simplified glibc calls::
0025
0026 /* caller must lock mutex */
0027 pthread_cond_wait(cond, mutex)
0028 {
0029 lock(cond->__data.__lock);
0030 unlock(mutex);
0031 do {
0032 unlock(cond->__data.__lock);
0033 futex_wait(cond->__data.__futex);
0034 lock(cond->__data.__lock);
0035 } while(...)
0036 unlock(cond->__data.__lock);
0037 lock(mutex);
0038 }
0039
0040 pthread_cond_broadcast(cond)
0041 {
0042 lock(cond->__data.__lock);
0043 unlock(cond->__data.__lock);
0044 futex_requeue(cond->data.__futex, cond->mutex);
0045 }
0046
0047 Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
0048 has waiters. Note that pthread_cond_wait() attempts to lock the
0049 mutex only after it has returned to user space. This will leave the
0050 underlying rt_mutex with waiters, and no owner, breaking the
0051 previously mentioned PI-boosting algorithms.
0052
0053 In order to support PI-aware pthread_condvar's, the kernel needs to
0054 be able to requeue tasks to PI futexes. This support implies that
0055 upon a successful futex_wait system call, the caller would return to
0056 user space already holding the PI futex. The glibc implementation
0057 would be modified as follows::
0058
0059
0060 /* caller must lock mutex */
0061 pthread_cond_wait_pi(cond, mutex)
0062 {
0063 lock(cond->__data.__lock);
0064 unlock(mutex);
0065 do {
0066 unlock(cond->__data.__lock);
0067 futex_wait_requeue_pi(cond->__data.__futex);
0068 lock(cond->__data.__lock);
0069 } while(...)
0070 unlock(cond->__data.__lock);
0071 /* the kernel acquired the mutex for us */
0072 }
0073
0074 pthread_cond_broadcast_pi(cond)
0075 {
0076 lock(cond->__data.__lock);
0077 unlock(cond->__data.__lock);
0078 futex_requeue_pi(cond->data.__futex, cond->mutex);
0079 }
0080
0081 The actual glibc implementation will likely test for PI and make the
0082 necessary changes inside the existing calls rather than creating new
0083 calls for the PI cases. Similar changes are needed for
0084 pthread_cond_timedwait() and pthread_cond_signal().
0085
0086 Implementation
0087 --------------
0088
0089 In order to ensure the rt_mutex has an owner if it has waiters, it
0090 is necessary for both the requeue code, as well as the waiting code,
0091 to be able to acquire the rt_mutex before returning to user space.
0092 The requeue code cannot simply wake the waiter and leave it to
0093 acquire the rt_mutex as it would open a race window between the
0094 requeue call returning to user space and the waiter waking and
0095 starting to run. This is especially true in the uncontended case.
0096
0097 The solution involves two new rt_mutex helper routines,
0098 rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
0099 allow the requeue code to acquire an uncontended rt_mutex on behalf
0100 of the waiter and to enqueue the waiter on a contended rt_mutex.
0101 Two new system calls provide the kernel<->user interface to
0102 requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
0103
0104 FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
0105 and pthread_cond_timedwait()) to block on the initial futex and wait
0106 to be requeued to a PI-aware futex. The implementation is the
0107 result of a high-speed collision between futex_wait() and
0108 futex_lock_pi(), with some extra logic to check for the additional
0109 wake-up scenarios.
0110
0111 FUTEX_CMP_REQUEUE_PI is called by the waker
0112 (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
0113 possibly wake the waiting tasks. Internally, this system call is
0114 still handled by futex_requeue (by passing requeue_pi=1). Before
0115 requeueing, futex_requeue() attempts to acquire the requeue target
0116 PI futex on behalf of the top waiter. If it can, this waiter is
0117 woken. futex_requeue() then proceeds to requeue the remaining
0118 nr_wake+nr_requeue tasks to the PI futex, calling
0119 rt_mutex_start_proxy_lock() prior to each requeue to prepare the
0120 task as a waiter on the underlying rt_mutex. It is possible that
0121 the lock can be acquired at this stage as well, if so, the next
0122 waiter is woken to finish the acquisition of the lock.
0123
0124 FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
0125 their sum is all that really matters. futex_requeue() will wake or
0126 requeue up to nr_wake + nr_requeue tasks. It will wake only as many
0127 tasks as it can acquire the lock for, which in the majority of cases
0128 should be 0 as good programming practice dictates that the caller of
0129 either pthread_cond_broadcast() or pthread_cond_signal() acquire the
0130 mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
0131 nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
0132 signal.