Documentation/locking/ww-mutex-design.rst

0001 ======================================
0002 Wound/Wait Deadlock-Proof Mutex Design
0003 ======================================
0004
0005 Please read mutex-design.rst first, as it applies to wait/wound mutexes too.
0006
0007 Motivation for WW-Mutexes
0008 -------------------------
0009
0010 GPU's do operations that commonly involve many buffers.  Those buffers
0011 can be shared across contexts/processes, exist in different memory
0012 domains (for example VRAM vs system memory), and so on.  And with
0013 PRIME / dmabuf, they can even be shared across devices.  So there are
0014 a handful of situations where the driver needs to wait for buffers to
0015 become ready.  If you think about this in terms of waiting on a buffer
0016 mutex for it to become available, this presents a problem because
0017 there is no way to guarantee that buffers appear in a execbuf/batch in
0018 the same order in all contexts.  That is directly under control of
0019 userspace, and a result of the sequence of GL calls that an application
0020 makes.  Which results in the potential for deadlock.  The problem gets
0021 more complex when you consider that the kernel may need to migrate the
0022 buffer(s) into VRAM before the GPU operates on the buffer(s), which
0023 may in turn require evicting some other buffers (and you don't want to
0024 evict other buffers which are already queued up to the GPU), but for a
0025 simplified understanding of the problem you can ignore this.
0026
0027 The algorithm that the TTM graphics subsystem came up with for dealing with
0028 this problem is quite simple.  For each group of buffers (execbuf) that need
0029 to be locked, the caller would be assigned a unique reservation id/ticket,
0030 from a global counter.  In case of deadlock while locking all the buffers
0031 associated with a execbuf, the one with the lowest reservation ticket (i.e.
0032 the oldest task) wins, and the one with the higher reservation id (i.e. the
0033 younger task) unlocks all of the buffers that it has already locked, and then
0034 tries again.
0035
0036 In the RDBMS literature, a reservation ticket is associated with a transaction.
0037 and the deadlock handling approach is called Wait-Die. The name is based on
0038 the actions of a locking thread when it encounters an already locked mutex.
0039 If the transaction holding the lock is younger, the locking transaction waits.
0040 If the transaction holding the lock is older, the locking transaction backs off
0041 and dies. Hence Wait-Die.
0042 There is also another algorithm called Wound-Wait:
0043 If the transaction holding the lock is younger, the locking transaction
0044 wounds the transaction holding the lock, requesting it to die.
0045 If the transaction holding the lock is older, it waits for the other
0046 transaction. Hence Wound-Wait.
0047 The two algorithms are both fair in that a transaction will eventually succeed.
0048 However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
0049 compared to Wait-Die, but is, on the other hand, associated with more work than
0050 Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
0051 algorithm in that transactions are wounded by other transactions, and that
0052 requires a reliable way to pick up the wounded condition and preempt the
0053 running transaction. Note that this is not the same as process preemption. A
0054 Wound-Wait transaction is considered preempted when it dies (returning
0055 -EDEADLK) following a wound.
0056
0057 Concepts
0058 --------
0059
0060 Compared to normal mutexes two additional concepts/objects show up in the lock
0061 interface for w/w mutexes:
0062
0063 Acquire context: To ensure eventual forward progress it is important that a task
0064 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
0065 acquired when starting the lock acquisition. This ticket is stored in the
0066 acquire context. Furthermore the acquire context keeps track of debugging state
0067 to catch w/w mutex interface abuse. An acquire context is representing a
0068 transaction.
0069
0070 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
0071 w/w mutexes, since it is required to initialize the acquire context. The lock
0072 class also specifies what algorithm to use, Wound-Wait or Wait-Die.
0073
0074 Furthermore there are three different class of w/w lock acquire functions:
0075
0076 * Normal lock acquisition with a context, using ww_mutex_lock.
0077
0078 * Slowpath lock acquisition on the contending lock, used by the task that just
0079   killed its transaction after having dropped all already acquired locks.
0080   These functions have the _slow postfix.
0081
0082   From a simple semantics point-of-view the _slow functions are not strictly
0083   required, since simply calling the normal ww_mutex_lock functions on the
0084   contending lock (after having dropped all other already acquired locks) will
0085   work correctly. After all if no other ww mutex has been acquired yet there's
0086   no deadlock potential and hence the ww_mutex_lock call will block and not
0087   prematurely return -EDEADLK. The advantage of the _slow functions is in
0088   interface safety:
0089
0090   - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
0091     has a void return type. Note that since ww mutex code needs loops/retries
0092     anyway the __must_check doesn't result in spurious warnings, even though the
0093     very first lock operation can never fail.
0094   - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
0095     ww mutex have been released (preventing deadlocks) and makes sure that we
0096     block on the contending lock (preventing spinning through the -EDEADLK
0097     slowpath until the contended lock can be acquired).
0098
0099 * Functions to only acquire a single w/w mutex, which results in the exact same
0100   semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
0101   context.
0102
0103   Again this is not strictly required. But often you only want to acquire a
0104   single lock in which case it's pointless to set up an acquire context (and so
0105   better to avoid grabbing a deadlock avoidance ticket).
0106
0107 Of course, all the usual variants for handling wake-ups due to signals are also
0108 provided.
0109
0110 Usage
0111 -----
0112
0113 The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
0114 DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
0115 As a rough rule of thumb, use Wound-Wait iff you
0116 expect the number of simultaneous competing transactions to be typically small,
0117 and you want to reduce the number of rollbacks.
0118
0119 Three different ways to acquire locks within the same w/w class. Common
0120 definitions for methods #1 and #2::
0121
0122   static DEFINE_WW_CLASS(ww_class);
0123
0124   struct obj {
0125         struct ww_mutex lock;
0126         /* obj data */
0127   };
0128
0129   struct obj_entry {
0130         struct list_head head;
0131         struct obj *obj;
0132   };
0133
0134 Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
0135 This is useful if a list of required objects is already tracked somewhere.
0136 Furthermore the lock helper can use propagate the -EALREADY return code back to
0137 the caller as a signal that an object is twice on the list. This is useful if
0138 the list is constructed from userspace input and the ABI requires userspace to
0139 not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::
0140
0141   int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
0142   {
0143         struct obj *res_obj = NULL;
0144         struct obj_entry *contended_entry = NULL;
0145         struct obj_entry *entry;
0146
0147         ww_acquire_init(ctx, &ww_class);
0148
0149   retry:
0150         list_for_each_entry (entry, list, head) {
0151                 if (entry->obj == res_obj) {
0152                         res_obj = NULL;
0153                         continue;
0154                 }
0155                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
0156                 if (ret < 0) {
0157                         contended_entry = entry;
0158                         goto err;
0159                 }
0160         }
0161
0162         ww_acquire_done(ctx);
0163         return 0;
0164
0165   err:
0166         list_for_each_entry_continue_reverse (entry, list, head)
0167                 ww_mutex_unlock(&entry->obj->lock);
0168
0169         if (res_obj)
0170                 ww_mutex_unlock(&res_obj->lock);
0171
0172         if (ret == -EDEADLK) {
0173                 /* we lost out in a seqno race, lock and retry.. */
0174                 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
0175                 res_obj = contended_entry->obj;
0176                 goto retry;
0177         }
0178         ww_acquire_fini(ctx);
0179
0180         return ret;
0181   }
0182
0183 Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
0184 of duplicate entry detection using -EALREADY as method 1 above. But the
0185 list-reordering allows for a bit more idiomatic code::
0186
0187   int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
0188   {
0189         struct obj_entry *entry, *entry2;
0190
0191         ww_acquire_init(ctx, &ww_class);
0192
0193         list_for_each_entry (entry, list, head) {
0194                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
0195                 if (ret < 0) {
0196                         entry2 = entry;
0197
0198                         list_for_each_entry_continue_reverse (entry2, list, head)
0199                                 ww_mutex_unlock(&entry2->obj->lock);
0200
0201                         if (ret != -EDEADLK) {
0202                                 ww_acquire_fini(ctx);
0203                                 return ret;
0204                         }
0205
0206                         /* we lost out in a seqno race, lock and retry.. */
0207                         ww_mutex_lock_slow(&entry->obj->lock, ctx);
0208
0209                         /*
0210                          * Move buf to head of the list, this will point
0211                          * buf->next to the first unlocked entry,
0212                          * restarting the for loop.
0213                          */
0214                         list_del(&entry->head);
0215                         list_add(&entry->head, list);
0216                 }
0217         }
0218
0219         ww_acquire_done(ctx);
0220         return 0;
0221   }
0222
0223 Unlocking works the same way for both methods #1 and #2::
0224
0225   void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
0226   {
0227         struct obj_entry *entry;
0228
0229         list_for_each_entry (entry, list, head)
0230                 ww_mutex_unlock(&entry->obj->lock);
0231
0232         ww_acquire_fini(ctx);
0233   }
0234
0235 Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
0236 e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
0237 and edges can only be changed when holding the locks of all involved nodes. w/w
0238 mutexes are a natural fit for such a case for two reasons:
0239
0240 - They can handle lock-acquisition in any order which allows us to start walking
0241   a graph from a starting point and then iteratively discovering new edges and
0242   locking down the nodes those edges connect to.
0243 - Due to the -EALREADY return code signalling that a given objects is already
0244   held there's no need for additional book-keeping to break cycles in the graph
0245   or keep track off which looks are already held (when using more than one node
0246   as a starting point).
0247
0248 Note that this approach differs in two important ways from the above methods:
0249
0250 - Since the list of objects is dynamically constructed (and might very well be
0251   different when retrying due to hitting the -EDEADLK die condition) there's
0252   no need to keep any object on a persistent list when it's not locked. We can
0253   therefore move the list_head into the object itself.
0254 - On the other hand the dynamic object list construction also means that the -EALREADY return
0255   code can't be propagated.
0256
0257 Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
0258 list of starting nodes (passed in from userspace) using one of the above
0259 methods. And then lock any additional objects affected by the operations using
0260 method #3 below. The backoff/retry procedure will be a bit more involved, since
0261 when the dynamic locking step hits -EDEADLK we also need to unlock all the
0262 objects acquired with the fixed list. But the w/w mutex debug checks will catch
0263 any interface misuse for these cases.
0264
0265 Also, method 3 can't fail the lock acquisition step since it doesn't return
0266 -EALREADY. Of course this would be different when using the _interruptible
0267 variants, but that's outside of the scope of these examples here::
0268
0269   struct obj {
0270         struct ww_mutex ww_mutex;
0271         struct list_head locked_list;
0272   };
0273
0274   static DEFINE_WW_CLASS(ww_class);
0275
0276   void __unlock_objs(struct list_head *list)
0277   {
0278         struct obj *entry, *temp;
0279
0280         list_for_each_entry_safe (entry, temp, list, locked_list) {
0281                 /* need to do that before unlocking, since only the current lock holder is
0282                 allowed to use object */
0283                 list_del(&entry->locked_list);
0284                 ww_mutex_unlock(entry->ww_mutex)
0285         }
0286   }
0287
0288   void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
0289   {
0290         struct obj *obj;
0291
0292         ww_acquire_init(ctx, &ww_class);
0293
0294   retry:
0295         /* re-init loop start state */
0296         loop {
0297                 /* magic code which walks over a graph and decides which objects
0298                  * to lock */
0299
0300                 ret = ww_mutex_lock(obj->ww_mutex, ctx);
0301                 if (ret == -EALREADY) {
0302                         /* we have that one already, get to the next object */
0303                         continue;
0304                 }
0305                 if (ret == -EDEADLK) {
0306                         __unlock_objs(list);
0307
0308                         ww_mutex_lock_slow(obj, ctx);
0309                         list_add(&entry->locked_list, list);
0310                         goto retry;
0311                 }
0312
0313                 /* locked a new object, add it to the list */
0314                 list_add_tail(&entry->locked_list, list);
0315         }
0316
0317         ww_acquire_done(ctx);
0318         return 0;
0319   }
0320
0321   void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
0322   {
0323         __unlock_objs(list);
0324         ww_acquire_fini(ctx);
0325   }
0326
0327 Method 4: Only lock one single objects. In that case deadlock detection and
0328 prevention is obviously overkill, since with grabbing just one lock you can't
0329 produce a deadlock within just one class. To simplify this case the w/w mutex
0330 api can be used with a NULL context.
0331
0332 Implementation Details
0333 ----------------------
0334
0335 Design:
0336 ^^^^^^^
0337
0338   ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
0339   normal mutex locks, which are far more common. As such there is only a small
0340   increase in code size if wait/wound mutexes are not used.
0341
0342   We maintain the following invariants for the wait list:
0343
0344   (1) Waiters with an acquire context are sorted by stamp order; waiters
0345       without an acquire context are interspersed in FIFO order.
0346   (2) For Wait-Die, among waiters with contexts, only the first one can have
0347       other locks acquired already (ctx->acquired > 0). Note that this waiter
0348       may come after other waiters without contexts in the list.
0349
0350   The Wound-Wait preemption is implemented with a lazy-preemption scheme:
0351   The wounded status of the transaction is checked only when there is
0352   contention for a new lock and hence a true chance of deadlock. In that
0353   situation, if the transaction is wounded, it backs off, clears the
0354   wounded status and retries. A great benefit of implementing preemption in
0355   this way is that the wounded transaction can identify a contending lock to
0356   wait for before restarting the transaction. Just blindly restarting the
0357   transaction would likely make the transaction end up in a situation where
0358   it would have to back off again.
0359
0360   In general, not much contention is expected. The locks are typically used to
0361   serialize access to resources for devices, and optimization focus should
0362   therefore be directed towards the uncontended cases.
0363
0364 Lockdep:
0365 ^^^^^^^^
0366
0367   Special care has been taken to warn for as many cases of api abuse
0368   as possible. Some common api abuses will be caught with
0369   CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
0370
0371   Some of the errors which will be warned about:
0372    - Forgetting to call ww_acquire_fini or ww_acquire_init.
0373    - Attempting to lock more mutexes after ww_acquire_done.
0374    - Attempting to lock the wrong mutex after -EDEADLK and
0375      unlocking all mutexes.
0376    - Attempting to lock the right mutex after -EDEADLK,
0377      before unlocking all mutexes.
0378
0379    - Calling ww_mutex_lock_slow before -EDEADLK was returned.
0380
0381    - Unlocking mutexes with the wrong unlock function.
0382    - Calling one of the ww_acquire_* twice on the same context.
0383    - Using a different ww_class for the mutex than for the ww_acquire_ctx.
0384    - Normal lockdep errors that can result in deadlocks.
0385
0386   Some of the lockdep errors that can result in deadlocks:
0387    - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
0388      having called ww_acquire_fini on the first.
0389    - 'normal' deadlocks that can occur.
0390
0391 FIXME:
0392   Update this section once we have the TASK_DEADLOCK task state flag magic
0393   implemented.