Documentation/locking/locktypes.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 .. _kernel_hacking_locktypes:
0004
0005 ==========================
0006 Lock types and their rules
0007 ==========================
0008
0009 Introduction
0010 ============
0011
0012 The kernel provides a variety of locking primitives which can be divided
0013 into three categories:
0014
0015  - Sleeping locks
0016  - CPU local locks
0017  - Spinning locks
0018
0019 This document conceptually describes these lock types and provides rules
0020 for their nesting, including the rules for use under PREEMPT_RT.
0021
0022
0023 Lock categories
0024 ===============
0025
0026 Sleeping locks
0027 --------------
0028
0029 Sleeping locks can only be acquired in preemptible task context.
0030
0031 Although implementations allow try_lock() from other contexts, it is
0032 necessary to carefully evaluate the safety of unlock() as well as of
0033 try_lock().  Furthermore, it is also necessary to evaluate the debugging
0034 versions of these primitives.  In short, don't acquire sleeping locks from
0035 other contexts unless there is no other option.
0036
0037 Sleeping lock types:
0038
0039  - mutex
0040  - rt_mutex
0041  - semaphore
0042  - rw_semaphore
0043  - ww_mutex
0044  - percpu_rw_semaphore
0045
0046 On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
0047
0048  - local_lock
0049  - spinlock_t
0050  - rwlock_t
0051
0052
0053 CPU local locks
0054 ---------------
0055
0056  - local_lock
0057
0058 On non-PREEMPT_RT kernels, local_lock functions are wrappers around
0059 preemption and interrupt disabling primitives. Contrary to other locking
0060 mechanisms, disabling preemption or interrupts are pure CPU local
0061 concurrency control mechanisms and not suited for inter-CPU concurrency
0062 control.
0063
0064
0065 Spinning locks
0066 --------------
0067
0068  - raw_spinlock_t
0069  - bit spinlocks
0070
0071 On non-PREEMPT_RT kernels, these lock types are also spinning locks:
0072
0073  - spinlock_t
0074  - rwlock_t
0075
0076 Spinning locks implicitly disable preemption and the lock / unlock functions
0077 can have suffixes which apply further protections:
0078
0079  ===================  ====================================================
0080  _bh()                Disable / enable bottom halves (soft interrupts)
0081  _irq()               Disable / enable interrupts
0082  _irqsave/restore()   Save and disable / restore interrupt disabled state
0083  ===================  ====================================================
0084
0085
0086 Owner semantics
0087 ===============
0088
0089 The aforementioned lock types except semaphores have strict owner
0090 semantics:
0091
0092   The context (task) that acquired the lock must release it.
0093
0094 rw_semaphores have a special interface which allows non-owner release for
0095 readers.
0096
0097
0098 rtmutex
0099 =======
0100
0101 RT-mutexes are mutexes with support for priority inheritance (PI).
0102
0103 PI has limitations on non-PREEMPT_RT kernels due to preemption and
0104 interrupt disabled sections.
0105
0106 PI clearly cannot preempt preemption-disabled or interrupt-disabled
0107 regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
0108 execute most such regions of code in preemptible task context, especially
0109 interrupt handlers and soft interrupts.  This conversion allows spinlock_t
0110 and rwlock_t to be implemented via RT-mutexes.
0111
0112
0113 semaphore
0114 =========
0115
0116 semaphore is a counting semaphore implementation.
0117
0118 Semaphores are often used for both serialization and waiting, but new use
0119 cases should instead use separate serialization and wait mechanisms, such
0120 as mutexes and completions.
0121
0122 semaphores and PREEMPT_RT
0123 ----------------------------
0124
0125 PREEMPT_RT does not change the semaphore implementation because counting
0126 semaphores have no concept of owners, thus preventing PREEMPT_RT from
0127 providing priority inheritance for semaphores.  After all, an unknown
0128 owner cannot be boosted. As a consequence, blocking on semaphores can
0129 result in priority inversion.
0130
0131
0132 rw_semaphore
0133 ============
0134
0135 rw_semaphore is a multiple readers and single writer lock mechanism.
0136
0137 On non-PREEMPT_RT kernels the implementation is fair, thus preventing
0138 writer starvation.
0139
0140 rw_semaphore complies by default with the strict owner semantics, but there
0141 exist special-purpose interfaces that allow non-owner release for readers.
0142 These interfaces work independent of the kernel configuration.
0143
0144 rw_semaphore and PREEMPT_RT
0145 ---------------------------
0146
0147 PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
0148 implementation, thus changing the fairness:
0149
0150  Because an rw_semaphore writer cannot grant its priority to multiple
0151  readers, a preempted low-priority reader will continue holding its lock,
0152  thus starving even high-priority writers.  In contrast, because readers
0153  can grant their priority to a writer, a preempted low-priority writer will
0154  have its priority boosted until it releases the lock, thus preventing that
0155  writer from starving readers.
0156
0157
0158 local_lock
0159 ==========
0160
0161 local_lock provides a named scope to critical sections which are protected
0162 by disabling preemption or interrupts.
0163
0164 On non-PREEMPT_RT kernels local_lock operations map to the preemption and
0165 interrupt disabling and enabling primitives:
0166
0167  ===============================  ======================
0168  local_lock(&llock)               preempt_disable()
0169  local_unlock(&llock)             preempt_enable()
0170  local_lock_irq(&llock)           local_irq_disable()
0171  local_unlock_irq(&llock)         local_irq_enable()
0172  local_lock_irqsave(&llock)       local_irq_save()
0173  local_unlock_irqrestore(&llock)  local_irq_restore()
0174  ===============================  ======================
0175
0176 The named scope of local_lock has two advantages over the regular
0177 primitives:
0178
0179   - The lock name allows static analysis and is also a clear documentation
0180     of the protection scope while the regular primitives are scopeless and
0181     opaque.
0182
0183   - If lockdep is enabled the local_lock gains a lockmap which allows to
0184     validate the correctness of the protection. This can detect cases where
0185     e.g. a function using preempt_disable() as protection mechanism is
0186     invoked from interrupt or soft-interrupt context. Aside of that
0187     lockdep_assert_held(&llock) works as with any other locking primitive.
0188
0189 local_lock and PREEMPT_RT
0190 -------------------------
0191
0192 PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
0193 semantics:
0194
0195   - All spinlock_t changes also apply to local_lock.
0196
0197 local_lock usage
0198 ----------------
0199
0200 local_lock should be used in situations where disabling preemption or
0201 interrupts is the appropriate form of concurrency control to protect
0202 per-CPU data structures on a non PREEMPT_RT kernel.
0203
0204 local_lock is not suitable to protect against preemption or interrupts on a
0205 PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
0206
0207
0208 raw_spinlock_t and spinlock_t
0209 =============================
0210
0211 raw_spinlock_t
0212 --------------
0213
0214 raw_spinlock_t is a strict spinning lock implementation in all kernels,
0215 including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
0216 core code, low-level interrupt handling and places where disabling
0217 preemption or interrupts is required, for example, to safely access
0218 hardware state.  raw_spinlock_t can sometimes also be used when the
0219 critical section is tiny, thus avoiding RT-mutex overhead.
0220
0221 spinlock_t
0222 ----------
0223
0224 The semantics of spinlock_t change with the state of PREEMPT_RT.
0225
0226 On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
0227 exactly the same semantics.
0228
0229 spinlock_t and PREEMPT_RT
0230 -------------------------
0231
0232 On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
0233 based on rt_mutex which changes the semantics:
0234
0235  - Preemption is not disabled.
0236
0237  - The hard interrupt related suffixes for spin_lock / spin_unlock
0238    operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
0239    interrupt disabled state.
0240
0241  - The soft interrupt related suffix (_bh()) still disables softirq
0242    handlers.
0243
0244    Non-PREEMPT_RT kernels disable preemption to get this effect.
0245
0246    PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
0247    preemption enabled. The lock disables softirq handlers and also
0248    prevents reentrancy due to task preemption.
0249
0250 PREEMPT_RT kernels preserve all other spinlock_t semantics:
0251
0252  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
0253    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
0254    disable migration, which ensures that pointers to per-CPU variables
0255    remain valid even if the task is preempted.
0256
0257  - Task state is preserved across spinlock acquisition, ensuring that the
0258    task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
0259    kernels leave task state untouched.  However, PREEMPT_RT must change
0260    task state if the task blocks during acquisition.  Therefore, it saves
0261    the current task state before blocking and the corresponding lock wakeup
0262    restores it, as shown below::
0263
0264     task->state = TASK_INTERRUPTIBLE
0265      lock()
0266        block()
0267          task->saved_state = task->state
0268          task->state = TASK_UNINTERRUPTIBLE
0269          schedule()
0270                                         lock wakeup
0271                                           task->state = task->saved_state
0272
0273    Other types of wakeups would normally unconditionally set the task state
0274    to RUNNING, but that does not work here because the task must remain
0275    blocked until the lock becomes available.  Therefore, when a non-lock
0276    wakeup attempts to awaken a task blocked waiting for a spinlock, it
0277    instead sets the saved state to RUNNING.  Then, when the lock
0278    acquisition completes, the lock wakeup sets the task state to the saved
0279    state, in this case setting it to RUNNING::
0280
0281     task->state = TASK_INTERRUPTIBLE
0282      lock()
0283        block()
0284          task->saved_state = task->state
0285          task->state = TASK_UNINTERRUPTIBLE
0286          schedule()
0287                                         non lock wakeup
0288                                           task->saved_state = TASK_RUNNING
0289
0290                                         lock wakeup
0291                                           task->state = task->saved_state
0292
0293    This ensures that the real wakeup cannot be lost.
0294
0295
0296 rwlock_t
0297 ========
0298
0299 rwlock_t is a multiple readers and single writer lock mechanism.
0300
0301 Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
0302 suffix rules of spinlock_t apply accordingly. The implementation is fair,
0303 thus preventing writer starvation.
0304
0305 rwlock_t and PREEMPT_RT
0306 -----------------------
0307
0308 PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
0309 implementation, thus changing semantics:
0310
0311  - All the spinlock_t changes also apply to rwlock_t.
0312
0313  - Because an rwlock_t writer cannot grant its priority to multiple
0314    readers, a preempted low-priority reader will continue holding its lock,
0315    thus starving even high-priority writers.  In contrast, because readers
0316    can grant their priority to a writer, a preempted low-priority writer
0317    will have its priority boosted until it releases the lock, thus
0318    preventing that writer from starving readers.
0319
0320
0321 PREEMPT_RT caveats
0322 ==================
0323
0324 local_lock on RT
0325 ----------------
0326
0327 The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
0328 implications. For example, on a non-PREEMPT_RT kernel the following code
0329 sequence works as expected::
0330
0331   local_lock_irq(&local_lock);
0332   raw_spin_lock(&lock);
0333
0334 and is fully equivalent to::
0335
0336    raw_spin_lock_irq(&lock);
0337
0338 On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
0339 is mapped to a per-CPU spinlock_t which neither disables interrupts nor
0340 preemption. The following code sequence works perfectly correct on both
0341 PREEMPT_RT and non-PREEMPT_RT kernels::
0342
0343   local_lock_irq(&local_lock);
0344   spin_lock(&lock);
0345
0346 Another caveat with local locks is that each local_lock has a specific
0347 protection scope. So the following substitution is wrong::
0348
0349   func1()
0350   {
0351     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
0352     func3();
0353     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
0354   }
0355
0356   func2()
0357   {
0358     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
0359     func3();
0360     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
0361   }
0362
0363   func3()
0364   {
0365     lockdep_assert_irqs_disabled();
0366     access_protected_data();
0367   }
0368
0369 On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
0370 local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
0371 of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
0372 because local_lock_irqsave() does not disable interrupts due to the
0373 PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
0374
0375   func1()
0376   {
0377     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
0378     func3();
0379     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
0380   }
0381
0382   func2()
0383   {
0384     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
0385     func3();
0386     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
0387   }
0388
0389   func3()
0390   {
0391     lockdep_assert_held(&local_lock);
0392     access_protected_data();
0393   }
0394
0395
0396 spinlock_t and rwlock_t
0397 -----------------------
0398
0399 The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
0400 have a few implications.  For example, on a non-PREEMPT_RT kernel the
0401 following code sequence works as expected::
0402
0403    local_irq_disable();
0404    spin_lock(&lock);
0405
0406 and is fully equivalent to::
0407
0408    spin_lock_irq(&lock);
0409
0410 Same applies to rwlock_t and the _irqsave() suffix variants.
0411
0412 On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
0413 fully preemptible context.  Instead, use spin_lock_irq() or
0414 spin_lock_irqsave() and their unlock counterparts.  In cases where the
0415 interrupt disabling and locking must remain separate, PREEMPT_RT offers a
0416 local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
0417 allowing things like per-CPU interrupt disabled locks to be acquired.
0418 However, this approach should be used only where absolutely necessary.
0419
0420 A typical scenario is protection of per-CPU variables in thread context::
0421
0422   struct foo *p = get_cpu_ptr(&var1);
0423
0424   spin_lock(&p->lock);
0425   p->count += this_cpu_read(var2);
0426
0427 This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
0428 this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
0429 not allow to acquire p->lock because get_cpu_ptr() implicitly disables
0430 preemption. The following substitution works on both kernels::
0431
0432   struct foo *p;
0433
0434   migrate_disable();
0435   p = this_cpu_ptr(&var1);
0436   spin_lock(&p->lock);
0437   p->count += this_cpu_read(var2);
0438
0439 migrate_disable() ensures that the task is pinned on the current CPU which
0440 in turn guarantees that the per-CPU access to var1 and var2 are staying on
0441 the same CPU while the task remains preemptible.
0442
0443 The migrate_disable() substitution is not valid for the following
0444 scenario::
0445
0446   func()
0447   {
0448     struct foo *p;
0449
0450     migrate_disable();
0451     p = this_cpu_ptr(&var1);
0452     p->val = func2();
0453
0454 This breaks because migrate_disable() does not protect against reentrancy from
0455 a preempting task. A correct substitution for this case is::
0456
0457   func()
0458   {
0459     struct foo *p;
0460
0461     local_lock(&foo_lock);
0462     p = this_cpu_ptr(&var1);
0463     p->val = func2();
0464
0465 On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
0466 preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
0467 underlying per-CPU spinlock.
0468
0469
0470 raw_spinlock_t on RT
0471 --------------------
0472
0473 Acquiring a raw_spinlock_t disables preemption and possibly also
0474 interrupts, so the critical section must avoid acquiring a regular
0475 spinlock_t or rwlock_t, for example, the critical section must avoid
0476 allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
0477 works perfectly::
0478
0479   raw_spin_lock(&lock);
0480   p = kmalloc(sizeof(*p), GFP_ATOMIC);
0481
0482 But this code fails on PREEMPT_RT kernels because the memory allocator is
0483 fully preemptible and therefore cannot be invoked from truly atomic
0484 contexts.  However, it is perfectly fine to invoke the memory allocator
0485 while holding normal non-raw spinlocks because they do not disable
0486 preemption on PREEMPT_RT kernels::
0487
0488   spin_lock(&lock);
0489   p = kmalloc(sizeof(*p), GFP_ATOMIC);
0490
0491
0492 bit spinlocks
0493 -------------
0494
0495 PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
0496 small to accommodate an RT-mutex.  Therefore, the semantics of bit
0497 spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
0498 caveats also apply to bit spinlocks.
0499
0500 Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
0501 using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
0502 usage-site changes are not needed for the spinlock_t substitution.
0503 Instead, conditionals in header files and the core locking implemementation
0504 enable the compiler to do the substitution transparently.
0505
0506
0507 Lock type nesting rules
0508 =======================
0509
0510 The most basic rules are:
0511
0512   - Lock types of the same lock category (sleeping, CPU local, spinning)
0513     can nest arbitrarily as long as they respect the general lock ordering
0514     rules to prevent deadlocks.
0515
0516   - Sleeping lock types cannot nest inside CPU local and spinning lock types.
0517
0518   - CPU local and spinning lock types can nest inside sleeping lock types.
0519
0520   - Spinning lock types can nest inside all lock types
0521
0522 These constraints apply both in PREEMPT_RT and otherwise.
0523
0524 The fact that PREEMPT_RT changes the lock category of spinlock_t and
0525 rwlock_t from spinning to sleeping and substitutes local_lock with a
0526 per-CPU spinlock_t means that they cannot be acquired while holding a raw
0527 spinlock.  This results in the following nesting ordering:
0528
0529   1) Sleeping locks
0530   2) spinlock_t, rwlock_t, local_lock
0531   3) raw_spinlock_t and bit spinlocks
0532
0533 Lockdep will complain if these constraints are violated, both in
0534 PREEMPT_RT and otherwise.