Documentation/arm/vlocks.rst

0001 ======================================
0002 vlocks for Bare-Metal Mutual Exclusion
0003 ======================================
0004
0005 Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
0006 mechanism, with reasonable but minimal requirements on the memory
0007 system.
0008
0009 These are intended to be used to coordinate critical activity among CPUs
0010 which are otherwise non-coherent, in situations where the hardware
0011 provides no other mechanism to support this and ordinary spinlocks
0012 cannot be used.
0013
0014
0015 vlocks make use of the atomicity provided by the memory system for
0016 writes to a single memory location.  To arbitrate, every CPU "votes for
0017 itself", by storing a unique number to a common memory location.  The
0018 final value seen in that memory location when all the votes have been
0019 cast identifies the winner.
0020
0021 In order to make sure that the election produces an unambiguous result
0022 in finite time, a CPU will only enter the election in the first place if
0023 no winner has been chosen and the election does not appear to have
0024 started yet.
0025
0026
0027 Algorithm
0028 ---------
0029
0030 The easiest way to explain the vlocks algorithm is with some pseudo-code::
0031
0032
0033         int currently_voting[NR_CPUS] = { 0, };
0034         int last_vote = -1; /* no votes yet */
0035
0036         bool vlock_trylock(int this_cpu)
0037         {
0038                 /* signal our desire to vote */
0039                 currently_voting[this_cpu] = 1;
0040                 if (last_vote != -1) {
0041                         /* someone already volunteered himself */
0042                         currently_voting[this_cpu] = 0;
0043                         return false; /* not ourself */
0044                 }
0045
0046                 /* let's suggest ourself */
0047                 last_vote = this_cpu;
0048                 currently_voting[this_cpu] = 0;
0049
0050                 /* then wait until everyone else is done voting */
0051                 for_each_cpu(i) {
0052                         while (currently_voting[i] != 0)
0053                                 /* wait */;
0054                 }
0055
0056                 /* result */
0057                 if (last_vote == this_cpu)
0058                         return true; /* we won */
0059                 return false;
0060         }
0061
0062         bool vlock_unlock(void)
0063         {
0064                 last_vote = -1;
0065         }
0066
0067
0068 The currently_voting[] array provides a way for the CPUs to determine
0069 whether an election is in progress, and plays a role analogous to the
0070 "entering" array in Lamport's bakery algorithm [1].
0071
0072 However, once the election has started, the underlying memory system
0073 atomicity is used to pick the winner.  This avoids the need for a static
0074 priority rule to act as a tie-breaker, or any counters which could
0075 overflow.
0076
0077 As long as the last_vote variable is globally visible to all CPUs, it
0078 will contain only one value that won't change once every CPU has cleared
0079 its currently_voting flag.
0080
0081
0082 Features and limitations
0083 ------------------------
0084
0085  * vlocks are not intended to be fair.  In the contended case, it is the
0086    _last_ CPU which attempts to get the lock which will be most likely
0087    to win.
0088
0089    vlocks are therefore best suited to situations where it is necessary
0090    to pick a unique winner, but it does not matter which CPU actually
0091    wins.
0092
0093  * Like other similar mechanisms, vlocks will not scale well to a large
0094    number of CPUs.
0095
0096    vlocks can be cascaded in a voting hierarchy to permit better scaling
0097    if necessary, as in the following hypothetical example for 4096 CPUs::
0098
0099         /* first level: local election */
0100         my_town = towns[(this_cpu >> 4) & 0xf];
0101         I_won = vlock_trylock(my_town, this_cpu & 0xf);
0102         if (I_won) {
0103                 /* we won the town election, let's go for the state */
0104                 my_state = states[(this_cpu >> 8) & 0xf];
0105                 I_won = vlock_lock(my_state, this_cpu & 0xf));
0106                 if (I_won) {
0107                         /* and so on */
0108                         I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
0109                         if (I_won) {
0110                                 /* ... */
0111                         }
0112                         vlock_unlock(the_whole_country);
0113                 }
0114                 vlock_unlock(my_state);
0115         }
0116         vlock_unlock(my_town);
0117
0118
0119 ARM implementation
0120 ------------------
0121
0122 The current ARM implementation [2] contains some optimisations beyond
0123 the basic algorithm:
0124
0125  * By packing the members of the currently_voting array close together,
0126    we can read the whole array in one transaction (providing the number
0127    of CPUs potentially contending the lock is small enough).  This
0128    reduces the number of round-trips required to external memory.
0129
0130    In the ARM implementation, this means that we can use a single load
0131    and comparison::
0132
0133         LDR     Rt, [Rn]
0134         CMP     Rt, #0
0135
0136    ...in place of code equivalent to::
0137
0138         LDRB    Rt, [Rn]
0139         CMP     Rt, #0
0140         LDRBEQ  Rt, [Rn, #1]
0141         CMPEQ   Rt, #0
0142         LDRBEQ  Rt, [Rn, #2]
0143         CMPEQ   Rt, #0
0144         LDRBEQ  Rt, [Rn, #3]
0145         CMPEQ   Rt, #0
0146
0147    This cuts down on the fast-path latency, as well as potentially
0148    reducing bus contention in contended cases.
0149
0150    The optimisation relies on the fact that the ARM memory system
0151    guarantees coherency between overlapping memory accesses of
0152    different sizes, similarly to many other architectures.  Note that
0153    we do not care which element of currently_voting appears in which
0154    bits of Rt, so there is no need to worry about endianness in this
0155    optimisation.
0156
0157    If there are too many CPUs to read the currently_voting array in
0158    one transaction then multiple transations are still required.  The
0159    implementation uses a simple loop of word-sized loads for this
0160    case.  The number of transactions is still fewer than would be
0161    required if bytes were loaded individually.
0162
0163
0164    In principle, we could aggregate further by using LDRD or LDM, but
0165    to keep the code simple this was not attempted in the initial
0166    implementation.
0167
0168
0169  * vlocks are currently only used to coordinate between CPUs which are
0170    unable to enable their caches yet.  This means that the
0171    implementation removes many of the barriers which would be required
0172    when executing the algorithm in cached memory.
0173
0174    packing of the currently_voting array does not work with cached
0175    memory unless all CPUs contending the lock are cache-coherent, due
0176    to cache writebacks from one CPU clobbering values written by other
0177    CPUs.  (Though if all the CPUs are cache-coherent, you should be
0178    probably be using proper spinlocks instead anyway).
0179
0180
0181  * The "no votes yet" value used for the last_vote variable is 0 (not
0182    -1 as in the pseudocode).  This allows statically-allocated vlocks
0183    to be implicitly initialised to an unlocked state simply by putting
0184    them in .bss.
0185
0186    An offset is added to each CPU's ID for the purpose of setting this
0187    variable, so that no CPU uses the value 0 for its ID.
0188
0189
0190 Colophon
0191 --------
0192
0193 Originally created and documented by Dave Martin for Linaro Limited, for
0194 use in ARM-based big.LITTLE platforms, with review and input gratefully
0195 received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
0196 grabbing most of this text out of the relevant mail thread and writing
0197 up the pseudocode.
0198
0199 Copyright (C) 2012-2013  Linaro Limited
0200 Distributed under the terms of Version 2 of the GNU General Public
0201 License, as defined in linux/COPYING.
0202
0203
0204 References
0205 ----------
0206
0207 [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
0208     Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
0209
0210     https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
0211
0212 [2] linux/arch/arm/common/vlock.S, www.kernel.org.