0001 ======================================
0002 vlocks for Bare-Metal Mutual Exclusion
0003 ======================================
0004
0005 Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
0006 mechanism, with reasonable but minimal requirements on the memory
0007 system.
0008
0009 These are intended to be used to coordinate critical activity among CPUs
0010 which are otherwise non-coherent, in situations where the hardware
0011 provides no other mechanism to support this and ordinary spinlocks
0012 cannot be used.
0013
0014
0015 vlocks make use of the atomicity provided by the memory system for
0016 writes to a single memory location. To arbitrate, every CPU "votes for
0017 itself", by storing a unique number to a common memory location. The
0018 final value seen in that memory location when all the votes have been
0019 cast identifies the winner.
0020
0021 In order to make sure that the election produces an unambiguous result
0022 in finite time, a CPU will only enter the election in the first place if
0023 no winner has been chosen and the election does not appear to have
0024 started yet.
0025
0026
0027 Algorithm
0028 ---------
0029
0030 The easiest way to explain the vlocks algorithm is with some pseudo-code::
0031
0032
0033 int currently_voting[NR_CPUS] = { 0, };
0034 int last_vote = -1; /* no votes yet */
0035
0036 bool vlock_trylock(int this_cpu)
0037 {
0038 /* signal our desire to vote */
0039 currently_voting[this_cpu] = 1;
0040 if (last_vote != -1) {
0041 /* someone already volunteered himself */
0042 currently_voting[this_cpu] = 0;
0043 return false; /* not ourself */
0044 }
0045
0046 /* let's suggest ourself */
0047 last_vote = this_cpu;
0048 currently_voting[this_cpu] = 0;
0049
0050 /* then wait until everyone else is done voting */
0051 for_each_cpu(i) {
0052 while (currently_voting[i] != 0)
0053 /* wait */;
0054 }
0055
0056 /* result */
0057 if (last_vote == this_cpu)
0058 return true; /* we won */
0059 return false;
0060 }
0061
0062 bool vlock_unlock(void)
0063 {
0064 last_vote = -1;
0065 }
0066
0067
0068 The currently_voting[] array provides a way for the CPUs to determine
0069 whether an election is in progress, and plays a role analogous to the
0070 "entering" array in Lamport's bakery algorithm [1].
0071
0072 However, once the election has started, the underlying memory system
0073 atomicity is used to pick the winner. This avoids the need for a static
0074 priority rule to act as a tie-breaker, or any counters which could
0075 overflow.
0076
0077 As long as the last_vote variable is globally visible to all CPUs, it
0078 will contain only one value that won't change once every CPU has cleared
0079 its currently_voting flag.
0080
0081
0082 Features and limitations
0083 ------------------------
0084
0085 * vlocks are not intended to be fair. In the contended case, it is the
0086 _last_ CPU which attempts to get the lock which will be most likely
0087 to win.
0088
0089 vlocks are therefore best suited to situations where it is necessary
0090 to pick a unique winner, but it does not matter which CPU actually
0091 wins.
0092
0093 * Like other similar mechanisms, vlocks will not scale well to a large
0094 number of CPUs.
0095
0096 vlocks can be cascaded in a voting hierarchy to permit better scaling
0097 if necessary, as in the following hypothetical example for 4096 CPUs::
0098
0099 /* first level: local election */
0100 my_town = towns[(this_cpu >> 4) & 0xf];
0101 I_won = vlock_trylock(my_town, this_cpu & 0xf);
0102 if (I_won) {
0103 /* we won the town election, let's go for the state */
0104 my_state = states[(this_cpu >> 8) & 0xf];
0105 I_won = vlock_lock(my_state, this_cpu & 0xf));
0106 if (I_won) {
0107 /* and so on */
0108 I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
0109 if (I_won) {
0110 /* ... */
0111 }
0112 vlock_unlock(the_whole_country);
0113 }
0114 vlock_unlock(my_state);
0115 }
0116 vlock_unlock(my_town);
0117
0118
0119 ARM implementation
0120 ------------------
0121
0122 The current ARM implementation [2] contains some optimisations beyond
0123 the basic algorithm:
0124
0125 * By packing the members of the currently_voting array close together,
0126 we can read the whole array in one transaction (providing the number
0127 of CPUs potentially contending the lock is small enough). This
0128 reduces the number of round-trips required to external memory.
0129
0130 In the ARM implementation, this means that we can use a single load
0131 and comparison::
0132
0133 LDR Rt, [Rn]
0134 CMP Rt, #0
0135
0136 ...in place of code equivalent to::
0137
0138 LDRB Rt, [Rn]
0139 CMP Rt, #0
0140 LDRBEQ Rt, [Rn, #1]
0141 CMPEQ Rt, #0
0142 LDRBEQ Rt, [Rn, #2]
0143 CMPEQ Rt, #0
0144 LDRBEQ Rt, [Rn, #3]
0145 CMPEQ Rt, #0
0146
0147 This cuts down on the fast-path latency, as well as potentially
0148 reducing bus contention in contended cases.
0149
0150 The optimisation relies on the fact that the ARM memory system
0151 guarantees coherency between overlapping memory accesses of
0152 different sizes, similarly to many other architectures. Note that
0153 we do not care which element of currently_voting appears in which
0154 bits of Rt, so there is no need to worry about endianness in this
0155 optimisation.
0156
0157 If there are too many CPUs to read the currently_voting array in
0158 one transaction then multiple transations are still required. The
0159 implementation uses a simple loop of word-sized loads for this
0160 case. The number of transactions is still fewer than would be
0161 required if bytes were loaded individually.
0162
0163
0164 In principle, we could aggregate further by using LDRD or LDM, but
0165 to keep the code simple this was not attempted in the initial
0166 implementation.
0167
0168
0169 * vlocks are currently only used to coordinate between CPUs which are
0170 unable to enable their caches yet. This means that the
0171 implementation removes many of the barriers which would be required
0172 when executing the algorithm in cached memory.
0173
0174 packing of the currently_voting array does not work with cached
0175 memory unless all CPUs contending the lock are cache-coherent, due
0176 to cache writebacks from one CPU clobbering values written by other
0177 CPUs. (Though if all the CPUs are cache-coherent, you should be
0178 probably be using proper spinlocks instead anyway).
0179
0180
0181 * The "no votes yet" value used for the last_vote variable is 0 (not
0182 -1 as in the pseudocode). This allows statically-allocated vlocks
0183 to be implicitly initialised to an unlocked state simply by putting
0184 them in .bss.
0185
0186 An offset is added to each CPU's ID for the purpose of setting this
0187 variable, so that no CPU uses the value 0 for its ID.
0188
0189
0190 Colophon
0191 --------
0192
0193 Originally created and documented by Dave Martin for Linaro Limited, for
0194 use in ARM-based big.LITTLE platforms, with review and input gratefully
0195 received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
0196 grabbing most of this text out of the relevant mail thread and writing
0197 up the pseudocode.
0198
0199 Copyright (C) 2012-2013 Linaro Limited
0200 Distributed under the terms of Version 2 of the GNU General Public
0201 License, as defined in linux/COPYING.
0202
0203
0204 References
0205 ----------
0206
0207 [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
0208 Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
0209
0210 https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
0211
0212 [2] linux/arch/arm/common/vlock.S, www.kernel.org.