Back to home page

OSCL-LXR

 
 

    


0001 This document provides "recipes", that is, litmus tests for commonly
0002 occurring situations, as well as a few that illustrate subtly broken but
0003 attractive nuisances.  Many of these recipes include example code from
0004 v5.7 of the Linux kernel.
0005 
0006 The first section covers simple special cases, the second section
0007 takes off the training wheels to cover more involved examples,
0008 and the third section provides a few rules of thumb.
0009 
0010 
0011 Simple special cases
0012 ====================
0013 
0014 This section presents two simple special cases, the first being where
0015 there is only one CPU or only one memory location is accessed, and the
0016 second being use of that old concurrency workhorse, locking.
0017 
0018 
0019 Single CPU or single memory location
0020 ------------------------------------
0021 
0022 If there is only one CPU on the one hand or only one variable
0023 on the other, the code will execute in order.  There are (as
0024 usual) some things to be careful of:
0025 
0026 1.      Some aspects of the C language are unordered.  For example,
0027         in the expression "f(x) + g(y)", the order in which f and g are
0028         called is not defined; the object code is allowed to use either
0029         order or even to interleave the computations.
0030 
0031 2.      Compilers are permitted to use the "as-if" rule.  That is, a
0032         compiler can emit whatever code it likes for normal accesses,
0033         as long as the results of a single-threaded execution appear
0034         just as if the compiler had followed all the relevant rules.
0035         To see this, compile with a high level of optimization and run
0036         the debugger on the resulting binary.
0037 
0038 3.      If there is only one variable but multiple CPUs, that variable
0039         must be properly aligned and all accesses to that variable must
0040         be full sized.  Variables that straddle cachelines or pages void
0041         your full-ordering warranty, as do undersized accesses that load
0042         from or store to only part of the variable.
0043 
0044 4.      If there are multiple CPUs, accesses to shared variables should
0045         use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
0046         tearing, load/store fusing, and invented loads and stores.
0047         There are exceptions to this rule, including:
0048 
0049         i.      When there is no possibility of a given shared variable
0050                 being updated by some other CPU, for example, while
0051                 holding the update-side lock, reads from that variable
0052                 need not use READ_ONCE().
0053 
0054         ii.     When there is no possibility of a given shared variable
0055                 being either read or updated by other CPUs, for example,
0056                 when running during early boot, reads from that variable
0057                 need not use READ_ONCE() and writes to that variable
0058                 need not use WRITE_ONCE().
0059 
0060 
0061 Locking
0062 -------
0063 
0064 Locking is well-known and straightforward, at least if you don't think
0065 about it too hard.  And the basic rule is indeed quite simple: Any CPU that
0066 has acquired a given lock sees any changes previously seen or made by any
0067 CPU before it released that same lock.  Note that this statement is a bit
0068 stronger than "Any CPU holding a given lock sees all changes made by any
0069 CPU during the time that CPU was holding this same lock".  For example,
0070 consider the following pair of code fragments:
0071 
0072         /* See MP+polocks.litmus. */
0073         void CPU0(void)
0074         {
0075                 WRITE_ONCE(x, 1);
0076                 spin_lock(&mylock);
0077                 WRITE_ONCE(y, 1);
0078                 spin_unlock(&mylock);
0079         }
0080 
0081         void CPU1(void)
0082         {
0083                 spin_lock(&mylock);
0084                 r0 = READ_ONCE(y);
0085                 spin_unlock(&mylock);
0086                 r1 = READ_ONCE(x);
0087         }
0088 
0089 The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
0090 then both r0 and r1 must be set to the value 1.  This also has the
0091 consequence that if the final value of r0 is equal to 1, then the final
0092 value of r1 must also be equal to 1.  In contrast, the weaker rule would
0093 say nothing about the final value of r1.
0094 
0095 The converse to the basic rule also holds, as illustrated by the
0096 following litmus test:
0097 
0098         /* See MP+porevlocks.litmus. */
0099         void CPU0(void)
0100         {
0101                 r0 = READ_ONCE(y);
0102                 spin_lock(&mylock);
0103                 r1 = READ_ONCE(x);
0104                 spin_unlock(&mylock);
0105         }
0106 
0107         void CPU1(void)
0108         {
0109                 spin_lock(&mylock);
0110                 WRITE_ONCE(x, 1);
0111                 spin_unlock(&mylock);
0112                 WRITE_ONCE(y, 1);
0113         }
0114 
0115 This converse to the basic rule guarantees that if CPU0() acquires
0116 mylock before CPU1(), then both r0 and r1 must be set to the value 0.
0117 This also has the consequence that if the final value of r1 is equal
0118 to 0, then the final value of r0 must also be equal to 0.  In contrast,
0119 the weaker rule would say nothing about the final value of r0.
0120 
0121 These examples show only a single pair of CPUs, but the effects of the
0122 locking basic rule extend across multiple acquisitions of a given lock
0123 across multiple CPUs.
0124 
0125 However, it is not necessarily the case that accesses ordered by
0126 locking will be seen as ordered by CPUs not holding that lock.
0127 Consider this example:
0128 
0129         /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
0130         void CPU0(void)
0131         {
0132                 spin_lock(&mylock);
0133                 WRITE_ONCE(x, 1);
0134                 WRITE_ONCE(y, 1);
0135                 spin_unlock(&mylock);
0136         }
0137 
0138         void CPU1(void)
0139         {
0140                 spin_lock(&mylock);
0141                 r0 = READ_ONCE(y);
0142                 WRITE_ONCE(z, 1);
0143                 spin_unlock(&mylock);
0144         }
0145 
0146         void CPU2(void)
0147         {
0148                 WRITE_ONCE(z, 2);
0149                 smp_mb();
0150                 r1 = READ_ONCE(x);
0151         }
0152 
0153 Counter-intuitive though it might be, it is quite possible to have
0154 the final value of r0 be 1, the final value of z be 2, and the final
0155 value of r1 be 0.  The reason for this surprising outcome is that
0156 CPU2() never acquired the lock, and thus did not benefit from the
0157 lock's ordering properties.
0158 
0159 Ordering can be extended to CPUs not holding the lock by careful use
0160 of smp_mb__after_spinlock():
0161 
0162         /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
0163         void CPU0(void)
0164         {
0165                 spin_lock(&mylock);
0166                 WRITE_ONCE(x, 1);
0167                 WRITE_ONCE(y, 1);
0168                 spin_unlock(&mylock);
0169         }
0170 
0171         void CPU1(void)
0172         {
0173                 spin_lock(&mylock);
0174                 smp_mb__after_spinlock();
0175                 r0 = READ_ONCE(y);
0176                 WRITE_ONCE(z, 1);
0177                 spin_unlock(&mylock);
0178         }
0179 
0180         void CPU2(void)
0181         {
0182                 WRITE_ONCE(z, 2);
0183                 smp_mb();
0184                 r1 = READ_ONCE(x);
0185         }
0186 
0187 This addition of smp_mb__after_spinlock() strengthens the lock acquisition
0188 sufficiently to rule out the counter-intuitive outcome.
0189 
0190 
0191 Taking off the training wheels
0192 ==============================
0193 
0194 This section looks at more complex examples, including message passing,
0195 load buffering, release-acquire chains, store buffering.
0196 Many classes of litmus tests have abbreviated names, which may be found
0197 here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
0198 
0199 
0200 Message passing (MP)
0201 --------------------
0202 
0203 The MP pattern has one CPU execute a pair of stores to a pair of variables
0204 and another CPU execute a pair of loads from this same pair of variables,
0205 but in the opposite order.  The goal is to avoid the counter-intuitive
0206 outcome in which the first load sees the value written by the second store
0207 but the second load does not see the value written by the first store.
0208 In the absence of any ordering, this goal may not be met, as can be seen
0209 in the MP+poonceonces.litmus litmus test.  This section therefore looks at
0210 a number of ways of meeting this goal.
0211 
0212 
0213 Release and acquire
0214 ~~~~~~~~~~~~~~~~~~~
0215 
0216 Use of smp_store_release() and smp_load_acquire() is one way to force
0217 the desired MP ordering.  The general approach is shown below:
0218 
0219         /* See MP+pooncerelease+poacquireonce.litmus. */
0220         void CPU0(void)
0221         {
0222                 WRITE_ONCE(x, 1);
0223                 smp_store_release(&y, 1);
0224         }
0225 
0226         void CPU1(void)
0227         {
0228                 r0 = smp_load_acquire(&y);
0229                 r1 = READ_ONCE(x);
0230         }
0231 
0232 The smp_store_release() macro orders any prior accesses against the
0233 store, while the smp_load_acquire macro orders the load against any
0234 subsequent accesses.  Therefore, if the final value of r0 is the value 1,
0235 the final value of r1 must also be the value 1.
0236 
0237 The init_stack_slab() function in lib/stackdepot.c uses release-acquire
0238 in this way to safely initialize of a slab of the stack.  Working out
0239 the mutual-exclusion design is left as an exercise for the reader.
0240 
0241 
0242 Assign and dereference
0243 ~~~~~~~~~~~~~~~~~~~~~~
0244 
0245 Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
0246 use of smp_store_release() and smp_load_acquire(), except that both
0247 rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
0248 pointers.  The general approach is shown below:
0249 
0250         /* See MP+onceassign+derefonce.litmus. */
0251         int z;
0252         int *y = &z;
0253         int x;
0254 
0255         void CPU0(void)
0256         {
0257                 WRITE_ONCE(x, 1);
0258                 rcu_assign_pointer(y, &x);
0259         }
0260 
0261         void CPU1(void)
0262         {
0263                 rcu_read_lock();
0264                 r0 = rcu_dereference(y);
0265                 r1 = READ_ONCE(*r0);
0266                 rcu_read_unlock();
0267         }
0268 
0269 In this example, if the final value of r0 is &x then the final value of
0270 r1 must be 1.
0271 
0272 The rcu_assign_pointer() macro has the same ordering properties as does
0273 smp_store_release(), but the rcu_dereference() macro orders the load only
0274 against later accesses that depend on the value loaded.  A dependency
0275 is present if the value loaded determines the address of a later access
0276 (address dependency, as shown above), the value written by a later store
0277 (data dependency), or whether or not a later store is executed in the
0278 first place (control dependency).  Note that the term "data dependency"
0279 is sometimes casually used to cover both address and data dependencies.
0280 
0281 In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
0282 rcu_assign_pointer(), and the next_prime_number() function invokes
0283 rcu_dereference().  This combination mediates access to a bit vector
0284 that is expanded as additional primes are needed.
0285 
0286 
0287 Write and read memory barriers
0288 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0289 
0290 It is usually better to use smp_store_release() instead of smp_wmb()
0291 and to use smp_load_acquire() instead of smp_rmb().  However, the older
0292 smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
0293 to understand their use cases.  The general approach is shown below:
0294 
0295         /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
0296         void CPU0(void)
0297         {
0298                 WRITE_ONCE(x, 1);
0299                 smp_wmb();
0300                 WRITE_ONCE(y, 1);
0301         }
0302 
0303         void CPU1(void)
0304         {
0305                 r0 = READ_ONCE(y);
0306                 smp_rmb();
0307                 r1 = READ_ONCE(x);
0308         }
0309 
0310 The smp_wmb() macro orders prior stores against later stores, and the
0311 smp_rmb() macro orders prior loads against later loads.  Therefore, if
0312 the final value of r0 is 1, the final value of r1 must also be 1.
0313 
0314 The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
0315 the following write-side code fragment:
0316 
0317         log->l_curr_block -= log->l_logBBsize;
0318         ASSERT(log->l_curr_block >= 0);
0319         smp_wmb();
0320         log->l_curr_cycle++;
0321 
0322 And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
0323 the corresponding read-side code fragment:
0324 
0325         cur_cycle = READ_ONCE(log->l_curr_cycle);
0326         smp_rmb();
0327         cur_block = READ_ONCE(log->l_curr_block);
0328 
0329 Alternatively, consider the following comment in function
0330 perf_output_put_handle() in kernel/events/ring_buffer.c:
0331 
0332          *   kernel                             user
0333          *
0334          *   if (LOAD ->data_tail) {            LOAD ->data_head
0335          *                      (A)             smp_rmb()       (C)
0336          *      STORE $data                     LOAD $data
0337          *      smp_wmb()       (B)             smp_mb()        (D)
0338          *      STORE ->data_head               STORE ->data_tail
0339          *   }
0340 
0341 The B/C pairing is an example of the MP pattern using smp_wmb() on the
0342 write side and smp_rmb() on the read side.
0343 
0344 Of course, given that smp_mb() is strictly stronger than either smp_wmb()
0345 or smp_rmb(), any code fragment that would work with smp_rmb() and
0346 smp_wmb() would also work with smp_mb() replacing either or both of the
0347 weaker barriers.
0348 
0349 
0350 Load buffering (LB)
0351 -------------------
0352 
0353 The LB pattern has one CPU load from one variable and then store to a
0354 second, while another CPU loads from the second variable and then stores
0355 to the first.  The goal is to avoid the counter-intuitive situation where
0356 each load reads the value written by the other CPU's store.  In the
0357 absence of any ordering it is quite possible that this may happen, as
0358 can be seen in the LB+poonceonces.litmus litmus test.
0359 
0360 One way of avoiding the counter-intuitive outcome is through the use of a
0361 control dependency paired with a full memory barrier:
0362 
0363         /* See LB+fencembonceonce+ctrlonceonce.litmus. */
0364         void CPU0(void)
0365         {
0366                 r0 = READ_ONCE(x);
0367                 if (r0)
0368                         WRITE_ONCE(y, 1);
0369         }
0370 
0371         void CPU1(void)
0372         {
0373                 r1 = READ_ONCE(y);
0374                 smp_mb();
0375                 WRITE_ONCE(x, 1);
0376         }
0377 
0378 This pairing of a control dependency in CPU0() with a full memory
0379 barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
0380 
0381 The A/D pairing from the ring-buffer use case shown earlier also
0382 illustrates LB.  Here is a repeat of the comment in
0383 perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
0384 control dependency on the kernel side and a full memory barrier on
0385 the user side:
0386 
0387          *   kernel                             user
0388          *
0389          *   if (LOAD ->data_tail) {            LOAD ->data_head
0390          *                      (A)             smp_rmb()       (C)
0391          *      STORE $data                     LOAD $data
0392          *      smp_wmb()       (B)             smp_mb()        (D)
0393          *      STORE ->data_head               STORE ->data_tail
0394          *   }
0395          *
0396          * Where A pairs with D, and B pairs with C.
0397 
0398 The kernel's control dependency between the load from ->data_tail
0399 and the store to data combined with the user's full memory barrier
0400 between the load from data and the store to ->data_tail prevents
0401 the counter-intuitive outcome where the kernel overwrites the data
0402 before the user gets done loading it.
0403 
0404 
0405 Release-acquire chains
0406 ----------------------
0407 
0408 Release-acquire chains are a low-overhead, flexible, and easy-to-use
0409 method of maintaining order.  However, they do have some limitations that
0410 need to be fully understood.  Here is an example that maintains order:
0411 
0412         /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
0413         void CPU0(void)
0414         {
0415                 WRITE_ONCE(x, 1);
0416                 smp_store_release(&y, 1);
0417         }
0418 
0419         void CPU1(void)
0420         {
0421                 r0 = smp_load_acquire(y);
0422                 smp_store_release(&z, 1);
0423         }
0424 
0425         void CPU2(void)
0426         {
0427                 r1 = smp_load_acquire(z);
0428                 r2 = READ_ONCE(x);
0429         }
0430 
0431 In this case, if r0 and r1 both have final values of 1, then r2 must
0432 also have a final value of 1.
0433 
0434 The ordering in this example is stronger than it needs to be.  For
0435 example, ordering would still be preserved if CPU1()'s smp_load_acquire()
0436 invocation was replaced with READ_ONCE().
0437 
0438 It is tempting to assume that CPU0()'s store to x is globally ordered
0439 before CPU1()'s store to z, but this is not the case:
0440 
0441         /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
0442         void CPU0(void)
0443         {
0444                 WRITE_ONCE(x, 1);
0445                 smp_store_release(&y, 1);
0446         }
0447 
0448         void CPU1(void)
0449         {
0450                 r0 = smp_load_acquire(y);
0451                 smp_store_release(&z, 1);
0452         }
0453 
0454         void CPU2(void)
0455         {
0456                 WRITE_ONCE(z, 2);
0457                 smp_mb();
0458                 r1 = READ_ONCE(x);
0459         }
0460 
0461 One might hope that if the final value of r0 is 1 and the final value
0462 of z is 2, then the final value of r1 must also be 1, but it really is
0463 possible for r1 to have the final value of 0.  The reason, of course,
0464 is that in this version, CPU2() is not part of the release-acquire chain.
0465 This situation is accounted for in the rules of thumb below.
0466 
0467 Despite this limitation, release-acquire chains are low-overhead as
0468 well as simple and powerful, at least as memory-ordering mechanisms go.
0469 
0470 
0471 Store buffering
0472 ---------------
0473 
0474 Store buffering can be thought of as upside-down load buffering, so
0475 that one CPU first stores to one variable and then loads from a second,
0476 while another CPU stores to the second variable and then loads from the
0477 first.  Preserving order requires nothing less than full barriers:
0478 
0479         /* See SB+fencembonceonces.litmus. */
0480         void CPU0(void)
0481         {
0482                 WRITE_ONCE(x, 1);
0483                 smp_mb();
0484                 r0 = READ_ONCE(y);
0485         }
0486 
0487         void CPU1(void)
0488         {
0489                 WRITE_ONCE(y, 1);
0490                 smp_mb();
0491                 r1 = READ_ONCE(x);
0492         }
0493 
0494 Omitting either smp_mb() will allow both r0 and r1 to have final
0495 values of 0, but providing both full barriers as shown above prevents
0496 this counter-intuitive outcome.
0497 
0498 This pattern most famously appears as part of Dekker's locking
0499 algorithm, but it has a much more practical use within the Linux kernel
0500 of ordering wakeups.  The following comment taken from waitqueue_active()
0501 in include/linux/wait.h shows the canonical pattern:
0502 
0503  *      CPU0 - waker                    CPU1 - waiter
0504  *
0505  *                                      for (;;) {
0506  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
0507  *      smp_mb();                         // smp_mb() from set_current_state()
0508  *      if (waitqueue_active(wq_head))         if (@cond)
0509  *        wake_up(wq_head);                      break;
0510  *                                        schedule();
0511  *                                      }
0512  *                                      finish_wait(&wq_head, &wait);
0513 
0514 On CPU0, the store is to @cond and the load is in waitqueue_active().
0515 On CPU1, prepare_to_wait() contains both a store to wq_head and a call
0516 to set_current_state(), which contains an smp_mb() barrier; the load is
0517 "if (@cond)".  The full barriers prevent the undesirable outcome where
0518 CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
0519 
0520 Note that use of locking can greatly simplify this pattern.
0521 
0522 
0523 Rules of thumb
0524 ==============
0525 
0526 There might seem to be no pattern governing what ordering primitives are
0527 needed in which situations, but this is not the case.  There is a pattern
0528 based on the relation between the accesses linking successive CPUs in a
0529 given litmus test.  There are three types of linkage:
0530 
0531 1.      Write-to-read, where the next CPU reads the value that the
0532         previous CPU wrote.  The LB litmus-test patterns contain only
0533         this type of relation.  In formal memory-modeling texts, this
0534         relation is called "reads-from" and is usually abbreviated "rf".
0535 
0536 2.      Read-to-write, where the next CPU overwrites the value that the
0537         previous CPU read.  The SB litmus test contains only this type
0538         of relation.  In formal memory-modeling texts, this relation is
0539         often called "from-reads" and is sometimes abbreviated "fr".
0540 
0541 3.      Write-to-write, where the next CPU overwrites the value written
0542         by the previous CPU.  The Z6.0 litmus test pattern contains a
0543         write-to-write relation between the last access of CPU1() and
0544         the first access of CPU2().  In formal memory-modeling texts,
0545         this relation is often called "coherence order" and is sometimes
0546         abbreviated "co".  In the C++ standard, it is instead called
0547         "modification order" and often abbreviated "mo".
0548 
0549 The strength of memory ordering required for a given litmus test to
0550 avoid a counter-intuitive outcome depends on the types of relations
0551 linking the memory accesses for the outcome in question:
0552 
0553 o       If all links are write-to-read links, then the weakest
0554         possible ordering within each CPU suffices.  For example, in
0555         the LB litmus test, a control dependency was enough to do the
0556         job.
0557 
0558 o       If all but one of the links are write-to-read links, then a
0559         release-acquire chain suffices.  Both the MP and the ISA2
0560         litmus tests illustrate this case.
0561 
0562 o       If more than one of the links are something other than
0563         write-to-read links, then a full memory barrier is required
0564         between each successive pair of non-write-to-read links.  This
0565         case is illustrated by the Z6.0 litmus tests, both in the
0566         locking and in the release-acquire sections.
0567 
0568 However, if you find yourself having to stretch these rules of thumb
0569 to fit your situation, you should consider creating a litmus test and
0570 running it on the model.