memory-model/Documentation/recipes.txt

0001 This document provides "recipes", that is, litmus tests for commonly
0002 occurring situations, as well as a few that illustrate subtly broken but
0003 attractive nuisances.  Many of these recipes include example code from
0004 v5.7 of the Linux kernel.
0005
0006 The first section covers simple special cases, the second section
0007 takes off the training wheels to cover more involved examples,
0008 and the third section provides a few rules of thumb.
0009
0010
0011 Simple special cases
0012 ====================
0013
0014 This section presents two simple special cases, the first being where
0015 there is only one CPU or only one memory location is accessed, and the
0016 second being use of that old concurrency workhorse, locking.
0017
0018
0019 Single CPU or single memory location
0020 ------------------------------------
0021
0022 If there is only one CPU on the one hand or only one variable
0023 on the other, the code will execute in order.  There are (as
0024 usual) some things to be careful of:
0025
0026 1.      Some aspects of the C language are unordered.  For example,
0027         in the expression "f(x) + g(y)", the order in which f and g are
0028         called is not defined; the object code is allowed to use either
0029         order or even to interleave the computations.
0030
0031 2.      Compilers are permitted to use the "as-if" rule.  That is, a
0032         compiler can emit whatever code it likes for normal accesses,
0033         as long as the results of a single-threaded execution appear
0034         just as if the compiler had followed all the relevant rules.
0035         To see this, compile with a high level of optimization and run
0036         the debugger on the resulting binary.
0037
0038 3.      If there is only one variable but multiple CPUs, that variable
0039         must be properly aligned and all accesses to that variable must
0040         be full sized.  Variables that straddle cachelines or pages void
0041         your full-ordering warranty, as do undersized accesses that load
0042         from or store to only part of the variable.
0043
0044 4.      If there are multiple CPUs, accesses to shared variables should
0045         use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
0046         tearing, load/store fusing, and invented loads and stores.
0047         There are exceptions to this rule, including:
0048
0049         i.      When there is no possibility of a given shared variable
0050                 being updated by some other CPU, for example, while
0051                 holding the update-side lock, reads from that variable
0052                 need not use READ_ONCE().
0053
0054         ii.     When there is no possibility of a given shared variable
0055                 being either read or updated by other CPUs, for example,
0056                 when running during early boot, reads from that variable
0057                 need not use READ_ONCE() and writes to that variable
0058                 need not use WRITE_ONCE().
0059
0060
0061 Locking
0062 -------
0063
0064 Locking is well-known and straightforward, at least if you don't think
0065 about it too hard.  And the basic rule is indeed quite simple: Any CPU that
0066 has acquired a given lock sees any changes previously seen or made by any
0067 CPU before it released that same lock.  Note that this statement is a bit
0068 stronger than "Any CPU holding a given lock sees all changes made by any
0069 CPU during the time that CPU was holding this same lock".  For example,
0070 consider the following pair of code fragments:
0071
0072         /* See MP+polocks.litmus. */
0073         void CPU0(void)
0074         {
0075                 WRITE_ONCE(x, 1);
0076                 spin_lock(&mylock);
0077                 WRITE_ONCE(y, 1);
0078                 spin_unlock(&mylock);
0079         }
0080
0081         void CPU1(void)
0082         {
0083                 spin_lock(&mylock);
0084                 r0 = READ_ONCE(y);
0085                 spin_unlock(&mylock);
0086                 r1 = READ_ONCE(x);
0087         }
0088
0089 The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
0090 then both r0 and r1 must be set to the value 1.  This also has the
0091 consequence that if the final value of r0 is equal to 1, then the final
0092 value of r1 must also be equal to 1.  In contrast, the weaker rule would
0093 say nothing about the final value of r1.
0094
0095 The converse to the basic rule also holds, as illustrated by the
0096 following litmus test:
0097
0098         /* See MP+porevlocks.litmus. */
0099         void CPU0(void)
0100         {
0101                 r0 = READ_ONCE(y);
0102                 spin_lock(&mylock);
0103                 r1 = READ_ONCE(x);
0104                 spin_unlock(&mylock);
0105         }
0106
0107         void CPU1(void)
0108         {
0109                 spin_lock(&mylock);
0110                 WRITE_ONCE(x, 1);
0111                 spin_unlock(&mylock);
0112                 WRITE_ONCE(y, 1);
0113         }
0114
0115 This converse to the basic rule guarantees that if CPU0() acquires
0116 mylock before CPU1(), then both r0 and r1 must be set to the value 0.
0117 This also has the consequence that if the final value of r1 is equal
0118 to 0, then the final value of r0 must also be equal to 0.  In contrast,
0119 the weaker rule would say nothing about the final value of r0.
0120
0121 These examples show only a single pair of CPUs, but the effects of the
0122 locking basic rule extend across multiple acquisitions of a given lock
0123 across multiple CPUs.
0124
0125 However, it is not necessarily the case that accesses ordered by
0126 locking will be seen as ordered by CPUs not holding that lock.
0127 Consider this example:
0128
0129         /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
0130         void CPU0(void)
0131         {
0132                 spin_lock(&mylock);
0133                 WRITE_ONCE(x, 1);
0134                 WRITE_ONCE(y, 1);
0135                 spin_unlock(&mylock);
0136         }
0137
0138         void CPU1(void)
0139         {
0140                 spin_lock(&mylock);
0141                 r0 = READ_ONCE(y);
0142                 WRITE_ONCE(z, 1);
0143                 spin_unlock(&mylock);
0144         }
0145
0146         void CPU2(void)
0147         {
0148                 WRITE_ONCE(z, 2);
0149                 smp_mb();
0150                 r1 = READ_ONCE(x);
0151         }
0152
0153 Counter-intuitive though it might be, it is quite possible to have
0154 the final value of r0 be 1, the final value of z be 2, and the final
0155 value of r1 be 0.  The reason for this surprising outcome is that
0156 CPU2() never acquired the lock, and thus did not benefit from the
0157 lock's ordering properties.
0158
0159 Ordering can be extended to CPUs not holding the lock by careful use
0160 of smp_mb__after_spinlock():
0161
0162         /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
0163         void CPU0(void)
0164         {
0165                 spin_lock(&mylock);
0166                 WRITE_ONCE(x, 1);
0167                 WRITE_ONCE(y, 1);
0168                 spin_unlock(&mylock);
0169         }
0170
0171         void CPU1(void)
0172         {
0173                 spin_lock(&mylock);
0174                 smp_mb__after_spinlock();
0175                 r0 = READ_ONCE(y);
0176                 WRITE_ONCE(z, 1);
0177                 spin_unlock(&mylock);
0178         }
0179
0180         void CPU2(void)
0181         {
0182                 WRITE_ONCE(z, 2);
0183                 smp_mb();
0184                 r1 = READ_ONCE(x);
0185         }
0186
0187 This addition of smp_mb__after_spinlock() strengthens the lock acquisition
0188 sufficiently to rule out the counter-intuitive outcome.
0189
0190
0191 Taking off the training wheels
0192 ==============================
0193
0194 This section looks at more complex examples, including message passing,
0195 load buffering, release-acquire chains, store buffering.
0196 Many classes of litmus tests have abbreviated names, which may be found
0197 here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
0198
0199
0200 Message passing (MP)
0201 --------------------
0202
0203 The MP pattern has one CPU execute a pair of stores to a pair of variables
0204 and another CPU execute a pair of loads from this same pair of variables,
0205 but in the opposite order.  The goal is to avoid the counter-intuitive
0206 outcome in which the first load sees the value written by the second store
0207 but the second load does not see the value written by the first store.
0208 In the absence of any ordering, this goal may not be met, as can be seen
0209 in the MP+poonceonces.litmus litmus test.  This section therefore looks at
0210 a number of ways of meeting this goal.
0211
0212
0213 Release and acquire
0214 ~~~~~~~~~~~~~~~~~~~
0215
0216 Use of smp_store_release() and smp_load_acquire() is one way to force
0217 the desired MP ordering.  The general approach is shown below:
0218
0219         /* See MP+pooncerelease+poacquireonce.litmus. */
0220         void CPU0(void)
0221         {
0222                 WRITE_ONCE(x, 1);
0223                 smp_store_release(&y, 1);
0224         }
0225
0226         void CPU1(void)
0227         {
0228                 r0 = smp_load_acquire(&y);
0229                 r1 = READ_ONCE(x);
0230         }
0231
0232 The smp_store_release() macro orders any prior accesses against the
0233 store, while the smp_load_acquire macro orders the load against any
0234 subsequent accesses.  Therefore, if the final value of r0 is the value 1,
0235 the final value of r1 must also be the value 1.
0236
0237 The init_stack_slab() function in lib/stackdepot.c uses release-acquire
0238 in this way to safely initialize of a slab of the stack.  Working out
0239 the mutual-exclusion design is left as an exercise for the reader.
0240
0241
0242 Assign and dereference
0243 ~~~~~~~~~~~~~~~~~~~~~~
0244
0245 Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
0246 use of smp_store_release() and smp_load_acquire(), except that both
0247 rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
0248 pointers.  The general approach is shown below:
0249
0250         /* See MP+onceassign+derefonce.litmus. */
0251         int z;
0252         int *y = &z;
0253         int x;
0254
0255         void CPU0(void)
0256         {
0257                 WRITE_ONCE(x, 1);
0258                 rcu_assign_pointer(y, &x);
0259         }
0260
0261         void CPU1(void)
0262         {
0263                 rcu_read_lock();
0264                 r0 = rcu_dereference(y);
0265                 r1 = READ_ONCE(*r0);
0266                 rcu_read_unlock();
0267         }
0268
0269 In this example, if the final value of r0 is &x then the final value of
0270 r1 must be 1.
0271
0272 The rcu_assign_pointer() macro has the same ordering properties as does
0273 smp_store_release(), but the rcu_dereference() macro orders the load only
0274 against later accesses that depend on the value loaded.  A dependency
0275 is present if the value loaded determines the address of a later access
0276 (address dependency, as shown above), the value written by a later store
0277 (data dependency), or whether or not a later store is executed in the
0278 first place (control dependency).  Note that the term "data dependency"
0279 is sometimes casually used to cover both address and data dependencies.
0280
0281 In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
0282 rcu_assign_pointer(), and the next_prime_number() function invokes
0283 rcu_dereference().  This combination mediates access to a bit vector
0284 that is expanded as additional primes are needed.
0285
0286
0287 Write and read memory barriers
0288 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0289
0290 It is usually better to use smp_store_release() instead of smp_wmb()
0291 and to use smp_load_acquire() instead of smp_rmb().  However, the older
0292 smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
0293 to understand their use cases.  The general approach is shown below:
0294
0295         /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
0296         void CPU0(void)
0297         {
0298                 WRITE_ONCE(x, 1);
0299                 smp_wmb();
0300                 WRITE_ONCE(y, 1);
0301         }
0302
0303         void CPU1(void)
0304         {
0305                 r0 = READ_ONCE(y);
0306                 smp_rmb();
0307                 r1 = READ_ONCE(x);
0308         }
0309
0310 The smp_wmb() macro orders prior stores against later stores, and the
0311 smp_rmb() macro orders prior loads against later loads.  Therefore, if
0312 the final value of r0 is 1, the final value of r1 must also be 1.
0313
0314 The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
0315 the following write-side code fragment:
0316
0317         log->l_curr_block -= log->l_logBBsize;
0318         ASSERT(log->l_curr_block >= 0);
0319         smp_wmb();
0320         log->l_curr_cycle++;
0321
0322 And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
0323 the corresponding read-side code fragment:
0324
0325         cur_cycle = READ_ONCE(log->l_curr_cycle);
0326         smp_rmb();
0327         cur_block = READ_ONCE(log->l_curr_block);
0328
0329 Alternatively, consider the following comment in function
0330 perf_output_put_handle() in kernel/events/ring_buffer.c:
0331
0332          *   kernel                             user
0333          *
0334          *   if (LOAD ->data_tail) {            LOAD ->data_head
0335          *                      (A)             smp_rmb()       (C)
0336          *      STORE $data                     LOAD $data
0337          *      smp_wmb()       (B)             smp_mb()        (D)
0338          *      STORE ->data_head               STORE ->data_tail
0339          *   }
0340
0341 The B/C pairing is an example of the MP pattern using smp_wmb() on the
0342 write side and smp_rmb() on the read side.
0343
0344 Of course, given that smp_mb() is strictly stronger than either smp_wmb()
0345 or smp_rmb(), any code fragment that would work with smp_rmb() and
0346 smp_wmb() would also work with smp_mb() replacing either or both of the
0347 weaker barriers.
0348
0349
0350 Load buffering (LB)
0351 -------------------
0352
0353 The LB pattern has one CPU load from one variable and then store to a
0354 second, while another CPU loads from the second variable and then stores
0355 to the first.  The goal is to avoid the counter-intuitive situation where
0356 each load reads the value written by the other CPU's store.  In the
0357 absence of any ordering it is quite possible that this may happen, as
0358 can be seen in the LB+poonceonces.litmus litmus test.
0359
0360 One way of avoiding the counter-intuitive outcome is through the use of a
0361 control dependency paired with a full memory barrier:
0362
0363         /* See LB+fencembonceonce+ctrlonceonce.litmus. */
0364         void CPU0(void)
0365         {
0366                 r0 = READ_ONCE(x);
0367                 if (r0)
0368                         WRITE_ONCE(y, 1);
0369         }
0370
0371         void CPU1(void)
0372         {
0373                 r1 = READ_ONCE(y);
0374                 smp_mb();
0375                 WRITE_ONCE(x, 1);
0376         }
0377
0378 This pairing of a control dependency in CPU0() with a full memory
0379 barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
0380
0381 The A/D pairing from the ring-buffer use case shown earlier also
0382 illustrates LB.  Here is a repeat of the comment in
0383 perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
0384 control dependency on the kernel side and a full memory barrier on
0385 the user side:
0386
0387          *   kernel                             user
0388          *
0389          *   if (LOAD ->data_tail) {            LOAD ->data_head
0390          *                      (A)             smp_rmb()       (C)
0391          *      STORE $data                     LOAD $data
0392          *      smp_wmb()       (B)             smp_mb()        (D)
0393          *      STORE ->data_head               STORE ->data_tail
0394          *   }
0395          *
0396          * Where A pairs with D, and B pairs with C.
0397
0398 The kernel's control dependency between the load from ->data_tail
0399 and the store to data combined with the user's full memory barrier
0400 between the load from data and the store to ->data_tail prevents
0401 the counter-intuitive outcome where the kernel overwrites the data
0402 before the user gets done loading it.
0403
0404
0405 Release-acquire chains
0406 ----------------------
0407
0408 Release-acquire chains are a low-overhead, flexible, and easy-to-use
0409 method of maintaining order.  However, they do have some limitations that
0410 need to be fully understood.  Here is an example that maintains order:
0411
0412         /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
0413         void CPU0(void)
0414         {
0415                 WRITE_ONCE(x, 1);
0416                 smp_store_release(&y, 1);
0417         }
0418
0419         void CPU1(void)
0420         {
0421                 r0 = smp_load_acquire(y);
0422                 smp_store_release(&z, 1);
0423         }
0424
0425         void CPU2(void)
0426         {
0427                 r1 = smp_load_acquire(z);
0428                 r2 = READ_ONCE(x);
0429         }
0430
0431 In this case, if r0 and r1 both have final values of 1, then r2 must
0432 also have a final value of 1.
0433
0434 The ordering in this example is stronger than it needs to be.  For
0435 example, ordering would still be preserved if CPU1()'s smp_load_acquire()
0436 invocation was replaced with READ_ONCE().
0437
0438 It is tempting to assume that CPU0()'s store to x is globally ordered
0439 before CPU1()'s store to z, but this is not the case:
0440
0441         /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
0442         void CPU0(void)
0443         {
0444                 WRITE_ONCE(x, 1);
0445                 smp_store_release(&y, 1);
0446         }
0447
0448         void CPU1(void)
0449         {
0450                 r0 = smp_load_acquire(y);
0451                 smp_store_release(&z, 1);
0452         }
0453
0454         void CPU2(void)
0455         {
0456                 WRITE_ONCE(z, 2);
0457                 smp_mb();
0458                 r1 = READ_ONCE(x);
0459         }
0460
0461 One might hope that if the final value of r0 is 1 and the final value
0462 of z is 2, then the final value of r1 must also be 1, but it really is
0463 possible for r1 to have the final value of 0.  The reason, of course,
0464 is that in this version, CPU2() is not part of the release-acquire chain.
0465 This situation is accounted for in the rules of thumb below.
0466
0467 Despite this limitation, release-acquire chains are low-overhead as
0468 well as simple and powerful, at least as memory-ordering mechanisms go.
0469
0470
0471 Store buffering
0472 ---------------
0473
0474 Store buffering can be thought of as upside-down load buffering, so
0475 that one CPU first stores to one variable and then loads from a second,
0476 while another CPU stores to the second variable and then loads from the
0477 first.  Preserving order requires nothing less than full barriers:
0478
0479         /* See SB+fencembonceonces.litmus. */
0480         void CPU0(void)
0481         {
0482                 WRITE_ONCE(x, 1);
0483                 smp_mb();
0484                 r0 = READ_ONCE(y);
0485         }
0486
0487         void CPU1(void)
0488         {
0489                 WRITE_ONCE(y, 1);
0490                 smp_mb();
0491                 r1 = READ_ONCE(x);
0492         }
0493
0494 Omitting either smp_mb() will allow both r0 and r1 to have final
0495 values of 0, but providing both full barriers as shown above prevents
0496 this counter-intuitive outcome.
0497
0498 This pattern most famously appears as part of Dekker's locking
0499 algorithm, but it has a much more practical use within the Linux kernel
0500 of ordering wakeups.  The following comment taken from waitqueue_active()
0501 in include/linux/wait.h shows the canonical pattern:
0502
0503  *      CPU0 - waker                    CPU1 - waiter
0504  *
0505  *                                      for (;;) {
0506  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
0507  *      smp_mb();                         // smp_mb() from set_current_state()
0508  *      if (waitqueue_active(wq_head))         if (@cond)
0509  *        wake_up(wq_head);                      break;
0510  *                                        schedule();
0511  *                                      }
0512  *                                      finish_wait(&wq_head, &wait);
0513
0514 On CPU0, the store is to @cond and the load is in waitqueue_active().
0515 On CPU1, prepare_to_wait() contains both a store to wq_head and a call
0516 to set_current_state(), which contains an smp_mb() barrier; the load is
0517 "if (@cond)".  The full barriers prevent the undesirable outcome where
0518 CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
0519
0520 Note that use of locking can greatly simplify this pattern.
0521
0522
0523 Rules of thumb
0524 ==============
0525
0526 There might seem to be no pattern governing what ordering primitives are
0527 needed in which situations, but this is not the case.  There is a pattern
0528 based on the relation between the accesses linking successive CPUs in a
0529 given litmus test.  There are three types of linkage:
0530
0531 1.      Write-to-read, where the next CPU reads the value that the
0532         previous CPU wrote.  The LB litmus-test patterns contain only
0533         this type of relation.  In formal memory-modeling texts, this
0534         relation is called "reads-from" and is usually abbreviated "rf".
0535
0536 2.      Read-to-write, where the next CPU overwrites the value that the
0537         previous CPU read.  The SB litmus test contains only this type
0538         of relation.  In formal memory-modeling texts, this relation is
0539         often called "from-reads" and is sometimes abbreviated "fr".
0540
0541 3.      Write-to-write, where the next CPU overwrites the value written
0542         by the previous CPU.  The Z6.0 litmus test pattern contains a
0543         write-to-write relation between the last access of CPU1() and
0544         the first access of CPU2().  In formal memory-modeling texts,
0545         this relation is often called "coherence order" and is sometimes
0546         abbreviated "co".  In the C++ standard, it is instead called
0547         "modification order" and often abbreviated "mo".
0548
0549 The strength of memory ordering required for a given litmus test to
0550 avoid a counter-intuitive outcome depends on the types of relations
0551 linking the memory accesses for the outcome in question:
0552
0553 o       If all links are write-to-read links, then the weakest
0554         possible ordering within each CPU suffices.  For example, in
0555         the LB litmus test, a control dependency was enough to do the
0556         job.
0557
0558 o       If all but one of the links are write-to-read links, then a
0559         release-acquire chain suffices.  Both the MP and the ISA2
0560         litmus tests illustrate this case.
0561
0562 o       If more than one of the links are something other than
0563         write-to-read links, then a full memory barrier is required
0564         between each successive pair of non-write-to-read links.  This
0565         case is illustrated by the Z6.0 litmus tests, both in the
0566         locking and in the release-acquire sections.
0567
0568 However, if you find yourself having to stretch these rules of thumb
0569 to fit your situation, you should consider creating a litmus test and
0570 running it on the model.