Back to home page

OSCL-LXR

 
 

    


0001 This document provides options for those wishing to keep their
0002 memory-ordering lives simple, as is necessary for those whose domain
0003 is complex.  After all, there are bugs other than memory-ordering bugs,
0004 and the time spent gaining memory-ordering knowledge is not available
0005 for gaining domain knowledge.  Furthermore Linux-kernel memory model
0006 (LKMM) is quite complex, with subtle differences in code often having
0007 dramatic effects on correctness.
0008 
0009 The options near the beginning of this list are quite simple.  The idea
0010 is not that kernel hackers don't already know about them, but rather
0011 that they might need the occasional reminder.
0012 
0013 Please note that this is a generic guide, and that specific subsystems
0014 will often have special requirements or idioms.  For example, developers
0015 of MMIO-based device drivers will often need to use mb(), rmb(), and
0016 wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
0017 to be more natural than smp_load_acquire() and smp_store_release().
0018 On the other hand, those coming in from other environments will likely
0019 be more familiar with these last two.
0020 
0021 
0022 Single-threaded code
0023 ====================
0024 
0025 In single-threaded code, there is no reordering, at least assuming
0026 that your toolchain and hardware are working correctly.  In addition,
0027 it is generally a mistake to assume your code will only run in a single
0028 threaded context as the kernel can enter the same code path on multiple
0029 CPUs at the same time.  One important exception is a function that makes
0030 no external data references.
0031 
0032 In the general case, you will need to take explicit steps to ensure that
0033 your code really is executed within a single thread that does not access
0034 shared variables.  A simple way to achieve this is to define a global lock
0035 that you acquire at the beginning of your code and release at the end,
0036 taking care to ensure that all references to your code's shared data are
0037 also carried out under that same lock.  Because only one thread can hold
0038 this lock at a given time, your code will be executed single-threaded.
0039 This approach is called "code locking".
0040 
0041 Code locking can severely limit both performance and scalability, so it
0042 should be used with caution, and only on code paths that execute rarely.
0043 After all, a huge amount of effort was required to remove the Linux
0044 kernel's old "Big Kernel Lock", so let's please be very careful about
0045 adding new "little kernel locks".
0046 
0047 One of the advantages of locking is that, in happy contrast with the
0048 year 1981, almost all kernel developers are very familiar with locking.
0049 The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
0050 the formerly feared deadlock scenarios.
0051 
0052 Please use the standard locking primitives provided by the kernel rather
0053 than rolling your own.  For one thing, the standard primitives interact
0054 properly with lockdep.  For another thing, these primitives have been
0055 tuned to deal better with high contention.  And for one final thing, it is
0056 surprisingly hard to correctly code production-quality lock acquisition
0057 and release functions.  After all, even simple non-production-quality
0058 locking functions must carefully prevent both the CPU and the compiler
0059 from moving code in either direction across the locking function.
0060 
0061 Despite the scalability limitations of single-threaded code, RCU
0062 takes this approach for much of its grace-period processing and also
0063 for early-boot operation.  The reason RCU is able to scale despite
0064 single-threaded grace-period processing is use of batching, where all
0065 updates that accumulated during one grace period are handled by the
0066 next one.  In other words, slowing down grace-period processing makes
0067 it more efficient.  Nor is RCU unique:  Similar batching optimizations
0068 are used in many I/O operations.
0069 
0070 
0071 Packaged code
0072 =============
0073 
0074 Even if performance and scalability concerns prevent your code from
0075 being completely single-threaded, it is often possible to use library
0076 functions that handle the concurrency nearly or entirely on their own.
0077 This approach delegates any LKMM worries to the library maintainer.
0078 
0079 In the kernel, what is the "library"?  Quite a bit.  It includes the
0080 contents of the lib/ directory, much of the include/linux/ directory along
0081 with a lot of other heavily used APIs.  But heavily used examples include
0082 the list macros (for example, include/linux/{,rcu}list.h), workqueues,
0083 smp_call_function(), and the various hash tables and search trees.
0084 
0085 
0086 Data locking
0087 ============
0088 
0089 With code locking, we use single-threaded code execution to guarantee
0090 serialized access to the data that the code is accessing.  However,
0091 we can also achieve this by instead associating the lock with specific
0092 instances of the data structures.  This creates a "critical section"
0093 in the code execution that will execute as though it is single threaded.
0094 By placing all the accesses and modifications to a shared data structure
0095 inside a critical section, we ensure that the execution context that
0096 holds the lock has exclusive access to the shared data.
0097 
0098 The poster boy for this approach is the hash table, where placing a lock
0099 in each hash bucket allows operations on different buckets to proceed
0100 concurrently.  This works because the buckets do not overlap with each
0101 other, so that an operation on one bucket does not interfere with any
0102 other bucket.
0103 
0104 As the number of buckets increases, data locking scales naturally.
0105 In particular, if the amount of data increases with the number of CPUs,
0106 increasing the number of buckets as the number of CPUs increase results
0107 in a naturally scalable data structure.
0108 
0109 
0110 Per-CPU processing
0111 ==================
0112 
0113 Partitioning processing and data over CPUs allows each CPU to take
0114 a single-threaded approach while providing excellent performance and
0115 scalability.  Of course, there is no free lunch:  The dark side of this
0116 excellence is substantially increased memory footprint.
0117 
0118 In addition, it is sometimes necessary to occasionally update some global
0119 view of this processing and data, in which case something like locking
0120 must be used to protect this global view.  This is the approach taken
0121 by the percpu_counter infrastructure. In many cases, there are already
0122 generic/library variants of commonly used per-cpu constructs available.
0123 Please use them rather than rolling your own.
0124 
0125 RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
0126 data sets.  For example, each CPU does private quiescent-state processing
0127 within its instance of the per-CPU rcu_data structure, and then uses data
0128 locking to report quiescent states up the grace-period combining tree.
0129 
0130 
0131 Packaged primitives: Sequence locking
0132 =====================================
0133 
0134 Lockless programming is considered by many to be more difficult than
0135 lock-based programming, but there are a few lockless design patterns that
0136 have been built out into an API.  One of these APIs is sequence locking.
0137 Although this APIs can be used in extremely complex ways, there are simple
0138 and effective ways of using it that avoid the need to pay attention to
0139 memory ordering.
0140 
0141 The basic keep-things-simple rule for sequence locking is "do not write
0142 in read-side code".  Yes, you can do writes from within sequence-locking
0143 readers, but it won't be so simple.  For example, such writes will be
0144 lockless and should be idempotent.
0145 
0146 For more sophisticated use cases, LKMM can guide you, including use
0147 cases involving combining sequence locking with other synchronization
0148 primitives.  (LKMM does not yet know about sequence locking, so it is
0149 currently necessary to open-code it in your litmus tests.)
0150 
0151 Additional information may be found in include/linux/seqlock.h.
0152 
0153 Packaged primitives: RCU
0154 ========================
0155 
0156 Another lockless design pattern that has been baked into an API
0157 is RCU.  The Linux kernel makes sophisticated use of RCU, but the
0158 keep-things-simple rules for RCU are "do not write in read-side code"
0159 and "do not update anything that is visible to and accessed by readers",
0160 and "protect updates with locking".
0161 
0162 These rules are illustrated by the functions foo_update_a() and
0163 foo_get_a() shown in Documentation/RCU/whatisRCU.rst.  Additional
0164 RCU usage patterns maybe found in Documentation/RCU and in the
0165 source code.
0166 
0167 
0168 Packaged primitives: Atomic operations
0169 ======================================
0170 
0171 Back in the day, the Linux kernel had three types of atomic operations:
0172 
0173 1.      Initialization and read-out, such as atomic_set() and atomic_read().
0174 
0175 2.      Operations that did not return a value and provided no ordering,
0176         such as atomic_inc() and atomic_dec().
0177 
0178 3.      Operations that returned a value and provided full ordering, such as
0179         atomic_add_return() and atomic_dec_and_test().  Note that some
0180         value-returning operations provide full ordering only conditionally.
0181         For example, cmpxchg() provides ordering only upon success.
0182 
0183 More recent kernels have operations that return a value but do not
0184 provide full ordering.  These are flagged with either a _relaxed()
0185 suffix (providing no ordering), or an _acquire() or _release() suffix
0186 (providing limited ordering).
0187 
0188 Additional information may be found in these files:
0189 
0190 Documentation/atomic_t.txt
0191 Documentation/atomic_bitops.txt
0192 Documentation/core-api/refcount-vs-atomic.rst
0193 
0194 Reading code using these primitives is often also quite helpful.
0195 
0196 
0197 Lockless, fully ordered
0198 =======================
0199 
0200 When using locking, there often comes a time when it is necessary
0201 to access some variable or another without holding the data lock
0202 that serializes access to that variable.
0203 
0204 If you want to keep things simple, use the initialization and read-out
0205 operations from the previous section only when there are no racing
0206 accesses.  Otherwise, use only fully ordered operations when accessing
0207 or modifying the variable.  This approach guarantees that code prior
0208 to a given access to that variable will be seen by all CPUs has having
0209 happened before any code following any later access to that same variable.
0210 
0211 Please note that per-CPU functions are not atomic operations and
0212 hence they do not provide any ordering guarantees at all.
0213 
0214 If the lockless accesses are frequently executed reads that are used
0215 only for heuristics, or if they are frequently executed writes that
0216 are used only for statistics, please see the next section.
0217 
0218 
0219 Lockless statistics and heuristics
0220 ==================================
0221 
0222 Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
0223 WRITE_ONCE() can safely be used in some cases.  These primitives provide
0224 no ordering, but they do prevent the compiler from carrying out a number
0225 of destructive optimizations (for which please see the next section).
0226 One example use for these primitives is statistics, such as per-CPU
0227 counters exemplified by the rt_cache_stat structure's routing-cache
0228 statistics counters.  Another example use case is heuristics, such as
0229 the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
0230 controlling how often RCU scans for idle CPUs.
0231 
0232 But be careful.  "Unordered" really does mean "unordered".  It is all
0233 too easy to assume ordering, and this assumption must be avoided when
0234 using these primitives.
0235 
0236 
0237 Don't let the compiler trip you up
0238 ==================================
0239 
0240 It can be quite tempting to use plain C-language accesses for lockless
0241 loads from and stores to shared variables.  Although this is both
0242 possible and quite common in the Linux kernel, it does require a
0243 surprising amount of analysis, care, and knowledge about the compiler.
0244 Yes, some decades ago it was not unfair to consider a C compiler to be
0245 an assembler with added syntax and better portability, but the advent of
0246 sophisticated optimizing compilers mean that those days are long gone.
0247 Today's optimizing compilers can profoundly rewrite your code during the
0248 translation process, and have long been ready, willing, and able to do so.
0249 
0250 Therefore, if you really need to use C-language assignments instead of
0251 READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
0252 understanding of both the C standard and your compiler.  Here are some
0253 introductory references and some tooling to start you on this noble quest:
0254 
0255 Who's afraid of a big bad optimizing compiler?
0256         https://lwn.net/Articles/793253/
0257 Calibrating your fear of big bad optimizing compilers
0258         https://lwn.net/Articles/799218/
0259 Concurrency bugs should fear the big bad data-race detector (part 1)
0260         https://lwn.net/Articles/816850/
0261 Concurrency bugs should fear the big bad data-race detector (part 2)
0262         https://lwn.net/Articles/816854/
0263 
0264 
0265 More complex use cases
0266 ======================
0267 
0268 If the alternatives above do not do what you need, please look at the
0269 recipes-pairs.txt file to peel off the next layer of the memory-ordering
0270 onion.