Back to home page




0001 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
0002         "">
0003         <html>
0004         <head><title>A Tour Through TREE_RCU's Data Structures []</title>
0005         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
0007            <p>January 27, 2016</p>
0008            <p>This article was contributed by Paul E.&nbsp;McKenney</p>
0010 <h3>Introduction</h3>
0012 This document describes RCU's major data structures and their relationship
0013 to each other.
0015 <ol>
0016 <li>    <a href="#Data-Structure Relationships">
0017     Data-Structure Relationships</a>
0018 <li>    <a href="#The rcu_state Structure">
0019     The <tt>rcu_state</tt> Structure</a>
0020 <li>    <a href="#The rcu_node Structure">
0021     The <tt>rcu_node</tt> Structure</a>
0022 <li>    <a href="#The rcu_data Structure">
0023     The <tt>rcu_data</tt> Structure</a>
0024 <li>    <a href="#The rcu_dynticks Structure">
0025     The <tt>rcu_dynticks</tt> Structure</a>
0026 <li>    <a href="#The rcu_head Structure">
0027     The <tt>rcu_head</tt> Structure</a>
0028 <li>    <a href="#RCU-Specific Fields in the task_struct Structure">
0029     RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
0030 <li>    <a href="#Accessor Functions">
0031     Accessor Functions</a>
0032 </ol>
0034 At the end we have the
0035 <a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
0037 <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
0039 <p>RCU is for all intents and purposes a large state machine, and its
0040 data structures maintain the state in such a way as to allow RCU readers
0041 to execute extremely quickly, while also processing the RCU grace periods
0042 requested by updaters in an efficient and extremely scalable fashion.
0043 The efficiency and scalability of RCU updaters is provided primarily
0044 by a combining tree, as shown below:
0046 </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
0048 </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
0049 containing a tree of <tt>rcu_node</tt> structures.
0050 Each leaf node of the <tt>rcu_node</tt> tree has up to 16
0051 <tt>rcu_data</tt> structures associated with it, so that there
0052 are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
0053 one for each possible CPU.
0054 This structure is adjusted at boot time, if needed, to handle the
0055 common case where <tt>nr_cpu_ids</tt> is much less than
0056 <tt>NR_CPUs</tt>.
0057 For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
0058 which results in a three-level <tt>rcu_node</tt> tree.
0059 If the actual hardware has only 16 CPUs, RCU will adjust itself
0060 at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
0062 </p><p>The purpose of this combining tree is to allow per-CPU events
0063 such as quiescent states, dyntick-idle transitions,
0064 and CPU hotplug operations to be processed efficiently
0065 and scalably.
0066 Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
0067 and other events are recorded by the leaf-level <tt>rcu_node</tt>
0068 structures.
0069 All of these events are combined at each level of the tree until finally
0070 grace periods are completed at the tree's root <tt>rcu_node</tt>
0071 structure.
0072 A grace period can be completed at the root once every CPU
0073 (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
0074 has passed through a quiescent state.
0075 Once a grace period has completed, record of that fact is propagated
0076 back down the tree.
0078 </p><p>As can be seen from the diagram, on a 64-bit system
0079 a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
0080 of 64 at the root and a fanout of 16 at the leaves.
0082 <table>
0083 <tr><th>&nbsp;</th></tr>
0084 <tr><th align="left">Quick Quiz:</th></tr>
0085 <tr><td>
0086     Why isn't the fanout at the leaves also 64?
0087 </td></tr>
0088 <tr><th align="left">Answer:</th></tr>
0089 <tr><td bgcolor="#ffffff"><font color="ffffff">
0090     Because there are more types of events that affect the leaf-level
0091     <tt>rcu_node</tt> structures than further up the tree.
0092     Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
0093     64, the contention on these structures' <tt>-&gt;structures</tt>
0094     becomes excessive.
0095     Experimentation on a wide variety of systems has shown that a fanout
0096     of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
0097     </font>
0099     <p><font color="ffffff">Of course, further experience with
0100     systems having hundreds or thousands of CPUs may demonstrate
0101     that the fanout for the non-leaf <tt>rcu_node</tt> structures
0102     must also be reduced.
0103     Such reduction can be easily carried out when and if it proves
0104     necessary.
0105     In the meantime, if you are using such a system and running into
0106     contention problems on the non-leaf <tt>rcu_node</tt> structures,
0107     you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
0108     parameter to reduce the non-leaf fanout as needed.
0109     </font>
0111     <p><font color="ffffff">Kernels built for systems with
0112     strong NUMA characteristics might also need to adjust
0113     <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
0114     <tt>rcu_node</tt> structures align with hardware boundaries.
0115     However, there has thus far been no need for this.
0116 </font></td></tr>
0117 <tr><td>&nbsp;</td></tr>
0118 </table>
0120 <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
0121 a 32-bit system), then RCU will automatically add more levels to the
0122 tree.
0123 For example, if you are crazy enough to build a 64-bit system with 65,536
0124 CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
0126 </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
0128 </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
0129 accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
0130 32-bit systems.
0131 On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
0132 as small as 2 if you wish, which would permit only 16 CPUs, which
0133 is useful for testing.
0135 </p><p>This multi-level combining tree allows us to get most of the
0136 performance and scalability
0137 benefits of partitioning, even though RCU grace-period detection is
0138 inherently a global operation.
0139 The trick here is that only the last CPU to report a quiescent state
0140 into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
0141 structure at the next level up the tree.
0142 This means that at the leaf-level <tt>rcu_node</tt> structure, only
0143 one access out of sixteen will progress up the tree.
0144 For the internal <tt>rcu_node</tt> structures, the situation is even
0145 more extreme:  Only one access out of sixty-four will progress up
0146 the tree.
0147 Because the vast majority of the CPUs do not progress up the tree,
0148 the lock contention remains roughly constant up the tree.
0149 No matter how many CPUs there are in the system, at most 64 quiescent-state
0150 reports per grace period will progress all the way to the root
0151 <tt>rcu_node</tt> structure, thus ensuring that the lock contention
0152 on that root <tt>rcu_node</tt> structure remains acceptably low.
0154 </p><p>In effect, the combining tree acts like a big shock absorber,
0155 keeping lock contention under control at all tree levels regardless
0156 of the level of loading on the system.
0158 </p><p>The Linux kernel actually supports multiple flavors of RCU
0159 running concurrently, so RCU builds separate data structures for each
0160 flavor.
0161 For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
0162 rcu_sched and rcu_bh, as shown below:
0164 </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
0166 </p><p>Energy efficiency is increasingly important, and for that
0167 reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
0168 turns off the scheduling-clock interrupts on idle CPUs, which in
0169 turn allows those CPUs to attain deeper sleep states and to consume
0170 less energy.
0171 CPUs whose scheduling-clock interrupts have been turned off are
0172 said to be in <i>dyntick-idle mode</i>.
0173 RCU must handle dyntick-idle CPUs specially
0174 because RCU would otherwise wake up each CPU on every grace period,
0175 which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
0176 RCU uses the <tt>rcu_dynticks</tt> structure to track
0177 which CPUs are in dyntick idle mode, as shown below:
0179 </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
0181 </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
0182 for all flavors of RCU.
0183 Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
0184 CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
0185 that <tt>rcu_dynticks</tt>, as shown in the figure.
0187 </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
0188 rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
0190 </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
0192 </p><p>RCU updaters wait for normal grace periods by registering
0193 RCU callbacks, either directly via <tt>call_rcu()</tt> and
0194 friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
0195 there being a separate interface per flavor of RCU)
0196 or indirectly via <tt>synchronize_rcu()</tt> and friends.
0197 RCU callbacks are represented by <tt>rcu_head</tt> structures,
0198 which are queued on <tt>rcu_data</tt> structures while they are
0199 waiting for a grace period to elapse, as shown in the following figure:
0201 </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
0203 </p><p>This figure shows how <tt>TREE_RCU</tt>'s and
0204 <tt>PREEMPT_RCU</tt>'s major data structures are related.
0205 Lesser data structures will be introduced with the algorithms that
0206 make use of them.
0208 </p><p>Note that each of the data structures in the above figure has
0209 its own synchronization:
0211 <p><ol>
0212 <li>    Each <tt>rcu_state</tt> structures has a lock and a mutex,
0213     and some fields are protected by the corresponding root
0214     <tt>rcu_node</tt> structure's lock.
0215 <li>    Each <tt>rcu_node</tt> structure has a spinlock.
0216 <li>    The fields in <tt>rcu_data</tt> are private to the corresponding
0217     CPU, although a few can be read and written by other CPUs.
0218 <li>    Similarly, the fields in <tt>rcu_dynticks</tt> are private
0219     to the corresponding CPU, although a few can be read by
0220     other CPUs.
0221 </ol>
0223 <p>It is important to note that different data structures can have
0224 very different ideas about the state of RCU at any given time.
0225 For but one example, awareness of the start or end of a given RCU
0226 grace period propagates slowly through the data structures.
0227 This slow propagation is absolutely necessary for RCU to have good
0228 read-side performance.
0229 If this balkanized implementation seems foreign to you, one useful
0230 trick is to consider each instance of these data structures to be
0231 a different person, each having the usual slightly different
0232 view of reality.
0234 </p><p>The general role of each of these data structures is as
0235 follows:
0237 </p><ol>
0238 <li>    <tt>rcu_state</tt>:
0239     This structure forms the interconnection between the
0240     <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
0241     tracks grace periods, serves as short-term repository
0242     for callbacks orphaned by CPU-hotplug events,
0243     maintains <tt>rcu_barrier()</tt> state,
0244     tracks expedited grace-period state,
0245     and maintains state used to force quiescent states when
0246     grace periods extend too long,
0247 <li>    <tt>rcu_node</tt>: This structure forms the combining
0248     tree that propagates quiescent-state
0249     information from the leaves to the root, and also propagates
0250     grace-period information from the root to the leaves.
0251     It provides local copies of the grace-period state in order
0252     to allow this information to be accessed in a synchronized
0253     manner without suffering the scalability limitations that
0254     would otherwise be imposed by global locking.
0255     In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
0256     of tasks that have blocked while in their current
0257     RCU read-side critical section.
0258     In <tt>CONFIG_PREEMPT_RCU</tt> with
0259     <tt>CONFIG_RCU_BOOST</tt>, it manages the
0260     per-<tt>rcu_node</tt> priority-boosting
0261     kernel threads (kthreads) and state.
0262     Finally, it records CPU-hotplug state in order to determine
0263     which CPUs should be ignored during a given grace period.
0264 <li>    <tt>rcu_data</tt>: This per-CPU structure is the
0265     focus of quiescent-state detection and RCU callback queuing.
0266     It also tracks its relationship to the corresponding leaf
0267     <tt>rcu_node</tt> structure to allow more-efficient
0268     propagation of quiescent states up the <tt>rcu_node</tt>
0269     combining tree.
0270     Like the <tt>rcu_node</tt> structure, it provides a local
0271     copy of the grace-period information to allow for-free
0272     synchronized
0273     access to this information from the corresponding CPU.
0274     Finally, this structure records past dyntick-idle state
0275     for the corresponding CPU and also tracks statistics.
0276 <li>    <tt>rcu_dynticks</tt>:
0277     This per-CPU structure tracks the current dyntick-idle
0278     state for the corresponding CPU.
0279     Unlike the other three structures, the <tt>rcu_dynticks</tt>
0280     structure is not replicated per RCU flavor.
0281 <li>    <tt>rcu_head</tt>:
0282     This structure represents RCU callbacks, and is the
0283     only structure allocated and managed by RCU users.
0284     The <tt>rcu_head</tt> structure is normally embedded
0285     within the RCU-protected data structure.
0286 </ol>
0288 <p>If all you wanted from this article was a general notion of how
0289 RCU's data structures are related, you are done.
0290 Otherwise, each of the following sections give more details on
0291 the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
0292 and <tt>rcu_dynticks</tt> data structures.
0294 <h3><a name="The rcu_state Structure">
0295 The <tt>rcu_state</tt> Structure</a></h3>
0297 <p>The <tt>rcu_state</tt> structure is the base structure that
0298 represents a flavor of RCU.
0299 This structure forms the interconnection between the
0300 <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
0301 tracks grace periods, contains the lock used to
0302 synchronize with CPU-hotplug events,
0303 and maintains state used to force quiescent states when
0304 grace periods extend too long,
0306 </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
0307 singly and in groups, in the following sections.
0308 The more specialized fields are covered in the discussion of their
0309 use.
0311 <h5>Relationship to rcu_node and rcu_data Structures</h5>
0313 This portion of the <tt>rcu_state</tt> structure is declared
0314 as follows:
0316 <pre>
0317   1   struct rcu_node node[NUM_RCU_NODES];
0318   2   struct rcu_node *level[NUM_RCU_LVLS + 1];
0319   3   struct rcu_data __percpu *rda;
0320 </pre>
0322 <table>
0323 <tr><th>&nbsp;</th></tr>
0324 <tr><th align="left">Quick Quiz:</th></tr>
0325 <tr><td>
0326     Wait a minute!
0327     You said that the <tt>rcu_node</tt> structures formed a tree,
0328     but they are declared as a flat array!
0329     What gives?
0330 </td></tr>
0331 <tr><th align="left">Answer:</th></tr>
0332 <tr><td bgcolor="#ffffff"><font color="ffffff">
0333     The tree is laid out in the array.
0334     The first node In the array is the head, the next set of nodes in the
0335     array are children of the head node, and so on until the last set of
0336     nodes in the array are the leaves.
0337     </font>
0339     <p><font color="ffffff">See the following diagrams to see how
0340     this works.
0341 </font></td></tr>
0342 <tr><td>&nbsp;</td></tr>
0343 </table>
0345 <p>The <tt>rcu_node</tt> tree is embedded into the
0346 <tt>-&gt;node[]</tt> array as shown in the following figure:
0348 </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
0350 </p><p>One interesting consequence of this mapping is that a
0351 breadth-first traversal of the tree is implemented as a simple
0352 linear scan of the array, which is in fact what the
0353 <tt>rcu_for_each_node_breadth_first()</tt> macro does.
0354 This macro is used at the beginning and ends of grace periods.
0356 </p><p>Each entry of the <tt>-&gt;level</tt> array references
0357 the first <tt>rcu_node</tt> structure on the corresponding level
0358 of the tree, for example, as shown below:
0360 </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
0362 </p><p>The zero<sup>th</sup> element of the array references the root
0363 <tt>rcu_node</tt> structure, the first element references the
0364 first child of the root <tt>rcu_node</tt>, and finally the second
0365 element references the first leaf <tt>rcu_node</tt> structure.
0367 </p><p>For whatever it is worth, if you draw the tree to be tree-shaped
0368 rather than array-shaped, it is easy to draw a planar representation:
0370 </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
0372 </p><p>Finally, the <tt>-&gt;rda</tt> field references a per-CPU
0373 pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
0375 </p><p>All of these fields are constant once initialization is complete,
0376 and therefore need no protection.
0378 <h5>Grace-Period Tracking</h5>
0380 <p>This portion of the <tt>rcu_state</tt> structure is declared
0381 as follows:
0383 <pre>
0384   1   unsigned long gpnum;
0385   2   unsigned long completed;
0386 </pre>
0388 <p>RCU grace periods are numbered, and
0389 the <tt>-&gt;gpnum</tt> field contains the number of the grace
0390 period that started most recently.
0391 The <tt>-&gt;completed</tt> field contains the number of the
0392 grace period that completed most recently.
0393 If the two fields are equal, the RCU grace period that most recently
0394 started has already completed, and therefore the corresponding
0395 flavor of RCU is idle.
0396 If <tt>-&gt;gpnum</tt> is one greater than <tt>-&gt;completed</tt>,
0397 then <tt>-&gt;gpnum</tt> gives the number of the current RCU
0398 grace period, which has not yet completed.
0399 Any other combination of values indicates that something is broken.
0400 These two fields are protected by the root <tt>rcu_node</tt>'s
0401 <tt>-&gt;lock</tt> field.
0403 </p><p>There are <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt> fields
0404 in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
0405 as well.
0406 The fields in the <tt>rcu_state</tt> structure represent the
0407 most current values, and those of the other structures are compared
0408 in order to detect the start of a new grace period in a distributed
0409 fashion.
0410 The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
0411 (down the tree from the root to the leaves) to <tt>rcu_data</tt>.
0413 <h5>Miscellaneous</h5>
0415 <p>This portion of the <tt>rcu_state</tt> structure is declared
0416 as follows:
0418 <pre>
0419   1   unsigned long gp_max;
0420   2   char abbr;
0421   3   char *name;
0422 </pre>
0424 <p>The <tt>-&gt;gp_max</tt> field tracks the duration of the longest
0425 grace period in jiffies.
0426 It is protected by the root <tt>rcu_node</tt>'s <tt>-&gt;lock</tt>.
0428 <p>The <tt>-&gt;name</tt> field points to the name of the RCU flavor
0429 (for example, &ldquo;rcu_sched&rdquo;), and is constant.
0430 The <tt>-&gt;abbr</tt> field contains a one-character abbreviation,
0431 for example, &ldquo;s&rdquo; for RCU-sched.
0433 <h3><a name="The rcu_node Structure">
0434 The <tt>rcu_node</tt> Structure</a></h3>
0436 <p>The <tt>rcu_node</tt> structures form the combining
0437 tree that propagates quiescent-state
0438 information from the leaves to the root and also that propagates
0439 grace-period information from the root down to the leaves.
0440 They provides local copies of the grace-period state in order
0441 to allow this information to be accessed in a synchronized
0442 manner without suffering the scalability limitations that
0443 would otherwise be imposed by global locking.
0444 In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
0445 of tasks that have blocked while in their current
0446 RCU read-side critical section.
0447 In <tt>CONFIG_PREEMPT_RCU</tt> with
0448 <tt>CONFIG_RCU_BOOST</tt>, they manage the
0449 per-<tt>rcu_node</tt> priority-boosting
0450 kernel threads (kthreads) and state.
0451 Finally, they record CPU-hotplug state in order to determine
0452 which CPUs should be ignored during a given grace period.
0454 </p><p>The <tt>rcu_node</tt> structure's fields are discussed,
0455 singly and in groups, in the following sections.
0457 <h5>Connection to Combining Tree</h5>
0459 <p>This portion of the <tt>rcu_node</tt> structure is declared
0460 as follows:
0462 <pre>
0463   1   struct rcu_node *parent;
0464   2   u8 level;
0465   3   u8 grpnum;
0466   4   unsigned long grpmask;
0467   5   int grplo;
0468   6   int grphi;
0469 </pre>
0471 <p>The <tt>-&gt;parent</tt> pointer references the <tt>rcu_node</tt>
0472 one level up in the tree, and is <tt>NULL</tt> for the root
0473 <tt>rcu_node</tt>.
0474 The RCU implementation makes heavy use of this field to push quiescent
0475 states up the tree.
0476 The <tt>-&gt;level</tt> field gives the level in the tree, with
0477 the root being at level zero, its children at level one, and so on.
0478 The <tt>-&gt;grpnum</tt> field gives this node's position within
0479 the children of its parent, so this number can range between 0 and 31
0480 on 32-bit systems and between 0 and 63 on 64-bit systems.
0481 The <tt>-&gt;level</tt> and <tt>-&gt;grpnum</tt> fields are
0482 used only during initialization and for tracing.
0483 The <tt>-&gt;grpmask</tt> field is the bitmask counterpart of
0484 <tt>-&gt;grpnum</tt>, and therefore always has exactly one bit set.
0485 This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
0486 structure in its parent's bitmasks, which are described later.
0487 Finally, the <tt>-&gt;grplo</tt> and <tt>-&gt;grphi</tt> fields
0488 contain the lowest and highest numbered CPU served by this
0489 <tt>rcu_node</tt> structure, respectively.
0491 </p><p>All of these fields are constant, and thus do not require any
0492 synchronization.
0494 <h5>Synchronization</h5>
0496 <p>This field of the <tt>rcu_node</tt> structure is declared
0497 as follows:
0499 <pre>
0500   1   raw_spinlock_t lock;
0501 </pre>
0503 <p>This field is used to protect the remaining fields in this structure,
0504 unless otherwise stated.
0505 That said, all of the fields in this structure can be accessed without
0506 locking for tracing purposes.
0507 Yes, this can result in confusing traces, but better some tracing confusion
0508 than to be heisenbugged out of existence.
0510 <h5>Grace-Period Tracking</h5>
0512 <p>This portion of the <tt>rcu_node</tt> structure is declared
0513 as follows:
0515 <pre>
0516   1   unsigned long gpnum;
0517   2   unsigned long completed;
0518 </pre>
0520 <p>These fields are the counterparts of the fields of the same name in
0521 the <tt>rcu_state</tt> structure.
0522 They each may lag up to one behind their <tt>rcu_state</tt>
0523 counterparts.
0524 If a given <tt>rcu_node</tt> structure's <tt>-&gt;gpnum</tt> and
0525 <tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_node</tt>
0526 structure believes that RCU is idle.
0527 Otherwise, as with the <tt>rcu_state</tt> structure,
0528 the <tt>-&gt;gpnum</tt> field will be one greater than the
0529 <tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
0530 indicating which grace period this <tt>rcu_node</tt> believes
0531 is still being waited for.
0533 </p><p>The <tt>&gt;gpnum</tt> field of each <tt>rcu_node</tt>
0534 structure is updated at the beginning
0535 of each grace period, and the <tt>-&gt;completed</tt> fields are
0536 updated at the end of each grace period.
0538 <h5>Quiescent-State Tracking</h5>
0540 <p>These fields manage the propagation of quiescent states up the
0541 combining tree.
0543 </p><p>This portion of the <tt>rcu_node</tt> structure has fields
0544 as follows:
0546 <pre>
0547   1   unsigned long qsmask;
0548   2   unsigned long expmask;
0549   3   unsigned long qsmaskinit;
0550   4   unsigned long expmaskinit;
0551 </pre>
0553 <p>The <tt>-&gt;qsmask</tt> field tracks which of this
0554 <tt>rcu_node</tt> structure's children still need to report
0555 quiescent states for the current normal grace period.
0556 Such children will have a value of 1 in their corresponding bit.
0557 Note that the leaf <tt>rcu_node</tt> structures should be
0558 thought of as having <tt>rcu_data</tt> structures as their
0559 children.
0560 Similarly, the <tt>-&gt;expmask</tt> field tracks which
0561 of this <tt>rcu_node</tt> structure's children still need to report
0562 quiescent states for the current expedited grace period.
0563 An expedited grace period has
0564 the same conceptual properties as a normal grace period, but the
0565 expedited implementation accepts extreme CPU overhead to obtain
0566 much lower grace-period latency, for example, consuming a few
0567 tens of microseconds worth of CPU time to reduce grace-period
0568 duration from milliseconds to tens of microseconds.
0569 The <tt>-&gt;qsmaskinit</tt> field tracks which of this
0570 <tt>rcu_node</tt> structure's children cover for at least
0571 one online CPU.
0572 This mask is used to initialize <tt>-&gt;qsmask</tt>,
0573 and <tt>-&gt;expmaskinit</tt> is used to initialize
0574 <tt>-&gt;expmask</tt> and the beginning of the
0575 normal and expedited grace periods, respectively.
0577 <table>
0578 <tr><th>&nbsp;</th></tr>
0579 <tr><th align="left">Quick Quiz:</th></tr>
0580 <tr><td>
0581     Why are these bitmasks protected by locking?
0582     Come on, haven't you heard of atomic instructions???
0583 </td></tr>
0584 <tr><th align="left">Answer:</th></tr>
0585 <tr><td bgcolor="#ffffff"><font color="ffffff">
0586     Lockless grace-period computation!  Such a tantalizing possibility!
0587     </font>
0589     <p><font color="ffffff">But consider the following sequence of events:
0590     </font>
0592     <ol>
0593     <li>    <font color="ffffff">CPU&nbsp;0 has been in dyntick-idle
0594         mode for quite some time.
0595         When it wakes up, it notices that the current RCU
0596         grace period needs it to report in, so it sets a
0597         flag where the scheduling clock interrupt will find it.
0598         </font><p>
0599     <li>    <font color="ffffff">Meanwhile, CPU&nbsp;1 is running
0600         <tt>force_quiescent_state()</tt>,
0601         and notices that CPU&nbsp;0 has been in dyntick idle mode,
0602         which qualifies as an extended quiescent state.
0603         </font><p>
0604     <li>    <font color="ffffff">CPU&nbsp;0's scheduling clock
0605         interrupt fires in the
0606         middle of an RCU read-side critical section, and notices
0607         that the RCU core needs something, so commences RCU softirq
0608         processing.
0609         </font>
0610         <p>
0611     <li>    <font color="ffffff">CPU&nbsp;0's softirq handler
0612         executes and is just about ready
0613         to report its quiescent state up the <tt>rcu_node</tt>
0614         tree.
0615         </font><p>
0616     <li>    <font color="ffffff">But CPU&nbsp;1 beats it to the punch,
0617         completing the current
0618         grace period and starting a new one.
0619         </font><p>
0620     <li>    <font color="ffffff">CPU&nbsp;0 now reports its quiescent
0621         state for the wrong
0622         grace period.
0623         That grace period might now end before the RCU read-side
0624         critical section.
0625         If that happens, disaster will ensue.
0626         </font>
0627     </ol>
0629     <p><font color="ffffff">So the locking is absolutely required in
0630     order to coordinate
0631     clearing of the bits with the grace-period numbers in
0632     <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt>.
0633 </font></td></tr>
0634 <tr><td>&nbsp;</td></tr>
0635 </table>
0637 <h5>Blocked-Task Management</h5>
0639 <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
0640 midst of their RCU read-side critical sections, and these tasks
0641 must be tracked explicitly.
0642 The details of exactly why and how they are tracked will be covered
0643 in a separate article on RCU read-side processing.
0644 For now, it is enough to know that the <tt>rcu_node</tt>
0645 structure tracks them.
0647 <pre>
0648   1   struct list_head blkd_tasks;
0649   2   struct list_head *gp_tasks;
0650   3   struct list_head *exp_tasks;
0651   4   bool wait_blkd_tasks;
0652 </pre>
0654 <p>The <tt>-&gt;blkd_tasks</tt> field is a list header for
0655 the list of blocked and preempted tasks.
0656 As tasks undergo context switches within RCU read-side critical
0657 sections, their <tt>task_struct</tt> structures are enqueued
0658 (via the <tt>task_struct</tt>'s <tt>-&gt;rcu_node_entry</tt>
0659 field) onto the head of the <tt>-&gt;blkd_tasks</tt> list for the
0660 leaf <tt>rcu_node</tt> structure corresponding to the CPU
0661 on which the outgoing context switch executed.
0662 As these tasks later exit their RCU read-side critical sections,
0663 they remove themselves from the list.
0664 This list is therefore in reverse time order, so that if one of the tasks
0665 is blocking the current grace period, all subsequent tasks must
0666 also be blocking that same grace period.
0667 Therefore, a single pointer into this list suffices to track
0668 all tasks blocking a given grace period.
0669 That pointer is stored in <tt>-&gt;gp_tasks</tt> for normal
0670 grace periods and in <tt>-&gt;exp_tasks</tt> for expedited
0671 grace periods.
0672 These last two fields are <tt>NULL</tt> if either there is
0673 no grace period in flight or if there are no blocked tasks
0674 preventing that grace period from completing.
0675 If either of these two pointers is referencing a task that
0676 removes itself from the <tt>-&gt;blkd_tasks</tt> list,
0677 then that task must advance the pointer to the next task on
0678 the list, or set the pointer to <tt>NULL</tt> if there
0679 are no subsequent tasks on the list.
0681 </p><p>For example, suppose that tasks&nbsp;T1, T2, and&nbsp;T3 are
0682 all hard-affinitied to the largest-numbered CPU in the system.
0683 Then if task&nbsp;T1 blocked in an RCU read-side
0684 critical section, then an expedited grace period started,
0685 then task&nbsp;T2 blocked in an RCU read-side critical section,
0686 then a normal grace period started, and finally task&nbsp;3 blocked
0687 in an RCU read-side critical section, then the state of the
0688 last leaf <tt>rcu_node</tt> structure's blocked-task list
0689 would be as shown below:
0691 </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
0693 </p><p>Task&nbsp;T1 is blocking both grace periods, task&nbsp;T2 is
0694 blocking only the normal grace period, and task&nbsp;T3 is blocking
0695 neither grace period.
0696 Note that these tasks will not remove themselves from this list
0697 immediately upon resuming execution.
0698 They will instead remain on the list until they execute the outermost
0699 <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
0700 section.
0702 <p>
0703 The <tt>-&gt;wait_blkd_tasks</tt> field indicates whether or not
0704 the current grace period is waiting on a blocked task.
0706 <h5>Sizing the <tt>rcu_node</tt> Array</h5>
0708 <p>The <tt>rcu_node</tt> array is sized via a series of
0709 C-preprocessor expressions as follows:
0711 <pre>
0712  1 #ifdef CONFIG_RCU_FANOUT
0714  3 #else
0715  4 # ifdef CONFIG_64BIT
0716  5 # define RCU_FANOUT 64
0717  6 # else
0718  7 # define RCU_FANOUT 32
0719  8 # endif
0720  9 #endif
0721 10
0724 13 #else
0725 14 # ifdef CONFIG_64BIT
0726 15 # define RCU_FANOUT_LEAF 64
0727 16 # else
0728 17 # define RCU_FANOUT_LEAF 32
0729 18 # endif
0730 19 #endif
0731 20
0732 21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
0733 22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
0734 23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
0735 24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
0736 25
0737 26 #if NR_CPUS &lt;= RCU_FANOUT_1
0738 27 #  define RCU_NUM_LVLS        1
0739 28 #  define NUM_RCU_LVL_0        1
0740 29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
0741 30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
0742 31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
0743 32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
0744 33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
0745 34 #elif NR_CPUS &lt;= RCU_FANOUT_2
0746 35 #  define RCU_NUM_LVLS        2
0747 36 #  define NUM_RCU_LVL_0        1
0748 37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
0749 38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
0750 39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
0751 40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
0752 41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
0753 42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
0754 43 #elif NR_CPUS &lt;= RCU_FANOUT_3
0755 44 #  define RCU_NUM_LVLS        3
0756 45 #  define NUM_RCU_LVL_0        1
0757 46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
0758 47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
0759 48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
0760 49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
0761 50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
0762 51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
0763 52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
0764 53 #elif NR_CPUS &lt;= RCU_FANOUT_4
0765 54 #  define RCU_NUM_LVLS        4
0766 55 #  define NUM_RCU_LVL_0        1
0767 56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
0768 57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
0769 58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
0770 59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
0771 60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
0772 61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
0773 62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
0774 63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
0775 64 #else
0776 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
0777 66 #endif
0778 </pre>
0780 <p>The maximum number of levels in the <tt>rcu_node</tt> structure
0781 is currently limited to four, as specified by lines&nbsp;21-24
0782 and the structure of the subsequent &ldquo;if&rdquo; statement.
0783 For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
0784 should be sufficient for the next few years at least.
0785 For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
0786 should see us through the next decade or so.
0787 This four-level tree also allows kernels built with
0788 <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
0789 which might be useful in very large systems having eight CPUs per
0790 socket (but please note that no one has yet shown any measurable
0791 performance degradation due to misaligned socket and <tt>rcu_node</tt>
0792 boundaries).
0793 In addition, building kernels with a full four levels of <tt>rcu_node</tt>
0794 tree permits better testing of RCU's combining-tree code.
0796 </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
0797 are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
0798 If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
0799 it is set based on the word size of the system, which is also
0800 the Kconfig default.
0802 </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
0803 handled by each leaf <tt>rcu_node</tt> structure.
0804 Experience has shown that allowing a given leaf <tt>rcu_node</tt>
0805 structure to handle 64 CPUs, as permitted by the number of bits in
0806 the <tt>-&gt;qsmask</tt> field on a 64-bit system, results in
0807 excessive contention for the leaf <tt>rcu_node</tt> structures'
0808 <tt>-&gt;lock</tt> fields.
0809 The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
0810 limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
0811 If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
0812 selected is based on the word size of the system, just as for
0813 <tt>CONFIG_RCU_FANOUT</tt>.
0814 Lines&nbsp;11-19 perform this computation.
0816 </p><p>Lines&nbsp;21-24 compute the maximum number of CPUs supported by
0817 a single-level (which contains a single <tt>rcu_node</tt> structure),
0818 two-level, three-level, and four-level <tt>rcu_node</tt> tree,
0819 respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
0820 and <tt>RCU_FANOUT_LEAF</tt>.
0821 These numbers of CPUs are retained in the
0822 <tt>RCU_FANOUT_1</tt>,
0823 <tt>RCU_FANOUT_2</tt>,
0824 <tt>RCU_FANOUT_3</tt>, and
0825 <tt>RCU_FANOUT_4</tt>
0826 C-preprocessor variables, respectively.
0828 </p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
0829 statement spanning lines&nbsp;26-66 that computes the number of
0830 <tt>rcu_node</tt> structures required for each level of the tree,
0831 as well as the number of levels required.
0832 The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
0833 C-preprocessor variable by lines&nbsp;27, 35, 44, and&nbsp;54.
0834 The number of <tt>rcu_node</tt> structures for the topmost level
0835 of the tree is always exactly one, and this value is unconditionally
0836 placed into <tt>NUM_RCU_LVL_0</tt> by lines&nbsp;28, 36, 45, and&nbsp;55.
0837 The rest of the levels (if any) of the <tt>rcu_node</tt> tree
0838 are computed by dividing the maximum number of CPUs by the
0839 fanout supported by the number of levels from the current level down,
0840 rounding up.  This computation is performed by lines&nbsp;37,
0841 46-47, and&nbsp;56-58.
0842 Lines&nbsp;31-33, 40-42, 50-52, and&nbsp;62-63 create initializers
0843 for lockdep lock-class names.
0844 Finally, lines&nbsp;64-66 produce an error if the maximum number of
0845 CPUs is too large for the specified fanout.
0847 <h3><a name="The rcu_data Structure">
0848 The <tt>rcu_data</tt> Structure</a></h3>
0850 <p>The <tt>rcu_data</tt> maintains the per-CPU state for the
0851 corresponding flavor of RCU.
0852 The fields in this structure may be accessed only from the corresponding
0853 CPU (and from tracing) unless otherwise stated.
0854 This structure is the
0855 focus of quiescent-state detection and RCU callback queuing.
0856 It also tracks its relationship to the corresponding leaf
0857 <tt>rcu_node</tt> structure to allow more-efficient
0858 propagation of quiescent states up the <tt>rcu_node</tt>
0859 combining tree.
0860 Like the <tt>rcu_node</tt> structure, it provides a local
0861 copy of the grace-period information to allow for-free
0862 synchronized
0863 access to this information from the corresponding CPU.
0864 Finally, this structure records past dyntick-idle state
0865 for the corresponding CPU and also tracks statistics.
0867 </p><p>The <tt>rcu_data</tt> structure's fields are discussed,
0868 singly and in groups, in the following sections.
0870 <h5>Connection to Other Data Structures</h5>
0872 <p>This portion of the <tt>rcu_data</tt> structure is declared
0873 as follows:
0875 <pre>
0876   1   int cpu;
0877   2   struct rcu_state *rsp;
0878   3   struct rcu_node *mynode;
0879   4   struct rcu_dynticks *dynticks;
0880   5   unsigned long grpmask;
0881   6   bool beenonline;
0882 </pre>
0884 <p>The <tt>-&gt;cpu</tt> field contains the number of the
0885 corresponding CPU, the <tt>-&gt;rsp</tt> pointer references
0886 the corresponding <tt>rcu_state</tt> structure (and is most frequently
0887 used to locate the name of the corresponding flavor of RCU for tracing),
0888 and the <tt>-&gt;mynode</tt> field references the corresponding
0889 <tt>rcu_node</tt> structure.
0890 The <tt>-&gt;mynode</tt> is used to propagate quiescent states
0891 up the combining tree.
0892 <p>The <tt>-&gt;dynticks</tt> pointer references the
0893 <tt>rcu_dynticks</tt> structure corresponding to this
0894 CPU.
0895 Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
0896 structure is shared among all flavors of RCU.
0897 These first four fields are constant and therefore require not
0898 synchronization.
0900 </p><p>The <tt>-&gt;grpmask</tt> field indicates the bit in
0901 the <tt>-&gt;mynode-&gt;qsmask</tt> corresponding to this
0902 <tt>rcu_data</tt> structure, and is also used when propagating
0903 quiescent states.
0904 The <tt>-&gt;beenonline</tt> flag is set whenever the corresponding
0905 CPU comes online, which means that the debugfs tracing need not dump
0906 out any <tt>rcu_data</tt> structure for which this flag is not set.
0908 <h5>Quiescent-State and Grace-Period Tracking</h5>
0910 <p>This portion of the <tt>rcu_data</tt> structure is declared
0911 as follows:
0913 <pre>
0914   1   unsigned long completed;
0915   2   unsigned long gpnum;
0916   3   bool cpu_no_qs;
0917   4   bool core_needs_qs;
0918   5   bool gpwrap;
0919   6   unsigned long rcu_qs_ctr_snap;
0920 </pre>
0922 <p>The <tt>completed</tt> and <tt>gpnum</tt>
0923 fields are the counterparts of the fields of the same name
0924 in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
0925 They may each lag up to one behind their <tt>rcu_node</tt>
0926 counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
0927 <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
0928 arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
0929 will catch up upon exit from dyntick-idle mode).
0930 If a given <tt>rcu_data</tt> structure's <tt>-&gt;gpnum</tt> and
0931 <tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_data</tt>
0932 structure believes that RCU is idle.
0933 Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt>
0934 structure,
0935 the <tt>-&gt;gpnum</tt> field will be one greater than the
0936 <tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
0937 indicating which grace period this <tt>rcu_data</tt> believes
0938 is still being waited for.
0940 <table>
0941 <tr><th>&nbsp;</th></tr>
0942 <tr><th align="left">Quick Quiz:</th></tr>
0943 <tr><td>
0944     All this replication of the grace period numbers can only cause
0945     massive confusion.
0946     Why not just keep a global pair of counters and be done with it???
0947 </td></tr>
0948 <tr><th align="left">Answer:</th></tr>
0949 <tr><td bgcolor="#ffffff"><font color="ffffff">
0950     Because if there was only a single global pair of grace-period
0951     numbers, there would need to be a single global lock to allow
0952     safely accessing and updating them.
0953     And if we are not going to have a single global lock, we need
0954     to carefully manage the numbers on a per-node basis.
0955     Recall from the answer to a previous Quick Quiz that the consequences
0956     of applying a previously sampled quiescent state to the wrong
0957     grace period are quite severe.
0958 </font></td></tr>
0959 <tr><td>&nbsp;</td></tr>
0960 </table>
0962 <p>The <tt>-&gt;cpu_no_qs</tt> flag indicates that the
0963 CPU has not yet passed through a quiescent state,
0964 while the <tt>-&gt;core_needs_qs</tt> flag indicates that the
0965 RCU core needs a quiescent state from the corresponding CPU.
0966 The <tt>-&gt;gpwrap</tt> field indicates that the corresponding
0967 CPU has remained idle for so long that the <tt>completed</tt>
0968 and <tt>gpnum</tt> counters are in danger of overflow, which
0969 will cause the CPU to disregard the values of its counters on
0970 its next exit from idle.
0971 Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
0972 cases where a given operation has resulted in a quiescent state
0973 for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
0975 <h5>RCU Callback Handling</h5>
0977 <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
0978 the same CPU that registered them.
0979 This is strictly a cache-locality optimization: callbacks can and
0980 do get invoked on CPUs other than the one that registered them.
0981 After all, if the CPU that registered a given callback has gone
0982 offline before the callback can be invoked, there really is no other
0983 choice.
0985 </p><p>This portion of the <tt>rcu_data</tt> structure is declared
0986 as follows:
0988 <pre>
0989  1 struct rcu_head *nxtlist;
0990  2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
0991  3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
0992  4 long qlen_lazy;
0993  5 long qlen;
0994  6 long qlen_last_fqs_check;
0995  7 unsigned long n_force_qs_snap;
0996  8 unsigned long n_cbs_invoked;
0997  9 unsigned long n_cbs_orphaned;
0998 10 unsigned long n_cbs_adopted;
0999 11 long blimit;
1000 </pre>
1002 <p>The <tt>-&gt;nxtlist</tt> pointer and the
1003 <tt>-&gt;nxttail[]</tt> array form a four-segment list with
1004 older callbacks near the head and newer ones near the tail.
1005 Each segment contains callbacks with the corresponding relationship
1006 to the current grace period.
1007 The pointer out of the end of each of the four segments is referenced
1008 by the element of the <tt>-&gt;nxttail[]</tt> array indexed by
1009 <tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
1010 <tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
1011 <tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
1012 grace period), and
1013 <tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
1014 with a specific grace period)
1015 respectively, as shown in the following figure.
1017 </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
1019 </p><p>In this figure, the <tt>-&gt;nxtlist</tt> pointer references the
1020 first
1021 RCU callback in the list.
1022 The <tt>-&gt;nxttail[RCU_DONE_TAIL]</tt> array element references
1023 the <tt>-&gt;nxtlist</tt> pointer itself, indicating that none
1024 of the callbacks is ready to invoke.
1025 The <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt> array element references callback
1026 CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
1027 CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period.
1028 The <tt>-&gt;nxttail[RCU_NEXT_READY_TAIL]</tt> array element
1029 references the same RCU callback that <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt>
1030 does, which indicates that there are no callbacks waiting on the next
1031 RCU grace period.
1032 The <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element references
1033 CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
1034 remaining RCU callbacks have not yet been assigned to an RCU grace
1035 period.
1036 Note that the <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element
1037 always references the last RCU callback's <tt>-&gt;next</tt> pointer
1038 unless the callback list is empty, in which case it references
1039 the <tt>-&gt;nxtlist</tt> pointer.
1041 </p><p>CPUs advance their callbacks from the
1042 <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
1043 <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
1044 as grace periods advance.
1045 The CPU advances the callbacks in its <tt>rcu_data</tt> structure
1046 whenever it notices that another RCU grace period has completed.
1047 The CPU detects the completion of an RCU grace period by noticing
1048 that the value of its <tt>rcu_data</tt> structure's
1049 <tt>-&gt;completed</tt> field differs from that of its leaf
1050 <tt>rcu_node</tt> structure.
1051 Recall that each <tt>rcu_node</tt> structure's
1052 <tt>-&gt;completed</tt> field is updated at the end of each
1053 grace period.
1055 </p><p>The <tt>-&gt;nxtcompleted[]</tt> array records grace-period
1056 numbers corresponding to the list segments.
1057 This allows CPUs that go idle for extended periods to determine
1058 which of their callbacks are ready to be invoked after reawakening.
1060 </p><p>The <tt>-&gt;qlen</tt> counter contains the number of
1061 callbacks in <tt>-&gt;nxtlist</tt>, and the
1062 <tt>-&gt;qlen_lazy</tt> contains the number of those callbacks that
1063 are known to only free memory, and whose invocation can therefore
1064 be safely deferred.
1065 The <tt>-&gt;qlen_last_fqs_check</tt> and
1066 <tt>-&gt;n_force_qs_snap</tt> coordinate the forcing of quiescent
1067 states from <tt>call_rcu()</tt> and friends when callback
1068 lists grow excessively long.
1070 </p><p>The <tt>-&gt;n_cbs_invoked</tt>,
1071 <tt>-&gt;n_cbs_orphaned</tt>, and <tt>-&gt;n_cbs_adopted</tt>
1072 fields count the number of callbacks invoked,
1073 sent to other CPUs when this CPU goes offline,
1074 and received from other CPUs when those other CPUs go offline.
1075 Finally, the <tt>-&gt;blimit</tt> counter is the maximum number of
1076 RCU callbacks that may be invoked at a given time.
1078 <h5>Dyntick-Idle Handling</h5>
1080 <p>This portion of the <tt>rcu_data</tt> structure is declared
1081 as follows:
1083 <pre>
1084   1   int dynticks_snap;
1085   2   unsigned long dynticks_fqs;
1086 </pre>
1088 The <tt>-&gt;dynticks_snap</tt> field is used to take a snapshot
1089 of the corresponding CPU's dyntick-idle state when forcing
1090 quiescent states, and is therefore accessed from other CPUs.
1091 Finally, the <tt>-&gt;dynticks_fqs</tt> field is used to
1092 count the number of times this CPU is determined to be in
1093 dyntick-idle state, and is used for tracing and debugging purposes.
1095 <h3><a name="The rcu_dynticks Structure">
1096 The <tt>rcu_dynticks</tt> Structure</a></h3>
1098 <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
1099 for the corresponding CPU.
1100 Unlike the other structures, <tt>rcu_dynticks</tt> is not
1101 replicated over the different flavors of RCU.
1102 The fields in this structure may be accessed only from the corresponding
1103 CPU (and from tracing) unless otherwise stated.
1104 Its fields are as follows:
1106 <pre>
1107   1   int dynticks_nesting;
1108   2   int dynticks_nmi_nesting;
1109   3   atomic_t dynticks;
1110 </pre>
1112 <p>The <tt>-&gt;dynticks_nesting</tt> field counts the
1113 nesting depth of normal interrupts.
1114 In addition, this counter is incremented when exiting dyntick-idle
1115 mode and decremented when entering it.
1116 This counter can therefore be thought of as counting the number
1117 of reasons why this CPU cannot be permitted to enter dyntick-idle
1118 mode, aside from non-maskable interrupts (NMIs).
1119 NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
1120 field, except that NMIs that interrupt non-dyntick-idle execution
1121 are not counted.
1123 </p><p>Finally, the <tt>-&gt;dynticks</tt> field counts the corresponding
1124 CPU's transitions to and from dyntick-idle mode, so that this counter
1125 has an even value when the CPU is in dyntick-idle mode and an odd
1126 value otherwise.
1128 <table>
1129 <tr><th>&nbsp;</th></tr>
1130 <tr><th align="left">Quick Quiz:</th></tr>
1131 <tr><td>
1132     Why not just count all NMIs?
1133     Wouldn't that be simpler and less error prone?
1134 </td></tr>
1135 <tr><th align="left">Answer:</th></tr>
1136 <tr><td bgcolor="#ffffff"><font color="ffffff">
1137     It seems simpler only until you think hard about how to go about
1138     updating the <tt>rcu_dynticks</tt> structure's
1139     <tt>-&gt;dynticks</tt> field.
1140 </font></td></tr>
1141 <tr><td>&nbsp;</td></tr>
1142 </table>
1144 <p>Additional fields are present for some special-purpose
1145 builds, and are discussed separately.
1147 <h3><a name="The rcu_head Structure">
1148 The <tt>rcu_head</tt> Structure</a></h3>
1150 <p>Each <tt>rcu_head</tt> structure represents an RCU callback.
1151 These structures are normally embedded within RCU-protected data
1152 structures whose algorithms use asynchronous grace periods.
1153 In contrast, when using algorithms that block waiting for RCU grace periods,
1154 RCU users need not provide <tt>rcu_head</tt> structures.
1156 </p><p>The <tt>rcu_head</tt> structure has fields as follows:
1158 <pre>
1159   1   struct rcu_head *next;
1160   2   void (*func)(struct rcu_head *head);
1161 </pre>
1163 <p>The <tt>-&gt;next</tt> field is used
1164 to link the <tt>rcu_head</tt> structures together in the
1165 lists within the <tt>rcu_data</tt> structures.
1166 The <tt>-&gt;func</tt> field is a pointer to the function
1167 to be called when the callback is ready to be invoked, and
1168 this function is passed a pointer to the <tt>rcu_head</tt>
1169 structure.
1170 However, <tt>kfree_rcu()</tt> uses the <tt>-&gt;func</tt>
1171 field to record the offset of the <tt>rcu_head</tt>
1172 structure within the enclosing RCU-protected data structure.
1174 </p><p>Both of these fields are used internally by RCU.
1175 From the viewpoint of RCU users, this structure is an
1176 opaque &ldquo;cookie&rdquo;.
1178 <table>
1179 <tr><th>&nbsp;</th></tr>
1180 <tr><th align="left">Quick Quiz:</th></tr>
1181 <tr><td>
1182     Given that the callback function <tt>-&gt;func</tt>
1183     is passed a pointer to the <tt>rcu_head</tt> structure,
1184     how is that function supposed to find the beginning of the
1185     enclosing RCU-protected data structure?
1186 </td></tr>
1187 <tr><th align="left">Answer:</th></tr>
1188 <tr><td bgcolor="#ffffff"><font color="ffffff">
1189     In actual practice, there is a separate callback function per
1190     type of RCU-protected data structure.
1191     The callback function can therefore use the <tt>container_of()</tt>
1192     macro in the Linux kernel (or other pointer-manipulation facilities
1193     in other software environments) to find the beginning of the
1194     enclosing structure.
1195 </font></td></tr>
1196 <tr><td>&nbsp;</td></tr>
1197 </table>
1199 <h3><a name="RCU-Specific Fields in the task_struct Structure">
1200 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
1202 <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
1203 additional fields in the <tt>task_struct</tt> structure:
1205 <pre>
1206  1 #ifdef CONFIG_PREEMPT_RCU
1207  2   int rcu_read_lock_nesting;
1208  3   union rcu_special rcu_read_unlock_special;
1209  4   struct list_head rcu_node_entry;
1210  5   struct rcu_node *rcu_blocked_node;
1211  6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
1212  7 #ifdef CONFIG_TASKS_RCU
1213  8   unsigned long rcu_tasks_nvcsw;
1214  9   bool rcu_tasks_holdout;
1215 10   struct list_head rcu_tasks_holdout_list;
1216 11   int rcu_tasks_idle_cpu;
1217 12 #endif /* #ifdef CONFIG_TASKS_RCU */
1218 </pre>
1220 <p>The <tt>-&gt;rcu_read_lock_nesting</tt> field records the
1221 nesting level for RCU read-side critical sections, and
1222 the <tt>-&gt;rcu_read_unlock_special</tt> field is a bitmask
1223 that records special conditions that require <tt>rcu_read_unlock()</tt>
1224 to do additional work.
1225 The <tt>-&gt;rcu_node_entry</tt> field is used to form lists of
1226 tasks that have blocked within preemptible-RCU read-side critical
1227 sections and the <tt>-&gt;rcu_blocked_node</tt> field references
1228 the <tt>rcu_node</tt> structure whose list this task is a member of,
1229 or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
1230 read-side critical section.
1232 <p>The <tt>-&gt;rcu_tasks_nvcsw</tt> field tracks the number of
1233 voluntary context switches that this task had undergone at the
1234 beginning of the current tasks-RCU grace period,
1235 <tt>-&gt;rcu_tasks_holdout</tt> is set if the current tasks-RCU
1236 grace period is waiting on this task, <tt>-&gt;rcu_tasks_holdout_list</tt>
1237 is a list element enqueuing this task on the holdout list,
1238 and <tt>-&gt;rcu_tasks_idle_cpu</tt> tracks which CPU this
1239 idle task is running, but only if the task is currently running,
1240 that is, if the CPU is currently idle.
1242 <h3><a name="Accessor Functions">
1243 Accessor Functions</a></h3>
1245 <p>The following listing shows the
1246 <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
1247 <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
1248 <tt>rcu_for_each_leaf_node()</tt> function and macros:
1250 <pre>
1251   1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
1252   2 {
1253   3   return &amp;rsp-&gt;node[0];
1254   4 }
1255   5
1256   6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
1257   7   for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1258   8        (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1259   9
1260  10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
1261  11   for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1262  12        (rnp) &lt; (rsp)-&gt;level[NUM_RCU_LVLS - 1]; (rnp)++)
1263  13
1264  14 #define rcu_for_each_leaf_node(rsp, rnp) \
1265  15   for ((rnp) = (rsp)-&gt;level[NUM_RCU_LVLS - 1]; \
1266  16        (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1267 </pre>
1269 <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
1270 first element of the specified <tt>rcu_state</tt> structure's
1271 <tt>-&gt;node[]</tt> array, which is the root <tt>rcu_node</tt>
1272 structure.
1274 </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
1275 macro takes advantage of the layout of the <tt>rcu_node</tt>
1276 structures in the <tt>rcu_state</tt> structure's
1277 <tt>-&gt;node[]</tt> array, performing a breadth-first traversal by
1278 simply traversing the array in order.
1279 The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
1280 similarly, but traverses only the first part of the array, thus excluding
1281 the leaf <tt>rcu_node</tt> structures.
1282 Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
1283 the last part of the array, thus traversing only the leaf
1284 <tt>rcu_node</tt> structures.
1286 <table>
1287 <tr><th>&nbsp;</th></tr>
1288 <tr><th align="left">Quick Quiz:</th></tr>
1289 <tr><td>
1290     What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
1291     <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
1292     contains only a single node?
1293 </td></tr>
1294 <tr><th align="left">Answer:</th></tr>
1295 <tr><td bgcolor="#ffffff"><font color="ffffff">
1296     In the single-node case,
1297     <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
1298     and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
1299 </font></td></tr>
1300 <tr><td>&nbsp;</td></tr>
1301 </table>
1303 <h3><a name="Summary">
1304 Summary</a></h3>
1306 So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
1307 which contains a combining tree of <tt>rcu_node</tt> and
1308 <tt>rcu_data</tt> structures.
1309 Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
1310 state is tracked by an <tt>rcu_dynticks</tt> structure.
1312 If you made it this far, you are well prepared to read the code
1313 walkthroughs in the other articles in this series.
1315 <h3><a name="Acknowledgments">
1316 Acknowledgments</a></h3>
1318 I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
1319 Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
1320 for helping me get this document into a more human-readable state.
1322 <h3><a name="Legal Statement">
1323 Legal Statement</a></h3>
1325 <p>This work represents the view of the author and does not necessarily
1326 represent the view of IBM.
1328 </p><p>Linux is a registered trademark of Linus Torvalds.
1330 </p><p>Other company, product, and service names may be trademarks or
1331 service marks of others.
1333 </body></html>