Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
0002 
0003 ===========================
0004 Lockless Ring Buffer Design
0005 ===========================
0006 
0007 Copyright 2009 Red Hat Inc.
0008 
0009 :Author:   Steven Rostedt <srostedt@redhat.com>
0010 :License:  The GNU Free Documentation License, Version 1.2
0011            (dual licensed under the GPL v2)
0012 :Reviewers:  Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
0013              and Frederic Weisbecker.
0014 
0015 
0016 Written for: 2.6.31
0017 
0018 Terminology used in this Document
0019 ---------------------------------
0020 
0021 tail
0022         - where new writes happen in the ring buffer.
0023 
0024 head
0025         - where new reads happen in the ring buffer.
0026 
0027 producer
0028         - the task that writes into the ring buffer (same as writer)
0029 
0030 writer
0031         - same as producer
0032 
0033 consumer
0034         - the task that reads from the buffer (same as reader)
0035 
0036 reader
0037         - same as consumer.
0038 
0039 reader_page
0040         - A page outside the ring buffer used solely (for the most part)
0041           by the reader.
0042 
0043 head_page
0044         - a pointer to the page that the reader will use next
0045 
0046 tail_page
0047         - a pointer to the page that will be written to next
0048 
0049 commit_page
0050         - a pointer to the page with the last finished non-nested write.
0051 
0052 cmpxchg
0053         - hardware-assisted atomic transaction that performs the following::
0054 
0055             A = B if previous A == C
0056 
0057             R = cmpxchg(A, C, B) is saying that we replace A with B if and only
0058                 if current A is equal to C, and we put the old (current)
0059                 A into R
0060 
0061             R gets the previous A regardless if A is updated with B or not.
0062 
0063           To see if the update was successful a compare of ``R == C``
0064           may be used.
0065 
0066 The Generic Ring Buffer
0067 -----------------------
0068 
0069 The ring buffer can be used in either an overwrite mode or in
0070 producer/consumer mode.
0071 
0072 Producer/consumer mode is where if the producer were to fill up the
0073 buffer before the consumer could free up anything, the producer
0074 will stop writing to the buffer. This will lose most recent events.
0075 
0076 Overwrite mode is where if the producer were to fill up the buffer
0077 before the consumer could free up anything, the producer will
0078 overwrite the older data. This will lose the oldest events.
0079 
0080 No two writers can write at the same time (on the same per-cpu buffer),
0081 but a writer may interrupt another writer, but it must finish writing
0082 before the previous writer may continue. This is very important to the
0083 algorithm. The writers act like a "stack". The way interrupts works
0084 enforces this behavior::
0085 
0086 
0087   writer1 start
0088      <preempted> writer2 start
0089          <preempted> writer3 start
0090                      writer3 finishes
0091                  writer2 finishes
0092   writer1 finishes
0093 
0094 This is very much like a writer being preempted by an interrupt and
0095 the interrupt doing a write as well.
0096 
0097 Readers can happen at any time. But no two readers may run at the
0098 same time, nor can a reader preempt/interrupt another reader. A reader
0099 cannot preempt/interrupt a writer, but it may read/consume from the
0100 buffer at the same time as a writer is writing, but the reader must be
0101 on another processor to do so. A reader may read on its own processor
0102 and can be preempted by a writer.
0103 
0104 A writer can preempt a reader, but a reader cannot preempt a writer.
0105 But a reader can read the buffer at the same time (on another processor)
0106 as a writer.
0107 
0108 The ring buffer is made up of a list of pages held together by a linked list.
0109 
0110 At initialization a reader page is allocated for the reader that is not
0111 part of the ring buffer.
0112 
0113 The head_page, tail_page and commit_page are all initialized to point
0114 to the same page.
0115 
0116 The reader page is initialized to have its next pointer pointing to
0117 the head page, and its previous pointer pointing to a page before
0118 the head page.
0119 
0120 The reader has its own page to use. At start up time, this page is
0121 allocated but is not attached to the list. When the reader wants
0122 to read from the buffer, if its page is empty (like it is on start-up),
0123 it will swap its page with the head_page. The old reader page will
0124 become part of the ring buffer and the head_page will be removed.
0125 The page after the inserted page (old reader_page) will become the
0126 new head page.
0127 
0128 Once the new page is given to the reader, the reader could do what
0129 it wants with it, as long as a writer has left that page.
0130 
0131 A sample of how the reader page is swapped: Note this does not
0132 show the head page in the buffer, it is for demonstrating a swap
0133 only.
0134 
0135 ::
0136 
0137   +------+
0138   |reader|          RING BUFFER
0139   |page  |
0140   +------+
0141                   +---+   +---+   +---+
0142                   |   |-->|   |-->|   |
0143                   |   |<--|   |<--|   |
0144                   +---+   +---+   +---+
0145                    ^ |             ^ |
0146                    | +-------------+ |
0147                    +-----------------+
0148 
0149 
0150   +------+
0151   |reader|          RING BUFFER
0152   |page  |-------------------+
0153   +------+                   v
0154     |             +---+   +---+   +---+
0155     |             |   |-->|   |-->|   |
0156     |             |   |<--|   |<--|   |<-+
0157     |             +---+   +---+   +---+  |
0158     |              ^ |             ^ |   |
0159     |              | +-------------+ |   |
0160     |              +-----------------+   |
0161     +------------------------------------+
0162 
0163   +------+
0164   |reader|          RING BUFFER
0165   |page  |-------------------+
0166   +------+ <---------------+ v
0167     |  ^          +---+   +---+   +---+
0168     |  |          |   |-->|   |-->|   |
0169     |  |          |   |   |   |<--|   |<-+
0170     |  |          +---+   +---+   +---+  |
0171     |  |             |             ^ |   |
0172     |  |             +-------------+ |   |
0173     |  +-----------------------------+   |
0174     +------------------------------------+
0175 
0176   +------+
0177   |buffer|          RING BUFFER
0178   |page  |-------------------+
0179   +------+ <---------------+ v
0180     |  ^          +---+   +---+   +---+
0181     |  |          |   |   |   |-->|   |
0182     |  |  New     |   |   |   |<--|   |<-+
0183     |  | Reader   +---+   +---+   +---+  |
0184     |  |  page ----^                 |   |
0185     |  |                             |   |
0186     |  +-----------------------------+   |
0187     +------------------------------------+
0188 
0189 
0190 
0191 It is possible that the page swapped is the commit page and the tail page,
0192 if what is in the ring buffer is less than what is held in a buffer page.
0193 
0194 ::
0195 
0196             reader page    commit page   tail page
0197                 |              |             |
0198                 v              |             |
0199                +---+           |             |
0200                |   |<----------+             |
0201                |   |<------------------------+
0202                |   |------+
0203                +---+      |
0204                           |
0205                           v
0206       +---+    +---+    +---+    +---+
0207   <---|   |--->|   |--->|   |--->|   |--->
0208   --->|   |<---|   |<---|   |<---|   |<---
0209       +---+    +---+    +---+    +---+
0210 
0211 This case is still valid for this algorithm.
0212 When the writer leaves the page, it simply goes into the ring buffer
0213 since the reader page still points to the next location in the ring
0214 buffer.
0215 
0216 
0217 The main pointers:
0218 
0219   reader page
0220             - The page used solely by the reader and is not part
0221               of the ring buffer (may be swapped in)
0222 
0223   head page
0224             - the next page in the ring buffer that will be swapped
0225               with the reader page.
0226 
0227   tail page
0228             - the page where the next write will take place.
0229 
0230   commit page
0231             - the page that last finished a write.
0232 
0233 The commit page only is updated by the outermost writer in the
0234 writer stack. A writer that preempts another writer will not move the
0235 commit page.
0236 
0237 When data is written into the ring buffer, a position is reserved
0238 in the ring buffer and passed back to the writer. When the writer
0239 is finished writing data into that position, it commits the write.
0240 
0241 Another write (or a read) may take place at anytime during this
0242 transaction. If another write happens it must finish before continuing
0243 with the previous write.
0244 
0245 
0246    Write reserve::
0247 
0248        Buffer page
0249       +---------+
0250       |written  |
0251       +---------+  <--- given back to writer (current commit)
0252       |reserved |
0253       +---------+ <--- tail pointer
0254       | empty   |
0255       +---------+
0256 
0257    Write commit::
0258 
0259        Buffer page
0260       +---------+
0261       |written  |
0262       +---------+
0263       |written  |
0264       +---------+  <--- next position for write (current commit)
0265       | empty   |
0266       +---------+
0267 
0268 
0269  If a write happens after the first reserve::
0270 
0271        Buffer page
0272       +---------+
0273       |written  |
0274       +---------+  <-- current commit
0275       |reserved |
0276       +---------+  <--- given back to second writer
0277       |reserved |
0278       +---------+ <--- tail pointer
0279 
0280   After second writer commits::
0281 
0282 
0283        Buffer page
0284       +---------+
0285       |written  |
0286       +---------+  <--(last full commit)
0287       |reserved |
0288       +---------+
0289       |pending  |
0290       |commit   |
0291       +---------+ <--- tail pointer
0292 
0293   When the first writer commits::
0294 
0295        Buffer page
0296       +---------+
0297       |written  |
0298       +---------+
0299       |written  |
0300       +---------+
0301       |written  |
0302       +---------+  <--(last full commit and tail pointer)
0303 
0304 
0305 The commit pointer points to the last write location that was
0306 committed without preempting another write. When a write that
0307 preempted another write is committed, it only becomes a pending commit
0308 and will not be a full commit until all writes have been committed.
0309 
0310 The commit page points to the page that has the last full commit.
0311 The tail page points to the page with the last write (before
0312 committing).
0313 
0314 The tail page is always equal to or after the commit page. It may
0315 be several pages ahead. If the tail page catches up to the commit
0316 page then no more writes may take place (regardless of the mode
0317 of the ring buffer: overwrite and produce/consumer).
0318 
0319 The order of pages is::
0320 
0321  head page
0322  commit page
0323  tail page
0324 
0325 Possible scenario::
0326 
0327                                tail page
0328     head page         commit page  |
0329         |                 |        |
0330         v                 v        v
0331       +---+    +---+    +---+    +---+
0332   <---|   |--->|   |--->|   |--->|   |--->
0333   --->|   |<---|   |<---|   |<---|   |<---
0334       +---+    +---+    +---+    +---+
0335 
0336 There is a special case that the head page is after either the commit page
0337 and possibly the tail page. That is when the commit (and tail) page has been
0338 swapped with the reader page. This is because the head page is always
0339 part of the ring buffer, but the reader page is not. Whenever there
0340 has been less than a full page that has been committed inside the ring buffer,
0341 and a reader swaps out a page, it will be swapping out the commit page.
0342 
0343 ::
0344 
0345             reader page    commit page   tail page
0346                 |              |             |
0347                 v              |             |
0348                +---+           |             |
0349                |   |<----------+             |
0350                |   |<------------------------+
0351                |   |------+
0352                +---+      |
0353                           |
0354                           v
0355       +---+    +---+    +---+    +---+
0356   <---|   |--->|   |--->|   |--->|   |--->
0357   --->|   |<---|   |<---|   |<---|   |<---
0358       +---+    +---+    +---+    +---+
0359                           ^
0360                           |
0361                       head page
0362 
0363 
0364 In this case, the head page will not move when the tail and commit
0365 move back into the ring buffer.
0366 
0367 The reader cannot swap a page into the ring buffer if the commit page
0368 is still on that page. If the read meets the last commit (real commit
0369 not pending or reserved), then there is nothing more to read.
0370 The buffer is considered empty until another full commit finishes.
0371 
0372 When the tail meets the head page, if the buffer is in overwrite mode,
0373 the head page will be pushed ahead one. If the buffer is in producer/consumer
0374 mode, the write will fail.
0375 
0376 Overwrite mode::
0377 
0378               tail page
0379                  |
0380                  v
0381       +---+    +---+    +---+    +---+
0382   <---|   |--->|   |--->|   |--->|   |--->
0383   --->|   |<---|   |<---|   |<---|   |<---
0384       +---+    +---+    +---+    +---+
0385                           ^
0386                           |
0387                       head page
0388 
0389 
0390               tail page
0391                  |
0392                  v
0393       +---+    +---+    +---+    +---+
0394   <---|   |--->|   |--->|   |--->|   |--->
0395   --->|   |<---|   |<---|   |<---|   |<---
0396       +---+    +---+    +---+    +---+
0397                                    ^
0398                                    |
0399                                head page
0400 
0401 
0402                       tail page
0403                           |
0404                           v
0405       +---+    +---+    +---+    +---+
0406   <---|   |--->|   |--->|   |--->|   |--->
0407   --->|   |<---|   |<---|   |<---|   |<---
0408       +---+    +---+    +---+    +---+
0409                                    ^
0410                                    |
0411                                head page
0412 
0413 Note, the reader page will still point to the previous head page.
0414 But when a swap takes place, it will use the most recent head page.
0415 
0416 
0417 Making the Ring Buffer Lockless:
0418 --------------------------------
0419 
0420 The main idea behind the lockless algorithm is to combine the moving
0421 of the head_page pointer with the swapping of pages with the reader.
0422 State flags are placed inside the pointer to the page. To do this,
0423 each page must be aligned in memory by 4 bytes. This will allow the 2
0424 least significant bits of the address to be used as flags, since
0425 they will always be zero for the address. To get the address,
0426 simply mask out the flags::
0427 
0428   MASK = ~3
0429 
0430   address & MASK
0431 
0432 Two flags will be kept by these two bits:
0433 
0434    HEADER
0435         - the page being pointed to is a head page
0436 
0437    UPDATE
0438         - the page being pointed to is being updated by a writer
0439           and was or is about to be a head page.
0440 
0441 ::
0442 
0443               reader page
0444                   |
0445                   v
0446                 +---+
0447                 |   |------+
0448                 +---+      |
0449                             |
0450                             v
0451         +---+    +---+    +---+    +---+
0452     <---|   |--->|   |-H->|   |--->|   |--->
0453     --->|   |<---|   |<---|   |<---|   |<---
0454         +---+    +---+    +---+    +---+
0455 
0456 
0457 The above pointer "-H->" would have the HEADER flag set. That is
0458 the next page is the next page to be swapped out by the reader.
0459 This pointer means the next page is the head page.
0460 
0461 When the tail page meets the head pointer, it will use cmpxchg to
0462 change the pointer to the UPDATE state::
0463 
0464 
0465               tail page
0466                  |
0467                  v
0468       +---+    +---+    +---+    +---+
0469   <---|   |--->|   |-H->|   |--->|   |--->
0470   --->|   |<---|   |<---|   |<---|   |<---
0471       +---+    +---+    +---+    +---+
0472 
0473               tail page
0474                  |
0475                  v
0476       +---+    +---+    +---+    +---+
0477   <---|   |--->|   |-U->|   |--->|   |--->
0478   --->|   |<---|   |<---|   |<---|   |<---
0479       +---+    +---+    +---+    +---+
0480 
0481 "-U->" represents a pointer in the UPDATE state.
0482 
0483 Any access to the reader will need to take some sort of lock to serialize
0484 the readers. But the writers will never take a lock to write to the
0485 ring buffer. This means we only need to worry about a single reader,
0486 and writes only preempt in "stack" formation.
0487 
0488 When the reader tries to swap the page with the ring buffer, it
0489 will also use cmpxchg. If the flag bit in the pointer to the
0490 head page does not have the HEADER flag set, the compare will fail
0491 and the reader will need to look for the new head page and try again.
0492 Note, the flags UPDATE and HEADER are never set at the same time.
0493 
0494 The reader swaps the reader page as follows::
0495 
0496   +------+
0497   |reader|          RING BUFFER
0498   |page  |
0499   +------+
0500                   +---+    +---+    +---+
0501                   |   |--->|   |--->|   |
0502                   |   |<---|   |<---|   |
0503                   +---+    +---+    +---+
0504                    ^ |               ^ |
0505                    | +---------------+ |
0506                    +-----H-------------+
0507 
0508 The reader sets the reader page next pointer as HEADER to the page after
0509 the head page::
0510 
0511 
0512   +------+
0513   |reader|          RING BUFFER
0514   |page  |-------H-----------+
0515   +------+                   v
0516     |             +---+    +---+    +---+
0517     |             |   |--->|   |--->|   |
0518     |             |   |<---|   |<---|   |<-+
0519     |             +---+    +---+    +---+  |
0520     |              ^ |               ^ |   |
0521     |              | +---------------+ |   |
0522     |              +-----H-------------+   |
0523     +--------------------------------------+
0524 
0525 It does a cmpxchg with the pointer to the previous head page to make it
0526 point to the reader page. Note that the new pointer does not have the HEADER
0527 flag set.  This action atomically moves the head page forward::
0528 
0529   +------+
0530   |reader|          RING BUFFER
0531   |page  |-------H-----------+
0532   +------+                   v
0533     |  ^          +---+   +---+   +---+
0534     |  |          |   |-->|   |-->|   |
0535     |  |          |   |<--|   |<--|   |<-+
0536     |  |          +---+   +---+   +---+  |
0537     |  |             |             ^ |   |
0538     |  |             +-------------+ |   |
0539     |  +-----------------------------+   |
0540     +------------------------------------+
0541 
0542 After the new head page is set, the previous pointer of the head page is
0543 updated to the reader page::
0544 
0545   +------+
0546   |reader|          RING BUFFER
0547   |page  |-------H-----------+
0548   +------+ <---------------+ v
0549     |  ^          +---+   +---+   +---+
0550     |  |          |   |-->|   |-->|   |
0551     |  |          |   |   |   |<--|   |<-+
0552     |  |          +---+   +---+   +---+  |
0553     |  |             |             ^ |   |
0554     |  |             +-------------+ |   |
0555     |  +-----------------------------+   |
0556     +------------------------------------+
0557 
0558   +------+
0559   |buffer|          RING BUFFER
0560   |page  |-------H-----------+  <--- New head page
0561   +------+ <---------------+ v
0562     |  ^          +---+   +---+   +---+
0563     |  |          |   |   |   |-->|   |
0564     |  |  New     |   |   |   |<--|   |<-+
0565     |  | Reader   +---+   +---+   +---+  |
0566     |  |  page ----^                 |   |
0567     |  |                             |   |
0568     |  +-----------------------------+   |
0569     +------------------------------------+
0570 
0571 Another important point: The page that the reader page points back to
0572 by its previous pointer (the one that now points to the new head page)
0573 never points back to the reader page. That is because the reader page is
0574 not part of the ring buffer. Traversing the ring buffer via the next pointers
0575 will always stay in the ring buffer. Traversing the ring buffer via the
0576 prev pointers may not.
0577 
0578 Note, the way to determine a reader page is simply by examining the previous
0579 pointer of the page. If the next pointer of the previous page does not
0580 point back to the original page, then the original page is a reader page::
0581 
0582 
0583              +--------+
0584              | reader |  next   +----+
0585              |  page  |-------->|    |<====== (buffer page)
0586              +--------+         +----+
0587                  |                | ^
0588                  |                v | next
0589             prev |              +----+
0590                  +------------->|    |
0591                                 +----+
0592 
0593 The way the head page moves forward:
0594 
0595 When the tail page meets the head page and the buffer is in overwrite mode
0596 and more writes take place, the head page must be moved forward before the
0597 writer may move the tail page. The way this is done is that the writer
0598 performs a cmpxchg to convert the pointer to the head page from the HEADER
0599 flag to have the UPDATE flag set. Once this is done, the reader will
0600 not be able to swap the head page from the buffer, nor will it be able to
0601 move the head page, until the writer is finished with the move.
0602 
0603 This eliminates any races that the reader can have on the writer. The reader
0604 must spin, and this is why the reader cannot preempt the writer::
0605 
0606               tail page
0607                  |
0608                  v
0609       +---+    +---+    +---+    +---+
0610   <---|   |--->|   |-H->|   |--->|   |--->
0611   --->|   |<---|   |<---|   |<---|   |<---
0612       +---+    +---+    +---+    +---+
0613 
0614               tail page
0615                  |
0616                  v
0617       +---+    +---+    +---+    +---+
0618   <---|   |--->|   |-U->|   |--->|   |--->
0619   --->|   |<---|   |<---|   |<---|   |<---
0620       +---+    +---+    +---+    +---+
0621 
0622 The following page will be made into the new head page::
0623 
0624              tail page
0625                  |
0626                  v
0627       +---+    +---+    +---+    +---+
0628   <---|   |--->|   |-U->|   |-H->|   |--->
0629   --->|   |<---|   |<---|   |<---|   |<---
0630       +---+    +---+    +---+    +---+
0631 
0632 After the new head page has been set, we can set the old head page
0633 pointer back to NORMAL::
0634 
0635              tail page
0636                  |
0637                  v
0638       +---+    +---+    +---+    +---+
0639   <---|   |--->|   |--->|   |-H->|   |--->
0640   --->|   |<---|   |<---|   |<---|   |<---
0641       +---+    +---+    +---+    +---+
0642 
0643 After the head page has been moved, the tail page may now move forward::
0644 
0645                       tail page
0646                           |
0647                           v
0648       +---+    +---+    +---+    +---+
0649   <---|   |--->|   |--->|   |-H->|   |--->
0650   --->|   |<---|   |<---|   |<---|   |<---
0651       +---+    +---+    +---+    +---+
0652 
0653 
0654 The above are the trivial updates. Now for the more complex scenarios.
0655 
0656 
0657 As stated before, if enough writes preempt the first write, the
0658 tail page may make it all the way around the buffer and meet the commit
0659 page. At this time, we must start dropping writes (usually with some kind
0660 of warning to the user). But what happens if the commit was still on the
0661 reader page? The commit page is not part of the ring buffer. The tail page
0662 must account for this::
0663 
0664 
0665             reader page    commit page
0666                 |              |
0667                 v              |
0668                +---+           |
0669                |   |<----------+
0670                |   |
0671                |   |------+
0672                +---+      |
0673                           |
0674                           v
0675       +---+    +---+    +---+    +---+
0676   <---|   |--->|   |-H->|   |--->|   |--->
0677   --->|   |<---|   |<---|   |<---|   |<---
0678       +---+    +---+    +---+    +---+
0679                  ^
0680                  |
0681              tail page
0682 
0683 If the tail page were to simply push the head page forward, the commit when
0684 leaving the reader page would not be pointing to the correct page.
0685 
0686 The solution to this is to test if the commit page is on the reader page
0687 before pushing the head page. If it is, then it can be assumed that the
0688 tail page wrapped the buffer, and we must drop new writes.
0689 
0690 This is not a race condition, because the commit page can only be moved
0691 by the outermost writer (the writer that was preempted).
0692 This means that the commit will not move while a writer is moving the
0693 tail page. The reader cannot swap the reader page if it is also being
0694 used as the commit page. The reader can simply check that the commit
0695 is off the reader page. Once the commit page leaves the reader page
0696 it will never go back on it unless a reader does another swap with the
0697 buffer page that is also the commit page.
0698 
0699 
0700 Nested writes
0701 -------------
0702 
0703 In the pushing forward of the tail page we must first push forward
0704 the head page if the head page is the next page. If the head page
0705 is not the next page, the tail page is simply updated with a cmpxchg.
0706 
0707 Only writers move the tail page. This must be done atomically to protect
0708 against nested writers::
0709 
0710   temp_page = tail_page
0711   next_page = temp_page->next
0712   cmpxchg(tail_page, temp_page, next_page)
0713 
0714 The above will update the tail page if it is still pointing to the expected
0715 page. If this fails, a nested write pushed it forward, the current write
0716 does not need to push it::
0717 
0718 
0719              temp page
0720                  |
0721                  v
0722               tail page
0723                  |
0724                  v
0725       +---+    +---+    +---+    +---+
0726   <---|   |--->|   |--->|   |--->|   |--->
0727   --->|   |<---|   |<---|   |<---|   |<---
0728       +---+    +---+    +---+    +---+
0729 
0730 Nested write comes in and moves the tail page forward::
0731 
0732                       tail page (moved by nested writer)
0733               temp page   |
0734                  |        |
0735                  v        v
0736       +---+    +---+    +---+    +---+
0737   <---|   |--->|   |--->|   |--->|   |--->
0738   --->|   |<---|   |<---|   |<---|   |<---
0739       +---+    +---+    +---+    +---+
0740 
0741 The above would fail the cmpxchg, but since the tail page has already
0742 been moved forward, the writer will just try again to reserve storage
0743 on the new tail page.
0744 
0745 But the moving of the head page is a bit more complex::
0746 
0747               tail page
0748                  |
0749                  v
0750       +---+    +---+    +---+    +---+
0751   <---|   |--->|   |-H->|   |--->|   |--->
0752   --->|   |<---|   |<---|   |<---|   |<---
0753       +---+    +---+    +---+    +---+
0754 
0755 The write converts the head page pointer to UPDATE::
0756 
0757               tail page
0758                  |
0759                  v
0760       +---+    +---+    +---+    +---+
0761   <---|   |--->|   |-U->|   |--->|   |--->
0762   --->|   |<---|   |<---|   |<---|   |<---
0763       +---+    +---+    +---+    +---+
0764 
0765 But if a nested writer preempts here, it will see that the next
0766 page is a head page, but it is also nested. It will detect that
0767 it is nested and will save that information. The detection is the
0768 fact that it sees the UPDATE flag instead of a HEADER or NORMAL
0769 pointer.
0770 
0771 The nested writer will set the new head page pointer::
0772 
0773              tail page
0774                  |
0775                  v
0776       +---+    +---+    +---+    +---+
0777   <---|   |--->|   |-U->|   |-H->|   |--->
0778   --->|   |<---|   |<---|   |<---|   |<---
0779       +---+    +---+    +---+    +---+
0780 
0781 But it will not reset the update back to normal. Only the writer
0782 that converted a pointer from HEAD to UPDATE will convert it back
0783 to NORMAL::
0784 
0785                       tail page
0786                           |
0787                           v
0788       +---+    +---+    +---+    +---+
0789   <---|   |--->|   |-U->|   |-H->|   |--->
0790   --->|   |<---|   |<---|   |<---|   |<---
0791       +---+    +---+    +---+    +---+
0792 
0793 After the nested writer finishes, the outermost writer will convert
0794 the UPDATE pointer to NORMAL::
0795 
0796 
0797                       tail page
0798                           |
0799                           v
0800       +---+    +---+    +---+    +---+
0801   <---|   |--->|   |--->|   |-H->|   |--->
0802   --->|   |<---|   |<---|   |<---|   |<---
0803       +---+    +---+    +---+    +---+
0804 
0805 
0806 It can be even more complex if several nested writes came in and moved
0807 the tail page ahead several pages::
0808 
0809 
0810   (first writer)
0811 
0812               tail page
0813                  |
0814                  v
0815       +---+    +---+    +---+    +---+
0816   <---|   |--->|   |-H->|   |--->|   |--->
0817   --->|   |<---|   |<---|   |<---|   |<---
0818       +---+    +---+    +---+    +---+
0819 
0820 The write converts the head page pointer to UPDATE::
0821 
0822               tail page
0823                  |
0824                  v
0825       +---+    +---+    +---+    +---+
0826   <---|   |--->|   |-U->|   |--->|   |--->
0827   --->|   |<---|   |<---|   |<---|   |<---
0828       +---+    +---+    +---+    +---+
0829 
0830 Next writer comes in, and sees the update and sets up the new
0831 head page::
0832 
0833   (second writer)
0834 
0835              tail page
0836                  |
0837                  v
0838       +---+    +---+    +---+    +---+
0839   <---|   |--->|   |-U->|   |-H->|   |--->
0840   --->|   |<---|   |<---|   |<---|   |<---
0841       +---+    +---+    +---+    +---+
0842 
0843 The nested writer moves the tail page forward. But does not set the old
0844 update page to NORMAL because it is not the outermost writer::
0845 
0846                       tail page
0847                           |
0848                           v
0849       +---+    +---+    +---+    +---+
0850   <---|   |--->|   |-U->|   |-H->|   |--->
0851   --->|   |<---|   |<---|   |<---|   |<---
0852       +---+    +---+    +---+    +---+
0853 
0854 Another writer preempts and sees the page after the tail page is a head page.
0855 It changes it from HEAD to UPDATE::
0856 
0857   (third writer)
0858 
0859                       tail page
0860                           |
0861                           v
0862       +---+    +---+    +---+    +---+
0863   <---|   |--->|   |-U->|   |-U->|   |--->
0864   --->|   |<---|   |<---|   |<---|   |<---
0865       +---+    +---+    +---+    +---+
0866 
0867 The writer will move the head page forward::
0868 
0869 
0870   (third writer)
0871 
0872                       tail page
0873                           |
0874                           v
0875       +---+    +---+    +---+    +---+
0876   <---|   |--->|   |-U->|   |-U->|   |-H->
0877   --->|   |<---|   |<---|   |<---|   |<---
0878       +---+    +---+    +---+    +---+
0879 
0880 But now that the third writer did change the HEAD flag to UPDATE it
0881 will convert it to normal::
0882 
0883 
0884   (third writer)
0885 
0886                       tail page
0887                           |
0888                           v
0889       +---+    +---+    +---+    +---+
0890   <---|   |--->|   |-U->|   |--->|   |-H->
0891   --->|   |<---|   |<---|   |<---|   |<---
0892       +---+    +---+    +---+    +---+
0893 
0894 
0895 Then it will move the tail page, and return back to the second writer::
0896 
0897 
0898   (second writer)
0899 
0900                                tail page
0901                                    |
0902                                    v
0903       +---+    +---+    +---+    +---+
0904   <---|   |--->|   |-U->|   |--->|   |-H->
0905   --->|   |<---|   |<---|   |<---|   |<---
0906       +---+    +---+    +---+    +---+
0907 
0908 
0909 The second writer will fail to move the tail page because it was already
0910 moved, so it will try again and add its data to the new tail page.
0911 It will return to the first writer::
0912 
0913 
0914   (first writer)
0915 
0916                                tail page
0917                                    |
0918                                    v
0919       +---+    +---+    +---+    +---+
0920   <---|   |--->|   |-U->|   |--->|   |-H->
0921   --->|   |<---|   |<---|   |<---|   |<---
0922       +---+    +---+    +---+    +---+
0923 
0924 The first writer cannot know atomically if the tail page moved
0925 while it updates the HEAD page. It will then update the head page to
0926 what it thinks is the new head page::
0927 
0928 
0929   (first writer)
0930 
0931                                tail page
0932                                    |
0933                                    v
0934       +---+    +---+    +---+    +---+
0935   <---|   |--->|   |-U->|   |-H->|   |-H->
0936   --->|   |<---|   |<---|   |<---|   |<---
0937       +---+    +---+    +---+    +---+
0938 
0939 Since the cmpxchg returns the old value of the pointer the first writer
0940 will see it succeeded in updating the pointer from NORMAL to HEAD.
0941 But as we can see, this is not good enough. It must also check to see
0942 if the tail page is either where it use to be or on the next page::
0943 
0944 
0945   (first writer)
0946 
0947                  A        B    tail page
0948                  |        |        |
0949                  v        v        v
0950       +---+    +---+    +---+    +---+
0951   <---|   |--->|   |-U->|   |-H->|   |-H->
0952   --->|   |<---|   |<---|   |<---|   |<---
0953       +---+    +---+    +---+    +---+
0954 
0955 If tail page != A and tail page != B, then it must reset the pointer
0956 back to NORMAL. The fact that it only needs to worry about nested
0957 writers means that it only needs to check this after setting the HEAD page::
0958 
0959 
0960   (first writer)
0961 
0962                  A        B    tail page
0963                  |        |        |
0964                  v        v        v
0965       +---+    +---+    +---+    +---+
0966   <---|   |--->|   |-U->|   |--->|   |-H->
0967   --->|   |<---|   |<---|   |<---|   |<---
0968       +---+    +---+    +---+    +---+
0969 
0970 Now the writer can update the head page. This is also why the head page must
0971 remain in UPDATE and only reset by the outermost writer. This prevents
0972 the reader from seeing the incorrect head page::
0973 
0974 
0975   (first writer)
0976 
0977                  A        B    tail page
0978                  |        |        |
0979                  v        v        v
0980       +---+    +---+    +---+    +---+
0981   <---|   |--->|   |--->|   |--->|   |-H->
0982   --->|   |<---|   |<---|   |<---|   |<---
0983       +---+    +---+    +---+    +---+