0001 .. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
0002
0003 ===========================
0004 Lockless Ring Buffer Design
0005 ===========================
0006
0007 Copyright 2009 Red Hat Inc.
0008
0009 :Author: Steven Rostedt <srostedt@redhat.com>
0010 :License: The GNU Free Documentation License, Version 1.2
0011 (dual licensed under the GPL v2)
0012 :Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
0013 and Frederic Weisbecker.
0014
0015
0016 Written for: 2.6.31
0017
0018 Terminology used in this Document
0019 ---------------------------------
0020
0021 tail
0022 - where new writes happen in the ring buffer.
0023
0024 head
0025 - where new reads happen in the ring buffer.
0026
0027 producer
0028 - the task that writes into the ring buffer (same as writer)
0029
0030 writer
0031 - same as producer
0032
0033 consumer
0034 - the task that reads from the buffer (same as reader)
0035
0036 reader
0037 - same as consumer.
0038
0039 reader_page
0040 - A page outside the ring buffer used solely (for the most part)
0041 by the reader.
0042
0043 head_page
0044 - a pointer to the page that the reader will use next
0045
0046 tail_page
0047 - a pointer to the page that will be written to next
0048
0049 commit_page
0050 - a pointer to the page with the last finished non-nested write.
0051
0052 cmpxchg
0053 - hardware-assisted atomic transaction that performs the following::
0054
0055 A = B if previous A == C
0056
0057 R = cmpxchg(A, C, B) is saying that we replace A with B if and only
0058 if current A is equal to C, and we put the old (current)
0059 A into R
0060
0061 R gets the previous A regardless if A is updated with B or not.
0062
0063 To see if the update was successful a compare of ``R == C``
0064 may be used.
0065
0066 The Generic Ring Buffer
0067 -----------------------
0068
0069 The ring buffer can be used in either an overwrite mode or in
0070 producer/consumer mode.
0071
0072 Producer/consumer mode is where if the producer were to fill up the
0073 buffer before the consumer could free up anything, the producer
0074 will stop writing to the buffer. This will lose most recent events.
0075
0076 Overwrite mode is where if the producer were to fill up the buffer
0077 before the consumer could free up anything, the producer will
0078 overwrite the older data. This will lose the oldest events.
0079
0080 No two writers can write at the same time (on the same per-cpu buffer),
0081 but a writer may interrupt another writer, but it must finish writing
0082 before the previous writer may continue. This is very important to the
0083 algorithm. The writers act like a "stack". The way interrupts works
0084 enforces this behavior::
0085
0086
0087 writer1 start
0088 <preempted> writer2 start
0089 <preempted> writer3 start
0090 writer3 finishes
0091 writer2 finishes
0092 writer1 finishes
0093
0094 This is very much like a writer being preempted by an interrupt and
0095 the interrupt doing a write as well.
0096
0097 Readers can happen at any time. But no two readers may run at the
0098 same time, nor can a reader preempt/interrupt another reader. A reader
0099 cannot preempt/interrupt a writer, but it may read/consume from the
0100 buffer at the same time as a writer is writing, but the reader must be
0101 on another processor to do so. A reader may read on its own processor
0102 and can be preempted by a writer.
0103
0104 A writer can preempt a reader, but a reader cannot preempt a writer.
0105 But a reader can read the buffer at the same time (on another processor)
0106 as a writer.
0107
0108 The ring buffer is made up of a list of pages held together by a linked list.
0109
0110 At initialization a reader page is allocated for the reader that is not
0111 part of the ring buffer.
0112
0113 The head_page, tail_page and commit_page are all initialized to point
0114 to the same page.
0115
0116 The reader page is initialized to have its next pointer pointing to
0117 the head page, and its previous pointer pointing to a page before
0118 the head page.
0119
0120 The reader has its own page to use. At start up time, this page is
0121 allocated but is not attached to the list. When the reader wants
0122 to read from the buffer, if its page is empty (like it is on start-up),
0123 it will swap its page with the head_page. The old reader page will
0124 become part of the ring buffer and the head_page will be removed.
0125 The page after the inserted page (old reader_page) will become the
0126 new head page.
0127
0128 Once the new page is given to the reader, the reader could do what
0129 it wants with it, as long as a writer has left that page.
0130
0131 A sample of how the reader page is swapped: Note this does not
0132 show the head page in the buffer, it is for demonstrating a swap
0133 only.
0134
0135 ::
0136
0137 +------+
0138 |reader| RING BUFFER
0139 |page |
0140 +------+
0141 +---+ +---+ +---+
0142 | |-->| |-->| |
0143 | |<--| |<--| |
0144 +---+ +---+ +---+
0145 ^ | ^ |
0146 | +-------------+ |
0147 +-----------------+
0148
0149
0150 +------+
0151 |reader| RING BUFFER
0152 |page |-------------------+
0153 +------+ v
0154 | +---+ +---+ +---+
0155 | | |-->| |-->| |
0156 | | |<--| |<--| |<-+
0157 | +---+ +---+ +---+ |
0158 | ^ | ^ | |
0159 | | +-------------+ | |
0160 | +-----------------+ |
0161 +------------------------------------+
0162
0163 +------+
0164 |reader| RING BUFFER
0165 |page |-------------------+
0166 +------+ <---------------+ v
0167 | ^ +---+ +---+ +---+
0168 | | | |-->| |-->| |
0169 | | | | | |<--| |<-+
0170 | | +---+ +---+ +---+ |
0171 | | | ^ | |
0172 | | +-------------+ | |
0173 | +-----------------------------+ |
0174 +------------------------------------+
0175
0176 +------+
0177 |buffer| RING BUFFER
0178 |page |-------------------+
0179 +------+ <---------------+ v
0180 | ^ +---+ +---+ +---+
0181 | | | | | |-->| |
0182 | | New | | | |<--| |<-+
0183 | | Reader +---+ +---+ +---+ |
0184 | | page ----^ | |
0185 | | | |
0186 | +-----------------------------+ |
0187 +------------------------------------+
0188
0189
0190
0191 It is possible that the page swapped is the commit page and the tail page,
0192 if what is in the ring buffer is less than what is held in a buffer page.
0193
0194 ::
0195
0196 reader page commit page tail page
0197 | | |
0198 v | |
0199 +---+ | |
0200 | |<----------+ |
0201 | |<------------------------+
0202 | |------+
0203 +---+ |
0204 |
0205 v
0206 +---+ +---+ +---+ +---+
0207 <---| |--->| |--->| |--->| |--->
0208 --->| |<---| |<---| |<---| |<---
0209 +---+ +---+ +---+ +---+
0210
0211 This case is still valid for this algorithm.
0212 When the writer leaves the page, it simply goes into the ring buffer
0213 since the reader page still points to the next location in the ring
0214 buffer.
0215
0216
0217 The main pointers:
0218
0219 reader page
0220 - The page used solely by the reader and is not part
0221 of the ring buffer (may be swapped in)
0222
0223 head page
0224 - the next page in the ring buffer that will be swapped
0225 with the reader page.
0226
0227 tail page
0228 - the page where the next write will take place.
0229
0230 commit page
0231 - the page that last finished a write.
0232
0233 The commit page only is updated by the outermost writer in the
0234 writer stack. A writer that preempts another writer will not move the
0235 commit page.
0236
0237 When data is written into the ring buffer, a position is reserved
0238 in the ring buffer and passed back to the writer. When the writer
0239 is finished writing data into that position, it commits the write.
0240
0241 Another write (or a read) may take place at anytime during this
0242 transaction. If another write happens it must finish before continuing
0243 with the previous write.
0244
0245
0246 Write reserve::
0247
0248 Buffer page
0249 +---------+
0250 |written |
0251 +---------+ <--- given back to writer (current commit)
0252 |reserved |
0253 +---------+ <--- tail pointer
0254 | empty |
0255 +---------+
0256
0257 Write commit::
0258
0259 Buffer page
0260 +---------+
0261 |written |
0262 +---------+
0263 |written |
0264 +---------+ <--- next position for write (current commit)
0265 | empty |
0266 +---------+
0267
0268
0269 If a write happens after the first reserve::
0270
0271 Buffer page
0272 +---------+
0273 |written |
0274 +---------+ <-- current commit
0275 |reserved |
0276 +---------+ <--- given back to second writer
0277 |reserved |
0278 +---------+ <--- tail pointer
0279
0280 After second writer commits::
0281
0282
0283 Buffer page
0284 +---------+
0285 |written |
0286 +---------+ <--(last full commit)
0287 |reserved |
0288 +---------+
0289 |pending |
0290 |commit |
0291 +---------+ <--- tail pointer
0292
0293 When the first writer commits::
0294
0295 Buffer page
0296 +---------+
0297 |written |
0298 +---------+
0299 |written |
0300 +---------+
0301 |written |
0302 +---------+ <--(last full commit and tail pointer)
0303
0304
0305 The commit pointer points to the last write location that was
0306 committed without preempting another write. When a write that
0307 preempted another write is committed, it only becomes a pending commit
0308 and will not be a full commit until all writes have been committed.
0309
0310 The commit page points to the page that has the last full commit.
0311 The tail page points to the page with the last write (before
0312 committing).
0313
0314 The tail page is always equal to or after the commit page. It may
0315 be several pages ahead. If the tail page catches up to the commit
0316 page then no more writes may take place (regardless of the mode
0317 of the ring buffer: overwrite and produce/consumer).
0318
0319 The order of pages is::
0320
0321 head page
0322 commit page
0323 tail page
0324
0325 Possible scenario::
0326
0327 tail page
0328 head page commit page |
0329 | | |
0330 v v v
0331 +---+ +---+ +---+ +---+
0332 <---| |--->| |--->| |--->| |--->
0333 --->| |<---| |<---| |<---| |<---
0334 +---+ +---+ +---+ +---+
0335
0336 There is a special case that the head page is after either the commit page
0337 and possibly the tail page. That is when the commit (and tail) page has been
0338 swapped with the reader page. This is because the head page is always
0339 part of the ring buffer, but the reader page is not. Whenever there
0340 has been less than a full page that has been committed inside the ring buffer,
0341 and a reader swaps out a page, it will be swapping out the commit page.
0342
0343 ::
0344
0345 reader page commit page tail page
0346 | | |
0347 v | |
0348 +---+ | |
0349 | |<----------+ |
0350 | |<------------------------+
0351 | |------+
0352 +---+ |
0353 |
0354 v
0355 +---+ +---+ +---+ +---+
0356 <---| |--->| |--->| |--->| |--->
0357 --->| |<---| |<---| |<---| |<---
0358 +---+ +---+ +---+ +---+
0359 ^
0360 |
0361 head page
0362
0363
0364 In this case, the head page will not move when the tail and commit
0365 move back into the ring buffer.
0366
0367 The reader cannot swap a page into the ring buffer if the commit page
0368 is still on that page. If the read meets the last commit (real commit
0369 not pending or reserved), then there is nothing more to read.
0370 The buffer is considered empty until another full commit finishes.
0371
0372 When the tail meets the head page, if the buffer is in overwrite mode,
0373 the head page will be pushed ahead one. If the buffer is in producer/consumer
0374 mode, the write will fail.
0375
0376 Overwrite mode::
0377
0378 tail page
0379 |
0380 v
0381 +---+ +---+ +---+ +---+
0382 <---| |--->| |--->| |--->| |--->
0383 --->| |<---| |<---| |<---| |<---
0384 +---+ +---+ +---+ +---+
0385 ^
0386 |
0387 head page
0388
0389
0390 tail page
0391 |
0392 v
0393 +---+ +---+ +---+ +---+
0394 <---| |--->| |--->| |--->| |--->
0395 --->| |<---| |<---| |<---| |<---
0396 +---+ +---+ +---+ +---+
0397 ^
0398 |
0399 head page
0400
0401
0402 tail page
0403 |
0404 v
0405 +---+ +---+ +---+ +---+
0406 <---| |--->| |--->| |--->| |--->
0407 --->| |<---| |<---| |<---| |<---
0408 +---+ +---+ +---+ +---+
0409 ^
0410 |
0411 head page
0412
0413 Note, the reader page will still point to the previous head page.
0414 But when a swap takes place, it will use the most recent head page.
0415
0416
0417 Making the Ring Buffer Lockless:
0418 --------------------------------
0419
0420 The main idea behind the lockless algorithm is to combine the moving
0421 of the head_page pointer with the swapping of pages with the reader.
0422 State flags are placed inside the pointer to the page. To do this,
0423 each page must be aligned in memory by 4 bytes. This will allow the 2
0424 least significant bits of the address to be used as flags, since
0425 they will always be zero for the address. To get the address,
0426 simply mask out the flags::
0427
0428 MASK = ~3
0429
0430 address & MASK
0431
0432 Two flags will be kept by these two bits:
0433
0434 HEADER
0435 - the page being pointed to is a head page
0436
0437 UPDATE
0438 - the page being pointed to is being updated by a writer
0439 and was or is about to be a head page.
0440
0441 ::
0442
0443 reader page
0444 |
0445 v
0446 +---+
0447 | |------+
0448 +---+ |
0449 |
0450 v
0451 +---+ +---+ +---+ +---+
0452 <---| |--->| |-H->| |--->| |--->
0453 --->| |<---| |<---| |<---| |<---
0454 +---+ +---+ +---+ +---+
0455
0456
0457 The above pointer "-H->" would have the HEADER flag set. That is
0458 the next page is the next page to be swapped out by the reader.
0459 This pointer means the next page is the head page.
0460
0461 When the tail page meets the head pointer, it will use cmpxchg to
0462 change the pointer to the UPDATE state::
0463
0464
0465 tail page
0466 |
0467 v
0468 +---+ +---+ +---+ +---+
0469 <---| |--->| |-H->| |--->| |--->
0470 --->| |<---| |<---| |<---| |<---
0471 +---+ +---+ +---+ +---+
0472
0473 tail page
0474 |
0475 v
0476 +---+ +---+ +---+ +---+
0477 <---| |--->| |-U->| |--->| |--->
0478 --->| |<---| |<---| |<---| |<---
0479 +---+ +---+ +---+ +---+
0480
0481 "-U->" represents a pointer in the UPDATE state.
0482
0483 Any access to the reader will need to take some sort of lock to serialize
0484 the readers. But the writers will never take a lock to write to the
0485 ring buffer. This means we only need to worry about a single reader,
0486 and writes only preempt in "stack" formation.
0487
0488 When the reader tries to swap the page with the ring buffer, it
0489 will also use cmpxchg. If the flag bit in the pointer to the
0490 head page does not have the HEADER flag set, the compare will fail
0491 and the reader will need to look for the new head page and try again.
0492 Note, the flags UPDATE and HEADER are never set at the same time.
0493
0494 The reader swaps the reader page as follows::
0495
0496 +------+
0497 |reader| RING BUFFER
0498 |page |
0499 +------+
0500 +---+ +---+ +---+
0501 | |--->| |--->| |
0502 | |<---| |<---| |
0503 +---+ +---+ +---+
0504 ^ | ^ |
0505 | +---------------+ |
0506 +-----H-------------+
0507
0508 The reader sets the reader page next pointer as HEADER to the page after
0509 the head page::
0510
0511
0512 +------+
0513 |reader| RING BUFFER
0514 |page |-------H-----------+
0515 +------+ v
0516 | +---+ +---+ +---+
0517 | | |--->| |--->| |
0518 | | |<---| |<---| |<-+
0519 | +---+ +---+ +---+ |
0520 | ^ | ^ | |
0521 | | +---------------+ | |
0522 | +-----H-------------+ |
0523 +--------------------------------------+
0524
0525 It does a cmpxchg with the pointer to the previous head page to make it
0526 point to the reader page. Note that the new pointer does not have the HEADER
0527 flag set. This action atomically moves the head page forward::
0528
0529 +------+
0530 |reader| RING BUFFER
0531 |page |-------H-----------+
0532 +------+ v
0533 | ^ +---+ +---+ +---+
0534 | | | |-->| |-->| |
0535 | | | |<--| |<--| |<-+
0536 | | +---+ +---+ +---+ |
0537 | | | ^ | |
0538 | | +-------------+ | |
0539 | +-----------------------------+ |
0540 +------------------------------------+
0541
0542 After the new head page is set, the previous pointer of the head page is
0543 updated to the reader page::
0544
0545 +------+
0546 |reader| RING BUFFER
0547 |page |-------H-----------+
0548 +------+ <---------------+ v
0549 | ^ +---+ +---+ +---+
0550 | | | |-->| |-->| |
0551 | | | | | |<--| |<-+
0552 | | +---+ +---+ +---+ |
0553 | | | ^ | |
0554 | | +-------------+ | |
0555 | +-----------------------------+ |
0556 +------------------------------------+
0557
0558 +------+
0559 |buffer| RING BUFFER
0560 |page |-------H-----------+ <--- New head page
0561 +------+ <---------------+ v
0562 | ^ +---+ +---+ +---+
0563 | | | | | |-->| |
0564 | | New | | | |<--| |<-+
0565 | | Reader +---+ +---+ +---+ |
0566 | | page ----^ | |
0567 | | | |
0568 | +-----------------------------+ |
0569 +------------------------------------+
0570
0571 Another important point: The page that the reader page points back to
0572 by its previous pointer (the one that now points to the new head page)
0573 never points back to the reader page. That is because the reader page is
0574 not part of the ring buffer. Traversing the ring buffer via the next pointers
0575 will always stay in the ring buffer. Traversing the ring buffer via the
0576 prev pointers may not.
0577
0578 Note, the way to determine a reader page is simply by examining the previous
0579 pointer of the page. If the next pointer of the previous page does not
0580 point back to the original page, then the original page is a reader page::
0581
0582
0583 +--------+
0584 | reader | next +----+
0585 | page |-------->| |<====== (buffer page)
0586 +--------+ +----+
0587 | | ^
0588 | v | next
0589 prev | +----+
0590 +------------->| |
0591 +----+
0592
0593 The way the head page moves forward:
0594
0595 When the tail page meets the head page and the buffer is in overwrite mode
0596 and more writes take place, the head page must be moved forward before the
0597 writer may move the tail page. The way this is done is that the writer
0598 performs a cmpxchg to convert the pointer to the head page from the HEADER
0599 flag to have the UPDATE flag set. Once this is done, the reader will
0600 not be able to swap the head page from the buffer, nor will it be able to
0601 move the head page, until the writer is finished with the move.
0602
0603 This eliminates any races that the reader can have on the writer. The reader
0604 must spin, and this is why the reader cannot preempt the writer::
0605
0606 tail page
0607 |
0608 v
0609 +---+ +---+ +---+ +---+
0610 <---| |--->| |-H->| |--->| |--->
0611 --->| |<---| |<---| |<---| |<---
0612 +---+ +---+ +---+ +---+
0613
0614 tail page
0615 |
0616 v
0617 +---+ +---+ +---+ +---+
0618 <---| |--->| |-U->| |--->| |--->
0619 --->| |<---| |<---| |<---| |<---
0620 +---+ +---+ +---+ +---+
0621
0622 The following page will be made into the new head page::
0623
0624 tail page
0625 |
0626 v
0627 +---+ +---+ +---+ +---+
0628 <---| |--->| |-U->| |-H->| |--->
0629 --->| |<---| |<---| |<---| |<---
0630 +---+ +---+ +---+ +---+
0631
0632 After the new head page has been set, we can set the old head page
0633 pointer back to NORMAL::
0634
0635 tail page
0636 |
0637 v
0638 +---+ +---+ +---+ +---+
0639 <---| |--->| |--->| |-H->| |--->
0640 --->| |<---| |<---| |<---| |<---
0641 +---+ +---+ +---+ +---+
0642
0643 After the head page has been moved, the tail page may now move forward::
0644
0645 tail page
0646 |
0647 v
0648 +---+ +---+ +---+ +---+
0649 <---| |--->| |--->| |-H->| |--->
0650 --->| |<---| |<---| |<---| |<---
0651 +---+ +---+ +---+ +---+
0652
0653
0654 The above are the trivial updates. Now for the more complex scenarios.
0655
0656
0657 As stated before, if enough writes preempt the first write, the
0658 tail page may make it all the way around the buffer and meet the commit
0659 page. At this time, we must start dropping writes (usually with some kind
0660 of warning to the user). But what happens if the commit was still on the
0661 reader page? The commit page is not part of the ring buffer. The tail page
0662 must account for this::
0663
0664
0665 reader page commit page
0666 | |
0667 v |
0668 +---+ |
0669 | |<----------+
0670 | |
0671 | |------+
0672 +---+ |
0673 |
0674 v
0675 +---+ +---+ +---+ +---+
0676 <---| |--->| |-H->| |--->| |--->
0677 --->| |<---| |<---| |<---| |<---
0678 +---+ +---+ +---+ +---+
0679 ^
0680 |
0681 tail page
0682
0683 If the tail page were to simply push the head page forward, the commit when
0684 leaving the reader page would not be pointing to the correct page.
0685
0686 The solution to this is to test if the commit page is on the reader page
0687 before pushing the head page. If it is, then it can be assumed that the
0688 tail page wrapped the buffer, and we must drop new writes.
0689
0690 This is not a race condition, because the commit page can only be moved
0691 by the outermost writer (the writer that was preempted).
0692 This means that the commit will not move while a writer is moving the
0693 tail page. The reader cannot swap the reader page if it is also being
0694 used as the commit page. The reader can simply check that the commit
0695 is off the reader page. Once the commit page leaves the reader page
0696 it will never go back on it unless a reader does another swap with the
0697 buffer page that is also the commit page.
0698
0699
0700 Nested writes
0701 -------------
0702
0703 In the pushing forward of the tail page we must first push forward
0704 the head page if the head page is the next page. If the head page
0705 is not the next page, the tail page is simply updated with a cmpxchg.
0706
0707 Only writers move the tail page. This must be done atomically to protect
0708 against nested writers::
0709
0710 temp_page = tail_page
0711 next_page = temp_page->next
0712 cmpxchg(tail_page, temp_page, next_page)
0713
0714 The above will update the tail page if it is still pointing to the expected
0715 page. If this fails, a nested write pushed it forward, the current write
0716 does not need to push it::
0717
0718
0719 temp page
0720 |
0721 v
0722 tail page
0723 |
0724 v
0725 +---+ +---+ +---+ +---+
0726 <---| |--->| |--->| |--->| |--->
0727 --->| |<---| |<---| |<---| |<---
0728 +---+ +---+ +---+ +---+
0729
0730 Nested write comes in and moves the tail page forward::
0731
0732 tail page (moved by nested writer)
0733 temp page |
0734 | |
0735 v v
0736 +---+ +---+ +---+ +---+
0737 <---| |--->| |--->| |--->| |--->
0738 --->| |<---| |<---| |<---| |<---
0739 +---+ +---+ +---+ +---+
0740
0741 The above would fail the cmpxchg, but since the tail page has already
0742 been moved forward, the writer will just try again to reserve storage
0743 on the new tail page.
0744
0745 But the moving of the head page is a bit more complex::
0746
0747 tail page
0748 |
0749 v
0750 +---+ +---+ +---+ +---+
0751 <---| |--->| |-H->| |--->| |--->
0752 --->| |<---| |<---| |<---| |<---
0753 +---+ +---+ +---+ +---+
0754
0755 The write converts the head page pointer to UPDATE::
0756
0757 tail page
0758 |
0759 v
0760 +---+ +---+ +---+ +---+
0761 <---| |--->| |-U->| |--->| |--->
0762 --->| |<---| |<---| |<---| |<---
0763 +---+ +---+ +---+ +---+
0764
0765 But if a nested writer preempts here, it will see that the next
0766 page is a head page, but it is also nested. It will detect that
0767 it is nested and will save that information. The detection is the
0768 fact that it sees the UPDATE flag instead of a HEADER or NORMAL
0769 pointer.
0770
0771 The nested writer will set the new head page pointer::
0772
0773 tail page
0774 |
0775 v
0776 +---+ +---+ +---+ +---+
0777 <---| |--->| |-U->| |-H->| |--->
0778 --->| |<---| |<---| |<---| |<---
0779 +---+ +---+ +---+ +---+
0780
0781 But it will not reset the update back to normal. Only the writer
0782 that converted a pointer from HEAD to UPDATE will convert it back
0783 to NORMAL::
0784
0785 tail page
0786 |
0787 v
0788 +---+ +---+ +---+ +---+
0789 <---| |--->| |-U->| |-H->| |--->
0790 --->| |<---| |<---| |<---| |<---
0791 +---+ +---+ +---+ +---+
0792
0793 After the nested writer finishes, the outermost writer will convert
0794 the UPDATE pointer to NORMAL::
0795
0796
0797 tail page
0798 |
0799 v
0800 +---+ +---+ +---+ +---+
0801 <---| |--->| |--->| |-H->| |--->
0802 --->| |<---| |<---| |<---| |<---
0803 +---+ +---+ +---+ +---+
0804
0805
0806 It can be even more complex if several nested writes came in and moved
0807 the tail page ahead several pages::
0808
0809
0810 (first writer)
0811
0812 tail page
0813 |
0814 v
0815 +---+ +---+ +---+ +---+
0816 <---| |--->| |-H->| |--->| |--->
0817 --->| |<---| |<---| |<---| |<---
0818 +---+ +---+ +---+ +---+
0819
0820 The write converts the head page pointer to UPDATE::
0821
0822 tail page
0823 |
0824 v
0825 +---+ +---+ +---+ +---+
0826 <---| |--->| |-U->| |--->| |--->
0827 --->| |<---| |<---| |<---| |<---
0828 +---+ +---+ +---+ +---+
0829
0830 Next writer comes in, and sees the update and sets up the new
0831 head page::
0832
0833 (second writer)
0834
0835 tail page
0836 |
0837 v
0838 +---+ +---+ +---+ +---+
0839 <---| |--->| |-U->| |-H->| |--->
0840 --->| |<---| |<---| |<---| |<---
0841 +---+ +---+ +---+ +---+
0842
0843 The nested writer moves the tail page forward. But does not set the old
0844 update page to NORMAL because it is not the outermost writer::
0845
0846 tail page
0847 |
0848 v
0849 +---+ +---+ +---+ +---+
0850 <---| |--->| |-U->| |-H->| |--->
0851 --->| |<---| |<---| |<---| |<---
0852 +---+ +---+ +---+ +---+
0853
0854 Another writer preempts and sees the page after the tail page is a head page.
0855 It changes it from HEAD to UPDATE::
0856
0857 (third writer)
0858
0859 tail page
0860 |
0861 v
0862 +---+ +---+ +---+ +---+
0863 <---| |--->| |-U->| |-U->| |--->
0864 --->| |<---| |<---| |<---| |<---
0865 +---+ +---+ +---+ +---+
0866
0867 The writer will move the head page forward::
0868
0869
0870 (third writer)
0871
0872 tail page
0873 |
0874 v
0875 +---+ +---+ +---+ +---+
0876 <---| |--->| |-U->| |-U->| |-H->
0877 --->| |<---| |<---| |<---| |<---
0878 +---+ +---+ +---+ +---+
0879
0880 But now that the third writer did change the HEAD flag to UPDATE it
0881 will convert it to normal::
0882
0883
0884 (third writer)
0885
0886 tail page
0887 |
0888 v
0889 +---+ +---+ +---+ +---+
0890 <---| |--->| |-U->| |--->| |-H->
0891 --->| |<---| |<---| |<---| |<---
0892 +---+ +---+ +---+ +---+
0893
0894
0895 Then it will move the tail page, and return back to the second writer::
0896
0897
0898 (second writer)
0899
0900 tail page
0901 |
0902 v
0903 +---+ +---+ +---+ +---+
0904 <---| |--->| |-U->| |--->| |-H->
0905 --->| |<---| |<---| |<---| |<---
0906 +---+ +---+ +---+ +---+
0907
0908
0909 The second writer will fail to move the tail page because it was already
0910 moved, so it will try again and add its data to the new tail page.
0911 It will return to the first writer::
0912
0913
0914 (first writer)
0915
0916 tail page
0917 |
0918 v
0919 +---+ +---+ +---+ +---+
0920 <---| |--->| |-U->| |--->| |-H->
0921 --->| |<---| |<---| |<---| |<---
0922 +---+ +---+ +---+ +---+
0923
0924 The first writer cannot know atomically if the tail page moved
0925 while it updates the HEAD page. It will then update the head page to
0926 what it thinks is the new head page::
0927
0928
0929 (first writer)
0930
0931 tail page
0932 |
0933 v
0934 +---+ +---+ +---+ +---+
0935 <---| |--->| |-U->| |-H->| |-H->
0936 --->| |<---| |<---| |<---| |<---
0937 +---+ +---+ +---+ +---+
0938
0939 Since the cmpxchg returns the old value of the pointer the first writer
0940 will see it succeeded in updating the pointer from NORMAL to HEAD.
0941 But as we can see, this is not good enough. It must also check to see
0942 if the tail page is either where it use to be or on the next page::
0943
0944
0945 (first writer)
0946
0947 A B tail page
0948 | | |
0949 v v v
0950 +---+ +---+ +---+ +---+
0951 <---| |--->| |-U->| |-H->| |-H->
0952 --->| |<---| |<---| |<---| |<---
0953 +---+ +---+ +---+ +---+
0954
0955 If tail page != A and tail page != B, then it must reset the pointer
0956 back to NORMAL. The fact that it only needs to worry about nested
0957 writers means that it only needs to check this after setting the HEAD page::
0958
0959
0960 (first writer)
0961
0962 A B tail page
0963 | | |
0964 v v v
0965 +---+ +---+ +---+ +---+
0966 <---| |--->| |-U->| |--->| |-H->
0967 --->| |<---| |<---| |<---| |<---
0968 +---+ +---+ +---+ +---+
0969
0970 Now the writer can update the head page. This is also why the head page must
0971 remain in UPDATE and only reset by the outermost writer. This prevents
0972 the reader from seeing the incorrect head page::
0973
0974
0975 (first writer)
0976
0977 A B tail page
0978 | | |
0979 v v v
0980 +---+ +---+ +---+ +---+
0981 <---| |--->| |--->| |--->| |-H->
0982 --->| |<---| |<---| |<---| |<---
0983 +---+ +---+ +---+ +---+