Documentation/core-api/errseq.rst

0001 =====================
0002 The errseq_t datatype
0003 =====================
0004
0005 An errseq_t is a way of recording errors in one place, and allowing any
0006 number of "subscribers" to tell whether it has changed since a previous
0007 point where it was sampled.
0008
0009 The initial use case for this is tracking errors for file
0010 synchronization syscalls (fsync, fdatasync, msync and sync_file_range),
0011 but it may be usable in other situations.
0012
0013 It's implemented as an unsigned 32-bit value.  The low order bits are
0014 designated to hold an error code (between 1 and MAX_ERRNO).  The upper bits
0015 are used as a counter.  This is done with atomics instead of locking so that
0016 these functions can be called from any context.
0017
0018 Note that there is a risk of collisions if new errors are being recorded
0019 frequently, since we have so few bits to use as a counter.
0020
0021 To mitigate this, the bit between the error value and counter is used as
0022 a flag to tell whether the value has been sampled since a new value was
0023 recorded.  That allows us to avoid bumping the counter if no one has
0024 sampled it since the last time an error was recorded.
0025
0026 Thus we end up with a value that looks something like this:
0027
0028 +--------------------------------------+----+------------------------+
0029 | 31..13                               | 12 | 11..0                  |
0030 +--------------------------------------+----+------------------------+
0031 | counter                              | SF | errno                  |
0032 +--------------------------------------+----+------------------------+
0033
0034 The general idea is for "watchers" to sample an errseq_t value and keep
0035 it as a running cursor.  That value can later be used to tell whether
0036 any new errors have occurred since that sampling was done, and atomically
0037 record the state at the time that it was checked.  This allows us to
0038 record errors in one place, and then have a number of "watchers" that
0039 can tell whether the value has changed since they last checked it.
0040
0041 A new errseq_t should always be zeroed out.  An errseq_t value of all zeroes
0042 is the special (but common) case where there has never been an error. An all
0043 zero value thus serves as the "epoch" if one wishes to know whether there
0044 has ever been an error set since it was first initialized.
0045
0046 API usage
0047 =========
0048
0049 Let me tell you a story about a worker drone.  Now, he's a good worker
0050 overall, but the company is a little...management heavy.  He has to
0051 report to 77 supervisors today, and tomorrow the "big boss" is coming in
0052 from out of town and he's sure to test the poor fellow too.
0053
0054 They're all handing him work to do -- so much he can't keep track of who
0055 handed him what, but that's not really a big problem.  The supervisors
0056 just want to know when he's finished all of the work they've handed him so
0057 far and whether he made any mistakes since they last asked.
0058
0059 He might have made the mistake on work they didn't actually hand him,
0060 but he can't keep track of things at that level of detail, all he can
0061 remember is the most recent mistake that he made.
0062
0063 Here's our worker_drone representation::
0064
0065         struct worker_drone {
0066                 errseq_t        wd_err; /* for recording errors */
0067         };
0068
0069 Every day, the worker_drone starts out with a blank slate::
0070
0071         struct worker_drone wd;
0072
0073         wd.wd_err = (errseq_t)0;
0074
0075 The supervisors come in and get an initial read for the day.  They
0076 don't care about anything that happened before their watch begins::
0077
0078         struct supervisor {
0079                 errseq_t        s_wd_err; /* private "cursor" for wd_err */
0080                 spinlock_t      s_wd_err_lock; /* protects s_wd_err */
0081         }
0082
0083         struct supervisor       su;
0084
0085         su.s_wd_err = errseq_sample(&wd.wd_err);
0086         spin_lock_init(&su.s_wd_err_lock);
0087
0088 Now they start handing him tasks to do.  Every few minutes they ask him to
0089 finish up all of the work they've handed him so far.  Then they ask him
0090 whether he made any mistakes on any of it::
0091
0092         spin_lock(&su.su_wd_err_lock);
0093         err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
0094         spin_unlock(&su.su_wd_err_lock);
0095
0096 Up to this point, that just keeps returning 0.
0097
0098 Now, the owners of this company are quite miserly and have given him
0099 substandard equipment with which to do his job. Occasionally it
0100 glitches and he makes a mistake.  He sighs a heavy sigh, and marks it
0101 down::
0102
0103         errseq_set(&wd.wd_err, -EIO);
0104
0105 ...and then gets back to work.  The supervisors eventually poll again
0106 and they each get the error when they next check.  Subsequent calls will
0107 return 0, until another error is recorded, at which point it's reported
0108 to each of them once.
0109
0110 Note that the supervisors can't tell how many mistakes he made, only
0111 whether one was made since they last checked, and the latest value
0112 recorded.
0113
0114 Occasionally the big boss comes in for a spot check and asks the worker
0115 to do a one-off job for him. He's not really watching the worker
0116 full-time like the supervisors, but he does need to know whether a
0117 mistake occurred while his job was processing.
0118
0119 He can just sample the current errseq_t in the worker, and then use that
0120 to tell whether an error has occurred later::
0121
0122         errseq_t since = errseq_sample(&wd.wd_err);
0123         /* submit some work and wait for it to complete */
0124         err = errseq_check(&wd.wd_err, since);
0125
0126 Since he's just going to discard "since" after that point, he doesn't
0127 need to advance it here. He also doesn't need any locking since it's
0128 not usable by anyone else.
0129
0130 Serializing errseq_t cursor updates
0131 ===================================
0132
0133 Note that the errseq_t API does not protect the errseq_t cursor during a
0134 check_and_advance_operation. Only the canonical error code is handled
0135 atomically.  In a situation where more than one task might be using the
0136 same errseq_t cursor at the same time, it's important to serialize
0137 updates to that cursor.
0138
0139 If that's not done, then it's possible for the cursor to go backward
0140 in which case the same error could be reported more than once.
0141
0142 Because of this, it's often advantageous to first do an errseq_check to
0143 see if anything has changed, and only later do an
0144 errseq_check_and_advance after taking the lock. e.g.::
0145
0146         if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
0147                 /* su.s_wd_err is protected by s_wd_err_lock */
0148                 spin_lock(&su.s_wd_err_lock);
0149                 err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
0150                 spin_unlock(&su.s_wd_err_lock);
0151         }
0152
0153 That avoids the spinlock in the common case where nothing has changed
0154 since the last time it was checked.
0155
0156 Functions
0157 =========
0158
0159 .. kernel-doc:: lib/errseq.c