Back to home page

OSCL-LXR

 
 

    


0001 .. _kernel_hacking_lock:
0002 
0003 ===========================
0004 Unreliable Guide To Locking
0005 ===========================
0006 
0007 :Author: Rusty Russell
0008 
0009 Introduction
0010 ============
0011 
0012 Welcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking
0013 issues. This document describes the locking systems in the Linux Kernel
0014 in 2.6.
0015 
0016 With the wide availability of HyperThreading, and preemption in the
0017 Linux Kernel, everyone hacking on the kernel needs to know the
0018 fundamentals of concurrency and locking for SMP.
0019 
0020 The Problem With Concurrency
0021 ============================
0022 
0023 (Skip this if you know what a Race Condition is).
0024 
0025 In a normal program, you can increment a counter like so:
0026 
0027 ::
0028 
0029           very_important_count++;
0030 
0031 
0032 This is what they would expect to happen:
0033 
0034 
0035 .. table:: Expected Results
0036 
0037   +------------------------------------+------------------------------------+
0038   | Instance 1                         | Instance 2                         |
0039   +====================================+====================================+
0040   | read very_important_count (5)      |                                    |
0041   +------------------------------------+------------------------------------+
0042   | add 1 (6)                          |                                    |
0043   +------------------------------------+------------------------------------+
0044   | write very_important_count (6)     |                                    |
0045   +------------------------------------+------------------------------------+
0046   |                                    | read very_important_count (6)      |
0047   +------------------------------------+------------------------------------+
0048   |                                    | add 1 (7)                          |
0049   +------------------------------------+------------------------------------+
0050   |                                    | write very_important_count (7)     |
0051   +------------------------------------+------------------------------------+
0052 
0053 This is what might happen:
0054 
0055 .. table:: Possible Results
0056 
0057   +------------------------------------+------------------------------------+
0058   | Instance 1                         | Instance 2                         |
0059   +====================================+====================================+
0060   | read very_important_count (5)      |                                    |
0061   +------------------------------------+------------------------------------+
0062   |                                    | read very_important_count (5)      |
0063   +------------------------------------+------------------------------------+
0064   | add 1 (6)                          |                                    |
0065   +------------------------------------+------------------------------------+
0066   |                                    | add 1 (6)                          |
0067   +------------------------------------+------------------------------------+
0068   | write very_important_count (6)     |                                    |
0069   +------------------------------------+------------------------------------+
0070   |                                    | write very_important_count (6)     |
0071   +------------------------------------+------------------------------------+
0072 
0073 
0074 Race Conditions and Critical Regions
0075 ------------------------------------
0076 
0077 This overlap, where the result depends on the relative timing of
0078 multiple tasks, is called a race condition. The piece of code containing
0079 the concurrency issue is called a critical region. And especially since
0080 Linux starting running on SMP machines, they became one of the major
0081 issues in kernel design and implementation.
0082 
0083 Preemption can have the same effect, even if there is only one CPU: by
0084 preempting one task during the critical region, we have exactly the same
0085 race condition. In this case the thread which preempts might run the
0086 critical region itself.
0087 
0088 The solution is to recognize when these simultaneous accesses occur, and
0089 use locks to make sure that only one instance can enter the critical
0090 region at any time. There are many friendly primitives in the Linux
0091 kernel to help you do this. And then there are the unfriendly
0092 primitives, but I'll pretend they don't exist.
0093 
0094 Locking in the Linux Kernel
0095 ===========================
0096 
0097 If I could give you one piece of advice on locking: **keep it simple**.
0098 
0099 Be reluctant to introduce new locks.
0100 
0101 Two Main Types of Kernel Locks: Spinlocks and Mutexes
0102 -----------------------------------------------------
0103 
0104 There are two main types of kernel locks. The fundamental type is the
0105 spinlock (``include/asm/spinlock.h``), which is a very simple
0106 single-holder lock: if you can't get the spinlock, you keep trying
0107 (spinning) until you can. Spinlocks are very small and fast, and can be
0108 used anywhere.
0109 
0110 The second type is a mutex (``include/linux/mutex.h``): it is like a
0111 spinlock, but you may block holding a mutex. If you can't lock a mutex,
0112 your task will suspend itself, and be woken up when the mutex is
0113 released. This means the CPU can do something else while you are
0114 waiting. There are many cases when you simply can't sleep (see
0115 `What Functions Are Safe To Call From Interrupts?`_),
0116 and so have to use a spinlock instead.
0117 
0118 Neither type of lock is recursive: see
0119 `Deadlock: Simple and Advanced`_.
0120 
0121 Locks and Uniprocessor Kernels
0122 ------------------------------
0123 
0124 For kernels compiled without ``CONFIG_SMP``, and without
0125 ``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent
0126 design decision: when no-one else can run at the same time, there is no
0127 reason to have a lock.
0128 
0129 If the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT``
0130 is set, then spinlocks simply disable preemption, which is sufficient to
0131 prevent any races. For most purposes, we can think of preemption as
0132 equivalent to SMP, and not worry about it separately.
0133 
0134 You should always test your locking code with ``CONFIG_SMP`` and
0135 ``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box,
0136 because it will still catch some kinds of locking bugs.
0137 
0138 Mutexes still exist, because they are required for synchronization
0139 between user contexts, as we will see below.
0140 
0141 Locking Only In User Context
0142 ----------------------------
0143 
0144 If you have a data structure which is only ever accessed from user
0145 context, then you can use a simple mutex (``include/linux/mutex.h``) to
0146 protect it. This is the most trivial case: you initialize the mutex.
0147 Then you can call mutex_lock_interruptible() to grab the
0148 mutex, and mutex_unlock() to release it. There is also a
0149 mutex_lock(), which should be avoided, because it will
0150 not return if a signal is received.
0151 
0152 Example: ``net/netfilter/nf_sockopt.c`` allows registration of new
0153 setsockopt() and getsockopt() calls, with
0154 nf_register_sockopt(). Registration and de-registration
0155 are only done on module load and unload (and boot time, where there is
0156 no concurrency), and the list of registrations is only consulted for an
0157 unknown setsockopt() or getsockopt() system
0158 call. The ``nf_sockopt_mutex`` is perfect to protect this, especially
0159 since the setsockopt and getsockopt calls may well sleep.
0160 
0161 Locking Between User Context and Softirqs
0162 -----------------------------------------
0163 
0164 If a softirq shares data with user context, you have two problems.
0165 Firstly, the current user context can be interrupted by a softirq, and
0166 secondly, the critical region could be entered from another CPU. This is
0167 where spin_lock_bh() (``include/linux/spinlock.h``) is
0168 used. It disables softirqs on that CPU, then grabs the lock.
0169 spin_unlock_bh() does the reverse. (The '_bh' suffix is
0170 a historical reference to "Bottom Halves", the old name for software
0171 interrupts. It should really be called spin_lock_softirq()' in a
0172 perfect world).
0173 
0174 Note that you can also use spin_lock_irq() or
0175 spin_lock_irqsave() here, which stop hardware interrupts
0176 as well: see `Hard IRQ Context`_.
0177 
0178 This works perfectly for UP as well: the spin lock vanishes, and this
0179 macro simply becomes local_bh_disable()
0180 (``include/linux/interrupt.h``), which protects you from the softirq
0181 being run.
0182 
0183 Locking Between User Context and Tasklets
0184 -----------------------------------------
0185 
0186 This is exactly the same as above, because tasklets are actually run
0187 from a softirq.
0188 
0189 Locking Between User Context and Timers
0190 ---------------------------------------
0191 
0192 This, too, is exactly the same as above, because timers are actually run
0193 from a softirq. From a locking point of view, tasklets and timers are
0194 identical.
0195 
0196 Locking Between Tasklets/Timers
0197 -------------------------------
0198 
0199 Sometimes a tasklet or timer might want to share data with another
0200 tasklet or timer.
0201 
0202 The Same Tasklet/Timer
0203 ~~~~~~~~~~~~~~~~~~~~~~
0204 
0205 Since a tasklet is never run on two CPUs at once, you don't need to
0206 worry about your tasklet being reentrant (running twice at once), even
0207 on SMP.
0208 
0209 Different Tasklets/Timers
0210 ~~~~~~~~~~~~~~~~~~~~~~~~~
0211 
0212 If another tasklet/timer wants to share data with your tasklet or timer
0213 , you will both need to use spin_lock() and
0214 spin_unlock() calls. spin_lock_bh() is
0215 unnecessary here, as you are already in a tasklet, and none will be run
0216 on the same CPU.
0217 
0218 Locking Between Softirqs
0219 ------------------------
0220 
0221 Often a softirq might want to share data with itself or a tasklet/timer.
0222 
0223 The Same Softirq
0224 ~~~~~~~~~~~~~~~~
0225 
0226 The same softirq can run on the other CPUs: you can use a per-CPU array
0227 (see `Per-CPU Data`_) for better performance. If you're
0228 going so far as to use a softirq, you probably care about scalable
0229 performance enough to justify the extra complexity.
0230 
0231 You'll need to use spin_lock() and
0232 spin_unlock() for shared data.
0233 
0234 Different Softirqs
0235 ~~~~~~~~~~~~~~~~~~
0236 
0237 You'll need to use spin_lock() and
0238 spin_unlock() for shared data, whether it be a timer,
0239 tasklet, different softirq or the same or another softirq: any of them
0240 could be running on a different CPU.
0241 
0242 Hard IRQ Context
0243 ================
0244 
0245 Hardware interrupts usually communicate with a tasklet or softirq.
0246 Frequently this involves putting work in a queue, which the softirq will
0247 take out.
0248 
0249 Locking Between Hard IRQ and Softirqs/Tasklets
0250 ----------------------------------------------
0251 
0252 If a hardware irq handler shares data with a softirq, you have two
0253 concerns. Firstly, the softirq processing can be interrupted by a
0254 hardware interrupt, and secondly, the critical region could be entered
0255 by a hardware interrupt on another CPU. This is where
0256 spin_lock_irq() is used. It is defined to disable
0257 interrupts on that cpu, then grab the lock.
0258 spin_unlock_irq() does the reverse.
0259 
0260 The irq handler does not need to use spin_lock_irq(), because
0261 the softirq cannot run while the irq handler is running: it can use
0262 spin_lock(), which is slightly faster. The only exception
0263 would be if a different hardware irq handler uses the same lock:
0264 spin_lock_irq() will stop that from interrupting us.
0265 
0266 This works perfectly for UP as well: the spin lock vanishes, and this
0267 macro simply becomes local_irq_disable()
0268 (``include/asm/smp.h``), which protects you from the softirq/tasklet/BH
0269 being run.
0270 
0271 spin_lock_irqsave() (``include/linux/spinlock.h``) is a
0272 variant which saves whether interrupts were on or off in a flags word,
0273 which is passed to spin_unlock_irqrestore(). This means
0274 that the same code can be used inside an hard irq handler (where
0275 interrupts are already off) and in softirqs (where the irq disabling is
0276 required).
0277 
0278 Note that softirqs (and hence tasklets and timers) are run on return
0279 from hardware interrupts, so spin_lock_irq() also stops
0280 these. In that sense, spin_lock_irqsave() is the most
0281 general and powerful locking function.
0282 
0283 Locking Between Two Hard IRQ Handlers
0284 -------------------------------------
0285 
0286 It is rare to have to share data between two IRQ handlers, but if you
0287 do, spin_lock_irqsave() should be used: it is
0288 architecture-specific whether all interrupts are disabled inside irq
0289 handlers themselves.
0290 
0291 Cheat Sheet For Locking
0292 =======================
0293 
0294 Pete Zaitcev gives the following summary:
0295 
0296 -  If you are in a process context (any syscall) and want to lock other
0297    process out, use a mutex. You can take a mutex and sleep
0298    (``copy_from_user()`` or ``kmalloc(x,GFP_KERNEL)``).
0299 
0300 -  Otherwise (== data can be touched in an interrupt), use
0301    spin_lock_irqsave() and
0302    spin_unlock_irqrestore().
0303 
0304 -  Avoid holding spinlock for more than 5 lines of code and across any
0305    function call (except accessors like readb()).
0306 
0307 Table of Minimum Requirements
0308 -----------------------------
0309 
0310 The following table lists the **minimum** locking requirements between
0311 various contexts. In some cases, the same context can only be running on
0312 one CPU at a time, so no locking is required for that context (eg. a
0313 particular thread can only run on one CPU at a time, but if it needs
0314 shares data with another thread, locking is required).
0315 
0316 Remember the advice above: you can always use
0317 spin_lock_irqsave(), which is a superset of all other
0318 spinlock primitives.
0319 
0320 ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
0321 .              IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B
0322 ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
0323 IRQ Handler A  None
0324 IRQ Handler B  SLIS          None
0325 Softirq A      SLI           SLI           SL
0326 Softirq B      SLI           SLI           SL        SL
0327 Tasklet A      SLI           SLI           SL        SL        None
0328 Tasklet B      SLI           SLI           SL        SL        SL        None
0329 Timer A        SLI           SLI           SL        SL        SL        SL        None
0330 Timer B        SLI           SLI           SL        SL        SL        SL        SL      None
0331 User Context A SLI           SLI           SLBH      SLBH      SLBH      SLBH      SLBH    SLBH    None
0332 User Context B SLI           SLI           SLBH      SLBH      SLBH      SLBH      SLBH    SLBH    MLI            None
0333 ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
0334 
0335 Table: Table of Locking Requirements
0336 
0337 +--------+----------------------------+
0338 | SLIS   | spin_lock_irqsave          |
0339 +--------+----------------------------+
0340 | SLI    | spin_lock_irq              |
0341 +--------+----------------------------+
0342 | SL     | spin_lock                  |
0343 +--------+----------------------------+
0344 | SLBH   | spin_lock_bh               |
0345 +--------+----------------------------+
0346 | MLI    | mutex_lock_interruptible   |
0347 +--------+----------------------------+
0348 
0349 Table: Legend for Locking Requirements Table
0350 
0351 The trylock Functions
0352 =====================
0353 
0354 There are functions that try to acquire a lock only once and immediately
0355 return a value telling about success or failure to acquire the lock.
0356 They can be used if you need no access to the data protected with the
0357 lock when some other thread is holding the lock. You should acquire the
0358 lock later if you then need access to the data protected with the lock.
0359 
0360 spin_trylock() does not spin but returns non-zero if it
0361 acquires the spinlock on the first try or 0 if not. This function can be
0362 used in all contexts like spin_lock(): you must have
0363 disabled the contexts that might interrupt you and acquire the spin
0364 lock.
0365 
0366 mutex_trylock() does not suspend your task but returns
0367 non-zero if it could lock the mutex on the first try or 0 if not. This
0368 function cannot be safely used in hardware or software interrupt
0369 contexts despite not sleeping.
0370 
0371 Common Examples
0372 ===============
0373 
0374 Let's step through a simple example: a cache of number to name mappings.
0375 The cache keeps a count of how often each of the objects is used, and
0376 when it gets full, throws out the least used one.
0377 
0378 All In User Context
0379 -------------------
0380 
0381 For our first example, we assume that all operations are in user context
0382 (ie. from system calls), so we can sleep. This means we can use a mutex
0383 to protect the cache and all the objects within it. Here's the code::
0384 
0385     #include <linux/list.h>
0386     #include <linux/slab.h>
0387     #include <linux/string.h>
0388     #include <linux/mutex.h>
0389     #include <asm/errno.h>
0390 
0391     struct object
0392     {
0393             struct list_head list;
0394             int id;
0395             char name[32];
0396             int popularity;
0397     };
0398 
0399     /* Protects the cache, cache_num, and the objects within it */
0400     static DEFINE_MUTEX(cache_lock);
0401     static LIST_HEAD(cache);
0402     static unsigned int cache_num = 0;
0403     #define MAX_CACHE_SIZE 10
0404 
0405     /* Must be holding cache_lock */
0406     static struct object *__cache_find(int id)
0407     {
0408             struct object *i;
0409 
0410             list_for_each_entry(i, &cache, list)
0411                     if (i->id == id) {
0412                             i->popularity++;
0413                             return i;
0414                     }
0415             return NULL;
0416     }
0417 
0418     /* Must be holding cache_lock */
0419     static void __cache_delete(struct object *obj)
0420     {
0421             BUG_ON(!obj);
0422             list_del(&obj->list);
0423             kfree(obj);
0424             cache_num--;
0425     }
0426 
0427     /* Must be holding cache_lock */
0428     static void __cache_add(struct object *obj)
0429     {
0430             list_add(&obj->list, &cache);
0431             if (++cache_num > MAX_CACHE_SIZE) {
0432                     struct object *i, *outcast = NULL;
0433                     list_for_each_entry(i, &cache, list) {
0434                             if (!outcast || i->popularity < outcast->popularity)
0435                                     outcast = i;
0436                     }
0437                     __cache_delete(outcast);
0438             }
0439     }
0440 
0441     int cache_add(int id, const char *name)
0442     {
0443             struct object *obj;
0444 
0445             if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
0446                     return -ENOMEM;
0447 
0448             strscpy(obj->name, name, sizeof(obj->name));
0449             obj->id = id;
0450             obj->popularity = 0;
0451 
0452             mutex_lock(&cache_lock);
0453             __cache_add(obj);
0454             mutex_unlock(&cache_lock);
0455             return 0;
0456     }
0457 
0458     void cache_delete(int id)
0459     {
0460             mutex_lock(&cache_lock);
0461             __cache_delete(__cache_find(id));
0462             mutex_unlock(&cache_lock);
0463     }
0464 
0465     int cache_find(int id, char *name)
0466     {
0467             struct object *obj;
0468             int ret = -ENOENT;
0469 
0470             mutex_lock(&cache_lock);
0471             obj = __cache_find(id);
0472             if (obj) {
0473                     ret = 0;
0474                     strcpy(name, obj->name);
0475             }
0476             mutex_unlock(&cache_lock);
0477             return ret;
0478     }
0479 
0480 Note that we always make sure we have the cache_lock when we add,
0481 delete, or look up the cache: both the cache infrastructure itself and
0482 the contents of the objects are protected by the lock. In this case it's
0483 easy, since we copy the data for the user, and never let them access the
0484 objects directly.
0485 
0486 There is a slight (and common) optimization here: in
0487 cache_add() we set up the fields of the object before
0488 grabbing the lock. This is safe, as no-one else can access it until we
0489 put it in cache.
0490 
0491 Accessing From Interrupt Context
0492 --------------------------------
0493 
0494 Now consider the case where cache_find() can be called
0495 from interrupt context: either a hardware interrupt or a softirq. An
0496 example would be a timer which deletes object from the cache.
0497 
0498 The change is shown below, in standard patch format: the ``-`` are lines
0499 which are taken away, and the ``+`` are lines which are added.
0500 
0501 ::
0502 
0503     --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100
0504     +++ cache.c.interrupt   2003-12-09 14:07:49.000000000 +1100
0505     @@ -12,7 +12,7 @@
0506              int popularity;
0507      };
0508 
0509     -static DEFINE_MUTEX(cache_lock);
0510     +static DEFINE_SPINLOCK(cache_lock);
0511      static LIST_HEAD(cache);
0512      static unsigned int cache_num = 0;
0513      #define MAX_CACHE_SIZE 10
0514     @@ -55,6 +55,7 @@
0515      int cache_add(int id, const char *name)
0516      {
0517              struct object *obj;
0518     +        unsigned long flags;
0519 
0520              if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
0521                      return -ENOMEM;
0522     @@ -63,30 +64,33 @@
0523              obj->id = id;
0524              obj->popularity = 0;
0525 
0526     -        mutex_lock(&cache_lock);
0527     +        spin_lock_irqsave(&cache_lock, flags);
0528              __cache_add(obj);
0529     -        mutex_unlock(&cache_lock);
0530     +        spin_unlock_irqrestore(&cache_lock, flags);
0531              return 0;
0532      }
0533 
0534      void cache_delete(int id)
0535      {
0536     -        mutex_lock(&cache_lock);
0537     +        unsigned long flags;
0538     +
0539     +        spin_lock_irqsave(&cache_lock, flags);
0540              __cache_delete(__cache_find(id));
0541     -        mutex_unlock(&cache_lock);
0542     +        spin_unlock_irqrestore(&cache_lock, flags);
0543      }
0544 
0545      int cache_find(int id, char *name)
0546      {
0547              struct object *obj;
0548              int ret = -ENOENT;
0549     +        unsigned long flags;
0550 
0551     -        mutex_lock(&cache_lock);
0552     +        spin_lock_irqsave(&cache_lock, flags);
0553              obj = __cache_find(id);
0554              if (obj) {
0555                      ret = 0;
0556                      strcpy(name, obj->name);
0557              }
0558     -        mutex_unlock(&cache_lock);
0559     +        spin_unlock_irqrestore(&cache_lock, flags);
0560              return ret;
0561      }
0562 
0563 Note that the spin_lock_irqsave() will turn off
0564 interrupts if they are on, otherwise does nothing (if we are already in
0565 an interrupt handler), hence these functions are safe to call from any
0566 context.
0567 
0568 Unfortunately, cache_add() calls kmalloc()
0569 with the ``GFP_KERNEL`` flag, which is only legal in user context. I
0570 have assumed that cache_add() is still only called in
0571 user context, otherwise this should become a parameter to
0572 cache_add().
0573 
0574 Exposing Objects Outside This File
0575 ----------------------------------
0576 
0577 If our objects contained more information, it might not be sufficient to
0578 copy the information in and out: other parts of the code might want to
0579 keep pointers to these objects, for example, rather than looking up the
0580 id every time. This produces two problems.
0581 
0582 The first problem is that we use the ``cache_lock`` to protect objects:
0583 we'd need to make this non-static so the rest of the code can use it.
0584 This makes locking trickier, as it is no longer all in one place.
0585 
0586 The second problem is the lifetime problem: if another structure keeps a
0587 pointer to an object, it presumably expects that pointer to remain
0588 valid. Unfortunately, this is only guaranteed while you hold the lock,
0589 otherwise someone might call cache_delete() and even
0590 worse, add another object, re-using the same address.
0591 
0592 As there is only one lock, you can't hold it forever: no-one else would
0593 get any work done.
0594 
0595 The solution to this problem is to use a reference count: everyone who
0596 has a pointer to the object increases it when they first get the object,
0597 and drops the reference count when they're finished with it. Whoever
0598 drops it to zero knows it is unused, and can actually delete it.
0599 
0600 Here is the code::
0601 
0602     --- cache.c.interrupt   2003-12-09 14:25:43.000000000 +1100
0603     +++ cache.c.refcnt  2003-12-09 14:33:05.000000000 +1100
0604     @@ -7,6 +7,7 @@
0605      struct object
0606      {
0607              struct list_head list;
0608     +        unsigned int refcnt;
0609              int id;
0610              char name[32];
0611              int popularity;
0612     @@ -17,6 +18,35 @@
0613      static unsigned int cache_num = 0;
0614      #define MAX_CACHE_SIZE 10
0615 
0616     +static void __object_put(struct object *obj)
0617     +{
0618     +        if (--obj->refcnt == 0)
0619     +                kfree(obj);
0620     +}
0621     +
0622     +static void __object_get(struct object *obj)
0623     +{
0624     +        obj->refcnt++;
0625     +}
0626     +
0627     +void object_put(struct object *obj)
0628     +{
0629     +        unsigned long flags;
0630     +
0631     +        spin_lock_irqsave(&cache_lock, flags);
0632     +        __object_put(obj);
0633     +        spin_unlock_irqrestore(&cache_lock, flags);
0634     +}
0635     +
0636     +void object_get(struct object *obj)
0637     +{
0638     +        unsigned long flags;
0639     +
0640     +        spin_lock_irqsave(&cache_lock, flags);
0641     +        __object_get(obj);
0642     +        spin_unlock_irqrestore(&cache_lock, flags);
0643     +}
0644     +
0645      /* Must be holding cache_lock */
0646      static struct object *__cache_find(int id)
0647      {
0648     @@ -35,6 +65,7 @@
0649      {
0650              BUG_ON(!obj);
0651              list_del(&obj->list);
0652     +        __object_put(obj);
0653              cache_num--;
0654      }
0655 
0656     @@ -63,6 +94,7 @@
0657              strscpy(obj->name, name, sizeof(obj->name));
0658              obj->id = id;
0659              obj->popularity = 0;
0660     +        obj->refcnt = 1; /* The cache holds a reference */
0661 
0662              spin_lock_irqsave(&cache_lock, flags);
0663              __cache_add(obj);
0664     @@ -79,18 +111,15 @@
0665              spin_unlock_irqrestore(&cache_lock, flags);
0666      }
0667 
0668     -int cache_find(int id, char *name)
0669     +struct object *cache_find(int id)
0670      {
0671              struct object *obj;
0672     -        int ret = -ENOENT;
0673              unsigned long flags;
0674 
0675              spin_lock_irqsave(&cache_lock, flags);
0676              obj = __cache_find(id);
0677     -        if (obj) {
0678     -                ret = 0;
0679     -                strcpy(name, obj->name);
0680     -        }
0681     +        if (obj)
0682     +                __object_get(obj);
0683              spin_unlock_irqrestore(&cache_lock, flags);
0684     -        return ret;
0685     +        return obj;
0686      }
0687 
0688 We encapsulate the reference counting in the standard 'get' and 'put'
0689 functions. Now we can return the object itself from
0690 cache_find() which has the advantage that the user can
0691 now sleep holding the object (eg. to copy_to_user() to
0692 name to userspace).
0693 
0694 The other point to note is that I said a reference should be held for
0695 every pointer to the object: thus the reference count is 1 when first
0696 inserted into the cache. In some versions the framework does not hold a
0697 reference count, but they are more complicated.
0698 
0699 Using Atomic Operations For The Reference Count
0700 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0701 
0702 In practice, :c:type:`atomic_t` would usually be used for refcnt. There are a
0703 number of atomic operations defined in ``include/asm/atomic.h``: these
0704 are guaranteed to be seen atomically from all CPUs in the system, so no
0705 lock is required. In this case, it is simpler than using spinlocks,
0706 although for anything non-trivial using spinlocks is clearer. The
0707 atomic_inc() and atomic_dec_and_test()
0708 are used instead of the standard increment and decrement operators, and
0709 the lock is no longer used to protect the reference count itself.
0710 
0711 ::
0712 
0713     --- cache.c.refcnt  2003-12-09 15:00:35.000000000 +1100
0714     +++ cache.c.refcnt-atomic   2003-12-11 15:49:42.000000000 +1100
0715     @@ -7,7 +7,7 @@
0716      struct object
0717      {
0718              struct list_head list;
0719     -        unsigned int refcnt;
0720     +        atomic_t refcnt;
0721              int id;
0722              char name[32];
0723              int popularity;
0724     @@ -18,33 +18,15 @@
0725      static unsigned int cache_num = 0;
0726      #define MAX_CACHE_SIZE 10
0727 
0728     -static void __object_put(struct object *obj)
0729     -{
0730     -        if (--obj->refcnt == 0)
0731     -                kfree(obj);
0732     -}
0733     -
0734     -static void __object_get(struct object *obj)
0735     -{
0736     -        obj->refcnt++;
0737     -}
0738     -
0739      void object_put(struct object *obj)
0740      {
0741     -        unsigned long flags;
0742     -
0743     -        spin_lock_irqsave(&cache_lock, flags);
0744     -        __object_put(obj);
0745     -        spin_unlock_irqrestore(&cache_lock, flags);
0746     +        if (atomic_dec_and_test(&obj->refcnt))
0747     +                kfree(obj);
0748      }
0749 
0750      void object_get(struct object *obj)
0751      {
0752     -        unsigned long flags;
0753     -
0754     -        spin_lock_irqsave(&cache_lock, flags);
0755     -        __object_get(obj);
0756     -        spin_unlock_irqrestore(&cache_lock, flags);
0757     +        atomic_inc(&obj->refcnt);
0758      }
0759 
0760      /* Must be holding cache_lock */
0761     @@ -65,7 +47,7 @@
0762      {
0763              BUG_ON(!obj);
0764              list_del(&obj->list);
0765     -        __object_put(obj);
0766     +        object_put(obj);
0767              cache_num--;
0768      }
0769 
0770     @@ -94,7 +76,7 @@
0771              strscpy(obj->name, name, sizeof(obj->name));
0772              obj->id = id;
0773              obj->popularity = 0;
0774     -        obj->refcnt = 1; /* The cache holds a reference */
0775     +        atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
0776 
0777              spin_lock_irqsave(&cache_lock, flags);
0778              __cache_add(obj);
0779     @@ -119,7 +101,7 @@
0780              spin_lock_irqsave(&cache_lock, flags);
0781              obj = __cache_find(id);
0782              if (obj)
0783     -                __object_get(obj);
0784     +                object_get(obj);
0785              spin_unlock_irqrestore(&cache_lock, flags);
0786              return obj;
0787      }
0788 
0789 Protecting The Objects Themselves
0790 ---------------------------------
0791 
0792 In these examples, we assumed that the objects (except the reference
0793 counts) never changed once they are created. If we wanted to allow the
0794 name to change, there are three possibilities:
0795 
0796 -  You can make ``cache_lock`` non-static, and tell people to grab that
0797    lock before changing the name in any object.
0798 
0799 -  You can provide a cache_obj_rename() which grabs this
0800    lock and changes the name for the caller, and tell everyone to use
0801    that function.
0802 
0803 -  You can make the ``cache_lock`` protect only the cache itself, and
0804    use another lock to protect the name.
0805 
0806 Theoretically, you can make the locks as fine-grained as one lock for
0807 every field, for every object. In practice, the most common variants
0808 are:
0809 
0810 -  One lock which protects the infrastructure (the ``cache`` list in
0811    this example) and all the objects. This is what we have done so far.
0812 
0813 -  One lock which protects the infrastructure (including the list
0814    pointers inside the objects), and one lock inside the object which
0815    protects the rest of that object.
0816 
0817 -  Multiple locks to protect the infrastructure (eg. one lock per hash
0818    chain), possibly with a separate per-object lock.
0819 
0820 Here is the "lock-per-object" implementation:
0821 
0822 ::
0823 
0824     --- cache.c.refcnt-atomic   2003-12-11 15:50:54.000000000 +1100
0825     +++ cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100
0826     @@ -6,11 +6,17 @@
0827 
0828      struct object
0829      {
0830     +        /* These two protected by cache_lock. */
0831              struct list_head list;
0832     +        int popularity;
0833     +
0834              atomic_t refcnt;
0835     +
0836     +        /* Doesn't change once created. */
0837              int id;
0838     +
0839     +        spinlock_t lock; /* Protects the name */
0840              char name[32];
0841     -        int popularity;
0842      };
0843 
0844      static DEFINE_SPINLOCK(cache_lock);
0845     @@ -77,6 +84,7 @@
0846              obj->id = id;
0847              obj->popularity = 0;
0848              atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
0849     +        spin_lock_init(&obj->lock);
0850 
0851              spin_lock_irqsave(&cache_lock, flags);
0852              __cache_add(obj);
0853 
0854 Note that I decide that the popularity count should be protected by the
0855 ``cache_lock`` rather than the per-object lock: this is because it (like
0856 the :c:type:`struct list_head <list_head>` inside the object)
0857 is logically part of the infrastructure. This way, I don't need to grab
0858 the lock of every object in __cache_add() when seeking
0859 the least popular.
0860 
0861 I also decided that the id member is unchangeable, so I don't need to
0862 grab each object lock in __cache_find() to examine the
0863 id: the object lock is only used by a caller who wants to read or write
0864 the name field.
0865 
0866 Note also that I added a comment describing what data was protected by
0867 which locks. This is extremely important, as it describes the runtime
0868 behavior of the code, and can be hard to gain from just reading. And as
0869 Alan Cox says, “Lock data, not code”.
0870 
0871 Common Problems
0872 ===============
0873 
0874 Deadlock: Simple and Advanced
0875 -----------------------------
0876 
0877 There is a coding bug where a piece of code tries to grab a spinlock
0878 twice: it will spin forever, waiting for the lock to be released
0879 (spinlocks, rwlocks and mutexes are not recursive in Linux). This is
0880 trivial to diagnose: not a
0881 stay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem.
0882 
0883 For a slightly more complex case, imagine you have a region shared by a
0884 softirq and user context. If you use a spin_lock() call
0885 to protect it, it is possible that the user context will be interrupted
0886 by the softirq while it holds the lock, and the softirq will then spin
0887 forever trying to get the same lock.
0888 
0889 Both of these are called deadlock, and as shown above, it can occur even
0890 with a single CPU (although not on UP compiles, since spinlocks vanish
0891 on kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data
0892 corruption in the second example).
0893 
0894 This complete lockup is easy to diagnose: on SMP boxes the watchdog
0895 timer or compiling with ``DEBUG_SPINLOCK`` set
0896 (``include/linux/spinlock.h``) will show this up immediately when it
0897 happens.
0898 
0899 A more complex problem is the so-called 'deadly embrace', involving two
0900 or more locks. Say you have a hash table: each entry in the table is a
0901 spinlock, and a chain of hashed objects. Inside a softirq handler, you
0902 sometimes want to alter an object from one place in the hash to another:
0903 you grab the spinlock of the old hash chain and the spinlock of the new
0904 hash chain, and delete the object from the old one, and insert it in the
0905 new one.
0906 
0907 There are two problems here. First, if your code ever tries to move the
0908 object to the same chain, it will deadlock with itself as it tries to
0909 lock it twice. Secondly, if the same softirq on another CPU is trying to
0910 move another object in the reverse direction, the following could
0911 happen:
0912 
0913 +-----------------------+-----------------------+
0914 | CPU 1                 | CPU 2                 |
0915 +=======================+=======================+
0916 | Grab lock A -> OK     | Grab lock B -> OK     |
0917 +-----------------------+-----------------------+
0918 | Grab lock B -> spin   | Grab lock A -> spin   |
0919 +-----------------------+-----------------------+
0920 
0921 Table: Consequences
0922 
0923 The two CPUs will spin forever, waiting for the other to give up their
0924 lock. It will look, smell, and feel like a crash.
0925 
0926 Preventing Deadlock
0927 -------------------
0928 
0929 Textbooks will tell you that if you always lock in the same order, you
0930 will never get this kind of deadlock. Practice will tell you that this
0931 approach doesn't scale: when I create a new lock, I don't understand
0932 enough of the kernel to figure out where in the 5000 lock hierarchy it
0933 will fit.
0934 
0935 The best locks are encapsulated: they never get exposed in headers, and
0936 are never held around calls to non-trivial functions outside the same
0937 file. You can read through this code and see that it will never
0938 deadlock, because it never tries to grab another lock while it has that
0939 one. People using your code don't even need to know you are using a
0940 lock.
0941 
0942 A classic problem here is when you provide callbacks or hooks: if you
0943 call these with the lock held, you risk simple deadlock, or a deadly
0944 embrace (who knows what the callback will do?).
0945 
0946 Overzealous Prevention Of Deadlocks
0947 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0948 
0949 Deadlocks are problematic, but not as bad as data corruption. Code which
0950 grabs a read lock, searches a list, fails to find what it wants, drops
0951 the read lock, grabs a write lock and inserts the object has a race
0952 condition.
0953 
0954 Racing Timers: A Kernel Pastime
0955 -------------------------------
0956 
0957 Timers can produce their own special problems with races. Consider a
0958 collection of objects (list, hash, etc) where each object has a timer
0959 which is due to destroy it.
0960 
0961 If you want to destroy the entire collection (say on module removal),
0962 you might do the following::
0963 
0964             /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
0965                HUNGARIAN NOTATION */
0966             spin_lock_bh(&list_lock);
0967 
0968             while (list) {
0969                     struct foo *next = list->next;
0970                     del_timer(&list->timer);
0971                     kfree(list);
0972                     list = next;
0973             }
0974 
0975             spin_unlock_bh(&list_lock);
0976 
0977 
0978 Sooner or later, this will crash on SMP, because a timer can have just
0979 gone off before the spin_lock_bh(), and it will only get
0980 the lock after we spin_unlock_bh(), and then try to free
0981 the element (which has already been freed!).
0982 
0983 This can be avoided by checking the result of
0984 del_timer(): if it returns 1, the timer has been deleted.
0985 If 0, it means (in this case) that it is currently running, so we can
0986 do::
0987 
0988             retry:
0989                     spin_lock_bh(&list_lock);
0990 
0991                     while (list) {
0992                             struct foo *next = list->next;
0993                             if (!del_timer(&list->timer)) {
0994                                     /* Give timer a chance to delete this */
0995                                     spin_unlock_bh(&list_lock);
0996                                     goto retry;
0997                             }
0998                             kfree(list);
0999                             list = next;
1000                     }
1001 
1002                     spin_unlock_bh(&list_lock);
1003 
1004 
1005 Another common problem is deleting timers which restart themselves (by
1006 calling add_timer() at the end of their timer function).
1007 Because this is a fairly common case which is prone to races, you should
1008 use del_timer_sync() (``include/linux/timer.h``) to
1009 handle this case. It returns the number of times the timer had to be
1010 deleted before we finally stopped it from adding itself back in.
1011 
1012 Locking Speed
1013 =============
1014 
1015 There are three main things to worry about when considering speed of
1016 some code which does locking. First is concurrency: how many things are
1017 going to be waiting while someone else is holding a lock. Second is the
1018 time taken to actually acquire and release an uncontended lock. Third is
1019 using fewer, or smarter locks. I'm assuming that the lock is used fairly
1020 often: otherwise, you wouldn't be concerned about efficiency.
1021 
1022 Concurrency depends on how long the lock is usually held: you should
1023 hold the lock for as long as needed, but no longer. In the cache
1024 example, we always create the object without the lock held, and then
1025 grab the lock only when we are ready to insert it in the list.
1026 
1027 Acquisition times depend on how much damage the lock operations do to
1028 the pipeline (pipeline stalls) and how likely it is that this CPU was
1029 the last one to grab the lock (ie. is the lock cache-hot for this CPU):
1030 on a machine with more CPUs, this likelihood drops fast. Consider a
1031 700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic
1032 increment takes about 58ns, a lock which is cache-hot on this CPU takes
1033 160ns, and a cacheline transfer from another CPU takes an additional 170
1034 to 360ns. (These figures from Paul McKenney's `Linux Journal RCU
1035 article <http://www.linuxjournal.com/article.php?sid=6993>`__).
1036 
1037 These two aims conflict: holding a lock for a short time might be done
1038 by splitting locks into parts (such as in our final per-object-lock
1039 example), but this increases the number of lock acquisitions, and the
1040 results are often slower than having a single lock. This is another
1041 reason to advocate locking simplicity.
1042 
1043 The third concern is addressed below: there are some methods to reduce
1044 the amount of locking which needs to be done.
1045 
1046 Read/Write Lock Variants
1047 ------------------------
1048 
1049 Both spinlocks and mutexes have read/write variants: ``rwlock_t`` and
1050 :c:type:`struct rw_semaphore <rw_semaphore>`. These divide
1051 users into two classes: the readers and the writers. If you are only
1052 reading the data, you can get a read lock, but to write to the data you
1053 need the write lock. Many people can hold a read lock, but a writer must
1054 be sole holder.
1055 
1056 If your code divides neatly along reader/writer lines (as our cache code
1057 does), and the lock is held by readers for significant lengths of time,
1058 using these locks can help. They are slightly slower than the normal
1059 locks though, so in practice ``rwlock_t`` is not usually worthwhile.
1060 
1061 Avoiding Locks: Read Copy Update
1062 --------------------------------
1063 
1064 There is a special method of read/write locking called Read Copy Update.
1065 Using RCU, the readers can avoid taking a lock altogether: as we expect
1066 our cache to be read more often than updated (otherwise the cache is a
1067 waste of time), it is a candidate for this optimization.
1068 
1069 How do we get rid of read locks? Getting rid of read locks means that
1070 writers may be changing the list underneath the readers. That is
1071 actually quite simple: we can read a linked list while an element is
1072 being added if the writer adds the element very carefully. For example,
1073 adding ``new`` to a single linked list called ``list``::
1074 
1075             new->next = list->next;
1076             wmb();
1077             list->next = new;
1078 
1079 
1080 The wmb() is a write memory barrier. It ensures that the
1081 first operation (setting the new element's ``next`` pointer) is complete
1082 and will be seen by all CPUs, before the second operation is (putting
1083 the new element into the list). This is important, since modern
1084 compilers and modern CPUs can both reorder instructions unless told
1085 otherwise: we want a reader to either not see the new element at all, or
1086 see the new element with the ``next`` pointer correctly pointing at the
1087 rest of the list.
1088 
1089 Fortunately, there is a function to do this for standard
1090 :c:type:`struct list_head <list_head>` lists:
1091 list_add_rcu() (``include/linux/list.h``).
1092 
1093 Removing an element from the list is even simpler: we replace the
1094 pointer to the old element with a pointer to its successor, and readers
1095 will either see it, or skip over it.
1096 
1097 ::
1098 
1099             list->next = old->next;
1100 
1101 
1102 There is list_del_rcu() (``include/linux/list.h``) which
1103 does this (the normal version poisons the old object, which we don't
1104 want).
1105 
1106 The reader must also be careful: some CPUs can look through the ``next``
1107 pointer to start reading the contents of the next element early, but
1108 don't realize that the pre-fetched contents is wrong when the ``next``
1109 pointer changes underneath them. Once again, there is a
1110 list_for_each_entry_rcu() (``include/linux/list.h``)
1111 to help you. Of course, writers can just use
1112 list_for_each_entry(), since there cannot be two
1113 simultaneous writers.
1114 
1115 Our final dilemma is this: when can we actually destroy the removed
1116 element? Remember, a reader might be stepping through this element in
1117 the list right now: if we free this element and the ``next`` pointer
1118 changes, the reader will jump off into garbage and crash. We need to
1119 wait until we know that all the readers who were traversing the list
1120 when we deleted the element are finished. We use
1121 call_rcu() to register a callback which will actually
1122 destroy the object once all pre-existing readers are finished.
1123 Alternatively, synchronize_rcu() may be used to block
1124 until all pre-existing are finished.
1125 
1126 But how does Read Copy Update know when the readers are finished? The
1127 method is this: firstly, the readers always traverse the list inside
1128 rcu_read_lock()/rcu_read_unlock() pairs:
1129 these simply disable preemption so the reader won't go to sleep while
1130 reading the list.
1131 
1132 RCU then waits until every other CPU has slept at least once: since
1133 readers cannot sleep, we know that any readers which were traversing the
1134 list during the deletion are finished, and the callback is triggered.
1135 The real Read Copy Update code is a little more optimized than this, but
1136 this is the fundamental idea.
1137 
1138 ::
1139 
1140     --- cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100
1141     +++ cache.c.rcupdate    2003-12-11 17:55:14.000000000 +1100
1142     @@ -1,15 +1,18 @@
1143      #include <linux/list.h>
1144      #include <linux/slab.h>
1145      #include <linux/string.h>
1146     +#include <linux/rcupdate.h>
1147      #include <linux/mutex.h>
1148      #include <asm/errno.h>
1149 
1150      struct object
1151      {
1152     -        /* These two protected by cache_lock. */
1153     +        /* This is protected by RCU */
1154              struct list_head list;
1155              int popularity;
1156 
1157     +        struct rcu_head rcu;
1158     +
1159              atomic_t refcnt;
1160 
1161              /* Doesn't change once created. */
1162     @@ -40,7 +43,7 @@
1163      {
1164              struct object *i;
1165 
1166     -        list_for_each_entry(i, &cache, list) {
1167     +        list_for_each_entry_rcu(i, &cache, list) {
1168                      if (i->id == id) {
1169                              i->popularity++;
1170                              return i;
1171     @@ -49,19 +52,25 @@
1172              return NULL;
1173      }
1174 
1175     +/* Final discard done once we know no readers are looking. */
1176     +static void cache_delete_rcu(void *arg)
1177     +{
1178     +        object_put(arg);
1179     +}
1180     +
1181      /* Must be holding cache_lock */
1182      static void __cache_delete(struct object *obj)
1183      {
1184              BUG_ON(!obj);
1185     -        list_del(&obj->list);
1186     -        object_put(obj);
1187     +        list_del_rcu(&obj->list);
1188              cache_num--;
1189     +        call_rcu(&obj->rcu, cache_delete_rcu);
1190      }
1191 
1192      /* Must be holding cache_lock */
1193      static void __cache_add(struct object *obj)
1194      {
1195     -        list_add(&obj->list, &cache);
1196     +        list_add_rcu(&obj->list, &cache);
1197              if (++cache_num > MAX_CACHE_SIZE) {
1198                      struct object *i, *outcast = NULL;
1199                      list_for_each_entry(i, &cache, list) {
1200     @@ -104,12 +114,11 @@
1201      struct object *cache_find(int id)
1202      {
1203              struct object *obj;
1204     -        unsigned long flags;
1205 
1206     -        spin_lock_irqsave(&cache_lock, flags);
1207     +        rcu_read_lock();
1208              obj = __cache_find(id);
1209              if (obj)
1210                      object_get(obj);
1211     -        spin_unlock_irqrestore(&cache_lock, flags);
1212     +        rcu_read_unlock();
1213              return obj;
1214      }
1215 
1216 Note that the reader will alter the popularity member in
1217 __cache_find(), and now it doesn't hold a lock. One
1218 solution would be to make it an ``atomic_t``, but for this usage, we
1219 don't really care about races: an approximate result is good enough, so
1220 I didn't change it.
1221 
1222 The result is that cache_find() requires no
1223 synchronization with any other functions, so is almost as fast on SMP as
1224 it would be on UP.
1225 
1226 There is a further optimization possible here: remember our original
1227 cache code, where there were no reference counts and the caller simply
1228 held the lock whenever using the object? This is still possible: if you
1229 hold the lock, no one can delete the object, so you don't need to get
1230 and put the reference count.
1231 
1232 Now, because the 'read lock' in RCU is simply disabling preemption, a
1233 caller which always has preemption disabled between calling
1234 cache_find() and object_put() does not
1235 need to actually get and put the reference count: we could expose
1236 __cache_find() by making it non-static, and such
1237 callers could simply call that.
1238 
1239 The benefit here is that the reference count is not written to: the
1240 object is not altered in any way, which is much faster on SMP machines
1241 due to caching.
1242 
1243 Per-CPU Data
1244 ------------
1245 
1246 Another technique for avoiding locking which is used fairly widely is to
1247 duplicate information for each CPU. For example, if you wanted to keep a
1248 count of a common condition, you could use a spin lock and a single
1249 counter. Nice and simple.
1250 
1251 If that was too slow (it's usually not, but if you've got a really big
1252 machine to test on and can show that it is), you could instead use a
1253 counter for each CPU, then none of them need an exclusive lock. See
1254 DEFINE_PER_CPU(), get_cpu_var() and
1255 put_cpu_var() (``include/linux/percpu.h``).
1256 
1257 Of particular use for simple per-cpu counters is the ``local_t`` type,
1258 and the cpu_local_inc() and related functions, which are
1259 more efficient than simple code on some architectures
1260 (``include/asm/local.h``).
1261 
1262 Note that there is no simple, reliable way of getting an exact value of
1263 such a counter, without introducing more locks. This is not a problem
1264 for some uses.
1265 
1266 Data Which Mostly Used By An IRQ Handler
1267 ----------------------------------------
1268 
1269 If data is always accessed from within the same IRQ handler, you don't
1270 need a lock at all: the kernel already guarantees that the irq handler
1271 will not run simultaneously on multiple CPUs.
1272 
1273 Manfred Spraul points out that you can still do this, even if the data
1274 is very occasionally accessed in user context or softirqs/tasklets. The
1275 irq handler doesn't use a lock, and all other accesses are done as so::
1276 
1277         spin_lock(&lock);
1278         disable_irq(irq);
1279         ...
1280         enable_irq(irq);
1281         spin_unlock(&lock);
1282 
1283 The disable_irq() prevents the irq handler from running
1284 (and waits for it to finish if it's currently running on other CPUs).
1285 The spinlock prevents any other accesses happening at the same time.
1286 Naturally, this is slower than just a spin_lock_irq()
1287 call, so it only makes sense if this type of access happens extremely
1288 rarely.
1289 
1290 What Functions Are Safe To Call From Interrupts?
1291 ================================================
1292 
1293 Many functions in the kernel sleep (ie. call schedule()) directly or
1294 indirectly: you can never call them while holding a spinlock, or with
1295 preemption disabled. This also means you need to be in user context:
1296 calling them from an interrupt is illegal.
1297 
1298 Some Functions Which Sleep
1299 --------------------------
1300 
1301 The most common ones are listed below, but you usually have to read the
1302 code to find out if other calls are safe. If everyone else who calls it
1303 can sleep, you probably need to be able to sleep, too. In particular,
1304 registration and deregistration functions usually expect to be called
1305 from user context, and can sleep.
1306 
1307 -  Accesses to userspace:
1308 
1309    -  copy_from_user()
1310 
1311    -  copy_to_user()
1312 
1313    -  get_user()
1314 
1315    -  put_user()
1316 
1317 -  kmalloc(GP_KERNEL) <kmalloc>`
1318 
1319 -  mutex_lock_interruptible() and
1320    mutex_lock()
1321 
1322    There is a mutex_trylock() which does not sleep.
1323    Still, it must not be used inside interrupt context since its
1324    implementation is not safe for that. mutex_unlock()
1325    will also never sleep. It cannot be used in interrupt context either
1326    since a mutex must be released by the same task that acquired it.
1327 
1328 Some Functions Which Don't Sleep
1329 --------------------------------
1330 
1331 Some functions are safe to call from any context, or holding almost any
1332 lock.
1333 
1334 -  printk()
1335 
1336 -  kfree()
1337 
1338 -  add_timer() and del_timer()
1339 
1340 Mutex API reference
1341 ===================
1342 
1343 .. kernel-doc:: include/linux/mutex.h
1344    :internal:
1345 
1346 .. kernel-doc:: kernel/locking/mutex.c
1347    :export:
1348 
1349 Futex API reference
1350 ===================
1351 
1352 .. kernel-doc:: kernel/futex/core.c
1353    :internal:
1354 
1355 .. kernel-doc:: kernel/futex/futex.h
1356    :internal:
1357 
1358 .. kernel-doc:: kernel/futex/pi.c
1359    :internal:
1360 
1361 .. kernel-doc:: kernel/futex/requeue.c
1362    :internal:
1363 
1364 .. kernel-doc:: kernel/futex/waitwake.c
1365    :internal:
1366 
1367 Further reading
1368 ===============
1369 
1370 -  ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking
1371    tutorial in the kernel sources.
1372 
1373 -  Unix Systems for Modern Architectures: Symmetric Multiprocessing and
1374    Caching for Kernel Programmers:
1375 
1376    Curt Schimmel's very good introduction to kernel level locking (not
1377    written for Linux, but nearly everything applies). The book is
1378    expensive, but really worth every penny to understand SMP locking.
1379    [ISBN: 0201633388]
1380 
1381 Thanks
1382 ======
1383 
1384 Thanks to Telsa Gwynne for DocBooking, neatening and adding style.
1385 
1386 Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras,
1387 Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev,
1388 James Morris, Robert Love, Paul McKenney, John Ashby for proofreading,
1389 correcting, flaming, commenting.
1390 
1391 Thanks to the cabal for having no influence on this document.
1392 
1393 Glossary
1394 ========
1395 
1396 preemption
1397   Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user
1398   context inside the kernel would not preempt each other (ie. you had that
1399   CPU until you gave it up, except for interrupts). With the addition of
1400   ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher
1401   priority tasks can "cut in": spinlocks were changed to disable
1402   preemption, even on UP.
1403 
1404 bh
1405   Bottom Half: for historical reasons, functions with '_bh' in them often
1406   now refer to any software interrupt, e.g. spin_lock_bh()
1407   blocks any software interrupt on the current CPU. Bottom halves are
1408   deprecated, and will eventually be replaced by tasklets. Only one bottom
1409   half will be running at any time.
1410 
1411 Hardware Interrupt / Hardware IRQ
1412   Hardware interrupt request. in_hardirq() returns true in a
1413   hardware interrupt handler.
1414 
1415 Interrupt Context
1416   Not user context: processing a hardware irq or software irq. Indicated
1417   by the in_interrupt() macro returning true.
1418 
1419 SMP
1420   Symmetric Multi-Processor: kernels compiled for multiple-CPU machines.
1421   (``CONFIG_SMP=y``).
1422 
1423 Software Interrupt / softirq
1424   Software interrupt handler. in_hardirq() returns false;
1425   in_softirq() returns true. Tasklets and softirqs both
1426   fall into the category of 'software interrupts'.
1427 
1428   Strictly speaking a softirq is one of up to 32 enumerated software
1429   interrupts which can run on multiple CPUs at once. Sometimes used to
1430   refer to tasklets as well (ie. all software interrupts).
1431 
1432 tasklet
1433   A dynamically-registrable software interrupt, which is guaranteed to
1434   only run on one CPU at a time.
1435 
1436 timer
1437   A dynamically-registrable software interrupt, which is run at (or close
1438   to) a given time. When running, it is just like a tasklet (in fact, they
1439   are called from the ``TIMER_SOFTIRQ``).
1440 
1441 UP
1442   Uni-Processor: Non-SMP. (``CONFIG_SMP=n``).
1443 
1444 User Context
1445   The kernel executing on behalf of a particular process (ie. a system
1446   call or trap) or kernel thread. You can tell which process with the
1447   ``current`` macro.) Not to be confused with userspace. Can be
1448   interrupted by software or hardware interrupts.
1449 
1450 Userspace
1451   A process executing its own code outside the kernel.