Back to home page

OSCL-LXR

 
 

    


0001 ======================================================
0002 hrtimers - subsystem for high-resolution kernel timers
0003 ======================================================
0004 
0005 This patch introduces a new subsystem for high-resolution kernel timers.
0006 
0007 One might ask the question: we already have a timer subsystem
0008 (kernel/timers.c), why do we need two timer subsystems? After a lot of
0009 back and forth trying to integrate high-resolution and high-precision
0010 features into the existing timer framework, and after testing various
0011 such high-resolution timer implementations in practice, we came to the
0012 conclusion that the timer wheel code is fundamentally not suitable for
0013 such an approach. We initially didn't believe this ('there must be a way
0014 to solve this'), and spent a considerable effort trying to integrate
0015 things into the timer wheel, but we failed. In hindsight, there are
0016 several reasons why such integration is hard/impossible:
0017 
0018 - the forced handling of low-resolution and high-resolution timers in
0019   the same way leads to a lot of compromises, macro magic and #ifdef
0020   mess. The timers.c code is very "tightly coded" around jiffies and
0021   32-bitness assumptions, and has been honed and micro-optimized for a
0022   relatively narrow use case (jiffies in a relatively narrow HZ range)
0023   for many years - and thus even small extensions to it easily break
0024   the wheel concept, leading to even worse compromises. The timer wheel
0025   code is very good and tight code, there's zero problems with it in its
0026   current usage - but it is simply not suitable to be extended for
0027   high-res timers.
0028 
0029 - the unpredictable [O(N)] overhead of cascading leads to delays which
0030   necessitate a more complex handling of high resolution timers, which
0031   in turn decreases robustness. Such a design still leads to rather large
0032   timing inaccuracies. Cascading is a fundamental property of the timer
0033   wheel concept, it cannot be 'designed out' without inevitably
0034   degrading other portions of the timers.c code in an unacceptable way.
0035 
0036 - the implementation of the current posix-timer subsystem on top of
0037   the timer wheel has already introduced a quite complex handling of
0038   the required readjusting of absolute CLOCK_REALTIME timers at
0039   settimeofday or NTP time - further underlying our experience by
0040   example: that the timer wheel data structure is too rigid for high-res
0041   timers.
0042 
0043 - the timer wheel code is most optimal for use cases which can be
0044   identified as "timeouts". Such timeouts are usually set up to cover
0045   error conditions in various I/O paths, such as networking and block
0046   I/O. The vast majority of those timers never expire and are rarely
0047   recascaded because the expected correct event arrives in time so they
0048   can be removed from the timer wheel before any further processing of
0049   them becomes necessary. Thus the users of these timeouts can accept
0050   the granularity and precision tradeoffs of the timer wheel, and
0051   largely expect the timer subsystem to have near-zero overhead.
0052   Accurate timing for them is not a core purpose - in fact most of the
0053   timeout values used are ad-hoc. For them it is at most a necessary
0054   evil to guarantee the processing of actual timeout completions
0055   (because most of the timeouts are deleted before completion), which
0056   should thus be as cheap and unintrusive as possible.
0057 
0058 The primary users of precision timers are user-space applications that
0059 utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
0060 users like drivers and subsystems which require precise timed events
0061 (e.g. multimedia) can benefit from the availability of a separate
0062 high-resolution timer subsystem as well.
0063 
0064 While this subsystem does not offer high-resolution clock sources just
0065 yet, the hrtimer subsystem can be easily extended with high-resolution
0066 clock capabilities, and patches for that exist and are maturing quickly.
0067 The increasing demand for realtime and multimedia applications along
0068 with other potential users for precise timers gives another reason to
0069 separate the "timeout" and "precise timer" subsystems.
0070 
0071 Another potential benefit is that such a separation allows even more
0072 special-purpose optimization of the existing timer wheel for the low
0073 resolution and low precision use cases - once the precision-sensitive
0074 APIs are separated from the timer wheel and are migrated over to
0075 hrtimers. E.g. we could decrease the frequency of the timeout subsystem
0076 from 250 Hz to 100 HZ (or even smaller).
0077 
0078 hrtimer subsystem implementation details
0079 ----------------------------------------
0080 
0081 the basic design considerations were:
0082 
0083 - simplicity
0084 
0085 - data structure not bound to jiffies or any other granularity. All the
0086   kernel logic works at 64-bit nanoseconds resolution - no compromises.
0087 
0088 - simplification of existing, timing related kernel code
0089 
0090 another basic requirement was the immediate enqueueing and ordering of
0091 timers at activation time. After looking at several possible solutions
0092 such as radix trees and hashes, we chose the red black tree as the basic
0093 data structure. Rbtrees are available as a library in the kernel and are
0094 used in various performance-critical areas of e.g. memory management and
0095 file systems. The rbtree is solely used for time sorted ordering, while
0096 a separate list is used to give the expiry code fast access to the
0097 queued timers, without having to walk the rbtree.
0098 
0099 (This separate list is also useful for later when we'll introduce
0100 high-resolution clocks, where we need separate pending and expired
0101 queues while keeping the time-order intact.)
0102 
0103 Time-ordered enqueueing is not purely for the purposes of
0104 high-resolution clocks though, it also simplifies the handling of
0105 absolute timers based on a low-resolution CLOCK_REALTIME. The existing
0106 implementation needed to keep an extra list of all armed absolute
0107 CLOCK_REALTIME timers along with complex locking. In case of
0108 settimeofday and NTP, all the timers (!) had to be dequeued, the
0109 time-changing code had to fix them up one by one, and all of them had to
0110 be enqueued again. The time-ordered enqueueing and the storage of the
0111 expiry time in absolute time units removes all this complex and poorly
0112 scaling code from the posix-timer implementation - the clock can simply
0113 be set without having to touch the rbtree. This also makes the handling
0114 of posix-timers simpler in general.
0115 
0116 The locking and per-CPU behavior of hrtimers was mostly taken from the
0117 existing timer wheel code, as it is mature and well suited. Sharing code
0118 was not really a win, due to the different data structures. Also, the
0119 hrtimer functions now have clearer behavior and clearer names - such as
0120 hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
0121 equivalent to del_timer() and del_timer_sync()] - so there's no direct
0122 1:1 mapping between them on the algorithmic level, and thus no real
0123 potential for code sharing either.
0124 
0125 Basic data types: every time value, absolute or relative, is in a
0126 special nanosecond-resolution type: ktime_t. The kernel-internal
0127 representation of ktime_t values and operations is implemented via
0128 macros and inline functions, and can be switched between a "hybrid
0129 union" type and a plain "scalar" 64bit nanoseconds representation (at
0130 compile time). The hybrid union type optimizes time conversions on 32bit
0131 CPUs. This build-time-selectable ktime_t storage format was implemented
0132 to avoid the performance impact of 64-bit multiplications and divisions
0133 on 32bit CPUs. Such operations are frequently necessary to convert
0134 between the storage formats provided by kernel and userspace interfaces
0135 and the internal time format. (See include/linux/ktime.h for further
0136 details.)
0137 
0138 hrtimers - rounding of timer values
0139 -----------------------------------
0140 
0141 the hrtimer code will round timer events to lower-resolution clocks
0142 because it has to. Otherwise it will do no artificial rounding at all.
0143 
0144 one question is, what resolution value should be returned to the user by
0145 the clock_getres() interface. This will return whatever real resolution
0146 a given clock has - be it low-res, high-res, or artificially-low-res.
0147 
0148 hrtimers - testing and verification
0149 -----------------------------------
0150 
0151 We used the high-resolution clock subsystem ontop of hrtimers to verify
0152 the hrtimer implementation details in praxis, and we also ran the posix
0153 timer tests in order to ensure specification compliance. We also ran
0154 tests on low-resolution clocks.
0155 
0156 The hrtimer patch converts the following kernel functionality to use
0157 hrtimers:
0158 
0159  - nanosleep
0160  - itimers
0161  - posix-timers
0162 
0163 The conversion of nanosleep and posix-timers enabled the unification of
0164 nanosleep and clock_nanosleep.
0165 
0166 The code was successfully compiled for the following platforms:
0167 
0168  i386, x86_64, ARM, PPC, PPC64, IA64
0169 
0170 The code was run-tested on the following platforms:
0171 
0172  i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
0173 
0174 hrtimers were also integrated into the -rt tree, along with a
0175 hrtimers-based high-resolution clock implementation, so the hrtimers
0176 code got a healthy amount of testing and use in practice.
0177 
0178         Thomas Gleixner, Ingo Molnar