0001 ===========================================================
0002 Clock sources, Clock events, sched_clock() and delay timers
0003 ===========================================================
0004
0005 This document tries to briefly explain some basic kernel timekeeping
0006 abstractions. It partly pertains to the drivers usually found in
0007 drivers/clocksource in the kernel tree, but the code may be spread out
0008 across the kernel.
0009
0010 If you grep through the kernel source you will find a number of architecture-
0011 specific implementations of clock sources, clockevents and several likewise
0012 architecture-specific overrides of the sched_clock() function and some
0013 delay timers.
0014
0015 To provide timekeeping for your platform, the clock source provides
0016 the basic timeline, whereas clock events shoot interrupts on certain points
0017 on this timeline, providing facilities such as high-resolution timers.
0018 sched_clock() is used for scheduling and timestamping, and delay timers
0019 provide an accurate delay source using hardware counters.
0020
0021
0022 Clock sources
0023 -------------
0024
0025 The purpose of the clock source is to provide a timeline for the system that
0026 tells you where you are in time. For example issuing the command 'date' on
0027 a Linux system will eventually read the clock source to determine exactly
0028 what time it is.
0029
0030 Typically the clock source is a monotonic, atomic counter which will provide
0031 n bits which count from 0 to (2^n)-1 and then wraps around to 0 and start over.
0032 It will ideally NEVER stop ticking as long as the system is running. It
0033 may stop during system suspend.
0034
0035 The clock source shall have as high resolution as possible, and the frequency
0036 shall be as stable and correct as possible as compared to a real-world wall
0037 clock. It should not move unpredictably back and forth in time or miss a few
0038 cycles here and there.
0039
0040 It must be immune to the kind of effects that occur in hardware where e.g.
0041 the counter register is read in two phases on the bus lowest 16 bits first
0042 and the higher 16 bits in a second bus cycle with the counter bits
0043 potentially being updated in between leading to the risk of very strange
0044 values from the counter.
0045
0046 When the wall-clock accuracy of the clock source isn't satisfactory, there
0047 are various quirks and layers in the timekeeping code for e.g. synchronizing
0048 the user-visible time to RTC clocks in the system or against networked time
0049 servers using NTP, but all they do basically is update an offset against
0050 the clock source, which provides the fundamental timeline for the system.
0051 These measures does not affect the clock source per se, they only adapt the
0052 system to the shortcomings of it.
0053
0054 The clock source struct shall provide means to translate the provided counter
0055 into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
0056 Since this operation may be invoked very often, doing this in a strict
0057 mathematical sense is not desirable: instead the number is taken as close as
0058 possible to a nanosecond value using only the arithmetic operations
0059 multiply and shift, so in clocksource_cyc2ns() you find:
0060
0061 ns ~= (clocksource * mult) >> shift
0062
0063 You will find a number of helper functions in the clock source code intended
0064 to aid in providing these mult and shift values, such as
0065 clocksource_khz2mult(), clocksource_hz2mult() that help determine the
0066 mult factor from a fixed shift, and clocksource_register_hz() and
0067 clocksource_register_khz() which will help out assigning both shift and mult
0068 factors using the frequency of the clock source as the only input.
0069
0070 For real simple clock sources accessed from a single I/O memory location
0071 there is nowadays even clocksource_mmio_init() which will take a memory
0072 location, bit width, a parameter telling whether the counter in the
0073 register counts up or down, and the timer clock rate, and then conjure all
0074 necessary parameters.
0075
0076 Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
0077 seconds, the code handling the clock source will have to compensate for this.
0078 That is the reason why the clock source struct also contains a 'mask'
0079 member telling how many bits of the source are valid. This way the timekeeping
0080 code knows when the counter will wrap around and can insert the necessary
0081 compensation code on both sides of the wrap point so that the system timeline
0082 remains monotonic.
0083
0084
0085 Clock events
0086 ------------
0087
0088 Clock events are the conceptual reverse of clock sources: they take a
0089 desired time specification value and calculate the values to poke into
0090 hardware timer registers.
0091
0092 Clock events are orthogonal to clock sources. The same hardware
0093 and register range may be used for the clock event, but it is essentially
0094 a different thing. The hardware driving clock events has to be able to
0095 fire interrupts, so as to trigger events on the system timeline. On an SMP
0096 system, it is ideal (and customary) to have one such event driving timer per
0097 CPU core, so that each core can trigger events independently of any other
0098 core.
0099
0100 You will notice that the clock event device code is based on the same basic
0101 idea about translating counters to nanoseconds using mult and shift
0102 arithmetic, and you find the same family of helper functions again for
0103 assigning these values. The clock event driver does not need a 'mask'
0104 attribute however: the system will not try to plan events beyond the time
0105 horizon of the clock event.
0106
0107
0108 sched_clock()
0109 -------------
0110
0111 In addition to the clock sources and clock events there is a special weak
0112 function in the kernel called sched_clock(). This function shall return the
0113 number of nanoseconds since the system was started. An architecture may or
0114 may not provide an implementation of sched_clock() on its own. If a local
0115 implementation is not provided, the system jiffy counter will be used as
0116 sched_clock().
0117
0118 As the name suggests, sched_clock() is used for scheduling the system,
0119 determining the absolute timeslice for a certain process in the CFS scheduler
0120 for example. It is also used for printk timestamps when you have selected to
0121 include time information in printk for things like bootcharts.
0122
0123 Compared to clock sources, sched_clock() has to be very fast: it is called
0124 much more often, especially by the scheduler. If you have to do trade-offs
0125 between accuracy compared to the clock source, you may sacrifice accuracy
0126 for speed in sched_clock(). It however requires some of the same basic
0127 characteristics as the clock source, i.e. it should be monotonic.
0128
0129 The sched_clock() function may wrap only on unsigned long long boundaries,
0130 i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
0131 after circa 585 years. (For most practical systems this means "never".)
0132
0133 If an architecture does not provide its own implementation of this function,
0134 it will fall back to using jiffies, making its maximum resolution 1/HZ of the
0135 jiffy frequency for the architecture. This will affect scheduling accuracy
0136 and will likely show up in system benchmarks.
0137
0138 The clock driving sched_clock() may stop or reset to zero during system
0139 suspend/sleep. This does not matter to the function it serves of scheduling
0140 events on the system. However it may result in interesting timestamps in
0141 printk().
0142
0143 The sched_clock() function should be callable in any context, IRQ- and
0144 NMI-safe and return a sane value in any context.
0145
0146 Some architectures may have a limited set of time sources and lack a nice
0147 counter to derive a 64-bit nanosecond value, so for example on the ARM
0148 architecture, special helper functions have been created to provide a
0149 sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
0150 same counter that is also used as clock source is used for this purpose.
0151
0152 On SMP systems, it is crucial for performance that sched_clock() can be called
0153 independently on each CPU without any synchronization performance hits.
0154 Some hardware (such as the x86 TSC) will cause the sched_clock() function to
0155 drift between the CPUs on the system. The kernel can work around this by
0156 enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
0157 that makes sched_clock() different from the ordinary clock source.
0158
0159
0160 Delay timers (some architectures only)
0161 --------------------------------------
0162
0163 On systems with variable CPU frequency, the various kernel delay() functions
0164 will sometimes behave strangely. Basically these delays usually use a hard
0165 loop to delay a certain number of jiffy fractions using a "lpj" (loops per
0166 jiffy) value, calibrated on boot.
0167
0168 Let's hope that your system is running on maximum frequency when this value
0169 is calibrated: as an effect when the frequency is geared down to half the
0170 full frequency, any delay() will be twice as long. Usually this does not
0171 hurt, as you're commonly requesting that amount of delay *or more*. But
0172 basically the semantics are quite unpredictable on such systems.
0173
0174 Enter timer-based delays. Using these, a timer read may be used instead of
0175 a hard-coded loop for providing the desired delay.
0176
0177 This is done by declaring a struct delay_timer and assigning the appropriate
0178 function pointers and rate settings for this delay timer.
0179
0180 This is available on some architectures like OpenRISC or ARM.