0001 =====================
0002 Scheduler Nice Design
0003 =====================
0004
0005 This document explains the thinking about the revamped and streamlined
0006 nice-levels implementation in the new Linux scheduler.
0007
0008 Nice levels were always pretty weak under Linux and people continuously
0009 pestered us to make nice +19 tasks use up much less CPU time.
0010
0011 Unfortunately that was not that easy to implement under the old
0012 scheduler, (otherwise we'd have done it long ago) because nice level
0013 support was historically coupled to timeslice length, and timeslice
0014 units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
0015
0016 In the O(1) scheduler (in 2003) we changed negative nice levels to be
0017 much stronger than they were before in 2.4 (and people were happy about
0018 that change), and we also intentionally calibrated the linear timeslice
0019 rule so that nice +19 level would be _exactly_ 1 jiffy. To better
0020 understand it, the timeslice graph went like this (cheesy ASCII art
0021 alert!)::
0022
0023
0024 A
0025 \ | [timeslice length]
0026 \ |
0027 \ |
0028 \ |
0029 \ |
0030 \|___100msecs
0031 |^ . _
0032 | ^ . _
0033 | ^ . _
0034 -*----------------------------------*-----> [nice level]
0035 -20 | +19
0036 |
0037 |
0038
0039 So that if someone wanted to really renice tasks, +19 would give a much
0040 bigger hit than the normal linear rule would do. (The solution of
0041 changing the ABI to extend priorities was discarded early on.)
0042
0043 This approach worked to some degree for some time, but later on with
0044 HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
0045 we felt to be a bit excessive. Excessive _not_ because it's too small of
0046 a CPU utilization, but because it causes too frequent (once per
0047 millisec) rescheduling. (and would thus trash the cache, etc. Remember,
0048 this was long ago when hardware was weaker and caches were smaller, and
0049 people were running number crunching apps at nice +19.)
0050
0051 So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
0052 right minimal granularity - and this translates to 5% CPU utilization.
0053 But the fundamental HZ-sensitive property for nice+19 still remained,
0054 and we never got a single complaint about nice +19 being too _weak_ in
0055 terms of CPU utilization, we only got complaints about it (still) being
0056 too _strong_ :-)
0057
0058 To sum it up: we always wanted to make nice levels more consistent, but
0059 within the constraints of HZ and jiffies and their nasty design level
0060 coupling to timeslices and granularity it was not really viable.
0061
0062 The second (less frequent but still periodically occurring) complaint
0063 about Linux's nice level support was its asymmetry around the origin
0064 (which you can see demonstrated in the picture above), or more
0065 accurately: the fact that nice level behavior depended on the _absolute_
0066 nice level as well, while the nice API itself is fundamentally
0067 "relative":
0068
0069 int nice(int inc);
0070
0071 asmlinkage long sys_nice(int increment)
0072
0073 (the first one is the glibc API, the second one is the syscall API.)
0074 Note that the 'inc' is relative to the current nice level. Tools like
0075 bash's "nice" command mirror this relative API.
0076
0077 With the old scheduler, if you for example started a niced task with +1
0078 and another task with +2, the CPU split between the two tasks would
0079 depend on the nice level of the parent shell - if it was at nice -10 the
0080 CPU split was different than if it was at +5 or +10.
0081
0082 A third complaint against Linux's nice level support was that negative
0083 nice levels were not 'punchy enough', so lots of people had to resort to
0084 run audio (and other multimedia) apps under RT priorities such as
0085 SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
0086 proof, and a buggy SCHED_FIFO app can also lock up the system for good.
0087
0088 The new scheduler in v2.6.23 addresses all three types of complaints:
0089
0090 To address the first complaint (of nice levels being not "punchy"
0091 enough), the scheduler was decoupled from 'time slice' and HZ concepts
0092 (and granularity was made a separate concept from nice levels) and thus
0093 it was possible to implement better and more consistent nice +19
0094 support: with the new scheduler nice +19 tasks get a HZ-independent
0095 1.5%, instead of the variable 3%-5%-9% range they got in the old
0096 scheduler.
0097
0098 To address the second complaint (of nice levels not being consistent),
0099 the new scheduler makes nice(1) have the same CPU utilization effect on
0100 tasks, regardless of their absolute nice levels. So on the new
0101 scheduler, running a nice +10 and a nice 11 task has the same CPU
0102 utilization "split" between them as running a nice -5 and a nice -4
0103 task. (one will get 55% of the CPU, the other 45%.) That is why nice
0104 levels were changed to be "multiplicative" (or exponential) - that way
0105 it does not matter which nice level you start out from, the 'relative
0106 result' will always be the same.
0107
0108 The third complaint (of negative nice levels not being "punchy" enough
0109 and forcing audio apps to run under the more dangerous SCHED_FIFO
0110 scheduling policy) is addressed by the new scheduler almost
0111 automatically: stronger negative nice levels are an automatic
0112 side-effect of the recalibrated dynamic range of nice levels.