Documentation/scheduler/schedutil.rst

0001 =========
0002 Schedutil
0003 =========
0004
0005 .. note::
0006
0007    All this assumes a linear relation between frequency and work capacity,
0008    we know this is flawed, but it is the best workable approximation.
0009
0010
0011 PELT (Per Entity Load Tracking)
0012 ===============================
0013
0014 With PELT we track some metrics across the various scheduler entities, from
0015 individual tasks to task-group slices to CPU runqueues. As the basis for this
0016 we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
0017 is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
0018 half, while the rest of history contribute the other half.
0019
0020 Specifically:
0021
0022   ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
0023
0024   ewma(u) = ewma_sum(u) / ewma_sum(1)
0025
0026 Since this is essentially a progression of an infinite geometric series, the
0027 results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
0028 is key, since it gives the ability to recompose the averages when tasks move
0029 around.
0030
0031 Note that blocked tasks still contribute to the aggregates (task-group slices
0032 and CPU runqueues), which reflects their expected contribution when they
0033 resume running.
0034
0035 Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
0036 reflects the time an entity spends on the CPU, while 'runnable' reflects the
0037 time an entity spends on the runqueue. When there is only a single task these
0038 two metrics are the same, but once there is contention for the CPU 'running'
0039 will decrease to reflect the fraction of time each task spends on the CPU
0040 while 'runnable' will increase to reflect the amount of contention.
0041
0042 For more detail see: kernel/sched/pelt.c
0043
0044
0045 Frequency / CPU Invariance
0046 ==========================
0047
0048 Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
0049 for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
0050 a big CPU, we allow architectures to scale the time delta with two ratios, one
0051 Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
0052
0053 For simple DVFS architectures (where software is in full control) we trivially
0054 compute the ratio as::
0055
0056             f_cur
0057   r_dvfs := -----
0058             f_max
0059
0060 For more dynamic systems where the hardware is in control of DVFS we use
0061 hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
0062 For Intel specifically, we use::
0063
0064            APERF
0065   f_cur := ----- * P0
0066            MPERF
0067
0068              4C-turbo;  if available and turbo enabled
0069   f_max := { 1C-turbo;  if turbo enabled
0070              P0;        otherwise
0071
0072                     f_cur
0073   r_dvfs := min( 1, ----- )
0074                     f_max
0075
0076 We pick 4C turbo over 1C turbo to make it slightly more sustainable.
0077
0078 r_cpu is determined as the ratio of highest performance level of the current
0079 CPU vs the highest performance level of any other CPU in the system.
0080
0081   r_tot = r_dvfs * r_cpu
0082
0083 The result is that the above 'running' and 'runnable' metrics become invariant
0084 of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
0085
0086 For more detail see:
0087
0088  - kernel/sched/pelt.h:update_rq_clock_pelt()
0089  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
0090  - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
0091
0092
0093 UTIL_EST / UTIL_EST_FASTUP
0094 ==========================
0095
0096 Because periodic tasks have their averages decayed while they sleep, even
0097 though when running their expected utilization will be the same, they suffer a
0098 (DVFS) ramp-up after they are running again.
0099
0100 To alleviate this (a default enabled option) UTIL_EST drives an Infinite
0101 Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
0102 highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
0103 filter to instantly increase and only decay on decrease.
0104
0105 A further runqueue wide sum (of runnable tasks) is maintained of:
0106
0107   util_est := \Sum_t max( t_running, t_util_est_ewma )
0108
0109 For more detail see: kernel/sched/fair.c:util_est_dequeue()
0110
0111
0112 UCLAMP
0113 ======
0114
0115 It is possible to set effective u_min and u_max clamps on each CFS or RT task;
0116 the runqueue keeps an max aggregate of these clamps for all running tasks.
0117
0118 For more detail see: include/uapi/linux/sched/types.h
0119
0120
0121 Schedutil / DVFS
0122 ================
0123
0124 Every time the scheduler load tracking is updated (task wakeup, task
0125 migration, time progression) we call out to schedutil to update the hardware
0126 DVFS state.
0127
0128 The basis is the CPU runqueue's 'running' metric, which per the above it is
0129 the frequency invariant utilization estimate of the CPU. From this we compute
0130 a desired frequency like::
0131
0132              max( running, util_est );  if UTIL_EST
0133   u_cfs := { running;                   otherwise
0134
0135                clamp( u_cfs + u_rt , u_min, u_max );    if UCLAMP_TASK
0136   u_clamp := { u_cfs + u_rt;                            otherwise
0137
0138   u := u_clamp + u_irq + u_dl;          [approx. see source for more detail]
0139
0140   f_des := min( f_max, 1.25 u * f_max )
0141
0142 XXX IO-wait: when the update is due to a task wakeup from IO-completion we
0143 boost 'u' above.
0144
0145 This frequency is then used to select a P-state/OPP or directly munged into a
0146 CPPC style request to the hardware.
0147
0148 XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
0149 required to satisfy the workload.
0150
0151 Because these callbacks are directly from the scheduler, the DVFS hardware
0152 interaction should be 'fast' and non-blocking. Schedutil supports
0153 rate-limiting DVFS requests for when hardware interaction is slow and
0154 expensive, this reduces effectiveness.
0155
0156 For more information see: kernel/sched/cpufreq_schedutil.c
0157
0158
0159 NOTES
0160 =====
0161
0162  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
0163    will closely reflect utilization.
0164
0165  - In saturated scenarios task movement will cause some transient dips,
0166    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
0167    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
0168    new CPU will gain 0.25. This is inevitable and time progression will
0169    correct this. XXX do we still guarantee f_max due to no idle-time?
0170
0171  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
0172    having to re-learn / ramp-up when load shifts.
0173