Back to home page

OSCL-LXR

 
 

    


0001 =========================
0002 Capacity Aware Scheduling
0003 =========================
0004 
0005 1. CPU Capacity
0006 ===============
0007 
0008 1.1 Introduction
0009 ----------------
0010 
0011 Conventional, homogeneous SMP platforms are composed of purely identical
0012 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
0013 different performance characteristics - on such platforms, not all CPUs can be
0014 considered equal.
0015 
0016 CPU capacity is a measure of the performance a CPU can reach, normalized against
0017 the most performant CPU in the system. Heterogeneous systems are also called
0018 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
0019 
0020 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
0021 from two factors:
0022 
0023 - not all CPUs may have the same microarchitecture (µarch).
0024 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
0025   physically able to attain the higher Operating Performance Points (OPP).
0026 
0027 Arm big.LITTLE systems are an example of both. The big CPUs are more
0028 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
0029 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
0030 can.
0031 
0032 CPU performance is usually expressed in Millions of Instructions Per Second
0033 (MIPS), which can also be expressed as a given amount of instructions attainable
0034 per Hz, leading to::
0035 
0036   capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
0037 
0038 1.2 Scheduler terms
0039 -------------------
0040 
0041 Two different capacity values are used within the scheduler. A CPU's
0042 ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
0043 attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
0044 which some loss of available performance (e.g. time spent handling IRQs) is
0045 subtracted.
0046 
0047 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
0048 while ``capacity_orig`` is class-agnostic. The rest of this document will use
0049 the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
0050 brevity.
0051 
0052 1.3 Platform examples
0053 ---------------------
0054 
0055 1.3.1 Identical OPPs
0056 ~~~~~~~~~~~~~~~~~~~~
0057 
0058 Consider an hypothetical dual-core asymmetric CPU capacity system where
0059 
0060 - work_per_hz(CPU0) = W
0061 - work_per_hz(CPU1) = W/2
0062 - all CPUs are running at the same fixed frequency
0063 
0064 By the above definition of capacity:
0065 
0066 - capacity(CPU0) = C
0067 - capacity(CPU1) = C/2
0068 
0069 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
0070 be a LITTLE.
0071 
0072 With a workload that periodically does a fixed amount of work, you will get an
0073 execution trace like so::
0074 
0075  CPU0 work ^
0076            |     ____                ____                ____
0077            |    |    |              |    |              |    |
0078            +----+----+----+----+----+----+----+----+----+----+-> time
0079 
0080  CPU1 work ^
0081            |     _________           _________           ____
0082            |    |         |         |         |         |
0083            +----+----+----+----+----+----+----+----+----+----+-> time
0084 
0085 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
0086 work W in T units of time. On the other hand, CPU1 has half the capacity of
0087 CPU0, and thus only completes W/2 in T.
0088 
0089 1.3.2 Different max OPPs
0090 ~~~~~~~~~~~~~~~~~~~~~~~~
0091 
0092 Usually, CPUs of different capacity values also have different maximum
0093 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
0094 
0095 - max_freq(CPU0) = F
0096 - max_freq(CPU1) = 2/3 * F
0097 
0098 This yields:
0099 
0100 - capacity(CPU0) = C
0101 - capacity(CPU1) = C/3
0102 
0103 Executing the same workload as described in 1.3.1, which each CPU running at its
0104 maximum frequency results in::
0105 
0106  CPU0 work ^
0107            |     ____                ____                ____
0108            |    |    |              |    |              |    |
0109            +----+----+----+----+----+----+----+----+----+----+-> time
0110 
0111                             workload on CPU1
0112  CPU1 work ^
0113            |     ______________      ______________      ____
0114            |    |              |    |              |    |
0115            +----+----+----+----+----+----+----+----+----+----+-> time
0116 
0117 1.4 Representation caveat
0118 -------------------------
0119 
0120 It should be noted that having a *single* value to represent differences in CPU
0121 performance is somewhat of a contentious point. The relative performance
0122 difference between two different µarchs could be X% on integer operations, Y% on
0123 floating point operations, Z% on branches, and so on. Still, results using this
0124 simple approach have been satisfactory for now.
0125 
0126 2. Task utilization
0127 ===================
0128 
0129 2.1 Introduction
0130 ----------------
0131 
0132 Capacity aware scheduling requires an expression of a task's requirements with
0133 regards to CPU capacity. Each scheduler class can express this differently, and
0134 while task utilization is specific to CFS, it is convenient to describe it here
0135 in order to introduce more generic concepts.
0136 
0137 Task utilization is a percentage meant to represent the throughput requirements
0138 of a task. A simple approximation of it is the task's duty cycle, i.e.::
0139 
0140   task_util(p) = duty_cycle(p)
0141 
0142 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
0143 busy loop. Conversely, 10% utilization hints it is a small periodic task that
0144 spends more time sleeping than executing. Variable CPU frequencies and
0145 asymmetric CPU capacities complexify this somewhat; the following sections will
0146 expand on these.
0147 
0148 2.2 Frequency invariance
0149 ------------------------
0150 
0151 One issue that needs to be taken into account is that a workload's duty cycle is
0152 directly impacted by the current OPP the CPU is running at. Consider running a
0153 periodic workload at a given frequency F::
0154 
0155   CPU work ^
0156            |     ____                ____                ____
0157            |    |    |              |    |              |    |
0158            +----+----+----+----+----+----+----+----+----+----+-> time
0159 
0160 This yields duty_cycle(p) == 25%.
0161 
0162 Now, consider running the *same* workload at frequency F/2::
0163 
0164   CPU work ^
0165            |     _________           _________           ____
0166            |    |         |         |         |         |
0167            +----+----+----+----+----+----+----+----+----+----+-> time
0168 
0169 This yields duty_cycle(p) == 50%, despite the task having the exact same
0170 behaviour (i.e. executing the same amount of work) in both executions.
0171 
0172 The task utilization signal can be made frequency invariant using the following
0173 formula::
0174 
0175   task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
0176 
0177 Applying this formula to the two examples above yields a frequency invariant
0178 task utilization of 25%.
0179 
0180 2.3 CPU invariance
0181 ------------------
0182 
0183 CPU capacity has a similar effect on task utilization in that running an
0184 identical workload on CPUs of different capacity values will yield different
0185 duty cycles.
0186 
0187 Consider the system described in 1.3.2., i.e.::
0188 
0189 - capacity(CPU0) = C
0190 - capacity(CPU1) = C/3
0191 
0192 Executing a given periodic workload on each CPU at their maximum frequency would
0193 result in::
0194 
0195  CPU0 work ^
0196            |     ____                ____                ____
0197            |    |    |              |    |              |    |
0198            +----+----+----+----+----+----+----+----+----+----+-> time
0199 
0200  CPU1 work ^
0201            |     ______________      ______________      ____
0202            |    |              |    |              |    |
0203            +----+----+----+----+----+----+----+----+----+----+-> time
0204 
0205 IOW,
0206 
0207 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
0208 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
0209 
0210 The task utilization signal can be made CPU invariant using the following
0211 formula::
0212 
0213   task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
0214 
0215 with ``max_capacity`` being the highest CPU capacity value in the
0216 system. Applying this formula to the above example above yields a CPU
0217 invariant task utilization of 25%.
0218 
0219 2.4 Invariant task utilization
0220 ------------------------------
0221 
0222 Both frequency and CPU invariance need to be applied to task utilization in
0223 order to obtain a truly invariant signal. The pseudo-formula for a task
0224 utilization that is both CPU and frequency invariant is thus, for a given
0225 task p::
0226 
0227                                      curr_frequency(cpu)   capacity(cpu)
0228   task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
0229                                      max_frequency(cpu)    max_capacity
0230 
0231 In other words, invariant task utilization describes the behaviour of a task as
0232 if it were running on the highest-capacity CPU in the system, running at its
0233 maximum frequency.
0234 
0235 Any mention of task utilization in the following sections will imply its
0236 invariant form.
0237 
0238 2.5 Utilization estimation
0239 --------------------------
0240 
0241 Without a crystal ball, task behaviour (and thus task utilization) cannot
0242 accurately be predicted the moment a task first becomes runnable. The CFS class
0243 maintains a handful of CPU and task signals based on the Per-Entity Load
0244 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
0245 opposed to instantaneous).
0246 
0247 This means that while the capacity aware scheduling criteria will be written
0248 considering a "true" task utilization (using a crystal ball), the implementation
0249 will only ever be able to use an estimator thereof.
0250 
0251 3. Capacity aware scheduling requirements
0252 =========================================
0253 
0254 3.1 CPU capacity
0255 ----------------
0256 
0257 Linux cannot currently figure out CPU capacity on its own, this information thus
0258 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
0259 for that purpose.
0260 
0261 The arm and arm64 architectures directly map this to the arch_topology driver
0262 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
0263 Documentation/devicetree/bindings/arm/cpu-capacity.txt.
0264 
0265 3.2 Frequency invariance
0266 ------------------------
0267 
0268 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
0269 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
0270 purpose.
0271 
0272 Implementing this function requires figuring out at which frequency each CPU
0273 have been running at. One way to implement this is to leverage hardware counters
0274 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
0275 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
0276 when the kernel is aware of the switched-to frequency (also employed by
0277 arm/arm64).
0278 
0279 4. Scheduler topology
0280 =====================
0281 
0282 During the construction of the sched domains, the scheduler will figure out
0283 whether the system exhibits asymmetric CPU capacities. Should that be the
0284 case:
0285 
0286 - The sched_asym_cpucapacity static key will be enabled.
0287 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
0288   level that spans all unique CPU capacity values.
0289 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
0290   CPUs with any range of asymmetry.
0291 
0292 The sched_asym_cpucapacity static key is intended to guard sections of code that
0293 cater to asymmetric CPU capacity systems. Do note however that said key is
0294 *system-wide*. Imagine the following setup using cpusets::
0295 
0296   capacity    C/2          C
0297             ________    ________
0298            /        \  /        \
0299   CPUs     0  1  2  3  4  5  6  7
0300            \__/  \______________/
0301   cpusets   cs0         cs1
0302 
0303 Which could be created via:
0304 
0305 .. code-block:: sh
0306 
0307   mkdir /sys/fs/cgroup/cpuset/cs0
0308   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
0309   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
0310 
0311   mkdir /sys/fs/cgroup/cpuset/cs1
0312   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
0313   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
0314 
0315   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
0316 
0317 Since there *is* CPU capacity asymmetry in the system, the
0318 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
0319 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
0320 set in that hierarchy, it describes an SMP island and should be treated as such.
0321 
0322 Therefore, the 'canonical' pattern for protecting codepaths that cater to
0323 asymmetric CPU capacities is to:
0324 
0325 - Check the sched_asym_cpucapacity static key
0326 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
0327   the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
0328   CPU or group thereof)
0329 
0330 5. Capacity aware scheduling implementation
0331 ===========================================
0332 
0333 5.1 CFS
0334 -------
0335 
0336 5.1.1 Capacity fitness
0337 ~~~~~~~~~~~~~~~~~~~~~~
0338 
0339 The main capacity scheduling criterion of CFS is::
0340 
0341   task_util(p) < capacity(task_cpu(p))
0342 
0343 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
0344 task "fits" on its CPU. If it is violated, the task will need to achieve more
0345 work than what its CPU can provide: it will be CPU-bound.
0346 
0347 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
0348 value for a task, either via sched_setattr() or via the cgroup interface (see
0349 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
0350 clamp task_util() in the previous criterion.
0351 
0352 5.1.2 Wakeup CPU selection
0353 ~~~~~~~~~~~~~~~~~~~~~~~~~~
0354 
0355 CFS task wakeup CPU selection follows the capacity fitness criterion described
0356 above. On top of that, uclamp is used to clamp the task utilization values,
0357 which lets userspace have more leverage over the CPU selection of CFS
0358 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
0359 
0360   clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
0361 
0362 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
0363 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
0364 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
0365 giving it a high uclamp.min value.
0366 
0367 .. note::
0368 
0369   Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
0370   (EAS), which is described in Documentation/scheduler/sched-energy.rst.
0371 
0372 5.1.3 Load balancing
0373 ~~~~~~~~~~~~~~~~~~~~
0374 
0375 A pathological case in the wakeup CPU selection occurs when a task rarely
0376 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
0377 
0378   w == wakeup event
0379 
0380   capacity(CPU0) = C
0381   capacity(CPU1) = C / 3
0382 
0383                            workload on CPU0
0384   CPU work ^
0385            |     _________           _________           ____
0386            |    |         |         |         |         |
0387            +----+----+----+----+----+----+----+----+----+----+-> time
0388                 w                   w                   w
0389 
0390                            workload on CPU1
0391   CPU work ^
0392            |     ____________________________________________
0393            |    |
0394            +----+----+----+----+----+----+----+----+----+----+->
0395                 w
0396 
0397 This workload should run on CPU0, but if the task either:
0398 
0399 - was improperly scheduled from the start (inaccurate initial
0400   utilization estimation)
0401 - was properly scheduled from the start, but suddenly needs more
0402   processing power
0403 
0404 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
0405 the CPU capacity scheduling criterion is violated, and there may not be any more
0406 wakeup event to fix this up via wakeup CPU selection.
0407 
0408 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
0409 put in place to handle this shares the same name. Misfit task migration
0410 leverages the CFS load balancer, more specifically the active load balance part
0411 (which caters to migrating currently running tasks). When load balance happens,
0412 a misfit active load balance will be triggered if a misfit task can be migrated
0413 to a CPU with more capacity than its current one.
0414 
0415 5.2 RT
0416 ------
0417 
0418 5.2.1 Wakeup CPU selection
0419 ~~~~~~~~~~~~~~~~~~~~~~~~~~
0420 
0421 RT task wakeup CPU selection searches for a CPU that satisfies::
0422 
0423   task_uclamp_min(p) <= capacity(task_cpu(cpu))
0424 
0425 while still following the usual priority constraints. If none of the candidate
0426 CPUs can satisfy this capacity criterion, then strict priority based scheduling
0427 is followed and CPU capacities are ignored.
0428 
0429 5.3 DL
0430 ------
0431 
0432 5.3.1 Wakeup CPU selection
0433 ~~~~~~~~~~~~~~~~~~~~~~~~~~
0434 
0435 DL task wakeup CPU selection searches for a CPU that satisfies::
0436 
0437   task_bandwidth(p) < capacity(task_cpu(p))
0438 
0439 while still respecting the usual bandwidth and deadline constraints. If
0440 none of the candidate CPUs can satisfy this capacity criterion, then the
0441 task will remain on its current CPU.