Documentation/scheduler/sched-energy.rst

0001 =======================
0002 Energy Aware Scheduling
0003 =======================
0004
0005 1. Introduction
0006 ---------------
0007
0008 Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
0009 the impact of its decisions on the energy consumed by CPUs. EAS relies on an
0010 Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
0011 with a minimal impact on throughput. This document aims at providing an
0012 introduction on how EAS works, what are the main design decisions behind it, and
0013 details what is needed to get it to run.
0014
0015 Before going any further, please note that at the time of writing::
0016
0017    /!\ EAS does not support platforms with symmetric CPU topologies /!\
0018
0019 EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
0020 because this is where the potential for saving energy through scheduling is
0021 the highest.
0022
0023 The actual EM used by EAS is _not_ maintained by the scheduler, but by a
0024 dedicated framework. For details about this framework and what it provides,
0025 please refer to its documentation (see Documentation/power/energy-model.rst).
0026
0027
0028 2. Background and Terminology
0029 -----------------------------
0030
0031 To make it clear from the start:
0032  - energy = [joule] (resource like a battery on powered devices)
0033  - power = energy/time = [joule/second] = [watt]
0034
0035 The goal of EAS is to minimize energy, while still getting the job done. That
0036 is, we want to maximize::
0037
0038         performance [inst/s]
0039         --------------------
0040             power [W]
0041
0042 which is equivalent to minimizing::
0043
0044         energy [J]
0045         -----------
0046         instruction
0047
0048 while still getting 'good' performance. It is essentially an alternative
0049 optimization objective to the current performance-only objective for the
0050 scheduler. This alternative considers two objectives: energy-efficiency and
0051 performance.
0052
0053 The idea behind introducing an EM is to allow the scheduler to evaluate the
0054 implications of its decisions rather than blindly applying energy-saving
0055 techniques that may have positive effects only on some platforms. At the same
0056 time, the EM must be as simple as possible to minimize the scheduler latency
0057 impact.
0058
0059 In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
0060 for the scheduler to decide where a task should run (during wake-up), the EM
0061 is used to break the tie between several good CPU candidates and pick the one
0062 that is predicted to yield the best energy consumption without harming the
0063 system's throughput. The predictions made by EAS rely on specific elements of
0064 knowledge about the platform's topology, which include the 'capacity' of CPUs,
0065 and their respective energy costs.
0066
0067
0068 3. Topology information
0069 -----------------------
0070
0071 EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
0072 differentiate CPUs with different computing throughput. The 'capacity' of a CPU
0073 represents the amount of work it can absorb when running at its highest
0074 frequency compared to the most capable CPU of the system. Capacity values are
0075 normalized in a 1024 range, and are comparable with the utilization signals of
0076 tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
0077 to capacity and utilization values, EAS is able to estimate how big/busy a
0078 task/CPU is, and to take this into consideration when evaluating performance vs
0079 energy trade-offs. The capacity of CPUs is provided via arch-specific code
0080 through the arch_scale_cpu_capacity() callback.
0081
0082 The rest of platform knowledge used by EAS is directly read from the Energy
0083 Model (EM) framework. The EM of a platform is composed of a power cost table
0084 per 'performance domain' in the system (see Documentation/power/energy-model.rst
0085 for futher details about performance domains).
0086
0087 The scheduler manages references to the EM objects in the topology code when the
0088 scheduling domains are built, or re-built. For each root domain (rd), the
0089 scheduler maintains a singly linked list of all performance domains intersecting
0090 the current rd->span. Each node in the list contains a pointer to a struct
0091 em_perf_domain as provided by the EM framework.
0092
0093 The lists are attached to the root domains in order to cope with exclusive
0094 cpuset configurations. Since the boundaries of exclusive cpusets do not
0095 necessarily match those of performance domains, the lists of different root
0096 domains can contain duplicate elements.
0097
0098 Example 1.
0099     Let us consider a platform with 12 CPUs, split in 3 performance domains
0100     (pd0, pd4 and pd8), organized as follows::
0101
0102                   CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
0103                   PDs:   |--pd0--|--pd4--|---pd8---|
0104                   RDs:   |----rd1----|-----rd2-----|
0105
0106     Now, consider that userspace decided to split the system with two
0107     exclusive cpusets, hence creating two independent root domains, each
0108     containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
0109     above figure. Since pd4 intersects with both rd1 and rd2, it will be
0110     present in the linked list '->pd' attached to each of them:
0111
0112        * rd1->pd: pd0 -> pd4
0113        * rd2->pd: pd4 -> pd8
0114
0115     Please note that the scheduler will create two duplicate list nodes for
0116     pd4 (one for each list). However, both just hold a pointer to the same
0117     shared data structure of the EM framework.
0118
0119 Since the access to these lists can happen concurrently with hotplug and other
0120 things, they are protected by RCU, like the rest of topology structures
0121 manipulated by the scheduler.
0122
0123 EAS also maintains a static key (sched_energy_present) which is enabled when at
0124 least one root domain meets all conditions for EAS to start. Those conditions
0125 are summarized in Section 6.
0126
0127
0128 4. Energy-Aware task placement
0129 ------------------------------
0130
0131 EAS overrides the CFS task wake-up balancing code. It uses the EM of the
0132 platform and the PELT signals to choose an energy-efficient target CPU during
0133 wake-up balance. When EAS is enabled, select_task_rq_fair() calls
0134 find_energy_efficient_cpu() to do the placement decision. This function looks
0135 for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
0136 each performance domain since it is the one which will allow us to keep the
0137 frequency the lowest. Then, the function checks if placing the task there could
0138 save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
0139 in its previous activation.
0140
0141 find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
0142 energy consumed by the system if the waking task was migrated. compute_energy()
0143 looks at the current utilization landscape of the CPUs and adjusts it to
0144 'simulate' the task migration. The EM framework provides the em_pd_energy() API
0145 which computes the expected energy consumption of each performance domain for
0146 the given utilization landscape.
0147
0148 An example of energy-optimized task placement decision is detailed below.
0149
0150 Example 2.
0151     Let us consider a (fake) platform with 2 independent performance domains
0152     composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
0153     are big.
0154
0155     The scheduler must decide where to place a task P whose util_avg = 200
0156     and prev_cpu = 0.
0157
0158     The current utilization landscape of the CPUs is depicted on the graph
0159     below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
0160     Each performance domain has three Operating Performance Points (OPPs).
0161     The CPU capacity and power cost associated with each OPP is listed in
0162     the Energy Model table. The util_avg of P is shown on the figures
0163     below as 'PP'::
0164
0165      CPU util.
0166       1024                 - - - - - - -              Energy Model
0167                                                +-----------+-------------+
0168                                                |  Little   |     Big     |
0169        768                 =============       +-----+-----+------+------+
0170                                                | Cap | Pwr | Cap  | Pwr  |
0171                                                +-----+-----+------+------+
0172        512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
0173                              ##     ##         | 341 | 150 | 768  | 800  |
0174        341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
0175              PP              ##     ##         +-----+-----+------+------+
0176        170  -## - - - -      ##     ##
0177              ##     ##       ##     ##
0178            ------------    -------------
0179             CPU0   CPU1     CPU2   CPU3
0180
0181       Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##
0182
0183
0184     find_energy_efficient_cpu() will first look for the CPUs with the
0185     maximum spare capacity in the two performance domains. In this example,
0186     CPU1 and CPU3. Then it will estimate the energy of the system if P was
0187     placed on either of them, and check if that would save some energy
0188     compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
0189     (which is coherent with the behaviour of the schedutil CPUFreq
0190     governor, see Section 6. for more details on this topic).
0191
0192     **Case 1. P is migrated to CPU1**::
0193
0194       1024                 - - - - - - -
0195
0196                                             Energy calculation:
0197        768                 =============     * CPU0: 200 / 341 * 150 = 88
0198                                              * CPU1: 300 / 341 * 150 = 131
0199                                              * CPU2: 600 / 768 * 800 = 625
0200        512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
0201                              ##     ##          => total_energy = 1364
0202        341  ===========      ##     ##
0203                     PP       ##     ##
0204        170  -## - - PP-      ##     ##
0205              ##     ##       ##     ##
0206            ------------    -------------
0207             CPU0   CPU1     CPU2   CPU3
0208
0209
0210     **Case 2. P is migrated to CPU3**::
0211
0212       1024                 - - - - - - -
0213
0214                                             Energy calculation:
0215        768                 =============     * CPU0: 200 / 341 * 150 = 88
0216                                              * CPU1: 100 / 341 * 150 = 43
0217                                     PP       * CPU2: 600 / 768 * 800 = 625
0218        512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
0219                              ##     ##          => total_energy = 1485
0220        341  ===========      ##     ##
0221                              ##     ##
0222        170  -## - - - -      ##     ##
0223              ##     ##       ##     ##
0224            ------------    -------------
0225             CPU0   CPU1     CPU2   CPU3
0226
0227
0228     **Case 3. P stays on prev_cpu / CPU 0**::
0229
0230       1024                 - - - - - - -
0231
0232                                             Energy calculation:
0233        768                 =============     * CPU0: 400 / 512 * 300 = 234
0234                                              * CPU1: 100 / 512 * 300 = 58
0235                                              * CPU2: 600 / 768 * 800 = 625
0236        512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
0237                              ##     ##          => total_energy = 1437
0238        341  -PP - - - -      ##     ##
0239              PP              ##     ##
0240        170  -## - - - -      ##     ##
0241              ##     ##       ##     ##
0242            ------------    -------------
0243             CPU0   CPU1     CPU2   CPU3
0244
0245
0246     From these calculations, the Case 1 has the lowest total energy. So CPU 1
0247     is be the best candidate from an energy-efficiency standpoint.
0248
0249 Big CPUs are generally more power hungry than the little ones and are thus used
0250 mainly when a task doesn't fit the littles. However, little CPUs aren't always
0251 necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
0252 of the little CPUs can be less energy-efficient than the lowest OPPs of the
0253 bigs, for example. So, if the little CPUs happen to have enough utilization at
0254 a specific point in time, a small task waking up at that moment could be better
0255 of executing on the big side in order to save energy, even though it would fit
0256 on the little side.
0257
0258 And even in the case where all OPPs of the big CPUs are less energy-efficient
0259 than those of the little, using the big CPUs for a small task might still, under
0260 specific conditions, save energy. Indeed, placing a task on a little CPU can
0261 result in raising the OPP of the entire performance domain, and that will
0262 increase the cost of the tasks already running there. If the waking task is
0263 placed on a big CPU, its own execution cost might be higher than if it was
0264 running on a little, but it won't impact the other tasks of the little CPUs
0265 which will keep running at a lower OPP. So, when considering the total energy
0266 consumed by CPUs, the extra cost of running that one task on a big core can be
0267 smaller than the cost of raising the OPP on the little CPUs for all the other
0268 tasks.
0269
0270 The examples above would be nearly impossible to get right in a generic way, and
0271 for all platforms, without knowing the cost of running at different OPPs on all
0272 CPUs of the system. Thanks to its EM-based design, EAS should cope with them
0273 correctly without too many troubles. However, in order to ensure a minimal
0274 impact on throughput for high-utilization scenarios, EAS also implements another
0275 mechanism called 'over-utilization'.
0276
0277
0278 5. Over-utilization
0279 -------------------
0280
0281 From a general standpoint, the use-cases where EAS can help the most are those
0282 involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
0283 being run, they will require all of the available CPU capacity, and there isn't
0284 much that can be done by the scheduler to save energy without severly harming
0285 throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
0286 'over-utilized' as soon as they are used at more than 80% of their compute
0287 capacity. As long as no CPUs are over-utilized in a root domain, load balancing
0288 is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
0289 the most energy efficient CPUs of the system more than the others if that can be
0290 done without harming throughput. So, the load-balancer is disabled to prevent
0291 it from breaking the energy-efficient task placement found by EAS. It is safe to
0292 do so when the system isn't overutilized since being below the 80% tipping point
0293 implies that:
0294
0295     a. there is some idle time on all CPUs, so the utilization signals used by
0296        EAS are likely to accurately represent the 'size' of the various tasks
0297        in the system;
0298     b. all tasks should already be provided with enough CPU capacity,
0299        regardless of their nice values;
0300     c. since there is spare capacity all tasks must be blocking/sleeping
0301        regularly and balancing at wake-up is sufficient.
0302
0303 As soon as one CPU goes above the 80% tipping point, at least one of the three
0304 assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
0305 is raised for the entire root domain, EAS is disabled, and the load-balancer is
0306 re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
0307 wake-up and load balance under CPU-bound conditions. This provides a better
0308 respect of the nice values of tasks.
0309
0310 Since the notion of overutilization largely relies on detecting whether or not
0311 there is some idle time in the system, the CPU capacity 'stolen' by higher
0312 (than CFS) scheduling classes (as well as IRQ) must be taken into account. As
0313 such, the detection of overutilization accounts for the capacity used not only
0314 by CFS tasks, but also by the other scheduling classes and IRQ.
0315
0316
0317 6. Dependencies and requirements for EAS
0318 ----------------------------------------
0319
0320 Energy Aware Scheduling depends on the CPUs of the system having specific
0321 hardware properties and on other features of the kernel being enabled. This
0322 section lists these dependencies and provides hints as to how they can be met.
0323
0324
0325 6.1 - Asymmetric CPU topology
0326 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0327
0328
0329 As mentioned in the introduction, EAS is only supported on platforms with
0330 asymmetric CPU topologies for now. This requirement is checked at run-time by
0331 looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
0332 domains are built.
0333
0334 See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
0335 flag to be set in the sched_domain hierarchy.
0336
0337 Please note that EAS is not fundamentally incompatible with SMP, but no
0338 significant savings on SMP platforms have been observed yet. This restriction
0339 could be amended in the future if proven otherwise.
0340
0341
0342 6.2 - Energy Model presence
0343 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0344
0345 EAS uses the EM of a platform to estimate the impact of scheduling decisions on
0346 energy. So, your platform must provide power cost tables to the EM framework in
0347 order to make EAS start. To do so, please refer to documentation of the
0348 independent EM framework in Documentation/power/energy-model.rst.
0349
0350 Please also note that the scheduling domains need to be re-built after the
0351 EM has been registered in order to start EAS.
0352
0353 EAS uses the EM to make a forecasting decision on energy usage and thus it is
0354 more focused on the difference when checking possible options for task
0355 placement. For EAS it doesn't matter whether the EM power values are expressed
0356 in milli-Watts or in an 'abstract scale'.
0357
0358
0359 6.3 - Energy Model complexity
0360 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0361
0362 The task wake-up path is very latency-sensitive. When the EM of a platform is
0363 too complex (too many CPUs, too many performance domains, too many performance
0364 states, ...), the cost of using it in the wake-up path can become prohibitive.
0365 The energy-aware wake-up algorithm has a complexity of:
0366
0367         C = Nd * (Nc + Ns)
0368
0369 with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
0370 total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
0371
0372 A complexity check is performed at the root domain level, when scheduling
0373 domains are built. EAS will not start on a root domain if its C happens to be
0374 higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
0375 time of writing).
0376
0377 If you really want to use EAS but the complexity of your platform's Energy
0378 Model is too high to be used with a single root domain, you're left with only
0379 two possible options:
0380
0381     1. split your system into separate, smaller, root domains using exclusive
0382        cpusets and enable EAS locally on each of them. This option has the
0383        benefit to work out of the box but the drawback of preventing load
0384        balance between root domains, which can result in an unbalanced system
0385        overall;
0386     2. submit patches to reduce the complexity of the EAS wake-up algorithm,
0387        hence enabling it to cope with larger EMs in reasonable time.
0388
0389
0390 6.4 - Schedutil governor
0391 ^^^^^^^^^^^^^^^^^^^^^^^^
0392
0393 EAS tries to predict at which OPP will the CPUs be running in the close future
0394 in order to estimate their energy consumption. To do so, it is assumed that OPPs
0395 of CPUs follow their utilization.
0396
0397 Although it is very difficult to provide hard guarantees regarding the accuracy
0398 of this assumption in practice (because the hardware might not do what it is
0399 told to do, for example), schedutil as opposed to other CPUFreq governors at
0400 least _requests_ frequencies calculated using the utilization signals.
0401 Consequently, the only sane governor to use together with EAS is schedutil,
0402 because it is the only one providing some degree of consistency between
0403 frequency requests and energy predictions.
0404
0405 Using EAS with any other governor than schedutil is not supported.
0406
0407
0408 6.5 Scale-invariant utilization signals
0409 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0410
0411 In order to make accurate prediction across CPUs and for all performance
0412 states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
0413 be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
0414 callbacks.
0415
0416 Using EAS on a platform that doesn't implement these two callbacks is not
0417 supported.
0418
0419
0420 6.6 Multithreading (SMT)
0421 ^^^^^^^^^^^^^^^^^^^^^^^^
0422
0423 EAS in its current form is SMT unaware and is not able to leverage
0424 multithreaded hardware to save energy. EAS considers threads as independent
0425 CPUs, which can actually be counter-productive for both performance and energy.
0426
0427 EAS on SMT is not supported.