0001 =======================
0002 Intel Powerclamp Driver
0003 =======================
0004
0005 By:
0006 - Arjan van de Ven <arjan@linux.intel.com>
0007 - Jacob Pan <jacob.jun.pan@linux.intel.com>
0008
0009 .. Contents:
0010
0011 (*) Introduction
0012 - Goals and Objectives
0013
0014 (*) Theory of Operation
0015 - Idle Injection
0016 - Calibration
0017
0018 (*) Performance Analysis
0019 - Effectiveness and Limitations
0020 - Power vs Performance
0021 - Scalability
0022 - Calibration
0023 - Comparison with Alternative Techniques
0024
0025 (*) Usage and Interfaces
0026 - Generic Thermal Layer (sysfs)
0027 - Kernel APIs (TBD)
0028
0029 INTRODUCTION
0030 ============
0031
0032 Consider the situation where a system’s power consumption must be
0033 reduced at runtime, due to power budget, thermal constraint, or noise
0034 level, and where active cooling is not preferred. Software managed
0035 passive power reduction must be performed to prevent the hardware
0036 actions that are designed for catastrophic scenarios.
0037
0038 Currently, P-states, T-states (clock modulation), and CPU offlining
0039 are used for CPU throttling.
0040
0041 On Intel CPUs, C-states provide effective power reduction, but so far
0042 they’re only used opportunistically, based on workload. With the
0043 development of intel_powerclamp driver, the method of synchronizing
0044 idle injection across all online CPU threads was introduced. The goal
0045 is to achieve forced and controllable C-state residency.
0046
0047 Test/Analysis has been made in the areas of power, performance,
0048 scalability, and user experience. In many cases, clear advantage is
0049 shown over taking the CPU offline or modulating the CPU clock.
0050
0051
0052 THEORY OF OPERATION
0053 ===================
0054
0055 Idle Injection
0056 --------------
0057
0058 On modern Intel processors (Nehalem or later), package level C-state
0059 residency is available in MSRs, thus also available to the kernel.
0060
0061 These MSRs are::
0062
0063 #define MSR_PKG_C2_RESIDENCY 0x60D
0064 #define MSR_PKG_C3_RESIDENCY 0x3F8
0065 #define MSR_PKG_C6_RESIDENCY 0x3F9
0066 #define MSR_PKG_C7_RESIDENCY 0x3FA
0067
0068 If the kernel can also inject idle time to the system, then a
0069 closed-loop control system can be established that manages package
0070 level C-state. The intel_powerclamp driver is conceived as such a
0071 control system, where the target set point is a user-selected idle
0072 ratio (based on power reduction), and the error is the difference
0073 between the actual package level C-state residency ratio and the target idle
0074 ratio.
0075
0076 Injection is controlled by high priority kernel threads, spawned for
0077 each online CPU.
0078
0079 These kernel threads, with SCHED_FIFO class, are created to perform
0080 clamping actions of controlled duty ratio and duration. Each per-CPU
0081 thread synchronizes its idle time and duration, based on the rounding
0082 of jiffies, so accumulated errors can be prevented to avoid a jittery
0083 effect. Threads are also bound to the CPU such that they cannot be
0084 migrated, unless the CPU is taken offline. In this case, threads
0085 belong to the offlined CPUs will be terminated immediately.
0086
0087 Running as SCHED_FIFO and relatively high priority, also allows such
0088 scheme to work for both preemptable and non-preemptable kernels.
0089 Alignment of idle time around jiffies ensures scalability for HZ
0090 values. This effect can be better visualized using a Perf timechart.
0091 The following diagram shows the behavior of kernel thread
0092 kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
0093 for a given "duration", then relinquishes the CPU to other tasks,
0094 until the next time interval.
0095
0096 The NOHZ schedule tick is disabled during idle time, but interrupts
0097 are not masked. Tests show that the extra wakeups from scheduler tick
0098 have a dramatic impact on the effectiveness of the powerclamp driver
0099 on large scale systems (Westmere system with 80 processors).
0100
0101 ::
0102
0103 CPU0
0104 ____________ ____________
0105 kidle_inject/0 | sleep | mwait | sleep |
0106 _________| |________| |_______
0107 duration
0108 CPU1
0109 ____________ ____________
0110 kidle_inject/1 | sleep | mwait | sleep |
0111 _________| |________| |_______
0112 ^
0113 |
0114 |
0115 roundup(jiffies, interval)
0116
0117 Only one CPU is allowed to collect statistics and update global
0118 control parameters. This CPU is referred to as the controlling CPU in
0119 this document. The controlling CPU is elected at runtime, with a
0120 policy that favors BSP, taking into account the possibility of a CPU
0121 hot-plug.
0122
0123 In terms of dynamics of the idle control system, package level idle
0124 time is considered largely as a non-causal system where its behavior
0125 cannot be based on the past or current input. Therefore, the
0126 intel_powerclamp driver attempts to enforce the desired idle time
0127 instantly as given input (target idle ratio). After injection,
0128 powerclamp monitors the actual idle for a given time window and adjust
0129 the next injection accordingly to avoid over/under correction.
0130
0131 When used in a causal control system, such as a temperature control,
0132 it is up to the user of this driver to implement algorithms where
0133 past samples and outputs are included in the feedback. For example, a
0134 PID-based thermal controller can use the powerclamp driver to
0135 maintain a desired target temperature, based on integral and
0136 derivative gains of the past samples.
0137
0138
0139
0140 Calibration
0141 -----------
0142 During scalability testing, it is observed that synchronized actions
0143 among CPUs become challenging as the number of cores grows. This is
0144 also true for the ability of a system to enter package level C-states.
0145
0146 To make sure the intel_powerclamp driver scales well, online
0147 calibration is implemented. The goals for doing such a calibration
0148 are:
0149
0150 a) determine the effective range of idle injection ratio
0151 b) determine the amount of compensation needed at each target ratio
0152
0153 Compensation to each target ratio consists of two parts:
0154
0155 a) steady state error compensation
0156 This is to offset the error occurring when the system can
0157 enter idle without extra wakeups (such as external interrupts).
0158
0159 b) dynamic error compensation
0160 When an excessive amount of wakeups occurs during idle, an
0161 additional idle ratio can be added to quiet interrupts, by
0162 slowing down CPU activities.
0163
0164 A debugfs file is provided for the user to examine compensation
0165 progress and results, such as on a Westmere system::
0166
0167 [jacob@nex01 ~]$ cat
0168 /sys/kernel/debug/intel_powerclamp/powerclamp_calib
0169 controlling cpu: 0
0170 pct confidence steady dynamic (compensation)
0171 0 0 0 0
0172 1 1 0 0
0173 2 1 1 0
0174 3 3 1 0
0175 4 3 1 0
0176 5 3 1 0
0177 6 3 1 0
0178 7 3 1 0
0179 8 3 1 0
0180 ...
0181 30 3 2 0
0182 31 3 2 0
0183 32 3 1 0
0184 33 3 2 0
0185 34 3 1 0
0186 35 3 2 0
0187 36 3 1 0
0188 37 3 2 0
0189 38 3 1 0
0190 39 3 2 0
0191 40 3 3 0
0192 41 3 1 0
0193 42 3 2 0
0194 43 3 1 0
0195 44 3 1 0
0196 45 3 2 0
0197 46 3 3 0
0198 47 3 0 0
0199 48 3 2 0
0200 49 3 3 0
0201
0202 Calibration occurs during runtime. No offline method is available.
0203 Steady state compensation is used only when confidence levels of all
0204 adjacent ratios have reached satisfactory level. A confidence level
0205 is accumulated based on clean data collected at runtime. Data
0206 collected during a period without extra interrupts is considered
0207 clean.
0208
0209 To compensate for excessive amounts of wakeup during idle, additional
0210 idle time is injected when such a condition is detected. Currently,
0211 we have a simple algorithm to double the injection ratio. A possible
0212 enhancement might be to throttle the offending IRQ, such as delaying
0213 EOI for level triggered interrupts. But it is a challenge to be
0214 non-intrusive to the scheduler or the IRQ core code.
0215
0216
0217 CPU Online/Offline
0218 ------------------
0219 Per-CPU kernel threads are started/stopped upon receiving
0220 notifications of CPU hotplug activities. The intel_powerclamp driver
0221 keeps track of clamping kernel threads, even after they are migrated
0222 to other CPUs, after a CPU offline event.
0223
0224
0225 Performance Analysis
0226 ====================
0227 This section describes the general performance data collected on
0228 multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
0229
0230 Effectiveness and Limitations
0231 -----------------------------
0232 The maximum range that idle injection is allowed is capped at 50
0233 percent. As mentioned earlier, since interrupts are allowed during
0234 forced idle time, excessive interrupts could result in less
0235 effectiveness. The extreme case would be doing a ping -f to generated
0236 flooded network interrupts without much CPU acknowledgement. In this
0237 case, little can be done from the idle injection threads. In most
0238 normal cases, such as scp a large file, applications can be throttled
0239 by the powerclamp driver, since slowing down the CPU also slows down
0240 network protocol processing, which in turn reduces interrupts.
0241
0242 When control parameters change at runtime by the controlling CPU, it
0243 may take an additional period for the rest of the CPUs to catch up
0244 with the changes. During this time, idle injection is out of sync,
0245 thus not able to enter package C- states at the expected ratio. But
0246 this effect is minor, in that in most cases change to the target
0247 ratio is updated much less frequently than the idle injection
0248 frequency.
0249
0250 Scalability
0251 -----------
0252 Tests also show a minor, but measurable, difference between the 4P/8P
0253 Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
0254 More compensation is needed on Westmere for the same amount of
0255 target idle ratio. The compensation also increases as the idle ratio
0256 gets larger. The above reason constitutes the need for the
0257 calibration code.
0258
0259 On the IVB 8P system, compared to an offline CPU, powerclamp can
0260 achieve up to 40% better performance per watt. (measured by a spin
0261 counter summed over per CPU counting threads spawned for all running
0262 CPUs).
0263
0264 Usage and Interfaces
0265 ====================
0266 The powerclamp driver is registered to the generic thermal layer as a
0267 cooling device. Currently, it’s not bound to any thermal zones::
0268
0269 jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
0270 cur_state:0
0271 max_state:50
0272 type:intel_powerclamp
0273
0274 cur_state allows user to set the desired idle percentage. Writing 0 to
0275 cur_state will stop idle injection. Writing a value between 1 and
0276 max_state will start the idle injection. Reading cur_state returns the
0277 actual and current idle percentage. This may not be the same value
0278 set by the user in that current idle percentage depends on workload
0279 and includes natural idle. When idle injection is disabled, reading
0280 cur_state returns value -1 instead of 0 which is to avoid confusing
0281 100% busy state with the disabled state.
0282
0283 Example usage:
0284 - To inject 25% idle time::
0285
0286 $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
0287
0288 If the system is not busy and has more than 25% idle time already,
0289 then the powerclamp driver will not start idle injection. Using Top
0290 will not show idle injection kernel threads.
0291
0292 If the system is busy (spin test below) and has less than 25% natural
0293 idle time, powerclamp kernel threads will do idle injection. Forced
0294 idle time is accounted as normal idle in that common code path is
0295 taken as the idle task.
0296
0297 In this example, 24.1% idle is shown. This helps the system admin or
0298 user determine the cause of slowdown, when a powerclamp driver is in action::
0299
0300
0301 Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
0302 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
0303 Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
0304 Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
0305
0306 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
0307 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
0308 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
0309 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
0310 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
0311 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
0312 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
0313 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
0314 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
0315
0316 Tests have shown that by using the powerclamp driver as a cooling
0317 device, a PID based userspace thermal controller can manage to
0318 control CPU temperature effectively, when no other thermal influence
0319 is added. For example, a UltraBook user can compile the kernel under
0320 certain temperature (below most active trip points).