Back to home page

OSCL-LXR

 
 

    


0001 =======================
0002 Intel Powerclamp Driver
0003 =======================
0004 
0005 By:
0006   - Arjan van de Ven <arjan@linux.intel.com>
0007   - Jacob Pan <jacob.jun.pan@linux.intel.com>
0008 
0009 .. Contents:
0010 
0011         (*) Introduction
0012             - Goals and Objectives
0013 
0014         (*) Theory of Operation
0015             - Idle Injection
0016             - Calibration
0017 
0018         (*) Performance Analysis
0019             - Effectiveness and Limitations
0020             - Power vs Performance
0021             - Scalability
0022             - Calibration
0023             - Comparison with Alternative Techniques
0024 
0025         (*) Usage and Interfaces
0026             - Generic Thermal Layer (sysfs)
0027             - Kernel APIs (TBD)
0028 
0029 INTRODUCTION
0030 ============
0031 
0032 Consider the situation where a system’s power consumption must be
0033 reduced at runtime, due to power budget, thermal constraint, or noise
0034 level, and where active cooling is not preferred. Software managed
0035 passive power reduction must be performed to prevent the hardware
0036 actions that are designed for catastrophic scenarios.
0037 
0038 Currently, P-states, T-states (clock modulation), and CPU offlining
0039 are used for CPU throttling.
0040 
0041 On Intel CPUs, C-states provide effective power reduction, but so far
0042 they’re only used opportunistically, based on workload. With the
0043 development of intel_powerclamp driver, the method of synchronizing
0044 idle injection across all online CPU threads was introduced. The goal
0045 is to achieve forced and controllable C-state residency.
0046 
0047 Test/Analysis has been made in the areas of power, performance,
0048 scalability, and user experience. In many cases, clear advantage is
0049 shown over taking the CPU offline or modulating the CPU clock.
0050 
0051 
0052 THEORY OF OPERATION
0053 ===================
0054 
0055 Idle Injection
0056 --------------
0057 
0058 On modern Intel processors (Nehalem or later), package level C-state
0059 residency is available in MSRs, thus also available to the kernel.
0060 
0061 These MSRs are::
0062 
0063       #define MSR_PKG_C2_RESIDENCY      0x60D
0064       #define MSR_PKG_C3_RESIDENCY      0x3F8
0065       #define MSR_PKG_C6_RESIDENCY      0x3F9
0066       #define MSR_PKG_C7_RESIDENCY      0x3FA
0067 
0068 If the kernel can also inject idle time to the system, then a
0069 closed-loop control system can be established that manages package
0070 level C-state. The intel_powerclamp driver is conceived as such a
0071 control system, where the target set point is a user-selected idle
0072 ratio (based on power reduction), and the error is the difference
0073 between the actual package level C-state residency ratio and the target idle
0074 ratio.
0075 
0076 Injection is controlled by high priority kernel threads, spawned for
0077 each online CPU.
0078 
0079 These kernel threads, with SCHED_FIFO class, are created to perform
0080 clamping actions of controlled duty ratio and duration. Each per-CPU
0081 thread synchronizes its idle time and duration, based on the rounding
0082 of jiffies, so accumulated errors can be prevented to avoid a jittery
0083 effect. Threads are also bound to the CPU such that they cannot be
0084 migrated, unless the CPU is taken offline. In this case, threads
0085 belong to the offlined CPUs will be terminated immediately.
0086 
0087 Running as SCHED_FIFO and relatively high priority, also allows such
0088 scheme to work for both preemptable and non-preemptable kernels.
0089 Alignment of idle time around jiffies ensures scalability for HZ
0090 values. This effect can be better visualized using a Perf timechart.
0091 The following diagram shows the behavior of kernel thread
0092 kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
0093 for a given "duration", then relinquishes the CPU to other tasks,
0094 until the next time interval.
0095 
0096 The NOHZ schedule tick is disabled during idle time, but interrupts
0097 are not masked. Tests show that the extra wakeups from scheduler tick
0098 have a dramatic impact on the effectiveness of the powerclamp driver
0099 on large scale systems (Westmere system with 80 processors).
0100 
0101 ::
0102 
0103   CPU0
0104                     ____________          ____________
0105   kidle_inject/0   |   sleep    |  mwait |  sleep     |
0106           _________|            |________|            |_______
0107                                  duration
0108   CPU1
0109                     ____________          ____________
0110   kidle_inject/1   |   sleep    |  mwait |  sleep     |
0111           _________|            |________|            |_______
0112                                 ^
0113                                 |
0114                                 |
0115                                 roundup(jiffies, interval)
0116 
0117 Only one CPU is allowed to collect statistics and update global
0118 control parameters. This CPU is referred to as the controlling CPU in
0119 this document. The controlling CPU is elected at runtime, with a
0120 policy that favors BSP, taking into account the possibility of a CPU
0121 hot-plug.
0122 
0123 In terms of dynamics of the idle control system, package level idle
0124 time is considered largely as a non-causal system where its behavior
0125 cannot be based on the past or current input. Therefore, the
0126 intel_powerclamp driver attempts to enforce the desired idle time
0127 instantly as given input (target idle ratio). After injection,
0128 powerclamp monitors the actual idle for a given time window and adjust
0129 the next injection accordingly to avoid over/under correction.
0130 
0131 When used in a causal control system, such as a temperature control,
0132 it is up to the user of this driver to implement algorithms where
0133 past samples and outputs are included in the feedback. For example, a
0134 PID-based thermal controller can use the powerclamp driver to
0135 maintain a desired target temperature, based on integral and
0136 derivative gains of the past samples.
0137 
0138 
0139 
0140 Calibration
0141 -----------
0142 During scalability testing, it is observed that synchronized actions
0143 among CPUs become challenging as the number of cores grows. This is
0144 also true for the ability of a system to enter package level C-states.
0145 
0146 To make sure the intel_powerclamp driver scales well, online
0147 calibration is implemented. The goals for doing such a calibration
0148 are:
0149 
0150 a) determine the effective range of idle injection ratio
0151 b) determine the amount of compensation needed at each target ratio
0152 
0153 Compensation to each target ratio consists of two parts:
0154 
0155         a) steady state error compensation
0156         This is to offset the error occurring when the system can
0157         enter idle without extra wakeups (such as external interrupts).
0158 
0159         b) dynamic error compensation
0160         When an excessive amount of wakeups occurs during idle, an
0161         additional idle ratio can be added to quiet interrupts, by
0162         slowing down CPU activities.
0163 
0164 A debugfs file is provided for the user to examine compensation
0165 progress and results, such as on a Westmere system::
0166 
0167   [jacob@nex01 ~]$ cat
0168   /sys/kernel/debug/intel_powerclamp/powerclamp_calib
0169   controlling cpu: 0
0170   pct confidence steady dynamic (compensation)
0171   0       0       0       0
0172   1       1       0       0
0173   2       1       1       0
0174   3       3       1       0
0175   4       3       1       0
0176   5       3       1       0
0177   6       3       1       0
0178   7       3       1       0
0179   8       3       1       0
0180   ...
0181   30      3       2       0
0182   31      3       2       0
0183   32      3       1       0
0184   33      3       2       0
0185   34      3       1       0
0186   35      3       2       0
0187   36      3       1       0
0188   37      3       2       0
0189   38      3       1       0
0190   39      3       2       0
0191   40      3       3       0
0192   41      3       1       0
0193   42      3       2       0
0194   43      3       1       0
0195   44      3       1       0
0196   45      3       2       0
0197   46      3       3       0
0198   47      3       0       0
0199   48      3       2       0
0200   49      3       3       0
0201 
0202 Calibration occurs during runtime. No offline method is available.
0203 Steady state compensation is used only when confidence levels of all
0204 adjacent ratios have reached satisfactory level. A confidence level
0205 is accumulated based on clean data collected at runtime. Data
0206 collected during a period without extra interrupts is considered
0207 clean.
0208 
0209 To compensate for excessive amounts of wakeup during idle, additional
0210 idle time is injected when such a condition is detected. Currently,
0211 we have a simple algorithm to double the injection ratio. A possible
0212 enhancement might be to throttle the offending IRQ, such as delaying
0213 EOI for level triggered interrupts. But it is a challenge to be
0214 non-intrusive to the scheduler or the IRQ core code.
0215 
0216 
0217 CPU Online/Offline
0218 ------------------
0219 Per-CPU kernel threads are started/stopped upon receiving
0220 notifications of CPU hotplug activities. The intel_powerclamp driver
0221 keeps track of clamping kernel threads, even after they are migrated
0222 to other CPUs, after a CPU offline event.
0223 
0224 
0225 Performance Analysis
0226 ====================
0227 This section describes the general performance data collected on
0228 multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
0229 
0230 Effectiveness and Limitations
0231 -----------------------------
0232 The maximum range that idle injection is allowed is capped at 50
0233 percent. As mentioned earlier, since interrupts are allowed during
0234 forced idle time, excessive interrupts could result in less
0235 effectiveness. The extreme case would be doing a ping -f to generated
0236 flooded network interrupts without much CPU acknowledgement. In this
0237 case, little can be done from the idle injection threads. In most
0238 normal cases, such as scp a large file, applications can be throttled
0239 by the powerclamp driver, since slowing down the CPU also slows down
0240 network protocol processing, which in turn reduces interrupts.
0241 
0242 When control parameters change at runtime by the controlling CPU, it
0243 may take an additional period for the rest of the CPUs to catch up
0244 with the changes. During this time, idle injection is out of sync,
0245 thus not able to enter package C- states at the expected ratio. But
0246 this effect is minor, in that in most cases change to the target
0247 ratio is updated much less frequently than the idle injection
0248 frequency.
0249 
0250 Scalability
0251 -----------
0252 Tests also show a minor, but measurable, difference between the 4P/8P
0253 Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
0254 More compensation is needed on Westmere for the same amount of
0255 target idle ratio. The compensation also increases as the idle ratio
0256 gets larger. The above reason constitutes the need for the
0257 calibration code.
0258 
0259 On the IVB 8P system, compared to an offline CPU, powerclamp can
0260 achieve up to 40% better performance per watt. (measured by a spin
0261 counter summed over per CPU counting threads spawned for all running
0262 CPUs).
0263 
0264 Usage and Interfaces
0265 ====================
0266 The powerclamp driver is registered to the generic thermal layer as a
0267 cooling device. Currently, it’s not bound to any thermal zones::
0268 
0269   jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
0270   cur_state:0
0271   max_state:50
0272   type:intel_powerclamp
0273 
0274 cur_state allows user to set the desired idle percentage. Writing 0 to
0275 cur_state will stop idle injection. Writing a value between 1 and
0276 max_state will start the idle injection. Reading cur_state returns the
0277 actual and current idle percentage. This may not be the same value
0278 set by the user in that current idle percentage depends on workload
0279 and includes natural idle. When idle injection is disabled, reading
0280 cur_state returns value -1 instead of 0 which is to avoid confusing
0281 100% busy state with the disabled state.
0282 
0283 Example usage:
0284 - To inject 25% idle time::
0285 
0286         $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
0287 
0288 If the system is not busy and has more than 25% idle time already,
0289 then the powerclamp driver will not start idle injection. Using Top
0290 will not show idle injection kernel threads.
0291 
0292 If the system is busy (spin test below) and has less than 25% natural
0293 idle time, powerclamp kernel threads will do idle injection. Forced
0294 idle time is accounted as normal idle in that common code path is
0295 taken as the idle task.
0296 
0297 In this example, 24.1% idle is shown. This helps the system admin or
0298 user determine the cause of slowdown, when a powerclamp driver is in action::
0299 
0300 
0301   Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
0302   Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
0303   Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
0304   Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
0305 
0306     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
0307    3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
0308    3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
0309    3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
0310    3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
0311    3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
0312    2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
0313    1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
0314    2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
0315 
0316 Tests have shown that by using the powerclamp driver as a cooling
0317 device, a PID based userspace thermal controller can manage to
0318 control CPU temperature effectively, when no other thermal influence
0319 is added. For example, a UltraBook user can compile the kernel under
0320 certain temperature (below most active trip points).