0001 .. SPDX-License-Identifier: GPL-2.0
0002 .. include:: <isonum.txt>
0003
0004 ==============================================
0005 ``intel_idle`` CPU Idle Time Management Driver
0006 ==============================================
0007
0008 :Copyright: |copy| 2020 Intel Corporation
0009
0010 :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
0011
0012
0013 General Information
0014 ===================
0015
0016 ``intel_idle`` is a part of the
0017 :doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
0018 (``CPUIdle``). It is the default CPU idle time management driver for the
0019 Nehalem and later generations of Intel processors, but the level of support for
0020 a particular processor model in it depends on whether or not it recognizes that
0021 processor model and may also depend on information coming from the platform
0022 firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
0023 works in general, so this is the time to get familiar with
0024 Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
0025
0026 ``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
0027 logical CPU executing it is idle and so it may be possible to put some of the
0028 processor's functional blocks into low-power states. That instruction takes two
0029 arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
0030 first of which, referred to as a *hint*, can be used by the processor to
0031 determine what can be done (for details refer to Intel Software Developer’s
0032 Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in
0033 which the support for the ``MWAIT`` instruction has been disabled (for example,
0034 via the platform firmware configuration menu) or which do not support that
0035 instruction at all.
0036
0037 ``intel_idle`` is not modular, so it cannot be unloaded, which means that the
0038 only way to pass early-configuration-time parameters to it is via the kernel
0039 command line.
0040
0041
0042 .. _intel-idle-enumeration-of-states:
0043
0044 Enumeration of Idle States
0045 ==========================
0046
0047 Each ``MWAIT`` hint value is interpreted by the processor as a license to
0048 reconfigure itself in a certain way in order to save energy. The processor
0049 configurations (with reduced power draw) resulting from that are referred to
0050 as C-states (in the ACPI terminology) or idle states. The list of meaningful
0051 ``MWAIT`` hint values and idle states (i.e. low-power configurations of the
0052 processor) corresponding to them depends on the processor model and it may also
0053 depend on the configuration of the platform.
0054
0055 In order to create a list of available idle states required by the ``CPUIdle``
0056 subsystem (see :ref:`idle-states-representation` in
0057 Documentation/admin-guide/pm/cpuidle.rst),
0058 ``intel_idle`` can use two sources of information: static tables of idle states
0059 for different processor models included in the driver itself and the ACPI tables
0060 of the system. The former are always used if the processor model at hand is
0061 recognized by ``intel_idle`` and the latter are used if that is required for
0062 the given processor model (which is the case for all server processor models
0063 recognized by ``intel_idle``) or if the processor model is not recognized.
0064 [There is a module parameter that can be used to make the driver use the ACPI
0065 tables with any processor model recognized by it; see
0066 `below <intel-idle-parameters_>`_.]
0067
0068 If the ACPI tables are going to be used for building the list of available idle
0069 states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
0070 objects corresponding to the CPUs in the system (refer to the ACPI specification
0071 [2]_ for the description of ``_CST`` and its output package). Because the
0072 ``CPUIdle`` subsystem expects that the list of idle states supplied by the
0073 driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
0074 registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
0075 driver looks for the first ``_CST`` object returning at least one valid idle
0076 state description and such that all of the idle states included in its return
0077 package are of the FFH (Functional Fixed Hardware) type, which means that the
0078 ``MWAIT`` instruction is expected to be used to tell the processor that it can
0079 enter one of them. The return package of that ``_CST`` is then assumed to be
0080 applicable to all of the other CPUs in the system and the idle state
0081 descriptions extracted from it are stored in a preliminary list of idle states
0082 coming from the ACPI tables. [This step is skipped if ``intel_idle`` is
0083 configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
0084
0085 Next, the first (index 0) entry in the list of available idle states is
0086 initialized to represent a "polling idle state" (a pseudo-idle state in which
0087 the target CPU continuously fetches and executes instructions), and the
0088 subsequent (real) idle state entries are populated as follows.
0089
0090 If the processor model at hand is recognized by ``intel_idle``, there is a
0091 (static) table of idle state descriptions for it in the driver. In that case,
0092 the "internal" table is the primary source of information on idle states and the
0093 information from it is copied to the final list of available idle states. If
0094 using the ACPI tables for the enumeration of idle states is not required
0095 (depending on the processor model), all of the listed idle state are enabled by
0096 default (so all of them will be taken into consideration by ``CPUIdle``
0097 governors during CPU idle state selection). Otherwise, some of the listed idle
0098 states may not be enabled by default if there are no matching entries in the
0099 preliminary list of idle states coming from the ACPI tables. In that case user
0100 space still can enable them later (on a per-CPU basis) with the help of
0101 the ``disable`` idle state attribute in ``sysfs`` (see
0102 :ref:`idle-states-representation` in
0103 Documentation/admin-guide/pm/cpuidle.rst). This basically means that
0104 the idle states "known" to the driver may not be enabled by default if they have
0105 not been exposed by the platform firmware (through the ACPI tables).
0106
0107 If the given processor model is not recognized by ``intel_idle``, but it
0108 supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
0109 tables is used for building the final list that will be supplied to the
0110 ``CPUIdle`` core during driver registration. For each idle state in that list,
0111 the description, ``MWAIT`` hint and exit latency are copied to the corresponding
0112 entry in the final list of idle states. The name of the idle state represented
0113 by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
0114 "CX_ACPI", where X is the index of that idle state in the final list (note that
0115 the minimum value of X is 1, because 0 is reserved for the "polling" state), and
0116 its target residency is based on the exit latency value. Specifically, for
0117 C1-type idle states the exit latency value is also used as the target residency
0118 (for compatibility with the majority of the "internal" tables of idle states for
0119 various processor models recognized by ``intel_idle``) and for the other idle
0120 state types (C2 and C3) the target residency value is 3 times the exit latency
0121 (again, that is because it reflects the target residency to exit latency ratio
0122 in the majority of cases for the processor models recognized by ``intel_idle``).
0123 All of the idle states in the final list are enabled by default in this case.
0124
0125
0126 .. _intel-idle-initialization:
0127
0128 Initialization
0129 ==============
0130
0131 The initialization of ``intel_idle`` starts with checking if the kernel command
0132 line options forbid the use of the ``MWAIT`` instruction. If that is the case,
0133 an error code is returned right away.
0134
0135 The next step is to check whether or not the processor model is known to the
0136 driver, which determines the idle states enumeration method (see
0137 `above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
0138 supports ``MWAIT`` (the initialization fails if that is not the case). Then,
0139 the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
0140 driver initialization fails if the level of support is not as expected (for
0141 example, if the total number of ``MWAIT`` substates returned is 0).
0142
0143 Next, if the driver is not configured to ignore the ACPI tables (see
0144 `below <intel-idle-parameters_>`_), the idle states information provided by the
0145 platform firmware is extracted from them.
0146
0147 Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
0148 available idle states is created as explained
0149 `above <intel-idle-enumeration-of-states_>`_.
0150
0151 Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
0152 as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
0153 for configuring individual CPUs is registered via cpuhp_setup_state(), which
0154 (among other things) causes the callback routine to be invoked for all of the
0155 CPUs present in the system at that time (each CPU executes its own instance of
0156 the callback routine). That routine registers a ``CPUIdle`` device for the CPU
0157 running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
0158 optionally performs some CPU-specific initialization actions that may be
0159 required for the given processor model.
0160
0161
0162 .. _intel-idle-parameters:
0163
0164 Kernel Command Line Options and Module Parameters
0165 =================================================
0166
0167 The *x86* architecture support code recognizes three kernel command line
0168 options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
0169 and ``idle=nomwait``. If any of them is present in the kernel command line, the
0170 ``MWAIT`` instruction is not allowed to be used, so the initialization of
0171 ``intel_idle`` will fail.
0172
0173 Apart from that there are four module parameters recognized by ``intel_idle``
0174 itself that can be set via the kernel command line (they cannot be updated via
0175 sysfs, so that is the only way to change their values).
0176
0177 The ``max_cstate`` parameter value is the maximum idle state index in the list
0178 of idle states supplied to the ``CPUIdle`` core during the registration of the
0179 driver. It is also the maximum number of regular (non-polling) idle states that
0180 can be used by ``intel_idle``, so the enumeration of idle states is terminated
0181 after finding that number of usable idle states (the other idle states that
0182 potentially might have been used if ``max_cstate`` had been greater are not
0183 taken into consideration at all). Setting ``max_cstate`` can prevent
0184 ``intel_idle`` from exposing idle states that are regarded as "too deep" for
0185 some reason to the ``CPUIdle`` core, but it does so by making them effectively
0186 invisible until the system is shut down and started again which may not always
0187 be desirable. In practice, it is only really necessary to do that if the idle
0188 states in question cannot be enabled during system startup, because in the
0189 working state of the system the CPU power management quality of service (PM
0190 QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
0191 even if they have been enumerated (see :ref:`cpu-pm-qos` in
0192 Documentation/admin-guide/pm/cpuidle.rst).
0193 Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
0194
0195 The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
0196 if the kernel has been configured with ACPI support) can be set to make the
0197 driver ignore the system's ACPI tables entirely or use them for all of the
0198 recognized processor models, respectively (they both are unset by default and
0199 ``use_acpi`` has no effect if ``no_acpi`` is set).
0200
0201 The value of the ``states_off`` module parameter (0 by default) represents a
0202 list of idle states to be disabled by default in the form of a bitmask.
0203
0204 Namely, the positions of the bits that are set in the ``states_off`` value are
0205 the indices of idle states to be disabled by default (as reflected by the names
0206 of the corresponding idle state directories in ``sysfs``, :file:`state0`,
0207 :file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
0208 idle state; see :ref:`idle-states-representation` in
0209 Documentation/admin-guide/pm/cpuidle.rst).
0210
0211 For example, if ``states_off`` is equal to 3, the driver will disable idle
0212 states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
0213 disabled by default and so on (bit positions beyond the maximum idle state index
0214 are ignored).
0215
0216 The idle states disabled this way can be enabled (on a per-CPU basis) from user
0217 space via ``sysfs``.
0218
0219
0220 .. _intel-idle-core-and-package-idle-states:
0221
0222 Core and Package Levels of Idle States
0223 ======================================
0224
0225 Typically, in a processor supporting the ``MWAIT`` instruction there are (at
0226 least) two levels of idle states (or C-states). One level, referred to as
0227 "core C-states", covers individual cores in the processor, whereas the other
0228 level, referred to as "package C-states", covers the entire processor package
0229 and it may also involve other components of the system (GPUs, memory
0230 controllers, I/O hubs etc.).
0231
0232 Some of the ``MWAIT`` hint values allow the processor to use core C-states only
0233 (most importantly, that is the case for the ``MWAIT`` hint value corresponding
0234 to the ``C1`` idle state), but the majority of them give it a license to put
0235 the target core (i.e. the core containing the logical CPU executing ``MWAIT``
0236 with the given hint value) into a specific core C-state and then (if possible)
0237 to enter a specific package C-state at the deeper level. For example, the
0238 ``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
0239 put the target core into the low-power state referred to as "core ``C3``" (or
0240 ``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
0241 have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
0242 representing a deeper idle state), and in addition to that (in the majority of
0243 cases) it gives the processor a license to put the entire package (possibly
0244 including some non-CPU components such as a GPU or a memory controller) into the
0245 low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
0246 all of the cores have gone into the ``CC3`` state and (possibly) some additional
0247 conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
0248 be required to be in a certain GPU-specific low-power state for ``PC3`` to be
0249 reachable).
0250
0251 As a rule, there is no simple way to make the processor use core C-states only
0252 if the conditions for entering the corresponding package C-states are met, so
0253 the logical CPU executing ``MWAIT`` with a hint value that is not core-level
0254 only (like for ``C1``) must always assume that this may cause the processor to
0255 enter a package C-state. [That is why the exit latency and target residency
0256 values corresponding to the majority of ``MWAIT`` hint values in the "internal"
0257 tables of idle states in ``intel_idle`` reflect the properties of package
0258 C-states.] If using package C-states is not desirable at all, either
0259 :ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
0260 ``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
0261 restrict the range of permissible idle states to the ones with core-level only
0262 ``MWAIT`` hint values (like ``C1``).
0263
0264
0265 References
0266 ==========
0267
0268 .. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
0269 https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
0270
0271 .. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
0272 https://uefi.org/specifications