admin-guide/hw-vuln/core-scheduling.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ===============
0004 Core Scheduling
0005 ===============
0006 Core scheduling support allows userspace to define groups of tasks that can
0007 share a core. These groups can be specified either for security usecases (one
0008 group of tasks don't trust another), or for performance usecases (some
0009 workloads may benefit from running on the same core as they don't need the same
0010 hardware resources of the shared core, or may prefer different cores if they
0011 do share hardware resource needs). This document only describes the security
0012 usecase.
0013
0014 Security usecase
0015 ----------------
0016 A cross-HT attack involves the attacker and victim running on different Hyper
0017 Threads of the same core. MDS and L1TF are examples of such attacks.  The only
0018 full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
0019 scheduling is a scheduler feature that can mitigate some (not all) cross-HT
0020 attacks. It allows HT to be turned on safely by ensuring that only tasks in a
0021 user-designated trusted group can share a core. This increase in core sharing
0022 can also improve performance, however it is not guaranteed that performance
0023 will always improve, though that is seen to be the case with a number of real
0024 world workloads. In theory, core scheduling aims to perform at least as good as
0025 when Hyper Threading is disabled. In practice, this is mostly the case though
0026 not always: as synchronizing scheduling decisions across 2 or more CPUs in a
0027 core involves additional overhead - especially when the system is lightly
0028 loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
0029 scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
0030 total number of CPUs. Please measure the performance of your workloads always.
0031
0032 Usage
0033 -----
0034 Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
0035 Using this feature, userspace defines groups of tasks that can be co-scheduled
0036 on the same core. The core scheduler uses this information to make sure that
0037 tasks that are not in the same group never run simultaneously on a core, while
0038 doing its best to satisfy the system's scheduling requirements.
0039
0040 Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
0041 This interface provides support for the creation of core scheduling groups, as
0042 well as admission and removal of tasks from created groups::
0043
0044     #include <sys/prctl.h>
0045
0046     int prctl(int option, unsigned long arg2, unsigned long arg3,
0047             unsigned long arg4, unsigned long arg5);
0048
0049 option:
0050     ``PR_SCHED_CORE``
0051
0052 arg2:
0053     Command for operation, must be one off:
0054
0055     - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
0056     - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
0057     - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
0058     - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
0059
0060 arg3:
0061     ``pid`` of the task for which the operation applies.
0062
0063 arg4:
0064     ``pid_type`` for which the operation applies. It is one of
0065     ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants.  For example, if arg4
0066     is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
0067     will be performed for all tasks in the task group of ``pid``.
0068
0069 arg5:
0070     userspace pointer to an unsigned long for storing the cookie returned by
0071     ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
0072
0073 In order for a process to push a cookie to, or pull a cookie from a process, it
0074 is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
0075 process.
0076
0077 Building hierarchies of tasks
0078 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0079 The simplest way to build hierarchies of threads/processes which share a
0080 cookie and thus a core is to rely on the fact that the core-sched cookie is
0081 inherited across forks/clones and execs, thus setting a cookie for the
0082 'initial' script/executable/daemon will place every spawned child in the
0083 same core-sched group.
0084
0085 Cookie Transferral
0086 ~~~~~~~~~~~~~~~~~~
0087 Transferring a cookie between the current and other tasks is possible using
0088 PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
0089 specified task or a share a cookie with a task. In combination this allows a
0090 simple helper program to pull a cookie from a task in an existing core
0091 scheduling group and share it with already running tasks.
0092
0093 Design/Implementation
0094 ---------------------
0095 Each task that is tagged is assigned a cookie internally in the kernel. As
0096 mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
0097 each other and share a core.
0098
0099 The basic idea is that, every schedule event tries to select tasks for all the
0100 siblings of a core such that all the selected tasks running on a core are
0101 trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
0102 The idle task is considered special, as it trusts everything and everything
0103 trusts it.
0104
0105 During a schedule() event on any sibling of a core, the highest priority task on
0106 the sibling's core is picked and assigned to the sibling calling schedule(), if
0107 the sibling has the task enqueued. For rest of the siblings in the core,
0108 highest priority task with the same cookie is selected if there is one runnable
0109 in their individual run queues. If a task with same cookie is not available,
0110 the idle task is selected.  Idle task is globally trusted.
0111
0112 Once a task has been selected for all the siblings in the core, an IPI is sent to
0113 siblings for whom a new task was selected. Siblings on receiving the IPI will
0114 switch to the new task immediately. If an idle task is selected for a sibling,
0115 then the sibling is considered to be in a `forced idle` state. I.e., it may
0116 have tasks on its on runqueue to run, however it will still have to run idle.
0117 More on this in the next section.
0118
0119 Forced-idling of hyperthreads
0120 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0121 The scheduler tries its best to find tasks that trust each other such that all
0122 tasks selected to be scheduled are of the highest priority in a core.  However,
0123 it is possible that some runqueues had tasks that were incompatible with the
0124 highest priority ones in the core. Favoring security over fairness, one or more
0125 siblings could be forced to select a lower priority task if the highest
0126 priority task is not trusted with respect to the core wide highest priority
0127 task.  If a sibling does not have a trusted task to run, it will be forced idle
0128 by the scheduler (idle thread is scheduled to run).
0129
0130 When the highest priority task is selected to run, a reschedule-IPI is sent to
0131 the sibling to force it into idle. This results in 4 cases which need to be
0132 considered depending on whether a VM or a regular usermode process was running
0133 on either HT::
0134
0135           HT1 (attack)            HT2 (victim)
0136    A      idle -> user space      user space -> idle
0137    B      idle -> user space      guest -> idle
0138    C      idle -> guest           user space -> idle
0139    D      idle -> guest           guest -> idle
0140
0141 Note that for better performance, we do not wait for the destination CPU
0142 (victim) to enter idle mode. This is because the sending of the IPI would bring
0143 the destination CPU immediately into kernel mode from user space, or VMEXIT
0144 in the case of guests. At best, this would only leak some scheduler metadata
0145 which may not be worth protecting. It is also possible that the IPI is received
0146 too late on some architectures, but this has not been observed in the case of
0147 x86.
0148
0149 Trust model
0150 ~~~~~~~~~~~
0151 Core scheduling maintains trust relationships amongst groups of tasks by
0152 assigning them a tag that is the same cookie value.
0153 When a system with core scheduling boots, all tasks are considered to trust
0154 each other. This is because the core scheduler does not have information about
0155 trust relationships until userspace uses the above mentioned interfaces, to
0156 communicate them. In other words, all tasks have a default cookie value of 0.
0157 and are considered system-wide trusted. The forced-idling of siblings running
0158 cookie-0 tasks is also avoided.
0159
0160 Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
0161 within such groups are considered to trust each other, but do not trust those
0162 outside. Tasks outside the group also don't trust tasks within.
0163
0164 Limitations of core-scheduling
0165 ------------------------------
0166 Core scheduling tries to guarantee that only trusted tasks run concurrently on a
0167 core. But there could be small window of time during which untrusted tasks run
0168 concurrently or kernel could be running concurrently with a task not trusted by
0169 kernel.
0170
0171 IPI processing delays
0172 ~~~~~~~~~~~~~~~~~~~~~
0173 Core scheduling selects only trusted tasks to run together. IPI is used to notify
0174 the siblings to switch to the new task. But there could be hardware delays in
0175 receiving of the IPI on some arch (on x86, this has not been observed). This may
0176 cause an attacker task to start running on a CPU before its siblings receive the
0177 IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
0178 may populate data in the cache and micro architectural buffers after the attacker
0179 starts to run and this is a possibility for data leak.
0180
0181 Open cross-HT issues that core scheduling does not solve
0182 --------------------------------------------------------
0183 1. For MDS
0184 ~~~~~~~~~~
0185 Core scheduling cannot protect against MDS attacks between the siblings
0186 running in user mode and the others running in kernel mode. Even though all
0187 siblings run tasks which trust each other, when the kernel is executing
0188 code on behalf of a task, it cannot trust the code running in the
0189 sibling. Such attacks are possible for any combination of sibling CPU modes
0190 (host or guest mode).
0191
0192 2. For L1TF
0193 ~~~~~~~~~~~
0194 Core scheduling cannot protect against an L1TF guest attacker exploiting a
0195 guest or host victim. This is because the guest attacker can craft invalid
0196 PTEs which are not inverted due to a vulnerable guest kernel. The only
0197 solution is to disable EPT (Extended Page Tables).
0198
0199 For both MDS and L1TF, if the guest vCPU is configured to not trust each
0200 other (by tagging separately), then the guest to guest attacks would go away.
0201 Or it could be a system admin policy which considers guest to guest attacks as
0202 a guest problem.
0203
0204 Another approach to resolve these would be to make every untrusted task on the
0205 system to not trust every other untrusted task. While this could reduce
0206 parallelism of the untrusted tasks, it would still solve the above issues while
0207 allowing system processes (trusted tasks) to share a core.
0208
0209 3. Protecting the kernel (IRQ, syscall, VMEXIT)
0210 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0211 Unfortunately, core scheduling does not protect kernel contexts running on
0212 sibling hyperthreads from one another. Prototypes of mitigations have been posted
0213 to LKML to solve this, but it is debatable whether such windows are practically
0214 exploitable, and whether the performance overhead of the prototypes are worth
0215 it (not to mention, the added code complexity).
0216
0217 Other Use cases
0218 ---------------
0219 The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
0220 with SMT enabled. There are other use cases where this feature could be used:
0221
0222 - Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
0223   that uses SIMD instructions etc.
0224 - Gang scheduling: Requirements for a group of tasks that needs to be scheduled
0225   together could also be realized using core scheduling. One example is vCPUs of
0226   a VM.