0001 ==========================================
0002 Reducing OS jitter due to per-cpu kthreads
0003 ==========================================
0004
0005 This document lists per-CPU kthreads in the Linux kernel and presents
0006 options to control their OS jitter. Note that non-per-CPU kthreads are
0007 not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
0008 them to a "housekeeping" CPU dedicated to such work.
0009
0010 References
0011 ==========
0012
0013 - Documentation/core-api/irq/irq-affinity.rst: Binding interrupts to sets of CPUs.
0014
0015 - Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
0016
0017 - man taskset: Using the taskset command to bind tasks to sets
0018 of CPUs.
0019
0020 - man sched_setaffinity: Using the sched_setaffinity() system
0021 call to bind tasks to sets of CPUs.
0022
0023 - /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
0024 writing "0" to offline and "1" to online.
0025
0026 - In order to locate kernel-generated OS jitter on CPU N:
0027
0028 cd /sys/kernel/debug/tracing
0029 echo 1 > max_graph_depth # Increase the "1" for more detail
0030 echo function_graph > current_tracer
0031 # run workload
0032 cat per_cpu/cpuN/trace
0033
0034 kthreads
0035 ========
0036
0037 Name:
0038 ehca_comp/%u
0039
0040 Purpose:
0041 Periodically process Infiniband-related work.
0042
0043 To reduce its OS jitter, do any of the following:
0044
0045 1. Don't use eHCA Infiniband hardware, instead choosing hardware
0046 that does not require per-CPU kthreads. This will prevent these
0047 kthreads from being created in the first place. (This will
0048 work for most people, as this hardware, though important, is
0049 relatively old and is produced in relatively low unit volumes.)
0050 2. Do all eHCA-Infiniband-related work on other CPUs, including
0051 interrupts.
0052 3. Rework the eHCA driver so that its per-CPU kthreads are
0053 provisioned only on selected CPUs.
0054
0055
0056 Name:
0057 irq/%d-%s
0058
0059 Purpose:
0060 Handle threaded interrupts.
0061
0062 To reduce its OS jitter, do the following:
0063
0064 1. Use irq affinity to force the irq threads to execute on
0065 some other CPU.
0066
0067 Name:
0068 kcmtpd_ctr_%d
0069
0070 Purpose:
0071 Handle Bluetooth work.
0072
0073 To reduce its OS jitter, do one of the following:
0074
0075 1. Don't use Bluetooth, in which case these kthreads won't be
0076 created in the first place.
0077 2. Use irq affinity to force Bluetooth-related interrupts to
0078 occur on some other CPU and furthermore initiate all
0079 Bluetooth activity on some other CPU.
0080
0081 Name:
0082 ksoftirqd/%u
0083
0084 Purpose:
0085 Execute softirq handlers when threaded or when under heavy load.
0086
0087 To reduce its OS jitter, each softirq vector must be handled
0088 separately as follows:
0089
0090 TIMER_SOFTIRQ
0091 -------------
0092
0093 Do all of the following:
0094
0095 1. To the extent possible, keep the CPU out of the kernel when it
0096 is non-idle, for example, by avoiding system calls and by forcing
0097 both kernel threads and interrupts to execute elsewhere.
0098 2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
0099 the CPU offline, then bring it back online. This forces
0100 recurring timers to migrate elsewhere. If you are concerned
0101 with multiple CPUs, force them all offline before bringing the
0102 first one back online. Once you have onlined the CPUs in question,
0103 do not offline any other CPUs, because doing so could force the
0104 timer back onto one of the CPUs in question.
0105
0106 NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
0107 ---------------------------------
0108
0109 Do all of the following:
0110
0111 1. Force networking interrupts onto other CPUs.
0112 2. Initiate any network I/O on other CPUs.
0113 3. Once your application has started, prevent CPU-hotplug operations
0114 from being initiated from tasks that might run on the CPU to
0115 be de-jittered. (It is OK to force this CPU offline and then
0116 bring it back online before you start your application.)
0117
0118 BLOCK_SOFTIRQ
0119 -------------
0120
0121 Do all of the following:
0122
0123 1. Force block-device interrupts onto some other CPU.
0124 2. Initiate any block I/O on other CPUs.
0125 3. Once your application has started, prevent CPU-hotplug operations
0126 from being initiated from tasks that might run on the CPU to
0127 be de-jittered. (It is OK to force this CPU offline and then
0128 bring it back online before you start your application.)
0129
0130 IRQ_POLL_SOFTIRQ
0131 ----------------
0132
0133 Do all of the following:
0134
0135 1. Force block-device interrupts onto some other CPU.
0136 2. Initiate any block I/O and block-I/O polling on other CPUs.
0137 3. Once your application has started, prevent CPU-hotplug operations
0138 from being initiated from tasks that might run on the CPU to
0139 be de-jittered. (It is OK to force this CPU offline and then
0140 bring it back online before you start your application.)
0141
0142 TASKLET_SOFTIRQ
0143 ---------------
0144
0145 Do one or more of the following:
0146
0147 1. Avoid use of drivers that use tasklets. (Such drivers will contain
0148 calls to things like tasklet_schedule().)
0149 2. Convert all drivers that you must use from tasklets to workqueues.
0150 3. Force interrupts for drivers using tasklets onto other CPUs,
0151 and also do I/O involving these drivers on other CPUs.
0152
0153 SCHED_SOFTIRQ
0154 -------------
0155
0156 Do all of the following:
0157
0158 1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
0159 for example, ensure that at most one runnable kthread is present
0160 on that CPU. If a thread that expects to run on the de-jittered
0161 CPU awakens, the scheduler will send an IPI that can result in
0162 a subsequent SCHED_SOFTIRQ.
0163 2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
0164 is marked as an adaptive-ticks CPU using the "nohz_full="
0165 boot parameter. This reduces the number of scheduler-clock
0166 interrupts that the de-jittered CPU receives, minimizing its
0167 chances of being selected to do the load balancing work that
0168 runs in SCHED_SOFTIRQ context.
0169 3. To the extent possible, keep the CPU out of the kernel when it
0170 is non-idle, for example, by avoiding system calls and by
0171 forcing both kernel threads and interrupts to execute elsewhere.
0172 This further reduces the number of scheduler-clock interrupts
0173 received by the de-jittered CPU.
0174
0175 HRTIMER_SOFTIRQ
0176 ---------------
0177
0178 Do all of the following:
0179
0180 1. To the extent possible, keep the CPU out of the kernel when it
0181 is non-idle. For example, avoid system calls and force both
0182 kernel threads and interrupts to execute elsewhere.
0183 2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
0184 CPU offline, then bring it back online. This forces recurring
0185 timers to migrate elsewhere. If you are concerned with multiple
0186 CPUs, force them all offline before bringing the first one
0187 back online. Once you have onlined the CPUs in question, do not
0188 offline any other CPUs, because doing so could force the timer
0189 back onto one of the CPUs in question.
0190
0191 RCU_SOFTIRQ
0192 -----------
0193
0194 Do at least one of the following:
0195
0196 1. Offload callbacks and keep the CPU in either dyntick-idle or
0197 adaptive-ticks state by doing all of the following:
0198
0199 a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
0200 de-jittered is marked as an adaptive-ticks CPU using the
0201 "nohz_full=" boot parameter. Bind the rcuo kthreads to
0202 housekeeping CPUs, which can tolerate OS jitter.
0203 b. To the extent possible, keep the CPU out of the kernel
0204 when it is non-idle, for example, by avoiding system
0205 calls and by forcing both kernel threads and interrupts
0206 to execute elsewhere.
0207
0208 2. Enable RCU to do its processing remotely via dyntick-idle by
0209 doing all of the following:
0210
0211 a. Build with CONFIG_NO_HZ=y.
0212 b. Ensure that the CPU goes idle frequently, allowing other
0213 CPUs to detect that it has passed through an RCU quiescent
0214 state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
0215 userspace execution also allows other CPUs to detect that
0216 the CPU in question has passed through a quiescent state.
0217 c. To the extent possible, keep the CPU out of the kernel
0218 when it is non-idle, for example, by avoiding system
0219 calls and by forcing both kernel threads and interrupts
0220 to execute elsewhere.
0221
0222 Name:
0223 kworker/%u:%d%s (cpu, id, priority)
0224
0225 Purpose:
0226 Execute workqueue requests
0227
0228 To reduce its OS jitter, do any of the following:
0229
0230 1. Run your workload at a real-time priority, which will allow
0231 preempting the kworker daemons.
0232 2. A given workqueue can be made visible in the sysfs filesystem
0233 by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
0234 Such a workqueue can be confined to a given subset of the
0235 CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
0236 files. The set of WQ_SYSFS workqueues can be displayed using
0237 "ls /sys/devices/virtual/workqueue". That said, the workqueues
0238 maintainer would like to caution people against indiscriminately
0239 sprinkling WQ_SYSFS across all the workqueues. The reason for
0240 caution is that it is easy to add WQ_SYSFS, but because sysfs is
0241 part of the formal user/kernel API, it can be nearly impossible
0242 to remove it, even if its addition was a mistake.
0243 3. Do any of the following needed to avoid jitter that your
0244 application cannot tolerate:
0245
0246 a. Build your kernel with CONFIG_SLUB=y rather than
0247 CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
0248 use of each CPU's workqueues to run its cache_reap()
0249 function.
0250 b. Avoid using oprofile, thus avoiding OS jitter from
0251 wq_sync_buffer().
0252 c. Limit your CPU frequency so that a CPU-frequency
0253 governor is not required, possibly enlisting the aid of
0254 special heatsinks or other cooling technologies. If done
0255 correctly, and if you CPU architecture permits, you should
0256 be able to build your kernel with CONFIG_CPU_FREQ=n to
0257 avoid the CPU-frequency governor periodically running
0258 on each CPU, including cs_dbs_timer() and od_dbs_timer().
0259
0260 WARNING: Please check your CPU specifications to
0261 make sure that this is safe on your particular system.
0262 d. As of v3.18, Christoph Lameter's on-demand vmstat workers
0263 commit prevents OS jitter due to vmstat_update() on
0264 CONFIG_SMP=y systems. Before v3.18, is not possible
0265 to entirely get rid of the OS jitter, but you can
0266 decrease its frequency by writing a large value to
0267 /proc/sys/vm/stat_interval. The default value is HZ,
0268 for an interval of one second. Of course, larger values
0269 will make your virtual-memory statistics update more
0270 slowly. Of course, you can also run your workload at
0271 a real-time priority, thus preempting vmstat_update(),
0272 but if your workload is CPU-bound, this is a bad idea.
0273 However, there is an RFC patch from Christoph Lameter
0274 (based on an earlier one from Gilad Ben-Yossef) that
0275 reduces or even eliminates vmstat overhead for some
0276 workloads at https://lore.kernel.org/r/00000140e9dfd6bd-40db3d4f-c1be-434f-8132-7820f81bb586-000000@email.amazonses.com.
0277 e. If running on high-end powerpc servers, build with
0278 CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
0279 daemon from running on each CPU every second or so.
0280 (This will require editing Kconfig files and will defeat
0281 this platform's RAS functionality.) This avoids jitter
0282 due to the rtas_event_scan() function.
0283 WARNING: Please check your CPU specifications to
0284 make sure that this is safe on your particular system.
0285 f. If running on Cell Processor, build your kernel with
0286 CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
0287 spu_gov_work().
0288 WARNING: Please check your CPU specifications to
0289 make sure that this is safe on your particular system.
0290 g. If running on PowerMAC, build your kernel with
0291 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
0292 avoiding OS jitter from rackmeter_do_timer().
0293
0294 Name:
0295 rcuc/%u
0296
0297 Purpose:
0298 Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
0299
0300 To reduce its OS jitter, do at least one of the following:
0301
0302 1. Build the kernel with CONFIG_PREEMPT=n. This prevents these
0303 kthreads from being created in the first place, and also obviates
0304 the need for RCU priority boosting. This approach is feasible
0305 for workloads that do not require high degrees of responsiveness.
0306 2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
0307 kthreads from being created in the first place. This approach
0308 is feasible only if your workload never requires RCU priority
0309 boosting, for example, if you ensure frequent idle time on all
0310 CPUs that might execute within the kernel.
0311 3. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs=
0312 boot parameter offloading RCU callbacks from all CPUs susceptible
0313 to OS jitter. This approach prevents the rcuc/%u kthreads from
0314 having any work to do, so that they are never awakened.
0315 4. Ensure that the CPU never enters the kernel, and, in particular,
0316 avoid initiating any CPU hotplug operations on this CPU. This is
0317 another way of preventing any callbacks from being queued on the
0318 CPU, again preventing the rcuc/%u kthreads from having any work
0319 to do.
0320
0321 Name:
0322 rcuop/%d and rcuos/%d
0323
0324 Purpose:
0325 Offload RCU callbacks from the corresponding CPU.
0326
0327 To reduce its OS jitter, do at least one of the following:
0328
0329 1. Use affinity, cgroups, or other mechanism to force these kthreads
0330 to execute on some other CPU.
0331 2. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
0332 kthreads from being created in the first place. However, please
0333 note that this will not eliminate OS jitter, but will instead
0334 shift it to RCU_SOFTIRQ.