Documentation/scheduler/sched-rt-group.rst

0001 ==========================
0002 Real-Time group scheduling
0003 ==========================
0004
0005 .. CONTENTS
0006
0007    0. WARNING
0008    1. Overview
0009      1.1 The problem
0010      1.2 The solution
0011    2. The interface
0012      2.1 System-wide settings
0013      2.2 Default behaviour
0014      2.3 Basis for grouping tasks
0015    3. Future plans
0016
0017
0018 0. WARNING
0019 ==========
0020
0021  Fiddling with these settings can result in an unstable system, the knobs are
0022  root only and assumes root knows what he is doing.
0023
0024 Most notable:
0025
0026  * very small values in sched_rt_period_us can result in an unstable
0027    system when the period is smaller than either the available hrtimer
0028    resolution, or the time it takes to handle the budget refresh itself.
0029
0030  * very small values in sched_rt_runtime_us can result in an unstable
0031    system when the runtime is so small the system has difficulty making
0032    forward progress (NOTE: the migration thread and kstopmachine both
0033    are real-time processes).
0034
0035 1. Overview
0036 ===========
0037
0038
0039 1.1 The problem
0040 ---------------
0041
0042 Realtime scheduling is all about determinism, a group has to be able to rely on
0043 the amount of bandwidth (eg. CPU time) being constant. In order to schedule
0044 multiple groups of realtime tasks, each group must be assigned a fixed portion
0045 of the CPU time available.  Without a minimum guarantee a realtime group can
0046 obviously fall short. A fuzzy upper limit is of no use since it cannot be
0047 relied upon. Which leaves us with just the single fixed portion.
0048
0049 1.2 The solution
0050 ----------------
0051
0052 CPU time is divided by means of specifying how much time can be spent running
0053 in a given period. We allocate this "run time" for each realtime group which
0054 the other realtime groups will not be permitted to use.
0055
0056 Any time not allocated to a realtime group will be used to run normal priority
0057 tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
0058 SCHED_OTHER.
0059
0060 Let's consider an example: a frame fixed realtime renderer must deliver 25
0061 frames a second, which yields a period of 0.04s per frame. Now say it will also
0062 have to play some music and respond to input, leaving it with around 80% CPU
0063 time dedicated for the graphics. We can then give this group a run time of 0.8
0064 * 0.04s = 0.032s.
0065
0066 This way the graphics group will have a 0.04s period with a 0.032s run time
0067 limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
0068 needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
0069 0.00015s. So this group can be scheduled with a period of 0.005s and a run time
0070 of 0.00015s.
0071
0072 The remaining CPU time will be used for user input and other tasks. Because
0073 realtime tasks have explicitly allocated the CPU time they need to perform
0074 their tasks, buffer underruns in the graphics or audio can be eliminated.
0075
0076 NOTE: the above example is not fully implemented yet. We still
0077 lack an EDF scheduler to make non-uniform periods usable.
0078
0079
0080 2. The Interface
0081 ================
0082
0083
0084 2.1 System wide settings
0085 ------------------------
0086
0087 The system wide settings are configured under the /proc virtual file system:
0088
0089 /proc/sys/kernel/sched_rt_period_us:
0090   The scheduling period that is equivalent to 100% CPU bandwidth
0091
0092 /proc/sys/kernel/sched_rt_runtime_us:
0093   A global limit on how much time realtime scheduling may use.  Even without
0094   CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
0095   processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
0096   available to all realtime groups.
0097
0098   * Time is specified in us because the interface is s32. This gives an
0099     operating range from 1us to about 35 minutes.
0100   * sched_rt_period_us takes values from 1 to INT_MAX.
0101   * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
0102   * A run time of -1 specifies runtime == period, ie. no limit.
0103
0104
0105 2.2 Default behaviour
0106 ---------------------
0107
0108 The default values for sched_rt_period_us (1000000 or 1s) and
0109 sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
0110 SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
0111 realtime tasks will not lock up the machine but leave a little time to recover
0112 it.  By setting runtime to -1 you'd get the old behaviour back.
0113
0114 By default all bandwidth is assigned to the root group and new groups get the
0115 period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
0116 want to assign bandwidth to another group, reduce the root group's bandwidth
0117 and assign some or all of the difference to another group.
0118
0119 Realtime group scheduling means you have to assign a portion of total CPU
0120 bandwidth to the group before it will accept realtime tasks. Therefore you will
0121 not be able to run realtime tasks as any user other than root until you have
0122 done that, even if the user has the rights to run processes with realtime
0123 priority!
0124
0125
0126 2.3 Basis for grouping tasks
0127 ----------------------------
0128
0129 Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
0130 CPU bandwidth to task groups.
0131
0132 This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
0133 to control the CPU time reserved for each control group.
0134
0135 For more information on working with control groups, you should read
0136 Documentation/admin-guide/cgroup-v1/cgroups.rst as well.
0137
0138 Group settings are checked against the following limits in order to keep the
0139 configuration schedulable:
0140
0141    \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
0142
0143 For now, this can be simplified to just the following (but see Future plans):
0144
0145    \Sum_{i} runtime_{i} <= global_runtime
0146
0147
0148 3. Future plans
0149 ===============
0150
0151 There is work in progress to make the scheduling period for each group
0152 ("<cgroup>/cpu.rt_period_us") configurable as well.
0153
0154 The constraint on the period is that a subgroup must have a smaller or
0155 equal period to its parent. But realistically its not very useful _yet_
0156 as its prone to starvation without deadline scheduling.
0157
0158 Consider two sibling groups A and B; both have 50% bandwidth, but A's
0159 period is twice the length of B's.
0160
0161 * group A: period=100000us, runtime=50000us
0162
0163         - this runs for 0.05s once every 0.1s
0164
0165 * group B: period= 50000us, runtime=25000us
0166
0167         - this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
0168
0169 This means that currently a while (1) loop in A will run for the full period of
0170 B and can starve B's tasks (assuming they are of lower priority) for a whole
0171 period.
0172
0173 The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
0174 full deadline scheduling to the linux kernel. Deadline scheduling the above
0175 groups and treating end of the period as a deadline will ensure that they both
0176 get their allocated time.
0177
0178 Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
0179 the biggest challenge as the current linux PI infrastructure is geared towards
0180 the limited static priority levels 0-99. With deadline scheduling you need to
0181 do deadline inheritance (since priority is inversely proportional to the
0182 deadline delta (deadline - now)).
0183
0184 This means the whole PI machinery will have to be reworked - and that is one of
0185 the most complex pieces of code we have.