Back to home page

OSCL-LXR

 
 

    


0001 ==========================
0002 Real-Time group scheduling
0003 ==========================
0004 
0005 .. CONTENTS
0006 
0007    0. WARNING
0008    1. Overview
0009      1.1 The problem
0010      1.2 The solution
0011    2. The interface
0012      2.1 System-wide settings
0013      2.2 Default behaviour
0014      2.3 Basis for grouping tasks
0015    3. Future plans
0016 
0017 
0018 0. WARNING
0019 ==========
0020 
0021  Fiddling with these settings can result in an unstable system, the knobs are
0022  root only and assumes root knows what he is doing.
0023 
0024 Most notable:
0025 
0026  * very small values in sched_rt_period_us can result in an unstable
0027    system when the period is smaller than either the available hrtimer
0028    resolution, or the time it takes to handle the budget refresh itself.
0029 
0030  * very small values in sched_rt_runtime_us can result in an unstable
0031    system when the runtime is so small the system has difficulty making
0032    forward progress (NOTE: the migration thread and kstopmachine both
0033    are real-time processes).
0034 
0035 1. Overview
0036 ===========
0037 
0038 
0039 1.1 The problem
0040 ---------------
0041 
0042 Realtime scheduling is all about determinism, a group has to be able to rely on
0043 the amount of bandwidth (eg. CPU time) being constant. In order to schedule
0044 multiple groups of realtime tasks, each group must be assigned a fixed portion
0045 of the CPU time available.  Without a minimum guarantee a realtime group can
0046 obviously fall short. A fuzzy upper limit is of no use since it cannot be
0047 relied upon. Which leaves us with just the single fixed portion.
0048 
0049 1.2 The solution
0050 ----------------
0051 
0052 CPU time is divided by means of specifying how much time can be spent running
0053 in a given period. We allocate this "run time" for each realtime group which
0054 the other realtime groups will not be permitted to use.
0055 
0056 Any time not allocated to a realtime group will be used to run normal priority
0057 tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
0058 SCHED_OTHER.
0059 
0060 Let's consider an example: a frame fixed realtime renderer must deliver 25
0061 frames a second, which yields a period of 0.04s per frame. Now say it will also
0062 have to play some music and respond to input, leaving it with around 80% CPU
0063 time dedicated for the graphics. We can then give this group a run time of 0.8
0064 * 0.04s = 0.032s.
0065 
0066 This way the graphics group will have a 0.04s period with a 0.032s run time
0067 limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
0068 needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
0069 0.00015s. So this group can be scheduled with a period of 0.005s and a run time
0070 of 0.00015s.
0071 
0072 The remaining CPU time will be used for user input and other tasks. Because
0073 realtime tasks have explicitly allocated the CPU time they need to perform
0074 their tasks, buffer underruns in the graphics or audio can be eliminated.
0075 
0076 NOTE: the above example is not fully implemented yet. We still
0077 lack an EDF scheduler to make non-uniform periods usable.
0078 
0079 
0080 2. The Interface
0081 ================
0082 
0083 
0084 2.1 System wide settings
0085 ------------------------
0086 
0087 The system wide settings are configured under the /proc virtual file system:
0088 
0089 /proc/sys/kernel/sched_rt_period_us:
0090   The scheduling period that is equivalent to 100% CPU bandwidth
0091 
0092 /proc/sys/kernel/sched_rt_runtime_us:
0093   A global limit on how much time realtime scheduling may use.  Even without
0094   CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
0095   processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
0096   available to all realtime groups.
0097 
0098   * Time is specified in us because the interface is s32. This gives an
0099     operating range from 1us to about 35 minutes.
0100   * sched_rt_period_us takes values from 1 to INT_MAX.
0101   * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
0102   * A run time of -1 specifies runtime == period, ie. no limit.
0103 
0104 
0105 2.2 Default behaviour
0106 ---------------------
0107 
0108 The default values for sched_rt_period_us (1000000 or 1s) and
0109 sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
0110 SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
0111 realtime tasks will not lock up the machine but leave a little time to recover
0112 it.  By setting runtime to -1 you'd get the old behaviour back.
0113 
0114 By default all bandwidth is assigned to the root group and new groups get the
0115 period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
0116 want to assign bandwidth to another group, reduce the root group's bandwidth
0117 and assign some or all of the difference to another group.
0118 
0119 Realtime group scheduling means you have to assign a portion of total CPU
0120 bandwidth to the group before it will accept realtime tasks. Therefore you will
0121 not be able to run realtime tasks as any user other than root until you have
0122 done that, even if the user has the rights to run processes with realtime
0123 priority!
0124 
0125 
0126 2.3 Basis for grouping tasks
0127 ----------------------------
0128 
0129 Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
0130 CPU bandwidth to task groups.
0131 
0132 This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
0133 to control the CPU time reserved for each control group.
0134 
0135 For more information on working with control groups, you should read
0136 Documentation/admin-guide/cgroup-v1/cgroups.rst as well.
0137 
0138 Group settings are checked against the following limits in order to keep the
0139 configuration schedulable:
0140 
0141    \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
0142 
0143 For now, this can be simplified to just the following (but see Future plans):
0144 
0145    \Sum_{i} runtime_{i} <= global_runtime
0146 
0147 
0148 3. Future plans
0149 ===============
0150 
0151 There is work in progress to make the scheduling period for each group
0152 ("<cgroup>/cpu.rt_period_us") configurable as well.
0153 
0154 The constraint on the period is that a subgroup must have a smaller or
0155 equal period to its parent. But realistically its not very useful _yet_
0156 as its prone to starvation without deadline scheduling.
0157 
0158 Consider two sibling groups A and B; both have 50% bandwidth, but A's
0159 period is twice the length of B's.
0160 
0161 * group A: period=100000us, runtime=50000us
0162 
0163         - this runs for 0.05s once every 0.1s
0164 
0165 * group B: period= 50000us, runtime=25000us
0166 
0167         - this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
0168 
0169 This means that currently a while (1) loop in A will run for the full period of
0170 B and can starve B's tasks (assuming they are of lower priority) for a whole
0171 period.
0172 
0173 The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
0174 full deadline scheduling to the linux kernel. Deadline scheduling the above
0175 groups and treating end of the period as a deadline will ensure that they both
0176 get their allocated time.
0177 
0178 Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
0179 the biggest challenge as the current linux PI infrastructure is geared towards
0180 the limited static priority levels 0-99. With deadline scheduling you need to
0181 do deadline inheritance (since priority is inversely proportional to the
0182 deadline delta (deadline - now)).
0183 
0184 This means the whole PI machinery will have to be reworked - and that is one of
0185 the most complex pieces of code we have.