Back to home page

OSCL-LXR

 
 

    


0001 =================
0002 Scheduler Domains
0003 =================
0004 
0005 Each CPU has a "base" scheduling domain (struct sched_domain). The domain
0006 hierarchy is built from these base domains via the ->parent pointer. ->parent
0007 MUST be NULL terminated, and domain structures should be per-CPU as they are
0008 locklessly updated.
0009 
0010 Each scheduling domain spans a number of CPUs (stored in the ->span field).
0011 A domain's span MUST be a superset of it child's span (this restriction could
0012 be relaxed if the need arises), and a base domain for CPU i MUST span at least
0013 i. The top domain for each CPU will generally span all CPUs in the system
0014 although strictly it doesn't have to, but this could lead to a case where some
0015 CPUs will never be given tasks to run unless the CPUs allowed mask is
0016 explicitly set. A sched domain's span means "balance process load among these
0017 CPUs".
0018 
0019 Each scheduling domain must have one or more CPU groups (struct sched_group)
0020 which are organised as a circular one way linked list from the ->groups
0021 pointer. The union of cpumasks of these groups MUST be the same as the
0022 domain's span. The group pointed to by the ->groups pointer MUST contain the CPU
0023 to which the domain belongs. Groups may be shared among CPUs as they contain
0024 read only data after they have been set up. The intersection of cpumasks from
0025 any two of these groups may be non empty. If this is the case the SD_OVERLAP
0026 flag is set on the corresponding scheduling domain and its groups may not be
0027 shared between CPUs.
0028 
0029 Balancing within a sched domain occurs between groups. That is, each group
0030 is treated as one entity. The load of a group is defined as the sum of the
0031 load of each of its member CPUs, and only when the load of a group becomes
0032 out of balance are tasks moved between groups.
0033 
0034 In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
0035 through scheduler_tick(). It raises a softirq after the next regularly scheduled
0036 rebalancing event for the current runqueue has arrived. The actual load
0037 balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
0038 in softirq context (SCHED_SOFTIRQ).
0039 
0040 The latter function takes two arguments: the runqueue of current CPU and whether
0041 the CPU was idle at the time the scheduler_tick() happened and iterates over all
0042 sched domains our CPU is on, starting from its base domain and going up the ->parent
0043 chain. While doing that, it checks to see if the current domain has exhausted its
0044 rebalance interval. If so, it runs load_balance() on that domain. It then checks
0045 the parent sched_domain (if it exists), and the parent of the parent and so
0046 forth.
0047 
0048 Initially, load_balance() finds the busiest group in the current sched domain.
0049 If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
0050 that group. If it manages to find such a runqueue, it locks both our initial
0051 CPU's runqueue and the newly found busiest one and starts moving tasks from it
0052 to our runqueue. The exact number of tasks amounts to an imbalance previously
0053 computed while iterating over this sched domain's groups.
0054 
0055 Implementing sched domains
0056 ==========================
0057 
0058 The "base" domain will "span" the first level of the hierarchy. In the case
0059 of SMT, you'll span all siblings of the physical CPU, with each group being
0060 a single virtual CPU.
0061 
0062 In SMP, the parent of the base domain will span all physical CPUs in the
0063 node. Each group being a single physical CPU. Then with NUMA, the parent
0064 of the SMP domain will span the entire machine, with each group having the
0065 cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
0066 might have just one domain covering its one NUMA level.
0067 
0068 The implementor should read comments in include/linux/sched/sd_flags.h:
0069 SD_* to get an idea of the specifics and what to tune for the SD flags
0070 of a sched_domain.
0071 
0072 Architectures may override the generic domain builder and the default SD flags
0073 for a given topology level by creating a sched_domain_topology_level array and
0074 calling set_sched_topology() with this array as the parameter.
0075 
0076 The sched-domains debugging infrastructure can be enabled by enabling
0077 CONFIG_SCHED_DEBUG and adding 'sched_verbose' to your cmdline. If you
0078 forgot to tweak your cmdline, you can also flip the
0079 /sys/kernel/debug/sched/verbose knob. This enables an error checking parse of
0080 the sched domains which should catch most possible errors (described above). It
0081 also prints out the domain structure in a visual format.