Documentation/accounting/taskstats.rst

0001 =============================
0002 Per-task statistics interface
0003 =============================
0004
0005
0006 Taskstats is a netlink-based interface for sending per-task and
0007 per-process statistics from the kernel to userspace.
0008
0009 Taskstats was designed for the following benefits:
0010
0011 - efficiently provide statistics during lifetime of a task and on its exit
0012 - unified interface for multiple accounting subsystems
0013 - extensibility for use by future accounting patches
0014
0015 Terminology
0016 -----------
0017
0018 "pid", "tid" and "task" are used interchangeably and refer to the standard
0019 Linux task defined by struct task_struct.  per-pid stats are the same as
0020 per-task stats.
0021
0022 "tgid", "process" and "thread group" are used interchangeably and refer to the
0023 tasks that share an mm_struct i.e. the traditional Unix process. Despite the
0024 use of tgid, there is no special treatment for the task that is thread group
0025 leader - a process is deemed alive as long as it has any task belonging to it.
0026
0027 Usage
0028 -----
0029
0030 To get statistics during a task's lifetime, userspace opens a unicast netlink
0031 socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
0032 The response contains statistics for a task (if pid is specified) or the sum of
0033 statistics for all tasks of the process (if tgid is specified).
0034
0035 To obtain statistics for tasks which are exiting, the userspace listener
0036 sends a register command and specifies a cpumask. Whenever a task exits on
0037 one of the cpus in the cpumask, its per-pid statistics are sent to the
0038 registered listener. Using cpumasks allows the data received by one listener
0039 to be limited and assists in flow control over the netlink interface and is
0040 explained in more detail below.
0041
0042 If the exiting task is the last thread exiting its thread group,
0043 an additional record containing the per-tgid stats is also sent to userspace.
0044 The latter contains the sum of per-pid stats for all threads in the thread
0045 group, both past and present.
0046
0047 getdelays.c is a simple utility demonstrating usage of the taskstats interface
0048 for reporting delay accounting statistics. Users can register cpumasks,
0049 send commands and process responses, listen for per-tid/tgid exit data,
0050 write the data received to a file and do basic flow control by increasing
0051 receive buffer sizes.
0052
0053 Interface
0054 ---------
0055
0056 The user-kernel interface is encapsulated in include/linux/taskstats.h
0057
0058 To avoid this documentation becoming obsolete as the interface evolves, only
0059 an outline of the current version is given. taskstats.h always overrides the
0060 description here.
0061
0062 struct taskstats is the common accounting structure for both per-pid and
0063 per-tgid data. It is versioned and can be extended by each accounting subsystem
0064 that is added to the kernel. The fields and their semantics are defined in the
0065 taskstats.h file.
0066
0067 The data exchanged between user and kernel space is a netlink message belonging
0068 to the NETLINK_GENERIC family and using the netlink attributes interface.
0069 The messages are in the format::
0070
0071     +----------+- - -+-------------+-------------------+
0072     | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
0073     +----------+- - -+-------------+-------------------+
0074
0075
0076 The taskstats payload is one of the following three kinds:
0077
0078 1. Commands: Sent from user to kernel. Commands to get data on
0079 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
0080 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
0081 the task/process for which userspace wants statistics.
0082
0083 Commands to register/deregister interest in exit data from a set of cpus
0084 consist of one attribute, of type
0085 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
0086 attribute payload. The cpumask is specified as an ascii string of
0087 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
0088 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
0089 in cpus before closing the listening socket, the kernel cleans up its interest
0090 set over time. However, for the sake of efficiency, an explicit deregistration
0091 is advisable.
0092
0093 2. Response for a command: sent from the kernel in response to a userspace
0094 command. The payload is a series of three attributes of type:
0095
0096 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
0097 a pid/tgid will be followed by some stats.
0098
0099 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
0100 are being returned.
0101
0102 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
0103 same structure is used for both per-pid and per-tgid stats.
0104
0105 3. New message sent by kernel whenever a task exits. The payload consists of a
0106    series of attributes of the following type:
0107
0108 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
0109 b) TASKSTATS_TYPE_PID: contains exiting task's pid
0110 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
0111 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
0112 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
0113 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
0114
0115
0116 per-tgid stats
0117 --------------
0118
0119 Taskstats provides per-process stats, in addition to per-task stats, since
0120 resource management is often done at a process granularity and aggregating task
0121 stats in userspace alone is inefficient and potentially inaccurate (due to lack
0122 of atomicity).
0123
0124 However, maintaining per-process, in addition to per-task stats, within the
0125 kernel has space and time overheads. To address this, the taskstats code
0126 accumulates each exiting task's statistics into a process-wide data structure.
0127 When the last task of a process exits, the process level data accumulated also
0128 gets sent to userspace (along with the per-task data).
0129
0130 When a user queries to get per-tgid data, the sum of all other live threads in
0131 the group is added up and added to the accumulated total for previously exited
0132 threads of the same thread group.
0133
0134 Extending taskstats
0135 -------------------
0136
0137 There are two ways to extend the taskstats interface to export more
0138 per-task/process stats as patches to collect them get added to the kernel
0139 in future:
0140
0141 1. Adding more fields to the end of the existing struct taskstats. Backward
0142    compatibility is ensured by the version number within the
0143    structure. Userspace will use only the fields of the struct that correspond
0144    to the version its using.
0145
0146 2. Defining separate statistic structs and using the netlink attributes
0147    interface to return them. Since userspace processes each netlink attribute
0148    independently, it can always ignore attributes whose type it does not
0149    understand (because it is using an older version of the interface).
0150
0151
0152 Choosing between 1. and 2. is a matter of trading off flexibility and
0153 overhead. If only a few fields need to be added, then 1. is the preferable
0154 path since the kernel and userspace don't need to incur the overhead of
0155 processing new netlink attributes. But if the new fields expand the existing
0156 struct too much, requiring disparate userspace accounting utilities to
0157 unnecessarily receive large structures whose fields are of no interest, then
0158 extending the attributes structure would be worthwhile.
0159
0160 Flow control for taskstats
0161 --------------------------
0162
0163 When the rate of task exits becomes large, a listener may not be able to keep
0164 up with the kernel's rate of sending per-tid/tgid exit data leading to data
0165 loss. This possibility gets compounded when the taskstats structure gets
0166 extended and the number of cpus grows large.
0167
0168 To avoid losing statistics, userspace should do one or more of the following:
0169
0170 - increase the receive buffer sizes for the netlink sockets opened by
0171   listeners to receive exit data.
0172
0173 - create more listeners and reduce the number of cpus being listened to by
0174   each listener. In the extreme case, there could be one listener for each cpu.
0175   Users may also consider setting the cpu affinity of the listener to the subset
0176   of cpus to which it listens, especially if they are listening to just one cpu.
0177
0178 Despite these measures, if the userspace receives ENOBUFS error messages
0179 indicated overflow of receive buffers, it should take measures to handle the
0180 loss of data.