Documentation/accounting/psi.rst

0001 .. _psi:
0002
0003 ================================
0004 PSI - Pressure Stall Information
0005 ================================
0006
0007 :Date: April, 2018
0008 :Author: Johannes Weiner <hannes@cmpxchg.org>
0009
0010 When CPU, memory or IO devices are contended, workloads experience
0011 latency spikes, throughput losses, and run the risk of OOM kills.
0012
0013 Without an accurate measure of such contention, users are forced to
0014 either play it safe and under-utilize their hardware resources, or
0015 roll the dice and frequently suffer the disruptions resulting from
0016 excessive overcommit.
0017
0018 The psi feature identifies and quantifies the disruptions caused by
0019 such resource crunches and the time impact it has on complex workloads
0020 or even entire systems.
0021
0022 Having an accurate measure of productivity losses caused by resource
0023 scarcity aids users in sizing workloads to hardware--or provisioning
0024 hardware according to workload demand.
0025
0026 As psi aggregates this information in realtime, systems can be managed
0027 dynamically using techniques such as load shedding, migrating jobs to
0028 other systems or data centers, or strategically pausing or killing low
0029 priority or restartable batch jobs.
0030
0031 This allows maximizing hardware utilization without sacrificing
0032 workload health or risking major disruptions such as OOM kills.
0033
0034 Pressure interface
0035 ==================
0036
0037 Pressure information for each resource is exported through the
0038 respective file in /proc/pressure/ -- cpu, memory, and io.
0039
0040 The format is as such::
0041
0042         some avg10=0.00 avg60=0.00 avg300=0.00 total=0
0043         full avg10=0.00 avg60=0.00 avg300=0.00 total=0
0044
0045 The "some" line indicates the share of time in which at least some
0046 tasks are stalled on a given resource.
0047
0048 The "full" line indicates the share of time in which all non-idle
0049 tasks are stalled on a given resource simultaneously. In this state
0050 actual CPU cycles are going to waste, and a workload that spends
0051 extended time in this state is considered to be thrashing. This has
0052 severe impact on performance, and it's useful to distinguish this
0053 situation from a state where some tasks are stalled but the CPU is
0054 still doing productive work. As such, time spent in this subset of the
0055 stall state is tracked separately and exported in the "full" averages.
0056
0057 CPU full is undefined at the system level, but has been reported
0058 since 5.13, so it is set to zero for backward compatibility.
0059
0060 The ratios (in %) are tracked as recent trends over ten, sixty, and
0061 three hundred second windows, which gives insight into short term events
0062 as well as medium and long term trends. The total absolute stall time
0063 (in us) is tracked and exported as well, to allow detection of latency
0064 spikes which wouldn't necessarily make a dent in the time averages,
0065 or to average trends over custom time frames.
0066
0067 Monitoring for pressure thresholds
0068 ==================================
0069
0070 Users can register triggers and use poll() to be woken up when resource
0071 pressure exceeds certain thresholds.
0072
0073 A trigger describes the maximum cumulative stall time over a specific
0074 time window, e.g. 100ms of total stall time within any 500ms window to
0075 generate a wakeup event.
0076
0077 To register a trigger user has to open psi interface file under
0078 /proc/pressure/ representing the resource to be monitored and write the
0079 desired threshold and time window. The open file descriptor should be
0080 used to wait for trigger events using select(), poll() or epoll().
0081 The following format is used::
0082
0083         <some|full> <stall amount in us> <time window in us>
0084
0085 For example writing "some 150000 1000000" into /proc/pressure/memory
0086 would add 150ms threshold for partial memory stall measured within
0087 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
0088 would add 50ms threshold for full io stall measured within 1sec time window.
0089
0090 Triggers can be set on more than one psi metric and more than one trigger
0091 for the same psi metric can be specified. However for each trigger a separate
0092 file descriptor is required to be able to poll it separately from others,
0093 therefore for each trigger a separate open() syscall should be made even
0094 when opening the same psi interface file. Write operations to a file descriptor
0095 with an already existing psi trigger will fail with EBUSY.
0096
0097 Monitors activate only when system enters stall state for the monitored
0098 psi metric and deactivates upon exit from the stall state. While system is
0099 in the stall state psi signal growth is monitored at a rate of 10 times per
0100 tracking window.
0101
0102 The kernel accepts window sizes ranging from 500ms to 10s, therefore min
0103 monitoring update interval is 50ms and max is 1s. Min limit is set to
0104 prevent overly frequent polling. Max limit is chosen as a high enough number
0105 after which monitors are most likely not needed and psi averages can be used
0106 instead.
0107
0108 When activated, psi monitor stays active for at least the duration of one
0109 tracking window to avoid repeated activations/deactivations when system is
0110 bouncing in and out of the stall state.
0111
0112 Notifications to the userspace are rate-limited to one per tracking window.
0113
0114 The trigger will de-register when the file descriptor used to define the
0115 trigger  is closed.
0116
0117 Userspace monitor usage example
0118 ===============================
0119
0120 ::
0121
0122   #include <errno.h>
0123   #include <fcntl.h>
0124   #include <stdio.h>
0125   #include <poll.h>
0126   #include <string.h>
0127   #include <unistd.h>
0128
0129   /*
0130    * Monitor memory partial stall with 1s tracking window size
0131    * and 150ms threshold.
0132    */
0133   int main() {
0134         const char trig[] = "some 150000 1000000";
0135         struct pollfd fds;
0136         int n;
0137
0138         fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
0139         if (fds.fd < 0) {
0140                 printf("/proc/pressure/memory open error: %s\n",
0141                         strerror(errno));
0142                 return 1;
0143         }
0144         fds.events = POLLPRI;
0145
0146         if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
0147                 printf("/proc/pressure/memory write error: %s\n",
0148                         strerror(errno));
0149                 return 1;
0150         }
0151
0152         printf("waiting for events...\n");
0153         while (1) {
0154                 n = poll(&fds, 1, -1);
0155                 if (n < 0) {
0156                         printf("poll error: %s\n", strerror(errno));
0157                         return 1;
0158                 }
0159                 if (fds.revents & POLLERR) {
0160                         printf("got POLLERR, event source is gone\n");
0161                         return 0;
0162                 }
0163                 if (fds.revents & POLLPRI) {
0164                         printf("event triggered!\n");
0165                 } else {
0166                         printf("unknown event received: 0x%x\n", fds.revents);
0167                         return 1;
0168                 }
0169         }
0170
0171         return 0;
0172   }
0173
0174 Cgroup2 interface
0175 =================
0176
0177 In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
0178 mounted, pressure stall information is also tracked for tasks grouped
0179 into cgroups. Each subdirectory in the cgroupfs mountpoint contains
0180 cpu.pressure, memory.pressure, and io.pressure files; the format is
0181 the same as the /proc/pressure/ files.
0182
0183 Per-cgroup psi monitors can be specified and used the same way as
0184 system-wide ones.