tools/perf/design.txt

0001
0002 Performance Counters for Linux
0003 ------------------------------
0004
0005 Performance counters are special hardware registers available on most modern
0006 CPUs. These registers count the number of certain types of hw events: such
0007 as instructions executed, cachemisses suffered, or branches mis-predicted -
0008 without slowing down the kernel or applications. These registers can also
0009 trigger interrupts when a threshold number of events have passed - and can
0010 thus be used to profile the code that runs on that CPU.
0011
0012 The Linux Performance Counter subsystem provides an abstraction of these
0013 hardware capabilities. It provides per task and per CPU counters, counter
0014 groups, and it provides event capabilities on top of those.  It
0015 provides "virtual" 64-bit counters, regardless of the width of the
0016 underlying hardware counters.
0017
0018 Performance counters are accessed via special file descriptors.
0019 There's one file descriptor per virtual counter used.
0020
0021 The special file descriptor is opened via the sys_perf_event_open()
0022 system call:
0023
0024    int sys_perf_event_open(struct perf_event_attr *hw_event_uptr,
0025                              pid_t pid, int cpu, int group_fd,
0026                              unsigned long flags);
0027
0028 The syscall returns the new fd. The fd can be used via the normal
0029 VFS system calls: read() can be used to read the counter, fcntl()
0030 can be used to set the blocking mode, etc.
0031
0032 Multiple counters can be kept open at a time, and the counters
0033 can be poll()ed.
0034
0035 When creating a new counter fd, 'perf_event_attr' is:
0036
0037 struct perf_event_attr {
0038         /*
0039          * The MSB of the config word signifies if the rest contains cpu
0040          * specific (raw) counter configuration data, if unset, the next
0041          * 7 bits are an event type and the rest of the bits are the event
0042          * identifier.
0043          */
0044         __u64                   config;
0045
0046         __u64                   irq_period;
0047         __u32                   record_type;
0048         __u32                   read_format;
0049
0050         __u64                   disabled       :  1, /* off by default        */
0051                                 inherit        :  1, /* children inherit it   */
0052                                 pinned         :  1, /* must always be on PMU */
0053                                 exclusive      :  1, /* only group on PMU     */
0054                                 exclude_user   :  1, /* don't count user      */
0055                                 exclude_kernel :  1, /* ditto kernel          */
0056                                 exclude_hv     :  1, /* ditto hypervisor      */
0057                                 exclude_idle   :  1, /* don't count when idle */
0058                                 mmap           :  1, /* include mmap data     */
0059                                 munmap         :  1, /* include munmap data   */
0060                                 comm           :  1, /* include comm data     */
0061
0062                                 __reserved_1   : 52;
0063
0064         __u32                   extra_config_len;
0065         __u32                   wakeup_events;  /* wakeup every n events */
0066
0067         __u64                   __reserved_2;
0068         __u64                   __reserved_3;
0069 };
0070
0071 The 'config' field specifies what the counter should count.  It
0072 is divided into 3 bit-fields:
0073
0074 raw_type: 1 bit   (most significant bit)        0x8000_0000_0000_0000
0075 type:     7 bits  (next most significant)       0x7f00_0000_0000_0000
0076 event_id: 56 bits (least significant)           0x00ff_ffff_ffff_ffff
0077
0078 If 'raw_type' is 1, then the counter will count a hardware event
0079 specified by the remaining 63 bits of event_config.  The encoding is
0080 machine-specific.
0081
0082 If 'raw_type' is 0, then the 'type' field says what kind of counter
0083 this is, with the following encoding:
0084
0085 enum perf_type_id {
0086         PERF_TYPE_HARDWARE              = 0,
0087         PERF_TYPE_SOFTWARE              = 1,
0088         PERF_TYPE_TRACEPOINT            = 2,
0089 };
0090
0091 A counter of PERF_TYPE_HARDWARE will count the hardware event
0092 specified by 'event_id':
0093
0094 /*
0095  * Generalized performance counter event types, used by the hw_event.event_id
0096  * parameter of the sys_perf_event_open() syscall:
0097  */
0098 enum perf_hw_id {
0099         /*
0100          * Common hardware events, generalized by the kernel:
0101          */
0102         PERF_COUNT_HW_CPU_CYCLES                = 0,
0103         PERF_COUNT_HW_INSTRUCTIONS              = 1,
0104         PERF_COUNT_HW_CACHE_REFERENCES          = 2,
0105         PERF_COUNT_HW_CACHE_MISSES              = 3,
0106         PERF_COUNT_HW_BRANCH_INSTRUCTIONS       = 4,
0107         PERF_COUNT_HW_BRANCH_MISSES             = 5,
0108         PERF_COUNT_HW_BUS_CYCLES                = 6,
0109         PERF_COUNT_HW_STALLED_CYCLES_FRONTEND   = 7,
0110         PERF_COUNT_HW_STALLED_CYCLES_BACKEND    = 8,
0111         PERF_COUNT_HW_REF_CPU_CYCLES            = 9,
0112 };
0113
0114 These are standardized types of events that work relatively uniformly
0115 on all CPUs that implement Performance Counters support under Linux,
0116 although there may be variations (e.g., different CPUs might count
0117 cache references and misses at different levels of the cache hierarchy).
0118 If a CPU is not able to count the selected event, then the system call
0119 will return -EINVAL.
0120
0121 More hw_event_types are supported as well, but they are CPU-specific
0122 and accessed as raw events.  For example, to count "External bus
0123 cycles while bus lock signal asserted" events on Intel Core CPUs, pass
0124 in a 0x4064 event_id value and set hw_event.raw_type to 1.
0125
0126 A counter of type PERF_TYPE_SOFTWARE will count one of the available
0127 software events, selected by 'event_id':
0128
0129 /*
0130  * Special "software" counters provided by the kernel, even if the hardware
0131  * does not support performance counters. These counters measure various
0132  * physical and sw events of the kernel (and allow the profiling of them as
0133  * well):
0134  */
0135 enum perf_sw_ids {
0136         PERF_COUNT_SW_CPU_CLOCK         = 0,
0137         PERF_COUNT_SW_TASK_CLOCK        = 1,
0138         PERF_COUNT_SW_PAGE_FAULTS       = 2,
0139         PERF_COUNT_SW_CONTEXT_SWITCHES  = 3,
0140         PERF_COUNT_SW_CPU_MIGRATIONS    = 4,
0141         PERF_COUNT_SW_PAGE_FAULTS_MIN   = 5,
0142         PERF_COUNT_SW_PAGE_FAULTS_MAJ   = 6,
0143         PERF_COUNT_SW_ALIGNMENT_FAULTS  = 7,
0144         PERF_COUNT_SW_EMULATION_FAULTS  = 8,
0145 };
0146
0147 Counters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event
0148 tracer is available, and event_id values can be obtained from
0149 /debug/tracing/events/*/*/id
0150
0151
0152 Counters come in two flavours: counting counters and sampling
0153 counters.  A "counting" counter is one that is used for counting the
0154 number of events that occur, and is characterised by having
0155 irq_period = 0.
0156
0157
0158 A read() on a counter returns the current value of the counter and possible
0159 additional values as specified by 'read_format', each value is a u64 (8 bytes)
0160 in size.
0161
0162 /*
0163  * Bits that can be set in hw_event.read_format to request that
0164  * reads on the counter should return the indicated quantities,
0165  * in increasing order of bit value, after the counter value.
0166  */
0167 enum perf_event_read_format {
0168         PERF_FORMAT_TOTAL_TIME_ENABLED  =  1,
0169         PERF_FORMAT_TOTAL_TIME_RUNNING  =  2,
0170 };
0171
0172 Using these additional values one can establish the overcommit ratio for a
0173 particular counter allowing one to take the round-robin scheduling effect
0174 into account.
0175
0176
0177 A "sampling" counter is one that is set up to generate an interrupt
0178 every N events, where N is given by 'irq_period'.  A sampling counter
0179 has irq_period > 0. The record_type controls what data is recorded on each
0180 interrupt:
0181
0182 /*
0183  * Bits that can be set in hw_event.record_type to request information
0184  * in the overflow packets.
0185  */
0186 enum perf_event_record_format {
0187         PERF_RECORD_IP          = 1U << 0,
0188         PERF_RECORD_TID         = 1U << 1,
0189         PERF_RECORD_TIME        = 1U << 2,
0190         PERF_RECORD_ADDR        = 1U << 3,
0191         PERF_RECORD_GROUP       = 1U << 4,
0192         PERF_RECORD_CALLCHAIN   = 1U << 5,
0193 };
0194
0195 Such (and other) events will be recorded in a ring-buffer, which is
0196 available to user-space using mmap() (see below).
0197
0198 The 'disabled' bit specifies whether the counter starts out disabled
0199 or enabled.  If it is initially disabled, it can be enabled by ioctl
0200 or prctl (see below).
0201
0202 The 'inherit' bit, if set, specifies that this counter should count
0203 events on descendant tasks as well as the task specified.  This only
0204 applies to new descendents, not to any existing descendents at the
0205 time the counter is created (nor to any new descendents of existing
0206 descendents).
0207
0208 The 'pinned' bit, if set, specifies that the counter should always be
0209 on the CPU if at all possible.  It only applies to hardware counters
0210 and only to group leaders.  If a pinned counter cannot be put onto the
0211 CPU (e.g. because there are not enough hardware counters or because of
0212 a conflict with some other event), then the counter goes into an
0213 'error' state, where reads return end-of-file (i.e. read() returns 0)
0214 until the counter is subsequently enabled or disabled.
0215
0216 The 'exclusive' bit, if set, specifies that when this counter's group
0217 is on the CPU, it should be the only group using the CPU's counters.
0218 In future, this will allow sophisticated monitoring programs to supply
0219 extra configuration information via 'extra_config_len' to exploit
0220 advanced features of the CPU's Performance Monitor Unit (PMU) that are
0221 not otherwise accessible and that might disrupt other hardware
0222 counters.
0223
0224 The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a
0225 way to request that counting of events be restricted to times when the
0226 CPU is in user, kernel and/or hypervisor mode.
0227
0228 Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way
0229 to request counting of events restricted to guest and host contexts when
0230 using Linux as the hypervisor.
0231
0232 The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap
0233 operations, these can be used to relate userspace IP addresses to actual
0234 code, even after the mapping (or even the whole process) is gone,
0235 these events are recorded in the ring-buffer (see below).
0236
0237 The 'comm' bit allows tracking of process comm data on process creation.
0238 This too is recorded in the ring-buffer (see below).
0239
0240 The 'pid' parameter to the sys_perf_event_open() system call allows the
0241 counter to be specific to a task:
0242
0243  pid == 0: if the pid parameter is zero, the counter is attached to the
0244  current task.
0245
0246  pid > 0: the counter is attached to a specific task (if the current task
0247  has sufficient privilege to do so)
0248
0249  pid < 0: all tasks are counted (per cpu counters)
0250
0251 The 'cpu' parameter allows a counter to be made specific to a CPU:
0252
0253  cpu >= 0: the counter is restricted to a specific CPU
0254  cpu == -1: the counter counts on all CPUs
0255
0256 (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
0257
0258 A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
0259 events of that task and 'follows' that task to whatever CPU the task
0260 gets schedule to. Per task counters can be created by any user, for
0261 their own tasks.
0262
0263 A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
0264 all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
0265 privilege.
0266
0267 The 'flags' parameter is currently unused and must be zero.
0268
0269 The 'group_fd' parameter allows counter "groups" to be set up.  A
0270 counter group has one counter which is the group "leader".  The leader
0271 is created first, with group_fd = -1 in the sys_perf_event_open call
0272 that creates it.  The rest of the group members are created
0273 subsequently, with group_fd giving the fd of the group leader.
0274 (A single counter on its own is created with group_fd = -1 and is
0275 considered to be a group with only 1 member.)
0276
0277 A counter group is scheduled onto the CPU as a unit, that is, it will
0278 only be put onto the CPU if all of the counters in the group can be
0279 put onto the CPU.  This means that the values of the member counters
0280 can be meaningfully compared, added, divided (to get ratios), etc.,
0281 with each other, since they have counted events for the same set of
0282 executed instructions.
0283
0284
0285 Like stated, asynchronous events, like counter overflow or PROT_EXEC mmap
0286 tracking are logged into a ring-buffer. This ring-buffer is created and
0287 accessed through mmap().
0288
0289 The mmap size should be 1+2^n pages, where the first page is a meta-data page
0290 (struct perf_event_mmap_page) that contains various bits of information such
0291 as where the ring-buffer head is.
0292
0293 /*
0294  * Structure of the page that can be mapped via mmap
0295  */
0296 struct perf_event_mmap_page {
0297         __u32   version;                /* version number of this structure */
0298         __u32   compat_version;         /* lowest version this is compat with */
0299
0300         /*
0301          * Bits needed to read the hw counters in user-space.
0302          *
0303          *   u32 seq;
0304          *   s64 count;
0305          *
0306          *   do {
0307          *     seq = pc->lock;
0308          *
0309          *     barrier()
0310          *     if (pc->index) {
0311          *       count = pmc_read(pc->index - 1);
0312          *       count += pc->offset;
0313          *     } else
0314          *       goto regular_read;
0315          *
0316          *     barrier();
0317          *   } while (pc->lock != seq);
0318          *
0319          * NOTE: for obvious reason this only works on self-monitoring
0320          *       processes.
0321          */
0322         __u32   lock;                   /* seqlock for synchronization */
0323         __u32   index;                  /* hardware counter identifier */
0324         __s64   offset;                 /* add to hardware counter value */
0325
0326         /*
0327          * Control data for the mmap() data buffer.
0328          *
0329          * User-space reading this value should issue an rmb(), on SMP capable
0330          * platforms, after reading this value -- see perf_event_wakeup().
0331          */
0332         __u32   data_head;              /* head in the data section */
0333 };
0334
0335 NOTE: the hw-counter userspace bits are arch specific and are currently only
0336       implemented on powerpc.
0337
0338 The following 2^n pages are the ring-buffer which contains events of the form:
0339
0340 #define PERF_RECORD_MISC_KERNEL          (1 << 0)
0341 #define PERF_RECORD_MISC_USER            (1 << 1)
0342 #define PERF_RECORD_MISC_OVERFLOW        (1 << 2)
0343
0344 struct perf_event_header {
0345         __u32   type;
0346         __u16   misc;
0347         __u16   size;
0348 };
0349
0350 enum perf_event_type {
0351
0352         /*
0353          * The MMAP events record the PROT_EXEC mappings so that we can
0354          * correlate userspace IPs to code. They have the following structure:
0355          *
0356          * struct {
0357          *      struct perf_event_header        header;
0358          *
0359          *      u32                             pid, tid;
0360          *      u64                             addr;
0361          *      u64                             len;
0362          *      u64                             pgoff;
0363          *      char                            filename[];
0364          * };
0365          */
0366         PERF_RECORD_MMAP                 = 1,
0367         PERF_RECORD_MUNMAP               = 2,
0368
0369         /*
0370          * struct {
0371          *      struct perf_event_header        header;
0372          *
0373          *      u32                             pid, tid;
0374          *      char                            comm[];
0375          * };
0376          */
0377         PERF_RECORD_COMM                 = 3,
0378
0379         /*
0380          * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field
0381          * will be PERF_RECORD_*
0382          *
0383          * struct {
0384          *      struct perf_event_header        header;
0385          *
0386          *      { u64                   ip;       } && PERF_RECORD_IP
0387          *      { u32                   pid, tid; } && PERF_RECORD_TID
0388          *      { u64                   time;     } && PERF_RECORD_TIME
0389          *      { u64                   addr;     } && PERF_RECORD_ADDR
0390          *
0391          *      { u64                   nr;
0392          *        { u64 event, val; }   cnt[nr];  } && PERF_RECORD_GROUP
0393          *
0394          *      { u16                   nr,
0395          *                              hv,
0396          *                              kernel,
0397          *                              user;
0398          *        u64                   ips[nr];  } && PERF_RECORD_CALLCHAIN
0399          * };
0400          */
0401 };
0402
0403 NOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented
0404       on x86.
0405
0406 Notification of new events is possible through poll()/select()/epoll() and
0407 fcntl() managing signals.
0408
0409 Normally a notification is generated for every page filled, however one can
0410 additionally set perf_event_attr.wakeup_events to generate one every
0411 so many counter overflow events.
0412
0413 Future work will include a splice() interface to the ring-buffer.
0414
0415
0416 Counters can be enabled and disabled in two ways: via ioctl and via
0417 prctl.  When a counter is disabled, it doesn't count or generate
0418 events but does continue to exist and maintain its count value.
0419
0420 An individual counter can be enabled with
0421
0422         ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
0423
0424 or disabled with
0425
0426         ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
0427
0428 For a counter group, pass PERF_IOC_FLAG_GROUP as the third argument.
0429 Enabling or disabling the leader of a group enables or disables the
0430 whole group; that is, while the group leader is disabled, none of the
0431 counters in the group will count.  Enabling or disabling a member of a
0432 group other than the leader only affects that counter - disabling an
0433 non-leader stops that counter from counting but doesn't affect any
0434 other counter.
0435
0436 Additionally, non-inherited overflow counters can use
0437
0438         ioctl(fd, PERF_EVENT_IOC_REFRESH, nr);
0439
0440 to enable a counter for 'nr' events, after which it gets disabled again.
0441
0442 A process can enable or disable all the counter groups that are
0443 attached to it, using prctl:
0444
0445         prctl(PR_TASK_PERF_EVENTS_ENABLE);
0446
0447         prctl(PR_TASK_PERF_EVENTS_DISABLE);
0448
0449 This applies to all counters on the current process, whether created
0450 by this process or by another, and doesn't affect any counters that
0451 this process has created on other processes.  It only enables or
0452 disables the group leaders, not any other members in the groups.
0453
0454
0455 Arch requirements
0456 -----------------
0457
0458 If your architecture does not have hardware performance metrics, you can
0459 still use the generic software counters based on hrtimers for sampling.
0460
0461 So to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you
0462 will need at least this:
0463         - asm/perf_event.h - a basic stub will suffice at first
0464         - support for atomic64 types (and associated helper functions)
0465
0466 If your architecture does have hardware capabilities, you can override the
0467 weak stub hw_perf_event_init() to register hardware counters.
0468
0469 Architectures that have d-cache aliassing issues, such as Sparc and ARM,
0470 should select PERF_USE_VMALLOC in order to avoid these for perf mmap().