Documentation/bpf/ringbuf.rst

0001 ===============
0002 BPF ring buffer
0003 ===============
0004
0005 This document describes BPF ring buffer design, API, and implementation details.
0006
0007 .. contents::
0008     :local:
0009     :depth: 2
0010
0011 Motivation
0012 ----------
0013
0014 There are two distinctive motivators for this work, which are not satisfied by
0015 existing perf buffer, which prompted creation of a new ring buffer
0016 implementation.
0017
0018 - more efficient memory utilization by sharing ring buffer across CPUs;
0019 - preserving ordering of events that happen sequentially in time, even across
0020   multiple CPUs (e.g., fork/exec/exit events for a task).
0021
0022 These two problems are independent, but perf buffer fails to satisfy both.
0023 Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
0024 also solved by having an MPSC implementation of ring buffer. The ordering
0025 problem could technically be solved for perf buffer with some in-kernel
0026 counting, but given the first one requires an MPSC buffer, the same solution
0027 would solve the second problem automatically.
0028
0029 Semantics and APIs
0030 ------------------
0031
0032 Single ring buffer is presented to BPF programs as an instance of BPF map of
0033 type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
0034 ultimately rejected.
0035
0036 One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
0037 ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
0038 enforce "same CPU only" rule. This would be more familiar interface compatible
0039 with existing perf buffer use in BPF, but would fail if application needed more
0040 advanced logic to lookup ring buffer by arbitrary key.
0041 ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
0042 Additionally, given the performance of BPF ringbuf, many use cases would just
0043 opt into a simple single ring buffer shared among all CPUs, for which current
0044 approach would be an overkill.
0045
0046 Another approach could introduce a new concept, alongside BPF map, to represent
0047 generic "container" object, which doesn't necessarily have key/value interface
0048 with lookup/update/delete operations. This approach would add a lot of extra
0049 infrastructure that has to be built for observability and verifier support. It
0050 would also add another concept that BPF developers would have to familiarize
0051 themselves with, new syntax in libbpf, etc. But then would really provide no
0052 additional benefits over the approach of using a map.  ``BPF_MAP_TYPE_RINGBUF``
0053 doesn't support lookup/update/delete operations, but so doesn't few other map
0054 types (e.g., queue and stack; array doesn't support delete, etc).
0055
0056 The approach chosen has an advantage of re-using existing BPF map
0057 infrastructure (introspection APIs in kernel, libbpf support, etc), being
0058 familiar concept (no need to teach users a new type of object in BPF program),
0059 and utilizing existing tooling (bpftool). For common scenario of using a single
0060 ring buffer for all CPUs, it's as simple and straightforward, as would be with
0061 a dedicated "container" object. On the other hand, by being a map, it can be
0062 combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
0063 a wide variety of topologies, from one ring buffer for each CPU (e.g., as
0064 a replacement for perf buffer use cases), to a complicated application
0065 hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
0066 with hashed task's tgid being a look up key to preserve order, but reduce
0067 contention).
0068
0069 Key and value sizes are enforced to be zero. ``max_entries`` is used to specify
0070 the size of ring buffer and has to be a power of 2 value.
0071
0072 There are a bunch of similarities between perf buffer
0073 (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
0074
0075 - variable-length records;
0076 - if there is no more space left in ring buffer, reservation fails, no
0077   blocking;
0078 - memory-mappable data area for user-space applications for ease of
0079   consumption and high performance;
0080 - epoll notifications for new incoming data;
0081 - but still the ability to do busy polling for new data to achieve the
0082   lowest latency, if necessary.
0083
0084 BPF ringbuf provides two sets of APIs to BPF programs:
0085
0086 - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
0087   buffer, similarly to ``bpf_perf_event_output()``;
0088 - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
0089   APIs split the whole process into two steps. First, a fixed amount of space
0090   is reserved. If successful, a pointer to a data inside ring buffer data
0091   area is returned, which BPF programs can use similarly to a data inside
0092   array/hash maps. Once ready, this piece of memory is either committed or
0093   discarded. Discard is similar to commit, but makes consumer ignore the
0094   record.
0095
0096 ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
0097 because record has to be prepared in some other place first. But it allows to
0098 submit records of the length that's not known to verifier beforehand. It also
0099 closely matches ``bpf_perf_event_output()``, so will simplify migration
0100 significantly.
0101
0102 ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
0103 pointer directly to ring buffer memory. In a lot of cases records are larger
0104 than BPF stack space allows, so many programs have use extra per-CPU array as
0105 a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
0106 completely. But in exchange, it only allows a known constant size of memory to
0107 be reserved, such that verifier can verify that BPF program can't access memory
0108 outside its reserved record space. bpf_ringbuf_output(), while slightly slower
0109 due to extra memory copy, covers some use cases that are not suitable for
0110 ``bpf_ringbuf_reserve()``.
0111
0112 The difference between commit and discard is very small. Discard just marks
0113 a record as discarded, and such records are supposed to be ignored by consumer
0114 code. Discard is useful for some advanced use-cases, such as ensuring
0115 all-or-nothing multi-record submission, or emulating temporary
0116 ``malloc()``/``free()`` within single BPF program invocation.
0117
0118 Each reserved record is tracked by verifier through existing
0119 reference-tracking logic, similar to socket ref-tracking. It is thus
0120 impossible to reserve a record, but forget to submit (or discard) it.
0121
0122 ``bpf_ringbuf_query()`` helper allows to query various properties of ring
0123 buffer.  Currently 4 are supported:
0124
0125 - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
0126 - ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
0127 - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
0128   of consumer/producer, respectively.
0129
0130 Returned values are momentarily snapshots of ring buffer state and could be
0131 off by the time helper returns, so this should be used only for
0132 debugging/reporting reasons or for implementing various heuristics, that take
0133 into account highly-changeable nature of some of those characteristics.
0134
0135 One such heuristic might involve more fine-grained control over poll/epoll
0136 notifications about new data availability in ring buffer. Together with
0137 ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
0138 helpers, it allows BPF program a high degree of control and, e.g., more
0139 efficient batched notifications. Default self-balancing strategy, though,
0140 should be adequate for most applications and will work reliable and efficiently
0141 already.
0142
0143 Design and Implementation
0144 -------------------------
0145
0146 This reserve/commit schema allows a natural way for multiple producers, either
0147 on different CPUs or even on the same CPU/in the same BPF program, to reserve
0148 independent records and work with them without blocking other producers. This
0149 means that if BPF program was interruped by another BPF program sharing the
0150 same ring buffer, they will both get a record reserved (provided there is
0151 enough space left) and can work with it and submit it independently. This
0152 applies to NMI context as well, except that due to using a spinlock during
0153 reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
0154 a lock, in which case reservation will fail even if ring buffer is not full.
0155
0156 The ring buffer itself internally is implemented as a power-of-2 sized
0157 circular buffer, with two logical and ever-increasing counters (which might
0158 wrap around on 32-bit architectures, that's not a problem):
0159
0160 - consumer counter shows up to which logical position consumer consumed the
0161   data;
0162 - producer counter denotes amount of data reserved by all producers.
0163
0164 Each time a record is reserved, producer that "owns" the record will
0165 successfully advance producer counter. At that point, data is still not yet
0166 ready to be consumed, though. Each record has 8 byte header, which contains the
0167 length of reserved record, as well as two extra bits: busy bit to denote that
0168 record is still being worked on, and discard bit, which might be set at commit
0169 time if record is discarded. In the latter case, consumer is supposed to skip
0170 the record and move on to the next one. Record header also encodes record's
0171 relative offset from the beginning of ring buffer data area (in pages). This
0172 allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
0173 pointer to the record itself, without requiring also the pointer to ring buffer
0174 itself. Ring buffer memory location will be restored from record metadata
0175 header. This significantly simplifies verifier, as well as improving API
0176 usability.
0177
0178 Producer counter increments are serialized under spinlock, so there is
0179 a strict ordering between reservations. Commits, on the other hand, are
0180 completely lockless and independent. All records become available to consumer
0181 in the order of reservations, but only after all previous records where
0182 already committed. It is thus possible for slow producers to temporarily hold
0183 off submitted records, that were reserved later.
0184
0185 One interesting implementation bit, that significantly simplifies (and thus
0186 speeds up as well) implementation of both producers and consumers is how data
0187 area is mapped twice contiguously back-to-back in the virtual memory. This
0188 allows to not take any special measures for samples that have to wrap around
0189 at the end of the circular buffer data area, because the next page after the
0190 last data page would be first data page again, and thus the sample will still
0191 appear completely contiguous in virtual memory. See comment and a simple ASCII
0192 diagram showing this visually in ``bpf_ringbuf_area_alloc()``.
0193
0194 Another feature that distinguishes BPF ringbuf from perf ring buffer is
0195 a self-pacing notifications of new data being availability.
0196 ``bpf_ringbuf_commit()`` implementation will send a notification of new record
0197 being available after commit only if consumer has already caught up right up to
0198 the record being committed. If not, consumer still has to catch up and thus
0199 will see new data anyways without needing an extra poll notification.
0200 Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that
0201 this allows to achieve a very high throughput without having to resort to
0202 tricks like "notify only every Nth sample", which are necessary with perf
0203 buffer. For extreme cases, when BPF program wants more manual control of
0204 notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
0205 ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
0206 data availability, but require extra caution and diligence in using this API.