0001 ==============
0002 BPF Design Q&A
0003 ==============
0004
0005 BPF extensibility and applicability to networking, tracing, security
0006 in the linux kernel and several user space implementations of BPF
0007 virtual machine led to a number of misunderstanding on what BPF actually is.
0008 This short QA is an attempt to address that and outline a direction
0009 of where BPF is heading long term.
0010
0011 .. contents::
0012 :local:
0013 :depth: 3
0014
0015 Questions and Answers
0016 =====================
0017
0018 Q: Is BPF a generic instruction set similar to x64 and arm64?
0019 -------------------------------------------------------------
0020 A: NO.
0021
0022 Q: Is BPF a generic virtual machine ?
0023 -------------------------------------
0024 A: NO.
0025
0026 BPF is generic instruction set *with* C calling convention.
0027 -----------------------------------------------------------
0028
0029 Q: Why C calling convention was chosen?
0030 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0031
0032 A: Because BPF programs are designed to run in the linux kernel
0033 which is written in C, hence BPF defines instruction set compatible
0034 with two most used architectures x64 and arm64 (and takes into
0035 consideration important quirks of other architectures) and
0036 defines calling convention that is compatible with C calling
0037 convention of the linux kernel on those architectures.
0038
0039 Q: Can multiple return values be supported in the future?
0040 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0041 A: NO. BPF allows only register R0 to be used as return value.
0042
0043 Q: Can more than 5 function arguments be supported in the future?
0044 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0045 A: NO. BPF calling convention only allows registers R1-R5 to be used
0046 as arguments. BPF is not a standalone instruction set.
0047 (unlike x64 ISA that allows msft, cdecl and other conventions)
0048
0049 Q: Can BPF programs access instruction pointer or return address?
0050 -----------------------------------------------------------------
0051 A: NO.
0052
0053 Q: Can BPF programs access stack pointer ?
0054 ------------------------------------------
0055 A: NO.
0056
0057 Only frame pointer (register R10) is accessible.
0058 From compiler point of view it's necessary to have stack pointer.
0059 For example, LLVM defines register R11 as stack pointer in its
0060 BPF backend, but it makes sure that generated code never uses it.
0061
0062 Q: Does C-calling convention diminishes possible use cases?
0063 -----------------------------------------------------------
0064 A: YES.
0065
0066 BPF design forces addition of major functionality in the form
0067 of kernel helper functions and kernel objects like BPF maps with
0068 seamless interoperability between them. It lets kernel call into
0069 BPF programs and programs call kernel helpers with zero overhead,
0070 as all of them were native C code. That is particularly the case
0071 for JITed BPF programs that are indistinguishable from
0072 native kernel C code.
0073
0074 Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
0075 ------------------------------------------------------------------------
0076 A: Soft yes.
0077
0078 At least for now, until BPF core has support for
0079 bpf-to-bpf calls, indirect calls, loops, global variables,
0080 jump tables, read-only sections, and all other normal constructs
0081 that C code can produce.
0082
0083 Q: Can loops be supported in a safe way?
0084 ----------------------------------------
0085 A: It's not clear yet.
0086
0087 BPF developers are trying to find a way to
0088 support bounded loops.
0089
0090 Q: What are the verifier limits?
0091 --------------------------------
0092 A: The only limit known to the user space is BPF_MAXINSNS (4096).
0093 It's the maximum number of instructions that the unprivileged bpf
0094 program can have. The verifier has various internal limits.
0095 Like the maximum number of instructions that can be explored during
0096 program analysis. Currently, that limit is set to 1 million.
0097 Which essentially means that the largest program can consist
0098 of 1 million NOP instructions. There is a limit to the maximum number
0099 of subsequent branches, a limit to the number of nested bpf-to-bpf
0100 calls, a limit to the number of the verifier states per instruction,
0101 a limit to the number of maps used by the program.
0102 All these limits can be hit with a sufficiently complex program.
0103 There are also non-numerical limits that can cause the program
0104 to be rejected. The verifier used to recognize only pointer + constant
0105 expressions. Now it can recognize pointer + bounded_register.
0106 bpf_lookup_map_elem(key) had a requirement that 'key' must be
0107 a pointer to the stack. Now, 'key' can be a pointer to map value.
0108 The verifier is steadily getting 'smarter'. The limits are
0109 being removed. The only way to know that the program is going to
0110 be accepted by the verifier is to try to load it.
0111 The bpf development process guarantees that the future kernel
0112 versions will accept all bpf programs that were accepted by
0113 the earlier versions.
0114
0115
0116 Instruction level questions
0117 ---------------------------
0118
0119 Q: LD_ABS and LD_IND instructions vs C code
0120 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0121
0122 Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
0123 C code cannot express them and has to use builtin intrinsics?
0124
0125 A: This is artifact of compatibility with classic BPF. Modern
0126 networking code in BPF performs better without them.
0127 See 'direct packet access'.
0128
0129 Q: BPF instructions mapping not one-to-one to native CPU
0130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0131 Q: It seems not all BPF instructions are one-to-one to native CPU.
0132 For example why BPF_JNE and other compare and jumps are not cpu-like?
0133
0134 A: This was necessary to avoid introducing flags into ISA which are
0135 impossible to make generic and efficient across CPU architectures.
0136
0137 Q: Why BPF_DIV instruction doesn't map to x64 div?
0138 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0139 A: Because if we picked one-to-one relationship to x64 it would have made
0140 it more complicated to support on arm64 and other archs. Also it
0141 needs div-by-zero runtime check.
0142
0143 Q: Why there is no BPF_SDIV for signed divide operation?
0144 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0145 A: Because it would be rarely used. llvm errors in such case and
0146 prints a suggestion to use unsigned divide instead.
0147
0148 Q: Why BPF has implicit prologue and epilogue?
0149 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0150 A: Because architectures like sparc have register windows and in general
0151 there are enough subtle differences between architectures, so naive
0152 store return address into stack won't work. Another reason is BPF has
0153 to be safe from division by zero (and legacy exception path
0154 of LD_ABS insn). Those instructions need to invoke epilogue and
0155 return implicitly.
0156
0157 Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
0158 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0159 A: Because classic BPF didn't have them and BPF authors felt that compiler
0160 workaround would be acceptable. Turned out that programs lose performance
0161 due to lack of these compare instructions and they were added.
0162 These two instructions is a perfect example what kind of new BPF
0163 instructions are acceptable and can be added in the future.
0164 These two already had equivalent instructions in native CPUs.
0165 New instructions that don't have one-to-one mapping to HW instructions
0166 will not be accepted.
0167
0168 Q: BPF 32-bit subregister requirements
0169 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0170 Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
0171 registers which makes BPF inefficient virtual machine for 32-bit
0172 CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
0173 be added to BPF in the future?
0174
0175 A: NO.
0176
0177 But some optimizations on zero-ing the upper 32 bits for BPF registers are
0178 available, and can be leveraged to improve the performance of JITed BPF
0179 programs for 32-bit architectures.
0180
0181 Starting with version 7, LLVM is able to generate instructions that operate
0182 on 32-bit subregisters, provided the option -mattr=+alu32 is passed for
0183 compiling a program. Furthermore, the verifier can now mark the
0184 instructions for which zero-ing the upper bits of the destination register
0185 is required, and insert an explicit zero-extension (zext) instruction
0186 (a mov32 variant). This means that for architectures without zext hardware
0187 support, the JIT back-ends do not need to clear the upper bits for
0188 subregisters written by alu32 instructions or narrow loads. Instead, the
0189 back-ends simply need to support code generation for that mov32 variant,
0190 and to overwrite bpf_jit_needs_zext() to make it return "true" (in order to
0191 enable zext insertion in the verifier).
0192
0193 Note that it is possible for a JIT back-end to have partial hardware
0194 support for zext. In that case, if verifier zext insertion is enabled,
0195 it could lead to the insertion of unnecessary zext instructions. Such
0196 instructions could be removed by creating a simple peephole inside the JIT
0197 back-end: if one instruction has hardware support for zext and if the next
0198 instruction is an explicit zext, then the latter can be skipped when doing
0199 the code generation.
0200
0201 Q: Does BPF have a stable ABI?
0202 ------------------------------
0203 A: YES. BPF instructions, arguments to BPF programs, set of helper
0204 functions and their arguments, recognized return codes are all part
0205 of ABI. However there is one specific exception to tracing programs
0206 which are using helpers like bpf_probe_read() to walk kernel internal
0207 data structures and compile with kernel internal headers. Both of these
0208 kernel internals are subject to change and can break with newer kernels
0209 such that the program needs to be adapted accordingly.
0210
0211 Q: Are tracepoints part of the stable ABI?
0212 ------------------------------------------
0213 A: NO. Tracepoints are tied to internal implementation details hence they are
0214 subject to change and can break with newer kernels. BPF programs need to change
0215 accordingly when this happens.
0216
0217 Q: Are places where kprobes can attach part of the stable ABI?
0218 --------------------------------------------------------------
0219 A: NO. The places to which kprobes can attach are internal implementation
0220 details, which means that they are subject to change and can break with
0221 newer kernels. BPF programs need to change accordingly when this happens.
0222
0223 Q: How much stack space a BPF program uses?
0224 -------------------------------------------
0225 A: Currently all program types are limited to 512 bytes of stack
0226 space, but the verifier computes the actual amount of stack used
0227 and both interpreter and most JITed code consume necessary amount.
0228
0229 Q: Can BPF be offloaded to HW?
0230 ------------------------------
0231 A: YES. BPF HW offload is supported by NFP driver.
0232
0233 Q: Does classic BPF interpreter still exist?
0234 --------------------------------------------
0235 A: NO. Classic BPF programs are converted into extend BPF instructions.
0236
0237 Q: Can BPF call arbitrary kernel functions?
0238 -------------------------------------------
0239 A: NO. BPF programs can only call a set of helper functions which
0240 is defined for every program type.
0241
0242 Q: Can BPF overwrite arbitrary kernel memory?
0243 ---------------------------------------------
0244 A: NO.
0245
0246 Tracing bpf programs can *read* arbitrary memory with bpf_probe_read()
0247 and bpf_probe_read_str() helpers. Networking programs cannot read
0248 arbitrary memory, since they don't have access to these helpers.
0249 Programs can never read or write arbitrary memory directly.
0250
0251 Q: Can BPF overwrite arbitrary user memory?
0252 -------------------------------------------
0253 A: Sort-of.
0254
0255 Tracing BPF programs can overwrite the user memory
0256 of the current task with bpf_probe_write_user(). Every time such
0257 program is loaded the kernel will print warning message, so
0258 this helper is only useful for experiments and prototypes.
0259 Tracing BPF programs are root only.
0260
0261 Q: New functionality via kernel modules?
0262 ----------------------------------------
0263 Q: Can BPF functionality such as new program or map types, new
0264 helpers, etc be added out of kernel module code?
0265
0266 A: NO.
0267
0268 Q: Directly calling kernel function is an ABI?
0269 ----------------------------------------------
0270 Q: Some kernel functions (e.g. tcp_slow_start) can be called
0271 by BPF programs. Do these kernel functions become an ABI?
0272
0273 A: NO.
0274
0275 The kernel function protos will change and the bpf programs will be
0276 rejected by the verifier. Also, for example, some of the bpf-callable
0277 kernel functions have already been used by other kernel tcp
0278 cc (congestion-control) implementations. If any of these kernel
0279 functions has changed, both the in-tree and out-of-tree kernel tcp cc
0280 implementations have to be changed. The same goes for the bpf
0281 programs and they have to be adjusted accordingly.
0282
0283 Q: Attaching to arbitrary kernel functions is an ABI?
0284 -----------------------------------------------------
0285 Q: BPF programs can be attached to many kernel functions. Do these
0286 kernel functions become part of the ABI?
0287
0288 A: NO.
0289
0290 The kernel function prototypes will change, and BPF programs attaching to
0291 them will need to change. The BPF compile-once-run-everywhere (CO-RE)
0292 should be used in order to make it easier to adapt your BPF programs to
0293 different versions of the kernel.
0294
0295 Q: Marking a function with BTF_ID makes that function an ABI?
0296 -------------------------------------------------------------
0297 A: NO.
0298
0299 The BTF_ID macro does not cause a function to become part of the ABI
0300 any more than does the EXPORT_SYMBOL_GPL macro.