Back to home page

OSCL-LXR

 
 

    


0001 
0002 ===================
0003 Classic BPF vs eBPF
0004 ===================
0005 
0006 eBPF is designed to be JITed with one to one mapping, which can also open up
0007 the possibility for GCC/LLVM compilers to generate optimized eBPF code through
0008 an eBPF backend that performs almost as fast as natively compiled code.
0009 
0010 Some core changes of the eBPF format from classic BPF:
0011 
0012 - Number of registers increase from 2 to 10:
0013 
0014   The old format had two registers A and X, and a hidden frame pointer. The
0015   new layout extends this to be 10 internal registers and a read-only frame
0016   pointer. Since 64-bit CPUs are passing arguments to functions via registers
0017   the number of args from eBPF program to in-kernel function is restricted
0018   to 5 and one register is used to accept return value from an in-kernel
0019   function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
0020   sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
0021   registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
0022 
0023   Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
0024   etc, and eBPF calling convention maps directly to ABIs used by the kernel on
0025   64-bit architectures.
0026 
0027   On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
0028   and may let more complex programs to be interpreted.
0029 
0030   R0 - R5 are scratch registers and eBPF program needs spill/fill them if
0031   necessary across calls. Note that there is only one eBPF program (== one
0032   eBPF main routine) and it cannot call other eBPF functions, it can only
0033   call predefined in-kernel functions, though.
0034 
0035 - Register width increases from 32-bit to 64-bit:
0036 
0037   Still, the semantics of the original 32-bit ALU operations are preserved
0038   via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
0039   subregisters that zero-extend into 64-bit if they are being written to.
0040   That behavior maps directly to x86_64 and arm64 subregister definition, but
0041   makes other JITs more difficult.
0042 
0043   32-bit architectures run 64-bit eBPF programs via interpreter.
0044   Their JITs may convert BPF programs that only use 32-bit subregisters into
0045   native instruction set and let the rest being interpreted.
0046 
0047   Operation is 64-bit, because on 64-bit architectures, pointers are also
0048   64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
0049   so 32-bit eBPF registers would otherwise require to define register-pair
0050   ABI, thus, there won't be able to use a direct eBPF register to HW register
0051   mapping and JIT would need to do combine/split/move operations for every
0052   register in and out of the function, which is complex, bug prone and slow.
0053   Another reason is the use of atomic 64-bit counters.
0054 
0055 - Conditional jt/jf targets replaced with jt/fall-through:
0056 
0057   While the original design has constructs such as ``if (cond) jump_true;
0058   else jump_false;``, they are being replaced into alternative constructs like
0059   ``if (cond) jump_true; /* else fall-through */``.
0060 
0061 - Introduces bpf_call insn and register passing convention for zero overhead
0062   calls from/to other kernel functions:
0063 
0064   Before an in-kernel function call, the eBPF program needs to
0065   place function arguments into R1 to R5 registers to satisfy calling
0066   convention, then the interpreter will take them from registers and pass
0067   to in-kernel function. If R1 - R5 registers are mapped to CPU registers
0068   that are used for argument passing on given architecture, the JIT compiler
0069   doesn't need to emit extra moves. Function arguments will be in the correct
0070   registers and BPF_CALL instruction will be JITed as single 'call' HW
0071   instruction. This calling convention was picked to cover common call
0072   situations without performance penalty.
0073 
0074   After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
0075   a return value of the function. Since R6 - R9 are callee saved, their state
0076   is preserved across the call.
0077 
0078   For example, consider three C functions::
0079 
0080     u64 f1() { return (*_f2)(1); }
0081     u64 f2(u64 a) { return f3(a + 1, a); }
0082     u64 f3(u64 a, u64 b) { return a - b; }
0083 
0084   GCC can compile f1, f3 into x86_64::
0085 
0086     f1:
0087         movl $1, %edi
0088         movq _f2(%rip), %rax
0089         jmp  *%rax
0090     f3:
0091         movq %rdi, %rax
0092         subq %rsi, %rax
0093         ret
0094 
0095   Function f2 in eBPF may look like::
0096 
0097     f2:
0098         bpf_mov R2, R1
0099         bpf_add R1, 1
0100         bpf_call f3
0101         bpf_exit
0102 
0103   If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
0104   returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
0105   be used to call into f2.
0106 
0107   For practical reasons all eBPF programs have only one argument 'ctx' which is
0108   already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
0109   can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
0110   are currently not supported, but these restrictions can be lifted if necessary
0111   in the future.
0112 
0113   On 64-bit architectures all register map to HW registers one to one. For
0114   example, x86_64 JIT compiler can map them as ...
0115 
0116   ::
0117 
0118     R0 - rax
0119     R1 - rdi
0120     R2 - rsi
0121     R3 - rdx
0122     R4 - rcx
0123     R5 - r8
0124     R6 - rbx
0125     R7 - r13
0126     R8 - r14
0127     R9 - r15
0128     R10 - rbp
0129 
0130   ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
0131   and rbx, r12 - r15 are callee saved.
0132 
0133   Then the following eBPF pseudo-program::
0134 
0135     bpf_mov R6, R1 /* save ctx */
0136     bpf_mov R2, 2
0137     bpf_mov R3, 3
0138     bpf_mov R4, 4
0139     bpf_mov R5, 5
0140     bpf_call foo
0141     bpf_mov R7, R0 /* save foo() return value */
0142     bpf_mov R1, R6 /* restore ctx for next call */
0143     bpf_mov R2, 6
0144     bpf_mov R3, 7
0145     bpf_mov R4, 8
0146     bpf_mov R5, 9
0147     bpf_call bar
0148     bpf_add R0, R7
0149     bpf_exit
0150 
0151   After JIT to x86_64 may look like::
0152 
0153     push %rbp
0154     mov %rsp,%rbp
0155     sub $0x228,%rsp
0156     mov %rbx,-0x228(%rbp)
0157     mov %r13,-0x220(%rbp)
0158     mov %rdi,%rbx
0159     mov $0x2,%esi
0160     mov $0x3,%edx
0161     mov $0x4,%ecx
0162     mov $0x5,%r8d
0163     callq foo
0164     mov %rax,%r13
0165     mov %rbx,%rdi
0166     mov $0x6,%esi
0167     mov $0x7,%edx
0168     mov $0x8,%ecx
0169     mov $0x9,%r8d
0170     callq bar
0171     add %r13,%rax
0172     mov -0x228(%rbp),%rbx
0173     mov -0x220(%rbp),%r13
0174     leaveq
0175     retq
0176 
0177   Which is in this example equivalent in C to::
0178 
0179     u64 bpf_filter(u64 ctx)
0180     {
0181         return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
0182     }
0183 
0184   In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
0185   arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
0186   registers and place their return value into ``%rax`` which is R0 in eBPF.
0187   Prologue and epilogue are emitted by JIT and are implicit in the
0188   interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
0189   them across the calls as defined by calling convention.
0190 
0191   For example the following program is invalid::
0192 
0193     bpf_mov R1, 1
0194     bpf_call foo
0195     bpf_mov R0, R1
0196     bpf_exit
0197 
0198   After the call the registers R1-R5 contain junk values and cannot be read.
0199   An in-kernel verifier.rst is used to validate eBPF programs.
0200 
0201 Also in the new design, eBPF is limited to 4096 insns, which means that any
0202 program will terminate quickly and will only call a fixed number of kernel
0203 functions. Original BPF and eBPF are two operand instructions,
0204 which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
0205 
0206 The input context pointer for invoking the interpreter function is generic,
0207 its content is defined by a specific use case. For seccomp register R1 points
0208 to seccomp_data, for converted BPF filters R1 points to a skb.
0209 
0210 A program, that is translated internally consists of the following elements::
0211 
0212   op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32
0213 
0214 So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
0215 has room for new instructions. Some of them may use 16/24/32 byte encoding. New
0216 instructions must be multiple of 8 bytes to preserve backward compatibility.
0217 
0218 eBPF is a general purpose RISC instruction set. Not every register and
0219 every instruction are used during translation from original BPF to eBPF.
0220 For example, socket filters are not using ``exclusive add`` instruction, but
0221 tracing filters may do to maintain counters of events, for example. Register R9
0222 is not used by socket filters either, but more complex filters may be running
0223 out of registers and would have to resort to spill/fill to stack.
0224 
0225 eBPF can be used as a generic assembler for last step performance
0226 optimizations, socket filters and seccomp are using it as assembler. Tracing
0227 filters may use it as assembler to generate code from kernel. In kernel usage
0228 may not be bounded by security considerations, since generated eBPF code
0229 may be optimizing internal code path and not being exposed to the user space.
0230 Safety of eBPF can come from the verifier.rst. In such use cases as
0231 described, it may be used as safe instruction set.
0232 
0233 Just like the original BPF, eBPF runs within a controlled environment,
0234 is deterministic and the kernel can easily prove that. The safety of the program
0235 can be determined in two steps: first step does depth-first-search to disallow
0236 loops and other CFG validation; second step starts from the first insn and
0237 descends all possible paths. It simulates execution of every insn and observes
0238 the state change of registers and stack.
0239 
0240 opcode encoding
0241 ===============
0242 
0243 eBPF is reusing most of the opcode encoding from classic to simplify conversion
0244 of classic BPF to eBPF.
0245 
0246 For arithmetic and jump instructions the 8-bit 'code' field is divided into three
0247 parts::
0248 
0249   +----------------+--------+--------------------+
0250   |   4 bits       |  1 bit |   3 bits           |
0251   | operation code | source | instruction class  |
0252   +----------------+--------+--------------------+
0253   (MSB)                                      (LSB)
0254 
0255 Three LSB bits store instruction class which is one of:
0256 
0257   ===================     ===============
0258   Classic BPF classes     eBPF classes
0259   ===================     ===============
0260   BPF_LD    0x00          BPF_LD    0x00
0261   BPF_LDX   0x01          BPF_LDX   0x01
0262   BPF_ST    0x02          BPF_ST    0x02
0263   BPF_STX   0x03          BPF_STX   0x03
0264   BPF_ALU   0x04          BPF_ALU   0x04
0265   BPF_JMP   0x05          BPF_JMP   0x05
0266   BPF_RET   0x06          BPF_JMP32 0x06
0267   BPF_MISC  0x07          BPF_ALU64 0x07
0268   ===================     ===============
0269 
0270 The 4th bit encodes the source operand ...
0271 
0272     ::
0273 
0274         BPF_K     0x00
0275         BPF_X     0x08
0276 
0277  * in classic BPF, this means::
0278 
0279         BPF_SRC(code) == BPF_X - use register X as source operand
0280         BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
0281 
0282  * in eBPF, this means::
0283 
0284         BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
0285         BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
0286 
0287 ... and four MSB bits store operation code.
0288 
0289 If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
0290 
0291   BPF_ADD   0x00
0292   BPF_SUB   0x10
0293   BPF_MUL   0x20
0294   BPF_DIV   0x30
0295   BPF_OR    0x40
0296   BPF_AND   0x50
0297   BPF_LSH   0x60
0298   BPF_RSH   0x70
0299   BPF_NEG   0x80
0300   BPF_MOD   0x90
0301   BPF_XOR   0xa0
0302   BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
0303   BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
0304   BPF_END   0xd0  /* eBPF only: endianness conversion */
0305 
0306 If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
0307 
0308   BPF_JA    0x00  /* BPF_JMP only */
0309   BPF_JEQ   0x10
0310   BPF_JGT   0x20
0311   BPF_JGE   0x30
0312   BPF_JSET  0x40
0313   BPF_JNE   0x50  /* eBPF only: jump != */
0314   BPF_JSGT  0x60  /* eBPF only: signed '>' */
0315   BPF_JSGE  0x70  /* eBPF only: signed '>=' */
0316   BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
0317   BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
0318   BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
0319   BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
0320   BPF_JSLT  0xc0  /* eBPF only: signed '<' */
0321   BPF_JSLE  0xd0  /* eBPF only: signed '<=' */
0322 
0323 So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
0324 and eBPF. There are only two registers in classic BPF, so it means A += X.
0325 In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
0326 BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
0327 src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
0328 
0329 Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
0330 eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
0331 BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
0332 exactly the same operations as BPF_ALU, but with 64-bit wide operands
0333 instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
0334 dst_reg = dst_reg + src_reg
0335 
0336 Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
0337 operation. Classic BPF_RET | BPF_K means copy imm32 into return register
0338 and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
0339 in eBPF means function exit only. The eBPF program needs to store return
0340 value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
0341 BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
0342 operands for the comparisons instead.
0343 
0344 For load and store instructions the 8-bit 'code' field is divided as::
0345 
0346   +--------+--------+-------------------+
0347   | 3 bits | 2 bits |   3 bits          |
0348   |  mode  |  size  | instruction class |
0349   +--------+--------+-------------------+
0350   (MSB)                             (LSB)
0351 
0352 Size modifier is one of ...
0353 
0354 ::
0355 
0356   BPF_W   0x00    /* word */
0357   BPF_H   0x08    /* half word */
0358   BPF_B   0x10    /* byte */
0359   BPF_DW  0x18    /* eBPF only, double word */
0360 
0361 ... which encodes size of load/store operation::
0362 
0363  B  - 1 byte
0364  H  - 2 byte
0365  W  - 4 byte
0366  DW - 8 byte (eBPF only)
0367 
0368 Mode modifier is one of::
0369 
0370   BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
0371   BPF_ABS     0x20
0372   BPF_IND     0x40
0373   BPF_MEM     0x60
0374   BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
0375   BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
0376   BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */