0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 .. _networking-filter:
0004
0005 =======================================================
0006 Linux Socket Filtering aka Berkeley Packet Filter (BPF)
0007 =======================================================
0008
0009 Notice
0010 ------
0011
0012 This file used to document the eBPF format and mechanisms even when not
0013 related to socket filtering. The ../bpf/index.rst has more details
0014 on eBPF.
0015
0016 Introduction
0017 ------------
0018
0019 Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
0020 Though there are some distinct differences between the BSD and Linux
0021 Kernel filtering, but when we speak of BPF or LSF in Linux context, we
0022 mean the very same mechanism of filtering in the Linux kernel.
0023
0024 BPF allows a user-space program to attach a filter onto any socket and
0025 allow or disallow certain types of data to come through the socket. LSF
0026 follows exactly the same filter code structure as BSD's BPF, so referring
0027 to the BSD bpf.4 manpage is very helpful in creating filters.
0028
0029 On Linux, BPF is much simpler than on BSD. One does not have to worry
0030 about devices or anything like that. You simply create your filter code,
0031 send it to the kernel via the SO_ATTACH_FILTER option and if your filter
0032 code passes the kernel check on it, you then immediately begin filtering
0033 data on that socket.
0034
0035 You can also detach filters from your socket via the SO_DETACH_FILTER
0036 option. This will probably not be used much since when you close a socket
0037 that has a filter on it the filter is automagically removed. The other
0038 less common case may be adding a different filter on the same socket where
0039 you had another filter that is still running: the kernel takes care of
0040 removing the old one and placing your new one in its place, assuming your
0041 filter has passed the checks, otherwise if it fails the old filter will
0042 remain on that socket.
0043
0044 SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
0045 set, a filter cannot be removed or changed. This allows one process to
0046 setup a socket, attach a filter, lock it then drop privileges and be
0047 assured that the filter will be kept until the socket is closed.
0048
0049 The biggest user of this construct might be libpcap. Issuing a high-level
0050 filter command like `tcpdump -i em1 port 22` passes through the libpcap
0051 internal compiler that generates a structure that can eventually be loaded
0052 via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
0053 displays what is being placed into this structure.
0054
0055 Although we were only speaking about sockets here, BPF in Linux is used
0056 in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
0057 qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
0058 such as team driver, PTP code, etc where BPF is being used.
0059
0060 .. [1] Documentation/userspace-api/seccomp_filter.rst
0061
0062 Original BPF paper:
0063
0064 Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
0065 architecture for user-level packet capture. In Proceedings of the
0066 USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
0067 Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
0068 CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
0069
0070 Structure
0071 ---------
0072
0073 User space applications include <linux/filter.h> which contains the
0074 following relevant structures::
0075
0076 struct sock_filter { /* Filter block */
0077 __u16 code; /* Actual filter code */
0078 __u8 jt; /* Jump true */
0079 __u8 jf; /* Jump false */
0080 __u32 k; /* Generic multiuse field */
0081 };
0082
0083 Such a structure is assembled as an array of 4-tuples, that contains
0084 a code, jt, jf and k value. jt and jf are jump offsets and k a generic
0085 value to be used for a provided code::
0086
0087 struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
0088 unsigned short len; /* Number of filter blocks */
0089 struct sock_filter __user *filter;
0090 };
0091
0092 For socket filtering, a pointer to this structure (as shown in
0093 follow-up example) is being passed to the kernel through setsockopt(2).
0094
0095 Example
0096 -------
0097
0098 ::
0099
0100 #include <sys/socket.h>
0101 #include <sys/types.h>
0102 #include <arpa/inet.h>
0103 #include <linux/if_ether.h>
0104 /* ... */
0105
0106 /* From the example above: tcpdump -i em1 port 22 -dd */
0107 struct sock_filter code[] = {
0108 { 0x28, 0, 0, 0x0000000c },
0109 { 0x15, 0, 8, 0x000086dd },
0110 { 0x30, 0, 0, 0x00000014 },
0111 { 0x15, 2, 0, 0x00000084 },
0112 { 0x15, 1, 0, 0x00000006 },
0113 { 0x15, 0, 17, 0x00000011 },
0114 { 0x28, 0, 0, 0x00000036 },
0115 { 0x15, 14, 0, 0x00000016 },
0116 { 0x28, 0, 0, 0x00000038 },
0117 { 0x15, 12, 13, 0x00000016 },
0118 { 0x15, 0, 12, 0x00000800 },
0119 { 0x30, 0, 0, 0x00000017 },
0120 { 0x15, 2, 0, 0x00000084 },
0121 { 0x15, 1, 0, 0x00000006 },
0122 { 0x15, 0, 8, 0x00000011 },
0123 { 0x28, 0, 0, 0x00000014 },
0124 { 0x45, 6, 0, 0x00001fff },
0125 { 0xb1, 0, 0, 0x0000000e },
0126 { 0x48, 0, 0, 0x0000000e },
0127 { 0x15, 2, 0, 0x00000016 },
0128 { 0x48, 0, 0, 0x00000010 },
0129 { 0x15, 0, 1, 0x00000016 },
0130 { 0x06, 0, 0, 0x0000ffff },
0131 { 0x06, 0, 0, 0x00000000 },
0132 };
0133
0134 struct sock_fprog bpf = {
0135 .len = ARRAY_SIZE(code),
0136 .filter = code,
0137 };
0138
0139 sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
0140 if (sock < 0)
0141 /* ... bail out ... */
0142
0143 ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
0144 if (ret < 0)
0145 /* ... bail out ... */
0146
0147 /* ... */
0148 close(sock);
0149
0150 The above example code attaches a socket filter for a PF_PACKET socket
0151 in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
0152 be dropped for this socket.
0153
0154 The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
0155 and SO_LOCK_FILTER for preventing the filter to be detached, takes an
0156 integer value with 0 or 1.
0157
0158 Note that socket filters are not restricted to PF_PACKET sockets only,
0159 but can also be used on other socket families.
0160
0161 Summary of system calls:
0162
0163 * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
0164 * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
0165 * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
0166
0167 Normally, most use cases for socket filtering on packet sockets will be
0168 covered by libpcap in high-level syntax, so as an application developer
0169 you should stick to that. libpcap wraps its own layer around all that.
0170
0171 Unless i) using/linking to libpcap is not an option, ii) the required BPF
0172 filters use Linux extensions that are not supported by libpcap's compiler,
0173 iii) a filter might be more complex and not cleanly implementable with
0174 libpcap's compiler, or iv) particular filter codes should be optimized
0175 differently than libpcap's internal compiler does; then in such cases
0176 writing such a filter "by hand" can be of an alternative. For example,
0177 xt_bpf and cls_bpf users might have requirements that could result in
0178 more complex filter code, or one that cannot be expressed with libpcap
0179 (e.g. different return codes for various code paths). Moreover, BPF JIT
0180 implementors may wish to manually write test cases and thus need low-level
0181 access to BPF code as well.
0182
0183 BPF engine and instruction set
0184 ------------------------------
0185
0186 Under tools/bpf/ there's a small helper tool called bpf_asm which can
0187 be used to write low-level filters for example scenarios mentioned in the
0188 previous section. Asm-like syntax mentioned here has been implemented in
0189 bpf_asm and will be used for further explanations (instead of dealing with
0190 less readable opcodes directly, principles are the same). The syntax is
0191 closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
0192
0193 The BPF architecture consists of the following basic elements:
0194
0195 ======= ====================================================
0196 Element Description
0197 ======= ====================================================
0198 A 32 bit wide accumulator
0199 X 32 bit wide X register
0200 M[] 16 x 32 bit wide misc registers aka "scratch memory
0201 store", addressable from 0 to 15
0202 ======= ====================================================
0203
0204 A program, that is translated by bpf_asm into "opcodes" is an array that
0205 consists of the following elements (as already mentioned)::
0206
0207 op:16, jt:8, jf:8, k:32
0208
0209 The element op is a 16 bit wide opcode that has a particular instruction
0210 encoded. jt and jf are two 8 bit wide jump targets, one for condition
0211 "jump if true", the other one "jump if false". Eventually, element k
0212 contains a miscellaneous argument that can be interpreted in different
0213 ways depending on the given instruction in op.
0214
0215 The instruction set consists of load, store, branch, alu, miscellaneous
0216 and return instructions that are also represented in bpf_asm syntax. This
0217 table lists all bpf_asm instructions available resp. what their underlying
0218 opcodes as defined in linux/filter.h stand for:
0219
0220 =========== =================== =====================
0221 Instruction Addressing mode Description
0222 =========== =================== =====================
0223 ld 1, 2, 3, 4, 12 Load word into A
0224 ldi 4 Load word into A
0225 ldh 1, 2 Load half-word into A
0226 ldb 1, 2 Load byte into A
0227 ldx 3, 4, 5, 12 Load word into X
0228 ldxi 4 Load word into X
0229 ldxb 5 Load byte into X
0230
0231 st 3 Store A into M[]
0232 stx 3 Store X into M[]
0233
0234 jmp 6 Jump to label
0235 ja 6 Jump to label
0236 jeq 7, 8, 9, 10 Jump on A == <x>
0237 jneq 9, 10 Jump on A != <x>
0238 jne 9, 10 Jump on A != <x>
0239 jlt 9, 10 Jump on A < <x>
0240 jle 9, 10 Jump on A <= <x>
0241 jgt 7, 8, 9, 10 Jump on A > <x>
0242 jge 7, 8, 9, 10 Jump on A >= <x>
0243 jset 7, 8, 9, 10 Jump on A & <x>
0244
0245 add 0, 4 A + <x>
0246 sub 0, 4 A - <x>
0247 mul 0, 4 A * <x>
0248 div 0, 4 A / <x>
0249 mod 0, 4 A % <x>
0250 neg !A
0251 and 0, 4 A & <x>
0252 or 0, 4 A | <x>
0253 xor 0, 4 A ^ <x>
0254 lsh 0, 4 A << <x>
0255 rsh 0, 4 A >> <x>
0256
0257 tax Copy A into X
0258 txa Copy X into A
0259
0260 ret 4, 11 Return
0261 =========== =================== =====================
0262
0263 The next table shows addressing formats from the 2nd column:
0264
0265 =============== =================== ===============================================
0266 Addressing mode Syntax Description
0267 =============== =================== ===============================================
0268 0 x/%x Register X
0269 1 [k] BHW at byte offset k in the packet
0270 2 [x + k] BHW at the offset X + k in the packet
0271 3 M[k] Word at offset k in M[]
0272 4 #k Literal value stored in k
0273 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
0274 6 L Jump label L
0275 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
0276 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf
0277 9 #k,Lt Jump to Lt if predicate is true
0278 10 x/%x,Lt Jump to Lt if predicate is true
0279 11 a/%a Accumulator A
0280 12 extension BPF extension
0281 =============== =================== ===============================================
0282
0283 The Linux kernel also has a couple of BPF extensions that are used along
0284 with the class of load instructions by "overloading" the k argument with
0285 a negative offset + a particular extension offset. The result of such BPF
0286 extensions are loaded into A.
0287
0288 Possible BPF extensions are shown in the following table:
0289
0290 =================================== =================================================
0291 Extension Description
0292 =================================== =================================================
0293 len skb->len
0294 proto skb->protocol
0295 type skb->pkt_type
0296 poff Payload start offset
0297 ifidx skb->dev->ifindex
0298 nla Netlink attribute of type X with offset A
0299 nlan Nested Netlink attribute of type X with offset A
0300 mark skb->mark
0301 queue skb->queue_mapping
0302 hatype skb->dev->type
0303 rxhash skb->hash
0304 cpu raw_smp_processor_id()
0305 vlan_tci skb_vlan_tag_get(skb)
0306 vlan_avail skb_vlan_tag_present(skb)
0307 vlan_tpid skb->vlan_proto
0308 rand prandom_u32()
0309 =================================== =================================================
0310
0311 These extensions can also be prefixed with '#'.
0312 Examples for low-level BPF:
0313
0314 **ARP packets**::
0315
0316 ldh [12]
0317 jne #0x806, drop
0318 ret #-1
0319 drop: ret #0
0320
0321 **IPv4 TCP packets**::
0322
0323 ldh [12]
0324 jne #0x800, drop
0325 ldb [23]
0326 jneq #6, drop
0327 ret #-1
0328 drop: ret #0
0329
0330 **icmp random packet sampling, 1 in 4**::
0331
0332 ldh [12]
0333 jne #0x800, drop
0334 ldb [23]
0335 jneq #1, drop
0336 # get a random uint32 number
0337 ld rand
0338 mod #4
0339 jneq #1, drop
0340 ret #-1
0341 drop: ret #0
0342
0343 **SECCOMP filter example**::
0344
0345 ld [4] /* offsetof(struct seccomp_data, arch) */
0346 jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
0347 ld [0] /* offsetof(struct seccomp_data, nr) */
0348 jeq #15, good /* __NR_rt_sigreturn */
0349 jeq #231, good /* __NR_exit_group */
0350 jeq #60, good /* __NR_exit */
0351 jeq #0, good /* __NR_read */
0352 jeq #1, good /* __NR_write */
0353 jeq #5, good /* __NR_fstat */
0354 jeq #9, good /* __NR_mmap */
0355 jeq #14, good /* __NR_rt_sigprocmask */
0356 jeq #13, good /* __NR_rt_sigaction */
0357 jeq #35, good /* __NR_nanosleep */
0358 bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
0359 good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
0360
0361 Examples for low-level BPF extension:
0362
0363 **Packet for interface index 13**::
0364
0365 ld ifidx
0366 jneq #13, drop
0367 ret #-1
0368 drop: ret #0
0369
0370 **(Accelerated) VLAN w/ id 10**::
0371
0372 ld vlan_tci
0373 jneq #10, drop
0374 ret #-1
0375 drop: ret #0
0376
0377 The above example code can be placed into a file (here called "foo"), and
0378 then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
0379 and cls_bpf understands and can directly be loaded with. Example with above
0380 ARP code::
0381
0382 $ ./bpf_asm foo
0383 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
0384
0385 In copy and paste C-like output::
0386
0387 $ ./bpf_asm -c foo
0388 { 0x28, 0, 0, 0x0000000c },
0389 { 0x15, 0, 1, 0x00000806 },
0390 { 0x06, 0, 0, 0xffffffff },
0391 { 0x06, 0, 0, 0000000000 },
0392
0393 In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
0394 filters that might not be obvious at first, it's good to test filters before
0395 attaching to a live system. For that purpose, there's a small tool called
0396 bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
0397 for testing BPF filters against given pcap files, single stepping through the
0398 BPF code on the pcap's packets and to do BPF machine register dumps.
0399
0400 Starting bpf_dbg is trivial and just requires issuing::
0401
0402 # ./bpf_dbg
0403
0404 In case input and output do not equal stdin/stdout, bpf_dbg takes an
0405 alternative stdin source as a first argument, and an alternative stdout
0406 sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
0407
0408 Other than that, a particular libreadline configuration can be set via
0409 file "~/.bpf_dbg_init" and the command history is stored in the file
0410 "~/.bpf_dbg_history".
0411
0412 Interaction in bpf_dbg happens through a shell that also has auto-completion
0413 support (follow-up example commands starting with '>' denote bpf_dbg shell).
0414 The usual workflow would be to ...
0415
0416 * load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
0417 Loads a BPF filter from standard output of bpf_asm, or transformed via
0418 e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
0419 debugging (next section), this command creates a temporary socket and
0420 loads the BPF code into the kernel. Thus, this will also be useful for
0421 JIT developers.
0422
0423 * load pcap foo.pcap
0424
0425 Loads standard tcpdump pcap file.
0426
0427 * run [<n>]
0428
0429 bpf passes:1 fails:9
0430 Runs through all packets from a pcap to account how many passes and fails
0431 the filter will generate. A limit of packets to traverse can be given.
0432
0433 * disassemble::
0434
0435 l0: ldh [12]
0436 l1: jeq #0x800, l2, l5
0437 l2: ldb [23]
0438 l3: jeq #0x1, l4, l5
0439 l4: ret #0xffff
0440 l5: ret #0
0441
0442 Prints out BPF code disassembly.
0443
0444 * dump::
0445
0446 /* { op, jt, jf, k }, */
0447 { 0x28, 0, 0, 0x0000000c },
0448 { 0x15, 0, 3, 0x00000800 },
0449 { 0x30, 0, 0, 0x00000017 },
0450 { 0x15, 0, 1, 0x00000001 },
0451 { 0x06, 0, 0, 0x0000ffff },
0452 { 0x06, 0, 0, 0000000000 },
0453
0454 Prints out C-style BPF code dump.
0455
0456 * breakpoint 0::
0457
0458 breakpoint at: l0: ldh [12]
0459
0460 * breakpoint 1::
0461
0462 breakpoint at: l1: jeq #0x800, l2, l5
0463
0464 ...
0465
0466 Sets breakpoints at particular BPF instructions. Issuing a `run` command
0467 will walk through the pcap file continuing from the current packet and
0468 break when a breakpoint is being hit (another `run` will continue from
0469 the currently active breakpoint executing next instructions):
0470
0471 * run::
0472
0473 -- register dump --
0474 pc: [0] <-- program counter
0475 code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
0476 curr: l0: ldh [12] <-- disassembly of current instruction
0477 A: [00000000][0] <-- content of A (hex, decimal)
0478 X: [00000000][0] <-- content of X (hex, decimal)
0479 M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
0480 -- packet dump -- <-- Current packet from pcap (hex)
0481 len: 42
0482 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
0483 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
0484 32: 00 00 00 00 00 00 0a 3b 01 01
0485 (breakpoint)
0486 >
0487
0488 * breakpoint::
0489
0490 breakpoints: 0 1
0491
0492 Prints currently set breakpoints.
0493
0494 * step [-<n>, +<n>]
0495
0496 Performs single stepping through the BPF program from the current pc
0497 offset. Thus, on each step invocation, above register dump is issued.
0498 This can go forwards and backwards in time, a plain `step` will break
0499 on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
0500
0501 * select <n>
0502
0503 Selects a given packet from the pcap file to continue from. Thus, on
0504 the next `run` or `step`, the BPF program is being evaluated against
0505 the user pre-selected packet. Numbering starts just as in Wireshark
0506 with index 1.
0507
0508 * quit
0509
0510 Exits bpf_dbg.
0511
0512 JIT compiler
0513 ------------
0514
0515 The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
0516 PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
0517 CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
0518 attached filter from user space or for internal kernel users if it has
0519 been previously enabled by root::
0520
0521 echo 1 > /proc/sys/net/core/bpf_jit_enable
0522
0523 For JIT developers, doing audits etc, each compile run can output the generated
0524 opcode image into the kernel log via::
0525
0526 echo 2 > /proc/sys/net/core/bpf_jit_enable
0527
0528 Example output from dmesg::
0529
0530 [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
0531 [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
0532 [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
0533 [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
0534 [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
0535 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
0536
0537 When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
0538 setting any other value than that will return in failure. This is even the case for
0539 setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
0540 is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
0541 generally recommended approach instead.
0542
0543 In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
0544 generating disassembly out of the kernel log's hexdump::
0545
0546 # ./bpf_jit_disasm
0547 70 bytes emitted from JIT compiler (pass:3, flen:6)
0548 ffffffffa0069c8f + <x>:
0549 0: push %rbp
0550 1: mov %rsp,%rbp
0551 4: sub $0x60,%rsp
0552 8: mov %rbx,-0x8(%rbp)
0553 c: mov 0x68(%rdi),%r9d
0554 10: sub 0x6c(%rdi),%r9d
0555 14: mov 0xd8(%rdi),%r8
0556 1b: mov $0xc,%esi
0557 20: callq 0xffffffffe0ff9442
0558 25: cmp $0x800,%eax
0559 2a: jne 0x0000000000000042
0560 2c: mov $0x17,%esi
0561 31: callq 0xffffffffe0ff945e
0562 36: cmp $0x1,%eax
0563 39: jne 0x0000000000000042
0564 3b: mov $0xffff,%eax
0565 40: jmp 0x0000000000000044
0566 42: xor %eax,%eax
0567 44: leaveq
0568 45: retq
0569
0570 Issuing option `-o` will "annotate" opcodes to resulting assembler
0571 instructions, which can be very useful for JIT developers:
0572
0573 # ./bpf_jit_disasm -o
0574 70 bytes emitted from JIT compiler (pass:3, flen:6)
0575 ffffffffa0069c8f + <x>:
0576 0: push %rbp
0577 55
0578 1: mov %rsp,%rbp
0579 48 89 e5
0580 4: sub $0x60,%rsp
0581 48 83 ec 60
0582 8: mov %rbx,-0x8(%rbp)
0583 48 89 5d f8
0584 c: mov 0x68(%rdi),%r9d
0585 44 8b 4f 68
0586 10: sub 0x6c(%rdi),%r9d
0587 44 2b 4f 6c
0588 14: mov 0xd8(%rdi),%r8
0589 4c 8b 87 d8 00 00 00
0590 1b: mov $0xc,%esi
0591 be 0c 00 00 00
0592 20: callq 0xffffffffe0ff9442
0593 e8 1d 94 ff e0
0594 25: cmp $0x800,%eax
0595 3d 00 08 00 00
0596 2a: jne 0x0000000000000042
0597 75 16
0598 2c: mov $0x17,%esi
0599 be 17 00 00 00
0600 31: callq 0xffffffffe0ff945e
0601 e8 28 94 ff e0
0602 36: cmp $0x1,%eax
0603 83 f8 01
0604 39: jne 0x0000000000000042
0605 75 07
0606 3b: mov $0xffff,%eax
0607 b8 ff ff 00 00
0608 40: jmp 0x0000000000000044
0609 eb 02
0610 42: xor %eax,%eax
0611 31 c0
0612 44: leaveq
0613 c9
0614 45: retq
0615 c3
0616
0617 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
0618 toolchain for developing and testing the kernel's JIT compiler.
0619
0620 BPF kernel internals
0621 --------------------
0622 Internally, for the kernel interpreter, a different instruction set
0623 format with similar underlying principles from BPF described in previous
0624 paragraphs is being used. However, the instruction set format is modelled
0625 closer to the underlying architecture to mimic native instruction sets, so
0626 that a better performance can be achieved (more details later). This new
0627 ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which
0628 originates from [e]xtended BPF is not the same as BPF extensions! While
0629 eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
0630 of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
0631
0632 The new instruction set was originally designed with the possible goal in
0633 mind to write programs in "restricted C" and compile into eBPF with a optional
0634 GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
0635 minimal performance overhead over two steps, that is, C -> eBPF -> native code.
0636
0637 Currently, the new format is being used for running user BPF programs, which
0638 includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
0639 team driver's classifier for its load-balancing mode, netfilter's xt_bpf
0640 extension, PTP dissector/classifier, and much more. They are all internally
0641 converted by the kernel into the new instruction set representation and run
0642 in the eBPF interpreter. For in-kernel handlers, this all works transparently
0643 by using bpf_prog_create() for setting up the filter, resp.
0644 bpf_prog_destroy() for destroying it. The function
0645 bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
0646 code to run the filter. 'filter' is a pointer to struct bpf_prog that we
0647 got from bpf_prog_create(), and 'ctx' the given context (e.g.
0648 skb pointer). All constraints and restrictions from bpf_check_classic() apply
0649 before a conversion to the new layout is being done behind the scenes!
0650
0651 Currently, the classic BPF format is being used for JITing on most
0652 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
0653 sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
0654 instruction set.
0655
0656 Testing
0657 -------
0658
0659 Next to the BPF toolchain, the kernel also ships a test module that contains
0660 various test cases for classic and eBPF that can be executed against
0661 the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
0662 enabled via Kconfig::
0663
0664 CONFIG_TEST_BPF=m
0665
0666 After the module has been built and installed, the test suite can be executed
0667 via insmod or modprobe against 'test_bpf' module. Results of the test cases
0668 including timings in nsec can be found in the kernel log (dmesg).
0669
0670 Misc
0671 ----
0672
0673 Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
0674 SECCOMP-BPF kernel fuzzing.
0675
0676 Written by
0677 ----------
0678
0679 The document was written in the hope that it is found useful and in order
0680 to give potential BPF hackers or security auditors a better overview of
0681 the underlying architecture.
0682
0683 - Jay Schulist <jschlst@samba.org>
0684 - Daniel Borkmann <daniel@iogearbox.net>
0685 - Alexei Starovoitov <ast@kernel.org>