Back to home page

OSCL-LXR

 
 

    


0001 .. SPDX-License-Identifier: GPL-2.0
0002 
0003 .. _networking-filter:
0004 
0005 =======================================================
0006 Linux Socket Filtering aka Berkeley Packet Filter (BPF)
0007 =======================================================
0008 
0009 Notice
0010 ------
0011 
0012 This file used to document the eBPF format and mechanisms even when not
0013 related to socket filtering.  The ../bpf/index.rst has more details
0014 on eBPF.
0015 
0016 Introduction
0017 ------------
0018 
0019 Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
0020 Though there are some distinct differences between the BSD and Linux
0021 Kernel filtering, but when we speak of BPF or LSF in Linux context, we
0022 mean the very same mechanism of filtering in the Linux kernel.
0023 
0024 BPF allows a user-space program to attach a filter onto any socket and
0025 allow or disallow certain types of data to come through the socket. LSF
0026 follows exactly the same filter code structure as BSD's BPF, so referring
0027 to the BSD bpf.4 manpage is very helpful in creating filters.
0028 
0029 On Linux, BPF is much simpler than on BSD. One does not have to worry
0030 about devices or anything like that. You simply create your filter code,
0031 send it to the kernel via the SO_ATTACH_FILTER option and if your filter
0032 code passes the kernel check on it, you then immediately begin filtering
0033 data on that socket.
0034 
0035 You can also detach filters from your socket via the SO_DETACH_FILTER
0036 option. This will probably not be used much since when you close a socket
0037 that has a filter on it the filter is automagically removed. The other
0038 less common case may be adding a different filter on the same socket where
0039 you had another filter that is still running: the kernel takes care of
0040 removing the old one and placing your new one in its place, assuming your
0041 filter has passed the checks, otherwise if it fails the old filter will
0042 remain on that socket.
0043 
0044 SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
0045 set, a filter cannot be removed or changed. This allows one process to
0046 setup a socket, attach a filter, lock it then drop privileges and be
0047 assured that the filter will be kept until the socket is closed.
0048 
0049 The biggest user of this construct might be libpcap. Issuing a high-level
0050 filter command like `tcpdump -i em1 port 22` passes through the libpcap
0051 internal compiler that generates a structure that can eventually be loaded
0052 via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
0053 displays what is being placed into this structure.
0054 
0055 Although we were only speaking about sockets here, BPF in Linux is used
0056 in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
0057 qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
0058 such as team driver, PTP code, etc where BPF is being used.
0059 
0060 .. [1] Documentation/userspace-api/seccomp_filter.rst
0061 
0062 Original BPF paper:
0063 
0064 Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
0065 architecture for user-level packet capture. In Proceedings of the
0066 USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
0067 Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
0068 CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
0069 
0070 Structure
0071 ---------
0072 
0073 User space applications include <linux/filter.h> which contains the
0074 following relevant structures::
0075 
0076         struct sock_filter {    /* Filter block */
0077                 __u16   code;   /* Actual filter code */
0078                 __u8    jt;     /* Jump true */
0079                 __u8    jf;     /* Jump false */
0080                 __u32   k;      /* Generic multiuse field */
0081         };
0082 
0083 Such a structure is assembled as an array of 4-tuples, that contains
0084 a code, jt, jf and k value. jt and jf are jump offsets and k a generic
0085 value to be used for a provided code::
0086 
0087         struct sock_fprog {                     /* Required for SO_ATTACH_FILTER. */
0088                 unsigned short             len; /* Number of filter blocks */
0089                 struct sock_filter __user *filter;
0090         };
0091 
0092 For socket filtering, a pointer to this structure (as shown in
0093 follow-up example) is being passed to the kernel through setsockopt(2).
0094 
0095 Example
0096 -------
0097 
0098 ::
0099 
0100     #include <sys/socket.h>
0101     #include <sys/types.h>
0102     #include <arpa/inet.h>
0103     #include <linux/if_ether.h>
0104     /* ... */
0105 
0106     /* From the example above: tcpdump -i em1 port 22 -dd */
0107     struct sock_filter code[] = {
0108             { 0x28,  0,  0, 0x0000000c },
0109             { 0x15,  0,  8, 0x000086dd },
0110             { 0x30,  0,  0, 0x00000014 },
0111             { 0x15,  2,  0, 0x00000084 },
0112             { 0x15,  1,  0, 0x00000006 },
0113             { 0x15,  0, 17, 0x00000011 },
0114             { 0x28,  0,  0, 0x00000036 },
0115             { 0x15, 14,  0, 0x00000016 },
0116             { 0x28,  0,  0, 0x00000038 },
0117             { 0x15, 12, 13, 0x00000016 },
0118             { 0x15,  0, 12, 0x00000800 },
0119             { 0x30,  0,  0, 0x00000017 },
0120             { 0x15,  2,  0, 0x00000084 },
0121             { 0x15,  1,  0, 0x00000006 },
0122             { 0x15,  0,  8, 0x00000011 },
0123             { 0x28,  0,  0, 0x00000014 },
0124             { 0x45,  6,  0, 0x00001fff },
0125             { 0xb1,  0,  0, 0x0000000e },
0126             { 0x48,  0,  0, 0x0000000e },
0127             { 0x15,  2,  0, 0x00000016 },
0128             { 0x48,  0,  0, 0x00000010 },
0129             { 0x15,  0,  1, 0x00000016 },
0130             { 0x06,  0,  0, 0x0000ffff },
0131             { 0x06,  0,  0, 0x00000000 },
0132     };
0133 
0134     struct sock_fprog bpf = {
0135             .len = ARRAY_SIZE(code),
0136             .filter = code,
0137     };
0138 
0139     sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
0140     if (sock < 0)
0141             /* ... bail out ... */
0142 
0143     ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
0144     if (ret < 0)
0145             /* ... bail out ... */
0146 
0147     /* ... */
0148     close(sock);
0149 
0150 The above example code attaches a socket filter for a PF_PACKET socket
0151 in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
0152 be dropped for this socket.
0153 
0154 The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
0155 and SO_LOCK_FILTER for preventing the filter to be detached, takes an
0156 integer value with 0 or 1.
0157 
0158 Note that socket filters are not restricted to PF_PACKET sockets only,
0159 but can also be used on other socket families.
0160 
0161 Summary of system calls:
0162 
0163  * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
0164  * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
0165  * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER,   &val, sizeof(val));
0166 
0167 Normally, most use cases for socket filtering on packet sockets will be
0168 covered by libpcap in high-level syntax, so as an application developer
0169 you should stick to that. libpcap wraps its own layer around all that.
0170 
0171 Unless i) using/linking to libpcap is not an option, ii) the required BPF
0172 filters use Linux extensions that are not supported by libpcap's compiler,
0173 iii) a filter might be more complex and not cleanly implementable with
0174 libpcap's compiler, or iv) particular filter codes should be optimized
0175 differently than libpcap's internal compiler does; then in such cases
0176 writing such a filter "by hand" can be of an alternative. For example,
0177 xt_bpf and cls_bpf users might have requirements that could result in
0178 more complex filter code, or one that cannot be expressed with libpcap
0179 (e.g. different return codes for various code paths). Moreover, BPF JIT
0180 implementors may wish to manually write test cases and thus need low-level
0181 access to BPF code as well.
0182 
0183 BPF engine and instruction set
0184 ------------------------------
0185 
0186 Under tools/bpf/ there's a small helper tool called bpf_asm which can
0187 be used to write low-level filters for example scenarios mentioned in the
0188 previous section. Asm-like syntax mentioned here has been implemented in
0189 bpf_asm and will be used for further explanations (instead of dealing with
0190 less readable opcodes directly, principles are the same). The syntax is
0191 closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
0192 
0193 The BPF architecture consists of the following basic elements:
0194 
0195   =======          ====================================================
0196   Element          Description
0197   =======          ====================================================
0198   A                32 bit wide accumulator
0199   X                32 bit wide X register
0200   M[]              16 x 32 bit wide misc registers aka "scratch memory
0201                    store", addressable from 0 to 15
0202   =======          ====================================================
0203 
0204 A program, that is translated by bpf_asm into "opcodes" is an array that
0205 consists of the following elements (as already mentioned)::
0206 
0207   op:16, jt:8, jf:8, k:32
0208 
0209 The element op is a 16 bit wide opcode that has a particular instruction
0210 encoded. jt and jf are two 8 bit wide jump targets, one for condition
0211 "jump if true", the other one "jump if false". Eventually, element k
0212 contains a miscellaneous argument that can be interpreted in different
0213 ways depending on the given instruction in op.
0214 
0215 The instruction set consists of load, store, branch, alu, miscellaneous
0216 and return instructions that are also represented in bpf_asm syntax. This
0217 table lists all bpf_asm instructions available resp. what their underlying
0218 opcodes as defined in linux/filter.h stand for:
0219 
0220   ===========      ===================  =====================
0221   Instruction      Addressing mode      Description
0222   ===========      ===================  =====================
0223   ld               1, 2, 3, 4, 12       Load word into A
0224   ldi              4                    Load word into A
0225   ldh              1, 2                 Load half-word into A
0226   ldb              1, 2                 Load byte into A
0227   ldx              3, 4, 5, 12          Load word into X
0228   ldxi             4                    Load word into X
0229   ldxb             5                    Load byte into X
0230 
0231   st               3                    Store A into M[]
0232   stx              3                    Store X into M[]
0233 
0234   jmp              6                    Jump to label
0235   ja               6                    Jump to label
0236   jeq              7, 8, 9, 10          Jump on A == <x>
0237   jneq             9, 10                Jump on A != <x>
0238   jne              9, 10                Jump on A != <x>
0239   jlt              9, 10                Jump on A <  <x>
0240   jle              9, 10                Jump on A <= <x>
0241   jgt              7, 8, 9, 10          Jump on A >  <x>
0242   jge              7, 8, 9, 10          Jump on A >= <x>
0243   jset             7, 8, 9, 10          Jump on A &  <x>
0244 
0245   add              0, 4                 A + <x>
0246   sub              0, 4                 A - <x>
0247   mul              0, 4                 A * <x>
0248   div              0, 4                 A / <x>
0249   mod              0, 4                 A % <x>
0250   neg                                   !A
0251   and              0, 4                 A & <x>
0252   or               0, 4                 A | <x>
0253   xor              0, 4                 A ^ <x>
0254   lsh              0, 4                 A << <x>
0255   rsh              0, 4                 A >> <x>
0256 
0257   tax                                   Copy A into X
0258   txa                                   Copy X into A
0259 
0260   ret              4, 11                Return
0261   ===========      ===================  =====================
0262 
0263 The next table shows addressing formats from the 2nd column:
0264 
0265   ===============  ===================  ===============================================
0266   Addressing mode  Syntax               Description
0267   ===============  ===================  ===============================================
0268    0               x/%x                 Register X
0269    1               [k]                  BHW at byte offset k in the packet
0270    2               [x + k]              BHW at the offset X + k in the packet
0271    3               M[k]                 Word at offset k in M[]
0272    4               #k                   Literal value stored in k
0273    5               4*([k]&0xf)          Lower nibble * 4 at byte offset k in the packet
0274    6               L                    Jump label L
0275    7               #k,Lt,Lf             Jump to Lt if true, otherwise jump to Lf
0276    8               x/%x,Lt,Lf           Jump to Lt if true, otherwise jump to Lf
0277    9               #k,Lt                Jump to Lt if predicate is true
0278   10               x/%x,Lt              Jump to Lt if predicate is true
0279   11               a/%a                 Accumulator A
0280   12               extension            BPF extension
0281   ===============  ===================  ===============================================
0282 
0283 The Linux kernel also has a couple of BPF extensions that are used along
0284 with the class of load instructions by "overloading" the k argument with
0285 a negative offset + a particular extension offset. The result of such BPF
0286 extensions are loaded into A.
0287 
0288 Possible BPF extensions are shown in the following table:
0289 
0290   ===================================   =================================================
0291   Extension                             Description
0292   ===================================   =================================================
0293   len                                   skb->len
0294   proto                                 skb->protocol
0295   type                                  skb->pkt_type
0296   poff                                  Payload start offset
0297   ifidx                                 skb->dev->ifindex
0298   nla                                   Netlink attribute of type X with offset A
0299   nlan                                  Nested Netlink attribute of type X with offset A
0300   mark                                  skb->mark
0301   queue                                 skb->queue_mapping
0302   hatype                                skb->dev->type
0303   rxhash                                skb->hash
0304   cpu                                   raw_smp_processor_id()
0305   vlan_tci                              skb_vlan_tag_get(skb)
0306   vlan_avail                            skb_vlan_tag_present(skb)
0307   vlan_tpid                             skb->vlan_proto
0308   rand                                  prandom_u32()
0309   ===================================   =================================================
0310 
0311 These extensions can also be prefixed with '#'.
0312 Examples for low-level BPF:
0313 
0314 **ARP packets**::
0315 
0316   ldh [12]
0317   jne #0x806, drop
0318   ret #-1
0319   drop: ret #0
0320 
0321 **IPv4 TCP packets**::
0322 
0323   ldh [12]
0324   jne #0x800, drop
0325   ldb [23]
0326   jneq #6, drop
0327   ret #-1
0328   drop: ret #0
0329 
0330 **icmp random packet sampling, 1 in 4**::
0331 
0332   ldh [12]
0333   jne #0x800, drop
0334   ldb [23]
0335   jneq #1, drop
0336   # get a random uint32 number
0337   ld rand
0338   mod #4
0339   jneq #1, drop
0340   ret #-1
0341   drop: ret #0
0342 
0343 **SECCOMP filter example**::
0344 
0345   ld [4]                  /* offsetof(struct seccomp_data, arch) */
0346   jne #0xc000003e, bad    /* AUDIT_ARCH_X86_64 */
0347   ld [0]                  /* offsetof(struct seccomp_data, nr) */
0348   jeq #15, good           /* __NR_rt_sigreturn */
0349   jeq #231, good          /* __NR_exit_group */
0350   jeq #60, good           /* __NR_exit */
0351   jeq #0, good            /* __NR_read */
0352   jeq #1, good            /* __NR_write */
0353   jeq #5, good            /* __NR_fstat */
0354   jeq #9, good            /* __NR_mmap */
0355   jeq #14, good           /* __NR_rt_sigprocmask */
0356   jeq #13, good           /* __NR_rt_sigaction */
0357   jeq #35, good           /* __NR_nanosleep */
0358   bad: ret #0             /* SECCOMP_RET_KILL_THREAD */
0359   good: ret #0x7fff0000   /* SECCOMP_RET_ALLOW */
0360 
0361 Examples for low-level BPF extension:
0362 
0363 **Packet for interface index 13**::
0364 
0365   ld ifidx
0366   jneq #13, drop
0367   ret #-1
0368   drop: ret #0
0369 
0370 **(Accelerated) VLAN w/ id 10**::
0371 
0372   ld vlan_tci
0373   jneq #10, drop
0374   ret #-1
0375   drop: ret #0
0376 
0377 The above example code can be placed into a file (here called "foo"), and
0378 then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
0379 and cls_bpf understands and can directly be loaded with. Example with above
0380 ARP code::
0381 
0382     $ ./bpf_asm foo
0383     4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
0384 
0385 In copy and paste C-like output::
0386 
0387     $ ./bpf_asm -c foo
0388     { 0x28,  0,  0, 0x0000000c },
0389     { 0x15,  0,  1, 0x00000806 },
0390     { 0x06,  0,  0, 0xffffffff },
0391     { 0x06,  0,  0, 0000000000 },
0392 
0393 In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
0394 filters that might not be obvious at first, it's good to test filters before
0395 attaching to a live system. For that purpose, there's a small tool called
0396 bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
0397 for testing BPF filters against given pcap files, single stepping through the
0398 BPF code on the pcap's packets and to do BPF machine register dumps.
0399 
0400 Starting bpf_dbg is trivial and just requires issuing::
0401 
0402     # ./bpf_dbg
0403 
0404 In case input and output do not equal stdin/stdout, bpf_dbg takes an
0405 alternative stdin source as a first argument, and an alternative stdout
0406 sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
0407 
0408 Other than that, a particular libreadline configuration can be set via
0409 file "~/.bpf_dbg_init" and the command history is stored in the file
0410 "~/.bpf_dbg_history".
0411 
0412 Interaction in bpf_dbg happens through a shell that also has auto-completion
0413 support (follow-up example commands starting with '>' denote bpf_dbg shell).
0414 The usual workflow would be to ...
0415 
0416 * load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
0417   Loads a BPF filter from standard output of bpf_asm, or transformed via
0418   e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
0419   debugging (next section), this command creates a temporary socket and
0420   loads the BPF code into the kernel. Thus, this will also be useful for
0421   JIT developers.
0422 
0423 * load pcap foo.pcap
0424 
0425   Loads standard tcpdump pcap file.
0426 
0427 * run [<n>]
0428 
0429 bpf passes:1 fails:9
0430   Runs through all packets from a pcap to account how many passes and fails
0431   the filter will generate. A limit of packets to traverse can be given.
0432 
0433 * disassemble::
0434 
0435         l0:     ldh [12]
0436         l1:     jeq #0x800, l2, l5
0437         l2:     ldb [23]
0438         l3:     jeq #0x1, l4, l5
0439         l4:     ret #0xffff
0440         l5:     ret #0
0441 
0442   Prints out BPF code disassembly.
0443 
0444 * dump::
0445 
0446         /* { op, jt, jf, k }, */
0447         { 0x28,  0,  0, 0x0000000c },
0448         { 0x15,  0,  3, 0x00000800 },
0449         { 0x30,  0,  0, 0x00000017 },
0450         { 0x15,  0,  1, 0x00000001 },
0451         { 0x06,  0,  0, 0x0000ffff },
0452         { 0x06,  0,  0, 0000000000 },
0453 
0454   Prints out C-style BPF code dump.
0455 
0456 * breakpoint 0::
0457 
0458         breakpoint at: l0:      ldh [12]
0459 
0460 * breakpoint 1::
0461 
0462         breakpoint at: l1:      jeq #0x800, l2, l5
0463 
0464   ...
0465 
0466   Sets breakpoints at particular BPF instructions. Issuing a `run` command
0467   will walk through the pcap file continuing from the current packet and
0468   break when a breakpoint is being hit (another `run` will continue from
0469   the currently active breakpoint executing next instructions):
0470 
0471   * run::
0472 
0473         -- register dump --
0474         pc:       [0]                       <-- program counter
0475         code:     [40] jt[0] jf[0] k[12]    <-- plain BPF code of current instruction
0476         curr:     l0:   ldh [12]              <-- disassembly of current instruction
0477         A:        [00000000][0]             <-- content of A (hex, decimal)
0478         X:        [00000000][0]             <-- content of X (hex, decimal)
0479         M[0,15]:  [00000000][0]             <-- folded content of M (hex, decimal)
0480         -- packet dump --                   <-- Current packet from pcap (hex)
0481         len: 42
0482             0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
0483         16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
0484         32: 00 00 00 00 00 00 0a 3b 01 01
0485         (breakpoint)
0486         >
0487 
0488   * breakpoint::
0489 
0490         breakpoints: 0 1
0491 
0492     Prints currently set breakpoints.
0493 
0494 * step [-<n>, +<n>]
0495 
0496   Performs single stepping through the BPF program from the current pc
0497   offset. Thus, on each step invocation, above register dump is issued.
0498   This can go forwards and backwards in time, a plain `step` will break
0499   on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
0500 
0501 * select <n>
0502 
0503   Selects a given packet from the pcap file to continue from. Thus, on
0504   the next `run` or `step`, the BPF program is being evaluated against
0505   the user pre-selected packet. Numbering starts just as in Wireshark
0506   with index 1.
0507 
0508 * quit
0509 
0510   Exits bpf_dbg.
0511 
0512 JIT compiler
0513 ------------
0514 
0515 The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
0516 PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
0517 CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
0518 attached filter from user space or for internal kernel users if it has
0519 been previously enabled by root::
0520 
0521   echo 1 > /proc/sys/net/core/bpf_jit_enable
0522 
0523 For JIT developers, doing audits etc, each compile run can output the generated
0524 opcode image into the kernel log via::
0525 
0526   echo 2 > /proc/sys/net/core/bpf_jit_enable
0527 
0528 Example output from dmesg::
0529 
0530     [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
0531     [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
0532     [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
0533     [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
0534     [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
0535     [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
0536 
0537 When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
0538 setting any other value than that will return in failure. This is even the case for
0539 setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
0540 is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
0541 generally recommended approach instead.
0542 
0543 In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
0544 generating disassembly out of the kernel log's hexdump::
0545 
0546         # ./bpf_jit_disasm
0547         70 bytes emitted from JIT compiler (pass:3, flen:6)
0548         ffffffffa0069c8f + <x>:
0549         0:      push   %rbp
0550         1:      mov    %rsp,%rbp
0551         4:      sub    $0x60,%rsp
0552         8:      mov    %rbx,-0x8(%rbp)
0553         c:      mov    0x68(%rdi),%r9d
0554         10:     sub    0x6c(%rdi),%r9d
0555         14:     mov    0xd8(%rdi),%r8
0556         1b:     mov    $0xc,%esi
0557         20:     callq  0xffffffffe0ff9442
0558         25:     cmp    $0x800,%eax
0559         2a:     jne    0x0000000000000042
0560         2c:     mov    $0x17,%esi
0561         31:     callq  0xffffffffe0ff945e
0562         36:     cmp    $0x1,%eax
0563         39:     jne    0x0000000000000042
0564         3b:     mov    $0xffff,%eax
0565         40:     jmp    0x0000000000000044
0566         42:     xor    %eax,%eax
0567         44:     leaveq
0568         45:     retq
0569 
0570         Issuing option `-o` will "annotate" opcodes to resulting assembler
0571         instructions, which can be very useful for JIT developers:
0572 
0573         # ./bpf_jit_disasm -o
0574         70 bytes emitted from JIT compiler (pass:3, flen:6)
0575         ffffffffa0069c8f + <x>:
0576         0:      push   %rbp
0577                 55
0578         1:      mov    %rsp,%rbp
0579                 48 89 e5
0580         4:      sub    $0x60,%rsp
0581                 48 83 ec 60
0582         8:      mov    %rbx,-0x8(%rbp)
0583                 48 89 5d f8
0584         c:      mov    0x68(%rdi),%r9d
0585                 44 8b 4f 68
0586         10:     sub    0x6c(%rdi),%r9d
0587                 44 2b 4f 6c
0588         14:     mov    0xd8(%rdi),%r8
0589                 4c 8b 87 d8 00 00 00
0590         1b:     mov    $0xc,%esi
0591                 be 0c 00 00 00
0592         20:     callq  0xffffffffe0ff9442
0593                 e8 1d 94 ff e0
0594         25:     cmp    $0x800,%eax
0595                 3d 00 08 00 00
0596         2a:     jne    0x0000000000000042
0597                 75 16
0598         2c:     mov    $0x17,%esi
0599                 be 17 00 00 00
0600         31:     callq  0xffffffffe0ff945e
0601                 e8 28 94 ff e0
0602         36:     cmp    $0x1,%eax
0603                 83 f8 01
0604         39:     jne    0x0000000000000042
0605                 75 07
0606         3b:     mov    $0xffff,%eax
0607                 b8 ff ff 00 00
0608         40:     jmp    0x0000000000000044
0609                 eb 02
0610         42:     xor    %eax,%eax
0611                 31 c0
0612         44:     leaveq
0613                 c9
0614         45:     retq
0615                 c3
0616 
0617 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
0618 toolchain for developing and testing the kernel's JIT compiler.
0619 
0620 BPF kernel internals
0621 --------------------
0622 Internally, for the kernel interpreter, a different instruction set
0623 format with similar underlying principles from BPF described in previous
0624 paragraphs is being used. However, the instruction set format is modelled
0625 closer to the underlying architecture to mimic native instruction sets, so
0626 that a better performance can be achieved (more details later). This new
0627 ISA is called eBPF.  See the ../bpf/index.rst for details.  (Note: eBPF which
0628 originates from [e]xtended BPF is not the same as BPF extensions! While
0629 eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
0630 of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
0631 
0632 The new instruction set was originally designed with the possible goal in
0633 mind to write programs in "restricted C" and compile into eBPF with a optional
0634 GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
0635 minimal performance overhead over two steps, that is, C -> eBPF -> native code.
0636 
0637 Currently, the new format is being used for running user BPF programs, which
0638 includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
0639 team driver's classifier for its load-balancing mode, netfilter's xt_bpf
0640 extension, PTP dissector/classifier, and much more. They are all internally
0641 converted by the kernel into the new instruction set representation and run
0642 in the eBPF interpreter. For in-kernel handlers, this all works transparently
0643 by using bpf_prog_create() for setting up the filter, resp.
0644 bpf_prog_destroy() for destroying it. The function
0645 bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
0646 code to run the filter. 'filter' is a pointer to struct bpf_prog that we
0647 got from bpf_prog_create(), and 'ctx' the given context (e.g.
0648 skb pointer). All constraints and restrictions from bpf_check_classic() apply
0649 before a conversion to the new layout is being done behind the scenes!
0650 
0651 Currently, the classic BPF format is being used for JITing on most
0652 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
0653 sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
0654 instruction set.
0655 
0656 Testing
0657 -------
0658 
0659 Next to the BPF toolchain, the kernel also ships a test module that contains
0660 various test cases for classic and eBPF that can be executed against
0661 the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
0662 enabled via Kconfig::
0663 
0664   CONFIG_TEST_BPF=m
0665 
0666 After the module has been built and installed, the test suite can be executed
0667 via insmod or modprobe against 'test_bpf' module. Results of the test cases
0668 including timings in nsec can be found in the kernel log (dmesg).
0669 
0670 Misc
0671 ----
0672 
0673 Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
0674 SECCOMP-BPF kernel fuzzing.
0675 
0676 Written by
0677 ----------
0678 
0679 The document was written in the hope that it is found useful and in order
0680 to give potential BPF hackers or security auditors a better overview of
0681 the underlying architecture.
0682 
0683 - Jay Schulist <jschlst@samba.org>
0684 - Daniel Borkmann <daniel@iogearbox.net>
0685 - Alexei Starovoitov <ast@kernel.org>