0001 ===========================================
0002 Seccomp BPF (SECure COMPuting with filters)
0003 ===========================================
0004
0005 Introduction
0006 ============
0007
0008 A large number of system calls are exposed to every userland process
0009 with many of them going unused for the entire lifetime of the process.
0010 As system calls change and mature, bugs are found and eradicated. A
0011 certain subset of userland applications benefit by having a reduced set
0012 of available system calls. The resulting set reduces the total kernel
0013 surface exposed to the application. System call filtering is meant for
0014 use with those applications.
0015
0016 Seccomp filtering provides a means for a process to specify a filter for
0017 incoming system calls. The filter is expressed as a Berkeley Packet
0018 Filter (BPF) program, as with socket filters, except that the data
0019 operated on is related to the system call being made: system call
0020 number and the system call arguments. This allows for expressive
0021 filtering of system calls using a filter program language with a long
0022 history of being exposed to userland and a straightforward data set.
0023
0024 Additionally, BPF makes it impossible for users of seccomp to fall prey
0025 to time-of-check-time-of-use (TOCTOU) attacks that are common in system
0026 call interposition frameworks. BPF programs may not dereference
0027 pointers which constrains all filters to solely evaluating the system
0028 call arguments directly.
0029
0030 What it isn't
0031 =============
0032
0033 System call filtering isn't a sandbox. It provides a clearly defined
0034 mechanism for minimizing the exposed kernel surface. It is meant to be
0035 a tool for sandbox developers to use. Beyond that, policy for logical
0036 behavior and information flow should be managed with a combination of
0037 other system hardening techniques and, potentially, an LSM of your
0038 choosing. Expressive, dynamic filters provide further options down this
0039 path (avoiding pathological sizes or selecting which of the multiplexed
0040 system calls in socketcall() is allowed, for instance) which could be
0041 construed, incorrectly, as a more complete sandboxing solution.
0042
0043 Usage
0044 =====
0045
0046 An additional seccomp mode is added and is enabled using the same
0047 prctl(2) call as the strict seccomp. If the architecture has
0048 ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below:
0049
0050 ``PR_SET_SECCOMP``:
0051 Now takes an additional argument which specifies a new filter
0052 using a BPF program.
0053 The BPF program will be executed over struct seccomp_data
0054 reflecting the system call number, arguments, and other
0055 metadata. The BPF program must then return one of the
0056 acceptable values to inform the kernel which action should be
0057 taken.
0058
0059 Usage::
0060
0061 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
0062
0063 The 'prog' argument is a pointer to a struct sock_fprog which
0064 will contain the filter program. If the program is invalid, the
0065 call will return -1 and set errno to ``EINVAL``.
0066
0067 If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child
0068 processes will be constrained to the same filters and system
0069 call ABI as the parent.
0070
0071 Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or
0072 run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not
0073 true, ``-EACCES`` will be returned. This requirement ensures that filter
0074 programs cannot be applied to child processes with greater privileges
0075 than the task that installed them.
0076
0077 Additionally, if ``prctl(2)`` is allowed by the attached filter,
0078 additional filters may be layered on which will increase evaluation
0079 time, but allow for further decreasing the attack surface during
0080 execution of a process.
0081
0082 The above call returns 0 on success and non-zero on error.
0083
0084 Return values
0085 =============
0086
0087 A seccomp filter may return any of the following values. If multiple
0088 filters exist, the return value for the evaluation of a given system
0089 call will always use the highest precedent value. (For example,
0090 ``SECCOMP_RET_KILL_PROCESS`` will always take precedence.)
0091
0092 In precedence order, they are:
0093
0094 ``SECCOMP_RET_KILL_PROCESS``:
0095 Results in the entire process exiting immediately without executing
0096 the system call. The exit status of the task (``status & 0x7f``)
0097 will be ``SIGSYS``, not ``SIGKILL``.
0098
0099 ``SECCOMP_RET_KILL_THREAD``:
0100 Results in the task exiting immediately without executing the
0101 system call. The exit status of the task (``status & 0x7f``) will
0102 be ``SIGSYS``, not ``SIGKILL``.
0103
0104 ``SECCOMP_RET_TRAP``:
0105 Results in the kernel sending a ``SIGSYS`` signal to the triggering
0106 task without executing the system call. ``siginfo->si_call_addr``
0107 will show the address of the system call instruction, and
0108 ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which
0109 syscall was attempted. The program counter will be as though
0110 the syscall happened (i.e. it will not point to the syscall
0111 instruction). The return value register will contain an arch-
0112 dependent value -- if resuming execution, set it to something
0113 sensible. (The architecture dependency is because replacing
0114 it with ``-ENOSYS`` could overwrite some useful information.)
0115
0116 The ``SECCOMP_RET_DATA`` portion of the return value will be passed
0117 as ``si_errno``.
0118
0119 ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``.
0120
0121 ``SECCOMP_RET_ERRNO``:
0122 Results in the lower 16-bits of the return value being passed
0123 to userland as the errno without executing the system call.
0124
0125 ``SECCOMP_RET_USER_NOTIF``:
0126 Results in a ``struct seccomp_notif`` message sent on the userspace
0127 notification fd, if it is attached, or ``-ENOSYS`` if it is not. See
0128 below on discussion of how to handle user notifications.
0129
0130 ``SECCOMP_RET_TRACE``:
0131 When returned, this value will cause the kernel to attempt to
0132 notify a ``ptrace()``-based tracer prior to executing the system
0133 call. If there is no tracer present, ``-ENOSYS`` is returned to
0134 userland and the system call is not executed.
0135
0136 A tracer will be notified if it requests ``PTRACE_O_TRACESECCOMP``
0137 using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified
0138 of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of
0139 the BPF program return value will be available to the tracer
0140 via ``PTRACE_GETEVENTMSG``.
0141
0142 The tracer can skip the system call by changing the syscall number
0143 to -1. Alternatively, the tracer can change the system call
0144 requested by changing the system call to a valid syscall number. If
0145 the tracer asks to skip the system call, then the system call will
0146 appear to return the value that the tracer puts in the return value
0147 register.
0148
0149 The seccomp check will not be run again after the tracer is
0150 notified. (This means that seccomp-based sandboxes MUST NOT
0151 allow use of ptrace, even of other sandboxed processes, without
0152 extreme care; ptracers can use this mechanism to escape.)
0153
0154 ``SECCOMP_RET_LOG``:
0155 Results in the system call being executed after it is logged. This
0156 should be used by application developers to learn which syscalls their
0157 application needs without having to iterate through multiple test and
0158 development cycles to build the list.
0159
0160 This action will only be logged if "log" is present in the
0161 actions_logged sysctl string.
0162
0163 ``SECCOMP_RET_ALLOW``:
0164 Results in the system call being executed.
0165
0166 If multiple filters exist, the return value for the evaluation of a
0167 given system call will always use the highest precedent value.
0168
0169 Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When
0170 multiple filters return values of the same precedence, only the
0171 ``SECCOMP_RET_DATA`` from the most recently installed filter will be
0172 returned.
0173
0174 Pitfalls
0175 ========
0176
0177 The biggest pitfall to avoid during use is filtering on system call
0178 number without checking the architecture value. Why? On any
0179 architecture that supports multiple system call invocation conventions,
0180 the system call numbers may vary based on the specific invocation. If
0181 the numbers in the different calling conventions overlap, then checks in
0182 the filters may be abused. Always check the arch value!
0183
0184 Example
0185 =======
0186
0187 The ``samples/seccomp/`` directory contains both an x86-specific example
0188 and a more generic example of a higher level macro interface for BPF
0189 program generation.
0190
0191 Userspace Notification
0192 ======================
0193
0194 The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
0195 particular syscall to userspace to be handled. This may be useful for
0196 applications like container managers, which wish to intercept particular
0197 syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
0198
0199 To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER``
0200 argument to the ``seccomp()`` syscall:
0201
0202 .. code-block:: c
0203
0204 fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
0205
0206 which (on success) will return a listener fd for the filter, which can then be
0207 passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to
0208 a particular filter, and not a particular task. So if this task then forks,
0209 notifications from both tasks will appear on the same filter fd. Reads and
0210 writes to/from a filter fd are also synchronized, so a filter fd can safely
0211 have many readers.
0212
0213 The interface for a seccomp notification fd consists of two structures:
0214
0215 .. code-block:: c
0216
0217 struct seccomp_notif_sizes {
0218 __u16 seccomp_notif;
0219 __u16 seccomp_notif_resp;
0220 __u16 seccomp_data;
0221 };
0222
0223 struct seccomp_notif {
0224 __u64 id;
0225 __u32 pid;
0226 __u32 flags;
0227 struct seccomp_data data;
0228 };
0229
0230 struct seccomp_notif_resp {
0231 __u64 id;
0232 __s64 val;
0233 __s32 error;
0234 __u32 flags;
0235 };
0236
0237 The ``struct seccomp_notif_sizes`` structure can be used to determine the size
0238 of the various structures used in seccomp notifications. The size of ``struct
0239 seccomp_data`` may change in the future, so code should use:
0240
0241 .. code-block:: c
0242
0243 struct seccomp_notif_sizes sizes;
0244 seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
0245
0246 to determine the size of the various structures to allocate. See
0247 samples/seccomp/user-trap.c for an example.
0248
0249 Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a
0250 seccomp notification fd to receive a ``struct seccomp_notif``, which contains
0251 five members: the input length of the structure, a unique-per-filter ``id``,
0252 the ``pid`` of the task which triggered this request (which may be 0 if the
0253 task is in a pid ns not visible from the listener's pid namespace). The
0254 notification also contains the ``data`` passed to seccomp, and a filters flag.
0255 The structure should be zeroed out prior to calling the ioctl.
0256
0257 Userspace can then make a decision based on this information about what to do,
0258 and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be
0259 returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` should
0260 be the same ``id`` as in ``struct seccomp_notif``.
0261
0262 Userspace can also add file descriptors to the notifying process via
0263 ``ioctl(SECCOMP_IOCTL_NOTIF_ADDFD)``. The ``id`` member of
0264 ``struct seccomp_notif_addfd`` should be the same ``id`` as in
0265 ``struct seccomp_notif``. The ``newfd_flags`` flag may be used to set flags
0266 like O_CLOEXEC on the file descriptor in the notifying process. If the supervisor
0267 wants to inject the file descriptor with a specific number, the
0268 ``SECCOMP_ADDFD_FLAG_SETFD`` flag can be used, and set the ``newfd`` member to
0269 the specific number to use. If that file descriptor is already open in the
0270 notifying process it will be replaced. The supervisor can also add an FD, and
0271 respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
0272 value will be the injected file descriptor number.
0273
0274 The notifying process can be preempted, resulting in the notification being
0275 aborted. This can be problematic when trying to take actions on behalf of the
0276 notifying process that are long-running and typically retryable (mounting a
0277 filesytem). Alternatively, at filter installation time, the
0278 ``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
0279 such that when a user notification is received by the supervisor, the notifying
0280 process will ignore non-fatal signals until the response is sent. Signals that
0281 are sent prior to the notification being received by userspace are handled
0282 normally.
0283
0284 It is worth noting that ``struct seccomp_data`` contains the values of register
0285 arguments to the syscall, but does not contain pointers to memory. The task's
0286 memory is accessible to suitably privileged traces via ``ptrace()`` or
0287 ``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned
0288 above in this document: all arguments being read from the tracee's memory
0289 should be read into the tracer's memory before any policy decisions are made.
0290 This allows for an atomic decision on syscall arguments.
0291
0292 Sysctls
0293 =======
0294
0295 Seccomp's sysctl files can be found in the ``/proc/sys/kernel/seccomp/``
0296 directory. Here's a description of each file in that directory:
0297
0298 ``actions_avail``:
0299 A read-only ordered list of seccomp return values (refer to the
0300 ``SECCOMP_RET_*`` macros above) in string form. The ordering, from
0301 left-to-right, is the least permissive return value to the most
0302 permissive return value.
0303
0304 The list represents the set of seccomp return values supported
0305 by the kernel. A userspace program may use this list to
0306 determine if the actions found in the ``seccomp.h``, when the
0307 program was built, differs from the set of actions actually
0308 supported in the current running kernel.
0309
0310 ``actions_logged``:
0311 A read-write ordered list of seccomp return values (refer to the
0312 ``SECCOMP_RET_*`` macros above) that are allowed to be logged. Writes
0313 to the file do not need to be in ordered form but reads from the file
0314 will be ordered in the same way as the actions_avail sysctl.
0315
0316 The ``allow`` string is not accepted in the ``actions_logged`` sysctl
0317 as it is not possible to log ``SECCOMP_RET_ALLOW`` actions. Attempting
0318 to write ``allow`` to the sysctl will result in an EINVAL being
0319 returned.
0320
0321 Adding architecture support
0322 ===========================
0323
0324 See ``arch/Kconfig`` for the authoritative requirements. In general, if an
0325 architecture supports both ptrace_event and seccomp, it will be able to
0326 support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return
0327 value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``
0328 to its arch-specific Kconfig.
0329
0330
0331
0332 Caveats
0333 =======
0334
0335 The vDSO can cause some system calls to run entirely in userspace,
0336 leading to surprises when you run programs on different machines that
0337 fall back to real syscalls. To minimize these surprises on x86, make
0338 sure you test with
0339 ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to
0340 something like ``acpi_pm``.
0341
0342 On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
0343 legacy variants on vDSO calls.) Currently, emulated vsyscalls will
0344 honor seccomp, with a few oddities:
0345
0346 - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to
0347 the vsyscall entry for the given call and not the address after the
0348 'syscall' instruction. Any code which wants to restart the call
0349 should be aware that (a) a ret instruction has been emulated and (b)
0350 trying to resume the syscall will again trigger the standard vsyscall
0351 emulation security checks, making resuming the syscall mostly
0352 pointless.
0353
0354 - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual,
0355 but the syscall may not be changed to another system call using the
0356 orig_rax register. It may only be changed to -1 order to skip the
0357 currently emulated call. Any other change MAY terminate the process.
0358 The rip value seen by the tracer will be the syscall entry address;
0359 this is different from normal behavior. The tracer MUST NOT modify
0360 rip or rsp. (Do not rely on other changes terminating the process.
0361 They might work. For example, on some kernels, choosing a syscall
0362 that only exists in future kernels will be correctly emulated (by
0363 returning ``-ENOSYS``).
0364
0365 To detect this quirky behavior, check for ``addr & ~0x0C00 ==
0366 0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For
0367 ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other
0368 condition: future kernels may improve vsyscall emulation and current
0369 kernels in vsyscall=native mode will behave differently, but the
0370 instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these
0371 cases.
0372
0373 Note that modern systems are unlikely to use vsyscalls at all -- they
0374 are a legacy feature and they are considerably slower than standard
0375 syscalls. New code will use the vDSO, and vDSO-issued system calls
0376 are indistinguishable from normal system calls.