Documentation/userspace-api/seccomp_filter.rst

0001 ===========================================
0002 Seccomp BPF (SECure COMPuting with filters)
0003 ===========================================
0004
0005 Introduction
0006 ============
0007
0008 A large number of system calls are exposed to every userland process
0009 with many of them going unused for the entire lifetime of the process.
0010 As system calls change and mature, bugs are found and eradicated.  A
0011 certain subset of userland applications benefit by having a reduced set
0012 of available system calls.  The resulting set reduces the total kernel
0013 surface exposed to the application.  System call filtering is meant for
0014 use with those applications.
0015
0016 Seccomp filtering provides a means for a process to specify a filter for
0017 incoming system calls.  The filter is expressed as a Berkeley Packet
0018 Filter (BPF) program, as with socket filters, except that the data
0019 operated on is related to the system call being made: system call
0020 number and the system call arguments.  This allows for expressive
0021 filtering of system calls using a filter program language with a long
0022 history of being exposed to userland and a straightforward data set.
0023
0024 Additionally, BPF makes it impossible for users of seccomp to fall prey
0025 to time-of-check-time-of-use (TOCTOU) attacks that are common in system
0026 call interposition frameworks.  BPF programs may not dereference
0027 pointers which constrains all filters to solely evaluating the system
0028 call arguments directly.
0029
0030 What it isn't
0031 =============
0032
0033 System call filtering isn't a sandbox.  It provides a clearly defined
0034 mechanism for minimizing the exposed kernel surface.  It is meant to be
0035 a tool for sandbox developers to use.  Beyond that, policy for logical
0036 behavior and information flow should be managed with a combination of
0037 other system hardening techniques and, potentially, an LSM of your
0038 choosing.  Expressive, dynamic filters provide further options down this
0039 path (avoiding pathological sizes or selecting which of the multiplexed
0040 system calls in socketcall() is allowed, for instance) which could be
0041 construed, incorrectly, as a more complete sandboxing solution.
0042
0043 Usage
0044 =====
0045
0046 An additional seccomp mode is added and is enabled using the same
0047 prctl(2) call as the strict seccomp.  If the architecture has
0048 ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below:
0049
0050 ``PR_SET_SECCOMP``:
0051         Now takes an additional argument which specifies a new filter
0052         using a BPF program.
0053         The BPF program will be executed over struct seccomp_data
0054         reflecting the system call number, arguments, and other
0055         metadata.  The BPF program must then return one of the
0056         acceptable values to inform the kernel which action should be
0057         taken.
0058
0059         Usage::
0060
0061                 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
0062
0063         The 'prog' argument is a pointer to a struct sock_fprog which
0064         will contain the filter program.  If the program is invalid, the
0065         call will return -1 and set errno to ``EINVAL``.
0066
0067         If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child
0068         processes will be constrained to the same filters and system
0069         call ABI as the parent.
0070
0071         Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or
0072         run with ``CAP_SYS_ADMIN`` privileges in its namespace.  If these are not
0073         true, ``-EACCES`` will be returned.  This requirement ensures that filter
0074         programs cannot be applied to child processes with greater privileges
0075         than the task that installed them.
0076
0077         Additionally, if ``prctl(2)`` is allowed by the attached filter,
0078         additional filters may be layered on which will increase evaluation
0079         time, but allow for further decreasing the attack surface during
0080         execution of a process.
0081
0082 The above call returns 0 on success and non-zero on error.
0083
0084 Return values
0085 =============
0086
0087 A seccomp filter may return any of the following values. If multiple
0088 filters exist, the return value for the evaluation of a given system
0089 call will always use the highest precedent value. (For example,
0090 ``SECCOMP_RET_KILL_PROCESS`` will always take precedence.)
0091
0092 In precedence order, they are:
0093
0094 ``SECCOMP_RET_KILL_PROCESS``:
0095         Results in the entire process exiting immediately without executing
0096         the system call.  The exit status of the task (``status & 0x7f``)
0097         will be ``SIGSYS``, not ``SIGKILL``.
0098
0099 ``SECCOMP_RET_KILL_THREAD``:
0100         Results in the task exiting immediately without executing the
0101         system call.  The exit status of the task (``status & 0x7f``) will
0102         be ``SIGSYS``, not ``SIGKILL``.
0103
0104 ``SECCOMP_RET_TRAP``:
0105         Results in the kernel sending a ``SIGSYS`` signal to the triggering
0106         task without executing the system call. ``siginfo->si_call_addr``
0107         will show the address of the system call instruction, and
0108         ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which
0109         syscall was attempted.  The program counter will be as though
0110         the syscall happened (i.e. it will not point to the syscall
0111         instruction).  The return value register will contain an arch-
0112         dependent value -- if resuming execution, set it to something
0113         sensible.  (The architecture dependency is because replacing
0114         it with ``-ENOSYS`` could overwrite some useful information.)
0115
0116         The ``SECCOMP_RET_DATA`` portion of the return value will be passed
0117         as ``si_errno``.
0118
0119         ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``.
0120
0121 ``SECCOMP_RET_ERRNO``:
0122         Results in the lower 16-bits of the return value being passed
0123         to userland as the errno without executing the system call.
0124
0125 ``SECCOMP_RET_USER_NOTIF``:
0126         Results in a ``struct seccomp_notif`` message sent on the userspace
0127         notification fd, if it is attached, or ``-ENOSYS`` if it is not. See
0128         below on discussion of how to handle user notifications.
0129
0130 ``SECCOMP_RET_TRACE``:
0131         When returned, this value will cause the kernel to attempt to
0132         notify a ``ptrace()``-based tracer prior to executing the system
0133         call.  If there is no tracer present, ``-ENOSYS`` is returned to
0134         userland and the system call is not executed.
0135
0136         A tracer will be notified if it requests ``PTRACE_O_TRACESECCOMP``
0137         using ``ptrace(PTRACE_SETOPTIONS)``.  The tracer will be notified
0138         of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of
0139         the BPF program return value will be available to the tracer
0140         via ``PTRACE_GETEVENTMSG``.
0141
0142         The tracer can skip the system call by changing the syscall number
0143         to -1.  Alternatively, the tracer can change the system call
0144         requested by changing the system call to a valid syscall number.  If
0145         the tracer asks to skip the system call, then the system call will
0146         appear to return the value that the tracer puts in the return value
0147         register.
0148
0149         The seccomp check will not be run again after the tracer is
0150         notified.  (This means that seccomp-based sandboxes MUST NOT
0151         allow use of ptrace, even of other sandboxed processes, without
0152         extreme care; ptracers can use this mechanism to escape.)
0153
0154 ``SECCOMP_RET_LOG``:
0155         Results in the system call being executed after it is logged. This
0156         should be used by application developers to learn which syscalls their
0157         application needs without having to iterate through multiple test and
0158         development cycles to build the list.
0159
0160         This action will only be logged if "log" is present in the
0161         actions_logged sysctl string.
0162
0163 ``SECCOMP_RET_ALLOW``:
0164         Results in the system call being executed.
0165
0166 If multiple filters exist, the return value for the evaluation of a
0167 given system call will always use the highest precedent value.
0168
0169 Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask.  When
0170 multiple filters return values of the same precedence, only the
0171 ``SECCOMP_RET_DATA`` from the most recently installed filter will be
0172 returned.
0173
0174 Pitfalls
0175 ========
0176
0177 The biggest pitfall to avoid during use is filtering on system call
0178 number without checking the architecture value.  Why?  On any
0179 architecture that supports multiple system call invocation conventions,
0180 the system call numbers may vary based on the specific invocation.  If
0181 the numbers in the different calling conventions overlap, then checks in
0182 the filters may be abused.  Always check the arch value!
0183
0184 Example
0185 =======
0186
0187 The ``samples/seccomp/`` directory contains both an x86-specific example
0188 and a more generic example of a higher level macro interface for BPF
0189 program generation.
0190
0191 Userspace Notification
0192 ======================
0193
0194 The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
0195 particular syscall to userspace to be handled. This may be useful for
0196 applications like container managers, which wish to intercept particular
0197 syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
0198
0199 To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER``
0200 argument to the ``seccomp()`` syscall:
0201
0202 .. code-block:: c
0203
0204     fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
0205
0206 which (on success) will return a listener fd for the filter, which can then be
0207 passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to
0208 a particular filter, and not a particular task. So if this task then forks,
0209 notifications from both tasks will appear on the same filter fd. Reads and
0210 writes to/from a filter fd are also synchronized, so a filter fd can safely
0211 have many readers.
0212
0213 The interface for a seccomp notification fd consists of two structures:
0214
0215 .. code-block:: c
0216
0217     struct seccomp_notif_sizes {
0218         __u16 seccomp_notif;
0219         __u16 seccomp_notif_resp;
0220         __u16 seccomp_data;
0221     };
0222
0223     struct seccomp_notif {
0224         __u64 id;
0225         __u32 pid;
0226         __u32 flags;
0227         struct seccomp_data data;
0228     };
0229
0230     struct seccomp_notif_resp {
0231         __u64 id;
0232         __s64 val;
0233         __s32 error;
0234         __u32 flags;
0235     };
0236
0237 The ``struct seccomp_notif_sizes`` structure can be used to determine the size
0238 of the various structures used in seccomp notifications. The size of ``struct
0239 seccomp_data`` may change in the future, so code should use:
0240
0241 .. code-block:: c
0242
0243     struct seccomp_notif_sizes sizes;
0244     seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
0245
0246 to determine the size of the various structures to allocate. See
0247 samples/seccomp/user-trap.c for an example.
0248
0249 Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)``  (or ``poll()``) on a
0250 seccomp notification fd to receive a ``struct seccomp_notif``, which contains
0251 five members: the input length of the structure, a unique-per-filter ``id``,
0252 the ``pid`` of the task which triggered this request (which may be 0 if the
0253 task is in a pid ns not visible from the listener's pid namespace). The
0254 notification also contains the ``data`` passed to seccomp, and a filters flag.
0255 The structure should be zeroed out prior to calling the ioctl.
0256
0257 Userspace can then make a decision based on this information about what to do,
0258 and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be
0259 returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` should
0260 be the same ``id`` as in ``struct seccomp_notif``.
0261
0262 Userspace can also add file descriptors to the notifying process via
0263 ``ioctl(SECCOMP_IOCTL_NOTIF_ADDFD)``. The ``id`` member of
0264 ``struct seccomp_notif_addfd`` should be the same ``id`` as in
0265 ``struct seccomp_notif``. The ``newfd_flags`` flag may be used to set flags
0266 like O_CLOEXEC on the file descriptor in the notifying process. If the supervisor
0267 wants to inject the file descriptor with a specific number, the
0268 ``SECCOMP_ADDFD_FLAG_SETFD`` flag can be used, and set the ``newfd`` member to
0269 the specific number to use. If that file descriptor is already open in the
0270 notifying process it will be replaced. The supervisor can also add an FD, and
0271 respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
0272 value will be the injected file descriptor number.
0273
0274 The notifying process can be preempted, resulting in the notification being
0275 aborted. This can be problematic when trying to take actions on behalf of the
0276 notifying process that are long-running and typically retryable (mounting a
0277 filesytem). Alternatively, at filter installation time, the
0278 ``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
0279 such that when a user notification is received by the supervisor, the notifying
0280 process will ignore non-fatal signals until the response is sent. Signals that
0281 are sent prior to the notification being received by userspace are handled
0282 normally.
0283
0284 It is worth noting that ``struct seccomp_data`` contains the values of register
0285 arguments to the syscall, but does not contain pointers to memory. The task's
0286 memory is accessible to suitably privileged traces via ``ptrace()`` or
0287 ``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned
0288 above in this document: all arguments being read from the tracee's memory
0289 should be read into the tracer's memory before any policy decisions are made.
0290 This allows for an atomic decision on syscall arguments.
0291
0292 Sysctls
0293 =======
0294
0295 Seccomp's sysctl files can be found in the ``/proc/sys/kernel/seccomp/``
0296 directory. Here's a description of each file in that directory:
0297
0298 ``actions_avail``:
0299         A read-only ordered list of seccomp return values (refer to the
0300         ``SECCOMP_RET_*`` macros above) in string form. The ordering, from
0301         left-to-right, is the least permissive return value to the most
0302         permissive return value.
0303
0304         The list represents the set of seccomp return values supported
0305         by the kernel. A userspace program may use this list to
0306         determine if the actions found in the ``seccomp.h``, when the
0307         program was built, differs from the set of actions actually
0308         supported in the current running kernel.
0309
0310 ``actions_logged``:
0311         A read-write ordered list of seccomp return values (refer to the
0312         ``SECCOMP_RET_*`` macros above) that are allowed to be logged. Writes
0313         to the file do not need to be in ordered form but reads from the file
0314         will be ordered in the same way as the actions_avail sysctl.
0315
0316         The ``allow`` string is not accepted in the ``actions_logged`` sysctl
0317         as it is not possible to log ``SECCOMP_RET_ALLOW`` actions. Attempting
0318         to write ``allow`` to the sysctl will result in an EINVAL being
0319         returned.
0320
0321 Adding architecture support
0322 ===========================
0323
0324 See ``arch/Kconfig`` for the authoritative requirements.  In general, if an
0325 architecture supports both ptrace_event and seccomp, it will be able to
0326 support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return
0327 value checking.  Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``
0328 to its arch-specific Kconfig.
0329
0330
0331
0332 Caveats
0333 =======
0334
0335 The vDSO can cause some system calls to run entirely in userspace,
0336 leading to surprises when you run programs on different machines that
0337 fall back to real syscalls.  To minimize these surprises on x86, make
0338 sure you test with
0339 ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to
0340 something like ``acpi_pm``.
0341
0342 On x86-64, vsyscall emulation is enabled by default.  (vsyscalls are
0343 legacy variants on vDSO calls.)  Currently, emulated vsyscalls will
0344 honor seccomp, with a few oddities:
0345
0346 - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to
0347   the vsyscall entry for the given call and not the address after the
0348   'syscall' instruction.  Any code which wants to restart the call
0349   should be aware that (a) a ret instruction has been emulated and (b)
0350   trying to resume the syscall will again trigger the standard vsyscall
0351   emulation security checks, making resuming the syscall mostly
0352   pointless.
0353
0354 - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual,
0355   but the syscall may not be changed to another system call using the
0356   orig_rax register. It may only be changed to -1 order to skip the
0357   currently emulated call. Any other change MAY terminate the process.
0358   The rip value seen by the tracer will be the syscall entry address;
0359   this is different from normal behavior.  The tracer MUST NOT modify
0360   rip or rsp.  (Do not rely on other changes terminating the process.
0361   They might work.  For example, on some kernels, choosing a syscall
0362   that only exists in future kernels will be correctly emulated (by
0363   returning ``-ENOSYS``).
0364
0365 To detect this quirky behavior, check for ``addr & ~0x0C00 ==
0366 0xFFFFFFFFFF600000``.  (For ``SECCOMP_RET_TRACE``, use rip.  For
0367 ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.)  Do not check any other
0368 condition: future kernels may improve vsyscall emulation and current
0369 kernels in vsyscall=native mode will behave differently, but the
0370 instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these
0371 cases.
0372
0373 Note that modern systems are unlikely to use vsyscalls at all -- they
0374 are a legacy feature and they are considerably slower than standard
0375 syscalls.  New code will use the vDSO, and vDSO-issued system calls
0376 are indistinguishable from normal system calls.