0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =====================
0004 Syscall User Dispatch
0005 =====================
0006
0007 Background
0008 ----------
0009
0010 Compatibility layers like Wine need a way to efficiently emulate system
0011 calls of only a part of their process - the part that has the
0012 incompatible code - while being able to execute native syscalls without
0013 a high performance penalty on the native part of the process. Seccomp
0014 falls short on this task, since it has limited support to efficiently
0015 filter syscalls based on memory regions, and it doesn't support removing
0016 filters. Therefore a new mechanism is necessary.
0017
0018 Syscall User Dispatch brings the filtering of the syscall dispatcher
0019 address back to userspace. The application is in control of a flip
0020 switch, indicating the current personality of the process. A
0021 multiple-personality application can then flip the switch without
0022 invoking the kernel, when crossing the compatibility layer API
0023 boundaries, to enable/disable the syscall redirection and execute
0024 syscalls directly (disabled) or send them to be emulated in userspace
0025 through a SIGSYS.
0026
0027 The goal of this design is to provide very quick compatibility layer
0028 boundary crosses, which is achieved by not executing a syscall to change
0029 personality every time the compatibility layer executes. Instead, a
0030 userspace memory region exposed to the kernel indicates the current
0031 personality, and the application simply modifies that variable to
0032 configure the mechanism.
0033
0034 There is a relatively high cost associated with handling signals on most
0035 architectures, like x86, but at least for Wine, syscalls issued by
0036 native Windows code are currently not known to be a performance problem,
0037 since they are quite rare, at least for modern gaming applications.
0038
0039 Since this mechanism is designed to capture syscalls issued by
0040 non-native applications, it must function on syscalls whose invocation
0041 ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
0042 doesn't rely on any of the syscall ABI to make the filtering. It uses
0043 only the syscall dispatcher address and the userspace key.
0044
0045 As the ABI of these intercepted syscalls is unknown to Linux, these
0046 syscalls are not instrumentable via ptrace or the syscall tracepoints.
0047
0048 Interface
0049 ---------
0050
0051 A thread can setup this mechanism on supported kernels by executing the
0052 following prctl:
0053
0054 prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
0055
0056 <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
0057 disable the mechanism globally for that thread. When
0058 PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
0059
0060 [<offset>, <offset>+<length>) delimit a memory region interval
0061 from which syscalls are always executed directly, regardless of the
0062 userspace selector. This provides a fast path for the C library, which
0063 includes the most common syscall dispatchers in the native code
0064 applications, and also provides a way for the signal handler to return
0065 without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
0066 interface should make sure that at least the signal trampoline code is
0067 included in this region. In addition, for syscalls that implement the
0068 trampoline code on the vDSO, that trampoline is never intercepted.
0069
0070 [selector] is a pointer to a char-sized region in the process memory
0071 region, that provides a quick way to enable disable syscall redirection
0072 thread-wide, without the need to invoke the kernel directly. selector
0073 can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
0074 Any other value should terminate the program with a SIGSYS.
0075
0076 Security Notes
0077 --------------
0078
0079 Syscall User Dispatch provides functionality for compatibility layers to
0080 quickly capture system calls issued by a non-native part of the
0081 application, while not impacting the Linux native regions of the
0082 process. It is not a mechanism for sandboxing system calls, and it
0083 should not be seen as a security mechanism, since it is trivial for a
0084 malicious application to subvert the mechanism by jumping to an allowed
0085 dispatcher region prior to executing the syscall, or to discover the
0086 address and modify the selector value. If the use case requires any
0087 kind of security sandboxing, Seccomp should be used instead.
0088
0089 Any fork or exec of the existing process resets the mechanism to
0090 PR_SYS_DISPATCH_OFF.