0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ======================
0004 Memory Protection Keys
0005 ======================
0006
0007 Memory Protection Keys provide a mechanism for enforcing page-based
0008 protections, but without requiring modification of the page tables when an
0009 application changes protection domains.
0010
0011 Pkeys Userspace (PKU) is a feature which can be found on:
0012 * Intel server CPUs, Skylake and later
0013 * Intel client CPUs, Tiger Lake (11th Gen Core) and later
0014 * Future AMD CPUs
0015
0016 Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
0017 a "protection key", giving 16 possible keys.
0018
0019 Protections for each key are defined with a per-CPU user-accessible register
0020 (PKRU). Each of these is a 32-bit register storing two bits (Access Disable
0021 and Write Disable) for each of 16 keys.
0022
0023 Being a CPU register, PKRU is inherently thread-local, potentially giving each
0024 thread a different set of protections from every other thread.
0025
0026 There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
0027 register. The feature is only available in 64-bit mode, even though there is
0028 theoretically space in the PAE PTEs. These permissions are enforced on data
0029 access only and have no effect on instruction fetches.
0030
0031 Syscalls
0032 ========
0033
0034 There are 3 system calls which directly interact with pkeys::
0035
0036 int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
0037 int pkey_free(int pkey);
0038 int pkey_mprotect(unsigned long start, size_t len,
0039 unsigned long prot, int pkey);
0040
0041 Before a pkey can be used, it must first be allocated with
0042 pkey_alloc(). An application calls the WRPKRU instruction
0043 directly in order to change access permissions to memory covered
0044 with a key. In this example WRPKRU is wrapped by a C function
0045 called pkey_set().
0046 ::
0047
0048 int real_prot = PROT_READ|PROT_WRITE;
0049 pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
0050 ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
0051 ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
0052 ... application runs here
0053
0054 Now, if the application needs to update the data at 'ptr', it can
0055 gain access, do the update, then remove its write access::
0056
0057 pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
0058 *ptr = foo; // assign something
0059 pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again
0060
0061 Now when it frees the memory, it will also free the pkey since it
0062 is no longer in use::
0063
0064 munmap(ptr, PAGE_SIZE);
0065 pkey_free(pkey);
0066
0067 .. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
0068 An example implementation can be found in
0069 tools/testing/selftests/x86/protection_keys.c.
0070
0071 Behavior
0072 ========
0073
0074 The kernel attempts to make protection keys consistent with the
0075 behavior of a plain mprotect(). For instance if you do this::
0076
0077 mprotect(ptr, size, PROT_NONE);
0078 something(ptr);
0079
0080 you can expect the same effects with protection keys when doing this::
0081
0082 pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
0083 pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
0084 something(ptr);
0085
0086 That should be true whether something() is a direct access to 'ptr'
0087 like::
0088
0089 *ptr = foo;
0090
0091 or when the kernel does the access on the application's behalf like
0092 with a read()::
0093
0094 read(fd, ptr, 1);
0095
0096 The kernel will send a SIGSEGV in both cases, but si_code will be set
0097 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
0098 the plain mprotect() permissions are violated.