0001 unshare system call
0002 ===================
0003
0004 This document describes the new system call, unshare(). The document
0005 provides an overview of the feature, why it is needed, how it can
0006 be used, its interface specification, design, implementation and
0007 how it can be tested.
0008
0009 Change Log
0010 ----------
0011 version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
0012
0013 Contents
0014 --------
0015 1) Overview
0016 2) Benefits
0017 3) Cost
0018 4) Requirements
0019 5) Functional Specification
0020 6) High Level Design
0021 7) Low Level Design
0022 8) Test Specification
0023 9) Future Work
0024
0025 1) Overview
0026 -----------
0027
0028 Most legacy operating system kernels support an abstraction of threads
0029 as multiple execution contexts within a process. These kernels provide
0030 special resources and mechanisms to maintain these "threads". The Linux
0031 kernel, in a clever and simple manner, does not make distinction
0032 between processes and "threads". The kernel allows processes to share
0033 resources and thus they can achieve legacy "threads" behavior without
0034 requiring additional data structures and mechanisms in the kernel. The
0035 power of implementing threads in this manner comes not only from
0036 its simplicity but also from allowing application programmers to work
0037 outside the confinement of all-or-nothing shared resources of legacy
0038 threads. On Linux, at the time of thread creation using the clone system
0039 call, applications can selectively choose which resources to share
0040 between threads.
0041
0042 unshare() system call adds a primitive to the Linux thread model that
0043 allows threads to selectively 'unshare' any resources that were being
0044 shared at the time of their creation. unshare() was conceptualized by
0045 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
0046 of the discussion on POSIX threads on Linux. unshare() augments the
0047 usefulness of Linux threads for applications that would like to control
0048 shared resources without creating a new process. unshare() is a natural
0049 addition to the set of available primitives on Linux that implement
0050 the concept of process/thread as a virtual machine.
0051
0052 2) Benefits
0053 -----------
0054
0055 unshare() would be useful to large application frameworks such as PAM
0056 where creating a new process to control sharing/unsharing of process
0057 resources is not possible. Since namespaces are shared by default
0058 when creating a new process using fork or clone, unshare() can benefit
0059 even non-threaded applications if they have a need to disassociate
0060 from default shared namespace. The following lists two use-cases
0061 where unshare() can be used.
0062
0063 2.1 Per-security context namespaces
0064 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0065
0066 unshare() can be used to implement polyinstantiated directories using
0067 the kernel's per-process namespace mechanism. Polyinstantiated directories,
0068 such as per-user and/or per-security context instance of /tmp, /var/tmp or
0069 per-security context instance of a user's home directory, isolate user
0070 processes when working with these directories. Using unshare(), a PAM
0071 module can easily setup a private namespace for a user at login.
0072 Polyinstantiated directories are required for Common Criteria certification
0073 with Labeled System Protection Profile, however, with the availability
0074 of shared-tree feature in the Linux kernel, even regular Linux systems
0075 can benefit from setting up private namespaces at login and
0076 polyinstantiating /tmp, /var/tmp and other directories deemed
0077 appropriate by system administrators.
0078
0079 2.2 unsharing of virtual memory and/or open files
0080 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0081
0082 Consider a client/server application where the server is processing
0083 client requests by creating processes that share resources such as
0084 virtual memory and open files. Without unshare(), the server has to
0085 decide what needs to be shared at the time of creating the process
0086 which services the request. unshare() allows the server an ability to
0087 disassociate parts of the context during the servicing of the
0088 request. For large and complex middleware application frameworks, this
0089 ability to unshare() after the process was created can be very
0090 useful.
0091
0092 3) Cost
0093 -------
0094
0095 In order to not duplicate code and to handle the fact that unshare()
0096 works on an active task (as opposed to clone/fork working on a newly
0097 allocated inactive task) unshare() had to make minor reorganizational
0098 changes to copy_* functions utilized by clone/fork system call.
0099 There is a cost associated with altering existing, well tested and
0100 stable code to implement a new feature that may not get exercised
0101 extensively in the beginning. However, with proper design and code
0102 review of the changes and creation of an unshare() test for the LTP
0103 the benefits of this new feature can exceed its cost.
0104
0105 4) Requirements
0106 ---------------
0107
0108 unshare() reverses sharing that was done using clone(2) system call,
0109 so unshare() should have a similar interface as clone(2). That is,
0110 since flags in clone(int flags, void \*stack) specifies what should
0111 be shared, similar flags in unshare(int flags) should specify
0112 what should be unshared. Unfortunately, this may appear to invert
0113 the meaning of the flags from the way they are used in clone(2).
0114 However, there was no easy solution that was less confusing and that
0115 allowed incremental context unsharing in future without an ABI change.
0116
0117 unshare() interface should accommodate possible future addition of
0118 new context flags without requiring a rebuild of old applications.
0119 If and when new context flags are added, unshare() design should allow
0120 incremental unsharing of those resources on an as needed basis.
0121
0122 5) Functional Specification
0123 ---------------------------
0124
0125 NAME
0126 unshare - disassociate parts of the process execution context
0127
0128 SYNOPSIS
0129 #include <sched.h>
0130
0131 int unshare(int flags);
0132
0133 DESCRIPTION
0134 unshare() allows a process to disassociate parts of its execution
0135 context that are currently being shared with other processes. Part
0136 of execution context, such as the namespace, is shared by default
0137 when a new process is created using fork(2), while other parts,
0138 such as the virtual memory, open file descriptors, etc, may be
0139 shared by explicit request to share them when creating a process
0140 using clone(2).
0141
0142 The main use of unshare() is to allow a process to control its
0143 shared execution context without creating a new process.
0144
0145 The flags argument specifies one or bitwise-or'ed of several of
0146 the following constants.
0147
0148 CLONE_FS
0149 If CLONE_FS is set, file system information of the caller
0150 is disassociated from the shared file system information.
0151
0152 CLONE_FILES
0153 If CLONE_FILES is set, the file descriptor table of the
0154 caller is disassociated from the shared file descriptor
0155 table.
0156
0157 CLONE_NEWNS
0158 If CLONE_NEWNS is set, the namespace of the caller is
0159 disassociated from the shared namespace.
0160
0161 CLONE_VM
0162 If CLONE_VM is set, the virtual memory of the caller is
0163 disassociated from the shared virtual memory.
0164
0165 RETURN VALUE
0166 On success, zero returned. On failure, -1 is returned and errno is
0167
0168 ERRORS
0169 EPERM CLONE_NEWNS was specified by a non-root process (process
0170 without CAP_SYS_ADMIN).
0171
0172 ENOMEM Cannot allocate sufficient memory to copy parts of caller's
0173 context that need to be unshared.
0174
0175 EINVAL Invalid flag was specified as an argument.
0176
0177 CONFORMING TO
0178 The unshare() call is Linux-specific and should not be used
0179 in programs intended to be portable.
0180
0181 SEE ALSO
0182 clone(2), fork(2)
0183
0184 6) High Level Design
0185 --------------------
0186
0187 Depending on the flags argument, the unshare() system call allocates
0188 appropriate process context structures, populates it with values from
0189 the current shared version, associates newly duplicated structures
0190 with the current task structure and releases corresponding shared
0191 versions. Helper functions of clone (copy_*) could not be used
0192 directly by unshare() because of the following two reasons.
0193
0194 1) clone operates on a newly allocated not-yet-active task
0195 structure, where as unshare() operates on the current active
0196 task. Therefore unshare() has to take appropriate task_lock()
0197 before associating newly duplicated context structures
0198
0199 2) unshare() has to allocate and duplicate all context structures
0200 that are being unshared, before associating them with the
0201 current task and releasing older shared structures. Failure
0202 do so will create race conditions and/or oops when trying
0203 to backout due to an error. Consider the case of unsharing
0204 both virtual memory and namespace. After successfully unsharing
0205 vm, if the system call encounters an error while allocating
0206 new namespace structure, the error return code will have to
0207 reverse the unsharing of vm. As part of the reversal the
0208 system call will have to go back to older, shared, vm
0209 structure, which may not exist anymore.
0210
0211 Therefore code from copy_* functions that allocated and duplicated
0212 current context structure was moved into new dup_* functions. Now,
0213 copy_* functions call dup_* functions to allocate and duplicate
0214 appropriate context structures and then associate them with the
0215 task structure that is being constructed. unshare() system call on
0216 the other hand performs the following:
0217
0218 1) Check flags to force missing, but implied, flags
0219
0220 2) For each context structure, call the corresponding unshare()
0221 helper function to allocate and duplicate a new context
0222 structure, if the appropriate bit is set in the flags argument.
0223
0224 3) If there is no error in allocation and duplication and there
0225 are new context structures then lock the current task structure,
0226 associate new context structures with the current task structure,
0227 and release the lock on the current task structure.
0228
0229 4) Appropriately release older, shared, context structures.
0230
0231 7) Low Level Design
0232 -------------------
0233
0234 Implementation of unshare() can be grouped in the following 4 different
0235 items:
0236
0237 a) Reorganization of existing copy_* functions
0238
0239 b) unshare() system call service function
0240
0241 c) unshare() helper functions for each different process context
0242
0243 d) Registration of system call number for different architectures
0244
0245 7.1) Reorganization of copy_* functions
0246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0247
0248 Each copy function such as copy_mm, copy_namespace, copy_files,
0249 etc, had roughly two components. The first component allocated
0250 and duplicated the appropriate structure and the second component
0251 linked it to the task structure passed in as an argument to the copy
0252 function. The first component was split into its own function.
0253 These dup_* functions allocated and duplicated the appropriate
0254 context structure. The reorganized copy_* functions invoked
0255 their corresponding dup_* functions and then linked the newly
0256 duplicated structures to the task structure with which the
0257 copy function was called.
0258
0259 7.2) unshare() system call service function
0260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0261
0262 * Check flags
0263 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
0264 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
0265 set and signals are also being shared, force CLONE_THREAD. If
0266 CLONE_NEWNS is set, force CLONE_FS.
0267
0268 * For each context flag, invoke the corresponding unshare_*
0269 helper routine with flags passed into the system call and a
0270 reference to pointer pointing the new unshared structure
0271
0272 * If any new structures are created by unshare_* helper
0273 functions, take the task_lock() on the current task,
0274 modify appropriate context pointers, and release the
0275 task lock.
0276
0277 * For all newly unshared structures, release the corresponding
0278 older, shared, structures.
0279
0280 7.3) unshare_* helper functions
0281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0282
0283 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
0284 and CLONE_THREAD, return -EINVAL since they are not implemented yet.
0285 For others, check the flag value to see if the unsharing is
0286 required for that structure. If it is, invoke the corresponding
0287 dup_* function to allocate and duplicate the structure and return
0288 a pointer to it.
0289
0290 7.4) Finally
0291 ~~~~~~~~~~~~
0292
0293 Appropriately modify architecture specific code to register the
0294 new system call.
0295
0296 8) Test Specification
0297 ---------------------
0298
0299 The test for unshare() should test the following:
0300
0301 1) Valid flags: Test to check that clone flags for signal and
0302 signal handlers, for which unsharing is not implemented
0303 yet, return -EINVAL.
0304
0305 2) Missing/implied flags: Test to make sure that if unsharing
0306 namespace without specifying unsharing of filesystem, correctly
0307 unshares both namespace and filesystem information.
0308
0309 3) For each of the four (namespace, filesystem, files and vm)
0310 supported unsharing, verify that the system call correctly
0311 unshares the appropriate structure. Verify that unsharing
0312 them individually as well as in combination with each
0313 other works as expected.
0314
0315 4) Concurrent execution: Use shared memory segments and futex on
0316 an address in the shm segment to synchronize execution of
0317 about 10 threads. Have a couple of threads execute execve,
0318 a couple _exit and the rest unshare with different combination
0319 of flags. Verify that unsharing is performed as expected and
0320 that there are no oops or hangs.
0321
0322 9) Future Work
0323 --------------
0324
0325 The current implementation of unshare() does not allow unsharing of
0326 signals and signal handlers. Signals are complex to begin with and
0327 to unshare signals and/or signal handlers of a currently running
0328 process is even more complex. If in the future there is a specific
0329 need to allow unsharing of signals and/or signal handlers, it can
0330 be incrementally added to unshare() without affecting legacy
0331 applications using unshare().
0332