Back to home page

LXR

 
 

    


0001 
0002 unshare system call:
0003 --------------------
0004 This document describes the new system call, unshare. The document
0005 provides an overview of the feature, why it is needed, how it can
0006 be used, its interface specification, design, implementation and
0007 how it can be tested.
0008 
0009 Change Log:
0010 -----------
0011 version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
0012 
0013 Contents:
0014 ---------
0015         1) Overview
0016         2) Benefits
0017         3) Cost
0018         4) Requirements
0019         5) Functional Specification
0020         6) High Level Design
0021         7) Low Level Design
0022         8) Test Specification
0023         9) Future Work
0024 
0025 1) Overview
0026 -----------
0027 Most legacy operating system kernels support an abstraction of threads
0028 as multiple execution contexts within a process. These kernels provide
0029 special resources and mechanisms to maintain these "threads". The Linux
0030 kernel, in a clever and simple manner, does not make distinction
0031 between processes and "threads". The kernel allows processes to share
0032 resources and thus they can achieve legacy "threads" behavior without
0033 requiring additional data structures and mechanisms in the kernel. The
0034 power of implementing threads in this manner comes not only from
0035 its simplicity but also from allowing application programmers to work
0036 outside the confinement of all-or-nothing shared resources of legacy
0037 threads. On Linux, at the time of thread creation using the clone system
0038 call, applications can selectively choose which resources to share
0039 between threads.
0040 
0041 unshare system call adds a primitive to the Linux thread model that
0042 allows threads to selectively 'unshare' any resources that were being
0043 shared at the time of their creation. unshare was conceptualized by
0044 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
0045 of the discussion on POSIX threads on Linux.  unshare augments the
0046 usefulness of Linux threads for applications that would like to control
0047 shared resources without creating a new process. unshare is a natural
0048 addition to the set of available primitives on Linux that implement
0049 the concept of process/thread as a virtual machine.
0050 
0051 2) Benefits
0052 -----------
0053 unshare would be useful to large application frameworks such as PAM
0054 where creating a new process to control sharing/unsharing of process
0055 resources is not possible. Since namespaces are shared by default
0056 when creating a new process using fork or clone, unshare can benefit
0057 even non-threaded applications if they have a need to disassociate
0058 from default shared namespace. The following lists two use-cases
0059 where unshare can be used.
0060 
0061 2.1 Per-security context namespaces
0062 -----------------------------------
0063 unshare can be used to implement polyinstantiated directories using
0064 the kernel's per-process namespace mechanism. Polyinstantiated directories,
0065 such as per-user and/or per-security context instance of /tmp, /var/tmp or
0066 per-security context instance of a user's home directory, isolate user
0067 processes when working with these directories. Using unshare, a PAM
0068 module can easily setup a private namespace for a user at login.
0069 Polyinstantiated directories are required for Common Criteria certification
0070 with Labeled System Protection Profile, however, with the availability
0071 of shared-tree feature in the Linux kernel, even regular Linux systems
0072 can benefit from setting up private namespaces at login and
0073 polyinstantiating /tmp, /var/tmp and other directories deemed
0074 appropriate by system administrators.
0075 
0076 2.2 unsharing of virtual memory and/or open files
0077 -------------------------------------------------
0078 Consider a client/server application where the server is processing
0079 client requests by creating processes that share resources such as
0080 virtual memory and open files. Without unshare, the server has to
0081 decide what needs to be shared at the time of creating the process
0082 which services the request. unshare allows the server an ability to
0083 disassociate parts of the context during the servicing of the
0084 request. For large and complex middleware application frameworks, this
0085 ability to unshare after the process was created can be very
0086 useful.
0087 
0088 3) Cost
0089 -------
0090 In order to not duplicate code and to handle the fact that unshare
0091 works on an active task (as opposed to clone/fork working on a newly
0092 allocated inactive task) unshare had to make minor reorganizational
0093 changes to copy_* functions utilized by clone/fork system call.
0094 There is a cost associated with altering existing, well tested and
0095 stable code to implement a new feature that may not get exercised
0096 extensively in the beginning. However, with proper design and code
0097 review of the changes and creation of an unshare test for the LTP
0098 the benefits of this new feature can exceed its cost.
0099 
0100 4) Requirements
0101 ---------------
0102 unshare reverses sharing that was done using clone(2) system call,
0103 so unshare should have a similar interface as clone(2). That is,
0104 since flags in clone(int flags, void *stack) specifies what should
0105 be shared, similar flags in unshare(int flags) should specify
0106 what should be unshared. Unfortunately, this may appear to invert
0107 the meaning of the flags from the way they are used in clone(2).
0108 However, there was no easy solution that was less confusing and that
0109 allowed incremental context unsharing in future without an ABI change.
0110 
0111 unshare interface should accommodate possible future addition of
0112 new context flags without requiring a rebuild of old applications.
0113 If and when new context flags are added, unshare design should allow
0114 incremental unsharing of those resources on an as needed basis.
0115 
0116 5) Functional Specification
0117 ---------------------------
0118 NAME
0119         unshare - disassociate parts of the process execution context
0120 
0121 SYNOPSIS
0122         #include <sched.h>
0123 
0124         int unshare(int flags);
0125 
0126 DESCRIPTION
0127         unshare allows a process to disassociate parts of its execution
0128         context that are currently being shared with other processes. Part
0129         of execution context, such as the namespace, is shared by default
0130         when a new process is created using fork(2), while other parts,
0131         such as the virtual memory, open file descriptors, etc, may be
0132         shared by explicit request to share them when creating a process
0133         using clone(2).
0134 
0135         The main use of unshare is to allow a process to control its
0136         shared execution context without creating a new process.
0137 
0138         The flags argument specifies one or bitwise-or'ed of several of
0139         the following constants.
0140 
0141         CLONE_FS
0142                 If CLONE_FS is set, file system information of the caller
0143                 is disassociated from the shared file system information.
0144 
0145         CLONE_FILES
0146                 If CLONE_FILES is set, the file descriptor table of the
0147                 caller is disassociated from the shared file descriptor
0148                 table.
0149 
0150         CLONE_NEWNS
0151                 If CLONE_NEWNS is set, the namespace of the caller is
0152                 disassociated from the shared namespace.
0153 
0154         CLONE_VM
0155                 If CLONE_VM is set, the virtual memory of the caller is
0156                 disassociated from the shared virtual memory.
0157 
0158 RETURN VALUE
0159         On success, zero returned. On failure, -1 is returned and errno is
0160 
0161 ERRORS
0162         EPERM   CLONE_NEWNS was specified by a non-root process (process
0163                 without CAP_SYS_ADMIN).
0164 
0165         ENOMEM  Cannot allocate sufficient memory to copy parts of caller's
0166                 context that need to be unshared.
0167 
0168         EINVAL  Invalid flag was specified as an argument.
0169 
0170 CONFORMING TO
0171         The unshare() call is Linux-specific and  should  not be used
0172         in programs intended to be portable.
0173 
0174 SEE ALSO
0175         clone(2), fork(2)
0176 
0177 6) High Level Design
0178 --------------------
0179 Depending on the flags argument, the unshare system call allocates
0180 appropriate process context structures, populates it with values from
0181 the current shared version, associates newly duplicated structures
0182 with the current task structure and releases corresponding shared
0183 versions. Helper functions of clone (copy_*) could not be used
0184 directly by unshare because of the following two reasons.
0185   1) clone operates on a newly allocated not-yet-active task
0186      structure, where as unshare operates on the current active
0187      task. Therefore unshare has to take appropriate task_lock()
0188      before associating newly duplicated context structures
0189   2) unshare has to allocate and duplicate all context structures
0190      that are being unshared, before associating them with the
0191      current task and releasing older shared structures. Failure
0192      do so will create race conditions and/or oops when trying
0193      to backout due to an error. Consider the case of unsharing
0194      both virtual memory and namespace. After successfully unsharing
0195      vm, if the system call encounters an error while allocating
0196      new namespace structure, the error return code will have to
0197      reverse the unsharing of vm. As part of the reversal the
0198      system call will have to go back to older, shared, vm
0199      structure, which may not exist anymore.
0200 
0201 Therefore code from copy_* functions that allocated and duplicated
0202 current context structure was moved into new dup_* functions. Now,
0203 copy_* functions call dup_* functions to allocate and duplicate
0204 appropriate context structures and then associate them with the
0205 task structure that is being constructed. unshare system call on
0206 the other hand performs the following:
0207   1) Check flags to force missing, but implied, flags
0208   2) For each context structure, call the corresponding unshare
0209      helper function to allocate and duplicate a new context
0210      structure, if the appropriate bit is set in the flags argument.
0211   3) If there is no error in allocation and duplication and there
0212      are new context structures then lock the current task structure,
0213      associate new context structures with the current task structure,
0214      and release the lock on the current task structure.
0215   4) Appropriately release older, shared, context structures.
0216 
0217 7) Low Level Design
0218 -------------------
0219 Implementation of unshare can be grouped in the following 4 different
0220 items:
0221   a) Reorganization of existing copy_* functions
0222   b) unshare system call service function
0223   c) unshare helper functions for each different process context
0224   d) Registration of system call number for different architectures
0225 
0226   7.1) Reorganization of copy_* functions
0227        Each copy function such as copy_mm, copy_namespace, copy_files,
0228        etc, had roughly two components. The first component allocated
0229        and duplicated the appropriate structure and the second component
0230        linked it to the task structure passed in as an argument to the copy
0231        function. The first component was split into its own function.
0232        These dup_* functions allocated and duplicated the appropriate
0233        context structure. The reorganized copy_* functions invoked
0234        their corresponding dup_* functions and then linked the newly
0235        duplicated structures to the task structure with which the
0236        copy function was called.
0237 
0238   7.2) unshare system call service function
0239        * Check flags
0240          Force implied flags. If CLONE_THREAD is set force CLONE_VM.
0241          If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
0242          set and signals are also being shared, force CLONE_THREAD. If
0243          CLONE_NEWNS is set, force CLONE_FS.
0244        * For each context flag, invoke the corresponding unshare_*
0245          helper routine with flags passed into the system call and a
0246          reference to pointer pointing the new unshared structure
0247        * If any new structures are created by unshare_* helper
0248          functions, take the task_lock() on the current task,
0249          modify appropriate context pointers, and release the
0250          task lock.
0251        * For all newly unshared structures, release the corresponding
0252          older, shared, structures.
0253 
0254   7.3) unshare_* helper functions
0255        For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
0256        and CLONE_THREAD, return -EINVAL since they are not implemented yet.
0257        For others, check the flag value to see if the unsharing is
0258        required for that structure. If it is, invoke the corresponding
0259        dup_* function to allocate and duplicate the structure and return
0260        a pointer to it.
0261 
0262   7.4) Appropriately modify architecture specific code to register the
0263        new system call.
0264 
0265 8) Test Specification
0266 ---------------------
0267 The test for unshare should test the following:
0268   1) Valid flags: Test to check that clone flags for signal and
0269         signal handlers, for which unsharing is not implemented
0270         yet, return -EINVAL.
0271   2) Missing/implied flags: Test to make sure that if unsharing
0272         namespace without specifying unsharing of filesystem, correctly
0273         unshares both namespace and filesystem information.
0274   3) For each of the four (namespace, filesystem, files and vm)
0275         supported unsharing, verify that the system call correctly
0276         unshares the appropriate structure. Verify that unsharing
0277         them individually as well as in combination with each
0278         other works as expected.
0279   4) Concurrent execution: Use shared memory segments and futex on
0280         an address in the shm segment to synchronize execution of
0281         about 10 threads. Have a couple of threads execute execve,
0282         a couple _exit and the rest unshare with different combination
0283         of flags. Verify that unsharing is performed as expected and
0284         that there are no oops or hangs.
0285 
0286 9) Future Work
0287 --------------
0288 The current implementation of unshare does not allow unsharing of
0289 signals and signal handlers. Signals are complex to begin with and
0290 to unshare signals and/or signal handlers of a currently running
0291 process is even more complex. If in the future there is a specific
0292 need to allow unsharing of signals and/or signal handlers, it can
0293 be incrementally added to unshare without affecting legacy
0294 applications using unshare.
0295