Back to home page

OSCL-LXR

 
 

    


0001 unshare system call
0002 ===================
0003 
0004 This document describes the new system call, unshare(). The document
0005 provides an overview of the feature, why it is needed, how it can
0006 be used, its interface specification, design, implementation and
0007 how it can be tested.
0008 
0009 Change Log
0010 ----------
0011 version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
0012 
0013 Contents
0014 --------
0015         1) Overview
0016         2) Benefits
0017         3) Cost
0018         4) Requirements
0019         5) Functional Specification
0020         6) High Level Design
0021         7) Low Level Design
0022         8) Test Specification
0023         9) Future Work
0024 
0025 1) Overview
0026 -----------
0027 
0028 Most legacy operating system kernels support an abstraction of threads
0029 as multiple execution contexts within a process. These kernels provide
0030 special resources and mechanisms to maintain these "threads". The Linux
0031 kernel, in a clever and simple manner, does not make distinction
0032 between processes and "threads". The kernel allows processes to share
0033 resources and thus they can achieve legacy "threads" behavior without
0034 requiring additional data structures and mechanisms in the kernel. The
0035 power of implementing threads in this manner comes not only from
0036 its simplicity but also from allowing application programmers to work
0037 outside the confinement of all-or-nothing shared resources of legacy
0038 threads. On Linux, at the time of thread creation using the clone system
0039 call, applications can selectively choose which resources to share
0040 between threads.
0041 
0042 unshare() system call adds a primitive to the Linux thread model that
0043 allows threads to selectively 'unshare' any resources that were being
0044 shared at the time of their creation. unshare() was conceptualized by
0045 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
0046 of the discussion on POSIX threads on Linux.  unshare() augments the
0047 usefulness of Linux threads for applications that would like to control
0048 shared resources without creating a new process. unshare() is a natural
0049 addition to the set of available primitives on Linux that implement
0050 the concept of process/thread as a virtual machine.
0051 
0052 2) Benefits
0053 -----------
0054 
0055 unshare() would be useful to large application frameworks such as PAM
0056 where creating a new process to control sharing/unsharing of process
0057 resources is not possible. Since namespaces are shared by default
0058 when creating a new process using fork or clone, unshare() can benefit
0059 even non-threaded applications if they have a need to disassociate
0060 from default shared namespace. The following lists two use-cases
0061 where unshare() can be used.
0062 
0063 2.1 Per-security context namespaces
0064 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0065 
0066 unshare() can be used to implement polyinstantiated directories using
0067 the kernel's per-process namespace mechanism. Polyinstantiated directories,
0068 such as per-user and/or per-security context instance of /tmp, /var/tmp or
0069 per-security context instance of a user's home directory, isolate user
0070 processes when working with these directories. Using unshare(), a PAM
0071 module can easily setup a private namespace for a user at login.
0072 Polyinstantiated directories are required for Common Criteria certification
0073 with Labeled System Protection Profile, however, with the availability
0074 of shared-tree feature in the Linux kernel, even regular Linux systems
0075 can benefit from setting up private namespaces at login and
0076 polyinstantiating /tmp, /var/tmp and other directories deemed
0077 appropriate by system administrators.
0078 
0079 2.2 unsharing of virtual memory and/or open files
0080 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0081 
0082 Consider a client/server application where the server is processing
0083 client requests by creating processes that share resources such as
0084 virtual memory and open files. Without unshare(), the server has to
0085 decide what needs to be shared at the time of creating the process
0086 which services the request. unshare() allows the server an ability to
0087 disassociate parts of the context during the servicing of the
0088 request. For large and complex middleware application frameworks, this
0089 ability to unshare() after the process was created can be very
0090 useful.
0091 
0092 3) Cost
0093 -------
0094 
0095 In order to not duplicate code and to handle the fact that unshare()
0096 works on an active task (as opposed to clone/fork working on a newly
0097 allocated inactive task) unshare() had to make minor reorganizational
0098 changes to copy_* functions utilized by clone/fork system call.
0099 There is a cost associated with altering existing, well tested and
0100 stable code to implement a new feature that may not get exercised
0101 extensively in the beginning. However, with proper design and code
0102 review of the changes and creation of an unshare() test for the LTP
0103 the benefits of this new feature can exceed its cost.
0104 
0105 4) Requirements
0106 ---------------
0107 
0108 unshare() reverses sharing that was done using clone(2) system call,
0109 so unshare() should have a similar interface as clone(2). That is,
0110 since flags in clone(int flags, void \*stack) specifies what should
0111 be shared, similar flags in unshare(int flags) should specify
0112 what should be unshared. Unfortunately, this may appear to invert
0113 the meaning of the flags from the way they are used in clone(2).
0114 However, there was no easy solution that was less confusing and that
0115 allowed incremental context unsharing in future without an ABI change.
0116 
0117 unshare() interface should accommodate possible future addition of
0118 new context flags without requiring a rebuild of old applications.
0119 If and when new context flags are added, unshare() design should allow
0120 incremental unsharing of those resources on an as needed basis.
0121 
0122 5) Functional Specification
0123 ---------------------------
0124 
0125 NAME
0126         unshare - disassociate parts of the process execution context
0127 
0128 SYNOPSIS
0129         #include <sched.h>
0130 
0131         int unshare(int flags);
0132 
0133 DESCRIPTION
0134         unshare() allows a process to disassociate parts of its execution
0135         context that are currently being shared with other processes. Part
0136         of execution context, such as the namespace, is shared by default
0137         when a new process is created using fork(2), while other parts,
0138         such as the virtual memory, open file descriptors, etc, may be
0139         shared by explicit request to share them when creating a process
0140         using clone(2).
0141 
0142         The main use of unshare() is to allow a process to control its
0143         shared execution context without creating a new process.
0144 
0145         The flags argument specifies one or bitwise-or'ed of several of
0146         the following constants.
0147 
0148         CLONE_FS
0149                 If CLONE_FS is set, file system information of the caller
0150                 is disassociated from the shared file system information.
0151 
0152         CLONE_FILES
0153                 If CLONE_FILES is set, the file descriptor table of the
0154                 caller is disassociated from the shared file descriptor
0155                 table.
0156 
0157         CLONE_NEWNS
0158                 If CLONE_NEWNS is set, the namespace of the caller is
0159                 disassociated from the shared namespace.
0160 
0161         CLONE_VM
0162                 If CLONE_VM is set, the virtual memory of the caller is
0163                 disassociated from the shared virtual memory.
0164 
0165 RETURN VALUE
0166         On success, zero returned. On failure, -1 is returned and errno is
0167 
0168 ERRORS
0169         EPERM   CLONE_NEWNS was specified by a non-root process (process
0170                 without CAP_SYS_ADMIN).
0171 
0172         ENOMEM  Cannot allocate sufficient memory to copy parts of caller's
0173                 context that need to be unshared.
0174 
0175         EINVAL  Invalid flag was specified as an argument.
0176 
0177 CONFORMING TO
0178         The unshare() call is Linux-specific and  should  not be used
0179         in programs intended to be portable.
0180 
0181 SEE ALSO
0182         clone(2), fork(2)
0183 
0184 6) High Level Design
0185 --------------------
0186 
0187 Depending on the flags argument, the unshare() system call allocates
0188 appropriate process context structures, populates it with values from
0189 the current shared version, associates newly duplicated structures
0190 with the current task structure and releases corresponding shared
0191 versions. Helper functions of clone (copy_*) could not be used
0192 directly by unshare() because of the following two reasons.
0193 
0194   1) clone operates on a newly allocated not-yet-active task
0195      structure, where as unshare() operates on the current active
0196      task. Therefore unshare() has to take appropriate task_lock()
0197      before associating newly duplicated context structures
0198 
0199   2) unshare() has to allocate and duplicate all context structures
0200      that are being unshared, before associating them with the
0201      current task and releasing older shared structures. Failure
0202      do so will create race conditions and/or oops when trying
0203      to backout due to an error. Consider the case of unsharing
0204      both virtual memory and namespace. After successfully unsharing
0205      vm, if the system call encounters an error while allocating
0206      new namespace structure, the error return code will have to
0207      reverse the unsharing of vm. As part of the reversal the
0208      system call will have to go back to older, shared, vm
0209      structure, which may not exist anymore.
0210 
0211 Therefore code from copy_* functions that allocated and duplicated
0212 current context structure was moved into new dup_* functions. Now,
0213 copy_* functions call dup_* functions to allocate and duplicate
0214 appropriate context structures and then associate them with the
0215 task structure that is being constructed. unshare() system call on
0216 the other hand performs the following:
0217 
0218   1) Check flags to force missing, but implied, flags
0219 
0220   2) For each context structure, call the corresponding unshare()
0221      helper function to allocate and duplicate a new context
0222      structure, if the appropriate bit is set in the flags argument.
0223 
0224   3) If there is no error in allocation and duplication and there
0225      are new context structures then lock the current task structure,
0226      associate new context structures with the current task structure,
0227      and release the lock on the current task structure.
0228 
0229   4) Appropriately release older, shared, context structures.
0230 
0231 7) Low Level Design
0232 -------------------
0233 
0234 Implementation of unshare() can be grouped in the following 4 different
0235 items:
0236 
0237   a) Reorganization of existing copy_* functions
0238 
0239   b) unshare() system call service function
0240 
0241   c) unshare() helper functions for each different process context
0242 
0243   d) Registration of system call number for different architectures
0244 
0245 7.1) Reorganization of copy_* functions
0246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0247 
0248 Each copy function such as copy_mm, copy_namespace, copy_files,
0249 etc, had roughly two components. The first component allocated
0250 and duplicated the appropriate structure and the second component
0251 linked it to the task structure passed in as an argument to the copy
0252 function. The first component was split into its own function.
0253 These dup_* functions allocated and duplicated the appropriate
0254 context structure. The reorganized copy_* functions invoked
0255 their corresponding dup_* functions and then linked the newly
0256 duplicated structures to the task structure with which the
0257 copy function was called.
0258 
0259 7.2) unshare() system call service function
0260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0261 
0262        * Check flags
0263          Force implied flags. If CLONE_THREAD is set force CLONE_VM.
0264          If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
0265          set and signals are also being shared, force CLONE_THREAD. If
0266          CLONE_NEWNS is set, force CLONE_FS.
0267 
0268        * For each context flag, invoke the corresponding unshare_*
0269          helper routine with flags passed into the system call and a
0270          reference to pointer pointing the new unshared structure
0271 
0272        * If any new structures are created by unshare_* helper
0273          functions, take the task_lock() on the current task,
0274          modify appropriate context pointers, and release the
0275          task lock.
0276 
0277        * For all newly unshared structures, release the corresponding
0278          older, shared, structures.
0279 
0280 7.3) unshare_* helper functions
0281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0282 
0283 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
0284 and CLONE_THREAD, return -EINVAL since they are not implemented yet.
0285 For others, check the flag value to see if the unsharing is
0286 required for that structure. If it is, invoke the corresponding
0287 dup_* function to allocate and duplicate the structure and return
0288 a pointer to it.
0289 
0290 7.4) Finally
0291 ~~~~~~~~~~~~
0292 
0293 Appropriately modify architecture specific code to register the
0294 new system call.
0295 
0296 8) Test Specification
0297 ---------------------
0298 
0299 The test for unshare() should test the following:
0300 
0301   1) Valid flags: Test to check that clone flags for signal and
0302      signal handlers, for which unsharing is not implemented
0303      yet, return -EINVAL.
0304 
0305   2) Missing/implied flags: Test to make sure that if unsharing
0306      namespace without specifying unsharing of filesystem, correctly
0307      unshares both namespace and filesystem information.
0308 
0309   3) For each of the four (namespace, filesystem, files and vm)
0310      supported unsharing, verify that the system call correctly
0311      unshares the appropriate structure. Verify that unsharing
0312      them individually as well as in combination with each
0313      other works as expected.
0314 
0315   4) Concurrent execution: Use shared memory segments and futex on
0316      an address in the shm segment to synchronize execution of
0317      about 10 threads. Have a couple of threads execute execve,
0318      a couple _exit and the rest unshare with different combination
0319      of flags. Verify that unsharing is performed as expected and
0320      that there are no oops or hangs.
0321 
0322 9) Future Work
0323 --------------
0324 
0325 The current implementation of unshare() does not allow unsharing of
0326 signals and signal handlers. Signals are complex to begin with and
0327 to unshare signals and/or signal handlers of a currently running
0328 process is even more complex. If in the future there is a specific
0329 need to allow unsharing of signals and/or signal handlers, it can
0330 be incrementally added to unshare() without affecting legacy
0331 applications using unshare().
0332