0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ===============
0004 Shared Subtrees
0005 ===============
0006
0007 .. Contents:
0008 1) Overview
0009 2) Features
0010 3) Setting mount states
0011 4) Use-case
0012 5) Detailed semantics
0013 6) Quiz
0014 7) FAQ
0015 8) Implementation
0016
0017
0018 1) Overview
0019 -----------
0020
0021 Consider the following situation:
0022
0023 A process wants to clone its own namespace, but still wants to access the CD
0024 that got mounted recently. Shared subtree semantics provide the necessary
0025 mechanism to accomplish the above.
0026
0027 It provides the necessary building blocks for features like per-user-namespace
0028 and versioned filesystem.
0029
0030 2) Features
0031 -----------
0032
0033 Shared subtree provides four different flavors of mounts; struct vfsmount to be
0034 precise
0035
0036 a. shared mount
0037 b. slave mount
0038 c. private mount
0039 d. unbindable mount
0040
0041
0042 2a) A shared mount can be replicated to as many mountpoints and all the
0043 replicas continue to be exactly same.
0044
0045 Here is an example:
0046
0047 Let's say /mnt has a mount that is shared::
0048
0049 mount --make-shared /mnt
0050
0051 Note: mount(8) command now supports the --make-shared flag,
0052 so the sample 'smount' program is no longer needed and has been
0053 removed.
0054
0055 ::
0056
0057 # mount --bind /mnt /tmp
0058
0059 The above command replicates the mount at /mnt to the mountpoint /tmp
0060 and the contents of both the mounts remain identical.
0061
0062 ::
0063
0064 #ls /mnt
0065 a b c
0066
0067 #ls /tmp
0068 a b c
0069
0070 Now let's say we mount a device at /tmp/a::
0071
0072 # mount /dev/sd0 /tmp/a
0073
0074 #ls /tmp/a
0075 t1 t2 t3
0076
0077 #ls /mnt/a
0078 t1 t2 t3
0079
0080 Note that the mount has propagated to the mount at /mnt as well.
0081
0082 And the same is true even when /dev/sd0 is mounted on /mnt/a. The
0083 contents will be visible under /tmp/a too.
0084
0085
0086 2b) A slave mount is like a shared mount except that mount and umount events
0087 only propagate towards it.
0088
0089 All slave mounts have a master mount which is a shared.
0090
0091 Here is an example:
0092
0093 Let's say /mnt has a mount which is shared.
0094 # mount --make-shared /mnt
0095
0096 Let's bind mount /mnt to /tmp
0097 # mount --bind /mnt /tmp
0098
0099 the new mount at /tmp becomes a shared mount and it is a replica of
0100 the mount at /mnt.
0101
0102 Now let's make the mount at /tmp; a slave of /mnt
0103 # mount --make-slave /tmp
0104
0105 let's mount /dev/sd0 on /mnt/a
0106 # mount /dev/sd0 /mnt/a
0107
0108 #ls /mnt/a
0109 t1 t2 t3
0110
0111 #ls /tmp/a
0112 t1 t2 t3
0113
0114 Note the mount event has propagated to the mount at /tmp
0115
0116 However let's see what happens if we mount something on the mount at /tmp
0117
0118 # mount /dev/sd1 /tmp/b
0119
0120 #ls /tmp/b
0121 s1 s2 s3
0122
0123 #ls /mnt/b
0124
0125 Note how the mount event has not propagated to the mount at
0126 /mnt
0127
0128
0129 2c) A private mount does not forward or receive propagation.
0130
0131 This is the mount we are familiar with. Its the default type.
0132
0133
0134 2d) A unbindable mount is a unbindable private mount
0135
0136 let's say we have a mount at /mnt and we make it unbindable::
0137
0138 # mount --make-unbindable /mnt
0139
0140 Let's try to bind mount this mount somewhere else::
0141
0142 # mount --bind /mnt /tmp
0143 mount: wrong fs type, bad option, bad superblock on /mnt,
0144 or too many mounted file systems
0145
0146 Binding a unbindable mount is a invalid operation.
0147
0148
0149 3) Setting mount states
0150
0151 The mount command (util-linux package) can be used to set mount
0152 states::
0153
0154 mount --make-shared mountpoint
0155 mount --make-slave mountpoint
0156 mount --make-private mountpoint
0157 mount --make-unbindable mountpoint
0158
0159
0160 4) Use cases
0161 ------------
0162
0163 A) A process wants to clone its own namespace, but still wants to
0164 access the CD that got mounted recently.
0165
0166 Solution:
0167
0168 The system administrator can make the mount at /cdrom shared::
0169
0170 mount --bind /cdrom /cdrom
0171 mount --make-shared /cdrom
0172
0173 Now any process that clones off a new namespace will have a
0174 mount at /cdrom which is a replica of the same mount in the
0175 parent namespace.
0176
0177 So when a CD is inserted and mounted at /cdrom that mount gets
0178 propagated to the other mount at /cdrom in all the other clone
0179 namespaces.
0180
0181 B) A process wants its mounts invisible to any other process, but
0182 still be able to see the other system mounts.
0183
0184 Solution:
0185
0186 To begin with, the administrator can mark the entire mount tree
0187 as shareable::
0188
0189 mount --make-rshared /
0190
0191 A new process can clone off a new namespace. And mark some part
0192 of its namespace as slave::
0193
0194 mount --make-rslave /myprivatetree
0195
0196 Hence forth any mounts within the /myprivatetree done by the
0197 process will not show up in any other namespace. However mounts
0198 done in the parent namespace under /myprivatetree still shows
0199 up in the process's namespace.
0200
0201
0202 Apart from the above semantics this feature provides the
0203 building blocks to solve the following problems:
0204
0205 C) Per-user namespace
0206
0207 The above semantics allows a way to share mounts across
0208 namespaces. But namespaces are associated with processes. If
0209 namespaces are made first class objects with user API to
0210 associate/disassociate a namespace with userid, then each user
0211 could have his/her own namespace and tailor it to his/her
0212 requirements. This needs to be supported in PAM.
0213
0214 D) Versioned files
0215
0216 If the entire mount tree is visible at multiple locations, then
0217 an underlying versioning file system can return different
0218 versions of the file depending on the path used to access that
0219 file.
0220
0221 An example is::
0222
0223 mount --make-shared /
0224 mount --rbind / /view/v1
0225 mount --rbind / /view/v2
0226 mount --rbind / /view/v3
0227 mount --rbind / /view/v4
0228
0229 and if /usr has a versioning filesystem mounted, then that
0230 mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
0231 /view/v4/usr too
0232
0233 A user can request v3 version of the file /usr/fs/namespace.c
0234 by accessing /view/v3/usr/fs/namespace.c . The underlying
0235 versioning filesystem can then decipher that v3 version of the
0236 filesystem is being requested and return the corresponding
0237 inode.
0238
0239 5) Detailed semantics
0240 ---------------------
0241 The section below explains the detailed semantics of
0242 bind, rbind, move, mount, umount and clone-namespace operations.
0243
0244 Note: the word 'vfsmount' and the noun 'mount' have been used
0245 to mean the same thing, throughout this document.
0246
0247 5a) Mount states
0248
0249 A given mount can be in one of the following states
0250
0251 1) shared
0252 2) slave
0253 3) shared and slave
0254 4) private
0255 5) unbindable
0256
0257 A 'propagation event' is defined as event generated on a vfsmount
0258 that leads to mount or unmount actions in other vfsmounts.
0259
0260 A 'peer group' is defined as a group of vfsmounts that propagate
0261 events to each other.
0262
0263 (1) Shared mounts
0264
0265 A 'shared mount' is defined as a vfsmount that belongs to a
0266 'peer group'.
0267
0268 For example::
0269
0270 mount --make-shared /mnt
0271 mount --bind /mnt /tmp
0272
0273 The mount at /mnt and that at /tmp are both shared and belong
0274 to the same peer group. Anything mounted or unmounted under
0275 /mnt or /tmp reflect in all the other mounts of its peer
0276 group.
0277
0278
0279 (2) Slave mounts
0280
0281 A 'slave mount' is defined as a vfsmount that receives
0282 propagation events and does not forward propagation events.
0283
0284 A slave mount as the name implies has a master mount from which
0285 mount/unmount events are received. Events do not propagate from
0286 the slave mount to the master. Only a shared mount can be made
0287 a slave by executing the following command::
0288
0289 mount --make-slave mount
0290
0291 A shared mount that is made as a slave is no more shared unless
0292 modified to become shared.
0293
0294 (3) Shared and Slave
0295
0296 A vfsmount can be both shared as well as slave. This state
0297 indicates that the mount is a slave of some vfsmount, and
0298 has its own peer group too. This vfsmount receives propagation
0299 events from its master vfsmount, and also forwards propagation
0300 events to its 'peer group' and to its slave vfsmounts.
0301
0302 Strictly speaking, the vfsmount is shared having its own
0303 peer group, and this peer-group is a slave of some other
0304 peer group.
0305
0306 Only a slave vfsmount can be made as 'shared and slave' by
0307 either executing the following command::
0308
0309 mount --make-shared mount
0310
0311 or by moving the slave vfsmount under a shared vfsmount.
0312
0313 (4) Private mount
0314
0315 A 'private mount' is defined as vfsmount that does not
0316 receive or forward any propagation events.
0317
0318 (5) Unbindable mount
0319
0320 A 'unbindable mount' is defined as vfsmount that does not
0321 receive or forward any propagation events and cannot
0322 be bind mounted.
0323
0324
0325 State diagram:
0326
0327 The state diagram below explains the state transition of a mount,
0328 in response to various commands::
0329
0330 -----------------------------------------------------------------------
0331 | |make-shared | make-slave | make-private |make-unbindab|
0332 --------------|------------|--------------|--------------|-------------|
0333 |shared |shared |*slave/private| private | unbindable |
0334 | | | | | |
0335 |-------------|------------|--------------|--------------|-------------|
0336 |slave |shared | **slave | private | unbindable |
0337 | |and slave | | | |
0338 |-------------|------------|--------------|--------------|-------------|
0339 |shared |shared | slave | private | unbindable |
0340 |and slave |and slave | | | |
0341 |-------------|------------|--------------|--------------|-------------|
0342 |private |shared | **private | private | unbindable |
0343 |-------------|------------|--------------|--------------|-------------|
0344 |unbindable |shared |**unbindable | private | unbindable |
0345 ------------------------------------------------------------------------
0346
0347 * if the shared mount is the only mount in its peer group, making it
0348 slave, makes it private automatically. Note that there is no master to
0349 which it can be slaved to.
0350
0351 ** slaving a non-shared mount has no effect on the mount.
0352
0353 Apart from the commands listed below, the 'move' operation also changes
0354 the state of a mount depending on type of the destination mount. Its
0355 explained in section 5d.
0356
0357 5b) Bind semantics
0358
0359 Consider the following command::
0360
0361 mount --bind A/a B/b
0362
0363 where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
0364 is the destination mount and 'b' is the dentry in the destination mount.
0365
0366 The outcome depends on the type of mount of 'A' and 'B'. The table
0367 below contains quick reference::
0368
0369 --------------------------------------------------------------------------
0370 | BIND MOUNT OPERATION |
0371 |************************************************************************|
0372 |source(A)->| shared | private | slave | unbindable |
0373 | dest(B) | | | | |
0374 | | | | | | |
0375 | v | | | | |
0376 |************************************************************************|
0377 | shared | shared | shared | shared & slave | invalid |
0378 | | | | | |
0379 |non-shared| shared | private | slave | invalid |
0380 **************************************************************************
0381
0382 Details:
0383
0384 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
0385 which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
0386 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
0387 are created and mounted at the dentry 'b' on all mounts where 'B'
0388 propagates to. A new propagation tree containing 'C1',..,'Cn' is
0389 created. This propagation tree is identical to the propagation tree of
0390 'B'. And finally the peer-group of 'C' is merged with the peer group
0391 of 'A'.
0392
0393 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
0394 which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
0395 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
0396 are created and mounted at the dentry 'b' on all mounts where 'B'
0397 propagates to. A new propagation tree is set containing all new mounts
0398 'C', 'C1', .., 'Cn' with exactly the same configuration as the
0399 propagation tree for 'B'.
0400
0401 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
0402 mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
0403 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
0404 'C3' ... are created and mounted at the dentry 'b' on all mounts where
0405 'B' propagates to. A new propagation tree containing the new mounts
0406 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
0407 propagation tree for 'B'. And finally the mount 'C' and its peer group
0408 is made the slave of mount 'Z'. In other words, mount 'C' is in the
0409 state 'slave and shared'.
0410
0411 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
0412 invalid operation.
0413
0414 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
0415 unbindable) mount. A new mount 'C' which is clone of 'A', is created.
0416 Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
0417
0418 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
0419 which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
0420 mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
0421 peer-group of 'A'.
0422
0423 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
0424 new mount 'C' which is a clone of 'A' is created. Its root dentry is
0425 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
0426 slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
0427 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
0428 mount/unmount on 'A' do not propagate anywhere else. Similarly
0429 mount/unmount on 'C' do not propagate anywhere else.
0430
0431 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
0432 invalid operation. A unbindable mount cannot be bind mounted.
0433
0434 5c) Rbind semantics
0435
0436 rbind is same as bind. Bind replicates the specified mount. Rbind
0437 replicates all the mounts in the tree belonging to the specified mount.
0438 Rbind mount is bind mount applied to all the mounts in the tree.
0439
0440 If the source tree that is rbind has some unbindable mounts,
0441 then the subtree under the unbindable mount is pruned in the new
0442 location.
0443
0444 eg:
0445
0446 let's say we have the following mount tree::
0447
0448 A
0449 / \
0450 B C
0451 / \ / \
0452 D E F G
0453
0454 Let's say all the mount except the mount C in the tree are
0455 of a type other than unbindable.
0456
0457 If this tree is rbound to say Z
0458
0459 We will have the following tree at the new location::
0460
0461 Z
0462 |
0463 A'
0464 /
0465 B' Note how the tree under C is pruned
0466 / \ in the new location.
0467 D' E'
0468
0469
0470
0471 5d) Move semantics
0472
0473 Consider the following command
0474
0475 mount --move A B/b
0476
0477 where 'A' is the source mount, 'B' is the destination mount and 'b' is
0478 the dentry in the destination mount.
0479
0480 The outcome depends on the type of the mount of 'A' and 'B'. The table
0481 below is a quick reference::
0482
0483 ---------------------------------------------------------------------------
0484 | MOVE MOUNT OPERATION |
0485 |**************************************************************************
0486 | source(A)->| shared | private | slave | unbindable |
0487 | dest(B) | | | | |
0488 | | | | | | |
0489 | v | | | | |
0490 |**************************************************************************
0491 | shared | shared | shared |shared and slave| invalid |
0492 | | | | | |
0493 |non-shared| shared | private | slave | unbindable |
0494 ***************************************************************************
0495
0496 .. Note:: moving a mount residing under a shared mount is invalid.
0497
0498 Details follow:
0499
0500 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
0501 mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
0502 are created and mounted at dentry 'b' on all mounts that receive
0503 propagation from mount 'B'. A new propagation tree is created in the
0504 exact same configuration as that of 'B'. This new propagation tree
0505 contains all the new mounts 'A1', 'A2'... 'An'. And this new
0506 propagation tree is appended to the already existing propagation tree
0507 of 'A'.
0508
0509 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
0510 mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
0511 are created and mounted at dentry 'b' on all mounts that receive
0512 propagation from mount 'B'. The mount 'A' becomes a shared mount and a
0513 propagation tree is created which is identical to that of
0514 'B'. This new propagation tree contains all the new mounts 'A1',
0515 'A2'... 'An'.
0516
0517 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
0518 mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
0519 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
0520 receive propagation from mount 'B'. A new propagation tree is created
0521 in the exact same configuration as that of 'B'. This new propagation
0522 tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
0523 propagation tree is appended to the already existing propagation tree of
0524 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
0525 becomes 'shared'.
0526
0527 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
0528 is invalid. Because mounting anything on the shared mount 'B' can
0529 create new mounts that get mounted on the mounts that receive
0530 propagation from 'B'. And since the mount 'A' is unbindable, cloning
0531 it to mount at other mountpoints is not possible.
0532
0533 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
0534 unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
0535
0536 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
0537 is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
0538 shared mount.
0539
0540 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
0541 The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
0542 continues to be a slave mount of mount 'Z'.
0543
0544 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
0545 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
0546 unbindable mount.
0547
0548 5e) Mount semantics
0549
0550 Consider the following command::
0551
0552 mount device B/b
0553
0554 'B' is the destination mount and 'b' is the dentry in the destination
0555 mount.
0556
0557 The above operation is the same as bind operation with the exception
0558 that the source mount is always a private mount.
0559
0560
0561 5f) Unmount semantics
0562
0563 Consider the following command::
0564
0565 umount A
0566
0567 where 'A' is a mount mounted on mount 'B' at dentry 'b'.
0568
0569 If mount 'B' is shared, then all most-recently-mounted mounts at dentry
0570 'b' on mounts that receive propagation from mount 'B' and does not have
0571 sub-mounts within them are unmounted.
0572
0573 Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
0574 each other.
0575
0576 let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
0577 'B1', 'B2' and 'B3' respectively.
0578
0579 let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
0580 mount 'B1', 'B2' and 'B3' respectively.
0581
0582 if 'C1' is unmounted, all the mounts that are most-recently-mounted on
0583 'B1' and on the mounts that 'B1' propagates-to are unmounted.
0584
0585 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
0586 on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
0587
0588 So all 'C1', 'C2' and 'C3' should be unmounted.
0589
0590 If any of 'C2' or 'C3' has some child mounts, then that mount is not
0591 unmounted, but all other mounts are unmounted. However if 'C1' is told
0592 to be unmounted and 'C1' has some sub-mounts, the umount operation is
0593 failed entirely.
0594
0595 5g) Clone Namespace
0596
0597 A cloned namespace contains all the mounts as that of the parent
0598 namespace.
0599
0600 Let's say 'A' and 'B' are the corresponding mounts in the parent and the
0601 child namespace.
0602
0603 If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
0604 each other.
0605
0606 If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
0607 'Z'.
0608
0609 If 'A' is a private mount, then 'B' is a private mount too.
0610
0611 If 'A' is unbindable mount, then 'B' is a unbindable mount too.
0612
0613
0614 6) Quiz
0615
0616 A. What is the result of the following command sequence?
0617
0618 ::
0619
0620 mount --bind /mnt /mnt
0621 mount --make-shared /mnt
0622 mount --bind /mnt /tmp
0623 mount --move /tmp /mnt/1
0624
0625 what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
0626 Should they all be identical? or should /mnt and /mnt/1 be
0627 identical only?
0628
0629
0630 B. What is the result of the following command sequence?
0631
0632 ::
0633
0634 mount --make-rshared /
0635 mkdir -p /v/1
0636 mount --rbind / /v/1
0637
0638 what should be the content of /v/1/v/1 be?
0639
0640
0641 C. What is the result of the following command sequence?
0642
0643 ::
0644
0645 mount --bind /mnt /mnt
0646 mount --make-shared /mnt
0647 mkdir -p /mnt/1/2/3 /mnt/1/test
0648 mount --bind /mnt/1 /tmp
0649 mount --make-slave /mnt
0650 mount --make-shared /mnt
0651 mount --bind /mnt/1/2 /tmp1
0652 mount --make-slave /mnt
0653
0654 At this point we have the first mount at /tmp and
0655 its root dentry is 1. Let's call this mount 'A'
0656 And then we have a second mount at /tmp1 with root
0657 dentry 2. Let's call this mount 'B'
0658 Next we have a third mount at /mnt with root dentry
0659 mnt. Let's call this mount 'C'
0660
0661 'B' is the slave of 'A' and 'C' is a slave of 'B'
0662 A -> B -> C
0663
0664 at this point if we execute the following command
0665
0666 mount --bind /bin /tmp/test
0667
0668 The mount is attempted on 'A'
0669
0670 will the mount propagate to 'B' and 'C' ?
0671
0672 what would be the contents of
0673 /mnt/1/test be?
0674
0675 7) FAQ
0676
0677 Q1. Why is bind mount needed? How is it different from symbolic links?
0678 symbolic links can get stale if the destination mount gets
0679 unmounted or moved. Bind mounts continue to exist even if the
0680 other mount is unmounted or moved.
0681
0682 Q2. Why can't the shared subtree be implemented using exportfs?
0683
0684 exportfs is a heavyweight way of accomplishing part of what
0685 shared subtree can do. I cannot imagine a way to implement the
0686 semantics of slave mount using exportfs?
0687
0688 Q3 Why is unbindable mount needed?
0689
0690 Let's say we want to replicate the mount tree at multiple
0691 locations within the same subtree.
0692
0693 if one rbind mounts a tree within the same subtree 'n' times
0694 the number of mounts created is an exponential function of 'n'.
0695 Having unbindable mount can help prune the unneeded bind
0696 mounts. Here is an example.
0697
0698 step 1:
0699 let's say the root tree has just two directories with
0700 one vfsmount::
0701
0702 root
0703 / \
0704 tmp usr
0705
0706 And we want to replicate the tree at multiple
0707 mountpoints under /root/tmp
0708
0709 step 2:
0710 ::
0711
0712
0713 mount --make-shared /root
0714
0715 mkdir -p /tmp/m1
0716
0717 mount --rbind /root /tmp/m1
0718
0719 the new tree now looks like this::
0720
0721 root
0722 / \
0723 tmp usr
0724 /
0725 m1
0726 / \
0727 tmp usr
0728 /
0729 m1
0730
0731 it has two vfsmounts
0732
0733 step 3:
0734 ::
0735
0736 mkdir -p /tmp/m2
0737 mount --rbind /root /tmp/m2
0738
0739 the new tree now looks like this::
0740
0741 root
0742 / \
0743 tmp usr
0744 / \
0745 m1 m2
0746 / \ / \
0747 tmp usr tmp usr
0748 / \ /
0749 m1 m2 m1
0750 / \ / \
0751 tmp usr tmp usr
0752 / / \
0753 m1 m1 m2
0754 / \
0755 tmp usr
0756 / \
0757 m1 m2
0758
0759 it has 6 vfsmounts
0760
0761 step 4:
0762 ::
0763 mkdir -p /tmp/m3
0764 mount --rbind /root /tmp/m3
0765
0766 I won't draw the tree..but it has 24 vfsmounts
0767
0768
0769 at step i the number of vfsmounts is V[i] = i*V[i-1].
0770 This is an exponential function. And this tree has way more
0771 mounts than what we really needed in the first place.
0772
0773 One could use a series of umount at each step to prune
0774 out the unneeded mounts. But there is a better solution.
0775 Unclonable mounts come in handy here.
0776
0777 step 1:
0778 let's say the root tree has just two directories with
0779 one vfsmount::
0780
0781 root
0782 / \
0783 tmp usr
0784
0785 How do we set up the same tree at multiple locations under
0786 /root/tmp
0787
0788 step 2:
0789 ::
0790
0791
0792 mount --bind /root/tmp /root/tmp
0793
0794 mount --make-rshared /root
0795 mount --make-unbindable /root/tmp
0796
0797 mkdir -p /tmp/m1
0798
0799 mount --rbind /root /tmp/m1
0800
0801 the new tree now looks like this::
0802
0803 root
0804 / \
0805 tmp usr
0806 /
0807 m1
0808 / \
0809 tmp usr
0810
0811 step 3:
0812 ::
0813
0814 mkdir -p /tmp/m2
0815 mount --rbind /root /tmp/m2
0816
0817 the new tree now looks like this::
0818
0819 root
0820 / \
0821 tmp usr
0822 / \
0823 m1 m2
0824 / \ / \
0825 tmp usr tmp usr
0826
0827 step 4:
0828 ::
0829
0830 mkdir -p /tmp/m3
0831 mount --rbind /root /tmp/m3
0832
0833 the new tree now looks like this::
0834
0835 root
0836 / \
0837 tmp usr
0838 / \ \
0839 m1 m2 m3
0840 / \ / \ / \
0841 tmp usr tmp usr tmp usr
0842
0843 8) Implementation
0844
0845 8A) Datastructure
0846
0847 4 new fields are introduced to struct vfsmount:
0848
0849 * ->mnt_share
0850 * ->mnt_slave_list
0851 * ->mnt_slave
0852 * ->mnt_master
0853
0854 ->mnt_share
0855 links together all the mount to/from which this vfsmount
0856 send/receives propagation events.
0857
0858 ->mnt_slave_list
0859 links all the mounts to which this vfsmount propagates
0860 to.
0861
0862 ->mnt_slave
0863 links together all the slaves that its master vfsmount
0864 propagates to.
0865
0866 ->mnt_master
0867 points to the master vfsmount from which this vfsmount
0868 receives propagation.
0869
0870 ->mnt_flags
0871 takes two more flags to indicate the propagation status of
0872 the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
0873 vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
0874 replicated.
0875
0876 All the shared vfsmounts in a peer group form a cyclic list through
0877 ->mnt_share.
0878
0879 All vfsmounts with the same ->mnt_master form on a cyclic list anchored
0880 in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
0881
0882 ->mnt_master can point to arbitrary (and possibly different) members
0883 of master peer group. To find all immediate slaves of a peer group
0884 you need to go through _all_ ->mnt_slave_list of its members.
0885 Conceptually it's just a single set - distribution among the
0886 individual lists does not affect propagation or the way propagation
0887 tree is modified by operations.
0888
0889 All vfsmounts in a peer group have the same ->mnt_master. If it is
0890 non-NULL, they form a contiguous (ordered) segment of slave list.
0891
0892 A example propagation tree looks as shown in the figure below.
0893 [ NOTE: Though it looks like a forest, if we consider all the shared
0894 mounts as a conceptual entity called 'pnode', it becomes a tree]::
0895
0896
0897 A <--> B <--> C <---> D
0898 /|\ /| |\
0899 / F G J K H I
0900 /
0901 E<-->K
0902 /|\
0903 M L N
0904
0905 In the above figure A,B,C and D all are shared and propagate to each
0906 other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
0907 mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
0908 'E' is also shared with 'K' and they propagate to each other. And
0909 'K' has 3 slaves 'M', 'L' and 'N'
0910
0911 A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
0912
0913 A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
0914
0915 E's ->mnt_share links with ->mnt_share of K
0916
0917 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
0918
0919 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
0920
0921 K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
0922
0923 C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
0924
0925 J and K's ->mnt_master points to struct vfsmount of C
0926
0927 and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
0928
0929 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
0930
0931
0932 NOTE: The propagation tree is orthogonal to the mount tree.
0933
0934 8B Locking:
0935
0936 ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
0937 by namespace_sem (exclusive for modifications, shared for reading).
0938
0939 Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
0940 There are two exceptions: do_add_mount() and clone_mnt().
0941 The former modifies a vfsmount that has not been visible in any shared
0942 data structures yet.
0943 The latter holds namespace_sem and the only references to vfsmount
0944 are in lists that can't be traversed without namespace_sem.
0945
0946 8C Algorithm:
0947
0948 The crux of the implementation resides in rbind/move operation.
0949
0950 The overall algorithm breaks the operation into 3 phases: (look at
0951 attach_recursive_mnt() and propagate_mnt())
0952
0953 1. prepare phase.
0954 2. commit phases.
0955 3. abort phases.
0956
0957 Prepare phase:
0958
0959 for each mount in the source tree:
0960
0961 a) Create the necessary number of mount trees to
0962 be attached to each of the mounts that receive
0963 propagation from the destination mount.
0964 b) Do not attach any of the trees to its destination.
0965 However note down its ->mnt_parent and ->mnt_mountpoint
0966 c) Link all the new mounts to form a propagation tree that
0967 is identical to the propagation tree of the destination
0968 mount.
0969
0970 If this phase is successful, there should be 'n' new
0971 propagation trees; where 'n' is the number of mounts in the
0972 source tree. Go to the commit phase
0973
0974 Also there should be 'm' new mount trees, where 'm' is
0975 the number of mounts to which the destination mount
0976 propagates to.
0977
0978 if any memory allocations fail, go to the abort phase.
0979
0980 Commit phase
0981 attach each of the mount trees to their corresponding
0982 destination mounts.
0983
0984 Abort phase
0985 delete all the newly created trees.
0986
0987 .. Note::
0988 all the propagation related functionality resides in the file pnode.c
0989
0990
0991 ------------------------------------------------------------------------
0992
0993 version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
0994
0995 version 0.2 (Incorporated comments from Al Viro)