Documentation/filesystems/idmappings.rst

0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 Idmappings
0004 ==========
0005
0006 Most filesystem developers will have encountered idmappings. They are used when
0007 reading from or writing ownership to disk, reporting ownership to userspace, or
0008 for permission checking. This document is aimed at filesystem developers that
0009 want to know how idmappings work.
0010
0011 Formal notes
0012 ------------
0013
0014 An idmapping is essentially a translation of a range of ids into another or the
0015 same range of ids. The notational convention for idmappings that is widely used
0016 in userspace is::
0017
0018  u:k:r
0019
0020 ``u`` indicates the first element in the upper idmapset ``U`` and ``k``
0021 indicates the first element in the lower idmapset ``K``. The ``r`` parameter
0022 indicates the range of the idmapping, i.e. how many ids are mapped. From now
0023 on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
0024 we're talking about an id in the upper or lower idmapset.
0025
0026 To see what this looks like in practice, let's take the following idmapping::
0027
0028  u22:k10000:r3
0029
0030 and write down the mappings it will generate::
0031
0032  u22 -> k10000
0033  u23 -> k10001
0034  u24 -> k10002
0035
0036 From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
0037 idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
0038 order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
0039 the set of all possible ids useable on a given system.
0040
0041 Looking at this mathematically briefly will help us highlight some properties
0042 that make it easier to understand how we can translate between idmappings. For
0043 example, we know that the inverse idmapping is an order isomorphism as well::
0044
0045  k10000 -> u22
0046  k10001 -> u23
0047  k10002 -> u24
0048
0049 Given that we are dealing with order isomorphisms plus the fact that we're
0050 dealing with subsets we can embedd idmappings into each other, i.e. we can
0051 sensibly translate between different idmappings. For example, assume we've been
0052 given the three idmappings::
0053
0054  1. u0:k10000:r10000
0055  2. u0:k20000:r10000
0056  3. u0:k30000:r10000
0057
0058 and id ``k11000`` which has been generated by the first idmapping by mapping
0059 ``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
0060
0061 Because we're dealing with order isomorphic subsets it is meaningful to ask
0062 what id ``k11000`` corresponds to in the second or third idmapping. The
0063 straightfoward algorithm to use is to apply the inverse of the first idmapping,
0064 mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
0065 either the second idmapping mapping or third idmapping mapping. The second
0066 idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
0067 ``u1000`` down to ``u31000``.
0068
0069 If we were given the same task for the following three idmappings::
0070
0071  1. u0:k10000:r10000
0072  2. u0:k20000:r200
0073  3. u0:k30000:r300
0074
0075 we would fail to translate as the sets aren't order isomorphic over the full
0076 range of the first idmapping anymore (However they are order isomorphic over
0077 the full range of the second idmapping.). Neither the second or third idmapping
0078 contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
0079 an id mapped. We can simply say that ``u1000`` is unmapped in the second and
0080 third idmapping. The kernel will report unmapped ids as the overflowuid
0081 ``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
0082
0083 The algorithm to calculate what a given id maps to is pretty simple. First, we
0084 need to verify that the range can contain our target id. We will skip this step
0085 for simplicity. After that if we want to know what ``id`` maps to we can do
0086 simple calculations:
0087
0088 - If we want to map from left to right::
0089
0090    u:k:r
0091    id - u + k = n
0092
0093 - If we want to map from right to left::
0094
0095    u:k:r
0096    id - k + u = n
0097
0098 Instead of "left to right" we can also say "down" and instead of "right to
0099 left" we can also say "up". Obviously mapping down and up invert each other.
0100
0101 To see whether the simple formulas above work, consider the following two
0102 idmappings::
0103
0104  1. u0:k20000:r10000
0105  2. u500:k30000:r10000
0106
0107 Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
0108 want to know what id this was mapped from in the upper idmapset of the first
0109 idmapping. So we're mapping up in the first idmapping::
0110
0111  id     - k      + u  = n
0112  k21000 - k20000 + u0 = u1000
0113
0114 Now assume we are given the id ``u1100`` in the upper idmapset of the second
0115 idmapping and we want to know what this id maps down to in the lower idmapset
0116 of the second idmapping. This means we're mapping down in the second
0117 idmapping::
0118
0119  id    - u    + k      = n
0120  u1100 - u500 + k30000 = k30600
0121
0122 General notes
0123 -------------
0124
0125 In the context of the kernel an idmapping can be interpreted as mapping a range
0126 of userspace ids into a range of kernel ids::
0127
0128  userspace-id:kernel-id:range
0129
0130 A userspace id is always an element in the upper idmapset of an idmapping of
0131 type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
0132 idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
0133 "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
0134 types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
0135
0136 The kernel is mostly concerned with kernel ids. They are used when performing
0137 permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
0138 A userspace id on the other hand is an id that is reported to userspace by the
0139 kernel, or is passed by userspace to the kernel, or a raw device id that is
0140 written or read from disk.
0141
0142 Note that we are only concerned with idmappings as the kernel stores them not
0143 how userspace would specify them.
0144
0145 For the rest of this document we will prefix all userspace ids with ``u`` and
0146 all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
0147 an idmapping will be written as ``u0:k10000:r10000``.
0148
0149 For example, the id ``u1000`` is an id in the upper idmapset or "userspace
0150 idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a
0151 kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``.
0152
0153 A kernel id is always created by an idmapping. Such idmappings are associated
0154 with user namespaces. Since we mainly care about how idmappings work we're not
0155 going to be concerned with how idmappings are created nor how they are used
0156 outside of the filesystem context. This is best left to an explanation of user
0157 namespaces.
0158
0159 The initial user namespace is special. It always has an idmapping of the
0160 following form::
0161
0162  u0:k0:r4294967295
0163
0164 which is an identity idmapping over the full range of ids available on this
0165 system.
0166
0167 Other user namespaces usually have non-identity idmappings such as::
0168
0169  u0:k10000:r10000
0170
0171 When a process creates or wants to change ownership of a file, or when the
0172 ownership of a file is read from disk by a filesystem, the userspace id is
0173 immediately translated into a kernel id according to the idmapping associated
0174 with the relevant user namespace.
0175
0176 For instance, consider a file that is stored on disk by a filesystem as being
0177 owned by ``u1000``:
0178
0179 - If a filesystem were to be mounted in the initial user namespaces (as most
0180   filesystems are) then the initial idmapping will be used. As we saw this is
0181   simply the identity idmapping. This would mean id ``u1000`` read from disk
0182   would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
0183   would contain ``k1000``.
0184
0185 - If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
0186   then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
0187   ``i_uid`` and ``i_gid`` would contain ``k11000``.
0188
0189 Translation algorithms
0190 ----------------------
0191
0192 We've already seen briefly that it is possible to translate between different
0193 idmappings. We'll now take a closer look how that works.
0194
0195 Crossmapping
0196 ~~~~~~~~~~~~
0197
0198 This translation algorithm is used by the kernel in quite a few places. For
0199 example, it is used when reporting back the ownership of a file to userspace
0200 via the ``stat()`` system call family.
0201
0202 If we've been given ``k11000`` from one idmapping we can map that id up in
0203 another idmapping. In order for this to work both idmappings need to contain
0204 the same kernel id in their kernel idmapsets. For example, consider the
0205 following idmappings::
0206
0207  1. u0:k10000:r10000
0208  2. u20000:k10000:r10000
0209
0210 and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
0211 then translate ``k11000`` into a userspace id in the second idmapping using the
0212 kernel idmapset of the second idmapping::
0213
0214  /* Map the kernel id up into a userspace id in the second idmapping. */
0215  from_kuid(u20000:k10000:r10000, k11000) = u21000
0216
0217 Note, how we can get back to the kernel id in the first idmapping by inverting
0218 the algorithm::
0219
0220  /* Map the userspace id down into a kernel id in the second idmapping. */
0221  make_kuid(u20000:k10000:r10000, u21000) = k11000
0222
0223  /* Map the kernel id up into a userspace id in the first idmapping. */
0224  from_kuid(u0:k10000:r10000, k11000) = u1000
0225
0226 This algorithm allows us to answer the question what userspace id a given
0227 kernel id corresponds to in a given idmapping. In order to be able to answer
0228 this question both idmappings need to contain the same kernel id in their
0229 respective kernel idmapsets.
0230
0231 For example, when the kernel reads a raw userspace id from disk it maps it down
0232 into a kernel id according to the idmapping associated with the filesystem.
0233 Let's assume the filesystem was mounted with an idmapping of
0234 ``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
0235 means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
0236 the inode's ``i_uid`` and ``i_gid`` field.
0237
0238 When someone in userspace calls ``stat()`` or a related function to get
0239 ownership information about the file the kernel can't simply map the id back up
0240 according to the filesystem's idmapping as this would give the wrong owner if
0241 the caller is using an idmapping.
0242
0243 So the kernel will map the id back up in the idmapping of the caller. Let's
0244 assume the caller has the slighly unconventional idmapping
0245 ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
0246 Consequently the user would see that this file is owned by ``u4000``.
0247
0248 Remapping
0249 ~~~~~~~~~
0250
0251 It is possible to translate a kernel id from one idmapping to another one via
0252 the userspace idmapset of the two idmappings. This is equivalent to remapping
0253 a kernel id.
0254
0255 Let's look at an example. We are given the following two idmappings::
0256
0257  1. u0:k10000:r10000
0258  2. u0:k20000:r10000
0259
0260 and we are given ``k11000`` in the first idmapping. In order to translate this
0261 kernel id in the first idmapping into a kernel id in the second idmapping we
0262 need to perform two steps:
0263
0264 1. Map the kernel id up into a userspace id in the first idmapping::
0265
0266     /* Map the kernel id up into a userspace id in the first idmapping. */
0267     from_kuid(u0:k10000:r10000, k11000) = u1000
0268
0269 2. Map the userspace id down into a kernel id in the second idmapping::
0270
0271     /* Map the userspace id down into a kernel id in the second idmapping. */
0272     make_kuid(u0:k20000:r10000, u1000) = k21000
0273
0274 As you can see we used the userspace idmapset in both idmappings to translate
0275 the kernel id in one idmapping to a kernel id in another idmapping.
0276
0277 This allows us to answer the question what kernel id we would need to use to
0278 get the same userspace id in another idmapping. In order to be able to answer
0279 this question both idmappings need to contain the same userspace id in their
0280 respective userspace idmapsets.
0281
0282 Note, how we can easily get back to the kernel id in the first idmapping by
0283 inverting the algorithm:
0284
0285 1. Map the kernel id up into a userspace id in the second idmapping::
0286
0287     /* Map the kernel id up into a userspace id in the second idmapping. */
0288     from_kuid(u0:k20000:r10000, k21000) = u1000
0289
0290 2. Map the userspace id down into a kernel id in the first idmapping::
0291
0292     /* Map the userspace id down into a kernel id in the first idmapping. */
0293     make_kuid(u0:k10000:r10000, u1000) = k11000
0294
0295 Another way to look at this translation is to treat it as inverting one
0296 idmapping and applying another idmapping if both idmappings have the relevant
0297 userspace id mapped. This will come in handy when working with idmapped mounts.
0298
0299 Invalid translations
0300 ~~~~~~~~~~~~~~~~~~~~
0301
0302 It is never valid to use an id in the kernel idmapset of one idmapping as the
0303 id in the userspace idmapset of another or the same idmapping. While the kernel
0304 idmapset always indicates an idmapset in the kernel id space the userspace
0305 idmapset indicates a userspace id. So the following translations are forbidden::
0306
0307  /* Map the userspace id down into a kernel id in the first idmapping. */
0308  make_kuid(u0:k10000:r10000, u1000) = k11000
0309
0310  /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
0311  make_kuid(u10000:k20000:r10000, k110000) = k21000
0312                                  ~~~~~~~
0313
0314 and equally wrong::
0315
0316  /* Map the kernel id up into a userspace id in the first idmapping. */
0317  from_kuid(u0:k10000:r10000, k11000) = u1000
0318
0319  /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
0320  from_kuid(u20000:k0:r10000, u1000) = k21000
0321                              ~~~~~
0322
0323 Idmappings when creating filesystem objects
0324 -------------------------------------------
0325
0326 The concepts of mapping an id down or mapping an id up are expressed in the two
0327 kernel functions filesystem developers are rather familiar with and which we've
0328 already used in this document::
0329
0330  /* Map the userspace id down into a kernel id. */
0331  make_kuid(idmapping, uid)
0332
0333  /* Map the kernel id up into a userspace id. */
0334  from_kuid(idmapping, kuid)
0335
0336 We will take an abbreviated look into how idmappings figure into creating
0337 filesystem objects. For simplicity we will only look at what happens when the
0338 VFS has already completed path lookup right before it calls into the filesystem
0339 itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
0340 called. We will also assume that the directory we're creating filesystem
0341 objects in is readable and writable for everyone.
0342
0343 When creating a filesystem object the caller will look at the caller's
0344 filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
0345 but they are exclusively used when determining file ownership which is why they
0346 are called "filesystem ids". They are usually identical to the uid and gid of
0347 the caller but can differ. We will just assume they are always identical to not
0348 get lost in too many details.
0349
0350 When the caller enters the kernel two things happen:
0351
0352 1. Map the caller's userspace ids down into kernel ids in the caller's
0353    idmapping.
0354    (To be precise, the kernel will simply look at the kernel ids stashed in the
0355    credentials of the current task but for our education we'll pretend this
0356    translation happens just in time.)
0357 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
0358    filesystem's idmapping.
0359
0360 The second step is important as regular filesystem will ultimately need to map
0361 the kernel id back up into a userspace id when writing to disk.
0362 So with the second step the kernel guarantees that a valid userspace id can be
0363 written to disk. If it can't the kernel will refuse the creation request to not
0364 even remotely risk filesystem corruption.
0365
0366 The astute reader will have realized that this is simply a varation of the
0367 crossmapping algorithm we mentioned above in a previous section. First, the
0368 kernel maps the caller's userspace id down into a kernel id according to the
0369 caller's idmapping and then maps that kernel id up according to the
0370 filesystem's idmapping.
0371
0372 Let's see some examples with caller/filesystem idmapping but without mount
0373 idmappings. This will exhibit some problems we can hit. After that we will
0374 revisit/reconsider these examples, this time using mount idmappings, to see how
0375 they can solve the problems we observed before.
0376
0377 Example 1
0378 ~~~~~~~~~
0379
0380 ::
0381
0382  caller id:            u1000
0383  caller idmapping:     u0:k0:r4294967295
0384  filesystem idmapping: u0:k0:r4294967295
0385
0386 Both the caller and the filesystem use the identity idmapping:
0387
0388 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
0389
0390     make_kuid(u0:k0:r4294967295, u1000) = k1000
0391
0392 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
0393    filesystem's idmapping.
0394
0395    For this second step the kernel will call the function
0396    ``fsuidgid_has_mapping()`` which ultimately boils down to calling
0397    ``from_kuid()``::
0398
0399     from_kuid(u0:k0:r4294967295, k1000) = u1000
0400
0401 In this example both idmappings are the same so there's nothing exciting going
0402 on. Ultimately the userspace id that lands on disk will be ``u1000``.
0403
0404 Example 2
0405 ~~~~~~~~~
0406
0407 ::
0408
0409  caller id:            u1000
0410  caller idmapping:     u0:k10000:r10000
0411  filesystem idmapping: u0:k20000:r10000
0412
0413 1. Map the caller's userspace ids down into kernel ids in the caller's
0414    idmapping::
0415
0416     make_kuid(u0:k10000:r10000, u1000) = k11000
0417
0418 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
0419    filesystem's idmapping::
0420
0421     from_kuid(u0:k20000:r10000, k11000) = u-1
0422
0423 It's immediately clear that while the caller's userspace id could be
0424 successfully mapped down into kernel ids in the caller's idmapping the kernel
0425 ids could not be mapped up according to the filesystem's idmapping. So the
0426 kernel will deny this creation request.
0427
0428 Note that while this example is less common, because most filesystem can't be
0429 mounted with non-initial idmappings this is a general problem as we can see in
0430 the next examples.
0431
0432 Example 3
0433 ~~~~~~~~~
0434
0435 ::
0436
0437  caller id:            u1000
0438  caller idmapping:     u0:k10000:r10000
0439  filesystem idmapping: u0:k0:r4294967295
0440
0441 1. Map the caller's userspace ids down into kernel ids in the caller's
0442    idmapping::
0443
0444     make_kuid(u0:k10000:r10000, u1000) = k11000
0445
0446 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
0447    filesystem's idmapping::
0448
0449     from_kuid(u0:k0:r4294967295, k11000) = u11000
0450
0451 We can see that the translation always succeeds. The userspace id that the
0452 filesystem will ultimately put to disk will always be identical to the value of
0453 the kernel id that was created in the caller's idmapping. This has mainly two
0454 consequences.
0455
0456 First, that we can't allow a caller to ultimately write to disk with another
0457 userspace id. We could only do this if we were to mount the whole fileystem
0458 with the caller's or another idmapping. But that solution is limited to a few
0459 filesystems and not very flexible. But this is a use-case that is pretty
0460 important in containerized workloads.
0461
0462 Second, the caller will usually not be able to create any files or access
0463 directories that have stricter permissions because none of the filesystem's
0464 kernel ids map up into valid userspace ids in the caller's idmapping
0465
0466 1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
0467
0468     make_kuid(u0:k0:r4294967295, u1000) = k1000
0469
0470 2. Map kernel ids up to userspace ids in the caller's idmapping::
0471
0472     from_kuid(u0:k10000:r10000, k1000) = u-1
0473
0474 Example 4
0475 ~~~~~~~~~
0476
0477 ::
0478
0479  file id:              u1000
0480  caller idmapping:     u0:k10000:r10000
0481  filesystem idmapping: u0:k0:r4294967295
0482
0483 In order to report ownership to userspace the kernel uses the crossmapping
0484 algorithm introduced in a previous section:
0485
0486 1. Map the userspace id on disk down into a kernel id in the filesystem's
0487    idmapping::
0488
0489     make_kuid(u0:k0:r4294967295, u1000) = k1000
0490
0491 2. Map the kernel id up into a userspace id in the caller's idmapping::
0492
0493     from_kuid(u0:k10000:r10000, k1000) = u-1
0494
0495 The crossmapping algorithm fails in this case because the kernel id in the
0496 filesystem idmapping cannot be mapped up to a userspace id in the caller's
0497 idmapping. Thus, the kernel will report the ownership of this file as the
0498 overflowid.
0499
0500 Example 5
0501 ~~~~~~~~~
0502
0503 ::
0504
0505  file id:              u1000
0506  caller idmapping:     u0:k10000:r10000
0507  filesystem idmapping: u0:k20000:r10000
0508
0509 In order to report ownership to userspace the kernel uses the crossmapping
0510 algorithm introduced in a previous section:
0511
0512 1. Map the userspace id on disk down into a kernel id in the filesystem's
0513    idmapping::
0514
0515     make_kuid(u0:k20000:r10000, u1000) = k21000
0516
0517 2. Map the kernel id up into a userspace id in the caller's idmapping::
0518
0519     from_kuid(u0:k10000:r10000, k21000) = u-1
0520
0521 Again, the crossmapping algorithm fails in this case because the kernel id in
0522 the filesystem idmapping cannot be mapped to a userspace id in the caller's
0523 idmapping. Thus, the kernel will report the ownership of this file as the
0524 overflowid.
0525
0526 Note how in the last two examples things would be simple if the caller would be
0527 using the initial idmapping. For a filesystem mounted with the initial
0528 idmapping it would be trivial. So we only consider a filesystem with an
0529 idmapping of ``u0:k20000:r10000``:
0530
0531 1. Map the userspace id on disk down into a kernel id in the filesystem's
0532    idmapping::
0533
0534     make_kuid(u0:k20000:r10000, u1000) = k21000
0535
0536 2. Map the kernel id up into a userspace id in the caller's idmapping::
0537
0538     from_kuid(u0:k0:r4294967295, k21000) = u21000
0539
0540 Idmappings on idmapped mounts
0541 -----------------------------
0542
0543 The examples we've seen in the previous section where the caller's idmapping
0544 and the filesystem's idmapping are incompatible causes various issues for
0545 workloads. For a more complex but common example, consider two containers
0546 started on the host. To completely prevent the two containers from affecting
0547 each other, an administrator may often use different non-overlapping idmappings
0548 for the two containers::
0549
0550  container1 idmapping:  u0:k10000:r10000
0551  container2 idmapping:  u0:k20000:r10000
0552  filesystem idmapping:  u0:k30000:r10000
0553
0554 An administrator wanting to provide easy read-write access to the following set
0555 of files::
0556
0557  dir id:       u0
0558  dir/file1 id: u1000
0559  dir/file2 id: u2000
0560
0561 to both containers currently can't.
0562
0563 Of course the administrator has the option to recursively change ownership via
0564 ``chown()``. For example, they could change ownership so that ``dir`` and all
0565 files below it can be crossmapped from the filesystem's into the container's
0566 idmapping. Let's assume they change ownership so it is compatible with the
0567 first container's idmapping::
0568
0569  dir id:       u10000
0570  dir/file1 id: u11000
0571  dir/file2 id: u12000
0572
0573 This would still leave ``dir`` rather useless to the second container. In fact,
0574 ``dir`` and all files below it would continue to appear owned by the overflowid
0575 for the second container.
0576
0577 Or consider another increasingly popular example. Some service managers such as
0578 systemd implement a concept called "portable home directories". A user may want
0579 to use their home directories on different machines where they are assigned
0580 different login userspace ids. Most users will have ``u1000`` as the login id
0581 on their machine at home and all files in their home directory will usually be
0582 owned by ``u1000``. At uni or at work they may have another login id such as
0583 ``u1125``. This makes it rather difficult to interact with their home directory
0584 on their work machine.
0585
0586 In both cases changing ownership recursively has grave implications. The most
0587 obvious one is that ownership is changed globally and permanently. In the home
0588 directory case this change in ownership would even need to happen everytime the
0589 user switches from their home to their work machine. For really large sets of
0590 files this becomes increasingly costly.
0591
0592 If the user is lucky, they are dealing with a filesystem that is mountable
0593 inside user namespaces. But this would also change ownership globally and the
0594 change in ownership is tied to the lifetime of the filesystem mount, i.e. the
0595 superblock. The only way to change ownership is to completely unmount the
0596 filesystem and mount it again in another user namespace. This is usually
0597 impossible because it would mean that all users currently accessing the
0598 filesystem can't anymore. And it means that ``dir`` still can't be shared
0599 between two containers with different idmappings.
0600 But usually the user doesn't even have this option since most filesystems
0601 aren't mountable inside containers. And not having them mountable might be
0602 desirable as it doesn't require the filesystem to deal with malicious
0603 filesystem images.
0604
0605 But the usecases mentioned above and more can be handled by idmapped mounts.
0606 They allow to expose the same set of dentries with different ownership at
0607 different mounts. This is achieved by marking the mounts with a user namespace
0608 through the ``mount_setattr()`` system call. The idmapping associated with it
0609 is then used to translate from the caller's idmapping to the filesystem's
0610 idmapping and vica versa using the remapping algorithm we introduced above.
0611
0612 Idmapped mounts make it possible to change ownership in a temporary and
0613 localized way. The ownership changes are restricted to a specific mount and the
0614 ownership changes are tied to the lifetime of the mount. All other users and
0615 locations where the filesystem is exposed are unaffected.
0616
0617 Filesystems that support idmapped mounts don't have any real reason to support
0618 being mountable inside user namespaces. A filesystem could be exposed
0619 completely under an idmapped mount to get the same effect. This has the
0620 advantage that filesystems can leave the creation of the superblock to
0621 privileged users in the initial user namespace.
0622
0623 However, it is perfectly possible to combine idmapped mounts with filesystems
0624 mountable inside user namespaces. We will touch on this further below.
0625
0626 Remapping helpers
0627 ~~~~~~~~~~~~~~~~~
0628
0629 Idmapping functions were added that translate between idmappings. They make use
0630 of the remapping algorithm we've introduced earlier. We're going to look at
0631 two:
0632
0633 - ``i_uid_into_mnt()`` and ``i_gid_into_mnt()``
0634
0635   The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into
0636   kernel ids in the mount's idmapping::
0637
0638    /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
0639    from_kuid(filesystem, kid) = uid
0640
0641    /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
0642    make_kuid(mount, uid) = kuid
0643
0644 - ``mapped_fsuid()`` and ``mapped_fsgid()``
0645
0646   The ``mapped_fs*id()`` functions translate the caller's kernel ids into
0647   kernel ids in the filesystem's idmapping. This translation is achieved by
0648   remapping the caller's kernel ids using the mount's idmapping::
0649
0650    /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
0651    from_kuid(mount, kid) = uid
0652
0653    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
0654    make_kuid(filesystem, uid) = kuid
0655
0656 Note that these two functions invert each other. Consider the following
0657 idmappings::
0658
0659  caller idmapping:     u0:k10000:r10000
0660  filesystem idmapping: u0:k20000:r10000
0661  mount idmapping:      u0:k10000:r10000
0662
0663 Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
0664 to ``k21000`` according to it's idmapping. This is what is stored in the
0665 inode's ``i_uid`` and ``i_gid`` fields.
0666
0667 When the caller queries the ownership of this file via ``stat()`` the kernel
0668 would usually simply use the crossmapping algorithm and map the filesystem's
0669 kernel id up to a userspace id in the caller's idmapping.
0670
0671 But when the caller is accessing the file on an idmapped mount the kernel will
0672 first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id
0673 into a kernel id in the mount's idmapping::
0674
0675  i_uid_into_mnt(k21000):
0676    /* Map the filesystem's kernel id up into a userspace id. */
0677    from_kuid(u0:k20000:r10000, k21000) = u1000
0678
0679    /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
0680    make_kuid(u0:k10000:r10000, u1000) = k11000
0681
0682 Finally, when the kernel reports the owner to the caller it will turn the
0683 kernel id in the mount's idmapping into a userspace id in the caller's
0684 idmapping::
0685
0686   from_kuid(u0:k10000:r10000, k11000) = u1000
0687
0688 We can test whether this algorithm really works by verifying what happens when
0689 we create a new file. Let's say the user is creating a file with ``u1000``.
0690
0691 The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
0692 kernel would now apply the crossmapping, verifying that ``k11000`` can be
0693 mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
0694 be mapped up in the filesystem's idmapping directly this creation request
0695 fails.
0696
0697 But when the caller is accessing the file on an idmapped mount the kernel will
0698 first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
0699 a kernel id according to the mount's idmapping::
0700
0701  mapped_fsuid(k11000):
0702     /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
0703     from_kuid(u0:k10000:r10000, k11000) = u1000
0704
0705     /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
0706     make_kuid(u0:k20000:r10000, u1000) = k21000
0707
0708 When finally writing to disk the kernel will then map ``k21000`` up into a
0709 userspace id in the filesystem's idmapping::
0710
0711    from_kuid(u0:k20000:r10000, k21000) = u1000
0712
0713 As we can see, we end up with an invertible and therefore information
0714 preserving algorithm. A file created from ``u1000`` on an idmapped mount will
0715 also be reported as being owned by ``u1000`` and vica versa.
0716
0717 Let's now briefly reconsider the failing examples from earlier in the context
0718 of idmapped mounts.
0719
0720 Example 2 reconsidered
0721 ~~~~~~~~~~~~~~~~~~~~~~
0722
0723 ::
0724
0725  caller id:            u1000
0726  caller idmapping:     u0:k10000:r10000
0727  filesystem idmapping: u0:k20000:r10000
0728  mount idmapping:      u0:k10000:r10000
0729
0730 When the caller is using a non-initial idmapping the common case is to attach
0731 the same idmapping to the mount. We now perform three steps:
0732
0733 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
0734
0735     make_kuid(u0:k10000:r10000, u1000) = k11000
0736
0737 2. Translate the caller's kernel id into a kernel id in the filesystem's
0738    idmapping::
0739
0740     mapped_fsuid(k11000):
0741       /* Map the kernel id up into a userspace id in the mount's idmapping. */
0742       from_kuid(u0:k10000:r10000, k11000) = u1000
0743
0744       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
0745       make_kuid(u0:k20000:r10000, u1000) = k21000
0746
0747 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
0748    filesystem's idmapping::
0749
0750     from_kuid(u0:k20000:r10000, k21000) = u1000
0751
0752 So the ownership that lands on disk will be ``u1000``.
0753
0754 Example 3 reconsidered
0755 ~~~~~~~~~~~~~~~~~~~~~~
0756
0757 ::
0758
0759  caller id:            u1000
0760  caller idmapping:     u0:k10000:r10000
0761  filesystem idmapping: u0:k0:r4294967295
0762  mount idmapping:      u0:k10000:r10000
0763
0764 The same translation algorithm works with the third example.
0765
0766 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
0767
0768     make_kuid(u0:k10000:r10000, u1000) = k11000
0769
0770 2. Translate the caller's kernel id into a kernel id in the filesystem's
0771    idmapping::
0772
0773     mapped_fsuid(k11000):
0774        /* Map the kernel id up into a userspace id in the mount's idmapping. */
0775        from_kuid(u0:k10000:r10000, k11000) = u1000
0776
0777        /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
0778        make_kuid(u0:k0:r4294967295, u1000) = k1000
0779
0780 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
0781    filesystem's idmapping::
0782
0783     from_kuid(u0:k0:r4294967295, k21000) = u1000
0784
0785 So the ownership that lands on disk will be ``u1000``.
0786
0787 Example 4 reconsidered
0788 ~~~~~~~~~~~~~~~~~~~~~~
0789
0790 ::
0791
0792  file id:              u1000
0793  caller idmapping:     u0:k10000:r10000
0794  filesystem idmapping: u0:k0:r4294967295
0795  mount idmapping:      u0:k10000:r10000
0796
0797 In order to report ownership to userspace the kernel now does three steps using
0798 the translation algorithm we introduced earlier:
0799
0800 1. Map the userspace id on disk down into a kernel id in the filesystem's
0801    idmapping::
0802
0803     make_kuid(u0:k0:r4294967295, u1000) = k1000
0804
0805 2. Translate the kernel id into a kernel id in the mount's idmapping::
0806
0807     i_uid_into_mnt(k1000):
0808       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
0809       from_kuid(u0:k0:r4294967295, k1000) = u1000
0810
0811       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
0812       make_kuid(u0:k10000:r10000, u1000) = k11000
0813
0814 3. Map the kernel id up into a userspace id in the caller's idmapping::
0815
0816     from_kuid(u0:k10000:r10000, k11000) = u1000
0817
0818 Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
0819 idmapping. With the idmapped mount in place it now can be crossmapped into the
0820 filesystem's idmapping via the mount's idmapping. The file will now be created
0821 with ``u1000`` according to the mount's idmapping.
0822
0823 Example 5 reconsidered
0824 ~~~~~~~~~~~~~~~~~~~~~~
0825
0826 ::
0827
0828  file id:              u1000
0829  caller idmapping:     u0:k10000:r10000
0830  filesystem idmapping: u0:k20000:r10000
0831  mount idmapping:      u0:k10000:r10000
0832
0833 Again, in order to report ownership to userspace the kernel now does three
0834 steps using the translation algorithm we introduced earlier:
0835
0836 1. Map the userspace id on disk down into a kernel id in the filesystem's
0837    idmapping::
0838
0839     make_kuid(u0:k20000:r10000, u1000) = k21000
0840
0841 2. Translate the kernel id into a kernel id in the mount's idmapping::
0842
0843     i_uid_into_mnt(k21000):
0844       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
0845       from_kuid(u0:k20000:r10000, k21000) = u1000
0846
0847       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
0848       make_kuid(u0:k10000:r10000, u1000) = k11000
0849
0850 3. Map the kernel id up into a userspace id in the caller's idmapping::
0851
0852     from_kuid(u0:k10000:r10000, k11000) = u1000
0853
0854 Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
0855 idmapping. With the idmapped mount in place it now can be crossmapped into the
0856 filesystem's idmapping via the mount's idmapping. The file is now owned by
0857 ``u1000`` according to the mount's idmapping.
0858
0859 Changing ownership on a home directory
0860 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0861
0862 We've seen above how idmapped mounts can be used to translate between
0863 idmappings when either the caller, the filesystem or both uses a non-initial
0864 idmapping. A wide range of usecases exist when the caller is using
0865 a non-initial idmapping. This mostly happens in the context of containerized
0866 workloads. The consequence is as we have seen that for both, filesystem's
0867 mounted with the initial idmapping and filesystems mounted with non-initial
0868 idmappings, access to the filesystem isn't working because the kernel ids can't
0869 be crossmapped between the caller's and the filesystem's idmapping.
0870
0871 As we've seen above idmapped mounts provide a solution to this by remapping the
0872 caller's or filesystem's idmapping according to the mount's idmapping.
0873
0874 Aside from containerized workloads, idmapped mounts have the advantage that
0875 they also work when both the caller and the filesystem use the initial
0876 idmapping which means users on the host can change the ownership of directories
0877 and files on a per-mount basis.
0878
0879 Consider our previous example where a user has their home directory on portable
0880 storage. At home they have id ``u1000`` and all files in their home directory
0881 are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
0882
0883 Taking their home directory with them becomes problematic. They can't easily
0884 access their files, they might not be able to write to disk without applying
0885 lax permissions or ACLs and even if they can, they will end up with an annoying
0886 mix of files and directories owned by ``u1000`` and ``u1125``.
0887
0888 Idmapped mounts allow to solve this problem. A user can create an idmapped
0889 mount for their home directory on their work computer or their computer at home
0890 depending on what ownership they would prefer to end up on the portable storage
0891 itself.
0892
0893 Let's assume they want all files on disk to belong to ``u1000``. When the user
0894 plugs in their portable storage at their work station they can setup a job that
0895 creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
0896 when they create a file the kernel performs the following steps we already know
0897 from above:::
0898
0899  caller id:            u1125
0900  caller idmapping:     u0:k0:r4294967295
0901  filesystem idmapping: u0:k0:r4294967295
0902  mount idmapping:      u1000:k1125:r1
0903
0904 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
0905
0906     make_kuid(u0:k0:r4294967295, u1125) = k1125
0907
0908 2. Translate the caller's kernel id into a kernel id in the filesystem's
0909    idmapping::
0910
0911     mapped_fsuid(k1125):
0912       /* Map the kernel id up into a userspace id in the mount's idmapping. */
0913       from_kuid(u1000:k1125:r1, k1125) = u1000
0914
0915       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
0916       make_kuid(u0:k0:r4294967295, u1000) = k1000
0917
0918 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
0919    filesystem's idmapping::
0920
0921     from_kuid(u0:k0:r4294967295, k1000) = u1000
0922
0923 So ultimately the file will be created with ``u1000`` on disk.
0924
0925 Now let's briefly look at what ownership the caller with id ``u1125`` will see
0926 on their work computer:
0927
0928 ::
0929
0930  file id:              u1000
0931  caller idmapping:     u0:k0:r4294967295
0932  filesystem idmapping: u0:k0:r4294967295
0933  mount idmapping:      u1000:k1125:r1
0934
0935 1. Map the userspace id on disk down into a kernel id in the filesystem's
0936    idmapping::
0937
0938     make_kuid(u0:k0:r4294967295, u1000) = k1000
0939
0940 2. Translate the kernel id into a kernel id in the mount's idmapping::
0941
0942     i_uid_into_mnt(k1000):
0943       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
0944       from_kuid(u0:k0:r4294967295, k1000) = u1000
0945
0946       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
0947       make_kuid(u1000:k1125:r1, u1000) = k1125
0948
0949 3. Map the kernel id up into a userspace id in the caller's idmapping::
0950
0951     from_kuid(u0:k0:r4294967295, k1125) = u1125
0952
0953 So ultimately the caller will be reported that the file belongs to ``u1125``
0954 which is the caller's userspace id on their workstation in our example.
0955
0956 The raw userspace id that is put on disk is ``u1000`` so when the user takes
0957 their home directory back to their home computer where they are assigned
0958 ``u1000`` using the initial idmapping and mount the filesystem with the initial
0959 idmapping they will see all those files owned by ``u1000``.