0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ====================================================
0004 pin_user_pages() and related calls
0005 ====================================================
0006
0007 .. contents:: :local:
0008
0009 Overview
0010 ========
0011
0012 This document describes the following functions::
0013
0014 pin_user_pages()
0015 pin_user_pages_fast()
0016 pin_user_pages_remote()
0017
0018 Basic description of FOLL_PIN
0019 =============================
0020
0021 FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
0022 ("gup") family of functions. FOLL_PIN has significant interactions and
0023 interdependencies with FOLL_LONGTERM, so both are covered here.
0024
0025 FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
0026 sites. This allows the associated wrapper functions (pin_user_pages*() and
0027 others) to set the correct combination of these flags, and to check for problems
0028 as well.
0029
0030 FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
0031 This is in order to avoid creating a large number of wrapper functions to cover
0032 all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
0033 pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
0034 that's a natural dividing line, and a good point to make separate wrapper calls.
0035 In other words, use pin_user_pages*() for DMA-pinned pages, and
0036 get_user_pages*() for other cases. There are five cases described later on in
0037 this document, to further clarify that concept.
0038
0039 FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
0040 multiple threads and call sites are free to pin the same struct pages, via both
0041 FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
0042 other, not the struct page(s).
0043
0044 The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
0045 uses a different reference counting technique.
0046
0047 FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
0048 FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
0049
0050 Which flags are set by each wrapper
0051 ===================================
0052
0053 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
0054 flags the caller provides. The caller is required to pass in a non-null struct
0055 pages* array, and the function then pins pages by incrementing each by a special
0056 value: GUP_PIN_COUNTING_BIAS.
0057
0058 For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
0059 an exact form of pin counting is achieved, by using the 2nd struct page
0060 in the compound page. A new struct page field, compound_pincount, has
0061 been added in order to support this.
0062
0063 This approach for compound pages avoids the counting upper limit problems that
0064 are discussed below. Those limitations would have been aggravated severely by
0065 huge pages, because each tail page adds a refcount to the head page. And in
0066 fact, testing revealed that, without a separate compound_pincount field,
0067 page overflows were seen in some huge page stress tests.
0068
0069 This also means that huge pages and compound pages do not suffer
0070 from the false positives problem that is mentioned below.::
0071
0072 Function
0073 --------
0074 pin_user_pages FOLL_PIN is always set internally by this function.
0075 pin_user_pages_fast FOLL_PIN is always set internally by this function.
0076 pin_user_pages_remote FOLL_PIN is always set internally by this function.
0077
0078 For these get_user_pages*() functions, FOLL_GET might not even be specified.
0079 Behavior is a little more complex than above. If FOLL_GET was *not* specified,
0080 but the caller passed in a non-null struct pages* array, then the function
0081 sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
0082 of each page by +1.::
0083
0084 Function
0085 --------
0086 get_user_pages FOLL_GET is sometimes set internally by this function.
0087 get_user_pages_fast FOLL_GET is sometimes set internally by this function.
0088 get_user_pages_remote FOLL_GET is sometimes set internally by this function.
0089
0090 Tracking dma-pinned pages
0091 =========================
0092
0093 Some of the key design constraints, and solutions, for tracking dma-pinned
0094 pages:
0095
0096 * An actual reference count, per struct page, is required. This is because
0097 multiple processes may pin and unpin a page.
0098
0099 * False positives (reporting that a page is dma-pinned, when in fact it is not)
0100 are acceptable, but false negatives are not.
0101
0102 * struct page may not be increased in size for this, and all fields are already
0103 used.
0104
0105 * Given the above, we can overload the page->_refcount field by using, sort of,
0106 the upper bits in that field for a dma-pinned count. "Sort of", means that,
0107 rather than dividing page->_refcount into bit fields, we simple add a medium-
0108 large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
0109 page->_refcount. This provides fuzzy behavior: if a page has get_page() called
0110 on it 1024 times, then it will appear to have a single dma-pinned count.
0111 And again, that's acceptable.
0112
0113 This also leads to limitations: there are only 31-10==21 bits available for a
0114 counter that increments 10 bits at a time.
0115
0116 * Callers must specifically request "dma-pinned tracking of pages". In other
0117 words, just calling get_user_pages() will not suffice; a new set of functions,
0118 pin_user_page() and related, must be used.
0119
0120 FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
0121 ==========================================================
0122
0123 Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
0124 these categories:
0125
0126 CASE 1: Direct IO (DIO)
0127 -----------------------
0128 There are GUP references to pages that are serving
0129 as DIO buffers. These buffers are needed for a relatively short time (so they
0130 are not "long term"). No special synchronization with page_mkclean() or
0131 munmap() is provided. Therefore, flags to set at the call site are: ::
0132
0133 FOLL_PIN
0134
0135 ...but rather than setting FOLL_PIN directly, call sites should use one of
0136 the pin_user_pages*() routines that set FOLL_PIN.
0137
0138 CASE 2: RDMA
0139 ------------
0140 There are GUP references to pages that are serving as DMA
0141 buffers. These buffers are needed for a long time ("long term"). No special
0142 synchronization with page_mkclean() or munmap() is provided. Therefore, flags
0143 to set at the call site are: ::
0144
0145 FOLL_PIN | FOLL_LONGTERM
0146
0147 NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
0148 because DAX pages do not have a separate page cache, and so "pinning" implies
0149 locking down file system blocks, which is not (yet) supported in that way.
0150
0151 CASE 3: MMU notifier registration, with or without page faulting hardware
0152 -------------------------------------------------------------------------
0153 Device drivers can pin pages via get_user_pages*(), and register for mmu
0154 notifier callbacks for the memory range. Then, upon receiving a notifier
0155 "invalidate range" callback , stop the device from using the range, and unpin
0156 the pages. There may be other possible schemes, such as for example explicitly
0157 synchronizing against pending IO, that accomplish approximately the same thing.
0158
0159 Or, if the hardware supports replayable page faults, then the device driver can
0160 avoid pinning entirely (this is ideal), as follows: register for mmu notifier
0161 callbacks as above, but instead of stopping the device and unpinning in the
0162 callback, simply remove the range from the device's page tables.
0163
0164 Either way, as long as the driver unpins the pages upon mmu notifier callback,
0165 then there is proper synchronization with both filesystem and mm
0166 (page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
0167
0168 CASE 4: Pinning for struct page manipulation only
0169 -------------------------------------------------
0170 If only struct page data (as opposed to the actual memory contents that a page
0171 is tracking) is affected, then normal GUP calls are sufficient, and neither flag
0172 needs to be set.
0173
0174 CASE 5: Pinning in order to write to the data within the page
0175 -------------------------------------------------------------
0176 Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
0177 write to a page's data, unpin" can cause a problem. Case 5 may be considered a
0178 superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
0179 other words, if the code is neither Case 1 nor Case 2, it may still require
0180 FOLL_PIN, for patterns like this:
0181
0182 Correct (uses FOLL_PIN calls):
0183 pin_user_pages()
0184 write to the data within the pages
0185 unpin_user_pages()
0186
0187 INCORRECT (uses FOLL_GET calls):
0188 get_user_pages()
0189 write to the data within the pages
0190 put_page()
0191
0192 page_maybe_dma_pinned(): the whole point of pinning
0193 ===================================================
0194
0195 The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
0196 to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
0197 (and file system writeback code in general) to make informed decisions about
0198 what to do when a page cannot be unmapped due to such pins.
0199
0200 What to do in those cases is the subject of a years-long series of discussions
0201 and debates (see the References at the end of this document). It's a TODO item
0202 here: fill in the details once that's worked out. Meanwhile, it's safe to say
0203 that having this available: ::
0204
0205 static inline bool page_maybe_dma_pinned(struct page *page)
0206
0207 ...is a prerequisite to solving the long-running gup+DMA problem.
0208
0209 Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
0210 ===================================================================
0211
0212 Another way of thinking about these flags is as a progression of restrictions:
0213 FOLL_GET is for struct page manipulation, without affecting the data that the
0214 struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
0215 short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
0216 a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
0217 restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
0218 will be pinned longterm, and whose data will be accessed.
0219
0220 Unit testing
0221 ============
0222 This file::
0223
0224 tools/testing/selftests/vm/gup_test.c
0225
0226 has the following new calls to exercise the new pin*() wrapper functions:
0227
0228 * PIN_FAST_BENCHMARK (./gup_test -a)
0229 * PIN_BASIC_TEST (./gup_test -b)
0230
0231 You can monitor how many total dma-pinned pages have been acquired and released
0232 since the system was booted, via two new /proc/vmstat entries: ::
0233
0234 /proc/vmstat/nr_foll_pin_acquired
0235 /proc/vmstat/nr_foll_pin_released
0236
0237 Under normal conditions, these two values will be equal unless there are any
0238 long-term [R]DMA pins in place, or during pin/unpin transitions.
0239
0240 * nr_foll_pin_acquired: This is the number of logical pins that have been
0241 acquired since the system was powered on. For huge pages, the head page is
0242 pinned once for each page (head page and each tail page) within the huge page.
0243 This follows the same sort of behavior that get_user_pages() uses for huge
0244 pages: the head page is refcounted once for each tail or head page in the huge
0245 page, when get_user_pages() is applied to a huge page.
0246
0247 * nr_foll_pin_released: The number of logical pins that have been released since
0248 the system was powered on. Note that pages are released (unpinned) on a
0249 PAGE_SIZE granularity, even if the original pin was applied to a huge page.
0250 Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
0251 the accounting balances out, so that after doing this::
0252
0253 pin_user_pages(huge_page);
0254 for (each page in huge_page)
0255 unpin_user_page(page);
0256
0257 ...the following is expected::
0258
0259 nr_foll_pin_released == nr_foll_pin_acquired
0260
0261 (...unless it was already out of balance due to a long-term RDMA pin being in
0262 place.)
0263
0264 Other diagnostics
0265 =================
0266
0267 dump_page() has been enhanced slightly, to handle these new counting
0268 fields, and to better report on compound pages in general. Specifically,
0269 for compound pages, the exact (compound_pincount) pincount is reported.
0270
0271 References
0272 ==========
0273
0274 * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
0275 * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
0276 * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
0277 * `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
0278
0279 John Hubbard, October, 2019