0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 =======
0004 The TLB
0005 =======
0006
0007 When the kernel unmaps or modified the attributes of a range of
0008 memory, it has two choices:
0009
0010 1. Flush the entire TLB with a two-instruction sequence. This is
0011 a quick operation, but it causes collateral damage: TLB entries
0012 from areas other than the one we are trying to flush will be
0013 destroyed and must be refilled later, at some cost.
0014 2. Use the invlpg instruction to invalidate a single page at a
0015 time. This could potentially cost many more instructions, but
0016 it is a much more precise operation, causing no collateral
0017 damage to other TLB entries.
0018
0019 Which method to do depends on a few things:
0020
0021 1. The size of the flush being performed. A flush of the entire
0022 address space is obviously better performed by flushing the
0023 entire TLB than doing 2^48/PAGE_SIZE individual flushes.
0024 2. The contents of the TLB. If the TLB is empty, then there will
0025 be no collateral damage caused by doing the global flush, and
0026 all of the individual flush will have ended up being wasted
0027 work.
0028 3. The size of the TLB. The larger the TLB, the more collateral
0029 damage we do with a full flush. So, the larger the TLB, the
0030 more attractive an individual flush looks. Data and
0031 instructions have separate TLBs, as do different page sizes.
0032 4. The microarchitecture. The TLB has become a multi-level
0033 cache on modern CPUs, and the global flushes have become more
0034 expensive relative to single-page flushes.
0035
0036 There is obviously no way the kernel can know all these things,
0037 especially the contents of the TLB during a given flush. The
0038 sizes of the flush will vary greatly depending on the workload as
0039 well. There is essentially no "right" point to choose.
0040
0041 You may be doing too many individual invalidations if you see the
0042 invlpg instruction (or instructions _near_ it) show up high in
0043 profiles. If you believe that individual invalidations being
0044 called too often, you can lower the tunable::
0045
0046 /sys/kernel/debug/x86/tlb_single_page_flush_ceiling
0047
0048 This will cause us to do the global flush for more cases.
0049 Lowering it to 0 will disable the use of the individual flushes.
0050 Setting it to 1 is a very conservative setting and it should
0051 never need to be 0 under normal circumstances.
0052
0053 Despite the fact that a single individual flush on x86 is
0054 guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
0055 flushes. THP is treated exactly the same as normal memory.
0056
0057 You might see invlpg inside of flush_tlb_mm_range() show up in
0058 profiles, or you can use the trace_tlb_flush() tracepoints. to
0059 determine how long the flush operations are taking.
0060
0061 Essentially, you are balancing the cycles you spend doing invlpg
0062 with the cycles that you spend refilling the TLB later.
0063
0064 You can measure how expensive TLB refills are by using
0065 performance counters and 'perf stat', like this::
0066
0067 perf stat -e
0068 cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
0069 cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
0070 cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
0071 cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
0072 cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
0073 cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
0074
0075 That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
0076 may have differently-named counters, but they should at least
0077 be there in some form. You can use pmu-tools 'ocperf list'
0078 (https://github.com/andikleen/pmu-tools) to find the right
0079 counters for a given CPU.
0080
0081 .. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
0082 says: "One execution of INVLPG is sufficient even for a page
0083 with size greater than 4 KBytes."