Back to home page

LXR

 
 

    


0001 this_cpu operations
0002 -------------------
0003 
0004 this_cpu operations are a way of optimizing access to per cpu
0005 variables associated with the *currently* executing processor. This is
0006 done through the use of segment registers (or a dedicated register where
0007 the cpu permanently stored the beginning of the per cpu area for a
0008 specific processor).
0009 
0010 this_cpu operations add a per cpu variable offset to the processor
0011 specific per cpu base and encode that operation in the instruction
0012 operating on the per cpu variable.
0013 
0014 This means that there are no atomicity issues between the calculation of
0015 the offset and the operation on the data. Therefore it is not
0016 necessary to disable preemption or interrupts to ensure that the
0017 processor is not changed between the calculation of the address and
0018 the operation on the data.
0019 
0020 Read-modify-write operations are of particular interest. Frequently
0021 processors have special lower latency instructions that can operate
0022 without the typical synchronization overhead, but still provide some
0023 sort of relaxed atomicity guarantees. The x86, for example, can execute
0024 RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the
0025 lock prefix and the associated latency penalty.
0026 
0027 Access to the variable without the lock prefix is not synchronized but
0028 synchronization is not necessary since we are dealing with per cpu
0029 data specific to the currently executing processor. Only the current
0030 processor should be accessing that variable and therefore there are no
0031 concurrency issues with other processors in the system.
0032 
0033 Please note that accesses by remote processors to a per cpu area are
0034 exceptional situations and may impact performance and/or correctness
0035 (remote write operations) of local RMW operations via this_cpu_*.
0036 
0037 The main use of the this_cpu operations has been to optimize counter
0038 operations.
0039 
0040 The following this_cpu() operations with implied preemption protection
0041 are defined. These operations can be used without worrying about
0042 preemption and interrupts.
0043 
0044         this_cpu_read(pcp)
0045         this_cpu_write(pcp, val)
0046         this_cpu_add(pcp, val)
0047         this_cpu_and(pcp, val)
0048         this_cpu_or(pcp, val)
0049         this_cpu_add_return(pcp, val)
0050         this_cpu_xchg(pcp, nval)
0051         this_cpu_cmpxchg(pcp, oval, nval)
0052         this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
0053         this_cpu_sub(pcp, val)
0054         this_cpu_inc(pcp)
0055         this_cpu_dec(pcp)
0056         this_cpu_sub_return(pcp, val)
0057         this_cpu_inc_return(pcp)
0058         this_cpu_dec_return(pcp)
0059 
0060 
0061 Inner working of this_cpu operations
0062 ------------------------------------
0063 
0064 On x86 the fs: or the gs: segment registers contain the base of the
0065 per cpu area. It is then possible to simply use the segment override
0066 to relocate a per cpu relative address to the proper per cpu area for
0067 the processor. So the relocation to the per cpu base is encoded in the
0068 instruction via a segment register prefix.
0069 
0070 For example:
0071 
0072         DEFINE_PER_CPU(int, x);
0073         int z;
0074 
0075         z = this_cpu_read(x);
0076 
0077 results in a single instruction
0078 
0079         mov ax, gs:[x]
0080 
0081 instead of a sequence of calculation of the address and then a fetch
0082 from that address which occurs with the per cpu operations. Before
0083 this_cpu_ops such sequence also required preempt disable/enable to
0084 prevent the kernel from moving the thread to a different processor
0085 while the calculation is performed.
0086 
0087 Consider the following this_cpu operation:
0088 
0089         this_cpu_inc(x)
0090 
0091 The above results in the following single instruction (no lock prefix!)
0092 
0093         inc gs:[x]
0094 
0095 instead of the following operations required if there is no segment
0096 register:
0097 
0098         int *y;
0099         int cpu;
0100 
0101         cpu = get_cpu();
0102         y = per_cpu_ptr(&x, cpu);
0103         (*y)++;
0104         put_cpu();
0105 
0106 Note that these operations can only be used on per cpu data that is
0107 reserved for a specific processor. Without disabling preemption in the
0108 surrounding code this_cpu_inc() will only guarantee that one of the
0109 per cpu counters is correctly incremented. However, there is no
0110 guarantee that the OS will not move the process directly before or
0111 after the this_cpu instruction is executed. In general this means that
0112 the value of the individual counters for each processor are
0113 meaningless. The sum of all the per cpu counters is the only value
0114 that is of interest.
0115 
0116 Per cpu variables are used for performance reasons. Bouncing cache
0117 lines can be avoided if multiple processors concurrently go through
0118 the same code paths.  Since each processor has its own per cpu
0119 variables no concurrent cache line updates take place. The price that
0120 has to be paid for this optimization is the need to add up the per cpu
0121 counters when the value of a counter is needed.
0122 
0123 
0124 Special operations:
0125 -------------------
0126 
0127         y = this_cpu_ptr(&x)
0128 
0129 Takes the offset of a per cpu variable (&x !) and returns the address
0130 of the per cpu variable that belongs to the currently executing
0131 processor.  this_cpu_ptr avoids multiple steps that the common
0132 get_cpu/put_cpu sequence requires. No processor number is
0133 available. Instead, the offset of the local per cpu area is simply
0134 added to the per cpu offset.
0135 
0136 Note that this operation is usually used in a code segment when
0137 preemption has been disabled. The pointer is then used to
0138 access local per cpu data in a critical section. When preemption
0139 is re-enabled this pointer is usually no longer useful since it may
0140 no longer point to per cpu data of the current processor.
0141 
0142 
0143 Per cpu variables and offsets
0144 -----------------------------
0145 
0146 Per cpu variables have *offsets* to the beginning of the per cpu
0147 area. They do not have addresses although they look like that in the
0148 code. Offsets cannot be directly dereferenced. The offset must be
0149 added to a base pointer of a per cpu area of a processor in order to
0150 form a valid address.
0151 
0152 Therefore the use of x or &x outside of the context of per cpu
0153 operations is invalid and will generally be treated like a NULL
0154 pointer dereference.
0155 
0156         DEFINE_PER_CPU(int, x);
0157 
0158 In the context of per cpu operations the above implies that x is a per
0159 cpu variable. Most this_cpu operations take a cpu variable.
0160 
0161         int __percpu *p = &x;
0162 
0163 &x and hence p is the *offset* of a per cpu variable. this_cpu_ptr()
0164 takes the offset of a per cpu variable which makes this look a bit
0165 strange.
0166 
0167 
0168 Operations on a field of a per cpu structure
0169 --------------------------------------------
0170 
0171 Let's say we have a percpu structure
0172 
0173         struct s {
0174                 int n,m;
0175         };
0176 
0177         DEFINE_PER_CPU(struct s, p);
0178 
0179 
0180 Operations on these fields are straightforward
0181 
0182         this_cpu_inc(p.m)
0183 
0184         z = this_cpu_cmpxchg(p.m, 0, 1);
0185 
0186 
0187 If we have an offset to struct s:
0188 
0189         struct s __percpu *ps = &p;
0190 
0191         this_cpu_dec(ps->m);
0192 
0193         z = this_cpu_inc_return(ps->n);
0194 
0195 
0196 The calculation of the pointer may require the use of this_cpu_ptr()
0197 if we do not make use of this_cpu ops later to manipulate fields:
0198 
0199         struct s *pp;
0200 
0201         pp = this_cpu_ptr(&p);
0202 
0203         pp->m--;
0204 
0205         z = pp->n++;
0206 
0207 
0208 Variants of this_cpu ops
0209 -------------------------
0210 
0211 this_cpu ops are interrupt safe. Some architectures do not support
0212 these per cpu local operations. In that case the operation must be
0213 replaced by code that disables interrupts, then does the operations
0214 that are guaranteed to be atomic and then re-enable interrupts. Doing
0215 so is expensive. If there are other reasons why the scheduler cannot
0216 change the processor we are executing on then there is no reason to
0217 disable interrupts. For that purpose the following __this_cpu operations
0218 are provided.
0219 
0220 These operations have no guarantee against concurrent interrupts or
0221 preemption. If a per cpu variable is not used in an interrupt context
0222 and the scheduler cannot preempt, then they are safe. If any interrupts
0223 still occur while an operation is in progress and if the interrupt too
0224 modifies the variable, then RMW actions can not be guaranteed to be
0225 safe.
0226 
0227         __this_cpu_read(pcp)
0228         __this_cpu_write(pcp, val)
0229         __this_cpu_add(pcp, val)
0230         __this_cpu_and(pcp, val)
0231         __this_cpu_or(pcp, val)
0232         __this_cpu_add_return(pcp, val)
0233         __this_cpu_xchg(pcp, nval)
0234         __this_cpu_cmpxchg(pcp, oval, nval)
0235         __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
0236         __this_cpu_sub(pcp, val)
0237         __this_cpu_inc(pcp)
0238         __this_cpu_dec(pcp)
0239         __this_cpu_sub_return(pcp, val)
0240         __this_cpu_inc_return(pcp)
0241         __this_cpu_dec_return(pcp)
0242 
0243 
0244 Will increment x and will not fall-back to code that disables
0245 interrupts on platforms that cannot accomplish atomicity through
0246 address relocation and a Read-Modify-Write operation in the same
0247 instruction.
0248 
0249 
0250 &this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
0251 --------------------------------------------
0252 
0253 The first operation takes the offset and forms an address and then
0254 adds the offset of the n field. This may result in two add
0255 instructions emitted by the compiler.
0256 
0257 The second one first adds the two offsets and then does the
0258 relocation.  IMHO the second form looks cleaner and has an easier time
0259 with (). The second form also is consistent with the way
0260 this_cpu_read() and friends are used.
0261 
0262 
0263 Remote access to per cpu data
0264 ------------------------------
0265 
0266 Per cpu data structures are designed to be used by one cpu exclusively.
0267 If you use the variables as intended, this_cpu_ops() are guaranteed to
0268 be "atomic" as no other CPU has access to these data structures.
0269 
0270 There are special cases where you might need to access per cpu data
0271 structures remotely. It is usually safe to do a remote read access
0272 and that is frequently done to summarize counters. Remote write access
0273 something which could be problematic because this_cpu ops do not
0274 have lock semantics. A remote write may interfere with a this_cpu
0275 RMW operation.
0276 
0277 Remote write accesses to percpu data structures are highly discouraged
0278 unless absolutely necessary. Please consider using an IPI to wake up
0279 the remote CPU and perform the update to its per cpu area.
0280 
0281 To access per-cpu data structure remotely, typically the per_cpu_ptr()
0282 function is used:
0283 
0284 
0285         DEFINE_PER_CPU(struct data, datap);
0286 
0287         struct data *p = per_cpu_ptr(&datap, cpu);
0288 
0289 This makes it explicit that we are getting ready to access a percpu
0290 area remotely.
0291 
0292 You can also do the following to convert the datap offset to an address
0293 
0294         struct data *p = this_cpu_ptr(&datap);
0295 
0296 but, passing of pointers calculated via this_cpu_ptr to other cpus is
0297 unusual and should be avoided.
0298 
0299 Remote access are typically only for reading the status of another cpus
0300 per cpu data. Write accesses can cause unique problems due to the
0301 relaxed synchronization requirements for this_cpu operations.
0302 
0303 One example that illustrates some concerns with write operations is
0304 the following scenario that occurs because two per cpu variables
0305 share a cache-line but the relaxed synchronization is applied to
0306 only one process updating the cache-line.
0307 
0308 Consider the following example
0309 
0310 
0311         struct test {
0312                 atomic_t a;
0313                 int b;
0314         };
0315 
0316         DEFINE_PER_CPU(struct test, onecacheline);
0317 
0318 There is some concern about what would happen if the field 'a' is updated
0319 remotely from one processor and the local processor would use this_cpu ops
0320 to update field b. Care should be taken that such simultaneous accesses to
0321 data within the same cache line are avoided. Also costly synchronization
0322 may be necessary. IPIs are generally recommended in such scenarios instead
0323 of a remote write to the per cpu area of another processor.
0324 
0325 Even in cases where the remote writes are rare, please bear in
0326 mind that a remote write will evict the cache line from the processor
0327 that most likely will access it. If the processor wakes up and finds a
0328 missing local cache line of a per cpu area, its performance and hence
0329 the wake up times will be affected.
0330 
0331 Christoph Lameter, August 4th, 2014
0332 Pranith Kumar, Aug 2nd, 2014