0001 .. Copyright 2001 Matthew Wilcox
0002 ..
0003 .. This documentation is free software; you can redistribute
0004 .. it and/or modify it under the terms of the GNU General Public
0005 .. License as published by the Free Software Foundation; either
0006 .. version 2 of the License, or (at your option) any later
0007 .. version.
0008
0009 ===============================
0010 Bus-Independent Device Accesses
0011 ===============================
0012
0013 :Author: Matthew Wilcox
0014 :Author: Alan Cox
0015
0016 Introduction
0017 ============
0018
0019 Linux provides an API which abstracts performing IO across all busses
0020 and devices, allowing device drivers to be written independently of bus
0021 type.
0022
0023 Memory Mapped IO
0024 ================
0025
0026 Getting Access to the Device
0027 ----------------------------
0028
0029 The most widely supported form of IO is memory mapped IO. That is, a
0030 part of the CPU's address space is interpreted not as accesses to
0031 memory, but as accesses to a device. Some architectures define devices
0032 to be at a fixed address, but most have some method of discovering
0033 devices. The PCI bus walk is a good example of such a scheme. This
0034 document does not cover how to receive such an address, but assumes you
0035 are starting with one. Physical addresses are of type unsigned long.
0036
0037 This address should not be used directly. Instead, to get an address
0038 suitable for passing to the accessor functions described below, you
0039 should call ioremap(). An address suitable for accessing
0040 the device will be returned to you.
0041
0042 After you've finished using the device (say, in your module's exit
0043 routine), call iounmap() in order to return the address
0044 space to the kernel. Most architectures allocate new address space each
0045 time you call ioremap(), and they can run out unless you
0046 call iounmap().
0047
0048 Accessing the device
0049 --------------------
0050
0051 The part of the interface most used by drivers is reading and writing
0052 memory-mapped registers on the device. Linux provides interfaces to read
0053 and write 8-bit, 16-bit, 32-bit and 64-bit quantities. Due to a
0054 historical accident, these are named byte, word, long and quad accesses.
0055 Both read and write accesses are supported; there is no prefetch support
0056 at this time.
0057
0058 The functions are named readb(), readw(), readl(), readq(),
0059 readb_relaxed(), readw_relaxed(), readl_relaxed(), readq_relaxed(),
0060 writeb(), writew(), writel() and writeq().
0061
0062 Some devices (such as framebuffers) would like to use larger transfers than
0063 8 bytes at a time. For these devices, the memcpy_toio(),
0064 memcpy_fromio() and memset_io() functions are
0065 provided. Do not use memset or memcpy on IO addresses; they are not
0066 guaranteed to copy data in order.
0067
0068 The read and write functions are defined to be ordered. That is the
0069 compiler is not permitted to reorder the I/O sequence. When the ordering
0070 can be compiler optimised, you can use __readb() and friends to
0071 indicate the relaxed ordering. Use this with care.
0072
0073 While the basic functions are defined to be synchronous with respect to
0074 each other and ordered with respect to each other the busses the devices
0075 sit on may themselves have asynchronicity. In particular many authors
0076 are burned by the fact that PCI bus writes are posted asynchronously. A
0077 driver author must issue a read from the same device to ensure that
0078 writes have occurred in the specific cases the author cares. This kind
0079 of property cannot be hidden from driver writers in the API. In some
0080 cases, the read used to flush the device may be expected to fail (if the
0081 card is resetting, for example). In that case, the read should be done
0082 from config space, which is guaranteed to soft-fail if the card doesn't
0083 respond.
0084
0085 The following is an example of flushing a write to a device when the
0086 driver would like to ensure the write's effects are visible prior to
0087 continuing execution::
0088
0089 static inline void
0090 qla1280_disable_intrs(struct scsi_qla_host *ha)
0091 {
0092 struct device_reg *reg;
0093
0094 reg = ha->iobase;
0095 /* disable risc and host interrupts */
0096 WRT_REG_WORD(®->ictrl, 0);
0097 /*
0098 * The following read will ensure that the above write
0099 * has been received by the device before we return from this
0100 * function.
0101 */
0102 RD_REG_WORD(®->ictrl);
0103 ha->flags.ints_enabled = 0;
0104 }
0105
0106 PCI ordering rules also guarantee that PIO read responses arrive after any
0107 outstanding DMA writes from that bus, since for some devices the result of
0108 a readb() call may signal to the driver that a DMA transaction is
0109 complete. In many cases, however, the driver may want to indicate that the
0110 next readb() call has no relation to any previous DMA writes
0111 performed by the device. The driver can use readb_relaxed() for
0112 these cases, although only some platforms will honor the relaxed
0113 semantics. Using the relaxed read functions will provide significant
0114 performance benefits on platforms that support it. The qla2xxx driver
0115 provides examples of how to use readX_relaxed(). In many cases, a majority
0116 of the driver's readX() calls can safely be converted to readX_relaxed()
0117 calls, since only a few will indicate or depend on DMA completion.
0118
0119 Port Space Accesses
0120 ===================
0121
0122 Port Space Explained
0123 --------------------
0124
0125 Another form of IO commonly supported is Port Space. This is a range of
0126 addresses separate to the normal memory address space. Access to these
0127 addresses is generally not as fast as accesses to the memory mapped
0128 addresses, and it also has a potentially smaller address space.
0129
0130 Unlike memory mapped IO, no preparation is required to access port
0131 space.
0132
0133 Accessing Port Space
0134 --------------------
0135
0136 Accesses to this space are provided through a set of functions which
0137 allow 8-bit, 16-bit and 32-bit accesses; also known as byte, word and
0138 long. These functions are inb(), inw(),
0139 inl(), outb(), outw() and
0140 outl().
0141
0142 Some variants are provided for these functions. Some devices require
0143 that accesses to their ports are slowed down. This functionality is
0144 provided by appending a ``_p`` to the end of the function.
0145 There are also equivalents to memcpy. The ins() and
0146 outs() functions copy bytes, words or longs to the given
0147 port.
0148
0149 __iomem pointer tokens
0150 ======================
0151
0152 The data type for an MMIO address is an ``__iomem`` qualified pointer, such as
0153 ``void __iomem *reg``. On most architectures it is a regular pointer that
0154 points to a virtual memory address and can be offset or dereferenced, but in
0155 portable code, it must only be passed from and to functions that explicitly
0156 operated on an ``__iomem`` token, in particular the ioremap() and
0157 readl()/writel() functions. The 'sparse' semantic code checker can be used to
0158 verify that this is done correctly.
0159
0160 While on most architectures, ioremap() creates a page table entry for an
0161 uncached virtual address pointing to the physical MMIO address, some
0162 architectures require special instructions for MMIO, and the ``__iomem`` pointer
0163 just encodes the physical address or an offsettable cookie that is interpreted
0164 by readl()/writel().
0165
0166 Differences between I/O access functions
0167 ========================================
0168
0169 readq(), readl(), readw(), readb(), writeq(), writel(), writew(), writeb()
0170
0171 These are the most generic accessors, providing serialization against other
0172 MMIO accesses and DMA accesses as well as fixed endianness for accessing
0173 little-endian PCI devices and on-chip peripherals. Portable device drivers
0174 should generally use these for any access to ``__iomem`` pointers.
0175
0176 Note that posted writes are not strictly ordered against a spinlock, see
0177 Documentation/driver-api/io_ordering.rst.
0178
0179 readq_relaxed(), readl_relaxed(), readw_relaxed(), readb_relaxed(),
0180 writeq_relaxed(), writel_relaxed(), writew_relaxed(), writeb_relaxed()
0181
0182 On architectures that require an expensive barrier for serializing against
0183 DMA, these "relaxed" versions of the MMIO accessors only serialize against
0184 each other, but contain a less expensive barrier operation. A device driver
0185 might use these in a particularly performance sensitive fast path, with a
0186 comment that explains why the usage in a specific location is safe without
0187 the extra barriers.
0188
0189 See memory-barriers.txt for a more detailed discussion on the precise ordering
0190 guarantees of the non-relaxed and relaxed versions.
0191
0192 ioread64(), ioread32(), ioread16(), ioread8(),
0193 iowrite64(), iowrite32(), iowrite16(), iowrite8()
0194
0195 These are an alternative to the normal readl()/writel() functions, with almost
0196 identical behavior, but they can also operate on ``__iomem`` tokens returned
0197 for mapping PCI I/O space with pci_iomap() or ioport_map(). On architectures
0198 that require special instructions for I/O port access, this adds a small
0199 overhead for an indirect function call implemented in lib/iomap.c, while on
0200 other architectures, these are simply aliases.
0201
0202 ioread64be(), ioread32be(), ioread16be()
0203 iowrite64be(), iowrite32be(), iowrite16be()
0204
0205 These behave in the same way as the ioread32()/iowrite32() family, but with
0206 reversed byte order, for accessing devices with big-endian MMIO registers.
0207 Device drivers that can operate on either big-endian or little-endian
0208 registers may have to implement a custom wrapper function that picks one or
0209 the other depending on which device was found.
0210
0211 Note: On some architectures, the normal readl()/writel() functions
0212 traditionally assume that devices are the same endianness as the CPU, while
0213 using a hardware byte-reverse on the PCI bus when running a big-endian kernel.
0214 Drivers that use readl()/writel() this way are generally not portable, but
0215 tend to be limited to a particular SoC.
0216
0217 hi_lo_readq(), lo_hi_readq(), hi_lo_readq_relaxed(), lo_hi_readq_relaxed(),
0218 ioread64_lo_hi(), ioread64_hi_lo(), ioread64be_lo_hi(), ioread64be_hi_lo(),
0219 hi_lo_writeq(), lo_hi_writeq(), hi_lo_writeq_relaxed(), lo_hi_writeq_relaxed(),
0220 iowrite64_lo_hi(), iowrite64_hi_lo(), iowrite64be_lo_hi(), iowrite64be_hi_lo()
0221
0222 Some device drivers have 64-bit registers that cannot be accessed atomically
0223 on 32-bit architectures but allow two consecutive 32-bit accesses instead.
0224 Since it depends on the particular device which of the two halves has to be
0225 accessed first, a helper is provided for each combination of 64-bit accessors
0226 with either low/high or high/low word ordering. A device driver must include
0227 either <linux/io-64-nonatomic-lo-hi.h> or <linux/io-64-nonatomic-hi-lo.h> to
0228 get the function definitions along with helpers that redirect the normal
0229 readq()/writeq() to them on architectures that do not provide 64-bit access
0230 natively.
0231
0232 __raw_readq(), __raw_readl(), __raw_readw(), __raw_readb(),
0233 __raw_writeq(), __raw_writel(), __raw_writew(), __raw_writeb()
0234
0235 These are low-level MMIO accessors without barriers or byteorder changes and
0236 architecture specific behavior. Accesses are usually atomic in the sense that
0237 a four-byte __raw_readl() does not get split into individual byte loads, but
0238 multiple consecutive accesses can be combined on the bus. In portable code, it
0239 is only safe to use these to access memory behind a device bus but not MMIO
0240 registers, as there are no ordering guarantees with regard to other MMIO
0241 accesses or even spinlocks. The byte order is generally the same as for normal
0242 memory, so unlike the other functions, these can be used to copy data between
0243 kernel memory and device memory.
0244
0245 inl(), inw(), inb(), outl(), outw(), outb()
0246
0247 PCI I/O port resources traditionally require separate helpers as they are
0248 implemented using special instructions on the x86 architecture. On most other
0249 architectures, these are mapped to readl()/writel() style accessors
0250 internally, usually pointing to a fixed area in virtual memory. Instead of an
0251 ``__iomem`` pointer, the address is a 32-bit integer token to identify a port
0252 number. PCI requires I/O port access to be non-posted, meaning that an outb()
0253 must complete before the following code executes, while a normal writeb() may
0254 still be in progress. On architectures that correctly implement this, I/O port
0255 access is therefore ordered against spinlocks. Many non-x86 PCI host bridge
0256 implementations and CPU architectures however fail to implement non-posted I/O
0257 space on PCI, so they can end up being posted on such hardware.
0258
0259 In some architectures, the I/O port number space has a 1:1 mapping to
0260 ``__iomem`` pointers, but this is not recommended and device drivers should
0261 not rely on that for portability. Similarly, an I/O port number as described
0262 in a PCI base address register may not correspond to the port number as seen
0263 by a device driver. Portable drivers need to read the port number for the
0264 resource provided by the kernel.
0265
0266 There are no direct 64-bit I/O port accessors, but pci_iomap() in combination
0267 with ioread64/iowrite64 can be used instead.
0268
0269 inl_p(), inw_p(), inb_p(), outl_p(), outw_p(), outb_p()
0270
0271 On ISA devices that require specific timing, the _p versions of the I/O
0272 accessors add a small delay. On architectures that do not have ISA buses,
0273 these are aliases to the normal inb/outb helpers.
0274
0275 readsq, readsl, readsw, readsb
0276 writesq, writesl, writesw, writesb
0277 ioread64_rep, ioread32_rep, ioread16_rep, ioread8_rep
0278 iowrite64_rep, iowrite32_rep, iowrite16_rep, iowrite8_rep
0279 insl, insw, insb, outsl, outsw, outsb
0280
0281 These are helpers that access the same address multiple times, usually to copy
0282 data between kernel memory byte stream and a FIFO buffer. Unlike the normal
0283 MMIO accessors, these do not perform a byteswap on big-endian kernels, so the
0284 first byte in the FIFO register corresponds to the first byte in the memory
0285 buffer regardless of the architecture.
0286
0287 Device memory mapping modes
0288 ===========================
0289
0290 Some architectures support multiple modes for mapping device memory.
0291 ioremap_*() variants provide a common abstraction around these
0292 architecture-specific modes, with a shared set of semantics.
0293
0294 ioremap() is the most common mapping type, and is applicable to typical device
0295 memory (e.g. I/O registers). Other modes can offer weaker or stronger
0296 guarantees, if supported by the architecture. From most to least common, they
0297 are as follows:
0298
0299 ioremap()
0300 ---------
0301
0302 The default mode, suitable for most memory-mapped devices, e.g. control
0303 registers. Memory mapped using ioremap() has the following characteristics:
0304
0305 * Uncached - CPU-side caches are bypassed, and all reads and writes are handled
0306 directly by the device
0307 * No speculative operations - the CPU may not issue a read or write to this
0308 memory, unless the instruction that does so has been reached in committed
0309 program flow.
0310 * No reordering - The CPU may not reorder accesses to this memory mapping with
0311 respect to each other. On some architectures, this relies on barriers in
0312 readl_relaxed()/writel_relaxed().
0313 * No repetition - The CPU may not issue multiple reads or writes for a single
0314 program instruction.
0315 * No write-combining - Each I/O operation results in one discrete read or write
0316 being issued to the device, and multiple writes are not combined into larger
0317 writes. This may or may not be enforced when using __raw I/O accessors or
0318 pointer dereferences.
0319 * Non-executable - The CPU is not allowed to speculate instruction execution
0320 from this memory (it probably goes without saying, but you're also not
0321 allowed to jump into device memory).
0322
0323 On many platforms and buses (e.g. PCI), writes issued through ioremap()
0324 mappings are posted, which means that the CPU does not wait for the write to
0325 actually reach the target device before retiring the write instruction.
0326
0327 On many platforms, I/O accesses must be aligned with respect to the access
0328 size; failure to do so will result in an exception or unpredictable results.
0329
0330 ioremap_wc()
0331 ------------
0332
0333 Maps I/O memory as normal memory with write combining. Unlike ioremap(),
0334
0335 * The CPU may speculatively issue reads from the device that the program
0336 didn't actually execute, and may choose to basically read whatever it wants.
0337 * The CPU may reorder operations as long as the result is consistent from the
0338 program's point of view.
0339 * The CPU may write to the same location multiple times, even when the program
0340 issued a single write.
0341 * The CPU may combine several writes into a single larger write.
0342
0343 This mode is typically used for video framebuffers, where it can increase
0344 performance of writes. It can also be used for other blocks of memory in
0345 devices (e.g. buffers or shared memory), but care must be taken as accesses are
0346 not guaranteed to be ordered with respect to normal ioremap() MMIO register
0347 accesses without explicit barriers.
0348
0349 On a PCI bus, it is usually safe to use ioremap_wc() on MMIO areas marked as
0350 ``IORESOURCE_PREFETCH``, but it may not be used on those without the flag.
0351 For on-chip devices, there is no corresponding flag, but a driver can use
0352 ioremap_wc() on a device that is known to be safe.
0353
0354 ioremap_wt()
0355 ------------
0356
0357 Maps I/O memory as normal memory with write-through caching. Like ioremap_wc(),
0358 but also,
0359
0360 * The CPU may cache writes issued to and reads from the device, and serve reads
0361 from that cache.
0362
0363 This mode is sometimes used for video framebuffers, where drivers still expect
0364 writes to reach the device in a timely manner (and not be stuck in the CPU
0365 cache), but reads may be served from the cache for efficiency. However, it is
0366 rarely useful these days, as framebuffer drivers usually perform writes only,
0367 for which ioremap_wc() is more efficient (as it doesn't needlessly trash the
0368 cache). Most drivers should not use this.
0369
0370 ioremap_np()
0371 ------------
0372
0373 Like ioremap(), but explicitly requests non-posted write semantics. On some
0374 architectures and buses, ioremap() mappings have posted write semantics, which
0375 means that writes can appear to "complete" from the point of view of the
0376 CPU before the written data actually arrives at the target device. Writes are
0377 still ordered with respect to other writes and reads from the same device, but
0378 due to the posted write semantics, this is not the case with respect to other
0379 devices. ioremap_np() explicitly requests non-posted semantics, which means
0380 that the write instruction will not appear to complete until the device has
0381 received (and to some platform-specific extent acknowledged) the written data.
0382
0383 This mapping mode primarily exists to cater for platforms with bus fabrics that
0384 require this particular mapping mode to work correctly. These platforms set the
0385 ``IORESOURCE_MEM_NONPOSTED`` flag for a resource that requires ioremap_np()
0386 semantics and portable drivers should use an abstraction that automatically
0387 selects it where appropriate (see the `Higher-level ioremap abstractions`_
0388 section below).
0389
0390 The bare ioremap_np() is only available on some architectures; on others, it
0391 always returns NULL. Drivers should not normally use it, unless they are
0392 platform-specific or they derive benefit from non-posted writes where
0393 supported, and can fall back to ioremap() otherwise. The normal approach to
0394 ensure posted write completion is to do a dummy read after a write as
0395 explained in `Accessing the device`_, which works with ioremap() on all
0396 platforms.
0397
0398 ioremap_np() should never be used for PCI drivers. PCI memory space writes are
0399 always posted, even on architectures that otherwise implement ioremap_np().
0400 Using ioremap_np() for PCI BARs will at best result in posted write semantics,
0401 and at worst result in complete breakage.
0402
0403 Note that non-posted write semantics are orthogonal to CPU-side ordering
0404 guarantees. A CPU may still choose to issue other reads or writes before a
0405 non-posted write instruction retires. See the previous section on MMIO access
0406 functions for details on the CPU side of things.
0407
0408 ioremap_uc()
0409 ------------
0410
0411 ioremap_uc() behaves like ioremap() except that on the x86 architecture without
0412 'PAT' mode, it marks memory as uncached even when the MTRR has designated
0413 it as cacheable, see Documentation/x86/pat.rst.
0414
0415 Portable drivers should avoid the use of ioremap_uc().
0416
0417 ioremap_cache()
0418 ---------------
0419
0420 ioremap_cache() effectively maps I/O memory as normal RAM. CPU write-back
0421 caches can be used, and the CPU is free to treat the device as if it were a
0422 block of RAM. This should never be used for device memory which has side
0423 effects of any kind, or which does not return the data previously written on
0424 read.
0425
0426 It should also not be used for actual RAM, as the returned pointer is an
0427 ``__iomem`` token. memremap() can be used for mapping normal RAM that is outside
0428 of the linear kernel memory area to a regular pointer.
0429
0430 Portable drivers should avoid the use of ioremap_cache().
0431
0432 Architecture example
0433 --------------------
0434
0435 Here is how the above modes map to memory attribute settings on the ARM64
0436 architecture:
0437
0438 +------------------------+--------------------------------------------+
0439 | API | Memory region type and cacheability |
0440 +------------------------+--------------------------------------------+
0441 | ioremap_np() | Device-nGnRnE |
0442 +------------------------+--------------------------------------------+
0443 | ioremap() | Device-nGnRE |
0444 +------------------------+--------------------------------------------+
0445 | ioremap_uc() | (not implemented) |
0446 +------------------------+--------------------------------------------+
0447 | ioremap_wc() | Normal-Non Cacheable |
0448 +------------------------+--------------------------------------------+
0449 | ioremap_wt() | (not implemented; fallback to ioremap) |
0450 +------------------------+--------------------------------------------+
0451 | ioremap_cache() | Normal-Write-Back Cacheable |
0452 +------------------------+--------------------------------------------+
0453
0454 Higher-level ioremap abstractions
0455 =================================
0456
0457 Instead of using the above raw ioremap() modes, drivers are encouraged to use
0458 higher-level APIs. These APIs may implement platform-specific logic to
0459 automatically choose an appropriate ioremap mode on any given bus, allowing for
0460 a platform-agnostic driver to work on those platforms without any special
0461 cases. At the time of this writing, the following ioremap() wrappers have such
0462 logic:
0463
0464 devm_ioremap_resource()
0465
0466 Can automatically select ioremap_np() over ioremap() according to platform
0467 requirements, if the ``IORESOURCE_MEM_NONPOSTED`` flag is set on the struct
0468 resource. Uses devres to automatically unmap the resource when the driver
0469 probe() function fails or a device in unbound from its driver.
0470
0471 Documented in Documentation/driver-api/driver-model/devres.rst.
0472
0473 of_address_to_resource()
0474
0475 Automatically sets the ``IORESOURCE_MEM_NONPOSTED`` flag for platforms that
0476 require non-posted writes for certain buses (see the nonposted-mmio and
0477 posted-mmio device tree properties).
0478
0479 of_iomap()
0480
0481 Maps the resource described in a ``reg`` property in the device tree, doing
0482 all required translations. Automatically selects ioremap_np() according to
0483 platform requirements, as above.
0484
0485 pci_ioremap_bar(), pci_ioremap_wc_bar()
0486
0487 Maps the resource described in a PCI base address without having to extract
0488 the physical address first.
0489
0490 pci_iomap(), pci_iomap_wc()
0491
0492 Like pci_ioremap_bar()/pci_ioremap_bar(), but also works on I/O space when
0493 used together with ioread32()/iowrite32() and similar accessors
0494
0495 pcim_iomap()
0496
0497 Like pci_iomap(), but uses devres to automatically unmap the resource when
0498 the driver probe() function fails or a device in unbound from its driver
0499
0500 Documented in Documentation/driver-api/driver-model/devres.rst.
0501
0502 Not using these wrappers may make drivers unusable on certain platforms with
0503 stricter rules for mapping I/O memory.
0504
0505 Generalizing Access to System and I/O Memory
0506 ============================================
0507
0508 .. kernel-doc:: include/linux/iosys-map.h
0509 :doc: overview
0510
0511 .. kernel-doc:: include/linux/iosys-map.h
0512 :internal:
0513
0514 Public Functions Provided
0515 =========================
0516
0517 .. kernel-doc:: arch/x86/include/asm/io.h
0518 :internal:
0519
0520 .. kernel-doc:: lib/pci_iomap.c
0521 :export: