Back to home page

OSCL-LXR

 
 

    


0001 Error Detection And Correction (EDAC) Devices
0002 =============================================
0003 
0004 Main Concepts used at the EDAC subsystem
0005 ----------------------------------------
0006 
0007 There are several things to be aware of that aren't at all obvious, like
0008 *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
0009 etc...
0010 
0011 These are some of the many terms that are thrown about that don't always
0012 mean what people think they mean (Inconceivable!).  In the interest of
0013 creating a common ground for discussion, terms and their definitions
0014 will be established.
0015 
0016 * Memory devices
0017 
0018 The individual DRAM chips on a memory stick.  These devices commonly
0019 output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
0020 provides the number of bits that the memory controller expects:
0021 typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
0022 
0023 * Memory Stick
0024 
0025 A printed circuit board that aggregates multiple memory devices in
0026 parallel.  In general, this is the Field Replaceable Unit (FRU) which
0027 gets replaced, in the case of excessive errors. Most often it is also
0028 called DIMM (Dual Inline Memory Module).
0029 
0030 * Memory Socket
0031 
0032 A physical connector on the motherboard that accepts a single memory
0033 stick. Also called as "slot" on several datasheets.
0034 
0035 * Channel
0036 
0037 A memory controller channel, responsible to communicate with a group of
0038 DIMMs. Each channel has its own independent control (command) and data
0039 bus, and can be used independently or grouped with other channels.
0040 
0041 * Branch
0042 
0043 It is typically the highest hierarchy on a Fully-Buffered DIMM memory
0044 controller. Typically, it contains two channels. Two channels at the
0045 same branch can be used in single mode or in lockstep mode. When
0046 lockstep is enabled, the cacheline is doubled, but it generally brings
0047 some performance penalty. Also, it is generally not possible to point to
0048 just one memory stick when an error occurs, as the error correction code
0049 is calculated using two DIMMs instead of one. Due to that, it is capable
0050 of correcting more errors than on single mode.
0051 
0052 * Single-channel
0053 
0054 The data accessed by the memory controller is contained into one dimm
0055 only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
0056 one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
0057 memories. FB-DIMM and RAMBUS use a different concept for channel, so
0058 this concept doesn't apply there.
0059 
0060 * Double-channel
0061 
0062 The data size accessed by the memory controller is interlaced into two
0063 dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
0064 bits with ECC), the data flows to the CPU using a 128 bits parallel
0065 access.
0066 
0067 * Chip-select row
0068 
0069 This is the name of the DRAM signal used to select the DRAM ranks to be
0070 accessed. Common chip-select rows for single channel are 64 bits, for
0071 dual channel 128 bits. It may not be visible by the memory controller,
0072 as some DIMM types have a memory buffer that can hide direct access to
0073 it from the Memory Controller.
0074 
0075 * Single-Ranked stick
0076 
0077 A Single-ranked stick has 1 chip-select row of memory. Motherboards
0078 commonly drive two chip-select pins to a memory stick. A single-ranked
0079 stick, will occupy only one of those rows. The other will be unused.
0080 
0081 .. _doubleranked:
0082 
0083 * Double-Ranked stick
0084 
0085 A double-ranked stick has two chip-select rows which access different
0086 sets of memory devices.  The two rows cannot be accessed concurrently.
0087 
0088 * Double-sided stick
0089 
0090 **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
0091 
0092 A double-sided stick has two chip-select rows which access different sets
0093 of memory devices. The two rows cannot be accessed concurrently.
0094 "Double-sided" is irrespective of the memory devices being mounted on
0095 both sides of the memory stick.
0096 
0097 * Socket set
0098 
0099 All of the memory sticks that are required for a single memory access or
0100 all of the memory sticks spanned by a chip-select row.  A single socket
0101 set has two chip-select rows and if double-sided sticks are used these
0102 will occupy those chip-select rows.
0103 
0104 * Bank
0105 
0106 This term is avoided because it is unclear when needing to distinguish
0107 between chip-select rows and socket sets.
0108 
0109 
0110 Memory Controllers
0111 ------------------
0112 
0113 Most of the EDAC core is focused on doing Memory Controller error detection.
0114 The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
0115 to describe the memory controllers, with is an opaque struct for the EDAC
0116 drivers. Only the EDAC core is allowed to touch it.
0117 
0118 .. kernel-doc:: include/linux/edac.h
0119 
0120 .. kernel-doc:: drivers/edac/edac_mc.h
0121 
0122 PCI Controllers
0123 ---------------
0124 
0125 The EDAC subsystem provides a mechanism to handle PCI controllers by calling
0126 the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
0127 :c:type:`edac_pci_ctl_info` to describe the PCI controllers.
0128 
0129 .. kernel-doc:: drivers/edac/edac_pci.h
0130 
0131 EDAC Blocks
0132 -----------
0133 
0134 The EDAC subsystem also provides a generic mechanism to report errors on
0135 other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
0136 
0137 The structures :c:type:`edac_dev_sysfs_block_attribute`,
0138 :c:type:`edac_device_block`, :c:type:`edac_device_instance` and
0139 :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
0140 representation at sysfs.
0141 
0142 This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
0143 PCI, like:
0144 
0145 - CPU caches (L1 and L2)
0146 - DMA engines
0147 - Core CPU switches
0148 - Fabric switch units
0149 - PCIe interface controllers
0150 - other EDAC/ECC type devices that can be monitored for
0151   errors, etc.
0152 
0153 It allows for a 2 level set of hierarchy.
0154 
0155 For example, a cache could be composed of L1, L2 and L3 levels of cache.
0156 Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
0157 caches. On such case, those can be represented via the following sysfs
0158 nodes::
0159 
0160         /sys/devices/system/edac/..
0161 
0162         pci/            <existing pci directory (if available)>
0163         mc/             <existing memory device directory>
0164         cpu/cpu0/..     <L1 and L2 block directory>
0165                 /L1-cache/ce_count
0166                          /ue_count
0167                 /L2-cache/ce_count
0168                          /ue_count
0169         cpu/cpu1/..     <L1 and L2 block directory>
0170                 /L1-cache/ce_count
0171                          /ue_count
0172                 /L2-cache/ce_count
0173                          /ue_count
0174         ...
0175 
0176         the L1 and L2 directories would be "edac_device_block's"
0177 
0178 .. kernel-doc:: drivers/edac/edac_device.h