0001 .. SPDX-License-Identifier: GPL-2.0
0002
0003 ===========================
0004 The Spidernet Device Driver
0005 ===========================
0006
0007 Written by Linas Vepstas <linas@austin.ibm.com>
0008
0009 Version of 7 June 2007
0010
0011 Abstract
0012 ========
0013 This document sketches the structure of portions of the spidernet
0014 device driver in the Linux kernel tree. The spidernet is a gigabit
0015 ethernet device built into the Toshiba southbridge commonly used
0016 in the SONY Playstation 3 and the IBM QS20 Cell blade.
0017
0018 The Structure of the RX Ring.
0019 =============================
0020 The receive (RX) ring is a circular linked list of RX descriptors,
0021 together with three pointers into the ring that are used to manage its
0022 contents.
0023
0024 The elements of the ring are called "descriptors" or "descrs"; they
0025 describe the received data. This includes a pointer to a buffer
0026 containing the received data, the buffer size, and various status bits.
0027
0028 There are three primary states that a descriptor can be in: "empty",
0029 "full" and "not-in-use". An "empty" or "ready" descriptor is ready
0030 to receive data from the hardware. A "full" descriptor has data in it,
0031 and is waiting to be emptied and processed by the OS. A "not-in-use"
0032 descriptor is neither empty or full; it is simply not ready. It may
0033 not even have a data buffer in it, or is otherwise unusable.
0034
0035 During normal operation, on device startup, the OS (specifically, the
0036 spidernet device driver) allocates a set of RX descriptors and RX
0037 buffers. These are all marked "empty", ready to receive data. This
0038 ring is handed off to the hardware, which sequentially fills in the
0039 buffers, and marks them "full". The OS follows up, taking the full
0040 buffers, processing them, and re-marking them empty.
0041
0042 This filling and emptying is managed by three pointers, the "head"
0043 and "tail" pointers, managed by the OS, and a hardware current
0044 descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
0045 currently being filled. When this descr is filled, the hardware
0046 marks it full, and advances the GDACTDPA by one. Thus, when there is
0047 flowing RX traffic, every descr behind it should be marked "full",
0048 and everything in front of it should be "empty". If the hardware
0049 discovers that the current descr is not empty, it will signal an
0050 interrupt, and halt processing.
0051
0052 The tail pointer tails or trails the hardware pointer. When the
0053 hardware is ahead, the tail pointer will be pointing at a "full"
0054 descr. The OS will process this descr, and then mark it "not-in-use",
0055 and advance the tail pointer. Thus, when there is flowing RX traffic,
0056 all of the descrs in front of the tail pointer should be "full", and
0057 all of those behind it should be "not-in-use". When RX traffic is not
0058 flowing, then the tail pointer can catch up to the hardware pointer.
0059 The OS will then note that the current tail is "empty", and halt
0060 processing.
0061
0062 The head pointer (somewhat mis-named) follows after the tail pointer.
0063 When traffic is flowing, then the head pointer will be pointing at
0064 a "not-in-use" descr. The OS will perform various housekeeping duties
0065 on this descr. This includes allocating a new data buffer and
0066 dma-mapping it so as to make it visible to the hardware. The OS will
0067 then mark the descr as "empty", ready to receive data. Thus, when there
0068 is flowing RX traffic, everything in front of the head pointer should
0069 be "not-in-use", and everything behind it should be "empty". If no
0070 RX traffic is flowing, then the head pointer can catch up to the tail
0071 pointer, at which point the OS will notice that the head descr is
0072 "empty", and it will halt processing.
0073
0074 Thus, in an idle system, the GDACTDPA, tail and head pointers will
0075 all be pointing at the same descr, which should be "empty". All of the
0076 other descrs in the ring should be "empty" as well.
0077
0078 The show_rx_chain() routine will print out the locations of the
0079 GDACTDPA, tail and head pointers. It will also summarize the contents
0080 of the ring, starting at the tail pointer, and listing the status
0081 of the descrs that follow.
0082
0083 A typical example of the output, for a nearly idle system, might be::
0084
0085 net eth1: Total number of descrs=256
0086 net eth1: Chain tail located at descr=20
0087 net eth1: Chain head is at 20
0088 net eth1: HW curr desc (GDACTDPA) is at 21
0089 net eth1: Have 1 descrs with stat=x40800101
0090 net eth1: HW next desc (GDACNEXTDA) is at 22
0091 net eth1: Last 255 descrs with stat=xa0800000
0092
0093 In the above, the hardware has filled in one descr, number 20. Both
0094 head and tail are pointing at 20, because it has not yet been emptied.
0095 Meanwhile, hw is pointing at 21, which is free.
0096
0097 The "Have nnn decrs" refers to the descr starting at the tail: in this
0098 case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
0099 to all of the rest of the descrs, from the last status change. The "nnn"
0100 is a count of how many descrs have exactly the same status.
0101
0102 The status x4... corresponds to "full" and status xa... corresponds
0103 to "empty". The actual value printed is RXCOMST_A.
0104
0105 In the device driver source code, a different set of names are
0106 used for these same concepts, so that::
0107
0108 "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
0109 "full" == SPIDER_NET_DESCR_FRAME_END == 0x4
0110 "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
0111
0112
0113 The RX RAM full bug/feature
0114 ===========================
0115
0116 As long as the OS can empty out the RX buffers at a rate faster than
0117 the hardware can fill them, there is no problem. If, for some reason,
0118 the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
0119 pointer will catch up to the head, notice the not-empty condition,
0120 ad stop. However, RX packets may still continue arriving on the wire.
0121 The spidernet chip can save some limited number of these in local RAM.
0122 When this local ram fills up, the spider chip will issue an interrupt
0123 indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
0124 will be set in GHIINT1STS). When the RX ram full condition occurs,
0125 a certain bug/feature is triggered that has to be specially handled.
0126 This section describes the special handling for this condition.
0127
0128 When the OS finally has a chance to run, it will empty out the RX ring.
0129 In particular, it will clear the descriptor on which the hardware had
0130 stopped. However, once the hardware has decided that a certain
0131 descriptor is invalid, it will not restart at that descriptor; instead
0132 it will restart at the next descr. This potentially will lead to a
0133 deadlock condition, as the tail pointer will be pointing at this descr,
0134 which, from the OS point of view, is empty; the OS will be waiting for
0135 this descr to be filled. However, the hardware has skipped this descr,
0136 and is filling the next descrs. Since the OS doesn't see this, there
0137 is a potential deadlock, with the OS waiting for one descr to fill,
0138 while the hardware is waiting for a different set of descrs to become
0139 empty.
0140
0141 A call to show_rx_chain() at this point indicates the nature of the
0142 problem. A typical print when the network is hung shows the following::
0143
0144 net eth1: Spider RX RAM full, incoming packets might be discarded!
0145 net eth1: Total number of descrs=256
0146 net eth1: Chain tail located at descr=255
0147 net eth1: Chain head is at 255
0148 net eth1: HW curr desc (GDACTDPA) is at 0
0149 net eth1: Have 1 descrs with stat=xa0800000
0150 net eth1: HW next desc (GDACNEXTDA) is at 1
0151 net eth1: Have 127 descrs with stat=x40800101
0152 net eth1: Have 1 descrs with stat=x40800001
0153 net eth1: Have 126 descrs with stat=x40800101
0154 net eth1: Last 1 descrs with stat=xa0800000
0155
0156 Both the tail and head pointers are pointing at descr 255, which is
0157 marked xa... which is "empty". Thus, from the OS point of view, there
0158 is nothing to be done. In particular, there is the implicit assumption
0159 that everything in front of the "empty" descr must surely also be empty,
0160 as explained in the last section. The OS is waiting for descr 255 to
0161 become non-empty, which, in this case, will never happen.
0162
0163 The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
0164 Since its already full, the hardware can do nothing more, and thus has
0165 halted processing. Notice that descrs 0 through 254 are all marked
0166 "full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
0167 descr 254, since tail was at 255.) Thus, the system is deadlocked,
0168 and there can be no forward progress; the OS thinks there's nothing
0169 to do, and the hardware has nowhere to put incoming data.
0170
0171 This bug/feature is worked around with the spider_net_resync_head_ptr()
0172 routine. When the driver receives RX interrupts, but an examination
0173 of the RX chain seems to show it is empty, then it is probable that
0174 the hardware has skipped a descr or two (sometimes dozens under heavy
0175 network conditions). The spider_net_resync_head_ptr() subroutine will
0176 search the ring for the next full descr, and the driver will resume
0177 operations there. Since this will leave "holes" in the ring, there
0178 is also a spider_net_resync_tail_ptr() that will skip over such holes.
0179
0180 As of this writing, the spider_net_resync() strategy seems to work very
0181 well, even under heavy network loads.
0182
0183
0184 The TX ring
0185 ===========
0186 The TX ring uses a low-watermark interrupt scheme to make sure that
0187 the TX queue is appropriately serviced for large packet sizes.
0188
0189 For packet sizes greater than about 1KBytes, the kernel can fill
0190 the TX ring quicker than the device can drain it. Once the ring
0191 is full, the netdev is stopped. When there is room in the ring,
0192 the netdev needs to be reawakened, so that more TX packets are placed
0193 in the ring. The hardware can empty the ring about four times per jiffy,
0194 so its not appropriate to wait for the poll routine to refill, since
0195 the poll routine runs only once per jiffy. The low-watermark mechanism
0196 marks a descr about 1/4th of the way from the bottom of the queue, so
0197 that an interrupt is generated when the descr is processed. This
0198 interrupt wakes up the netdev, which can then refill the queue.
0199 For large packets, this mechanism generates a relatively small number
0200 of interrupts, about 1K/sec. For smaller packets, this will drop to zero
0201 interrupts, as the hardware can empty the queue faster than the kernel
0202 can fill it.