0001 =================================
0002 brief tutorial on CRC computation
0003 =================================
0004
0005 A CRC is a long-division remainder. You add the CRC to the message,
0006 and the whole thing (message+CRC) is a multiple of the given
0007 CRC polynomial. To check the CRC, you can either check that the
0008 CRC matches the recomputed value, *or* you can check that the
0009 remainder computed on the message+CRC is 0. This latter approach
0010 is used by a lot of hardware implementations, and is why so many
0011 protocols put the end-of-frame flag after the CRC.
0012
0013 It's actually the same long division you learned in school, except that:
0014
0015 - We're working in binary, so the digits are only 0 and 1, and
0016 - When dividing polynomials, there are no carries. Rather than add and
0017 subtract, we just xor. Thus, we tend to get a bit sloppy about
0018 the difference between adding and subtracting.
0019
0020 Like all division, the remainder is always smaller than the divisor.
0021 To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
0022 Since it's 33 bits long, bit 32 is always going to be set, so usually the
0023 CRC is written in hex with the most significant bit omitted. (If you're
0024 familiar with the IEEE 754 floating-point format, it's the same idea.)
0025
0026 Note that a CRC is computed over a string of *bits*, so you have
0027 to decide on the endianness of the bits within each byte. To get
0028 the best error-detecting properties, this should correspond to the
0029 order they're actually sent. For example, standard RS-232 serial is
0030 little-endian; the most significant bit (sometimes used for parity)
0031 is sent last. And when appending a CRC word to a message, you should
0032 do it in the right order, matching the endianness.
0033
0034 Just like with ordinary division, you proceed one digit (bit) at a time.
0035 Each step of the division you take one more digit (bit) of the dividend
0036 and append it to the current remainder. Then you figure out the
0037 appropriate multiple of the divisor to subtract to being the remainder
0038 back into range. In binary, this is easy - it has to be either 0 or 1,
0039 and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
0040
0041 When computing a CRC, we don't care about the quotient, so we can
0042 throw the quotient bit away, but subtract the appropriate multiple of
0043 the polynomial from the remainder and we're back to where we started,
0044 ready to process the next bit.
0045
0046 A big-endian CRC written this way would be coded like::
0047
0048 for (i = 0; i < input_bits; i++) {
0049 multiple = remainder & 0x80000000 ? CRCPOLY : 0;
0050 remainder = (remainder << 1 | next_input_bit()) ^ multiple;
0051 }
0052
0053 Notice how, to get at bit 32 of the shifted remainder, we look
0054 at bit 31 of the remainder *before* shifting it.
0055
0056 But also notice how the next_input_bit() bits we're shifting into
0057 the remainder don't actually affect any decision-making until
0058 32 bits later. Thus, the first 32 cycles of this are pretty boring.
0059 Also, to add the CRC to a message, we need a 32-bit-long hole for it at
0060 the end, so we have to add 32 extra cycles shifting in zeros at the
0061 end of every message.
0062
0063 These details lead to a standard trick: rearrange merging in the
0064 next_input_bit() until the moment it's needed. Then the first 32 cycles
0065 can be precomputed, and merging in the final 32 zero bits to make room
0066 for the CRC can be skipped entirely. This changes the code to::
0067
0068 for (i = 0; i < input_bits; i++) {
0069 remainder ^= next_input_bit() << 31;
0070 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
0071 remainder = (remainder << 1) ^ multiple;
0072 }
0073
0074 With this optimization, the little-endian code is particularly simple::
0075
0076 for (i = 0; i < input_bits; i++) {
0077 remainder ^= next_input_bit();
0078 multiple = (remainder & 1) ? CRCPOLY : 0;
0079 remainder = (remainder >> 1) ^ multiple;
0080 }
0081
0082 The most significant coefficient of the remainder polynomial is stored
0083 in the least significant bit of the binary "remainder" variable.
0084 The other details of endianness have been hidden in CRCPOLY (which must
0085 be bit-reversed) and next_input_bit().
0086
0087 As long as next_input_bit is returning the bits in a sensible order, we don't
0088 *have* to wait until the last possible moment to merge in additional bits.
0089 We can do it 8 bits at a time rather than 1 bit at a time::
0090
0091 for (i = 0; i < input_bytes; i++) {
0092 remainder ^= next_input_byte() << 24;
0093 for (j = 0; j < 8; j++) {
0094 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
0095 remainder = (remainder << 1) ^ multiple;
0096 }
0097 }
0098
0099 Or in little-endian::
0100
0101 for (i = 0; i < input_bytes; i++) {
0102 remainder ^= next_input_byte();
0103 for (j = 0; j < 8; j++) {
0104 multiple = (remainder & 1) ? CRCPOLY : 0;
0105 remainder = (remainder >> 1) ^ multiple;
0106 }
0107 }
0108
0109 If the input is a multiple of 32 bits, you can even XOR in a 32-bit
0110 word at a time and increase the inner loop count to 32.
0111
0112 You can also mix and match the two loop styles, for example doing the
0113 bulk of a message byte-at-a-time and adding bit-at-a-time processing
0114 for any fractional bytes at the end.
0115
0116 To reduce the number of conditional branches, software commonly uses
0117 the byte-at-a-time table method, popularized by Dilip V. Sarwate,
0118 "Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
0119 v.31 no.8 (August 1998) p. 1008-1013.
0120
0121 Here, rather than just shifting one bit of the remainder to decide
0122 in the correct multiple to subtract, we can shift a byte at a time.
0123 This produces a 40-bit (rather than a 33-bit) intermediate remainder,
0124 and the correct multiple of the polynomial to subtract is found using
0125 a 256-entry lookup table indexed by the high 8 bits.
0126
0127 (The table entries are simply the CRC-32 of the given one-byte messages.)
0128
0129 When space is more constrained, smaller tables can be used, e.g. two
0130 4-bit shifts followed by a lookup in a 16-entry table.
0131
0132 It is not practical to process much more than 8 bits at a time using this
0133 technique, because tables larger than 256 entries use too much memory and,
0134 more importantly, too much of the L1 cache.
0135
0136 To get higher software performance, a "slicing" technique can be used.
0137 See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
0138 ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
0139
0140 This does not change the number of table lookups, but does increase
0141 the parallelism. With the classic Sarwate algorithm, each table lookup
0142 must be completed before the index of the next can be computed.
0143
0144 A "slicing by 2" technique would shift the remainder 16 bits at a time,
0145 producing a 48-bit intermediate remainder. Rather than doing a single
0146 lookup in a 65536-entry table, the two high bytes are looked up in
0147 two different 256-entry tables. Each contains the remainder required
0148 to cancel out the corresponding byte. The tables are different because the
0149 polynomials to cancel are different. One has non-zero coefficients from
0150 x^32 to x^39, while the other goes from x^40 to x^47.
0151
0152 Since modern processors can handle many parallel memory operations, this
0153 takes barely longer than a single table look-up and thus performs almost
0154 twice as fast as the basic Sarwate algorithm.
0155
0156 This can be extended to "slicing by 4" using 4 256-entry tables.
0157 Each step, 32 bits of data is fetched, XORed with the CRC, and the result
0158 broken into bytes and looked up in the tables. Because the 32-bit shift
0159 leaves the low-order bits of the intermediate remainder zero, the
0160 final CRC is simply the XOR of the 4 table look-ups.
0161
0162 But this still enforces sequential execution: a second group of table
0163 look-ups cannot begin until the previous groups 4 table look-ups have all
0164 been completed. Thus, the processor's load/store unit is sometimes idle.
0165
0166 To make maximum use of the processor, "slicing by 8" performs 8 look-ups
0167 in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed
0168 with 64 bits of input data. What is important to note is that 4 of
0169 those 8 bytes are simply copies of the input data; they do not depend
0170 on the previous CRC at all. Thus, those 4 table look-ups may commence
0171 immediately, without waiting for the previous loop iteration.
0172
0173 By always having 4 loads in flight, a modern superscalar processor can
0174 be kept busy and make full use of its L1 cache.
0175
0176 Two more details about CRC implementation in the real world:
0177
0178 Normally, appending zero bits to a message which is already a multiple
0179 of a polynomial produces a larger multiple of that polynomial. Thus,
0180 a basic CRC will not detect appended zero bits (or bytes). To enable
0181 a CRC to detect this condition, it's common to invert the CRC before
0182 appending it. This makes the remainder of the message+crc come out not
0183 as zero, but some fixed non-zero value. (The CRC of the inversion
0184 pattern, 0xffffffff.)
0185
0186 The same problem applies to zero bits prepended to the message, and a
0187 similar solution is used. Instead of starting the CRC computation with
0188 a remainder of 0, an initial remainder of all ones is used. As long as
0189 you start the same way on decoding, it doesn't make a difference.