Back to home page

LXR

 
 

    


0001 
0002 LZO stream format as understood by Linux's LZO decompressor
0003 ===========================================================
0004 
0005 Introduction
0006 
0007   This is not a specification. No specification seems to be publicly available
0008   for the LZO stream format. This document describes what input format the LZO
0009   decompressor as implemented in the Linux kernel understands. The file subject
0010   of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
0011   the compressor nor on any other implementations though it seems likely that
0012   the format matches the standard one. The purpose of this document is to
0013   better understand what the code does in order to propose more efficient fixes
0014   for future bug reports.
0015 
0016 Description
0017 
0018   The stream is composed of a series of instructions, operands, and data. The
0019   instructions consist in a few bits representing an opcode, and bits forming
0020   the operands for the instruction, whose size and position depend on the
0021   opcode and on the number of literals copied by previous instruction. The
0022   operands are used to indicate :
0023 
0024     - a distance when copying data from the dictionary (past output buffer)
0025     - a length (number of bytes to copy from dictionary)
0026     - the number of literals to copy, which is retained in variable "state"
0027       as a piece of information for next instructions.
0028 
0029   Optionally depending on the opcode and operands, extra data may follow. These
0030   extra data can be a complement for the operand (eg: a length or a distance
0031   encoded on larger values), or a literal to be copied to the output buffer.
0032 
0033   The first byte of the block follows a different encoding from other bytes, it
0034   seems to be optimized for literal use only, since there is no dictionary yet
0035   prior to that byte.
0036 
0037   Lengths are always encoded on a variable size starting with a small number
0038   of bits in the operand. If the number of bits isn't enough to represent the
0039   length, up to 255 may be added in increments by consuming more bytes with a
0040   rate of at most 255 per extra byte (thus the compression ratio cannot exceed
0041   around 255:1). The variable length encoding using #bits is always the same :
0042 
0043        length = byte & ((1 << #bits) - 1)
0044        if (!length) {
0045                length = ((1 << #bits) - 1)
0046                length += 255*(number of zero bytes)
0047                length += first-non-zero-byte
0048        }
0049        length += constant (generally 2 or 3)
0050 
0051   For references to the dictionary, distances are relative to the output
0052   pointer. Distances are encoded using very few bits belonging to certain
0053   ranges, resulting in multiple copy instructions using different encodings.
0054   Certain encodings involve one extra byte, others involve two extra bytes
0055   forming a little-endian 16-bit quantity (marked LE16 below).
0056 
0057   After any instruction except the large literal copy, 0, 1, 2 or 3 literals
0058   are copied before starting the next instruction. The number of literals that
0059   were copied may change the meaning and behaviour of the next instruction. In
0060   practice, only one instruction needs to know whether 0, less than 4, or more
0061   literals were copied. This is the information stored in the <state> variable
0062   in this implementation. This number of immediate literals to be copied is
0063   generally encoded in the last two bits of the instruction but may also be
0064   taken from the last two bits of an extra operand (eg: distance).
0065 
0066   End of stream is declared when a block copy of distance 0 is seen. Only one
0067   instruction may encode this distance (0001HLLL), it takes one LE16 operand
0068   for the distance, thus requiring 3 bytes.
0069 
0070   IMPORTANT NOTE : in the code some length checks are missing because certain
0071   instructions are called under the assumption that a certain number of bytes
0072   follow because it has already been guaranteed before parsing the instructions.
0073   They just have to "refill" this credit if they consume extra bytes. This is
0074   an implementation design choice independent on the algorithm or encoding.
0075 
0076 Byte sequences
0077 
0078   First byte encoding :
0079 
0080       0..17   : follow regular instruction encoding, see below. It is worth
0081                 noting that codes 16 and 17 will represent a block copy from
0082                 the dictionary which is empty, and that they will always be
0083                 invalid at this place.
0084 
0085       18..21  : copy 0..3 literals
0086                 state = (byte - 17) = 0..3  [ copy <state> literals ]
0087                 skip byte
0088 
0089       22..255 : copy literal string
0090                 length = (byte - 17) = 4..238
0091                 state = 4 [ don't copy extra literals ]
0092                 skip byte
0093 
0094   Instruction encoding :
0095 
0096       0 0 0 0 X X X X  (0..15)
0097         Depends on the number of literals copied by the last instruction.
0098         If last instruction did not copy any literal (state == 0), this
0099         encoding will be a copy of 4 or more literal, and must be interpreted
0100         like this :
0101 
0102            0 0 0 0 L L L L  (0..15)  : copy long literal string
0103            length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
0104            state = 4  (no extra literals are copied)
0105 
0106         If last instruction used to copy between 1 to 3 literals (encoded in
0107         the instruction's opcode or distance), the instruction is a copy of a
0108         2-byte block from the dictionary within a 1kB distance. It is worth
0109         noting that this instruction provides little savings since it uses 2
0110         bytes to encode a copy of 2 other bytes but it encodes the number of
0111         following literals for free. It must be interpreted like this :
0112 
0113            0 0 0 0 D D S S  (0..15)  : copy 2 bytes from <= 1kB distance
0114            length = 2
0115            state = S (copy S literals after this block)
0116          Always followed by exactly one byte : H H H H H H H H
0117            distance = (H << 2) + D + 1
0118 
0119         If last instruction used to copy 4 or more literals (as detected by
0120         state == 4), the instruction becomes a copy of a 3-byte block from the
0121         dictionary from a 2..3kB distance, and must be interpreted like this :
0122 
0123            0 0 0 0 D D S S  (0..15)  : copy 3 bytes from 2..3 kB distance
0124            length = 3
0125            state = S (copy S literals after this block)
0126          Always followed by exactly one byte : H H H H H H H H
0127            distance = (H << 2) + D + 2049
0128 
0129       0 0 0 1 H L L L  (16..31)
0130            Copy of a block within 16..48kB distance (preferably less than 10B)
0131            length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
0132         Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
0133            distance = 16384 + (H << 14) + D
0134            state = S (copy S literals after this block)
0135            End of stream is reached if distance == 16384
0136 
0137       0 0 1 L L L L L  (32..63)
0138            Copy of small block within 16kB distance (preferably less than 34B)
0139            length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
0140         Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
0141            distance = D + 1
0142            state = S (copy S literals after this block)
0143 
0144       0 1 L D D D S S  (64..127)
0145            Copy 3-4 bytes from block within 2kB distance
0146            state = S (copy S literals after this block)
0147            length = 3 + L
0148          Always followed by exactly one byte : H H H H H H H H
0149            distance = (H << 3) + D + 1
0150 
0151       1 L L D D D S S  (128..255)
0152            Copy 5-8 bytes from block within 2kB distance
0153            state = S (copy S literals after this block)
0154            length = 5 + L
0155          Always followed by exactly one byte : H H H H H H H H
0156            distance = (H << 3) + D + 1
0157 
0158 Authors
0159 
0160   This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
0161   analysis of the decompression code available in Linux 3.16-rc5. The code is
0162   tricky, it is possible that this document contains mistakes or that a few
0163   corner cases were overlooked. In any case, please report any doubt, fix, or
0164   proposed updates to the author(s) so that the document can be updated.