Compression

The Wikipedia entry "Lossless compression algorithms" is a handy reference for this stuff; it covers many topics that we never had time to even mention.

This handout describes two kinds of compression:

All compression is based on the idea

Data = Model + Surprise
Model = Framework + Parameters

where the Framework is some kind of representation scheme which is known to both the sender and the receiver, the Parameters are some kind of summary of a message as a whole (such as the frequency of occurrence for each character), and the Surprise is the information which the receiver cannot predict given the Model.

For example, the Unicode character set (see the Unicode web site) has room for over a million different characters; the current version, Unicode 4.1.0, has 97,720 defined codes, which would require 17 bits per character, or 16.6 bits per character if we were clever. But by choosing ISO 2022 as our Framework, which has "which character set in such-and-such a registry" as Parameter, we can encode a document like this one in ASCII, using just 7 bits per character, a big saving. Using a fancier compression method, we might save even more space. (gzip compresses UnicodeData.txt by a factor of 6.6.)

Compressing Integer Sequences

A basic information retrieval index is a mapping

index : term -> postings

where a term could be a word or a stem, and the postings for a term is a set of

(document ID, count)
pairs, where a document ID is the (natural number) identifier of a document that the term occurs in and a count is the (positive integer) number of times the term occurs in that document.

These two kinds of numbers have different properties. If there are n documents in the collection, we need ceiling(log2(n)) bits to represent a document number, and if a term occurs in k of those documents, we expect them to be scattered fairly uniformly throughout the range 1..n, so there doesn't seem to be much we can exploit.

Delta encoding

Since the postings is a set of pairs, we can rearrange the order of the pairs any way we find useful. One useful way is to sort them so that the document IDs form a strictly increasing sequence. For example, to represent the set {19497, 19646, 19619, 19599, 19597, 19468, 19495, 19668, 19422, 19446, 19600, 19442} --- a real set of document IDs I happened to have handy --- we might sort them, getting 19422 < 19442 < 19446 < 19468 < 19495 < 19497 < 19597 < 19599 < 19600 < 19619 < 19646 < 19668. Now we can take the differences, and represent the set as 19422, +20, +4, +22, +27, +2, +100, +2, +1, +19, +27, +22. In fact, if we ensure that document IDs are strictly positive, and start with an origin of 0, we can represent all the document IDs as a positive difference from the next smaller number.

Replacing a strictly increasing sequence of integers by the sequence of differences is delta encoding. The differences are all positive integers, and they cannot be any larger than the numbers you started with. The nice thing about this is that when you are dealing with a set of document IDs in the range 1..n, the more elements the set has, the smaller the differences have to get.

Compressing numbers that are commonly small

When we are dealing with collections of positive integers, we often find that small numbers are much commoner than big ones. For example, of the 10313 distinct words in "Mutual Aid", only 42 occur more than 255 times. Clearly, words that don't occur very often in an entire collection cannot occur very often in any one document, so we expect the document frequency numbers to be small much more often than not. In the same way, once an increasing sequence has been delta encoded, we expect small differences more often than big ones.

Variable Byte Encoding

There is a way to deal with this which is simple, fast, and compresses tolerably well, although not perfectly well. It represents each number using a whole number of bytes.

The basic idea is that we devote one bit in each byte to answering the question "is this the last byte for a number?", and seven bits to providing data.

The "end" bit could be at the top of the byte or the bottom; you could use 0 to mean stop or 0 to mean continue. So there are at least four sensible ways to do this. It doesn't matter very much. The scheme I show here puts the end bit at the top of the byte and uses 0 for stop, so that the representation of a value 0..127 requires no change.

Encoding:

    void encode(unsigned long long x) {
	while (x > 127) {
	    putchar((x & 127)|128);
	    x >>= 7;
	}
	putchar(x);
    }

Decoding:

    unsigned long decode(void) {
	unsigned long long x, b;
	int shift;
	
	x = 0, shift = 0;
	while ((b = getchar()) > 127) {
	    x |= (b & 127) << shift;
	    shift += 7;
	}
	return x | (b << shift);
    }

This is independent of the word size and endian-ness of the computer.

A perhaps more obvious approach would be for the first byte to indicate the number of bytes required, and then have the remaining bytes 8 bits per byte. That could handle numbers up to 2**(256*8), which is much larger than we need; the problem is that small numebrs (0..127) would require two bytes instead of one. One can imagine all sorts of hybrid schemes, but variable byte encoding is hard to beat.

UTF-8

The Unicode standard defines a standard way of representing a stream of Unicode characters as a stream of bytes. It is called UTF-8 (Unicode Transmission Format using 8 bit elements). It is a close cousin of variable byte encoding. However, it had to satisfy several requirements:

The scheme goes like this:
BytesUnicode value
0xxxxxxx00000 00000000 0xxxxxxx
110yyyyy 10xxxxxx00000 00000yyy yyxxxxxx
1110zzzz 10yyyyyy 10zzzzzz00000 zzzzyyyy yyxxxxxx
11110www 10zzzzzz 10yyyyyy 10xxxxxxwwwzz zzzzyyyy yyxxxxxx

This encoding is actually redundant, so there is an additional rule: each value will be encoded using the shortest available form. It's worth noting that the so-called "UTF-8" that Java uses is not well-formed UTF-8; where real UTF-8 would use 4 bytes, Java's "modified UTF-8" will use 6 bytes, and in addition, Java represents the NUL character by the two bytes 11000000 10000000 instead of the single byte 00000000.

Unary encoding

Unary encoding is obviously absurd. It represents the natural number n by n ones followed by a zero. If we are encoding positive integers, we can use n-1 ones followed by a zero.

While it's obviously absurd, there is a probability distribution on the natural numbers for which it is optimal: if each number k is twice as common as k+1.

Unary coding is not often used by itself, but it is often used as a building block in other codes. For example, we can see UTF-8 codes as beginning with a unary number: 0 for ASCII, and 2 3 or 4 (110 1110 or 11110) for 2 3 or 4 byte codes.

We could also use 0s followed by a 1, of course.

Bounded binary encoding

If our Model tells us that a number lies in the range 2k .. 2k+1-1, then the Surprise lies in the bottom k bits.

Elias gamma encoding

Elias gamma encoding combines unary encoding and bounded binary encoding. To encode a positive integer n, let k be the natural number such that

2k <= n < 2k+1.

Now write k in unary (as k zeros followed by a one) and then the bottom k bits of n.
ncode
1[1][]
2[01][0]
3[01][1]
4[001][00]
5[001][01]
6[001][10]
7[001][11]
8[0001][000]
9[0001][001]
10[0001][010]
11[0001][011]
12[0001][100]
13[0001][101]
14[0001][110]
15[0001][111]
16[00001][0000]
17[00001][0001]
......
31[00001][1111]
......
255[00000001][1111111]
......

Compare this with variable byte encoding.

Elias delta encoding

As we get to large numbers, we find ourselves needing lots of bits for the unary prefix. So Elias delta encoding applies the same idea one more time. Instead of writing k using unary encoding, write k+1 using Elias gamma encoding.
ncode
1[1][][]
2[01][0][0]
3[01][0][1]
4[01][1][00]
5[01][1][01]
6[01][1][10]
7[01][1][11]
8[001][00][000]
9[001][00][001]
10[001][00][010]
11[001][00][011]
12[001][00][100]
13[001][00][101]
14[001][00][110]
15[001][00][111]
16[001][01][0000]
17[001][01][0001]
......

Other encodings

You can find out about Elias omega encoding (which applies the Elias idea recursively), Golumb encoding (divide n by a tunable parameter b, encoding n/b in unary, and n%b in truncated binary), and others in the Wikipedia. They won't be in the exam.

Compressing text

The more we can build into our Model, the less Surprise there will be.

The first step in modelling text is to choose an appropriate character set and say that text is a sequence of characters.

For that sequence, we can use a sequence of fixed length elements (single bytes if there are no more than 256 characters in the character set, pairs of bytes if there are no more than 65,536 characters in the character set) or a sequence of variable length elements, as in UTF-8 (and other compression schemes for Unicode, like UTF-7, UTF-EBCDIC, and TR-6).

A somewhat richer Framework says that some characters may be commoner than others. For example, I took 83,358 lines of text from Project Gutenberg (about 4.3MB), and counted characters.
countc countc countc countc
   701663SP 4@ 4`
   2419! 7438A 254246a
   12784" 4655B 46583b
   7# 3514C 85093c
   1$ 2705D 133298d
   3% 4882E 410679e
   28& 2746F 79642f
   10756' 3481G 58893g
   2591( 6691H 202018h
30TAB 2597) 12618I 224779i
83358LF 178* 1102J 3587j
   1+ 1340K 22461k
   59534, 2952L 138009l
   10145- 3865M 79073m
   33233. 3406N 230331n
   23/ 3599O 253465o
   9080 2892P 54258p
   20191 171Q 3034q
   11822 3022R 188953r
   8783 6184S 205266s
   6464 11812T 298917t
   6595 929U 91154u
   5176 805V 31179v
   4937 4323W 70257w
   7798 423X 5411x
   4699 1129Y 59800y
   1703: 81Z 1680z
   6191; 917[ 84{
   1< 1\ 92|
   1= 915] 84}
   1> 84^ 1~
   2634? 1534_   

We see, for example, that capital Q is quite rate (171 occurrences) but lower case e is very common (410,697 occurrences). Other than letters, plus only occurs once, while comma occurs 59,534 times. It seems silly to spend the same number of bits on common characters as rare characters, and it is silly. We can do better by using short codes for common characters even though that means using long codes for rare characters.

Huffman encoding

Huffman encoding provides the Framework. The Parameters for any particular document are either the character counts from which the encoding tree is derived, or the shape of the tree itself. We can transmit a tree like this:

tree_bits (Leaf c) = "0" ++ byte_bits c
tree_bits (Fork x y) = "1" ++ tree_bits x ++ tree_bits y

The idea of Huffman encoding is very simple.

  1. Start with a collection of (Leaf character,count) pairs.
  2. While this collection has more than one element,
    1. Pick the two items with the smallest counts, (x,nx) and (y,ny).
    2. Replace them with a new pair (Fork x y,nx+ny). It doesn't matter which one is taken for x and which is taken for y.

When this process is finished, we have a tree. This defines a binary code:

codes tree = visit tree ""
  where visit (Leaf c) code = [(c,code)]
        visit (Fork x y) code = visit x (code ++ "0") ++
                               visit y (code ++ "1")

These codes have the important property that none of them is a prefix of any other, so when we are decoding from left to right we are never in any doubt about which character's code we have seen next.

Huffman code is optimal amongst codes that don't change throughout a message. Roughly speaking, it is likely to halve the size of a text.

Adaptive Huffman coding

Suppose you have a Text which starts in English, switches to French for a couple of pages, and then switches back to English. (This happens in Dorothy L. Sayers' "Clouds of Witnesses", for example.) English and French have many similarities, but we can expect the character frequencies to differ somewhat. (For example, English makes very little use of accented vowels, while French uses them a lot.) It would be better if the coder could "fit" English in the English bits and "fit" French in the French bits. Even (or especially) on a small scale, the assumption of position-independence doesn't work very well: the letter after Q is almost always U, the letter after H is almost never T even though T is very common, and so on.

We could extend Huffman encoding in two ways. One way would be to use 256 different encoding trees, depending on what the last byte was. Another would try to adapt to larger scale patterns.

In adaptive Huffman encoding, the sender and receiver start with the same tree. It doesn't matter in principle what that tree is. As it encounters each character, the sender encodes it using the current tree and only then updates the tree. The receiver updates its tree exactly the same way after decoding each character.

Lempel-Ziv methods

The basic idea of these methods is to keep a "window" of the last N characters transmitted. The sender looks at the upcoming characters and if it can find a (usefully long) prefix of the future characters which occurs in its window onto the past, it sends a code which means "such and such a substring of the past", otherwise it just encodes the next character. These methods can pick up repeating words and even phrases. They give very good compression.

Word coding

Word coding says "wait a minute, text isn't really a sequence of characters, that's just an encoding; it really is a sequence of words." So we make (or maintain) a dictionary mapping words to integers, and instead of encoding a sequence of characters, we encode a sequence of word numbers. If we sort the words in the dictionary by descending order of frequency, we'll be back in the realm of integer sequences where small numbers are much more common than big numbers.

What about spaces and punctuation marks? Spaceless word coding says "assume that each word is followed by a space." What happens when it isn't? Each of the characters in your alphabet is also recorded as a word, and if the next word is one of those, the space is cancelled. Some of these character will also have an implied space, some won't.

This is an example, albeit a short one, of how(NL)
spaceless word coding works.(NL)

is divided up as

(This )(is )(an )(example )(, )(albeit )(a )(short )(one )(, )
(of )(how )(NL)(spaceless )(word )(coding )(works )(. )(NL).

We can then encode the sequence of word numbers using Huffman coding or variable byte coding or anything that takes our fancy.

One nice consequence of using word coding followed by variable byte encoding is that we can search for a whole number of words in a compressed document by doing a byte search for the compressed query;