Compression

The Wikipedia has a number of entries that are worth reading:

These pages point to many topics that we never had time to even mention. But an even better resource is Matt Mahony's Data Compression Explained.

This handout describes two kinds of compression:

All compression is based on the idea

Data = Model + Surprise
Model = Framework + Parameters

where the Framework is some kind of representation scheme which is known to both the sender and the receiver, the Parameters are some kind of summary of a message as a whole (such as the frequency of occurrence for each character), and the Surprise is the information which the receiver cannot predict given the Model.

For example, the Unicode character set (see the Unicode web site) has room for over a million different characters; one recent version, Unicode 6.0.0, has 111,563 defined codes, which would require 17 bits per character, or 16.8 bits per character if we were clever. But by choosing ISO 2022 as our Framework, which has "which character set in such-and-such a registry" as Parameter, we can encode a document like this one in ASCII, using just 7 bits per character, a big saving. Using a fancier compression method, we might save even more space. (gzip compresses UnicodeData.txt by a factor of 6.6.)

Compressing Integer Sequences

A basic information retrieval index is a mapping

index : term -> postings

where a term could be a word or a stem, and the postings for a term is a set of

(document ID, count)
pairs, where a document ID is the (natural number) identifier of a document that the term occurs in and a count is the (positive integer) number of times the term occurs in that document.

These two kinds of numbers have different properties. If there are n documents in the collection, we need ceiling(log2(n)) bits to represent a document number, and if a term occurs in k of those documents, we expect them to be scattered fairly uniformly throughout the range 1..n, so there doesn't seem to be much we can exploit.

Delta encoding

Since the postings is a set of pairs, we can rearrange the order of the pairs any way we find useful. One useful way is to sort them so that the document IDs form a strictly increasing sequence. For example, to represent the set {19497, 19646, 19619, 19599, 19597, 19468, 19495, 19668, 19422, 19446, 19600, 19442} --- a real set of document IDs I happened to have handy --- we might sort them, getting 19422 < 19442 < 19446 < 19468 < 19495 < 19497 < 19597 < 19599 < 19600 < 19619 < 19646 < 19668. Now we can take the differences, and represent the set as 19422, +20, +4, +22, +27, +2, +100, +2, +1, +19, +27, +22. In fact, if we ensure that document IDs are strictly positive, and start with an origin of 0, we can represent all the document IDs as a positive difference from the next smaller number.

Replacing a strictly increasing sequence of integers by the sequence of differences is delta encoding. The differences are all positive integers, and they cannot be any larger than the numbers you started with. The nice thing about this is that when you are dealing with a set of document IDs in the range 1..n, the more elements the set has, the smaller the differences have to get.

Compressing numbers that are commonly small

When we are dealing with collections of positive integers, we often find that small numbers are much commoner than big ones. For example, of the 10313 distinct words in "Mutual Aid", only 42 occur more than 255 times. Clearly, words that don't occur very often in an entire collection cannot occur very often in any one document, so we expect the document frequency numbers to be small much more often than not. In the same way, once an increasing sequence has been delta encoded, we expect small differences more often than big ones.

Variable Byte Encoding

There is a way to deal with this which is simple, fast, and compresses tolerably well, although not perfectly well. It represents each number using a whole number of bytes.

The basic idea is that we devote one bit in each byte to answering the question "is this the last byte for a number?", and seven bits to providing data.

The "end" bit could be at the top of the byte or the bottom; you could use 0 to mean stop or 0 to mean continue. So there are at least four sensible ways to do this. It doesn't matter very much. The scheme I show here puts the end bit at the top of the byte and uses 0 for stop, so that the representation of a value 0..127 requires no change.

Encoding:

    void encode(unsigned long long x) {
	while (x > 127) {
	    putchar((x & 127)|128);
	    x >>= 7;
	}
	putchar(x);
    }

Decoding:

    unsigned long long decode(void) {
	unsigned long long x, b;
	int shift;
	
	x = 0, shift = 0;
	while ((b = getchar()) > 127) {
	    x |= (b & 127) << shift;
	    shift += 7;
	}
	return x | (b << shift);
    }

This is independent of the word size and endian-ness of the computer.

Note: you may need to use getchar_unlocked and putchar_unlocked to get acceptable performance, if your C compiler assumes multithreading as the default.

A perhaps more obvious approach would be for the first byte to indicate the number of bytes required, and then have the remaining bytes 8 bits per byte. That could handle numbers up to 2**(256*8), which is much larger than we need; the problem is that small numebrs (0..127) would require two bytes instead of one. One can imagine all sorts of hybrid schemes, but variable byte encoding is hard to beat.

I have heard that Google use one of those schemes. What follows is third-hand, so don't take it as Gospel. It's one possible scheme, that's all. The trick is to encode numbers in groups of four, and to use 2-bit length codes, so that you can fit four length codes in one byte. Code 0 = 1 byte, 1 = 2 bytes, 2 = 4 bytes, 4 = 8 bytes. I don't know which way the numbers are stored, let's say little-endian.

Encoding:

    void encode4(
        unsigned long long x0,
        unsigned long long x1,
        unsigned long long x2,
        unsigned long long x3
    ) {
      # define L(x) ((x >>  8) == 0ull ? 0 : \
		     (x >> 16) == 0ull ? 1 : \
		     (x >> 32) == 0ull ? 2 : 3)
      # define P(l, x) l = 1 << l;\
		       do putchar((unsigned char)(x)),x >>= 8; while (0 != --l)
	unsigned char l0 = L(x0);
	unsigned char l1 = L(x1);
	unsigned char l2 = L(x2);
	unsigned char l3 = L(x3);

	/* Write four lengths as one byte */
	putchar((((((l0<<2)|l1)<<2)|l2)<<2)|l3);
	/* Write each number as 1..8 bytes */
	P(l0, x0);
	P(l1, x1);
	P(l2, x2);
	P(l3, x3);
      # undef P
      # undef L
    }

Decoding:

    static inline unsigned long long get_1_to_8(int n) {
	unsigned long long r;
      # define next_byte (unsigned long long)getchar()

	r = next_byte;		/* placed here to shut gcc up */
	switch (n&3) {
	    case 0:
		break;
	    case 1:
		r |= next_byte << 8;
		break;
	    case 2:
		r |= next_byte <<  8;
		r |= next_byte << 16;
		r |= next_byte << 24;
		break;
	    case 3:
		r |= next_byte <<  8;
		r |= next_byte << 16;
		r |= next_byte << 24;
		r |= next_byte << 32;
		r |= next_byte << 40;
		r |= next_byte << 48;
		r |= next_byte << 56;
		break;
	}
	return r;
      # undef next_byte
    }

    void decode4(
	unsigned long long *x0,
	unsigned long long *x1,
	unsigned long long *x2, 
	unsigned long long *x3
    ) {
	unsigned char c = getchar();
	*x0 = get_1_to_8(c >> 6);
	*x1 = get_1_to_8(c >> 4);
	*x2 = get_1_to_8(c >> 2);
	*x3 = get_1_to_8(c >> 0);
    }

UTF-8

The Unicode standard defines a standard way of representing a stream of Unicode characters as a stream of bytes. It is called UTF-8 (Unicode Transformation Format using 8 bit elements). It is a close cousin of variable byte encoding. However, it had to satisfy several requirements:

The scheme goes like this:
BytesUnicode value
0xxxxxxx00000 00000000 0xxxxxxx
110yyyyy 10xxxxxx00000 00000yyy yyxxxxxx
1110zzzz 10yyyyyy 10zzzzzz00000 zzzzyyyy yyxxxxxx
11110www 10zzzzzz 10yyyyyy 10xxxxxxwwwzz zzzzyyyy yyxxxxxx

This encoding is actually redundant, so there is an additional rule: each value will be encoded using the shortest available form. It's worth noting that the so-called "UTF-8" that Java uses is not well-formed UTF-8; where real UTF-8 would use 4 bytes, Java's "modified UTF-8" will use 6 bytes, and in addition, Java represents the NUL character by the two bytes 11000000 10000000 instead of the single byte 00000000.

Unary encoding

Unary encoding is obviously absurd. It represents the natural number n by n ones followed by a zero. If we are encoding positive integers, we can use n-1 ones followed by a zero.

While it's obviously absurd, there is a probability distribution on the natural numbers for which it is optimal: if each number k is twice as common as k+1.

Unary coding is not often used by itself, but it is often used as a building block in other codes. For example, we can see UTF-8 codes as beginning with a unary number: 0 for ASCII, and 2 3 or 4 (110 1110 or 11110) for 2 3 or 4 byte codes.

We could also use 0s followed by a 1, of course.

Bounded binary encoding

If our Model tells us that a number lies in the range 2k .. 2k+1-1, then the Surprise lies in the bottom k bits.

Elias gamma encoding

Elias gamma encoding combines unary encoding and bounded binary encoding. To encode a positive integer n, let k be the natural number such that

2k <= n < 2k+1.

Now write k in unary (as k zeros followed by a one) and then the bottom k bits of n.
ncode
1[1][]
2[01][0]
3[01][1]
4[001][00]
5[001][01]
6[001][10]
7[001][11]
8[0001][000]
9[0001][001]
10[0001][010]
11[0001][011]
12[0001][100]
13[0001][101]
14[0001][110]
15[0001][111]
16[00001][0000]
17[00001][0001]
......
31[00001][1111]
......
255[00000001][1111111]
......

Compare this with variable byte encoding.

Elias delta encoding

As we get to large numbers, we find ourselves needing lots of bits for the unary prefix. So Elias delta encoding applies the same idea one more time. Instead of writing k using unary encoding, write k+1 using Elias gamma encoding.
ncode
1[1][][]
2[01][0][0]
3[01][0][1]
4[01][1][00]
5[01][1][01]
6[01][1][10]
7[01][1][11]
8[001][00][000]
9[001][00][001]
10[001][00][010]
11[001][00][011]
12[001][00][100]
13[001][00][101]
14[001][00][110]
15[001][00][111]
16[001][01][0000]
17[001][01][0001]
......

Other encodings

You can find out about Elias omega encoding (which applies the Elias idea recursively), Golomb encoding (divide n by a tunable parameter b, encoding n/b in unary, and n%b in truncated binary), and others in the Wikipedia. They won't be in the exam.

Compressing text

The more we can build into our Model, the less Surprise there will be.

The first step in modelling text is to choose an appropriate character set and say that text is a sequence of characters.

For that sequence, we can use a sequence of fixed length elements (single bytes if there are no more than 256 characters in the character set, pairs of bytes if there are no more than 65,536 characters in the character set) or a sequence of variable length elements, as in UTF-8 (and other compression schemes for Unicode, like UTF-7, UTF-EBCDIC, and TR-6).

A somewhat richer Framework says that some characters may be commoner than others. For example, I took 83,358 lines of text from Project Gutenberg (about 4.3MB), and counted characters.
countc countc countc countc
   701663SP 4@ 4`
   2419! 7438A 254246a
   12784" 4655B 46583b
   7# 3514C 85093c
   1$ 2705D 133298d
   3% 4882E 410679e
   28& 2746F 79642f
   10756' 3481G 58893g
   2591( 6691H 202018h
30TAB 2597) 12618I 224779i
83358LF 178* 1102J 3587j
   1+ 1340K 22461k
   59534, 2952L 138009l
   10145- 3865M 79073m
   33233. 3406N 230331n
   23/ 3599O 253465o
   9080 2892P 54258p
   20191 171Q 3034q
   11822 3022R 188953r
   8783 6184S 205266s
   6464 11812T 298917t
   6595 929U 91154u
   5176 805V 31179v
   4937 4323W 70257w
   7798 423X 5411x
   4699 1129Y 59800y
   1703: 81Z 1680z
   6191; 917[ 84{
   1< 1\ 92|
   1= 915] 84}
   1> 84^ 1~
   2634? 1534_   

We see, for example, that capital Q is quite rare (171 occurrences) but lower case e is very common (410,697 occurrences). Other than letters, plus only occurs once, while comma occurs 59,534 times. It seems silly to spend the same number of bits on common characters as rare characters, and it is silly. We can do better by using short codes for common characters even though that means using long codes for rare characters.

Huffman encoding

Huffman encoding provides the Framework. The Parameters for any particular document are either the character counts from which the encoding tree is derived, or the shape of the tree itself. We can transmit a tree like this:

tree_bits (Leaf c) = "0" ++ byte_bits c
tree_bits (Fork x y) = "1" ++ tree_bits x ++ tree_bits y

The idea of Huffman encoding is very simple.

  1. Start with a collection of (Leaf character,count) pairs.
  2. While this collection has more than one element,
    1. Pick the two items with the smallest counts, (x,nx) and (y,ny).
    2. Replace them with a new pair (Fork x y,nx+ny). It doesn't matter which one is taken for x and which is taken for y.

When this process is finished, we have a tree. This defines a binary code:

codes tree = visit tree ""
  where visit (Leaf c) code = [(c,code)]
        visit (Fork x y) code = visit x (code ++ "0") ++
                               visit y (code ++ "1")

These codes have the important property that none of them is a prefix of any other, so when we are decoding from left to right we are never in any doubt about which character's code we have seen next.

Huffman code is optimal amongst codes that don't change throughout a message. Roughly speaking, it is likely to halve the size of a text.

Adaptive Huffman coding

Suppose you have a Text which starts in English, switches to French for a couple of pages, and then switches back to English. (This happens in Dorothy L. Sayers' "Clouds of Witnesses", for example.) English and French have many similarities, but we can expect the character frequencies to differ somewhat. (For example, English makes very little use of accented vowels, while French uses them a lot.) It would be better if the coder could "fit" English in the English bits and "fit" French in the French bits. Even (or especially) on a small scale, the assumption of position-independence doesn't work very well: the letter after Q is almost always U, the letter after H is almost never T even though T is very common, and so on.

We could extend Huffman encoding in two ways. One way would be to use 256 different encoding trees, depending on what the last byte was. Another would try to adapt to larger scale patterns.

In adaptive Huffman encoding, the sender and receiver start with the same tree. It doesn't matter in principle what that tree is. As it encounters each character, the sender encodes it using the current tree and only then updates the tree. The receiver updates its tree exactly the same way after decoding each character.

Lempel-Ziv methods

The basic idea of these methods is to keep a "window" of the last N characters transmitted. The sender looks at the upcoming characters and if it can find a (usefully long) prefix of the future characters which occurs in its window onto the past, it sends a code which means "such and such a substring of the past", otherwise it just encodes the next character. These methods can pick up repeating words and even phrases. They give very good compression.

Word coding

Word coding says "wait a minute, text isn't really a sequence of characters, that's just an encoding; it really is a sequence of words." So we make (or maintain) a dictionary mapping words to integers, and instead of encoding a sequence of characters, we encode a sequence of word numbers. If we sort the words in the dictionary by descending order of frequency, we'll be back in the realm of integer sequences where small numbers are much more common than big numbers.

What about spaces and punctuation marks? Spaceless word coding says "assume that each word is followed by a space." What happens when it isn't? Each of the characters in your alphabet is also recorded as a word, and if the next word is one of those, the space is cancelled. Some of these character will also have an implied space, some won't.

This is an example, albeit a short one, of how(NL)
spaceless word coding works.(NL)

is divided up as

(This )(is )(an )(example )(, )(albeit )(a )(short )(one )(, )
(of )(how )(NL)(spaceless )(word )(coding )(works )(. )(NL)

We can then encode the sequence of word numbers using Huffman coding or variable byte coding or anything that takes our fancy.

One nice consequence of using word coding followed by variable byte encoding is that we can search for a whole number of words in a compressed document by doing a byte search for the compressed query.

English and French

As an example I obtained four books from Project Gutenberg:

These four books provide about 60,000 lines of text, 3.2MB. I converted them from UTF-8 to Latin-1 using iconv(1), then stripped out the Project Gutenberg boiler-plate at top and bottom (largely so that the French texts wouldn't have too much English added). I used a C program to count characters, and an AWK script to pull the results together and make the following table.

Chr is the character. Only characters that occur at least once in at least one of the four books are listed. A character that doesn't occur at all in a document is shown as "-", while characters that occur at least once are shown as a percentage rounded to three decimal places. 1001-en and 1001-fr are the English and French versions of the Stevenson book. scott-en and scott-fr are the English and French versions of the Scott book. You will notice that the proportions of question mark characters go with the books, not the languages, while only the English texts contain " quotation marks and only the French texts contain « » quotation marks.
Chr1001-en1001-frscott-enscott-fr
N/L2.064%1.913%1.817%1.820%
SPC16.899%14.855%15.964%15.349%
'!'0.074%0.068%0.068%0.082%
'"'0.758%-0.595%-
'''0.096%1.023%0.108%1.103%
'('0.004%0.003%0.011%0.012%
')'0.004%0.003%0.011%0.012%
'*'-0.002%0.000%0.000%
','1.346%1.474%1.749%1.320%
'-'0.149%0.501%0.334%0.586%
'.'0.858%0.829%0.509%0.522%
'0'0.000%-0.002%0.002%
'1'0.001%0.002%0.008%0.006%
'2'0.000%-0.002%0.003%
'3'0.000%-0.003%0.004%
'4'0.001%-0.004%0.004%
'5'0.001%0.001%0.002%0.004%
'6'0.000%0.000%0.002%0.004%
'7'0.000%0.000%0.002%0.003%
'8'-0.001%0.002%0.004%
'9'0.000%0.000%0.001%0.002%
':'0.013%0.034%0.011%0.051%
';'0.295%0.205%0.092%0.175%
'?'0.092%0.080%0.058%0.057%
'A'0.149%0.069%0.106%0.065%
'B'0.074%0.030%0.129%0.055%
'C'0.072%0.103%0.157%0.124%
'D'0.059%0.062%0.106%0.077%
'E'0.061%0.091%0.040%0.065%
'F'0.066%0.047%0.058%0.034%
'G'0.061%0.028%0.046%0.026%
'H'0.147%0.065%0.111%0.034%
'I'0.484%0.090%0.251%0.092%
'J'0.007%0.102%0.011%0.054%
'K'0.004%0.002%0.065%0.002%
'L'0.062%0.148%0.127%0.142%
'M'0.114%0.129%0.091%0.086%
'N'0.067%0.041%0.035%0.046%
'O'0.048%0.028%0.046%0.039%
'P'0.077%0.066%0.073%0.067%
'Q'0.002%0.032%0.051%0.063%
'R'0.053%0.052%0.021%0.024%
'S'0.097%0.100%0.114%0.071%
'T'0.197%0.036%0.140%0.042%
'U'0.011%0.048%0.008%0.012%
'V'0.042%0.085%0.011%0.044%
'W'0.064%0.002%0.053%0.003%
'X'0.002%0.002%0.008%0.013%
'Y'0.056%0.002%0.026%0.001%
'Z'0.003%0.005%0.002%0.001%
'['-0.001%0.023%0.016%
']'-0.001%0.023%0.016%
'_'-0.035%-0.076%
'a'6.249%5.913%5.956%5.915%
'b'1.049%0.671%1.037%0.644%
'c'1.829%2.318%2.089%2.403%
'd'3.624%2.876%3.408%2.971%
'e'9.676%11.449%9.743%11.538%
'f'1.687%0.767%1.870%0.747%
'g'1.344%0.661%1.412%0.751%
'h'4.680%0.598%5.056%0.595%
'i'4.923%5.697%5.213%5.588%
'j'0.054%0.464%0.088%0.369%
'k'0.534%0.034%0.466%0.020%
'l'3.087%4.085%2.848%3.954%
'm'1.981%2.289%1.861%2.063%
'n'5.433%5.525%5.361%5.551%
'o'5.886%4.134%5.976%4.261%
'p'1.254%2.191%1.266%2.139%
'q'0.065%0.873%0.068%1.120%
'r'4.536%5.383%4.735%5.433%
's'4.606%5.986%4.935%5.983%
't'6.446%5.505%6.779%5.436%
'u'2.374%5.006%2.556%5.247%
'v'0.672%1.410%0.717%1.352%
'w'1.617%0.009%1.652%0.038%
'x'0.103%0.309%0.122%0.304%
'y'1.527%0.257%1.477%0.202%
'z'0.029%0.222%0.025%0.138%
'«'- 0.089%-0.014%
'°'- 0.000%--
'»'- 0.081%-0.003%
'À'- 0.016%-0.009%
'Â'- --0.000%
'È'- 0.001%-0.000%
'É'- 0.008%-0.026%
'Ê'- 0.001%-0.001%
'Î'- 0.000%--
'Ô'- --0.000%
'à'- 0.384%-0.415%
'â'- 0.050%-0.068%
'æ'- --0.000%
'ç'- 0.062%-0.049%
'è'- 0.243%-0.263%
'é'- 1.550%-1.503%
'ê'- 0.211%-0.191%
'ë'- 0.005%-0.000%
'î'- 0.041%-0.055%
'ï'- 0.003%-0.006%
'ô'- 0.052%-0.043%
'ù'- 0.027%-0.023%
'û'- 0.046%-0.049%
'ü'- 0.000%--

Lossy compression and text

Unicode introduces a number of complications. One of them is that there are several characters that all look the same. Here are just a few of many examples.
U+0041ALATIN CAPITAL LETTER A
U+0391ΑGREEK CAPITAL LETTER ALPHA
U+0410АCYRILLIC CAPITAL LETTER A
U+FF21FULL-WIDTH LATIN CAPITAL LETTER A
U+10300𐌀OLD ITALIC LETTER A
U+1D400𝐀MATHMATICAL BOLD LETTER CAPITAL A
U+00B5µMICRO SIGN
U+03BCμGREEK SMALL LETTER MU

Add this to the fact that Unicode contains both floating diacriticals and precomposed accented letters
U+01DEǞLATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
U+00C4 U+0304ǞLATIN CAPITAL LETTER A WITH DIAERESIS;MACRON
U+0041 U_0308 U+0304ǞLATIN CAPITAL LETTER A;DIAERESIS;MACRON

and you will appreciate that there may be many Unicode character sequences that give you the same visual impression when rendered, so that looking for a match may be tricky.

One consequence of this is that when we want to search a text for words or phrases, we may prefer a lossy encoding that maps all the A-like characters to a common version. Unicode defines some normalization forms, based on the ideas of (1) replacing accented characters by base characters + accents, (2) sorting the accents, (3) replacing forms that only exist for backwards compatibility with other standards with sensible forms, and (4) packing things back into precomposed accented characters when possible.

Each of these has its uses. Programming language standards sometimes require that identifiers be in NFC, for example, while not caring about comments. For example, XML says “Characters in names should be expressed using Normalization Form C”.