The Wikipedia entry "Lossless compression algorithms" is a handy reference for this stuff; it covers many topics that we never had time to even mention.
This handout describes two kinds of compression:
All compression is based on the idea
Data = Model + Surprise Model = Framework + Parameters
where the Framework is some kind of representation scheme which is known to both the sender and the receiver, the Parameters are some kind of summary of a message as a whole (such as the frequency of occurrence for each character), and the Surprise is the information which the receiver cannot predict given the Model.
For example, the Unicode character set (see the Unicode web site) has room for over a million different characters; the current version, Unicode 4.1.0, has 97,720 defined codes, which would require 17 bits per character, or 16.6 bits per character if we were clever. But by choosing ISO 2022 as our Framework, which has "which character set in such-and-such a registry" as Parameter, we can encode a document like this one in ASCII, using just 7 bits per character, a big saving. Using a fancier compression method, we might save even more space. (gzip compresses UnicodeData.txt by a factor of 6.6.)
A basic information retrieval index is a mapping
index : term -> postings
where a term could be a word or a stem, and the postings for a term is a set of
(document ID, count)pairs, where a document ID is the (natural number) identifier of a document that the term occurs in and a count is the (positive integer) number of times the term occurs in that document.
These two kinds of numbers have different properties. If there are n documents in the collection, we need ceiling(log2(n)) bits to represent a document number, and if a term occurs in k of those documents, we expect them to be scattered fairly uniformly throughout the range 1..n, so there doesn't seem to be much we can exploit.
Since the postings is a set of pairs, we can rearrange the order of the pairs any way we find useful. One useful way is to sort them so that the document IDs form a strictly increasing sequence. For example, to represent the set {19497, 19646, 19619, 19599, 19597, 19468, 19495, 19668, 19422, 19446, 19600, 19442} --- a real set of document IDs I happened to have handy --- we might sort them, getting 19422 < 19442 < 19446 < 19468 < 19495 < 19497 < 19597 < 19599 < 19600 < 19619 < 19646 < 19668. Now we can take the differences, and represent the set as 19422, +20, +4, +22, +27, +2, +100, +2, +1, +19, +27, +22. In fact, if we ensure that document IDs are strictly positive, and start with an origin of 0, we can represent all the document IDs as a positive difference from the next smaller number.
Replacing a strictly increasing sequence of integers by the sequence of differences is delta encoding. The differences are all positive integers, and they cannot be any larger than the numbers you started with. The nice thing about this is that when you are dealing with a set of document IDs in the range 1..n, the more elements the set has, the smaller the differences have to get.
When we are dealing with collections of positive integers, we often find that small numbers are much commoner than big ones. For example, of the 10313 distinct words in "Mutual Aid", only 42 occur more than 255 times. Clearly, words that don't occur very often in an entire collection cannot occur very often in any one document, so we expect the document frequency numbers to be small much more often than not. In the same way, once an increasing sequence has been delta encoded, we expect small differences more often than big ones.
There is a way to deal with this which is simple, fast, and compresses tolerably well, although not perfectly well. It represents each number using a whole number of bytes.
The basic idea is that we devote one bit in each byte to answering the question "is this the last byte for a number?", and seven bits to providing data.
The "end" bit could be at the top of the byte or the bottom; you could use 0 to mean stop or 0 to mean continue. So there are at least four sensible ways to do this. It doesn't matter very much. The scheme I show here puts the end bit at the top of the byte and uses 0 for stop, so that the representation of a value 0..127 requires no change.
void encode(unsigned long long x) {
while (x > 127) {
putchar((x & 127)|128);
x >>= 7;
}
putchar(x);
}
unsigned long decode(void) {
unsigned long long x, b;
int shift;
x = 0, shift = 0;
while ((b = getchar()) > 127) {
x |= (b & 127) << shift;
shift += 7;
}
return x | (b << shift);
}
This is independent of the word size and endian-ness of the computer.
A perhaps more obvious approach would be for the first byte to indicate the number of bytes required, and then have the remaining bytes 8 bits per byte. That could handle numbers up to 2**(256*8), which is much larger than we need; the problem is that small numebrs (0..127) would require two bytes instead of one. One can imagine all sorts of hybrid schemes, but variable byte encoding is hard to beat.
The Unicode standard defines a standard way of representing a stream of Unicode characters as a stream of bytes. It is called UTF-8 (Unicode Transmission Format using 8 bit elements). It is a close cousin of variable byte encoding. However, it had to satisfy several requirements:
The scheme goes like this:
| Bytes | Unicode value |
|---|---|
| 0xxxxxxx | 00000 00000000 0xxxxxxx |
| 110yyyyy 10xxxxxx | 00000 00000yyy yyxxxxxx |
| 1110zzzz 10yyyyyy 10zzzzzz | 00000 zzzzyyyy yyxxxxxx |
| 11110www 10zzzzzz 10yyyyyy 10xxxxxx | wwwzz zzzzyyyy yyxxxxxx |
This encoding is actually redundant, so there is an additional rule: each value will be encoded using the shortest available form. It's worth noting that the so-called "UTF-8" that Java uses is not well-formed UTF-8; where real UTF-8 would use 4 bytes, Java's "modified UTF-8" will use 6 bytes, and in addition, Java represents the NUL character by the two bytes 11000000 10000000 instead of the single byte 00000000.
Unary encoding is obviously absurd. It represents the natural number n by n ones followed by a zero. If we are encoding positive integers, we can use n-1 ones followed by a zero.
While it's obviously absurd, there is a probability distribution on the natural numbers for which it is optimal: if each number k is twice as common as k+1.
Unary coding is not often used by itself, but it is often used as a building block in other codes. For example, we can see UTF-8 codes as beginning with a unary number: 0 for ASCII, and 2 3 or 4 (110 1110 or 11110) for 2 3 or 4 byte codes.
We could also use 0s followed by a 1, of course.
If our Model tells us that a number lies in the range 2k .. 2k+1-1, then the Surprise lies in the bottom k bits.
Elias gamma encoding combines unary encoding and bounded binary encoding. To encode a positive integer n, let k be the natural number such that
2k <= n < 2k+1.
Now write k in unary (as k zeros followed by a one) and then the bottom k bits of n.
| n | code |
| 1 | [1][] |
| 2 | [01][0] |
| 3 | [01][1] |
| 4 | [001][00] |
| 5 | [001][01] |
| 6 | [001][10] |
| 7 | [001][11] |
| 8 | [0001][000] |
| 9 | [0001][001] |
| 10 | [0001][010] |
| 11 | [0001][011] |
| 12 | [0001][100] |
| 13 | [0001][101] |
| 14 | [0001][110] |
| 15 | [0001][111] |
| 16 | [00001][0000] |
| 17 | [00001][0001] |
| ... | ... |
| 31 | [00001][1111] |
| ... | ... |
| 255 | [00000001][1111111] |
| ... | ... |
Compare this with variable byte encoding.
As we get to large numbers, we find ourselves needing lots of bits for the unary prefix. So Elias delta encoding applies the same idea one more time. Instead of writing k using unary encoding, write k+1 using Elias gamma encoding.
| n | code |
| 1 | [1][][] |
| 2 | [01][0][0] |
| 3 | [01][0][1] |
| 4 | [01][1][00] |
| 5 | [01][1][01] |
| 6 | [01][1][10] |
| 7 | [01][1][11] |
| 8 | [001][00][000] |
| 9 | [001][00][001] |
| 10 | [001][00][010] |
| 11 | [001][00][011] |
| 12 | [001][00][100] |
| 13 | [001][00][101] |
| 14 | [001][00][110] |
| 15 | [001][00][111] |
| 16 | [001][01][0000] |
| 17 | [001][01][0001] |
| ... | ... |
You can find out about Elias omega encoding (which applies the Elias idea recursively), Golumb encoding (divide n by a tunable parameter b, encoding n/b in unary, and n%b in truncated binary), and others in the Wikipedia. They won't be in the exam.
The more we can build into our Model, the less Surprise there will be.
The first step in modelling text is to choose an appropriate character set and say that text is a sequence of characters.
For that sequence, we can use a sequence of fixed length elements (single bytes if there are no more than 256 characters in the character set, pairs of bytes if there are no more than 65,536 characters in the character set) or a sequence of variable length elements, as in UTF-8 (and other compression schemes for Unicode, like UTF-7, UTF-EBCDIC, and TR-6).
A somewhat richer Framework says that some characters may be commoner than others. For example, I took 83,358 lines of text from Project Gutenberg (about 4.3MB), and counted characters.
| count | c | count | c | count | c | count | c |
|---|---|---|---|---|---|---|---|
| 701663 | SP | 4 | @ | 4 | ` | ||
| 2419 | ! | 7438 | A | 254246 | a | ||
| 12784 | " | 4655 | B | 46583 | b | ||
| 7 | # | 3514 | C | 85093 | c | ||
| 1 | $ | 2705 | D | 133298 | d | ||
| 3 | % | 4882 | E | 410679 | e | ||
| 28 | & | 2746 | F | 79642 | f | ||
| 10756 | ' | 3481 | G | 58893 | g | ||
| 2591 | ( | 6691 | H | 202018 | h | ||
| 30 | TAB | 2597 | ) | 12618 | I | 224779 | i |
| 83358 | LF | 178 | * | 1102 | J | 3587 | j |
| 1 | + | 1340 | K | 22461 | k | ||
| 59534 | , | 2952 | L | 138009 | l | ||
| 10145 | - | 3865 | M | 79073 | m | ||
| 33233 | . | 3406 | N | 230331 | n | ||
| 23 | / | 3599 | O | 253465 | o | ||
| 908 | 0 | 2892 | P | 54258 | p | ||
| 2019 | 1 | 171 | Q | 3034 | q | ||
| 1182 | 2 | 3022 | R | 188953 | r | ||
| 878 | 3 | 6184 | S | 205266 | s | ||
| 646 | 4 | 11812 | T | 298917 | t | ||
| 659 | 5 | 929 | U | 91154 | u | ||
| 517 | 6 | 805 | V | 31179 | v | ||
| 493 | 7 | 4323 | W | 70257 | w | ||
| 779 | 8 | 423 | X | 5411 | x | ||
| 469 | 9 | 1129 | Y | 59800 | y | ||
| 1703 | : | 81 | Z | 1680 | z | ||
| 6191 | ; | 917 | [ | 84 | { | ||
| 1 | < | 1 | \ | 92 | | | ||
| 1 | = | 915 | ] | 84 | } | ||
| 1 | > | 84 | ^ | 1 | ~ | ||
| 2634 | ? | 1534 | _ |
We see, for example, that capital Q is quite rate (171 occurrences) but lower case e is very common (410,697 occurrences). Other than letters, plus only occurs once, while comma occurs 59,534 times. It seems silly to spend the same number of bits on common characters as rare characters, and it is silly. We can do better by using short codes for common characters even though that means using long codes for rare characters.
Huffman encoding provides the Framework. The Parameters for any particular document are either the character counts from which the encoding tree is derived, or the shape of the tree itself. We can transmit a tree like this:
tree_bits (Leaf c) = "0" ++ byte_bits c tree_bits (Fork x y) = "1" ++ tree_bits x ++ tree_bits y
The idea of Huffman encoding is very simple.
When this process is finished, we have a tree. This defines a binary code:
codes tree = visit tree ""
where visit (Leaf c) code = [(c,code)]
visit (Fork x y) code = visit x (code ++ "0") ++
visit y (code ++ "1")
These codes have the important property that none of them is a prefix of any other, so when we are decoding from left to right we are never in any doubt about which character's code we have seen next.
Huffman code is optimal amongst codes that don't change throughout a message. Roughly speaking, it is likely to halve the size of a text.
Suppose you have a Text which starts in English, switches to French for a couple of pages, and then switches back to English. (This happens in Dorothy L. Sayers' "Clouds of Witnesses", for example.) English and French have many similarities, but we can expect the character frequencies to differ somewhat. (For example, English makes very little use of accented vowels, while French uses them a lot.) It would be better if the coder could "fit" English in the English bits and "fit" French in the French bits. Even (or especially) on a small scale, the assumption of position-independence doesn't work very well: the letter after Q is almost always U, the letter after H is almost never T even though T is very common, and so on.
We could extend Huffman encoding in two ways. One way would be to use 256 different encoding trees, depending on what the last byte was. Another would try to adapt to larger scale patterns.
In adaptive Huffman encoding, the sender and receiver start with the same tree. It doesn't matter in principle what that tree is. As it encounters each character, the sender encodes it using the current tree and only then updates the tree. The receiver updates its tree exactly the same way after decoding each character.
The basic idea of these methods is to keep a "window" of the last N characters transmitted. The sender looks at the upcoming characters and if it can find a (usefully long) prefix of the future characters which occurs in its window onto the past, it sends a code which means "such and such a substring of the past", otherwise it just encodes the next character. These methods can pick up repeating words and even phrases. They give very good compression.
Word coding says "wait a minute, text isn't really a sequence of characters, that's just an encoding; it really is a sequence of words." So we make (or maintain) a dictionary mapping words to integers, and instead of encoding a sequence of characters, we encode a sequence of word numbers. If we sort the words in the dictionary by descending order of frequency, we'll be back in the realm of integer sequences where small numbers are much more common than big numbers.
What about spaces and punctuation marks? Spaceless word coding says "assume that each word is followed by a space." What happens when it isn't? Each of the characters in your alphabet is also recorded as a word, and if the next word is one of those, the space is cancelled. Some of these character will also have an implied space, some won't.
This is an example, albeit a short one, of how(NL) spaceless word coding works.(NL)
is divided up as
(This )(is )(an )(example )(, )(albeit )(a )(short )(one )(, ) (of )(how )(NL)(spaceless )(word )(coding )(works )(. )(NL).
We can then encode the sequence of word numbers using Huffman coding or variable byte coding or anything that takes our fancy.
One nice consequence of using word coding followed by variable byte encoding is that we can search for a whole number of words in a compressed document by doing a byte search for the compressed query;