Andrew
Trotman
Information Retrieval
6(1):5-19
Research
into inverted file compression has focused on compression ratio – how small the
indexes can be. Compression ratio is
important for fast interactive searching.
It is taken as read, the smaller the index, the faster the search.
The
premise “smaller is better” may not be true.
To truly build faster indexes it is often necessary to forfeit
compression. For inverted lists
consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in
128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512
bytes and the word size is 4 bytes, then both the compressed and raw postings
would require one disk seek and one disk sector read. A less efficient compression technique may
increase the file size, but decrease load / decompress time, thereby increasing
throughput.
Examined
here are five compression techniques, Golomb, Elias gamma, Elias delta,
Variable Byte Encoding and Binary Interpolative Encoding. The effect on file size, file seek time, and
file read time are all measured as is decompression time. A quantitative measure of throughput is
developed and the performance of each method is determined.