Andrew Trotman, Vikram Subramanya
Compression of
term frequency lists and very long document-id lists within an inverted file
search engine are examined. Several compression schemes are compared including
Elias g and d
codes, Golomb Encoding, Variable Byte Encoding, and a class of word-based
encoding schemes including Simple-9, Relative-10 and Carryover-12. It is shown
that these compression methods are not well suited to compressing these kinds
of lists of numbers. Of those tested, Carryover-12 is preferred because it is
both effective at compression and fast at decompression.
A novel technique,
Sigma Encoding prior to compression, is proposed and tested. Sigma Encoding
utilizes a parameterized dictionary to reduce the number of bits necessary to
store an integer. This method shows an about 0.3 bit per integer improvement
over Carryover-12 while costing only about 3 extra clock cycles per integer to
decompress.