Brain Dump

Dictionary Text Compression

Tags
text-processing

A form of text compression reliant on replacing word fragments with dictionary entries. Store the key for an entry instead of the entry itself to save space.

This approach has issues because:

  • Words common to many texts are pretty short (not much to optimise).
  • Dictionaries are suited for specific texts, not for others. An english dictionary isn't much use for a french text.

The main concerns with building a dictionary based solution are:

  • Deciding how to store and transmit the data
  • Decide which words should be in the dictionary

Adaptive Dictionary Schemes

Eg. Ziv and Lempel (LZ77/LZ78) and [see page 8, gzip].

An approach to dictionary compression which uses the entire decoded output (so far) as the dictionary. Each [see page 5, block] stores a pointer to some characters before the current point, a length indicating how many characters from that point to repeat, and a new character to put at the end of the current block (without this there would be no way to create the first character (because there's nothing to reference before it)).

See the [see page 6, example] and take special note of the wraparound recursion when we go back less characters than we insert. Note: This is implemented using two pointers.

See [see page 7, considerations] for these approaches.

Note: An adaptive word based model has to have some sort of mechanism for previously unseen words. Eg. [see page 10, using an escape character].

Links to this note