Brain Dump

Text Coding

Tags
text-processing

Determining the output representation of a symbol in a compression scheme given the probability distribution supplied by a model. The general approach is:

  • Symbols that occur the most frequently should have the shortest code.
  • Symbols that occur the least frequently should have the longest code.

Given a set of codewords we can [see page 116, calculate] the expected average code length (bit count) for each symbol in the compressed output.