Huffman Coding

Tags: text-processing

A coding algorithm for variable length, symbolwise, text compression. See here.

The huffman coding approach uses a [see page 109, deterministic binary prefix-tree] to represent the symbols for a text compression. The idea is we follow a path through a binary tree for each bit 0/1 we encounter in the input stream and once we reach a leaf node we output the character at that node.

Note: A huffman tree is guaranteed to be deterministic. It's impossible to encounter a bit-sequence which can lead to multiple leaf nodes depending on which path we take.

See pros/cons [see page 147, here] and [see page 159, here].

Note: Huffman coding can compress arbitrarily close to entropy.

Warn: Huffman coding requires at least one-bit per symbol.

Building a Tree

For each symbol in our model we create leaf nodes the nodes probability as their value.
At each iteration we join the 2 nodes with the smallest probabilities and treat them as a new node with a probability equal to the sum of the two nodes.
Once there's a single node left, we assign 0s and 1s to each path through the tree.

Note: that this approach produces a tree where the least frequent (lowest probability) nodes are as far away from the root as possible and the most frequent ones are close to the node.

Canonical Huffman Codes

Special huffman coding where codes are generated in a standardised format.

The general [see page 188, approach] is to find the code-length (bit-size) for each symbol (this can be done with huffman coding) and then ordering the symbols based on this length. We then assign the first symbol a code of 00 of the desired length and then start incrementing it to get the remaining codes. Once we encounter a longer code-length we shift-left to ensure the length of that code matches and to maintain determinism.

This compact representation can be stored as simply the sequence of symbols (in order) and the number of symbols for each code length.

Because we're incrementing the numbers as we move through the symbol list the numerical value of bits for each prefix in the tree is predictable. What this means is that the first symbol in a block has the smallest value for that block and a value larger than all the blocks before it.

# the first block which has a number larger than our
# inputs prefix of the same length, indicates we've
# just gone past the block-length our input lies in.
#
# See [[cite:COM3110-w05-text-compression][see page 201, [here]]].
input = 1100101010  # ...
for i in range(0, len(blocks)):
  if input[0..i] < blocks[i][0]:
     return i - 1