Greedy Solution to Huffman Coding Problem

Huffman Coding

Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.

Characteristics

  • Prefix code (a.k.a Prefix-free code)

    • No codeword appears as a prefix of any other codeword.

    • No ambiguity when decoding a bitstream.

      For example, if ‘a’ is encoded as ‘01’, ‘b’ cannot be encoded as ‘0’. Otherwise, when decoding ‘01’, it is ambiguous whether it is an ‘a’ (01) or a ‘b’ (0) followed by something (1).

  • Optimal code

    • Minimizes the length of the encoded bitstream.

    • The most frequent character are encoded with the least number of bits.

Applications

As a fundamental algorithm in data compression, Huffman coding is used in various compression schemes such as JPEG, MP3, and DEFLATE (PKZIP). It is also used in other algorithms or applications that require lossless compression and depression like in network transmission software.

Prefix Coding Tree

The coding tree is a binary tree that represents the mapping between characters and their corresponding codewords.

  • Full binary tree

  • Leaf nodes are characters (guaranteed prefix code)

  • Each internal node means a bit in the codeword

Example coding trees

Prefix coding trees

Credit: Introduction to Algorithms, 4th Ed., T. Cormen et al.

  • Left: Fixed length encoding

    • Total space: 3 * (45 + 13 + 12 + 16 + 9 + 5) = 300 bits

  • Right: Huffman coding

    • Total space: 1 * 45 + 3 * (13 + 12 + 16) + 4 * (9 + 5) = 224 bits

  • Contents in nodes:

    • Leaf node: <character>: <frequency>

    • Internal node: <total frequency of the subtree>

Algorithm to build a Huffman coding tree

This is a greedy algorithm that builds the tree bottom-up from the least frequent characters to the most frequent characters.

  • Create a leaf node for each symbol and add it to the min priority priority queue.

  • While there is more than one node in the min priority queue:

    • Dequeue two nodes with the lowest frequencies from the min priority queue

    • Create a new internal node with these two nodes as its children and with frequency equal to the sum of the two nodes’ frequency

    • Add the new node to the min priority queue

  • The remaining node is the root node and the tree is complete

Credit: Pseudo-code adopted from Wikipedia

Steps to build the example tree

Steps to Build the Example Tree

Credit: Introduction to Algorithms, 4th Ed., T. Cormen et al.