Text Compression for Beginners (Huffman Coding)

I'm a very curious person. This weekend I wondered how compression actually works. Whats the best way to understand something in the programming-world? Exactly, to code it. So I did.

Research

To decide which compression algorithm I want to look into I had to do some research of what exists and is commonly used. To give you an overview and better explain why I chose Huffman, here are some common compression algorithms

Lempel-Ziv-Welch (LZW)
- replaces repeating sequences of bytes with codes of fixed length, mapped in a dynamic dictionary
- known for application in GIF and early Version of TIFF
LZ77 & LZ78 (Basis of LZW)
- LZ77: Searches for repeating sequences in a sliding window on the already processed text and replaces them through a reference to their earlier occurence (distance and length)
- LZ78: Builds an explicit dictionary of found phrases
- Basis of many modern compression methods (e.g. ZIP)
Deflate
- Combination of LZ77 and Huffman-Coding
- LZ77 identifies and replaces seqences
- Huffman-Coding gets applied to the result of LZ77
- most known for usage in ZIP
Run Length Encoding (RLE)
- replaces sequences of identical Bytes through a single byte instance + amount of repetitions
- most effective on Data with long sequences of identical bytes e.g. bitmap
Huffman-Coding:
- the algorithm I decided to use

Why Huffman

You may wonder why I chose Huffman to implement. To get into compression, I expected 2 things:

applicable to files I most often use: Code (Plaintext)
relatively simple to understand and implement

Huffman vs RLE

While also being relatively easy to implement, RLE didn't meet my requirement of being applicable to plaintext. RLE is not a good fit, because in Code, JSON, XML,... you rarely have chains of a repeating character like:

AAABBBBBBBCCCDDDDDDDD -> RLE Result: A3B7C3D8

Terminology

Tree
- hierarchical datastructure
- consisting of nodes, connected by edge
- single root node, no cycles
Complete Binary Tree
- each layer is completely filled, except the last one
- last level nodes are as far left as possible
Huffman Tree
- full binary tree
Heap
- complete binary tree
- often implemented with use of arrays
- O(1) for top lookup
- O(log n) for insertion and deletion
- Learn more
min-heap
- heap where each parent node is smaller or equal than its child nodes

Node types in Kontext of a Huffman Tree (Full Binary Tree)

Let me explain what I mean by "node", keep in mind that this definition is only applicable to a Huffman-Tree (full binary tree).

    (R)
    / \
  (L) (I)
      / \
    (L) (L)
     
Root Node		(R)
Leaf Node 		(L)
Internal Node 	(I)

All nodes being at the bottom of a branch with no lower "connections" (children) to other nodes are called Leaf Nodes (L). Each Leaf Node has either an internal or the root node above it (parent).

There is only one Root Node (R), its the one at the top, from where the tree spreads. Its the origin and starting point, so it never has a parent node.

Every Node between a Leaf Node and the Root Node is an internal Node. An internal Node has either the Root Node or an internal Node as parent.

Both Root and internal Nodes must have exactly two children, each either Leaf Node or internal Node.

How does Compression work

Huffman Coding works through a tree. We count how often each character existing in the file appears. From the character to frequency map we create a Leaf-Node each.

Now we initialize a min-heap with all leaf nodes, using the frequency as priority indicator. If two leaf nodes got the same frequency, we prioritize the node with the "smaller" character (character mapping to the lower binary representation, usually handled automatically through the < > comparison).

This tie-breaking rule is essential depending on how you store your tree in the header. It ensures that the compression and decompression processes build the exact same Huffman tree, guaranteeing that the file can be correctly restored. Without a deterministic rule, different implementations might create different trees for the same input, making decompression impossible.

Now we pop the top two nodes from the heap and create an internal node, having the smaller node (automatically the first node we pop, because its a min-heap) on the left and the bigger on the right. The new nodes frequency is the sum of the two child nodes frequencies. Now we store the new node in the heap. This gets repeated until there is only one left. The last remaining node is the root of our Huffman-Tree.

Now we have to create a map. To do that, we iterate recursively over the tree while keeping track of our path. if we encounter a leaf node, we store the path in the dictionary for the character of the leaf node. Then we go back to the last Node where we got a path left we didn't go down yet.

Once we have our completed map, we just iterate over the whole file. We retrieve the path for the given character from our newly created dictionary and write a 0 bit for each left turn and a 1 for each right turn. In the end we add we add padding to the next full Byte.

Next we write a header to the output file, containing the Huffman tree and padding. Then we add the encoded bytes to the file and save it.

Congrats, your file is now compressed and depending on the size and contents of the file, you get about 50% compression.

How does Decompression work?

First we reconstruct the Tree. Then we iterate over body data bit by bit. We start at the root node and keep track of the node we're at.

Again, each 0 means we go to the left node and for a 1 we proceed to the right. Eventually we will always land on a leaf node. Once we're there, we append our decompress string by the character at the leaf node. Then we just reset our current node variable to the root node. Repeat until the amount of bits left equals the padding.

Write the result to a file, and your original contents should be exactly restored without any loss.

Compression Example

By now you're maybe wondering how a compressed string actually looks in binary. Lets say we take "Mississippi" as input. Mississippi contains:

I already ordered it in the overview above, so that would be the arrangement in the heap as well.

M & P are the least frequent characters. That leaves us with 3 as their parent node.

  (Mp 3)
  /   	\
(M 1) 	(p 2)

we insert Mp 3 into the Heap (in the real world you wouldn't name the node anything, thats just to make it easier to track for you):

3 Mp
4 i
4 s

now Mp and i are the least frequent 3 from Mp + 4 from i leaves us with seven:

   	 (Mpi 7)
   	/      \
  (Mp 3)   (i 4)
  /    \
(M 1)  (p 2)

we again insert the result node into heap:

4 s
7 Mpi

we only have two nodes left, so lets combine them into our root node

    (sMpi 11)
    /      \
  (s 4)    (Mpi 7)
   	      /      \
        (Mp 3)   (i 4)
        /    \
      (M 1)  (p 2)

For s we go left(0) from the source node:

s: 0 For i we have to go right(1) right(1):
s: 0
i 11 For M we go right(1) left(0) left(0):
s: 0
i 11
M 100 And last but not least, for p we go right(1) left(0) right(1)
s: 0
i 11
M 100
p 101

now we replace each character with its corresponding bits

Before:

M		 i		  s		   s		i		 s		   s	   i		p		 p		  i
01001101 01101001 01110011 01110011 01101001 01110011 01110011 01101001 01110000 01110000 01101001

1 Byte (8 Bits) for each character. leaves us with 11 Bytes (88 Bits)

After:

M  i ssi  ssi p   p  i
10011001 10011101 10111000

Compressed Mississippi is only 3 Byte big (21 Bits + 3 padding)

Compression (with code)

Lets look at how we would compress a string (for example read from a plaintext file).

First we need to count how often each character appears in our input String.

private func count(_ input: String) -> [Character: Int] {
	// store all amounts with character as key
    var counts = [Character: Int]()
    
    // iterate over the string, counting each character
    input.forEach { char in
    	// treat non-existing keys as having value 0
        counts[char, default: 0] += 1
    }
    return counts
}

Next, we create our Huffman-Tree.

func buildTree(
		_ counts: [Character: Int]
) throws -> HuffmanNode {
	// Convert KV pairs to leaf nodes
    let leafNodes = counts.map { key, value in
 	   	HuffmanNode(character: key, frequency: value)
    }
    
    // initialize the Heap
    let heap = Heap<HuffmanNode>{ lhs, rhs in
        lhs.frequency < rhs.frequency
    }
    heap.add(leafNodes)
    
    // repeat while more than the root node is left in heap
    while heap.count > 1 {
    	// get and remove two smallest nodes
    	// smaller one is always left child
        guard let leftNode = heap.popTop(),
         	  let rightNode = heap.popTop() else {
        	 break
       	}
       	// create new Huffman Node
       	// frequency gets calculated inside initializer
        let node = HuffmanNode(
        	left: leftNode,
        	right: rightNode
        )
        // add new node into heap
        heap.insert(item: node)
    }
    
    // get top returns nil if heap is empty
    // we safely unwrap it
    guard let top = heap.getTop() else {
        //throw Error
    }
        
    return top
}

We need to map characters to a binary sequence:

// pass root node of Huffman Tree
func buildMap(root: HuffmanNode) -> [String: [Bool]] {
	// Bool to represent 0 / 1
    var map = [String: [Bool]]()
    
    traverse(node: root, map: &map, currentPath: [])
    
    return map
    
    // recursively iterate over the tree
    func traverse(
    	// next node
    	node: HuffmanNode,
    	// map variable to mutate
    	map: inout [String: [Bool]],
    	// the path we went to get to the node
    	currentPath: [Bool]
    ) {
    	// if we're at a leaf node, save path for character
        if let char = node.character {
            map["\(char)"] = currentPath
            return
        }
        
        // If it's an internal node, continue traversal
        // Go left: append 'false' (0) to the path
        if let left = node.left {
        	// Create a copy of the path for this branch
            var newPathForLeft = currentPath
            newPathForLeft.append(false) // 0 for left
            traverse(
            	node: left,
            	map: &map,
            	currentPath: newPathForLeft
            )
        }

        // Go right: append 'true' (1) to the path
        if let right = node.right {
        	// Create a copy of the path for this branch
            var newPathForRight = currentPath
            newPathForRight.append(true) // 1 for right
            traverse(
            	node: right,
            	map: &map,
            	currentPath: newPathForRight
            )
        }
        
    }
}

Recursion is often a bit hard to understand. Let me break down how this works:

We go always as far left as possible.

Ok, great, we got our map, so its finally time to compress!

func compress(_ data: String) throws -> Data {
	// how often does each character occur
    let counts = count(data)
    // build Huffman-Tree and store root
    let treeRoot = try buildTree(counts)
    // map each character to a bitsequence
    let map = buildMap(root: treeRoot)
    
    // swift doesn't have included bitwise writing
    // so we're gonna use our own
    var bitOperator = BitWriter()
    
    // iterate over the whole to encode string
    for char in data {
        guard let sequence = map["\(char)"] else {
            // character doesn't exist in map
            // only typesafety artifact, should never happen
            return
        }
        
        // write sequence bit by bit
        for bit in sequence {
            bitOperator.write(bit: bit)
        }
    }
   	
   	// get compressed data + padding
    let (data, padding) = bitOperator.getData()
    
    // encode data including padding, tree and compressed data
    // PropertyListEncoder has smaller output than JSON
    return try PropertyListEncoder().encode(
    	KompressorFile (
    		h: .init(
    			p: padding,
    			t: treeRoot.serializeToCompactData()
    		),
    		b: data
    	)
    )
}

struct KompressorFile: Codable {
    ///header
    let h: Header
    ///body
    let b: Data
    struct Header: Codable {
        ///padding
        let p: Int
        ///huffman root node (tree)
        let t: Data
    }
}

A very important part: the Huffman-Node Datastructure. If you're not interested in those details [Jump to Decompression]

/// Represents a node in the Huffman tree. It can be either a leaf node
/// (containing a character and its frequency) or an internal node
/// (containing only a frequency and references to its children).
class HuffmanNode: Comparable {
    let character: String? // Only set for leaf nodes
    let frequency: Int
    var left: HuffmanNode?
    var right: HuffmanNode?
    
    /// Initializes a leaf node.
    init(character: Character, frequency: Int) {
        self.character = "\(character)"
        self.frequency = frequency
        self.left = nil
        self.right = nil
    }
    
    /// Initializes an internal node.
    init(left: HuffmanNode, right: HuffmanNode) {
        self.character = nil // Internal nodes don't represent a single character
        self.frequency = left.frequency + right.frequency
        self.left = left
        self.right = right
    }
    
    /// Compares two HuffmanNodes based on their frequency, then by character for tie-breaking.
    static func < (lhs: HuffmanNode, rhs: HuffmanNode) -> Bool {
        if lhs.frequency != rhs.frequency {
            return lhs.frequency < rhs.frequency
        }
        // Tie-breaking: If frequencies are equal, prioritize leaf nodes, then by character value.
        // This ensures a deterministic tree.
        if lhs.character != nil && rhs.character == nil {
            return true // Leaf node (lhs) is "smaller" than internal node (rhs) for tie-breaking
        } else if lhs.character == nil && rhs.character != nil {
            return false // Internal node (lhs) is "larger" than leaf node (rhs) for tie-breaking
        } else if let lhsChar = lhs.character, let rhsChar = rhs.character {
            return lhsChar < rhsChar // Compare characters for leaf nodes
        }
        return false
    }
    
    static func == (lhs: HuffmanNode, rhs: HuffmanNode) -> Bool {
        return lhs.frequency == rhs.frequency && lhs.character == rhs.character
    }
}

If you're interested in the Heap implementation, read the article (thanks, Tom) or have a look at the file

We need some additional helpers to keep the size of the tree small when saved in the compressed File. If you're not interested in those details [Jump to Decompression]

extension HuffmanNode {
    /// Serializes the tree into a compact binary Data object using BitWriter.
    func serializeToCompactData() -> Data {
        var writer = BitWriter()
        self.serialize(into: &writer)
        // Use your getData() method to finalize the data buffer
        return writer.getData().data
    }
    
    /// helper for recursive serialization.
    private func serialize(into writer: inout BitWriter) {
        if character != nil {
            // 1. This is a leaf node. Write a '1' bit.
            writer.write(bit: true)
            
            // 2. Then, write the character's byte value.
            guard let char = self.character?.first, let byte = char.utf8.first else {
                // Fallback for an invalid leaf node.
                writer.write(bits: 0, count: 8)
                return
            }
            // Use the multi-bit write method for efficiency and clarity.
            writer.write(bits: UInt(byte), count: 8)
        } else {
            // 1. This is an internal node. Write a '0' bit.
            writer.write(bit: false)
            
            // 2. Recursively serialize the left and right children.
            left?.serialize(into: &writer)
            right?.serialize(into: &writer)
        }
    }
}

To write bitwise, we need our own implementation in swift. If you're not interested in those details [Jump to Decompression]

struct BitWriter {
    /// The underlying buffer where bits are being accumulated.
    private var data: Data

    /// The current byte being filled with bits.
    private var currentByte: UInt8

    /// The number of bits already written into `currentByte`.
    /// Ranges from 0 to 7.
    private var bitsInCurrentByte: Int

    /// Initializes a new `BitWriter`.
    init() {
        self.data = Data()
        self.currentByte = 0
        self.bitsInCurrentByte = 0
    }

    /**
     Appends a single bit to the buffer.

     - Parameter bit: The bit to append (true for 1, false for 0).
     */
    mutating func write(bit: Bool) {
        // Shift the current byte to the left to make space for the new bit
        currentByte <<= 1

        // If the bit is true (1), set the least significant bit
        if bit {
            currentByte |= 1
        }

        bitsInCurrentByte += 1

        // If the current byte is full (8 bits), append it to the data buffer
        // and reset for the next byte.
        if bitsInCurrentByte == 8 {
            data.append(currentByte)
            currentByte = 0
            bitsInCurrentByte = 0
        }
    }

    /**
     Appends a sequence of bits from a given `UInt` value.
     The bits are taken from the least significant side of the value.

     - Parameters:
       - value: The `UInt` value containing the bits.
       - count: The number of bits to take from the `value` (starting from LSB).
                Must be between 0 and `UInt.bitWidth`.
     */
    mutating func write(bits value: UInt, count: Int) {
        // Input validation
        precondition(
            count >= 0 && count <= UInt.bitWidth,
            "Count must be between 0 and \(UInt.bitWidth)."
        )

        // Iterate from the most significant bit of the 'count' bits down to the least significant.
        // This ensures bits are written in the correct order (MSB first).
        for i in (0 ..< count).reversed() {
            // Check if the i-th bit (from right) is set in the value.
            let bit = (value >> i) & 1 == 1
            write(bit: bit)
        }
    }

    /**
     Returns the accumulated `Data` buffer.

     - Returns: The `Data` object containing the packed bits.
                This method also handles any remaining bits in `currentByte`
                by padding with zeros.
     */
    mutating func getData() -> (data: Data, padding: Int) {
        // If there are any remaining bits in currentByte,
        // we need to pad it with zeros to fill the byte
        // and append it to the data buffer.
        if bitsInCurrentByte > 0 {
            // Shift remaining bits to the left to align them to MSB of the byte
            // and effectively pad with zeros from the right.
            currentByte <<= (8 - bitsInCurrentByte)
            data.append(currentByte)
            currentByte = 0
            bitsInCurrentByte = 0
        }
        return (data: data, padding: (8 - bitsInCurrentByte))
    }
}

Decompression (with code)

Decompressing is a lot simpler, as we only have to rebuild the tree and run over it.

func decompress(_ file: KompressorFile) throws -> Data {
    let tree = HuffmanNode(fromCompactData: file.h.t)
    
    var reader = BitReader(data: file.b)
    
    var reconstructed = ""
    
    var currentNode = tree
    
    while reader.remainingBits > file.h.p {
        if let char = currentNode?.character {
            reconstructed.append(char)
            currentNode = tree
        }
        guard let bit = reader.readBit() else {
            break
        }
        guard let nextNode = bit ? currentNode?.right: currentNode?.left else {
            throw E.internalError(description: "Unexpectedly encountered leaf node")
        }
        currentNode = nextNode
    }
    
    guard let reconstructedData = reconstructed.data(using: .utf8) else {
        //error Decoded string isn't convertible to utf8
        return
    }
    return reconstructedData
}

We also need a new initializer to create our tree from compacted data. Jump to Conclusion

extension HuffmanNode {
    /// Deserializes a HuffmanNode from a compact binary Data object.
    convenience init?(fromCompactData data: Data) {
        guard !data.isEmpty else { return nil }
        var reader = BitReader(data: data)
        // The recursive helper will build the entire tree.
        guard let reconstructedNode = HuffmanNode.build(from: &reader) else {
            return nil
        }
        // Use the reconstructed node's properties to initialize self.
        self.init(
            character: reconstructedNode.character?.first,
            frequency: reconstructedNode.frequency,
            left: reconstructedNode.left,
            right: reconstructedNode.right
        )
    }

    /// Private helper for recursive deserialization.
    private static func build(from reader: inout BitReader) -> HuffmanNode? {
        guard let bit = reader.readBit() else { return nil }
        
        if bit == true { // Leaf node
            guard let byte = reader.readByte() else { return nil }
            let char = Character(UnicodeScalar(byte))
            // Frequency is not stored and not relevant anymore
            // so we use a placeholder
            return HuffmanNode(character: char, frequency: 0)
        } else { // Internal node
            guard let leftChild = build(from: &reader),
                  let rightChild = build(from: &reader)
            else {
                return nil
            }
            return HuffmanNode(left: leftChild, right: rightChild)
        }
    }

    // Helper initializer to construct a node from its components.
    private convenience init?(character: Character?, frequency: Int, left: HuffmanNode?, right: HuffmanNode?) {
        if let char = character {
            self.init(character: char, frequency: frequency)
        } else if let l = left, let r = right {
            self.init(left: l, right: r)
        } else {
            return nil
        }
    }
}

And of course we need a possibility to read bit by bit too. Jump to Conclusion

struct BitReader {
    private let data: Data
    private var byteIndex: Int
    private var bitPosition: Int// 0-7, from MSB to LSB
    var remainingBits: Int
    
    init(data: Data) {
        self.data = data
        self.byteIndex = 0
        self.bitPosition = 0
        self.remainingBits = data.count * 8
    }
    
    /// Reads a single bit from the buffer.
    /// Returns nil if no more bits are available.
    mutating func readBit() -> Bool? {
        guard byteIndex < data.count else { return nil }
        
        let currentByte = data[byteIndex]
        // Read bit from left (MSB) to right (LSB)
        let bit = (currentByte >> (7 - bitPosition)) & 1 == 1
        
        bitPosition += 1
        remainingBits -= 1
        if bitPosition == 8 {
            bitPosition = 0
            byteIndex += 1
        }
        
        return bit
    }
    
    /// Reads a full 8-bit byte from the buffer.
    /// inverse of `BitWriter.write(bits:count:8)`.
    mutating func readByte() -> UInt8? {
        var byte: UInt8 = 0
        // We need to read 8 bits to form a byte.
        // The first bit read corresponds to the MSB of the original byte.
        for i in (0..<8).reversed() {
            guard let bit = readBit() else {
                // Not enough bits left to form a full byte
                return nil
            }
            if bit {
                byte |= (1 << i)
            }
        }
        return byte
    }
}

Conclusion

Thank you so much for sticking through. It was a lot of fun to learn how this works. I hope you learnt something too.

Now I'm very curious how LZ77 works, maybe I'll write about it too.

If you're interested to see how all this looks as a cli-tool, hop over to my github.

If you enjoyed the read I would appreciate you sharing it.

Until next time, Mia

Sidenote

The repo and cli are called "kompressor" because I once got a plushie cat. It arrived vacuum packaged and never completely "uncompressed" so it got called Kompressor.