What is image file: Huffman Coding
grade 2: huffman coding
In previous post, the introduction about JPEG, we introduced the compression algorithm “Huffman Coding”, this post is about it.
Huffman coding is widely used in:
- File compression: zip / gzip
- Image compression: jpeg / png
- Text compression: word2vec
A slight note about word2vec, which treats words in corpus as leaf nodes, based on words’ frequency as weights, to build corresponding huffman tree.
Now as you could see, we need to understand huffman coding because it was adopted in NLP and CV domains.
There was said huffman coding was invented by David A. Huffman in 1952 while he was attending exam in MIT.
The description in detail wiki.
- Lossless data compression.
- Optimal prefix code.
- Variable length code for different symbols.
- Greedy approach upon frequency of characters.
- Complexity of nlogn to #n unique characters.
- Length of code of character is inverse to its frequency.
begin count frequency of each character sort into ascend sequence make each char a leaf node (character, frequency, left_son, right_son) put nodes into queue Q while (Q >= 2) do begin pop the first/smallest two nodes (n1, n2) create a new node with sum of chosen nodes insert new node into sorted Q end append 0 to left_son and 1 to right_son from the root make output for each input character end
# Huffman Coding class Node(): def __init__(self,name=None,value=None): self._name = name self._value = value self._left = None self._right = None class HuffmanTree(): # based on leaf node, grow the tree def __init__(self,char_weights): # create node for each char self.a = [Node(part, part) for part in char_weights] while len(self.a) != 1: self.a.sort(key = lambda node : node._value,reverse = True) c = Node( value = (self.a[-1]._value + self.a[-2]._value)) c._left = self.a.pop(-1) c._right = self.a.pop(-1) self.a.append(c) self.root=self.a # save huffman code into b, range 10 is the tree depth self.b = list(range(10)) # create code recursively def pre(self,tree,length): node = tree if (not node): return elif node._name: print(node._name + ' : ', end='') for i in range(length): print(self.b[i], end='') print('\n') return self.b[length] = 0 self.pre(node._left, length+1) self.b[length] = 1 self.pre(node._right,length+1) # create huffman code def get_code(self): self.pre(self.root, 0) if __name__ == '__main__': # take inputs of char & frequency char_weights = [('i',3), ('p',7), ('h',6), ('o',5), ('n',1), ('e',2)] tree = HuffmanTree(char_weights) tree.get_code()
Let’s consider another scenario, which explains the usage of Huffman coding well. Assume we have scores of a great amount of students’ exams, which range from 0 to 100, and we want to level them from A to E.
One common approach is the if / else if / else statement, something like:
if score < 61 else if score < 71 else if score < 81 else if score < 91 else:
In a first looking, this approach would work, but if take the score distribution into consideration, i.e. majority of the scores are in between 70 ~ 80, we will know this approach is not efficient.
This is the place where Huffman coding works, if we know the distribution as following:
|91 ~ 100||81 ~ 90||71 ~ 80||61 ~ 70||0 ~ 60|
How would we rewrite the conditional statement?
if score >=71 and score <81: else if score >=81 and score < 91: else if score score >=61 and score < 71: else if score if score <61: else:
Now, let’s talk about encoding and its reverse operation of decoding. Say a string of “abcc”, what will its huffman coding looks like?
We could easily figure out in this example:
But wait, what will the coding looks like towards “bacc”?
It may still be:
Or a: 01, b: 00, c: 1
Towards 2 characters share same frequency, it doesn’t matter to sort in alphabetical sequence or not, during preparation for huffman tree construction, as long as the code could be generated and decoded properly.
Secret to decode a encoded huffman content without ambiguity lies in the “prefix code”. To understand prefix code, let’s see what is none-prefix code first.
Still, take the “abcc” example, if we have:
a: 00, b: 10, c: 001
This is not a prefix, because given a encoded string like: 0010001001, we would have problem to tell what 00 means, it could be an “a” or the prefix to “c”, which brings ambiguity.
So, a prefix property means, each of prefix code is unique, and, starts looking up from each code’s first position, it won’t be contained in any other codes start from their first positions.
If we have the following encoding string: 000111 and huffman man tree: a: 00, b: 01, c: 1, we may start interpret from its beginning and get out: a b c c.
If we have the same encoding string: 000111 and huffman man tree: b: 00, a: 01, c: 1, we may start interpret from its beginning and get out: b a c c.
[post status: half done]