Today I learned about Trie data structure. Trie (pron. like "try" to distinguish from "tree") is a tree-like data structure used to store prefixes of the words and words. The root represents empty string "". Every node contains a letter and leaf nodes contain words. Going from root to the leaf we can build the word contained in a leaf with the letters present in the nodes on our path.

This data structure is often used to implement autocomplete. The leaf nodes can contain also metadata about the word it represents, for example: number of times this word was queried. Using this data we can then respond with top X most popular queries.

Every time user types a letter, we can do the following steps:

  1. Traverse the trie using typed letters to find the node which contains typed prefix.
  2. Traverse the subtree from found node to find all words of that subtree.
  3. Sort the words based on metadata.

This approach can be improved further, by caching top searched words on each prefix node. So each node with a letter (non-leaf node) will also contain top words searched using the prefix it represents. This changes the memory requirements, but can significantly improve the query speed.

We can improve even more by storing the trie in a Hash Map. This is called Prefix Hash Tree. The idea store every prefix from the trie as a key in the Hash Map, then the list of top words for that prefix will be stored as a value. This way we have the complexity of O(1).

The trie can also be simplified to O(1) by using few optimizations. Let's start from the initial implementation. We have the following steps:

  1. Traverse the tree to found node representing prefix — O(c), where c is the length of the prefix.
  2. Traverse subtree of a found node — O(N), where N is the number of nodes in the Trie (generally it's the number of nodes in the subtree, but worst case scenario is N)
  3. Sort the words by score — O(klog(k)), where k is the number of words (number of leaf nodes)

The summary complexity is then O(c) + O(N) + O(klogk)

I already mentioned first optimization, which is to move the top X words to a non-leaf nodes as a metadata. This gives us O(c) + O(1) (we don't need to get all nodes of a tree, and we don't need to sort them). Then we can agree on a limit for a prefix, for example 50 characters. This gives us O(1) instead of O(c). Of course, we'll still need to get to the node which represents our prefix and in the case of Prefix Hash Tree, we don't need to do that.