Skip to the content.

Auto complete service

Use cases

Design a Google Search like auto completion service, which takes user’s real time input and return a set of top ranked completions from a given prefix.

Requirements

Design

Data model and Algorithms

trie-tree

Using trie tree as the data structure is a get-go solution. We could store the ranking score along with each word node. So that we could sort on it to get the top k records.

Current time complexity:

From above solution, the O(N) could be the bottle neck if we have a large number of sub nodes to be processed. And this could be true since we want to generate the completions based on all users’ input as one of the requirements.


A improvement from above is to store the list of completions as part of each node. This is called pre-computation.

pre-computation-trie-tree

This speeds up the read performance, but brings in some issues as well:

However, the cost on updating ranking scores is trivial comparing to the read performance. We could have a asyc process updating the ranking scores with some intervals, e.g. every 10mins or every 1000 read requests.

If we really want to improve on storing extra data, we could:

Current time complexity:

If we limited the L and K, the time complexity is constant.


Now we need to persist the data model into database, because we do not want to lose the prefix tree we have generated in the case of system rebooting. If using relational database, we need to have ORM(Object-Relation-Mapping) to map the data structure into schemas which might increase the overhead of encoding and decoding and have an impact on performance. With NoSQL databases, things will be easier because a tree structure could be stored as a Document(MongoDB) or a nested JSON. Updating the tree structure would be also easier, and also distribute the persistent data.

We could use Prefix Hash Tree to store our prefixes and completions, because it could map the prefix to a hash key and the completions to the hash value, so it could be very easily implemented in NoSQL database.

prefix-hash-tree

The disadvantage will be the space. It does not share the prefix of each key in the map. But we could accept the tradeoff.


Lets say we have the maping between prefix and completions list. Since we have to return a sorted completions based on sores, so we have two options: a) sort on read, b) sort on write. If we sort on read, it definitely will increase the response time. So sort on write might be a better solution.

And We have the following operations need to consider:

The database processing is time consuming, we could have a background process to do the job instead of doing it synchronously.

How to store tree in data store

how to store tree in data store

Cache

Now another concern comes into the picture, disk IO is expensive for reading/writing the prefixes, scores back and forth between the service and database.

So, we could use an in-memory based database as a cache in between disk based database and server code. Reids is a good candidate. The LRU policy could keep the latest records and if cache misses, data could be read from the disk. Another reason of picking redis is that it is also distributed in-memory database which has a very high scalabilty.

On disk database

We discussed on NoSQL database could be a best fit in our case. And a key-value based database could make the persistence process easier, because it could avoid some overhead to encoding/decoding data between two different models.

Data sharding

If we support average 5 words auto suggestion, each word has average 10 characters, and we have 1 billion unit sentences. So we will end up with 5 * 10 * 8(byte) * 1,000,000,000 = 400GB data. This data cannot fit in to memory and cannot fit into regular K8S node. So we have to shard our data across all nodes in our service cluster.

// TODO: Add design for data sharding

Data replication

// TODO: Add design for data sharding

Client design

Further analysis

References