Skip to the content.

Design text based search service

Requirements

Design a system that supports the search feature based on arbitrary text input. Like twitter search by keywords or Youtube/Netflix search by video name, etc.

Functional requests

Non-functional requests

Questions to ask

Assumptions

8 * 10 * 4 Byte * 500 * 60 minute * 24 hours = 230 MBs / Day

230 MBs * 365 Days * 10 Years = 840 GBs in total for 10 years

This means we have to create the index for 840 GBs data in total. This definitely could not be held in a single machine in the memory. We want to design a distributed service.

Design overview

High level architecture

architecture

How does content get indexed

Tokenizing

Real-time tokenizing


documentID : list of terms
--------------------------
0          | [(i, 0), (like,2), (apple,7)]
1          | [(liverpool,0),(wins,2)]
Offline tokenizing

daily-data-processing

Inverted index building

inverted-indexing-1

Why using parallel array instead of list? Because array elements have consecutive memory addresses. List is a random memory address allocation. So array has better performance in terms of traversing.

Offline inverted indexing

inverted-indexing-2

Earlybirds shards

We need to distribute the index to all Earlybird servers in order to balance the load from both index building step and query serving step. We could actually perform the index building (above step) on the Earlybird server itself, which means we have to partition and distribute the load upfront. On the stage of tokenizing, the key is documentID and the value is a list of terms. We could simply partition by the documentID, because each document usually has relatively random terms, so that each server would be assigned with relatively random terms.

Earlybird roots(Blender)

Merge the search results from different Earlybird machines and enrich the result. This is the same idea of Blender.

How does text based query work

There are two phases:

Posting list has limited size in memory to hold all relevance/ranking related info, so we could have a plugin mechanism which could take the documentID as the input and return the score we could use to sort. However, this external service query increases the latency. It is really a tradeoff between storage and performance (we either store the score with posting list or decouple it to external service for scalability). Or we could have the ranking service to be affinitive to the query service to reduce the network calls.

How does index server handle high concurrent read/write

How does document updates work

Note: This is my personal thinking

We have a flag in each posting to indicate if the posting entry should be deleted during the segments merge or if it should be skipped during query. This is the similar idea to DB index segments merge.

How does index server handle failures(High availability)

To have replicas across different index servers for HA, the replicas could be copied asynchronously. The inverted index in memory will be lost in worst case scenario. However, the offline inverted indexing will fix it.

What are the bottlenecks and how to improve

Reduce the size of posting list

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats.html

Reduce the indexing latency and enhancements on original design

There are two major factors could drag down the performance

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second.html

Sending indexing requests in batch instead of multi http requests

https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync

Notes of ElasticSearch

TBA

Notes of Lucene index

TBA

References