Design i18n service

What is i18n

Internationalization is the process of adapting software to accommodate for different languages, cultures, and regions (i.e. locales), while minimizing additional engineering changes required for localization.

Requirements

Functional requirements

A shared service which takes locale and phrase as input, return an output under the specific locale.

Non-functional requirements

Various types of client.
High available: A shared service cannot tolerate any service unavailable.
Low latency: When client sends a request, the service needs to return a result in low latency.
Scalable: It is possible that more and more phrases in different languages are added, we need to be able to scale.
Read heavy: The shared service is read heavy.

Data models

Phrase

A phrase is consist of a word or a sentence, it is the base unit in a translation service.

type Phrase struct {
	ID string // A global unique ID which identifies a phrase
	Locale string // A string used to identify the locale, i.e., en, fr, cn
	Version string // Used to track different versions of a phrase. Could also use timestamp as the version.
	Default string // The default text of the phrase
	Singular string // The text of a phrase in singular, i.e., he or 他
	Plural string // The text of a phrase in plural, i.e., they or 他们
	...
}

We could use ID+Locale as the key to identify all phrases including the translated phrase, i.e., 123_en, 123_fr. We could also use ID+Locale+Version as the key to identify all versioned phrases, it is also fine to use timestamp for versioning.

Data persistent

We need to store the base phrase when it is created, and also need to store the translated phrases.

SQL or NoSQL

Data model is simple and no ACID requirement, so it does not have to be SQL.
It is hard to define a fit-to-all schema at the very beginning for a shared service. So there could be some changes against schema in the future. Use NoSQL could be more flexible.
No schema could increase the read-write performance.

Which NoSQL to use

key-value could be used to handle the i18n in programming. I.e., Java i18n messaging, there is no need to worry about singular or plural.

wide-column could be used to handle more advanced cases.

	default	singular	plural
123_cn	他	他	他们
123_en	he	he	they
124_cn	你好
124_en	hello

Considering the extensibility, using wide-column data store like DynamoDB, BigTable, Cassandra would be a good idea.

Leader based or leaderless

Cassandra has the leaderless distributed architecture, however DynamoDB, BigTable are leader based.

For handling writes, both are ok. Cassandra is optimized for writes, and DynamoDB as a leader based DB could also be fit in this case since the shared service is read heavy.
For handling reads, Cassandra might have low performance if we want strong consistency because read needs to get the quorum. DynamoDB might also have slow performance if we need strong consistency because read needs to be handled by leader. In this particular case, we do not have strong consistency requirement, so either one should be good.

APIs

type I18n interface {
	// Add a phrase, return phrase ID and error status
	AddPhrase(defaultText, singular, plural, locale string) (string, error)
	// Get a phrase, return the phrase and error status
	GetPhrase(id, locale string) (string, error)
	// Translate a phrase from one locale to another
	// externalTranslator is a hook point which could be used to link to external translation service
	Translate(id, sourceLocale, targetLocale string, externalTranslator func(phrase, sourceLocale, targetLocale string) (string, error)) (string, error)
}

Architecture

Airbnb i18n platform

My own architecture

architecture

Components

Content service

The service handles the addition of a base phrase under a particular locale.
The service handles the retrieval of a phrase under a particular locale.
The service sends translation requests to external pluggable translator in a batch.

Translation service

Provide a hook point to talk to external translator. We could plugin different translation vendor.
Provide a hook point to talk to external configuration service, i.e., Github.
Implement the logic on when to do the translation, where to get the phrases, where to store the result and what external translator to use.

Workflow

Users add a phrase either using client lib programmatically or using REST endpoint.
- AddPhrase("This is an apple.", "", "", "en")
DNS server resolves the fqdn and request is sent to GLB.
API Gateway does AuthN/Z, RateLimiting, etc.

The validated request reaches the Content Service.

“This is an apple.” needs to have a deduplication check.(could use hash value)
The Content Service generates ID 123 to “This is an apple.”.

The Content Service stores it in database.

	default	singular	plural
123_en	This is an apple.

The Content Service adds/updates in cache(this could happen simultaneously with storing the phrase in database)
- key: 123_en
- value: This is an apple.
The Content Service add the event to an event bus, and the consumer of this event bus will be Translation Service.
- {id:123, default:"This is an apple.", sourceLocale:"en"}
The Translation Service gets translation config to know what target locales need to be translated to.
The Translation Service buffers the events and process them in a batch.
The Translation Service translate the phrase to target phrases by invoking the external translator.

The Translation Service stores the translated phrases into database and updates cache as well.

	default	singular	plural
123_en	This is an apple.
123_cn	这是一个苹果。

Client who wants to retrieve the message could read from cache or databases if cache misses.

Design details

Row structure if using wide-column

	default	singular	plural
123_cn	你好
123_en	hello

	cn_default	cn_singular	cn_plural	en_default	en_singular	en_plural
123	你好			Hello

	default	singular	plural
123	encode(cn:你好,en:Hello)

We could have three ways to structure a row, each of them has pros and cons and there are few things need to be considered:

Row key would be the partition key. The first one will be more flexible on partition, however the following two could only be partitioned based on the ID which means all translations of a phrase will be partitioned together. Putting all translations together is not a bad idea because it could provide some level of data locality.
The third one will have some performance issue when query a particular phrase under a particular locale because the entire bit cell will be retrieved and then filtered.

Failure handling

Content Service failure

Content Service is stateless and have multiple replicas to distribute the load. If any Content Service instance is down, we could just restart it. I.e., Content Service as a K8S Deployment with replica = 3.
Clients could retry the failed request if there is no ACK received. The Content Service is idempotent and has deduplication, so it is ok to have requests retried.
Batch translation is just an event, it is either sent or not. If clients retry, the event will be triggered again.

Translation Service failure

Is Translation Service stateful or stateless? It depends on whether we are using MQ or Kafka as the event bus.
- If using Kafka, Translation Service needs to persist the offset and coordinate between consumer threads(deduplication)
- If using MQ, Translation Service does not need to persist what has been processed, just consume next one after restart.

If Translation Service needs to persist the offset, there are two options to handle the failures:

Have producer side or external coordinator(zookeeper) to periodically have checkpoint. Translation Service resumes from that checkpoint on restart.
Have the offset replicated across multiple instances, like having a built-in distributed metadata store.

Database failures

Distributed datastore has its own solution to handle the failures like node crash etc. The basic idea is to have replicas with leader selection(coordination and concurrency handling).

Cache layer failures

Distributed cache has its own failure handling.
Worst case is cache miss, and requests will be redirected to data store.

Client local store

Client could be in air-gap environment.
To reduce the latency so that client does not have to send all the requests to backend to retrieve a phrase.
Some code has the phrase bundle packaged within the client lib, that phrase bundle needs to be synced.

This is why Airbnb’s i18n platform will have client lib and i18n agent installed on client side:

The goal is to make the client local store to be in sync with server side. The server side is the source of truth.

Pull(on the left side): Pull the delta, but hard to control the pulling frequency.
Event-driven(on the right side): An backend update will trigger a notification to client, so the client could load the updates.

Design i18n service

Let's keep learning

Design i18n service

What is i18n

Requirements

Functional requirements

Non-functional requirements

Data models

Phrase

Data persistent

SQL or NoSQL

Which NoSQL to use

Leader based or leaderless

APIs

Architecture

Airbnb i18n platform

My own architecture

Components

Content service

Translation service

Workflow

Design details

Row structure if using wide-column

Failure handling

Content Service failure

Translation Service failure

Database failures

Cache layer failures

Client local store

References