Skip to the content.

Replication

The reasons to replicate data:

Leader and followers replication

Synchronous replication or Asynchronous replication

Fully synchronous:

Semi synchronous: One of the follower is synchronous, others are asynchronous.

Fully asynchronous:

Adding new followers

If a new follower is added, it starts with empty data. Leader needs to replicate data to that follower. There are two potential issues: 1) If we add multiple followers at the same time, leader will under a pressure to replicate its data to all followers. 2) leader is keep accepting write requests, so data is flux. File copy does not make sense.

For #1, we could let some of the followers to off the load from leader, they could replicate their data to the new members first, once the new member catch up with them, leader takes over the job to continue the data replication.

For #2, we could take consistent snapshot on leader’s database, copy the snapshot to the new member, the new member request the backlog of data changes since the snapshot. ETCD uses a different way. It replicates the data in chunks, e.g. first round 10KB to replicate current data, second round 8KB to replicate current data, and so on until the chunk size reaches a predefined threshold, ETCD treats the follower “up-to-date”.

Handling node outages

Follower failure: The data changes are persistent on local disk, so if follower fails it could still recover by loading the data changes from local disk.

Leader failure: If leader fails, a new leader will be promoted and followers know who is the new leader and redirect all write requests to the new leader. Leader failure might cause the data lost when new leader does not have replicated all the data from old leader.

Implementations of log replication

Problems with replication lag

Data could be inconsistent between leader and followers due to the replication lag. Unless we do synchronous log replication. This is a trade off between availability and consisitency.

Read your own write issue

read-you-own-write

If we DO NOT have cache solution. When writing Facebook comments, the user might not see the comment right after his write. When user update his Facebook profile, the user might not see the update right after his write.

We could somehow redirect such read requests to leader:

If user enter something from one device and read from a different device, we should see the info just entered. E.g. Upload a video from one device, we should be able to see it from another device.

Monotonic read issue

The user sends several read requests, like refresh page, the requests could be sent to different replicas. This will cause the data user sees is inconsistent. To avoid this, we should redirect the requests from same users to the same replica.

Consistent prefix reads

There could be a big problem to solve in chatting service, e.g. iMessager, Snapchat. In a group chat, the observer might see the messages in an unexpected order.

consistent-prefix-read

One possible solution is to make sure that any related writes are written to the same partition.

// TODO: Add more detailed solution

Multiple leaders replication

There is a big downside of leader based replication that is all writes must go through leader. Some database like etcd serves read through leader as well. So that leader must be able to handle the pressure and could easily be failed.

One extension is to allow more nodes to accept writes. This is called multi-leaders replication. Usually this happens across multiple datacenters.

multiple-leaders

Pros with multi-leader replication:

Cons with multi-leader replication:

Handle write conflicts

How leaders communicate

multi-leaders-communication

MySQL only supports circular topology for now, and the most general one is all-to-all.

Leaderless replication

The leader based replication is based on the concept that all write requests are sent to one node(leader), which requires the leader node has the capability to handle the peak traffic. Another model comes out to reduce the burden is leader-less replication. Amazon’s Dynamo DB uses this model.

How does it work

Handle stale data

If is possible nodes are offline, when those nodes come back online they could have stale data. When client reads from replicas, the stale data could be also read. There are several solutions to make sure the client gets latest data:

Handle network partition

There are m nodes within the cluster, n out of m nodes are for data replications(m » n). And we have the w as the write quorums which is usually n/2 + 1. If there is a network partition between client and the w nodes, we could not reach the quorum for write operations. We have two solutions:

Handle overwrites and concurrent write

Those requests might arrive different nodes in different order. We could not simply overwrite previous value with the current value, this will cause the data permanently inconsistent. See below with more details on solutions.

How to tell if requests are concurrent or happens-before

Concurrent Shopping Cart

concurrent-shopping-cart

Happens Before Writes

happens-before-writes

Last write wins solution

Cassandra uses this solution.

Pros:

Cons:

Shopping cart solution

concurrent-shopping-cart

In the end, we might want to return the values back to user and let user to converge the values in the shopping cart. Or simply return the union of values and return back to user.

Algorithm

Pros:

Cons: