Skip to the content.

Deep dive into config change in distributed system

Safety

config-change

From above diagram, it is possible S1 and S2 form the majority of C-old, and S3, S4, S5 form the majority of C-new. We need to avoid two leaders from both C-old and C-new to be elected within the same term.

Safety of adding or removing one server at a time

change-one-member-at-a-time

If adding or removing one server at a time, this prevents cluster from splitting into two independent majorities. Which means it is not possible to have two leaders within the same term.

Workflow


It is possible leader crashes before the C-new gets committed. In this case, a new leader will be elected, client could retry the configuration change since it does not receive the response from previous leader.

Safety of adding or removing arbitrary servers at a time

The solution is mentioned in 4.3 of the paper which uses two phases.

joint-consensus

Availability

Availability of adding or removing one server at a time

Catching up new servers

When a new server is added, it starts with empty log entries:

a. Leader needs to replicate all its log to the new member which might cause the leader overloaded.

b. As mentioned in etcd-blog, increasing the quorum immediately when a new member joins would cause a lot of problems.

c. If we have s1, s2, s3 and now we add s4, s3 could be isolated from network partition before s4 catches up leader’s log. There is a period of time the cluster could not reach commitment.

Solution:

Removing current leader

Leader needs to wait until C-new is committed to step down. In two servers cluster case, s1(leader) and s2. If s2 does not have C-new and s1 steps down, s2 will never be elected as a new leader.

Disruptive servers

Availability Add or remove arbitrary servers at a time

Reading materials