Skip to the content.

Raft distributed consensus

Faults it handles

Leader election

leader-follower-role-states

Term: Election + Normal operation under a single leader

Heartbeats and Timeouts

Election basics when a server starts an election:

Election safety: At most one leader per term

Election liveness: Some candidate must eventually become a leader

Normal operation

Log structure:

img

The different color or the number within the box represents the different term and each server has its own copy of log and persistent on disk.

Workflow:

Log consistency:

Because of above properties, there is an AppendEntries consistency check:

append-entries-consistency-check

Leader changes

When leader changes, logs among servers might not be identical. Leader’s log is the only truth, and eventually leader makes followers log identical to its log.

log-consistency

If there is network partition at term 5 (S1 - S3 is one group, S4 - S5 is one group) and leader at term 5 is S2. The committed entry 5 needs to be present in the logs of all future leaders, otherwise 5 might be missing if leader becomes S5 in the future.

Safety requirement:

Pick the best leader:

pick-best-leader

(Term2, Index5) are committed entries. But if S3 becomes unavailable, new leader needs to be picked from S1 and S2. If new leader is S2 which does not have (Term2, Index5), there will be a problem that the committed entries will be lost. So, leader election needs to pick the server which has “most complete” log.

Voters deny the vote if: lastTermOfVoter > lastTermOfCandidate || (lastTermOfVoter == lastTermOfCandidate && lastIdxOfVoter > lastIdxOfCandidate)

This guarantees S4 and S5 will NOT be elected as the new leader from the following:

pick-best-leader

However, the following case will still mess things up. The leader on Term2 only replicated entries on S1 and S2 before its term ended. S5 was selected as leader on Term3 and append logs to its own then crashed. S1 is the current leader which is trying to finish committing entry from Term2. Now the entry 2 is replicated on [S1, S2, S3], but is not safely committed, since S5 could still be elected as leader at Term5 and will broadcast the entry 3 on [S1, S2, S3] and in this case we will lose entry 2 which has been committed.

pick-best-leader

For a leader to decide an entry is committed:

new-commitment-rule

If entry 4 is committed, then S5 cannot be elected as leader at term 5.

How to make log entries identical after leader changes

- keeps nextIdx for each follower, nextIdx initialized to leader's last index + 1
- leader sends the preceding log index and term with the appendEntries RPC for consisitency check
- If fails, decrement the nextIdx and try again
- If succeeds, increment the nextIdx and append next entry
- For extraneous entries, follower overwrites inconsistent entry and deletes all subsequent entries(inconsistent)

When old leader gets reconnected

Old leader holds an old term, so the RPC calls from the old leader will be rejected if receivers hold newer term, then old leader steps down to be a follower

If client request times out

If client just simply reissues the command, it would result in the command gets executed twice. So we ask client to embed a unique id with each command, leader could use it to check if the command has been logged in the log entry. If yes, then just return the response from previously executed command

Configuration changes

config-change

If we had 3 servers at the beginning, and now want to add 2 more servers at the same time. There are several factors we need to consider:

The solution is mentioned in 4.3 of the paper which uses two phases.

joint-consensus


Above solution works, but Raft is now using a simpler solution described in 4.2 of the paper

See deep-dive-config-change for more details.

Reading materials