Abstract

Raft is a Distributed Consensus Algorithm which was designed with the goal of being more simple to understand than existing algorithms like Paxos, while still maintaining similar performance. It accomplishes this by separating key things like leader election, log replication, and safety.

History

Distributed Consensus is a hard problem. Traditionally, Paxos is used to solve it, but Paxos is hard to understand for almost anyone. Raft exists to try to be more understandable than Paxos while still being performant.

Goals

Raft requires:

Safety under all non-Byzantine conditions (Byzantine Generals Problem)
Manage a replicated log
Do not depend on HW timing as long as a majority of nodes are correct.

Implementation

There are two RPCs defined under Raft, which can serve multiple purposes.

RequestVote: used for candidate server to request votes for elections.
AppendEntries: used for managing the replicated log by the leader, as well as heartbeats.

Each node can be, at any given time, in one of three states: leader, candidate, and follower. Under normal operations, there is exactly one leader. The rest of the nodes are followers. The leader is responsible for managing the replicated log. Leaders are controlled by terms. Terms may be of any arbitrary length. Term lengths need not be the same. There is exactly one leader during a term. When the leader fails for whatever reason, an election is held

Term number

Term numbers are used as a logical clock in Raft. If a request with a stale term number is sent, then the recipient rejects the request. If a newer term number is sent to a recipient, they throw out their current term number and accept the newer one.

Leader election

Leaders send heartbeats ever so often to the followers. This is an AppendEntries RPC with no data. Should a leader fail, they will stop sending RPCs. After a given time (known as an election timeout), if a follower sees that they have not received a heartbeat message, they will assume that there is no viable leader and begin an election.

Election Timeouts

Election timeouts are randomized per node to prevent a situation in which many servers become candidates and a vote is split, which would waste time.

Start of an election

When a follower begins an election, it promotes to the candidate state. It then sends a RequestVote RPC to every single node. If the node sees that their term number is greater than that of the candidate, it rejects the request: this is to prevent a bad leader, who could potentially override data.

Normal Operation

During a term, there is a considerable period of normal operation. This is the most dominant time, so it is very important for this to be efficient. Once a leader has been elected, it is the leader’s job to service client requests and replicate them to the other state machines.

On request

The leader appends the request to its own log, then issues the AppendEntries RPC to each other node. Each entry in a log can thus be tagged with (term, index). By the Log Matching Property, this makes each request unique across all machines. When a log is replicated to a majority of nodes, it is considered “committed”. This basically just means that it can be executed now, but it does NOT mean that the leader is done replicating, which it will continue to do. If there are conflicts, the leader will force followers to take his own log. It does so by backtracking until the logs completely match. Property: if a log is committed, all previous logs are also committed.

Safety

Leader changes

Because leader changes can lead to overrides, we want to be careful of who we elect. Therefore, there is a fundamental invariant: Any leader must also have all previously committed entries. As previously mentioned, a voter will simply note vote for a lagging leader (as their term number and log index number will be lower).

Timing and Availability

Raft imposes no HW timing requests for safety. It does however impose them on availability. The simple constraint is: broadcastTime << electionTimeout << MTBF

broadcastTime: The average time it takes for a full broadcasted message with all nodes replying
electionTimeout: The average time before a candidate determines an election is over
MTBF: The average time before a node goes down

Cluster membership changes

Things can go wrong when someone is added to or removed from the cluster. To ensure it doesn’t, Raft requires that a new leader wins the majority vote from both the previous configuration of nodes and the new one.

Performance

Raft is easier to understand and gets relatively good performance, primarily because replicating a log entry requires a single RPC to each node, which can be done in parallel and is the brunt of the work.

notes

Raft