Abstract
Raft is a Distributed Consensus Algorithm which was designed with the goal of being more simple to understand than existing algorithms like Paxos, while still maintaining similar performance. It accomplishes this by separating key things like leader election, log replication, and safety.
History
Distributed Consensus is a hard problem. Traditionally, Paxos is used to solve it, but Paxos is hard to understand for almost anyone. Raft exists to try to be more understandable than Paxos while still being performant.
Goals
Raft requires:
- Safety under all non-Byzantine conditions (Byzantine Generals Problem)
- Manage a replicated log
- Do not depend on HW timing as long as a majority of nodes are correct.
Implementation
There are two RPCs defined under Raft, which can serve multiple purposes.
- RequestVote: used for candidate server to request votes for elections.
- AppendEntries: used for managing the replicated log by the leader, as well as heartbeats.
Each node can be, at any given time, in one of three states: leader, candidate, and follower. Under normal operations, there is exactly one leader. The rest of the nodes are followers. The leader is responsible for managing the replicated log. Leaders are controlled by terms. Terms may be of any arbitrary length. Term lengths need not be the same. There is exactly one leader during a term. When the leader fails for whatever reason, an election is held
Term number
Term numbers are used as a logical clock in Raft. If a request with a stale term number is sent, then the recipient rejects the request. If a newer term number is sent to a recipient, they throw out their current term number and accept the newer one.
Leader election
Leaders send heartbeats ever so often to the followers. This is an AppendEntries RPC with no data. Should a leader fail, they will stop sending RPCs. After a given time (known as an election timeout), if a follower sees that they have not received a heartbeat message, they will assume that there is no viable leader and begin an election.
Election Timeouts
Election timeouts are randomized per node to prevent a situation in which many servers become candidates and a vote is split, which would waste time.
Start of an election
When a follower begins an election, it promotes to the candidate state. It then sends a RequestVote RPC to every single node. If the node sees that their term number is greater than that of the candidate, it rejects the request: this is to prevent a bad leader, who could potentially override data.
Normal Operation
During a term, there is a considerable period of normal operation. This is the most dominant time, so it is very important for this to be efficient. Once a leader has been elected, it is the leader’s job to service client requests and replicate them to the other state machines.
On request
The leader appends the request to its own log, then issues the AppendEntries RPC to each other node. Each entry in a log can thus be tagged with (term, index). By the Log Matching Property, this makes each request unique across all machines. When a log is replicated to a majority of nodes, it is considered “committed”. This basically just means that it can be executed now, but it does NOT mean that the leader is done replicating, which it will continue to do. If there are conflicts, the leader will force followers to take his own log. It does so by backtracking until the logs completely match. Property: if a log is committed, all previous logs are also committed.
Safety
Leader changes
Because leader changes can lead to overrides, we want to be careful of who we elect. Therefore, there is a fundamental invariant: Any leader must also have all previously committed entries. As previously mentioned, a voter will simply note vote for a lagging leader (as their term number and log index number will be lower).
Timing and Availability
Raft imposes no HW timing requests for safety. It does however impose them on availability.
The simple constraint is:
broadcastTime << electionTimeout << MTBF
broadcastTime: The average time it takes for a full broadcasted message with all nodes replyingelectionTimeout: The average time before a candidate determines an election is overMTBF: The average time before a node goes down
Cluster membership changes
Things can go wrong when someone is added to or removed from the cluster. To ensure it doesn’t, Raft requires that a new leader wins the majority vote from both the previous configuration of nodes and the new one.
Performance
Raft is easier to understand and gets relatively good performance, primarily because replicating a log entry requires a single RPC to each node, which can be done in parallel and is the brunt of the work.