Paxos is a class of synod-based algorithms for building available consistent distributed systems on top of asynchronous and unreliable network. For example, if you're building a key/value storage, Paxos will help you to keep it working in the presence of network errors (partition-tolerance) and node failures (availability), and will help you produce non-contradictory view to different clients (consistency).
> Doesn't it violate CAP theorem?
No, it doesn't. Availability in the CAP-sense is very strict. For example, a system using two-phase commit (2PC) algorithm and a system using Paxos algorithm are both unavailable in a CAP sense. It makes sense for the 2PC system since the 2PC's coordinator is a single point of failure but it's strange for Paxos because it tolerates up to N fails out of 2N+1 nodes.
So Paxos is an available CP system.
> What can we build with Paxos?
We can build a distributed state machine and implement any algorithm on top of it. But it's very hard to think about an unbounded domain, so in this post we consider Paxos as a foundation for building distributed data storages.
It's a common practice for storages to have write operations to mutate its state and read operations to query it. Paxos is different, it guarantees consistency only for write operations, so to query its state the system makes a read, writes a value back and when the write is confirmed the system returns the value or its projection to a client.
> Wait, don't pay so much attention to the details, what is the topology of the Paxos-based distributed system?
Usually a Paxos-based distributed system consists of clients, proposers and acceptors. Clients mutate and query the state of the system, proposers process the client's commands and acceptors store information.
The Paxos topology is similar to the typical 3-tier application where clients correspond to web-browsers, proposers to web servers and acceptors to databases.
> If proposers are similar to front-end servers does it mean that the proposers are stateless?
No, each of the proposers should be able to generate a sequence of increasing ID (ballot numbers) which doesn't intersect with the sequences of other proposers. So proposers must have state to store the last used ballot number. There're multiple ways how to generate such sequences with and without coordination.
For example each server may have unique ID and use an increasing sequence of natural number n to generate (n,ID) tuples and use tuples as ballot numbers. To compare them we start by comparing the first element from each tuple. If they are equal, we use the second component of the tuple (ID) as a tie breaker.
Let IDs of two servers are 0 and 1 then two sequences they generate are (0,0),(1,0),(2,0),(3,0).. and (0,1),(1,1),(2,1),(3,1).. Obviously they are unique, ordered and for any element in one we always can peak an greater element from another.
> Ok, how do proposers and acceptors communicate to agree on the system state?
Let's take a look on how a Paxos-based distributed system handles a state mutation request. The typical case is:
- Client connects to any proposer and issues the command.
- The proposer commnicates with the acceptors and they agree on the system's state.
- Once the change is accepted all future reads should reflect the change.
On the diagram we see two rounds of proposer-acceptors communications. We also can estimate that the system generates from 4f+6
to 8f+6
messages for every change/read where f
is a number of failures that the system can tolerate.
If something bad happens and client doesn't receive a confirmation then she should query the system to understand if her change was applied or not. For example it may happen when the concurrent requests from the clients collide and abort each other.