How Paxos works

Abstract. One day I understood Paxos, but a couple of months later I realized that I actually didn't. I reread 'Paxos Made Simple' but it was almost as hard as the first time so I wrote this memo to help myself in the future to get Paxos faster.

Overview

Paxos is a class of synod-based algorithms for building available consistent distributed systems on top of asynchronous and unreliable network. For example, if you're building a key/value storage, Paxos will help you to keep it working in the presence of network errors (partition-tolerance) and node failures (availability), and will help you produce non-contradictory view to different clients (consistency).

> Doesn't it violate CAP theorem?

No, it doesn't. Availability in the CAP-sense is very strict. For example, a system using two-phase commit (2PC) algorithm and a system using Paxos algorithm are both unavailable in a CAP sense. It makes sense for the 2PC system since the 2PC's coordinator is a single point of failure but it's strange for Paxos because it tolerates up to N fails out of 2N+1 nodes.

So Paxos is an available CP system.

> What can we build with Paxos?

We can build a distributed state machine and implement any algorithm on top of it. But it's very hard to think about an unbounded domain, so in this post we consider Paxos as a foundation for building distributed data storages.

It's a common practice for storages to have write operations to mutate its state and read operations to query it. Paxos is different, it guarantees consistency only for write operations, so to query its state the system makes a read, writes a value back and when the write is confirmed the system returns the value or its projection to a client.

> Wait, don't pay so much attention to the details, what is the topology of the Paxos-based distributed system?

Usually a Paxos-based distributed system consists of clients, proposers and acceptors. Clients mutate and query the state of the system, proposers process the client's commands and acceptors store information.

The Paxos topology is similar to the typical 3-tier application where clients correspond to web-browsers, proposers to web servers and acceptors to databases.

> If proposers are similar to front-end servers does it mean that the proposers are stateless?

No, each of the proposers should be able to generate a sequence of increasing ID (ballot numbers) which doesn't intersect with the sequences of other proposers. So proposers must have state to store the last used ballot number. There're multiple ways how to generate such sequences with and without coordination.

For example each server may have unique ID and use an increasing sequence of natural number n to generate (n,ID) tuples and use tuples as ballot numbers. To compare them we start by comparing the first element from each tuple. If they are equal, we use the second component of the tuple (ID) as a tie breaker.

Let IDs of two servers are 0 and 1 then two sequences they generate are (0,0),(1,0),(2,0),(3,0).. and (0,1),(1,1),(2,1),(3,1).. Obviously they are unique, ordered and for any element in one we always can peak an greater element from another.

> Ok, how do proposers and acceptors communicate to agree on the system state?

Let's take a look on how a Paxos-based distributed system handles a state mutation request. The typical case is:

Client connects to any proposer and issues the command.
The proposer commnicates with the acceptors and they agree on the system's state.
Once the change is accepted all future reads should reflect the change.

On the diagram we see two rounds of proposer-acceptors communications. We also can estimate that the system generates from 4f+6 to 8f+6 messages for every change/read where f is a number of failures that the system can tolerate.

If something bad happens and client doesn't receive a confirmation then she should query the system to understand if her change was applied or not. For example it may happen when the concurrent requests from the clients collide and abort each other.

Code

Ok. Let's start with the acceptor's code. As you can see from the sequence diagram it should support two phases: preparation and acceptance. They are supported via the corresponding endpoints. The algorithm itself is written in a Python inspired pseudocode. I assume that every hdd.write call is flushed to the disk.

As you remember, clients communicate only with the proposers so it's a good idea to explore its API. I did mention before that Paxos guarantees consistency only for write requests so it shouldn't be a surprise that the proposer's API consists just of one change_query endpoint.

It accepts two pure functions: change and query. The change function validates the old state and maps it into the new state (or throws an exception). The query function makes a projection of the new state and returns it to the client.

Consider that we want to work with a distributed variable, then we may use identical transformation both as change and as query. If we want to perform a CAS-guarded state transition from old to new value then we should use the following change-generator:

Once we digest all the previous information the proposal's source code shouldn't be too scary.

Math

The Paxos algorithm gives us a way to build reliable distributed data structures which keep working in a predictable way even if the whole system experiences network partitioning or node failures. As long as a client can communicate with a proposer and the proposer sees the quorum of the acceptors then the distributed storage behaves like a thread safe data structure. Otherwise the system is unavailable.

Let's prove it. As you can see from the sources the system generates events of different kind (emit_executed, emit_prepared, emit_accepted and emit_promised). We will show that:

there is a relation between emit_accepted events which captures causality between them
the relation is also defined on emit_executed
the reduction of the relation on emit_executed is total order

The 3rd bullet means that any successful change is an effect of the previously successful change.

We should introduce a couple of contractions for the events name to simplify the reasoning about the space of events.

In the code	In the proof
emit_executed(n,...)	$\bar{n}^2$
emit_prepared(n,...)	$\bar{n}^1$
emit_accepted(n,...)	$\ddot{n}^2$
emit_promised(n,...)	$\ddot{n}^1$

$\bar{n}^1$ and $\bar{n}^2$ are single events, but $\ddot{n}^1$ and $\ddot{n}^2$ are sets of events where each event corresponds to an acceptor.

By the definition:

$$ \forall \ddot{b}^2 \in \mathrm{E} \quad \exists \ddot{a}^2 \in \mathrm{E} \,:\, \mathrm{s}(\ddot{b}^2) = \mathrm{change}(\mathrm{s}(\ddot{a}^2))$$

where $\mathrm{E}$ is a space of events generated by the system and $\mathrm{s}(x) := x.\mathrm{state}$. In such cases we say that $\ddot{a}^2$ is an ancestor of $\ddot{b}^2$ and we use the subset sign to express it: $\mathrm{s}(\ddot{a}^2) \subset \mathrm{s}(\ddot{b}^2)$.

Let's extend $\subset$ to be transitive: we say that $\mathrm{s}(\ddot{a}^2) \subset \mathrm{s}(\ddot{c}^2)$ if $\mathrm{s}(\ddot{a}^2) \subset \mathrm{s}(\ddot{b}^2)$ and $\mathrm{s}(\ddot{b}^2) \subset \mathrm{s}(\ddot{c}^2)$. Since causality is also transitive this extension respects it.

We want to extend the $\subset$ relation on $\bar{x}^2$ events. We say that $\mathrm{s}(\bar{x}^2) \subset y$ if $\forall \ddot{x}^2 \; \mathrm{s}(\ddot{x}^2) \subset y$. For example

$$\mathrm{s}(\bar{x}^2) \subset \mathrm{s}(\bar{y}^2)$$

means

$$\forall \ddot{x}^2 \; \forall \ddot{y}^2 \,:\, \mathrm{s}(\ddot{x}^2) \subset \mathrm{s}(\ddot{y}^2)$$

Statement.

We're proving that:

$$\mathrm{s}(\bar{n}^2) \subset \mathrm{s}(\bar{m}^2) \;\vee\; \mathrm{s}(\bar{m}^2) \subset \mathrm{s}(\bar{n}^2)$$

It obviously holds for any track (a set of events generated by the system) if the following expression is true for the same track:

$$n<m \;\Rightarrow\; \mathrm{s}(\bar{n}^2) \subset \mathrm{s}(\bar{m}^2)$$

Which holds if:

$$n<m \;\Rightarrow\; \mathrm{s}(\bar{n}^2) \subset \mathrm{s}(\ddot{m}^2)$$

Let's do it.

Proof.

First we define unwrap function which maps a ballot number of the write to the ballot number of the previous write.

$$\mathrm{unwrap}(\ddot{x}^2) := \bar{x}^1.\mathrm{reads}.\mathrm{max}(x \to x.\mathrm{accepted\_n}).\mathrm{accepted\_n}$$

We may think about unwrap as an inversion of the change, because

$$\mathrm{s}(\ddot{b}^2) = \mathrm{change}(\mathrm{s}(\ddot{a}^2)) \;\Rightarrow\; a=\mathrm{unwrap}(\ddot{b}^2)$$

We also need a couple of lemmas:

$\mathrm{s}(\ddot{b}^2) = \mathrm{change}(\mathrm{s}(\ddot{a}^2)) \;\Rightarrow\; a<b$
$\forall \bar{a}^2 \in \mathrm{E} \quad \forall b>a \,:\, \ddot{b}^2 \in \mathrm{E} \;\Rightarrow\; a \leq \mathrm{unwrap}(\ddot{b}^2)$

We want to prove the following statement:

$$n<m \;\Rightarrow\; \mathrm{s}(\bar{n}^2) \subset \mathrm{s}(\ddot{m}^2)$$

$k := 0$,$z := m$
$z_{k+1} := \mathrm{unwrap}(\ddot{z_k}^2),\;k:=k+1$. Lemma 1 guarantees that for any $y<x$ the $\ddot{z_x}^2$ is an ansestor of $\ddot{z_y}^2$
Lemma 2 states that $n \leq z_k$. So we have two cases:
1. If $n < z_k$ then goto step #2
2. If $n = z_k$ then $\mathrm{s}(\bar{n}^2) \subset \mathrm{s}(\ddot{m}^2)$ because $z_k$ is an ansestor of $z_0$ which is $m$

Q.E.D.

Lemma 1.

$\mathrm{s}(\ddot{b}^2) = \mathrm{change}(\mathrm{s}(\ddot{a}^2)) \;\Rightarrow\; a<b$ is true because we explicitly check it in the proposer's source code, see the monotonicity assert.

Lemma 2.

$\forall \bar{a}^2 \in \mathrm{E} \quad \forall b>a \,:\, \ddot{b}^2 \in \mathrm{E} \;\Rightarrow\; a \leq \mathrm{unwrap}(\ddot{b}^2)$

Proof.

Since we don't write a new state $\ddot{b}^2$ unless we got a confirmation from the majority $\bar{b}^1$ then the following statement holds:

$$\forall \ddot{b}^2 \in \mathrm{E} \;\Rightarrow\; \forall \bar{b}^1 \in \mathrm{E}$$

Proposer should receive promises from a majority of the acceptors before it generates a $\ddot{b}^1$ event. It guarantees truth of the following expression:

$$(\mathrm{N} := \bar{b}^1.\mathrm{reads}.[\mathrm{node\_id}] \cap \bar{a}^2.\mathrm{writes}.[\mathrm{node\_id}]) \neq \emptyset$$

where $\bigtriangleup.[\odot]:=\bigtriangleup.\mathrm{map}(x\to x.\odot)$.

Let $n \in \mathrm{N}$, $\dot{a}^2 \in \ddot{a}^2 \cap \mathrm{E[n]}$ and $\dot{b}^1 \in \ddot{b}^1 \cap \mathrm{E[n]}$ where $\mathrm{E[n]}$ are events that happened on the $n$ node.

$\dot{a}^2.\mathrm{ts} < \dot{b}^1.\mathrm{ts}$ holds because acceptor doesn't accept states with lower ballot number when it promised to accept a state with higher ballot number and $a<b$.

By definition $\dot{a}^2.\mathrm{accepted\_n}$ is the ballot number of the accepted state at moment $\dot{a}.\mathrm{ts}$ on node $n$. The same is also true for $\dot{b}^1$.

Since the ballot numbers of the accepted state is a monotonically increasing function of time then

$$\dot{a}^2.\mathrm{accepted\_n} \;\leq\; \dot{b}^1.\mathrm{accepted\_n}$$

By definition $\dot{b}^1.\mathrm{accepted\_n} \in \bar{b}^1.\mathrm{reads}.[\mathrm{accepted\_n}]$ so

$$\dot{b}^1.\mathrm{accepted\_n} \;\leq\; \bar{b}^1.\mathrm{reads}.\mathrm{max}(x \to x.\mathrm{accepted\_n}).\mathrm{accepted\_n} \;=\; \mathrm{unwrap}(\ddot{b}^2)$$

And it's the final in proving the lemma since:

$$a \;=\; \dot{a}^2.\mathrm{accepted\_n} \;\leq\; \dot{b}^1.\mathrm{accepted\_n} \;\leq\; \mathrm{unwrap}(\ddot{b}^2)$$

Q.E.D.

Conclusion

The family of Paxos protocols provides powerful primitives for building robust distributed systems, distributed data structures like switches, registers (a foundation for a hashtable aka key/value storage), logs, and state machines.

For example we'll define a distributed switch and register as a simple layer on the algorithm described in this post.

Distributed switch

Distributed register

If you read Paxos Made Simple then you might have noticed that the algorithm in this post is a little bit different. Leslie Lamport described a distributed switch (Single Decree Paxos, Synod) and shown how to run multiple instances of it to build a distributed log, a foundation for state machine (Multi-Paxos).

> It seems that the "Paxos Made Simple" version is more complex. Is it really true?

A log is a useful abstraction for building a state machine but if we're building something simpler like a key/value storage then other approaches may be a better fit.