Best of both worlds: Raft's joint consensus + Single Decree Paxos
New post: Best of both worlds: Raft's joint consensus + Single Decree Paxos -- https://t.co/7UtMtiMijs
— Rystsov Denis (@rystsov) January 5, 2016
New post: Best of both worlds: Raft's joint consensus + Single Decree Paxos -- https://t.co/7UtMtiMijs
— Rystsov Denis (@rystsov) January 5, 2016
Paxos is a family of protocols used to build CP-distributed data structures. The Paxos Made Simple paper describes two members of that family: Single Decree Paxos for building write-once distributed registers and Multi Paxos for building distributed append-only logs.
The simplest member, Single Decree Paxos, seems impractical because "Paxos Made Simple" doesn't tell us how to change membership. Until we know it, we can't replace failed nodes, and since failures happen — the system eventually becomes unavailable. In this post, I took Raft's idea of joint consensus to manage membership, adjusted it to Single Decree Paxos and proved its correctness.
One might say that it isn't important because nobody wants to build systems on top of write-once distributed registers when they can use distributed logs. It's true, but a little tweak turns a write-once register into a variable one.
Distributed variables may be more appealing to practitioners than distributed logs because variables are already powerful enough to be used in the real-world systems and they don't have log-related problems like log compaction.
For example, we may use an array of distributed variables as a distributed hashtable. Since the only API of many distributed storages is a key/value API (hashtable) it proves that Single Decree Paxos is applicable to the problems which people currently solve with Raft and Multi-Paxos.
Among the popular distributed key/value solutions are:
Yeah, I blogged about this problem and proposed a solution in the Dynamic Plain Paxos post. The algorithm is correct, does the job, but has reliability issues. If we define reliability as the number of nodes that may fail without affecting the system, then during the transition phase of the the Dynamic Plain Paxos algorithm between $2f+1$ and $2f+2$ nodes reliability goes down from $f$ to $f-1$, and then restores back to $f$.
The Raft's joint consensus approach doesn't have this disadvantage and it's worth it to backport it to Single Decree Paxos.
The algorithm is based on two principles.
Let's review how Paxos works, describe joint consensus, apply it to Single Decree Paxos and prove that the consistency of the system holds during the transition from a $2f+1$ to a $2f+2$ node cluster.
When a proposer receives a request from a client to change the distributed variable in the $2f+1$ node paxos cluster it should:
To read a value, a proposer should execute the same algorithm but keep the value untouched on the 4th step.
Please, see the How Paxos works for details.
Joint consensus is a way of changing set of acceptors in the paxos cluster without affecting the linearizability. Suppose we want to make a transition from old set of acceptors O to the new one N. To make it we:
Now that we got the general idea, we can dive deeper and take a look at the joint consensus inspired algorithm to change membership in paxos.
Let's describe how to enlarge a paxos cluster from $2f+1$ acceptors to $2f+2$ acceptors step by step. The $A_1, A_2 \cdots A_{2f+1}$ are the original acceptors and we want to add a new $A_{2f+2}$ acceptor to the cluster. The steps to achieve this are:
The membership change from $2f$ to $2f+1$ is much simpler. We think about the $2f$ node system as a $2f+1$ node system where all the messages to the new node are filtered out. So we just ask each proposer to switch itself to the new configuration. The filter equivalence principle guarantees correctness.
To exclude a node from the cluster, we should follow the same algorithms but in the reversed order.
In case something goes wrong during the cluster extension, we can always go back to the previous configuration or just pause the extension, fix the problem and resume it later. Since the extension doesn't affect reliability we don't have to finish it as fast as possible.