PigPaxos is a scalable Paxos protocol for large clusters. The central idea in PigPaxos is to decouple the communication from the decision-making at the leader. PigPaxos revises the communication flow to replace the direct communication between the leader and followers in Paxos with a relay based communication flow. PigPaxos chooses relays randomly from follower clusters at each communication round to reduce contention and improve scalability of throughput. Using randomly rotating relays allow PigPaxos to shift a significant load away from the leader and onto the follower nodes.
What’s up with the name?
When we started working on the idea, we were calling it BigPaxos, because our ultimate goal was to scale to hundreds of nodes. One day while shooting the shit about BigPaxos on Slack, I made an accidental typo to call the protocol PigPaxos. Needless to say, I did not realize the typo until it was too late and the name stuck.
One of the more obvious observations was that in protocols relying on a single “strong” leader, that leader is overwhelmed with managing all the communication. The goal of PigPaxos is to give the leader a bit more breathing room to do the job of leader, and not talking as much. To that order, we replaced the direct communication pattern between leader and followers with a two-hop pattern in which a leader talks to a small subset of randomly picked relay nodes, and the relays in turn communicate with the rest of the cluster. PigPaxos also uses relays to aggregate the replies together before returning to the leader. On each communication step, PigPaxos uses a new set of randomly picked relay nodes to both spread the load evenly among the followers and to tolerate failures.
By randomly rotating the relays and enforcing timeouts and including some other optimization on how many nodes to wait at each relay node, we can provide adequate performance even in the event of node crashes or network partitions. The fault tolerance limit of PigPaoxs is similar to Paxos, and up to a minority of nodes may fail with the system still making some (limited if implemented naively) progress.
We implemented PigPaxos in our Paxi framework with almost no changes to the core Paxos code, as we focused only on the message passing layer and relay group orchestration. The entire protocol was implemented in just 1200 lines of code.
Following the footsteps of our SIGMOD work, we have developed a performance model that, hopefully, is accurate enough to show some expected performance on the bigger scale.
For empirical evaluation we deployed the protocols on a cluster of up to 25 AWS EC2 m5a nodes with 2 vCPUs and 8 GB of RAM. The performance relative to Multi-Paxos on a 25-node cluster is astonishing PigPaxos shows nearly 3.5 fold improvement in throughput compared to Paxos. We also ran EPaxos with a workload generating no conflicts to show EPaxos’s best-case performance.
Network uniformity is not a requirement for PigPaxos. In fact it is perfectly ok to have some links slower than the others. However, some arrangement of relay groups may be required to get the best performance when links between nodes have different speed or capacity. The most pronounced real-world example of this non-uniformity is the wide area networks. When we deployed real, not-simulated PigPaxos in such geo-distributed environment, it no longer had the disadvantage of slower latency, as the latency became dominated by much slower geo-links. We took advantage of natural division between fast and slow links, and made all nodes in every region to be part of the same relay group. Another advantage of this setup is the amount of cross-region traffic flowing, as data moves to each region only once regardless of how many replicas are there.
The benefits of PigPaxos do not stop at the large clusters. To our surprise, we observed better throughput from PigPaxos on just a 5 node cluster. Yes, PigPaxos has slightly higher latency due to the additional network hop, but it is still able to edge out some throughput improvements over Paxos. Once again, we show the best-case EPaxos performance here as well for reference. EPaxos actually outperforms PigPaxos in this small cluster by about 7%.
On the fault tolerance front, relay nodes definitely introduce more ways for the protocol to stumble. Crash of a relay node makes the entire relay group unavailable for that communication attempt. Crash of a non-relay node causes timeout which may add to the operation latency. The core principle behind PigPaxos’ fault tolerance is to repeat failed communication in the new configuration of relay nodes. Eventually, the configuration will be favorable enough to make progress, given that the majority of nodes are up. However, this process can be slow when many nodes are crashed, so some orthogonal optimization can help. For example, it is worth remembering nodes temporarily down and not use these nodes for relays or otherwise expect them to reply on time. Another approach is to reduce the wait quorum of the relay group to tolerate strugglers, or even use overlapping groups for communication redundancy. However, even with all these ad-hoc optimizations turned off, PigPaxos can still mask failures originating in the minority of relay groups without much impact on performance. For example, in the experiment below we have one relay group experiencing a failure on every operation for 10 seconds without much detriment to overall performance.