Can 100 Machines Agree?

Rachid Guerraoui; Jad Hamza; Dragos-Adrian Seredinschi; Marko; Vukolic

arXiv:1911.07966·cs.DC·November 20, 2019·1 cites

Can 100 Machines Agree?

Rachid Guerraoui, Jad Hamza, Dragos-Adrian Seredinschi, Marko, Vukolic

PDF

Open Access

TL;DR

This study empirically evaluates the performance of various agreement protocols on up to 100 machines, revealing that throughput decay initially sharpens then dampens, with chain- and ring-based protocols outperforming others significantly.

Contribution

Provides the first extensive empirical analysis of agreement protocols at large scale, demonstrating their graceful performance and identifying advantages of chain- and ring-based methods.

Findings

01

Agreement protocols perform well on 100 machines with dampened throughput decay.

02

Chain- and ring-based protocols achieve 4K to 11K requests/sec, outperforming traditional protocols.

03

Performance decay slows down beyond a few tens of replicas, indicating scalability benefits.

Abstract

Agreement protocols have been typically deployed at small scale, e.g., using three to five machines. This is because these protocols seem to suffer from a sharp performance decay. More specifically, as the size of a deployment---i.e., degree of replication---increases, the protocol performance greatly decreases. There is not much experimental evidence for this decay in practice, however, notably for larger system sizes, e.g., beyond a handful of machines. In this paper we execute agreement protocols on up to 100 machines and observe on their performance decay. We consider well-known agreement protocols part of mature systems, such as Apache ZooKeeper, etcd, and BFT-Smart, as well as a chain and a novel ring-based agreement protocol which we implement ourselves. We provide empirical evidence that current agreement protocols execute gracefully on 100 machines. We observe that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Cloud Computing and Resource Management