Revisiting consensus protocols through wait-free parallelization

Suyash Gupta; Jelle Hellings; Mohammad Sadoghi

arXiv:1908.01458·cs.DC·August 7, 2019

Revisiting consensus protocols through wait-free parallelization

Suyash Gupta, Jelle Hellings, Mohammad Sadoghi

PDF

TL;DR

This paper introduces a wait-free parallelization method for consensus protocols that enhances performance and robustness by running multiple instances in parallel, reducing load and mitigating malicious influence.

Contribution

It proposes a protocol-agnostic, wait-free parallelization approach for primary-backup consensus protocols, improving performance and security.

Findings

01

Reduces load on individual consensus instances.

02

Mitigates impact of malicious primaries.

03

Enhances overall system robustness.

Abstract

The recent surge of blockchain systems has renewed the interest in traditional Byzantine fault-tolerant consensus protocols. Many such consensus protocols have a primary-backup design in which an assigned replica, the primary, is responsible for coordinating the consensus protocol. Although the primary-backup design leads to relatively simple and high performance consensus protocols, it places an unreasonable burden on a good primary and allows malicious primaries to substantially affect the system performance. In this paper, we propose a protocol-agnostic approach to improve the design of primary backup consensus protocols. At the core of our approach is a novel wait-free approach of running several instances of the underlying consensus protocol in parallel. To yield a high performance parallelized design, we present coordination-free techniques to order operations across parallel…

Equations6

S (D_{ρ}) = {i \mapsto S (CR) ∣ D_{ρ, i} = S (CR), 1 \leq i \leq m}; F (D_{ρ}) = {i \mapsto F ∣ D_{ρ, i} = F, 1 \leq i \leq m},

S (D_{ρ}) = {i \mapsto S (CR) ∣ D_{ρ, i} = S (CR), 1 \leq i \leq m}; F (D_{ρ}) = {i \mapsto F ∣ D_{ρ, i} = F, 1 \leq i \leq m},

transfer (A, B, n, m) := if amount (A) > n then withdraw (A, m); deposit (B, m) .

transfer (A, B, n, m) := if amount (A) > n then withdraw (A, m); deposit (B, m) .

f_{S} (i) = {S f_{S ∖ S [q]} (r) \oplus S [q] if ∣ S ∣ = 1; if ∣ S ∣ > 1,

f_{S} (i) = {S f_{S ∖ S [q]} (r) \oplus S [q] if ∣ S ∣ = 1; if ∣ S ∣ > 1,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Revisiting consensus protocols through wait-free parallelization111A brief announcement of this work will be presented at the 33rd International Symposium on Distributed Computing (DISC 2019) [21].

Suyash Gupta

Jelle Hellings

Mohammad Sadoghi

(

[TABLE] )

Abstract

The recent surge of blockchain systems has renewed the interest in traditional Byzantine fault-tolerant consensus protocols. Many such consensus protocols have a primary-backup design in which an assigned replica, the primary, is responsible for coordinating the consensus protocol. Although the primary-backup design leads to relatively simple and high-performance consensus protocols, it places an unreasonable burden on a good primary and allows malicious primaries to substantially affect the system performance.

In this paper, we propose a protocol-agnostic approach to improve the design of primary-backup consensus protocols. At the core of our approach is a novel wait-free approach of running several instances of the underlying consensus protocol in parallel. To yield a high-performance parallelized design, we present coordination-free techniques to order operations across parallel instances, deal with instance failures, and assign clients to specific instances. Consequently, the design we present is able to reduce the load on individual instances and primaries, while also reducing the adverse effects of any malicious replicas.

1 Introduction

The introduction of Bitcoin—the first wide-spread application driven by blockchains—has resulted in a surge of interest in blockchain technology. This interest is backed by many use cases in the public and private sectors. For example, in trade [36, 13, 10, 36, 34], identity management [3, 34, 10], food production [18], aid delivery [3, 34], health care [25, 19, 7], fraud prevention [24], and GDPR compliance [15]. At the core of these use cases is the need to manage and replicate data, such as financial transactions, among a group of participants. Consequently, at the core of blockchain technology are consensus protocols that allow replicating data across a group of servers (replicas), some of which can fail or can act maliciously.

Several use cases for blockchains operate in a permissioned environment in which the participants can only join via well-established procedures. These established procedures prevent malicious entities from controlling a majority of the replicas. In this permissioned environment, a blockchain can be maintained using traditional high-performance Byzantine fault-tolerant (bft) consensus protocols [11, 12, 14, 5, 26, 27, 4, 33, 28, 44, 20, 32, 37, 23]. Commonly, these protocols use the primary-backup model pioneered in the Practical Byzantine Fault Tolerant consensus protocol [11]. In these bft protocols, a single replica is designated as the primary and is responsible for coordinating the consensus decisions, while all the other replicas perform the backup role.

The primary-backup model simplifies the development of consensus protocols substantially: when a primary is non-malicious, then even the simplest broadcast replication protocols suffice. The only complication in these consensus protocols is the way in which they deal with malicious primaries: malicious behavior must be either detected (after which the primary can be replaced) or prevented altogether. This simplicity of the primary-backup model negatively affects its performance in three ways [5, 14, 2, 41]:

Primary load. The primary not only has to perform the primary tasks, but also the backup role (as it is itself a replica). Consequently, the primary receives a higher load than other replicas, and this load at the primary can become a bottleneck in the overall system throughput. This is especially the case in fine-tuned high-performance consensus protocols that employ complex cryptographic primitive, for example, to reduce communication overheads or to improve resilience. 2. 2.

Primary replacement. As stated earlier, primary-backup consensus protocols work only when the primary behaves in accordance with the protocol. If the primary acts malicious or is faulty, then it will be replaced. However, detection of such behaviors requires setting timers. Further, replacing a faulty primary usually takes a while. During this time the system is unable to handle requests, which negatively affects its overall throughput. 3. 3.

Malicious behavior. Primary-backup consensus protocols rely on the underlying algorithm to detect malicious behavior of the primary. Usually, these detectors are only capable of detecting catastrophic failures that prevent new consensus decisions altogether, but they fail to detect or deal with primaries that affect the performance of the system in other ways, for example, a malicious primary could reduce or throttle the throughput of the system.

To the best of our knowledge, no approach is yet able to address all these limitations of primary-backup consensus protocols. In this work, we address these limitations in a protocol-agnostic manner by exploiting parallelization. In our paradigm, we run several instances of the underlying consensus protocol in parallel and we balance the system load among these parallel instances. This parallelism helps to reduce the load per primary and mitigates the negative impacts of a single primary on the throughput of the system. Our design is fine-tuned such that the instances coordinated by non-faulty replicas are wait-free: they can continuously make consensus decisions, independent of the behavior of any other instances. Our paradigm is highly flexible: it can be used in combination with any well-behaved primary-backup consensus protocol and it can be fine-tuned towards various application-specific needs.

Organization.

In Section 2, we introduce terminology and notations that we use throughout this paper. In Section 3, we present how we envision parallelization of consensus protocols, and tackle the main design challenges. Next, in Section 4, we refine the heavily step-wise coordinated approach of Section 3 by presenting a wait-free design that adds additional challenges but supports maintaining high throughput. Finally, in Section 5, we discuss related work and in Section 6 we conclude on our findings and discuss avenues for further research.

2 Preliminaries

In this work, we present a protocol-agnostic paradigm to parallelize consensus protocols with the aim of increasing consensus throughput, while reducing the effects of individual malicious replicas. We now introduce the notations and assumptions used throughout this paper.

Service notation.

We represent a replicated service by a triple $\mathfrak{S}=(\mathfrak{C},\mathfrak{R},\mathscr{F})$ , where $\mathfrak{C}$ is the set of clients using the service, $\mathfrak{R}$ is the set of replicas and $\mathscr{F}\subset\mathfrak{R}$ is the set of faulty replicas that exhibit Byzantine behavior. We write $\textbf{n}=\lvert\mathfrak{R}\rvert$ and $\textbf{f}=\lvert\mathscr{F}\rvert$ to denote the number of replicas and faulty replicas, respectively. We assign each replica ${R}\in\mathfrak{R}$ a unique identifier $\mathop{\textsf{id}}({R})$ with $0\leq\mathop{\textsf{id}}({R})<\textbf{n}$ . Similarly, we assign each client ${c}\in\mathfrak{C}$ a unique identifier $\mathop{\textsf{id}}({c})$ with $0\leq\mathop{\textsf{id}}({c})<\lvert\mathfrak{C}\rvert$ . The set of non-faulty replicas, denoted by $\mathop{\textsf{nf}}(\mathfrak{S})$ , is defined as $\mathop{\textsf{nf}}(\mathfrak{S})=\mathfrak{R}\setminus\mathscr{F}$ . We assume that the non-faulty replicas behave in accordance with the protocol and are deterministic: on identical inputs, we expect non-faulty replicas to produce identical outputs.

Consensus protocol.

A consensus protocol helps to replicate a sequence of values among all the non-faulty replicas. A single execution of a correct consensus protocol satisfies the following two requirements [39]:

Termination. Each non-faulty replica accepts a value. 2. 2.

Non-divergence. All non-faulty replicas accept the same value.

In this paper, we consider those consensus protocols that replicate a sequence of client requests (for example, database operations). In this setting, the termination and non-divergence requirements imply data consistency, a safety property. Additionally, termination of one round (or one consensus) assures that all the non-faulty replicas have the same state. This ensures that any preconditions for the next round are met, which implies availability, a liveness property. Most general-purpose consensus protocols that do not expect synchronous communication to guarantee non-divergence require $\textbf{n}>3\textbf{f}$ , which we also assume throughout this paper [11, 12].

The focus of this paper are the consensus protocols that follow the primary-backup model. In such protocols, a single replica is assigned the role of the primary and is responsible for initiating and coordinating each round of the consensus protocols, while all the remaining replicas perform the backup role. As the primary can be malicious, these protocols usually have the means to detect failure of the primary and transfer control to a new primary. A well-known example of a primary-backup consensus protocol is the Practical Byzantine Fault Tolerance protocol (Pbft) [11, 12], which has inspired the design of many modern consensus protocols [14, 26, 27, 4, 33, 5, 28].

Sequence and map notations.

Let $S=[s_{0},\dots,s_{k-1}]$ be a sequence. We write $S[i]$ to denote $s_{i}$ and $\lvert S\rvert$ to denote the length $k$ of $S$ . If $v$ is a value, then $S\oplus v=[s_{0},\dots,s_{k-1},v]$ denotes the concatenation of $S$ and $v$ . If $v$ is a value, then $S\setminus v$ denotes the sequence obtained from $S$ by removing all occurrences of $v$ . If $k$ and $v$ are values, then we write $k\mapsto v$ to denote a key-value mapping that maps $k$ onto $v$ .

Cryptographic primitives.

We assume a collision-resistant hash function that maps an arbitrary value $v$ to a numeric value $\mathop{\texttt{Hash}}(v)$ in a bounded range, called the digest [29]. We assume that it is practically impossible to find another value $v^{\prime}$ , $v\neq v^{\prime}$ , such that $\mathop{\texttt{Hash}}(v)=\mathop{\texttt{Hash}}(v^{\prime})$ .

3 Parallelizing consensus

We now present our paradigm to parallelize a consensus protocol. In our paradigm, each replica ${R}\in\mathfrak{R}$ participates in m, $1\leq\textbf{m}\leq\textbf{n}$ , instances of the underlying consensus protocol. We use $\mathscr{I}=\{\mathcal{I}_{1},\dots,\mathcal{I}_{\textbf{m}}\}$ to denote these instances, where $1,\dots,\textbf{m}$ acts as the instance identifier. We write $\mathcal{I}_{i}({R})$ to denote the $i$ -th instance, $1\leq i\leq\textbf{m}$ , running on a replica ${R}$ and we use $\mathscr{I}({R})=\{\mathcal{I}_{1}({R}),\dots,\mathcal{I}_{\textbf{m}}({R})\}$ to represent all the m instances running, in parallel, on the replica ${R}$ . Further, we represent the primary of an $i$ -th instance as $\mathcal{P}_{i}$ , $1\leq i\leq\textbf{m}$ . Our paradigm enforces that each primary exists on a distinct replicas, that is, for all $1\leq i<j\leq\textbf{m}$ , we have $\mathcal{P}_{i}\neq\mathcal{P}_{j}$ .

Figure 1 presents a set of tasks undertaken by a replica employing our paradigm. Each replica takes as input a bft protocol and runs m instance of that protocol in parallel. Once these instances complete, the replica generates a global order of the requests across all these instances and executes these requests in the global order. For the sake of clarity, we first present a parallelized design in which the instances operate in a coordinated step-wise manner. Furthermore, we assume that each instance operates a general consensus protocol:

Definition 3.1.

We model a consensus protocol as a black-box that operates in well-defined rounds. In each such round, a single consensus decision is made by all the non-faulty replicas. If a round succeeds, then $\textsf{S}(\texttt{CR})$ is the consensus decision observed by all the non-faulty replicas, where CR is the client request accepted by all the non-faulty replicas in that round. If a round fails, then F is the consensus decision observed by all the non-faulty replicas, which indicates a primary failure. After a failure, a replica can be instructed to transfer control to a new primary. If all the non-faulty replicas are instructed to transfer control to the same new primary, then this process will succeed and a new primary is elected.222Several practical bft-style consensus protocols do not strictly adhere to these assumptions. To improve throughput, these protocols provide partial consensus, in which a majority of the non-faulty replicas are guaranteed to make successful consensus decisions. In these partial consensus protocols, consensus among all the non-faulty replicas is guaranteed only eventually through additional checkpoint and recovery steps.

The general model of Definition 3.1 allows us to focus on the core challenges in parallelizing consensus protocols. At the core of our paradigm is the coordination of m instances of a consensus protocol running in parallel. This implies that a single round of our paradigm coordinates multiple parallel consensus rounds, each of which is initiated and managed by a distinct primary $\mathcal{P}_{i}$ for the instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ . Each consensus decision succeeds whenever $\mathcal{P}_{i}$ is non-faulty. This approach to parallelization raises several important challenges:

For optimal throughput, we need to ensure that each instance is making a distinct consensus decision, that is, each instance is processing a distinct client request. 2. 2.

Every non-faulty replica should execute all the accepted client requests in the same order. 3. 3.

When several instances fail in a round and want to transfer control to new primaries, then all non-faulty replicas need to do so in the same manner.

In our design, we address each of these challenges. Figure 2 sketches a high-level overview of a parallelized consensus round at replica ${R}$ .

In each parallelized consensus round, we first allow each of the m instances to independently reach a consensus decision. Next, we collect these decisions. The success decisions—of the form $\textsf{S}(\texttt{CR})$ —are executed in a deterministic fashion. The failure decisions—of the form F—are used to recover the instances involved in these decisions. To recover these instances, we replace their respective primaries, which we explain later in this section.

We use $\mathcal{D}_{\rho,i}$ , $1\leq i\leq\textbf{m}$ , to denote the consensus decision of instance $\mathcal{I}_{i}$ , agreed by all the non-faulty replicas in round $\rho$ . Similarly, we use $\mathcal{P}_{i,\rho}$ to indicate the primary of the $i$ -th instance in round $\rho$ . We write $\mathscr{D}_{\rho}=\{\mathcal{D}_{\rho,1},\dots,\mathcal{D}_{\rho,\textbf{m}}\}$ to represent the set of m consensus decisions agreed upon by all the non-faulty replicas. Finally, we write

[TABLE]

to denote the partitioning of $\mathscr{D}_{\rho}$ into sets of success decisions and failure decisions.

In Section 3.1, we describe how to determine the order of execution of the client requests in $\textsf{S}(\mathscr{D}_{\rho})$ and in Section 3.2, we describe how to deal with primary failure in a coordinated manner in response to the failure decisions in $\textsf{F}(\mathscr{D}_{\rho})$ . We discuss the assignment of clients to instances in the following section as part of the efforts to optimize parallelization benefits by removing the need for round-based step-wise operations.

3.1 Deterministic round execution

The correctness of the underlying consensus protocol, used by instances $\mathcal{I}_{1},\dots,\mathcal{I}_{\textbf{m}}$ , guarantees that each non-faulty replica derives the same set of client requests $\textsf{S}(\mathscr{D}_{\rho})$ in round $\rho$ . Hence, non-faulty replicas only need to determine the order of execution of these client requests.

A simple solution would be to order the client requests based on their instance identifiers: first execute the client request of $\mathcal{I}_{1}$ (if any), then execute the client request of $\mathcal{I}_{2}$ (if any), and so on until all the requests are executed. Although this approach guarantees a unique sequential order among all the executed client requests across all the non-faulty replicas, the approach also gives earlier instances disproportional control over execution. We illustrate this next:

*Example 3.2**.*

Consider a financial service in which client requests are of the form

[TABLE]

Let $\texttt{CR}_{1}=\operatorname{transfer}(\text{Alice},\text{Bob},500,200)$ and $\texttt{CR}_{2}=\operatorname{transfer}(\text{Bob},\text{Eve},400,300)$ be client requests. Execution of $\texttt{CR}_{1}$ influences the outcome of execution of $\texttt{CR}_{2}$ : if $200\leq\operatorname{amount}(\text{Bob})<400$ , then execution of $\texttt{CR}_{1}$ before $\texttt{CR}_{2}$ will result in a transfer of $300$ to Eve. If $\texttt{CR}_{1}$ is executed after $\texttt{CR}_{2}$ , then Eve will not receive anything. Hence, by choosing a predictable order of execution, earlier instances in the ordering can influence the execution of any requests accepted by later instances.

To resolve the illustrated shortcoming, we propose a method to deterministically select a different permutation of the order of execution in every round. Note that for any sequence $S$ of $k=\lvert S\rvert$ values, there exist $k!$ distinct permutations. We write $P(S)$ to denote the set of permutations of $S$ . As $\lvert P(S)\rvert=k!$ , there exists a bijection $f_{S}:\{0,\dots,k!-1\}\rightarrow P(S)$ . Next, we define the function $f_{S}$ recursively. We have the following:

[TABLE]

in which $q=i\operatorname{div}(\lvert S\rvert-1)!$ is the quotient and $r=i\operatorname{mod}(\lvert S\rvert-1)!$ is the remainder of integer division by $(\lvert S\rvert-1)!$ .

Lemma 3.3.

Function $f_{S}$ is a bijection from $\{0,\dots,\lvert S\rvert!-1\}$ to all possible permutations of $S$ .

Proof.

The proof is by induction on the size of $S$ . The base case is $\lvert S\rvert=1$ , in which case $f_{S}=\{0\mapsto S\}$ , a bijection. As the induction hypothesis, we assume that $f_{S^{\prime}}$ is a bijection for all $S^{\prime}$ with $\lvert S^{\prime}\rvert\leq j$ . Next, consider the case $f_{S}$ with $\lvert S\rvert=j+1$ . Observe that, for each $s\in S$ , there exist $j!$ permutations of $S$ that end with $s$ . The computation of $q=i\operatorname{div}(\lvert S\rvert-1)!=i\operatorname{div}j!$ and $r=i\operatorname{mod}(\lvert S\rvert-1)!=i\operatorname{mod}j!$ divides all possible values $i$ into $(j+1)$ ranges of $j!$ values each. Hence, each $s\in S$ is chosen via $j!$ different values of $i$ , and we have $j!$ different possible values for $r$ for each choice of $s$ . The function $f_{S}$ chooses $s=S[q]$ . By the induction hypothesis, the function $f_{S\setminus s}(r)$ is a bijection. Thus, we conclude that the function $f_{S}$ is also a bijection. ∎

The result of $f_{S}(\rho\operatorname{mod}\lvert S\rvert!)$ , on a sequence $S$ of all the consensus decisions of round $\rho$ , can be used by all non-faulty replicas as a deterministic order of execution. This approach ensures that each instance receives equal opportunity to propose the client request to be executed first, across all the replicas. However, this approach is highly predictable. Hence, as a further improvement, we can use the value $h=\mathop{\texttt{Hash}}(R)$ , with $R$ the set of client requests accepted in round $\rho$ , instead of the round number $\rho$ to determine the order of execution. Assuming at least one primary is non-malicious ( $\textbf{m}>\textbf{f}$ ), this value $h$ is only known after completion of the round $\rho$ as the malicious primaries cannot effectively collude to obtain a certain order of execution. Figure 3 presents the pseudo-code for the execution protocol.

Proposition 3.4.

The execution protocol of Figure 3 guarantees that every non-faulty replica executes client request in the same order. If $\textbf{m}>\textbf{f}$ , then malicious replicas cannot control the order of execution. If $\textbf{n}>2\textbf{f}$ , then clients will always receive at least $\textbf{f}+1$ identical results (from at least $\textbf{f}+1$ non-faulty replicas) and can reliably detect successful execution of their request.

3.2 Dealing with primary failure

The easiest way to deal with primary failure is by shutting down the instance coordinated by that primary. This approach would work well for some time, as in most practical settings the set of faulty replicas is relatively stable. However, for high availability, we need to consider a more dynamic setting in which we only know that at most f replicas are faulty in a specific window of time (as faulty replicas can recover and non-faulty replicas can become faulty). Hence, we propose two targeted methods to deal with primary failure.

3.2.1 Discarding primaries and in-place recovery

An easy way to deal with faulty primaries is by permitting a certain delay for the failing instance to recover, after which all the non-faulty replicas are instructed to transfer control back to the previously-failed primary. This requires all the non-faulty replicas to agree on a delay and this delay needs to provide the faulty primary sufficient time to recover. If the delay is too short, then it would result in repeated primary failure, which in turn would lead to multiple failed attempts to transfer control back to the primary, an unnecessary cost for all the replicas. To determine the right delay, the replicas can start with a small value (in number of rounds) and double this value after each failure. This approach does not necessitate any coordination between the replicas or between the instances. Further, this approach even works when $\textbf{m}>\textbf{n}-\textbf{f}$ , in which case some primaries might always be faulty while no non-faulty replicas are available to replace them.

3.2.2 Unified primary replacement

A second way to deal with faulty primaries is by replacement. Indeed, when $\textbf{m}\leq\textbf{n}-\textbf{f}$ our paradigm can aim for a stable set of m non-faulty primaries by replacing a failed primary by another available replica. This approach is akin to the one taken by traditional bft-style consensus protocols. In such protocols, the failure of the primary ${P}$ is detected whenever the behavior of ${P}$ prevents successful consensus decisions. After detecting the failure of ${P}$ , all the non-faulty replicas switch to the next primary. This next primary is deterministically selected by choosing the replica following ${P}$ , that is, by choosing the replica ${R}$ with $\mathop{\textsf{id}}({R})=\mathop{\textsf{id}}({P})+1$ .333Recently, a few consensus protocols proposed choosing primaries uniformly at random using a distributed random coin [1]. Such an approach replaces the malicious primaries with a non-faulty replica with high probability. Our parallelization paradigm can easily be extended to use this approach. However, such a selection strategy may not work with our paradigm as we require all the m instances to have distinct primaries.

*Example 3.5**.*

Consider instances $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ with primaries ${R}=\mathcal{P}_{1}$ and ${Q}=\mathcal{P}_{2}$ with $\mathop{\textsf{id}}({R})=1$ and $\mathop{\textsf{id}}({Q})=2$ . Consider the following two consensus decisions made during round $\rho$ :

$\mathscr{D}_{\rho}=[\textsf{F},\textsf{S}(\texttt{CR})]$ . Here, instance $\mathcal{I}_{1}$ needs to replace its primary. If $\mathcal{I}_{1}$ chooses the replica following ${R}$ , replica ${Q}$ , then both instances end up with the same primary. 2. 2.

$\mathscr{D}_{\rho}=[\textsf{F},\textsf{F}]$ . Here, both the instances need to replace their primaries. If $\mathcal{I}_{1}$ chooses the replica following ${R}$ , replica ${Q}$ , then it will end up with a known faulty primary. If $\mathcal{I}_{1}$ realizes that ${Q}$ was already in use by $\mathcal{I}_{2}$ , then $\mathcal{I}_{1}$ would choose the replica following ${Q}$ . Unfortunately, in such a case, $\mathcal{I}_{2}$ would do the same and both instances would end up with the same primary.

These cases illustrate that primary replacement would fail without coordination between the instances. Hence, this coordination is an essential task in our parallelization paradigm.

We introduce a unified primary replacement protocol, which facilitates coordinated primary replacement among the instances. The protocol requires each non-faulty replica to maintain an internal state $(\textit{failed\/},\textit{primary\/})$ , where $\textit{failed\/}\subseteq\mathscr{F}$ is the set of known faulty replicas, and $\textit{primary\/}:\{1,\dots,\textbf{m}\}\rightarrow(\mathfrak{R}\setminus\textit{failed\/})$ is an injective function that maps each instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , onto its primary, that is, $\textit{primary\/}(i)=\mathcal{P}_{i}$ . The function primary never maps to known faulty replicas. Figure 4 presents the pseudo-code for this protocol. Formally, the unified primary replacement protocol maintains the following invariants:

Invariant 3.6.

Let $\mathfrak{S}$ be a service. We write $\textit{failed\/}_{\rho}({R})$ and $\textit{primary\/}_{\rho}({R})$ to denote the value of these variables at non-faulty replica ${R}\in\mathop{\textsf{nf}}(\mathfrak{S})$ at the start of round $\rho$ . Further, we use $\textit{primary\/}_{\rho}({R})[i]$ , $1\leq i\leq\textbf{m}$ , to denote the primary of instance $\mathcal{I}_{i}$ . For every ${R},{Q}\in\mathop{\textsf{nf}}(\mathfrak{S})$ , the following properties hold at the start of every round $\rho$ :

$\textit{failed\/}_{\rho}({R})=\{\mathcal{P}_{i,j}\mid(i\mapsto\textsf{F})\in\textsf{F}(\mathscr{D}_{j}),1\leq j<\rho\}$ ; 2. 2.

$\textit{primary\/}_{\rho}({R})$ is an injective function and $\textit{primary\/}_{\rho}({R})[i]\in(\mathfrak{R}\setminus\textit{failed\/})$ , $1\leq i\leq\textbf{m}$ ; 3. 3.

$\textit{primary\/}_{\rho}({R})=\textit{primary\/}_{\rho}({Q})$ and $\textit{failed\/}_{\rho}({R})=\textit{failed\/}_{\rho}({Q})\subseteq\mathscr{F}$ .

Proposition 3.7.

The unified primary replacement protocol of Figure 4 maintains Invariant 3.6.

Proof.

Let $\mathfrak{S}=(\mathfrak{C},\mathfrak{R},\mathscr{F})$ be a service and let ${R},{Q}\in\mathop{\textsf{nf}}(\mathfrak{S})$ . Initially, due to Line 2, we have $\textit{failed\/}_{0}({R})=\textit{failed\/}_{0}({Q})=\emptyset$ and $\textit{primary\/}_{0}({R})[i]=\textit{primary\/}_{0}({Q})[i]={P}_{i}$ , $1\leq i\leq\textbf{m}$ , in which $\mathop{\textsf{id}}({P}_{i})=i$ . Hence, Invariant 3.6 initially holds.

Next, we assume that Invariant 3.6 holds at the start of round $\rho$ , and we prove that it again holds at the start of round $\rho+1$ . At Line 4, the failed primaries in $\textsf{F}(\mathscr{D}_{\rho})$ are added to failed. As the underlying consensus protocol ensures that every replica decides on the same set $\textsf{F}(\mathscr{D}_{\rho})$ , all the non-faulty replicas make identical changes to failed. Hence, we are assured that Invariant 3.6 1 is maintained. The loop at Line 5 will replace every newly detected faulty primary by a freshly chosen primary that is not yet in use (not in Im) and is not a known faulty replica (not in failed). As all non-faulty replicas agree on $\textsf{F}(\mathscr{D}_{\rho})$ , they process values in $\textsf{F}(\mathscr{D}_{\rho})$ in the same order (Line 5). Further, each non-faulty replica choose new primaries deterministically (Line 7) and makes the same changes to primary. Hence, we are assured that Invariants 3.6 2 and 3.6 3 are maintained. Finally, at Line 9 each instance $\mathcal{I}_{i}$ that decided F in round $\rho$ gets assigned a new primary $\textit{primary\/}[i]$ . Thus, we conclude that Invariant 3.6 holds. ∎

For environments in which the set of faulty replicas is ever changing, the unified primary replacement protocol can easily be tweaked such that faulty primaries are eventually reconsidered (for example, by reintroducing the earliest failed replicas after all the other replicas have failed).

4 Optimizing parallelization to increase performance

In the previous section, we presented a paradigm for parallelizing consensus protocols. To simplify the presentation, we presented a step-wise design whose practical implementation would require substantial coordination between instances (for example, via the use of locks). In practice, this step-wise design incurs a lot of waiting, which we illustrate next:

*Example 4.1**.*

Consider service $\mathfrak{S}$ working on a consensus round $\rho$ and let ${R}\in\mathop{\textsf{nf}}(\mathfrak{S})$ be a non-faulty replica. We describe four cases in which the design presented in Section 3 induces waiting:

The execution of client requests consumes time and forces all the instances to wait until completion. 2. 2.

In a practical setting, there can be a variation in message delivery time, which can lead to instances making consensus decisions at different speeds. For example, a temporary hiccup in the network can cause the instance $\mathcal{I}_{1}$ to take twice the time it takes the other instances to make a consensus decision in round $\rho$ . Hence, all the other instances would have to wait for $\mathcal{I}_{1}$ to complete with a successful consensus decision. 3. 3.

A faulty primary $\mathcal{P}_{i}$ coordinating instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , can actively throttle the speed at which its instance makes a consensus decision, which delays all the other instances. 4. 4.

A primary $\mathcal{P}_{i}$ coordinating instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , can crash. Existing consensus protocols [11, 26] detect such a crash through large timeout values. This forces the other instances, whose consensus throughput should only be limited by the network latency, to wait for long idle periods for a primary failure to resolve.

Waiting reduces the attainable performance (given the available resources). Fortunately, all the above described forms of waiting can be eliminated from our paradigm. First, in Section 4.1, we describe how to eliminate waiting. This complicates dealing with client requests, and, hence, in Section 4.2, we describe these complications and provide solutions to resolve them.

4.1 Making parallelization wait-free

To ensure the correctness of our parallelization paradigm, we do not require any instance to wait for the other instances. Consider a service working on consensus round $\rho$ . First, we observe that the execution of client requests in round $\rho$ has no influence on the consensus decisions of future rounds. Second, the instances arriving at successful consensus decisions do not require any coordination. The only required coordination between the instances is the unified primary replacement (Section 3.2.2), which is limited to instances with failed primaries. Hence, instances that arrived at successful consensus decisions in the current round are free to make consensus decisions for the future rounds, while the execution of the client requests of previous rounds occurs in the background, and the other instances are still making consensus decisions for the current round. Thus, dealing with Example 4.1, Case 1 and Case 2, is straightforward.

To arrive at a fully wait-free design, we must also address the malicious behaviors described in Example 4.1, Case 3 and Case 4, and deal with any structural differences in speed of the instances. Note that if these behaviors are left unhandled, then they can cause unbounded delays between the acceptance and execution of a client request, which we illustrate next:

*Example 4.2**.*

Consider a service with $\textbf{m}=2$ instances, where instance $\mathcal{I}_{1}$ makes a consensus decision every $10$ \mathrm{\SIUnitSymbolMicro s} $and $\mathcal{I}_{2}$ makes a consensus decision every $20$\mathrm{\SIUnitSymbolMicro s}$ . As $\mathcal{I}_{2}$ operates slower, it determines the speed by which the system can complete a consensus round. Consequently, over time $\mathcal{I}_{1}$ will make consensus decisions of an ever growing set of client requests that are awaiting execution. Further, the delay between $\mathcal{I}_{1}$ accepting a client request and its execution grows without a bound.

Similarly, consider the case in which both instances make consensus decisions every $10$ \mathrm{\SIUnitSymbolMicro s} $. Assume that in round $\rho$ the primary $\mathcal{P}_{2}$ of instance $\mathcal{I}_{2}$ fails. Further, assume it takes $500$\mathrm{\SIUnitSymbolMicro s}$ to detect such a failure and $100$ \mathrm{\SIUnitSymbolMicro s} $to replace the primary. In such a case, $\mathcal{I}_{1}$ would have made $60$ consensus decisions before $\mathcal{I}_{2}$ resumes normal operation. Hence, after round $\rho$ all the client requests accepted by $\mathcal{I}_{1}$ will see an additional delay of $600$\mathrm{\SIUnitSymbolMicro s}$ .

The situations illustrated in Example 4.2 are among the several cases in which the client requests accepted by well-performing instances would see their execution unnecessary delayed due to interference (delays) from other instances. Next, we show how our parallelization paradigm can address these cases using a simple yet effective principle:

Definition 4.3.

We say that an instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , suffers a soft failure if it is working on a consensus decision in round $\rho$ , while some other instance is already working on a consensus decision in round $\rho+\sigma$ . We call $\sigma$ the gap size, which is determined by the network latency and the timeout used to detect failures.

When an instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , suffers any type of failure in round $\rho$ , including soft failures, then $\mathcal{I}_{i}$ is excluded from contributing to the next $\varepsilon$ consensus rounds. We call $\varepsilon$ the skip size, which is determined by the gap size and time to replace the faulty primary. Hence, in case of the failure of instance $\mathcal{I}_{i}$ in round $\rho$ , $\mathcal{I}_{i}$ starts making consensus decisions for consensus round $\rho+\varepsilon$ .

The soft failure principle is based on the assumption that all instances coordinated by the non-faulty primaries should reach successful consensus decisions roughly within the same time (as they all operate in the same environment). Hence, instances that lag by a significant margin $\sigma$ could be led by a faulty primary.444A similar assumption is the basis of Rbft [5], see Section 5. Each replica locally detects the soft failure of its instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , and uses the fault detection infrastructure of the underlying consensus protocol to work towards ending the ongoing consensus round with a decision F. The concept of soft failures addresses the execution delays due to underperforming instances, whereas the skipping of consensus rounds allows previously-failed instances to catch up with the other instances.

Theorem 4.4.

Instances are wait-free: instances can make successful consensus decisions without outside interference and, at the same time, the delay between an instance accepting a client request and the replicas executing this client request is upper-bounded.

Note that if unified primary replacement (Section 3.2.2) is used to deal with primary failures, then natural fluctuations in the performance of an instance can cause an unjust replacement of its primary. Hence, we allow treatment of soft failures as temporarily failures from which primaries can eventually recover: a replica that fails soft can be considered non-faulty when the unified primary replacement protocol runs out of replicas it considers non-faulty.

4.2 Consistent handling of client requests

Consensus protocols facilitate execution of client requests in a consistent manner across all the non-faulty replicas. Usually, consensus protocols can also aim at executing these client requests in the order they were sent by the clients. Maintaining this consistent ordering becomes harder when parallelizing consensus protocols:

*Example 4.5**.*

Consider a service $\mathfrak{S}$ with two instances $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ and client ${c}$ . If both $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ are about to make a consensus decision for round $\rho$ and ${c}$ sends a request CR to both instances, then both instances might propose the same request, which would waste resources. If ${c}$ sends requests $\texttt{CR}_{1}$ to $\mathcal{I}_{1}$ and $\texttt{CR}_{2}$ to $\mathcal{I}_{2}$ , then the order in which these requests are executed is subject to decisions made by the execution protocol (Section 3.1). Moreover, due to the wait-free design, instances $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ can be in completely different rounds when proposing $\texttt{CR}_{1}$ and $\texttt{CR}_{2}$ , again making the order of execution independent of the order in which ${c}$ requested $\texttt{CR}_{1}$ and $\texttt{CR}_{2}$ .

If consistent ordering of execution of client requests is not necessary, e.g., if client requests operate on conflict-free replicated data types [38], then clients can simply send their transactions to arbitrary instances. If consistent ordering is necessary, then the straightforward way to guarantee ordering is to assign each client to a unique instance.

Let $\mathfrak{S}=(\mathfrak{C},\mathfrak{R},\mathscr{F})$ be a service. If we assume $\lvert\mathfrak{C}\rvert>\textbf{m}$ , then we can assign clients to instances in a round-robin manner by requiring that the instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ , only deals with client requests of clients ${c}\in\mathfrak{C}$ with $i=\mathop{\textsf{id}}({c})\operatorname{mod}\textbf{m}$ . We notice that a client ${c}$ can be assigned to an instance with a faulty primary that might ignore the client request. In existing consensus protocols, this behavior eventually leads to the primary being detected as faulty. Hence, based on how we deal with the faulty primaries there are two ways to guarantee service for ${c}$ .

If faulty primaries are replaced (Section 3.2.2), then unified primary replacement assures that eventually a non-faulty primary will coordinate the instance and propose the requests of ${c}$ . 2. 2.

If faulty primaries are not replaced (Section 3.2.1), then the requests of client ${c}$ could get indefinitely ignored by a faulty primary that never recovers. In such a case, we allow ${c}$ to switch to another instance $\mathcal{I}_{i}$ , $1\leq i\leq\textbf{m}$ . To do so, ${c}$ sends an instance-change request to $\mathcal{I}_{i}$ . If $\mathcal{I}_{i}$ gets this request, then it adds the change-request to reassign ${c}$ to $\mathcal{I}_{i}$ to the consensus decision it is going to make in $\rho$ by proposing to all replicas to reassign the client to $\mathcal{I}_{i}$ in round $\rho+2\sigma$ . After this proposal is requested, instance $\mathcal{I}_{i}$ is able to propose client requests for ${c}$ after round $\rho+2\sigma$ is executed. To assure balanced load among instances, a non-faulty instance only has to accept an instance-change request if it does not yet have $\lceil\lvert\mathfrak{C}\rvert/\lvert\mathop{\textsf{nf}}(\mathfrak{S})\rvert\rceil$ clients assigned.

4.3 Overview of wait-free designs

We have explored several different designs for the parallelization of consensus protocols, each resulting in a valid and highly parallelized consensus protocol. Next, we summarize our findings.

Theorem 4.6.

The parallelization paradigm we propose can turn a general consensus protocol into a high-performance parallelized wait-free consensus protocol in which every client will eventually see its requests executed, the load of non-faulty replicas is evenly distributed, and the impact of faulty replicas is minimized. Additionally, the parallelization paradigm we propose can also turn partial consensus protocols into high-performance parallelized wait-free consensus protocols.

Proof.

We consider the following instances of the parallelization paradigm based on how one deals with primary failure and clients:

failed primaries are not replaced and clients can request instance reassignment; and 2. 2.

primaries are replaced and clients are assigned statically to instances.

In both cases, non-divergence of the parallelized protocol follows from non-divergence of the underlying consensus protocol, which assures that all replicas reach the same consensus decision for each instance in each round, and the deterministic round execution of the execution protocol (Proposition 3.4). Next, we look at termination, which follows directly from termination of the underlying consensus protocol.

The equal distribution of load among all replicas follows from the uniform assignment of clients to each instance (when primaries are replaced) and by the upper-bound on the assigned clients when client can request instance reassignment. The impact of faulty replicas is minimized due to the resulting protocol being wait-free (Theorem 4.4). ∎

5 Related work

There is an abundant literature on consensus protocols and, in specific, primary-backup consensus protocols (e.g., [6, 17, 39, 40, 33, 22, 9]). In this paper, we primarily focus on works that address the limitations of primary-backup protocols, as described in the Introduction. Several different approaches towards resolving some of these limitations have been considered in the literature.

Leader-free protocols.

Several leader-free protocols have been proposed to eliminate any issues arising from the discrepancy in responsibilities between primaries and backups, especially with respect to malicious primary behavior [30, 8]. In these leader-free designs, all replicas have the same responsibilities, the same load, and have the same impact on the system performance. Unfortunately, these leader-free designs come at high communication costs, making their practical usage limited. Recently, HoneyBadgerBFT proposed a leader-free design based on expensive asynchronous broadcast protocols and reduce the amortized per-request communication cost by making the batch size a function of the communication complexity of the protocol. We believe that simpler and more efficient primary-backup designs are more suitable for high-performance applications, and we view fine-tuning their performance (e.g., via parallelization as explored in this paper, or by applying amortized optimizations such as explored in HoneyBadgerBFT) as a more promising avenue for further development.

The Proof-of-Work (PoW) protocol and other similar protocols employed by cryptocurrencies [31, 43, 22, 17] are also leader-less. These protocols can be employed in permissionless environments in which participants can join and leave at any time [35]. Unfortunately, for many practical applications the computational costs of PoW are too high and the throughput too low [16, 42, 34]. As permissioned bft-style protocols outperform PoW by several orders of magnitudes, this rules out their usage in the permissioned setting we study in this paper.

Redistribution of tasks.

Recently, several complex consensus protocols have been proposed that use cryptographic techniques to reduce the communication costs of bft-style consensus [44, 20], which especially adds burden on the primary. In LinBFT, this is partially addressed by deferring some of the primary tasks to other replicas. Unfortunately, this complicates the design significantly, as the protocol not only has to detect and replace faulty primaries, but also detect and compensate for deferred faulty replicas. Moreover such deferral techniques are highly protocol-specific and only address issues related to the primary load, not to the costs of primary replacement or other malicious behavior by the primary. Other designs, such as FastBFT [28], uses tree-based overlay networks and efficient message aggregation to reduce the total communication cost of the protocol. The usage of overlay networks is orthogonal to our approach and can reduce communication of the consensus protocol on which our paradigm relies.

Reducing malicious behavior.

Several works have observed that traditional bft-style consensus protocols only address a narrow set of malicious behavior, namely behavior that prevents any progress [5, 14, 2, 41]. Hence, several designs have been proposed to also address behavior that impedes performance without completely preventing progress. One such design is Rbft, which uses parallelization not to improve performance—as we propose—but only to detect malicious behavior. In practice, the design of Rbft results in poor performance at high costs. Another design is Spinning [41], which proposes to replace the primary every round. This would not incur the costs of Rbft, while still reducing the impact of faulty replicas to severely reduce throughput. In our parallelization paradigm, however, we provide wait-free consensus to instances with non-faulty primaries, which allows those instances to always process client requests at maximum throughput.

6 Conclusions and future work

In this paper, we propose a novel paradigm for parallelizing consensus protocols in a wait-free manner, thereby improving the system throughput by reducing the load on individual replicas and sharply reducing the impact of faulty replicas. Our techniques are protocol-agnostic, adjustable to several settings, and can be combined with many readily available primary-backup consensus protocols. Hence, our paradigm opens the door for the development of new and highly performant permissioned blockchain applications.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ittai Abraham, Srinivas Devadas, Danny Dolev, Kartik Nayak, and Ling Ren. Synchronous byzantine agreement with expected 𝒪 ( 1 ) 𝒪 1 \mathcal{O}(1) rounds, expected 𝒪 ( n 2 ) 𝒪 superscript 𝑛 2 \mathcal{O}(n^{2}) communication, and optimal resilience, 2018.
2[2] Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. Prime: Byzantine replication under attack. IEEE Transactions on Dependable and Secure Computing , 8(4):564–577, 2011. doi:10.1109/TDSC.2010.70 . · doi ↗
3[3] GSM Association. Blockchain for development: Emerging opportunities for mobile, identity and aid, 2017. URL: https://www.gsma.com/mobilefordevelopment/wp-content/uploads/2017/12/Blockchain-for-Development.pdf .
4[4] Pierre-Louis Aublin, Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko Vukolić. The next 700 BFT protocols. ACM Transactions on Computer Systems , 32(4):12:1–12:45, 2015. doi:10.1145/2658994 . · doi ↗
5[5] Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. RBFT: Redundant byzantine fault tolerance. In 2013 IEEE 33rd International Conference on Distributed Computing Systems , pages 297–306. IEEE, 2013. doi:10.1109/ICDCS.2013.53 . · doi ↗
6[6] Christian Berger and Hans P. Reiser. Scaling byzantine consensus: A broad analysis. In Proceedings of the 2Nd Workshop on Scalable and Resilient Infrastructures for Distributed Ledgers , SERIAL’18, pages 13–18. ACM, 2018. doi:10.1145/3284764.3284767 . · doi ↗
7[7] Burkhard Blechschmidt. Blockchain in Europe: Closing the strategy gap. Technical report, Cognizant Consulting, 2018. URL: https://www.cognizant.com/whitepapers/blockchain-in-europe-closing-the-strategy-gap-codex 3320.pdf .
8[8] Fatemeh Borran and André Schiper. A leader-free byzantine consensus algorithm. In Distributed Computing and Networking , pages 67–78. Springer Berlin Heidelberg, 2010. doi:10.1007/978-3-642-11322-2_11 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Revisiting consensus protocols through wait-free parallelization111A brief announcement of this work will be presented at the 33rd International Symposium on Distributed Computing (DISC 2019) [21].

Abstract

1 Introduction

Organization.

2 Preliminaries

Service notation.

Consensus protocol.

Sequence and map notations.

Cryptographic primitives.

3 Parallelizing consensus

Definition 3.1**.**

3.1 Deterministic round execution

Example 3.2*.*

Lemma 3.3**.**

Proof.

Proposition 3.4**.**

3.2 Dealing with primary failure

3.2.1 Discarding primaries and in-place recovery

3.2.2 Unified primary replacement

Example 3.5*.*

Invariant 3.6**.**

Proposition 3.7**.**

Proof.

4 Optimizing parallelization to increase performance

Example 4.1*.*

4.1 Making parallelization wait-free

Example 4.2*.*

Definition 4.3**.**

Theorem 4.4**.**

4.2 Consistent handling of client requests

Example 4.5*.*

4.3 Overview of wait-free designs

Theorem 4.6**.**

Proof.

5 Related work

Leader-free protocols.

Redistribution of tasks.

Reducing malicious behavior.

6 Conclusions and future work

Definition 3.1.

*Example 3.2**.*

Lemma 3.3.

Proposition 3.4.

*Example 3.5**.*

Invariant 3.6.

Proposition 3.7.

*Example 4.1**.*

*Example 4.2**.*

Definition 4.3.

Theorem 4.4.

*Example 4.5**.*

Theorem 4.6.