On Mixing Eventual and Strong Consistency: Acute Cloud Types

Maciej Kokoci\'nski; Tadeusz Kobus; Pawe{\l} T. Wojciechowski

arXiv:1905.11762·cs.DC·January 15, 2021

On Mixing Eventual and Strong Consistency: Acute Cloud Types

Maciej Kokoci\'nski, Tadeusz Kobus, Pawe{\l} T. Wojciechowski

PDF

TL;DR

This paper introduces acute cloud types (ACTs), a formal model for distributed systems that combine eventual and strong consistency, highlighting unique phenomena and impossibility results in mixed-consistency systems.

Contribution

It formalizes ACTs, demonstrates their properties, and proves an impossibility result regarding operation reordering in mixed-consistency systems.

Findings

01

ACTs enable efficient quorum-based protocols like Paxos.

02

Temporary operation reordering can cause interim disagreements.

03

Strengthening semantics can weaken guarantees on eventual consistency.

Abstract

In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering, which implies interim disagreement between replicas on the relative order in…

Equations16

\BODY

\BODY

\textsc E V = def \forall e \in E : ∣ {e^{'} \in E : e rb e^{'} \land e \neq vis e^{'}} ∣ < \infty

\textsc E V = def \forall e \in E : ∣ {e^{'} \in E : e rb e^{'} \land e \neq vis e^{'}} ∣ < \infty

\textsc N C C = def acyclic (hb)

\textsc N C C = def acyclic (hb)

\textsc R V a l (F) = def \forall e \in E : rval (e) = F (op (e), context (A, e))

\textsc R V a l (F) = def \forall e \in E : rval (e) = F (op (e), context (A, e))

\textsc B E C (F) = def \textsc E V \land \textsc N C C \land \textsc R V a l (F)

\textsc B E C (F) = def \textsc E V \land \textsc N C C \land \textsc R V a l (F)

\textsc F R V a l (F) = def \forall e \in E : rval (e) = F (op (e), fcontext (A, e))

\textsc F R V a l (F) = def \forall e \in E : rval (e) = F (op (e), fcontext (A, e))

\textsc C P a r = def \forall e \in E : ∣ {e^{'} \in E_{e} : rank (vis^{- 1} (e^{'}), par (e^{'}), e) \neq = rank (vis^{- 1} (e^{'}), ar, e)} ∣ < \infty

\textsc C P a r = def \forall e \in E : ∣ {e^{'} \in E_{e} : rank (vis^{- 1} (e^{'}), par (e^{'}), e) \neq = rank (vis^{- 1} (e^{'}), ar, e)} ∣ < \infty

\textsc F E C (F) = def \textsc E V \land \textsc N C C \land \textsc F R V a l (F) \land \textsc C P a r

\textsc F E C (F) = def \textsc E V \land \textsc N C C \land \textsc F R V a l (F) \land \textsc C P a r

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On Mixing Eventual and Strong Consistency: Acute Cloud Types

Maciej Kokociński, Tadeusz Kobus, Paweł T. Wojciechowski The authors are with the Institute of Computing Science, Poznan University of Technology, 60-965 Poznań, Poland.

E-mail: {Maciej.Kokocinski,Tadeusz.Kobus,Pawel.T.Wojciechowski} @cs.put.edu.pl This work was supported by the Foundation for Polish Science, within the TEAM programme co-financed by the European Union under the European Regional Development Fund (grant No. POIR.04.04.00-00-5C5B/17-00). Kokociński and Kobus were also supported by the Polish National Science Centre (grant No. DEC-2012/07/B/ST6/01230) and partially by the internal funds of the Faculty of Computing, Poznan University of Technology.

Abstract

In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering, which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.

Index Terms:

eventual consistency, mixed consistency, fault-tolerance, acute cloud types, ACT

\NewEnviron

myeq

[TABLE]

1 Introduction

The massive scalability and high availability of the complex (geo-replicated) distributed systems that power today’s Internet often hinges on the use of eventually consistent data stores. These systems extensively employ specialized data structures, e.g., last-write-wins registers (LWW-registers), multi-value registers (MVRs), observed-remove sets (OR-sets) or other conflict-free replicated data types (CRDTs) [1] [2] [3]. These data structures are replicated on multiple machines (replicas) and can be read or modified independently on each site without prior synchronization with other replicas. It means that replicas can promptly respond to the clients. The communication between the replicas happens solely using a gossip protocol. By design replicas are guaranteed to be able to converge to a single state, automatically resolving any inconsistencies between them.

Unfortunately, the semantics of such data structures are very limited. To provide high availability, low-latency responses and eventual state convergence, these data structures require either that all operations commute or that there exist commutative, associative, and idempotent procedures for merging replica states. This is why these mechanisms are not suitable for all use cases. For example, consider a simple non-negative integer counter. The addition operation can be trivially implemented in a conflict-free manner, as the addition operations are commutative. However, implementing the subtraction operation requires global agreement to ensure that the value of the counter never drops below 0. In a similar way, in an auction system concurrent bids can be considered independent operations and thus their execution does not need to be synchronized. However, the operation that closes the auction requires solving distributed consensus to select the single winning bid [4].

Due to the inherent shortcomings of CRDTs and solutions similar to them, recently there have been several attempts, both at academia (e.g., [5] [6] [7] [8] [9] [10] [11]) and in the industry (e.g., [12] [13] [14] [15]) to enrich the semantics of eventually consistent systems by allowing some operations to be performed with stronger consistency guarantees or by introducing (quasi) transactional support. Crucially, none of the mixed-consistency approaches we are aware of is flexible enough to: (a) account for very weak consistency models (weaker than causal consistency, which is known to be costly to achieve in practice [16]), (b) admit strongly consistent operations which do not require all replicas to be operational in order to complete (so to gracefully tolerate failures), and (c) provide clearly stated semantics that enables easy reasoning about the system-wide guarantees. The latter trait is especially important when the same data can be accessed at the same time both in a strongly and eventually consistent fashion, what is notoriously difficult to implement. For example, in Apache Cassandra using the light weight transactions on data that are accessed at the same time in the regular, eventually consistent fashion leads to undefined behaviour [17].

In this article we introduce acute cloud types (ACTs), a family of specialized mixed-consistency data structures designed primarily for high availability and low latency, but which also seamlessly integrate on-demand strongly consistent semantics. ACT feature two kinds of operations:

•

weak operations–targeted for unconstrained scalability and low latency responses (as operations in CRDTs),

•

strong operations–used when eventually consistent guarantees are insufficient, require consensus-based inter-replica synchronization prior to execution.

Weak operations are guaranteed to progress, and are handled in such a way that the replicas eventually converge to the same state within each network partition, even when strongly consistent operations cannot complete due to network and process failures. On the other hand, strong operations can provide guarantees even as strong as linearizability [18] wrt. the already completed strong operations and a precisely defined subset of completed weak operations. Crucially, strong operations are non-blocking: they can leverage efficient, quorum-based synchronization protocols, such as Paxos [19], and thus gracefully tolerate machine and network failures. Both weak and strong operations can be arbitrarily complex, but they must be deterministic.

Our approach is more robust than other mixed-consistency solutions. Most notably, unlike classic cloud types [20] and global sequence protocol (GSP) [21], ACTs are symmetrical in the sense that they do not assume the existence of a server or servers that mediate all communication between remote replicas. This has several advantages: a failure of a replica or a group of replicas cannot impede the ability of other ACT replicas to execute weak operations and propagate the resulting updates. Also, ACTs can better tolerate network splits by allowing the replicas in the minority partitions to execute weak operations and exchange resulting updates. Furthermore, unlike the RedBlue consistency model [6] and approaches similar to it (e.g., [22] [10] [11]), ACTs support consistency guarantees weaker than causal consistency, so account for a wider range of systems. Crucially, ACTs do not require all replicas to be operational in order for the strong operations to complete, contrary to the approaches mentioned above. This latter trait has been fundamental to the design of ACTs.

In order to provide an easy to understand yet flexible consistency model that allows weak operations to be executed in a highly available and scalable manner, we require that in any run of an ACT, logically, there always exists a single global order $S$ of all operations. During execution, strong operations are guaranteed to observe the prefix of $S$ up to their position in $S$ . A weak operation may observe a serialization $S^{\prime}$ of operations that diverges from $S$ , but only by a finite number of elements. Thus weak and strong operations are interconnected in a non-trivial way, which intuitively ensures write stabilization: once a strong operation, during its execution, observes some weak operations $\mathit{op}_{i}$ , $\mathit{op}_{j}$ in that order, all subsequent strong operations, and eventually all weak operations, will also observe $\mathit{op}_{i}$ , $\mathit{op}_{j}$ in that order. It is so even though weak operations never have to directly synchronize with strong operations (e.g., by blocking on the completion of strong operations).

We propose a framework that enables formal reasoning about ACTs and their guarantees. We express the dependencies between operations through the visibility and arbitration relations, similarly to [23], but we allow each operation to observe the arbitration in a temporarily inconsistent (but eventually convergent) form. In order to capture the unique properties of ACTs and write stabilization in particular, we define a novel correctness condition called fluctuating eventual consistency (FEC) that is strictly weaker than Burckhardt’s Basic Eventual Consistency (BEC) [7].

By formally specifying ACTs, we uncovered several interesting phenomena unique to mixed-consistency systems (they are never exhibited by popular NoSQL systems, which only guarantee eventual consistency, nor by strongly consistent solutions). Crucially, some ACTs exhibit a phenomenon that we call temporary operation reordering, which happens when the replicas temporarily disagree on the relative order in which the requests (modelled as operations) submitted to the system were executed. When not handled carefully, temporary operation reordering may lead to all kinds of undesired situations, e.g., circular causality among the responses observed by the clients. As we formally prove, temporary operation reordering is not present in all ACTs but in some cases cannot be avoided. This impossibility result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strong operations to an eventually-consistent system) results in the weakening of the guarantees on the eventually-consistent operations.

In order to illustrate our concepts and analysis, we present an ACT for a non-negative counter and also revisit Bayou [24], a seminal, always available, eventually consistent data store. Bayou combines timestamp-based eventual consistency [25] and serializability [26] by speculatively executing transactions submitted by clients and having a primary replica to periodically stabilize the transactions (establish the final transaction execution order). We show how Bayou can be improved to form a general-purpose ACT.

1.1 Contribution summary

We define acute cloud types, a family of specialized mixed-consistency data structures designed primarily for high availability and low latency, but which also seamlessly integrate on-demand strongly consistent semantics achieved through quorum-based consensus protocols. Weak and strong operations in ACTs are interconnected in a non-trivial way, which intuitively ensures write stabilization. 2. 2.

We identify a range of traits unique to some ACTs. Most importantly, we define temporary operation reordering, a situation in which there is an interim disagreement between replicas on the relative order in which the client requests were executed. 3. 3.

We propose a framework that enables formal reasoning about ACTs and their guarantees. In particular, our framework allows us to formalize temporary operation reordering and propose a correctness condition called fluctuating eventual consistency which adequately captures the guarantees provided by ACTs that exhibit this phenomenon. 4. 4.

We use our framework to prove a number of formal results regarding ACTs. Crucially, we show an impossibility result that states that temporary operation reordering is not present in all ACTs, but in some cases cannot be avoided. 5. 5.

We revisit the seminal Bayou system, study its consistency guarantees, and show how it can be improved to form a general-purpose ACT.

1.2 Article structure

The article is organized as follows. In Section 2 we explain ACTs through examples: an acute non-negative counter and adaptation of Bayou that forms a general-purpose ACT. We formally define ACTs in Section 3, and introduce the formal framework for reasoning about their correctness in Section 4. In Section 5 we define FEC, our new correctness criterion and prove the correctness of our example ACTs. Next, in Section 6, we give our impossibility result. We discuss related work in Section 7, and conclude in Section 8.

A brief announcement of this article appeared in [27].

2 Acute cloud types by examples

2.1 Acute non-negative counter

As we mentioned in Section 1, a non-negative integer counter cannot be implemented as a classic CRDT because the subtraction operation requires global coordination to ensure that the value of the counter never drops below 0. In Algorithm 1 we present an acute non-negative integer counter (ANNC), a simple ACT implementing such a counter. The $\mathrm{add}$ (line 5) and $\mathrm{get}$ (line 32) operations can be weak and thus always ensure low latency responses, whereas $\mathrm{subtract}$ (line 11) must be a strong operation to ensure the semantics of a non-negative counter. The crux of ANNC lies in using two complementary protocols for exchanging updates (a gossip one and one that establishes the ultimate operation serialization), and calculating the state of the counter by liberally counting $\mathrm{add}$ operations and conservatively counting the $\mathrm{subtract}$ operations.

To track the execution of weak and strong operations, each ANNC replica maintains three variables (line 2): one for subtraction operations ( $\mathsf{strongSub}$ ) and two for the addition operations ( $\mathsf{weakAdd}$ and $\mathsf{strongAdd}$ ). The replicas exchange the information about new ADD requests (weak updating operations) using a gossip protocol (modelled using reliable broadcast, RB [28]) as well as a protocol that involves inter-replica synchronization (modelled using total order broadcast, TOB [29], which can be efficiently implemented using quorum-based protocols, such as Paxos [19]; lines 9-10). The $\mathrm{subtract}$ operation, which does not commute unlike the $\mathrm{add}$ operation, solely uses TOB. Upon receipt of a $\mathrm{TOB{\text{-}}cast}$ SUBTRACT message, the subtract operation completes successfully only if we are certain that the value of the counter does not drop below 0, i.e., when the aggregated value of all confirmed addition operations ( $\mathsf{weakAdd}$ ) is greater or equal to the aggregated value of all subtract operations ( $\mathsf{strongSub}$ ) increased by $\mathsf{value}$ (lines 26-28).

We ensure that on any replica and for any ADD request $r$ , the $\mathrm{RB{\text{-}}deliver}(r)$ event always happens before the $\mathrm{TOB{\text{-}}deliver}(r)$ event (lines 17–18 and 22–23). This way $\mathsf{weakAdd}\geq\mathsf{strongAdd}$ . Hence, we solely use $\mathsf{weakAdd}$ as the approximation of the total value added to ANNC when calculating the return value for the $\mathrm{get}$ operations.

Using a gossip protocol allows us to achieve propagation of weak updating operations within network partitions, when synchronization involving solving distributed consensus is not possible. On the other hand, when solving distributed consensus is possible, replicas can agree on the final order in which operations will be visible. This way weak operations $\mathrm{add}$ and $\mathrm{get}$ are highly available, i.e., they always execute in a constant number of steps and do not depend on waiting on communication with other replicas. Crucially, the return value of the $\mathrm{get}$ operation always reflects all the $\mathrm{add}$ operations performed locally and, eventually, all $\mathrm{add}$ operations performed within the network partition to which the replica belongs, if such a partition exists. On the other hand, the strong $\mathrm{subtract}$ operation is applied only if the replicas agree that it is safe to do so.

ANNC guarantees a property which is a conjunction of basic eventual consistency (BEC) [7] [23] for weak operations ( $\mathrm{add}$ and $\mathrm{get}$ ) and linearizability (Lin) [18] for strong operations ( $\mathrm{subtract}$ ). We formalize BEC and Lin in Sections 5.2 and 5.5, and prove the correctness of ANNC in Section 5.6.

2.2 Bayou

Bayou was an experimental system, so was never optimized for performance. However, due to its unique approach to speculative execution of transactions and their later stabilization (establishing the final transaction execution order by a primary replica), examining Bayou allows us to discuss various problematic phenomena that stem from having both weak and strong semantics in a single system. We improve Bayou to form a general-purpose, albeit not performance-optimized ACT.

2.2.1 Protocol overview

Below we give a high-level description of the Bayou protocol. An interested reader may find a detailed description of Bayou (together with a pseudocode) in Appendix A.1.

In order to make our analysis more general, we abstract certain aspects of the original protocol. Crucially, we allow clients to submit to Bayou replicas deterministic, arbitrarily complex (also as complex as, e.g., SQL transactions) operations that can provide the clients with a return value. Each operation is either weak or strong, similarly to operations in ANNC. Any weak operation is non-blocking with respect to network communication, because it is executed locally without any coordination with other replicas, but its ultimate impact on the system’s state might differ from what the client can infer from the return value (if the stabilized execution happened differently than the speculative one). On the other hand, the return value of a strong operation results from a prior inter-replica synchronization and thus can be trusted to never change.

In Bayou, each server speculatively total-orders all received operations using a simple timestamp-based mechanism and without prior agreement with other servers. A unique timestamp is generated by the replica upon receipt of an operation from the client. An operation tagged with the timestamp is then sent to all other replicas using some gossip protocol. When a replica has a new operation $\mathit{op}$ ( $\mathit{op}$ was directly submitted by a client or it has been received from other replica), firstly the replica determines the suitable execution order for $\mathit{op}$ . If $\mathit{op}$ has the highest timestamp of all operations executed so far by the replica, it simply executes $\mathit{op}$ (and, if $\mathit{op}$ was submitted to the replica by a client and $\mathit{op}$ is a weak operation, the replica provides the client with a return value). Otherwise, the replica rolls back all operations that have higher timestamps than $\mathit{op}$ (starting from the one with the highest timestamp), executes $\mathit{op}$ and reexecutes the rolled back operations according to their timestamps. This way a single total order consistent with operation timestamps is always maintained by all replicas.

The above approach has two major downsides. The first one concerns the performance: every time a replica receives an operation with a relatively low timestamp (compared to the timestamps of the operations executed most recently), to maintain the correct execution order, many operations need to be rolled back and reexecuted. The second downside is related to the provided guarantees: a client that submitted an operation $\mathit{op}$ and already received a response can never be sure that there will be no other operation $\mathit{op}^{\prime}$ with a lower timestamp than $\mathit{op}$ , which will eventually cause $\mathit{op}$ to be reexecuted, thus producing possibly a different return value.

In order to mitigate the above two problems, one of the replicas, called the primary, periodically communicates to the other replicas the final operation execution order, which is a growing prefix of the operations already executed by the primary. Other replicas always honour the decision made by the primary, which may force them to adjust their local operation execution orders by rolling back and reexecuting some operations. When there are no major communication delays between replicas, the final operation execution order established by the primary does not deviate much from the order resulting from operation timestamps. Hence, replicas do not need to perform many rollbacks and reexecutions. Moreover, the operation values obtained during speculative execution are mostly correct, i.e., the same as the return values obtained during execution of the operations according to the final operation execution order. Once the final operation execution for some $\mathit{op}$ (weak or strong) is established, it will never be reexecuted again. If $\mathit{op}$ is a strong operation, it is now safe to provide the client with the return value.

Intuitively, the replicas converge to the same state, which is reflected by the prefix of operations established by the primary (called the $\mathsf{committed}$ list of operations) and the sequence of other operations ordered according to their timestamps (the $\mathsf{tentative}$ list of operations). More precisely, when the stream of operations incoming to the system ceases and there are no network partitions (the replicas can reach with the primary), the $\mathsf{committed}$ lists at all replicas will be the same, whereas the $\mathsf{tentative}$ lists will be empty. On the other hand, when there are partitions, some operations might not be successfully committed by the primary, but will be disseminated within a partition using a gossip protocol. Then all replicas within the same partition will have the same $\mathsf{committed}$ and (non-empty) $\mathsf{tentative}$ lists.

2.2.2 Anomalies

Consider the example in Figure 1, which shows an execution of a three-replica Bayou system. Initially, replica $R_{1}$ executes updating operations $u_{1}$ and $u_{2}$ in order $u_{2},u_{1}$ , which corresponds to $u_{1}$ ’s and $u_{2}$ ’s timestamps. This operation execution order is observed by the client who issues query $q_{1}$ . On the other hand, $R_{2}$ executes the operations according to the final execution order ( $u_{1},u_{2}$ ), as established by the primary replica $R_{3}$ . Hence, the client who issued query $q_{2}$ observes a different execution order than the client who issued $q_{1}$ . Note that replicas execute the operations with a delay (e.g., due to CPU being busy) and that $R_{1}$ reexecutes the operations once it gets to know the final order.

Clearly, the clients that issued the operations can infer from the return values the order in which Bayou executed the operations. The observed operation execution orders differ between the clients accessing $R_{1}$ and $R_{2}$ . This kind of anomaly, which we call temporary operation reordering, cannot be avoided since Bayou uses two, inconsistent with each other, ways in which operations are ordered (the timestamp order and the order established by the primary, which may occasionally differ from the timestamp order). This behaviour is not present in strongly consistent systems, which ensure that a single global ordering of operation execution is always respected (e.g., [30] [31]). The majority of eventually consistent systems which trade consistency for high availability are also free of this anomaly, as they only use one method for ordering concurrent operations (e.g., [32] [7]), or support only commutative operations (as in strong eventual consistency [2], e.g. [3] [33]). There are also protocols that allow some operations to perceive the past events in different (but still legal) orders (e.g., [34] [35] [6]). But, unlike Bayou, they do not require the replicas to eventually agree on a single execution order for all operations. Interestingly, temporary operation reordering is not present in ANNC, because weak updating operations ( $\mathrm{add}$ ) commute and do not provide clients with the return values.

Bayou exhibits another anomaly, which comes as very non-intuitive, i.e., circular causality. By analysing the return values of queries $q_{1}$ and $q_{2}$ one may conclude that there is a circular dependency between $u_{1}$ and $u_{2}$ : $u_{1}$ depends on $u_{2}$ as evidenced by $q_{1}$ ’s response, while $u_{2}$ depends on $u_{1}$ as evidenced by $q_{2}$ ’s response (the cycle of causally related operations can contain more operations). Interestingly, as we show later, circular causality does not directly follow from temporary operation reordering but is rather a result of the way Bayou rolls back and reexecutes some operations.

In the original Bayou protocol, application-specific conflict detection and resolution is accomplished through the use of dependency checks and merge procedure mechanisms. Since we allow operations with arbitrary complex semantics, the dependency checks and the merge procedures can be emulated by the operations themselves, by simply incorporating if-else statements: the dependency check as the if condition, and the merge procedure in the else branch (as suggested in the original paper [24]). Hence, these mechanisms do not alleviate the anomalies outlined above.

2.2.3 Correctness guarantees

Because of the phenomena described above, the guarantees provided by Bayou cannot be formalized using the correctness criteria used for contemporary eventually consistent systems based on CRDTs. E.g., basic eventual consistency (BEC) by Burckhardt et al. [7] [23] (mentioned briefly when discussing ANNC’s guarantees) directly forbids circular causality (see Section 5.2 for definition of BEC). BEC also requires the relative order of any two operations, as perceived by the client, to be consistent and to never change. Similarly, strong eventual consistency (SEC) by Shapiro et al. [2] requires any two replicas that delivered the same updates to have equivalent states.111BEC can be seen as a refinement of SEC, which abstracts away from CRDTs implementation details and ensures that no return value is constructed out of thin air. Obviously, Bayou neither satisfies BEC nor SEC (as evidenced by Figure 1). On the other hand informal definitions of eventual consistency which admit temporal reordering, such as [25], involve only liveness guarantees, which is insufficient. Bayou fulfills the operational specification in [36]. However, we are interested in declarative specifications, similar in the style to popular consistency criteria, such as sequential consistency [37], or serializability [26], through which we can concisely define the behaviour of a wide class of systems. Hence we introduce a new correctness criterion, fluctuating eventual consistency (FEC), which can be viewed as a generalization of BEC (see Section 5.3 for definition). FEC relaxes BEC, so different operations can perceive different operation orders. However, we require that the different perceived operation orders converge to one final execution order. Hence, FEC is suitable for systems that feature temporary operation reordering.

Similarly to ANNC, Bayou also ensures linearizability for strong operations (a response of a strong operation $\mathit{op}$ always reflects the serial execution of all stabilized operations up to the point of $\mathit{op}$ ’s commit). In Section 5.6 we formally prove that the Bayou-derived general-purpose ACT satisfies the above correctness criteria.

In Appendix A.2, an interested reader may find a brief analysis of Bayou’s liveness guarantees.

2.2.4 Fault-tolerance

Bayou’s reliance on the primary means that it provides only limited fault-tolerance. Even though the primary may recover, when it is down, operations do not stabilize, and thus no strong operation can complete. Hence, the primary is the single point of failure. Alternatively, the primary could be replaced by a distributed commit protocol. If two-phase-commit (2PC) [38] is used, the phenomena illustrated in Figure 1 are not possible. However, in this approach, a failure of any replica blocks the execution of strong operations (in 2PC all the replicas need to be operational in order to reach distributed agreement). On the other hand, if a non-blocking commit protocol, e.g., one that utilizes a quorum-based implementation of TOB is used (as in ANNC), the system may stabilize operations despite (a limited number of) failures.222Sharded 2PC [39] can be considered non-blocking, if within each shard at least one process remains operational at all times. Then, in such a scheme not every process needs to be contacted to commit a transaction, thus it falls under the quorum-based category. As we prove later, ACTs (which do not depend on synchronous communication with all the replicas and thus can operate despite failures of some of them) with general-purpose semantics similar to Bayou, are necessarily prone to the phenomena described above.

2.2.5 The improved Bayou protocol

Bayou can be improved to make it more fault-tolerant and free of some of the phenomena described above.

Firstly, we use TOB in place of the primary to establish the final operation execution order. More precisely, each time a replica receives an operation $\mathit{op}$ from a client, it still disseminates $\mathit{op}$ using a gossip protocol (so it can reach at least all replicas within the same network partition) but it also broadcasts the operation using TOB (so in a similar way in which weak updating operations are handled in ANNC). Since TOB guarantees that all replicas deliver the same set of messages in the same order, all replicas will stabilize the same set of operations in the same order. As we argued earlier, TOB can be implemented in a way that avoids a single point of failure [19].

The second modification is aimed at eliminating circular causality in Bayou. To this end (1) strong operations are broadcast using TOB and never a gossip protocol, and (2) upon being submitted, a weak operation $\mathit{op}$ is executed immediately on the current state to produce the return value; if $\mathit{op}$ ’s timestamp is not the highest timestamp of all already executed operations, $\mathit{op}$ is then rolled back and eventually executed in the order consistent with its timestamp. In Appendix A.3 we formally prove that above changes to the protocol allow us to avoid circular causality.

The modified variant of Bayou does not ensure that subsequent operations invoked by the same client observe the effects of previous ones, even if they are issued on the same replica (the read-your-writes session guarantee [40]).333In the original Bayou system, clients were colocated with replicas and the read-your-writes guarantee was naturally provided. In our approach, such guarantees can be provided on the client side.

With the above modifications the improved Bayou protocol becomes the general-purpose ACT called AcuteBayou.

2.3 ANNC vs AcuteBayou

ANNC and AcuteBayou greatly differ in the offered semantics and complexity. Note that we could trivially implement a non-negative integer counter using AcuteBayou by executing each counter operation as a separate AcuteBayou operation, albeit such an implementation would be suboptimal: in some cases the operations would have to be rolled back and temporary operation reordering would be possible again. Still, we can consider AcuteBayou as a generic ACT, capable of executing any set of weak and strong operations.

Despite the many differences between ANNC and AcuteBayou, they share several design assumptions, which are common to all ACT implementations. Firstly, in order to facilitate high availability and low latency responses (which are essential in geo-replicated environments), frequently invoked operations should be defined as weak operations and replicas should process them similarly to operations in CRDTs (automatically resolve conflicts between concurrent updates; converge to the same state within a network partition). To enforce this behaviour without resorting to distributed agreement, we impose the same assumptions as Attiya et al. for highly available eventually consistent data stores in [33] (see also Section 3.3). Secondly, when weak consistency guarantees are insufficient, strong operations can be used. Strong operations use a global agreement protocol for inter-replica synchronization, e.g., TOB. We require that strong operations do not block the execution of weak operations and that they do not require all replicas to be operational at all times in order to complete (as in 2PC).

ACTs are meant to provide the programmer with a modular abstraction layer that handles all the complexities of replication, while enabling flexibility, high performance and clear mixed-consistency semantics. In the next section we specify ACTs formally.

3 Acute Cloud Types

3.1 Definition

An acute cloud type is an abstract data type, implemented as a replicated data structure, that offers a precisely defined set of operations, divided into two groups: weak and strong. The operations can be either updating or read-only (RO), and all operations are allowed to provide a return value (in Section 4 we show how the semantics of operations can be specified formally). We impose the following implementation restrictions over ACTs: invisible reads, input-driven processing, op-driven messages, highly available weak operations and non-blocking strong operations. The first four, are adapted from the definition of write-propagating data stores [33]. They guarantee genuine, low-latency, eventually-consistent processing for weak operations (as in, e.g., CRDTs [2]). The last restriction guarantees that strong operations are implemented using a non-blocking agreement protocol, instead of a non-fault-tolerant approach requiring all the replicas to be operational. In Sections 3.2 and 3.3 we formalize the system model and provide precise definitions of the implementation restrictions.

3.2 System model

3.2.1 Replicas and clients

We consider a system consisting of $n\geq 2$ processes called replicas, which maintain full copies of an ACT444Partial replication is orthogonal to our work. We assume full replication for simplicity. and to which external clients submit requests in the form of operations to be executed. Each operation invoked by a client is marked either weak or strong. Replicas communicate with each other through message passing. We assume the availability of a gossip protocol, which is used when ordering constraints are not necessary, and some global agreement protocol, used for tasks that require solving distributed consensus. For simplicity, as in Algorithm 1, we formalize these protocols using reliable broadcast (RB) [28], and TOB, respectively. Replicas can implement point-to-point communication simply by ignoring messages for which they are not the intended recipient. We model replicas as deterministic state machines, which execute atomic steps in reaction to external events (e.g., operation invocation or message delivery), and can execute internal events (e.g., scheduled processing of rollbacks). A specific event is enabled on a replica, if its preconditions are met (e.g., an $\mathrm{RB{\text{-}}deliver}(m)$ event is enabled on a replica $R$ , if $m$ was previously $\mathrm{RB{\text{-}}cast}$ and $R$ has not yet delivered the message $m$ ). Replicas have access to a local clock, which advances monotonically, but we make no assumptions on the bound on clock drift between replicas.

We model crashed replicas as if they stopped all computation (or compute infinitely slowly). We say that a replica is faulty if it crashes (in an infinite execution it executes only a finite number of steps). Otherwise, it is correct.

3.2.2 Network properties

In a fully asynchronous system, a crashed replica is indistinguishable to its peers from a very slow one, and it is impossible to solve the distributed consensus problem [41]. Real distributed systems which exhibit some amount of synchrony can usually overcome this limitation. For example, in a quasi-synchronous model [42], the system is considered to be synchronous, but there exist a non-negligible probability that timing assumptions can be broken. We are interested in the behaviour of protocols, both in the fully asynchronous environment, when timing assumptions are consistently broken (e.g. because of prevalent network partitions), and in a stable one, when the minimal amount of synchrony is available so that consensus eventually terminates. Thus, we consider two kinds of runs: asynchronous runs and stable runs. Replicas are not aware which kind of a run they are currently executing. In stable runs, we augment the system with the failure detector $\Omega$ (which is an abstraction for the synchronous aspects of the system). We do so implicitly by allowing the replicas to use TOB through the $\mathrm{TOB{\text{-}}cast}$ and $\mathrm{TOB{\text{-}}deliver}$ primitives. Since, TOB is known to require a failure detector at least as strong as $\Omega$ to terminate [43], we guarantee it achieves progress only in stable runs.

In both asynchronous and stable runs we guarantee the basic properties of reliable message passing [28], i.e.:

•

if a message is $\mathrm{RB{\text{-}}deliver}$ ed, or $\mathrm{TOB{\text{-}}deliver}$ ed, then it was, respectively, $\mathrm{RB{\text{-}}cast}$ , or $\mathrm{TOB{\text{-}}cast}$ , by some replica,

•

no message is $\mathrm{RB{\text{-}}deliver}$ ed, or $\mathrm{TOB{\text{-}}deliver}$ ed, more than once by the same replica,

•

if a correct replica $\mathrm{RB{\text{-}}cast}$ s some message, then eventually it $\mathrm{RB{\text{-}}deliver}$ s it,

•

if a correct replica $\mathrm{RB{\text{-}}deliver}$ s some message, then eventually all correct replicas $\mathrm{RB{\text{-}}deliver}$ it,

•

if any (correct or faulty) replica $\mathrm{TOB{\text{-}}deliver}$ s some message, then eventually all correct replicas, $\mathrm{TOB{\text{-}}deliver}$ it,

•

messages are $\mathrm{TOB{\text{-}}deliver}$ ed by all replicas in the same total order.

We define $\mathsf{tobNo}(m)$ as the sequence number of the $\mathrm{TOB{\text{-}}deliver}(m)$ event (among other $\mathrm{TOB{\text{-}}deliver}$ events in the execution) on any replica (we leave it undefined, i.e., $\mathsf{tobNo}(m)=\bot$ , if $m$ is never $\mathrm{TOB{\text{-}}deliver}$ ed by any replica).

Solely in stable runs, we also guarantee the following:

•

if a correct replica $\mathrm{TOB{\text{-}}cast}$ s some message, then eventually all correct replicas $\mathrm{TOB{\text{-}}deliver}$ it.

•

if a message $m$ was both $\mathrm{RB{\text{-}}cast}$ and $\mathrm{TOB{\text{-}}cast}$ by some (correct or faulty) replica, and $m$ was $\mathrm{RB{\text{-}}deliver}$ ed by some correct replica, then eventually all correct replicas $\mathrm{TOB{\text{-}}deliver}$ it.

The last guarantee is non-standard for a total-order broadcast, but could be easily emulated by the application itself. We include it to simplify presentation of certain algorithms, such as ANNC and AcuteBayou.

3.2.3 Fair executions

An execution is fair, if each replica, has a chance to execute its steps (all replicas execute infinitely many steps of each type of an enabled event, e.g., infinitely many $\mathrm{RB{\text{-}}deliver}$ events for infinitely many messages $\mathrm{RB{\text{-}}cast}$ ).

We analyze the correctness of a protocol by evaluating a single arbitrary infinite fair execution of the protocol, similarly to [23] and [44]. If the execution satisfies the desired properties, then all the executions of the protocol (including finite ones and the ones featuring crashed replicas) satisfy all the safety aspects verified (nothing bad ever happens [45] [46]). Additionally, all fair executions of the protocol satisfy liveness aspects (something good eventually happens).

3.3 Implementation restrictions

Below we state the five rules that ACTs need to adhere to.

1. Invisible reads. Replicas do not change their state due to an invocation of a weak read-only operation. Formally, for each weak read-only operation $\mathit{op}$ invoked on a replica $R$ in state $\sigma$ , the state of $R$ after a response for $\mathit{op}$ is returned is equal $\sigma$ . Note that, the consequence of this is that weak read-only operations need to return a response to the client immediately in the invoke event, without executing any other steps. We allow strong read-only operations to change the state of a replica, because sometimes it is necessary to synchronize with other replicas, and the replica needs to note down that a response is pending.

2. Input-driven processing. Replicas execute a series of steps only in response to some external stimulus, e.g., an operation invocation or a received message. A state $\sigma$ of a replica $R$ is passive if none of the internal events on the replica are enabled in $\sigma$ . Initially each replica is in a passive state. An external event may bring a replica to an active state $\sigma^{\prime}$ in which it has some internal events enabled. Then, after executing a finite number of internal events (when no new external events are executed), the replica enters a passive state. More formally, for each replica $R$ , we require that in a given execution, either there is only a finite number of internal events executed on $R$ , or there is an infinite number of external events executed on $R$ . We say that $R$ is passive, if it is in a passive state, otherwise it is active.

3. Op-driven messages. RB or TOB messages are only generated and sent as a result of some not read-only client operation, and not spontaneously or in response to a received message. More formally, a message can be $\mathrm{RB{\text{-}}cast}$ or $\mathrm{TOB{\text{-}}cast}$ by a replica $R$ , if previously some not read-only operation was invoked on $R$ , and since then $R$ did not enter a passive state.

4. Highly available weak operations. Weak operations need to eventually return a response without communicating with other replicas. A weak operation $\mathit{op}$ may remain pending only if the execution is finite, and the executing replica remains active since the invocation of $\mathit{op}$ (in an infinite execution a pending weak operation is never allowed).

5. Non-blocking strong operations. Strong operations need to eventually return a response if a global agreement has been reached. More formally, for a strong operation $\mathit{op}$ invoked on a replica $R$ , let $\mathsf{msgs}$ be the set of all messages $\mathrm{TOB{\text{-}}cast}$ by $R$ since the invocation of $\mathit{op}$ but before $R$ enters a passive state. Then, $\mathit{op}$ may remain pending only if:

•

the execution is finite, and $R$ remains active since the invocation of $\mathit{op}$ , or $R$ remains active because of the delivery of any message $m\in\mathsf{msgs}$ , or

•

there exists a message $m\in\mathsf{msgs}$ , which has not been $\mathrm{TOB{\text{-}}deliver}$ ed by $R$ yet.

It means that in order to execute a strong operation replicas may synchronize by $\mathrm{TOB{\text{-}}cast}$ ing multiple messages, but once TOB completes, the response must be returned in a finite number of steps.

All the above requirements are commonly met by various eventually consistent data stores and CRDTs (when we consider them as ACTs with only weak operations and using our communication model555In case of geo-replicated systems which are weakly consistent between data centers, but feature state machine replication within a data center to simulate reliable processes, we can consider the whole data center as a single replica.), e.g., [47] [48] [1] [49] [50] [2] [51] [33] [44]. The restrictions 1–4 are inspired by the ones defined for write-propagating data stores [33], but modified appropriately to accommodate for the more complex nature of ACTs. In particular, we allow implementations which do not execute each invoked operation in one atomic step, but divide the execution between many internal steps (e.g., see the pseudocode of Bayou in Appendix A.1). On the other hand, the 5th requirement concerns strong operations, and so is specific for ACTs. As discussed at length in [33] [44], requirements 1–4 preclude implementations that offer stronger consistency guarantees but do not provide a real value to the programmer (and still fall short of the guarantees possible to ensure if global agreement can be reached). For example, without invisible reads, it is possible to propose an implementation of a register’s read operation, which returns the most up to date value written to the register only after it returned stale values to a similar call for a fixed number of times (even though the newest value was already available). On the other hand, with the above restrictions, it is still possible to attain causal consistency and variants of it, such as observable causal consistency [33].

4 Formal framework

Below we provide the formalism that allows us to reason about execution histories and correctness criteria. We extended the framework by Burckhardt et al. [7][23] (also used by several other researchers, e.g., [52] [33] [44] [53]).

4.1 Preliminaries

Relations: A binary relation $\mathsf{rel}$ over set $A$ is a subset $\mathsf{rel}\subseteq A\times A$ . For $a,b\in A$ , we use the notation $a\xrightarrow{\mathsf{\mathsf{rel}}}b$ to denote $(a,b)\in\mathsf{rel}$ , and the notation $\mathsf{rel}(a)$ to denote $\{b\in A:a\xrightarrow{\mathsf{\mathsf{rel}}}b\}$ . We use the notation $\mathsf{rel}^{-1}$ to denote the inverse relation, i.e. $(a\xrightarrow{\mathsf{\mathsf{rel}^{-1}}}b)\Leftrightarrow(b\xrightarrow{\mathsf{\mathsf{rel}}}a)$ . Therefore, $\mathsf{rel}^{-1}(b)=\{a\in A:a\xrightarrow{\mathsf{\mathsf{rel}}}b\}$ . Given two binary relations $\mathsf{rel}$ , $\mathsf{rel}^{\prime}$ over $A$ , we define the composition $\mathsf{rel};\mathsf{rel}^{\prime}=\{(a,c):\exists b\in A:a\xrightarrow{\mathsf{\mathsf{rel}}}b\xrightarrow{\mathsf{\mathsf{rel}^{\prime}}}c\}$ . We let $\mathsf{id}_{A}$ be the identity relation over $A$ , i.e., $(a\xrightarrow{\mathsf{\mathsf{id}_{A}}}b)\Leftrightarrow(a\in A)\wedge(a=b)$ . For $n\in\mathbb{N}_{0}$ , we let $\mathsf{rel}^{n}$ be the n-ary composition $\mathsf{rel};\mathsf{rel}...;\mathsf{rel}$ , with $\mathsf{rel}^{0}=\mathsf{id}_{A}$ . We let $\mathsf{rel}^{+}=\bigcup_{n\geq 1}\mathsf{rel}^{n}$ and $\mathsf{rel}^{*}=\bigcup_{n\geq 0}\mathsf{rel}^{n}$ . For some subset $A^{\prime}\subseteq A$ , we define the restricted relation $\mathsf{rel}|_{A^{\prime}}=\mathsf{rel}\cap(A^{\prime}\times A^{\prime})$ . In Figure 2 we summarize various properties of relations.

We define by $\mathsf{words}(A)$ the set of all sequences (words) containing only elements from the set $A$ . When not ambiguous we use $A^{*}$ to denote $\mathsf{words}(A)$ (i.e. when $A$ is not a binary relation).

Let $\mathsf{rank}$ be a function that encounts elements of a set $A$ that are in relation $\mathsf{rel}$ to element $a\in A$ : $\mathsf{rank}(A,\mathsf{rel},a)=|\{x\in A:x\xrightarrow{\mathsf{\mathsf{rel}}}a\}|$ . Thus, $\mathsf{rank}(A,\mathsf{rel},a)=|\mathsf{rel}^{-1}(a)\cap A|$ .

We also define two operators $\mathsf{sort}$ and $\mathsf{foldr}$ . $A.\mathsf{sort}(\mathsf{rel})\in A^{*}$ arranges in an ascending order the elements of set $A$ according to the total order $\mathsf{rel}$ . $\mathsf{foldr}(a_{0},f,w)\in A$ reduces sequence $w\in B^{*}$ by one element at a time using the function $f:A\times B\to A$ and accumulator $a_{0}\in A$ : {myeq} foldr(a_0, f, w) &= { a_0if w = ϵf(foldr(a_0, f, w’), b)if w = w’b

Event graphs: To reason about executions of a distributed system we encode the information about events that occur in the system and about various dependencies between them in the form of an event graph. An event graph $G$ is a tuple $(E,d_{1},....,d_{n})$ , where $E\subseteq\mathsf{Events}$ is a finite or countably infinite set of events drawn from universe $\mathsf{Events}$ , $n\geq 1$ , and each $d_{i}$ is an attribute or a relation over $E$ . Vertices in $G$ represent events that occurred at some point during the execution and are interpreted as opaque identifiers. Attributes label vertices with information pertinent to the corresponding event, e.g., operation performed, or the value returned. All possible operations of all considered data types form the $\mathsf{Ope\-ra\-tions}$ set. All possible return values of all operations form the $\mathsf{Values}$ set. Relations represent orderings or groupings of events, and thus can be understood as arcs or edges of the graph.

Event graphs are meant to carry information that is independent of the actual elements of $\mathsf{Events}$ chosen to represent the events (the attributes and relations in $G$ encode all relevant information regarding the execution). Let $G=(E,d_{1},....,d_{n})$ and $G^{\prime}=(E^{\prime},d^{\prime}_{1},....,d^{\prime}_{n})$ be two event graphs. $G$ and $G^{\prime}$ are isomorphic, written $G\simeq G^{\prime}$ , if (1) for all $i\geq 1$ , $d_{i}$ and $d^{\prime}_{i}$ are of the same kind (attribute vs. relation) and (2) there exists a bijection $\phi:E\rightarrow E^{\prime}$ such that for all $d_{i}$ , where $d_{i}$ is an attribute, and all $x\in E$ , we have $d_{i}(x)=d^{\prime}_{i}(\phi(x))$ , and such that for all $d_{i}$ where $d_{i}$ is a relation, and all $x,y\in E$ , we have $x\xrightarrow{\mathsf{d_{i}}}y\Leftrightarrow\phi(x)\xrightarrow{\mathsf{d^{\prime}_{i}}}\phi(y)$ .

4.2 Histories

We represent a high-level view of a system execution as a history. We omit implementation details such as message exchanges or internal steps executed by the replicas. We include only the observable behaviour of the system, as perceived by the clients through received responses. Formally, we define a history as an event graph $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ , where:

•

$\mathsf{op}:E\rightarrow\mathsf{Ope\-ra\-tions}$ , specifies the operation invoked in a particular event, e.g., $\mathsf{op}(e)=\mathsf{write}(3)$ ,

•

$\mathsf{rval}:E\rightarrow\mathsf{Values}\cup\{\nabla\}$ , specifies the value returned by the operation, e.g., $\mathsf{rval}(e)=3$ , or $\mathsf{rval}(e^{\prime})=\nabla$ , if the operation never returns ( $e^{\prime}$ is pending in $H$ ),

•

$\mathsf{rb}$ , the returns-before relation, is a natural partial order over $E$ , which specifies the ordering of non-overlapping operations (one operation returns before the other starts, in real-time),

•

$\mathsf{ss}$ , the same session relation, is an equivalence relation which groups events executed within the same session (the same client), and finally

•

$\mathsf{lvl}:E\rightarrow\{\mathsf{weak},\mathsf{strong}\}$ , specifies the consistency level demanded for the operation invoked in the event.

We consider only well-formed histories, for which the following holds:

•

$\forall a,b\in E:(a\xrightarrow{\mathsf{rb}}b\Rightarrow\mathsf{rval}(a)\neq\nabla)$ (a pending operation does not return),

•

$\forall a,b,c,d\in E:(a\xrightarrow{\mathsf{rb}}b\wedge c\xrightarrow{\mathsf{rb}}d)\Rightarrow(a\xrightarrow{\mathsf{rb}}d\vee c\xrightarrow{\mathsf{rb}}b)$ ( $\mathsf{rb}$ is an interval order [54]),

•

for each event $e\in E$ and its session $S=\{e^{\prime}\in E:e\xrightarrow{\mathsf{\mathsf{ss}}}e^{\prime}\}$ , the restriction $\mathsf{rb}|_{S}$ is an enumeration (clients issue operations sequentially).

4.3 Abstract executions

In order to explain the history, i.e., the observed return values, and reason about the system properties, we need to extend the history with information about the abstract relationships between events. For strongly consistent systems typically we do so by finding a serialization [37] (an enumeration of all events) that satisfies certain criteria. For weaker consistency models, such as eventual consistency or causal consistency, it is more natural to reason about partial ordering of events. Hence, we resort to abstract executions.

An abstract execution is an event graph $A=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl},\mathsf{vis},\mathsf{ar},\mathsf{par})$ , such that:

•

$(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ is some history $H$ ,

•

$\mathsf{vis}$ is an acyclic and natural relation,

•

$\mathsf{ar}$ is a total order relation, and

•

$\mathsf{par}:E\rightarrow 2^{E\times E}$ is a function which returns a binary relation in $E$ .

For brevity, we often use a shorter notation $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ and let $\mathcal{H}(A)=H$ . Just as serializations are used to explain and justify operations’ return values reported in a history, so are the visibility ( $\mathsf{vis}$ ) and arbitration ( $\mathsf{ar}$ ) relations. Perceived arbitration ( $\mathsf{par}$ ) is a function which is necessary to formalize temporary operation reordering.

Visibility ( $\mathsf{vis}$ ) describes the relative influence of operation executions in a history on each others’ return values: if $a$ is visible to $b$ (denoted $a\xrightarrow{\mathsf{vis}}b$ ), then the effect of $a$ is visible to the replica performing $b$ (and thus reflected in the $b$ ’s return value). Visibility often mirrors how updates propagate through the system, but it is not tied to the low-level phenomena, such as message delivery. It is an acyclic and natural relation, which may or may not be transitive. Two events are concurrent if they are not ordered by visibility.

Arbitration ( $\mathsf{ar}$ ) is an additional ordering of events which is necessary in case of non-commutative operations. It describes how the effects of these operations should be applied. If $a$ is arbitrated before $b$ (denoted $a\xrightarrow{\mathsf{ar}}b$ ), then $a$ is considered to have been executed earlier than $b$ . Arbitration is essential for resolving conflicts between concurrent events, but it is defined as a total-order over all operation executions in a history. It usually matches whatever conflict resolution scheme is used in an actual system, be it physical time-based timestamps, or logical clocks.

Perceived arbitration ( $\mathsf{par}$ ) describes the relative order of operation executions, as perceived by each operation ( $\mathsf{par}(e)$ defines the total order of all operations, as perceived by event $e$ ). If $\forall e\in E:\mathsf{par}(e)=\mathsf{ar}$ , then there is no temporary operation reordering in $A$ .

4.4 Correctness predicates

A consistency guarantee $\mathcal{P}(A)$ is a set of conditions on an abstract execution $A$ , which depend on the particulars of $A$ up to isomorphism. For brevity we usually omit the argument $A$ . We write $A\models\mathcal{P}$ if $A$ satisfies $\mathcal{P}$ . More precisely: $A\models\mathcal{P}\stackrel{{\scriptstyle\text{def}}}{{\Longleftrightarrow}}\forall A^{\prime}:A^{\prime}\simeq A:\mathcal{P}(A^{\prime})$ . A history $H$ is correct according to some consistency guarantee $\mathcal{P}$ (written $H\models\mathcal{P}$ ) if it can be extended with some $\mathsf{vis}$ and $\mathsf{ar}$ relations to an abstract execution $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ that satisfies $\mathcal{P}$ . We say that a system is correct according to some consistency guarantee $\mathcal{P}$ if all of its histories satisfy $\mathcal{P}$ .

We say that a consistency guarantee $\mathcal{P}_{i}$ is at least as strong as a consistency guarantee $\mathcal{P}_{j}$ , denoted $\mathcal{P}_{i}\geq\mathcal{P}_{j}$ , if $\forall H:H\models\mathcal{P}_{i}\Rightarrow H\models\mathcal{P}_{j}$ . If $\mathcal{P}_{i}\geq\mathcal{P}_{j}$ and $\mathcal{P}_{j}\not\geq\mathcal{P}_{i}$ then $\mathcal{P}_{i}$ is stronger than $\mathcal{P}_{j}$ , denoted $\mathcal{P}_{i}>\mathcal{P}_{j}$ . If $\mathcal{P}_{i}\not\geq\mathcal{P}_{j}$ and $\mathcal{P}_{j}\not\geq\mathcal{P}_{i}$ , then $\mathcal{P}_{i}$ and $\mathcal{P}_{j}$ are incomparable, denoted $\mathcal{P}_{i}\lessgtr\mathcal{P}_{j}$ .

4.5 Replicated data type

In order to specify semantics of operations invoked by the clients on the replicas, we model the whole system as a single replicated object (as in case of Algorithm 1). Even though we use only a single object, this approach is general, as multiple objects can be viewed as a single instance of a more complicated type, e.g. multiple registers constitute a single key-value store. Defining the semantics of the replicated object through a sequential specification [18] is not sufficient for replicated objects which expose concurrency to the client, e.g. multi-value register (MVR) [2] or observed-remove set (OR-Set) [3]. Hence, we utilize replicated data types specification [48].

In this approach, the state on which an operation $\mathit{op}\in\mathsf{Ope\-ra\-tions}$ executes, called the operation context, is formalized by the event graph of the prior operations visible to $\mathit{op}$ . Formally, for any event $e$ in an abstract execution $A=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl},\mathsf{vis},\mathsf{ar},\mathsf{par})$ , the operation context of $e$ in $A$ is the event graph $\mathsf{context}(A,e)\stackrel{{\scriptstyle\text{def}}}{{=}}(\mathsf{vis}^{-1}(e),\mathsf{op},\mathsf{vis},\mathsf{ar})$ . Note that an operation context lacks return values, the returns-before relation, and the information about sessions. The set of previously invoked operations and their relative visibility and arbitration unambiguously defines the output of each operation. This brings us to the formal definition of a replicated data type.

A replicated data type $\mathcal{F}$ is a function that, for each operation $\mathit{op}\in\mathsf{ops}(\mathcal{F})$ (where $\mathsf{ops}(\mathcal{F})\subseteq\mathsf{Ope\-ra\-tions}$ ) and operation context $C$ , defines the expected return value $v=\mathcal{F}(\mathit{op},C)\in\mathsf{Values}$ , such that $v$ does not depend on events, i.e., is the same for isomorphic contexts: $C\simeq C^{\prime}\Rightarrow\mathcal{F}(\mathit{op},C)=\mathcal{F}(\mathit{op},C^{\prime})$ for all $\mathit{op}$ , $C$ , $C^{\prime}$ . We say that $\mathit{op}\in\mathsf{ops}(\mathcal{F})$ is a read-only operation (denoted $\mathit{op}\in\mathsf{readonlyops}(\mathcal{F})$ ), if and only if, for any operation $\mathit{op}^{\prime}$ , context $C=(E,\mathsf{op},\mathsf{vis},\mathsf{ar})$ and event $e\in E$ , such that $\mathsf{op}(e)=\mathit{op}$ , $\mathcal{F}(\mathit{op}^{\prime},C)=\mathcal{F}(\mathit{op}^{\prime},C^{\prime})$ , where $C^{\prime}=(E\setminus\{e\},\mathsf{op},\mathsf{vis},\mathsf{ar})$ . In other words, read-only operations can be excluded from any context $C$ , producing $C^{\prime}$ , and the result of any operation $\mathit{op}^{\prime}$ will not change.

In Figure 3 we give the specification of three replicated data types: $\mathcal{F}_{\mathit{MVR}}$ (a multi-value register), $\mathcal{F}_{\mathit{seq}}$ (an append-only sequence), and $\mathcal{F}_{\mathit{NNC}}$ (a non-negative counter). We use $\mathcal{F}_{\mathit{seq}}$ in the subsequent sections to illustrate various consistency models.

4.6 ACT specification

To accommodate for the mixed-consistency nature of ACTs we extend replicated data type specification with the information on supported consistency levels for a given operation. Thus, we define ACT specification as a pair $(\mathcal{F},\mathsf{lvlmap})$ , where $\mathcal{F}$ is a replicated data type specification and $\mathsf{lvlmap}:\mathsf{Ope\-ra\-tions}\rightarrow 2^{\{\mathsf{weak},\mathsf{strong}\}}$ is a function which specifies for each $\mathit{op}\in\mathsf{Ope\-ra\-tions}$ with which consistency levels it can be executed. We assume that clients follow this contract, and thus, when considering a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ of an ACT compliant with the specification $(\mathcal{F},\mathsf{lvlmap})$ , we assume that for each $e\in E:\mathsf{lvl}(e)\in\mathsf{lvlmap}(\mathsf{op}(e))$ .

Then, ANNC’s specification is is $(\mathcal{F}_{\mathit{NNC}},\mathsf{lvlmap}_{\mathit{NNC}})$ , where $\mathsf{lvlmap}_{\mathit{NNC}}(\mathit{add})=\mathsf{lvlmap}_{\mathit{NNC}}(\mathit{get})=\{\mathsf{weak}\}$ and $\mathsf{lvlmap}_{\mathit{NNC}}(\mathit{subtract})=\{\mathsf{strong}\}$ .

5 Correctness guarantees

In this section we define various correctness guarantees for ACTs. We define them as conjunctions of several basic predicates. We start with two simple requirements that should naturally be present in any eventually consistent system. For the discussion below we assume some arbitrary abstract execution $A=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl},\mathsf{vis},\mathsf{ar},\mathsf{par})$ .

5.1 Key requirements for eventual consistency

The first requirement is the eventual visibility (EV) of events. EV requires that for any event $e$ in $A$ , there is only a finite number of events in $E$ that do not observe $e$ . Formally:

[TABLE]

Intuitively, EV implies progress in the system because replicas must synchronize and exchange knowledge about operations submitted to the system.

The second requirement concerns avoiding circular causality, as discussed in Section 2.2.2. To this end we define two auxiliary relations: session order and happens-before. The session order relation $\mathsf{so}\stackrel{{\scriptstyle\text{def}}}{{=}}\mathsf{rb}\cap\mathsf{ss}$ represents the order of operations in each session. The happens-before relation $\mathsf{hb}\stackrel{{\scriptstyle\text{def}}}{{=}}(\mathsf{so}\cup\mathsf{vis})^{+}$ (a transitive closure of session order and visibility) allows us to express the causal dependency between events. Intuitively, if $e\xrightarrow{\mathsf{\mathsf{hb}}}e^{\prime}$ , then $e^{\prime}$ potentially depends on $e$ . We simply require no circular causality:

[TABLE]

In the following sections we add requirements on the return values of the operations in $A$ . Formalizing the properties of ACTs which, similarly to AcuteBayou, admit temporary operation reordering, requires a new approach. We start, however with the traditional one.

5.2 Basic Eventual Consistency

Intuitively, basic eventual consistency (BEC) [7] [23], in addition to EV and NCC, requires that the return value of each invoked operation can be explained using the specification of the replicated data type $\mathcal{F}$ , what is formalized as follows:

[TABLE]

Then:

[TABLE]

An example abstract execution $A_{\textsc{BEC}}$ that satisfies $\textsc{BEC}(\mathcal{F}_{\mathit{seq}})$ is shown in Figure 4. In $A_{\textsc{BEC}}$ , firstly replicas $R_{1}$ and $R_{2}$ concurrently execute two $\mathsf{append}()$ operations, and then each replica executes an infinite number of $\mathsf{read}()$ operations. Consider the $\mathsf{read}()$ operations on $R_{2}$ : the first one observes only $\mathsf{append}(a)$ (which is in the operation context of $\mathsf{read}()$ ), whereas the second observes only $\mathsf{append}(b)$ . BEC admits this kind of execution, because it does not make any requirements in terms of session guarantees [40]. Eventually, both $\mathsf{append}(a)$ and $\mathsf{append}(b)$ become visible to all subsequent $\mathsf{read}()$ operations, thus satisfying EV.

By the definition of the $\mathsf{context}$ function (Section 4.5), when $A$ satisfies $\textsc{RVal}(\mathcal{F})$ , the return value of each operation is calculated according to the $\mathsf{ar}$ relation. It is then easy to see that there are executions of AcuteBayou (or other ACTs that admit temporary operation reordering) that do not satisfy $\textsc{RVal}(\mathcal{F})$ . It is because weak operations (as shown in Section 2.2.2), might observe past operations in an order that differs from the final operation execution order ( $\mathsf{ar}$ ). Hence AcuteBayou does not satisfy $\textsc{BEC}(\mathcal{F})$ for arbitrary $\mathcal{F}$ . It still, though, could satisfy $\textsc{BEC}(\mathcal{F})$ for a sufficiently simple $\mathcal{F}$ , such as a conflict-free counter, in which all operations always commute (as opposed to $\mathcal{F}_{\mathit{NNC}}$ ). It is so, because then, even if AcuteBayou reorders some operations internally, the final result never changes and thus the reordering cannot be observed by the clients.

5.3 Fluctuating Eventual Consistency

In order to admit temporary operation reordering, we give a slightly different definition of the $\mathsf{context}$ function, in which the arbitration order fluctuates, i.e., it changes from one event to another. Let $\mathsf{fcontext}(A,e)\stackrel{{\scriptstyle\text{def}}}{{=}}(\mathsf{vis}^{-1}(e),\mathsf{op},\mathsf{vis},\mathsf{par}(e))$ , which means that now we consider the operation execution order as perceived by $e$ , and not the final one. The definition of the fluctuating variant of RVal is straightforward:

[TABLE]

To define the fluctuating variant of BEC, that could be used to formalize the guarantees provided by ACTs we additionally must ensure, that the arbitration order perceived by events is not completely unrestricted, but that it gradually converges to $\mathsf{ar}$ for each subsequent event. It means that each $e\in E$ can be temporarily observed by the subsequent events $e^{\prime}$ according to an order that differs from $\mathsf{ar}$ (but is consistent with $\mathsf{par}(e^{\prime})$ ). However, from some moment on, every event $e^{\prime}$ will observe $e$ according to $\mathsf{ar}$ . To define this requirement, we use the $\mathsf{rank}$ function (defined in Section 4.1). Let $E_{e}=\{e^{\prime}\in E:e\xrightarrow{\mathsf{vis}}e^{\prime}\}$ . This intuition is formalized by convergent perceived arbitration:

[TABLE]

If $A$ satisfies CPar, then for each event $e$ , the set of events $e^{\prime}$ , which observe the position of $e$ not according to $\mathsf{ar}$ is finite. Thus, the position of $e$ in $\mathsf{par}(e^{\prime})$ for subsequent events $e^{\prime}$ stabilizes, and $\mathsf{par}(e^{\prime})$ eventually converges to $\mathsf{ar}$ .

Now we can define our new consistency criterion fluctuating eventual consistency (FEC):

[TABLE]

An example abstract execution $A_{\textsc{FEC}}$ that satisfies FEC is shown in Figure 4. In $A_{\textsc{FEC}}$ , replica $R_{2}$ temporarily observes the $\mathsf{append}()$ operations in the order $\mathsf{append}(b),\mathsf{append}(a)$ which is different then the eventual operation execution order (as evidenced by the infinite number of $\mathsf{read}()\rightarrow ab$ operations). We call this behaviour fluctuation.

It is easy to see that $\textsc{FEC}(\mathcal{F})<\textsc{BEC}(\mathcal{F})$ , in the sense that: for each $\mathcal{F}$ , $\textsc{FEC}(\mathcal{F})\leq\textsc{BEC}(\mathcal{F})$ , and for some $\mathcal{F}$ , $\textsc{FEC}(\mathcal{F})<\textsc{BEC}(\mathcal{F})$ . It is so, because FEC uses $\mathsf{par}$ instead of $\mathsf{ar}$ to calculate the return values of operation executions, but $\mathsf{par}$ eventually converges to $\mathsf{ar}$ . Hence, $\textsc{BEC}(\mathcal{F})$ is a special case of $\textsc{FEC}(\mathcal{F})$ , when $\forall e\in E:\mathsf{par}(e)=\mathsf{ar}$ . It is easy to see that $A_{\textsc{BEC}}$ from Figure 4 satisfies both BEC and FEC, whereas $A_{\textsc{FEC}}$ satisfies only FEC.

5.4 Operation levels

The above definitions can be used to capture the guarantees provided by a wide variety of eventually consistent systems. However, our framework still lacks the capability to express the semantics of mixed-consistency systems. ACTs offer different guarantees for different classes of operations (e.g., consistency guarantees stronger than BEC or FEC are provided only for strong operations in AcuteBayou or ANNC). Hence, we need to parametrize the consistency criteria with a level attribute (as indicated by the $\mathsf{lvl}$ function for each event). Since, consistency level is specified per operation invocation, we need to assure that the respective operations’ responses reflect the demanded consistency level.

Let us revisit BEC first. Let $L=\{e\in E:\mathsf{lvl}(e)=l\}$ for a given $l$ . Then: {myeq} EV (l) =def& ∀e ∈E : —{ e’ ∈L : e rb→e’ ∧e ／vis→e’ }— ¡ ∞

NCC (l) =def acyclic(hb∩(L ×L))

RVal (l, F) =def ∀e ∈L : rval(e) = F(op(e), context(A,e))

BEC (l, F) =def EV (l) ∧NCC (l) ∧RVal (l, F)

The above parametrized definition of BEC restricts the RVal predicate only to events issued with the given consistency level $l$ (the events that belong to the set $L$ ). It means that for any such event the response has to conform with the replicated data type specification $\mathcal{F}$ , and with the $\mathsf{vis}$ and $\mathsf{ar}$ relations (as defined by the definition of the $\mathsf{context}$ function). For all other events this requirement does not need to be satisfied, so they can return arbitrary responses (unless restricted by other predicates targeted for these events). Similarly, for EV and NCC, the predicates are restricted to affect only the events from the set $L$ . In case of EV, each event eventually becomes visible to the operations executed with the level $l$ . In case of NCC, there must be no cycles in $\mathsf{hb}$ involving events from the set $L$ .

The parametrized variant of FEC is formulated analogously. Let $L$ be as defined before, and for any event $e\in E$ , let $L_{e}=\{e^{\prime}\in L:e\xrightarrow{\mathsf{vis}}e^{\prime}\}$ be the subset of events from $L$ which observe $e$ . Then: {myeq} FRVal (l, F) =def& ∀e ∈L : rval(e) = F(op(e), fcontext(A,e))

CPar (l) =def ∀e ∈E : —{e’ ∈L_e: rank(vis^-1(e’), par(e’), e)

≠rank(vis^-1(e’), ar, e)}— ¡ ∞

FEC (l, F) =def EV (l) ∧NCC (l) ∧FRVal (l, F) ∧CPar (l)

As before, we restrict the return values only for the events from the set $L$ . Additionally, we restrict the predicate CPar, so that $\mathsf{par}(e)$ converges towards $\mathsf{ar}$ only for events $e\in L$ . Other events can differently perceive the arbitration of events (in principle, the observed arbitration can be completely different from the final one, specified by $\mathsf{ar}$ ).

We can compare the parametrized variants of BEC and FEC as before: $\textsc{FEC}(l,\mathcal{F})<\textsc{BEC}(l,\mathcal{F})$ .

All of the strong consistency criteria which we are going to discuss next, we define already in the parametrized form with the given level $l$ in mind, so they can be used for, e.g., strong operations in AcuteBayou and ANNC.

5.5 Strong consistency

A common feature of strong consistency criteria, such as sequential consistency [37], or linearizability [18], is a single global serialization of all operations. It means that a history satisfies these criteria, if it is equivalent to some serial execution (serialization) of all the operations. Additionally, depending on the particular criterion, the serialization must, e.g., respect program-order, or real-time order of operation executions. Although we provide a serialization of all operations (through the total order relation $\mathsf{ar}$ , which is part of every abstract execution), the equivalence of a history to the serialization is not enforced in the correctness criteria we have defined so far. For example, given a sequence of three events $\langle a,b,c\rangle$ , such that $a\xrightarrow{\mathsf{ar}}b\xrightarrow{\mathsf{ar}}c$ , the response of $c$ according to BEC, does not need to reflect neither $a$ , nor $b$ , as they may simply be not visible to $c$ ( $a\not\xrightarrow{\mathsf{vis}}c\vee b\not\xrightarrow{\mathsf{vis}}c$ ). Thus, to guarantee conformance to a single global serialization, we must enforce that for any two events $e_{1},e_{2}\in E$ , $e_{1}\xrightarrow{\mathsf{ar}}e_{2}\Leftrightarrow e_{1}\xrightarrow{\mathsf{vis}}e_{2}$ (unless $e_{1}$ is pending, since a pending operation might be arbitrated before a completed one, yet still be not visible). We express this through the following predicate, single order: {myeq} SinOrd =def& ∃E’ ⊆rval^-1(∇) : vis= ar∖(E’ ×E)

SinOrd (l) =def ∃E’ ⊆rval^-1(∇) : vis_L = ar_L ∖(E’ ×E)

where vis_L = vis∩(E ×L) and ar_L = ar∩(E ×L)

In the parametrized form, the conformance to the serialization is required only for the events from the set $L$ (but the serialization includes all the events).

In order to capture the eventual stabilization of the operation execution order, which happens in AcuteBayou and in ACTs similar to it, we now define two additional correctness criteria that feature SinOrd.

Sequential consistency. Informally, sequential consistency (Seq) [37] guarantees that the system behaves as if all operations were executed sequentially, but in an order that respects the program order, i.e., the order in which operations were executed in each session. Hence, Seq implies $\textsc{RVal}(\mathcal{F})$ together with SinOrd, and additionally, session arbitration (SessArb). SessArb simply requires that for any two events $e,e^{\prime}\in E$ , if $e\xrightarrow{\mathsf{so}}e^{\prime}$ , then $e\xrightarrow{\mathsf{ar}}e^{\prime}$ . In the parametrized form we are interested only in the guarantees for events in the set $L$ , and thus we use $\mathsf{so}_{L}=\mathsf{so}\cap(E\times L)$ instead of $\mathsf{so}$ (see Section 5.1). SinOrd together with SessArb imply NCC and EV [23], however this does not hold for the parametrized forms of these predicates. Thus, we define Seq by extending BEC (which explicitly includes EV and NCC): {myeq} SessArb (l) =def& so_L ⊆ar

Seq (l, F) =def SinOrd (l) ∧SessArb (l) ∧BEC (l, F)

An example abstract execution $A_{\textsc{Seq}}$ that satisfies Seq is shown in Figure 4. According to Seq, since the $\mathsf{append}()$ operations are arbitrated $\mathsf{append}(a),\mathsf{append}(b)$ (as evidenced by any $\mathsf{read}()$ operation that observes both $\mathsf{append}()$ operations), any $\mathsf{read}()$ operation can either return $ab$ or $a$ , a non-empty prefix of $ab$ .

Linearizability. The correctness condition of linearizability (Lin) [18] is similar to Seq but instead of program order it enforces a stronger requirement called real-time order. Informally, a system that is linearizable guarantees that for any operation $\mathit{op}^{\prime}$ that starts (in real-time) after any operation $\mathit{op}$ ends, $\mathit{op}^{\prime}$ will observe the effects of $\mathit{op}$ . We formalize Lin using the real-time order (RT) predicate, that uses the $\mathsf{rb}_{L}=\mathsf{rb}\cap(L\times L)$ relation in its parametrized form: {myeq} RT (l) =def& rb_L ⊆ar

Lin (l, F) =def SinOrd (l) ∧RT (l) ∧BEC (l, F)

Note that, Seq and Lin are incomparable in their parametrized forms. While $\textsc{Lin}(l,\mathcal{F})$ requires any two events to be arbitrated according to real-time if they both belong to $L$ , $\textsc{Seq}(l,\mathcal{F})$ enforces real-time only within the same session, but only one of the events needs to belong to $L$ . By using a stronger definition of $\textsc{RT}^{\prime}(l)$ with $\mathsf{rb}^{\prime}_{L}=\mathsf{rb}\cap(E\times L)$ , we would force all operations to synchronize, which is incompatible with high availability of weak operations.

An example abstract execution $A_{\textsc{LIN}}$ that satisfies Lin is shown in Figure 4. According to Lin, since $\mathsf{append}(a)$ ended before $\mathsf{append}(b)$ started, the operations must be arbitrated $\mathsf{append}(a),\mathsf{append}(b)$ (as evidenced by any $\mathsf{read}()$ operation that observes both $\mathsf{append}()$ operations). If some $\mathsf{read}()$ operation started after $\mathsf{append}(a)$ ended but executed concurrently with $\mathsf{append}(b)$ ( $\mathsf{append}(b)$ would start before $\mathsf{read}()$ ended), $\mathsf{read}()$ could return either $a$ or $ab$ .

5.6 Correctness of ANNC and AcuteBayou

Having defined BEC, FEC and Lin, we show four formal results: two regarding ANNC and two regarding AcuteBayou. The proofs of all four theorems can be found in Appendix A.5.

As we have discussed in Section 3.2.2, we are interested in the behaviour of systems, both in fully asynchronous environment, when timing assumptions are consistently broken (e.g., because of prevalent network partitions), and in a stable one, when the minimal amount of synchrony is available so that consensus eventually terminates. Thus, we consider two kinds of runs: asynchronous and stable.

Theorem 1.

In stable runs ANNC satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ .

Theorem 2.

In asynchronous runs ANNC satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})$ and it does not satisfy $\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ .

ANNC does not guarantee $\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ in asynchronous runs, because strong operations in general (for arbitrary $\mathcal{F}$ ) cannot be implemented without solving global agreement, and since in asynchronous runs TOB completion is not guaranteed, some of the operations may remain pending. It means that for some $e\in E$ , such that $\mathsf{lvl}(e)=\mathsf{strong}$ , $\mathsf{rval}(e)=\nabla$ , even though it is not allowed by $\mathcal{F}$ (recall from Section 3.2.3 that we consider fair executions).

By satisfying $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})$ , we proved that temporary operation reordering is not possible in ANNC. As we discussed in Section 2.2.2, it is not the case for AcuteBayou. However, we can prove, that AcuteBayou satisfies our new correctness criterion $\textsc{FEC}(\mathsf{weak},\mathcal{F})$ (for arbitrary $\mathcal{F}$ ).

Theorem 3.

In stable runs AcuteBayou satisfies $\textsc{FEC}(\mathsf{weak},\mathcal{F})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F})$ for any arbitrary ACT specification $(\mathcal{F},\mathsf{lvlmap})$ .

Theorem 4.

In asynchronous runs AcuteBayou satisfies $\textsc{FEC}(\mathsf{weak},\mathcal{F})$ and it does not satisfy $\textsc{Lin}(\mathsf{strong},\mathcal{F})$ for any arbitrary ACT specification $(\mathcal{F},\mathsf{lvlmap})$ .

The observation that some undesired anomalies are not inherent to all ACTs leads to interesting questions that we plan to investigate more closely in the future, e.g., what are the common characteristics of the replicated data types with mixed-consistency semantics that can be implemented as ACTs that are free of temporary operation reordering.

6 Impossibility

Now we proceed to our central contribution–we show that there exists an ACT specification for which it is impossible to propose an ACT implementation that avoids temporary operation reordering.

If a mixed-consistency ACT that implements some replicated data type $\mathcal{F}$ could avoid temporary operation reordering, it would mean that it ensures BEC for weak operations and also provides at least some criterion based on SinOrd for strong operations (to ensure a global serialization of all operations). Hence we state our main theorem:

Theorem 5.

There exists an ACT specification $(\mathcal{F},\mathsf{lvlmap})$ , for which there does not exist an implementation that satisfies $\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{BEC}(\mathsf{strong},\mathcal{F})$ in stable runs, and $\textsc{BEC}(\mathsf{weak},\mathcal{F})$ in both asynchronous and stable runs.

To prove the theorem, we take $\mathcal{F}_{\mathit{seq}}$ (defined in Figure 3) as an example replicated data type specification $\mathcal{F}$ . We consider an ACT specification, which features $\mathsf{append}$ and $\mathsf{read}$ operations in both consistency levels, $\mathsf{weak}$ , and $\mathsf{strong}$ . Thus, $(\mathcal{F},\mathsf{lvlmap})=(\mathcal{F}_{\mathit{seq}},\mathsf{lvlmap}_{\mathit{seq}})$ , where $\mathsf{lvlmap}_{\mathit{seq}}(\mathsf{append})=\mathsf{lvlmap}_{\mathit{seq}}(\mathsf{read})=\{\mathsf{weak},\mathsf{strong}\}$ .

Let us begin with an observation. Whenever any ACT implementation of $(\mathcal{F}_{\mathit{seq}},\mathsf{lvlmap}_{\mathit{seq}})$ that satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ in asynchronous runs, executes a weak $\mathsf{append}$ operation, it has to $\mathrm{RB{\text{-}}cast}$ some message $m$ . Since the implementation satisfies EV (through $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ ) we know that all replicas have to be informed about the invocation of $\mathsf{append}$ . The replica executing the $\mathsf{append}$ operation may not postpone sending the message until some other invocation happens, because all the subsequent operation invocations on the replica may be operations, which do not grant the replica the right to send messages (e.g., RO operations, by the invisible reads requirement). Moreover, the replica may not depend on $\mathrm{TOB{\text{-}}cast}$ messages, because in asynchronous runs they are not guaranteed to be delivered to other replicas.666A replica may $\mathrm{TOB{\text{-}}cast}$ some messages due to the invocation of a weak $\mathsf{append}$ operation, but its correctness cannot depend on their delivery. Thus, a message must be $\mathrm{RB{\text{-}}cast}$ . Since replicas cannot distinguish between asynchronous and stable runs, the same observation also holds for stable runs. We utilize this fact in our proof by considering asynchronous and stable executions and establishing certain invariants which need to hold in both kinds of runs.

We conduct the proof by contradiction using a specially constructed execution, in which a replica that executes a strong operation has to return a value without consulting all replicas. Thus, we consider an ACT implementation of $(\mathcal{F}_{\mathit{seq}},\mathsf{lvlmap}_{\mathit{seq}})$ that satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ in asynchronous runs, and $\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})\wedge\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{BEC}(\mathsf{strong},\mathcal{F}_{\mathit{seq}})$ in both the asynchronous and stable runs.

Proof.

We give a proof for a two-replica system and then show how it can be easily to a system with $n>2$ replicas.

We begin with an empty execution represented by a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ , which we will extend in subsequent steps. Initially replicas $R_{1}$ and $R_{2}$ are separated by a temporary network partition, which means that the messages broadcast by the replicas do not propagate (however, eventually they will be delivered once the partition heals). A weak $\mathsf{append}(a)$ operation is invoked on $R_{1}$ in the event $e_{a}$ and a weak $\mathsf{append}(b)$ operation is invoked on $R_{2}$ in the event $e_{b}$ . By input-driven processing and highly available weak operations both replicas return responses for the operations and become passive afterwards. Let $\mathsf{msgs}_{a}^{\mathit{RB}}$ and $\mathsf{msgs}_{b}^{\mathit{RB}}$ denote the set of messages $\mathrm{RB{\text{-}}cast}$ by, respectively, $R_{1}$ and $R_{2}$ , until this point. Let $\mathsf{msgs}_{a}^{\mathit{TOB}}$ and $\mathsf{msgs}_{b}^{\mathit{TOB}}$ denote the set of messages $\mathrm{TOB{\text{-}}cast}$ by, respectively, $R_{1}$ and $R_{2}$ , until this point. Neither $R_{1}$ $\mathrm{RB{\text{-}}deliver}$ s messages from the set $\mathsf{msgs}_{b}^{\mathit{RB}}$ , nor $R_{2}$ $\mathrm{RB{\text{-}}deliver}$ s messages from the set $\mathsf{msgs}_{a}^{\mathit{RB}}$ (due to the temporary network partition), but the replicas $\mathrm{RB{\text{-}}deliver}$ their own messages, and subsequently become passive (if $\mathsf{msgs}_{a}^{\mathit{TOB}}\neq\emptyset$ or $\mathsf{msgs}_{b}^{\mathit{TOB}}\neq\emptyset$ , then these messages remain pending).

Consider an alternative execution represented by history $H^{\prime}=(E^{\prime},\mathsf{op}^{\prime},\mathsf{rval}^{\prime},\mathsf{rb}^{\prime},\mathsf{ss}^{\prime},\mathsf{lvl}^{\prime})$ in which the network partition heals, and $R_{1}$ $\mathrm{RB{\text{-}}deliver}$ s all messages in the set $\mathsf{msgs}_{b}^{\mathit{RB}}$ , $R_{2}$ $\mathrm{RB{\text{-}}deliver}$ s all messages in the set $\mathsf{msgs}_{a}^{\mathit{RB}}$ , and then a weak $\mathsf{read}$ operation is invoked on $R_{1}$ in the event $e^{\prime}_{c}$ and a weak $\mathsf{read}$ operation is invoked on $R_{2}$ in the event $e^{\prime}_{d}$ . By invisible reads and highly available operations, both replicas remain passive and immediately return a response.

Claim 1.

$\mathsf{rval}^{\prime}(e^{\prime}_{c})=\mathsf{rval}^{\prime}(e^{\prime}_{d})=v$ , and $v=\mathit{ab}$ or $v=\mathit{ba}$ .

Proof.

We extend $H^{\prime}$ with infinitely many weak $\mathsf{read}$ invocations on both $R_{1}$ and $R_{2}$ , in events $e^{\prime}_{k}$ , for $k\geq 1$ . Similarly to $e^{\prime}_{c}$ and $e^{\prime}_{d}$ , the $\mathsf{read}$ operations invoked in each $e^{\prime}_{k}$ return immediately and leave the replicas $R_{1}$ and $R_{2}$ in the unmodified passive state. Since none of the $\mathsf{read}$ operations generate any new messages, $H^{\prime}$ represents a fair infinite execution that satisfies all network properties of an asynchronous run. Then, by our base assumption, there exists an abstract execution $A^{\prime}=(H^{\prime},\mathsf{vis}^{\prime},\mathsf{ar}^{\prime},\mathsf{par}^{\prime})$ , such that $A^{\prime}\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ .

Because each replica remains in the same state since the execution of $e^{\prime}_{c}$ and $e^{\prime}_{d}$ , respectively, each $\mathsf{read}$ operation invoked in $e^{\prime}_{k}$ , returns the same response as $e^{\prime}_{c}$ or $e^{\prime}_{d}$ , depending on which replica the given event was executed. By $\textsc{EV}(\mathsf{weak})$ , the two updating events $e_{a}$ and $e_{b}$ have to be both observed by infinitely many of the $e^{\prime}_{k}$ events. Let $e^{\prime}_{p}$ be one such event executed on $R_{1}$ and $e^{\prime}_{q}$ be one such event executed on $R_{2}$ , then $(e_{a}\xrightarrow{\mathsf{\mathsf{vis}^{\prime}}}e^{\prime}_{p}\wedge e_{a}\xrightarrow{\mathsf{\mathsf{vis}^{\prime}}}e^{\prime}_{q}\wedge e_{b}\xrightarrow{\mathsf{\mathsf{vis}^{\prime}}}e^{\prime}_{p}\wedge e_{b}\xrightarrow{\mathsf{\mathsf{vis}^{\prime}}}e^{\prime}_{q})$ . There is either: $e_{a}\xrightarrow{\mathsf{\mathsf{ar}^{\prime}}}e_{b}$ , or $e_{b}\xrightarrow{\mathsf{\mathsf{ar}^{\prime}}}e_{a}$ . Now, by the definition of read-only operations we can exclude the RO operations from the context of any operation without affecting the return value of all operations. Thus $\mathcal{F}_{\mathit{seq}}(\mathsf{read}(),\mathsf{context}(A^{\prime},e^{\prime}_{p}))=\mathcal{F}_{\mathit{seq}}(\mathsf{read}(),\mathsf{context}(A^{\prime},e^{\prime}_{q}))=v^{\prime}$ for some $v^{\prime}$ . Because of $\textsc{RVal}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ , $\mathsf{rval}^{\prime}(e^{\prime}_{p})=v^{\prime}=\mathsf{rval}^{\prime}(e^{\prime}_{q})$ . Therefore, all $\mathsf{read}$ operations in $H^{\prime}$ return the same value $v^{\prime}$ , including the earliest ones $e^{\prime}_{c}$ and $e^{\prime}_{d}$ , which means that $v=v^{\prime}$ . By the definition of $\mathcal{F}_{\mathit{seq}}$ , either $v=\mathit{ab}$ or $v=\mathit{ba}$ (depending on whether $e_{a}\xrightarrow{\mathsf{\mathsf{ar}^{\prime}}}e_{b}$ , or $e_{b}\xrightarrow{\mathsf{\mathsf{ar}^{\prime}}}e_{a}$ ). ∎

Without loss of generality, let us assume that $v$ obtained in the history $H^{\prime}$ equals $\mathit{ab}$ . Let us return to our main history $H$ . We extend it similarly to the way we extended $H^{\prime}$ , but we do not allow the network partition to heal completely. Instead, we just let $\mathsf{msgs}_{b}^{\mathit{RB}}$ to reach $R_{1}$ , which $\mathrm{RB{\text{-}}deliver}$ s them exactly as in $H^{\prime}$ . Since replicas are deterministic, the current state of $R_{1}$ must be the same as it was in $H^{\prime}$ during the execution of $e^{\prime}_{c}$ . Thus, similarly to $H^{\prime}$ , we invoke a weak $\mathsf{read}$ operation on $R_{1}$ in an event $e_{r}$ , and $\mathsf{rval}(e_{r})=v=\mathit{ab}$ .

Consider yet another execution represented by history $H^{\prime\prime}=(E^{\prime\prime},\mathsf{op}^{\prime\prime},\mathsf{rval}^{\prime\prime},\mathsf{rb}^{\prime\prime},\mathsf{ss}^{\prime\prime},\mathsf{lvl}^{\prime\prime})$ which is obtained from our main execution $H$ by removing any steps executed by $R_{1}$ . The events executed on $R_{2}$ remain unchanged, since the two replicas were all the time separated by a network partition, and no messages from $R_{1}$ reached $R_{2}$ . We let the network partition heal. $R_{1}$ $\mathrm{RB{\text{-}}deliver}$ s messages from the set $\mathsf{msgs}_{b}^{\mathit{RB}}$ , both replicas $\mathrm{TOB{\text{-}}deliver}$ messages from the set $\mathsf{msgs}_{b}^{\mathit{TOB}}$ , and afterward both replicas become passive.

We now extend $H^{\prime\prime}$ by infinitely many times applying the following procedure (for $k$ from $1$ to infinity):

invoke a strong $\mathsf{read}$ on $R_{2}$ in the event $e^{\prime\prime}_{2k}$ , 2. 2.

let $R_{2}$ execute its steps until it becomes passive, 3. 3.

on both $R_{1}$ and $R_{2}$ , $\mathrm{RB{\text{-}}deliver}$ and $\mathrm{TOB{\text{-}}deliver}$ all messages, respectively, $\mathrm{RB{\text{-}}cast}$ or $\mathrm{TOB{\text{-}}cast}$ , by $R_{2}$ in step 2, 4. 4.

let both replicas reach a passive state, 5. 5.

invoke a weak $\mathsf{read}$ on $R_{1}$ in the event $e^{\prime\prime}_{2k+1}$ .

The resulting execution is fair and satisfies all the network properties of a stable run. Note that the strong $\mathsf{read}$ operations executed on $R_{2}$ are not restricted by invisible reads and thus may freely change the state of $R_{2}$ . Moreover, they can cause $R_{2}$ to $\mathrm{RB{\text{-}}cast}$ and $\mathrm{TOB{\text{-}}cast}$ messages. On the other hand, the weak $\mathsf{read}$ operations executed on $R_{1}$ are always executed on a passive state, and leave the replica in the same state. Moreover, $R_{1}$ does not $\mathrm{RB{\text{-}}cast}$ , nor $\mathrm{TOB{\text{-}}cast}$ any messages. By non-blocking strong operations no strong $\mathsf{read}$ operation may be pending in $H^{\prime\prime}$ . This is so, because for each $k$ , by step 4, there is no pending message not yet $\mathrm{TOB{\text{-}}deliver}$ ed on $R_{2}$ , and $R_{2}$ is in a passive state.

Claim 2.

There exists an event $e^{\prime\prime}_{x}\in E^{\prime\prime}$ , with $x=2k$ for some natural $k$ , such that $\mathsf{rval}^{\prime\prime}(e^{\prime\prime}_{x})=b$ .

Proof.

By our base assumption, there exists an abstract execution $A^{\prime\prime}=(H^{\prime\prime},\mathsf{vis}^{\prime\prime},\mathsf{ar}^{\prime\prime},\mathsf{par}^{\prime\prime})$ , such that $A^{\prime\prime}\models\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{BEC}(\mathsf{strong},\mathcal{F}_{\mathit{seq}})$ . Then, for each $k$ , by $\textsc{RVal}(\mathsf{strong},\mathcal{F}_{\mathit{seq}})$ , $\mathsf{rval}^{\prime\prime}(e^{\prime\prime}_{2k})=\mathcal{F}_{\mathit{seq}}(\mathsf{read}(),\mathsf{context}(A^{\prime\prime},e^{\prime\prime}_{2k}))$ . Moreover, because of $\textsc{EV}(\mathsf{strong})$ , $e_{b}$ needs to be observed from some point on by every $e^{\prime\prime}_{2k}$ . Thus, we let $e_{b}\xrightarrow{\mathsf{\mathsf{vis}^{\prime\prime}}}e^{\prime\prime}_{x}$ . Since $e_{b}$ is the only $\mathsf{append}$ operation visible to $e^{\prime\prime}_{x}$ (there are no other $\mathsf{append}$ operations in $A^{\prime\prime}$ ), by definition of $\mathcal{F}_{\mathit{seq}}$ , $\mathsf{rval}^{\prime\prime}(e^{\prime\prime}_{x})=b$ . ∎

Let us return to our main history $H$ . Note that, when we restrict $H$ and $H^{\prime\prime}$ only to events on $R_{2}$ , $H$ constitutes a prefix of $H^{\prime\prime}$ . Moreover, the state of $R_{2}$ at the end of $H$ is the same as in $H^{\prime\prime}$ just before $\mathrm{TOB{\text{-}}deliver}$ ing messages from the set $\mathsf{msgs}_{b}^{\mathit{TOB}}$ (if any) and executing the first strong $\mathsf{read}$ operation. We now extend $H$ by $\mathrm{TOB{\text{-}}deliver}$ ing the messages from the set $\mathsf{msgs}_{b}^{\mathit{TOB}}$ and then with steps executed on $R_{2}$ generated using the repeated procedure for $H^{\prime\prime}$ , for $k$ from $1$ to $\frac{x}{2}$ . We can freely omit the steps executed on $R_{1}$ , since none of them influenced in any way $R_{2}$ ( $R_{2}$ did not deliver any message from $R_{1}$ ).777With a typical TOB implementation, it might be impossible for $R_{2}$ to $\mathrm{TOB{\text{-}}deliver}$ its own messages without the votes of $R_{1}$ to reach a quorum. However, as we have discussed earlier, we abstract away from the implementation details of the TOB mechanism. Crucially, no information was transferred from $R_{1}$ to $R_{2}$ . Moreover, in a three replica system, $R_{2}$ could establish a majority with $R_{3}$ to finalize TOB. Thus, there exists an event $e_{x}\in E$ executed on $R_{2}$ , an equivalent of the $e^{\prime\prime}_{x}$ event from $H^{\prime\prime}$ , such that $\mathsf{op}(e_{x})=\mathsf{read}()$ , $\mathsf{lvl}(e_{x})=\mathsf{strong}$ and $\mathsf{rval}(e_{x})=b$ .

Finally, we allow the network partition to heal. $R_{2}$ $\mathrm{RB{\text{-}}deliver}$ s the messages from the set $\mathsf{msgs}_{a}^{\mathit{RB}}$ , and $R_{1}$ $\mathrm{RB{\text{-}}deliver}$ s and $\mathrm{TOB{\text{-}}deliver}$ s any outstanding messages generated by $R_{2}$ (naturally, $R_{1}$ $\mathrm{TOB{\text{-}}deliver}$ s messages in the same order as $R_{2}$ did). Then, we let the replicas reach a passive state, and in order to make our constructed execution fair, we extend it with infinitely many weak $\mathsf{read}$ operations as we did with $H^{\prime}$ . By our base assumption, there exists an abstract execution $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ , such that $A\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})\wedge\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{BEC}(\mathsf{strong},\mathcal{F}_{\mathit{seq}})$ . There are only two $\mathsf{append}$ operations invoked in $A$ in the events $e_{a}$ and $e_{b}$ . Since $\mathsf{rval}(e_{r})=\mathit{ab}$ (which we have established after the Claim 1), by $\textsc{RVal}(\mathsf{weak},\mathcal{F}_{\mathit{seq}})$ and the definition of $\mathcal{F}_{\mathit{seq}}$ , it can be only that $e_{a}\xrightarrow{\mathsf{ar}}e_{b}$ . We also know that $\mathsf{rval}(e_{x})=b$ ( $e_{x}$ is a strong $\mathsf{read}$ operation executed on $R_{2}$ ), which means that $e_{b}\xrightarrow{\mathsf{vis}}e_{x}\wedge e_{a}\not\xrightarrow{\mathsf{vis}}e_{x}$ . By $\textsc{Sin\-Ord}(\mathsf{strong})$ , $e_{b}\xrightarrow{\mathsf{ar}}e_{x}\wedge e_{a}\not\xrightarrow{\mathsf{ar}}e_{x}$ , and thus $e_{x}\xrightarrow{\mathsf{ar}}e_{a}$ . Therefore, a cycle forms in the total order relation $\mathsf{ar}$ : $e_{a}\xrightarrow{\mathsf{ar}}e_{b}\xrightarrow{\mathsf{ar}}e_{x}\xrightarrow{\mathsf{ar}}e_{a}$ , a contradiction. This concludes our proof for a system with two replicas.

We could easily extended our reasoning to account for any number of replicas $n>2$ : any additional replica $R_{i}$ performs an infinite number of read operations, in the same fashion as the replica $R_{1}$ or $R_{2}$ , depending on whether $R_{i}$ originally belonged to the same partition as $R_{1}$ or $R_{2}$ . ∎

Since from Theorem 5 we know that there exists an ACT specification $(\mathcal{F},\mathsf{lvlmap})$ for which we cannot propose (even a specialized) implementation that satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F})$ , we can formulate a more general result about generic ACTs:

Corollary 1.

There does not exist a generic implementation that for arbitrary ACT specification $(\mathcal{F},\mathsf{lvlmap})$ satisfies $\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{BEC}(\mathsf{strong},\mathcal{F})$ in stable runs, and $\textsc{BEC}(\mathsf{weak},\mathcal{F})$ both in asynchronous, and in stable runs.

Theorem 5 shows that it is impossible to devise a system similar to AcuteBayou (for arbitrary $\mathcal{F}$ ) that never admits temporary operation reordering (so satisfies $\textsc{BEC}(\mathsf{weak},\mathcal{F})$ instead of $\textsc{FEC}(\mathsf{weak},\mathcal{F})$ ). Hence, admitting temporary operation reordering is the inherent cost of mixing eventual and strong consistency when we make no assumptions about the semantics of $\mathcal{F}$ . Naturally, for certain replicated data types, such as $\mathcal{F}_{\mathit{NNC}}$ , achieving both $\textsc{BEC}(\mathsf{weak},\mathcal{F})$ and $\textsc{Lin}(\mathsf{strong},\mathcal{F})$ is possible, as we show with ANNC.

In the next section we discuss several approaches that avoid temporary operation reordering, albeit at the cost of compromising fault-tolerance (e.g., by requiring all replicas to be operational), or sacrificing high availability (e.g., by forcing replicas to synchronize on weak operations).

7 Related work

7.1 *Symmetric

models with strong operations blocking upon a single crash*

We start with symmetric mixed-consistency models, in which all replicas can process both weak and strong operations and communicate directly with each other (thus enabling processing of weak operations within network partitions), but either do not enable fully-fledged strong operations (there is no stabilization of operation execution order) or require all replicas to synchronize in order for a strong operation to complete. In turn, the way these models bind the execution of weak and strong operations can be understood as an asymmetric (1– $n$ ) variant of quorum-based synchronization. Hence, unlike in ACTs, strong operations cannot complete if even a single replica cannot respond (due to a machine or network failure), which is a major limitation.

Lazy Replication [22] features three operation levels: causal, forced (totally ordered with respect to one another) and immediate (totally ordered with respect to all other operations). In this approach, it is possible that two replicas execute a causal operation $\mathit{op}_{c}$ and a forced operation $\mathit{op}_{f}$ in different orders. Since $\mathit{op}_{c}$ is required to commute with $\mathit{op}_{f}$ , replicas will converge to the same state. However, the user is never certain that even after the completion of $\mathit{op}_{f}$ , on some other replica no weaker operation $\mathit{op}^{\prime}_{c}$ will be executed prior to $\mathit{op}_{f}$ . Hence the guarantees provided by forced operations are inadequate for certain use cases, which require write stabilization, e.g., an auction system [4] (see also Section 1). On the other hand, immediate operations offer stronger guarantees, but their implementation is based on three-phase commit [55], and thus requires all replicas to vote in order to proceed.

RedBlue consistency [6] extends Lazy Replication (with blue and red operations corresponding to the causal and forced ones), by allowing operations to be split into (side-effect free) generator and (globally commutative) shadow operations. This greatly increases the number of operations which commute, but red operations still do not guarantee write stabilization. To overcome this limitation, RedBlue consistency was extended with programmer-defined partial order restrictions over operations [11]. The proposed implementation, Olisipo, relies on a counter-based system to synchronize conflicting operations. Synchronization can be either symmetric (all potentially conflicting pairs of operations must synchronize, which means that weak operations are not highly available any more) or asymmetric (all replicas must be operational for strong operations to complete).

The formal framework of [10] can be used to express various consistency guarantees, including those of Lazy Replication and RedBlue consistency, but not as strong as, e.g., linearizability. Conflicts resulting from operations that do not commute are modelled through a set of tokens. On the other hand, in explicit consistency [9], stronger consistency guarantees are modelled through application-level invariants and can be achieved using multi-level locks (similar to readers-writer locks from shared memory).

All above models assume causal consistency (CC) as the base-line consistency criterion and thus do not account for weaker consistency guarantees, such as FEC or BEC, as our framework. CC is argued to be costly to ensure in real-life [16], which makes our approach more general.

Finally, the model in [7] is similar to ours but treats strong operations as fences (barriers). It means that all replicas must vote in order for a strong operation to complete.

Temporary operation reordering is not possible in the models discussed above. It is because they are either state-based (and thus their formalism abstracts away from the operation return values which clients observe and interpret) and feature no write stabilization, or they require all replicas to vote in order to process strong operations.

7.2 *Symmetric Bayou-like

models*

In Section 2 we have already discussed the relationship between the seminal Bayou protocol [24] and ACTs.

In eventually-serializable data service (ESDS) [36], operations are executed speculatively before they are stabilized, similarly to Bayou. However, ESDS additionally allows a programmer to attach to an operation an arbitrary causal context that must be satisfied before the operation is executed. Zeno [56] is similar to Bayou but has been designed to tolerate Byzantine failures.

All three systems (Bayou, ESDS, Zeno) are eventually consistent, but ensure that eventually there exists a single serialization of all operations, and the client may wait for a notification that certain operation was stabilized. Since these systems enable an execution of arbitrarily complex operations (as ACTs), they admit temporary operation reordering.

Several researchers attempted a formal analysis of the guarantees provided by Bayou or systems similar to it. E.g., the authors of Zeno [56] describe its behaviour using I/O automata. In [57] the authors analyse Bayou and explain it through a formal framework that is tailored to Bayou. Both of these approaches are not as general as ours and do not enable comparison of the guarantees provided by other systems. Finally, the framework in [52] enables reasoning about eventually consistent systems that enable speculative executions and rollbacks and so also AcuteBayou. However, the framework does not formalize strong consistency models, which means it is not suitable for our purposes.

7.3 *Asymmetric

models with cloud as a proxy*

Contrary to our approach, the work described below assumes an asymmetric model in which external clients maintain local copies of primary objects that reside in a centralized (replicated) system, referred to as the cloud. Clients perform weak operations on local copies and only synchronize with the cloud lazily or to complete strong operations. Since the cloud functions as a communication proxy between the clients, when it is is unavailable (e.g., due to failures of majority of replicas or a partition), clients cannot observe even each others new weak operations. Hence, this approach is less flexible than ours. However, since the cloud serves the role of a single source of truth, conflicts between concurrent updates can be resolved before they are propagated to the clients, so temporary operation reordering is not possible.

In cloud types [20], clients issue operations on replicated objects stored in the local revision and occasionally synchronize with the main revision stored in the cloud, in a way similar as in version control systems. The synchronization happens either eagerly or lazily, depending on the used mode of synchronization. The authors use revision consistency [58] as the target correctness criterion. In a subsequent work [21] a global sequence protocol (GSP) was introduced, which refines the programming model of cloud types, and replaces revision consistency with an abstract data model, as revisions and revision consistency were deemed too complicated for non-expert users. Global sequence consistency (GSC) [59] is a consistency model that generalizes GSP and a few other approaches that assume external clients that either eagerly or lazily push or pull data from the cloud.

7.4 *Asymmetric master-slave

models*

There are systems which relax strong consistency by allowing clients to read stale data, either on demand (the client may forgo recency guarantees by choosing a weak consistency level for an operation), or depending on the replica localization (in a geo-replicated system the client accessing the nearest replica can read stale data that are pertinent to a different region). However, in such systems all updating operations (including the weak ones) must pass through the primary server designated for each particular data item. Thus, similarly to the asymmetric, cloud as a proxy models, in this approach weak operations are not freely disseminated among the replicas. Since all updates (of a concrete data item) are serialized by the primary, temporary operation reordering is not possible.

Examples of systems which follow this design and allow users to select an appropriate consistency level include PNUTS [60], Pileus [8], and also the widely popular contemporary cloud data stores, such as AmazonDB [12] and CosmosDB [13]. Systems that guarantee strong consistency within a single site and causal consistency between sites include Walter [5], COPS [50], Eiger [61] and Occult [62].

7.5 Other approaches

Certain eventually consistent NoSQL data stores enable strongly consistent operations on-demand . E.g., Riak allows some data to be kept in strongly consistent buckets [15], which is a namespace completely separate from the one used for data accessed in a regular, eventually-consistent way. Apache Cassandra provides compare-and-set-like operations, called light-weight transactions (LWTs) [14], which can be executed on any data, but the user is forbidden from executing weakly consistent updates on that data at the same time. Concurrent updates and LWTs result in undefined behaviour [17], which means that mixed-consistency semantics of LWTs can be considered broken.

In Lynx [63] and Salt [64] mixed-consistency transactions are translated into a chain of subtransactions, each committed at a different primary site. Thus such transactions can block or raise an error if a specific site is unavailable.

Recently some work has been published on the programming language perspective of mixed-consistency semantics. Since this research is not directly related to our work, we briefly discuss only a few papers. Correctables [65] are abstractions similar to futures, that can be used to obtain multiple, incremental views on the operation return value (e.g., a result of a speculative execution of the operation and then the final return value). Correctables are used as an interface for the modified variants of Apache Cassandra and ZooKeeper [66] (a strongly consistent system). In MixT [67] each data item is marked with a consistency level that will be used upon access. A transaction that accesses data marked with different consistency levels is split into multiple independently executed subtransactions, each corresponding to a concrete consistency level. The compilation-time code-level verification ensures that operations performed on data marked with weaker consistency levels do not influence the operations on data marked with stronger consistency levels. Understandably, the execution of a mixed-level transaction can be blocking. Finally, in [68] the authors advocate the use of the release-acquire semantics (adapted from low-level concurrent programming) and propose Kite, a mixed-consistency key-value store utilizing this consistency model. In Kite weak read operations occasionally require inter-replica synchronization and block on network communication, thus they are not highly available.

8 Conclusions

In this paper we defined acute cloud types, a class of replicated systems that aim at seamless mixing of eventual and strong consistency. ACTs are primarily designed to execute client-submitted operations in a highly available, eventually-consistent fashion, similarly to CRDTs. However, for tasks that cannot be performed in that way, ACTs at the same time support operations that require some form of distributed consensus-based synchronization.

We defined ACTs and the guarantees they provide in our novel framework which is suited for modeling mixed-consistency systems. We also proposed a new consistency criterion called fluctuating eventual consistency, which captures a common trait of many ACTs, namely temporary operation reordering. Interestingly, temporary operation reordering appears neither in systems that are purely eventually consistent (e.g., NoSQL data stores) nor purely strongly consistent (e.g., traditional DBMS). Moreover, it is not necessarily present in all ACTs, but as we formally prove, it cannot be avoided in ACTs that feature arbitrarily complex (but deterministic) semantics (e.g., arbitrary SQL transactions).

Appendix A

In this appendix we present additional material that could not be included in the article due to space considerations. In Section A.1 we give a detailed description of the seminal Bayou protocol, and in Section A.2 we discuss its liveness guarantees. Next, in Section A.3 we supplement the details on how the Bayou protocol can be improved to form the general-purpose ACT AcuteBayou. In Section A.4 we formalize the properties of the $\mathsf{state}$ object, the black box component responsible for the semantics of the implemented data type in the algorithms of Bayou and AcuteBayou. Finally, in Section A.5 we provide the formal proofs of correcntess for ANNC and AcuteBayou.

A.1 Bayou–detailed description

The pseudocode in Algorithm 2 specifies the Bayou protocol for replica $R_{i}$ . Replicas are independent and communicate solely by message passing. When a client submits an operation $\mathit{op}$ to a replica, $\mathit{op}$ is broadcast within a $\mathrm{Req}$ message using a gossip protocol. In our pseudocode, we use regular reliable broadcast, RB (line 12; we say that $\mathit{op}$ has been $\mathrm{RB{\text{-}}cast}$ ). Through the code in line 13 we simulate immediate local $\mathrm{RB{\text{-}}deliver}$ y of $\mathit{op}$ .

Each Bayou replica totally-orders all operations it knows about (executed locally or received through RB). In order to keep track of the total order, a replica maintains two lists of operations: $\mathsf{committed}$ and $\mathsf{tentative}$ . The $\mathsf{committed}$ list encompasses the stabilized operations, i.e., operations whose final execution order has been established by the primary. On the other hand, the $\mathsf{tentative}$ list encompasses operations whose final execution order has not yet been determined. The operations on the $\mathsf{tentative}$ list are sorted using the operations’ timestamps (to resolve any ties, the replica identifiers and per replica sequence numbers are used). A timestamp is assigned to an operation as soon as a Bayou replica receives it from a client.

A Bayou replica continually executes operations one by one in the order determined by the concatenation of the two lists: $\mathsf{committed}\cdot\mathsf{tentative}$ (line 55). The replica keeps additional data structures, such as $\mathsf{executed}$ and $\mathsf{toBeExecuted}$ , to keep track of its progress. An operation $\mathit{op}\in\mathsf{committed}$ , once executed, will not be executed again as its final operation execution order is determined. On the other hand, an operation in the $\mathsf{tentative}$ list might be executed and rolled back multiple times. It is because a replica adds operations to the $\mathsf{tentative}$ list (rearranging it if necessary; lines 18-16) as they are delivered by a gossip protocol. Hence, a replica might execute some operation $\mathit{op}$ , and then, in order to maintain the proper execution order consistent with the modified $\mathsf{tentative}$ list, the replica might be forced to roll $\mathit{op}$ back (line 51), execute a just received operation $\mathit{op}^{\prime}$ (which has lower timestamp than $\mathit{op}$ ), and execute $\mathit{op}$ again. We maintain the $\mathsf{toBeRolledBack}$ list of operations scheduled for rollback (operations are kept in the order reverse to the one in which they were executed, line 48). An operation execution can proceed only once all the scheduled rollbacks have been performed.

One of the replicas, called the primary, periodically commits operations from its $\mathsf{tentative}$ list by moving them to the end of the $\mathsf{committed}$ list, thus establishing their final execution order (line 38). The primary announces the commit of operations by $\mathrm{RB{\text{-}}cast}$ ing commit messages, so that each replica can also commit the appropriate operations. Note that the primary uses the FIFO variant of RB to ensure that all replicas commit the same set of operations in the same order.

Intuitively, the replicas converge to the same state, which is reflected by the $\mathsf{committed}\cdot\mathsf{tentative}$ list of operations. More precisely, when the stream of operations incoming to the system ceases and there are no network partitions (the replicas can communicate with the primary), the $\mathsf{committed}$ lists at all replicas will be the same, whereas the $\mathsf{tentative}$ lists will be empty. On the other hand, when there are partitions, some operations might not be successfully committed by the primary, but will be disseminated within a partition using RB. Then all replicas within the same partition will have the same $\mathsf{committed}$ and (non-empty) $\mathsf{tentative}$ lists.

Operations are executed on the $\mathsf{state}$ object (line 4), which encapsulates the state of the local database. At any moment, the value of $\mathsf{state}$ corresponds to a sequence $s$ of the already executed operations on a replica given, where $s$ is a prefix of $\mathsf{committed}\cdot\mathsf{tentative}$ . Note that $\mathsf{state}$ allows us to easily rollback a suffix of $s$ (line 51). We discuss the properties of the $\mathsf{state}$ object in more detail in Section A.4.

Algorithm 3 shows a pseudocode of a referential implementation of the StateObject for arbitrary operations of any sequential data type (a specialized one can be used to take advantage of specific data type’s characteristics or to enable non-sequential semantics for certain replicated data types which expose concurrency to the client). We assume that each operation can be specified as a composition of read and write operations on registers (objects) together with some local computation. The assumption is sensible, as the operations are executed locally, in a sequential manner, and thus no stronger primitives than registers (such as CAS, fetch-and-add, etc.) are necessary. The StateObject keeps an undo log which allows it to revoke the effects of any operation executed so far (the log can be truncated to include only the operations on the $\mathsf{tentative}$ list).

A.2 Liveness guarantees in Bayou

Eventually consistent systems are aimed at providing high availability. It means that a replica is supposed to respond to a request even in the presence of network partitions in the system. This requirement can be differently formalized. In the model considered by Brewer [69], a network partition can last infinitely. Then, high availability can be formalized as wait-freedom [70], which means that each request is eventually processed by the system and the response is returned to the client. In the more commonly assumed model that admits only temporary network partitions (we also adopt this model, similarly to, e.g., [7] [33]), that requirement is not strong enough, since a replica could trivially just wait until the partitions are repaired before executing a request and responding to the client. Therefore, in such a model the requirement of high availability must be formulated differently. It can be done as follows: a system is highly available if it executes each request in a finite number of steps even when no messages are exchanged between the replicas (the replica cannot indefinitely postpone execution of a request or returning the response to the client, see Section 3.3 for a formal definition). In this sense, Bayou is highly available. However, this definition of high availability does not preclude situations in which, e.g., the number of steps the execution of each request takes grows over time and thus is unbounded. Hence, one could formulate a slightly stronger requirement, i.e., bounded wait-freedom [70], which states that there is a possibly unknown but bounded number of protocol steps that the replica takes before a response is returned to the client upon invocation of an operation. Interestingly, unlike many popular NoSQL data stores, such as [1] or [71], Bayou does not guarantee bounded wait-freedom even for weak operations, as we now demonstrate.

Consider a Bayou system with $n$ replicas, one of which, $R_{s}$ , processes requests slower compared to all other replicas. Assume also that every fixed period of time $\Delta t$ there are $n$ new weak requests issued, one on each replica, and the processing capabilities of all replicas are saturated. In every $\Delta t$ , $R_{s}$ should process all $n$ requests (as do other replicas), but it starts to lag behind, with its backlog constantly growing. Intuitively, every new operation invoked on $R_{s}$ will be scheduled for execution after all operations in the backlog, as they were issued with lower timestamps. Hence the response time will increase with every new invocation on $R_{s}$ . One could try to overcome the problem of the increasing latency on $R_{s}$ by artificially slowing the clock on $R_{s}$ , thus giving unfair priority to the operations issued on $R_{s}$ , compared to operations issued on other replicas. But then any operation invoked on $R_{s}$ would appear on other replicas as an operation from a distant past. In turn, any such operation would cause a growing number of rollbacks on the other replicas.

Strong operations cannot be (bounded) wait-free simply because in order for them to complete, the primary must be operational, which cannot be guaranteed in a fault-prone environment.

Interestingly, in AcuteBayou (see Sections 2.2.5 and A.3) the execution of weak operations is trivially bounded wait-free, as they are executed immediately upon their invocations.

A.3 Bayou improved

As we discussed in Section 2.2.5, we can improve the Bayou protocol to make it more fault-tolerant and free of cicular causality, and thus obtain AcuteBayou. In Algorithm 4 we present the modifications to the Algorithm 2, which give us AcuteBayou. Note that, in accordance with the ACT restrictions (see Section 3.3) we also improve the execution of weak read-only (RO) operations (since any RO operation $\mathit{op}$ does not change the logical state of the $\mathsf{state}$ , $\mathit{op}$ can be executed only locally888We assume that StateObject features an overloaded $\mathrm{execute}$ function which takes a plain operation as an argument, instead of a $\mathrm{Req}$ record, when executing RO operations.).

Firstly, we use TOB in place of the primary to establish the final operation execution order. More precisely, every (weak, updating) operation is broadcast using RB (as before) as well as TOB (lines 15–16). When a replica $\mathrm{TOB{\text{-}}deliver}$ s an operation $\mathit{op}$ (line 23), it stabilizes $\mathit{op}$ . Since TOB guarantees that all replicas $\mathrm{TOB{\text{-}}deliver}$ the same set of messages in the same order, all replicas will stabilize the same set of operations in the same order. As we have argued, TOB can be implemented in a way that avoids a single point of failure [19].

Further changes are aimed at eliminating circular causality in Bayou as well as improving the response time for weak operations. To this end (1) any strong operation is broadcast using TOB only (line 21), and (2) upon being submitted, any weak operation is executed immediately on the current state, and then rolled back (lines 13 and 14). It is easy to see that the modification (2) means the incoming stream of weak operations from other replicas cannot delay the execution of weak operations submitted locally. Below we argue why the two above modifications allow us to avoid circular causality in Bayou.

The change (1) means that for any pair of a strong $s$ and a weak operation $w$ , if the return value of any operation $e$ depends on both $s$ and $w$ ( $e$ observes $s$ and $w$ ), they will be observed in an order consistent with the final operation execution order. We prove it through the following observations:

for $e$ to observe $s$ , $s$ must be committed (in the modified algorithm $s$ never appears on the $\mathsf{tentative}$ list), 2. 2.

if $e$ is a strong operation, then $w$ must also be committed, because upon execution strong operations do not observe operations on the $\mathsf{tentative}$ list; hence both operations are observed according to their final execution order, 3. 3.

otherwise ( $e$ is a weak operation):

(a)

$w$ is updating (not RO), because otherwise it would not logically impact the return value of $e$ , 2. (b)

if $w$ is already committed, it is similar to case 2, 3. (c)

if $w$ is not yet committed, $e$ will observe the operations in the order $s,w$ ; on the other hand, once $w$ is delivered by TOB and committed, it will appear on the $\mathsf{committed}$ list after $s$ , and so $e$ also observes $s$ and $w$ in the same order $s,w$ .

The change (2) is necessary to prevent circular causality between two (or more) weak operations (the case depicted in Figure 1. It is because the modified algorithm executes a weak (updating) operation $\mathit{op}$ without waiting for the $\mathrm{RB{\text{-}}cast}$ / $\mathrm{TOB{\text{-}}cast}$ message to arrive. It means that no concurrent operation $\mathsf{op}^{\prime}$ will be executed prior to the first execution of $\mathit{op}$ , whose return value observes the client. Otherwise $\mathit{op}$ could observe $\mathit{op}^{\prime}$ even though the final execution order is $\mathit{op},\mathit{op}^{\prime}$ .

Finally, we redefine the $\mathrm{Req}$ record to include the execution context $\mathsf{ctx}$ , i.e., the identifiers of requests already executed upon the invocation of the current operation and which have influenced the $\mathsf{state}$ object (those on the $\mathsf{executed}$ list and those on the $\mathsf{toBeRolledBack}$ list). Note that in practice such identifiers can be efficiently represented using Dotted Version Vectors [72]. With the augmented $\mathrm{Req}$ record the implementation of StateObject can take advantage of the relative visibility between operations to achieve the non-sequential semantics of such replicated data types as MVRs or ORsets.

A.4 StateObject properties

Although in Algorithm 3 we present a referential implementation of StateObject, in general we treat the $\mathsf{state}$ object as a black box with unknown implementation. The corretness of AcuteBayou depends on the properties of the $\mathsf{state}$ object which we formalize below.

Take the list of requests that were executed on the $\mathsf{state}$ , and remove the requests which were rolled back; we call the resulting sequence $\alpha$ the current trace of the $\mathsf{state}$ .999We omit weak RO operations executed in Algorithm 4 line 6, which are not associated with any $\mathrm{Req}$ record. Since the $\mathsf{state}$ encapsulates the state of the system after locally executing and revoking requests, we require that the $\mathsf{state}$ ’s responses are consistent with a deterministic serial execution of $\alpha$ as specified by the type specification $\mathcal{F}$ when taking into account the relative visibility between requests encoded in the $\mathsf{ctx}$ field of the $\mathrm{Req}$ record. In case of any strong operation $\mathit{op}$ (in a request $r$ ), we assume that all requests $r^{\prime}\in\alpha$ prior to $r$ are visible to $r$ (regardless of $\mathsf{ctx}$ ). This is because $\mathit{op}$ is executed only once $r$ is on the $\mathsf{committed}$ list and thus its position relative to all other operations is fixed and corresponds to the TOB order.

More precisely, for any given trace $\alpha$ , the $\mathsf{state}$ object deterministically holds the state $S_{\alpha}$ , and for any operation $\mathit{op}\in\mathsf{ops}(\mathcal{F})$ , the response of the $\mathsf{state}.\mathrm{execute}$ function invoked on the $\mathsf{state}$ object in state $S_{\alpha}$ equals $\mathcal{F}(\mathit{op},C_{\alpha})$ , where $C_{\alpha}=(E_{\alpha},\mathsf{op}_{\alpha},\mathsf{vis}_{\alpha},\mathsf{ar}_{\alpha})$ is a context such that:

•

$E_{\alpha}$ consists of all the requests in $\alpha$ ,

•

$\mathsf{op}_{\alpha}(r)=r.\mathit{op}$ , for any request $r\in E_{\alpha}$ ,

•

$\mathsf{vis}_{\alpha}$ is the visibility relation based on the $\mathsf{ctx}$ fields of the $\mathrm{Req}$ record for the weak operations and on the order in $\alpha$ for strong operations, i.e. for any $r,r^{\prime}\in E_{\alpha}$ such that $r\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}r^{\prime}$ :

–

if $r^{\prime}.\mathsf{strongOp}=\mathsf{false}$ , then $r.\mathsf{id}\in r^{\prime}.\mathsf{ctx}$ ;

–

if $r^{\prime}.\mathsf{strongOp}=\mathsf{true}$ , then $r\xrightarrow{\mathsf{\mathsf{ar}_{\alpha}}}r^{\prime}$ ;

•

$\mathsf{ar}_{\alpha}$ is the enumeration of requests in $E_{\alpha}$ according to their position in $\alpha$ .

In AcuteBayou, $\alpha=\mathsf{executed}\cdot\mathrm{reverse}(\mathsf{toBeRolledBack})$ , because:

•

requests are executed only if $\mathsf{toBeRolledBack}$ is empty,101010Weak requests are also executed in the invoke block, independently of the $\mathsf{toBeExecuted}$ and $\mathsf{toBeRolledBack}$ lists, but they are immediately afterwards rolled back, so they do not influence the trace.

•

whenever a request is executed it is added to the $\mathsf{executed}$ list, thus it is appended to the end of $\alpha$ ,

•

in the $\mathrm{adjustExecution}$ function, some requests move from the $\mathsf{executed}$ list to the end of the $\mathsf{toBeRolledBack}$ list, thus not changing their position in $\alpha$ ,

•

whenever a request is rolled back, it is removed from the head of the $\mathsf{toBeRolledBack}$ list, and thus removed from the end of $\alpha$ , consistently with the definition of a trace.

A.5 Proofs of correctness

In this section we provide the formal proofs of correcntess for ANNC and AcuteBayou anticipated in Section 5.6. We start with an overview of proofs’ structures.

In order to prove correctness of either protocol, we take a single arbitrary execution of the protocol, and without making any specific assumptions about it, we show how the visibility and arbitration relations can be defined so that the appropriate correctness guarantees can be proven. Below we briefly outline our approach.

In both ANNC and AcuteBayou, strong operations are disseminated solely by TOB, and weak updating operations are sent using both RB and TOB. On the other hand weak RO operations are executed completely locally and do not involve any network communication (strong RO operations are present only in AcuteBayou and are treated as regular strong operations). Thus, in the proofs, for the purpose of constructing the arbitration relation ( $\mathsf{ar}$ ), we order all updating (strong or weak) operations based on the order of the delivery of their respective messages broadcast using TOB. In the case of updating operations whose messages were not $\mathrm{TOB{\text{-}}deliver}$ ed (which can happen in the asynchronous runs), we order them in $\mathsf{ar}$ after all the operations whose messages were $\mathrm{TOB{\text{-}}deliver}$ ed. Their relative order can be arbitrary in ANNC, and in AcuteBayou it has to conform to the order imposed by the $\mathrm{Req}$ records. Finally, for completeness, $\mathsf{ar}$ needs to include also weak RO operations. We carefully interleave them with updating operations in such a way to guarantee no circular causality as well as equivalence between visibility and arbitration for strong operations.

We construct the visibility relation ( $\mathsf{vis}$ ) by choosing for any two events $e,e^{\prime}$ whether one should be observed by the other. We include an edge $e\xrightarrow{\mathsf{vis}}e^{\prime}$ under two, broad conditions: the edge is essential, i.e., $e$ could have influenced the return value of $e^{\prime}$ , or the edge is non-essential, i.e., $e$ could not have influenced the return value of $e^{\prime}$ (because, e.g., $e$ is an RO operation), but $e$ occurs before $e^{\prime}$ in real-time or arbitration. Non-essential edges are important to guarantee eventual visibility for all events.

Now let us make some observations regarding network properties during synchronous and asynchronous runs. Since we consider infinite fair executions, in both types of runs each message $\mathrm{RB{\text{-}}cast}$ is guaranteed to be $\mathrm{RB{\text{-}}deliver}$ ed by each replica. On the other hand, the same delivery guarantee, but for messages $\mathrm{TOB{\text{-}}cast}$ , holds only in the stable runs, and in the asynchronous runs, some messages can be $\mathrm{TOB{\text{-}}deliver}$ ed while others may remain pending. However, asynchronous runs still obey other guarantees, which means that, crucially, no messages $\mathrm{TOB{\text{-}}cast}$ will be $\mathrm{TOB{\text{-}}deliver}$ ed by any replica out of order. Moreover, if some message was $\mathrm{TOB{\text{-}}deliver}$ ed by one replica, then it will be $\mathrm{TOB{\text{-}}deliver}$ ed by all replicas. Also, if one replica manages to $\mathrm{TOB{\text{-}}cast}$ infinitely many messages which are then $\mathrm{TOB{\text{-}}deliver}$ ed, then each replica can succesfully $\mathrm{TOB{\text{-}}cast}$ and $\mathrm{TOB{\text{-}}deliver}$ its messages. Thus, in the asynchronous runs, we expect a finite number of $\mathrm{TOB{\text{-}}cast}$ messages to be $\mathrm{TOB{\text{-}}deliver}$ ed, while all other to remain pending.

For each event $e$ let us denote by $\mathsf{msg}_{\mathsf{TOB}}(e)$ and $\mathsf{msg}_{\mathsf{RB}}(e)$ , respectively, the message $\mathrm{TOB{\text{-}}cast}$ in the event $e$ and the message $\mathrm{RB{\text{-}}cast}$ in the event $e$ (both $\mathsf{msg}_{\mathsf{TOB}}(e)$ and $\mathsf{msg}_{\mathsf{RB}}(e)$ can be undefined for a given event $e$ , denoted $\mathsf{msg}_{\mathsf{TOB}}(e)=\bot$ or $\mathsf{msg}_{\mathsf{RB}}(e)=\bot$ ). For any two events $e,e^{\prime}$ , such that $\mathsf{msg}_{\mathsf{TOB}}(e)=m$ , $\mathsf{msg}_{\mathsf{TOB}}(e^{\prime})=m^{\prime}$ and $\mathsf{tobNo}(m)<\mathsf{tobNo}(m^{\prime})$ we introduce the following notation: $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ , which defines the $\mathsf{tobNo}$ order (based on the $\mathsf{tobNo}$ function). Additionally, for any two events $e,e^{\prime}$ , such that $\mathsf{msg}_{\mathsf{TOB}}(e)=m$ (or respectively $\mathsf{msg}_{\mathsf{RB}}(e)=m$ ), we write $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ ( $e\xrightarrow{\mathsf{\mathsf{RBdel}}}e^{\prime}$ ), if $e^{\prime}$ executes on a replica that has $\mathrm{TOB{\text{-}}deliver}$ ed ( $\mathrm{RB{\text{-}}deliver}$ ed) $m$ prior to its execution.

Finally, let us observe that we model replicas as deterministic state machines (as discussed in Section 3.2), whose specification we give through pseudocode. The variables declared in the algorithms of ANNC and AcuteBayou represent the state of the replicas, while the code blocks represent atomic steps that transition the replicas from one state to another. It means that each such block executes completely before any of its effects become visible. This allows us to infere the following rule (in both ANNC and AcuteBayou) for weak operations which execute in one atomic transition in some event $e$ , which is either in the $\mathsf{TOBdel}$ or $\mathsf{RBdel}$ relation with any other event $e^{\prime}$ : $\mathsf{lvl}(e)=\mathsf{weak}\wedge(e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}\vee e\xrightarrow{\mathsf{\mathsf{RBdel}}}e^{\prime})\Rightarrow e\xrightarrow{\mathsf{rb}}e^{\prime}$ ( $e$ returns before $e^{\prime}$ ).

A.5.1 ANNC correcntess proofs

Let us proceed with the proof of the guarantees offered by ANNC in the stable runs.

See 1

Proof.

For any given arbitrary stable run of ANNC represented by a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ we have to find suitable $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ , such that $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ is such that $A\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ .

Additional observations. Note that each $\mathit{subtract}$ operation executed in some event $e$ finishes when the replica $\mathrm{TOB{\text{-}}deliver}$ s the message $m=\mathsf{msg}_{\mathsf{TOB}}(e)$ . It means that for every operation executed in event $e^{\prime}$ , such that $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , if $\mathsf{msg}_{\mathsf{TOB}}(e^{\prime})=m^{\prime}$ ( $m^{\prime}\neq\bot$ ), then $\mathsf{tobNo}(m)<\mathsf{tobNo}(m^{\prime})$ .

Arbitration. We construct the total order relation $\mathsf{ar}$ by sorting all updating events (additions and subtractions) based on the order in which their respective $\mathrm{TOB{\text{-}}cast}$ messages are $\mathrm{TOB{\text{-}}deliver}$ ed, i.e., respecting the $\mathsf{tobNo}$ order.

Next, we interleave the updating events with RO events (gets) in the following way: each such an RO event $e$ occurs in $\mathsf{ar}$ after the last subtract event $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{rb}}e^{\prime}$ . Thus, for each subtract event $e^{\prime}$ the following holds $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{rb}}e^{\prime}$ . The relative order of RO operations is irrelevant.

As ANNC does not feature operation reordering, for each event $e$ we simply let $\mathsf{par}(e)=\mathsf{ar}$ .

Visibility. For any two events $e,e^{\prime}\in E$ , we include an edge $e\xrightarrow{\mathsf{vis}}e^{\prime}$ in our construction of $\mathsf{vis}$ , if:

$\mathsf{op}(e)=\mathit{add}(v)$ or $\mathsf{op}(e)=\mathit{subtract}(v)$ , $\mathsf{op}(e^{\prime})=\mathit{subtract}(v^{\prime})$ and $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ , 2. 2.

$\mathsf{op}(e)=\mathit{subtract}(v)$ , $\mathsf{op}(e^{\prime})=\mathit{get}$ and $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ , 3. 3.

$\mathsf{op}(e)=\mathit{add}(v)$ , $\mathsf{op}(e^{\prime})=\mathit{get}$ , and $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ , 4. 4.

$\mathsf{op}(e)=\mathit{add}(v)$ , $\mathsf{op}(e^{\prime})=\mathit{get}$ , and $e\xrightarrow{\mathsf{\mathsf{RBdel}}}e^{\prime}$ , 5. 5.

$\mathsf{op}(e)=\mathit{get}$ , $\mathsf{op}(e^{\prime})=\mathit{get}$ and $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , 6. 6.

$\mathsf{op}(e)=\mathit{get}$ , $\mathsf{op}(e^{\prime})=\mathit{subtract}(v^{\prime})$ and $e\xrightarrow{\mathsf{ar}}e^{\prime}$ , 7. 7.

$\mathsf{op}(e^{\prime})=\mathit{add}(v^{\prime})$ and $e\xrightarrow{\mathsf{rb}}e^{\prime}$ ,

(for some $v,v^{\prime}\in\mathbb{N}$ ).

The edges 1-4 are essential, while the edges 5-7 are non-essential. The updates that are visible to a $\mathit{subtract}$ operation depends solely on the $\mathsf{tobNo}$ order, while in case of a $\mathit{get}$ operation, the $\mathsf{TOBdel}$ and $\mathsf{RBdel}$ relations play a role. It does not matter which updates are visible to an $\mathit{add}$ operation because it always responds with a simple $\mathsf{ok}$ acknowledgment, hence the edge 7 is non-essential.

Note that in case of edges 3-4, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ is implied (see the general observations in Section A.5), and in case of the edge 6, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ follows directly from the construction of $\mathsf{ar}$ . Thus, for all edges 3-7, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ .

Having defined $A$ (through $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ ), it now remains to show that $A\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ , or more specifically $A\models\textsc{EV}(\mathsf{weak})\wedge\textsc{EV}(\mathsf{strong})\wedge\textsc{NCC}(\mathsf{weak})\wedge\textsc{NCC}(\mathsf{strong})\wedge\textsc{RVal}(\mathsf{weak})\wedge\textsc{RVal}(\mathsf{strong})\wedge\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{RT}(\mathsf{strong})$ .

Eventual visibility. We prove now that eventual visibility is satisfied for all events:

•

each $\mathit{add}$ or $\mathit{subtract}$ event $e$ is visible to all subsequent $\mathit{subtract}$ events from some point, because there is only a finite number of updating events $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ (1),

•

each $\mathit{add}$ or $\mathit{subtract}$ event $e$ is visible to all subsequent $\mathit{get}$ events from some point, because both $\mathsf{msg}_{\mathsf{RB}}(e)$ and $\mathsf{msg}_{\mathsf{TOB}}(e)$ are eventually delivered on all replicas, (2, 3 and 4),

•

each $\mathit{get}$ event $e$ is visible to all subsequent $\mathit{get}$ events from some point (5),

•

each $\mathit{get}$ event $e$ is visible to all subsequent $\mathit{subtract}$ events from some point, because by construction of $\mathsf{ar}$ there is only a finite number of events $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{ar}}e^{\prime}$ (6),

•

each event is visible to all subsequent $\mathit{add}$ events from some point (7).

No circular causality. We need to show that $\mathsf{acyclic}(\mathsf{hb}\cap(W\times W))$ and $\mathsf{acyclic}(\mathsf{hb}\cap(S\times S))$ , where $W\subseteq E,S\subseteq E$ , are, respectively, the sets of all weak, and strong events. We elect to prove a more general case of $\mathsf{acyclic}(\mathsf{hb})$ .

Recall that $\mathsf{hb}=(\mathsf{vis}\cup\mathsf{so})^{+}$ . If $\mathsf{acyclic}(\mathsf{vis}\cup\mathsf{so})$ , then $\mathsf{acyclic}(\mathsf{hb})$ , because transitive edges cannot introduce cycles. Thus, we have eight types of edges to consider: edges 1-7 from $\mathsf{vis}$ and the eight edge $e\xrightarrow{\mathsf{so}}e^{\prime}$ . We divide them into two groups: the first one consists of edges 1-2, while the second one consists of edges 3-8. Note that for the second group $e\xrightarrow{\mathsf{rb}}e^{\prime}$ always holds.

There can be no cycles when we restrict the edges only to the ones from the first group, as edge 1 is constrained by the $\mathsf{tobNo}$ order, and edge 2 leads to a $\mathit{get}$ event which cannot be followed using only edges from the first group.

Also, there can be no cycles when we restrict the edges only to the ones from the second group, as all the edges are constrained by the $\mathsf{rb}$ relation, which is naturally acyclic.

Thus, a potential cycle could only form when we mix edges from both groups. Let us assume that the cycle contains the following chain of edges: $a\xrightarrow{\mathsf{\mathsf{hb}}}b\xrightarrow{\mathsf{\mathsf{hb}}}...\xrightarrow{\mathsf{\mathsf{hb}}}c\xrightarrow{\mathsf{\mathsf{hb}}}...\xrightarrow{\mathsf{\mathsf{hb}}}d$ , where $a,b,c,d\in E$ , all the edges between $b$ and $c$ belong to the second group, while the other ones belong to the first group. Notice that $b\xrightarrow{\mathsf{rb}}c$ , and that $\mathsf{op}(a),\mathsf{op}(c)\in\{\mathit{add}(v):v\in\mathbb{N}\}\cup\{\mathit{subtract}(v):v\in\mathbb{N}\}$ while $\mathsf{op}(b),\mathsf{op}(d)\in\{\mathit{subtract}(v):v\in\mathbb{N}\}\cup\{\mathit{get}\}$ . Thus, the chain consists of a series of edges from the first group and a series of edges from the second group. The whole cycle can be combined from multiple such chains, but for simplicity, let us assume that it contains only one such chain and that $d=a$ (the same reasoning as below can be applied iteratively for multiple interleavings of edges from the two groups).

If $\mathsf{op}(b)=\mathit{subtract}(v)$ , for some $v\in\mathbb{N}$ , then $a\xrightarrow{\mathsf{\mathsf{tobNo}}}b$ (edge 1), and since $b\xrightarrow{\mathsf{rb}}c$ , also $b\xrightarrow{\mathsf{\mathsf{tobNo}}}c$ (see the additional observations in the begining of the proof). A contradiction: $a\xrightarrow{\mathsf{\mathsf{tobNo}}}b\xrightarrow{\mathsf{\mathsf{tobNo}}}c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ .

If $\mathsf{op}(b)=\mathit{get}$ , then $\mathsf{op}(a)=\mathit{subtract}(v)$ , for some $v\in\mathbb{N}$ , and $a\xrightarrow{\mathsf{\mathsf{TOBdel}}}b$ (edge 2). Either $a\xrightarrow{\mathsf{\mathsf{tobNo}}}c$ , or $c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ . In the former case we end up with a similar contradiction as above: $a\xrightarrow{\mathsf{\mathsf{tobNo}}}c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ . In the latter case, since $c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ , also $c\xrightarrow{\mathsf{\mathsf{TOBdel}}}b$ (the message $\mathsf{msg}_{\mathsf{TOB}}(c)$ is $\mathrm{TOB{\text{-}}deliver}$ ed before the message $\mathsf{msg}_{\mathsf{TOB}}(a)$ ). However, $b\xrightarrow{\mathsf{rb}}c$ , which means that the message $\mathsf{msg}_{\mathsf{TOB}}(c)$ was not even $\mathrm{TOB{\text{-}}cast}$ yet when $b$ executed. A contradiction.

Return value consistency. We need to show that for each event $e\in E$ : $\mathsf{rval}(e)=\mathcal{F}_{\mathit{NNC}}(\mathsf{op}(e),\mathsf{context}(A,e))$ . We base our reasoning below on essential $\mathsf{vis}$ edges and $\mathsf{ar}$ order.

Trivially, the condition is satisfied for all $\mathit{add}$ events, which always return $\mathsf{ok}$ . For all $\mathit{subtract}$ and $\mathit{get}$ events, we can exclude from $\mathsf{context}(A,e)$ all $\mathit{get}$ events which by the definition of an RO operation are irrelevant for the computation of $\mathcal{F}_{\mathit{NNC}}$ .

In case of a $\mathit{subtract}(v)$ operation, for some $v\in\mathbb{N}$ , executed in some event $e$ , $\mathsf{context}(A,e)$ includes all the $\mathit{add}$ and $\mathit{subtract}$ events that precede $e$ in the $\mathsf{tobNo}$ order. When applying the $\mathsf{foldr}$ function from the definition of $\mathcal{F}_{\mathit{NNC}}$ , these $\mathit{add}$ and $\mathit{subtract}$ operations are processed one by one, in the order of their $\mathrm{TOB{\text{-}}deliver}$ y (by construction of $\mathsf{ar}$ ). Each $\mathit{add}(v)$ operation increases the accumulator by $v$ , and each $\mathit{subtract}(v)$ operation decreases the accumulator by $v$ , but only if it is greater or equal $v$ . This matches the pseudocode (lines 24 and 27-28) with the accumulator corresponding to the difference between $\mathsf{strongAdd}$ and $\mathsf{strongSub}$ variables. Thus, the computed value of the $\mathsf{foldr}$ function corresponds to the difference between $\mathsf{strongAdd}$ and $\mathsf{strongSub}$ variables at the time the response to $e$ is computed in line 26. If that value is greater or equal $v$ then $\mathsf{true}$ is returned, which matches the pseudocode’s behaviour.

In case of a $\mathit{get}$ operation executed in some event $e$ , $\mathsf{context}(A,e)$ includes all the $\mathit{add}$ and $\mathit{subtract}$ events that were $\mathrm{TOB{\text{-}}deliver}$ ed before the execution of $e$ , as well as, (possibly) some $\mathit{add}$ events which were not $\mathrm{TOB{\text{-}}deliver}$ ed, but only $\mathrm{RB{\text{-}}deliver}$ ed before the execution of $e$ . Note that all the latter $\mathit{add}$ events are ordered according to $\mathsf{ar}$ , after all the former $\mathit{add}$ and $\mathit{subtract}$ events (have they had been ordered earlier due to lower $\mathsf{tobNo}$ value of their respective $\mathrm{TOB{\text{-}}cast}$ message, they would also be $\mathrm{TOB{\text{-}}deliver}$ ed). When processing the $\mathsf{foldr}$ function up to the last $\mathrm{TOB{\text{-}}deliver}$ ed event, the value of the accumulator corresponds, similarly as in case of $\mathit{subtract}$ events above, to the difference between $\mathsf{strongAdd}$ and $\mathsf{strongSub}$ variables. Then, when processing the remaining $\mathit{add}$ events the final computed value of the $\mathsf{foldr}$ function grows by an amount $V$ , which is equal to the sum of all these $\mathit{add}$ operations’ arguments. Due to the fact that each $\mathrm{TOB{\text{-}}deliver}$ ed message is first $\mathrm{RB{\text{-}}deliver}$ ed or is processed as if it were $\mathrm{RB{\text{-}}deliver}$ ed (lines 22-23), the value of $\mathsf{weakAdd}$ is always greater or equal $\mathsf{strongAdd}$ . The difference between $\mathsf{weakAdd}$ and $\mathsf{strongAdd}$ variables corresponds exactly to $V$ , because it includes events which were $\mathrm{RB{\text{-}}deliver}$ ed, but not $\mathrm{TOB{\text{-}}deliver}$ ed. Thus, the computed value of $\mathcal{F}_{\mathit{NNC}}(\mathit{get},\mathsf{context}(A,e))$ equals $\mathsf{strongAdd}-\mathsf{strongSub}+V=\mathsf{strongAdd}-\mathsf{strongSub}+\mathsf{weakAdd}-\mathsf{strongAdd}=\mathsf{weakAdd}-\mathsf{strongSub}$ at the time of executing $e$ , which matches $\mathsf{rval}(e)$ .

Single order. Since there are no pending $\mathit{subtract}$ operations (because eventually every message is $\mathrm{TOB{\text{-}}deliver}$ ed and the operations finish), we have to simply prove that $\mathsf{vis}\cap(E\times S)=\mathsf{ar}\cap(E\times S)$ , where $S=\{e:\mathsf{lvl}(e)=\mathsf{strong}\}$ . In other words, for any two events $e\in E,e^{\prime}\in S$ : $e\xrightarrow{\mathsf{vis}}e^{\prime}\Leftrightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Let us begin with $e\xrightarrow{\mathsf{vis}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ . Either $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ (edge 1), or $\mathsf{op}(e)=\mathit{get}$ (edge 6). In both cases $e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Now let us consider $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{vis}}e^{\prime}$ . Either $\mathsf{op}(e)\in\{\mathit{add}(v):v\in\mathbb{N}\}\cup\{\mathit{subtract}(v):v\in\mathbb{N}\}$ , or $\mathsf{op}(e)=\mathit{get}$ . In the former case, $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ , and thus $e\xrightarrow{\mathsf{vis}}e^{\prime}$ (edge 1). In the latter case, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ (by construction of $\mathsf{ar}$ ), and thus $e\xrightarrow{\mathsf{vis}}e^{\prime}$ (edge 6).

Real-time order. We need to show that arbitration order respects the real-time order of strong operations, i.e., $\mathsf{rb}\cap(S\times S)\subseteq\mathsf{ar}$ . In other words, for any two $e,e^{\prime}\in S$ : $e\xrightarrow{\mathsf{rb}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Clearly, if $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , then $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ (see the additional observations in the begining of the proof). Thus, $e\xrightarrow{\mathsf{ar}}e^{\prime}$ (by construction of $\mathsf{ar}$ ). ∎

Now, let us continue with the proof of the guarantees offered by ANNC in the asynchronous runs.

See 2

Proof.

To show the inability of ANNC to satisfy $\textsc{Lin}(\mathsf{strong},\mathcal{F}_{\mathit{NNC}})$ in asynchronous runs, it is sufficient to observe that due to some of the $\mathrm{TOB{\text{-}}cast}$ messages not being $\mathrm{TOB{\text{-}}deliver}$ ed, some of the $\mathit{subtract}$ operations remain pending. A pending operation’s return value equals $\nabla$ which is unreconcilable with the requirements of the predicate $\textsc{RVal}(\mathcal{F}_{\mathit{NNC}})$ .

The proof regarding the guarantees of the weak operations is similar to the one for the stable runs, thus we rely on it and focus only on differences between stable and asynchronous runs that need to be addressed. Now for any given arbitrary asynchronous run of ANNC represented by a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ we have to find suitable $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ , such that $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ is such that $A\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})$ .

Arbitration. We construct the total order relation $\mathsf{ar}$ by sorting all updating events (additions and subtractions) based on the order in which their respective $\mathrm{TOB{\text{-}}cast}$ messages are $\mathrm{TOB{\text{-}}deliver}$ ed, i.e., respecting the $\mathsf{tobNo}$ order. Updating events whose messages are not $\mathrm{TOB{\text{-}}deliver}$ ed are ordered after those whose messages are $\mathrm{TOB{\text{-}}deliver}$ ed.

Next, we interleave the updating events with RO events (gets) in the following way: each such an RO event $e$ occurs in $\mathsf{ar}$ after the last non-pending subtract event $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{rb}}e^{\prime}$ . Thus, for each non-pending subtract event $e^{\prime}$ the following holds $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{rb}}e^{\prime}$ . The relative order of RO operations is irrelevant.

As ANNC does not feature operation reordering, for each event $e$ we simply let $\mathsf{par}(e)=\mathsf{ar}$ .

Visibility. We construct the visibility relation in the same way as in the stable runs case. However, we remove edges to and from pending $\mathit{subtract}$ events. Since pending operations do not provide a return value, no edge to a pending event is essential. Also, as we guarantee only eventual visibility for weak events, edges to $\mathit{subtract}$ events are not necessary to satisfy $\textsc{EV}(\mathsf{weak})$ . Moreover, edges from pending events are not needed either, because by definition a pending event is never followed in $\mathsf{rb}$ by any other event (which is a requirement to fail the test for EV). Again, for all edges 3-7, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ .

Having defined $A$ (through $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ ), it now remains to show that $A\models\textsc{BEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})$ , or more specifically $A\models\textsc{EV}(\mathsf{weak})\wedge\textsc{NCC}(\mathsf{weak})\wedge\textsc{RVal}(\mathsf{weak})$ .

Eventual visibility. We prove now that eventual visibility is satisfied for all weak events:

•

each $\mathit{add}$ or non-pending $\mathit{subtract}$ event $e$ is visible to all subsequent $\mathit{get}$ events from some point, because $\mathsf{msg}_{\mathsf{RB}}(e)$ or $\mathsf{msg}_{\mathsf{TOB}}(e)$ are eventually delivered on all replicas (2, 3 and 4),

•

each $\mathit{get}$ event $e$ is visible to all subsequent $\mathit{get}$ events from some point (5),

•

each non-pending event is visible to all subsequent $\mathit{add}$ events from some point (7).

No circular causality. We use exactly the same reasoning as in the stable runs case to show that $\mathsf{acyclic}(\mathsf{hb})$ holds true.

Return value consistency. Again, we use exactly the same reasoning as in the stable runs case to show that for each weak event $e\in E$ : $\mathsf{rval}(e)=\mathcal{F}_{\mathit{NNC}}(\mathsf{op}(e),\mathsf{context}(A,e))$ . Although this time we only need to prove return value consistency for $\mathit{add}$ and $\mathit{get}$ operations, it can be shown that it also holds for non-pending subtract events.

∎

A.5.2 AcuteBayou correcntess proofs

The proofs for AcuteBayou are analogous to those for ANNC, but are slighly more complex due to operation reordering and the more general nature of AcuteBayou with unconstrained operations’ semantics (in contrast ANNC features weak updating operations that always return $\mathsf{ok}$ ). Because we strive in this section for self-contained proofs we do not refer to the proofs for ANNC even when doing so would allow us to omit some repetitions.

We begin with the proof of guarantees offered by AcuteBayou in the stable runs.

See 3

Proof.

For any given arbitrary stable run of AcuteBayou represented by a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ we have to find suitable $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ , such that $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ is such that $A\models\textsc{FEC}(\mathsf{weak},\mathcal{F})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F})$ .

Additional observations. All events besides weak RO ones, have an associated unique $\mathrm{Req}$ record which is disseminated using $\mathrm{RB{\text{-}}cast}$ and $\mathrm{TOB{\text{-}}cast}$ ; let us denote by $\mathsf{req}(e)$ the $\mathrm{Req}$ record of the event $e$ .111111Thus a trace of the $\mathsf{state}$ object, which consists of such records, can be translated into a sequence of events. Since, the handling of weak RO events, which are local to a replica, differ significantly from other events, which are shared, we divide the set of all events $E$ into two subsets: $\Psi\subseteq E$ , consisting of weak updating, strong updating and strong RO events; and $\Omega\subseteq E$ , consisting of weak RO events. We also further divide $\Psi$ into subsets $\Psi_{w}$ and $\Psi_{s}$ , consisting of, respectively, weak and strong events.

Upon $\mathrm{TOB{\text{-}}deliver}$ y of a $\mathrm{COMMIT}$ message, the receiced request $r$ is committed (Algorithm 4 line 24), i.e., it is appended at the end of the $\mathsf{committed}$ list, and removed from the $\mathsf{tentative}$ list (if present there). Note that the position of $r$ established on the $\mathsf{committed}$ list never changes as the list is only appended to. Once the request is committed, the operation associated with the request is eventually executed (unless the request was already executed in the order consistent with the commit order) and then the request is never rolled back. This is so, because:

•

the $\mathsf{committed}$ list is included in the $\mathsf{newOrder}$ list as a prefix in the $\mathrm{commit}$ procedure (Algorithm 2 line 33),

•

until the request $r$ executes it has to feature on the list $\mathsf{toBeExecuted}$ (Algorithm 2 line 47) and there can be only a finite number of items preceding it on that list,

•

the $\mathsf{toBeRolledBack}$ list cannot grow indefinitely without executing some of the requests from the $\mathsf{toBeExecuted}$ list, which means that $r$ is eventually executed (Algorithm 2 line 55),

•

and finally a request which is included in both the $\mathsf{committed}$ and $\mathsf{executed}$ lists is never part of the $\mathsf{outOfOrder}$ list (Algorithm 2 line 45), which means it will not be scheduled for rollback.

Weak operations execute atomically in the invoke code block where the response is always returned immediately to the client.121212If due to operation reexecutions multiple responses are returned to the client we discard the additional ones. For a given weak event $e$ the response is computed on the $\mathsf{state}$ object in some state $S_{\alpha}$ , where $\alpha$ is the current trace of the $\mathsf{state}$ object at the time of the operation’s invocation. We let $\mathsf{trace}(e)$ denote the trace $\alpha$ .

On the other hand, strong operations follow a more complicated route. For a strong event $e$ : firstly the $\mathrm{COMMIT}$ message is $\mathrm{TOB{\text{-}}cast}$ , then upon its $\mathrm{TOB{\text{-}}deliver}$ y the request $r=\mathsf{req}(e)$ is committed. Since $r$ is not disseminated using $\mathrm{RB{\text{-}}cast}$ , it is never included in the $\mathsf{tentative}$ list, and so it executes for the first time after its commit. Thus, each strong operation is executed on each replica exactly once, on a $\mathsf{state}$ object in some state $S_{\alpha}$ , where $\alpha$ is the current trace of the $\mathsf{state}$ object at the time of the execution. Note that the trace $\alpha$ is exactly the same on each replica and it consists exactly of all the requests preceding $\mathsf{req}(e)$ in the $\mathsf{committed}$ list (which due to the properties of $\mathrm{TOB{\text{-}}deliver}$ y has the same value on each replica upon $r$ ’s commit). Again, as in case of weak events, we let $\mathsf{trace}(e)$ denote the trace $\alpha$ .

Note that each strong operation executed in some event $e$ finishes only after the replica $\mathrm{TOB{\text{-}}deliver}$ s the message $m=\mathsf{msg}_{\mathsf{TOB}}(e)$ . It means that for every operation executed in event $e^{\prime}$ , such that $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , if $\mathsf{msg}_{\mathsf{TOB}}(e^{\prime})=m^{\prime}$ ( $m^{\prime}\neq\bot$ ), then $\mathsf{tobNo}(m)<\mathsf{tobNo}(m^{\prime})$ .

Arbitration. We construct the total order relation $\mathsf{ar}$ by sorting all shared events based on the order in which their respective $\mathrm{TOB{\text{-}}cast}$ messages are $\mathrm{TOB{\text{-}}deliver}$ ed, i.e., respecting the $\mathsf{tobNo}$ order.

Next, we interleave the shared events with local events in the following way: each local event $e$ occurs in $\mathsf{ar}$ after the last shared event $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{rb}}e^{\prime}$ . Thus, for each shared event $e^{\prime}$ the following holds $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{rb}}e^{\prime}$ . The relative order of local operations is irrelevant.

We construct the perceived arbitration order $\mathsf{par}(e)$ , for each event $e$ , using the trace $\alpha=\mathsf{trace}(e)$ . More precisely, we add all the events whose requests appear in $\alpha$ in the order of occurence, next we add all the remaining shared events according to their order in $\mathsf{ar}$ . Finally, we interleave the constructed sequence with local events in a similar way as in case of $\mathsf{ar}$ , i.e., for each local event $f$ and each shared event $g$ , the following holds $f\xrightarrow{\mathsf{par}(\mathsf{e})}g\Rightarrow f\xrightarrow{\mathsf{rb}}g$ .

Note that for a strong event $e$ , $\mathsf{par}(e)=\mathsf{ar}$ . This is because $e$ executes once $\mathsf{req}(e)$ is on the $\mathsf{committed}$ list, and its position on the list is determined by the $\mathsf{tobNo}$ order, which means that the trace $\alpha$ contains exactly all the shared events preceding $e$ in $\mathsf{ar}$ .

Visibility. For any two events $e,e^{\prime}\in E$ , such that $\mathsf{trace}(e^{\prime})=\alpha$ , we include an edge $e\xrightarrow{\mathsf{vis}}e^{\prime}$ in our construction of $\mathsf{vis}$ , if:

$e\in\Psi$ , $e^{\prime}\in\Psi_{s}$ , and $\mathsf{req}(e)\in\alpha$ , 2. 2.

$e\in\Psi_{s}$ , $e^{\prime}\in\Psi_{w}$ , and $\mathsf{req}(e)\in\alpha$ , 3. 3.

$e\in\Psi_{s}$ , $e^{\prime}\in\Omega$ , and $\mathsf{req}(e)\in\alpha$ , 4. 4.

$e\in\Psi_{w}$ , $e^{\prime}\in\Psi_{w}\cup\Omega$ , and $\mathsf{req}(e)\in\alpha$ , 5. 5.

$e,e^{\prime}\in\Omega$ , and $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , 6. 6.

$e\in\Omega$ , $e^{\prime}\in\Psi$ , and $e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

The edges 1-4 are essential, while the edges 5-6 are non-essential.

Note that in case of edge 4, either $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ , or $e\xrightarrow{\mathsf{\mathsf{RBdel}}}e^{\prime}$ , and thus $e\xrightarrow{\mathsf{rb}}e^{\prime}$ is implied (see the general observations in Section A.5). Thus, for all edges 4-6, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ .

Additionally, observe that in case of edge 1, $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ , because $\alpha$ contains only requests on the $\mathsf{committed}$ list (see the additional observations in the beginning of the proof), and thus $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ . Similarly, in case of edges 2 and 3, $e\xrightarrow{\mathsf{\mathsf{TOBdel}}}e^{\prime}$ , because $\mathsf{msg}_{\mathsf{RB}}(e)=\bot$ and thus $\mathsf{req}(e)$ can appear in $\alpha$ only if it was $\mathrm{TOB{\text{-}}deliver}$ ed by the replica executing $e^{\prime}$ . Also in case of edge 2, $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ .

Having defined $A$ (through $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ ), it now remains to show that $A\models\textsc{FEC}(\mathsf{weak},\mathcal{F})\wedge\textsc{Lin}(\mathsf{strong},\mathcal{F})$ , or more specifically $A\models\textsc{EV}(\mathsf{weak})\wedge\textsc{EV}(\mathsf{strong})\wedge\textsc{NCC}(\mathsf{weak})\wedge\textsc{NCC}(\mathsf{strong})\wedge\textsc{FRVal}(\mathsf{weak})\wedge\textsc{RVal}(\mathsf{strong})\wedge\textsc{CPar}(\mathsf{weak})\wedge\textsc{Sin\-Ord}(\mathsf{strong})\wedge\textsc{RT}(\mathsf{strong})$ .

Eventual visibility. We prove now that eventual visibility is satisfied for all events:

•

each shared event $e$ is visible to all subsequent events from some point, because $\mathsf{msg}_{\mathsf{TOB}}(e)$ is eventually $\mathrm{TOB{\text{-}}deliver}$ ed and $r=\mathsf{req}(e)$ is placed on the $\mathsf{committed}$ list on each replica, thus $r$ is eventually executed and never rolled back, and is included in the trace of the $\mathsf{state}$ object from some point (1, 2, 3 and 4),

•

each local event $e$ is visible to all subsequent local events from some point (5),

•

each local event $e$ is visible to all subsequent shared events from some point, because by construction of $\mathsf{ar}$ there is only a finite number of events $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{ar}}e^{\prime}$ (6).

No circular causality. We need to show that $\mathsf{acyclic}(\mathsf{hb}\cap(W\times W))$ and $\mathsf{acyclic}(\mathsf{hb}\cap(S\times S))$ , where $W\subseteq E,S\subseteq E$ , are, respectively, the sets of all weak, and strong events. We elect to prove a more general case of $\mathsf{acyclic}(\mathsf{hb})$ .

Recall that $\mathsf{hb}=(\mathsf{vis}\cup\mathsf{so})^{+}$ . If $\mathsf{acyclic}(\mathsf{vis}\cup\mathsf{so})$ , then $\mathsf{acyclic}(\mathsf{hb})$ , because transitive edges cannot introduce cycles. Thus, we have six types of edges to consider: edges 1-6 from $\mathsf{vis}$ and the seventh edge $e\xrightarrow{\mathsf{so}}e^{\prime}$ . We divide them into two groups: the first one consists of edges 1-3, while the second one consists of edges 4-7. Note that for the second group $e\xrightarrow{\mathsf{rb}}e^{\prime}$ always holds.

There can be no cycles when we restrict the edges only to the ones from the first group, as the edges 1 and 2 are constrained by the $\mathsf{tobNo}$ order, and edge 3 leads to a local event which cannot be followed using only edges from the first group.

Also, there can be no cycles when we restrict the edges only to the ones from the second group, as all the edges are constrained by the $\mathsf{rb}$ relation, which is naturally acyclic.

Thus, a potential cycle could only form when we mix edges from both groups. Let us assume that the cycle contains the following chain of edges: $a\xrightarrow{\mathsf{\mathsf{hb}}}b\xrightarrow{\mathsf{\mathsf{hb}}}...\xrightarrow{\mathsf{\mathsf{hb}}}c\xrightarrow{\mathsf{\mathsf{hb}}}...\xrightarrow{\mathsf{\mathsf{hb}}}d$ , where $a,b,c,d\in E$ , all the edges between $b$ and $c$ belong to the second group, while the other ones belong to the first group. Notice that $b\xrightarrow{\mathsf{rb}}c$ , and that $a,c\in\Psi$ . Thus, the chain consists of a series of edges from the first group and a series of edges from the second group. The whole cycle can be combined from multiple such chains, but for simplicity, let us assume that it contains only one such chain and that $d=a$ (the same reasoning as below can be applied iteratively for multiple interleavings of edges from the two groups).

If $b\in\Psi$ , then $a\xrightarrow{\mathsf{\mathsf{tobNo}}}b$ (edges 1 and 2), and since $b\xrightarrow{\mathsf{rb}}c$ , also $b\xrightarrow{\mathsf{\mathsf{tobNo}}}c$ (see the additional observations in the beginning of the proof). A contradiction: $a\xrightarrow{\mathsf{\mathsf{tobNo}}}b\xrightarrow{\mathsf{\mathsf{tobNo}}}c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ .

If $b\in\Omega$ , then $a\in\Psi_{s}$ , and $a\xrightarrow{\mathsf{\mathsf{TOBdel}}}b$ (edge 3). Either $a\xrightarrow{\mathsf{\mathsf{tobNo}}}c$ , or $c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ . In the former case we end up with a similar contradiction as above: $a\xrightarrow{\mathsf{\mathsf{tobNo}}}c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ . In the latter case, since $c\xrightarrow{\mathsf{\mathsf{tobNo}}}a$ , also $c\xrightarrow{\mathsf{\mathsf{TOBdel}}}b$ (the message $\mathsf{msg}_{\mathsf{TOB}}(c)$ is $\mathrm{TOB{\text{-}}deliver}$ ed before the message $\mathsf{msg}_{\mathsf{TOB}}(a)$ ). However, $b\xrightarrow{\mathsf{rb}}c$ , which means that the message $\mathsf{msg}_{\mathsf{TOB}}(c)$ was not even $\mathrm{TOB{\text{-}}cast}$ yet when $b$ executed. A contradiction.

Single order. Since there are no pending strong operations (because eventually every message is $\mathrm{TOB{\text{-}}deliver}$ ed and the operations finish), we have to simply prove that $\mathsf{vis}\cap(E\times S)=\mathsf{ar}\cap(E\times S)$ , where $S=\{e:\mathsf{lvl}(e)=\mathsf{strong}\}$ . In other words, for any two events $e\in E,e^{\prime}\in S$ : $e\xrightarrow{\mathsf{vis}}e^{\prime}\Leftrightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Let us begin with $e\xrightarrow{\mathsf{vis}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ . Either $e\in\Psi$ , and thus $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ (edge 1), or $e\in\Omega$ (edge 6). In both cases $e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Now let us consider $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{vis}}e^{\prime}$ . Either $e\in\Psi$ , or $e\in\Omega$ . In the former case, $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ , and thus $e$ must be included in $\mathsf{trace}(e^{\prime})$ , which means that $e\xrightarrow{\mathsf{vis}}e^{\prime}$ (edge 1). In the latter case, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ (by construction of $\mathsf{ar}$ ), and thus also $e\xrightarrow{\mathsf{vis}}e^{\prime}$ (edge 6).

Return value consistency. Since for a strong event $e$ , $\mathsf{par}(e)=\mathsf{ar}$ and $\mathsf{fcontext}(A,e)=\mathsf{context}(A,e)$ . Thus, for each event $e\in E$ , we need to show that: $\mathsf{rval}(e)=\mathcal{F}(\mathsf{op}(e),\mathsf{fcontext}(A,e))$ . We base our reasoning below on essential $\mathsf{vis}$ edges and $\mathsf{par}(e)$ order.

Firstly, observe that we can exclude from $\mathsf{fcontext}(A,e)$ all local events which by the definition of an RO operation are irrelevant for the computation of $\mathcal{F}$ . Thus, let $C=(E_{C},\mathsf{op},\mathsf{vis},\mathsf{par}(e))$ , where $E_{C}=\{e^{\prime}\in\Psi:e^{\prime}\xrightarrow{\mathsf{vis}}e\}$ .

Then, recall that $\mathsf{rval}(e)$ is obtained by calling $\mathsf{state}.\mathrm{execute}$ on the $\mathsf{state}$ object in state $S_{\alpha}$ , where $\alpha=\mathsf{trace}(e)$ , and that $\mathsf{rval}(e)=\mathcal{F}(\mathit{op},C_{\alpha})$ , where $C_{\alpha}=(E_{\alpha},\mathsf{op}_{\alpha},\mathsf{vis}_{\alpha},\mathsf{ar}_{\alpha})$ is a context constructed from $\alpha$ as defined in Section A.4. It suffices to show that the context $C$ is isomorphic with $C_{\alpha}$ , which we do below.

Clearly, by construction of $\mathsf{vis}$ , if $e^{\prime}\xrightarrow{\mathsf{vis}}e$ and $e^{\prime}\in E_{C}$ , then $\mathsf{req}(e^{\prime})\in\alpha$ . Thus, $E_{\alpha}$ consists of the $\mathrm{Req}$ records of the events in $E_{C}$ . By the way how $\mathrm{Req}$ records are constructed (Algorithm 4 line 11), for any given event $e\in E_{C}$ , $\mathsf{op}_{\alpha}(\mathsf{req}(e))$ equals $\mathsf{op}(e)$ . Also, for any two events $f,g\in E_{C}$ , $f\xrightarrow{\mathsf{par}(\mathsf{e})}g\Leftrightarrow\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{ar}_{\alpha}}}\mathsf{req}(g)$ , which follows trivially from the construction of $\mathsf{par}(e)$ . It remains to show that for any two events $f,g\in E_{C}$ , $f\xrightarrow{\mathsf{vis}}g\Leftrightarrow\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}\mathsf{req}(g)$ .

If $g\in\Psi_{w}$ and $f\xrightarrow{\mathsf{vis}}g$ , then $\mathsf{req}(f)\in\mathsf{trace}(g)$ , and thus $\mathsf{req}(f).\mathsf{id}\in\mathsf{req}(g).\mathsf{ctx}$ , which implies $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}\mathsf{req}(g)$ .

If $g\in\Psi_{w}$ and $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}\mathsf{req}(g)$ , then $\mathsf{req}(f).\mathsf{id}\in\mathsf{req}(g).\mathsf{ctx}$ , and thus $\mathsf{req}(f)\in\mathsf{trace}(g)$ , which implies $f\xrightarrow{\mathsf{vis}}g$ .

If $g\in\Psi_{s}$ and $f\xrightarrow{\mathsf{vis}}g$ , then $f\xrightarrow{\mathsf{ar}}g$ (by Single Order), and thus $f\xrightarrow{\mathsf{\mathsf{tobNo}}}g$ . Since $\mathsf{req}(g)$ is committed at the time of $e$ ’s execution ( $\mathsf{req}(g)\in\alpha$ and $\mathsf{lvl}(g)=\mathsf{strong}$ ), so is $\mathsf{req}(f)$ but its position on the $\mathsf{committed}$ list is earlier ( $f\xrightarrow{\mathsf{\mathsf{tobNo}}}g$ ). Because the order of requests in the trace is based on the $\mathsf{executed}$ list, whose order is consistent with the order of the $\mathsf{committed}$ list, $\mathsf{req}(f)$ precedes $\mathsf{req}(g)$ in $\alpha$ , which implies $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{ar}_{\alpha}}}\mathsf{req}(g)$ . Then, by construction of $C_{\alpha}$ , $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}\mathsf{req}(g)$ .

If $g\in\Psi_{s}$ and $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{vis}_{\alpha}}}\mathsf{req}(g)$ , then $\mathsf{req}(f)\xrightarrow{\mathsf{\mathsf{ar}_{\alpha}}}\mathsf{req}(g)$ , and thus $\mathsf{req}(f)$ precedes $\mathsf{req}(g)$ in $\alpha$ . Since $\mathsf{req}(g)$ is committed at the time of $e$ ’s execution, both $\mathsf{req}(f)$ and $\mathsf{req}(g)$ belong to the $\mathsf{committed}$ list during the $e$ ’s execution, which implies that $f\xrightarrow{\mathsf{\mathsf{tobNo}}}g$ . Thus, $f\xrightarrow{\mathsf{ar}}g$ , and by Single Order, $f\xrightarrow{\mathsf{vis}}g$ .

Thus, $C$ is isomorphic with $C_{\alpha}$ .

Convergent perceived arbitration. We now show, that for each event $e\in E$ there exist only a finite number of weak events $e^{\prime}$ , such that the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to the event $e$ differ, which is a sufficient condition to prove $\textsc{CPar}(\mathsf{weak})$ .

If $e\in\Psi$ , then eventually on each replica $\mathsf{msg}_{\mathsf{TOB}}(e)$ is $\mathrm{TOB{\text{-}}deliver}$ ed, and $\mathsf{req}(e)$ is committed and executed. Thus, from some point, the trace of each subsequent event $e^{\prime}$ contains $\mathsf{req}(e)$ , preceded by requests of events $e^{\prime\prime}$ committed earlier, such that $e^{\prime\prime}\xrightarrow{\mathsf{\mathsf{tobNo}}}e$ . Both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ are constructed by first ordering shared events and then interleaving them with local events using the same procedure. In both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ , $e$ is preceded by the same shared events $e^{\prime\prime}$ , such that $e^{\prime\prime}\xrightarrow{\mathsf{\mathsf{tobNo}}}e$ . Then, it is also preceded by the same local events, which means the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to $e$ are equal.

If $e\in\Omega$ , then eventually the requests of all shared events $e^{\prime\prime}$ , such that $e^{\prime\prime}\xrightarrow{\mathsf{ar}}e$ , are committed and executed on each replica. Then, from some point, the trace of each subsequent event $e^{\prime}$ contains the requests of events $e^{\prime\prime}$ , ordered by $\mathsf{tobNo}$ . Thus, $e$ is preceded in both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ by the same shared events $e^{\prime\prime}$ . Because both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ are interleaved with local events using the same procedure, $e$ is also preceded in both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ by the same local events, which means the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to $e$ are equal.

Real-time order. We need to show that arbitration order respects real-time order of strong operations, i.e., $\mathsf{rb}\cap(S\times S)\subseteq\mathsf{ar}$ , where $S=\{e:\mathsf{lvl}(e)=\mathsf{strong}\}$ . In other words, for any two $e,e^{\prime}\in S$ : $e\xrightarrow{\mathsf{rb}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{ar}}e^{\prime}$ .

Clearly, if $e\xrightarrow{\mathsf{rb}}e^{\prime}$ , then $e\xrightarrow{\mathsf{\mathsf{tobNo}}}e^{\prime}$ (see the additional observations in the beginning of the proof). Thus, $e\xrightarrow{\mathsf{ar}}e^{\prime}$ (by construction of $\mathsf{ar}$ ). ∎

Now, let us continue with the proof of the guarantees offered by AcuteBayou in the asynchronous runs.

See 4

Proof.

To show the inability of AcuteBayou to satisfy $\textsc{Lin}(\mathsf{strong},\mathcal{F})$ in asynchronous runs, it is sufficient to observe that due to some of the $\mathrm{TOB{\text{-}}cast}$ messages not being $\mathrm{TOB{\text{-}}deliver}$ ed, some of the strong operations remain pending. A pending operation’s return value equals $\nabla$ which is unreconcilable with the requirements of the predicate $\textsc{RVal}(\mathcal{F})$ .

The proof regarding the guarantees of the weak operations is similar to the one for the stable runs, thus we rely on it and focus only on differences between stable and asynchronous runs that need to be addressed. Now for any given arbitrary asynchronous run of AcuteBayou represented by a history $H=(E,\mathsf{op},\mathsf{rval},\mathsf{rb},\mathsf{ss},\mathsf{lvl})$ we have to find suitable $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ , such that $A=(H,\mathsf{vis},\mathsf{ar},\mathsf{par})$ is such that $A\models\textsc{FEC}(\mathsf{weak},\mathcal{F})$ .

Additional observations. The same observations apply as in case of stable runs, with the only distinction that some strong events $e$ remain pending due to the lack of $\mathrm{TOB{\text{-}}deliver}$ y of $\mathsf{msg}_{\mathsf{TOB}}(e)$ . In such cases $\mathsf{trace}(e)$ is undefined.

Now let us make one more observation: the request of a weak updating event $e$ whose $\mathsf{msg}_{\mathsf{TOB}}(e)$ is never $\mathrm{TOB{\text{-}}deliver}$ ed, even though it never commits, eventually settles, i.e. it is eventually executed and is never rolled back after that execution. It is so, because after $r=\mathsf{req}(e)$ is $\mathrm{RB{\text{-}}deliver}$ ed by each replica and placed on the $\mathsf{tentative}$ list, only a finite number of other requests can commit (due to the properties of TOB in asynchronous runs), and also only a finite number of other requests can have a lesser $\mathrm{Req}$ record (as defined by the operator $<$ in Algorithm 2) and thus precede $r$ in the $\mathsf{tentative}$ list (due to monotonically increasing clocks on each replica). Thus, once $r$ is placed on the $\mathsf{toBeExecuted}$ list, it eventually executes, and when executed $r$ can be rolled back at most a finite number of times, due to a commit of other request, or a lesser $\mathrm{Req}$ being inserted into the $\mathsf{tentative}$ list.

Arbitration. We construct the total order relation $\mathsf{ar}$ by sorting all shared events based on the order in which their respective $\mathrm{TOB{\text{-}}cast}$ messages are $\mathrm{TOB{\text{-}}deliver}$ ed, i.e., respecting the $\mathsf{tobNo}$ order. Shared events whose messages are not $\mathrm{TOB{\text{-}}deliver}$ ed are ordered after those whose messages are $\mathrm{TOB{\text{-}}deliver}$ ed, with weak updating events appearing first, ordered relatively based on their $\mathrm{Req}$ records, followed by pending strong events.

Next, we interleave the shared events with local events in the following way: each local event $e$ occurs in $\mathsf{ar}$ after the last non-pending shared event $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{rb}}e^{\prime}$ . Thus, for each non-pending shared event $e^{\prime}$ the following holds $e\xrightarrow{\mathsf{ar}}e^{\prime}\Rightarrow e\xrightarrow{\mathsf{rb}}e^{\prime}$ . The relative order of local events is irrelevant.

We construct the perceived arbitration order $\mathsf{par}(e)$ for each event $e$ , in the same way as in case of stable runs, i.e. using $\mathsf{trace}(e)$ , the remaining shared events from $\mathsf{ar}$ , and finally interleaving the constructed sequence with local events as in case of $\mathsf{ar}$ (so that for each local event $e^{\prime}$ and each non-pending shared event $e^{\prime\prime}$ , the following holds $e^{\prime}\xrightarrow{\mathsf{par}(\mathsf{e})}e^{\prime\prime}\Rightarrow e^{\prime}\xrightarrow{\mathsf{rb}}e^{\prime\prime}$ .

For a pending strong event $e$ , which was not executed at all, we let $\mathsf{par}(e)=\mathsf{ar}$ .

Note that for a non-pending strong event $e$ , $\mathsf{par}(e)=\mathsf{ar}$ . This is because $e$ executes once $\mathsf{req}(e)$ is on the $\mathsf{committed}$ list, and its position on the list is determined by the $\mathsf{tobNo}$ order, which means that its trace will contain exactly all the shared events preceding $e$ in $\mathsf{ar}$ .

Visibility. We construct the visibility relation in the same way as in the stable runs case. However, we remove edges to and from pending strong events. Since pending operations do not provide a return value, no edge to a pending event is essential. Also, as we guarantee only eventual visibility for weak events, edges to strong events are not necessary to satisfy $\textsc{EV}(\mathsf{weak})$ . Moreover, edges from pending events are not needed either, because by definition a pending event is never followed in $\mathsf{rb}$ by any other event (which is a requirement to fail the test for EV). Again, for all edges 4-6, $e\xrightarrow{\mathsf{rb}}e^{\prime}$ .

Having defined $A$ (through $\mathsf{vis}$ , $\mathsf{ar}$ and $\mathsf{par}$ ), it now remains to show that $A\models\textsc{FEC}(\mathsf{weak},\mathcal{F}_{\mathit{NNC}})$ , or more specifically $A\models\textsc{EV}(\mathsf{weak})\wedge\textsc{NCC}(\mathsf{weak})\wedge\textsc{FRVal}(\mathsf{weak})\wedge\textsc{CPar}(\mathsf{weak})$ .

Eventual visibility. We prove now that eventual visibility is satisfied for all weak events:

•

each non-pending shared event $e$ , such that $\mathsf{msg}_{\mathsf{TOB}}(e)$ is eventually $\mathrm{TOB{\text{-}}deliver}$ ed on each replica, is visible to all subsequent non-pending events from some point, because $r=\mathsf{req}(e)$ is placed on the $\mathsf{committed}$ list on each replica, thus $r$ is eventually executed and never rolled back, and is included in the trace of the $\mathsf{state}$ object from some point (2, 3 and 4),

•

each weak updating event $e$ , such that $\mathsf{msg}_{\mathsf{TOB}}(e)$ is not eventually $\mathrm{TOB{\text{-}}deliver}$ ed on each replica, is visible to all subsequent non-pending events from some point, because it settles (see the additional observations in the beginning of the proof) and is included in the trace of the $\mathsf{state}$ object on each replica from some point (4),

•

each local event $e$ is visible to all subsequent local events from some point (5),

•

each local event $e$ is visible to all subsequent non-pending shared events from some point, because by construction of $\mathsf{ar}$ there is only a finite number of events $e^{\prime}$ such that $e\not\xrightarrow{\mathsf{ar}}e^{\prime}$ (6).

No circular causality. We use exactly the same reasoning as in the stable runs case to show that $\mathsf{acyclic}(\mathsf{hb})$ holds true.

Return value consistency. Again, we use exactly the same reasoning as in the stable runs case to show that for each weak event $e\in E$ : $\mathsf{rval}(e)=\mathcal{F}_{\mathit{NNC}}(\mathsf{op}(e),\mathsf{fcontext}(A,e))$ . Although this time we only need to prove return value consistency for weak operations, it can be shown that it also holds for non-pending strong events.

Convergent perceived arbitration. We now show, that for each non-pending131313We can exclude pending events, because according to the construction of $\mathsf{vis}$ they are not visible to any other event, and thus automatically satisfy the requirements of the CPar predicate. event $e\in E$ there exist only a finite number of weak events $e^{\prime}$ , such that the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to the event $e$ differ, which is a sufficient condition to prove $\textsc{CPar}(\mathsf{weak})$ .

If $e\in\Psi$ and $\mathsf{msg}_{\mathsf{TOB}}(e)$ is eventually $\mathrm{TOB{\text{-}}deliver}$ ed, then the same logic can be applied as in case of stable runs to show that from some point for each subsequent event $e^{\prime}$ the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to $e$ are equal.

If $e\in\Psi_{w}$ and $\mathsf{msg}_{\mathsf{TOB}}(e)$ is never $\mathrm{TOB{\text{-}}deliver}$ ed, then it eventually settles (see the additional observations in the beginning of the proof) and thus also the same logic can be applied as in case of stable runs, with the distinction that $e$ is preceded in $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ not only by events $e^{\prime\prime}$ whose requests are committed, but also by events $e^{\prime\prime}$ , such that $\mathsf{req}(e^{\prime\prime})<\mathsf{req}(e)$ .

If $e\in\Omega$ , then eventually the requests of all shared events $e^{\prime\prime}$ , such that $e^{\prime\prime}\xrightarrow{\mathsf{ar}}e$ (none of which are pending by the construction of $\mathsf{ar}$ ), are either committed, or settled, and executed on each replica. Then, from some point, the trace of each subsequent event $e^{\prime}$ contains the requests of events $e^{\prime\prime}$ , ordered by both $\mathsf{tobNo}$ , and based on their $\mathrm{Req}$ records. Thus, $e$ is preceded in both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ by the same shared events $e^{\prime\prime}$ . Because both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ are interleaved with local events using the same procedure, $e$ is also preceded in both $\mathsf{ar}$ and $\mathsf{par}(e^{\prime})$ by the same local events, which means the prefixes of $\mathsf{par}(e^{\prime})$ and $\mathsf{ar}$ up to $e$ are equal. ∎

Bibliography72

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. De Candia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” SIGOPS Operating Systems Review , vol. 41, no. 6, pp. 205–220, Oct. 2007.
2[2] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski, “Conflict-free replicated data types,” in Proc. of SSS ’11 , May 2011.
3[3] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski, “A comprehensive study of convergent and commutative replicated data types,” Inria–Centre Paris-Rocquencourt; INRIA, Tech. Rep. 7506, 2011.
4[4] N. M. Preguiça, C. Baquero, and M. Shapiro, “Conflict-free replicated data types,” Co RR , vol. abs/1805.06358, 2018.
5[5] Y. Sovran, R. Power, M. K. Aguilera, and J. Li, “Transactional storage for geo-replicated systems,” in Proc. of SOSP ’11 , 2011, pp. 385–400.
6[6] C. Li, D. Porto, A. Clement, J. Gehrke, N. Preguiça, and R. Rodrigues, “Making geo-replicated systems fast as possible, consistent when necessary,” in Proc. of OSDI ’12 , Oct. 2012.
7[7] S. Burckhardt, A. Gotsman, and H. Yang, “Understanding eventual consistency,” Microsoft Research, Tech. Rep. MSR-TR-2013-39, Mar. 2013.
8[8] D. B. Terry, V. Prabhakaran, R. Kotla, M. Balakrishnan, M. K. Aguilera, and H. Abu-Libdeh, “Consistency-based service level agreements for cloud storage,” in Proc. of SOSP ’13 , Nov. 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On Mixing Eventual and Strong Consistency: Acute Cloud Types

Abstract

Index Terms:

1 Introduction

1.1 Contribution summary

1.2 Article structure

2 Acute cloud types by examples

2.1 Acute non-negative counter

2.2 Bayou

2.2.1 Protocol overview

2.2.2 Anomalies

2.2.3 Correctness guarantees

2.2.4 Fault-tolerance

2.2.5 The improved Bayou protocol

2.3 ANNC vs AcuteBayou

3 Acute Cloud Types

3.1 Definition

3.2 System model

3.2.1 Replicas and clients

3.2.2 Network properties

3.2.3 Fair executions

3.3 Implementation restrictions

4 Formal framework

4.1 Preliminaries

4.2 Histories

4.3 Abstract executions

4.4 Correctness predicates

4.5 Replicated data type

4.6 ACT specification

5 Correctness guarantees

5.1 Key requirements for eventual consistency

5.2 Basic Eventual Consistency

5.3 Fluctuating Eventual Consistency

5.4 Operation levels

5.5 Strong consistency

5.6 Correctness of ANNC and AcuteBayou

Theorem 1**.**

Theorem 2**.**

Theorem 3**.**

Theorem 4**.**

6 Impossibility

Theorem 5**.**

Proof.

Claim 1**.**

Proof.

Claim 2**.**

Proof.

Corollary 1**.**

7 Related work

7.1 *Symmetric

7.2 *Symmetric Bayou-like

7.3 *Asymmetric

7.4 *Asymmetric master-slave

7.5 Other approaches

8 Conclusions

Appendix A

A.1 Bayou–detailed description

A.2 Liveness guarantees in Bayou

A.3 Bayou improved

A.4 StateObject properties

A.5 Proofs of correctness

A.5.1 ANNC correcntess proofs

Proof.

Proof.

A.5.2 AcuteBayou correcntess proofs

Proof.

Proof.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Claim 1.

Claim 2.

Corollary 1.