On Mixing Eventual and Strong Consistency: Acute Cloud Types
Maciej Kokoci\'nski, Tadeusz Kobus, Pawe{\l} T. Wojciechowski

TL;DR
This paper introduces acute cloud types (ACTs), a formal model for distributed systems that combine eventual and strong consistency, highlighting unique phenomena and impossibility results in mixed-consistency systems.
Contribution
It formalizes ACTs, demonstrates their properties, and proves an impossibility result regarding operation reordering in mixed-consistency systems.
Findings
ACTs enable efficient quorum-based protocols like Paxos.
Temporary operation reordering can cause interim disagreements.
Strengthening semantics can weaken guarantees on eventual consistency.
Abstract
In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering, which implies interim disagreement between replicas on the relative order in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On Mixing Eventual and Strong Consistency: Acute Cloud Types
Maciej Kokociński, Tadeusz Kobus, Paweł T. Wojciechowski The authors are with the Institute of Computing Science, Poznan University of Technology, 60-965 Poznań, Poland.
E-mail: {Maciej.Kokocinski,Tadeusz.Kobus,Pawel.T.Wojciechowski} @cs.put.edu.pl This work was supported by the Foundation for Polish Science, within the TEAM programme co-financed by the European Union under the European Regional Development Fund (grant No. POIR.04.04.00-00-5C5B/17-00). Kokociński and Kobus were also supported by the Polish National Science Centre (grant No. DEC-2012/07/B/ST6/01230) and partially by the internal funds of the Faculty of Computing, Poznan University of Technology.
Abstract
In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering, which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.
Index Terms:
eventual consistency, mixed consistency, fault-tolerance, acute cloud types, ACT
\NewEnviron
myeq
[TABLE]
1 Introduction
The massive scalability and high availability of the complex (geo-replicated) distributed systems that power today’s Internet often hinges on the use of eventually consistent data stores. These systems extensively employ specialized data structures, e.g., last-write-wins registers (LWW-registers), multi-value registers (MVRs), observed-remove sets (OR-sets) or other conflict-free replicated data types (CRDTs) [1] [2] [3]. These data structures are replicated on multiple machines (replicas) and can be read or modified independently on each site without prior synchronization with other replicas. It means that replicas can promptly respond to the clients. The communication between the replicas happens solely using a gossip protocol. By design replicas are guaranteed to be able to converge to a single state, automatically resolving any inconsistencies between them.
Unfortunately, the semantics of such data structures are very limited. To provide high availability, low-latency responses and eventual state convergence, these data structures require either that all operations commute or that there exist commutative, associative, and idempotent procedures for merging replica states. This is why these mechanisms are not suitable for all use cases. For example, consider a simple non-negative integer counter. The addition operation can be trivially implemented in a conflict-free manner, as the addition operations are commutative. However, implementing the subtraction operation requires global agreement to ensure that the value of the counter never drops below 0. In a similar way, in an auction system concurrent bids can be considered independent operations and thus their execution does not need to be synchronized. However, the operation that closes the auction requires solving distributed consensus to select the single winning bid [4].
Due to the inherent shortcomings of CRDTs and solutions similar to them, recently there have been several attempts, both at academia (e.g., [5] [6] [7] [8] [9] [10] [11]) and in the industry (e.g., [12] [13] [14] [15]) to enrich the semantics of eventually consistent systems by allowing some operations to be performed with stronger consistency guarantees or by introducing (quasi) transactional support. Crucially, none of the mixed-consistency approaches we are aware of is flexible enough to: (a) account for very weak consistency models (weaker than causal consistency, which is known to be costly to achieve in practice [16]), (b) admit strongly consistent operations which do not require all replicas to be operational in order to complete (so to gracefully tolerate failures), and (c) provide clearly stated semantics that enables easy reasoning about the system-wide guarantees. The latter trait is especially important when the same data can be accessed at the same time both in a strongly and eventually consistent fashion, what is notoriously difficult to implement. For example, in Apache Cassandra using the light weight transactions on data that are accessed at the same time in the regular, eventually consistent fashion leads to undefined behaviour [17].
In this article we introduce acute cloud types (ACTs), a family of specialized mixed-consistency data structures designed primarily for high availability and low latency, but which also seamlessly integrate on-demand strongly consistent semantics. ACT feature two kinds of operations:
- •
weak operations–targeted for unconstrained scalability and low latency responses (as operations in CRDTs),
- •
strong operations–used when eventually consistent guarantees are insufficient, require consensus-based inter-replica synchronization prior to execution.
Weak operations are guaranteed to progress, and are handled in such a way that the replicas eventually converge to the same state within each network partition, even when strongly consistent operations cannot complete due to network and process failures. On the other hand, strong operations can provide guarantees even as strong as linearizability [18] wrt. the already completed strong operations and a precisely defined subset of completed weak operations. Crucially, strong operations are non-blocking: they can leverage efficient, quorum-based synchronization protocols, such as Paxos [19], and thus gracefully tolerate machine and network failures. Both weak and strong operations can be arbitrarily complex, but they must be deterministic.
Our approach is more robust than other mixed-consistency solutions. Most notably, unlike classic cloud types [20] and global sequence protocol (GSP) [21], ACTs are symmetrical in the sense that they do not assume the existence of a server or servers that mediate all communication between remote replicas. This has several advantages: a failure of a replica or a group of replicas cannot impede the ability of other ACT replicas to execute weak operations and propagate the resulting updates. Also, ACTs can better tolerate network splits by allowing the replicas in the minority partitions to execute weak operations and exchange resulting updates. Furthermore, unlike the RedBlue consistency model [6] and approaches similar to it (e.g., [22] [10] [11]), ACTs support consistency guarantees weaker than causal consistency, so account for a wider range of systems. Crucially, ACTs do not require all replicas to be operational in order for the strong operations to complete, contrary to the approaches mentioned above. This latter trait has been fundamental to the design of ACTs.
In order to provide an easy to understand yet flexible consistency model that allows weak operations to be executed in a highly available and scalable manner, we require that in any run of an ACT, logically, there always exists a single global order of all operations. During execution, strong operations are guaranteed to observe the prefix of up to their position in . A weak operation may observe a serialization of operations that diverges from , but only by a finite number of elements. Thus weak and strong operations are interconnected in a non-trivial way, which intuitively ensures write stabilization: once a strong operation, during its execution, observes some weak operations , in that order, all subsequent strong operations, and eventually all weak operations, will also observe , in that order. It is so even though weak operations never have to directly synchronize with strong operations (e.g., by blocking on the completion of strong operations).
We propose a framework that enables formal reasoning about ACTs and their guarantees. We express the dependencies between operations through the visibility and arbitration relations, similarly to [23], but we allow each operation to observe the arbitration in a temporarily inconsistent (but eventually convergent) form. In order to capture the unique properties of ACTs and write stabilization in particular, we define a novel correctness condition called fluctuating eventual consistency (FEC) that is strictly weaker than Burckhardt’s Basic Eventual Consistency (BEC) [7].
By formally specifying ACTs, we uncovered several interesting phenomena unique to mixed-consistency systems (they are never exhibited by popular NoSQL systems, which only guarantee eventual consistency, nor by strongly consistent solutions). Crucially, some ACTs exhibit a phenomenon that we call temporary operation reordering, which happens when the replicas temporarily disagree on the relative order in which the requests (modelled as operations) submitted to the system were executed. When not handled carefully, temporary operation reordering may lead to all kinds of undesired situations, e.g., circular causality among the responses observed by the clients. As we formally prove, temporary operation reordering is not present in all ACTs but in some cases cannot be avoided. This impossibility result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strong operations to an eventually-consistent system) results in the weakening of the guarantees on the eventually-consistent operations.
In order to illustrate our concepts and analysis, we present an ACT for a non-negative counter and also revisit Bayou [24], a seminal, always available, eventually consistent data store. Bayou combines timestamp-based eventual consistency [25] and serializability [26] by speculatively executing transactions submitted by clients and having a primary replica to periodically stabilize the transactions (establish the final transaction execution order). We show how Bayou can be improved to form a general-purpose ACT.
1.1 Contribution summary
We define acute cloud types, a family of specialized mixed-consistency data structures designed primarily for high availability and low latency, but which also seamlessly integrate on-demand strongly consistent semantics achieved through quorum-based consensus protocols. Weak and strong operations in ACTs are interconnected in a non-trivial way, which intuitively ensures write stabilization. 2. 2.
We identify a range of traits unique to some ACTs. Most importantly, we define temporary operation reordering, a situation in which there is an interim disagreement between replicas on the relative order in which the client requests were executed. 3. 3.
We propose a framework that enables formal reasoning about ACTs and their guarantees. In particular, our framework allows us to formalize temporary operation reordering and propose a correctness condition called fluctuating eventual consistency which adequately captures the guarantees provided by ACTs that exhibit this phenomenon. 4. 4.
We use our framework to prove a number of formal results regarding ACTs. Crucially, we show an impossibility result that states that temporary operation reordering is not present in all ACTs, but in some cases cannot be avoided. 5. 5.
We revisit the seminal Bayou system, study its consistency guarantees, and show how it can be improved to form a general-purpose ACT.
1.2 Article structure
The article is organized as follows. In Section 2 we explain ACTs through examples: an acute non-negative counter and adaptation of Bayou that forms a general-purpose ACT. We formally define ACTs in Section 3, and introduce the formal framework for reasoning about their correctness in Section 4. In Section 5 we define FEC, our new correctness criterion and prove the correctness of our example ACTs. Next, in Section 6, we give our impossibility result. We discuss related work in Section 7, and conclude in Section 8.
A brief announcement of this article appeared in [27].
2 Acute cloud types by examples
2.1 Acute non-negative counter
As we mentioned in Section 1, a non-negative integer counter cannot be implemented as a classic CRDT because the subtraction operation requires global coordination to ensure that the value of the counter never drops below 0. In Algorithm 1 we present an acute non-negative integer counter (ANNC), a simple ACT implementing such a counter. The (line 5) and (line 32) operations can be weak and thus always ensure low latency responses, whereas (line 11) must be a strong operation to ensure the semantics of a non-negative counter. The crux of ANNC lies in using two complementary protocols for exchanging updates (a gossip one and one that establishes the ultimate operation serialization), and calculating the state of the counter by liberally counting operations and conservatively counting the operations.
To track the execution of weak and strong operations, each ANNC replica maintains three variables (line 2): one for subtraction operations () and two for the addition operations ( and ). The replicas exchange the information about new ADD requests (weak updating operations) using a gossip protocol (modelled using reliable broadcast, RB [28]) as well as a protocol that involves inter-replica synchronization (modelled using total order broadcast, TOB [29], which can be efficiently implemented using quorum-based protocols, such as Paxos [19]; lines 9-10). The operation, which does not commute unlike the operation, solely uses TOB. Upon receipt of a SUBTRACT message, the subtract operation completes successfully only if we are certain that the value of the counter does not drop below 0, i.e., when the aggregated value of all confirmed addition operations () is greater or equal to the aggregated value of all subtract operations () increased by (lines 26-28).
We ensure that on any replica and for any ADD request , the event always happens before the event (lines 17–18 and 22–23). This way . Hence, we solely use as the approximation of the total value added to ANNC when calculating the return value for the operations.
Using a gossip protocol allows us to achieve propagation of weak updating operations within network partitions, when synchronization involving solving distributed consensus is not possible. On the other hand, when solving distributed consensus is possible, replicas can agree on the final order in which operations will be visible. This way weak operations and are highly available, i.e., they always execute in a constant number of steps and do not depend on waiting on communication with other replicas. Crucially, the return value of the operation always reflects all the operations performed locally and, eventually, all operations performed within the network partition to which the replica belongs, if such a partition exists. On the other hand, the strong operation is applied only if the replicas agree that it is safe to do so.
ANNC guarantees a property which is a conjunction of basic eventual consistency (BEC) [7] [23] for weak operations ( and ) and linearizability (Lin) [18] for strong operations (). We formalize BEC and Lin in Sections 5.2 and 5.5, and prove the correctness of ANNC in Section 5.6.
2.2 Bayou
Bayou was an experimental system, so was never optimized for performance. However, due to its unique approach to speculative execution of transactions and their later stabilization (establishing the final transaction execution order by a primary replica), examining Bayou allows us to discuss various problematic phenomena that stem from having both weak and strong semantics in a single system. We improve Bayou to form a general-purpose, albeit not performance-optimized ACT.
2.2.1 Protocol overview
Below we give a high-level description of the Bayou protocol. An interested reader may find a detailed description of Bayou (together with a pseudocode) in Appendix A.1.
In order to make our analysis more general, we abstract certain aspects of the original protocol. Crucially, we allow clients to submit to Bayou replicas deterministic, arbitrarily complex (also as complex as, e.g., SQL transactions) operations that can provide the clients with a return value. Each operation is either weak or strong, similarly to operations in ANNC. Any weak operation is non-blocking with respect to network communication, because it is executed locally without any coordination with other replicas, but its ultimate impact on the system’s state might differ from what the client can infer from the return value (if the stabilized execution happened differently than the speculative one). On the other hand, the return value of a strong operation results from a prior inter-replica synchronization and thus can be trusted to never change.
In Bayou, each server speculatively total-orders all received operations using a simple timestamp-based mechanism and without prior agreement with other servers. A unique timestamp is generated by the replica upon receipt of an operation from the client. An operation tagged with the timestamp is then sent to all other replicas using some gossip protocol. When a replica has a new operation ( was directly submitted by a client or it has been received from other replica), firstly the replica determines the suitable execution order for . If has the highest timestamp of all operations executed so far by the replica, it simply executes (and, if was submitted to the replica by a client and is a weak operation, the replica provides the client with a return value). Otherwise, the replica rolls back all operations that have higher timestamps than (starting from the one with the highest timestamp), executes and reexecutes the rolled back operations according to their timestamps. This way a single total order consistent with operation timestamps is always maintained by all replicas.
The above approach has two major downsides. The first one concerns the performance: every time a replica receives an operation with a relatively low timestamp (compared to the timestamps of the operations executed most recently), to maintain the correct execution order, many operations need to be rolled back and reexecuted. The second downside is related to the provided guarantees: a client that submitted an operation and already received a response can never be sure that there will be no other operation with a lower timestamp than , which will eventually cause to be reexecuted, thus producing possibly a different return value.
In order to mitigate the above two problems, one of the replicas, called the primary, periodically communicates to the other replicas the final operation execution order, which is a growing prefix of the operations already executed by the primary. Other replicas always honour the decision made by the primary, which may force them to adjust their local operation execution orders by rolling back and reexecuting some operations. When there are no major communication delays between replicas, the final operation execution order established by the primary does not deviate much from the order resulting from operation timestamps. Hence, replicas do not need to perform many rollbacks and reexecutions. Moreover, the operation values obtained during speculative execution are mostly correct, i.e., the same as the return values obtained during execution of the operations according to the final operation execution order. Once the final operation execution for some (weak or strong) is established, it will never be reexecuted again. If is a strong operation, it is now safe to provide the client with the return value.
Intuitively, the replicas converge to the same state, which is reflected by the prefix of operations established by the primary (called the list of operations) and the sequence of other operations ordered according to their timestamps (the list of operations). More precisely, when the stream of operations incoming to the system ceases and there are no network partitions (the replicas can reach with the primary), the lists at all replicas will be the same, whereas the lists will be empty. On the other hand, when there are partitions, some operations might not be successfully committed by the primary, but will be disseminated within a partition using a gossip protocol. Then all replicas within the same partition will have the same and (non-empty) lists.
2.2.2 Anomalies
Consider the example in Figure 1, which shows an execution of a three-replica Bayou system. Initially, replica executes updating operations and in order , which corresponds to ’s and ’s timestamps. This operation execution order is observed by the client who issues query . On the other hand, executes the operations according to the final execution order (), as established by the primary replica . Hence, the client who issued query observes a different execution order than the client who issued . Note that replicas execute the operations with a delay (e.g., due to CPU being busy) and that reexecutes the operations once it gets to know the final order.
Clearly, the clients that issued the operations can infer from the return values the order in which Bayou executed the operations. The observed operation execution orders differ between the clients accessing and . This kind of anomaly, which we call temporary operation reordering, cannot be avoided since Bayou uses two, inconsistent with each other, ways in which operations are ordered (the timestamp order and the order established by the primary, which may occasionally differ from the timestamp order). This behaviour is not present in strongly consistent systems, which ensure that a single global ordering of operation execution is always respected (e.g., [30] [31]). The majority of eventually consistent systems which trade consistency for high availability are also free of this anomaly, as they only use one method for ordering concurrent operations (e.g., [32] [7]), or support only commutative operations (as in strong eventual consistency [2], e.g. [3] [33]). There are also protocols that allow some operations to perceive the past events in different (but still legal) orders (e.g., [34] [35] [6]). But, unlike Bayou, they do not require the replicas to eventually agree on a single execution order for all operations. Interestingly, temporary operation reordering is not present in ANNC, because weak updating operations () commute and do not provide clients with the return values.
Bayou exhibits another anomaly, which comes as very non-intuitive, i.e., circular causality. By analysing the return values of queries and one may conclude that there is a circular dependency between and : depends on as evidenced by ’s response, while depends on as evidenced by ’s response (the cycle of causally related operations can contain more operations). Interestingly, as we show later, circular causality does not directly follow from temporary operation reordering but is rather a result of the way Bayou rolls back and reexecutes some operations.
In the original Bayou protocol, application-specific conflict detection and resolution is accomplished through the use of dependency checks and merge procedure mechanisms. Since we allow operations with arbitrary complex semantics, the dependency checks and the merge procedures can be emulated by the operations themselves, by simply incorporating if-else statements: the dependency check as the if condition, and the merge procedure in the else branch (as suggested in the original paper [24]). Hence, these mechanisms do not alleviate the anomalies outlined above.
2.2.3 Correctness guarantees
Because of the phenomena described above, the guarantees provided by Bayou cannot be formalized using the correctness criteria used for contemporary eventually consistent systems based on CRDTs. E.g., basic eventual consistency (BEC) by Burckhardt et al. [7] [23] (mentioned briefly when discussing ANNC’s guarantees) directly forbids circular causality (see Section 5.2 for definition of BEC). BEC also requires the relative order of any two operations, as perceived by the client, to be consistent and to never change. Similarly, strong eventual consistency (SEC) by Shapiro et al. [2] requires any two replicas that delivered the same updates to have equivalent states.111BEC can be seen as a refinement of SEC, which abstracts away from CRDTs implementation details and ensures that no return value is constructed out of thin air. Obviously, Bayou neither satisfies BEC nor SEC (as evidenced by Figure 1). On the other hand informal definitions of eventual consistency which admit temporal reordering, such as [25], involve only liveness guarantees, which is insufficient. Bayou fulfills the operational specification in [36]. However, we are interested in declarative specifications, similar in the style to popular consistency criteria, such as sequential consistency [37], or serializability [26], through which we can concisely define the behaviour of a wide class of systems. Hence we introduce a new correctness criterion, fluctuating eventual consistency (FEC), which can be viewed as a generalization of BEC (see Section 5.3 for definition). FEC relaxes BEC, so different operations can perceive different operation orders. However, we require that the different perceived operation orders converge to one final execution order. Hence, FEC is suitable for systems that feature temporary operation reordering.
Similarly to ANNC, Bayou also ensures linearizability for strong operations (a response of a strong operation always reflects the serial execution of all stabilized operations up to the point of ’s commit). In Section 5.6 we formally prove that the Bayou-derived general-purpose ACT satisfies the above correctness criteria.
In Appendix A.2, an interested reader may find a brief analysis of Bayou’s liveness guarantees.
2.2.4 Fault-tolerance
Bayou’s reliance on the primary means that it provides only limited fault-tolerance. Even though the primary may recover, when it is down, operations do not stabilize, and thus no strong operation can complete. Hence, the primary is the single point of failure. Alternatively, the primary could be replaced by a distributed commit protocol. If two-phase-commit (2PC) [38] is used, the phenomena illustrated in Figure 1 are not possible. However, in this approach, a failure of any replica blocks the execution of strong operations (in 2PC all the replicas need to be operational in order to reach distributed agreement). On the other hand, if a non-blocking commit protocol, e.g., one that utilizes a quorum-based implementation of TOB is used (as in ANNC), the system may stabilize operations despite (a limited number of) failures.222Sharded 2PC [39] can be considered non-blocking, if within each shard at least one process remains operational at all times. Then, in such a scheme not every process needs to be contacted to commit a transaction, thus it falls under the quorum-based category. As we prove later, ACTs (which do not depend on synchronous communication with all the replicas and thus can operate despite failures of some of them) with general-purpose semantics similar to Bayou, are necessarily prone to the phenomena described above.
2.2.5 The improved Bayou protocol
Bayou can be improved to make it more fault-tolerant and free of some of the phenomena described above.
Firstly, we use TOB in place of the primary to establish the final operation execution order. More precisely, each time a replica receives an operation from a client, it still disseminates using a gossip protocol (so it can reach at least all replicas within the same network partition) but it also broadcasts the operation using TOB (so in a similar way in which weak updating operations are handled in ANNC). Since TOB guarantees that all replicas deliver the same set of messages in the same order, all replicas will stabilize the same set of operations in the same order. As we argued earlier, TOB can be implemented in a way that avoids a single point of failure [19].
The second modification is aimed at eliminating circular causality in Bayou. To this end (1) strong operations are broadcast using TOB and never a gossip protocol, and (2) upon being submitted, a weak operation is executed immediately on the current state to produce the return value; if ’s timestamp is not the highest timestamp of all already executed operations, is then rolled back and eventually executed in the order consistent with its timestamp. In Appendix A.3 we formally prove that above changes to the protocol allow us to avoid circular causality.
The modified variant of Bayou does not ensure that subsequent operations invoked by the same client observe the effects of previous ones, even if they are issued on the same replica (the read-your-writes session guarantee [40]).333In the original Bayou system, clients were colocated with replicas and the read-your-writes guarantee was naturally provided. In our approach, such guarantees can be provided on the client side.
With the above modifications the improved Bayou protocol becomes the general-purpose ACT called AcuteBayou.
2.3 ANNC vs AcuteBayou
ANNC and AcuteBayou greatly differ in the offered semantics and complexity. Note that we could trivially implement a non-negative integer counter using AcuteBayou by executing each counter operation as a separate AcuteBayou operation, albeit such an implementation would be suboptimal: in some cases the operations would have to be rolled back and temporary operation reordering would be possible again. Still, we can consider AcuteBayou as a generic ACT, capable of executing any set of weak and strong operations.
Despite the many differences between ANNC and AcuteBayou, they share several design assumptions, which are common to all ACT implementations. Firstly, in order to facilitate high availability and low latency responses (which are essential in geo-replicated environments), frequently invoked operations should be defined as weak operations and replicas should process them similarly to operations in CRDTs (automatically resolve conflicts between concurrent updates; converge to the same state within a network partition). To enforce this behaviour without resorting to distributed agreement, we impose the same assumptions as Attiya et al. for highly available eventually consistent data stores in [33] (see also Section 3.3). Secondly, when weak consistency guarantees are insufficient, strong operations can be used. Strong operations use a global agreement protocol for inter-replica synchronization, e.g., TOB. We require that strong operations do not block the execution of weak operations and that they do not require all replicas to be operational at all times in order to complete (as in 2PC).
ACTs are meant to provide the programmer with a modular abstraction layer that handles all the complexities of replication, while enabling flexibility, high performance and clear mixed-consistency semantics. In the next section we specify ACTs formally.
3 Acute Cloud Types
3.1 Definition
An acute cloud type is an abstract data type, implemented as a replicated data structure, that offers a precisely defined set of operations, divided into two groups: weak and strong. The operations can be either updating or read-only (RO), and all operations are allowed to provide a return value (in Section 4 we show how the semantics of operations can be specified formally). We impose the following implementation restrictions over ACTs: invisible reads, input-driven processing, op-driven messages, highly available weak operations and non-blocking strong operations. The first four, are adapted from the definition of write-propagating data stores [33]. They guarantee genuine, low-latency, eventually-consistent processing for weak operations (as in, e.g., CRDTs [2]). The last restriction guarantees that strong operations are implemented using a non-blocking agreement protocol, instead of a non-fault-tolerant approach requiring all the replicas to be operational. In Sections 3.2 and 3.3 we formalize the system model and provide precise definitions of the implementation restrictions.
3.2 System model
3.2.1 Replicas and clients
We consider a system consisting of processes called replicas, which maintain full copies of an ACT444Partial replication is orthogonal to our work. We assume full replication for simplicity. and to which external clients submit requests in the form of operations to be executed. Each operation invoked by a client is marked either weak or strong. Replicas communicate with each other through message passing. We assume the availability of a gossip protocol, which is used when ordering constraints are not necessary, and some global agreement protocol, used for tasks that require solving distributed consensus. For simplicity, as in Algorithm 1, we formalize these protocols using reliable broadcast (RB) [28], and TOB, respectively. Replicas can implement point-to-point communication simply by ignoring messages for which they are not the intended recipient. We model replicas as deterministic state machines, which execute atomic steps in reaction to external events (e.g., operation invocation or message delivery), and can execute internal events (e.g., scheduled processing of rollbacks). A specific event is enabled on a replica, if its preconditions are met (e.g., an event is enabled on a replica , if was previously and has not yet delivered the message ). Replicas have access to a local clock, which advances monotonically, but we make no assumptions on the bound on clock drift between replicas.
We model crashed replicas as if they stopped all computation (or compute infinitely slowly). We say that a replica is faulty if it crashes (in an infinite execution it executes only a finite number of steps). Otherwise, it is correct.
3.2.2 Network properties
In a fully asynchronous system, a crashed replica is indistinguishable to its peers from a very slow one, and it is impossible to solve the distributed consensus problem [41]. Real distributed systems which exhibit some amount of synchrony can usually overcome this limitation. For example, in a quasi-synchronous model [42], the system is considered to be synchronous, but there exist a non-negligible probability that timing assumptions can be broken. We are interested in the behaviour of protocols, both in the fully asynchronous environment, when timing assumptions are consistently broken (e.g. because of prevalent network partitions), and in a stable one, when the minimal amount of synchrony is available so that consensus eventually terminates. Thus, we consider two kinds of runs: asynchronous runs and stable runs. Replicas are not aware which kind of a run they are currently executing. In stable runs, we augment the system with the failure detector (which is an abstraction for the synchronous aspects of the system). We do so implicitly by allowing the replicas to use TOB through the and primitives. Since, TOB is known to require a failure detector at least as strong as to terminate [43], we guarantee it achieves progress only in stable runs.
In both asynchronous and stable runs we guarantee the basic properties of reliable message passing [28], i.e.:
- •
if a message is ed, or ed, then it was, respectively, , or , by some replica,
- •
no message is ed, or ed, more than once by the same replica,
- •
if a correct replica s some message, then eventually it s it,
- •
if a correct replica s some message, then eventually all correct replicas it,
- •
if any (correct or faulty) replica s some message, then eventually all correct replicas, it,
- •
messages are ed by all replicas in the same total order.
We define as the sequence number of the event (among other events in the execution) on any replica (we leave it undefined, i.e., , if is never ed by any replica).
Solely in stable runs, we also guarantee the following:
- •
if a correct replica s some message, then eventually all correct replicas it.
- •
if a message was both and by some (correct or faulty) replica, and was ed by some correct replica, then eventually all correct replicas it.
The last guarantee is non-standard for a total-order broadcast, but could be easily emulated by the application itself. We include it to simplify presentation of certain algorithms, such as ANNC and AcuteBayou.
3.2.3 Fair executions
An execution is fair, if each replica, has a chance to execute its steps (all replicas execute infinitely many steps of each type of an enabled event, e.g., infinitely many events for infinitely many messages ).
We analyze the correctness of a protocol by evaluating a single arbitrary infinite fair execution of the protocol, similarly to [23] and [44]. If the execution satisfies the desired properties, then all the executions of the protocol (including finite ones and the ones featuring crashed replicas) satisfy all the safety aspects verified (nothing bad ever happens [45] [46]). Additionally, all fair executions of the protocol satisfy liveness aspects (something good eventually happens).
3.3 Implementation restrictions
Below we state the five rules that ACTs need to adhere to.
1. Invisible reads. Replicas do not change their state due to an invocation of a weak read-only operation. Formally, for each weak read-only operation invoked on a replica in state , the state of after a response for is returned is equal . Note that, the consequence of this is that weak read-only operations need to return a response to the client immediately in the invoke event, without executing any other steps. We allow strong read-only operations to change the state of a replica, because sometimes it is necessary to synchronize with other replicas, and the replica needs to note down that a response is pending.
2. Input-driven processing. Replicas execute a series of steps only in response to some external stimulus, e.g., an operation invocation or a received message. A state of a replica is passive if none of the internal events on the replica are enabled in . Initially each replica is in a passive state. An external event may bring a replica to an active state in which it has some internal events enabled. Then, after executing a finite number of internal events (when no new external events are executed), the replica enters a passive state. More formally, for each replica , we require that in a given execution, either there is only a finite number of internal events executed on , or there is an infinite number of external events executed on . We say that is passive, if it is in a passive state, otherwise it is active.
3. Op-driven messages. RB or TOB messages are only generated and sent as a result of some not read-only client operation, and not spontaneously or in response to a received message. More formally, a message can be or by a replica , if previously some not read-only operation was invoked on , and since then did not enter a passive state.
4. Highly available weak operations. Weak operations need to eventually return a response without communicating with other replicas. A weak operation may remain pending only if the execution is finite, and the executing replica remains active since the invocation of (in an infinite execution a pending weak operation is never allowed).
5. Non-blocking strong operations. Strong operations need to eventually return a response if a global agreement has been reached. More formally, for a strong operation invoked on a replica , let be the set of all messages by since the invocation of but before enters a passive state. Then, may remain pending only if:
- •
the execution is finite, and remains active since the invocation of , or remains active because of the delivery of any message , or
- •
there exists a message , which has not been ed by yet.
It means that in order to execute a strong operation replicas may synchronize by ing multiple messages, but once TOB completes, the response must be returned in a finite number of steps.
All the above requirements are commonly met by various eventually consistent data stores and CRDTs (when we consider them as ACTs with only weak operations and using our communication model555In case of geo-replicated systems which are weakly consistent between data centers, but feature state machine replication within a data center to simulate reliable processes, we can consider the whole data center as a single replica.), e.g., [47] [48] [1] [49] [50] [2] [51] [33] [44]. The restrictions 1–4 are inspired by the ones defined for write-propagating data stores [33], but modified appropriately to accommodate for the more complex nature of ACTs. In particular, we allow implementations which do not execute each invoked operation in one atomic step, but divide the execution between many internal steps (e.g., see the pseudocode of Bayou in Appendix A.1). On the other hand, the 5th requirement concerns strong operations, and so is specific for ACTs. As discussed at length in [33] [44], requirements 1–4 preclude implementations that offer stronger consistency guarantees but do not provide a real value to the programmer (and still fall short of the guarantees possible to ensure if global agreement can be reached). For example, without invisible reads, it is possible to propose an implementation of a register’s read operation, which returns the most up to date value written to the register only after it returned stale values to a similar call for a fixed number of times (even though the newest value was already available). On the other hand, with the above restrictions, it is still possible to attain causal consistency and variants of it, such as observable causal consistency [33].
4 Formal framework
Below we provide the formalism that allows us to reason about execution histories and correctness criteria. We extended the framework by Burckhardt et al. [7][23] (also used by several other researchers, e.g., [52] [33] [44] [53]).
4.1 Preliminaries
Relations: A binary relation over set is a subset . For , we use the notation to denote , and the notation to denote . We use the notation to denote the inverse relation, i.e. . Therefore, . Given two binary relations , over , we define the composition . We let be the identity relation over , i.e., . For , we let be the n-ary composition , with . We let and . For some subset , we define the restricted relation . In Figure 2 we summarize various properties of relations.
We define by the set of all sequences (words) containing only elements from the set . When not ambiguous we use to denote (i.e. when is not a binary relation).
Let be a function that encounts elements of a set that are in relation to element : . Thus, .
We also define two operators and . arranges in an ascending order the elements of set according to the total order . reduces sequence by one element at a time using the function and accumulator : {myeq} foldr(a_0, f, w) &= { a_0if w = ϵf(foldr(a_0, f, w’), b)if w = w’b
Event graphs: To reason about executions of a distributed system we encode the information about events that occur in the system and about various dependencies between them in the form of an event graph. An event graph is a tuple , where is a finite or countably infinite set of events drawn from universe , , and each is an attribute or a relation over . Vertices in represent events that occurred at some point during the execution and are interpreted as opaque identifiers. Attributes label vertices with information pertinent to the corresponding event, e.g., operation performed, or the value returned. All possible operations of all considered data types form the set. All possible return values of all operations form the set. Relations represent orderings or groupings of events, and thus can be understood as arcs or edges of the graph.
Event graphs are meant to carry information that is independent of the actual elements of chosen to represent the events (the attributes and relations in encode all relevant information regarding the execution). Let and be two event graphs. and are isomorphic, written , if (1) for all , and are of the same kind (attribute vs. relation) and (2) there exists a bijection such that for all , where is an attribute, and all , we have , and such that for all where is a relation, and all , we have .
4.2 Histories
We represent a high-level view of a system execution as a history. We omit implementation details such as message exchanges or internal steps executed by the replicas. We include only the observable behaviour of the system, as perceived by the clients through received responses. Formally, we define a history as an event graph , where:
- •
, specifies the operation invoked in a particular event, e.g., ,
- •
, specifies the value returned by the operation, e.g., , or , if the operation never returns ( is pending in ),
- •
, the returns-before relation, is a natural partial order over , which specifies the ordering of non-overlapping operations (one operation returns before the other starts, in real-time),
- •
, the same session relation, is an equivalence relation which groups events executed within the same session (the same client), and finally
- •
, specifies the consistency level demanded for the operation invoked in the event.
We consider only well-formed histories, for which the following holds:
- •
(a pending operation does not return),
- •
( is an interval order [54]),
- •
for each event and its session , the restriction is an enumeration (clients issue operations sequentially).
4.3 Abstract executions
In order to explain the history, i.e., the observed return values, and reason about the system properties, we need to extend the history with information about the abstract relationships between events. For strongly consistent systems typically we do so by finding a serialization [37] (an enumeration of all events) that satisfies certain criteria. For weaker consistency models, such as eventual consistency or causal consistency, it is more natural to reason about partial ordering of events. Hence, we resort to abstract executions.
An abstract execution is an event graph , such that:
- •
is some history ,
- •
is an acyclic and natural relation,
- •
is a total order relation, and
- •
is a function which returns a binary relation in .
For brevity, we often use a shorter notation and let . Just as serializations are used to explain and justify operations’ return values reported in a history, so are the visibility () and arbitration () relations. Perceived arbitration () is a function which is necessary to formalize temporary operation reordering.
Visibility () describes the relative influence of operation executions in a history on each others’ return values: if is visible to (denoted ), then the effect of is visible to the replica performing (and thus reflected in the ’s return value). Visibility often mirrors how updates propagate through the system, but it is not tied to the low-level phenomena, such as message delivery. It is an acyclic and natural relation, which may or may not be transitive. Two events are concurrent if they are not ordered by visibility.
Arbitration () is an additional ordering of events which is necessary in case of non-commutative operations. It describes how the effects of these operations should be applied. If is arbitrated before (denoted ), then is considered to have been executed earlier than . Arbitration is essential for resolving conflicts between concurrent events, but it is defined as a total-order over all operation executions in a history. It usually matches whatever conflict resolution scheme is used in an actual system, be it physical time-based timestamps, or logical clocks.
Perceived arbitration () describes the relative order of operation executions, as perceived by each operation ( defines the total order of all operations, as perceived by event ). If , then there is no temporary operation reordering in .
4.4 Correctness predicates
A consistency guarantee is a set of conditions on an abstract execution , which depend on the particulars of up to isomorphism. For brevity we usually omit the argument . We write if satisfies . More precisely: . A history is correct according to some consistency guarantee (written ) if it can be extended with some and relations to an abstract execution that satisfies . We say that a system is correct according to some consistency guarantee if all of its histories satisfy .
We say that a consistency guarantee is at least as strong as a consistency guarantee , denoted , if . If and then is stronger than , denoted . If and , then and are incomparable, denoted .
4.5 Replicated data type
In order to specify semantics of operations invoked by the clients on the replicas, we model the whole system as a single replicated object (as in case of Algorithm 1). Even though we use only a single object, this approach is general, as multiple objects can be viewed as a single instance of a more complicated type, e.g. multiple registers constitute a single key-value store. Defining the semantics of the replicated object through a sequential specification [18] is not sufficient for replicated objects which expose concurrency to the client, e.g. multi-value register (MVR) [2] or observed-remove set (OR-Set) [3]. Hence, we utilize replicated data types specification [48].
In this approach, the state on which an operation executes, called the operation context, is formalized by the event graph of the prior operations visible to . Formally, for any event in an abstract execution , the operation context of in is the event graph . Note that an operation context lacks return values, the returns-before relation, and the information about sessions. The set of previously invoked operations and their relative visibility and arbitration unambiguously defines the output of each operation. This brings us to the formal definition of a replicated data type.
A replicated data type is a function that, for each operation (where ) and operation context , defines the expected return value , such that does not depend on events, i.e., is the same for isomorphic contexts: for all , , . We say that is a read-only operation (denoted ), if and only if, for any operation , context and event , such that , , where . In other words, read-only operations can be excluded from any context , producing , and the result of any operation will not change.
In Figure 3 we give the specification of three replicated data types: (a multi-value register), (an append-only sequence), and (a non-negative counter). We use in the subsequent sections to illustrate various consistency models.
4.6 ACT specification
To accommodate for the mixed-consistency nature of ACTs we extend replicated data type specification with the information on supported consistency levels for a given operation. Thus, we define ACT specification as a pair , where is a replicated data type specification and is a function which specifies for each with which consistency levels it can be executed. We assume that clients follow this contract, and thus, when considering a history of an ACT compliant with the specification , we assume that for each .
Then, ANNC’s specification is is , where and .
5 Correctness guarantees
In this section we define various correctness guarantees for ACTs. We define them as conjunctions of several basic predicates. We start with two simple requirements that should naturally be present in any eventually consistent system. For the discussion below we assume some arbitrary abstract execution .
5.1 Key requirements for eventual consistency
The first requirement is the eventual visibility (EV) of events. EV requires that for any event in , there is only a finite number of events in that do not observe . Formally:
[TABLE]
Intuitively, EV implies progress in the system because replicas must synchronize and exchange knowledge about operations submitted to the system.
The second requirement concerns avoiding circular causality, as discussed in Section 2.2.2. To this end we define two auxiliary relations: session order and happens-before. The session order relation represents the order of operations in each session. The happens-before relation (a transitive closure of session order and visibility) allows us to express the causal dependency between events. Intuitively, if , then potentially depends on . We simply require no circular causality:
[TABLE]
In the following sections we add requirements on the return values of the operations in . Formalizing the properties of ACTs which, similarly to AcuteBayou, admit temporary operation reordering, requires a new approach. We start, however with the traditional one.
5.2 Basic Eventual Consistency
Intuitively, basic eventual consistency (BEC) [7] [23], in addition to EV and NCC, requires that the return value of each invoked operation can be explained using the specification of the replicated data type , what is formalized as follows:
[TABLE]
Then:
[TABLE]
An example abstract execution that satisfies is shown in Figure 4. In , firstly replicas and concurrently execute two operations, and then each replica executes an infinite number of operations. Consider the operations on : the first one observes only (which is in the operation context of ), whereas the second observes only . BEC admits this kind of execution, because it does not make any requirements in terms of session guarantees [40]. Eventually, both and become visible to all subsequent operations, thus satisfying EV.
By the definition of the function (Section 4.5), when satisfies , the return value of each operation is calculated according to the relation. It is then easy to see that there are executions of AcuteBayou (or other ACTs that admit temporary operation reordering) that do not satisfy . It is because weak operations (as shown in Section 2.2.2), might observe past operations in an order that differs from the final operation execution order (). Hence AcuteBayou does not satisfy for arbitrary . It still, though, could satisfy for a sufficiently simple , such as a conflict-free counter, in which all operations always commute (as opposed to ). It is so, because then, even if AcuteBayou reorders some operations internally, the final result never changes and thus the reordering cannot be observed by the clients.
5.3 Fluctuating Eventual Consistency
In order to admit temporary operation reordering, we give a slightly different definition of the function, in which the arbitration order fluctuates, i.e., it changes from one event to another. Let , which means that now we consider the operation execution order as perceived by , and not the final one. The definition of the fluctuating variant of RVal is straightforward:
[TABLE]
To define the fluctuating variant of BEC, that could be used to formalize the guarantees provided by ACTs we additionally must ensure, that the arbitration order perceived by events is not completely unrestricted, but that it gradually converges to for each subsequent event. It means that each can be temporarily observed by the subsequent events according to an order that differs from (but is consistent with ). However, from some moment on, every event will observe according to . To define this requirement, we use the function (defined in Section 4.1). Let . This intuition is formalized by convergent perceived arbitration:
[TABLE]
If satisfies CPar, then for each event , the set of events , which observe the position of not according to is finite. Thus, the position of in for subsequent events stabilizes, and eventually converges to .
Now we can define our new consistency criterion fluctuating eventual consistency (FEC):
[TABLE]
An example abstract execution that satisfies FEC is shown in Figure 4. In , replica temporarily observes the operations in the order which is different then the eventual operation execution order (as evidenced by the infinite number of operations). We call this behaviour fluctuation.
It is easy to see that , in the sense that: for each , , and for some , . It is so, because FEC uses instead of to calculate the return values of operation executions, but eventually converges to . Hence, is a special case of , when . It is easy to see that from Figure 4 satisfies both BEC and FEC, whereas satisfies only FEC.
5.4 Operation levels
The above definitions can be used to capture the guarantees provided by a wide variety of eventually consistent systems. However, our framework still lacks the capability to express the semantics of mixed-consistency systems. ACTs offer different guarantees for different classes of operations (e.g., consistency guarantees stronger than BEC or FEC are provided only for strong operations in AcuteBayou or ANNC). Hence, we need to parametrize the consistency criteria with a level attribute (as indicated by the function for each event). Since, consistency level is specified per operation invocation, we need to assure that the respective operations’ responses reflect the demanded consistency level.
Let us revisit BEC first. Let for a given . Then: {myeq} EV (l) =def& ∀e ∈E : —{ e’ ∈L : e rb→e’ ∧e /vis→e’ }— ¡ ∞
NCC (l) =def acyclic(hb∩(L ×L))
RVal (l, F) =def ∀e ∈L : rval(e) = F(op(e), context(A,e))
BEC (l, F) =def EV (l) ∧NCC (l) ∧RVal (l, F)
The above parametrized definition of BEC restricts the RVal predicate only to events issued with the given consistency level (the events that belong to the set ). It means that for any such event the response has to conform with the replicated data type specification , and with the and relations (as defined by the definition of the function). For all other events this requirement does not need to be satisfied, so they can return arbitrary responses (unless restricted by other predicates targeted for these events). Similarly, for EV and NCC, the predicates are restricted to affect only the events from the set . In case of EV, each event eventually becomes visible to the operations executed with the level . In case of NCC, there must be no cycles in involving events from the set .
The parametrized variant of FEC is formulated analogously. Let be as defined before, and for any event , let be the subset of events from which observe . Then: {myeq} FRVal (l, F) =def& ∀e ∈L : rval(e) = F(op(e), fcontext(A,e))
CPar (l) =def ∀e ∈E : —{e’ ∈L_e: rank(vis^-1(e’), par(e’), e)
≠rank(vis^-1(e’), ar, e)}— ¡ ∞
FEC (l, F) =def EV (l) ∧NCC (l) ∧FRVal (l, F) ∧CPar (l)
As before, we restrict the return values only for the events from the set . Additionally, we restrict the predicate CPar, so that converges towards only for events . Other events can differently perceive the arbitration of events (in principle, the observed arbitration can be completely different from the final one, specified by ).
We can compare the parametrized variants of BEC and FEC as before: .
All of the strong consistency criteria which we are going to discuss next, we define already in the parametrized form with the given level in mind, so they can be used for, e.g., strong operations in AcuteBayou and ANNC.
5.5 Strong consistency
A common feature of strong consistency criteria, such as sequential consistency [37], or linearizability [18], is a single global serialization of all operations. It means that a history satisfies these criteria, if it is equivalent to some serial execution (serialization) of all the operations. Additionally, depending on the particular criterion, the serialization must, e.g., respect program-order, or real-time order of operation executions. Although we provide a serialization of all operations (through the total order relation , which is part of every abstract execution), the equivalence of a history to the serialization is not enforced in the correctness criteria we have defined so far. For example, given a sequence of three events , such that , the response of according to BEC, does not need to reflect neither , nor , as they may simply be not visible to (). Thus, to guarantee conformance to a single global serialization, we must enforce that for any two events , (unless is pending, since a pending operation might be arbitrated before a completed one, yet still be not visible). We express this through the following predicate, single order: {myeq} SinOrd =def& ∃E’ ⊆rval^-1(∇) : vis= ar∖(E’ ×E)
SinOrd (l) =def ∃E’ ⊆rval^-1(∇) : vis_L = ar_L ∖(E’ ×E)
where vis_L = vis∩(E ×L) and ar_L = ar∩(E ×L)
In the parametrized form, the conformance to the serialization is required only for the events from the set (but the serialization includes all the events).
In order to capture the eventual stabilization of the operation execution order, which happens in AcuteBayou and in ACTs similar to it, we now define two additional correctness criteria that feature SinOrd.
Sequential consistency. Informally, sequential consistency (Seq) [37] guarantees that the system behaves as if all operations were executed sequentially, but in an order that respects the program order, i.e., the order in which operations were executed in each session. Hence, Seq implies together with SinOrd, and additionally, session arbitration (SessArb). SessArb simply requires that for any two events , if , then . In the parametrized form we are interested only in the guarantees for events in the set , and thus we use instead of (see Section 5.1). SinOrd together with SessArb imply NCC and EV [23], however this does not hold for the parametrized forms of these predicates. Thus, we define Seq by extending BEC (which explicitly includes EV and NCC): {myeq} SessArb (l) =def& so_L ⊆ar
Seq (l, F) =def SinOrd (l) ∧SessArb (l) ∧BEC (l, F)
An example abstract execution that satisfies Seq is shown in Figure 4. According to Seq, since the operations are arbitrated (as evidenced by any operation that observes both operations), any operation can either return or , a non-empty prefix of .
Linearizability. The correctness condition of linearizability (Lin) [18] is similar to Seq but instead of program order it enforces a stronger requirement called real-time order. Informally, a system that is linearizable guarantees that for any operation that starts (in real-time) after any operation ends, will observe the effects of . We formalize Lin using the real-time order (RT) predicate, that uses the relation in its parametrized form: {myeq} RT (l) =def& rb_L ⊆ar
Lin (l, F) =def SinOrd (l) ∧RT (l) ∧BEC (l, F)
Note that, Seq and Lin are incomparable in their parametrized forms. While requires any two events to be arbitrated according to real-time if they both belong to , enforces real-time only within the same session, but only one of the events needs to belong to . By using a stronger definition of with , we would force all operations to synchronize, which is incompatible with high availability of weak operations.
An example abstract execution that satisfies Lin is shown in Figure 4. According to Lin, since ended before started, the operations must be arbitrated (as evidenced by any operation that observes both operations). If some operation started after ended but executed concurrently with ( would start before ended), could return either or .
5.6 Correctness of ANNC and AcuteBayou
Having defined BEC, FEC and Lin, we show four formal results: two regarding ANNC and two regarding AcuteBayou. The proofs of all four theorems can be found in Appendix A.5.
As we have discussed in Section 3.2.2, we are interested in the behaviour of systems, both in fully asynchronous environment, when timing assumptions are consistently broken (e.g., because of prevalent network partitions), and in a stable one, when the minimal amount of synchrony is available so that consensus eventually terminates. Thus, we consider two kinds of runs: asynchronous and stable.
Theorem 1**.**
In stable runs ANNC satisfies .
Theorem 2**.**
In asynchronous runs ANNC satisfies and it does not satisfy .
ANNC does not guarantee in asynchronous runs, because strong operations in general (for arbitrary ) cannot be implemented without solving global agreement, and since in asynchronous runs TOB completion is not guaranteed, some of the operations may remain pending. It means that for some , such that , , even though it is not allowed by (recall from Section 3.2.3 that we consider fair executions).
By satisfying , we proved that temporary operation reordering is not possible in ANNC. As we discussed in Section 2.2.2, it is not the case for AcuteBayou. However, we can prove, that AcuteBayou satisfies our new correctness criterion (for arbitrary ).
Theorem 3**.**
In stable runs AcuteBayou satisfies for any arbitrary ACT specification .
Theorem 4**.**
In asynchronous runs AcuteBayou satisfies and it does not satisfy for any arbitrary ACT specification .
The observation that some undesired anomalies are not inherent to all ACTs leads to interesting questions that we plan to investigate more closely in the future, e.g., what are the common characteristics of the replicated data types with mixed-consistency semantics that can be implemented as ACTs that are free of temporary operation reordering.
6 Impossibility
Now we proceed to our central contribution–we show that there exists an ACT specification for which it is impossible to propose an ACT implementation that avoids temporary operation reordering.
If a mixed-consistency ACT that implements some replicated data type could avoid temporary operation reordering, it would mean that it ensures BEC for weak operations and also provides at least some criterion based on SinOrd for strong operations (to ensure a global serialization of all operations). Hence we state our main theorem:
Theorem 5**.**
There exists an ACT specification , for which there does not exist an implementation that satisfies in stable runs, and in both asynchronous and stable runs.
To prove the theorem, we take (defined in Figure 3) as an example replicated data type specification . We consider an ACT specification, which features and operations in both consistency levels, , and . Thus, , where .
Let us begin with an observation. Whenever any ACT implementation of that satisfies in asynchronous runs, executes a weak operation, it has to some message . Since the implementation satisfies EV (through ) we know that all replicas have to be informed about the invocation of . The replica executing the operation may not postpone sending the message until some other invocation happens, because all the subsequent operation invocations on the replica may be operations, which do not grant the replica the right to send messages (e.g., RO operations, by the invisible reads requirement). Moreover, the replica may not depend on messages, because in asynchronous runs they are not guaranteed to be delivered to other replicas.666A replica may some messages due to the invocation of a weak operation, but its correctness cannot depend on their delivery. Thus, a message must be . Since replicas cannot distinguish between asynchronous and stable runs, the same observation also holds for stable runs. We utilize this fact in our proof by considering asynchronous and stable executions and establishing certain invariants which need to hold in both kinds of runs.
We conduct the proof by contradiction using a specially constructed execution, in which a replica that executes a strong operation has to return a value without consulting all replicas. Thus, we consider an ACT implementation of that satisfies in asynchronous runs, and in both the asynchronous and stable runs.
Proof.
We give a proof for a two-replica system and then show how it can be easily to a system with replicas.
We begin with an empty execution represented by a history , which we will extend in subsequent steps. Initially replicas and are separated by a temporary network partition, which means that the messages broadcast by the replicas do not propagate (however, eventually they will be delivered once the partition heals). A weak operation is invoked on in the event and a weak operation is invoked on in the event . By input-driven processing and highly available weak operations both replicas return responses for the operations and become passive afterwards. Let and denote the set of messages by, respectively, and , until this point. Let and denote the set of messages by, respectively, and , until this point. Neither s messages from the set , nor s messages from the set (due to the temporary network partition), but the replicas their own messages, and subsequently become passive (if or , then these messages remain pending).
Consider an alternative execution represented by history in which the network partition heals, and s all messages in the set , s all messages in the set , and then a weak operation is invoked on in the event and a weak operation is invoked on in the event . By invisible reads and highly available operations, both replicas remain passive and immediately return a response.
Claim 1**.**
, and or .
Proof.
We extend with infinitely many weak invocations on both and , in events , for . Similarly to and , the operations invoked in each return immediately and leave the replicas and in the unmodified passive state. Since none of the operations generate any new messages, represents a fair infinite execution that satisfies all network properties of an asynchronous run. Then, by our base assumption, there exists an abstract execution , such that .
Because each replica remains in the same state since the execution of and , respectively, each operation invoked in , returns the same response as or , depending on which replica the given event was executed. By , the two updating events and have to be both observed by infinitely many of the events. Let be one such event executed on and be one such event executed on , then . There is either: , or . Now, by the definition of read-only operations we can exclude the RO operations from the context of any operation without affecting the return value of all operations. Thus for some . Because of , . Therefore, all operations in return the same value , including the earliest ones and , which means that . By the definition of , either or (depending on whether , or ). ∎
Without loss of generality, let us assume that obtained in the history equals . Let us return to our main history . We extend it similarly to the way we extended , but we do not allow the network partition to heal completely. Instead, we just let to reach , which s them exactly as in . Since replicas are deterministic, the current state of must be the same as it was in during the execution of . Thus, similarly to , we invoke a weak operation on in an event , and .
Consider yet another execution represented by history which is obtained from our main execution by removing any steps executed by . The events executed on remain unchanged, since the two replicas were all the time separated by a network partition, and no messages from reached . We let the network partition heal. s messages from the set , both replicas messages from the set , and afterward both replicas become passive.
We now extend by infinitely many times applying the following procedure (for from to infinity):
invoke a strong on in the event , 2. 2.
let execute its steps until it becomes passive, 3. 3.
on both and , and all messages, respectively, or , by in step 2, 4. 4.
let both replicas reach a passive state, 5. 5.
invoke a weak on in the event .
The resulting execution is fair and satisfies all the network properties of a stable run. Note that the strong operations executed on are not restricted by invisible reads and thus may freely change the state of . Moreover, they can cause to and messages. On the other hand, the weak operations executed on are always executed on a passive state, and leave the replica in the same state. Moreover, does not , nor any messages. By non-blocking strong operations no strong operation may be pending in . This is so, because for each , by step 4, there is no pending message not yet ed on , and is in a passive state.
Claim 2**.**
There exists an event , with for some natural , such that .
Proof.
By our base assumption, there exists an abstract execution , such that . Then, for each , by , . Moreover, because of , needs to be observed from some point on by every . Thus, we let . Since is the only operation visible to (there are no other operations in ), by definition of , . ∎
Let us return to our main history . Note that, when we restrict and only to events on , constitutes a prefix of . Moreover, the state of at the end of is the same as in just before ing messages from the set (if any) and executing the first strong operation. We now extend by ing the messages from the set and then with steps executed on generated using the repeated procedure for , for from to . We can freely omit the steps executed on , since none of them influenced in any way ( did not deliver any message from ).777With a typical TOB implementation, it might be impossible for to its own messages without the votes of to reach a quorum. However, as we have discussed earlier, we abstract away from the implementation details of the TOB mechanism. Crucially, no information was transferred from to . Moreover, in a three replica system, could establish a majority with to finalize TOB. Thus, there exists an event executed on , an equivalent of the event from , such that , and .
Finally, we allow the network partition to heal. s the messages from the set , and s and s any outstanding messages generated by (naturally, s messages in the same order as did). Then, we let the replicas reach a passive state, and in order to make our constructed execution fair, we extend it with infinitely many weak operations as we did with . By our base assumption, there exists an abstract execution , such that . There are only two operations invoked in in the events and . Since (which we have established after the Claim 1), by and the definition of , it can be only that . We also know that ( is a strong operation executed on ), which means that . By , , and thus . Therefore, a cycle forms in the total order relation : , a contradiction. This concludes our proof for a system with two replicas.
We could easily extended our reasoning to account for any number of replicas : any additional replica performs an infinite number of read operations, in the same fashion as the replica or , depending on whether originally belonged to the same partition as or . ∎
Since from Theorem 5 we know that there exists an ACT specification for which we cannot propose (even a specialized) implementation that satisfies , we can formulate a more general result about generic ACTs:
Corollary 1**.**
There does not exist a generic implementation that for arbitrary ACT specification satisfies in stable runs, and both in asynchronous, and in stable runs.
Theorem 5 shows that it is impossible to devise a system similar to AcuteBayou (for arbitrary ) that never admits temporary operation reordering (so satisfies instead of ). Hence, admitting temporary operation reordering is the inherent cost of mixing eventual and strong consistency when we make no assumptions about the semantics of . Naturally, for certain replicated data types, such as , achieving both and is possible, as we show with ANNC.
In the next section we discuss several approaches that avoid temporary operation reordering, albeit at the cost of compromising fault-tolerance (e.g., by requiring all replicas to be operational), or sacrificing high availability (e.g., by forcing replicas to synchronize on weak operations).
7 Related work
7.1 *Symmetric
models with strong operations blocking upon a single crash*
We start with symmetric mixed-consistency models, in which all replicas can process both weak and strong operations and communicate directly with each other (thus enabling processing of weak operations within network partitions), but either do not enable fully-fledged strong operations (there is no stabilization of operation execution order) or require all replicas to synchronize in order for a strong operation to complete. In turn, the way these models bind the execution of weak and strong operations can be understood as an asymmetric (1–) variant of quorum-based synchronization. Hence, unlike in ACTs, strong operations cannot complete if even a single replica cannot respond (due to a machine or network failure), which is a major limitation.
Lazy Replication [22] features three operation levels: causal, forced (totally ordered with respect to one another) and immediate (totally ordered with respect to all other operations). In this approach, it is possible that two replicas execute a causal operation and a forced operation in different orders. Since is required to commute with , replicas will converge to the same state. However, the user is never certain that even after the completion of , on some other replica no weaker operation will be executed prior to . Hence the guarantees provided by forced operations are inadequate for certain use cases, which require write stabilization, e.g., an auction system [4] (see also Section 1). On the other hand, immediate operations offer stronger guarantees, but their implementation is based on three-phase commit [55], and thus requires all replicas to vote in order to proceed.
RedBlue consistency [6] extends Lazy Replication (with blue and red operations corresponding to the causal and forced ones), by allowing operations to be split into (side-effect free) generator and (globally commutative) shadow operations. This greatly increases the number of operations which commute, but red operations still do not guarantee write stabilization. To overcome this limitation, RedBlue consistency was extended with programmer-defined partial order restrictions over operations [11]. The proposed implementation, Olisipo, relies on a counter-based system to synchronize conflicting operations. Synchronization can be either symmetric (all potentially conflicting pairs of operations must synchronize, which means that weak operations are not highly available any more) or asymmetric (all replicas must be operational for strong operations to complete).
The formal framework of [10] can be used to express various consistency guarantees, including those of Lazy Replication and RedBlue consistency, but not as strong as, e.g., linearizability. Conflicts resulting from operations that do not commute are modelled through a set of tokens. On the other hand, in explicit consistency [9], stronger consistency guarantees are modelled through application-level invariants and can be achieved using multi-level locks (similar to readers-writer locks from shared memory).
All above models assume causal consistency (CC) as the base-line consistency criterion and thus do not account for weaker consistency guarantees, such as FEC or BEC, as our framework. CC is argued to be costly to ensure in real-life [16], which makes our approach more general.
Finally, the model in [7] is similar to ours but treats strong operations as fences (barriers). It means that all replicas must vote in order for a strong operation to complete.
Temporary operation reordering is not possible in the models discussed above. It is because they are either state-based (and thus their formalism abstracts away from the operation return values which clients observe and interpret) and feature no write stabilization, or they require all replicas to vote in order to process strong operations.
7.2 *Symmetric Bayou-like
models*
In Section 2 we have already discussed the relationship between the seminal Bayou protocol [24] and ACTs.
In eventually-serializable data service (ESDS) [36], operations are executed speculatively before they are stabilized, similarly to Bayou. However, ESDS additionally allows a programmer to attach to an operation an arbitrary causal context that must be satisfied before the operation is executed. Zeno [56] is similar to Bayou but has been designed to tolerate Byzantine failures.
All three systems (Bayou, ESDS, Zeno) are eventually consistent, but ensure that eventually there exists a single serialization of all operations, and the client may wait for a notification that certain operation was stabilized. Since these systems enable an execution of arbitrarily complex operations (as ACTs), they admit temporary operation reordering.
Several researchers attempted a formal analysis of the guarantees provided by Bayou or systems similar to it. E.g., the authors of Zeno [56] describe its behaviour using I/O automata. In [57] the authors analyse Bayou and explain it through a formal framework that is tailored to Bayou. Both of these approaches are not as general as ours and do not enable comparison of the guarantees provided by other systems. Finally, the framework in [52] enables reasoning about eventually consistent systems that enable speculative executions and rollbacks and so also AcuteBayou. However, the framework does not formalize strong consistency models, which means it is not suitable for our purposes.
7.3 *Asymmetric
models with cloud as a proxy*
Contrary to our approach, the work described below assumes an asymmetric model in which external clients maintain local copies of primary objects that reside in a centralized (replicated) system, referred to as the cloud. Clients perform weak operations on local copies and only synchronize with the cloud lazily or to complete strong operations. Since the cloud functions as a communication proxy between the clients, when it is is unavailable (e.g., due to failures of majority of replicas or a partition), clients cannot observe even each others new weak operations. Hence, this approach is less flexible than ours. However, since the cloud serves the role of a single source of truth, conflicts between concurrent updates can be resolved before they are propagated to the clients, so temporary operation reordering is not possible.
In cloud types [20], clients issue operations on replicated objects stored in the local revision and occasionally synchronize with the main revision stored in the cloud, in a way similar as in version control systems. The synchronization happens either eagerly or lazily, depending on the used mode of synchronization. The authors use revision consistency [58] as the target correctness criterion. In a subsequent work [21] a global sequence protocol (GSP) was introduced, which refines the programming model of cloud types, and replaces revision consistency with an abstract data model, as revisions and revision consistency were deemed too complicated for non-expert users. Global sequence consistency (GSC) [59] is a consistency model that generalizes GSP and a few other approaches that assume external clients that either eagerly or lazily push or pull data from the cloud.
7.4 *Asymmetric master-slave
models*
There are systems which relax strong consistency by allowing clients to read stale data, either on demand (the client may forgo recency guarantees by choosing a weak consistency level for an operation), or depending on the replica localization (in a geo-replicated system the client accessing the nearest replica can read stale data that are pertinent to a different region). However, in such systems all updating operations (including the weak ones) must pass through the primary server designated for each particular data item. Thus, similarly to the asymmetric, cloud as a proxy models, in this approach weak operations are not freely disseminated among the replicas. Since all updates (of a concrete data item) are serialized by the primary, temporary operation reordering is not possible.
Examples of systems which follow this design and allow users to select an appropriate consistency level include PNUTS [60], Pileus [8], and also the widely popular contemporary cloud data stores, such as AmazonDB [12] and CosmosDB [13]. Systems that guarantee strong consistency within a single site and causal consistency between sites include Walter [5], COPS [50], Eiger [61] and Occult [62].
7.5 Other approaches
Certain eventually consistent NoSQL data stores enable strongly consistent operations on-demand . E.g., Riak allows some data to be kept in strongly consistent buckets [15], which is a namespace completely separate from the one used for data accessed in a regular, eventually-consistent way. Apache Cassandra provides compare-and-set-like operations, called light-weight transactions (LWTs) [14], which can be executed on any data, but the user is forbidden from executing weakly consistent updates on that data at the same time. Concurrent updates and LWTs result in undefined behaviour [17], which means that mixed-consistency semantics of LWTs can be considered broken.
In Lynx [63] and Salt [64] mixed-consistency transactions are translated into a chain of subtransactions, each committed at a different primary site. Thus such transactions can block or raise an error if a specific site is unavailable.
Recently some work has been published on the programming language perspective of mixed-consistency semantics. Since this research is not directly related to our work, we briefly discuss only a few papers. Correctables [65] are abstractions similar to futures, that can be used to obtain multiple, incremental views on the operation return value (e.g., a result of a speculative execution of the operation and then the final return value). Correctables are used as an interface for the modified variants of Apache Cassandra and ZooKeeper [66] (a strongly consistent system). In MixT [67] each data item is marked with a consistency level that will be used upon access. A transaction that accesses data marked with different consistency levels is split into multiple independently executed subtransactions, each corresponding to a concrete consistency level. The compilation-time code-level verification ensures that operations performed on data marked with weaker consistency levels do not influence the operations on data marked with stronger consistency levels. Understandably, the execution of a mixed-level transaction can be blocking. Finally, in [68] the authors advocate the use of the release-acquire semantics (adapted from low-level concurrent programming) and propose Kite, a mixed-consistency key-value store utilizing this consistency model. In Kite weak read operations occasionally require inter-replica synchronization and block on network communication, thus they are not highly available.
8 Conclusions
In this paper we defined acute cloud types, a class of replicated systems that aim at seamless mixing of eventual and strong consistency. ACTs are primarily designed to execute client-submitted operations in a highly available, eventually-consistent fashion, similarly to CRDTs. However, for tasks that cannot be performed in that way, ACTs at the same time support operations that require some form of distributed consensus-based synchronization.
We defined ACTs and the guarantees they provide in our novel framework which is suited for modeling mixed-consistency systems. We also proposed a new consistency criterion called fluctuating eventual consistency, which captures a common trait of many ACTs, namely temporary operation reordering. Interestingly, temporary operation reordering appears neither in systems that are purely eventually consistent (e.g., NoSQL data stores) nor purely strongly consistent (e.g., traditional DBMS). Moreover, it is not necessarily present in all ACTs, but as we formally prove, it cannot be avoided in ACTs that feature arbitrarily complex (but deterministic) semantics (e.g., arbitrary SQL transactions).
Appendix A
In this appendix we present additional material that could not be included in the article due to space considerations. In Section A.1 we give a detailed description of the seminal Bayou protocol, and in Section A.2 we discuss its liveness guarantees. Next, in Section A.3 we supplement the details on how the Bayou protocol can be improved to form the general-purpose ACT AcuteBayou. In Section A.4 we formalize the properties of the object, the black box component responsible for the semantics of the implemented data type in the algorithms of Bayou and AcuteBayou. Finally, in Section A.5 we provide the formal proofs of correcntess for ANNC and AcuteBayou.
A.1 Bayou–detailed description
The pseudocode in Algorithm 2 specifies the Bayou protocol for replica . Replicas are independent and communicate solely by message passing. When a client submits an operation to a replica, is broadcast within a message using a gossip protocol. In our pseudocode, we use regular reliable broadcast, RB (line 12; we say that has been ). Through the code in line 13 we simulate immediate local y of .
Each Bayou replica totally-orders all operations it knows about (executed locally or received through RB). In order to keep track of the total order, a replica maintains two lists of operations: and . The list encompasses the stabilized operations, i.e., operations whose final execution order has been established by the primary. On the other hand, the list encompasses operations whose final execution order has not yet been determined. The operations on the list are sorted using the operations’ timestamps (to resolve any ties, the replica identifiers and per replica sequence numbers are used). A timestamp is assigned to an operation as soon as a Bayou replica receives it from a client.
A Bayou replica continually executes operations one by one in the order determined by the concatenation of the two lists: (line 55). The replica keeps additional data structures, such as and , to keep track of its progress. An operation , once executed, will not be executed again as its final operation execution order is determined. On the other hand, an operation in the list might be executed and rolled back multiple times. It is because a replica adds operations to the list (rearranging it if necessary; lines 18-16) as they are delivered by a gossip protocol. Hence, a replica might execute some operation , and then, in order to maintain the proper execution order consistent with the modified list, the replica might be forced to roll back (line 51), execute a just received operation (which has lower timestamp than ), and execute again. We maintain the list of operations scheduled for rollback (operations are kept in the order reverse to the one in which they were executed, line 48). An operation execution can proceed only once all the scheduled rollbacks have been performed.
One of the replicas, called the primary, periodically commits operations from its list by moving them to the end of the list, thus establishing their final execution order (line 38). The primary announces the commit of operations by ing commit messages, so that each replica can also commit the appropriate operations. Note that the primary uses the FIFO variant of RB to ensure that all replicas commit the same set of operations in the same order.
Intuitively, the replicas converge to the same state, which is reflected by the list of operations. More precisely, when the stream of operations incoming to the system ceases and there are no network partitions (the replicas can communicate with the primary), the lists at all replicas will be the same, whereas the lists will be empty. On the other hand, when there are partitions, some operations might not be successfully committed by the primary, but will be disseminated within a partition using RB. Then all replicas within the same partition will have the same and (non-empty) lists.
Operations are executed on the object (line 4), which encapsulates the state of the local database. At any moment, the value of corresponds to a sequence of the already executed operations on a replica given, where is a prefix of . Note that allows us to easily rollback a suffix of (line 51). We discuss the properties of the object in more detail in Section A.4.
Algorithm 3 shows a pseudocode of a referential implementation of the StateObject for arbitrary operations of any sequential data type (a specialized one can be used to take advantage of specific data type’s characteristics or to enable non-sequential semantics for certain replicated data types which expose concurrency to the client). We assume that each operation can be specified as a composition of read and write operations on registers (objects) together with some local computation. The assumption is sensible, as the operations are executed locally, in a sequential manner, and thus no stronger primitives than registers (such as CAS, fetch-and-add, etc.) are necessary. The StateObject keeps an undo log which allows it to revoke the effects of any operation executed so far (the log can be truncated to include only the operations on the list).
A.2 Liveness guarantees in Bayou
Eventually consistent systems are aimed at providing high availability. It means that a replica is supposed to respond to a request even in the presence of network partitions in the system. This requirement can be differently formalized. In the model considered by Brewer [69], a network partition can last infinitely. Then, high availability can be formalized as wait-freedom [70], which means that each request is eventually processed by the system and the response is returned to the client. In the more commonly assumed model that admits only temporary network partitions (we also adopt this model, similarly to, e.g., [7] [33]), that requirement is not strong enough, since a replica could trivially just wait until the partitions are repaired before executing a request and responding to the client. Therefore, in such a model the requirement of high availability must be formulated differently. It can be done as follows: a system is highly available if it executes each request in a finite number of steps even when no messages are exchanged between the replicas (the replica cannot indefinitely postpone execution of a request or returning the response to the client, see Section 3.3 for a formal definition). In this sense, Bayou is highly available. However, this definition of high availability does not preclude situations in which, e.g., the number of steps the execution of each request takes grows over time and thus is unbounded. Hence, one could formulate a slightly stronger requirement, i.e., bounded wait-freedom [70], which states that there is a possibly unknown but bounded number of protocol steps that the replica takes before a response is returned to the client upon invocation of an operation. Interestingly, unlike many popular NoSQL data stores, such as [1] or [71], Bayou does not guarantee bounded wait-freedom even for weak operations, as we now demonstrate.
Consider a Bayou system with replicas, one of which, , processes requests slower compared to all other replicas. Assume also that every fixed period of time there are new weak requests issued, one on each replica, and the processing capabilities of all replicas are saturated. In every , should process all requests (as do other replicas), but it starts to lag behind, with its backlog constantly growing. Intuitively, every new operation invoked on will be scheduled for execution after all operations in the backlog, as they were issued with lower timestamps. Hence the response time will increase with every new invocation on . One could try to overcome the problem of the increasing latency on by artificially slowing the clock on , thus giving unfair priority to the operations issued on , compared to operations issued on other replicas. But then any operation invoked on would appear on other replicas as an operation from a distant past. In turn, any such operation would cause a growing number of rollbacks on the other replicas.
Strong operations cannot be (bounded) wait-free simply because in order for them to complete, the primary must be operational, which cannot be guaranteed in a fault-prone environment.
Interestingly, in AcuteBayou (see Sections 2.2.5 and A.3) the execution of weak operations is trivially bounded wait-free, as they are executed immediately upon their invocations.
A.3 Bayou improved
As we discussed in Section 2.2.5, we can improve the Bayou protocol to make it more fault-tolerant and free of cicular causality, and thus obtain AcuteBayou. In Algorithm 4 we present the modifications to the Algorithm 2, which give us AcuteBayou. Note that, in accordance with the ACT restrictions (see Section 3.3) we also improve the execution of weak read-only (RO) operations (since any RO operation does not change the logical state of the , can be executed only locally888We assume that StateObject features an overloaded function which takes a plain operation as an argument, instead of a record, when executing RO operations.).
Firstly, we use TOB in place of the primary to establish the final operation execution order. More precisely, every (weak, updating) operation is broadcast using RB (as before) as well as TOB (lines 15–16). When a replica s an operation (line 23), it stabilizes . Since TOB guarantees that all replicas the same set of messages in the same order, all replicas will stabilize the same set of operations in the same order. As we have argued, TOB can be implemented in a way that avoids a single point of failure [19].
Further changes are aimed at eliminating circular causality in Bayou as well as improving the response time for weak operations. To this end (1) any strong operation is broadcast using TOB only (line 21), and (2) upon being submitted, any weak operation is executed immediately on the current state, and then rolled back (lines 13 and 14). It is easy to see that the modification (2) means the incoming stream of weak operations from other replicas cannot delay the execution of weak operations submitted locally. Below we argue why the two above modifications allow us to avoid circular causality in Bayou.
The change (1) means that for any pair of a strong and a weak operation , if the return value of any operation depends on both and ( observes and ), they will be observed in an order consistent with the final operation execution order. We prove it through the following observations:
for to observe , must be committed (in the modified algorithm never appears on the list), 2. 2.
if is a strong operation, then must also be committed, because upon execution strong operations do not observe operations on the list; hence both operations are observed according to their final execution order, 3. 3.
otherwise ( is a weak operation):
- (a)
is updating (not RO), because otherwise it would not logically impact the return value of , 2. (b)
if is already committed, it is similar to case 2, 3. (c)
if is not yet committed, will observe the operations in the order ; on the other hand, once is delivered by TOB and committed, it will appear on the list after , and so also observes and in the same order .
The change (2) is necessary to prevent circular causality between two (or more) weak operations (the case depicted in Figure 1. It is because the modified algorithm executes a weak (updating) operation without waiting for the / message to arrive. It means that no concurrent operation will be executed prior to the first execution of , whose return value observes the client. Otherwise could observe even though the final execution order is .
Finally, we redefine the record to include the execution context , i.e., the identifiers of requests already executed upon the invocation of the current operation and which have influenced the object (those on the list and those on the list). Note that in practice such identifiers can be efficiently represented using Dotted Version Vectors [72]. With the augmented record the implementation of StateObject can take advantage of the relative visibility between operations to achieve the non-sequential semantics of such replicated data types as MVRs or ORsets.
A.4 StateObject properties
Although in Algorithm 3 we present a referential implementation of StateObject, in general we treat the object as a black box with unknown implementation. The corretness of AcuteBayou depends on the properties of the object which we formalize below.
Take the list of requests that were executed on the , and remove the requests which were rolled back; we call the resulting sequence the current trace of the .999We omit weak RO operations executed in Algorithm 4 line 6, which are not associated with any record. Since the encapsulates the state of the system after locally executing and revoking requests, we require that the ’s responses are consistent with a deterministic serial execution of as specified by the type specification when taking into account the relative visibility between requests encoded in the field of the record. In case of any strong operation (in a request ), we assume that all requests prior to are visible to (regardless of ). This is because is executed only once is on the list and thus its position relative to all other operations is fixed and corresponds to the TOB order.
More precisely, for any given trace , the object deterministically holds the state , and for any operation , the response of the function invoked on the object in state equals , where is a context such that:
- •
consists of all the requests in ,
- •
, for any request ,
- •
is the visibility relation based on the fields of the record for the weak operations and on the order in for strong operations, i.e. for any such that :
- –
if , then ;
- –
if , then ;
- •
is the enumeration of requests in according to their position in .
In AcuteBayou, , because:
- •
requests are executed only if is empty,101010Weak requests are also executed in the invoke block, independently of the and lists, but they are immediately afterwards rolled back, so they do not influence the trace.
- •
whenever a request is executed it is added to the list, thus it is appended to the end of ,
- •
in the function, some requests move from the list to the end of the list, thus not changing their position in ,
- •
whenever a request is rolled back, it is removed from the head of the list, and thus removed from the end of , consistently with the definition of a trace.
A.5 Proofs of correctness
In this section we provide the formal proofs of correcntess for ANNC and AcuteBayou anticipated in Section 5.6. We start with an overview of proofs’ structures.
In order to prove correctness of either protocol, we take a single arbitrary execution of the protocol, and without making any specific assumptions about it, we show how the visibility and arbitration relations can be defined so that the appropriate correctness guarantees can be proven. Below we briefly outline our approach.
In both ANNC and AcuteBayou, strong operations are disseminated solely by TOB, and weak updating operations are sent using both RB and TOB. On the other hand weak RO operations are executed completely locally and do not involve any network communication (strong RO operations are present only in AcuteBayou and are treated as regular strong operations). Thus, in the proofs, for the purpose of constructing the arbitration relation (), we order all updating (strong or weak) operations based on the order of the delivery of their respective messages broadcast using TOB. In the case of updating operations whose messages were not ed (which can happen in the asynchronous runs), we order them in after all the operations whose messages were ed. Their relative order can be arbitrary in ANNC, and in AcuteBayou it has to conform to the order imposed by the records. Finally, for completeness, needs to include also weak RO operations. We carefully interleave them with updating operations in such a way to guarantee no circular causality as well as equivalence between visibility and arbitration for strong operations.
We construct the visibility relation () by choosing for any two events whether one should be observed by the other. We include an edge under two, broad conditions: the edge is essential, i.e., could have influenced the return value of , or the edge is non-essential, i.e., could not have influenced the return value of (because, e.g., is an RO operation), but occurs before in real-time or arbitration. Non-essential edges are important to guarantee eventual visibility for all events.
Now let us make some observations regarding network properties during synchronous and asynchronous runs. Since we consider infinite fair executions, in both types of runs each message is guaranteed to be ed by each replica. On the other hand, the same delivery guarantee, but for messages , holds only in the stable runs, and in the asynchronous runs, some messages can be ed while others may remain pending. However, asynchronous runs still obey other guarantees, which means that, crucially, no messages will be ed by any replica out of order. Moreover, if some message was ed by one replica, then it will be ed by all replicas. Also, if one replica manages to infinitely many messages which are then ed, then each replica can succesfully and its messages. Thus, in the asynchronous runs, we expect a finite number of messages to be ed, while all other to remain pending.
For each event let us denote by and , respectively, the message in the event and the message in the event (both and can be undefined for a given event , denoted or ). For any two events , such that , and we introduce the following notation: , which defines the order (based on the function). Additionally, for any two events , such that (or respectively ), we write (), if executes on a replica that has ed (ed) prior to its execution.
Finally, let us observe that we model replicas as deterministic state machines (as discussed in Section 3.2), whose specification we give through pseudocode. The variables declared in the algorithms of ANNC and AcuteBayou represent the state of the replicas, while the code blocks represent atomic steps that transition the replicas from one state to another. It means that each such block executes completely before any of its effects become visible. This allows us to infere the following rule (in both ANNC and AcuteBayou) for weak operations which execute in one atomic transition in some event , which is either in the or relation with any other event : ( returns before ).
A.5.1 ANNC correcntess proofs
Let us proceed with the proof of the guarantees offered by ANNC in the stable runs.
See 1
Proof.
For any given arbitrary stable run of ANNC represented by a history we have to find suitable , and , such that is such that .
Additional observations. Note that each operation executed in some event finishes when the replica s the message . It means that for every operation executed in event , such that , if (), then .
Arbitration. We construct the total order relation by sorting all updating events (additions and subtractions) based on the order in which their respective messages are ed, i.e., respecting the order.
Next, we interleave the updating events with RO events (gets) in the following way: each such an RO event occurs in after the last subtract event such that . Thus, for each subtract event the following holds . The relative order of RO operations is irrelevant.
As ANNC does not feature operation reordering, for each event we simply let .
Visibility. For any two events , we include an edge in our construction of , if:
or , and , 2. 2.
, and , 3. 3.
, , and , 4. 4.
, , and , 5. 5.
, and , 6. 6.
, and , 7. 7.
and ,
(for some ).
The edges 1-4 are essential, while the edges 5-7 are non-essential. The updates that are visible to a operation depends solely on the order, while in case of a operation, the and relations play a role. It does not matter which updates are visible to an operation because it always responds with a simple acknowledgment, hence the edge 7 is non-essential.
Note that in case of edges 3-4, is implied (see the general observations in Section A.5), and in case of the edge 6, follows directly from the construction of . Thus, for all edges 3-7, .
Having defined (through , and ), it now remains to show that , or more specifically .
Eventual visibility. We prove now that eventual visibility is satisfied for all events:
- •
each or event is visible to all subsequent events from some point, because there is only a finite number of updating events such that (1),
- •
each or event is visible to all subsequent events from some point, because both and are eventually delivered on all replicas, (2, 3 and 4),
- •
each event is visible to all subsequent events from some point (5),
- •
each event is visible to all subsequent events from some point, because by construction of there is only a finite number of events such that (6),
- •
each event is visible to all subsequent events from some point (7).
No circular causality. We need to show that and , where , are, respectively, the sets of all weak, and strong events. We elect to prove a more general case of .
Recall that . If , then , because transitive edges cannot introduce cycles. Thus, we have eight types of edges to consider: edges 1-7 from and the eight edge . We divide them into two groups: the first one consists of edges 1-2, while the second one consists of edges 3-8. Note that for the second group always holds.
There can be no cycles when we restrict the edges only to the ones from the first group, as edge 1 is constrained by the order, and edge 2 leads to a event which cannot be followed using only edges from the first group.
Also, there can be no cycles when we restrict the edges only to the ones from the second group, as all the edges are constrained by the relation, which is naturally acyclic.
Thus, a potential cycle could only form when we mix edges from both groups. Let us assume that the cycle contains the following chain of edges: , where , all the edges between and belong to the second group, while the other ones belong to the first group. Notice that , and that while . Thus, the chain consists of a series of edges from the first group and a series of edges from the second group. The whole cycle can be combined from multiple such chains, but for simplicity, let us assume that it contains only one such chain and that (the same reasoning as below can be applied iteratively for multiple interleavings of edges from the two groups).
If , for some , then (edge 1), and since , also (see the additional observations in the begining of the proof). A contradiction: .
If , then , for some , and (edge 2). Either , or . In the former case we end up with a similar contradiction as above: . In the latter case, since , also (the message is ed before the message ). However, , which means that the message was not even yet when executed. A contradiction.
Return value consistency. We need to show that for each event : . We base our reasoning below on essential edges and order.
Trivially, the condition is satisfied for all events, which always return . For all and events, we can exclude from all events which by the definition of an RO operation are irrelevant for the computation of .
In case of a operation, for some , executed in some event , includes all the and events that precede in the order. When applying the function from the definition of , these and operations are processed one by one, in the order of their y (by construction of ). Each operation increases the accumulator by , and each operation decreases the accumulator by , but only if it is greater or equal . This matches the pseudocode (lines 24 and 27-28) with the accumulator corresponding to the difference between and variables. Thus, the computed value of the function corresponds to the difference between and variables at the time the response to is computed in line 26. If that value is greater or equal then is returned, which matches the pseudocode’s behaviour.
In case of a operation executed in some event , includes all the and events that were ed before the execution of , as well as, (possibly) some events which were not ed, but only ed before the execution of . Note that all the latter events are ordered according to , after all the former and events (have they had been ordered earlier due to lower value of their respective message, they would also be ed). When processing the function up to the last ed event, the value of the accumulator corresponds, similarly as in case of events above, to the difference between and variables. Then, when processing the remaining events the final computed value of the function grows by an amount , which is equal to the sum of all these operations’ arguments. Due to the fact that each ed message is first ed or is processed as if it were ed (lines 22-23), the value of is always greater or equal . The difference between and variables corresponds exactly to , because it includes events which were ed, but not ed. Thus, the computed value of equals at the time of executing , which matches .
Single order. Since there are no pending operations (because eventually every message is ed and the operations finish), we have to simply prove that , where . In other words, for any two events : .
Let us begin with . Either (edge 1), or (edge 6). In both cases .
Now let us consider . Either , or . In the former case, , and thus (edge 1). In the latter case, (by construction of ), and thus (edge 6).
Real-time order. We need to show that arbitration order respects the real-time order of strong operations, i.e., . In other words, for any two : .
Clearly, if , then (see the additional observations in the begining of the proof). Thus, (by construction of ). ∎
Now, let us continue with the proof of the guarantees offered by ANNC in the asynchronous runs.
See 2
Proof.
To show the inability of ANNC to satisfy in asynchronous runs, it is sufficient to observe that due to some of the messages not being ed, some of the operations remain pending. A pending operation’s return value equals which is unreconcilable with the requirements of the predicate .
The proof regarding the guarantees of the weak operations is similar to the one for the stable runs, thus we rely on it and focus only on differences between stable and asynchronous runs that need to be addressed. Now for any given arbitrary asynchronous run of ANNC represented by a history we have to find suitable , and , such that is such that .
Arbitration. We construct the total order relation by sorting all updating events (additions and subtractions) based on the order in which their respective messages are ed, i.e., respecting the order. Updating events whose messages are not ed are ordered after those whose messages are ed.
Next, we interleave the updating events with RO events (gets) in the following way: each such an RO event occurs in after the last non-pending subtract event such that . Thus, for each non-pending subtract event the following holds . The relative order of RO operations is irrelevant.
As ANNC does not feature operation reordering, for each event we simply let .
Visibility. We construct the visibility relation in the same way as in the stable runs case. However, we remove edges to and from pending events. Since pending operations do not provide a return value, no edge to a pending event is essential. Also, as we guarantee only eventual visibility for weak events, edges to events are not necessary to satisfy . Moreover, edges from pending events are not needed either, because by definition a pending event is never followed in by any other event (which is a requirement to fail the test for EV). Again, for all edges 3-7, .
Having defined (through , and ), it now remains to show that , or more specifically .
Eventual visibility. We prove now that eventual visibility is satisfied for all weak events:
- •
each or non-pending event is visible to all subsequent events from some point, because or are eventually delivered on all replicas (2, 3 and 4),
- •
each event is visible to all subsequent events from some point (5),
- •
each non-pending event is visible to all subsequent events from some point (7).
No circular causality. We use exactly the same reasoning as in the stable runs case to show that holds true.
Return value consistency. Again, we use exactly the same reasoning as in the stable runs case to show that for each weak event : . Although this time we only need to prove return value consistency for and operations, it can be shown that it also holds for non-pending subtract events.
∎
A.5.2 AcuteBayou correcntess proofs
The proofs for AcuteBayou are analogous to those for ANNC, but are slighly more complex due to operation reordering and the more general nature of AcuteBayou with unconstrained operations’ semantics (in contrast ANNC features weak updating operations that always return ). Because we strive in this section for self-contained proofs we do not refer to the proofs for ANNC even when doing so would allow us to omit some repetitions.
We begin with the proof of guarantees offered by AcuteBayou in the stable runs.
See 3
Proof.
For any given arbitrary stable run of AcuteBayou represented by a history we have to find suitable , and , such that is such that .
Additional observations. All events besides weak RO ones, have an associated unique record which is disseminated using and ; let us denote by the record of the event .111111Thus a trace of the object, which consists of such records, can be translated into a sequence of events. Since, the handling of weak RO events, which are local to a replica, differ significantly from other events, which are shared, we divide the set of all events into two subsets: , consisting of weak updating, strong updating and strong RO events; and , consisting of weak RO events. We also further divide into subsets and , consisting of, respectively, weak and strong events.
Upon y of a message, the receiced request is committed (Algorithm 4 line 24), i.e., it is appended at the end of the list, and removed from the list (if present there). Note that the position of established on the list never changes as the list is only appended to. Once the request is committed, the operation associated with the request is eventually executed (unless the request was already executed in the order consistent with the commit order) and then the request is never rolled back. This is so, because:
- •
the list is included in the list as a prefix in the procedure (Algorithm 2 line 33),
- •
until the request executes it has to feature on the list (Algorithm 2 line 47) and there can be only a finite number of items preceding it on that list,
- •
the list cannot grow indefinitely without executing some of the requests from the list, which means that is eventually executed (Algorithm 2 line 55),
- •
and finally a request which is included in both the and lists is never part of the list (Algorithm 2 line 45), which means it will not be scheduled for rollback.
Weak operations execute atomically in the invoke code block where the response is always returned immediately to the client.121212If due to operation reexecutions multiple responses are returned to the client we discard the additional ones. For a given weak event the response is computed on the object in some state , where is the current trace of the object at the time of the operation’s invocation. We let denote the trace .
On the other hand, strong operations follow a more complicated route. For a strong event : firstly the message is , then upon its y the request is committed. Since is not disseminated using , it is never included in the list, and so it executes for the first time after its commit. Thus, each strong operation is executed on each replica exactly once, on a object in some state , where is the current trace of the object at the time of the execution. Note that the trace is exactly the same on each replica and it consists exactly of all the requests preceding in the list (which due to the properties of y has the same value on each replica upon ’s commit). Again, as in case of weak events, we let denote the trace .
Note that each strong operation executed in some event finishes only after the replica s the message . It means that for every operation executed in event , such that , if (), then .
Arbitration. We construct the total order relation by sorting all shared events based on the order in which their respective messages are ed, i.e., respecting the order.
Next, we interleave the shared events with local events in the following way: each local event occurs in after the last shared event such that . Thus, for each shared event the following holds . The relative order of local operations is irrelevant.
We construct the perceived arbitration order , for each event , using the trace . More precisely, we add all the events whose requests appear in in the order of occurence, next we add all the remaining shared events according to their order in . Finally, we interleave the constructed sequence with local events in a similar way as in case of , i.e., for each local event and each shared event , the following holds .
Note that for a strong event , . This is because executes once is on the list, and its position on the list is determined by the order, which means that the trace contains exactly all the shared events preceding in .
Visibility. For any two events , such that , we include an edge in our construction of , if:
, , and , 2. 2.
, , and , 3. 3.
, , and , 4. 4.
, , and , 5. 5.
, and , 6. 6.
, , and .
The edges 1-4 are essential, while the edges 5-6 are non-essential.
Note that in case of edge 4, either , or , and thus is implied (see the general observations in Section A.5). Thus, for all edges 4-6, .
Additionally, observe that in case of edge 1, , because contains only requests on the list (see the additional observations in the beginning of the proof), and thus . Similarly, in case of edges 2 and 3, , because and thus can appear in only if it was ed by the replica executing . Also in case of edge 2, .
Having defined (through , and ), it now remains to show that , or more specifically .
Eventual visibility. We prove now that eventual visibility is satisfied for all events:
- •
each shared event is visible to all subsequent events from some point, because is eventually ed and is placed on the list on each replica, thus is eventually executed and never rolled back, and is included in the trace of the object from some point (1, 2, 3 and 4),
- •
each local event is visible to all subsequent local events from some point (5),
- •
each local event is visible to all subsequent shared events from some point, because by construction of there is only a finite number of events such that (6).
No circular causality. We need to show that and , where , are, respectively, the sets of all weak, and strong events. We elect to prove a more general case of .
Recall that . If , then , because transitive edges cannot introduce cycles. Thus, we have six types of edges to consider: edges 1-6 from and the seventh edge . We divide them into two groups: the first one consists of edges 1-3, while the second one consists of edges 4-7. Note that for the second group always holds.
There can be no cycles when we restrict the edges only to the ones from the first group, as the edges 1 and 2 are constrained by the order, and edge 3 leads to a local event which cannot be followed using only edges from the first group.
Also, there can be no cycles when we restrict the edges only to the ones from the second group, as all the edges are constrained by the relation, which is naturally acyclic.
Thus, a potential cycle could only form when we mix edges from both groups. Let us assume that the cycle contains the following chain of edges: , where , all the edges between and belong to the second group, while the other ones belong to the first group. Notice that , and that . Thus, the chain consists of a series of edges from the first group and a series of edges from the second group. The whole cycle can be combined from multiple such chains, but for simplicity, let us assume that it contains only one such chain and that (the same reasoning as below can be applied iteratively for multiple interleavings of edges from the two groups).
If , then (edges 1 and 2), and since , also (see the additional observations in the beginning of the proof). A contradiction: .
If , then , and (edge 3). Either , or . In the former case we end up with a similar contradiction as above: . In the latter case, since , also (the message is ed before the message ). However, , which means that the message was not even yet when executed. A contradiction.
Single order. Since there are no pending strong operations (because eventually every message is ed and the operations finish), we have to simply prove that , where . In other words, for any two events : .
Let us begin with . Either , and thus (edge 1), or (edge 6). In both cases .
Now let us consider . Either , or . In the former case, , and thus must be included in , which means that (edge 1). In the latter case, (by construction of ), and thus also (edge 6).
Return value consistency. Since for a strong event , and . Thus, for each event , we need to show that: . We base our reasoning below on essential edges and order.
Firstly, observe that we can exclude from all local events which by the definition of an RO operation are irrelevant for the computation of . Thus, let , where .
Then, recall that is obtained by calling on the object in state , where , and that , where is a context constructed from as defined in Section A.4. It suffices to show that the context is isomorphic with , which we do below.
Clearly, by construction of , if and , then . Thus, consists of the records of the events in . By the way how records are constructed (Algorithm 4 line 11), for any given event , equals . Also, for any two events , , which follows trivially from the construction of . It remains to show that for any two events , .
If and , then , and thus , which implies .
If and , then , and thus , which implies .
If and , then (by Single Order), and thus . Since is committed at the time of ’s execution ( and ), so is but its position on the list is earlier (). Because the order of requests in the trace is based on the list, whose order is consistent with the order of the list, precedes in , which implies . Then, by construction of , .
If and , then , and thus precedes in . Since is committed at the time of ’s execution, both and belong to the list during the ’s execution, which implies that . Thus, , and by Single Order, .
Thus, is isomorphic with .
Convergent perceived arbitration. We now show, that for each event there exist only a finite number of weak events , such that the prefixes of and up to the event differ, which is a sufficient condition to prove .
If , then eventually on each replica is ed, and is committed and executed. Thus, from some point, the trace of each subsequent event contains , preceded by requests of events committed earlier, such that . Both and are constructed by first ordering shared events and then interleaving them with local events using the same procedure. In both and , is preceded by the same shared events , such that . Then, it is also preceded by the same local events, which means the prefixes of and up to are equal.
If , then eventually the requests of all shared events , such that , are committed and executed on each replica. Then, from some point, the trace of each subsequent event contains the requests of events , ordered by . Thus, is preceded in both and by the same shared events . Because both and are interleaved with local events using the same procedure, is also preceded in both and by the same local events, which means the prefixes of and up to are equal.
Real-time order. We need to show that arbitration order respects real-time order of strong operations, i.e., , where . In other words, for any two : .
Clearly, if , then (see the additional observations in the beginning of the proof). Thus, (by construction of ). ∎
Now, let us continue with the proof of the guarantees offered by AcuteBayou in the asynchronous runs.
See 4
Proof.
To show the inability of AcuteBayou to satisfy in asynchronous runs, it is sufficient to observe that due to some of the messages not being ed, some of the strong operations remain pending. A pending operation’s return value equals which is unreconcilable with the requirements of the predicate .
The proof regarding the guarantees of the weak operations is similar to the one for the stable runs, thus we rely on it and focus only on differences between stable and asynchronous runs that need to be addressed. Now for any given arbitrary asynchronous run of AcuteBayou represented by a history we have to find suitable , and , such that is such that .
Additional observations. The same observations apply as in case of stable runs, with the only distinction that some strong events remain pending due to the lack of y of . In such cases is undefined.
Now let us make one more observation: the request of a weak updating event whose is never ed, even though it never commits, eventually settles, i.e. it is eventually executed and is never rolled back after that execution. It is so, because after is ed by each replica and placed on the list, only a finite number of other requests can commit (due to the properties of TOB in asynchronous runs), and also only a finite number of other requests can have a lesser record (as defined by the operator in Algorithm 2) and thus precede in the list (due to monotonically increasing clocks on each replica). Thus, once is placed on the list, it eventually executes, and when executed can be rolled back at most a finite number of times, due to a commit of other request, or a lesser being inserted into the list.
Arbitration. We construct the total order relation by sorting all shared events based on the order in which their respective messages are ed, i.e., respecting the order. Shared events whose messages are not ed are ordered after those whose messages are ed, with weak updating events appearing first, ordered relatively based on their records, followed by pending strong events.
Next, we interleave the shared events with local events in the following way: each local event occurs in after the last non-pending shared event such that . Thus, for each non-pending shared event the following holds . The relative order of local events is irrelevant.
We construct the perceived arbitration order for each event , in the same way as in case of stable runs, i.e. using , the remaining shared events from , and finally interleaving the constructed sequence with local events as in case of (so that for each local event and each non-pending shared event , the following holds .
For a pending strong event , which was not executed at all, we let .
Note that for a non-pending strong event , . This is because executes once is on the list, and its position on the list is determined by the order, which means that its trace will contain exactly all the shared events preceding in .
Visibility. We construct the visibility relation in the same way as in the stable runs case. However, we remove edges to and from pending strong events. Since pending operations do not provide a return value, no edge to a pending event is essential. Also, as we guarantee only eventual visibility for weak events, edges to strong events are not necessary to satisfy . Moreover, edges from pending events are not needed either, because by definition a pending event is never followed in by any other event (which is a requirement to fail the test for EV). Again, for all edges 4-6, .
Having defined (through , and ), it now remains to show that , or more specifically .
Eventual visibility. We prove now that eventual visibility is satisfied for all weak events:
- •
each non-pending shared event , such that is eventually ed on each replica, is visible to all subsequent non-pending events from some point, because is placed on the list on each replica, thus is eventually executed and never rolled back, and is included in the trace of the object from some point (2, 3 and 4),
- •
each weak updating event , such that is not eventually ed on each replica, is visible to all subsequent non-pending events from some point, because it settles (see the additional observations in the beginning of the proof) and is included in the trace of the object on each replica from some point (4),
- •
each local event is visible to all subsequent local events from some point (5),
- •
each local event is visible to all subsequent non-pending shared events from some point, because by construction of there is only a finite number of events such that (6).
No circular causality. We use exactly the same reasoning as in the stable runs case to show that holds true.
Return value consistency. Again, we use exactly the same reasoning as in the stable runs case to show that for each weak event : . Although this time we only need to prove return value consistency for weak operations, it can be shown that it also holds for non-pending strong events.
Convergent perceived arbitration. We now show, that for each non-pending131313We can exclude pending events, because according to the construction of they are not visible to any other event, and thus automatically satisfy the requirements of the CPar predicate. event there exist only a finite number of weak events , such that the prefixes of and up to the event differ, which is a sufficient condition to prove .
If and is eventually ed, then the same logic can be applied as in case of stable runs to show that from some point for each subsequent event the prefixes of and up to are equal.
If and is never ed, then it eventually settles (see the additional observations in the beginning of the proof) and thus also the same logic can be applied as in case of stable runs, with the distinction that is preceded in and not only by events whose requests are committed, but also by events , such that .
If , then eventually the requests of all shared events , such that (none of which are pending by the construction of ), are either committed, or settled, and executed on each replica. Then, from some point, the trace of each subsequent event contains the requests of events , ordered by both , and based on their records. Thus, is preceded in both and by the same shared events . Because both and are interleaved with local events using the same procedure, is also preceded in both and by the same local events, which means the prefixes of and up to are equal. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. De Candia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” SIGOPS Operating Systems Review , vol. 41, no. 6, pp. 205–220, Oct. 2007.
- 2[2] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski, “Conflict-free replicated data types,” in Proc. of SSS ’11 , May 2011.
- 3[3] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski, “A comprehensive study of convergent and commutative replicated data types,” Inria–Centre Paris-Rocquencourt; INRIA, Tech. Rep. 7506, 2011.
- 4[4] N. M. Preguiça, C. Baquero, and M. Shapiro, “Conflict-free replicated data types,” Co RR , vol. abs/1805.06358, 2018.
- 5[5] Y. Sovran, R. Power, M. K. Aguilera, and J. Li, “Transactional storage for geo-replicated systems,” in Proc. of SOSP ’11 , 2011, pp. 385–400.
- 6[6] C. Li, D. Porto, A. Clement, J. Gehrke, N. Preguiça, and R. Rodrigues, “Making geo-replicated systems fast as possible, consistent when necessary,” in Proc. of OSDI ’12 , Oct. 2012.
- 7[7] S. Burckhardt, A. Gotsman, and H. Yang, “Understanding eventual consistency,” Microsoft Research, Tech. Rep. MSR-TR-2013-39, Mar. 2013.
- 8[8] D. B. Terry, V. Prabhakaran, R. Kotla, M. Balakrishnan, M. K. Aguilera, and H. Abu-Libdeh, “Consistency-based service level agreements for cloud storage,” in Proc. of SOSP ’13 , Nov. 2013.
