Knock Out 2PC with Practicality Intact: a High-performance and General   Distributed Transaction Protocol (Technical Report)

Ziliang Lai; Hua Fan; Wenchao Zhou; Zhanfeng Ma; Xiang Peng; Feifei; Li; Eric Lo

arXiv:2302.12517·cs.DB·March 2, 2023

Knock Out 2PC with Practicality Intact: a High-performance and General Distributed Transaction Protocol (Technical Report)

Ziliang Lai, Hua Fan, Wenchao Zhou, Zhanfeng Ma, Xiang Peng, Feifei, Li, Eric Lo

PDF

Open Access

TL;DR

Primo is a high-performance distributed transaction protocol that eliminates the need for two-phase commit by ensuring conflict-free commit phases and using asynchronous group commit, significantly improving throughput.

Contribution

Primo introduces a novel conflict-free concurrency control and asynchronous group commit to support general workloads without 2PC, enhancing performance and simplicity.

Findings

01

Primo achieves 1.42x to 8.25x higher throughput than existing protocols.

02

Primo maintains similar latency to COCO while improving throughput.

03

Empirical results demonstrate Primo's effectiveness on YCSB and TPC-C benchmarks.

Abstract

Two-phase-commit (2PC) has been widely adopted for distributed transaction processing, but it also jeopardizes throughput by introducing two rounds of network communications and two durable log writes to a transaction's critical path. Despite the various proposals that eliminate 2PC such as deterministic database and access localization, 2PC remains the de facto standard since the alternatives often lack generality (e.g., requiring workloads without branches based on query results). In this paper, we present Primo, a distributed transaction protocol that supports a more general set of workloads without 2PC. Primo features write-conflict-free concurrency control that guarantees once a transaction enters the commit phase, no concurrency conflict (e.g., deadlock) would occur when installing the write-set -- hence the prepare phase is no longer needed to account for any potential conflict…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems

Full text

\externaldocument

ICDE_paper

Response to Reviewers’ Comments

We thank all the reviewers for their insightful reviews and constructive comments. We also thank the meta-reviewer for overseeing our submission. All your comments are addressed in this version and we summarize the major changes below:

•

Explain why Primo acquires exclusive locks for reads. We have improved the overview (Section LABEL:sec:primo) to clarify this problem and show the intuition of Primo.

•

Improve the evaluation of watermark-based group commit. We have added two experiments in Section LABEL:sec:log_exp: (1) evaluating the impact of the epoch/interval size on latency and crash-abort rate; (2) evaluating the impact of watermark lagging and the effectiveness of force-updating the watermark.

•

Add deterministic database baseline. We have added Aria, a state-of-the-art deterministic database as a baseline. We have also added a discussion of QueCC in Section LABEL:sec:related_work.

•

Move less critical experiments to our technical report to save space. To accommodate the new experiments and discussions, we have to move some less critical experiments including the comparison with CLV (the original Fig 11 and a part of Sec LABEL:sec:log_exp) and the comparison with TAPIR (the original Fig 14 and Sec VI-F) to our technical report due to the space limit. Please let us know if you feel necessary to keep them in the paper.

We have highlighted all major changes in \textcolororangeorange in the revision. In the following, we first address the common concerns, and then provide responses to each individual comment.

Response to common concerns

\stitle

Concern 1 (Meta1, R1O1, R2O1, R3O1): *Acquiring exclusive locks for read operations seems unnecessary. The reviewers suggest to acquire shared locks for read operations and upgrading to exclusive locks for write operations. *

Thanks for pointing out this confusion. In this revision, we have clarified that in Section LABEL:sec:primo (Overview). Specifically, yes, given Primo can decouple failure handling from transaction execution, your suggested approach (i.e., acquiring shared locks for read operations and upgrading to exclusive locks for write operations) indeed seemingly removes 2PC. However, for general transactions that we do not have prior read-write set knowledge (i.e., deterministic database like Calvin [calvin] is not applicable here), such an approach is actually no different from the 2PC approach. That is because when performing the writes, the lock-upgrading requests (acquiring exclusive locks) may fail due to potential deadlocks and hence the coordinator also has to spend a round to finalize the commit-vs-abort decision. That is essentially no different from the PREPARE phase of 2PC.

We give an example here to better explain the issue. Figure Response to common concerns shows a simple transaction that transfers money from account $x$ (in partition P1) to account $y$ (in partition P2). The transaction first reads $x$ and $y$ , computes their new values, and then updates them accordingly. If we do not assume the DB engine has prior knowledge on the transaction, then the coordinator can only request shared locks on reading $x$ and $y$ in step \circled1 because the DB engine only sees SELECT commands. Only when the DB engine sees the updates on $x$ and $y$ in step \circled2, it would request upgrades to exclusive locks and make the final commit-vs-abort decision.

\captionof

figureThe lock upgrading approach

Figure Response to common concerns shows the workflow if using 2PC, an industrial-strength implementation (e.g., Google Spanner [spanner]) would indeed bundle the PREPARE message and write message together (cf. Sec LABEL:sec:2pc). Comparing the two approaches (Fig Response to common concerns and Fig Response to common concerns), we can observe that the lock upgrading approach is just a form of 2PC (we decouple logging from 2PC using our distributed group commit), which also has a high contention footprint and more message roundtrips. The protocol shown in Fig Response to common concerns (and equivalently Fig Response to common concerns) is exactly the 2PL baseline in our experiments.

\captionof

figureClassic 2PC-based protocol

Figure Response to common concerns shows the workflow of Primo for the same transaction. Acquiring exclusive locks for all read operations in step \circled1 (regardless of whether $x$ and $y$ would be updated later) would guarantee writes in step \circled2 must succeed (crash-faults are handled independently via distributed group commit). As a result, Primo can pragmatically reduce the contention footprint and message roundtrips at the cost of a potentially higher number of exclusive locks (e.g., the transaction may not make any updates if it finds out $x$ does not have enough money to transfer). Yet, this is exactly Primo’s design choice that is proven to be simple but very effective — the 2PC bottleneck is removed, and the impact of the extra exclusive locks can be mitigated by using TicToc for local transactions (cf. LABEL:sec:complete_waf).

\captionof

figurePrimo

Deterministic databases like Calvin [calvin] can precisely pre-acquire all the locks needed for the transaction before execution (i.e., no extra roundtrips for lock upgrading and no unnecessary exclusive locks). However, achieving that requires prior knowledge of the read-write set which is not always applicable to general transactions [aria, sensitivity].

\stitle

Concern 2 (Meta3, R1O4, R2O4): Lack of experimental comparison with deterministic CC such as Aria/QueCC.

Thanks for your suggestion! In this revision, we have added Aria to our experiment. Aria is more recent and does not require prior knowledge of the read-write sets. We have also added a discussion of QueCC in Section LABEL:sec:related_work.

We have compared both Aria’s open-source implementation [aria-code] and our own implementation of Aria (we implemented it in DBx1000 [dbx1000], the same code framework as Primo). Since our implementation of Aria performs better than Aria’s open-source implementation in all cases (see Fig 3), our revision is based on our own implementation. Our implementation is stronger because we have added two common optimizations: (1) message batching (that can reduce the number of system calls and improve bandwidth utilization [ambrosia]) (2) scheduling new transactions to a thread if it is waiting for an RPC response (Aria’s open-source implementation blocks a thread when waiting for RPCs).

Our experiments show that Primo can outperform Aria (even after we optimized it) by a large margin (see our revised Section LABEL:sec:exp; Aria is included in almost all the figures there). In fact, Aria’s OCC-style determinism, when applied to a partitioned database (not a replicated database setting), still requires a lighter form of 2PC across the partitions. It is lighter because logging can be removed due to Aria’s determinism but 2 phases of network roundtrips are still necessary. We have contacted Aria’s first author Yi Lu about this issue and he has confirmed our finding. With 2PC-like roundtrips still on the critical path, Aria cannot outperform Primo.

As a side note, our experiments also unveil that Aria has no clear advantage over many 2PC-based baselines (e.g., Sundial [sundial]) because logging in those baselines is also not in the critical path (we optimized them with state-of-the-art distributed group commit). Indeed, such strong baselines were not included in Aria’s paper.

\stitle

Concern 3 (R1O5, R2O3): Experiments are needed to verify the claim that latency and crash-abort rate are controllable by tuning the epoch size.

Good suggestion! We have refactored Section LABEL:sec:log_exp (comparisons of logging optimizations) to include such experiments. Fig LABEL:fig:ycsb_epoch in the revised paper confirmed our claim. Specifically, the latency and the crash-abort rate (i.e., abort rate due to failure) are proportional to the epoch size, meaning that if the application favors lower latency and fewer aborts in the case of failure, using a smaller epoch size effectively meets that demand (of course, at some cost in terms of throughput). Since the results under YCSB and TPC-C show the same trend, we only showed the YCSB result as a representative due to the space limit. We are open to including the TPC-C result if the reviewers feel necessary.

Response to meta-reviewer (Meta)

\stitle

Meta1: “acquiring exclusive locks for reads (which is odd, some CC variants acquire read locks for writes, then upgrade to exclusive during certification, but getting exclusive locks for reads seems confusing)”

Thanks for pointing out the confusion! And we have addressed that in this revision. Please refer to our response to the common concerns (Concern 1).

\stitle

Meta2: “impacts of watermark blocking performance”

In this revision, we have incorporated the study of blocking in Section LABEL:sec:log_exp. The results basically confirms our claim:

•

If the watermark message is delayed due to network, our watermark-based distributed group commit (WM) would only experience latency increase but almost no throughput drop. In contrast, our baseline COCO shows significant throughput drops and latency increases. That is because WM is an asynchronous scheme (i.e., the delayed watermark only detains the transaction from returning the results to the client, without blocking the transaction execution).

•

If a partition grows the watermark slowly (e.g., when the partition leader becomes slow due to hardware problems), our force-updating mechanism (i.e., adaptively adding offsets to the watermark of a slow partition) can effectively compensate for that and reduce latency.

\stitle

Meta3: “lack of comparison with deterministic CC approaches such Aria/QueCC (both are cited)”

Good suggestion! The requested experiments are added in this revision. Please refer to our response to the common concerns (Concern 2).

\stitle

Meta4: “improving evaluation analysis and discussion”

All the requested analyses and discussions are included in this revision. Please check our responses to other reviewers’ comments for more details.

Response To Reviewer 1 (R1)

\stitle

R1O1: “It is unclear why a transaction needs to acquire exclusive locks for read operations to avoid 2PC, which seems to be the main idea of this paper. I agree with the authors that if a transaction always commits, 2PC is unnecessary when there are no failures. However, suppose we acquire shared locks for read operations and exclusive locks for write operations during the execution phase. After all the locks are acquired, I think a transaction will not abort due to conflicts as well. Deterministic databases also achieved this without requiring all the locks to be exclusive. Hence, acquiring exclusive locks for read operations is not a good idea.”

Thanks for asking this fundamental question! We have addressed that in this revision. Please refer to our response to the common concerns (Concern 1).

\stitle

R1O2: “The authors design the concurrency control protocol based on the assumption that read operations acquire exclusive locks. Hence, the proposed method using Tictoc to reduce the lock cost seems redundant.”

We assume this concern is gone after addressing the O1. Basically, Primo eliminates 2PC at the cost of potentially more exclusive locks. That’s why we proposed to use TicToc which is less affected by the exclusive locks to mitigate this problem (cf. Sec LABEL:sec:complete_waf).

\stitle

R1O3: “A transaction may block when accessing a partition whose watermark lags behind. Although a force update of the watermark can alleviate this problem, an experimental analysis of how the blocking affects performance and how the watermark updating performs is necessary.”

Thanks for your suggestion! We have included the requested experiments in Section LABEL:sec:log_exp. Specifically, watermark lagging could be caused by (1) delayed watermark messages due to network fluctuations or (2) a slow partition (e.g., due to hardware problems) that processes fewer transactions and thus grows its watermark slowly. We provide results under both situations (Since YCSB and TPC-C results are of identical trend, we only showed the YCSB result in the revision due to the space limit; Please let us know if you feel necessary to include the TPC-C result). The results basically verified the effectiveness of our design:

•

For situation (1), our watermark-based distributed group commit (WM) would only experience a latency increase but almost no throughput drop. In contrast, our baseline COCO shows significant throughput drops and latency increases. That is because WM is an asynchronous scheme (i.e., the delayed watermark only detains the transaction from returning the results to the client, without blocking the transaction execution).

•

For situation (2), our force-updating mechanism (i.e., adaptively adding offsets to the watermark of a slow partition) can effectively compensate for the lagging watermark in the slow partition. As a result, latency is almost unaffected by the slow partition. Of course, the throughput inevitably drops regardless of what protocol is used because we indeed made one partition slow-running.

\stitle

R1O4: “The authors should conduct an experimental comparison with deterministic approaches like Aria [17]. This is because, like the proposed method, Aria does not require prior knowledge of read/write sets as well. Besides, Aria can achieve transaction reordering while the proposed method seems cannot.”

Thanks for pointing that out! Aria has been added as a baseline in this revision. Please refer to our response to the common concerns (Concern 2).

\stitle

R1O5: “The authors argue the drawbacks, including increased latency and extra aborts, can be controlled by the size of the epochs. However, there lacks an experimental analysis by varying the epoch size to verify this statement.”

Thanks for your suggestion! The requested experiments are added to this revision. Please refer to our response to the common concerns (Concern 3).

Response to Reviewer 2 (R2)

\stitle

R2O1: “acquiring exclusive locks for all read operations is not necessary to ensure no abort in the commit phase. I think if all shared locks for read operations and exclusive locks for write operations are acquired during the execution phase, this transaction will not abort due to conflicts. This is the same as deterministic databases. After all the locks are acquired, this transaction will not abort. Hence, I think that acquiring exclusive locks for all read operations is not new and apparently introduces additional locking overheads.”

Thanks for raising this important question! We have clarified the issue in this revision. Please refer to our response to the common concerns (Concern 1).

\stitle

R2O2: “The authors design a concurrency control algorithm by integrating TicToc to reduce locking overhead. However, because exclusive locks seem unnecessary for read operations, I am not clear about the novelty of the proposed algorithm.”

We assume this concern is gone after O1 is addressed. Basically, Primo removes 2PC at the cost of higher locking overhead (i.e., potentially some unnecessary exclusive locks), and we observed that TicToc is less affected by the exclusive locks (cf. Section LABEL:sec:complete_waf). Hence, we proposed this novel approach to reduce Primo’s overhead.

\stitle

R2O3: “As pointed out by the authors, the proposed protocol has drawbacks like increased latency and extra aborts when failures occur. The authors claim controlling the size of the epochs can alleviate these drawbacks. However, I fail to find corresponding experiments to verify whether this statement is true. I’d like to see experimental results of the abort rate and transaction latency by varying the epoch size.”

Thanks for your suggestion! The requested results have been included in the revision. Please refer to our response to the common concerns (Concern 3).

\stitle

R2O4: “Experimental comparisons with deterministic approaches like Aria [17] are missing. These comparisons are necessary because Aria does not require prior knowledge of read/write sets, which is the same as the proposed method.”

Good suggestion! Aria has been added to the evaluation of this revision. Please refer to our response to the common concerns (Concern 2).

Response to Reviewer 3 (R3)

\stitle

R3O1: “It appears that acquiring X-lock for all reads is not necessary to make sure no abort in the commit phase. What would matter is to acquire locks for writes in the execution phase. This won’t introduce extra X-locks but only increase the time a lock is held. Then there is no need to separate local and distributed modes, which adds system complexity and it is hard to reason about the correctness of combining WCF and OOC. If there is no extra X-lock, OOC is not needed so the extra work of handling timestamp to satisfy R1R2 in WM can be removed.”

Thanks for pointing out the confusion! We have clarified that in this revision. Please refer to our response to the common concerns (Concern 1).

\stitle

R3O2: “When comparing with other systems, the author tunes all protocols to have ~10 ms latency. However, Fig 4, 5 (c) shows that Primo has lower latency while the others have almost the same higher latency. Justifying why this is observed would be helpful.”

Thanks for pointing that out! We have added the justification for that in Section LABEL:sec:exp_overall (bottom-left of Page 10). Specifically, although we unified distributed group commit protocols to have around 10ms latency, the latency is also affected by the abort rate. The baselines have higher abort rates (due to the contention footprint caused by 2PC) and thus their transactions spend more time retrying. In fact, having their latency slightly higher than Primo’s actually gives them advantages. That is because, if tuned to the same latency, their epoch sizes must be smaller, which would downgrade their throughput (fewer transactions to amortize the group commit cost).

\stitle

R3O3: “For Fig 4,5 (b), It would be better to show the throughput instead of improvement, so we can compare Primo w/o WM&WCF with 2PL easier. The improvement ratio can be reported in the text. Also, there is a typo ”WAF” on the x-axis.”

Good suggestion! We have changed the y-axis of them (they become Fig LABEL:fig:ycsb_factor and LABEL:fig:tpcc_factor in this revision) to throughput. The improvement ratios are also annotated on top of the bars. The typo is also fixed, thanks!

\stitle

R3O4: “The idea of holding x-locks during the execution for read and write operations suggests that the proposed work may be better aligned with literature that does not separate between read and write locks. For example, work that ensures mutual exclusion or general atomic broadcast. A discussion and comparison with such methods would help understand the trade-offs in the work.”

Thanks for your suggestion! We agree that when we no longer differentiate between read and write locks, work that ensures mutual exclusion or atomic broadcast from the systems community would come into the picture. Here, we use Fig 4 to elaborate.

In general, mutual exclusion [mutex-algorithms, lamport_work, mutex1, mutex2] and atomic broadcast problems [dist-book, ab-survey, ab-zookeeper] in a distributed setting have historically been based on a replicated (e.g., for high availability) or a so-called shared-everything setting (i.e., every distributed component can read/write to every data). They are classical consensus problems. Each specific consensus problem can be solved by a general consensus protocol (e.g., Paxos [consensus-2]) or some problem-specific protocols (e.g., Lamport’s bakery algorithm for mutual exclusion [bakery]). However, they are not directly comparable with Primo who optimizes distributed transaction processing in the shared-nothing (sharding) setting instead. As a shared-nothing database, Primo involves atomic commit and transaction isolation (Primo also offers durability and high availability by replication, but they are not the main focus). We agree that the boundaries among these concepts are sometimes blurry but in general, mutual exclusion, be it in a shared-memory or a distributed setting, often focuses on synchronization on a single item, while transaction isolation focuses on multiple items. Although a general consensus protocol (e.g., [consensus-2]) can replace 2PC for atomic commit, its overhead is similar to or even higher than 2PC [paxoscommit]. In contrast, Primo does not even need 2PC.

In summary, Primo solves multiple problems together (atomic commit + transaction isolation + durability + high availability) to build a shared-nothing database, while mutual exclusion and atomic broadcast are standalone problems in shared-everything or replicated systems. We have added this discussion in Sec LABEL:sec:related_work of the revision.

\stitle

R3O5: “The restrictions that are introduced due to the proposed model need to be clarified and discussed. For example, it appears that clients/coordinator unilateral aborts are not possible. For example, if the messages are sent to the participants and they all respond positively, but the user/server/coordinator unilaterally aborts then fail (which is permissible in 2PC), what are the implications of such a scenario in the proposed algorithm? Specifically, how is recovery performed for this scenario and would this lead to not allowing such unilateral aborts?”

Thanks for pointing that out! Indeed, Primo disallows unilateral aborts from the clients/coordinator in the commit phase to avoid 2PC. Hence, we added a discussion to Sec LABEL:sec:watermark (left of Page 8) to clarify. Primo actually aligns with more recent 2PC-based protocols (e.g., MDCC [mdcc] and TAPIR [TAPIR]) who also disallow unilateral aborts during 2PC. In fact, this limitation is generally acceptable (or not even regarded as a limitation) in commercial databases (e.g., CockroachDB [cockroachdb], OceanBase [oceanbase]).

\stitle

R3O6: “The proposed methods would be efficient in a specific type of workload where there are no long-running read-only (or read-mostly) transactions and where the percentage of reads is not high in comparison to writes.”

Yes, that is exactly our target workload as we described in the first paragraph of our overview (section LABEL:sec:primo). Indeed, short-running read-write transactions are dominating in typical OLTP workloads like TPC-C [tpcc], TATP [tatp], and AuctionMark [auction]. Many previous protocols also target the same type of workload [calvin, bohm, hstore, voltdb, pwv]. In fact, practical systems often optimize long-running read-only transactions by snapshots [silo, tictoc], which can also be integrated into Primo. For long-running read-heavy transactions, Primo can also fall back to 2PC.