Reconfigurable Atomic Transaction Commit (Extended Version)

Manuel Bravo; Alexey Gotsman

arXiv:1906.01365·cs.DC·June 5, 2019

Reconfigurable Atomic Transaction Commit (Extended Version)

Manuel Bravo, Alexey Gotsman

PDF

TL;DR

This paper introduces new atomic commit protocols for distributed data stores that require fewer replicas, reconfigure upon failures, and are proven correct in both asynchronous and RDMA models, improving scalability and fault tolerance.

Contribution

It presents novel atomic commit protocols that reduce replica requirements to f+1, incorporate reconfiguration, and are rigorously proven correct in multiple models.

Findings

01

Protocols require only f+1 replicas, reducing overhead.

02

Protocols are proven correct under the TCS specification.

03

Work highlights trade-offs of using RDMA in distributed commit protocols.

Abstract

Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f+1 crash-stop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f+1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models--the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the…

Figures2

Click any figure to enlarge with its caption.

Equations57

\forall L_{1}, L_{2}, l . f (L_{1} \cup L_{2}, l) = f (L_{1}, l) ⊓ f (L_{2}, l),

\forall L_{1}, L_{2}, l . f (L_{1} \cup L_{2}, l) = f (L_{1}, l) ⊓ f (L_{2}, l),

\forall x, v . (x, v) \in R ⟹ (\forall (_, W^{'}, V_{c}^{'}) \in L . (x,_) \in W^{'} ⟹ V_{c}^{'} \leq v) .

\forall x, v . (x, v) \in R ⟹ (\forall (_, W^{'}, V_{c}^{'}) \in L . (x,_) \in W^{'} ⟹ V_{c}^{'} \leq v) .

\forall t, t^{'} . decide (t,_) ≺_{h} certify (t^{'},_) ⟹ decide (t,_) ≺_{ℓ} certify (t^{'},_) .

\forall t, t^{'} . decide (t,_) ≺_{h} certify (t^{'},_) ⟹ decide (t,_) ≺_{ℓ} certify (t^{'},_) .

\forall t, l, d . certify (t, l), decide (t, d) \in act (h) ⟹ d = f ({l^{'} ∣ certify (t^{'}, l^{'}) \in act (h) \land decide (t^{'}, \textsc co mmi t) ≺_{h} decide (t, d)}, l) .

\forall t, l, d . certify (t, l), decide (t, d) \in act (h) ⟹ d = f ({l^{'} ∣ certify (t^{'}, l^{'}) \in act (h) \land decide (t^{'}, \textsc co mmi t) ≺_{h} decide (t, d)}, l) .

\begin{array}[]{@{}l@{}l@{}l@{}l@{}}\forall x\in{\sf Obj}_{s}.\,\forall v.{}&(x,v)\in R\implies&\\[2.0pt] &(\forall\langle\_,W^{\prime},V^{\prime}_{c}\rangle\in\mathit{L}.\,(x,\_)\in W^{\prime}\implies&V^{\prime}_{c}\leq v),\end{array}

\begin{array}[]{@{}l@{}l@{}l@{}l@{}}\forall x\in{\sf Obj}_{s}.\,\forall v.{}&(x,v)\in R\implies&\\[2.0pt] &(\forall\langle\_,W^{\prime},V^{\prime}_{c}\rangle\in\mathit{L}.\,(x,\_)\in W^{\prime}\implies&V^{\prime}_{c}\leq v),\end{array}

\begin{array}[]{@{}l@{}l@{}l@{}}\forall x\in{\sf Obj}_{s}.\,\forall v.{}&((x,\_)\in R&\implies(\forall\langle\_,W^{\prime},\_\rangle\in\mathit{L}.\,(x,\_)\not\in W^{\prime}))\wedge{}\\[2.0pt] &((x,\_)\in W&\implies(\forall\langle R^{\prime},\_,\_\rangle\in\mathit{L}.\,(x,\_)\not\in R^{\prime})).\end{array}

\begin{array}[]{@{}l@{}l@{}l@{}}\forall x\in{\sf Obj}_{s}.\,\forall v.{}&((x,\_)\in R&\implies(\forall\langle\_,W^{\prime},\_\rangle\in\mathit{L}.\,(x,\_)\not\in W^{\prime}))\wedge{}\\[2.0pt] &((x,\_)\in W&\implies(\forall\langle R^{\prime},\_,\_\rangle\in\mathit{L}.\,(x,\_)\not\in R^{\prime})).\end{array}

\forall l \in L . \forall L \subseteq L . f (L, l) = \textsc co mmi t ⟺ \forall s \in S . f_{s} ((L ∣ s), (l ∣ s)) = \textsc co mmi t .

\forall l \in L . \forall L \subseteq L . f (L, l) = \textsc co mmi t ⟺ \forall s \in S . f_{s} ((L ∣ s), (l ∣ s)) = \textsc co mmi t .

\forall l \in L . L \subseteq L . g_{s} (L, l) = \textsc co mmi t ⟹ f_{s} (L, l) = \textsc co mmi t;

\forall l \in L . L \subseteq L . g_{s} (L, l) = \textsc co mmi t ⟹ f_{s} (L, l) = \textsc co mmi t;

\forall l, l^{'} \in L . g_{s} ({l}, l^{'}) = \textsc co mmi t ⟹ f_{s} ({l^{'}}, l) = \textsc co mmi t .

\forall t . d [t] = \bigsqcap {d_{s} [t] ∣ s \in shards (t)}

\forall t . d [t] = \bigsqcap {d_{s} [t] ∣ s \in shards (t)}

\forall t_{1}, t_{2}, s . t_{1} \neq = t_{2} ⟹ pos_{s} [t_{1}] \neq = pos_{s} [t_{2}]

\forall t_{1}, t_{2}, s . t_{1} \neq = t_{2} ⟹ pos_{s} [t_{1}] \neq = pos_{s} [t_{2}]

\forall t, l, s . certify (t, l) \in h ⟹ (d_{s} [t] = \textsc co mmi t ⟹ pload_{s} [t] = (l ∣ s)) \land (d_{s} [t] = \textsc ab or t ⟹ pload_{s} [t] \in {(l ∣ s), ε})

\forall t, l, s . certify (t, l) \in h ⟹ (d_{s} [t] = \textsc co mmi t ⟹ pload_{s} [t] = (l ∣ s)) \land (d_{s} [t] = \textsc ab or t ⟹ pload_{s} [t] \in {(l ∣ s), ε})

\forall t, s . d_{s} [t] ⊑ f_{s} (pload_{s} (T_{s} [t]), pload_{s} [t]) ⊓ g_{s} (pload_{s} (P_{s} [t]), pload_{s} [t])

\forall t, s . d_{s} [t] ⊑ f_{s} (pload_{s} (T_{s} [t]), pload_{s} [t]) ⊓ g_{s} (pload_{s} (P_{s} [t]), pload_{s} [t])

\forall t, s . T_{s} [t] = {t^{'} ∣ pos_{s} [t^{'}] < pos_{s} [t] \land d [t^{'}] = \textsc co mmi t} ∖ P_{s} [t]

\forall t, s . T_{s} [t] = {t^{'} ∣ pos_{s} [t^{'}] < pos_{s} [t] \land d [t^{'}] = \textsc co mmi t} ∖ P_{s} [t]

\forall t, s . P_{s} [t] \subseteq {t^{'} ∣ pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t}

\forall t, s . P_{s} [t] \subseteq {t^{'} ∣ pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t}

\forall t, t^{'}, s . t^{'} ≺_{rt} t \land s \in shards (t^{'}) \cap shards (t) ⟹ pos_{s} [t^{'}] < pos_{s} [t]

\forall t, t^{'}, s . t^{'} ≺_{rt} t \land s \in shards (t^{'}) \cap shards (t) ⟹ pos_{s} [t^{'}] < pos_{s} [t]

≺_{rt} \cup ≺_{dec} \mbox i s a cy c l i c,

≺_{rt} \cup ≺_{dec} \mbox i s a cy c l i c,

\forall x, y \in {\textsc ab or t ∣ \textsc co mmi t} . x ⊑ y ⟺ x = y \lor (x = \textsc ab or t \land y = \textsc co mmi t)

\forall x, y \in {\textsc ab or t ∣ \textsc co mmi t} . x ⊑ y ⟺ x = y \lor (x = \textsc ab or t \land y = \textsc co mmi t)

\forall t, t^{'} . t^{'} ≺_{rt} t ⟺ decide (t^{'},_) ≺_{h} certify (t,_)

\forall t, t^{'} . t^{'} ≺_{rt} t ⟺ decide (t^{'},_) ≺_{h} certify (t,_)

\forall t, t^{'} . t^{'} ≺_{dec} t ⟺ \exists s . t^{'} \in T_{s} [t] \lor (pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t \land d [t^{'}] = \textsc ab or t \land t^{'} \neq \in P_{s} [t])

\forall t, t^{'} . t^{'} ≺_{dec} t ⟺ \exists s . t^{'} \in T_{s} [t] \lor (pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t \land d [t^{'}] = \textsc ab or t \land t^{'} \neq \in P_{s} [t])

decide (t,_) ≺_{h} certify (t^{'},_) \land s \in shards (t) \cap shards (t^{'}) .

decide (t,_) ≺_{h} certify (t^{'},_) \land s \in shards (t) \cap shards (t^{'}) .

\begin{array}[]{@{}l@{}}\forall k.\,({\sf vote}[k]\mbox{ is defined})\implies\\ \exists T,P.\,{\sf vote}[k]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf payload}[k])\sqcap{}\\ \hskip 59.75095ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf payload}[k])\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]={\textsc{commit}}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]={\textsc{commit}}\}\setminus P\wedge{}\\[2.0pt] P\subseteq\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]={\textsc{commit}}\}\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],{\textsc{commit}})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]={\textsc{commit}}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],{\textsc{abort}})\mbox{ has been sent}));\end{array}

\begin{array}[]{@{}l@{}}\forall k.\,({\sf vote}[k]\mbox{ is defined})\implies\\ \exists T,P.\,{\sf vote}[k]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf payload}[k])\sqcap{}\\ \hskip 59.75095ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf payload}[k])\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]={\textsc{commit}}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]={\textsc{commit}}\}\setminus P\wedge{}\\[2.0pt] P\subseteq\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]={\textsc{commit}}\}\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],{\textsc{commit}})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]={\textsc{commit}}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],{\textsc{abort}})\mbox{ has been sent}));\end{array}

\begin{array}[]{@{}l@{}}{\sf vote}[k]=f_{s}(L_{1},{\sf payload}[k])\sqcap g_{s}(L_{2},{\sf payload}[k]);\\[2.0pt] L_{1}=\{{\sf payload}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{decided}\wedge{}\\ \hskip 22.76228pt{\sf dec}[k^{\prime}]=\textsc{commit}\};\\[2.0pt] L_{2}=\{{\sf payload}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{prepared}\wedge{}\\ \hskip 22.76228pt{\sf vote}[k^{\prime}]=\textsc{commit}\}.\end{array}

\begin{array}[]{@{}l@{}}{\sf vote}[k]=f_{s}(L_{1},{\sf payload}[k])\sqcap g_{s}(L_{2},{\sf payload}[k]);\\[2.0pt] L_{1}=\{{\sf payload}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{decided}\wedge{}\\ \hskip 22.76228pt{\sf dec}[k^{\prime}]=\textsc{commit}\};\\[2.0pt] L_{2}=\{{\sf payload}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{prepared}\wedge{}\\ \hskip 22.76228pt{\sf vote}[k^{\prime}]=\textsc{commit}\}.\end{array}

\begin{array}[]{@{}l@{}}{\sf vote}[k]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf payload}[k])\sqcap{}\\ \hskip 36.98866ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf payload}[k])\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{decided}\wedge{}\\ \hskip 19.91684pt{\sf dec}[k^{\prime}]=\textsc{commit}\}\wedge{}\\[2.0pt] P=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{prepared}\wedge{}\\ \hskip 19.91684pt{\sf vote}[k^{\prime}]=\textsc{commit}\}.\end{array}

\begin{array}[]{@{}l@{}}{\sf vote}[k]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf payload}[k])\sqcap{}\\ \hskip 36.98866ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf payload}[k])\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{decided}\wedge{}\\ \hskip 19.91684pt{\sf dec}[k^{\prime}]=\textsc{commit}\}\wedge{}\\[2.0pt] P=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf phase}[k^{\prime}]=\textsc{prepared}\wedge{}\\ \hskip 19.91684pt{\sf vote}[k^{\prime}]=\textsc{commit}\}.\end{array}

\begin{array}[]{@{}l@{}}T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]=\textsc{commit}\}\setminus P\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})),\end{array}

\begin{array}[]{@{}l@{}}T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]=\textsc{commit}\}\setminus P\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})),\end{array}

\begin{array}[]{@{}l@{}}d_{s}[t]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf topload}(t,{\sf txn},{\sf payload}))\sqcap{}\\ \hskip 28.45274ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf topload}(t,{\sf txn},{\sf payload}))\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]=\textsc{commit}\}\setminus P_{s}[t]\wedge{}\\[2.0pt] P\subseteq\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\}\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})).\end{array}

\begin{array}[]{@{}l@{}}d_{s}[t]\sqsubseteq f_{s}({\sf topload}(T,{\sf txn},{\sf payload}),{\sf topload}(t,{\sf txn},{\sf payload}))\sqcap{}\\ \hskip 28.45274ptg_{s}({\sf topload}(P,{\sf txn},{\sf payload}),{\sf topload}(t,{\sf txn},{\sf payload}))\wedge{}\\[2.0pt] T=\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{}\\ \hskip 19.91684ptd[{\sf txn}[k^{\prime}]]=\textsc{commit}\}\setminus P_{s}[t]\wedge{}\\[2.0pt] P\subseteq\{{\sf txn}[k^{\prime}]\mid k^{\prime}<k\wedge{\sf vote}[k^{\prime}]=\textsc{commit}\}\wedge{}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T\cup P{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})).\end{array}

\begin{array}[]{@{}l@{}}T_{s}[t]=T=\{t^{\prime}\mid\mathit{pos}_{s}[t^{\prime}]<\mathit{pos}_{s}[t]\wedge d[t^{\prime}]=\textsc{commit}\}\setminus P_{s}[t]\\[2.0pt] P_{s}[t]=P\setminus\{t\mid t\in P\wedge\mathit{pos}_{s}[t]\mbox{ is not defined}\}\subseteq\\ \hskip 48.36958pt\{t^{\prime}\mid\mathit{pos}_{s}[t^{\prime}]<\mathit{pos}_{s}[t]\wedge d_{s}[t^{\prime}]=\textsc{commit}\}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T_{s}[t]{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T_{s}[t]\cup P_{s}[t]{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})),\end{array}

\begin{array}[]{@{}l@{}}T_{s}[t]=T=\{t^{\prime}\mid\mathit{pos}_{s}[t^{\prime}]<\mathit{pos}_{s}[t]\wedge d[t^{\prime}]=\textsc{commit}\}\setminus P_{s}[t]\\[2.0pt] P_{s}[t]=P\setminus\{t\mid t\in P\wedge\mathit{pos}_{s}[t]\mbox{ is not defined}\}\subseteq\\ \hskip 48.36958pt\{t^{\prime}\mid\mathit{pos}_{s}[t^{\prime}]<\mathit{pos}_{s}[t]\wedge d_{s}[t^{\prime}]=\textsc{commit}\}\\[2.0pt] (\forall k^{\prime}.\,{\sf txn}[k^{\prime}]\in T_{s}[t]{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{commit})\mbox{ has been sent}))\wedge{}\\[2.0pt] (\forall k^{\prime}<k.\,{\sf vote}[k^{\prime}]=\textsc{commit}\wedge{\sf txn}[k^{\prime}]\not\in T_{s}[t]\cup P_{s}[t]{\implies}\\ \hskip 48.36958pt({\tt DECISION}({\sf txn}[k^{\prime}],\textsc{abort})\mbox{ has been sent})),\end{array}

d_{s} [t] ⊑ f_{s} (pload_{s} (T_{s} [t]), pload_{s} [t]) ⊓ g_{s} (pload_{s} (P_{s} [t]), pload_{s} [t]),

d_{s} [t] ⊑ f_{s} (pload_{s} (T_{s} [t]), pload_{s} [t]) ⊓ g_{s} (pload_{s} (P_{s} [t]), pload_{s} [t]),

pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t \land d [t^{'}] = \textsc ab or t \land t^{'} \neq \in P_{s} [t] .

pos_{s} [t^{'}] < pos_{s} [t] \land d_{s} [t^{'}] = \textsc co mmi t \land d [t^{'}] = \textsc ab or t \land t^{'} \neq \in P_{s} [t] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\xspaceaddexceptions

(,$

Reconfigurable Atomic Transaction Commit (Extended Version)

Manuel Bravo

IMDEA Software Institute

and

Alexey Gotsman

IMDEA Software Institute

(2019)

Abstract.

Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require $2f+1$ crash-stop replicas per shard to tolerate $f$ failures. In this paper we present atomic commit protocols that require only $f+1$ replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models—the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latter’s CPU. Our protocols are inspired by a recent FARM system for RDMA-based transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the trade-offs required by the use of RDMA.

Atomic commit, vertical Paxos, RDMA.

††journalyear: 2019††copyright: acmlicensed††conference: 2019 ACM Symposium on Principles of Distributed Computing; July 29-August 2, 2019; Toronto, ON, Canada††booktitle: 2019 ACM Symposium on Principles of Distributed Computing (PODC ’19), July 29-August 2, 2019, Toronto, ON, Canada††price: 15.00††doi: 10.1145/3293611.3331590††isbn: 978-1-4503-6217-7/19/07††ccs: Theory of computation Distributed algorithms††ccs: Theory of computation Distributed computing models

1. Introduction

Modern data stores are often required to manage massive amounts of data while providing stringent transactional guarantees to their users. They achieve scalability by partitioning data into independently managed shards (aka partitions) and fault-tolerance by replicating each shard across a set of servers (spanner, ; scatter, ; gdur, ). Such data stores often use optimistic concurrency control (wv, ), where a transaction is first executed speculatively, and the results (e.g., read and write sets) are then certified to determine whether the transaction can commit or must abort because of a conflict with concurrent transactions. The certification is implemented by a Transaction Certification Service (TCS), which accepts a stream of transactions and outputs decisions based on a given certification function, defining the concurrency-control check for the desired isolation level. TCS is the most challenging part of transaction processing in systems with the above architecture, since it requires solving a distributed agreement problem among the replicated shards participating in the transaction. This agreement problem has been recently formalized as the multi-shot commit problem (discpaper, ), generalizing the classical atomic commit problem (dwork-skeen, ) to more faithfully reflect the requirements of modern transaction processing systems (we review the new problem statement in §2).

Most existing solutions to the TCS problem require replicating each shard among $2f+1$ replicas to tolerate $f$ crash-stop failures within each shard (spanner, ; scatter, ; uw-inconsistent, ; mdcc, ), which allows using a replication protocol such as Paxos (paxos, ). This is expensive: if transaction data are written to all replicas of the shard, only $f+1$ replicas are needed for the data to survive failures. Since, in this case even a single replica failure will block transaction processing, to recover we need to reconfigure the system, i.e., change its membership to replace failed replicas with fresh ones. Unfortunately, processes concurrently deciding to reconfigure the system need to be able to agree on the next configuration; this reduces to solving consensus, which again requires $2f+1$ replicas (lower-bound, ). The way out of this conundrum is to use a separate configuration service with $2f+1$ replicas to perform consensus on the configuration. In this way, we use $2f+1$ replicas only to store the small amount of information about the configuration and $f+1$ replicas to store the actual data. This vertical approach (vertical-paxos, ), which layers replication on top of a configuration service, has been used by a number of practical systems (corfu, ; bigtable, ; farm, ). It is particularly suitable for deployment in local-area networks, where the configuration service can be reached quickly.

In this paper we propose the first rigorously proven protocols for implementing a TCS in a vertical system, with $f+1$ replicas per shard and an external configuration service. We present protocols in two different models—the standard asynchronous message-passing model (§3) and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latter’s CPU (§5). Our protocols are parametric in the isolation level provided, and we prove that they correctly implement the TCS specification from the multi-shot commit problem (discpaper, ) (§4).

Our work complements and takes its inspiration from a recent FARM system (farm, ; farm2, )—a transaction processing system that achieves impressive scalability and availability by exploiting RDMA and the vertical approach. FARM currently forms the core of a graph database used to serve some of search queries in Microsoft Bing. It is a complex system that includes a number of optimizations, both specific to RDMA and not. FARM’s design was presented without a rigorous proof of correctness, and it did not highlight which features are motivated by the use of RDMA and which are inherent to the vertical approach. Our work provides a theoretical complement to FARM: we codify its core ideas as distributed transaction commit protocols and rigorously prove them correct with respect to the TCS specification. By basing our protocols on a principled footing, we are also able to provide better fault-tolerance guarantees than FARM. Finally, by presenting two related protocols using message passing and RDMA, we determine the trade-offs required by the use of RDMA.

In more detail, a straightforward way to implement TCS is using the classical two-phase commit (2PC) protocol (2pc, ). Since 2PC is not fault-tolerant, we can make each shard simulate a reliable process in 2PC using a replication protocol such as Paxos (spanner, ; scatter, ). This vanilla approach requires every 2PC action to be replicated using Paxos, which results in a high latency (7 message delays to learn a decision on a transaction (replicated-commit, )) and a high load on Paxos leaders. To improve on this, our protocol combines 2PC and Vertical Paxos (vertical-paxos, ) into one coherent protocol, thereby minimizing the latency and load on Paxos leaders. Upon a failure inside a shard, we use the reconfiguration service to replace the failed replicas, as in Vertical Paxos. This reconfiguration interacts nontrivially with the 2PC part of the protocol: e.g., reconfiguration may lead to losing undecided transactions that affected 2PC computations of decisions on other transactions—a behavior that we nevertheless show to be correct. Finally, we show that the price of exploiting RDMA to efficiently write transaction data to replicas is that reconfiguration has to be performed globally, instead of per-shard: when reconfiguring a shard, we have to ensure that the whole system is aware of the configuration before activating it.

2. Transaction Certification Service

Service interface and certification functions.

A Transaction Certification Service (TCS) is meant to be used in the context of transactional processing systems with optimistic concurrency control (wv, ), where transactions are first executed speculatively, and the results are submitted for certification to the TCS. We start by reviewing its specification proposed in (discpaper, ). Clients invoke the TCS using requests of the form ${\tt certify}(t,\mathit{l})$ , where $t\in\mathcal{T}$ is a unique transaction identifier and $\mathit{l}\in\mathcal{L}$ is the transaction payload, which carries the results of the optimistic execution of the transaction (e.g., read and write sets). Responses of the service are of the form ${\tt decide}(t,d)$ , where $d\in\mathcal{D}=\{\textsc{abort},\textsc{commit}\}$ . A TCS is specified using a certification function $f:2^{\mathcal{L}}\times\mathcal{L}\to\mathcal{D}$ , which encapsulates the concurrency-control policy for the desired isolation level. The result $f(\mathit{L},\mathit{l})$ is the decision for the transaction with payload $\mathit{l}$ given the set of payloads $\mathit{L}$ of the previously committed transactions. We require $f$ to be distributive in the following sense:

[TABLE]

where $\sqcap$ is such that $\textsc{commit}\sqcap\textsc{commit}=\textsc{commit}$ and $d\sqcap\textsc{abort}=\textsc{abort}$ for any $d$ . This requirement is justified by the fact that common definitions of $f(\mathit{L},\mathit{l})$ check $\mathit{l}$ for conflicts against each transaction in $\mathit{L}$ separately.

As an example, consider a transactional system managing objects from ${\sf Obj}$ with values from ${\sf Val}$ , where transactions can execute reads and writes on the objects. The objects are associated with a totally ordered set ${\sf Ver}$ of versions. Then the payload of a transaction $t$ is a triple $\langle R,W,V_{c}\rangle$ . Here the read set $R\subseteq{\sf Obj}\times{\sf Ver}$ is the set of objects with their versions that $t$ read, which contains one version per object. The write set $W\subseteq{\sf Obj}\times{\sf Val}$ is the set of objects with their values that $t$ wrote, which contains one value per object. We require that any object written has also been read: $\forall(x,\_)\in W.\,(x,\_)\in R$ . Finally, the commit version $V_{c}\in{\sf Ver}$ is the version to be assigned to the writes of $t$ . We require this version to be higher than any of the versions read: $\forall(\_,v)\in R.\,V_{c}>v$ . Given this domain of transactions, the following certification function encapsulates the classical concurrency-control policy for serializability (wv, ): $f(\mathit{L},\langle R,W,V_{c}\rangle)=\textsc{commit}$ iff none of the versions in $R$ have been overwritten by a transaction in $\mathit{L}$ , i.e.,

[TABLE]

TCS specification.

We represent TCS executions using histories—sequences of ${\tt certify}$ and ${\tt decide}$ actions such that every transaction appears at most once in ${\tt certify}$ , and each ${\tt decide}$ is a response to exactly one preceding ${\tt certify}$ . For a history $h$ we let ${\sf act}(h)$ be the set of actions in $h$ . For actions $a,a^{\prime}\in{\sf act}(h)$ , we write $a\prec_{h}a^{\prime}$ when $a$ occurs before $a^{\prime}$ in $h$ . A history $h$ is complete if every ${\tt certify}$ action in it has a matching ${\tt decide}$ action. A complete history is sequential if it consists of pairs of ${\tt certify}$ and matching ${\tt decide}$ actions. A transaction $t$ commits in a history $h$ if $h$ contains ${\tt decide}(t,\textsc{commit})$ . We denote by ${\sf committed}(h)$ the projection of $h$ to actions corresponding to the transactions that are committed in $h$ . For a complete history $h$ , a linearization $\ell$ of $h$ (linearizability, ) is a sequential history such that $h$ and $\ell$ contain the same actions and

[TABLE]

A complete sequential history $h$ is legal with respect to a certification function $f$ , if its decisions are computed according to $f$ :

[TABLE]

A history $h$ is correct with respect to $f$ if $h\mid{\sf committed}(h)$ has a legal linearization. A TCS implementation is correct with respect to $f$ if so are all its histories.

A TCS implementation satisfying the above specification can be readily used in a transaction processing system. For example, consider the domain of transactions defined earlier. A typical system based on optimistic concurrency control will ensure that transactions submitted for certification only read versions written by previously committed transactions. A history produced by such a system that is correct with respect to certification function (2) is also serializable (discpaper, ). Hence, a TCS correct with respect to this certification function can indeed be used to implement serializability.

Shard-local certification functions.

We are interested in TCS implementations in systems where the data are partitioned into shards from a set $\mathcal{S}$ . In such systems TCS is usually implemented using a variant of the classical two-phase commit protocol (2PC) (2pc, ). In this protocol each shard $s$ receiving a transaction for certification first prepares it, i.e., performs a local concurrency-control check and accordingly votes to commit or abort the transaction. The votes on the transaction by different shards are aggregated, and the final decision is then distributed to all shards: the transaction can commit only if all votes are commit. When a shard $s$ votes on a transaction, it does not have information about all transactions in the system, but only those that concern it. Hence, the votes are computed using not the global certification function $f$ , but shard-local certification functions (discpaper, ), which check for conflicts only on objects managed by the shard and correspondingly take as parameters only the parts of the transaction payloads relevant to the shard: for a payload $\mathit{l}$ we denote this by $\mathit{l}\mid s$ . For example, let ${\sf Obj}_{s}$ be the set of objects managed by a shard $s$ . For a payload $\mathit{l}=\langle R,W,V_{c}\rangle$ of the form given above, we let $\mathit{l}\mid s=\langle R^{s},W^{s},V_{c}\rangle$ , where $R^{s}=\{(x,\_)\in R\mid x\in{\sf Obj}_{s}\}$ and $W^{s}=\{(x,\_)\in W\mid x\in{\sf Obj}_{s}\}$ . There are two shard-local functions, $f_{s}:2^{\mathcal{L}}\times\mathcal{L}\to\mathcal{D}$ and $g_{s}:2^{\mathcal{L}}\times\mathcal{L}\to\mathcal{D}$ . As its first argument $f_{s}$ takes the set of shard-relevant payloads of transactions that previously committed at the shard, and $g_{s}$ the set of such payloads for transactions that have been prepared to commit. As their second argument, the functions take the part of the payload of the transaction being certified relevant to the shard. We require that these functions are distributive, similarly to (1).

For example, the shard-local certification functions for serializability are defined as follows: $f_{s}(\mathit{L},\langle R,W,V_{c}\rangle)=\textsc{commit}$ iff

[TABLE]

and $g_{s}(\mathit{L},\langle R,W,V_{c}\rangle)=\textsc{commit}$ iff

[TABLE]

The function $f_{s}$ certifies a transaction $t$ against previously committed transactions similarly to the certification function (2) for serializability, but taking into account only the objects managed by the shard $s$ . The function $g_{s}$ certifies $t$ against transactions prepared to commit, and its check is stricter than that of $f_{s}$ . In our example, the function $g_{s}$ aborts a transaction $t$ if: (i) it read an object written by a transaction $t^{\prime}$ prepared to commit; or (ii) it writes to an object read by a transaction $t^{\prime}$ prepared to commit. This reflects the behaviour of typical implementations, which upon preparing a transaction acquire read locks on its read set and write locks on its write set, and abort the transaction if the locks cannot be acquired.

For a sharded TCS implementation to be correct, shard-local functions have to match the global certification function, i.e., perform similar conflict checks. We formalize the required conditions as follows. Assume a function ${\sf shards}:\mathcal{T}\to 2^{\mathcal{S}}$ that determines the shards that need to certify a transaction with a given identifier, which are usually the shards storing the data the transaction accesses. We also assume a distinguished empty payload $\varepsilon\in\mathcal{L}$ such that $\forall s,\mathit{L}.\,f_{s}(L,\varepsilon)=\textsc{commit}$ . For example, for a payload $\mathit{l}=\langle R,W,\_\rangle$ of the form given above, $\mathit{l}=\varepsilon$ is such that $R=\emptyset$ and $W=\emptyset$ . We require that for a transaction $t\in\mathcal{T}$ with payload $\mathit{l}\in\mathcal{L}$ , for each shard $s\not\in{\sf shards}(t)$ , we have $\mathit{l}\mid s=\varepsilon$ . We further lift the $\mid$ operator to sets of payloads: for any $\mathit{L}\subseteq\mathcal{L}$ we let $(\mathit{L}\mid s)=\{(\mathit{l}\mid s)\mid\mathit{l}\in\mathit{L}\}$ . Then we require that global and local certification functions match as follows:

[TABLE]

Finally, for each shard $s\in\mathcal{S}$ , the two functions $f_{s}$ and $g_{s}$ are required to be related to each other as follows (discpaper, ):

[TABLE]

Property (4) requires the conflict check performed by $g_{s}$ to be no weaker than the one performed by $f_{s}$ . Property (5) requires a form of commutativity: if a transaction with payload $l^{\prime}$ is allowed to commit after a still-pending transaction with payload $l$ , then the latter would be allowed to commit after the former.

3. Atomic Commit Protocol

System model.

We consider an asynchronous message-passing system consisting of a set of processes $\mathcal{P}$ which may fail by crashing, i.e., permanently stopping execution. We assume that processes are connected by reliable FIFO channels: messages are delivered in FIFO order, and messages between non-faulty processes are guaranteed to be eventually delivered. A function ${\sf client}:\mathcal{T}\to\mathcal{P}$ determines the client process that issued a given transaction.

Each shard $s\in\mathcal{S}$ is managed by a group of replica processes, whose membership can change over time. For simplicity, we assume that the groups of replica processes managing different shards are disjoint. Each shard moves through a sequence of configurations, determining its membership. Reconfiguration is the process of changing the configuration of a shard. In our protocols reconfiguration is initiated by a replica when it suspects another replica of failing: for simplicity we do not expose it in the TCS interface. Every member of a shard in a given configuration is either the leader of the shard or a follower. A configuration of a shard $s$ is then a tuple $\langle\mathit{e},\mathit{M},p_{l}\rangle$ where $\mathit{e}$ is the epoch identifying the configuration, $\mathit{M}\in 2^{\mathcal{P}}$ is the set of processes that manage $s$ at $\mathit{e}$ , and $p_{l}\in\mathit{M}$ is the leader of $s$ at $\mathit{e}$ .

Configurations are stored in an external configuration service (CS), which for simplicity we assume to be a reliable process. In practice, this service may be implemented using Paxos-like replication over $2f+1$ processes out of which at most $f$ can fail (as done in systems such as Zookeeper (zookeeper, )). The configuration service stores the configurations of all shards and provides three operations. An operation compare_and_swap $(s,e,\langle e^{\prime},\mathit{M},p_{l}\rangle)$ succeeds if the epoch of the last stored configuration of $s$ is $e$ ; in this case it stores the provided configuration with a higher epoch $e^{\prime}>e$ . Operations get_last $(s)$ and get $(s,e)$ respectively return the last configuration of $s$ and the configuration of $s$ associated with a given epoch $e$ .

Protocol preliminaries.

We give the pseudocode of our protocol in Figure 1, illustrate its message flow in Figure 2 and summarize the key invariants used in its proof of correctness in Figure 3. The protocol weaves together the two-phase commit protocol across shards (2pc, ) and a Vertical Paxos-based reconfiguration protocol within each shard (vertical-paxos, ). At any given time, a process participates in a single configuration of the shard it belongs to. The process stores the information about this configuration as well as those of other shards in several arrays: configuration epochs are stored in an array ${\sf epoch}\in\mathcal{S}\to\mathbb{N}$ , the current members in ${\sf members}\in\mathcal{S}\to 2^{\mathcal{P}}$ , and the current leader in ${\sf leader}\in\mathcal{S}\to\mathcal{P}$ . The entries for the shard the process belongs to give the configuration the process is in; the other entries maintain information about the configurations of the other shards. A ${\sf status}$ variable at a process records whether it is a leader, a follower or is in a special reconfiguring state used during reconfiguration. Each process keeps track of the status of transactions in an array ${\sf phase}$ , whose entries initially store start. The transaction status changes to prepared when the shard determines its vote and to decided when a final decision on the transaction is reached.

Failure-free case.

A client submits a transaction for certification by calling the certify function at any replica process, which will serve as the coordinator of the transaction (line 1). The function takes as arguments the transaction’s identifier and its payload. The transaction coordinator first sends a ${\tt PREPARE}$ message to the leaders of the relevant shards, which includes the payload part for each shard (line 1). The leader of a shard arranges all transactions it receives into a total certification order, which the leader stores in an array ${\sf txn}\in\mathbb{N}\to\mathcal{T}$ ; a ${\sf next}\in\mathbb{Z}$ variable points to the last filled slot in the array. When the leader receives a ${\tt PREPARE}$ message for a transaction for the first time (line 1), it appends the transaction to the certification order, stores the transaction’s payload in an array ${\sf payload}\in\mathbb{N}\to\mathcal{L}$ , and sets the transaction’s phase to prepared. It then computes a vote on the transaction and stores it in an array ${\sf vote}\in\mathbb{N}\to\{\textsc{commit},\textsc{abort}\}$ (line 1). The vote is computed using the shard-local certification functions $f_{s_{0}}$ and $g_{s_{0}}$ to check for conflicts against transactions that have been previously committed or prepared to commit; the results are combined using the $\sqcap$ operator, so that the transaction can commit only if both functions say so. We defer the description of the cases when the leader has previously received the transaction in the ${\tt PREPARE}$ message (line 1) and when the payload in the message is an undefined value $\bot$ (line 1).

Our protocol next replicates the leader’s decision and the transaction payload at the followers. Instead of having the leader to do this directly, the protocol delegates this task to the coordinator of the transaction. This design is used by practical systems, such as Corfu (corfu, ) and FARM (farm, ), since it minimizes the load on the leaders, which are the main potential performance bottleneck. Instead, the network-intensive task of persisting transactions at multiple followers is spread among a number of different transaction coordinators. As we explain in the following, this optimization interacts in a nontrivial way with transaction certification. In more detail, after preparing a transaction the leader sends a ${\tt PREPARE\_ACK}$ message to the coordinator of the transaction, which carries the leader’s epoch, the transaction identifier, its position in the certification order, the payload, and the vote (line 1). Upon receiving the ${\tt PREPARE\_ACK}$ message (line 1), the coordinator forwards the data from the ${\tt PREPARE\_ACK}$ message to the followers in an ${\tt ACCEPT}$ message.

A process handles an ${\tt ACCEPT}$ message only if it participates in the corresponding epoch (line 1). The process stores the transaction identifier, its payload and vote, and advances the transaction’s phase to prepared. It then sends an ${\tt ACCEPT\_ACK}$ message to the coordinator of the transaction, confirming that the process has accepted the transaction and the vote. The certification order at a follower is always a prefix with zero or more holes of the certification order at the leader of the epoch the follower is in, as formalized by Invariant 1 (Figure 3). The holes in the prefix arise from the lack of FIFO ordering in the communication between the leader of a given epoch and its followers, as the ${\tt ACCEPT}$ message for a given transaction is sent to the followers by the coordinator of the transaction and not directly by the leader.

The coordinator of a transaction $t$ acts once it receives ${\tt ACCEPT\_ACK}$ messages for $t$ from every follower of its shards $s\in{\sf shards}(t)$ (line 1); it determines this using the configuration information it stores for every shard. The coordinator computes the final decision on $t$ using the $\sqcap$ operator on the votes of each involved shard: the transaction can commit if all votes are commit. The coordinator then sends the final decision in ${\tt DECISION}$ messages to the client and to each of the relevant shards. When a process receives a decision for a transaction (line 1), it stores the decision and advances the transaction’s phase to decided. In a realistic implementation, at this point the process would also upcall into the transaction processing system running at its server, to inform it about the decision and allow it to apply the transaction’s writes to the database if the decision is to commit.

In the absence of failures, our protocol allows the client to learn a decision on a transaction in 5 message delays, instead of 7 required by vanilla protocols that use Paxos as a black-box (spanner, ; scatter, ). We can further reduce this to 4 by co-locating the client with the transaction coordinator. The protocol also minimizes the load on Paxos leaders, which are the main potential bottleneck: each involved leader only has to receive one ${\tt PREPARE}$ and one ${\tt DECISION}$ message, and send one ${\tt PREPARE\_ACK}$ message.

Reconfiguration.

When a failure is suspected in a shard $s$ , any process can initiate a reconfiguration of the shard to replace failed replicas. Reconfiguration is done only in the affected shard, without disrupting others. It aims to preserve Invariant 2, which is key in proving the correctness of the protocol. This assumes that all followers in $s$ at an epoch $\mathit{e}$ have received ${\tt ACCEPT}(\mathit{e},k,t,l,d)$ and responded to it with ${\tt ACCEPT\_ACK}$ ; in this case we say that the transaction $t$ has been accepted at shard $s$ . The invariant guarantees that the accepted transaction $t$ will persist in epochs higher than $\mathit{e}$ ; this is used to prove that the protocol computes a unique decision on each transaction. The invariant also guarantees that the entries preceding $t$ in the certification order in epochs higher than $\mathit{e}$ may only contain the votes that the leader of $s$ at epoch $\mathit{e}$ took into account when computing the vote $d$ on $t$ (some of these votes may be missing due to the lack of FIFO order in the communication between the leader and its followers). This property is necessary to guarantee that the protocol computes decisions according to a single global certification order, as required by the TCS specification.

To ensure Invariant 2, a process performing reconfiguration first probes previous configurations to determine which processes are still alive and to find a process whose state contains all transactions previously accepted at the shard, which will serve the new leader. The new leader then transfers its state to the members of the new configuration, thereby initializing them. A variable ${\sf initialized}\in\{\textsc{true},\textsc{false}\}$ at a process records whether it has ever been initialized. Our protocol guarantees that a shard can become operational, i.e., start accepting transactions, only after all its members have been initialized.

The probing phase is complicated by the fact that there may be a series of failed reconfiguration attempts, where the new leader fails before initializing all its followers. Hence, probing requires traversing epochs from the current one down, skipping epochs that are not operational. Probing selects as the new leader the first initialized process it encounters during this traversal; we can show that this process is guaranteed to know about all transactions accepted at the shard, and thus making it the new leader will preserve Invariant 2 (§4).

In more detail, a process $p_{r}$ initiates a reconfiguration of a shard $s$ by calling ${\tt reconfigure}(s)$ (line 1). The process picks an epoch number ${\sf recon\_epoch}$ higher than the epoch of $s$ stored in the configuration service and then starts the probing phase, as marked by the flag ${\sf probing}$ . The process $p_{r}$ keeps track of the shard being reconfigured in ${\sf recon\_shard}$ , the epoch being probed in ${\sf probed\_epoch}$ and the membership of this epoch in ${\sf probed\_members}$ . The process initializes these variables when it first reads the current configuration from the configuration service (line 1). It then sends a ${\tt PROBE}$ message to the members of the current configuration, asking them to join the new epoch ${\sf recon\_epoch}$ . Upon receiving a ${\tt PROBE}(e)$ message (line 1), a process first checks that the proposed epoch is equal or higher than the highest epoch it has ever been asked to join, which the process stores in ${\sf new\_epoch}$ (we always have ${\sf epoch}[s]\leq{\sf new\_epoch}$ at a process in $s$ ). In this case, the process sets ${\sf new\_epoch}$ to $e$ and changes its status to reconfiguring, which causes it to stop transaction processing. It then replies to $p_{r}$ with a ${\tt PROBE\_ACK}$ message, which indicates whether it has been previously initialized or not. If $p_{r}$ finds a process that has previously been initialized, and hence can serve as the new leader, $p_{r}$ ends probing (line 1). If $p_{r}$ does not find such a process in the epoch ${\sf probed\_epoch}$ and receives at least one reply ${\tt PROBE\_ACK}$ from a process that has not been initialized (line 1), $p_{r}$ can conclude that the epoch ${\sf probed\_epoch}$ is not operational and will never become such, because it has convinced at least one of its members to join the new epoch; this is formalized by Invariant 3. In this case $p_{r}$ starts probing the preceding epoch. Since no transactions could have been accepted at the epoch ${\sf probed\_epoch}$ , picking a new leader from an earlier epoch will not lose any accepted transactions and thus will not violate Invariant 2.

Once the probing finds a new leader $p_{j}$ for the shard $s$ (line 1), the process $p_{r}$ computes the membership of the new configuration using a function compute_membership (line 1). We do not prescribe a particular implementation of this function, except that the new membership must contain the new leader $p_{j}$ and may only contain the processes that replied to probing or fresh processes. The latter can be added to reach the desired level of fault tolerance. Once the new configuration is computed, $p_{r}$ attempts to store it in the configuration service using a compare-and-swap operation. This succeeds only if the current epoch is still the epoch from which $p_{r}$ started probing, which means that no concurrent reconfiguration occurred while $p_{r}$ was probing. In this case, $p_{r}$ sends a ${\tt NEW\_CONFIG}$ message with the new configuration to the new leader of $s$ .

When the new leader of $s$ receives the ${\tt NEW\_CONFIG}$ message (line 1), it sets ${\sf next}$ to the length of its sequence of transactions, ${\sf epoch}[s]$ to the new epoch and ${\sf status}$ to leader, which allows it to start processing new transactions. It then sends a ${\tt NEW\_STATE}$ message to its followers, containing its state. Upon receiving this message (line 1), a process overwrites its state with the one provided, changes its status to follower, and sets ${\sf initialized}$ to true. As part of the state update, the process also updates its epoch ${\sf epoch}[s_{0}]$ to the new one. Hence, the process will not accept transactions from the new leader until it receives the ${\tt NEW\_STATE}$ message.

When a new configuration of a shard $s$ is persisted in the configuration service, the service sends it in a ${\tt CONFIG\_CHANGE}$ message to the members of shards other than $s$ . A process updates the locally stored configuration upon receiving this message (line 1).

Coordinator recovery.

If a process that accepted a transaction $t$ does not receive the final decision on it, this may be because the coordinator of $t$ has failed. In this case the process may decide to become a new coordinator by executing a retry function (line 1). For this, the process just sends a ${\tt PREPARE}(t,\bot)$ message to the leaders of the shards of $t$ , carrying a special undefined value $\bot$ as the payload. If a leader receiving ${\tt PREPARE}(t,\bot)$ has already certified $t$ , it re-sends the corresponding ${\tt PREPARE\_ACK}$ message to the new coordinator, including the transaction payload and vote (line 1). Otherwise, if the leader does not have the payload of $t$ , it prepares the transaction as aborted and with an empty payload $\varepsilon$ (line 1). In either case, the new coordinator will finish processing the transaction as usual. The above case when the transaction is aborted because the leader of a shard does not know its payload may arise when the old coordinator crashed in between sending ${\tt PREPARE}$ messages to different shards. Note that if the old coordinator was suspected spuriously and will try later to submit the transaction to a shard where it was aborted, it will just get a ${\tt PREPARE\_ACK}$ message with an abort vote.

Our protocol allows any number of processes to become coordinators of a transaction at the same time. Nevertheless, the protocol ensures that they will all reach the same decision, even in case of reconfigurations. We formalize this by Invariant 4: part (a) ensures an agreement on the decision on the $k$ -th transaction in the certification order at a given shard; part (b) ensures a system-wide agreement on the decision on a given transaction $t$ . The latter part establishes that the protocol computes a unique decision on each transaction. Invariant 4 is proved as a corollary of Invariant 2.

Losing undecided transactions.

Recall that our protocol uses the optimization that delegates persisting transactions at followers to coordinators (corfu, ; farm, ). We now highlight how this optimization interacts with transaction certification. Because of the optimization, transactions prepared by a leader of a shard $s$ can be persisted at followers out of order. For example, $t_{2}$ may follow $t_{1}$ in the certification order at the leader, but may be persisted at followers first. If now the leader of $s$ and the coordinator of $t_{1}$ crashes before $t_{1}$ is persisted at followers, $t_{1}$ will be lost forever, something that is allowed by Invariant 2 (due to the use of $\prec$ ). In this case we lose a transaction $t_{1}$ on the basis of which the vote on the transaction $t_{2}$ was computed (e.g., the payload $l_{1}$ of $t_{1}$ was in $L_{2}$ when the vote on $t_{2}$ was computed at line 1). This does not violate correctness, since the vote on $t_{2}$ makes sense also in the context excluding $t_{1}$ : due to distributivity of certification functions (§2), if $t_{2}$ was allowed to commit in the presence of $t_{1}$ ( $f_{s}(\{l_{1}\},l_{2})=\textsc{commit}$ ), it can also commit in its absence ( $f_{s}(\emptyset,l_{2})=\textsc{commit}$ ). Note that in this case a decision on $t_{1}$ could not have been exposed to the client: otherwise $t_{1}$ could not get lost due to Invariant 2. Also note that, since we assume the transaction execution component produces payloads with read-sets containing only values written by committed transactions (§2), in the above case $t_{2}$ could not have read a value written by $t_{1}$ .

4. Correctness

The next theorem states the safety of our protocol, showing that it implements the TCS specification.

Theorem 4.1.

A transaction certification service implemented using the protocol in Figure 1 is correct with respect to a certification function $f$ matching the shard-local certification functions $f_{s}$ and $g_{s}$ .

We defer the proof to §A and only sketch the proof of the key Invariant 2. This relies on auxiliary Invariant 5, which we prove first.

Proof sketch for Invariant 5.

We prove the invariant by induction on $\mathit{e}^{\prime\prime}$ . Assume that the invariant holds for all $\mathit{e}^{\prime\prime}<\mathit{e}^{*}$ . We now show it for $\mathit{e}^{\prime\prime}=\mathit{e}^{*}$ . The members of $s$ at $\mathit{e}^{*}$ are computed at line 1 by a reconfiguring process $p_{r}$ using the compute_membership function, which returns either fresh processes or processes that responded to $p_{r}$ ’s probing. Since $p_{i}$ was a member of $s$ at $e^{\prime}<e^{*}$ , it is not fresh; then by assumptions on compute_membership $p_{i}$ must have received ${\tt PROBE}(\mathit{e}^{*})$ from $p_{r}$ and replied with ${\tt PROBE\_ACK}(\_,\mathit{e}^{*},s)$ . The process $p_{r}$ starts probing at epoch $\mathit{e}^{*}-1$ and ends it upon receiving a ${\tt PROBE\_ACK}(\textsc{true},\mathit{e}^{*},s)$ message. By the induction hypothesis, $p_{i}$ is not a member of $s$ at any epoch from $\mathit{e}^{*}-1$ down to $\mathit{e}+1$ . Hence, if the probing stops before reaching $\mathit{e}$ , then $p_{i}$ will not be a member of $s$ at $e^{*}$ , as required. Assume now that the probing reaches $\mathit{e}$ . By Invariant 3, each follower in $s$ at $\mathit{e}$ must have sent ${\tt ACCEPT\_ACK}(s,\mathit{e},t)$ before receiving ${\tt PROBE}(\mathit{e}^{*})$ . Then any member of $s$ at $\mathit{e}$ receiving ${\tt PROBE}(\mathit{e}^{*})$ will have ${\sf initialized}=\textsc{true}$ . Hence, if any member of $s$ at $\mathit{e}$ replies with ${\tt PROBE\_ACK}(\mathit{initialized},\mathit{e}^{*},s)$ , we have that $\mathit{initialized}=\textsc{true}$ . Since the process $p_{r}$ will not move to the preceding epoch until at least one process replies with ${\tt PROBE\_ACK}$ , this means that the probing can never go beyond $\mathit{e}$ . Since the process $p_{i}$ is not a member of $\mathit{e}$ , it cannot be included as a member of $s$ in $\mathit{e}^{*}$ , as required.∎

Proof sketch for Invariant 2.

We prove the invariant by induction on $\mathit{e}^{\prime}$ . Assume that the invariant holds for all $\mathit{e}^{\prime}<\mathit{e}^{\prime\prime}$ . We now show it for $\mathit{e}^{\prime}=\mathit{e}^{\prime\prime}$ by induction on the length of the protocol execution. We only consider the most interesting transition in line 1, when a process $p_{i}$ becomes a leader of $s$ at an epoch $\mathit{e}^{\prime\prime}$ . We show that after this transition at $p_{i}$ we have ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}$ .

Since $p_{i}$ was chosen as the leader of $s$ at $\mathit{e}^{\prime\prime}$ , this process replied with ${\tt PROBE\_ACK}(\textsc{true},\mathit{e}^{\prime\prime},s)$ to a ${\tt PROBE}(\mathit{e}^{\prime\prime})$ . Therefore, $p_{i}$ was a member of $s$ at an epoch $\mathit{e}^{*}<\mathit{e}^{\prime\prime}$ that was being probed. Probing ends when at least one process sends a ${\tt PROBE\_ACK}(\textsc{true},\mathit{e}^{\prime\prime},s)$ . From Invariant 3 and the assumption that all followers in $\mathit{e}$ replied with ${\tt ACCEPT\_ACK}$ to ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ , we can conclude that probing could no have gone further than $\mathit{e}$ . Hence, $\mathit{e}\leq\mathit{e}^{*}<\mathit{e}^{\prime\prime}$ .

Let $\mathit{e}_{0}$ be the value of ${\sf epoch}[s]$ at $p_{i}$ right before the transition at line 1. We have $\mathit{e}_{0}\geq\mathit{e}$ , as otherwise $p_{i}$ would not be a member of $s$ at $\mathit{e}$ and by Invariant 5 could not be picked as the leader of $s$ at $\mathit{e}^{\prime\prime}$ . It is also easy to show that $\mathit{e}_{0}<\mathit{e}^{\prime\prime}$ . Hence, $\mathit{e}\leq\mathit{e}_{0}<\mathit{e}^{\prime\prime}$ .

If $\mathit{e}<\mathit{e}_{0}$ , then by the induction hypothesis, we have ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}$ right after the transition in line 1, as required. Assume now that $\mathit{e}_{0}=\mathit{e}$ . If $p_{i}$ was the leader of $s$ at $\mathit{e}$ , then we trivially have ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}$ right after the transition in line 1, as required. Otherwise, by Invariant 3, $p_{i}$ must have received ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and responded to it with ${\tt ACCEPT\_ACK}(s,\mathit{e},t)$ before the transition in line 1. Then the required follows from Invariant 1.∎

We next state liveness properties of our protocol (we again defer proofs to §A). The reconfiguration procedure in the protocol will get stuck if it cannot find an initialized process, which may happen if enough processes crash, so that all shard data is lost. We now state conditions under which this cannot happen. We associate two events with each configuration $e$ of a shard $s$ : introduction and activation. Introduction indicates that the configuration comes into existence and is triggered when the configuration is successfully persisted in the configuration service (line 1). Activation indicates that the configuration becomes operational and is triggered when all the followers of the configuration have processed the ${\tt NEW\_STATE}$ messages sent by its leader (line 1).

Once a configuration has been activated, we say that it is active. We define its lifetime as the time interval between its introduction and when a succeeding configuration becomes active. Note that not every introduced configuration necessarily becomes active, since its leader may never complete the data transfer to the followers. To ensure our protocol is live we make the following assumption, similar to the ones made by other protocols with changing membership (ken-book, ; spiegelman2017dynamic, ).

Assumption 1.

At least one member in each configuration is non-faulty throughout the lifetime of a configuration.

The following two theorems show that, under this assumption, a single reconfiguration makes progress.

Theorem 4.2.

If a process $p_{r}$ attempts to reconfigure a shard $s$ and no other process attempts to reconfigure $s$ simultaneously, then if $p_{r}$ is non-faulty for long enough, it will eventually introduce a new configuration.

Theorem 4.3.

If a configuration of a shard $s$ is introduced by a process $p_{r}$ , then it will eventually be activated, provided no process attempts to reconfigure $s$ simultaneously, and $p_{r}$ and all the members of the configuration are non-faulty for long enough.

Finally, the following theorem shows that in the absence of failures or reconfigurations, transaction certification makes progress.

Theorem 4.4.

Assume that the current configuration of each shard is active, all processes are aware of the current configuration of each shard, and no reconfiguration is in progress. If a transaction is submitted for certification, then it will eventually be decided, provided no reconfiguration is attempted and all the processes belonging to the current configuration of each shard are non-faulty for long enough.

5. Exploiting RDMA

We now present a variant of our protocol that uses Remote Direct Memory Access (RDMA), which follows the design of the FARM system (farm, ; farm2, ). By comparing this protocol with that of §3 we highlight the trade-offs required by the use of RDMA. Due to space constraints, we defer the pseudocode of our protocol to §C and describe the required changes in the protocol of §3 only informally.

We assume the same system model as in §2, except that processes can communicate using RDMA. This allows a machine to access the memory of another machine over the network without involving the latter’s CPU, thus lowering latency. Like FARM, our protocol uses RDMA to implement a primitive for point-to-point communication between processes with the following interface. The primitive allows a sender process to reliably send a message $m$ to a receiver process $p_{j}$ (send-rdma ( $m,p_{j}$ )) by remotely writing into a specific memory region of $p_{j}$ . The sender then gets an acknowledgement when the message reaches the receiver’s memory (ack-rdma ( $m,p_{j}$ )), sent by the receiver’s network interface card (NIC) without interrupting its CPU. The receiver is notified at a later point that a new message is available (deliver-rdma ( $m,p_{j}$ )). Hence, the guarantee provided by ack-rdma ( $m,p_{j}$ ) is that the receiver will eventually deliver the message $m$ , even if the sender crashes, since the message is already in the receiver’s memory. The operation open ( $p_{i}$ ) grants $p_{i}$ access to a region of the caller’s memory, and close ( $p_{i}$ ) revokes it. Once the latter operation completes, $p_{i}$ cannot send any message to the caller using send-rdma. Finally, we assume that the communication primitive includes another operation: flush. This operation blocks the caller until it has delivered all messages addressed to it that have been acknowledged by its NIC through an ack-rdma.

To implement the above primitive, the receiver usually keeps a circular buffer in memory for each process that may send it a message (farm-first, ; rdma-mpi, ). The operation send-rdma ( $m,p_{j}$ ) issued by a process $p_{i}$ appends a message to the corresponding buffer at the receiver using RDMA writes. Receivers periodically pull messages from the buffers and deliver them to the application via deliver-rdma. If a buffer at a process $p_{j}$ gets full, the associated sender process will not be able to send a message to $p_{j}$ until the latter pulls some messages.

Following FARM, we use the above RDMA-based communication primitive in our protocol to persists votes and decisions (steps 2 and 3 of Figure 2a). This requires the following changes to the protocol in Figure 1. First, ${\tt ACCEPT}$ and ${\tt DECISION}$ messages are sent using send-rdma instead of send ** (lines 1 and 1). Second, the followers do not send explicit ${\tt ACCEPT\_ACK}$ messages to transaction coordinators (line 1); instead, the latter act once they receive an RDMA acknowledgement ack-rdma. This makes the checks at lines 1 and 1 redundant, as followers cannot reject ${\tt ACCEPT}$ or ${\tt DECISION}$ messages under any circumstance. The practical rationale for these changes is that persisting a transaction $t$ at followers using RDMA minimizes the time during which the transaction is prepared at leaders, which requires them to vote abort on all transactions conflicting with $t$ (via the certification function $g_{s}$ , §2); this results in lower abort rates (farm, ; binnig, ). Transaction processing at followers (e.g., adding them to the local copy of the certification order, line 1) is done off the critical path of certification.

Unfortunately, the above changes to the failure-free path of the protocol do not preserve correctness without changes to reconfiguration, as illustrated by an example execution in Figure 4a. In this execution, two shards $s_{1}$ and $s_{2}$ are involved in the certification of a transaction $t$ , coordinated by a process $p_{c}$ from a third shard. The transaction is prepared to commit at the leaders $p_{1}$ and $p_{3}$ of both shards (step
1 ), and the commit vote from the leader of $s_{1}$ ( $p_{1}$ ) is persisted at the follower $p_{2}$ using RDMA (step
2 ). Before the coordinator $p_{c}$ persists the vote from the leader $p_{3}$ of $s_{2}$ at the follower $p_{4}$ , the leader $p_{3}$ is suspected of failure and a reconfiguration is triggered at shard $s_{2}$ . This promotes the follower $p_{4}$ to a new leader and brings online a fresh follower $p_{5}$ . Next, the leader $p_{1}$ of $s_{1}$ suspects the coordinator $p_{c}$ of failure and triggers a reconfiguration to remove it. Once $p_{c}$ is removed from its shard, $p_{1}$ retries the processing of $t$ (step
3 , line 1 in Figure 1). The new leader $p_{4}$ of $s_{2}$ does not know about $t$ , so this results in the transaction being aborted, because its payload at shard $s_{2}$ is thought to be lost (steps
4 and

5 ). But now the coordinator $p_{c}$ , who did not actually fail and still believes $s_{2}$ is in the old configuration, finishes its processing by persisting the commit vote of the old leader $p_{3}$ of $s_{2}$ at the old follower $p_{4}$ , which is now the new leader of $s_{2}$ (step
6 ). Since this is done via RDMA, $p_{4}$ cannot reject the vote and, thus, $p_{c}$ commits the transaction (step
7 ). This violates safety, as two contradictory results have been externalized. The protocol in §3 is not subject to this problem, because in that protocol the new leader $p_{4}$ of the shard $s_{2}$ would reject the ${\tt ACCEPT}$ message due to the failure of the check at line 1.

To make the RDMA-based protocol correct, we need to change the reconfiguration protocol so that the whole system participates in reconfiguration instead of just the affected shard. Figure 4b illustrates the message flow of the redesigned reconfiguration protocol. Processes now maintain a single epoch variable instead of a vector. The data structures maintained by the external configuration service and its interface are adjusted accordingly. Like in our previous commit protocol, the process $p_{r}$ performing reconfiguration first probes previous configurations by sending ${\tt PROBE}$ messages. However, $p_{r}$ now probes all shards. A process receiving ${\tt PROBE}$ handles it as before (line 1), but additionally closes all incoming RDMA connections using close, which guarantees that the process accepts no more transactions at its previous epoch. This is needed because, due to communication via RDMA, the protocol cannot longer leverage the safety check at line 1. The logic of the reconfiguring process is also changed: after this process computes the new configuration and stores it in the configuration service (line 1), the process sends a new ${\tt CONFIG\_PREPARE}$ message to all processes in the configuration. Upon receiving ${\tt CONFIG\_PREPARE}$ , a process updates its locally stored configuration and replies with a ${\tt CONFIG\_PREPARE\_ACK}$ message. This ensures that the whole system is aware of the new configuration before it is activated. Only after this does the reconfiguring process send a ${\tt NEW\_CONFIG}$ message to the leaders of the new configuration. Upon receiving ${\tt NEW\_CONFIG}$ (line 1), a leader $p_{l}$ first calls flush. This guarantees that all the messages that have been acknowledged as having reached $p_{l}$ ’s memory will be replicated to followers in ${\tt NEW\_STATE}$ messages; this is necessary since transaction coordinators may have already externalized decisions taken based on these acknowledgements. Finally, processes open RDMA connections to all other processes in the configuration using open: a leader after sending ${\tt NEW\_STATE}$ to its followers, and followers upon receiving ${\tt NEW\_STATE}$ (line 1).

The new protocol guarantees that: () if a process receives an ${\tt ACCEPT}$ message for a transaction $t$ while at epoch $\mathit{e}$ , then the leader that prepared $t$ was at epoch $\mathit{e}$ when it prepared this transaction. This property is key in proving the correctness of the protocol, as it provides the same guarantees as the removed guard in line 1, which we could not leverage due to the use of RDMA. The property () holds because: (i) at any time, a process only maintain RDMA connections to the members of its current epoch; and (ii) before persisting a vote at a follower, the coordinator of a transaction checks that the transaction was prepared in its current epoch (line 1).

We now show how the revised reconfiguration protocol prevents the bug in Figure 4a. In this protocol, when $p_{c}$ attempts to persist the commit vote at $p_{4}$ (step
6 ), the latter will be already aware that $p_{c}$ has been removed from the system and will close the RDMA connection to it. Thus, $p_{c}$ will be unable to persist the vote at $p_{4}$ (this would violate the property (*)) and will never gather enough acknowledgements to decide the transaction. Hence, no contradictory results will be externalized. We state and prove the correctness of the RDMA-based protocol in §C.

6. Related Work and Discussion

Our protocols are inspired by the recent FARM system for transaction processing, which also uses $f+1$ replicas per shard and deals with failures using reconfiguration (farm, ; farm2, ). FARM was presented as a complete database system with a number of optimizations, including the use of RDMA. In contrast, our work distills the core ideas of FARM into protocols solving the well-defined transaction certification problem, parametric in the isolation level provided and rigorously proven correct. This allows us to simplify some aspects of the FARM design. In particular, FARM has a more complex way of determining the state of the new leader upon a reconfiguration, which merges the states from all surviving replicas of the previous configuration. In contrast, our protocols take the state of any single initialized replica. Our reconfiguration protocols also provide better fault-tolerance guarantees on a par with those of existing ones (ken-book, ; spiegelman2017dynamic, ). This is because, like Vertical Paxos I (vertical-paxos, ), our protocols look through a sequence of configurations to find the new leader, whereas FARM only considers the previous configuration. Hence, FARM reconfiguration can get stuck even when there exists a non-faulty replica with the necessary data. Finally, by presenting two related protocols using message passing and RDMA, we are able to identify the price of exploiting RDMA—having to reconfigure the whole system instead of a single shard.

There have been a number of protocols for solving the atomic commit problem, which requires reaching a decision on a single transaction (2pc, ; Hadzilacos1990, ; nbac, ; dwork-skeen, ). In contrast to these works, our protocol solves the more general problem of implementing a Transaction Certification Service, which requires reaching decisions on a stream of transactions. This problem more faithfully reflects the requirements of modern transaction processing systems (discpaper, ).

Our protocol weaves together two-phase commit (2PC) (2pc, ) and Vertical Paxos (vertical-paxos, ), instead of using Paxos replication as a black box. This is similar to several existing sharded systems for transaction processing, which integrate protocols for distribution and replication (uw-inconsistent, ; mdcc, ; replicated-commit, ; discpaper, ). However, these systems considered a static set of $2f+1$ processes per shard, whereas we assume $f+1$ processes and allow the system to be reconfigured. Achieving this correctly is nontrivial and requires a subtle interplay between the reconfigurable replication mechanism and cross-shard coordination. For example, as we explained in §3, on failures our protocol may lose information about transactions that influenced votes on other transactions, but this does not violate correctness. As is well-known (cheappaxos, ), using $f+1$ instead of $2f+1$ replicas results in somewhat weaker availability guarantees: upon a single failure, our protocols have to stop processing transactions while the system is reconfigured.

Acknowledgments.

We thank Dushyanth Narayanan for discussions about FARM. This research was supported by an ERC grant RACCOON.

Appendix A Correctness of the Protocol

Figure 5 summarizes additional invariants that, together with the invariants listed in Figure 3, are used to prove the correctness of the protocol. We first prove the nontrivial Invariants 1, 3, 11, 12 and 4 that were not proved in §4. We then prove Theorem 4.1.

A.1. Proof of Invariants

Proof of Invariant 1.

Assume that a process $p_{i}$ in $s$ at $\mathit{e}$ receives ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and replies with ${\tt ACCEPT\_ACK}$ . We prove that, after the transition and while ${\sf epoch}[s]=\mathit{e}$ , $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k}$ , where $\mathit{txn}$ , $\mathit{vote}$ and $\mathit{payload}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader $p_{l}$ of $s$ at $\mathit{e}$ when it sent the corresponding message ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ .

If $p_{i}$ processes ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ , then $p_{i}$ has ${\sf epoch}=\mathit{e}$ . Thus, $p_{i}$ has processed ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ before. After processing this message, $p_{i}$ has ${\sf txn}=\mathit{txn}^{\prime}$ , ${\sf vote}=\mathit{vote}^{\prime}$ and ${\sf payload}=\mathit{payload}^{\prime}$ where $\mathit{txn}^{\prime}$ , $\mathit{vote}^{\prime}$ and $\mathit{payload}^{\prime}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader $p_{l}$ of $s$ at $\mathit{e}$ when it sent the ${\tt NEW\_STATE}$ message. Let $k^{\prime}={\sf length}(\mathit{txn}^{\prime})$ . By lines 1, 1 and 1 we have that $\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{txn}^{\prime}$ , $\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{vote}^{\prime}$ and $\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{payload}^{\prime}$ . Therefore, $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}$ while ${\sf epoch}[s]=\mathit{e}$ . Furthermore, after processing ${\tt ACCEPT}(k,t,\mathit{l},d)$ , $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ . By Invariant 9, $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ while ${\sf epoch}[s]=\mathit{e}$ .

We now prove that after processing ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}$ , $p_{i}$ has ${\sf txn}[k^{\prime\prime}]\in\{\mathit{txn}[k^{\prime\prime}],\bot\}$ , ${\sf vote}[k^{\prime\prime}]\in\{\mathit{vote}[k^{\prime\prime}],\bot\}$ and ${\sf payload}[k^{\prime\prime}]\in\{\mathit{payload}[k^{\prime\prime}],\bot\}$ for any $k^{\prime\prime}$ such that $k^{\prime}<k^{\prime\prime}<k$ . We prove it by induction on the length of the protocol execution from the moment in which $p_{i}$ has processed ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ . The validity of the property can be affected by only the transition at line 1. Let ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ be the message that triggers the transition. Assume that $k^{\prime}<k^{*}<k$ , as otherwise the transition does not affect the validity of the property. By the induction hypothesis, $p_{i}$ has ${\sf txn}[k^{\prime\prime}]\in\{\mathit{txn}[k^{\prime\prime}],\bot\}$ , ${\sf vote}[k^{\prime\prime}]\in\{\mathit{vote}[k^{\prime\prime}],\bot\}$ and ${\sf payload}[k^{\prime\prime}]\in\{\mathit{payload}[k^{\prime\prime}],\bot\}$ for any $k^{\prime\prime}\neq k^{*}$ such that $k^{\prime}<k^{\prime\prime}<k$ after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ . Also, after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ , $p_{i}$ has ${\sf txn}[k^{*}]=t^{*}$ , ${\sf vote}[k^{*}]=d^{*}$ and ${\sf payload}[k^{*}]=\mathit{l}^{*}$ . By lines 1, 1 and 1, $\mathit{txn}[k^{*}]=t^{*}$ , $\mathit{vote}[k^{*}]=d^{*}$ and $\mathit{payload}[k^{*}]=\mathit{l}^{*}$ . Then, after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ , $p_{i}$ has ${\sf txn}[k^{*}]=\mathit{txn}[k^{*}]$ , ${\sf vote}[k^{*}]=\mathit{vote}[k^{*}]$ and ${\sf payload}[k^{*}]=\mathit{payload}[k^{*}]$ . This proves that, after processing ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}$ , $p_{i}$ has ${\sf txn}[k^{\prime\prime}]\in\{\mathit{txn}[k^{\prime\prime}],\bot\}$ , ${\sf vote}[k^{\prime\prime}]\in\{\mathit{vote}[k^{\prime\prime}],\bot\}$ and ${\sf payload}[k^{\prime\prime}]\in\{\mathit{payload}[k^{\prime\prime}],\bot\}$ for any $k^{\prime\prime}$ such that $k^{\prime}<k^{\prime\prime}<k$ . We have already proved that (i) $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}$ after processing ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}$ ; and that (ii) $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ after processing ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and while ${\sf epoch}[s]=\mathit{e}$ . Hence, $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k}$ after processing ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and while ${\sf epoch}[s]=\mathit{e}$ , as required. ∎

Proof of Invariant 3.

When $p_{i}$ processed ${\tt PROBE}(\mathit{e})$ , it set ${\sf status}=\textsc{reconfiguring}$ . This prevents $p_{i}$ from processing any ${\tt ACCEPT}$ message until it processes a ${\tt NEW\_CONFIG}(\mathit{e}^{*},\_)$ or a ${\tt NEW\_STATE}(\mathit{e}^{*},\_,\_,\_,\_,\_,\_)$ . When $p_{i}$ processed ${\tt PROBE}(\mathit{e})$ , it also sets ${\sf new\_epoch}=\mathit{e}$ . By the checks in lines 1 and 1 and by the fact that ${\sf new\_epoch}$ can never decrease, this guarantees that $p_{i}$ only handles any of these messages if $\mathit{e}^{*}\geq{\sf new\_epoch}$ . Hence, by the time $p_{i}$ is able to process ${\tt ACCEPT}$ messages again it will have ${\sf epoch}[s]=\mathit{e}^{*}>\mathit{e}^{\prime}$ . By the check in line 1 and the fact that the protocol trivially guarantees that ${\sf epoch}[s]$ never decreases, $p_{i}$ will never send ${\tt ACCEPT\_ACK}(s,\mathit{e}^{\prime},\_,\_,\_)$ after sending a ${\tt PROBE\_ACK}(\_,\mathit{e},s)$ , as required. ∎

Proof of Invariant 11.

(a) Assume that all followers in $s$ at $\mathit{e}_{1}$ have received ${\tt ACCEPT}(\mathit{e}_{1},k,t_{1},\mathit{l}_{1},d_{1})$ and replied with ${\tt ACCEPT\_ACK}(s,\mathit{e}_{1},k,t_{1},d_{1})$ . Assume that all followers in $s$ at $\mathit{e}_{2}$ have received ${\tt ACCEPT}(\mathit{e}_{2},k,t_{2},\mathit{l}_{2},d_{2})$ and replied with ${\tt ACCEPT\_ACK}(s,\mathit{e}_{2},k,t_{2},d_{2})$ . Assume without loss of generality that $\mathit{e}_{1}\leq\mathit{e}_{2}$ . If $\mathit{e}_{1}=\mathit{e}_{2}$ , then by Invariant 6 we must have $t_{1}=t_{2}$ , $\mathit{l}_{1}=\mathit{l}_{2}$ and $d_{1}=d_{2}$ . Assume now that $\mathit{e}_{1}<\mathit{e}_{2}$ . By Invariant 2, when the leader of $s$ at $\mathit{e}_{2}$ sent the ${\tt PREPARE\_ACK}(\mathit{e}_{2},s,k,t_{2},\mathit{l}_{2},d_{2})$ message it has ${\sf txn}[k]=t_{1}$ , ${\sf vote}[k]=d_{1}$ and ${\sf payload}[k]=\mathit{l}_{1}$ . But then due to the check at line 1, we again must have $t_{1}=t_{2}$ , $\mathit{l}_{1}=\mathit{l}_{2}$ and $d_{1}=d_{2}$ .

(b) Assume that all followers in $s$ at $\mathit{e}_{1}$ have received ${\tt ACCEPT}(\mathit{e}_{1},k_{1},t,\mathit{l}_{1},d_{1})$ and replied with ${\tt ACCEPT\_ACK}(s,\mathit{e}_{1},k_{1},t,d_{1})$ . Assume that all followers in $s$ at $\mathit{e}_{2}$ have received ${\tt ACCEPT}(\mathit{e}_{2},k_{2},t,\mathit{l}_{2},d_{2})$ and replied with ${\tt ACCEPT\_ACK}(s,\mathit{e}_{2},k_{2},t,d_{2})$ . Assume without loss of generality that $\mathit{e}_{1}\leq\mathit{e}_{2}$ . We first show that $k_{1}=k_{2}$ . If $\mathit{e}_{1}=\mathit{e}_{2}$ , then we must have $k_{1}=k_{2}$ by Invariant 9. Assume now that $\mathit{e}_{1}<\mathit{e}_{2}$ . By Invariant 2, when the leader of $s$ at $\mathit{e}_{2}$ sent the ${\tt PREPARE\_ACK}(\mathit{e}_{2},s,k_{2},t,\mathit{l}_{2},d_{2})$ message it has ${\sf txn}[k_{1}]=t$ . But then due to the check at line 1 and Invariant 10, we again must have $k_{1}=k_{2}$ . Hence, $k_{1}=k_{2}$ . But then by Invariant 11a we must also have $\mathit{l}_{1}=\mathit{l}_{2}$ and $d_{1}=d_{2}$ . ∎

Proof of Invariant 12.

(a) Assume that a process $p_{i}$ in shard $s$ has ${\sf epoch}[s]=\mathit{e}^{\prime}$ , ${\sf phase}[k]=\textsc{decided}$ and ${\sf dec}[k]=d$ . We show that then a ${\tt DECISION}(\mathit{e},k,d)$ message has been sent to $s$ , where $\mathit{e}\leq\mathit{e}^{\prime}$ . We prove the invariant by induction on the length of the protocol execution. The validity of the property can be affected by only the transitions at lines 1 and 1. First, consider the transition at line 1. By the induction hypothesis, $p_{i}$ satisfies the property before handling the ${\tt DECISION}(\mathit{e},k,d)$ message that causes the transition. Given that the message is only handled if $\mathit{e}\leq\mathit{e}^{\prime}$ , lines 1 and 1 trivially preserve the invariant. Finally, consider the transition at line 1. The transition is triggered when $p_{i}$ receives a ${\tt NEW\_STATE}(\mathit{e}^{\prime},\_,\_,\_,\_,\mathit{dec},\mathit{phase})$ . By the induction hypothesis, the leader of $s$ at $\mathit{e}^{\prime}$ satisfies the required before the transition. The process $p_{i}$ simply substitutes its ${\sf dec}$ and ${\sf phase}$ arrays by the arrays $\mathit{dec}$ and $\mathit{phase}$ . Therefore, $p_{i}$ will also satisfy the required after the transition.

(b) Follows from item (a) and Invariant 2.∎

Proof of Invariant 4.

Follows from Invariant 11, since, if a coordinator has computed the final decision on a transaction, then all followers in each relevant shard at a given epoch have accepted a corresponding vote.∎

A.2. Proof of Theorem 4.1

To facilitate the proof of Theorem 4.1, we first introduce a low-level specification TCS-LL, and prove that it is correctly implemented by the atomic commit protocol (Lemma A.1). We then show that every history satisfying TCS-LL is correct with respect to $f$ (Lemma A.3). The low-level specification TCS-LL is defined as follows.

Consider a history $h$ . Let $T$ denote the set of transactions $t$ such that ${\tt certify}(t,\_)$ is an event in $h$ , and $d[t]$ denote the decision value $d$ of $t\in T$ if ${\tt decide}(t,d)$ is an event in $h$ . The history $h$ satisfies TCS-LL if for some of transactions $t\in T$ and shards $s\in{\sf shards}(t)$ there exist $d_{s}[t]\in\mathcal{D}$ , $\mathit{pos}_{s}[t]\in\mathbb{N}$ , $\mathit{pload}_{s}[t]\in\mathcal{L}$ and $T_{s}[t],P_{s}[t]\in 2^{\mathcal{T}}$ such that all the constraints in Figure 6 are satisfied. A protocol is a correct implementation of TCS-LL if each of its finite histories satisfies TCS-LL.

Lemma A.1.

The atomic commit protocol in Figures 1 is a correct implementation of TCS-LL.

Proof

Fix a finite execution of the atomic commit protocol with a history $h$ . Let $T$ be the set of transactions $t$ such that ${\tt certify}(t,\mathit{l})$ occurs in $h$ . For some of transactions $t\in T$ , $\mathit{l}\in\mathcal{L}$ , and shards $s\in{\sf shards}(t)$ , we define the certification order position $\mathit{pos}_{s}[t]$ , $\mathit{pload}_{s}[t]$ and a vote $d_{s}[t]$ computed by the protocol as follows:

Consider $t\in T$ and $s\in{\sf shards}(t)$ . Assume that all followers in $s$ at $\mathit{e}$ received ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and responded to it with ${\tt ACCEPT\_ACK}(s,\mathit{e},k,t,d)$ . Then, we let $\mathit{pos}_{s}[{\sf txn}[k]]=k$ , $\mathit{pload}_{s}[{\sf txn}[k]]=\mathit{l}$ and $d_{s}[{\sf txn}[k]]=d$ .

According to Invariants 2 and 10, this defines $\mathit{pos}_{s}[t]$ , $\mathit{pload}_{s}[t]$ and $d_{s}[t]$ uniquely and (7) in Figure 6 holds. Furthermore, by the structure of the handler at line 1, for each $t$ such that ${\tt decide}(t,d[t])$ occurs in $h$ , $d_{s}[t]$ is defined for all $s\in{\sf shards}(t)$ and (6) holds. By Invariant 7, (8) holds.

We now prove (12). Consider $t,t^{\prime},s$ such that

[TABLE]

Let ${\tt DECISION}(\mathit{e},\mathit{pos}_{s}[t],\_)$ be the message sent to the shard $s$ when the ${\tt decide}(t,\_)$ action was generated. Let $\mathit{e}^{\prime}$ be some epoch at which $\mathit{pos}_{s}[t^{\prime}]$ is defined according to the above definition. Assume first that $\mathit{e}^{\prime}<\mathit{e}$ . Then by Invariant 2 when the leader of $\mathit{e}$ starts operating, it has ${\sf txn}[\mathit{pos}_{s}[t^{\prime}]]=t^{\prime}$ . But then ${\tt certify}(t^{\prime},\_)$ must have occurred before the ${\tt decide}(t,\_)$ . Hence, $\mathit{e}\leq\mathit{e}^{\prime}$ . By Invariant 2 when the leader of $s$ at $\mathit{e}^{\prime}$ receives ${\tt PREPARE}(t^{\prime},\_)$ , it has ${\sf txn}[\mathit{pos}_{s}[t]]=t$ . But then $\mathit{pos}_{s}[t]<\mathit{pos}_{s}[t^{\prime}]$ , which proves (12).

We prove (9)-(11) using the following proposition.

Proposition A.2.

The following always holds at any process in a shard $s$ :

[TABLE]

where the function ${\sf topload}:(\mathcal{T}\times(\mathbb{N}\to\mathcal{T})\times(\mathbb{N}\to\mathcal{L}))\to\mathcal{L}$ determines the payload that a process has stored for a given transaction, i.e., for any transaction $t\in\mathcal{T}$ , and arrays $\mathit{txn}\in\mathbb{N}\to\mathcal{T}$ and $\mathit{payload}\in\mathbb{N}\to\mathcal{L}$ , ${\sf topload}(t,\mathit{txn},\mathit{payload})=\{\mathit{payload}[k]\mid t=\mathit{txn}[k]\}$ . We lift the function to sets of transactions: for any set of transactions $T\subseteq\mathcal{T}$ and arrays $\mathit{txn}\in\mathbb{N}\to\mathcal{T}$ and $\mathit{payload}\in\mathbb{N}\to\mathcal{L}$ , we have ${\sf topload}(T,\mathit{txn},\mathit{payload})=\{{\sf topload}(t,\mathit{txn},\mathit{payload})\mid t\in T\}$ .

Proof.

We prove this by induction on the length of the protocol execution. The validity of the above property can be nontrivially affected only by the transitions at lines 1, 1, 1, and 1.

First consider the transition at line 1, which computes ${\sf vote}[k]$ as follows:

[TABLE]

Then for some $T$ , $P$ we have

[TABLE]

From the last two conjuncts and Invariant 12 we get

[TABLE]

which implies the required.

We next consider the transition at line 1 by a process $p_{i}$ . The induction hypothesis implies that, before the transition at line 1, we have (14) at $p_{i}$ . After processing the ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ , $p_{i}$ modifies its ${\sf txn}$ , ${\sf vote}$ , ${\sf payload}$ and ${\sf phase}$ arrays by assigning the $k^{\prime}$ position. Fix a $k$ . We distinguish three cases:

(1)

$k<k^{\prime}$ . The required trivially follows from the induction hypothesis. 2. (2)

$k=k^{\prime}$ . By Invariant 1, after processing the ${\tt ACCEPT}$ message, $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , where $\mathit{txn}$ , $\mathit{vote}$ and $\mathit{payload}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader of $s$ at $\mathit{e}^{\prime\prime}$ when it sent the ${\tt PREPARE\_ACK}(\mathit{e}^{\prime\prime},s,k^{\prime},t,\mathit{l},d)$ . By the induction hypothesis, the leader of $s$ at $\mathit{e}^{\prime\prime}$ satisfies the required before sending the ${\tt PREPARE\_ACK}$ message. Hence, by the fact that $f_{s}$ and $g_{s}$ are distributive, the required is guaranteed at $p_{i}$ for ${\sf vote}[k^{\prime}]$ after the transition. 3. (3)

$k>k^{\prime}$ . We have that before processing the ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ message, $p_{i}$ has processed ${\tt NEW\_STATE}(\mathit{e}^{\prime\prime},\_,\_,\mathit{txn},\_,\_,\_)$ . After processing ${\tt NEW\_STATE}$ , $p_{i}$ has ${\sf txn}=\mathit{txn}$ , ${\sf vote}=\mathit{vote}$ and ${\sf payload}=\mathit{payload}$ where $\mathit{txn}$ , $\mathit{vote}$ and $\mathit{payload}$ are the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader of $s$ at $\mathit{e}^{\prime\prime}$ when it sent the ${\tt NEW\_STATE}$ message. Let $m={\sf length}({\sf txn})$ at $p_{i}$ after processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ .

Consider first the case when $m={\sf length}(\mathit{txn})$ . Then $p_{i}$ , after processing the ${\tt NEW\_STATE}$ message and before processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ may have only processed ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{*},\_,\_,\_)$ such that $k^{*}\leq m$ . Lines 1 and 1 trivially guarantee that after processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ , $p_{i}$ still has ${\sf txn}=\mathit{txn}$ , ${\sf vote}=\mathit{vote}$ and ${\sf payload}=\mathit{payload}$ . By the induction hypothesis, the leader of $s$ at $\mathit{e}^{\prime\prime}$ satisfies the required before sending the ${\tt NEW\_STATE}$ . Hence, the required is guaranteed at $p_{i}$ after the transition when $m={\sf length}({\sf txn})$ .

Consider now the case when $m>{\sf length}({\sf txn})$ . Therefore, $p_{i}$ must have received an ${\tt ACCEPT}(\mathit{e}^{\prime\prime},m,\_,\_,\_)$ message and responded to it with ${\tt ACCEPT\_ACK}$ before processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},t,\mathit{l},d)$ . By Invariant 1, after processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},m,\_,\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}^{\prime\prime}$ , $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{m}\prec\mathit{txn}^{\prime}\mathpunct{\downharpoonleft}_{m}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{m}\prec\mathit{vote}^{\prime}\mathpunct{\downharpoonleft}_{m}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{m}\prec\mathit{payload}^{\prime}\mathpunct{\downharpoonleft}_{m}$ , where $\mathit{txn}^{\prime}$ , $\mathit{vote}^{\prime}$ and $\mathit{payload}^{\prime}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader of $s$ at $\mathit{e}^{\prime\prime}$ when it sent the ${\tt PREPARE\_ACK}(\mathit{e}^{\prime\prime},s,m,\_,\_,\_)$ . Thus, after processing ${\tt ACCEPT}(\mathit{e}^{\prime\prime},k^{\prime},\_,\_,\_)$ , $p_{i}$ still has ${\sf txn}\mathpunct{\downharpoonleft}_{m}\prec\mathit{txn}^{\prime}\mathpunct{\downharpoonleft}_{m}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{m}\prec\mathit{vote}^{\prime}\mathpunct{\downharpoonleft}_{m}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{m}\prec\mathit{payload}^{\prime}\mathpunct{\downharpoonleft}_{m}$ . By the induction hypothesis, the leader of $s$ at $\mathit{e}^{\prime\prime}$ satisfies the required before sending ${\tt PREPARE\_ACK}(\mathit{e}^{\prime\prime},s,m,\_,\_,\_)$ . Hence by the fact that $f_{s}$ and $g_{s}$ are distributive, the required is guaranteed at $p_{i}$ after the transition.

Finally, the transitions at lines 1 and 1 are handled easily.∎

We now prove (9)-(11). Take the earliest point in the execution where $d_{s}[t]$ can be determined as per the definition given earlier. Let $\mathit{e}$ be the epoch used in this definition. Then by Proposition A.2 at this point, at the leader of $s$ at $\mathit{e}$ for some $T,P$ we have

[TABLE]

For the $T,P$ fixed above, and Invariant 2 we get

[TABLE]

which establishes (10) and (11).

By Invariant 7, (8) and by the fact that $f_{s}$ and $g_{s}$ are distributive, we get

[TABLE]

which establishes (9) for the $T_{s}[t],P_{s}[t]$ fixed above.

Finally, we prove (13). To this end, we show that if $t^{\prime}\prec_{\text{rt}}t$ or $t^{\prime}\prec_{\text{dec}}t$ , then a ${\tt DECISION}(t^{\prime},d[t^{\prime}])$ message was sent in the execution, and this had happened before any ${\tt DECISION}(t,\_)$ message was sent. The case of $t^{\prime}\prec_{\text{rt}}t$ is trivial and therefore we only consider the case of $t^{\prime}\prec_{\text{dec}}t$ . Take the earliest point in the execution where we can define $d_{s}[t]$ , and hence, $T_{s}[t]$ and $P_{s}[t]$ (by (15)). Then a ${\tt DECISION}(t,\_)$ message could not have been sent by this point. Assume first that $t^{\prime}\in T_{s}[t]$ . Then by (15) a ${\tt DECISION}(t^{\prime},\textsc{commit})$ message has been sent earlier. Now assume that

[TABLE]

Then at this point ${\sf txn}[\mathit{pos}_{s}[t^{\prime}]]=t^{\prime}$ and ${\sf vote}[\mathit{pos}_{s}[t^{\prime}]]=\textsc{commit}$ , so that by (15) a ${\tt DECISION}(t^{\prime},\textsc{abort})$ message has been sent earlier. We have thus proved (13).∎

Lemma A.3.

*If shard-local certification functions $f_{s}$ and $g_{s}$ satisfy (3)-(5), then every history satisfying TCS-LL is correct with respect to $f$ . *

Proof.

This follows the proof of Theorem 1 in [5]111Appendix A of https://arxiv.org/pdf/1808.00688. with minimal adjustments.∎

Proof of Theorem 4.1.

Follows from Lemmas A.1 and A.3.∎

Appendix B Proof of Liveness

We prove the nontrivial Theorem 4.2:

If a process $p_{r}$ attempts to reconfigure a shard $s$ and no other process attempts to reconfigure $s$ simultaneously, then if $p_{r}$ is non-faulty for long enough, it will eventually introduce a new configuration.

Proof.

Assume that a process $p_{r}$ attempts to reconfigure a shard $s$ . Take the earliest point in the execution where $p_{r}$ calls the reconfigure function. Let $\mathit{e}$ be the epoch of the last active configuration of $s$ at that point in time. The process $p_{r}$ first queries the configuration service to find the latest introduced configuration of $s$ to start the probing. Let $\mathit{e}^{\prime}$ be the epoch of this configuration.

Assume that the probing eventually ends. After this happens, $p_{r}$ computes the membership of the new configuration $c$ (lines 1). Then $p_{r}$ attempts to write $c$ into the configuration service. Since there is not other process attempting to reconfigure $s$ simultaneously, $p_{r}$ will succeed. This last step introduces $c$ , as required.

We now prove that the probing eventually ends, provided that no other process attempts reconfiguring $s$ simultaneously and $p_{r}$ is non-faulty for long enough. The probing procedure proceeds by iterations in epoch descending order, starting by probing the members of $s$ at $\mathit{e}^{\prime}$ . The process $p_{r}$ only moves to the next iteration after receiving at least one reply from a member of $s$ at the epoch being currently probed while no process replies with ${\tt PROBE\_ACK}(\textsc{true},\mathit{e}+1,s)$ . Consider an arbitrary epoch $\mathit{e}^{\prime\prime}$ such that $\mathit{e}^{\prime\prime}\leq\mathit{e}^{\prime}$ . If $p_{r}$ is probing the members of $s$ at $\mathit{e}^{\prime\prime}$ , then $p_{r}$ has received a ${\tt PROBE\_ACK}(\textsc{false},\mathit{e}+1,s)$ from at least one member of $s$ at each epoch $\mathit{e}^{*}$ such that $\mathit{e}^{\prime\prime}<\mathit{e}^{*}\leq\mathit{e}^{\prime}$ . Furthermore, because of line 1 and the check in line 1, none of these configurations will ever become active. Then by Assumption 1 and the fact that there is no concurrent reconfiguration, $p_{r}$ is guaranteed to receive at least one reply from a member of $s$ at $\mathit{e}^{\prime\prime}$ . Hence, for each epoch $\mathit{e}^{\prime\prime}$ that $p_{r}$ probes, either the whole probing terminates, or $p_{r}$ will eventually move to probe the previous epoch. Assume that $p_{r}$ reaches epoch $\mathit{e}$ . By line 1 and the check in line 1, the configurations of $s$ with epoch $\mathit{e}^{*}$ , such that $\mathit{e}<\mathit{e}^{*}\leq\mathit{e}^{\prime}$ , will never become active. Then by Assumption 1 and the fact that there is no concurrent reconfiguration, $p_{r}$ is guaranteed to receive at least one reply from a member of $s$ at $\mathit{e}$ . That the configuration of $s$ with epoch $\mathit{e}$ became active implies that every member of $s$ at $\mathit{e}$ has ${\sf initialized}=\textsc{true}$ when being probed. Hence, the probing procedure is guaranteed to finish.∎

Appendix C RDMA-based Atomic Commit Protocol

We give the pseudocode of the RDMA-based protocol in Figures 7 and 8. The redesigned reconfiguration protocol uses a slightly different set of variables. Instead of the variable ${\sf probing}$ , the protocol uses the variable ${\sf rec\_status}\in\{\textsc{ready},\textsc{probing},\textsc{installing}\}$ to record whether a process is ready to start reconfiguring the system, probing the system or disseminating a new configuration. A variable ${\sf connections}\in 2^{\mathcal{P}}$ records the set of processes to which a process currently maintains an open RDMA connection. Also, the variables ${\sf probed\_epoch}$ and ${\sf probed\_members}$ are now arrays: ${\sf probed\_epoch}\in\mathcal{S}\to\mathbb{N}$ and ${\sf probed\_members}\in\mathcal{S}\to 2^{\mathcal{P}}$ . This change is required because now reconfiguration involves all shards, instead of a single one. Finally, the data structures maintained by the external configuration service and its interface are adjusted as well. Instead of keeping a separate data structure with each shard’s sequence of configurations, the configuration service keeps a single data structure with the system’s sequence of configurations parameterized by shard. Moreover, none of the three operations of the configuration service’s interface take a shard identifier as argument anymore.

To prove the correctness of the protocol, apart from the set of invariants (Figures 3 and 5) used to prove the correctness of the atomic commit protocol in Figure 1, we require the following invariant, formalizing property (*) from §5:

Assume that the coordinator of a transaction $t$ receives a ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ message and sends an ${\tt ACCEPT}(k,t,\mathit{l},d)$ message to a process $p_{i}$ . If $p_{i}$ receives the ${\tt ACCEPT}(k,t,\mathit{l},d)$ message, then it has ${\sf epoch}=\mathit{e}$ right before this.

This invariant trivially holds in the atomic commit protocol in Figure 1 but is nontrivial in the RDMA-based protocol.

We first prove Invariant 13. Then we prove Invariants 3 and 1, whose proofs rely on Invariant 13. We skip the proofs for the rest of the invariants, as these are similar to the proofs of the invariants in Figures 3 and 5 of the protocol in Figure 1, with small adjustments due to differences in the protocols’ pseudocodes. Finally, we prove the following theorem.

Theorem C.1.

A transaction certification service implemented using the protocol in Figures 7 and 8 is correct with respect to a certification function $f$ matching the shard-local certification functions $f_{s}$ and $g_{s}$ .

Proof of Invariant 13.

Assume that the coordinator $p_{c}$ of a transaction $t$ receives a ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ message and sends an ${\tt ACCEPT}(k,t,\mathit{l},d)$ message to a process $p_{i}$ . Assume further that $p_{i}$ receives the ${\tt ACCEPT}(k,t,\mathit{l},d)$ message and let ${\sf epoch}=\mathit{e}^{\prime}$ at $p_{i}$ right before this transition. We prove that $\mathit{e}^{\prime}=\mathit{e}$ .

The leader $p_{l}$ of $s$ at $\mathit{e}$ must have received ${\tt PREPARE}(t,\mathit{l})$ and replied with ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ , and when it received the message, it had ${\sf epoch}=\mathit{e}$ . Thus, $p_{l}$ must have received ${\tt NEW\_CONFIG}(\mathit{e})$ earlier. Also, by the check in line 7, $p_{i}$ must be a member of $\mathit{e}$ . Then $p_{i}$ had processed ${\tt CONFIG\_PREPARE}(\mathit{e},\_,\_)$ before $p_{c}$ sent the ${\tt ACCEPT}(k,t,\mathit{l},d)$ message to $p_{i}$ . When $p_{i}$ processed ${\tt CONFIG\_PREPARE}(\mathit{e},\_,\_)$ , it had no open connections, either because it was probed (line 8) or because it is a new process. The process $p_{i}$ only opens them, allowing it to receive ${\tt ACCEPT}(k,t,\mathit{l},d)$ , when it receives either a ${\tt NEW\_CONFIG}(\mathit{e}^{*})$ or a ${\tt NEW\_STATE}(\mathit{e}^{*},\_,\_,\_,\_,\_)$ message, so that $e^{\prime}\geq\mathit{e}^{*}$ . By the fact that ${\sf new\_epoch}$ gets updated when processing ${\tt CONFIG\_PREPARE}$ and by the checks in lines 8 and 8, we have that $\mathit{e}^{*}\geq\mathit{e}$ . Then $\mathit{e}^{\prime}\geq\mathit{e}$ .

Assume now that $\mathit{e}^{\prime}>\mathit{e}$ . When $p_{i}$ processes ${\tt ACCEPT}(k,t,\mathit{l},d)$ it has ${\sf epoch}=\mathit{e}^{\prime}$ . Then $p_{i}$ has received ${\tt NEW\_CONFIG}(\mathit{e}^{\prime})$ or ${\tt NEW\_STATE}(\mathit{e}^{\prime},\_,\_,\_,\_,\_)$ before. When processing any of these messages, $p_{i}$ has no open connections. Therefore, $p_{c}$ must have sent ${\tt ACCEPT}(k,t,\mathit{l},d)$ after $p_{i}$ processed ${\tt NEW\_CONFIG}(\mathit{e}^{\prime})$ or ${\tt NEW\_STATE}(\mathit{e}^{\prime},\_,\_,\_,\_,\_)$ . Furthermore, by the checks in lines 8 and 8, $p_{i}$ only establishes connections to the members of $\mathit{e}^{\prime}$ . Thus, $p_{c}$ must be a member of $\mathit{e}^{\prime}$ to successfully send ${\tt ACCEPT}(k,t,\mathit{l},d)$ to $p_{i}$ . Then $p_{c}$ must have received ${\tt CONFIG\_PREPARE}(\mathit{e}^{\prime},\_,\_)$ and replied with ${\tt CONFIG\_PREPARE\_ACK}(\mathit{e}^{\prime})$ before $p_{i}$ processed ${\tt NEW\_CONFIG}(\mathit{e}^{\prime})$ or ${\tt NEW\_STATE}(\mathit{e}^{\prime},\_,\_,\_,\_,\_)$ and therefore before sending ${\tt ACCEPT}(k,t,\mathit{l},d)$ . When processing ${\tt CONFIG\_PREPARE}(\mathit{e}^{\prime},\_,\_)$ , $p_{c}$ has no open connections. Thus, $p_{c}$ can only send ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ after receiving ${\tt NEW\_CONFIG}(\mathit{e}^{*})$ or ${\tt NEW\_STATE}(\mathit{e}^{*},\_,\_,\_,\_,\_)$ , where $\mathit{e}^{*}\geq\mathit{e}^{\prime}$ . This implies that $p_{c}$ received ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ after setting ${\sf epoch}=\mathit{e}^{*}\geq e^{\prime}>e$ . By the check in line 7, $p_{c}$ then would never send ${\tt ACCEPT}(k,t,\mathit{l},d)$ to $p_{i}$ at this point. Hence, we must have $\mathit{e}^{\prime}\leq\mathit{e}$ , which together with $\mathit{e}^{\prime}\geq\mathit{e}$ implies $\mathit{e}^{\prime}=\mathit{e}$ .∎

Proof of Invariant 1.

Assume that a process $p_{i}$ in $s$ at $\mathit{e}$ processes ${\tt ACCEPT}(k,t,\mathit{l},d)$ . We prove that, after the transition and while ${\sf epoch}[s]=\mathit{e}$ , $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k}$ , where $\mathit{txn}$ , $\mathit{vote}$ and $\mathit{payload}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader $p_{l}$ of $s$ at $\mathit{e}$ when it sent the corresponding message ${\tt PREPARE\_ACK}(\mathit{e},s,k,t,\mathit{l},d)$ .

By Invariant 13, if $p_{i}$ processes ${\tt ACCEPT}(k,t,\mathit{l},d)$ , then $p_{i}$ has ${\sf epoch}=\mathit{e}$ . Thus, $p_{i}$ has processed ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ before. After processing this message, $p_{i}$ has ${\sf txn}=\mathit{txn}^{\prime}$ , ${\sf vote}=\mathit{vote}^{\prime}$ and ${\sf payload}=\mathit{payload}^{\prime}$ where $\mathit{txn}^{\prime}$ , $\mathit{vote}^{\prime}$ and $\mathit{payload}^{\prime}$ are the values of the arrays ${\sf txn}$ , ${\sf vote}$ and ${\sf payload}$ at the leader $p_{l}$ of $s$ at $\mathit{e}$ when it sent the ${\tt NEW\_STATE}$ message. Let $k^{\prime}={\sf length}(\mathit{txn}^{\prime})$ . By lines 7, 7 and 8 we have that $\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{txn}^{\prime}$ , $\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{vote}^{\prime}$ and $\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}=\mathit{payload}^{\prime}$ . By Invariant 13, $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}$ while ${\sf epoch}[s]=\mathit{e}$ . Furthermore, after processing ${\tt ACCEPT}(k,t,\mathit{l},d)$ , $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ . By Invariants 9 and 13, $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ while ${\sf epoch}[s]=\mathit{e}$ .

We now prove that after processing ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}$ , $p_{i}$ has ${\sf txn}[k^{\prime\prime}]\in\{\mathit{txn}[k^{\prime\prime}],\bot\}$ , ${\sf vote}[k^{\prime\prime}]\in\{\mathit{vote}[k^{\prime\prime}],\bot\}$ and ${\sf payload}[k^{\prime\prime}]\in\{\mathit{payload}[k^{\prime\prime}],\bot\}$ for any $k^{\prime\prime}$ such that $k^{\prime}<k^{\prime\prime}<k$ . We prove it by induction on the length of the protocol execution from the moment in which $p_{i}$ has processed ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ . The validity of the property can be affected by only the transition at line 7. Let ${\tt ACCEPT}(k^{*},t^{*},\mathit{l}^{*},d^{*})$ be the message that triggers the transition. Assume that $k^{\prime}<k^{*}<k$ , as otherwise the transition does not affect the validity of the property. By the induction hypothesis, $p_{i}$ has ${\sf txn}[k^{\prime\prime}]\in\{\mathit{txn}[k^{\prime\prime}],\bot\}$ , ${\sf vote}[k^{\prime\prime}]\in\{\mathit{vote}[k^{\prime\prime}],\bot\}$ and ${\sf payload}[k^{\prime\prime}]\in\{\mathit{payload}[k^{\prime\prime}],\bot\}$ for any $k^{\prime\prime}\neq k^{*}$ such that $k^{\prime}<k^{\prime\prime}<k$ after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ . Also, after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ , $p_{i}$ has ${\sf txn}[k^{*}]=t^{*}$ , ${\sf vote}[k^{*}]=d^{*}$ and ${\sf payload}[k^{*}]=\mathit{l}^{*}$ . By Invariant 13, $t^{*}$ must have been prepared by the leader $p_{l}$ of $s$ at $\mathit{e}$ . Then, by lines 7, 7 and 8, $\mathit{txn}[k^{*}]=t^{*}$ , $\mathit{vote}[k^{*}]=d^{*}$ and $\mathit{payload}[k^{*}]=\mathit{l}^{*}$ . Then, after processing ${\tt ACCEPT}(\mathit{e},k^{*},t^{*},\mathit{l}^{*},d^{*})$ , $p_{i}$ has ${\sf txn}[k^{*}]=\mathit{txn}[k^{*}]$ , ${\sf vote}[k^{*}]=\mathit{vote}[k^{*}]$ and ${\sf payload}[k^{*}]=\mathit{payload}[k^{*}]$ . We have already proved that (i) $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k^{\prime}}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k^{\prime}}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k^{\prime}}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k^{\prime}}$ after processing ${\tt NEW\_STATE}(\mathit{e},\mathit{txn}^{\prime},\mathit{payload}^{\prime},\mathit{vote}^{\prime},\_,\_)$ and while ${\sf epoch}[s]=\mathit{e}$ ; and that (ii) $p_{i}$ has ${\sf txn}[k]=\mathit{txn}[k]$ , ${\sf vote}[k]=\mathit{vote}[k]$ and ${\sf payload}[k]=\mathit{payload}[k]$ after processing ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and while ${\sf epoch}[s]=\mathit{e}$ . Hence, $p_{i}$ has ${\sf txn}\mathpunct{\downharpoonleft}_{k}\prec\mathit{txn}\mathpunct{\downharpoonleft}_{k}$ , ${\sf vote}\mathpunct{\downharpoonleft}_{k}\prec\mathit{vote}\mathpunct{\downharpoonleft}_{k}$ and ${\sf payload}\mathpunct{\downharpoonleft}_{k}\prec\mathit{payload}\mathpunct{\downharpoonleft}_{k}$ after processing ${\tt ACCEPT}(\mathit{e},k,t,\mathit{l},d)$ and while ${\sf epoch}[s]=\mathit{e}$ , as required. ∎

Proof of Invariant 3.

When $p_{i}$ processed ${\tt PROBE}(\mathit{e})$ , it closed all connections. This prevents $p_{i}$ from acknowledging any ${\tt ACCEPT}$ message until it processes a ${\tt NEW\_CONFIG}(\mathit{e}^{*})$ or a ${\tt NEW\_STATE}(\mathit{e}^{*},\_,\_,\_,\_,\_)$ . When $p_{i}$ processed ${\tt PROBE}(\mathit{e})$ , it also sets ${\sf new\_epoch}=\mathit{e}$ . By the checks in lines 8 and 8 and by the fact that ${\sf new\_epoch}$ can never decrease, this guarantees that $p_{i}$ only handles any of these messages if $\mathit{e}^{*}\geq e$ . Hence, by the time $p_{i}$ is able to process ${\tt ACCEPT}$ messages again it will have ${\sf epoch}=\mathit{e}^{*}>\mathit{e}^{\prime}$ . Since ${\sf epoch}$ never decreases at a process, from this point on, by Invariant 13, $p_{i}$ will not process any ${\tt ACCEPT}$ message prepared in an epoch preceding $\mathit{e}^{*}$ , as required.∎

Lemma C.2.

The atomic commit protocol in Figures 7 and 8 is a correct implementation of TCS-LL.

Proof

This follows the proof of Lemma A.1 with minimal adjustments.∎

Proof of Theorem C.1.

Follows from Lemmas C.2 and A.3.∎

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. Corfu: A shared log design for flash clusters. In Conference on Networked Systems Design and Implementation (NSDI) , 2012.
2[2] C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It’s time for a redesign. PVLDB , 9(7), 2016.
3[3] K. Birman, D. Malkhi, and R. V. Renesse. Virtually synchronous methodology for building dynamic reliable services. In K. Birman, editor, Guide to Reliable Distributed Systems - Building High-Assurance Applications and Cloud-Hosted Services , Texts in Computer Science, chapter 22. Springer, 2012.
4[4] F. Chang et al. Bigtable: A distributed storage system for structured data. In Symposium on Operating Systems Design and Implementation (OSDI) , 2006.
5[5] G. Chockler and A. Gotsman. Multi-shot distributed transaction commit. In Symposium on Distributed Computing (DISC) , 2018.
6[6] J. C. Corbett et al. Spanner: Google’s globally-distributed database. In Symposium on Operating Systems Design and Implementation (OSDI) , 2012.
7[7] A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Fa RM: Fast remote memory. In Conference on Networked Systems Design and Implementation (NSDI) , 2014.
8[8] A. Dragojević, D. Narayanan, E. B. Nightingale, M. Renzelmann, A. Shamis, A. Badam, and M. Castro. No compromises: Distributed transactions with consistency, availability, and performance. In Symposium on Operating Systems Principles (SOSP) , 2015.