An algorithm for geo-distributed and redundant storage in Garage

Mendes Oulamara; Alex Auvolat

arXiv:2302.13798·cs.DS·February 28, 2023

An algorithm for geo-distributed and redundant storage in Garage

Mendes Oulamara, Alex Auvolat

PDF

Open Access

TL;DR

This paper introduces an optimal algorithm for assigning data to storage nodes in a geo-distributed system, optimizing storage efficiency and redundancy while analyzing its complexity and user metrics.

Contribution

The paper proposes a novel optimal algorithm specifically designed for data assignment in geo-distributed, redundant storage systems like Garage.

Findings

01

Algorithm achieves optimal data placement

02

Complexity analysis of each algorithm step

03

Metrics for user display and system monitoring

Abstract

This paper presents an optimal algorithm to compute the assignment of data to storage nodes in the Garage geo-distributed storage system. We discuss the complexity of the different steps of the algorithm and metrics that can be displayed to the user.

Equations20

s

s

s^{*} = n \in N min \frac{c _{n}}{p _{n}} .

s^{*} = n \in N min \frac{c _{n}}{p _{n}} .

d (α, α^{'}) := # {(n, p) \in N \times P ∣ n \in α_{p} △ α_{p}^{'}}

d (α, α^{'}) := # {(n, p) \in N \times P ∣ n \in α_{p} △ α_{p}^{'}}

d (f, f^{'})

d (f, f^{'})

\displaystyle=\frac{1}{2}\big{(}\#X+\sum_{e\in X}1_{f(e)\neq f^{\prime}(e)}-1_{f(e)=f^{\prime}(e)}\big{)}.

w (γ)

w (γ)

w (γ)

w (γ)

\displaystyle=\frac{1}{2}\Big{(}\sum_{e\in X,e\in\gamma}-1_{f(e)\neq f^{\prime}(e)}+1_{f(e)=f^{\prime}(e)}

\displaystyle\qquad+\sum_{e\in X,e\in\gamma}1_{(f+\gamma)(e)\neq f^{\prime}(e)}+1_{(f+\gamma)(e)=f^{\prime}(e)}\Big{)}.

d (f, f^{'}) + w (γ) = d (f + γ, f^{'}) .

d (f, f^{'}) + w (γ) = d (f + γ, f^{'}) .

de g v

de g v

= u \in V \sum f^{*} (v, u) - f (v, u) = u \in V \sum f^{*} (v, u) - u \in V \sum f (v, u) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Advanced Manufacturing and Logistics Optimization · Mobile Agent-Based Network Management

Full text

An algorithm for geo-distributed and redundant storage in Garage

Mendes Oulamara [email protected] Deuxfleurs

Alex Auvolat [email protected] Deuxfleurs

Abstract

This paper presents an optimal algorithm to compute the assignment of data to storage nodes in the Garage geo-distributed storage system. We discuss the complexity of the different steps of the algorithm and metrics that can be displayed to the user.

1 Introduction

Garage111https://garagehq.deuxfleurs.fr/ is an open-source distributed object storage service tailored for self-hosting. It was designed by the Deuxfleurs association222https://deuxfleurs.fr/ to enable small structures (associations, collectives, small companies) to share storage resources to reliably self-host their data, possibly with old and non-reliable machines. To achieve these reliability and availability goals, the data is broken into partitions and every partition is replicated over 3 different machines (that we call nodes). When the data is queried, it is fetched from one of the nodes. A replication factor of 3 ensures good guarantees regarding node failure[1]. But this parameter can be another (preferably larger and odd) number.

Moreover, if the nodes are spread over different zones (different houses, offices, cities…), we can require the data to be replicated over nodes belonging to different zones. This improves the storage robustness against zone failures (such as power outages). To do so, we define a scattering factor, that is no more than the replication factor, and we require that the replicas of any partition are spread over this number of zones at least.

In this work, we propose an assignment algorithm that, given the nodes specifications and the replication and scattering factors, computes an optimal assignment of partitions to nodes. We say that the assignment is optimal in the sense that it maximizes the size of the partitions, and hence the effective storage capacity of the system.

Moreover, when a former assignment exists, which is not optimal anymore due to node or zone changes, our algorithm computes a new optimal assignment that minimizes the amount of data to be transferred during the assignment update (the transfer load).

We call the set of nodes cooperating to store the data a cluster, and a description of the nodes, zones and the assignment of partitions to nodes a cluster layout

1.1 Notations

Let $k$ be some fixed parameter value, typically 8, that we call the “partition bits”. Every object to be stored in the system is split into data blocks of fixed size. We compute a hash $h(\mathbf{b})$ of every such block $\mathbf{b}$ , and we define the $k$ first bits of this hash to be the partition number $p(\mathbf{b})$ of the block. This label can take $P=2^{k}$ different values, and hence there are $P$ different partitions. We denote $\mathbf{P}$ the set of partition labels (i.e. $\mathbf{P}=\llbracket 1,P\rrbracket$ ).

We are given a set $\mathbf{N}$ of $N$ nodes and a set $\mathbf{Z}$ of $Z$ zones. Every node $n$ has a non-negative storage capacity $c_{n}\geq 0$ and belongs to a zone $z_{n}\in\mathbf{Z}$ . We are also given a replication factor $\rho_{\mathbf{N}}$ and a scattering factor $\rho_{\mathbf{Z}}$ such that $1\leq\rho_{\mathbf{Z}}\leq\rho_{\mathbf{N}}$ (typical values would be $\rho_{N}=\rho_{Z}=3$ ).

Our goal is to compute an assignment $\alpha=(\alpha_{p}^{1},\ldots,\alpha_{p}^{\rho_{\mathbf{N}}})_{p\in\mathbf{P}}$ such that every partition $p$ is associated to $\rho_{\mathbf{N}}$ distinct nodes $\alpha_{p}^{1},\ldots,\alpha_{p}^{\rho_{\mathbf{N}}}\in\mathbf{N}$ and these nodes belong to at least $\rho_{\mathbf{Z}}$ distinct zones. Among the possible assignments, we choose one that maximizes the effective storage capacity of the cluster. If the layout contained a previous assignment $\alpha^{\prime}$ , we minimize the amount of data to transfer during the layout update by making $\alpha$ as close as possible to $\alpha^{\prime}$ . These maximization and minimization are described more formally in the following section.

1.2 Optimization objectives

To link the effective storage capacity of the cluster to partition assignment, we make the following assumption:

[TABLE]

This assumption is justified by the dispersion of the hashing function, when the number of partitions is small relative to the number of stored blocks.

Every node $n$ will store some number $p_{n}$ of partitions (it is the number of partitions $p$ such that $n$ appears in the $\alpha_{p}$ ). Hence the partitions stored by $n$ (and hence all partitions by our assumption) have their size bounded by $c_{n}/p_{n}$ . This remark leads us to define the optimal size that we will want to maximize:

[TABLE]

When the capacities of the nodes are updated (this includes adding or removing a node), we want to update the assignment as well. However, transferring the data between nodes has a cost and we would like to limit the number of changes in the assignment. We make the following assumption:

[TABLE]

This assumption justifies that when we compute the new assignment $\alpha$ , it is worth to optimize the partition size (OPT) first, and then, among the possible optimal solutions, to try to minimize the number of partition transfers. More formally, we minimize the distance between two assignments defined by

[TABLE]

where the symmetric difference $\alpha_{p}\triangle\alpha^{\prime}_{p}$ denotes the nodes appearing in one of the assignments but not in both.

2 Computation of an optimal assignment

The algorithm that we propose takes as inputs the cluster layout parameters $\mathbf{N}$ , $\mathbf{Z}$ , $\mathbf{P}$ , $(c_{n})_{n\in\mathbf{N}}$ , $\rho_{\mathbf{N}}$ , $\rho_{\mathbf{Z}}$ , that we defined in the introduction, together with the former assignment $\alpha^{\prime}$ (if any). The computation of the new optimal assignment $\alpha^{*}$ is done in three successive steps that will be detailed in the following sections. The first step computes the largest partition size $s^{*}$ that an assignment can achieve. The second step computes an optimal candidate assignment $\alpha$ that achieves $s^{*}$ and a heuristic is used in the computation to make it hopefully close to $\alpha^{\prime}$ . The third steps modifies $\alpha$ iteratively to reduces $d(\alpha,\alpha^{\prime})$ and yields an assignment $\alpha^{*}$ achieving $s^{*}$ , and minimizing $d(\cdot,\alpha^{\prime})$ among such assignments.

We will explain in the next section how to represent an assignment $\alpha$ by a flow $f$ on a weighted graph $G$ to enable the use of flow and graph algorithms. The main function of the algorithm can be written as follows.

Algorithm

1:function Compute Layout( $\mathbf{N}$ , $\mathbf{Z}$ , $\mathbf{P}$ , $(c_{n})_{n\in\mathbf{N}}$ , $\rho_{\mathbf{N}}$ , $\rho_{\mathbf{Z}}$ , $\alpha^{\prime}$ )

2: $s^{*}\leftarrow$ Compute Partition Size( $\mathbf{N}$ , $\mathbf{Z}$ , $\mathbf{P}$ , $(c_{n})_{n\in\mathbf{N}}$ , $\rho_{\mathbf{N}}$ , $\rho_{\mathbf{Z}}$ )

3: $G\leftarrow G(s^{*})$

4: $f\leftarrow$ Compute Candidate Assignment( $G$ , $\alpha^{\prime}$ )

5: $f^{*}\leftarrow$ Minimize transfer load( $G$ , $f$ , $\alpha^{\prime}$ )

6: Build $\alpha^{*}$ from $f^{*}$

7: return $\alpha^{*}$

8:end function

Complexity

As we will see in the next sections, the worst case complexity of this algorithm is $O(P^{2}N^{2})$ . The minimization of transfer load is the most expensive step, and it can run with a timeout since it is only an optimization step. Without this step (or with a smart timeout), the worst case complexity can be $O((PN)^{3/2}\log C)$ where $C$ is the total storage capacity of the cluster.

2.1 Determination of the partition size $s^{*}$

We will represent an assignment $\alpha$ as a flow in a specific graph $G$ . Remark that such flow must have value $\rho_{\mathbf{N}}P$ . We will not compute the optimal partition size $s^{*}$ a priori, but we will determine it by dichotomy, as the largest size $s$ such that the maximal flow achievable on $G=G(s)$ has value $\rho_{\mathbf{N}}P$ . We will assume that the capacities are given in a small enough unit (e.g. megabytes), and we will determine $s^{*}$ at the precision of the given unit.

Given some candidate size value $s$ , we describe the oriented weighted graph $G=(V,E)$ with vertex set $V$ and arc set $E$ (see Figure 1).

The set of vertices $V$ contains the source $\mathbf{s}$ , the sink $\mathbf{t}$ , vertices $\mathbf{p^{+},p^{-}}$ for every partition $p$ , vertices $\mathbf{x}_{p,z}$ for every partition $p$ and zone $z$ , and vertices $\mathbf{n}$ for every node $n$ .

The set of arcs $E$ contains:

•

( $\mathbf{s}$ , $\mathbf{p}^{+}$ , $\rho_{\mathbf{Z}}$ ) for every partition $p$ ;

•

( $\mathbf{s}$ , $\mathbf{p}^{-}$ , $\rho_{\mathbf{N}}-\rho_{\mathbf{Z}}$ ) for every partition $p$ ;

•

( $\mathbf{p}^{+}$ , $\mathbf{x}_{p,z}$ , 1) for every partition $p$ and zone $z$ ;

•

( $\mathbf{p}^{-}$ , $\mathbf{x}_{p,z}$ , $\rho_{\mathbf{N}}-\rho_{\mathbf{Z}}$ ) for every partition $p$ and zone $z$ ;

•

( $\mathbf{x}_{p,z}$ , $\mathbf{n}$ , 1) for every partition $p$ , zone $z$ and node $n\in z$ ;

•

( $\mathbf{n}$ , $\mathbf{t}$ , $\lfloor c_{n}/s\rfloor$ ) for every node $n$ .

In the following complexity calculations, we will use the number of vertices and edges of $G$ . Remark for now that $\#V=O(PZ)$ and $\#E=O(PN)$ .

Proposition 1.

An assignment $\alpha$ is realizable with partition size $s$ and replication and scattering factors $(\rho_{\mathbf{N}},\rho_{\mathbf{Z}})$ if and only if there exists a maximal flow function $f$ in $G$ with total flow $\rho_{\mathbf{N}}P$ , such that the arcs ( $\mathbf{x}_{p,z}$ , $\mathbf{n}$ , 1) used are exactly those for which $p$ is associated to $n$ in $\alpha$ .

Proof.

Given such flow $f$ , we can reconstruct a candidate $\alpha$ . In $f$ , the flow passing through $\mathbf{p^{+}}$ and $\mathbf{p^{-}}$ is $\rho_{\mathbf{N}}$ , and since the outgoing capacity of every $\mathbf{x}_{p,z}$ is 1, every partition is associated to $\rho_{\mathbf{N}}$ distinct nodes. The fraction $\rho_{\mathbf{Z}}$ of the flow passing through every $\mathbf{p^{+}}$ must be spread over as many distinct zones as every arc outgoing from $\mathbf{p^{+}}$ has capacity 1. So the reconstructed $\alpha$ verifies the replication and scattering constraints. For every node $n$ , the flow between $\mathbf{n}$ and $\mathbf{t}$ corresponds to the number of partitions associated to $n$ . By construction of $f$ , this does not exceed $\lfloor c_{n}/s\rfloor$ . We assumed that the partition size is $s$ , hence this association does not exceed the storage capacity of the nodes.

In the other direction, given an assignment $\alpha$ , one can similarly check that the facts that $\alpha$ respects the replication and scattering constraints, and the storage capacities of the nodes, are necessary condition to construct a maximal flow function $f$ . ∎

Implementation remark.

In the flow algorithm, while exploring the graph, we explore the neighbours of every vertex in a random order to heuristically spread the associations between nodes and partitions.

Algorithm

With this result mind, we can describe the first step of our algorithm. All divisions are supposed to be integer divisions.

1:function Compute Partition Size( $\mathbf{N}$ , $\mathbf{Z}$ , $\mathbf{P}$ , $(c_{n})_{n\in\mathbf{N}}$ , $\rho_{\mathbf{N}}$ , $\rho_{\mathbf{Z}}$ )

2: Build the graph $G=G(s=1)$

3: $f\leftarrow$ Maximal flow( $G$ )

4: if $f.\mathrm{totalflow}<\rho_{\mathbf{N}}P$ then

5: return Error: capacities too small or constraints too strong.

6: end if

7: $s^{-}\leftarrow 1$

8: $s^{+}\leftarrow 1+\frac{1}{\rho_{\mathbf{N}}}\sum_{n\in\mathbf{N}}c_{n}$

9: while $s^{-}+1<s^{+}$ do

10: Build the graph $G=G(s=(s^{-}+s^{+})/2)$

11: $f\leftarrow$ Maximal flow( $G$ )

12: if $f.\mathrm{totalflow}<\rho_{\mathbf{N}}P$ then

13: $s^{+}\leftarrow(s^{-}+s^{+})/2$

14: else

15: $s^{-}\leftarrow(s^{-}+s^{+})/2$

16: end if

17: end while

18: return $s^{-}$

19:end function

Complexity

To compute the maximal flow, we use Dinic’s algorithm [2]. Its complexity on general graphs is $O(\#V^{2}\#E)$ , but on graphs with edge capacity bounded by a constant, it turns out to be $O(\#E^{3/2})$ . The graph $G$ does not fall in this case since the capacities of the arcs incoming to $\mathbf{t}$ are far from bounded. However, the proof of this complexity function works readily for graphs where we only ask the edges not incoming to the sink $\mathbf{t}$ to have their capacities bounded by a constant. One can find the proof of this claim in [3, Section 2]. The dichotomy adds a logarithmic factor $\log(C)$ where $C=\sum_{n\in\mathbf{N}}c_{n}$ is the total capacity of the cluster. The total complexity of this first function is hence $O(\#E^{3/2}\log C)=O\big{(}(PN)^{3/2}\log C\big{)}$ .

Metrics

We can display the discrepancy between the computed $s^{*}$ and the best size we could have hoped for the given total capacity, that is $C/\rho_{\mathbf{N}}$ .

2.2 Computation of a candidate assignment

Now that we have the optimal partition size $s^{*}$ , to compute a candidate assignment it would be enough to compute a maximal flow function $f$ on $G(s^{*})$ . This is what we do if there is no former assignment $\alpha^{\prime}$ .

If there is some $\alpha^{\prime}$ , we add a step that will heuristically help to obtain a candidate $\alpha$ closer to $\alpha^{\prime}$ . We fist compute a flow function $\tilde{f}$ that uses only the partition-to-node associations appearing in $\alpha^{\prime}$ . Most likely, $\tilde{f}$ will not be a maximal flow of $G(s^{*})$ . In Dinic’s algorithm, we can start from a non maximal flow function and then discover improving paths. This is what we do by starting from $\tilde{f}$ . The hope333This is only a hope, because one can find examples where the construction of $f$ from $\tilde{f}$ produces an assignment $\alpha$ that is not as close as possible to $\alpha^{\prime}$ . is that the final flow function $f$ will tend to keep the associations appearing in $\tilde{f}$ .

More formally, we construct the graph $G_{|\alpha^{\prime}}$ from $G$ by removing all the arcs $(\mathbf{x}_{p,z},\mathbf{n},1)$ where $p$ is not associated to $n$ in $\alpha^{\prime}$ . We compute a maximal flow function $\tilde{f}$ in $G_{|\alpha^{\prime}}$ . The flow $\tilde{f}$ is also a valid (most likely non maximal) flow function on $G$ . We compute a maximal flow function $f$ on $G$ by starting Dinic’s algorithm with $\tilde{f}$ .

Algorithm

1:function Compute Candidate Assignment( $G$ , $\alpha^{\prime}$ )

2: Build the graph $G_{|\alpha^{\prime}}$

3: $\tilde{f}\leftarrow$ Maximal flow( $G_{|\alpha^{\prime}}$ )

4: $f\leftarrow$ Maximal flow from flow( $G$ , $\tilde{f}$ )

5: return $f$

6:end function

Remark

The function “Maximal flow” can be just seen as the function “Maximal flow from flow” called with the zero flow function as starting flow.

Complexity

With the considerations of the last section, we have the complexity of Dinic’s algorithm $O(\#E^{3/2})=O((PN)^{3/2})$ .

Metrics

We can display the flow value of $\tilde{f}$ , which is an upper bound of the distance between $\alpha$ and $\alpha^{\prime}$ , although this information might not be very relevant to end users.

2.3 Minimization of the transfer load

Now that we have a candidate flow function $f$ , we want to modify it to make its corresponding assignment $\alpha$ as close as possible to $\alpha^{\prime}$ . Denote by $f^{\prime}$ the maximal flow corresponding to $\alpha^{\prime}$ , and let $d(f,\alpha^{\prime})=d(f,f^{\prime}):=d(\alpha,\alpha^{\prime})$ 444It is the number of arcs of type $(\mathbf{x}_{p,z},\mathbf{n})$ saturated in one flow and not in the other.. We want to build a sequence $f=f_{0},f_{1},f_{2}\dots$ of maximal flows such that $d(f_{i},\alpha^{\prime})$ decreases as $i$ increases. The distance being a non-negative integer, this sequence of flow functions must be finite. We now explain how to find some improving $f_{i+1}$ from $f_{i}$ .

For any maximal flow $f$ in $G$ , we define the oriented weighted graph $G_{f}=(V,E_{f})$ as follows. The vertices of $G_{f}$ are the same as the vertices of $G$ . $E_{f}$ contains the arc $(v_{1},v_{2},w)$ between vertices $v_{1},v_{2}\in V$ with weight $w$ if and only if the arc $(v_{1},v_{2})$ is not saturated in $f$ (i.e. $c(v_{1},v_{2})-f(v_{1},v_{2})\geq 1$ , we also consider reversed arcs). The weight $w$ is:

•

$-1$ if $(v_{1},v_{2})$ is of type $(\mathbf{x}_{p,z},\mathbf{n})$ or $(\mathbf{n},\mathbf{x}_{p,z})$ and is saturated in only one of the two flows $f,f^{\prime}$ ;

•

$+1$ if $(v_{1},v_{2})$ is of type $(\mathbf{x}_{p,z},\mathbf{n})$ or $(\mathbf{n},\mathbf{x}_{p,z})$ and is saturated in either both or none of the two flows $f,f^{\prime}$ ;

•

[math] otherwise.

If $\gamma$ is a simple cycle of arcs in $G_{f}$ , we define its weight $w(\gamma)$ as the sum of the weights of its arcs. We can add $+1$ to the value of $f$ on the arcs of $\gamma$ , and by construction of $G_{f}$ and the fact that $\gamma$ is a cycle, the function that we get is still a valid flow function on $G$ , it is maximal as it has the same flow value as $f$ . We denote this new function $f+\gamma$ .

Proposition 2.

Given a maximal flow $f$ and a simple cycle $\gamma$ in $G_{f}$ , we have $d(f+\gamma,f^{\prime})-d(f,f^{\prime})=w(\gamma)$ .

Proof.

Let $X$ be the set of arcs of type $(\mathbf{x}_{p,z},\mathbf{n})$ . Then we can express $d(f,f^{\prime})$ as

[TABLE]

We can express the cycle weight as

[TABLE]

Remark that since we passed one unit of flow in $\gamma$ to construct $f+\gamma$ , we have for any $e\in X$ , $f(e)=f^{\prime}(e)$ if and only if $(f+\gamma)(e)\neq f^{\prime}(e)$ . Hence

[TABLE]

Plugging this in the previous equation, we find that

[TABLE]

∎

This result suggests that given some flow $f_{i}$ , we just need to find a negative cycle $\gamma$ in $G_{f_{i}}$ to construct $f_{i+1}$ as $f_{i}+\gamma$ . The following proposition ensures that this greedy strategy reaches an optimal flow.

Proposition 3.

For any maximal flow $f$ , $G_{f}$ contains a negative cycle if and only if there exists a maximal flow $f^{*}$ in $G$ such that $d(f^{*},f^{\prime})<d(f,f^{\prime})$ .

Proof.

Suppose that there is such flow $f^{*}$ . Define the oriented multigraph $M_{f,f^{*}}=(V,E_{M})$ with the same vertex set $V$ as in $G$ , and for every $v_{1},v_{2}\in V$ , $E_{M}$ contains $(f^{*}(v_{1},v_{2})-f(v_{1},v_{2}))_{+}$ copies of the arc $(v_{1},v_{2})$ . For every vertex $v$ , its total degree (meaning its outer degree minus its inner degree) is equal to

[TABLE]

The last two sums are zero for any inner vertex since $f,f^{*}$ are flows, and they are equal on the source and sink since the two flows are both maximal and have hence the same value. Thus, $\deg v=0$ for every vertex $v$ .

This implies that the multigraph $M_{f,f^{*}}$ is the union of disjoint simple cycles. $f$ can be transformed into $f^{*}$ by pushing a mass 1 along all these cycles in any order. Since $d(f^{*},f^{\prime})<d(f,f^{\prime})$ , there must exists one of these simple cycles $\gamma$ with $d(f+\gamma,f^{\prime})<d(f,f^{\prime})$ . Finally, since we can push a mass in $f$ along $\gamma$ , it must appear in $G_{f}$ . Hence $\gamma$ is a cycle of $G_{f}$ with negative weight. ∎

In the next section we describe the corresponding algorithm. Instead of discovering only one cycle per iteration, we are allowed to discover a set $\Gamma$ of disjoint negative cycles.

Algorithm

1:function Minimize transfer load( $G$ , $f$ , $\alpha^{\prime}$ )

2: Build the graph $G_{f}$

3: $\Gamma\leftarrow$ Detect Negative Cycles( $G_{f}$ )

4: while $\Gamma\neq\emptyset$ do

5: for all $\gamma\in\Gamma$ do

6: $f\leftarrow f+\gamma$

7: end for

8: Update $G_{f}$

9: $\Gamma\leftarrow$ Detect Negative Cycles( $G_{f}$ )

10: end while

11: return $f$

12:end function

Complexity

The distance $d(f,f^{\prime})$ is bounded by the maximal number of differences in the associated assignment. If these assignment are totally disjoint, this distance is $2\rho_{N}P$ . At every iteration of the While loop, the distance decreases, so there is at most $O(\rho_{N}P)=O(P)$ iterations.

The detection of negative cycles is done with the Bellman-Ford algorithm, whose complexity should normally be $O(\#E\#V)$ . In our case, it amounts to $O(P^{2}ZN)$ . Multiplied by the complexity of the outer loop, it amounts to $O(P^{3}ZN)$ which is a lot when the number of partitions and nodes starts to be large. To avoid that, we adapt the Bellman-Ford algorithm.

The Bellman-Ford algorithm runs $\#V$ iterations of an outer loop, and an inner loop over $E$ . The idea is to compute the shortest paths from a source vertex $v$ to all other vertices. After $k$ iterations of the outer loop, the algorithm has computed all shortest path of length at most $k$ . All simple paths have length at most $\#V-1$ , so if there is an update in the last iteration of the loop, it means that there is a negative cycle in the graph. The observation that will enable us to improve the complexity is the following:

Proposition 4.

In the graph $G_{f}$ (and $G$ ), all simple paths have a length at most $4N$ .

Proof.

Since $f$ is a maximal flow, there is no outgoing edge from $\mathbf{s}$ in $G_{f}$ . One can thus check than any simple path of length 4 must contain at least two node of type $\mathbf{n}$ . Hence on a path, at most 4 arcs separate two successive nodes of type $\mathbf{n}$ . ∎

Thus, in the absence of negative cycles, shortest paths in $G_{f}$ have length at most $4N$ . So we can do only $4N+1$ iterations of the outer loop in the Bellman-Ford algorithm. This makes the complexity of the detection of one set of cycle to be $O(N\#E)=O(N^{2}P)$ .

With this improvement, the complexity of the whole algorithm is, in the worst case, $O(N^{2}P^{2})$ . However, since we detect several cycles at once and we start with a flow that might be close to the previous one, the number of iterations of the outer loop might be smaller in practice.

Metrics

We can display the node and zone utilization ratio, by dividing the flow passing through them divided by their outgoing capacity. In particular, we can pinpoint saturated nodes and zones (i.e. used at their full potential).

We can display the distance to the previous assignment, and the number of partition transfers.

3 Related work

In previous versions of Garage, we iterated through many algorithms to build an assignment of partitions to nodes, always with unsatisfactory results. These previous attempts, all based on existing work, are described in this section.

Basic consistent hashing with zone awareness

In this algorithm, we use the simple consistent hashing ring described in Dynamo [4]. We slightly adapt it to support nodes in different zones and the requirement to spread replicas over as many zones as possible: when looking up the nodes associated to a data block, we walk the ring starting from the position corresponding to its hash, but we skip nodes that are in a zone from which we have already selected a node (except if there are no more distinct zones to take nodes from). This method had the disadvantage of giving a very unbalanced distribution of data between nodes. For example, suppose that there are many consecutive nodes on the ring that are in zones 1 and 2, followed by one node in zone 3. Then that node will store a copy of all data blocks whose hashes are in the interval before it that contains only nodes of zone 1 and 2.

Arbitrary ring positions vs. fixed partition boundaries

As already discussed in the Dynamo paper [4] (see the three different strategies presented in Figure 7), using the hashes of node identifiers as positions on the consistent hashing ring makes the intervals between these positions of wildly varying sizes, worsening the imbalance of storage affected to all nodes. To resolve this issue, we very rapidly switched to dividing the consistent hashing ring into equally sized parts (what we call partitions), as shown in Dynamo’s strategies 2 and 3. To ensure that all nodes handle a number of partitions strictly proportional to their capacity, we tried using the MagLev algorithm [5] to assign partitions to nodes. However, just doing this does not solve the zone awareness issue; continuing to use the simple ring walking where nodes are skipped still produces a very imbalanced distribution.

Multi-zone aware MagLev

Our next try was to improve the MagLev algorithm to be multi-zone aware. Now, instead of assigning a single node to each ring position (each partition) and walking the ring to find three nodes starting at a given key’s hash, we directly assign a set of three nodes to each partition and completely abandon ring walking. The first node of the three is computed for all partitions by using the standard MagLev algorithm. Then, the next two are computed using a variant of MagLev that skips assigning nodes to partitions when they are in zones of nodes already selected for that partition (unless there are no more distinct zones available), selecting other nodes instead. This way, we ensure that the three nodes assigned to each partition are in as many distinct zones as possible. This method provided perfectly equitable distribution of data among nodes, however when layout changes occurred, the entire assignment was recomputed without taking into account the previous one, and thus there was no way to ensure that a minimal amount of data was displaced from one node to another.

Stateful assignment algorithms

In all of the previous iterations, we were limiting ourselves to algorithms that were stateless: the assignment had to be computed in a deterministic way from only the list of node identifiers and their zone and capacity information, using hash functions to provide pseudo-randomness. To be able to minimize the transfer load on layout changes, we had to switch to a stateful method where the entire assignment is computed offline and then propagated to all cluster nodes. It can now be computed using any arbitrary optimization algorithm that can take as an input the previous assignment to minimize transfer load. This method was introduced in Garage version 0.5 with a simple greedy optimization algorithm that was not optimal, which was in use until version 0.8. The final, optimal assignment algorithm is the one we presented in this paper, which will be included in Garage version 0.9 and forward.

Acknowledgements

This project has received funding from the European Union’s Horizon 2021 research and innovation programme within the framework of the NGI-POINTER Project funded under grant agreement N° 871528.

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Raynal, Building Read/Write Registers Despite Asynchrony and Less than Half of Processes Crash ( t < n / 2 𝑡 𝑛 2 t<n/2 ) , pp. 95–117. Cham: Springer International Publishing, 2018.
2[2] Y. Dinitz, “Algorithm for solution of a problem of maximum flow in networks with power estimation,” Soviet Math. Dokl. , vol. 11, pp. 1277–1280, 01 1970.
3[3] S. Even and R. E. Tarjan, “Network flow and testing graph connectivity,” SIAM journal on computing , vol. 4, no. 4, pp. 507–518, 1975.
4[4] G. De Candia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review , vol. 41, no. 6, pp. 205–220, 2007.
5[5] D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein, “Maglev: A fast and reliable software network load balancer,” in 13th { { \{ USENIX } } \} Symposium on Networked Systems Design and Implementation ( { { \{ NSDI } } \} 16) , pp. 523–535, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

An algorithm for geo-distributed and redundant storage in Garage

Abstract

1 Introduction

1.1 Notations

1.2 Optimization objectives

2 Computation of an optimal assignment

Algorithm

Complexity

2.1 Determination of the partition size s∗s^{*}s∗

Proposition 1**.**

Proof.

Implementation remark.

Algorithm

Complexity

Metrics

2.2 Computation of a candidate assignment

Algorithm

Remark

Complexity

Metrics

2.3 Minimization of the transfer load

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

Algorithm

Complexity

Proposition 4**.**

Proof.

Metrics

3 Related work

Basic consistent hashing with zone awareness

Arbitrary ring positions vs. fixed partition boundaries

Multi-zone aware MagLev

Stateful assignment algorithms

Acknowledgements

2.1 Determination of the partition size $s^{*}$

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.