An algorithm for geo-distributed and redundant storage in Garage
Mendes Oulamara, Alex Auvolat

TL;DR
This paper introduces an optimal algorithm for assigning data to storage nodes in a geo-distributed system, optimizing storage efficiency and redundancy while analyzing its complexity and user metrics.
Contribution
The paper proposes a novel optimal algorithm specifically designed for data assignment in geo-distributed, redundant storage systems like Garage.
Findings
Algorithm achieves optimal data placement
Complexity analysis of each algorithm step
Metrics for user display and system monitoring
Abstract
This paper presents an optimal algorithm to compute the assignment of data to storage nodes in the Garage geo-distributed storage system. We discuss the complexity of the different steps of the algorithm and metrics that can be displayed to the user.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Advanced Manufacturing and Logistics Optimization · Mobile Agent-Based Network Management
An algorithm for geo-distributed and redundant storage in Garage
Mendes Oulamara [email protected] Deuxfleurs
Alex Auvolat [email protected] Deuxfleurs
Abstract
This paper presents an optimal algorithm to compute the assignment of data to storage nodes in the Garage geo-distributed storage system. We discuss the complexity of the different steps of the algorithm and metrics that can be displayed to the user.
1 Introduction
Garage111https://garagehq.deuxfleurs.fr/ is an open-source distributed object storage service tailored for self-hosting. It was designed by the Deuxfleurs association222https://deuxfleurs.fr/ to enable small structures (associations, collectives, small companies) to share storage resources to reliably self-host their data, possibly with old and non-reliable machines. To achieve these reliability and availability goals, the data is broken into partitions and every partition is replicated over 3 different machines (that we call nodes). When the data is queried, it is fetched from one of the nodes. A replication factor of 3 ensures good guarantees regarding node failure[1]. But this parameter can be another (preferably larger and odd) number.
Moreover, if the nodes are spread over different zones (different houses, offices, cities…), we can require the data to be replicated over nodes belonging to different zones. This improves the storage robustness against zone failures (such as power outages). To do so, we define a scattering factor, that is no more than the replication factor, and we require that the replicas of any partition are spread over this number of zones at least.
In this work, we propose an assignment algorithm that, given the nodes specifications and the replication and scattering factors, computes an optimal assignment of partitions to nodes. We say that the assignment is optimal in the sense that it maximizes the size of the partitions, and hence the effective storage capacity of the system.
Moreover, when a former assignment exists, which is not optimal anymore due to node or zone changes, our algorithm computes a new optimal assignment that minimizes the amount of data to be transferred during the assignment update (the transfer load).
We call the set of nodes cooperating to store the data a cluster, and a description of the nodes, zones and the assignment of partitions to nodes a cluster layout
1.1 Notations
Let be some fixed parameter value, typically 8, that we call the “partition bits”. Every object to be stored in the system is split into data blocks of fixed size. We compute a hash of every such block , and we define the first bits of this hash to be the partition number of the block. This label can take different values, and hence there are different partitions. We denote the set of partition labels (i.e. ).
We are given a set of nodes and a set of zones. Every node has a non-negative storage capacity and belongs to a zone . We are also given a replication factor and a scattering factor such that (typical values would be ).
Our goal is to compute an assignment such that every partition is associated to distinct nodes and these nodes belong to at least distinct zones. Among the possible assignments, we choose one that maximizes the effective storage capacity of the cluster. If the layout contained a previous assignment , we minimize the amount of data to transfer during the layout update by making as close as possible to . These maximization and minimization are described more formally in the following section.
1.2 Optimization objectives
To link the effective storage capacity of the cluster to partition assignment, we make the following assumption:
[TABLE]
This assumption is justified by the dispersion of the hashing function, when the number of partitions is small relative to the number of stored blocks.
Every node will store some number of partitions (it is the number of partitions such that appears in the ). Hence the partitions stored by (and hence all partitions by our assumption) have their size bounded by . This remark leads us to define the optimal size that we will want to maximize:
[TABLE]
When the capacities of the nodes are updated (this includes adding or removing a node), we want to update the assignment as well. However, transferring the data between nodes has a cost and we would like to limit the number of changes in the assignment. We make the following assumption:
[TABLE]
This assumption justifies that when we compute the new assignment , it is worth to optimize the partition size (OPT) first, and then, among the possible optimal solutions, to try to minimize the number of partition transfers. More formally, we minimize the distance between two assignments defined by
[TABLE]
where the symmetric difference denotes the nodes appearing in one of the assignments but not in both.
2 Computation of an optimal assignment
The algorithm that we propose takes as inputs the cluster layout parameters , , , , , , that we defined in the introduction, together with the former assignment (if any). The computation of the new optimal assignment is done in three successive steps that will be detailed in the following sections. The first step computes the largest partition size that an assignment can achieve. The second step computes an optimal candidate assignment that achieves and a heuristic is used in the computation to make it hopefully close to . The third steps modifies iteratively to reduces and yields an assignment achieving , and minimizing among such assignments.
We will explain in the next section how to represent an assignment by a flow on a weighted graph to enable the use of flow and graph algorithms. The main function of the algorithm can be written as follows.
Algorithm
1:function Compute Layout(, , , , , , )
2: Compute Partition Size(, , , , , )
3:
4: Compute Candidate Assignment(, )
5: Minimize transfer load(, , )
6: Build from
7: return
8:end function
Complexity
As we will see in the next sections, the worst case complexity of this algorithm is . The minimization of transfer load is the most expensive step, and it can run with a timeout since it is only an optimization step. Without this step (or with a smart timeout), the worst case complexity can be where is the total storage capacity of the cluster.
2.1 Determination of the partition size
We will represent an assignment as a flow in a specific graph . Remark that such flow must have value . We will not compute the optimal partition size a priori, but we will determine it by dichotomy, as the largest size such that the maximal flow achievable on has value . We will assume that the capacities are given in a small enough unit (e.g. megabytes), and we will determine at the precision of the given unit.
Given some candidate size value , we describe the oriented weighted graph with vertex set and arc set (see Figure 1).
The set of vertices contains the source , the sink , vertices for every partition , vertices for every partition and zone , and vertices for every node .
The set of arcs contains:
- •
(,, ) for every partition ;
- •
(,, ) for every partition ;
- •
(,, 1) for every partition and zone ;
- •
(,, ) for every partition and zone ;
- •
(,, 1) for every partition , zone and node ;
- •
(, , ) for every node .
In the following complexity calculations, we will use the number of vertices and edges of . Remark for now that and .
Proposition 1**.**
An assignment is realizable with partition size and replication and scattering factors if and only if there exists a maximal flow function in with total flow , such that the arcs (,, 1) used are exactly those for which is associated to in .
Proof.
Given such flow , we can reconstruct a candidate . In , the flow passing through and is , and since the outgoing capacity of every is 1, every partition is associated to distinct nodes. The fraction of the flow passing through every must be spread over as many distinct zones as every arc outgoing from has capacity 1. So the reconstructed verifies the replication and scattering constraints. For every node , the flow between and corresponds to the number of partitions associated to . By construction of , this does not exceed . We assumed that the partition size is , hence this association does not exceed the storage capacity of the nodes.
In the other direction, given an assignment , one can similarly check that the facts that respects the replication and scattering constraints, and the storage capacities of the nodes, are necessary condition to construct a maximal flow function . ∎
Implementation remark.
In the flow algorithm, while exploring the graph, we explore the neighbours of every vertex in a random order to heuristically spread the associations between nodes and partitions.
Algorithm
With this result mind, we can describe the first step of our algorithm. All divisions are supposed to be integer divisions.
1:function Compute Partition Size(, , , , , )
2: Build the graph
3: Maximal flow()
4: if then
5: return Error: capacities too small or constraints too strong.
6: end if
7:
8:
9: while do
10: Build the graph
11: Maximal flow()
12: if then
13:
14: else
15:
16: end if
17: end while
18: return
19:end function
Complexity
To compute the maximal flow, we use Dinic’s algorithm [2]. Its complexity on general graphs is , but on graphs with edge capacity bounded by a constant, it turns out to be . The graph does not fall in this case since the capacities of the arcs incoming to are far from bounded. However, the proof of this complexity function works readily for graphs where we only ask the edges not incoming to the sink to have their capacities bounded by a constant. One can find the proof of this claim in [3, Section 2]. The dichotomy adds a logarithmic factor where is the total capacity of the cluster. The total complexity of this first function is hence O(\#E^{3/2}\log C)=O\big{(}(PN)^{3/2}\log C\big{)}.
Metrics
We can display the discrepancy between the computed and the best size we could have hoped for the given total capacity, that is .
2.2 Computation of a candidate assignment
Now that we have the optimal partition size , to compute a candidate assignment it would be enough to compute a maximal flow function on . This is what we do if there is no former assignment .
If there is some , we add a step that will heuristically help to obtain a candidate closer to . We fist compute a flow function that uses only the partition-to-node associations appearing in . Most likely, will not be a maximal flow of . In Dinic’s algorithm, we can start from a non maximal flow function and then discover improving paths. This is what we do by starting from . The hope333This is only a hope, because one can find examples where the construction of from produces an assignment that is not as close as possible to . is that the final flow function will tend to keep the associations appearing in .
More formally, we construct the graph from by removing all the arcs where is not associated to in . We compute a maximal flow function in . The flow is also a valid (most likely non maximal) flow function on . We compute a maximal flow function on by starting Dinic’s algorithm with .
Algorithm
1:function Compute Candidate Assignment(, )
2: Build the graph
3: Maximal flow()
4: Maximal flow from flow(, )
5: return
6:end function
Remark
The function “Maximal flow” can be just seen as the function “Maximal flow from flow” called with the zero flow function as starting flow.
Complexity
With the considerations of the last section, we have the complexity of Dinic’s algorithm .
Metrics
We can display the flow value of , which is an upper bound of the distance between and , although this information might not be very relevant to end users.
2.3 Minimization of the transfer load
Now that we have a candidate flow function , we want to modify it to make its corresponding assignment as close as possible to . Denote by the maximal flow corresponding to , and let 444It is the number of arcs of type saturated in one flow and not in the other.. We want to build a sequence of maximal flows such that decreases as increases. The distance being a non-negative integer, this sequence of flow functions must be finite. We now explain how to find some improving from .
For any maximal flow in , we define the oriented weighted graph as follows. The vertices of are the same as the vertices of . contains the arc between vertices with weight if and only if the arc is not saturated in (i.e. , we also consider reversed arcs). The weight is:
- •
if is of type or and is saturated in only one of the two flows ;
- •
if is of type or and is saturated in either both or none of the two flows ;
- •
[math] otherwise.
If is a simple cycle of arcs in , we define its weight as the sum of the weights of its arcs. We can add to the value of on the arcs of , and by construction of and the fact that is a cycle, the function that we get is still a valid flow function on , it is maximal as it has the same flow value as . We denote this new function .
Proposition 2**.**
Given a maximal flow and a simple cycle in , we have .
Proof.
Let be the set of arcs of type . Then we can express as
[TABLE]
We can express the cycle weight as
[TABLE]
Remark that since we passed one unit of flow in to construct , we have for any , if and only if . Hence
[TABLE]
Plugging this in the previous equation, we find that
[TABLE]
∎
This result suggests that given some flow , we just need to find a negative cycle in to construct as . The following proposition ensures that this greedy strategy reaches an optimal flow.
Proposition 3**.**
For any maximal flow , contains a negative cycle if and only if there exists a maximal flow in such that .
Proof.
Suppose that there is such flow . Define the oriented multigraph with the same vertex set as in , and for every , contains copies of the arc . For every vertex , its total degree (meaning its outer degree minus its inner degree) is equal to
[TABLE]
The last two sums are zero for any inner vertex since are flows, and they are equal on the source and sink since the two flows are both maximal and have hence the same value. Thus, for every vertex .
This implies that the multigraph is the union of disjoint simple cycles. can be transformed into by pushing a mass 1 along all these cycles in any order. Since , there must exists one of these simple cycles with . Finally, since we can push a mass in along , it must appear in . Hence is a cycle of with negative weight. ∎
In the next section we describe the corresponding algorithm. Instead of discovering only one cycle per iteration, we are allowed to discover a set of disjoint negative cycles.
Algorithm
1:function Minimize transfer load(, , )
2: Build the graph
3: Detect Negative Cycles()
4: while do
5: for all do
6:
7: end for
8: Update
9: Detect Negative Cycles()
10: end while
11: return
12:end function
Complexity
The distance is bounded by the maximal number of differences in the associated assignment. If these assignment are totally disjoint, this distance is . At every iteration of the While loop, the distance decreases, so there is at most iterations.
The detection of negative cycles is done with the Bellman-Ford algorithm, whose complexity should normally be . In our case, it amounts to . Multiplied by the complexity of the outer loop, it amounts to which is a lot when the number of partitions and nodes starts to be large. To avoid that, we adapt the Bellman-Ford algorithm.
The Bellman-Ford algorithm runs iterations of an outer loop, and an inner loop over . The idea is to compute the shortest paths from a source vertex to all other vertices. After iterations of the outer loop, the algorithm has computed all shortest path of length at most . All simple paths have length at most , so if there is an update in the last iteration of the loop, it means that there is a negative cycle in the graph. The observation that will enable us to improve the complexity is the following:
Proposition 4**.**
In the graph (and ), all simple paths have a length at most .
Proof.
Since is a maximal flow, there is no outgoing edge from in . One can thus check than any simple path of length 4 must contain at least two node of type . Hence on a path, at most 4 arcs separate two successive nodes of type . ∎
Thus, in the absence of negative cycles, shortest paths in have length at most . So we can do only iterations of the outer loop in the Bellman-Ford algorithm. This makes the complexity of the detection of one set of cycle to be .
With this improvement, the complexity of the whole algorithm is, in the worst case, . However, since we detect several cycles at once and we start with a flow that might be close to the previous one, the number of iterations of the outer loop might be smaller in practice.
Metrics
We can display the node and zone utilization ratio, by dividing the flow passing through them divided by their outgoing capacity. In particular, we can pinpoint saturated nodes and zones (i.e. used at their full potential).
We can display the distance to the previous assignment, and the number of partition transfers.
3 Related work
In previous versions of Garage, we iterated through many algorithms to build an assignment of partitions to nodes, always with unsatisfactory results. These previous attempts, all based on existing work, are described in this section.
Basic consistent hashing with zone awareness
In this algorithm, we use the simple consistent hashing ring described in Dynamo [4]. We slightly adapt it to support nodes in different zones and the requirement to spread replicas over as many zones as possible: when looking up the nodes associated to a data block, we walk the ring starting from the position corresponding to its hash, but we skip nodes that are in a zone from which we have already selected a node (except if there are no more distinct zones to take nodes from). This method had the disadvantage of giving a very unbalanced distribution of data between nodes. For example, suppose that there are many consecutive nodes on the ring that are in zones 1 and 2, followed by one node in zone 3. Then that node will store a copy of all data blocks whose hashes are in the interval before it that contains only nodes of zone 1 and 2.
Arbitrary ring positions vs. fixed partition boundaries
As already discussed in the Dynamo paper [4] (see the three different strategies presented in Figure 7), using the hashes of node identifiers as positions on the consistent hashing ring makes the intervals between these positions of wildly varying sizes, worsening the imbalance of storage affected to all nodes. To resolve this issue, we very rapidly switched to dividing the consistent hashing ring into equally sized parts (what we call partitions), as shown in Dynamo’s strategies 2 and 3. To ensure that all nodes handle a number of partitions strictly proportional to their capacity, we tried using the MagLev algorithm [5] to assign partitions to nodes. However, just doing this does not solve the zone awareness issue; continuing to use the simple ring walking where nodes are skipped still produces a very imbalanced distribution.
Multi-zone aware MagLev
Our next try was to improve the MagLev algorithm to be multi-zone aware. Now, instead of assigning a single node to each ring position (each partition) and walking the ring to find three nodes starting at a given key’s hash, we directly assign a set of three nodes to each partition and completely abandon ring walking. The first node of the three is computed for all partitions by using the standard MagLev algorithm. Then, the next two are computed using a variant of MagLev that skips assigning nodes to partitions when they are in zones of nodes already selected for that partition (unless there are no more distinct zones available), selecting other nodes instead. This way, we ensure that the three nodes assigned to each partition are in as many distinct zones as possible. This method provided perfectly equitable distribution of data among nodes, however when layout changes occurred, the entire assignment was recomputed without taking into account the previous one, and thus there was no way to ensure that a minimal amount of data was displaced from one node to another.
Stateful assignment algorithms
In all of the previous iterations, we were limiting ourselves to algorithms that were stateless: the assignment had to be computed in a deterministic way from only the list of node identifiers and their zone and capacity information, using hash functions to provide pseudo-randomness. To be able to minimize the transfer load on layout changes, we had to switch to a stateful method where the entire assignment is computed offline and then propagated to all cluster nodes. It can now be computed using any arbitrary optimization algorithm that can take as an input the previous assignment to minimize transfer load. This method was introduced in Garage version 0.5 with a simple greedy optimization algorithm that was not optimal, which was in use until version 0.8. The final, optimal assignment algorithm is the one we presented in this paper, which will be included in Garage version 0.9 and forward.
Acknowledgements
This project has received funding from the European Union’s Horizon 2021 research and innovation programme within the framework of the NGI-POINTER Project funded under grant agreement N° 871528.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Raynal, Building Read/Write Registers Despite Asynchrony and Less than Half of Processes Crash ( t < n / 2 𝑡 𝑛 2 t<n/2 ) , pp. 95–117. Cham: Springer International Publishing, 2018.
- 2[2] Y. Dinitz, “Algorithm for solution of a problem of maximum flow in networks with power estimation,” Soviet Math. Dokl. , vol. 11, pp. 1277–1280, 01 1970.
- 3[3] S. Even and R. E. Tarjan, “Network flow and testing graph connectivity,” SIAM journal on computing , vol. 4, no. 4, pp. 507–518, 1975.
- 4[4] G. De Candia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review , vol. 41, no. 6, pp. 205–220, 2007.
- 5[5] D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein, “Maglev: A fast and reliable software network load balancer,” in 13th { { \{ USENIX } } \} Symposium on Networked Systems Design and Implementation ( { { \{ NSDI } } \} 16) , pp. 523–535, 2016.
