Efficient Distributed Workload (Re-)Embedding
Monika Henzinger, Stefan Neumann, Stefan Schmid

TL;DR
This paper introduces a distributed online algorithm for dynamic workload embedding in reconfigurable networked systems, balancing the benefits of collocation against re-embedding costs, and nearly matches the theoretical optimal performance.
Contribution
It presents a novel distributed online algorithm for workload embedding that is asymptotically almost optimal, addressing the challenge of dynamic, demand-aware reconfiguration.
Findings
Algorithm is asymptotically almost optimal in competitive ratio.
The approach effectively balances reconfiguration costs and performance benefits.
Application demonstrated in distributed union find problem.
Abstract
Modern networked systems are increasingly reconfigurable, enabling demand-aware infrastructures whose resources can be adjusted according to the workload they currently serve. Such dynamic adjustments can be exploited to improve network utilization and hence performance, by moving frequently interacting communication partners closer, e.g., collocating them in the same server or datacenter. However, dynamically changing the embedding of workloads is algorithmically challenging: communication patterns are often not known ahead of time, but must be learned. During the learning process, overheads related to unnecessary moves (i.e., re-embeddings) should be minimized. This paper studies a fundamental model which captures the tradeoff between the benefits and costs of dynamically collocating communication partners on servers, in an online manner. Our main contribution is a distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Distributed systems and fault tolerance · IoT and Edge/Fog Computing
Efficient Distributed Workload (Re-)Embedding111A version of this
paper will appear at SIGMETRICS’19. Authors are ordered alphabetically.
Monika Henzinger
University of Vienna, Faculty of Computer Science, Vienna, Austria
Stefan Neumann
University of Vienna, Faculty of Computer Science, Vienna, Austria
Stefan Schmid
University of Vienna, Faculty of Computer Science, Vienna, Austria
Abstract
Modern networked systems are increasingly reconfigurable, enabling demand-aware infrastructures whose resources can be adjusted according to the workload they currently serve. Such dynamic adjustments can be exploited to improve network utilization and hence performance, by moving frequently interacting communication partners closer, e.g., collocating them in the same server or datacenter. However, dynamically changing the embedding of workloads is algorithmically challenging: communication patterns are often not known ahead of time, but must be learned. During the learning process, overheads related to unnecessary moves (i.e., re-embeddings) should be minimized. This paper studies a fundamental model which captures the tradeoff between the benefits and costs of dynamically collocating communication partners on servers, in an online manner. Our main contribution is a distributed online algorithm which is asymptotically almost optimal, i.e., almost matches the lower bound (also derived in this paper) on the competitive ratio of any (distributed or centralized) online algorithm. As an application, we show that our algorithm can be used to solve a distributed union find problem in which the sets are stored across multiple servers.
1 Introduction
Along with the trend towards more data centric applications (e.g., online services like web search, social networking, financial services as well as emerging applications such as distributed machine learning [41, 37]), comes a need to scale out such applications, and distribute the workload across multiple servers or even datacenters. However, while such parallel processing can improve performance, it can entail a non-trivial load on the interconnecting network. Indeed, distributed cloud applications, such as batch processing, streaming, or scale-out databases, can generate a significant amount of network traffic [40].
At the same time, emerging networked systems are becoming increasingly flexible and thereby provide novel opportunities to mitigate the overhead that distributed applications impose on the network. In particular, the more flexible and dynamic resource allocation (enabled, e.g., by virtualization) introduces a vision of workload-aware infrastructures which optimize themselves to the demand [9]. In such infrastructures, communication partners which interact intensively, may be moved closer (e.g., collocated on the same server, rack, or datacenter) in an adaptive manner, depending on the demand. This “re-embedding” of the workload allows to keep communication local and reduce costs. Indeed, empirical studies have shown that communication patterns in distributed applications feature much locality, which highlights the potential of such self-adjusting networked systems [29, 47, 11].
However, leveraging such resource reconfiguration flexibilities to optimize performance, poses an algorithmic challenge. First, while collocating communication partners reduces communication cost, it also introduces a reconfiguration cost (e.g., due to virtual machine migration). Thus, an algorithm needs to strike a balance between the benefits and the cost of such reconfigurations. Second, as workloads and communication patterns are usually not known ahead of time, reconfiguration decisions need to be made in an online manner, i.e., without knowing the future. We are hence in the realm of online algorithms and competitive analysis.
This paper studies the fundamental tradeoff underlying the optimization of such workload-aware reconfigurable systems. In particular, we consider the design of an online algorithm which, without prior knowledge of the workload, aims to minimize communication cost by performing a small number of moves (i.e., migrations). In a nutshell (more details will follow below), we consider a communication graph between vertices (e.g., virtual machines) which can be perfectly partitioned among a set of servers (resp. racks or datacenters) of a given capacity. We assume that the communication patterns, which partition the communication graph, consist of vertices and that once the whole communication graph was revealed, each server must contain exactly one communication pattern.
The communication graph is initially unknown and revealed to the algorithm in an online manner, edge-by-edge, by an adversary who aims to maximize the cost of the given algorithm. The cost here consists of communication cost and moving cost: The algorithm incurs one unit cost if the two endpoints (i.e., communication partners) of the request belong to different servers. After each request, the algorithm can reconfigure the infrastructure and move communication endpoints from one server to another, essentially repartitioning the communication partners; however, each move incurs a cost of .
In other words, this paper considers the problem of learning a partition, i.e., an optimal assignment of communication partners to servers, at low communication and moving cost. Interestingly, while the problem is natural and fundamental, not much is known today about the algorithmic challenges underlying this problem, except for the negative result that no good competitive algorithm can exist if communication partners can change arbitrarily over time [8]. This lower bound motivates us, in this paper, to focus on the online learning variant where the communication partners are unknown but fixed. At the same time, as we will show, the problem features interesting connections to several classic problems. Specifically, the problem can be seen as a distributed version of classic online caching problems [51] or an online version of the -way partitioning problem [48].
1.1 Our Contributions
We initiate the study of a fundamental problem, how to learn and re-embed workload in an online manner, with few moves. We make the following main contributions.
We present a distributed -competitive online algorithm for servers of capacity , where . We allow the servers to have more space than is strictly needed to embed its corresponding communication pattern (which is of size ); we denote this additional space as augmentation. Such augmentation is also needed, as our lower bounds discussed next show.
We show that there are inherent limitations of what online algorithms can achieve in our model: We derive a lower bound of on the competitive ratio of any deterministic online algorithm given servers of capacity at least . This lower bound has several consequences: (1) To obtain -competitive algorithms, the servers must have augmentation. (2) If the servers have augmentation (e.g., each server has 10% more capacity than the size of its communication pattern), our algorithm is optimal up to an factor. Thus, our results are particularly interesting for large servers, e.g., in a wide-area networking context where there is usually only a small number of datacenters where communication partners can be collocated (e.g., ): if each datacenter (“server”) has augmentation , our algorithm is optimal up to constant factors.
The distributed algorithms we present not only provide good competitive ratios but they are also highly efficient w.r.t. the network traffic they cause. In fact, we show that for servers, running the algorithms introduces only little overhead in network traffic and that this overhead is asymptotically negligible (see Section 5.1).
While the previous algorithms require exponential time, we also present polynomial time algorithms at the cost of a slightly worse competitive ratio of in Section 5.2.
As a sample application of our newly introduced model we present a distributed union find data structure [27, 52] (also known as disjoint-set data structure or merge-find data structure) in Section 7.1: There are items from a universe which are distributed over servers; each server can store at most items and each item belongs to a unique set. The operation union allows to merge two sets. In our setting, we require that items from the same set must be assigned to the same server. To reduce the network traffic, our goal is to minimize the number of item moves during union operations. For example, when two sets are merged which are assigned to different servers, then the items of one of the sets must be reassigned to another server. We compare against an optimal offline algorithm which knows the initial assignment of all items and all union operations in advance. We obtain the same competitive ratios as above. We believe that this distributed union find data structure will be useful as a subroutine for several problems such as merging duplicate websites in search engines [17].
We also show that our algorithms solve an online version of the -way partition problem in Section 7.2.
1.2 Organization
We introduce our model formally in Section 2. To ease the readability, we first explore centralized online algorithms that efficiently collocate communication patterns for servers in Section 3, and then study the general case of servers in Section 4. In Section 5 we show how the previously derived centralized algorithms can be made distributed and how the algorithm can be implemented in polynomial time at the cost of a slightly worse competitive ratio. We provide the lower bounds in Section 6. Section 7 provides a distributed union find data structure and a result for online -way partitioning; these problems serve as sample applications of the problem we study. After reviewing related work in Section 8, we conclude our contribution in Section 9.
2 Model
We start by formally introducing the model which we will be studying in this paper. We consider a set of vertices (e.g., a set of virtual machines) which interact according to an initially unknown communication pattern, which can be represented as a communication graph with vertices and edges. The vertices of are partitioned into sets where each , forming a connected communication component (the workload), has size222Note that in general is not always an integer and we would have to take rounding into account. However, we ignore this technicality for better readability of the paper. ; the connected components of coincide with the sets . The sets are the communication patterns which need to be recovered by the online algorithm, henceforth called ground truth components.
The communicating vertices need to be assigned to servers . Accordingly, we define an assignment (the embedding) which is a function from the vertices to the servers. The load of a server is the number of vertices that are assigned to it. An assignment is valid if each server has load at most and we call the capacity of the servers and the augmentation. If , the total server capacity exactly matches the number of vertices. The available capacity of a server is the difference between the server’s capacity and its load. An assignment is perfectly balanced if each server has load exactly . We assume that when the algorithm starts, we have a perfectly balanced assignment. We will write to denote the set of vertices assigned to server and for the set of vertices initially assigned to server . We say that an assignment is a perfect partitioning if it satisfies , i.e., the vertices on the servers coincide with the connected components of .
The communication graph is revealed by an adversary in an online manner, as a sequence of edges , where denotes the number of communication requests and for each . Note that the adversary can only provide edges which are present in and that each edge can appear multiple times in the sequence of edges. We assume that the sequence of the edges provided by the adversary reveals the ground truth components , i.e., after having seen all edges in the algorithm can compute the connected components of which (by assumption) coincide with the ground truth components . We present an illustration of the model in Figure 1.
Now an online algorithm must iteratively change the assignment such that eventually the assignment is a perfect partitioning.
The reassignment needs to be done while minimizing certain communication and migration cost. If an edge provided by the adversary has both endpoints in the same server at the time of the request, an algorithm incurs no costs. If and are in different servers and , then their communication cost is . Reassigning, i.e., moving, a vertex from a server to a server costs .
When measuring the cost of an online algorithm, we will compare against an optimal offline algorithm denoted by OPT . OPT has a priori knowledge of the communication graph as well as the given the sequence of all edges . In other words, OPT can compute the assignment of vertices to servers which provides the minimum migration cost from the initial assignment.
Now let the cost paid by an online algorithm be denoted by ON and let the cost of the optimal offline algorithm be denoted by OPT . We consider the design of an online algorithm which minimizes the (strict) competitive ratio defined as .
The Role of Connected Components
We will briefly discuss how connected components are induced by subsequence of and how we will treat connected components in our algorithms. We then give a reduction which helps us to avoid considering communication costs in our proofs.
Recall that the adversary provides a sequence of edges to an algorithm in an online manner. As this happens, an algorithm can keep track of all edges it has seen so far. Let this set of edges be . Using the edges in , the algorithm can compute the connected components which are induced by . Here, denotes the current number of connected components.
To obtain a better understanding of the relationship between the connected components and the ground truth components , we make four observations: (1) When the algorithm starts, all connected components only consist of single vertices (because has not yet revealed any edges). (2) When a previously unknown edge is revealed which has its endpoints in different connected components and , these connected components get merged. (3) Suppose a subsequence of induces connected components (i.e., has not yet revealed the whole graph ). Then for each ground truth component there exists a subset of the connected components such that . (4) When an algorithm terminates (and, hence, revealed all edges in ), there exists a one-to-one correspondence between the connected components and the ground truth components .
By assumption on the input from the adversary, when all of was revealed, reveals the ground truth components . Thus, in total there will be exactly edges connecting vertices from different connected components.
All of the algorithms we consider in this paper have the property that they always assign vertices of the same connected component to the same server. This property implies that the communication cost paid by such an algorithm is bounded by its moving cost (we prove this in the following lemma). Hence, in the rest of the paper we only need to bound the moving costs of our algorithms to obtain a bound on their total costs.
Lemma 1**.**
Suppose an algorithm always assigns all vertices of the same connected component to the same server and pays for moving vertices. Then its communication cost is at most . Furthermore, its total cost is at most .
Proof.
Suppose the adversary provides an edge . We consider two cases. Case 1: and are assigned to the same server. Then does not pay any communication costs. Case 2: and are assigned to connected components and on different servers. Then the algorithm needs to pay communication cost. However, in this case must move or to a different server at the cost of at least . Hence, the moving cost is larger than the communication cost. We conclude that ’s total communication cost is at most . By summing the two quantities, we obtain the second claim of the lemma. ∎
While in Lemma 1 we have shown that algorithms which always collocate connected components immediately are efficient w.r.t. their total cost, in Section 6.1 we show that any efficient algorithm must satisfy a similar (slightly more general) property.
Throughout the rest of the paper, we write to denote the number of vertices in a connected component . For a vertex , we write to denote the connected component which contains .
3 Online Partition for Two Servers
In this section, we consider the problem of learning a communication graph with few moves with two servers. As we will see later, the concepts introduced in this section will be useful when solving the problem with servers. We derive the following result.
Theorem 2**.**
Consider the setting with two servers of capacity for , i.e., the augmentation is . Then there exists an algorithm with competitive ratio .
The proof is organized as follows. We first characterize the optimal solution by OPT in Section 3.1. We then present an algorithm which is efficient whenever OPT incurs “significant cost”, in Section 3.2. In Section 3.3, we describe an algorithm which is efficient whenever the solution by OPT is “cheap”. We prove Theorem 2 via a combination of the two algorithms in Section 3.4.
3.1 Costs of OPT
The following lemma gives a precise characterization of the cost paid by OPT in the two server case. It introduces a parameter which equals the number of vertices moved by OPT and which we will be using throughout the rest of this section.
Lemma 3**.**
Suppose and the vertices initially assigned to the servers are given by the sets for . Then the cost of OPT is , where
[TABLE]
It follows immediately that (as ).
Proof.
Recall that our model forces OPT to provide a final assignment satisfying , i.e., OPT must produce a final assignment which coincides with the ground truth components (even if paying for each communication request individually and not relocating any vertices might be cheaper). Thus, we can assume that OPT performs all vertex moves in the beginning, to avoid paying any communication cost. Since the edge sequence provided by the adversary is assumed to reveal the connected components and , OPT can compute and before it performs any moves.
As there are only two servers, one of them must contain at least half of the vertices from in the initial assignment. Now let us first assume that this server is ; this setting is illustrated in Figure 2. In this case, OPT can move the vertices in to and those in to . It is easy to verify that this yields an assignment satisfying and that the moving cost is minimized. Further, the cost for this reassignment is exactly .
The second case where contains more than half of the vertices from in the initial assignment is symmetric. ∎
While in Lemma 3 we have presented the lower bound w.r.t. server , we could also express the lower bound in terms of server . We then obtain the following equality:
[TABLE]
3.2 The Small–Large–Rebalance Algorithm
A natural idea to obtain a small number of vertex moves is to proceed as follows. Whenever two vertices and belonging to different connected components communicate, the algorithm merges their connected components. If the two components were already assigned to the same server, no vertex moves are required. If and are assigned to different servers, we move the smaller connected component to the server of the larger connected component. This algorithm is efficient in that it never performs more than vertex moves (see Lemma 4).
However, the algorithm could require much augmentation, as it does not account for server capacities. Thus, we propose the following extension called the Small–Large–Rebalance Algorithm: Whenever a server exceeds its capacity, the algorithm computes a perfectly balanced assignment of the vertices which respects the previously observed connected components; we call this a rebalancing step. We provide pseudocode in Algorithm 1.
Section 5.2.1 shows how such a rebalancing step can be implemented in time. Later, we show that there can be at most such rebalancing steps which implies that the total running time Algorithm 1 is .
Note that Algorithm 1 also works in the setting with servers for . We will analyze this more general algorithm in Section 4.4.
3.2.1 Analysis
To analyze Algorithm 1, we first consider the algorithm from the first paragraph which does not have the rebalancing step. When the algorithm moves a smaller component to the server of a larger component, we call this a small-to-large step.
Lemma 4**.**
Consider the algorithm which always moves the smaller connected component to the server of the larger connected component when it obtains an edge between vertices from different connected components. The algorithm moves each vertex at most times. Its total number of vertex moves is .
Proof.
Consider any vertex . We use the following accounting: Whenever is in the a smaller component that is moved, add a token to . Now observe that whenever gains a token, the size of its component at least doubles. This implies that can be in the smaller component only times. Thus, cannot accumulate more than tokens. Since this holds for each of the vertices, the total number of moves is . ∎
The following lemma provides the analysis for Algorithm 1 which performs small-to-large steps and rebalancing steps.
Lemma 5**.**
Suppose both servers have capacity , i.e., the augmentation is for . Then Algorithm 1 performs rebalancing steps and vertex moves.
Proof.
We prove the bound on the number of vertex moves; the claim about the number of rebalancing steps is proved along the way. Note that all vertex moves performed by the algorithm originate from either small-to-large steps or from rebalancing steps. We bound the number of each of these vertex moves separately.
Note that the token-based argument from Lemma 4 still applies to the small-to-large steps of Algorithm 1. This implies that the total number of vertex moves due small-to-large steps is .
Now consider the vertex moves caused by the rebalancing steps and recall that the initial assignment is perfectly balanced. Whenever a server exceeds its load, the small-to-large steps of the algorithm must have moved at least vertices (because the augmentation of one of the servers is exceeded). This can only happen times since the total number of vertex moves due to small-to-large steps is . Hence, the number of rebalancing steps is at most . Since each rebalancing step performs vertex moves, the lemma follows. ∎
3.2.2 More Efficient Rebalancing
We next propose a better rebalancing strategy which makes Algorithm 1 more efficient. So far, we used vertex moves for each rebalancing operation at the cost of . We now bring the rebalancing cost down to .
We adjust Algorithm 1 in the following way: Instead of rebalancing by taking any perfectly balanced assignment respecting the connected components (Line 12), we choose a perfectly balanced assignment respecting the connected components which minimizes the number of vertex moves from the initial solution. We call such an assignment cheap.
To find a cheap assignment, the algorithm could simply do the following: (1) Recall the initial assignment. (2) Exhaustively enumerate all perfectly balanced assignments respecting the connected components. (3) Among all of these assignments find one which is cheap. While such a simple algorithm can in principle be computationally costly, we can here exploit the online model of computation which allows us unlimited computational power. In Section 5.2 we show how less efficient rebalancing strategies can be implemented in polynomial time and we obtain slightly worse competitive ratios.
With the improved rebalancing strategy, we obtain Proposition 6.
Proposition 6**.**
Suppose all servers have capacity , . Then the number of vertex reassignments performed by Algorithm 1 with more efficient rebalancing is , where is the number of vertex moves used by OPT .
Proof.
First, note that the number of vertex moves for moving smaller components to larger components (Line 9) is , by exactly the same arguments used in the proof of Lemma 5.
Second, we bound the number of vertex moves required for the rebalancing operations. Whenever the algorithm needs to rebalance, we can assume (for the sake of the analysis) that the algorithm makes the following three steps: (1) Roll back all changes done by small-to-large moves (Line 9) since the last rebalancing operation. Thus, after rolling back we have the same assignment as after the last rebalancing operation. (2) Roll back to the initial assignment (by undoing the last rebalancing operation). (3) Move to a cheap assignment.
Observe that Step (1) and (2) of the previous three step procedure increase the number of vertex moves only by a constant factor compared to when the algorithm does not roll back: In total, Step (1) only adds additional vertex moves because each small-to-large move is undone exactly once. Step (2) only doubles the number of vertex moves for moving to cheap assignments as each rebalancing is only undone once.
Thus, we can complete the proof if we can show that the total number of vertex moves for moving from the initial assignment to the cheap assignments is bounded by .
By Lemma 5, the number of rebalancing steps is bounded by . Now we argue that for moving from the initial solution to each cheap assignment, the rebalancing moves at most vertices: Every time the algorithm computes a cheap rebalancing, the final solution obtained by OPT is a perfectly balanced assignment respecting the connected components. Thus, the number of vertex moves to obtain a cheap rebalancing is bounded by the number of moves performed by OPT which is . This finishes the proof. ∎
3.3 The Majority Voting Algorithm
We now present an algorithm which works well whenever the cost paid by OPT is small, i.e., when OPT only needs to move few vertices. The issue with Algorithm 1 from Section 3.2 is that during its execution, it might deviate much from the initial assignment (and thus move many vertices). The following algorithm has the property that it always stays close to the initial assignment.
For ease of readability, we will often refer to the two servers as the left and right servers, respectively, instead of calling them and .
Our algorithm starts by coloring vertices on the left server yellow and on the right server black. Throughout the execution of the algorithm, the vertices will keep this initially assigned color. The algorithm then follows the idea of always moving the smaller connected component to the server of the larger connected component; we will refer to this as small-to-large step. To stay close to the initial assignment, whenever the number of vertices in a newly merged connected component surpasses a power of , the algorithm performs a majority vote and moves the component to the server where more of its vertices originate from. More formally, we say that a set of vertices (e.g., a connected component) has a yellow (black) majority if it contains more yellow (black) vertices than black (yellow) vertices. In the majority voting step, the algorithm moves a component with a yellow (black) majority which is currently on the right (left) server to the left (right) server. The pseudocode for this procedure is stated in Algorithm 2.333Note that in Algorithm 2 the following is possible when a component is merged with a component : is moved from to due to a small-to-large step and immediately after that is moved back to due to a majority-voting step. Thus, it would be more efficient to compute the result of the majority-voting step earlier and to move to immediately (without ever moving to ). This modification would be slightly more efficient but it would affect the competitive ratio of the algorithm only by at most a constant factor. Thus, to simplify our analysis, we ignore this modification.
The reason for introducing the majority voting step is that it keeps the assignments produced by the algorithm during its runtime close to the initial assignment. Due to this property, we can show that the cost of Algorithm 2 is always close to the cost of OPT . The formal guarantees are stated in Proposition 7.
Proposition 7**.**
Let be the number of vertex moves performed by OPT (see Section 3.1). Then Algorithm 2 is -competitive and the load of both servers is bounded by .
We devote rest of this subsection to the proof of the proposition. We start bounding the augmentation. For the proofs recall that and are the ground truth connected components of .
In the following we are interested in what happened to a connected component since its last majority vote. To this end, we decompose it into a sequence of smaller connected components such that first a majority vote is performed and after that, only small-to-large steps are performed. For all of these small-to-large steps, the component will stay on the server that was picked by the majority vote. The following definition makes this notion formal.
Definition 8** (Doubling Decomposition).**
Let be a connected component and let be such that . Consider disjoint sets of vertices and let for .
A sequence is a doubling decomposition of if the following properties hold:
, 2. 2.
during the execution of the algorithm, first are merged, then are merged, and, more generally, is merged before , 3. 3.
for each , and the algorithm moves to the server of , 4. 4.
* and .*
Note that when considering a doubling decomposition, there will be exactly one majority-vote for the components — the one after and are merged. Thus, and all , , will be assigned to the server that was picked in the majority vote of .
The following lemma shows that doubling decompositions are indeed well-defined. Its proof provides the construction of a doubling decomposition for a given connected component.
Lemma 9**.**
Let be a connected component. Then there exists a doubling decomposition for .
Proof.
Suppose was the last edge which caused the algorithm to set . W.l.o.g. assume that (in case of ties let be the connected component that is moved by the algorithm). Then set and set . Now repeat this procedure for in place of to obtain and . Continue this procedure until is of appropriate size.
Note that Properties 1 and 2 follow immediately from the above construction. Property 3 follows from the definition of small-to-large steps and the choice of above. Property 4 is guaranteed by the stopping criterion of the above recursion. ∎
Lemma 10 will be crucial for the proofs of many upcoming claims in this section. The lemma asserts that when a connected component is currently assigned to the (say) right server but at the end it will be assigned to the left server, then it must contain relatively many vertices that were initially assigned to the right server.
Lemma 10**.**
Let be a connected component with . Suppose that is currently assigned to server and that will be assigned to server when the algorithm terminates. Then contains at least vertices which were initially assigned to .
Proof.
Assume w.l.o.g. that is currently assigned to the right server and it will be assigned to the left server when the algorithm terminates. We show that at least a -fraction of the vertices in must be black. This implies the lemma.
Let be a doubling decomposition of which exists by Lemma 9. Observe that must be assigned to the same server as after they were merged and after the algorithm processed the majority vote for (by Properties 3 and 4 of doubling decompositions). Thus, had a black majority, i.e., it contains at least black vertices. Since , must contain at least black vertices. ∎
Now we bound the augmentation that is used by Algorithm 2.
Lemma 11**.**
The load of both servers is bounded by . Hence, Algorithm 2 uses at most augmentation.
Proof.
Assume that at some point during the execution of the algorithm the (w.l.o.g.) right server contains more vertices than the left server. We bound the load of the right server.
Recall from Lemma 3 that . We start by considering the case where . In this case, even moving all vertices to the right server only causes augmentation .
Now consider the case where . Since , the initial assignment of must contain more vertices from either or . Thus, exactly one of the ground truth components and must have a black majority (as the algorithm colored all vertices initially assigned to black). We assume w.l.o.g. that has this black majority. This implies that has black vertices and has black vertices. Further, as the algorithm proceeds, the vertices from must be moved to the right server.
The right server contains at each point a (potentially empty) set of vertices from and a (potentially empty) set of vertices from . For the latter set we use the trivial upper bound of , while for the earlier set we give a bound of . The lemma follows.
Consider a component which is on the right server and a subset of . By Lemma 10, contains at least black vertices.
As there are only black vertices in the ground truth component and each component on the right server has at least a -fraction of black vertices, it follows that all components on the right server which are subsets of can only contain vertices. ∎
Having derived the bound for the augmentation, our next goal is to show that the cost paid by the algorithm is bounded by . We start by bounding the cost paid by the algorithm for each connected component.
The following lemma implies that the algorithm pays nothing for components in which all vertices have the same color.
Lemma 12**.**
Let be a connected component and suppose all vertices in have the same color. Then the algorithm has never moved the vertices in .
Proof.
We prove the claim by induction over .
Let . Then consists of a single vertex. But the algorithm never moves single vertices unless they become part of a larger connected component. Hence, is not moved.
Now let . Consider the last edge which was inserted that forced the algorithm to merge . Since in all vertices have the same color, all vertices in and must have the same color. By induction hypothesis, the vertices in and have never been moved before. Thus, and must be assigned to the same server. This implies that a small-to-large step would not move or . Further, a majority voting step would not move since all vertices vote for the server which they are already assigned to. Thus, no vertices in are moved. ∎
Next, we bound the cost paid for any connected component.
Lemma 13**.**
Let be a connected component. Then the cost (over the entire execution time of the algorithm) paid for the vertices in is at most .
Proof.
Consider a vertex . We perform the following accounting: we assign a token to each time when it is reassigned to a server and we show that the number of tokens for is bounded by . This implies that the total number of reassignments for the vertices in is and the lemma follows.
First, consider the case where is moved because it is in a smaller connected component (Line 8). Whenever this happens the size of the connected component containing at least doubled. This can only happen times.
Second, consider the case when is moved because of a majority vote. A majority vote is performed every time when the size of the component containing doubled. This can only happen times and, hence, this can only add another tokens for .
Thus, the total number of tokens assigned to is . ∎
Note that Lemma 13 is only useful for components of size at most ): If we were to apply the lemma to a component of size then the cost would only be bounded by . However, this can be much worse than our desired bound of when . Thus, we need a more fine-grained argument to obtain our goal of showing that the cost paid by Algorithm 2 never exceeds . To do this, we first prove two technical lemmas.
Lemma 14**.**
Suppose is a component which is moved from to and the vertices in are never reassigned after this move. Then contains at least vertices which were initially assigned to .
Proof.
There are only two possible reasons why is moved: Either due to a small-to-large step (Line 8) or due to a majority voting step (Line 10). We consider both cases separately.
Case 1: is moved due to a small-to-large step. Then by Lemma 10, must contain at least vertices which were initially assigned to .
Case 2: is moved due to a majority voting step.
First, consider the case when contains at most 7 vertices. Then at least one vertex was initially assigned to (if all vertices had been initially assigned to , they would all have the same color and a majority vote would not move due to Lemma 12). Thus, at least a -fraction of the vertices were initially assigned to and the lemma holds.
Second, suppose that contains at least vertices. Consider the last edge that caused the merge . Suppose that the small-to-large step moved to the server of . Note that was assigned to and was moved to . Now apply Lemma 10 to . This implies that must contain at least vertices that were initially assigned to . As , must contain at least vertices that were initially assigned to . ∎
We are now ready to show that the cost incurred by the majority-voting algorithm never exceeds .
Lemma 15**.**
The total cost paid by Algorithm 2 is at most and the final assignment is a perfect partitioning.
Proof.
When the algorithm finishes, the final assignment must be a perfect partitioning because the connected components were completely revealed. We only need to prove that the cost of the algorithm is .
Recall that OPT moves exactly vertices (Lemma 3). We can assume w.l.o.g. that OPT moves vertices from that were initially assigned to to and vertices from that were initially assigned to to . We will argue that the cost paid by the algorithm for moving all vertices from into the will be ; the same will hold for and symmetrically.
Consider time during the execution of the algorithm where the following happens. A connected component is reassigned the left server and has the following properties: (1) is a subset of and (2) the vertices in never leave the left server after time . Since each vertex of is assigned to the left server when the algorithm terminates, each vertex of is contained in a component with the above properties (when a vertex or component is never moved, we set ). We call a component with the above properties mixed if it contains at least one black vertex. Note that when mixed component is assigned to the left server, contains a black vertex and, hence, must be moved from the right to the left server.
We now bound the cost for mixed components. Let be the set of all mixed components and let . Since is mixed, Lemma 14 implies that at least vertices of are black. As the black vertices in mixed components form a partition of the black vertices in moved by OPT , we obtain that the number of black vertices in mixed components is . Thus, the total number of vertices in all mixed components is .
By Lemma 13, the total cost paid for each until (including) its final move is . Since (by assumption) the vertices in never move between the servers again, their cost never exceeds until the algorithm finishes. Hence, the cost paid by the algorithm for all mixed components is
[TABLE]
Now consider the vertices of which are not part of mixed components. These vertices must have been part of components in which all vertices are colored yellow. By Lemma 12, these vertices have never been moved. Thus, they do not incur any additional cost for the algorithm. ∎
Proof of Proposition 7.
Lemma 11 gives the bound for the augmentation used by the algorithm. By Lemma 15 and Lemma 3, Algorithm 2 obtains a competitive ratio of
[TABLE]
3.4 Bringing It All Together: Theorem 2
Proof.
Proof of Theorem 2. Consider the following algorithm: Run the majority-voting algorithm until we have seen all edges or until at some point it tries to exceed the allowed augmentation. In the latter case, compute a perfectly balanced assignment respecting the connected components and start running Algorithm 1 (Section 3.2.2).
To prove the theorem, we distinguish two cases based on .
First, suppose . By Proposition 7, Algorithm 2 uses at most augmentation. Thus, in the current case the augmentation used by Algorithm 2 is bounded by and it is -competitive. This proves the theorem for this case.
Second, suppose . In this case we run Algorithm 2 until it tries to exceed the allowed augmentation; this serves as a certificate that . At this point we switch to Algorithm 1.
When we switch algorithms, Algorithm 2 has paid , by applying Lemma 13 to each connected component, and then summing over these costs. For switching to the perfectly balanced reassignment, we only need to pay once.
By Proposition 6, Algorithm 1 never uses more than vertex moves. Using the bound and the fact that pays (Lemma 3), we obtain the desired competitive ratio:
[TABLE]
4 Generalization to Many Servers
We extend our study to the scenario with servers. As we will see, while several concepts introduced for the two server case are still useful, the -server case introduces additional challenges. We derive the following main result.
Theorem 16**.**
Given a system with servers each of capacity (i.e., augmentation ), for , then there exists an -competitive algorithm.
Our algorithm will be based on a recursive bipartitioning scheme, described in Section 4.1. We will use this bipartitioning scheme to derive a static approximation algorithm of the optimal solution (Section 4.2). Then we provide a recursive version of the majority voting algorithm which we will compare against the approximation algorithm (Section 4.3). In Section 4.4, we analyze the Small–Large–Rebalance algorithm in the server setting and we conclude by proving Theorem 16 in Section 4.5
4.1 The Bipartition Tree
We establish a recursive bipartitioning scheme of the servers which we will be using throughout the rest of this section. All algorithms in this section which use the recursive bipartitioning create such a bipartitioning at the start of the algorithm, before the adversary provides any edge. After that the bipartitioning will never be changed.
We obtain the bipartition scheme by growing a balanced binary tree on a set of leaves, where each leaf corresponds to a server . We call this tree the bipartition tree and denote it by .
We denote the internal nodes of by . For an internal node , we write to denote the subtree of which is rooted at and we define to be the set of servers which are leaves in . We further write to denote the set of vertices which are assigned to the servers in . See Figure 3 for an illustration.
Observe that defines a bipartition scheme: let be an internal node of and let , be its children. Then444If is a leaf corresponding to server , we set . and are disjoint and their union is . Thus, implies a bipartition scheme of the servers and internal nodes correspond to bipartition steps.
Note that since is a balanced binary tree, there are internal nodes in total and each server is contained in at most subtrees of . Hence, for each server there are at most internal vertices such that .
In the following we will refer to the internal nodes in as nodes, whereas the vertices from the graph are called vertices.
4.2 Offline Approximation Algorithm
We are not aware of a concise characterization of the optimal solution used by OPT (unlike in the two-server case in Section 3.1). Thus, to get a better understanding of the solution obtained by OPT , we provide an offline approximation algorithm, called APPROX , which exploits the previously defined bipartition scheme and which obtains a 2-approximation of the optimal solution. However, unlike the solution obtained by OPT , we allow the approximation algorithm to use unlimited augmentation in each server; its only goal is to move all vertices from the same ground truth components to the same server using few vertex moves.555In this setting, a trivial solution assigns all vertices to the same server at cost . Later, APPROX will play a role for the design and analysis of our online algorithm.
Intuitively, APPROX traverses the bipartition tree top–down and greedily minimizes the number of vertices “moved over” each server bipartition. We now describe the algorithm in more detail.
APPROX is given the sequence of edges a priori and it also knows the initial assignment of the vertices to the servers. Using the knowledge about the edges, APPROX starts by computing the connected components of and obtaining the ground truth components .
Now, for each ground truth component , APPROX does the following. Let be the root of and let and denote its children. Let , , denote the vertices from which are currently assigned to servers in . Define and assume w.l.o.g. that . The algorithm marks the vertices from as dirty. Now the algorithm recurses on the subtree in place of and marks more vertices of as dirty. The recursion stops when only contains a single server . Then the algorithm moves all dirty vertices of into server .
The pseudocode of APPROX is stated in Algorithm 3.
By overloading notation, we let APPROX denote the cost paid by APPROX . Further, we let denote the cost paid by APPROX to move all vertices from to the same server .
We now show that APPROX indeed yields a 2-approximate solution of the cost of the optimal offline algorithm.
Lemma 17**.**
.
Proof.
Fix any . Let denote the cost paid by OPT to move the vertices from to the same server. We show that . This claim implies the lemma since
[TABLE]
Observe that while proceeds, it traverses from root to one of the leaves, and at each step, it increases the level of the current internal node by one.
Using the solution of , we can define a similar traversal of : Let be the root of and let and be its children. As must move all vertices from to the same server , moves the vertices from to a server in for . We call the moved vertices dirty. After this move, still needs to process the vertices of which were initially assigned to a server in but not to . We can view this as processing . Thus, traverses until the final server is reached and marks a subset of dirty.
The previous paragraphs define to two different traversals of and two different sets of dirty vertices. Let be the smallest level where the two traversals picked different internal nodes in .
Until level , both vertices have marked the same vertices dirty. At levels and below, we obtain the following bounds. Let be the internal node at level that is traversed by both algorithms and let , denote its children at level . Let be defined as in the definition of APPROX . marks at most vertices from as dirty. Since the two traversals split at level and moves vertices (by definition), moves at least vertices.
Recall that for each algorithm, its sets of dirty vertices and its set of moved vertices are identical. Now the following computation proves the claim that :
[TABLE]
4.3 The Recursive Majority Voting Algorithm
We now describe an algorithm which works efficiently in the setting with servers whenever OPT does not perform too many vertex moves. The algorithm can be viewed as a generalization of Algorithm 2 to servers, by exploiting the previously defined bipartitioning scheme.
4.3.1 The Algorithm
The algorithm consists of two parts: A single global algorithm and multiple local algorithms, one per internal node in . The global algorithm maintains a recursive bipartitioning scheme (as defined in Section 4.1) and runs a local algorithm on each bipartition. The local algorithms are used to “reduce” the setting with multiple servers to the case with two servers.
We now describe the two parts in more detail and state the pseudocode in Algorithm 4. We write to denote the server which vertex is assigned to.
Global Algorithm.
The global algorithm starts by computing the bipartition tree . On each internal node of , the global algorithm instantiates a local algorithm which we describe below.
Furthermore, the global algorithm iterates over all vertices and does the following for each . The algorithm finds all internal nodes such that and labels with . This labelling of the vertices only takes into account the initial assignment of the vertices and will never be changed throughout the running time of the algorithm. For example, if the vertices and in Figure 3 are assigned to servers and in the initial assignment, their labels will be and , respectively.
When the adversary provides an edge , the global algorithm does the following. It locates the servers and . If , the algorithm merges the components and continues with the next edge. If , the global algorithm finds the internal node in which is the lowest common ancestor of and . (For example, in Figure 3 the lowest common ancestor for and is .) Then the global algorithm gives the edge to the local algorithm corresponding to .
Local Algorithms.
A local algorithm is run on an internal node of . Let and denote the children of in . Note that each local algorithm corresponds to a bipartition step where the servers in are partitioned into subsets and .
An instance of the local algorithm only receives edges from the global algorithm when (1) their endpoints are assigned to servers and (2) and are in different sets of the bipartition, i.e., and .
When the global algorithm provides an edge with the above properties, the local algorithm locates , and . Assume w.l.o.g. that . Then is moved to and and are merged.666When changes its server, all local algorithms corresponding to internal nodes with or , must be informed about this move. This can be done by recomputing for each internal node . Note that this is just an internal operation of the data structure and does not incur any cost to the algorithm. As before, we call this a small-to-large step.
Finally, the local algorithm checks whether the new component has size or it surpassed a power of , i.e., it checks if or there exists an s.t. , and . If this is the case, the local algorithm triggers a majority voting step for which we explain next.
Majority Voting Step.
When a local algorithm triggers a majority voting step for a connected component , the algorithm does the following. Let be the root of and let and be the two child nodes of . For , let denote the number of vertices in with label . If , the algorithm recurses on in place of ; else, the algorithm recurses on in place of . The recursion continues until a leaf in the bipartitioning tree is reached which corresponds to a server . Then the algorithm moves to .
Note that the above majority voting procedure is very similar to what APPROX does for a single ground truth component .
Stopping Criterion.
To ensure that the algorithm does not exceed the augmentation of the servers, we add a stopping criterion.
To define the stopping criterion, let be an internal node of with children and . For , we call overloaded if contains at least vertices with label .
Intuitively, the condition states that an internal node is overloaded when its servers obtained “many” vertices which were initially assigned to the other side of the bipartition, .
The stopping criterion is checked before each component move (i.e., before each small-to-large step and before each majority voting step). It is triggered if the component move would create an assignment in which there exists an overloaded internal node . When the stopping criterion is triggered, the global algorithm and all local algorithms stop and Algorithm 1 is started instead (we show in Section 4.4 that Algorithm 1 also works for servers).
4.3.2 Structural Properties
To obtain a better understanding of the algorithm, we first prove some structural properties about it and defer its cost analysis to Section 4.3.3. We consider the setting where each server has capacity for .
In Subsections 4.3.2 and 4.3.3, we only analyze the cost Algorithm 4 without the cost of Algorithm 1. We analyze the cost of Algorithm 1 for servers in Sections 4.4 and 4.5.
We begin by showing that as long as the stopping criterion is not triggered, the vertex assignment created by Algorithm 4 is close to the initial assignment.
Lemma 18**.**
Suppose the stopping criterion is not triggered. Then:
Each server contains at most vertices that were not initially assigned to it. 2. 2.
Each server contains at least vertices that were initially assigned to it.
Proof.
Consider any server . We show that since the stopping criterion is not triggered, obtains at most vertices for each of the subtrees in containing .
As argued in Section 4.1, there are at most internal nodes of such that . Since the stopping criterion is not triggered, no internal node of is overloaded.
To prove Part (1), consider an internal node of with . Let be the children of and suppose . Observe that can obtain at most vertices that were originally assigned to servers in (if it had received more vertices, then would be overloaded). As there are at most nodes with the above property, the number of vertices which were not initially assigned to is bounded by .
Now let us prove Part (2). Consider an internal node of with . Let be the children of and suppose . Now observe that the servers in can have obtained vertices that were originally assigned to (if they had received more vertices, then would be overloaded). As there are at most nodes with the above property, it follows that the number of vertices assigned to servers that were initially assigned to is . Hence, must contain at least vertices that were initially assigned to it. ∎
As a corollary of Lemma 18 we obtain the following lemma.
Lemma 19**.**
(1) As long as the stopping criterion is not triggered, the load of each server is bounded by , i.e., Algorithm 4 uses only augmentation.
(2) When the stopping criterion is triggered, the augmentation still does not exceed .
Proof.
Part (1) of the lemma follows immediately from Part (1) of Lemma 18. Let us prove Part (2): The stopping criterion is checked every time before a component is moved. Hence, at the time when the algorithm checks the stopping criterion, the algorithm did not exceed the augmentation bound due to Part (1). If the algorithm triggers the stopping criterion, then the component was not yet moved and the augmentation is still the same as before. ∎
Define the final assignment to be the assignment which is created by Algorithm 4 once it has seen all edges in . We show that the final assignment of the algorithm provides a perfect partitioning if the stopping criterion is not triggered.
Lemma 20**.**
If Algorithm 4 stops and the stopping criterion is not triggered, then the final assignment is a perfect partitioning.
Proof.
By definition of the algorithm, vertices of the same connected component are always assigned to the same server. When the algorithm finishes, all edges of were revealed and each component has size . By Lemma 18, the augmentation of each server is at most . Since , no server can have more than one component assigned. As each component is placed on a server, each component is placed alone on a server. This proves that the algorithm creates a perfect partitioning. ∎
Indeed, we show that the final assignment of Algorithm 4 is not only a perfect partitioning, but it is the same assignment as the one created by APPROX from Section 4.2.
Lemma 21**.**
If Algorithm 4 stops and the stopping criterion is not triggered, Algorithm 4 and APPROX have the same final assignment.
Proof.
By Part (2) of Lemma 18, Algorithm 4 moves at most vertices out of each server compared to the initial assignment. Hence, in the final assignment each server must still contain at least vertices from its original assignment since . Thus, in the final assignment each server contains more than half of the vertices that were originally assigned to it.
Consider any server and let be the set of vertices initially assigned to . Then there must exist a ground truth component with . We show that APPROX and Algorithm 4 both assign this component to . This proves the lemma since this claim holds for any .
First, consider APPROX . Note that at each step of the traversal of , the majority of the vertices in will vote for the internal node containing server . Hence, APPROX will place on .
Second, consider Algorithm 4. When the algorithm stops, all edges were revealed and the connected components agree with the ground truth components. Now consider the component . When the grows to size , the algorithm performs a majority voting step (by definition of the algorithm). At this point, more than half of the vertices in were labeled with (because more than half of the vertices from were originally assigned to ). Hence, Algorithm 4 will also place on . ∎
4.3.3 Analysis
The rest of this subsection is devoted to proving the following proposition about Algorithm 4.
Proposition 22**.**
Suppose there are servers and each has capacity for , i.e., the augmentation is . Algorithm 4 has the following properties:
If the stopping criterion is not triggered, the algorithm creates a perfect partitioning, its cost is bounded by and at no point during its execution it uses more than augmentation. 2. 2.
If the stopping criterion is triggered, the cost of the algorithm is plus the cost of Algorithm 1 and the cost of OPT is at least .
We prove the proposition at the end of this section. We start by proving a sequence of lemmata and begin by reasoning about the cost paid by Algorithm 4. As shown in Lemma 1 we only need to bound the moving cost paid by Algorithm 4 to bound its total cost.
The following lemma bounds the cost paid for any connected component .
Lemma 23**.**
Let be a connected component. Then the cost (over the entire execution time of the algorithm) paid for moving the vertices in is .
Proof.
We can use the same accounting argument as in the proof of Lemma 13. That is, we assign a token to a vertex whenever it is moved. Now, whenever the component containing is moved due to a small-to-large step, the size of doubles. This can only happen times. Furthermore, there are only majority voting steps involving : Each majority voting step is triggered because or because surpassed a power of ; the first event can happen only once and the second event can happen at most times. Hence, will never accumulate more than tokens. Since the above arguments apply for each , the total cost paid for moving the vertices in is bounded by . ∎
Let be the function which maps each vertex to its server in the final assignment by Algorithm 4. That is, when Algorithm 4 processed all edges, each is assigned to . For a connected component , set for . Note that is well-defined since all vertices of are assigned to the same when the algorithm terminates.
In the following proofs, we will write to denote the number of vertices in a connected component which are labeled with . We further write to denote the number of vertices in which are not labeled with , i.e., .
Lemma 24 shows that whenever a component is assigned to a server which is not its final server, it must contain relatively many vertices which were not initially assigned to its final server .
Lemma 24**.**
Consider any point in the execution of the algorithm at which a connected component is assigned to server . Let be the lowest common ancestor of and in and denote the children of by and .
If for , then:
* contains at least vertices which do not have label , i.e., .* 2. 2.
* contains at least vertices which were not initially assigned to .*
Proof.
To prove Part (1), consider a doubling decomposition of (see Definition 8); the decomposition exists by Lemma 9 which also applies in the server setting. After and were merged, Algorithm 4 performed a majority voting step and placed in a server . Thus, (otherwise, the majority voting step would have chosen a server in ). Since and ,
[TABLE]
For Part (2) note that each vertex which was initially assigned to has label (because by assumption). ∎
In the following, we show that the cost paid by the algorithm is when the stopping criterion is not triggered. We start by showing that when a component is moved for the last time, it contains a large number of vertices which did not originate from the server it is assigned to.
Lemma 25**.**
Let be a component which is moved to server and suppose the vertices of are never reassigned after this move.777Note that when a small-to-large step is performed, two components are merged due to the corresponding edge insertion. In this case, the component in the lemma is the component which is being moved (i.e., before merging). Then contains at least vertices which were not assigned to in the initial assignment.
Proof.
Note that is moved due to one of two reasons: Either because of a small-to-large step or because of a majority voting step. We distinguish between these cases.
In case of a small-to-large step, is assigned to a server before the move. Lemma 24 implies that contains at least vertices which were not originally assigned to .
Now suppose that is moved due to a majority voting step. Let be the last edge which was inserted and which triggered the majority voting step for . Then Algorithm 4 previously merged components and ; suppose w.l.o.g. that was moved to and . Prior to the majority voting step, is assigned to the same server that was assigned to before was inserted. Hence, we can apply Lemma 24 to and obtain that contains at least vertices which were not initially assigned to . Thus, the number of vertices in which do not originate from is at least
[TABLE]
The next lemma considers the cost paid by Algorithm 4 when the stopping criterion is not triggered.
Lemma 26**.**
Suppose there are servers and each has capacity for , i.e., the augmentation is . If the stopping criterion is not triggered and Algorithm 4 stops, then the cost paid by the algorithm is .
Proof.
Fix some . Recall that denotes the cost paid by APPROX to move the vertices from to the server . We show that for , Algorithm 4 pays . The lemma follows from this claim and Lemma 17, since the total cost paid by Algorithm 4 is bounded by
[TABLE]
Consider any ground truth component and let denote the number of vertices moves to server . Note that as moves vertices into , we get .
Consider time of the execution of the algorithm where the following happens. A component is reassigned to and has the following properties: (1) is a subset of and (2) the vertices in never leave server after time . Since each vertex of is assigned to when the algorithm terminates, each vertex of is contained in a component with the above properties (when a vertex or component is never moved, we set ). A component with the above properties is a mixed component if contains at least one vertex which was not initially assigned to . Note that when a mixed component is reassigned to , contains at least one vertex which was not initially assigned to and, hence, must be moved from a server , , to .
We bound the cost for mixed components. Let be the set of all mixed components of . Recall that Algorithm 4 and APPROX create the same final assignment (Lemma 21). Hence, Algorithm 4 moves the same vertices from into as APPROX . Lemma 25 implies that for each at least vertices from are part of the vertices moved by APPROX . Thus, the union of all contains at most vertices.
By Lemma 23, Algorithm 4 pays at most for each over the entire execution. Thus, its total cost is bounded by
[TABLE]
Consider the vertices of which are not in mixed components. These vertices must have been part of components in which all vertices were originally assigned to . By Lemma 12 (which still applies in the server setting), these vertices were never moved. Thus, they do not incur any cost to the algorithm. ∎
Next, we show that when the stopping criterion is triggered, the recursive majority voting algorithm pays and cost of the solution by OPT is .
Lemma 27**.**
When the stopping criterion is triggered, (1) the cost paid by Algorithm 4 is and (2) the cost paid by is .
Proof.
Let denote the set of all connected components. Part (1) follows from Lemma 23 since the total cost paid by Algorithm 4 is
[TABLE]
Now we prove Part (2). Let be an internal node of with children and suppose w.l.o.g. that is overloaded. Since the stopping criterion is triggered, contains at least vertices with label .
Let be the set of all connected components with the following properties: is assigned to a server in at the time at which the stopping criterion is triggered and contains at least one vertex which is labeled with .
To show that OPT performs vertex moves, we prove that OPT performs vertex moves for each . Part (2) of the lemma follows since the components in contain at least vertices with label and thus
[TABLE]
We prove that OPT moves at least vertices for each by distinguishing two cases for . We define as the function which maps to the server it is assigned to in the solution of OPT , i.e., OPT assigns to server .
Case 1: , i.e., in the final assignment of OPT , the vertices in are assigned to . Then OPT must perform at least vertex moves because it must move all -labeled vertices of from their initial server in to .
Case 2: , i.e., in the final solution by OPT , the vertices in are assigned to a server . We show that contains at least vertices without label . This implies the claim since OPT must move at least vertices from servers not in to .
Consider a doubling decomposition of (which exists by Lemma 9). After and were merged, the algorithm performed a majority voting step and placed in a server in . Thus, (otherwise, the majority voting step would place in a server in ). Hence, . Since , we get . ∎
Proof of Proposition 22..
The first statement of the proposition is implied by Lemmas 20 (perfect partitioning), 26 (total cost) and 19 (small augmentation). The second statement is proved in Lemma 27 (guarantees when stopping criterion is triggered). ∎
4.4 Small–Large–Rebalance Algorithm for Many Servers
To obtain an efficient algorithm in cases where OPT moves many vertices, we reuse the Algorithm 1 from Section 3.2.2. Note that Algorithm 1 also works with servers because it did not use the fact that there are only two servers. In the setting with servers, we obtain the following result.
Proposition 28**.**
Suppose that all servers have capacity for , i.e., the augmentation is . Then the cost paid by the more efficient version of Algorithm 1 is .
Proof.
The proof of the lemma is almost the same as the proof of Proposition 6. The only difference is that we need to bound the number of rebalance operations differently.
The number of vertex moves performed by the algorithm which always moves the smaller connected component to the server of the larger connected component is and, hence, it incurs cost . Now, whenever a server exceeds its capacity, the algorithm must have moved at least vertices. This can only happen times. By the same arguments as in the proof of Proposition 6, each rebalancing operations costs . Hence, the cost for all rebalancing steps is bounded by . ∎
We should point out that as in Lemma 5, we could also do the repartitioning step of Algorithm 1 by taking any perfectly balanced assignment respecting the connected components. In the analysis this would incur vertex moves for each such step and, hence, yield an algorithm with vertex moves in total. However, unlike in the two-server case, finding a perfectly balanced assignment respecting the connected components is an NP-hard problem. Nonetheless, the problem can be solved approximately in polynomial time at the cost of a constant factor in the competitive ratio. We discuss this in further detail in Section 5.2.2.
4.5 Bringing It All Together: Theorem 16
Proof of Theorem 16..
Consider the algorithm which first runs Algorithm 4 until the stopping criterion is triggered and then switches to the Algorithm 1 from Section 4.4.
If the stopping criterion of the Algorithm 4 is not triggered, then by Proposition 22 the cost of the algorithm is . Thus, it is -competitive.
If the stopping criterion is triggered, then Algorithm 4 pays by Proposition 22 and the cost of OPT is . Furthermore, the cost of Algorithm 1 is by Proposition 28. Hence, we obtain the following competitive ratio:
[TABLE]
5 Distributed and Fast Algorithms
In this section we show how the algorithms from Section 4 can be implemented in a distributed setting (Section 5.1) and how they need to be modified to work in polynomial time at the cost of a slightly worse competitive ratio (Section 5.2).
We should point out that even though we discuss the distributed and polynomial time versions of the algorithms separately, they can easily be combined to obtain a distributed algorithm with polynomial computation time.
5.1 Distributed Algorithm
While in Section 4 we presented algorithms in a centralized model of computation, we now show how Algorithms 1 and 4 can be implemented in a distributed model of computation. For realistic parameter settings, the network traffic caused by our distributed algorithms does not increase (asymptotically) compared to the traffic caused by moving around the vertices between the servers.
In our distributed model of computation we assume that all servers have access to: (1) the number of servers , (2) the ID of the root server , (3) a shared clock, and (4) all-to-all communication.
When computing the network traffic, we will asymptotically count the number of messages sent by the algorithms and we further assume that each message contains bits. For the sake of simplicity we assume that moving a vertex from one server to another incurs cost .888Note that this is a realistic assumption since in order to move a vertex, a server must send the ID of a vertex to another server. Sending the ID of the vertex requires bits.
Because of this simplifying assumption we do not have to distinguish between the number of messages sent by the algorithm and the number of messages used for moving algorithms.
In this distributed model of computation, we obtain the following main result for the distributed versions of the algorithms.
Theorem 29**.**
Consider a system with servers each of capacity (i.e., augmentation ) for . Let be the number of vertex moves performed by OPT .
Then there exists a distributed -competitive algorithm which sends
* messages if ,* 2. 2.
* messages if .*
In particular, if , then the algorithm’s communication cost does not exceed its cost for moving vertices.
We show for Algorithm 4 (Section 5.1.1) and for Algorithm 1 (Section 5.1.2) individually how they can be implemented distributedly. After that we prove Theorem 29 in Section 5.1.3.
5.1.1 Making Algorithm 4 Distributed
We start by considering the distributed implementation of Algorithm 4 and obtain the following result.
Lemma 30**.**
Algorithm 4 can be implemented in a distributed model of computation such that the guarantees from Proposition 22 still hold. Furthermore, if OPT performs vertex moves, then we additionally have the following two properties:
If the stopping criterion is not triggered and the algorithm terminates, then the algorithm sent messages. 2. 2.
If the stopping criterion was triggered, the algorithm sent messages.
Proof.
We start by presenting the necessary modifications to the algorithm and analyze the number of sent messages at the end of the proof.
Let us start by observing that each server can maintain a local representation of the bipartition tree : Since the number of servers is known to all servers and does not depend on any other quantity, each server can compute locally. Next, the data structure stores for each vertex its ID (requiring bits) and the ID of the server it was initially assigned to (requiring bits). Thus, the data structure uses bits of storage for each vertex. In other words, it takes messages to move a vertex between different servers.
Next, we provide the modifications for checking the stopping criterion, small-to-large steps and for majority voting steps.
Before the algorithm moves a component from server to server , and need to check whether the move would trigger the stopping criterion. To do so, and do the following. First, asks for its ID using messages. Second, distinguishes between two cases: (1) contains at most vertices. Then for each vertex , sends a message to containing the ID of the server was initially assigned to. This requires messages. (2) contains more than vertices. Then locally computes all internal nodes of the bipartition tree which contain as a leaf. For each such node , let be the sibling of in . Now for each , computes the number of vertices in which were initially assigned to a server in . Then sends these values to using messages. Note that in both cases the algorithm does not send more than messages and these messages can be charged to the moving cost of (which requires messages) which happens after the checking of the stopping criterion. Third, receives the messages from and checks locally whether receiving would trigger the stopping criterion. If the stopping criterion is not triggered, tells to start moving . If the stopping criterion is triggered, sends a message to the root server about this event. Then informs all other servers about switching to Algorithm 1. This requires messages.
Now suppose the algorithm performs a small-to-large step and the stopping criterion was previously checked and not triggered. In this case, no modifications are necessary: The component can just be sent from one server to the other at the cost of messages (since each vertex in can be sent using messages).
Now suppose a server needs to perform a majority voting step for a component . First, observe that can locally decide whether a majority voting step is necessary for since it must only check the size of . Second, when a majority voting step is necessary, can locally compute which server will be the recipient of : For each vertex , knows which server was initially assigned to. Hence, for each , can compute the labels of w.r.t. the bipartitioning scheme from Section 4.1 locally. Since also knows , can compute to which server the component should be moved to. These operations do not require any communication between the servers.
To conclude the proof of the lemma, observe that the distributed algorithm performs exactly as many vertex moves as the centralized algorithm. Hence, the guarantees from Proposition 22 still hold. Next, we analyze the number of messages sent by the algorithm. A small-to-large step moving a component requires messages. Checking the stopping criterion before moving a component requires another messages. Checking whether a majority voting step is necessary requires no communication at all. Hence, the number of messages used by the algorithm is linear in its number of vertex moves. Thus, Proposition 22 implies the two additional properties which are claimed in the statement of the lemma. ∎
5.1.2 Making Algorithm 1 Distributed
For the distributed implementation of Algorithm 1 we obtain the following result.
Lemma 31**.**
Algorithm 1 can be implemented in a distributed model of computation such that the guarantees from Proposition 28 still hold. Furthermore, if OPT performs vertex moves, then the algorithm sends at most messages.
Proof.
We start by stating which modifications need to be made to make Algorithm 1 distributed.
First, suppose that Algorithm 1 performs a small-to-large step moving a component and that this move does not make any server exceed its capacity. In this case, no modifications are necessary and the number of messages sent is as we have seen in the proof of Lemma 30.
Second, suppose that a small-to-large step wants to move component to server which would cause to exceed its capacity. Then the algorithm performs the following operations:
informs the root server that a rebuild is required. 2. 2.
asks all servers to send the edges that were inserted and caused the merge of two connected components since the last rebuild. The servers send of all these edges together with the timestamps when they were inserted. 3. 3.
locally simulates the whole system from the beginning and obtains knowledge about all connected components and which servers they are assigned to. 4. 4.
tells all other servers which components need to be moved and all servers perform the necessary moves.
Since the distributed algorithm performs exactly the same vertex moves as the centralized algorithm, the distributed algorithm is correct and provides the same guarantees as provided in Proposition 28. We only need to analyze how many messages the algorithm sends. To do so, we analyze each step separately.
Every time Step 1 is performed, it requires messages. As there are rebuilds in total, Step 1 sends messages in total.
To bound the number of messages sent in Step 2, recall that in total there are only edges which merge connected components. Hence, sending these edges requires messages. Furthermore, when a server did not obtain an edge merging two connected components between two rebuilds, it can inform about this in messages. As this can be the case for at most servers and since there are rebuilds, at most messages are sent when servers did not receive new edges.
In Step 3, locally simulates the system. This does not incur any network traffic.
Now consider Step 4. During a rebuild, the number of components which the algorithm needs to reassign is trivially bounded by the number of vertex moves performed during the rebuild. Thus, Proposition 28 implies that only messages are required for all invocations of Step 4.
In total, we obtain that the algorithm sends at most messages, where is the number of vertices moved by OPT . ∎
5.1.3 Proof of Theorem 29
To prove the claim about the competitive ratio of the algorithm observe that the distributed algorithm performs exactly the same vertex moves as the centralized algorithm. Hence, the cost paid by both algorithms is the same and the distributed algorithm has the same competitive ratio as the centralized algorithm in Theorem 16.
The claim about the number of messages sent by the algorithm follows from Lemma 30 and Lemma 31 and summing over the number of messages.
To prove the last claim of the theorem, we distinguish two cases. If the stopping criterion was not triggered, then the claim holds by Lemma 30. If the stopping criterion was triggered, then if , we obtain that the total number of messages is
[TABLE]
which is exactly the number of vertices moved by Algorithm 1. ∎
5.2 Fast Algorithms
In this section, we discuss the computational challenges when computing perfectly balanced assignments. These computational problems occur when Algorithm 1 performs rebalancing steps (see Section 3.2 and Section 4.4). So far, we were only concerned with algorithms which try to minimize the vertex moves while using potentially exponential running time. We now consider polynomial time algorithms. The only step where our algorithms might use exponential time is during rebalancing. Thus we show next how to perform the rebalancing operations in polynomial time. In the case of servers, our polynomial time algorithms perform slightly more vertex moves than the exponential time algorithms.
We discuss the two server case which can be solved optimally in polynomial time in Section 5.2.1. In Section 5.2.2, we argue that in the general case with servers this problem is NP-hard. We resolve this issue in Section 5.2.3 by computing approximately balanced assignments in polynomial time.
5.2.1 Computing Perfectly Balanced Assignments for Two Servers
We consider computing a perfectly balanced assignment respecting the connected components for two servers. Specifically, we provide a dynamic program which can find such an assignment in polynomial time.
The dynamic program works as follows. Suppose are the connected components assigned to the two servers. Now let for . We create a set consisting of integers with the following property: Each number corresponds to a set of connected components such that . That is, whenever , there exists a set of connected components which together contain vertices. For each , the algorithm maintains a set of connected components explicitly. We denote the components corresponding to value by .
At the beginning of a rebalancing step, the algorithm sets . The connected component corresponding to value [math] is simply the empty set of vertices, i.e., . For the algorithm does the following. Iterate over all and over all components and add to if and . Whenever a new value is added to , set .
As soon as the value is added to , the dynamic program stops and assigns all vertices in to the left server and all remaining vertices to the right server.
The correctness of the above algorithm is clear by construction. We only need to show that it finishes in polynomial time.
Note that the above dynamic program runs in time . Now observe that is bounded by since there are at most connected components. Furthermore, for each subset , we have that (because the components in cannot contain more than vertices). Thus, because each value corresponds to a subset of components and each value is only added once to . Hence, the algorithm runs in time .
5.2.2 Computing Perfectly Balanced Assignments for Many Servers
We consider computing a perfectly balanced assignment respecting the connected components for servers.
Let be the connected components which are assigned to the servers. To find a perfectly balanced assignment respecting the connected components, we need to find a partition of the set into subsets such that for each subset we have that .
Unfortunately, the above problem is known to be NP-complete, see, e.g., the result about multi-processor scheduling in Garey and Johnson [28]. However, since we prove our results in the online model of computation, which allows unlimited computational power, the algorithm can solve this NP-complete problem. We note that this problem has also been studied in practice, see, e.g., Schreiber et al. [48] and references therein.
See Section 5.2.3 for how this problem can be solved approximately at the cost of a constant in the competitive ratio of the algorithm.
5.2.3 Computing Approximately Balanced Assignments for Many Servers
Previously we have we seen that perfectly balanced assignments for servers cannot be computed in polynomial time unless (Section 5.2.2). Thus, we now consider computing approximately balanced assignments for servers which is sufficient for our purpose: Let be a constant. An assignment is -approximately balanced if each server has load at most . Using this definition, we obtain the following result.
Proposition 32**.**
Let be constants and suppose each server has capacity . Then a -approximately balanced assignment for servers can be computed in polynomial time.
Using the proposition (which we prove at the end of the subsection), we obtain a polynomial time algorithm with a slightly worse competitive ratio than that of Theorem 16.
Theorem 33**.**
Given a system with servers each of capacity , for constant , then there exists an -competitive algorithm which runs in polynomial time.
Proof.
First, observe that Algorithm 4 runs in polynomial time. Thus, the result of Proposition 22 also holds for polynomial time algorithms.
Second, consider a modification of Algorithm 1 where at each rebalancing step we compute a -approximately balanced assignment for . Such an approximately balanced assignment can be computed in polynomial time due to Proposition 32. Thus, the modified algorithm runs in polynomial time.
Observe that now all steps of the resulting algorithm can be computed in polynomial time. It is left to bound the competitive ratio of the modified algorithm.
We start by bounding the cost paid by the modified version of Algorithm 1. Note that each approximate rebalancing step incurs cost at most ; recall that denotes the cost for moving a vertex to a different server. Now we bound the number of approximate rebalancing steps. Recall from Lemma 4 that the number of vertex moves due to small-to-large steps is at most . Now whenever a new approximately balanced assignment is computed, the small-to-large steps must have moved at least vertices to exceed the capacity of one of the servers. Thus, the total number of approximate rebalancing operations is bounded by and, hence, the total cost of Algorithm 1 with approximate rebalancing steps is bounded by .
Altogether, we obtain the following competitive ratio by following the steps from the proof of Theorem 16 (Section 4.5):
[TABLE]
where in the last step we used that . ∎
To prove Proposition 32, we consider the makespan minimization problem in which there are jobs with processing times which must be assigned to identical machines. Given an assignment of the jobs to the machines, the maximum running time time of any machine is called the makespan. The goal is to find an assignment of the jobs to the machines which minimizes the makespan.
The makespan minimization problem is known to be NP-hard but Hochbaum and Shmoys [31] presented a polynomial time approximation scheme (PTAS).
Lemma 34** (Hochbaum and Shmoys [31]).**
Let be a constant. Then there exists an algorithm which computes a -approximate solution for the makespan minimization problem in polynomial time.
Using the result from the lemma we can prove Proposition 32.
Proof of Proposition 32.
Suppose the system currently contains connected components . We consider these connected components as the jobs of the makespan minimization problem with processing times for . The machines correspond to the servers.
Note that the optimal solution for the instance of the makespan minimization problem is : Since we have made the assumption that in the final assignment all servers have load exactly , there must exist a perfectly balanced assignment from the components to the servers . In other words, there exists an assignment of the jobs to the machines such that each machine has running time and, hence, the optimal makespan is .
By running the algorithm from Lemma 34, we obtain a -approximate solution for the makespan minimization problem. Since the optimal solution for this problem is , each machine has load at most in the solution returned by the algorithm from Lemma 34. Assigning the components to the servers in exactly the same way as the corresponding jobs are assigned to the corresponding machines, we obtain a -approximately balanced assignment in polynomial time. ∎
6 Lower Bounds
To study the optimality of our algorithms, we derive bounds on the competitive ratios which can be achieved by any deterministic online algorithm.
The following theorem provides a lower bound of . The lower bound has the following two main consequences: (1) If an algorithm is only allowed to use constant augmentation (i.e., servers of capacity ), then the lower bound implies that any algorithm must have a competitive ratio of .999To obtain servers of capacity , we must set . (2) The lower bound holds even in the setting in which there are only two servers. Thus, the algorithm from Section 3 for the two server setting is close to optimal (up to a factor) and the generalized algorithm from Section 4 is optimal up to a factor.
Theorem 35**.**
Suppose there are two servers of capacity for . Then any deterministic online algorithm must have a competitive ratio of .
To prove the theorem, we show in Section 6.1 that there exist input sequences such that either an algorithm always assigns vertices of the same connected component to the same server or it has prohibitively high cost. Using this fact, we prove our concrete lower bounds in Section 6.2.
6.1 Assigning Connected Components to Servers
In this subsection, we give an important reduction which will be useful to derive the lower bounds in the next subsection (Section 6.2). This reduction lets us assume that every competitive algorithm will always assign vertices of the same connected component to the same server.
More concretely, we show that every sequence of edges can be manipulated to a new edge sequence such that: (1) reveals the same edges as and (2) on input , every algorithm either moves the vertices of the same connected components to the same server, or has prohibitively high cost and, hence, cannot be competitive.
We first prove the following technical lemma.
Lemma 36**.**
Consider a sequence which reveals the edges . Let be the connected components induced by .
Then for each initial assignment there exists an input sequence consisting only of edges in such that either (1) at some point during the input sequence the algorithm assigns all vertices from each to the same or (2) the cost of the algorithm is at least .
Proof.
We will construct an input sequence provided by the adversary such that either Property (1) or Property (2) must hold.
Consider an arbitrary initial assignment and pick the ground truth components such that they do not coincide with the initial assignment of the vertices to the servers, i.e., for all . Let be the edges revealed by the adversary and suppose that contains at least one edge such that and are assigned to different servers in the initial assignment.
Now consider the input sequence with which consists of the edges in concatenated times.
Suppose that while running the algorithm there always exists a such that not all vertices from are assigned to the same server , i.e., Claim (1) does not apply. We show that then Claim (2) must apply.
Consider the state of the algorithm prior to a single subsequence containing the edges . By assumption at least one edge must be between two vertices from different servers. Now the algorithm must either pay for communication along this edge or it must move one of the edge’s endpoints at the cost of to avoid paying for communication along this edge. Thus, the algorithm must pay at least for the subsequence .
As there are such subsequences, the algorithm must pay at least in total. ∎
As we will see, the lemma essentially allows us to assume that every algorithm which obtains an edge between vertices on different clusters, must move their connected components to the same cluster. That is, given an input sequence , in our lower bound proof, we can employ Lemma 36 to obtain an input sequence which does not reveal any additional edges and which forces every algorithm to have Property (1) or Property (2).
Now observe that if an algorithm has Property (2), since the cost of OPT are always bounded by (OPT moves each vertex at most once), the algorithm cannot be competitive: the competitive ratio must be at least , much higher than the competitive ratios derived in this paper. Hence, in the following we can assume that every algorithm with a competitive ratio better than must satisfy Property (1) of Lemma 36.
6.2 Lower Bound Proofs
In this subsection, we prove Theorem 35 by proving two different lower bounds: The first lower bound asserts a competitive ratio of and the second lower bounds asserts a competitive ratio of .
In the lower bound constructions we heavily exploit that we provide hard instances against deterministic algorithms, i.e., we will rely on the fact that at each point in time the adversary knows exactly which assignment the online algorithm created.
Furthermore, we assume that after each edge which was provided by the adversary, the algorithm creates an assignment such that all vertices of the same connected component are assigned to the same server. This assumption is admissible by the discussion in Section 6.1.
We start by proving the lower bound of .
Lemma 37**.**
Consider the setting with two servers which both have capacity for .
Then for each deterministic online algorithm ON there exists an input sequence such that the cost of ON is and the cost paid by OPT is . Thus, the competitive ratio of every online algorithm is .
Proof.
Choose an arbitrary initial assignment of vertices to the servers. Let denote the allowed augmentation of the servers. The initial assignment is as follows. In the left server, there are connected components of size . On the right server, we build one connected component of size denoted and one large connected component of size denoted . First, the adversary provides all edges of these connected components at no cost to the algorithm.
Then the adversary inserts an edge from a vertex in to a vertex in . Since has size and the right server currently has vertices, the algorithm cannot move to the right server. For the same reason, the algorithm cannot move to the left server either. Thus, the algorithm’s only option to bring and to the same server is to replace with some at the cost of .
We will refer to the merged connected component of size as . Note that must be on the left server. Now let be the connected component of size on the right server. The adversary adds an edge from a vertex in to a vertex in . By the same reasoning as before, the algorithm must now pick some , , of size from the left server and swap it with . This costs another .
The adversary continues the previous procedure until only a of size is left on the left server and then she connects and . This gives the final partitioning of the vertices.
We observe that each vertex which is on the left server at the very end, has been on the right server exactly once during the execution of the algorithm. Thus, the costs paid by the algorithm must be .
Note that OPT pays exactly because it can determine beforehand which must be moved to the right server and only move that connected component. Before, we have seen that any deterministic algorithm must pay at least . Thus, the competitive ratio is . ∎
Next, we prove the lower bound for the competitive ratio of deterministic algorithms.
Lemma 38**.**
Consider the setting with two servers which both have capacity for .
Then for each deterministic online algorithm ON there exists an input sequence such that the cost paid by ON is and the cost paid by OPT is . Thus, the competitive ratio of every deterministic online algorithm is .
Proof.
Choose an arbitrary initial assignment of vertices to the servers. Since we want to prove a lower bound, we can assume that is a power of 2. Thus, suppose that for .
In our hard instance, we are creating a sequence of edge insertions which proceeds in rounds. When round starts, all connected components have size induced by the previously provided edges, and when round finishes, all connected components have size . We show that ON pays in each round. This implies the claimed cost of for ON . The cost for OPT follows immediately from Lemma 3 which states that OPT never pays more than when there are only two servers.
When ON starts and no edge was provided by the adversary, all connected components have size , i.e., the connected components are isolated vertices.
Now suppose round starts. By induction, all connected components have size . We now define a sequence of edge insertions for round which forces ON to pay and after which all connected components have size .
Let denote the current number of connected components of size . When round starts, there are exactly connected components of size each. Recall that each server has capacity . Thus, at most
[TABLE]
connected components of size can be assigned to each server.
Now suppose there exists an edge such that and are of size and they are assigned to different servers; we call such an edge expensive. When the adversary inserts an expensive edge, ON must pay for moving or to a different server.
The strategy of the adversary is to insert expensive edges as long as they exist. Once no expensive edges exist anymore, the adversary connects all remaining components of size arbitrarily until all components have size .
Note that expensive edges exist as long as (because when this inequality is satisfied, not all connected components of size can be assigned to the same server). Furthermore, observe that when the adversary inserts an expensive edge, decreases by 2.
Now we prove a lower bound on the number of expensive edges . By the previous arguments, must be large enough such that:
[TABLE]
Solving this inequality for , we obtain
[TABLE]
We conclude that that adversary can perform expensive edge insertions. Since for each of these edge insertions, ON must pay , we obtain that the cost paid by ON in round is
[TABLE]
7 Sample Applications: A Distributed Union Find Algorithm and Online
-Way Partitioning
In this section we provide two sample applications for our model and our algorithms. First, we show that our results can be used to solve a distributed union find problem and we give an example where a union find data structure is used in practice. Second, we show that our algorithms imply competitive algorithms for an online version of the -way partitioning problem.
7.1 Distributed Union Find
Recall that in the static union find problem, there are elements from a universe and initially there are sets containing one element each. The data structure supports two operations: and . Given two elements , the operation merges the sets containing and . The operation returns the set containing .
In the distributed setting we consider, elements are stored across servers. Each server has enough capacity to store elements and we have the natural constraint that elements from the same set must always be stored on the same server (in order to maintain locality for elements from the same set). We consider a setting in which all sets have size when the algorithm finishes.
Note that if the sets of are stored on different servers when the operation is performed, one of the sets containing or must be moved to a different server. The goal of an algorithm is to minimize the moving cost caused by -operations.
When analyzing the moving cost, we will compare with an optimal offline algorithm which knows in advance which -operations will be performed. Thus, the optimal algorithm can move from the initial assignment to the final assignment at the minimum possible cost. For our analysis we will compute the competitive ratio between an online algorithm solving the above problem and the optimal offline algorithm (as also detailed in Section 2).
Using the algorithms from Sections 4 and 5.1, we obtain the following result.
Theorem 39**.**
Consider a system with servers each of capacity for . Then there exists a distributed -competitive algorithm for the distributed union find problem. Moreover, for servers, the algorithm’s communication cost does not exceed its cost for moving vertices.
Proof.
The theorem follows immediately from Theorem 29 by the following reduction from the model in Section 2. We identify vertices in the model from Section 2 with elements from the universe in the union find model. Furthermore, for each operation we insert an edge into the model from Section 2. Since all algorithms we considered always collocate vertices from the same connected component, they satisfy the constraint that elements from the same set must be assigned to the same server. Moreover, in our analysis we were able to focus on the number of vertex moves due to Lemma 1. In our proofs, we showed competitive bounds for the number of vertex moves performed by the algorithm from Theorem 29 compared with an optimal offline algorithm. Thus, the same bounds as derived in Theorem 29 apply. ∎
For servers and the exact number of messages sent by the algorithm, see Theorem 29. The guarantees from Theorem 29 carry over immediately.
An examples where distributed union find data structures are used in practice is search engines [17]. A search engine stores many different documents from the Web over multiple servers. Now union find data structures are used to collocate duplicate documents on the same server, i.e., when documents and are identified as duplicates the operation is used to collocate these documents (and all previously identified duplicates) on the same server. Furthermore, union find data structures are used to find blocks in dense linear systems and in pattern recognition tasks (see Cybenko et al. [19] and references therein).
7.2 Online -Way Partitioning
The model and algorithms we study in this paper can also be used to solve an online variant of the -way partition problem [48]. In the static version of the -way partition problem one is given a (multi-)set of integers and the task is to partition into subsets such that the sum of all subsets is (approximately) equal.
Our model and our algorithms can be used to solve the following online version of this fundamental problem. Initially, contains integers and all integers are . Each integer is assigned to one of bins and each bin has capacity . Now in an online sequence of operations, an adversary picks two integers from and these integers are added. For example, after adding integers , becomes . During this sequence of operations an online algorithm must ensure that the load of all bins is always bounded by . We work under the assumption that after each operation there always exists an assignment from the integers in to the bins such that each bin has load exactly . We further assume that at the end of the sequence of operations there are integers and each integer is .
Note that when two integers from different bins are added, either or must be moved to a different bin. This might cause that bin to exceed its capacity.
We will analyze algorithms which have small moving cost. That is, the cost of an algorithm is the sum of the numbers it has moved. We consider the competitive analysis of online algorithms compared with an optimal offline algorithm which knows the sequence of additions in advance and which can move the numbers at optimal cost.
We then obtain the following result for the -way partitioning problem.
Theorem 40**.**
Consider a system with bins each of capacity for . Then there exists a -competitive algorithm for the -way partition problem.
Proof.
We can relate the online version of the -way partition problem to the model we study by identifying integers and the sizes of connected components. Initially, we identify each with a single vertex. Note that this can be done since initially and thus and the size of its corresponding connected component are the same. After that, when two integers and are added, we take their corresponding connected components and and insert an edge between them. Note that the resulting integer corresponds to the connected component and their sizes agree, i.e., . Now observe that summing the moving cost for integers is the same as counting the number of vertex reassignments for connected components. Thus, the result of the theorem follows from Theorem 16. ∎
8 Related Work
The design of more flexible networked systems that can adapt to their workloads has received much attention over the last years, with applications for traffic engineering [32, 33], load-balancing [43, 20], network slicing [49], server migration [12], switching [16, 25], or even adjusting the network topology [29]. The impact of distributed applications on the communication network is also well-documented in the literature [40, 26, 37, 50, 18]. Several empirical studies exploring the spatial and temporal locality in traffic patterns found evidence that these workloads are often sparse and skewed [4, 29, 47, 34], introducing optimization opportunities. E.g., studies of reconfigurable datacenter networks [30, 29] have shown that for certain workloads, a demand-aware datacenter network can achieve a performance similar to a demand-oblivious datacenter network at 25-40% lower cost [30, 29].
However, much less is known about the algorithmic challenges underlying such workload-adaptive networked systems, the focus of our paper. From an online algorithm perspective, our problem is related to reconfiguration problems such as online page (resp. file) migration [10, 14] as well as server migration [13] problems, -server [24] problems, or online metrical task systems [15]. In contrast to these problems, in our model, requests do not appear somewhere in a graph or metric space but between communication partners. From this perspective, our problem can also be seen as a “distributed” version of online paging problems [23, 39, 51, 54] (and especially their variants with bypassing [2, 21]) where access costs can be avoided by moving items to a cache: in our model, access costs are avoided by collocating communication partners on the same server (a “distributed cache”).
The static version of our problem, how to partition a graph, is a most fundamental and well-explored problem in computer science [53], with many applications, e.g., in community detection [1]. The balanced graph partitioning problem is related to minimum bisection problems [22], and known to be hard even to approximate [6]. The best approximation today is due to Krauthgamer [36]. In contrast, we in this paper are interested in a dynamic version of the problem where the edges of the to-be-partitioned graph are revealed over time, in an online manner. Further, the offline problem of embedding workloads in a communication-efficient manner has been studied in the context of the minimum linear arrangement problem [46] and the virtual network embedding problem [55], however, without considering the option of migrations. In this regard, our paper features an interesting connection to the itinerant list update model [42], a kind of “dynamic” minimum linear arrangement problem which allows for reconfigurations and, notably, considers pair-wise requests. However, communication is limited to a linear line and so far, only non-trivial offline solutions are known.
One of the applications of the problem we study is a distributed union find data structure (see Section 7.1). Union find data structures have been initially proposed in the centralized setting and efficient algorithms were derived [27, 52]. Later, parallel versions of union find data structures were considered in a shared memory setting in which the goal was to derive wait-free algorithms [5]; also external memory algorithms were considered [3]. To the best of our knowledge studies of union find data structures in a distributed memory setting were only conducted experimentally, see (for example) [19, 38, 44, 45].
The second application we presented was as online -way partitioning (Section 7.2). The -way partitioning problem is known to be NP-hard as it constitutes a very simple scheduling problem [28]. The problem has also been researched in practice, see, e.g., [35, 48] and references therein. We are not aware of literature studying the online version of the problem which we have considered.
The paper most closely related to ours is by Avin et al. [8, 7] who studied a more general version of the problem considered in our paper. In their model, request patterns can change arbitrarily over time, and in particular, do not have to follow a partition and hence “cannot be learned”. Indeed, as we have shown in this paper, learning algorithms can perform significantly better: in [8], it was shown that for constant any deterministic online algorithm must have a competitive ratio of at least unless it can collocate all nodes on a single server, while we have presented an -competitive online algorithm. Thus, our result is exponentially better than what can possibly be achieved in the model of [8].
9 Conclusion
Motivated by the increasing resource allocation flexibilities available in modern compute infrastructures, we initiated the study of online algorithms for adjusting the embedding of workloads according to the specific communication patterns, to reduce communication and moving costs. In particular, we presented algorithms and derived upper and lower bounds on their competitive ratio.
We believe that our work opens several interesting questions for future research. In particular, it remains to close the gap between the upper and lower bound of the competitive ratios derived in this paper. Furthermore, while in this paper we assumed that there are ground truth components of size , it will be interesting to study more general settings with smaller and larger components.
More generally, it will be interesting to consider algorithms which do not collocate all communication partners. Also, studying collocation in specific networks such as Clos networks, which are frequently encountered in datacenters, would be intriguing.
Acknowledgments
We are grateful to our shepherd Rachit Agarwal as well as the anonymous reviewers whose insightful comments helped us improve the presentation of the paper.
The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement No. 340506. Stefan Neumann gratefully acknowledges the financial support from the Doctoral Programme “Vienna Graduate School on Computational Optimization” which is funded by the Austrian Science Fund (FWF, project no. W1260-N35).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Emmanuel Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research , 18(177):1–86, 2018.
- 2[2] Anna Adamaszek, Artur Czumaj, Matthias Englert, and Harald Räcke. An O (log k )-competitive algorithm for generalized caching. In Proc. 23rd SODA , pages 1681–1689, 2012.
- 3[3] Pankaj K. Agarwal, Lars Arge, and Ke Yi. I/o-efficient batched union-find and its applications to terrain analysis. ACM Trans. Algorithms , 7(1):11:1–11:21, 2010.
- 4[4] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center tcp (dctcp). In Proc. ACM SIGCOMM Computer Communication Review (CCR) , volume 40, pages 63–74, 2010.
- 5[5] Richard J. Anderson and Heather Woll. Wait-free parallel algorithms for the union-find problem. In STOC , pages 370–380, 1991.
- 6[6] Konstantin Andreev and Harald Räcke. Balanced graph partitioning. Theory of Computing Systems , 39(6):929–939, 2006.
- 7[7] Chen Avin, Marcin Bienkowski, Andreas Loukas, Maciej Pacut, and Stefan Schmid. Dynamic balanced graph partitioning. In SIAM J. Discrete Math (SIDMA) , 2019.
- 8[8] Chen Avin, Andreas Loukas, Maciej Pacut, and Stefan Schmid. Online balanced repartitioning. In Proc. 30th International Symposium on Distributed Computing (DISC) , 2016.
