Near Optimal Coflow Scheduling in Networks
Mosharaf Chowdhury, Samir Khuller, Manish Purohit, Sheng Yang, Jie You

TL;DR
This paper introduces a practical, efficient approximation algorithm for coflow scheduling in general networks, extending prior work from bipartite graphs to more complex graph models, with strong theoretical and empirical results.
Contribution
It presents a randomized 2-approximation algorithm for coflow scheduling on general graphs, improving previous bounds and demonstrating practical effectiveness.
Findings
The algorithm achieves a 2-approximation ratio.
Extensive experiments show the algorithm is practical and performs well.
The approach extends coflow scheduling to general network topologies.
Abstract
The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently. In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between data centers) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Near Optimal Coflow Scheduling in Networks
Mosharaf Chowdhury
University of Michigan, Ann Arbor
,
Samir Khuller
Northwestern University
,
Manish Purohit
Google, Mountain View
,
Sheng Yang
University of Maryland, College Park
and
Jie You
University of Michigan, Ann Arbor
(2019)
Abstract.
The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center (Chowdhury and Stoica, 2012). In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently (Agarwal et al., 2018; Ahmadi et al., 2017; You and Chowdhury, 2019).
In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between data centers (Jahanjou et al., 2017; You and Chowdhury, 2019)) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.
coflow, scheduling, LP relaxation, network flow, LP rounding, cloud computing
††journalyear: 2019††copyright: rightsretained††conference: 31st ACM Symposium on Parallelism in Algorithms and Architectures; June 22–24, 2019; Phoenix, AZ, USA††booktitle: 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’19), June 22–24, 2019, Phoenix, AZ, USA††doi: 10.1145/3323165.3323179††isbn: 978-1-4503-6184-2/19/06††submissionid: spaaf023††ccs: Mathematics of computing Approximation algorithms††ccs: Mathematics of computing Graph algorithms††ccs: Networks Packet scheduling††ccs: Networks Network control algorithms††ccs: Theory of computation Scheduling algorithms††ccs: Theory of computation Routing and network design problems††ccs: Theory of computation Linear programming††ccs: Theory of computation Network flows††ccs: Theory of computation Rounding techniques††ccs: Computer systems organization Cloud computing
1. Introduction
Modern computing applications have rather intensive computational needs. Many machine learning applications require up to tens of thousands of machines and often involve processing units across multiple data centers collaborating on the same application. This collaboration is usually handled by a large-scale distributed computing framework that ideally ensures a close-to-linear speedup compared to a single machine. A crucial part of the collaboration is that large chunks of data require both inter and intra-datacenter transmissions.
For intra-datacenter transmission, a common example would be the MapReduce framework. Map workers write all intermediate results independently to several servers to guard against failure and allow possible re-calculation. These results are shuffled and sent to Reduce workers. The volume of transmission between machines is so large that it has become a major bottleneck in the performance. In addition to this challenge, multiple applications may share the same cluster, and an un-coordinated schedule of their data transmission may cause an unacceptable delay in their completion times.
Chowdhury and Stoica (Chowdhury and Stoica, 2012) first introduced the abstraction of coflow scheduling, which assumes that each application consists of a set of flows, and is finished once all the flows are completed. In their framework the network between machines is modeled as a switch: the input ports of different machines on one side, and output ports on the other side. A machine can send (receive) data to (from) any other machine, but to (from) only one machine at a time (sending and receiving may happen concurrently). The transmission speed between all machines is uniform. This describes a “perfect” datacenter where networking between machines is handled by a high-speed central switch (modeled by a complete bipartite graph) connected directly to all the machines (Chowdhury and Stoica, 2012). However, real world datacenters are far more complicated; direct (virtual) links between machines may exist to avoid latency, duplicate links may exist to tolerate failure, network speeds may vary widely for different machines and links, and complicated network structures may exist for a variety of reasons. To make things worse, some tasks may involve multiple datacenters around the globe, and the switch model simply cannot accurately capture the graph based network that connects all the data centers.
For inter-datacenter transmission, distributed machine learning tasks can generate huge amounts of traffic. Due to legal or cost reasons, some datasets cannot be gathered into a single datacenter for processing. Instead, several geographically distributed datacenters work together to train a single model, and exchange local updates frequently to ensure accuracy and convergence. Though the size of a single transmission may be small considering the network bandwidth, the repeated exchange blows up the volume of transmission and makes network traffic its bottleneck.
In order to solve these problems, a slightly different model of coflow scheduling was proposed by Jahanjou et al. (Jahanjou et al., 2017), which assumes that the underlying connection between machines is an arbitrary graph rather than a complete bipartite graph. Each node can be a machine, a datacenter or an exchange point (switch, router, etc.), and an edge between two nodes represents a physical link between the two Internet infrastructures. When some data needs to be transmitted from one node to another, it needs to be transmitted along edges. Unlike in the switch model where only one packet can be sent at each time slot, data for multiple jobs is allowed to transfer on the same link at the same time, or in other words, shared traffic on links is allowed. The total volume of data transmission on a link however is bounded by the link bandwidth111One major challenge in the switch model is the node-wise I/O speed constraint. In order to capture this in the graph model, we can replace every datacenter with a gadget of two nodes. The first node has exactly the same neighbors and edges that the original node for the datacenter has, plus links from and to the second node. The second node is only connected to the first node, and is the true source and destination for all demands involving this datacenter. By setting capacity on the links between these two nodes, we can enforce I/O limit for the whole datacenter like in the switch model.. Jahanjou et al. (Jahanjou et al., 2017) considered the model in which data has to travel along a single specified path. In addition to this model, we also consider the free path model which allows data to be split or merged at nodes to utilize the whole graph when transmitting the same piece of data as long as the capacity of each link is respected. This seems much more complicated in practice than a single path transmission, but modern distributed computation frameworks (You and Chowdhury, 2019) allow this kind of fine-grained control on network routing and transfer rate, which makes the model realistic. See Figure 1 for a brief illustration of the two models. The formal definitions come in Section 2.
1.1. Related Work
The idea of scheduling coflows was first introduced by Chowdhury and Stoica (Chowdhury and Stoica, 2012). Since then, it has been a hot topic in both the systems (Luo et al., 2016; Li et al., 2016; Chowdhury et al., 2014; Yu et al., 2016) and the theory (Qiu et al., 2015; Khuller and Purohit, 2016; Jahanjou et al., 2017; Ahmadi et al., 2017; Shafiee and Ghaderi, 2018) communities. Most theoretical research has focused on coflow scheduling in the switch model, where the communication graph is a complete, bipartite graph. Since this basic problem generalizes concurrent open shop scheduling and is thus NP-hard, the main results focus on the development of approximation algorithms. Over the last three years, a series of papers (Qiu et al., 2015; Khuller and Purohit, 2016; Ahmadi et al., 2017) have brought down the approximation factor from 67/3 to 5 for coflow scheduling with arbitrary release times and to 4 for the case without release times (Ahmadi et al., 2017; Shafiee and Ghaderi, 2018) 2224 is still the best known bound.. We would like to note that a very simple primal-dual framework is proposed by Ahmadi et al. (Ahmadi et al., 2017), and this yields a very practical combinatorial algorithm for the problem without requiring the need to solve an LP (as in (Shafiee and Ghaderi, 2018)). Furthermore, in recent work, a system called Sincronia (Agarwal et al., 2018) was also developed based on the primal-dual method. It improves upon state-of-the-art methods and gives practical and near-optimal solutions in real testbeds.
One natural extension is to take the graph structure into consideration. Zhao et al. (Zhao et al., 2015) consider coflow scheduling over arbitrary graphs and attempt to jointly optimize routing and scheduling. They give a heuristic based on shortest job first, and use the idle slots to schedule flows from the longest job. Jahanjou et al. (Jahanjou et al., 2017) studied two variants of coflow scheduling over general graphs, namely, when the path for a flow is given or if the path is unspecified. In both cases, the transmission rate may change over time, but each flow can only take a single path, whether given to or chosen by the fractional routing algorithm. In the first case, Jahanjou et al. (Jahanjou et al., 2017) develop the first constant approximation algorithm (approximation ratio ) and in the second case they develop an approximation algorithm ( is the number of nodes in the graph), matching the lower bound given by Chuzhoy et al. (Chuzhoy et al., 2007).
Since preemption often incurs large overheads, some recent work (Yu et al., 2016) has tackled the problem of non-preemptive coflow scheduling. Mao, Aggarwal, and Chiang (Mao et al., 2018) consider the non-preemptive coflow scheduling problem with stochastic sizes and give an algorithm with an approximation factor of , where is an upper bound of squared coefficient of variation of processing times. This simplifies to a approximation for non-stochastic cases.
1.2. Our Contributions
The main result of this paper is a unified, tight -approximation algorithm for the coflow scheduling problem in both the single path model and the free path model when all release times and demands are polynomially sized, and a -approximation when the release times and demands can be super-polynomial. This improves upon the 17.6 approximation given by Jahanjou et al. (Jahanjou et al., 2017) for the single path model, and is the first approximation algorithm for the free path model (introduced by You and Chowdhury (You and Chowdhury, 2019)).
We also evaluated our algorithm using two WAN topologies (Microsoft’s SWAN (Hong et al., 2013) and Google’s G-Scale (Jain et al., 2013)) on four different workloads (BigBench (Intel Hadoop, 2016), TPC-DS (Nambiar and Poess, 2006), TPC-H (Poess and Floyd, 2000), and Facebook (FB) (Facebook, 2014; Chowdhury, 2015)) and compared with state-of-the-art for both models(Jahanjou et al., 2017; You and Chowdhury, 2019). For the single path model, we significantly improved over Jahanjou et al. (Jahanjou et al., 2017). For the free path model, we are close to what Terra (You and Chowdhury, 2019) gets, but have the extra capability of dealing with weights. Across all variants and models, we have shown that taking the LP solution directly is an effective heuristic in practice.
1.3. Paper Organization
In Section 2 we give a formal definition of the two models for coflow scheduling. In Section 3 we give a general linear program that deals with both models. We give the additional flow constraints for the two models in Section 3.1. In Section 4.1 we describe the main algorithm and present the analysis in Section 4.2. We prove both models to be NP-hard in Section 5. In Section 6, we show experimental results by comparing our algorithms to some baseline algorithms. We conclude in Section 7 with some new directions to work on.
2. Model and Problem Definition
We now formally define the models of coflow scheduling that we consider in this paper. Let be a directed graph that represents the data center network and be a function that denotes the capacity (bandwidth) available on each edge of the network. Let denote the set of coflows. A coflow has weight that denotes its priority and consists of individual flows, i.e., where denotes a flow from source node to sink with demand . We assume that time is discrete and data transfer is instantaneous, i.e., it takes negligible time for data to cover multiple hops of edges as network delay is low compared to the time to transmit large chunks of data. A coflow is said to be completed at the earliest time such that for each flow , units of data have been transferred from source to sink . Our goal is to find a schedule that routes all the requisite flows (i.e. at any time, what fraction of a certain flow is transmitted and along which path/paths) subject to the edge bandwidth constraints so that the total weighted completion time of the coflows is minimized. Figure 2 gives an example of an instance of the coflow scheduling problem over a simple network.
We consider two different transmission models, based on whether a flow has restrictions as to how the data is transmitted. In the single path model, each flow specifies a path from source to sink so that the flow can only be routed along that path. This is exactly the “circuit-based coflows with paths given” model studied by Jahanjou et al. (Jahanjou et al., 2017).
In the free path model, we can freely select the routing we desire for any flow . In any time slot, data transmission occurs as a feasible multi-commodity flow so that both flow-conservation and edge bandwidth constraints are satisfied. Thus, we can split any flow along multiple paths from its source to destination. This model was proposed in Terra (You and Chowdhury, 2019). Since the shortest paths of different flows can share edges and cause congestion, the free path model offers the flexibility of rerouting flows along less congested paths. In addition, modern internet infrastructures support using multiple paths together to get a higher overall speed (known as link aggregation), which is captured in the free path model as network flow.
In fact, both models are handled uniformly by the same framework, and the only difference is the set of flow constraints that describe what are considered feasible transmissions. It is also possible to handle other kinds of transmissions, like an intermediate case between single path and free path: several paths are given, and we can use them together and decide at what rate we are transmitting along each path. Figures 3 and 4 show the optimal solutions for the example coflow problem in the single path and free path models respectively.
3. Linear Programming Relaxation
We use a time-indexed linear program to model this problem. Let denote an upper bound on the total time required to schedule all the coflows. Note that might be super-polynomial if the release times or coflow sizes are large. However, there is a standard technique that achieves polynomial size at the cost of a factor on approximation ratio. We will assume to be polynomial in the main paper, and present the fix for super-polynomial in Appendix A.
Let time be slotted and time slot cover the interval of time . For a given flow and a time slot , we introduce the variable to indicate the fraction of flow that is scheduled at time . For each coflow , we introduce variables to indicate if all the flows have been completely scheduled by time . Finally, we introduce a variable that models the completion time of coflow .
To make the linear program compatible with both single path model and free path model, we exclude the flow constraints and edge bandwidth constraints for now and delay them to Section 3.1.
[TABLE]
Constraint (1) certifies that each flow is fully scheduled. Constraint (2) ensures that coflow is considered completed at time only if all flows have been fully scheduled by time . In Proposition 3.1, we show that Constraint (3) enforces a valid lower bound on the completion time of coflow . Finally, Constraint (4) ensures that no flow is scheduled before it has been released. Note this is not a typical LP relaxation, since any fractional solution is valid. The main relaxation is around the completion time, since representing the exact completion time of job is beyond the capability of a linear program.
Proposition 3.1.
The completion time of a coflow can be lower bounded by where denotes the fraction of coflow that has been completed by (the end of) time slot .
Proof.
Conventionally, in time-indexed linear programming relaxations, the completion time of a job is lower bounded by the fractional completion time in the schedule, or . In our setting, this corresponds to the constraint where denotes the fraction of coflow that is scheduled during time slot . The desired constraint in Eq (3) is exactly the same constraint rearranged in a format that is more convenient for analysis.
[TABLE]
∎
3.1. Model-specific Constraints
3.1.1. Single Path Model
In the single path model, a flow can only be routed along a specified path . Thus, we do not need to make any routing decisions in the linear program and only need to ensure that edge bandwidths are respected.
[TABLE]
Constraint (6) enforces that the total flow scheduled through edge at any time slot does not exceed the edge bandwidth. Constraints (1)-(6) thus form the complete linear programming relaxation for coflow scheduling in the single path model.
3.1.2. Free Path Model
In the free path model, the path for flow is not specified. In fact, data can split and merge at vertices to utilize all possible capacity. We use variable to denote the fraction of flow transmitted through edge in time slot . Recall that we use to denote the total fraction of flow that is transmitted in time slot . () represents the set of edges that comes in (out of) vertex . Here are the flow conservation constraints we need.
[TABLE]
Constraints (7) and (8) enforce that the total fraction of flow satisfied at time over all the paths is exactly . Constraints (9) ensure flow conservation at all nodes other than source and sink. Constraints (10) guarantee that all edge bandwidths are satisfied at all time steps. Constraints (1)-(5) and (7)-(10) thus form the complete linear programming relaxation for coflow scheduling in the free path model.
Let denote the completion time of coflow in an optimal solution of the LP relaxation, and let denote the completion time of coflow in the corresponding optimal integral solution. Thus, for both the models, we have
[TABLE]
4. Approximation Algorithms
Let denote the fraction of flow that is scheduled at time step in an optimal solution to the above LP. The LP constraints guarantee that this yields a feasible schedule to the coflow scheduling problem (in both the single path as well as the free path models). However, since the completion time of a coflow is defined as the earliest time such that all flows have been completely scheduled, the true completion time of coflow obtained in this scheduled is given by
[TABLE]
Unfortunately, this completion time can be much greater than the completion time variable in the optimal LP solution , and thus the obtained schedule is not a constant-approximate coflow schedule. For instance, consider a coflow with only one flow () and let the optimal LP solution set its schedule as follows , and . Now, the completion time variable in the optimal LP solution is . However, true completion time of the coflow in such a schedule is .
To overcome the obstacle above, we propose the following algorithm called Stretch (see Section 4.1) that modifies the schedule obtained by the linear program so that the completion time of each coflow in the modified schedule can be compared with the completion time variable of the corresponding coflow in an optimal LP solution. The schedule “stretching” idea (also called ‘slow-motion’) used in our algorithm has been used before successfully in other scheduling contexts (Im et al., 2014; Queyranne and Sviridenko, 2002; Schulz and Skutella, 1997).
4.1. Stretch Algorithm
- (1)
Solve the linear program in Section 3 and obtain a fractional optimal solution. 2. (2)
Let be drawn randomly according to the p.d.f . We can verify that this is indeed a valid probability distribution. 3. (3)
Stretch the LP schedule by . This means that we schedule everything exactly as per the LP solution - but whatever LP schedules in the interval , we will schedule in the interval . 4. (4)
Once units of flow have been scheduled, leave the remaining slots for empty.
Figure 5 illustrates the key ideas of the algorithm. To help understand this algorithm, start with the simple case where we have a fixed , in other words stretch the time axis by a factor of . Intuitively, we move everything at time slot and to both time slots and . What used to be transmitted at time will be transmitted no later than time . Consider any flow and let denote the earliest time by which the LP has scheduled at least fraction of the flow. Then, it is easy to verify that the flow is completely scheduled by time .
Now we consider a general and prove that this algorithm does output a feasible schedule. Due to fractional , it might be the case that some flow of LP variable in integral interval becomes , a fractional interval. In this case, for a time slot , or a interval after stretching, we just add .
The only flows that might be scheduled in time slot are those scheduled in time slot and before stretching, or flows and flows . (The two time slots might be the same. If so, feasibility is automatically met. Otherwise, we have .) For all flows at time before stretching, the factor we multiplied with is . For all flows at time before stretching, the factor we use to multiply with is . Note . In fact, the schedule at time can be viewed as a weighted average of the schedule at time and (if is a integer, then the schedule will be exactly what it used to be at time ), the first with weight and the second with weight . The nature of network flow ensures that the weighted sum of two feasible flows is a feasible flow.
Another fact that needs proof is that every flow is finished. This is guaranteed since schedules are stretched, and we only leave the remaining slots empty for if units of flow have been scheduled, or in other words, all the demand for this flow has been scheduled.
4.2. Analysis
Recall that denotes the completion time of coflow in the optimal LP solution. While we consider that time is slotted in the LP formulation and time slot covers the interval of time , at this stage it is more convenient to work with continuous time rather than discrete time. For any continuous time , define to be the fraction of coflow that has been scheduled in the LP solution by time . We define by assuming that the flow is scheduled at an uniform rate in every time slot. Formally, we have
[TABLE]
The LP constraints (3) guarantee that for any coflow , we have . We can now lower-bound the LP completion time by replacing the above summation by an integral.
Lemma 4.1.
* where is defined as per Eq. (13).*
Proof.
By definition of , we have the following.
[TABLE]
where the last inequality follows from Constraint (3). ∎
For any , define to be the earliest time such that fraction of the coflow has been scheduled in the LP solution, i.e., in other words its the smallest such that . Note that by time , fraction of every flow has been scheduled by the LP.
Proposition 4.2.
**
Proof.
[TABLE]
∎
Finally, we are ready to bound the completion time of coflow in the stretched schedule (denoted as ). For any fixed , since we stretch the schedule by a factor of , we have . Notice the ceiling function in the bound 333All flows were completed by at least fraction by time . So in the stretched schedule, all those flows must be completed by time . The ceiling is necessary since may be fractional (i.e. occur in the middle of a time slot). Since is drawn randomly from a distribution, the following lemma bounds the expected completion time of coflow in the stretched schedule.
Lemma 4.3.
The expected completion time of any coflow in the stretched schedule is bounded by .
Proof.
[TABLE]
∎
Theorem 4.4 thus follows from the linearity of expectation.
Theorem 4.4.
There is a randomized 2-approximation algorithm for coflow scheduling in networks in both the single path and free path models when all release times and coflow sizes are polynomially sized.
For the case where the total time we need to schedule all coflows is super-polynomial, we use the standard trick of geometric series time intervals, and claim the following theorem. Proof comes in Appendix A.
Theorem 4.5.
For any , there is a randomized -approximation algorithm for coflow scheduling in networks in both the single path and the free path models (with possibly super polynomial release times and demands).
5. Hardness of Approximation
We claim the following theorem:
Theorem 5.1.
For the coflow scheduling problem, in both the single path and the free path model, it is NP-hard to obtain a approximation, for any .
Proof.
We prove it by a reduction from concurrent open-shop problem (proved NP-hard to approximate within a factor better than (Bansal and Khot, 2010; Sachdeva and Saket, 2013)). The definition of concurrent open shop problem is as follows: there are machines and jobs, each job need to be processed on machine for time non-preemptively. We would like to minimize the total weighted completion time. Unlike the open shop problem, in the concurrent open shop problem a job can be processed on more than one machine at the same time.
Given a concurrent open-shop problem instance with machines, we construct an instance of the coflow scheduling problem as follows. For every machine , we have two nodes and , and an edge of unit bandwidth from to . Notice the graph has different components, between each pair , there is only one path from to . Thus this construction works for both the single path model and the free path model. We will not distinguish the models in the following proof.
For a certain job with demands in the concurrent open shop instance, we add a coflow with demand of from to . Weights are directly taken from the concurrent open shop problem instance. Suppose we get a solution for this coflow scheduling instance, we can get a solution of no larger cost for the concurrent open shop instance as follows. If we have a flow for job on edge of size at time , then we schedule a fraction of for job on machine at time . Suppose a flow is finished at time in the coflow scheduling problem, the corresponding concurrent open shop problem for job and machine is also finished at time . Similarly, the finishing time of coflow and concurrent open shop job are the same. However, the solution we get is fractional, and might be preemptive (we might pause a job and resume it later).
Now we prove that we can modify this solution to get a non-preemptive integral solution without raising the total weighted completion time. For each machine , consider all completion times . Sort them in non-decreasing order , and we can safely reschedule these demand in the order of , and get new completion times while not raising any completion time. We know all demand of job on machine has been finished by , so , similarly all demands of job and have been finished by , and . We can continue and get . Thus the total weighted completion time for this integral solution would be upper bounded by the cost for the coflow scheduling instance.
[TABLE]
For the other direction, for a certain solution of a concurrent open-shop problem, if task of job is scheduled from time to time , we make the flow take up all bandwidth of edge from time to time . Then flow is finished the same time when task of job is finished. Since every task is finished the same time before and after reduction, completion times and the objective weighted completion time stays the same for the coflow scheduling problem.
In conclusion, for a solution of concurrent open-shop problem with weighted completion time , we can construct a solution for coflow scheduling problem of the same weighted completion time . For a solution of coflow scheduling problem with weighted completion time , we can construct a solution for the original concurrent open-shop problem, with cost at most . Since concurrent open-shop problem is NP-hard to get a approximation, we know it is also NP-hard to approximate coflow scheduling problem to a factor of , for both single path model and free path model. ∎
6. Experiments
We evaluated the Stretch Algorithm on 2 topologies and 4 benchmarks/industrial workloads. Experiments were run on a machine with dual Intel(R) Xeon(R) CPU E5-2430, and 64GB of RAM, and using Gurobi (Gurobi Optimization, LLC, 2018) as the LP solver. We first discuss the experimental set up and then in Section 6.2 discuss what evaluation we performed.
WAN topology: We consider the following graph topologies.
- (1)
Swan (Hong et al., 2013): Microsoft’s inter-datacenter WAN with 5 datacenters and 7 inter-datacenter links. We calculate link bandwidth using the setup described by Hong et al.(Hong et al., 2013). 2. (2)
G-Scale (Jain et al., 2013): Google’s inter-datacenter WAN with 12 datacenters and 19 inter-datacenter links.
Workloads: We use the following mix of jobs from public benchmarks - TPC-DS (Nambiar and Poess, 2006), TPC-H (Poess and Floyd, 2000), and BigBench (Intel Hadoop, 2016) - and from Facebook (FB) production traces (Facebook, 2014; Chowdhury, 2015). We follow (You and Chowdhury, 2019) to set up the benchmarks: for a certain workload, jobs are randomly chosen and since they do not have a release time, we assign a release time similar to that in production traces. Each job lasts from a few minutes to dozens of minutes. Each benchmark experiment has 200 jobs. We randomly assign these jobs to nodes in the datacenter, and the demand will be between the corresponding nodes. Since weights are not available, we assign weights that are uniformly chosen from the interval between and .
6.1. Implementation Details
In this subsection we discuss some details related to the implementation.
Time Index: There is a trade-off in selecting the size of a time slot. If the length of a time slot is shorter, we get more accurate answers, but need to solve a larger LP. On the contrary, if we make each time slot longer, the amount of computational resources need is greatly reduced, but the quality of the solution suffers. In all our experiments, we considered time slots of length seconds as this led to tractable LP relaxations.
Rounding: Algorithm Stretch is meant for easy theoretical analysis, and is not a sophisticated rounding method; we are not trying to schedule later flows in the slots that are idle. This can cause huge overhead in experiments. See Figure 5 for an illustration. In our implementation, we deal with this issue by moving the schedule of every time slot to an earlier idle slot if for all flows scheduled at , its release time is before .
To address the random sampling of , we sample times from the distribution mentioned in Section 4.1 to get the expected weighted completion time for Algorithm Stretch, and denote it with “Average ”. We also measure the best solution obtained over these random choices (denoted by “Best ”).
6.2. Baselines
LP-based Heuristic: In addition to algorithms with theoretically worst case guarantee, we also propose a heuristic that works well in practice. Recall in Section 4.1, we mentioned that the LP solution itself is a valid schedule. We can use this solution as a heuristic, for both the single path and free path models. Note the weighted completion time for this LP solution is NOT the same as the LP objective function, as explained in Section 4.1. This implies that the solution from the heuristic can be arbitrarily bad in the worst case. In practice, however, this proves to be a very effective algorithm that can be quite close to the lower bound we get from LP.
Jahanjou et al. (Single path model): Since path information is not available in the datasets, we randomly generate one for each flow. For a source sink pair , we randomly select one of the shortest paths as the path for flow . For this model, we compare our algorithm with the algorithm presented by Jahanjou et al. (Jahanjou et al., 2017). Here is a brief description of their approach. First write an LP using geometric time intervals, then schedule each job according to the interval its point (the time when fraction of this job is finished) belongs to. A common reason for geometric time intervals is to avoid having a super-polynomial time horizon (a practical reason is to make the LP smaller), and a time series of is chosen where is close to 0. The closer is to 0, the better the approximation ratio can be. However, in Jahanjou et al.’s algorithm, the rounding step has a dependency on . To optimize the approximation ratio, is set to . Our algorithm, on the contrary, is time slot based, and can be turned into a geometric series of time intervals by losing a factor of . In experiments, we include both the case of and the case of for completeness.
Terra (Free path model): For the flow-based model, we are comparing to the offline algorithm in Terra (You and Chowdhury, 2019). This algorithm only works for the unweighted case. It calculates the time for each single coflow to finish individually, and then schedule with SRTF (shortest remaining time first). Instead of one large LP like all other algorithms compared here, this algorithm solves a large number of LPs, twice the number of coflow jobs. Terra can work with very fine grained time, to the order of milliseconds (and does not need time to be slotted). Since there is no previous work on weighted case, we compare the weighted case with the LP solution and our heuristic directly based on time indexed LP.
6.3. Experimental Results
Impact of : See Figure 6 and Figure 7. When is , we take the LP solution directly (this is exactly the LP-based heuristic). Across all experiments, this seems the best choice of . The best sampled and the average case are pretty close, indicating the performance does not change much across different .
Impact of : To study the effect of the size of the time interval, we measure the LP objective and the schedule obtained by the LP-based heuristic as we vary in Figure 8. As increases, the size of the linear program will drop, making it faster to solve. On the other hand, the quality of solution drops, as we will not start a job until the whole current interval is after its release time, and will not consider a job finished until the interval its completion time belongs to ends. Thus a proper selection of may depend on the available computational resources for solving the LP.
Single Path Model: Figures 9 and 10 compare the performance of our algorithms with that of Jahanjou et al. (Jahanjou et al., 2017) on all the benchmarks and topologies. Across all the experiments, we observe that our algorithms perform significantly better.
Free Path Model: See Figure 11 and Figure 12 for comparisons with the algorithm in Terra(You and Chowdhury, 2019). Since Terra only handles uniform coflow weights, we set all weights to be unit for these experiments. Surprisingly, we observe that Terra performs slightly better than even the LP objective itself. This disparity arises as the LP relies on time slots of 50 seconds while Terra deals with time slots of much finer granularity. For the weighted case, we are not aware of previous work, and only compare to LP solution in Figures 6 and 7.
7. Conclusion
In this paper we developed an efficient approximation algorithm for the coflow scheduling problem in general graph topologies. This algorithm is shown to be practical and one that delivers extremely high quality solutions. The new insight was to write a time indexed LP formulation and to convert it using the idea of stretching the schedule.
The next major challenge is developing online methods for coflow shceduling to minimize weighted flow time. Prior work (Khuller et al., 2018) deals with the problem of minimizing weighted completion time by making use of offline approximation algorithms. However, the problem of minimizing weighted flow time is considerably more challenging. The technical difference is that flow time is defined as where is the completion time of a job, and the release time. Optimizing flow time non-preemptively even on a single machine (a different model) is a notoriously difficult problem with some recent progress (Batra et al., 2018; Feige et al., 2019).
Appendix A Sketch of generalization to super-polynomial time span
Geometric series time interval is defined as follows. For an , let . We define the -th interval as . Since is at most the sum of all processing time and all release time, we know the number of intervals is polynomial.
We change the LP as follows. We abuse notation a bit and allow to represent the set when there is no confusion. We replace all accurance of with in Section 3, modify Equation (4) and Equation (3) to accommodate for release time, and get the following linear program.
[TABLE]
For the model specific part of linear program, we only need to change the capacity constraints: replace Equation (6) for single path model to get
[TABLE]
and Equation (10) for free path model to get
[TABLE]
Similar to Proposition 3.1, we prove Constraint (16) is a good lower bound.
Proposition A.1.
The completion time of a coflow can be lower bounded by where denotes the fraction of coflow that has been completed by (the end of) time interval .
Proof.
If a job completes in the interval , then its finishing time is at least .
[TABLE]
∎
After getting a solution, we would schedule coflows into intervals instead of into time slots. Inside each time interval, we just schedule each flow at uniform speed, and break into actual time slots. Similar to Section 4.1, we can prove that this solution is feasible.
A.1. Analysis
Recall that denotes the completion time of the coflow in the optimal LP solution. For any continuous time , define to be the fraction of coflow that has been scheduled in the LP solution by time . Note is for time interval , but is for original time slots. Flows are scheduled at an uniform rate in every time interval. Use to denote the smallest such that , we have
[TABLE]
Similar to Lemma 4.1, we state and prove the following lemma.
Lemma A.2.
* where is defined as per Eq. (24).*
Proof.
From constraints (16), we have that
[TABLE]
∎
For any , define to be the earliest time such that fraction of the coflow has been scheduled in the LP solution, i.e., in other words its the smallest such that . Note that by time , fraction of every flow has been scheduled by the LP.
Proposition A.3.
.
Proof.
[TABLE]
∎
Finally, we are ready to bound the completion time of coflow in the stretched schedule. For any fixed , since we stretch the schedule by a factor of , it is easy to verify444All flows were completed by at least fraction by time . So in the stretched schedule, all those flows must be completed by time . that . Since is drawn randomly from a distribution, the following lemma bounds the expected completion time of coflow in the stretched schedule.
Lemma A.4.
The expected completion time of any coflow in the stretched schedule is bounded by .
Proof.
[TABLE]
∎
Theorem 4.5 thus follows from the linearity of expectation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Agarwal et al . (2018) Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agarwal, David Shmoys, and Amin Vahdat. 2018. Sincronia: near-optimal network design for coflows. In SIGCOMM . ACM, 16–29.
- 3Ahmadi et al . (2017) Saba Ahmadi, Samir Khuller, Manish Purohit, and Sheng Yang. 2017. On scheduling coflows. In IPCO . Springer, 13–24.
- 4Bansal and Khot (2010) Nikhil Bansal and Subhash Khot. 2010. Inapproximability of hypergraph vertex cover and applications to scheduling problems. In ICALP . Springer, 250–261.
- 5Batra et al . (2018) Jatin Batra, Amit Kumar, and Naveen Garg. 2018. Constant Factor Approximation Algorithm for Weighted Flow Time on a Single Machine in Pseudo-polynomial time. In FOCS . IEEE.
- 6Chowdhury (2015) Mosharaf Chowdhury. 2015. Coflow Benchmark Based on Facebook Traces. Retrieved April 22, 2019 from https://github.com/coflow/coflow-benchmark
- 7Chowdhury and Stoica (2012) Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: A networking abstraction for cluster applications. In Hot Nets . ACM, 31–36.
- 8Chowdhury et al . (2014) Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In SIGCOMM . ACM, New York, NY, USA, 443–454. https://doi.org/10.1145/2619239.2626315 · doi ↗
