The Communication Complexity of Optimization
Santosh S. Vempala, Ruosong Wang, David P. Woodruff

TL;DR
This paper investigates the communication complexity of distributed optimization problems, providing tight bounds and demonstrating the limitations of sampling and sketching techniques, thus motivating the development of new optimization methods.
Contribution
It offers the first tight bounds for communication complexity in distributed linear systems and optimization, highlighting the limitations of existing techniques and proposing new bounds for various problem settings.
Findings
Communication complexity for linear systems is $ ilde{ heta}(d^2L + sd)$ (deterministic) and $ ilde{ heta}(sd^2L)$ (randomized).
Sampling and sketching are suboptimal for distributed optimization in dependence on $d$ and $ ext{epsilon}$.
New bounds for linear programming communication complexity, especially when coefficients are randomly perturbed.
Abstract
We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with servers , the -th of which holds a subset of constraints of a linear system in variables, and the coordinator would like to output for which for . We assume each coefficient of each constraint is specified using bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is and , respectively. We obtain similar results for the blackboard model. When there is no solution to the linear system, a natural alternative is to find the solution minimizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Communication Complexity of Optimization††thanks: Santosh S. Vempala was supported in part by NSF awards CCF-1717349 and DMS-1839323. Ruosong Wang and David P. Woodruff were supported in part by Office of Naval Research (ONR) grant N00014-18-1-2562. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.
Santosh S. Vempala
Georgia Tech
Ruosong Wang
Carnegie Mellon University
David P. Woodruff
Carnegie Mellon University
We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with servers , the -th of which holds a subset of constraints of a linear system in variables, and the coordinator would like to output an for which for . We assume each coefficient of each constraint is specified using bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is and , respectively. We obtain similar results for the blackboard communication model. As a result of independent interest, we show the probability a random matrix with integer entries in is invertible is , whereas previously only was known.
When there is no solution to the linear system, a natural alternative is to find the solution minimizing the loss, which is the regression problem. While this problem has been studied, we give improved upper or lower bounds for every value of . One takeaway message is that sampling and sketching techniques, which are commonly used in earlier work on distributed optimization, are neither optimal in the dependence on nor on the dependence on the approximation , thus motivating new techniques from optimization to solve these problems.
Towards this end, we consider the communication complexity of optimization tasks which generalize linear systems, such as linear, semidefinite, and convex programming. For linear programming, we first resolve the communication complexity when is constant, showing it is in the point-to-point model. For general and in the point-to-point model, we show an upper bound and an lower bound. In fact, we show if one perturbs the coefficients randomly by numbers as small as , then the upper bound is , and so this bound holds for almost all linear programs. Our study motivates understanding the bit complexity of linear programming, which is related to the running time in the unit cost RAM model with words of bits, and we give the fastest known algorithms for linear programming in this model.
1 Introduction
Large-scale optimization problems often cannot fit into a single machine, and so they are distributed across a number of machines. That is, each of servers may hold a subset of constraints that it is given locally as input, and the goal of the servers is to communicate with each other to find a solution satisfying all constraints. Since communication is often a bottleneck in distributed computation, the goal of the servers is to communicate as little as possible.
There are several different standard communication models, including the point-to-point model and the blackboard model. In the point-to-point model, each pair of servers can talk directly with each other. This is often more conveniently modeled by looking at the coordinator model, for which there is an extra server called the coordinator, and all communication must pass through the coordinator. This is easily seen to be equivalent, from a total communication perspective, to the point-to-point model up to a factor of , for forwarding messages from server to server , and a term of per message to indicate which server the message should be forwarded to. Another model of computation is the blackboard model, in which there is a shared broadcast channel among all the servers. When a server sends a message, it is visible to each of the other servers and determines who speaks next, based upon an agreed upon protocol. We mostly consider randomized communication, in which for every input, we require the coordinator to output the solution to the optimization problem with high probability. For linear systems we also consider deterministic communication complexity.
A number of recent works in the theory community have looked at studying specific optimization problems in such communication models, such as principal component analysis [43, 47, 17] and kernel [11] and robust variants [71, 28], computing higher correlations [43], regression [67, 28] and sparse regression [19], estimating the mean of a Gaussian [73, 33, 19], database problems [36, 70], clustering [21], statistical [68], graph problems [54, 68] and many, many more.
There are also a large number of distributed learning and optimization papers, for example [10, 73, 75, 1, 18, 48, 72, 23, 55, 31, 29, 30, 40, 60, 59, 74, 6, 42]. With a few exceptions, these works do not study general communication complexity, but rather consider specific classes of algorithms. Namely, a number of these works only allow gradient and Hessian computations in each round, and do not allow arbitrary communication. Another aspect of these works is that they typically do not count total bit complexity, but rather only count number of rounds, whereas we are interested in total communication. In a number of optimization problems, the bit complexity of storing a single number in an intermediate computation may be as large as storing the entire original optimization problem. It is therefore infeasible to transmit such a number. While one could round this number, the effect of rounding is often unclear, and could destroy the desired approximation guarantee. One exception to the above is the work of [65], which studies the problem in which there are two servers, each holding a convex function, who would like to find a solution so as to minimize the sum of the two functions. The upper bounds are in a different communication model than ours, where the functions are added together, while the lower bounds only apply to a restricted class of protocols.
Noticeably absent from previous work is the communication complexity of solving linear systems, which is a fundamental primitive in many optimization tasks. Formally, suppose there is a coordinator together with servers , the -th of which holds a subset of constraints of a -dimensional linear system, and the coordinator would like to output an for which for . We further assume each coefficient of each constraint is specified using bits. The first question we ask is the following.
Question 1.1**.**
What is the communication complexity of solving a linear system?
When there is no solution to the linear system, a natural alternative is to find the solution minimizing the loss, which is the regression problem , where for an -dimensional vector , is its norm.
In the distributed regression problem, each server has a matrix and a vector , and the coordinator would like to output an so that is approximately minimized, namely, that . Note that here is the matrix obtained by stacking the matrices on top of each other, where . Also, is the vector obtained by stacking the vectors on top of each other. We assume that each entry of and is an -bit integer, and we are interested in the randomized communication complexity of this problem.
While previous work [50, 67] has looked at the distributed regression problem, such work is based on two main ideas: sampling and sketching. Such techniques reduce a large optimization problem to a much smaller one, thereby allowing servers to send succinct synopses of their constraints in order to solve a global optimization problem.
Sampling and sketching are the key techniques of recent work on distributed low rank approximation [67, 43] and regression algorithms. A natural question, which will motivate our study of more complex optimization problems below, is whether other techniques in optimization can be used to obtain more communication-efficient algorithms for these problems.
Question 1.2**.**
Are there tractable optimization problems for which sampling and sketching techniques are suboptimal in terms of total communication?
To answer Question 1.2 it is useful to study optimization problems generalizing both linear systems and regression for certain values of . Towards this end, we consider the communication complexity of linear, semidefinite, and convex programming. Formally, in the linear programming problem, suppose there is a coordinator together with servers , the -th of which holds a subset of constraints of a -dimensional linear system, and the coordinator, who holds a vector , would like to output an for which is maximized subject to for . We further assume each coefficient of each constraint, as well as the objective function , is specified using bits.
Question 1.3**.**
What is the communication complexity of solving a linear program?
One could try to implement known linear programming algorithms as distributed protocols. The main challenge here is that known linear programming algorithms operate in the real RAM model of computation, meaning that basic arithmetic operations on real numbers can be performed in constant time. This is problematic in the distributed setting, since it might mean real numbers need to be communicated among the servers, resulting in protocols that could have infinite communication. Thus, controlling the bit complexity of the underlying algorithm is essential, and this motivates the study of linear programming algorithms in the unit cost RAM model of computation, meaning that a word is bits, and only basic arithmetic operations on words can be performed in constant time. Such a model is arguably more natural than the real RAM model. If one were to analyze the fastest linear programming algorithms in the unit cost RAM model, their time complexity would blow up by poly factors, since the intermediate computations require manipulating numbers that grow exponentially large or small. Surprisingly, we are not aware of any work that has addressed this question:
Question 1.4**.**
What is the best possible running time of an algorithm for linear programming in the unit cost RAM model?
As far as time complexity is concerned, it is not even known if linear programming is inherently more difficult than just solving a linear system. Indeed, a long line of work on interior point methods, with the current most recent work of [25], suggests that solving a linear program may not be substantially harder than solving a linear system. One could ask the same question for communication.
Question 1.5**.**
Is solving a linear program inherently harder than solving a linear system? What about just checking the feasibility of a linear program versus that of a linear system?
Recent Independent Work.
A recent independent work [7] also studies solving linear programs in the distributed setting, although their focus is to study the tradeoff between round complexity and communication complexity in low dimensions, while our focus is to study the communication complexity in arbitrary dimensions. Note, however, that we also provide nearly optimal bounds for constant dimensions for linear programming in both coordinator and blackboard models.
1.1 Our Contributions
We make progress on answering the above questions, with nearly tight bounds in many cases. For a function , we let and similarly define and .
1.1.1 Linear Systems
We begin with linear systems, for which we obtain a complete answer for both randomized and deterministic communication, in both coordinator and blackboard models of communication.
Theorem 1.6**.**
In the coordinator model, the randomized communication complexity of solving a linear system is , while the deterministic communication complexity is . In the blackboard model, both the randomized communication complexity and the deterministic communication complexity are .
Theorem 1.6 shows that randomization provably helps for solving linear systems. The theorem also shows that in the blackboard model the problem becomes substantially easier.
1.1.2 Approximate Linear Systems, i.e., Regression
We next study the regression problem in both the coordinator and blackboard models of communication. Finding a solution to a linear system is a special case of regression; indeed in the case that there is an for which we must return such an to achieve relative error in objective function value for regression. Consequently, our lower bounds for linear systems apply also to regression for any .
We first summarize our results in Table 1 and Table 2 for constant . We state our results primarily for randomized communication. However, in the case of regression, we also discuss deterministic communication complexity.
One of the main takeaway messages from Table 1 is that sampling-based approaches, namely those based upon the so-called Lewis weights [27], would require samples for regression when , and thus communication. Another way for solving regression for is via sketching, as done in [67], but then the communication is . Our method, which is deeply tied to linear programming, discussed more below, solves this problem in communication. Thus, this gives a new method, departing from sampling and sketching techniques, which achieves much better communication. Our method involves embedding into , and then using distributed algorithms for linear programming to solve regression.
As with linear systems, one takeaway message from the results in Table 2 is that the problems have significantly more communication-efficient upper bounds in the blackboard model than in the coordinator model. Indeed, here we obtain tight bounds for and regression, matching those that are known for the easier problem of linear systems.
We next describe our results for non-constant in both the coordinator model and the blackboard model. Here we focus on and , which illustrate several surprises.
One of the most interesting aspects of the results in Table 3 is our dependence on for regression, where for small enough relative to , we achieve a instead of a dependence. We note that all sampling [27] and sketching-based solutions [67] to regression have a dependence. Indeed, this dependence on comes from basic concentration inequalities. In contrast, our approach is based on preconditioned first-order methods described in more detail below.
A takeaway message from Table 4 is that our lower bound shows some dependence on is necessary both for and regression, provided is not too small. This shows that in the blackboard model, one cannot obtain the same upper bound for these problems as for linear systems, thereby separating their complexity from that of solving a linear system.
1.1.3 Linear Programming
One of our main technical ingredients is to recast regression problems as linear programming problems and develop the first communication-efficient solutions for distributed linear programming. Despite this problem being one of the most important problems that we know how to solve in polynomial time, we are not aware of any previous work considering its communication complexity in generality besides a recent independent work [7]
First, when the dimension is constant, we obtain nearly optimal upper and lower bounds.
Theorem 1.7**.**
In constant dimensions, the randomized communication complexity of linear programming is in the coordinator model and in the blackboard model. Our upper bounds allow the coordinator to output the solution vector , while the lower bounds hold already for testing if the linear program is feasible. Here the notation and the notation suppress only factors.
Despite the fact that we do not have tight upper bounds matching the lower bounds in the blackboard model, under the additional assumption that each constraint in the linear program is placed on a random server, we develop an algorithm with a matching communication cost. Partitioning constraints randomly across servers instead is common in distributed computation, see, e.g., [8]. Neverthelss we leave it as an open problem in the blackboard model in constant dimensions, to remove this requirement.
For solving a linear system in constant dimensions, the randomized communication complexity is in both models. Again, the notation suppresses only factors. Thus, in the coordinator model, we separate the communication complexity of these problems. We can also separate the complexities in the blackboard model if we instead look at the feasibility problem. Here instead of requiring the coordinator to output the solution vector, we just want to see if the linear system or linear program is feasible. We have the following theorem for this.
Theorem 1.8**.**
In constant dimensions, the randomized communication complexity of checking whether a system of linear equations is feasible is in either the coordinator or blackboard model of communication.
Combining Theorem 1.7 and Theorem 1.8, we see that for feasibility in the blackboard model, linear programming requires bits, while linear system feasibility takes bits, and thus we separate these problems in the blackboard model as well.
Returning to linear programs, we next consider the complexity in arbitrary dimensions.
Theorem 1.9**.**
In the coordinator model, the randomized communication complexity of exactly solving a linear program with constraints in dimension and all coefficients specified by -bit numbers is . Moreover it is lower bounded by . Here the upper and lower bounds require the coordinator to output the solution vector .
The lower bound in Theorem 1.9 just follows from our lower bound for linear systems. The upper bound is based on an optimized distributed cutting-plane algorithm. We describe the idea below.
While the upper bound is , one can further improve it as follows. We show that if the coefficients of in the input to the linear program are perturbed independently by i.i.d. discrete Gaussians with variance as small as , then we can improve the upper bound for solving this perturbed problem to , where now the success probability of the algorithm is taken over both the randomness of the algorithm and the random input instance, which is formed by a random perturbation of a worst-case instance. Note that this is an improvement for sufficiently large . Our model coincides with the well-studied smooth complexity model of linear programming [61, 14, 62]. However, a major difference is that the variance of the perturbation needs to be at least inverse polynomial in their works, whereas we allow our variance to be as small as .
Theorem 1.10**.**
*In the smoothed complexity model with discrete Gaussians of variance , the communication complexity of exactly solving a linear program with constraints in dimension and all coefficients specified by -bit numbers, with probability at least over the input distribution and randomness of the protocol, is in the coordinator model. *
While our focus in this paper is on communication, our upper bounds also give a new technique for improving the time complexity in the unit cost RAM model of linear programming, where arithmetic operations on words of size can be performed in constant time. For this fundamental problem we obtain the fastest known algorithm even in the non-smoothed setting of linear programming.
Theorem 1.11**.**
The time complexity of solving an linear program with -bit coefficients is in the unit cost RAM model.
We note that this is for solving an LP exactly in the RAM model with words of size bits. The current fastest linear programming algorithms [45, 46, 25] state the bounds in terms of additive error , which incurs a multiplicative factor of at least to solve the problem exactly. Also such algorithms manipulate large numbers at intermediate points in the algorithm, which are at least bits, which could take time to perform a single operation on. It seems that transferring such results to the unit cost RAM model with bit words incurs time at least . This holds true even of the recent work [25], which focuses on the setting and does not improve the leading term. Even such a bit-complexity bound needs careful checking of the number of bits required as recent improvements use sophisticated inverse maintenance methods to save on the number of operations (an exercise that was carried out thoroughly for the Ellipsoid method in [34]).
1.1.4 Implications for Convex Optimization and Semidefinite Programming
Our upper bounds also extend to more general convex optimization problems. For these, we must modify the problem statement to finding an -additive approximation rather than the exact solution. We obtain the following upper bound for a convex program in .
Theorem 1.12**.**
The communication complexity of solving the convex optimization problem for convex sets , one per server, to within an additive error , i.e., finding a point s.t. and is .
If the objective function is not known to all servers, we incur an additional communication. For semidefinite programs with symmetric matrices and linear constraints this gives a bound of . Note that we can simply send all the constraints to one server in communication, so this is always an upper bound.
1.2 Our Techniques
1.2.1 Linear Systems
To solve linear systems in the distributed setting, the coordinator can go through the servers one by one. The coordinator and all servers maintain the same set of linearly independent linear equations. For each server , if there is a linear equation stored by that is linearly independent with linear equations in , then sends that linear equation to all other servers and adds that linear equation into . In the end, will be a maximal set of linearly independent equations, and thus the coordinator can simply solve the linear equations in . This protocol is deterministic and has communication complexity in the coordinator model and in the blackboard model, since at most linear equations will be added into the set .
In fact, the preceding protocol is optimal for deterministic protocols, even just for testing the feasibility of linear systems. To prove lower bounds, we first prove the following new theorem about random matrices which may be of independent interest.
Theorem 1.13** (Informal version of Theorem 3.1).**
Let be a matrix with i.i.d. random integer entries in . The probability that is invertible is .
The previous best known probability bound was only [63, 16]; we stress that the results of [16] are not sufficient 111We have verified this with Philip Matchett Wood, who is an author of [16]. The issue is that in their Corollary 1.2, they have an explicit constraint on the cardinality of the set , i.e., . In their Theorem 2.2, it is assumed that . Thus, as far as we are aware, there are no known results sufficient to prove our singularity probability bound. to prove our stronger bound with the extra factor of in the exponent, which is crucial for our lower bound.
With Theorem 1.13, in Lemma 3.3, we use the probabilistic method to construct a set of matrices with integral entries in , such that for any , , where is the -th standard basis vector.
Now consider any deterministic protocol for testing the feasibility of linear systems. Suppose the linear system on the -th server is for some , then the entire linear system is feasible if and only if . This is equivalent to the problem in which each server receives a binary string of length , and the goal is to test whether all strings are the same or not. In the coordinator model, a deterministic lower bound of for this problem can be proved using the symmetrization technique in [54, 69], which gives an optimal lower bound. An optimal deterministic lower bound can also be proved in the blackboard model. The formal analysis is given in Section 3.3.
For solving linear systems, an lower bound holds even for randomized algorithms in the coordinator model. When there is only a single server which holds a linear system for some , in order for the coordinator to know the solution , standard information-theoretic argument shows that bits of communication is necessary, which gives an lower bound. This idea is formalized in Section 3.4. A natural question is whether the upper bound is optimal for randomized protocols.
We first show that in order to test feasibility, it is possible to achieve a communication complexity of , which can be exponentially better than the bound for deterministic protocols. The idea is to use hashing. With randomness, the servers can first agree on a random prime number , and test the feasibility over the finite field . It suffices to have the prime number randomly generated from the range , and thus the factor in the communicataion complexity of deterministic protocols can be improved to . However, it is still unclear if solving linear systems in the coordinator model will require bits of communication for randomized protocols.
Quite surprisingly, we show that is not the optimal bound for randomized protocols, and the optimal bound is . In the deterministic protocol with communication complexity , most communication is wasted on synchronizing the set , which requires the servers to send linear equations to all other servers. In our new protocol, only the coordinator maintains the set . The issue now, however, is that the servers no longer know which linear equation they own is linearly independent with those equations in . On the other hand, each server can simply generate a random linear combination of all linear equations it owns. We can show that if a server does have a linear equation that is linearly independent with those in , with constant probability, the random linear combination is also linearly independent with those in , and thus the coordinator can add the random linear combination into . Notice that taking random linear combinations to preserve the rank of a matrix is a special case of dimensionality reduction or sketching, which comes up in a number of applications, see, for example compressed sensing [20, 13], data streams [4], and randomized numerical linear algebra [66]. Here though, a crucial difference is that we just need the fact that if a set of vectors is not contained in the span of another set of vectors , then a random linear combination of the vectors in is also not in the span of with high probability. This allows us to adaptively take as few linear combinations as possible to solve the linear system, enabling us to achieve much lower communication than would be possible by just sketching the linear systems at each server and non-adaptively combining them.
If we implement this protocol naïvely, then the communication complexity will be , since at most linear equations will be added into , and there is an communication complexity associated with each of them. Furthermore, even if a server does not have any linear equation that is linearly independent with , it still needs to send random linear combinations to the coordinator, which would require communication. To improve this further to , we can still use the hashing trick mentioned before. If a server generates a random linear combination, it can first test whether the linear combination is linearly independent with over the finite field , for a random prime chosen in . This will reduce the communication complexity to for each test. If the linear equation is indeed linearly independent with , then the server sends the original linear equation (without taking the residual modulo ) to the coordinator. Again the total communication complexity for sending the original linear equations is upper bounded by . Thus, the total communication complexity is upper bounded by . See Section 4.2 for the formal analysis.
By a reduction from the OR of copies of the two-server set-disjointness problem to solving linear systems, we can prove an extra lower bound, which holds even for testing feasibility of linear systems. Here the idea is to interpret vectors in as characteristic vectors of subsets of . One of the servers will fix the solution of the linear system to be a predefined vector . Each server has a single linear equation . By interpreting vectors as sets, implies the set represented by and are intersecting. Thus, the servers are actually solving the OR of copies of the two-server set-disjointness problem, which is known to have communication complexity [54, 68]. This lower bound is formally given in Section 3.4.
1.2.2 Linear Regression
For an regression instance , the optimal solution can be calculated using the normal equations, i.e., the optimal solution satisfies . This already gives a simple yet nearly optimal deterministic protocol for regression in the coordinator model: the coordinator calculates and using only bits of communication by collecting the covariance matrices from each server and summing them up. The communication complexity matches our lower bound for solving linear systems for deterministic protocols in the coordinator model. However, when implemented in the blackboard model, the communication complexity of this protocol is still . To improve this bound, we first show how to efficiently obtain approximations to leverage scores in both models. Our protocol is built upon the algorithm in [26], but implemented in a distributed manner. The resulting algorithm has communication complexity in the coordinator model but only communication complexity in the blackboard model. With approximate leverage scores, the coordinator can then sample rows of the matrix to obtain a subspace embeeding, at which point it will be easy to calculate a -approximate solution to the regression problem. The number of sampled rows can be further improved to using Sárlos’s argument [57] since solving regression does not necessarily require a full subspace embedding, which results in a protocol with communication complexity in the blackboard model. Full details can be found in Section 6.
One may wonder if the dependence on is necessary for solving regression in the blackboard model. In Section 5, we show that some dependence on is actually necessary. We show an lower bound whenever . The hardness follows from the fact that if the matrix satisfies for all , then the optimal solution is just the average of . Thus, if we can get sufficiently good approximation to the regression problem, then we can actually recover the sum of , at which point we can resort to known communication complexity lower bound in the blackboard model [54]. This argument will also give an lower bound for -approximate regression in the blackboard model, whenever . The formal analysis can be found in Section 5.
For regression, we can no longer use the normal equations. However, we can obtain approximations to Lewis weights by using approximations to leverage scores, as shown in [27]. With approximate Lewis weights of the matrix, the coordinator can then obtain a subspace embedding by sampling rows. This will give an upper bound for -approximate regression in the coordinator model, and an upper bound in the blackboard model. It is unclear if the number of sampled rows can be further reduced since there is no known version of Sárlos’s argument. A natural question is whether the dependence is optimal. We show that the dependence on can be further improved to , by using optimization techniques, or more specifically, first-order methods. Despite the fact that the objective function of regression is neither smooth nor strongly-convex, it is known that by using Nesterov’s Accelerated Gradient Descent and smoothing reductions [51], one can solve regression using only full gradient calculations. On the other hand, the complexity of first-order methods usually has dependences on various parameters of the input matrix , which can be unbounded in the worst case. Fortunately, recent developments in regression [32] show how to precondition the matrix by simply doing an Lewis weights sampling, and then rotating the matrix appropriately. By carefully combining this preconditioning procedure with Accelerated Gradient Descent, we obtain an algorithm for -approximate regression with communication complexity in the coordinator model, which shows it is indeed possible to improve the dependence for regression. A formal analysis is given in Section 7.
For general regression, if we still use Lewis weights sampling, then the number of sampled rows and thus the communication complexity will be . Even worse, when , Lewis weights sampling will require an unbounded number of samples. However, regression can be easily formulated as a linear program, which we show how to solve exactly in the distributed setting. Inspired by this approach, we further develop a general reduction from regression to linear programming. Our idea is to use the max-stability of exponential random variables [3] to embed into , write the optimization problem in as a linear program and then solve the problem using linear program solvers. However, such embeddings based on exponential random variables usually produce heavy-tailed random variables and makes the dilation bound hard to analyze. Here, since our goal is just to solve a linear regression problem, we only need the dilation bound for the optimal solution of the regression problem. The formal analysis in Section 8 shows that -approximate regression can be reduced to solving a linear program with variables, which implies a communication protocol for regression without the dependence.
1.2.3 Linear and Convex Programs
We adapt two different algorithms from the literature for efficient communication and implement them in the distributed setting. The first is Clarkson’s algorithm, which works by sampling constraints in each iteration and finds an optimal solution to this subset; the sampling weights are maintained implicitly. In each iteration the total communication is for gathering the constraints and an additional per round to send the solution to this subset of constraints to all servers. This solution is used to update the sampling weights. Clarkson’s algorithm has the nice guarantee that it needs only rounds with high probability. A careful examination of this algorithm shows that the bit complexity of the computation (not the communication) is dominated by checking whether a proposed solution satisfies all constraints, i.e., computing for a given . We show this can be done with time complexity in the unit cost RAM model and this is the leading term of the claimed time bound.
Notice that the term in the communication complexity of Clarkson’s algorithm comes from the fact that the protocol needs to send an optimal solution of a linear program with size for a total of times. However, when each server receives , all will do is to check whether satisfies the constraints stored on or not. Notice that here entries in the constraints have bit complexity , whereas the solution vector has bit complexity for each entry. Intuitively, for most linear programs, we don’t need such a high precision for the solution vector . This leads to the idea of smoothed analysis. We show that if the coefficients of in the input to the linear program are perturbed independently by i.i.d. discrete Gaussians with variance as small as , then we can improve the upper bound for solving this perturbed problem to . The reason here is that with Gaussian noise, we can round each entry of the solution vector to have bit complexity , which would suffice for verifying whether satisfies the constraints or not, for most linear programs. Full details regarding Clarkson’s algorithm and the smoothed analysis model can be found in Section 10.
One minor drawback of Clarkson’s algorithm is it has a dependence on . In constant dimensions, our lower bound in the blackboard model holds only when , in which case the communication complexity of Clarkson’s algorithm will be .
Under the additional assumption that each constraint in the linear program is placed on a random server, we develop an algorithm with communication complexity in the blackboard model. To achieve this goal, we modify Seidel’s classical algorithm and implement it in the distributed setting. Seidel’s algorithm benefits from the additional assumption from two aspects. On the one hand, Seidel’s classical algorithm needs to go through all the constraints in a random order, which can be easily achieved now since all constraints are placed on a random server. On the other hand, Seidel’s classical algorithm needs to make a recursive call each time it finds one of constraints that determines the optimal solution, and will make recursive calls in expectation. To implement Seidel’s algorithm in the distributed setting, each time we find one of the constraints that determines the optimal solution, the current server also needs to broadcast that constraint. Thus, naïvely we need to broadcast constraints during the execution, which would result in communication. Under the additional assumption, with good probability, the first server stores at least constraints. Since the first server does not need to make any recursive calls or broadcasts, the total number of recursive calls (and thus broadcasts) will be . The formal analysis is given in Section 12.
For convex programming, we have to use a more general algorithm. We use a refined version of the classical center-of-gravity method. The basic idea is to round violated constraints that are used as cutting planes to bits. We optimize over the ellipsoid method in the following two ways. First, we round the violated constraint sent in each iteration by locally maintaining an ellipsoid to ensure the rounding error does not affect the algorithm. Roughly speaking, each server maintains a well-rounded current feasible set, and the number of bits needed in each round is thus only . Secondly, we use the center of gravity method to make sure the volume is cut by a constant factor rather than a factor in each iteration, even when constraints are rounded. See Section 11 for the formal analysis.
2 Preliminaries
2.1 Notation
For matrices , we use to denote the matrix in whose first columns are the same as , the next columns are the same as , …, and the last columns are the same as .
For a matrix , we use to denote the subspace spanned by the columns of the matrix . For a set of vectors , we use to denote the subspace spanned by the vectors in . For a set of linear equations , we also to denote all linear combinations of linear equations in . We use to denote the -th column of and to denote the -th row of . We use to denote the Moore-Penrose inverse of . We use to denote the rank of over the real numbers and to denote the rank of over the finite field .
For a vector , we use to denote its norm. For two vectors and , we use to denote their inner product.
For matrices and , we say if and only if
[TABLE]
where refers to the Löwner partial ordering of matrices, i.e., if is positive semi-definite.
2.2 Models of Computation and Problem Settings
We study the distributed linear regression problem in two distributed models: the coordinator model (a.k.a. the message passing model) and the blackboard model. The coordinator model represents distributed computation systems with point-to-point communication, while the blackboard model represents those where messages can be broadcasted to each party.
In the coordinator model, there are servers , and one coordinator. These servers can directly send messages to the coordinator through a two-way private channel. The computation is in terms of rounds: at the beginning of each round, the coordinator sends a message to some of the servers, and then each of those servers that have been contacted by the coordinator sends a message back to the coordinator.
In the alternative blackboard model, the coordinator is simply a blackboard where the servers can share information; in other words, if one server sends a message to the coordinator/blackboard then the other servers can see this information without further communication. The order for the servers to send messages is decided by the contents of the blackboard.
For both models we measure the communication cost which is defined to be the total number of bits sent through the channels.
In the distributed linear system problem, there is a data matrix and a vector of observed values. All entries in and are integers between , where is the bit complexity. The matrix is distributed row-wise among the servers . More specifically, for each server , there is a matrix stored on , which is a subset of rows of . Here we assume is a partition of all rows in . The goal of the feasibility testing problem is to design a protocol, such that upon termination of the protocol, the coordinator reports whether the linear system is feasible or not. The goal of the linear system solving problem is to design a protocol, such that upon termination of the protocol, either the coordinator outputs a vector , such that , or the coordinator reports the linear system is infeasible. It can be seen that the linear system solving problem is strictly harder than the feasibility testing problem.
In the distributed linear regression problem, there is a data matrix and a vector of observed values, which is distributed in the same way as in the distributed linear system problem. The goal of the distributed regression problem is to design a protocol, such that upon termination of the protocol, the coordinator outputs a vector to minimize .
In the distributed linear programming problem, there is a matrix and a vector , which is distributed in the same way as in the distributed linear system problem. The goal of the feasibility testing problem is to design a protocol, such that upon termination of the protocol, the coordinator reports whether the linear program is feasible or not. In the linear programming solving problem, the goal is to design a protocol, such that upon termination of the protocol, the coordinator outputs a vector such that is satisfied. There can also be a vector which is known to all servers, and in this case the goal is to minimize (or maximize) under the constraint that .
2.3 Row Sampling Algorithms
Definition 2.1** ([26]).**
Given a matrix . The leverage score of a row is defined to be
[TABLE]
Given another matrix , the generalized leverage score of a row w.r.t. is defined to be
[TABLE]
Definition 2.2** ([27]).**
Given a matrix . The Lewis weights are the unique weights such that for each we have
[TABLE]
where is the diagonal matrix formed by putting on the diagonal.
Theorem 2.1** ( Matrix Concentration Bound, Lemma 4 in [26]).**
There exists an absolute constant such that for any matrix and any set of sampling values satisfying
[TABLE]
if we generate a matrix with rows, each chosen independently as the -th basis vector, times with probability , then with probability at least , for all vector ,
[TABLE]
Theorem 2.2** ( Matrix Concentration Bound, Theorem 7.1 in [27]).**
There exists an absolute constant such that for any matrix and any set of sampling values satisfying
[TABLE]
if we generate a matrix with rows, each chosen independently as the -th basis vector, times with probability , then with probability at least , for all vectors ,
[TABLE]
Here are the Lewis weights of the matrix .
3 Communication Complexity Lower Bound for Linear Systems
3.1 The Hard Instance
In this section, we construct a family of matrices, which will be used to prove a communication complexity lower bound in the subsequent section.
We first introduce generalized binomial distributions.
Definition 3.1**.**
For any , let be a random variable which takes or with probability , and [math] with probability . Let be a random variable with the same distribution as the sum of i.i.d. copies of . For simplicity we use and to denote and , respectively.
We need the following theorem on the singularity probability of discrete random matrices.
Theorem 3.1**.**
Let be a matrix whose entries are i.i.d. random variables with the same distribution as , for sufficiently large ,
[TABLE]
where is an absolute constant.
The proof of Theorem 3.1 closely follows previous approaches for bounding the singularity probability of random matrices (see, e.g., [41, 63, 64, 16].). For completeness, we include a proof of Theorem 3.1 in Section 13.
Lemma 3.2**.**
For any and sufficiently large , there exists a set of matrices with integral entries in for which and
For any , ; 2. 2.
For any such that , .
Proof.
We use the probabilisitic method to prove existence. We use to denote the set
[TABLE]
where is a vector whose entries are i.i.d. random variables with the same distribution as and is the constant in Theorem 3.1.
Consider a random matrix whose entries are i.i.d. random variables with the same distribution as , we have
[TABLE]
since otherwise, if we use to denote a vector whose entries are i.i.d. random variables with the same distribution as , we have
[TABLE]
which violates Theorem 3.1.
For any fixed , consider a random matrix whose entries are i.i.d. random variables with the same distribution as . We have,
[TABLE]
which follows from the definition of and the independence of columns of .
Now we construct a multiset of matrices chosen with replacement, each of dimension and with i.i.d. entries having the same distribution as . By (1) and linearity of expectation, we have
[TABLE]
We use to denote the even that
[TABLE]
which holds with probability at least by using Markov’s inequality.
We use to denote the event that
[TABLE]
Using a union bound and (2), the probability that holds is at least
[TABLE]
Thus by a union bound, the probability that both and hold is strictly larger than zero, which implies there exists a set such that and hold simultaneously. Now we consider . Since holds, we have . implies that all elements in are distinct, and furthermore for any such that , we have
[TABLE]
∎
Lemma 3.3**.**
For any and sufficiently large , there exists a set of matrices with integral entries in for which and
For any , is non-singular; 2. 2.
For any , , where is the -th standard basis vector.
Proof.
Consider the matrix set constructed in Lemma 3.2. For each , we add into , where is the -th standard basis vector such that . Clearly is non-singular since and .
Now suppose there exists such that , which means there exists some such that and . This implies there exists some such that and . However, must be since , which implies . Thus for any , . ∎
3.2 Deterministic Lower Bound for the Equality Problem
In this section, we prove our deterministic communication complexity lower bound for the Equality problem in the coordinator model, which will be used as an intermediate problem in Section 3.3. In the Equality problem, each server receives a binary string . The goal is to test whether . We will prove an lower bound for deterministic communication protocols.
The case has a well-known lower bound.
Lemma 3.4** (See, e.g., [44, p11]).**
Any deterministic protocol for solving the Equality problem with requires bits of communication.
Our plan is to reduce the case to the case , using the symmetrization technique [54, 69]. Suppose there exists a deterministic communication protocol for the Equality problem with servers, and the communication complexity of is where is the length of the binary strings received by the servers. We show how to solve the case using .
Suppose Alice receives a binary string and Bob receives a binary string . We show that by using the protocol , they can judge whether or not using communication. Thus by Lemma 3.4, we must have
[TABLE]
which implies .
Since is deterministic, by averaging, there exists a fixed server and a fixed set with size , such that for any , when all servers have the same input , the total communication complexity beteen and the coordinator is upper bounded by . Now we fix a bijection . Alice plays the role of server in , and sets the input of to be . Bob plays the role of the coordinator and all servers for , and sets the input of to be for all . To simulate the protocol , Alice and Bob need to communicate if and only if server needs to communicate with the coordinator. If the total amount of communication between Alice and Bob exceeds then they terminate and return . Alice and Bob return if and only if the protocol returns .
Now we analyze the correctness and the efficiency of the reduction. When , we have , and by definition of and , we must have the total communication complexity between and the coordinator, and thus that between Alice and Bob, is upper bounded by . Also the protocol must return due to the correctness of . When , either the total amount of communication between Alice and Bob exceeds , in which case they will return . Otherwise returns due to its correctness.
Formally, we have proved the following theorem.
Theorem 3.5**.**
Any deterministic protocol for solving the Equality problem with servers in the coordinator model requires bits of communication.
3.3 Deterministic Lower Bound for Testing Feasibility of Linear Systems
In this section, we prove our deterministic communication complexity lower bound for testing the feasibility of linear systems, in the coordinator model and the blackboard model.
Theorem 3.6**.**
For any deterministic protocol ,
- •
If can test whether is feasible or not in the coordinator model, then the communication complexity of is ;
- •
If can test whether is feasible or not in the blackboard model, then the communication complexity of is ;
Proof.
Consider the set constructed in Lemma 3.3 with . In the hard instance, each server receives a matrix . The linear system stored on each server is just . Due to Lemma 3.3, the entire linear system is feasible if and only if . Since , we can reduce the Equality problem in Section 3.2 to solving a linear system, with . By Theorem 3.6, this implies an lower bound in the coordinator model.
In the blackboard model, the bound follows from the case when . When , the blackboard model is essentially the same as the coordinator model, up to constants in the communication complexity. The lower bound follows from the fact that each server needs to communicate at least bit. ∎
3.4 Randomized Lower Bound for Solving Linear Systems
In this section, we prove randomized communication complexity lower bounds for solving linear systems. We first prove an lower bound, which already holds for the case . When the coordinator model and the blackboard model are equivalent in terms of communication complexity, and thus we shall not distinguish these two models in the remaining part of this proof.
Consider the set constructed in Lemma 3.3 with . In the hard instance, only server receives a matrix , and the goal is to let the coordinator output the solution to the linear system . For any two and , we must have . Thus, by standard information-theoretic arguments, in order for the coordinator to output the solution to , the communication complexity is at least .
Formally, we have proved the following theorem.
Theorem 3.7**.**
Any randomized protocol that succeeds with probability at least for solving linear systems requires bits of communication in the coordinator model and the blackboard model. The lower bound holds even when .
Now we prove another lower bound of for solving linear systems in the coordinator model. In the hard instance, the last server receives a vector , and the linear equations stored on server are simply , i.e., the solution vector should be exactly . This forces the solution vector to be some predefined binary vector . The remaining servers each receive a vector , and the linear equation stored on is
[TABLE]
Also, it is guaranteed that for each , or .
Here we interpret the vector as the characteristic vector of a set , and interpret each vector also as the characteristic vector a set . Thus, testing the feasibility of the linear system is equivalent to testing whether the set owned by the server is disjoint with the set owned by any other player, which is the OR of copies of the two-player set-disjointness problem. The communication complexity for the latter problem has been studied in [54, 68]. Combining Lemma 2.2 in [54] with Theorem 1 in [68], for any communication protocol that succeeds with probability , the communication complexity is lower bounded by . By standard repetition arguments, this implies for any randomized communication protocol that succeeds with probability at least , the communication complexity is lower bounded by .
Combining this lower bound and the trivial lower bound in the blackboard model with Theorem 3.7, we have the following theorem.
Theorem 3.8**.**
Any randomized protocol that succeeds with probability at least for solving linear systems requires bits of communication in the coordinator model and bits of communication in the blackboard model.
4 Communication Protocols for Linear Systems
4.1 Testing Feasibility of Linear Systems
In this section we present a randomized communication protocol for testing feasibility of linear systems, which has communication complexity in the coordinator model and in the blackboard model. The protocol is described in Figure 1.
We first bound the communication complexity of the protocol in Figure 1. Clearly, Step 1 has communication complexity at most . During the execution of the whole protocol, at most linear equations will be added into . The communication complexity for sending each linear equation is in the coordinator model, and in the blackboard model. Thus, the total communication complexity is in the coordinator model, and in the blackboard model.
To prove the correctness of this protocol, we need the following lemma.
Lemma 4.1**.**
Given a matrix where each entry is an integer in and . Suppose is chose uniformly at random from all primes numbers in .
- (i)
; 2. (ii)
With probability at least , .
Proof.
The point (i) is immediate. For point (ii), there exists a square submatrix of with size which is non-singular over real numbers, which implies the determinant of is non-zero as a real number. Since all entries of are integers in , the determinant of as a real number is an integer in . Thus, the determinant of has at most prime factors. According to the Prime Number Theorem, there are at least distinct prime numbers in the range , for sufficiently large . Thus, by adjusting constants, is not a prime factor of the determinant of with probability at least , in which case . ∎
Notice that the protocol in Figure 1 is basically testing the feasibility of the linear system over the finite field , for a randomly chosen prime number . Before the execution of the -th loop of Step 4, the set is a maximal set of linearly independent equations for all linear equations stored on the first servers . Here the linear independence is defined over the finite field . During the execution of the -th loop of Step 4, server considers each linear equation stored on itself one by one, sends the linear equation to all other servers and adds the linear equation to set if that linear equation is linearly independent with all existing linear equations in . If server finds that the set becomes infeasible after adding linear equations stored on , then terminates the protocol.
Consider a linear system where and and all entries of and are integers in the range . If is feasible over the real numbers, then it will also be feasible over the finite field . If is infeasible, then we have . By Lemma 4.1, , and since , with probability at least , , which implies with probability , is still infeasible over the finite field . Since the protocol in Figure 1 tests the feasibility of the linear system over the finite field , the correctness follows.
Formally, we have proved the following theorem.
Theorem 4.2**.**
The protocol in Figure 1 is a randomized protocol for testing feasibility of linear systems and has communication complexity in the coordinator model and in the blackboard model. The protocol succeeds with probability at least .
4.2 Solving Linear Systems
In this section we present communication protocols for solving linear systems. We start with deterministic protocols, in which case we can get a protocol with communication complexity in the coordinator model and in the blackboard model.
In order to solve linear systems, we can still use the protocol in Figure 1, but we don’t use the prime number any more. In Step 4a of the protocol, we no longer check the feasibility over the finite field. In Step 4b of the protocol, we no longer takes the residual modulo before sending the linear equations. At the end of the protocol, each server can use the set of linear equations , which is a maximal set of linear equations of the original linear system, to solve the linear system. The communication complexity is in the coordinator model and in the blackboard model since at most linear equations will be added into the set , and each linear equation requires bits to describe.
Formally, we have proved the following theorem.
Theorem 4.3**.**
There exists a deterministic protocol for solving linear systems which has communication complexity in the coordinator model and in the blackboard model.
Now we turn to randomized protocols. We describe a protocol for solving linear systems with communication complexity in the coordinator model. The description is given in Figure 2.
Now we prove the correctness of the protocol. We first note a few simple properties of the protocol. For each , after executing the -th loop of Step 3, we have . Furthermore, at Step 3(a)ii, we must have . This means if is infeasible, then the original linear system must be infeasible.
Thus, it suffices to show that for each , if there exists such that is infeasible, then the protocol is terminated, and otherwise after executing the -th loop of Step 3, we have .
Suppose before the execution of the -th loop of Step 3 we have and is feasible, and the protocol is executing the -th loop of Step 3. There are two cases here.
Case 1: .
In this case, will remain unchanged during the -th loop of Step 3.
Case 2: There exists , .
In this case, we claim that with probability at least , the linear equation calculated at Step 3(a)i satisfies . This can be seen since for any linear combination , if we flip the sign of and obtain , then either or , since and .
Thus, if there exists such that , then with probability , at least one of the linear equations calculated at Step 3(a)i satisfies , in which case the protocol terminates if is infeasible, or is added into otherwise. Thus, if , then after the execution of the -th loop of Step 3, with probability at least , either the protocol (correctly) terminates, or we have . The correctness of the protocol just follows by applying a union bound over all such that . Notice that there are at most such we need to apply a union bound over.
Now we analyze the communication complexity of the protocol. Notice that at most linear equations will be added into , and thus the total communication complexity associated with sending when is added into is upper bounded by . Furthermore, if is infeasible at Step 3(a)ii, then the protocol terminates and thus the communication complexity for sending associated such that case is upper bounded by . Furthermore, for each , server will send different linear equations to the coordinator, and if we implement the protocol naïvely, then the total communication complexity is upper bounded by . Thus, the total communication complexity of the whole protocol is upper bounded by .
However, using Lemma 4.1 and the same argument as in Section 4.1, to implement Step 3(a)ii, it suffices to check if is feasible and if is a linear combination of existing linear equations in , over the finite field , for a random prime number . The correctness still follows since this check fails with probability at most . After this modification, the communication complexity is now upper bounded by .
Formally, we have proved the following theorem.
Theorem 4.4**.**
The protocol described in Figure 2 is a randomized protocol for solving linear systems which has communication complexity in the coordinator model. Here the notation hides only factors. The protocol succeeds with probability at least .
5 Communication Complexity Lower Bounds for Linear Regressions in the Blackboard Model
In this section, we prove communication complexity lower bounds for linear regression in the blackboard model.
We first define the -XOR problem and the -MAJ problem. In the blackboard model, each server receives a binary string . In the -XOR problem, at the end of a communication protocol, the coordinator correctly outputs the coordinate-wise XOR of these vectors, for at least coordinates. In the -MAJ problem, at the end of a communication protocol, the coordinator correctly outputs the coordinate-wise majority of these vectors, for at least coordinates.
We need the following lemma for our lower bound proof.
Lemma 5.1**.**
Any randomized communication protocol that solves the -XOR problem or the -MAJ problem and succeeds with probability at least has communication complexity .
Proof.
The lower bound for -XOR directly follows from [54, Theorem 1.1]. Now we prove the lower bound for -MAJ.
First, consider a communication problem with two players. Alice receives a binary string and binary strings . Bob receives a binary string and the same binary strings . These binary strings are generated uniformly at random conditioned on the following constraint: for each coordinate , the -th coordinate of contains either zeros or zeros. For each coordiante , whether the -th coordinate of contains zeros or zeros is also chosen uniformly at random. In this communication problem, the goal of Alice is to output the vector .
Now we prove a lower bound the communication problem defined above. Notice that for each coordinate , if contains exactly zeros and ones at the -th coordinate, then the -th coordinate of will be uniformly at random. By a Chernoff bound, with high probability, there exists a set with size , such that for each , the -th coordinate of contains exactly zeros and ones. By standard information-theoretic arguments, if at the end of the communication protocol, with constant probability, Alice correctly outputs the value of for at least fraction of , then the expected communication complexity is lower bounded by , even with public randomness. See, e.g., Lemma 2.1 in [54] for a formal proof.
Now we reduce the problem mentioned above to -MAJ and prove an lower bound. Given any protocol for the -MAJ problem with expected communication complexity on the distribution mentioned above, Alice and Bob first use public randomness to choose two distinct servers and uniformly at random, and then Alice and Bob simulate the protocol . To simulate , Alice plays the role of server and Bob plays the role of server . They both play the roles of all other players. Alice sets the input of to be , and Bob sets the input of to be . The inputs of the other servers are set to be . To simulate , Alice and Bob need to communicate if and only if server or server needs to communicate with the coordinator since all other communication can be simulated by Alice and Bob themselves.
By symmetry, the expected communication complexity between Alice and Bob is upper bounded by . Furthermore, at the end of the protocol, Alice and Bob have the coordinate-wise majority of , for at least coordinates. Thus, for at least fraction of , Alice knows the majority of the -th coordinates of . However, by definition of , the majority of the -th coordinates of is exactly . Thus, by the lower bound mentioned above, we must have , which implies an lower bound. ∎
Now we give a reduction from -MAJ to -approximate regression in the blackboard model, and prove an lower bound when . In the hard case we assume , and we simply ignore all other servers if . For each server , its matrix is set to be the identity matrix , and . Notice that in such case, we can calculate the regression value separately for each coordinate . The optimal solution can be achieved by taking to be , where is the majority of . Notice that the regression value associated with the -th coordinate in the optimal solution is upper bounded by , and thus the total regression value is upper bounded by in the optimal solution. Furthermore, if , then the regression value associated with the -th coordinate by using will be at least larger than the regression value associated with the -th coordinate in the optimal solution.
Now consider a -approximate solution . We claim that for at least coordinates of , we have . The claim follows since otherwise, the total regression value of would be at least larger than the optimal regression value, which would again be larger than times the optimal regression value, by adjusting the constant in . Thus, from a -approximate solution to the regression problem, we can solve the -MAJ problem with , which implies an lower bound.
Formally, we have proved the following theorem.
Theorem 5.2**.**
When , any randomized protocol that succeeds with probability at least for solving -approximate regression requires bits of communication in the blackboard model.
Now we give a reduction from -XOR to -approximate regression in the blackboard model, and prove an lower bound when . In the hard case we assume , and we simply ignore all other servers if . For each server , its matrix is set to be the identity matrix , and . For regression, the optimal solution can be achieved by taking to be , where is the average of . Notice that the squared regression value associated with the -th coordinate in the optimal solution is upper bounded by , and thus the total squared regression value is upper bounded by . Furthermore, if , then the squared regression value associated with the -th coordinate by using will be at least larger than the squared regression value associated with the -th coordinate in the optimal solution.
Now consider a -approximate solution . We claim that for at least coordinates of , we have . The claim follows since otherwise, the total squared regression value of would be at least larger than the optimal squared regression value, which would again be larger than times the optimal squared regression value, by adjusting the constant in . Notice that for those coordinates with , we can exactly recover from , since is the average of and thus is an integer multiple of . This also implies we can recover the XOR of . Thus, from a -approximate solution to the regression problem, we can solve the -XOR problem with , which implies an lower bound.
Formally, we have proved the following theorem.
Theorem 5.3**.**
When , any randomized protocol that succeeds with probability at least for solving -approximate regression requires bits of communication in the blackboard model.
6 Communication Protocols for Regression
In this section, we design distributed protocols for solving the regression problem.
6.1 A Deterministic Protocol
In this section, we design a simple deterministic protocol for regression in the distributed setting with communication complexity in the coordinator model.
According to the normal equations, the optimal solution to the regression problem can be attained by setting . In Figure 3, we show how to calculate and in the distributed model.
Notice that the bit complexity of entries in and is since the bit complexity of entries in and is , which implies the communication complexity of the protocol in Figure 3 is , in both the coordinator model and the blackboard model.
Theorem 6.1**.**
The protocol in Figure 3 is a deterministic protocol which exactly solves regression, and the communication complexity is , in both the coordinator model and the blackboard model.
6.2 A Protocol in the Blackboard Model
In this section, we design a recursive protocol for obtaining constant approximations to leverage scores in the distributed setting, which is described in Figure 4. We then show how to solve regression by using this protocol.
The protocol described in Figure 4 is basically Algorithm 2 in [26] for approximating leverage scores, implemented in the distributed setting. Using Lemma 8 in [26], the protocol returns an matrix such that
[TABLE]
for all , with constant probability. Each server can then obtain constant approximations to leverage scores of all rows in by calculating .
Now we analyze the communication complexity of the protocol. Notice that this recursive algorithm has levels of recursion. Step 1 has communication complexity in the coordinator model and in the blackboard model, and will be executed at most once during the whole protocol. At Step 4, we can assume each is a power of two between and , since we can discard all rows whose and increase each by a constant factor. In order to implement the sampling process in Theorem 2.1 in the distributed setting, each server sends the summation of for all rows in to the coordinator. After receiving all these summations, the coordinator decides the number of rows to be sampled from each and sends these numbers back to each server. The communication complexity of this step is at most . Each server samples and rescales the rows accordingly, and then sends these sampled rows to the coordinator, and the coordinator sends all sampled rows back to all servers. Notice that the bit complexity of all entries in the sampled rows is at most since is an integer between and . Thus, in the blackboard model, the total communication complexity at each recursive level of the protocol is upper bounded by , which implies the communication complexity of the whole protocol is at most in the blackboard model. A similar analysis shows that the communication complexity of the whole protocol is in the coordinator model.
Lemma 6.2**.**
The protocol described in Figure 4 is a randomized protocol with communication complexity in the coordinator model and in the blackboard model, such that with constant probability, upon termination of the protocol, each server has constant approximations to leverage scores of all rows in .
Our protocol for solving the regression problem in the blackboard model is described in Figure 5. It is guaranteed that the vector calculated at Step 3 is a -approximate solution, with constant probability. A naïve approach for obtaining -approximate solution to the regression problem will be using Theorem 2.1 to obtain and such that with constant probability, for all ,
[TABLE]
By doing so, the number of sampled rows should be according to Theorem 2.1. However, as shown in Theorem 36 of [24], in order to obtain a -approximate solution to the regression problem (instead of obtaining a subspace embedding), it suffices to sample rows from .
Now we analyze the communication complexity of the protocol in Figure 5 in the blackboard model. By Lemma 6.2, the communication complexity of Step 1 is upper bounded by . Similar to Step 4 of the protocol described in Figure 4, the sampling process in Step 2 can be implemented with communication complexity . Thus, the total communication complexity is in the blackboard model.
Theorem 6.3**.**
The protocol described in Figure 5 is a randomized protocol which returns a -approximate solution to regression with constant probability, and the communication complexity is in the blackboard model .
7 Communication Protocols for Regression
In this section, we design distributed protocols for solving the regression problem.
7.1 A Simple Protocol
In this section, we design a simple protocol for obtaining a -approximate solution to the regression problem in the distributed setting. The protocol is described in Figure 6.
To implement Step 1, each server calculates the Lewis weights of and uses Theorem 2.2 to randomly generate a matrix . then checks whether for all
[TABLE]
If not, randomly generates another until (3) is satisfied for all . Since (3) is satisfied with constant probability, the number of independent trials is in expectation. Furthermore, each server can locally check whether (3) holds or not by e.g., verifying on an -net. Notice that the use of randomness is not critical here, since each server can locally enumerate all possible up to a specific precision, instead of using Theorem 2.2 to randomly generate a matrix . Each server will eventually find a matrix which satisfies (3), whose existence is guaranteed by Theorem 2.2.
Now we prove the correctness of the protocol. Notice that it is guaranteed that for any and any ,
[TABLE]
It implies that
[TABLE]
and
[TABLE]
Thus, the vector calculated at Step 2 is a -approximate solution to the regression problem.
Finally, we analyze the communication complexity of the protocol. Similar to the analysis in Section 6.2, we may assume all in the sampling process of Theorem 2.2 are integers between and . Thus, the bit complexity of all entries in and is at most , which implies the communication complexity of Step 1 is in both the coordinator model and the blackboard model.
Theorem 7.1**.**
The protocol described in Figure 6 is a deterministic protocol which returns a -approximate solution to the regression problem, and the communication complexity is in both the coordinator model and the blackboard model.
7.2 A Protocol Based on Lewis Weights Sampling
In this section, we first design a protocol for obtaining constant approximations to Lewis weights in the distributed setting, which is described in Figure 7, and then solves the regression problem based on this protocol.
The protocol described in Figure 7 is basically the algorithm in Section 3 of [27] for approximating Lewis weights, implemented in the distributed setting. Using the same analysis, by setting , we can show are constant approximations to the Lewis weights of . Now we show that we can assume all are integers between and .
Without loss of generality we assume each row of contains at least one non-zero entry. Since our goal here is to calculate constant approximations to the Lewis weights, using the analysis in Section 3 in [27], we only need constant approximations to during the execution of the algorithm. Furthermore, since leverage scores are at most (see, e.g., Section 2.4 in [66]), we can prove by induction that during the execution of the algorithm. Thus, we may assume and are integers.
Now we show that . We prove this claim by induction. At the beginning of the algorithm, for all . Assume by the induction hypothesis, we know all entries in are integers between and . Using Lemma 2 in [26], we know
[TABLE]
By the Cauchy-Schwarz inequality, in order that , we must have
[TABLE]
since otherwise all entries in are less than , which violates the assumption that all entries in are integers and each row of contains at least one non-zero entry. Furthermore, the number of iterations is at most , which implies by induction.
Thus, all entries in have bit complexity during the execution of the algorithm. Using Lemma 6.2, the communication complexity is in the blackboard model, and in the coordinator model.
Lemma 7.2**.**
The protocol described in Figure 7 is a randomized protocol with communication complexity in the blackboard model and in the blackboard model, such that with constant probability, upon termination of the protocol, each server has constant approximations to the Lewis weights of all rows in .
Our protocol for solving the regression problem in the blackboard model is described in Figure 8. By Theorem 2.2, it is guaranteed that the vector calculated at Step 3 is a -approximate solution, with constant probability.
Now we analyze the communication complexity of the protocol in Figure 8. By Lemma 7.2, the communication complexity of Step 1 is upper bounded by in the blackboard model and in the coordinator model. Similar to Step 4 of the protocol described in Figure 4, the sampling process in Step 2 can be implemented with communication complexity in both models. Thus, the total communication complexity is in the blackboard model. In the coordinator model, the total communication complexity is .
Theorem 7.3**.**
The protocol described in Figure 8 is a randomized protocol which returns a -approximate solution to the regression problem with constant probability, and the communication complexity is in the blackboard model and in the coordinator model.
7.3 A Protocol Based on Accelerated Gradient Descent
In this section, we present a protocol for the regression problem in the coordinator model, whose communication complexity is .
We need the following definition in [32].
Definition 7.1** ([32]).**
Suppose . A matrix is approximately isotropic row-bounded if the following hold:
; 2. 2.
For all rows of , .
Before presenting the protocol, we first present a preconditioning procedure in Figure 9, which will later be used in the protocol for regression.
The communication complexity of the this protocol in the coordinator model is by Lemma 6.2 and Lemma 7.2. Similar to the analysis in Section 6.2, we can also assume the bit complexity of all entries in and is . Furthermore, by Theorem 2.2, a -approximate solution to the regression problem
[TABLE]
is a -approximate solution to the original regression problem
[TABLE]
Thus, we will focus on the regression problem in (4) in the remaining part of this section
Now we show is approximately isotropic row-bounded as in Definition 7.1. We only need to show and all rows of satisfy where is the number of rows of .
To show , it is equivalent to show that for all ,
[TABLE]
Notice that is an orthogonal matrix , which implies for all
[TABLE]
Combining (6) and the fact that
[TABLE]
for all , we can prove (5), which implies .
To show , we use Lemma 29 in [32], which states that with constant probability, the leverage scores of satisfy for all . Since leverage scores are invariant under change of basis (see, e.g., Section 2.4 in [66]), we have for all ,
[TABLE]
Since , we have
[TABLE]
Thus, is approximately isotropic row-bounded.
Now we describe our protocol for regression in Figure 10. Our protocol first uses the preconditioning procedure in Figure 9 and then uses Nesterov’s accelerated gradient descent [52] to solve the regression problem
[TABLE]
Furthermore, we invoke a smoothing reduction JointAdaptRegSmooth in [2] to obtain better dependence on .
In order to implement Nesterov’s accelerated gradient descent in the distributed setting, each server maintains the current solution . In each round, servers communicate to calculate the current gradient vector. Once all servers receive the gradient vector, they can update their current solution locally and proceed to the next round. Analysis in [2] (Example C.3) shows that when JointAdaptRegSmooth is applied to Nesterov’s accelerated gradient descent, after full gradient calculations, the algorithm will output a vector such that
[TABLE]
where we assume is -Lipschitz continuous and the initial solution satisfies . Since is approximately isotropic row-bounded and the initial vector is the optimal solution to the regression problem , Lemma 19 in [32] shows that . Furthermore, Lemma 15 in [32] shows that By setting , we can calculate a -approximate solution to the regression problem using full gradient calculations.
Both Nesterov’s accelerated gradient descent [52] and JointAdaptRegSmooth in [2] require an estimation (up to a constant factor) of , which be can be obtained by using the algorithm in Section 7.1 to obtain an -approximate solution and then calculating .
It remains to design an protocol to calculate the gradient vector of the smoothed objective function for , in the distributed setting. We show this can be done with communication complexity . By using JointAdaptRegSmooth in [2], the new objective function will be
[TABLE]
where
[TABLE]
for some and known to each server.
Each server can locally calculate the gradient vector of the term, since and is known to each server. In the remaining part of this section, we focus on designing an algorithm for calculating the gradient vector of the first term in (7).
For the first term in (7), we have
[TABLE]
Notice that we cannot directly let each server calculate the gradient vectors using (8), send the gradient vectors to the coordinator and calculate the summation, since the bit complexity of can be unbounded. Instead, we deal the two cases in (7) by using two different approaches.
When , notice that although the bit complexity of can be unbounded, all entries in the vector have bit complexity at most and is a matrix known to each server. Thus, for each server , it sends
[TABLE]
to the coordinator, for each row which is stored on and satisfies . After receiving from each server, the coordinator calculates
[TABLE]
for all rows that satisfy , and sends it to each server. All servers can then recover the gradient vector. The total communication for this case is at most .
When ,
[TABLE]
Thus, for each server , it sends
[TABLE]
and
[TABLE]
to the coordinator, for each row which is stored on and satisfies . After receiving from each server, the coordinator calculates
[TABLE]
and
[TABLE]
for all rows that satisfy , and sends it to each server. All servers can then recover the gradient vector. The total communication for this case is at most .
Thus, the total communication complexity of the protocol in Figure 10 is .
Theorem 7.4**.**
The protocol described in Figure 10 is a randomized protocol which returns a -approximate solution to the regression problem with constant probability, and the communication complexity is in the coordinator model.
8 Communication Protocols for Regression
In this section, we design distributed protocols for solving the regression problem, including .
8.1 Communication Protocols for Regression
Any regression instance can be formulated as the following linear program,
[TABLE]
which has constraints and variables. Thus, any linear programming protocol implies a protocol for solving the regression problem, with the same communication complexity. Using the linear program solvers in Section 10 and Section 11, we have the following theorem.
Theorem 8.1**.**
* regression can be solved deterministically and exactly with communication complexity in the coordinator model, and randomly and exactly with communication complexity in the blackboard model.*
8.2 Communication Protocols for Regression When
In this section, we introduce an approach that reduces -approximate regression to linear programs with variables. Our main idea is to use the max-stability of exponential random variables [3] to embed into . Such idea was previously used to construct subspace embeddings for the norm [67]. However, since our goal here is to solve linear regression instead of providing an embedding for the whole subspace, we can achieve a much better approximation ratio than previous work [67].
Theorem 8.2**.**
For any matrix and constant , let be random diagonal matrices, whose diagonal entries are i.i.d. random variables with the same distribution as , where is an exponential random variable. If , then with constant probability, the following holds:
[TABLE] 2. 2.
For all ,
[TABLE]
Here is the optimal solution to the regression problem and is a constant which is the expectation of for an exponential random variable .
The proof of Theorem 8.2 can be found in Section 8.3.
Now we prove with constant probability, the optimal solution to the optimization problem
[TABLE]
satisfies
[TABLE]
where is the optimal solution to the regression problem .
Notice that with constant probability,
[TABLE]
Thus, we have reduced -approxiamte regression to
[TABLE]
The optimzation problem in (9) can be written as a linear program with variables. For each , we use to represent the value of as in Section 8.1, and the goal is to minimize Furthermore this reduction can be easily implemented in the distributed setting since each server can independently generate random variables in associated with its own input rows in . We can round each entry in to its nearest integer mutiple of , which is enough for the correctness of Theorem 8.2, but increases the bit complexity of each entry by at most an factor.
Using the linear program solvers in Section 10 and Section 11, we have the following theorem.
Theorem 8.3**.**
-approximate regression can be solved by a randomized protocol with communication complexity in the coordinator model, or by a randomized protocol with communication complexity in the blackboard model.
8.3 Proof of Theorem 8.2
We need the following Bernstein-type lower tail inequality which is due to Maurer [49].
Lemma 8.4** ([49]).**
Suppose are independent positive random variables that satisfy . Let . For any we have
[TABLE]
We use the standard -net construction of a subspace in [15].
Definition 8.1**.**
For any , for a given , let . We say is an -net of if for any , there exists a such that .
Lemma 8.5** ([15]).**
For a given , there exists an -net with size .
Lemma 8.6** (Auerbach basis [9]).**
For any matrix and , there exists a basis matrix of the column space of , such that for all , and for any vector ,
[TABLE]
Now we give the proof of Theorem 8.2. Notice that for any fixed vector ,
[TABLE]
where is an exponential random variable. Moreover, when , both and are bounded by a constant.
We have . By linearity of expectation, we also have
[TABLE]
We use to denote an Auerbach basis of the column space of . We create three events , , . Here is an absolute constant.
- •
: for all .
- •
: for all and .
- •
: for all ,
[TABLE]
where is a -net of . By Lemma 8.5 we have .
According to the cumulative density function of for an exponential random variable , and a union bound over , holds with constant probability. Similarly, also holds with constant probability. For each , using Maurer’s inequality in Lemma 8.4, we have
[TABLE]
Thus for , by using a union bound for all , with constant probability holds.
Conditioned on , using Bernstein’s inequality, we have
[TABLE]
Thus, for and ,
[TABLE]
holds with constant probability, which implies Part 1 of Theorem 8.2.
Now for any with , by definition of the Auerbach basis we have . Conditioned on , we have,
[TABLE]
Consider any with . We claim can be written as
[TABLE]
where for any we have (i) and (ii) .
According to the definition of a -net, there exists a vector for which and . If then we stop. Otherwise we consider the vector . Again we can find a vector such that and . Here we set and continue this process inductively.
Thus, conditioned on and , we have for any with ,
[TABLE]
For any , by homogeneity, we still have
[TABLE]
which implies Part 2 of Theorem 8.2.
9 Communication Complexity Lower Bound for Linear Programming
In this section, we prove a communication complexity lower bound for testing feasibility of linear programs.
We need the following lemma to construct our hard instance.
Lemma 9.1**.**
Let be a sufficiently large integer. We use to denote the vector
[TABLE]
For any , we have
; 2. 2.
For any , .
Proof.
For any , we have
[TABLE]
For any and , we have
[TABLE]
∎
Now we reduce the lopsided set disjiontness problem to testing feasibility of linear programs. In this problem, for a choice of universe size , the last server receives an element , and for each , server receives a set . The goal is to test whether there exists such that . We reduce this problem with to testing the feasibility of linear programs for , where is the bit complexity of the linear program.
For the reduction, server adds a constraint , for the element that receives. I.e., server forces the solution to be . For each , for each , server adds a constraint . Here and are as defined in Lemma 9.1. By Lemma 9.1, this linear program is feasible if and only if .
In the remaining part of this section, we show the lopsided set disjointness problem has an randomized communication complexity lower bound in the coordinator model, which implies an lower bound for testing feasiblity of linear programming, even for . An lower bound also holds in the blackboard model, since when the coordinator model is equivalent to the blackboard model, up to a constant factor in the communication complexity.
We first consider the two-player case, in which Alice receives an element and Bob receives a set . The goal is to test whether or not. Let be the distribution where is chosen uniformly at random from , and is a subset of such that each element is included independently with probability . Let be the conditional distribution of given , and be the conditional distribution of given . In [5, Section 2.2], it has been shown that any communication protocol that succeeds with probability at least on the distribution requires bits of communication in the worst case. By applying Markov’s inequality and stopping the protocol early once the communication complexity is too large, this implies any randomized protocol that succeeds with probability at least on the distribution requires bits of communication in expectation. In fact, this implies a stronger hardness result, that for any protocol that succeeds with probability at least on , its expected communication complexity is on both and .
Consider a new distribution which is with probability and with probability . Suppose a protocol succeeds with probability at least . Then by averaging succeeds with probaility at least on both and , which implies the expected communication complexity of is on both and . Now by linearity of expectation, the expected communication complexity of on is lower bounded by . This, in particular, implies any protocol that succeeds with probability at least on should have expected communication complexity . At this point, Theorem 1.1 in [68] implies that for the -player case, any communication protocol that succeeds with probability at least has worst case communication complexity at least . By standard repitition arguments this implies an lower bound for protocols that succeed with constant probability.
Formally, we have the following theorem.
Theorem 9.2**.**
Any randomized protocol that succeeds with probability at least for testing feasibility of linear programs requires bits of communication in the coordinator model and bits of communication in the blackboard model. The lower bound holds even when .
Notice that by Theorem 4.2, testing feasibility of linear systems for requires only randomized communication complexity. This shows an exponential separation between testing feasibility of linear systems and linear programs, in the communication model.
10 Clarkson’s Algorithm
10.1 The Communication Complexity
In this section, we discuss how to implement Clarkson’s algorithm to solve linear programs in the distributed setting. The protocol is described in Figure 11. During the protocol, each server maintains a multi-set of constraints (i.e., each constraint can appear more than once in ). Initially, is the set of constraints stored on . Furthermore, the coordinator maintains , which is initially set to be the number of constraints stored on each server.
The protocol in Figure 11 is basically Clarkson’s algorithm [22], implemented in the distributed setting. Using the analysis in [22], the expected number of iterations is . The correctness also directly follows from the analysis in [22]. Now we analyze the communication complexity for each iteration.
To implement the sampling process in Step 1, the coordinator first determines the number of constraints to be sampled from each server and sends this number to . The total communication complexity for this step is in both the coordinator model and the blackboard model. Then each server samples accordingly and sends these constraints to the coordinator. The total communication for this step is in both models.
To implement Step 2, we first verify the bit complexity of the optimal solution . One of the optimal solutions is a vertex of the polyhedron . From polyhedral theory we know that there exists a non-singular subsystem of , say , such that is the unique solution of . Thus, by Cramer’s rule, each entry of is a fraction whose numerator and denominator are integers between and , and thus can be represented by using at most bits. This implies the bit complexity of all entries in the vector calculated at Step 2 is upper bounded by . Thus the communication complexity for Step 2 is upper bounded by in the coordinator model and in the blackboard model. The communication complexity of the last two steps of the protocol is upper bounded by in both models. Thus, the expected communication complexity is in the coordinator model and in the blackboard model.
Theorem 10.1**.**
The expected communication complexity of the protocol in Figure 11 is in the coordinator model and in the blackboard model
10.2 Running Time of Clarkson’s Algorithm in Unit Cost RAM
In this section, we show how to implement Clarkson’s algorithm in the unit cost RAM model on words of size so that the running time is upper bounded by , and prove Theorem 1.11.
A description of Clarkson’s algorithm can be found in Figure 11. This algorithm runs in rounds in expectation. In each round, it samples constraints , and calculates an optimal solution that satisfies all constraints in . This optimal solution can be calculated using any polynomial time linear programming algorithm, which always has running time . The bottleneck in the unit cost RAM model is Step 4 of the algorithm in Figure 11, i.e., for each of the constraints, testing whether satisfies the constraint or not. Formally, we just need to output , and then compare each entry with . In the remaining part of this section we show how to caculate in time
Since each entry of has bit complexity , we first calculate a matrix , where each entry of has bit complexity , and the entry consists of the first bits of , the second bits of the last bits of . Now we calculate . Since all entries in and have bit complexity , and caculating the matrix mutilplication of two matrices with bit complexity requires only time [39], can therefore be calculated in time. Given , one can then easily calculate in time. Thus, the total expected running time is upper bounded by .
10.3 Smoothed Analysis of Communication Complexity
In this section we define our model for smoothed analysis of communication complexity of communication protocols for solving linear programming.
For a randomized communication protocol , we use to denote its communication complexity on the linear programming instance
[TABLE]
where , and . The standard definition [61] of smoothed analysis assumes that each entry of is perturbed by i.i.d. Gaussian noise with zero mean and standard deviation . However, since we are measuring the communication complexity in terms of bit complexity, we cannot allow the noise to be arbitrary real numbers. Instead, in our model, we use discrete Gaussian random variables as the noise.
Formally, we use to denote the function that rounds a real number to its nearest integer multiple of . For notational convenience, we define . We say a communication protocol solves the linear program instance (10) with smoothed communication complexity if with probability at least , the protocol correctly solves the instance
[TABLE]
with communication complexity , where all entries of are i.i.d. copies of and is a Gaussian random variable with zero mean and standard deviation. Here the probability is defined over the randomness of the protocol and the noise . Notice that when , is a matrix whose all entries are i.i.d. Gaussian random variables with standard deviation .
10.4 Smoothed Analysis of Clarkson’s Algorithm
In this section, we present our variant of Clarkson’s algorithm for solving smoothed linear programming instances. The protocol is described in Figure 12. The main difference is in Step 2, where the coordinator rounds each entry of the solution before sending it to other servers.
10.4.1 Correctness of the Protocol
We first prove the correctness of the protocol. Our plan is to show if , then our modified Clarkson’s algorithm follows the computation path of the original Clarkson’s algorithm in Figure 11 when executing on the perturbed instance, with high probability, and thus prove the correctness of the protocol.
We need the following bound on the condition number of a matrix.
Lemma 10.2**.**
For a matrix with all entries in , for any integer , and , we have
[TABLE]
and
[TABLE]
Proof.
To prove the first inequality, notice that
[TABLE]
Thus, the first inequality just follows from tail inequalities of the Guassian distribution.
To analyze , we write to denote a matrix whose entries are the Gaussian random variables of before applying the truncation operation. Notice that this implies . We invoke Theorem 3.3 in [56], which states that with probability ,
[TABLE]
which implies with probability ,
[TABLE]
∎
Lemma 10.3**.**
During the execution of the protocol in Figure 12, each time Step 2 is executed, if , with probability at least , satisfies
[TABLE]
Proof.
From polyhedral theory we know that there exists a non-singular subsystem of the sampled constraints , say , such that is the unique solution of .
If , since each entry of was pertubed by a discrete Gaussian noise, and all entries of are integers in the range , by Lemma 10.2 we have
[TABLE]
Furthermore, since ,
[TABLE]
If , then we must have , since is non-singular and thus is the unique solution. ∎
Now we create a family of events . We use to denote the event that, during the -th loop of the execution of the protocol in Figure 12, for each constraint , the constraint can be satisfied by if and only if it can be satisfied by . Notice that for those constraints in , can always satisfy them, by definition of .
Now we show that for each , the probability that holds is at least . By showing this, we have actually shown our algorithm follows the computation path of the original Clarkson’s algorithm in Figure 11 when executing on the perturbed instance, with high probability. Since the original Clarkson’s algorithm in Figure 11 terminates in rounds with probability at least , the correcntess of our algorithm follows by applying a union bound over all events .
To show that for each , the probability that holds is at least , by applying a union bound over all constraints, it suffces to show that for each constaint , can satisfy if and only if can satisfy , with probability .
Lemma 10.4**.**
For each constraint , with probability , can satisfy if and only if can satisfy .
Proof.
If , then , in which case the lemma follows trivially. Thus we assume in the remaining part of this proof.
The constraint can be written as , for some vector and some , and all entries of are i.i.d. copies of and is a Gaussian random variable with zero mean and standard deviation. Notice that since , the vector and the vector are independent. By Lemma 10.3, with probability at least ,
[TABLE]
Furthermore, the probability that can satisfy but cannot satisfy , or cannot satisfy but can satisfy , is at most
[TABLE]
We first analyze the right hand side of the inequality. Notice that , and with probability at least by tail inequalities of the Gaussian distribution and . Moreover, . Thus by Cauchy-Schwarz,
[TABLE]
On the other hand, if we write to denote a vector whose entries are the Gaussian random variables of before applying the truncation operation, then .
Thus, by taking , we have
[TABLE]
By the lower tail inequality of the Gaussian distribution and the fact that , we have with probability at least ,
[TABLE]
Thus, the lemma follows by appropriately adjusting the constant in . ∎
10.4.2 Communication Complexity of the Algorithm
The analysis in the preceding section shows that with high probability, our modified Clarkson’s algorithm follows the computation path of the original Clarkson’s algorithm, and thus also terminates within rounds with probability at least . Furthermore, with high probability, the discrete Gaussian noise of all entries is upper bounded by . Thus, the bit complexity of sending each constraint will be , with high probability.
The sampling process in Step 1 requires bits of communication to sample constraints. To implement Step 2, we need to verify the bit complexity of . Since we round each entry of to its nearest integer multiple of , and by Lemma 10.3, with high probability, , the communication compleixty for sending is upper bounded by . The communication complexity of the last two steps of the protocol is still upper bounded by . Thus, the smoothed communication complexity is in the coordinator model.
Theorem 10.5**.**
For , the protocol in Figure 12 correctly solves smoothed linear programming with probability at least , and the smoothed communication compleixty is
[TABLE]
in the coordinator model.
11 The Center of Gravity Method
In this section, we discuss how to implement the center-of-gravity cutting-plane method [35] in the distributed setting. The description of the protocol can be found in Figure 13.
The servers each maintain a polytope (the same one for all servers), adding a constraint in each iteration. Each server also maintains the center of the polytope and its covariance .
For any vector , its -rounding w.r.t. to is defined as follows: Let . We take the unit vector , round it down to the nearest multiple of in each coordinate. So we have .
If each server were to report the exact violated constraint, the volume of would drop by a constant factor in each iteration. To reduce the communication, we round the constraint and shift it away a bit to make sure that the rounded constraint (1) is still valid for the target LP and (2) it is close enough that the volume still drops by a constant factor.
Lemma 11.1** ([12]).**
Let be the center of gravity of an isotropic convex body in . Then, for any halfspace within distance of , we have
[TABLE]
Lemma 11.2**.**
For , the volume of the polytope maintained by each server drops by a constant factor in each iteration.
Proof.
Assume without loss of generality that is isotropic. If the centroid is not feasible, we get a violated constraint such that the entire feasible region lies in the halfspace with . Now we replace by . As a result,
[TABLE]
Here we used the fact that in isotropic position any convex body is contained in a ball of radius , so for all of and therefore for the feasible region. Thus the constraint imposed by the algorithm is valid.
Next, we note that the distance of the constraint from the origin is at most , so for , it is less than (in isotropic position). By Lemma 11.1, with , the volume of drops by a constant factor. ∎
Theorem 11.3**.**
The protocol in Figure 13 is a deterministic protocol for solving linear programming with communication complexity in both the coordinator model and the blackboard model.
Proof.
The algorithm runs for rounds. To see this we note that the each vertex of the feasible region is the solution of a subset of the linear equalities taken to be equalities. Thus, each coordinate of each vertex is a ratio of two determinants of matrices whose entries are -bit numbers and so the maximum distance of a vertex from the origin is , which upper bounds the volume by . The smallest any coordinate can be is similarly . The minimum volume we need to go to is the volume spanned by a simplex of vertices, which itself is a determinant with entries of this size. Thus, the volume is at least . Since the volume of the polytope maintained drops by a constant factor in each iteration222The ellipsoid method uses the same argument, except that each round reduces the volume by only [34]., the number of rounds is . Each round includes a broadcast of a single vector, with bits. This is because the size of the -net used is . By viewing the objective function as a constraint, we note that the volume bounds used above apply to the optimization version as well. At the end, we use diophantine approximation to get an exact solution [34]. ∎
A similar argument applies to general convex programming.
Theorem 11.4**.**
The communication complexity of the protocol in Figure 13 for solving convex programming is .
Proof.
The initial volume is at most and the algorithm can stop when the volume is . Therefore the number of rounds is . Each round uses bits giving the final bound. ∎
12 Seidel’s Algorithm
We give an alternative constant dimensional linear programming algorithms in the blackboard model, based on Seidel’s classical algorithm [58]. Here we additionally assume that each constraint in the linear program is placed on a random server. This assumption is essential to get rid of the dependence in the communication complexity. Here we also assume that the linear program is bounded.
To implement Seidel’s algorithm in the blackboard model, we go through all servers , and for each server , we go through all constraints stored on in a random order. We maintain the optimal solution to the set of constraints that we have already went through. For a new constraint , the current server first checks whether the constraint is satisfied or not. The current server proceeds to the next constraint if it is indeed satisfied. If it is not satisfied, then the current constraint must be one of the constraints that determines the current optimal solution. In this case, the current server broadcasts the current constraint to all other servers, and makes a recursive call to figure out the optimal solution, by adding an equality constraint to the set of constraints. Notice that if the first server finds a violated constraint, does not need to broadcast the violated constraint, since can simply add the equality constraint to the beginning of all constraints owned by .
One major difference between the classical Seidel’s algorithm and our implementation is that each time we make a recursive call, we do not randomly permute the constraints again in the recursive calls. Instead, the order that we go through the servers is fixed, in different recursive calls. Due to this difference, there will be subtle dependence between the communication complexity of different recursive calls.
We use to denote the event that the number of constraints stored on is at least . By a Chernoff bound, holds with probability at least . For the -th constraint (in the order that we go through all constraints), we let the random variable be if the -th constraint is one the constraints that determines the optimal solution among the first constrains, and [math] otherwise.
Since each constraint in the linear program is placed on a random server, by standard backward analysis, . Furthermore, . However, conditioned on , the first constraints will not be broadcasted and there will be no recursive calls associated with them, since they are stored on the first server . Thus, conditioned on , the expected number of broadcasts (and thus recursive calls) is upper bounded by
[TABLE]
We use to denote the event that there are at most recursive calls made at the top layer of the recursive tree corresponding to Seidel’s algorithm. Conditioned on , by Markov’s inequality, holds with probability at least . Similarly, we use to denote the event that there are at at most recursive calls made at the second layer of the recursive tree corresponding to Seidel’s algorithm. Conditioned on and , again by Markov’s inequality, holds with probability at least . We similarly define . Thus, conditioned on , with probability at least , holds for all . Which implies the total number of broadcasts is upper bounded by .
In each broadcast, the current server needs to broadcast the current constraint, which has bit complexity . Moreover, after the recursive call, the current server broadcasts the current solution vector . By polyhedral theory, can be achieved by setting inequality constraints to be equality constraints. Thus, by Cramer’s rule, each entry of has bit complexity . The total communication complexity is hence upper bounded by , with probability at least . Here the randomness is over the initial random assignment of each constraint and the random coins tossed by the algorithm.
Theorem 12.1**.**
Seidel’s algorithm can be used to solve linear programs in constant dimension with communication in the blackboard model, if each constraint in the linear program is placed on a random server. Here the randomness is over the initial random assignment of each constraint and the random coins tossed by the algorithm.
13 Singularity Probability
The goal of this section is to prove Theorem 3.1. We restate it here for convenience.
Theorem 3.1. (restated) * Let be a matrix whose entries are i.i.d. random variables with the same distribution as , for sufficiently large ,*
[TABLE]
*where is an absolute constant. *
Our proof of Theorem 3.1 follows very closely the proof of Theorem 1.5 in [63]. Throughout this section we use to denote . We use to denote the -th row of .
We need the following lemma on generalized binomial distributions.
Lemma 13.1**.**
We have
[TABLE]
and
[TABLE]
Here and are absolute constants.
Proof.
By Stirling’s approximation we have
[TABLE]
which proves (12). To prove (13), by a Chernoff bound swe have that the number of non-zero terms in the summation of is with probability . Conditioned on this event, we can then prove (13) by using the same estimation as (12). ∎
The following lemma is a direct implication of Lemma 13.1 and Odlyzko’s results in [53]. See also Lemma 2.1 in [63] and Section 3.2 in [41].
Lemma 13.2**.**
Let be an arbitrary subspace and whose entries are i.i.d. random variables with the same distribution as . We have
[TABLE]
and
[TABLE]
By Lemma 5.1 in [63], we have
[TABLE]
which implies we only need to consider the case when span a hyperplane.
We say a hyperplane is non-trivial if is spanned by its intersection with . Notice that a hyperplane has
[TABLE]
only when is non-trivial. Thus, we focus only on non-trivial hyperplanes in the remaining part of the proof.
Definition 13.1**.**
Let whose entries are i.i.d. random variables with the same distribution as . For a hyperplane , define the discrete codimension of to be the unique integer multiple of such that
[TABLE]
According to the definition, it is clear from Lemma 13.2 that .
We first dispose hyperplanes with high discrete codimension using the following lemma, which is a direct corollary of Lemma 1 in [41].
Lemma 13.3**.**
Suppose whose entries are i.i.d. random variables with the same distribution as , then
[TABLE]
Let be a constant to be determined. Using Lemma 13.3, we have
[TABLE]
Thus, in the remaining part of the proof we will focus only on the case when .
We say a hyperplane to be non-degenerate if its normal vector satisfies . Here we use to denote the number of non-zero entries in the normal vector . The following lemma, which is a simple adaption of Lemma 5.3 in [63], provides a crude estimation of the number of degenerate hyperplanes.
Lemma 13.4**.**
The number of degenerate non-trivial hyperplanes is at most .
Combining Lemma 13.2 and Lemma 13.4, we then have
[TABLE]
Thus, we can just focus on non-degenerate hyperplanes.
The following theorem, which first appeared in [41] as Theorem 2 (see also Section 7 in [63]), is based on Fourier-analytic arguments by Halász [38, 37].
Theorem 13.5**.**
Suppose is a non-trivial hyperplane. Let whose entries are i.i.d. random variables with the same distribution as , be a positive number and be a positive integer such that . We have
[TABLE]
where we use to denote the normal vector of and to denote the number of non-zero entries of .
Corollary 13.6**.**
Suppose is a non-degenerate non-trivial hyperplane. Let whose entries are i.i.d. random variables with the same distribution as . For sufficiently large , we have
[TABLE]
Proof.
We note that
[TABLE]
where are i.i.d. random variables with the same distribution as and is the -th coordinate of the normal vector . This enables one to apply Theorem 13.5. Notice that when applying Theorem 13.5 we have , since each non-zero entry of appears times in the summation of (14). Recall that . We set to be an integer which is at least . Since is non-degenerate, we have , which implies
[TABLE]
The correctness of the corollary thus follows from our choice of . ∎
For a non-degenerate non-trivial hyperplane which satisfies , define to be the event that
[TABLE]
where are independent random vectors in whose entries are i.i.d. random variables with the same distribution as and are random vectors in whose entries are i.i.d. random variables with the same distribution as . Here and where is a constant to be determined.
We first prove that
[TABLE]
To prove this, we define to be the event that
[TABLE]
By Corollary 13.6,
[TABLE]
Now we show that
[TABLE]
According to the definition of discrete codimension , we have
[TABLE]
By Corollary 13.6 we have
[TABLE]
On the other hand, by Lemma 13.2, we have
[TABLE]
Thus,
[TABLE]
which implies
[TABLE]
Using the estimation given above, for sufficiently large , we have
[TABLE]
since .
Similarly, when , i.e., , we have
[TABLE]
Again we have
[TABLE]
We define to be the event that span the hyperplane . Since and are independent, we have
[TABLE]
Consider a set
[TABLE]
which satisfies . There exist vectors
[TABLE]
such that
[TABLE]
By using a union bound of size , we can just assume . Here we use to denote the binary entropy function. Thus,
[TABLE]
Thus, by using (15) and Lemma 13.2 we have
[TABLE]
Notice that
[TABLE]
Thus, for any and sufficiently large , we have
[TABLE]
Here the second inequality follows since for sufficiently large . The third inequality is due to the monotonicity of the binary entropy function on and the fact that . The fourth inequality follows from the fact that . The last inequality follows by setting to be the solution of . A numerical calculation shows that . Theorem 3.1 thus follows by using a union bound for all possible , which has at most different valid values and setting .
We remark that the choice of parameters here is mainly for simplicity and not optimized.
14 Discussion
The lens of communication complexity reveals surprising structure about well-known optimization problems. A very interesting open question is to fully resolve the randomized communication complexity of linear programming as a function of and . Another interesting direction is to design more efficient linear programming algorithms in the RAM model with unit cost operations on words of size bits; such algorithms while being inherently useful may also give rise to improved communication protocols. While our regression algorithms illustrated various shortcomings of previous techniques, there are still interesting gaps in our bounds to be resolved.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascale linear learning system. Journal of Machine Learning Research , 15(1):1111–1133, 2014.
- 2[2] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Advances in Neural Information Processing Systems , pages 1614–1622, 2016.
- 3[3] Alexandr Andoni. High frequency moments via max-stability. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on , pages 6364–6368. IEEE, 2017.
- 4[4] Alexandr Andoni et al. Eigenvalues of a matrix in the streaming model. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms , pages 1729–1737. Society for Industrial and Applied Mathematics, 2013.
- 5[5] Alexandr Andoni, Piotr Indyk, and Mihai Patrascu. On the optimality of the dimensionality reduction method. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on , pages 449–458. IEEE, 2006.
- 6[6] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 1756–1764, 2015.
- 7[7] Sepehr Assadi, Nikolai Karpov, and Qin Zhang. Distributed and streaming linear programming in low dimensions. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems , pages 236–253. ACM, 2019.
- 8[8] Sepehr Assadi and Sanjeev Khanna. Randomized composable coresets for matching and vertex cover. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017 , pages 3–12, 2017.
