The Communication Complexity of Optimization

Santosh S. Vempala; Ruosong Wang; David P. Woodruff

arXiv:1906.05832·cs.DS·November 1, 2019

The Communication Complexity of Optimization

Santosh S. Vempala, Ruosong Wang, David P. Woodruff

PDF

TL;DR

This paper investigates the communication complexity of distributed optimization problems, providing tight bounds and demonstrating the limitations of sampling and sketching techniques, thus motivating the development of new optimization methods.

Contribution

It offers the first tight bounds for communication complexity in distributed linear systems and optimization, highlighting the limitations of existing techniques and proposing new bounds for various problem settings.

Findings

01

Communication complexity for linear systems is $ ilde{ heta}(d^2L + sd)$ (deterministic) and $ ilde{ heta}(sd^2L)$ (randomized).

02

Sampling and sketching are suboptimal for distributed optimization in dependence on $d$ and $ ext{epsilon}$.

03

New bounds for linear programming communication complexity, especially when coefficients are randomly perturbed.

Abstract

We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with $s$ servers $P_{1}, \dots, P_{s}$ , the $i$ -th of which holds a subset $A^{(i)} x = b^{(i)}$ of $n_{i}$ constraints of a linear system in $d$ variables, and the coordinator would like to output $x \in R^{d}$ for which $A^{(i)} x = b^{(i)}$ for $i = 1, \dots, s$ . We assume each coefficient of each constraint is specified using $L$ bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is $\tilde{Θ} (d^{2} L + s d)$ and $\tilde{Θ} (s d^{2} L)$ , respectively. We obtain similar results for the blackboard model. When there is no solution to the linear system, a natural alternative is to find the solution minimizing the…

Tables4

Table 1. Table 1: Summary of our results for ℓ p subscript ℓ 𝑝 \ell_{p} regression in the coordinator model for constant ε 𝜀 \varepsilon .

Error Measure	Upper Bound	Lower Bound	Theorem
$ℓ_{1}$ (randomized)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 7.1, 3.8
$ℓ_{1}$ (deterministic)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (s d^{2} L)$	Theorem 7.1, 3.6
$ℓ_{2}$ (randomized)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 6.1, 3.8
$ℓ_{2}$ (deterministic)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (s d^{2} L)$	Theorem 6.1, 3.6
$ℓ_{p}$ for constant $p > 2$	$\tilde{O} (s d^{3} L)$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 8.3, 3.8
$ℓ_{\infty}$	$\tilde{O} (s d^{3} L)$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 8.1, 3.8

Table 2. Table 2: Summary of our results for ℓ p subscript ℓ 𝑝 \ell_{p} regression in the blackboard model for constant ε 𝜀 \varepsilon .

Error Measure	Upper Bound	Lower Bound	Theorem
$ℓ_{1}$	$\tilde{O} (s + d^{2} L)$	$\tilde{Ω} (s + d^{2} L)$	Theorem 7.3, 3.8
$ℓ_{2}$	$\tilde{O} (s + d^{2} L)$	$\tilde{Ω} (s + d^{2} L)$	Theorem 6.3, 3.8
$ℓ_{p}$ for constant $p > 2$	$O (\min {s d + d^{4} L, s d^{3} L})$	$\tilde{Ω} (s + d^{2} L)$	Theorem 8.3, 3.8
$ℓ_{\infty}$	$O (\min {s d + d^{4} L, s d^{3} L})$	$\tilde{Ω} (s + d^{2} L)$	Theorem 8.1, 3.8

Table 3. Table 3: Summary of our results for ℓ 1 subscript ℓ 1 \ell_{1} and ℓ 2 subscript ℓ 2 \ell_{2} regression in the coordinator model for general ε 𝜀 \varepsilon .

Error Measure	Upper Bound	Lower Bound	Theorem
$ℓ_{1}$	$\tilde{O} (\min (s d^{2} L + \frac{d^{2} L}{ε^{2}}, \frac{s d^{3} L}{ε})$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 7.3, 7.4, 3.8
$ℓ_{2}$ (randomized)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (d^{2} L + s d)$	Theorem 6.1, 3.8
$ℓ_{2}$ (deterministic)	$\tilde{O} (s d^{2} L)$	$\tilde{Ω} (s d^{2} L)$	Theorem 6.1, 3.6

Table 4. Table 4: Summary of our results for ℓ 1 subscript ℓ 1 \ell_{1} and ℓ 2 subscript ℓ 2 \ell_{2} regression in the blackboard model for general ε 𝜀 \varepsilon .

Error Measure	Upper Bound	Lower Bound	Theorem
$ℓ_{1}$	$\tilde{O} (s + \frac{d^{2} L}{ε^{2}})$	$\tilde{Ω} (s + \frac{d}{ε} + d^{2} L)$ for $s > Ω (1 / ε)$	Theorem 7.3, 3.8, 5.2
$ℓ_{2}$	$\tilde{O} (s + \frac{d^{2} L}{ε})$	$\tilde{Ω} (s + \frac{d}{ε^{1 / 2}} + d^{2} L)$ for $s > Ω (1 / \sqrt{ε})$	Theorem 6.3, 3.8, 5.3

Equations282

\frac{1}{κ} B ⪯ A ⪯ κ B,

\frac{1}{κ} B ⪯ A ⪯ κ B,

τ_{i} (A) = A^{i} (A^{T} A)^{†} (A^{i})^{T} .

τ_{i} (A) = A^{i} (A^{T} A)^{†} (A^{i})^{T} .

τ_{i}^{B} (A) = {A^{i} (B^{T} B)^{†} (A^{i})^{T} \infty if A^{i} ⊥ ker (B), otherwise .

τ_{i}^{B} (A) = {A^{i} (B^{T} B)^{†} (A^{i})^{T} \infty if A^{i} ⊥ ker (B), otherwise .

\overline{w}_{i} = τ_{i} (\overline{W}^{- 1/2} A),

\overline{w}_{i} = τ_{i} (\overline{W}^{- 1/2} A),

p_{i} \geq C τ_{i} (A) lo g d ε^{- 2},

p_{i} \geq C τ_{i} (A) lo g d ε^{- 2},

(1 - ε) ∥ A x ∥_{2} \leq ∥ S A x ∥_{2} \leq (1 + ε) ∥ A x ∥_{2} .

(1 - ε) ∥ A x ∥_{2} \leq ∥ S A x ∥_{2} \leq (1 + ε) ∥ A x ∥_{2} .

p_{i} \geq C \overline{w}_{i} lo g d ε^{- 2},

p_{i} \geq C \overline{w}_{i} lo g d ε^{- 2},

(1 - ε) ∥ A x ∥_{1} \leq ∥ S A x ∥_{1} \leq (1 + ε) ∥ A x ∥_{1} .

(1 - ε) ∥ A x ∥_{1} \leq ∥ S A x ∥_{1} \leq (1 + ε) ∥ A x ∥_{1} .

Pr [M_{n} is singular] \leq t^{- C n},

Pr [M_{n} is singular] \leq t^{- C n},

Bad = {B \in R^{d \times (d - 1)} ∣ Pr [X \in span (B)] \geq t^{- C d /2} or rank (B) < d - 1},

Bad = {B \in R^{d \times (d - 1)} ∣ Pr [X \in span (B)] \geq t^{- C d /2} or rank (B) < d - 1},

Pr [A \in Bad] \leq t^{- C d /2},

Pr [A \in Bad] \leq t^{- C d /2},

Pr [rank ([A X]) < d]

Pr [rank ([A X]) < d]

\geq

>

Pr [span ([A B]) = R^{d}] \geq 1 - Pr [i = 1 ⋂ d - 1 B_{i} \in span (A)] \geq 1 - t^{- C d (d - 1) /2},

Pr [span ([A B]) = R^{d}] \geq 1 - Pr [i = 1 ⋂ d - 1 B_{i} \in span (A)] \geq 1 - t^{- C d (d - 1) /2},

E [∣ S \cap Bad ∣] \leq t^{C d (d - 1) /6} \cdot t^{- C d /2} .

E [∣ S \cap Bad ∣] \leq t^{C d (d - 1) /6} \cdot t^{- C d /2} .

∣ S \cap Bad ∣ \leq 4 E [∣ S \cap Bad ∣] \leq 4 t^{C d (d - 1) /6} \cdot t^{- C d /2},

∣ S \cap Bad ∣ \leq 4 E [∣ S \cap Bad ∣] \leq 4 t^{C d (d - 1) /6} \cdot t^{- C d /2},

\forall S \in S ∖ Bad, \forall T \in S ∖ {S}, span ([S T]) = R^{d} .

\forall S \in S ∖ Bad, \forall T \in S ∖ {S}, span ([S T]) = R^{d} .

1 - ∣ S ∣^{2} t^{- C d (d - 1) /2} = 1 - t^{- Ω (d^{2})} .

1 - ∣ S ∣^{2} t^{- C d (d - 1) /2} = 1 - t^{- Ω (d^{2})} .

span ([S T]) = R^{d} .

span ([S T]) = R^{d} .

O (C_{P} (n) / s) \geq Ω (n),

O (C_{P} (n) / s) \geq Ω (n),

j = 1 \sum d a_{j} x_{j} = 1.

j = 1 \sum d a_{j} x_{j} = 1.

Ω (1) ∥ A x ∥_{2} \leq ∥ A x ∥_{2} \leq O (1) ∥ A x ∥_{2}

Ω (1) ∥ A x ∥_{2} \leq ∥ A x ∥_{2} \leq O (1) ∥ A x ∥_{2}

\widetilde{\tau}_{i}=\begin{cases}\tau_{i}^{\widetilde{A}}(A)&\text{if row $A^{i}$ is sampled in Step \ref{step:leverage_sample}},\\ \frac{1}{1+\frac{1}{\tau_{i}^{\widetilde{A}}(A)}}&\text{otherwise}.\end{cases}

\widetilde{\tau}_{i}=\begin{cases}\tau_{i}^{\widetilde{A}}(A)&\text{if row $A^{i}$ is sampled in Step \ref{step:leverage_sample}},\\ \frac{1}{1+\frac{1}{\tau_{i}^{\widetilde{A}}(A)}}&\text{otherwise}.\end{cases}

Ω (1) ∥ A x ∥_{2} \leq ∥ A x ∥_{2} \leq O (1) ∥ A x ∥_{2}

Ω (1) ∥ A x ∥_{2} \leq ∥ A x ∥_{2} \leq O (1) ∥ A x ∥_{2}

(1 - ε) ∥ A x - b ∥_{2} \leq ∥ S (A x - b) ∥_{2} \leq (1 + ε) ∥ A x - b ∥_{2} .

(1 - ε) ∥ A x - b ∥_{2} \leq ∥ S (A x - b) ∥_{2} \leq (1 + ε) ∥ A x - b ∥_{2} .

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1},

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1},

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} .

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} .

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} .

(1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} \leq ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} .

∥ A x - b ∥_{1} = i = 1 \sum s ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \geq i = 1 \sum s (1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} = (1 - ε) ∥ A x - b ∥_{1}

∥ A x - b ∥_{1} = i = 1 \sum s ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \geq i = 1 \sum s (1 - ε) ∥ A^{(i)} x - b^{(i)} ∥_{1} = (1 - ε) ∥ A x - b ∥_{1}

∥ A x - b ∥_{1} = i = 1 \sum s ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) i = 1 \sum s ∥ A^{(i)} x - b^{(i)} ∥_{1} = (1 + ε) ∥ A x - b ∥_{1} .

∥ A x - b ∥_{1} = i = 1 \sum s ∥ S^{(i)} A^{(i)} x - S^{(i)} b^{(i)} ∥_{1} \leq (1 + ε) i = 1 \sum s ∥ A^{(i)} x - b^{(i)} ∥_{1} = (1 + ε) ∥ A x - b ∥_{1} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Communication Complexity of Optimization††thanks: Santosh S. Vempala was supported in part by NSF awards CCF-1717349 and DMS-1839323. Ruosong Wang and David P. Woodruff were supported in part by Office of Naval Research (ONR) grant N00014-18-1-2562. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.

Santosh S. Vempala

Georgia Tech

[email protected]

Ruosong Wang

Carnegie Mellon University

[email protected]

David P. Woodruff

Carnegie Mellon University

[email protected]

We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with $s$ servers $P_{1},\ldots,P_{s}$ , the $i$ -th of which holds a subset $A^{(i)}x=b^{(i)}$ of $n_{i}$ constraints of a linear system in $d$ variables, and the coordinator would like to output an $x\in\mathbb{R}^{d}$ for which $A^{(i)}x=b^{(i)}$ for $i=1,\ldots,s$ . We assume each coefficient of each constraint is specified using $L$ bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is $\widetilde{\Theta}(d^{2}L+sd)$ and $\widetilde{\Theta}(sd^{2}L)$ , respectively. We obtain similar results for the blackboard communication model. As a result of independent interest, we show the probability a random matrix with integer entries in $\{-2^{L},\ldots,2^{L}\}$ is invertible is $1-2^{-\Theta(dL)}$ , whereas previously only $1-2^{-\Theta(d)}$ was known.

When there is no solution to the linear system, a natural alternative is to find the solution minimizing the $\ell_{p}$ loss, which is the $\ell_{p}$ regression problem. While this problem has been studied, we give improved upper or lower bounds for every value of $p\geq 1$ . One takeaway message is that sampling and sketching techniques, which are commonly used in earlier work on distributed optimization, are neither optimal in the dependence on $d$ nor on the dependence on the approximation $\varepsilon$ , thus motivating new techniques from optimization to solve these problems.

Towards this end, we consider the communication complexity of optimization tasks which generalize linear systems, such as linear, semidefinite, and convex programming. For linear programming, we first resolve the communication complexity when $d$ is constant, showing it is $\widetilde{\Theta}(sL)$ in the point-to-point model. For general $d$ and in the point-to-point model, we show an $\widetilde{O}(sd^{3}L)$ upper bound and an $\widetilde{\Omega}(d^{2}L+sd)$ lower bound. In fact, we show if one perturbs the coefficients randomly by numbers as small as $2^{-\Theta(L)}$ , then the upper bound is $\widetilde{O}(sd^{2}L)+\textrm{poly}(dL)$ , and so this bound holds for almost all linear programs. Our study motivates understanding the bit complexity of linear programming, which is related to the running time in the unit cost RAM model with words of $O(\log(nd))$ bits, and we give the fastest known algorithms for linear programming in this model.

1 Introduction

Large-scale optimization problems often cannot fit into a single machine, and so they are distributed across a number $s$ of machines. That is, each of servers $P_{1},\ldots,P_{s}$ may hold a subset of constraints that it is given locally as input, and the goal of the servers is to communicate with each other to find a solution satisfying all constraints. Since communication is often a bottleneck in distributed computation, the goal of the servers is to communicate as little as possible.

There are several different standard communication models, including the point-to-point model and the blackboard model. In the point-to-point model, each pair of servers can talk directly with each other. This is often more conveniently modeled by looking at the coordinator model, for which there is an extra server called the coordinator, and all communication must pass through the coordinator. This is easily seen to be equivalent, from a total communication perspective, to the point-to-point model up to a factor of $2$ , for forwarding messages from server $P_{i}$ to server $P_{j}$ , and a term of $\log s$ per message to indicate which server the message should be forwarded to. Another model of computation is the blackboard model, in which there is a shared broadcast channel among all the $s$ servers. When a server sends a message, it is visible to each of the other $s-1$ servers and determines who speaks next, based upon an agreed upon protocol. We mostly consider randomized communication, in which for every input, we require the coordinator to output the solution to the optimization problem with high probability. For linear systems we also consider deterministic communication complexity.

A number of recent works in the theory community have looked at studying specific optimization problems in such communication models, such as principal component analysis [43, 47, 17] and kernel [11] and robust variants [71, 28], computing higher correlations [43], $\ell_{p}$ regression [67, 28] and sparse regression [19], estimating the mean of a Gaussian [73, 33, 19], database problems [36, 70], clustering [21], statistical [68], graph problems [54, 68] and many, many more.

There are also a large number of distributed learning and optimization papers, for example [10, 73, 75, 1, 18, 48, 72, 23, 55, 31, 29, 30, 40, 60, 59, 74, 6, 42]. With a few exceptions, these works do not study general communication complexity, but rather consider specific classes of algorithms. Namely, a number of these works only allow gradient and Hessian computations in each round, and do not allow arbitrary communication. Another aspect of these works is that they typically do not count total bit complexity, but rather only count number of rounds, whereas we are interested in total communication. In a number of optimization problems, the bit complexity of storing a single number in an intermediate computation may be as large as storing the entire original optimization problem. It is therefore infeasible to transmit such a number. While one could round this number, the effect of rounding is often unclear, and could destroy the desired approximation guarantee. One exception to the above is the work of [65], which studies the problem in which there are two servers, each holding a convex function, who would like to find a solution so as to minimize the sum of the two functions. The upper bounds are in a different communication model than ours, where the functions are added together, while the lower bounds only apply to a restricted class of protocols.

Noticeably absent from previous work is the communication complexity of solving linear systems, which is a fundamental primitive in many optimization tasks. Formally, suppose there is a coordinator together with $s$ servers $P_{1},\ldots,P_{s}$ , the $i$ -th of which holds a subset $A^{(i)}x=b^{(i)}$ of $n_{i}$ constraints of a $d$ -dimensional linear system, and the coordinator would like to output an $x\in\mathbb{R}^{d}$ for which $A^{(i)}x=b^{(i)}$ for $i=1,\ldots,s$ . We further assume each coefficient of each constraint is specified using $L$ bits. The first question we ask is the following.

Question 1.1.

What is the communication complexity of solving a linear system?

When there is no solution to the linear system, a natural alternative is to find the solution minimizing the $\ell_{p}$ loss, which is the $\ell_{p}$ regression problem $\min_{x\in\mathbb{R}^{d}}\|Ax-b\|_{p}$ , where for an $n$ -dimensional vector $y$ , $\|y\|_{p}=\left(\sum_{i=1}^{n}|y_{i}|^{p}\right)^{1/p}$ is its $\ell_{p}$ norm.

In the distributed $\ell_{p}$ regression problem, each server has a matrix $A^{(i)}\in\mathbb{R}^{n_{i}\times d}$ and a vector $b^{(i)}\in\mathbb{R}^{n_{i}}$ , and the coordinator would like to output an $x\in\mathbb{R}^{d}$ so that $\|Ax-b\|_{p}$ is approximately minimized, namely, that $\|Ax-b\|_{p}\leq(1+\varepsilon)\min_{x^{\prime}}\|Ax^{\prime}-b\|_{p}$ . Note that here $A\in\mathbb{R}^{n\times d}$ is the matrix obtained by stacking the matrices $A^{(1)},\ldots,A^{(s)}$ on top of each other, where $n=\sum_{i=1}^{s}n_{i}$ . Also, $b\in\mathbb{R}^{n}$ is the vector obtained by stacking the vectors $b^{(1)},\ldots,b^{(s)}$ on top of each other. We assume that each entry of $A$ and $b$ is an $L$ -bit integer, and we are interested in the randomized communication complexity of this problem.

While previous work [50, 67] has looked at the distributed $\ell_{p}$ regression problem, such work is based on two main ideas: sampling and sketching. Such techniques reduce a large optimization problem to a much smaller one, thereby allowing servers to send succinct synopses of their constraints in order to solve a global optimization problem.

Sampling and sketching are the key techniques of recent work on distributed low rank approximation [67, 43] and regression algorithms. A natural question, which will motivate our study of more complex optimization problems below, is whether other techniques in optimization can be used to obtain more communication-efficient algorithms for these problems.

Question 1.2.

Are there tractable optimization problems for which sampling and sketching techniques are suboptimal in terms of total communication?

To answer Question 1.2 it is useful to study optimization problems generalizing both linear systems and $\ell_{p}$ regression for certain values of $p$ . Towards this end, we consider the communication complexity of linear, semidefinite, and convex programming. Formally, in the linear programming problem, suppose there is a coordinator together with $s$ servers $P_{1},\ldots,P_{s}$ , the $i$ -th of which holds a subset $A^{(i)}x\leq b^{(i)}$ of $n_{i}$ constraints of a $d$ -dimensional linear system, and the coordinator, who holds a vector $c\in\mathbb{R}^{d}$ , would like to output an $x\in\mathbb{R}^{d}$ for which $c^{T}x$ is maximized subject to $A^{(i)}x\leq b^{(i)}$ for $i=1,\ldots,s$ . We further assume each coefficient of each constraint, as well as the objective function $c$ , is specified using $L$ bits.

Question 1.3.

What is the communication complexity of solving a linear program?

One could try to implement known linear programming algorithms as distributed protocols. The main challenge here is that known linear programming algorithms operate in the real RAM model of computation, meaning that basic arithmetic operations on real numbers can be performed in constant time. This is problematic in the distributed setting, since it might mean real numbers need to be communicated among the servers, resulting in protocols that could have infinite communication. Thus, controlling the bit complexity of the underlying algorithm is essential, and this motivates the study of linear programming algorithms in the unit cost RAM model of computation, meaning that a word is $O(\log(nd))$ bits, and only basic arithmetic operations on words can be performed in constant time. Such a model is arguably more natural than the real RAM model. If one were to analyze the fastest linear programming algorithms in the unit cost RAM model, their time complexity would blow up by poly $(dL)$ factors, since the intermediate computations require manipulating numbers that grow exponentially large or small. Surprisingly, we are not aware of any work that has addressed this question:

Question 1.4.

What is the best possible running time of an algorithm for linear programming in the unit cost RAM model?

As far as time complexity is concerned, it is not even known if linear programming is inherently more difficult than just solving a linear system. Indeed, a long line of work on interior point methods, with the current most recent work of [25], suggests that solving a linear program may not be substantially harder than solving a linear system. One could ask the same question for communication.

Question 1.5.

Is solving a linear program inherently harder than solving a linear system? What about just checking the feasibility of a linear program versus that of a linear system?

Recent Independent Work.

A recent independent work [7] also studies solving linear programs in the distributed setting, although their focus is to study the tradeoff between round complexity and communication complexity in low dimensions, while our focus is to study the communication complexity in arbitrary dimensions. Note, however, that we also provide nearly optimal bounds for constant dimensions for linear programming in both coordinator and blackboard models.

1.1 Our Contributions

We make progress on answering the above questions, with nearly tight bounds in many cases. For a function $f$ , we let $\widetilde{O}(f)=f\operatorname{polylog}(sndL/\varepsilon)$ and similarly define $\widetilde{\Theta}$ and $\widetilde{\Omega}$ .

1.1.1 Linear Systems

We begin with linear systems, for which we obtain a complete answer for both randomized and deterministic communication, in both coordinator and blackboard models of communication.

Theorem 1.6.

In the coordinator model, the randomized communication complexity of solving a linear system is $\widetilde{\Theta}(d^{2}L+sd)$ , while the deterministic communication complexity is $\widetilde{\Theta}(sd^{2}L)$ . In the blackboard model, both the randomized communication complexity and the deterministic communication complexity are $\widetilde{\Theta}(d^{2}L+s)$ .

Theorem 1.6 shows that randomization provably helps for solving linear systems. The theorem also shows that in the blackboard model the problem becomes substantially easier.

1.1.2 Approximate Linear Systems, i.e., $\ell_{p}$ Regression

We next study the $\ell_{p}$ regression problem in both the coordinator and blackboard models of communication. Finding a solution to a linear system is a special case of $\ell_{p}$ regression; indeed in the case that there is an $x$ for which $Ax=b$ we must return such an $x$ to achieve $(1+\varepsilon)$ relative error in objective function value for $\ell_{p}$ regression. Consequently, our lower bounds for linear systems apply also to $\ell_{p}$ regression for any $\varepsilon>0$ .

We first summarize our results in Table 1 and Table 2 for constant $\varepsilon$ . We state our results primarily for randomized communication. However, in the case of $\ell_{2}$ regression, we also discuss deterministic communication complexity.

One of the main takeaway messages from Table 1 is that sampling-based approaches, namely those based upon the so-called Lewis weights [27], would require $\Omega(d^{p/2})$ samples for $\ell_{p}$ regression when $p>2$ , and thus communication. Another way for solving $\ell_{p}$ regression for $p>2$ is via sketching, as done in [67], but then the communication is $\Omega(n^{1-2/p})$ . Our method, which is deeply tied to linear programming, discussed more below, solves this problem in $\widetilde{O}(sd^{3}L)$ communication. Thus, this gives a new method, departing from sampling and sketching techniques, which achieves much better communication. Our method involves embedding $\ell_{p}$ into $\ell_{\infty}$ , and then using distributed algorithms for linear programming to solve $\ell_{\infty}$ regression.

As with linear systems, one takeaway message from the results in Table 2 is that the problems have significantly more communication-efficient upper bounds in the blackboard model than in the coordinator model. Indeed, here we obtain tight bounds for $\ell_{1}$ and $\ell_{2}$ regression, matching those that are known for the easier problem of linear systems.

We next describe our results for non-constant $\varepsilon$ in both the coordinator model and the blackboard model. Here we focus on $\ell_{1}$ and $\ell_{2}$ , which illustrate several surprises.

One of the most interesting aspects of the results in Table 3 is our dependence on $\varepsilon$ for $\ell_{1}$ regression, where for small enough $\varepsilon$ relative to $sd$ , we achieve a $1/\varepsilon$ instead of a $1/\varepsilon^{2}$ dependence. We note that all sampling [27] and sketching-based solutions [67] to $\ell_{1}$ regression have a $1/\varepsilon^{2}$ dependence. Indeed, this dependence on $\varepsilon$ comes from basic concentration inequalities. In contrast, our approach is based on preconditioned first-order methods described in more detail below.

A takeaway message from Table 4 is that our lower bound shows some dependence on $\varepsilon$ is necessary both for $\ell_{1}$ and $\ell_{2}$ regression, provided $\varepsilon$ is not too small. This shows that in the blackboard model, one cannot obtain the same $\widetilde{O}(d^{2}L+s)$ upper bound for these problems as for linear systems, thereby separating their complexity from that of solving a linear system.

1.1.3 Linear Programming

One of our main technical ingredients is to recast $\ell_{p}$ regression problems as linear programming problems and develop the first communication-efficient solutions for distributed linear programming. Despite this problem being one of the most important problems that we know how to solve in polynomial time, we are not aware of any previous work considering its communication complexity in generality besides a recent independent work [7]

First, when the dimension $d$ is constant, we obtain nearly optimal upper and lower bounds.

Theorem 1.7.

In constant dimensions, the randomized communication complexity of linear programming is $\widetilde{\Theta}(sL)$ in the coordinator model and $\widetilde{\Omega}(s+L)$ in the blackboard model. Our upper bounds allow the coordinator to output the solution vector $x\in\mathbb{R}^{d}$ , while the lower bounds hold already for testing if the linear program is feasible. Here the $\widetilde{\Theta}(\cdot)$ notation and the $\widetilde{\Omega}(\cdot)$ notation suppress only $\operatorname{polylog}(sL)$ factors.

Despite the fact that we do not have tight upper bounds matching the $\widetilde{\Omega}(s+L)$ lower bounds in the blackboard model, under the additional assumption that each constraint in the linear program is placed on a random server, we develop an algorithm with a matching $\widetilde{O}(s+L)$ communication cost. Partitioning constraints randomly across servers instead is common in distributed computation, see, e.g., [8]. Neverthelss we leave it as an open problem in the blackboard model in constant dimensions, to remove this requirement.

For solving a linear system in constant dimensions, the randomized communication complexity is $\widetilde{\Theta}(s+L)$ in both models. Again, the $\widetilde{\Theta}(\cdot)$ notation suppresses only $\operatorname{polylog}(sL)$ factors. Thus, in the coordinator model, we separate the communication complexity of these problems. We can also separate the complexities in the blackboard model if we instead look at the feasibility problem. Here instead of requiring the coordinator to output the solution vector, we just want to see if the linear system or linear program is feasible. We have the following theorem for this.

Theorem 1.8.

In constant dimensions, the randomized communication complexity of checking whether a system of linear equations is feasible is $O(s\log L)$ in either the coordinator or blackboard model of communication.

Combining Theorem 1.7 and Theorem 1.8, we see that for feasibility in the blackboard model, linear programming requires $\widetilde{\Omega}(s+L)$ bits, while linear system feasibility takes $\widetilde{O}(s)$ bits, and thus we separate these problems in the blackboard model as well.

Returning to linear programs, we next consider the complexity in arbitrary dimensions.

Theorem 1.9.

In the coordinator model, the randomized communication complexity of exactly solving a linear program $\max\{c^{T}x\,:\,Ax\leq b\}$ with $n$ constraints in dimension $d$ and all coefficients specified by $L$ -bit numbers is $\widetilde{O}(sd^{3}L)$ . Moreover it is lower bounded by $\widetilde{\Omega}(d^{2}L+sd)$ . Here the upper and lower bounds require the coordinator to output the solution vector $x\in\mathbb{R}^{d}$ .

The lower bound in Theorem 1.9 just follows from our lower bound for linear systems. The upper bound is based on an optimized distributed cutting-plane algorithm. We describe the idea below.

While the upper bound is $\widetilde{O}(sd^{3}L)$ , one can further improve it as follows. We show that if the coefficients of $A$ in the input to the linear program are perturbed independently by i.i.d. discrete Gaussians with variance as small as $2^{-\Theta(L)}$ , then we can improve the upper bound for solving this perturbed problem to $\widetilde{O}(sd^{2}L+d^{4}L)$ , where now the success probability of the algorithm is taken over both the randomness of the algorithm and the random input instance, which is formed by a random perturbation of a worst-case instance. Note that this is an improvement for sufficiently large $s$ . Our model coincides with the well-studied smooth complexity model of linear programming [61, 14, 62]. However, a major difference is that the variance of the perturbation needs to be at least inverse polynomial in their works, whereas we allow our variance to be as small as $2^{-\Theta(L)}$ .

Theorem 1.10.

*In the smoothed complexity model with discrete Gaussians of variance $2^{-\Theta(L)}$ , the communication complexity of exactly solving a linear program $\max\{c^{T}x\,:\,Ax\leq b\}$ with $n$ constraints in dimension $d$ and all coefficients specified by $L$ -bit numbers, with probability at least $9/10$ over the input distribution and randomness of the protocol, is $\widetilde{O}(sd^{2}L+d^{4}L)$ in the coordinator model. *

While our focus in this paper is on communication, our upper bounds also give a new technique for improving the time complexity in the unit cost RAM model of linear programming, where arithmetic operations on words of size $O(\log(nd))$ can be performed in constant time. For this fundamental problem we obtain the fastest known algorithm even in the non-smoothed setting of linear programming.

Theorem 1.11.

The time complexity of solving an $n\times d$ linear program with $L$ -bit coefficients is $\widetilde{O}(nd^{\omega}L+\operatorname{poly}(dL))$ in the unit cost RAM model.

We note that this is for solving an LP exactly in the RAM model with words of size $O(\log(nd))$ bits. The current fastest linear programming algorithms [45, 46, 25] state the bounds in terms of additive error $\varepsilon$ , which incurs a multiplicative factor of at least $\Omega(dL)$ to solve the problem exactly. Also such algorithms manipulate large numbers at intermediate points in the algorithm, which are at least $L$ bits, which could take $\Omega(L)$ time to perform a single operation on. It seems that transferring such results to the unit cost RAM model with $O(\log(nd))$ bit words incurs time at least $\Omega(nd^{2.5}L^{2}+d^{w+1.5}L^{2})$ . This holds true even of the recent work [25], which focuses on the setting $n=O(d)$ and does not improve the leading $nd^{2.5}L^{2}$ term. Even such a bit-complexity bound needs careful checking of the number of bits required as recent improvements use sophisticated inverse maintenance methods to save on the number of operations (an exercise that was carried out thoroughly for the Ellipsoid method in [34]).

1.1.4 Implications for Convex Optimization and Semidefinite Programming

Our upper bounds also extend to more general convex optimization problems. For these, we must modify the problem statement to finding an $\varepsilon$ -additive approximation rather than the exact solution. We obtain the following upper bound for a convex program in $\mathbb{R}^{d}$ .

Theorem 1.12.

The communication complexity of solving the convex optimization problem $\min\{c^{T}x\,:\,x\in\bigcap_{i}K_{i}\}$ for convex sets $K_{i}\subseteq RB^{n}$ , one per server, to within an additive error $\varepsilon$ , i.e., finding a point $y$ s.t. $c^{T}y\leq OPT+\varepsilon$ and $y\in\bigcap_{i}K_{i}+\varepsilon B^{n}$ is $O(sd^{2}\log(Rd/\varepsilon)\log d)$ .

If the objective function is not known to all servers, we incur an additional $O(sdL)$ communication. For semidefinite programs with $d\times d$ symmetric matrices and $n$ linear constraints this gives a bound of $\widetilde{O}(sd^{4}\log(1/\varepsilon))$ . Note that we can simply send all the constraints to one server in $O(nd^{2}L)$ communication, so this is always an upper bound.

1.2 Our Techniques

1.2.1 Linear Systems

To solve linear systems in the distributed setting, the coordinator can go through the servers one by one. The coordinator and all servers maintain the same set $C$ of linearly independent linear equations. For each server $P_{i}$ , if there is a linear equation stored by $P_{i}$ that is linearly independent with linear equations in $C$ , then $P_{i}$ sends that linear equation to all other servers and adds that linear equation into $C$ . In the end, $C$ will be a maximal set of linearly independent equations, and thus the coordinator can simply solve the linear equations in $C$ . This protocol is deterministic and has communication complexity $O(sd^{2}L)$ in the coordinator model and $O(s+d^{2}L)$ in the blackboard model, since at most $d$ linear equations will be added into the set $C$ .

In fact, the preceding protocol is optimal for deterministic protocols, even just for testing the feasibility of linear systems. To prove lower bounds, we first prove the following new theorem about random matrices which may be of independent interest.

Theorem 1.13 (Informal version of Theorem 3.1).

Let $R$ be a $d\times d$ matrix with i.i.d. random integer entries in $\{-2^{L},\ldots,2^{L}\}$ . The probability that $R$ is invertible is $1-2^{-\Theta(dL)}$ .

The previous best known probability bound was only $1-2^{-\Theta(d)}$ [63, 16]; we stress that the results of [16] are not sufficient 111We have verified this with Philip Matchett Wood, who is an author of [16]. The issue is that in their Corollary 1.2, they have an explicit constraint on the cardinality of the set $S$ , i.e., $|S|=O(1)$ . In their Theorem 2.2, it is assumed that $|S|=n^{o(n)}$ . Thus, as far as we are aware, there are no known results sufficient to prove our singularity probability bound. to prove our stronger bound with the extra factor of $L$ in the exponent, which is crucial for our lower bound.

With Theorem 1.13, in Lemma 3.3, we use the probabilistic method to construct a set of $|\mathcal{H}|=2^{\Omega(d^{2}L)}$ matrices $\mathcal{H}\subseteq\mathbb{R}^{d\times d}$ with integral entries in $[-2^{L},2^{L}]$ , such that for any $S,T\in\mathcal{H}$ , $S^{-1}e_{d}\neq T^{-1}e_{d}$ , where $e_{d}$ is the $d$ -th standard basis vector.

Now consider any deterministic protocol for testing the feasibility of linear systems. Suppose the linear system on the $i$ -th server is $H_{i}x=e_{d}$ for some $H_{i}\in\mathcal{H}$ , then the entire linear system is feasible if and only if $H_{1}=H_{2}=\ldots=H_{s}$ . This is equivalent to the problem in which each server receives a binary string of length $\log(|\mathcal{H}|)$ , and the goal is to test whether all strings are the same or not. In the coordinator model, a deterministic lower bound of $\Omega(s\log(|\mathcal{H}|))$ for this problem can be proved using the symmetrization technique in [54, 69], which gives an optimal $\Omega(sd^{2}L)$ lower bound. An optimal $\Omega(s+d^{2}L)$ deterministic lower bound can also be proved in the blackboard model. The formal analysis is given in Section 3.3.

For solving linear systems, an $\Omega(d^{2}L)$ lower bound holds even for randomized algorithms in the coordinator model. When there is only a single server which holds a linear system $Hx=e_{d}$ for some $H\in\mathcal{H}$ , in order for the coordinator to know the solution $x=H^{-1}e_{d}$ , standard information-theoretic argument shows that $\log(|\mathcal{H}|)$ bits of communication is necessary, which gives an $\Omega(d^{2}L)$ lower bound. This idea is formalized in Section 3.4. A natural question is whether the $O(sd^{2}L)$ upper bound is optimal for randomized protocols.

We first show that in order to test feasibility, it is possible to achieve a communication complexity of $O(sd^{2}\log(dL))$ , which can be exponentially better than the bound for deterministic protocols. The idea is to use hashing. With randomness, the servers can first agree on a random prime number $p$ , and test the feasibility over the finite field $\mathbb{F}_{p}$ . It suffices to have the prime number $p$ randomly generated from the range $[2,\operatorname{poly}(dL)]$ , and thus the $L$ factor in the communicataion complexity of deterministic protocols can be improved to $\log p=\log(dL)$ . However, it is still unclear if solving linear systems in the coordinator model will require $\Omega(sd^{2}L)$ bits of communication for randomized protocols.

Quite surprisingly, we show that $O(sd^{2}L)$ is not the optimal bound for randomized protocols, and the optimal bound is $\widetilde{\Theta}(d^{2}L+sd)$ . In the deterministic protocol with communication complexity $O(sd^{2}L)$ , most communication is wasted on synchronizing the set $C$ , which requires the servers to send linear equations to all other servers. In our new protocol, only the coordinator maintains the set $C$ . The issue now, however, is that the servers no longer know which linear equation they own is linearly independent with those equations in $C$ . On the other hand, each server can simply generate a random linear combination of all linear equations it owns. We can show that if a server does have a linear equation that is linearly independent with those in $C$ , with constant probability, the random linear combination is also linearly independent with those in $C$ , and thus the coordinator can add the random linear combination into $C$ . Notice that taking random linear combinations to preserve the rank of a matrix is a special case of dimensionality reduction or sketching, which comes up in a number of applications, see, for example compressed sensing [20, 13], data streams [4], and randomized numerical linear algebra [66]. Here though, a crucial difference is that we just need the fact that if a set of vectors $S$ is not contained in the span of another set of vectors $V$ , then a random linear combination of the vectors in $S$ is also not in the span of $V$ with high probability. This allows us to adaptively take as few linear combinations as possible to solve the linear system, enabling us to achieve much lower communication than would be possible by just sketching the linear systems at each server and non-adaptively combining them.

If we implement this protocol naïvely, then the communication complexity will be $\widetilde{O}(d^{2}L+sdL)$ , since at most $d$ linear equations will be added into $C$ , and there is an $\widetilde{O}(dL)$ communication complexity associated with each of them. Furthermore, even if a server does not have any linear equation that is linearly independent with $C$ , it still needs to send random linear combinations to the coordinator, which would require $\widetilde{O}(sdL)$ communication. To improve this further to $\widetilde{O}(sd)$ , we can still use the hashing trick mentioned before. If a server generates a random linear combination, it can first test whether the linear combination is linearly independent with $C$ over the finite field $p$ , for a random prime $p$ chosen in $[2,\operatorname{poly}(dL)]$ . This will reduce the communication complexity to $\widetilde{O}(d)$ for each test. If the linear equation is indeed linearly independent with $C$ , then the server sends the original linear equation (without taking the residual modulo $p$ ) to the coordinator. Again the total communication complexity for sending the original linear equations is upper bounded by $O(d^{2}L)$ . Thus, the total communication complexity is upper bounded by $\widetilde{O}(d^{2}L+sd)$ . See Section 4.2 for the formal analysis.

By a reduction from the OR of $s-1$ copies of the two-server set-disjointness problem to solving linear systems, we can prove an extra $\widetilde{\Omega}(sd)$ lower bound, which holds even for testing feasibility of linear systems. Here the idea is to interpret vectors in $\{0,1\}^{d}$ as characteristic vectors of subsets of $[d]$ . One of the servers will fix the solution of the linear system to be a predefined vector $x$ . Each server $P_{i}$ has a single linear equation $a_{i}^{T}x=1$ . By interpreting vectors as sets, $a_{i}^{T}x=1$ implies the set represented by $a_{i}$ and $x$ are intersecting. Thus, the servers are actually solving the OR of $s-1$ copies of the two-server set-disjointness problem, which is known to have $\widetilde{\Omega}(sd)$ communication complexity [54, 68]. This lower bound is formally given in Section 3.4.

1.2.2 Linear Regression

For an $\ell_{2}$ regression instance $\min_{x}\|Ax-b\|_{2}$ , the optimal solution can be calculated using the normal equations, i.e., the optimal solution $x$ satisfies $A^{T}Ax=A^{T}b$ . This already gives a simple yet nearly optimal deterministic protocol for $\ell_{2}$ regression in the coordinator model: the coordinator calculates $A^{T}A$ and $A^{T}b$ using only $\widetilde{O}(sd^{2}L)$ bits of communication by collecting the covariance matrices from each server and summing them up. The $\widetilde{O}(sd^{2}L)$ communication complexity matches our lower bound for solving linear systems for deterministic protocols in the coordinator model. However, when implemented in the blackboard model, the communication complexity of this protocol is still $\widetilde{O}(sd^{2}L)$ . To improve this bound, we first show how to efficiently obtain approximations to leverage scores in both models. Our protocol is built upon the algorithm in [26], but implemented in a distributed manner. The resulting algorithm has $\widetilde{O}(sd^{2}L)$ communication complexity in the coordinator model but only $\widetilde{O}(s+d^{2}L)$ communication complexity in the blackboard model. With approximate leverage scores, the coordinator can then sample $\widetilde{O}(d/\varepsilon^{2})$ rows of the matrix $A$ to obtain a subspace embeeding, at which point it will be easy to calculate a $(1+\varepsilon)$ -approximate solution to the $\ell_{2}$ regression problem. The number of sampled rows can be further improved to $\widetilde{O}(d/\varepsilon)$ using Sárlos’s argument [57] since solving $\ell_{2}$ regression does not necessarily require a full $(1+\varepsilon)$ subspace embedding, which results in a protocol with communication complexity $\widetilde{O}(s+d^{2}L/\varepsilon)$ in the blackboard model. Full details can be found in Section 6.

One may wonder if the dependence on $1/\varepsilon$ is necessary for solving $\ell_{2}$ regression in the blackboard model. In Section 5, we show that some dependence on $1/\varepsilon$ is actually necessary. We show an $\Omega(d/\sqrt{\varepsilon})$ lower bound whenever $s>\Omega(1/\sqrt{\varepsilon})$ . The hardness follows from the fact that if the matrix $A$ satisfies $A^{(i)}=I$ for all $i\in[s]$ , then the optimal solution is just the average of $b^{(1)},b^{(2)},\ldots,b^{(s)}$ . Thus, if we can get sufficiently good approximation to the $\ell_{2}$ regression problem, then we can actually recover the sum of $b^{(1)},b^{(2)},\ldots,b^{(s)}$ , at which point we can resort to known communication complexity lower bound in the blackboard model [54]. This argument will also give an $\Omega(d/\varepsilon)$ lower bound for $(1+\varepsilon)$ -approximate $\ell_{1}$ regression in the blackboard model, whenever $s>\Omega(1/\varepsilon)$ . The formal analysis can be found in Section 5.

For $\ell_{1}$ regression, we can no longer use the normal equations. However, we can obtain approximations to $\ell_{1}$ Lewis weights by using approximations to leverage scores, as shown in [27]. With approximate $\ell_{1}$ Lewis weights of the $A$ matrix, the coordinator can then obtain a $(1+\varepsilon)$ $\ell_{1}$ subspace embedding by sampling $\widetilde{O}(d/\varepsilon^{2})$ rows. This will give an $O(sd^{2}L+d^{2}L/\varepsilon^{2})$ upper bound for $(1+\varepsilon)$ -approximate $\ell_{1}$ regression in the coordinator model, and an $O(s+d^{2}L/\varepsilon^{2})$ upper bound in the blackboard model. It is unclear if the number of sampled rows can be further reduced since there is no known $\ell_{1}$ version of Sárlos’s argument. A natural question is whether the $1/\varepsilon^{2}$ dependence is optimal. We show that the dependence on $\varepsilon$ can be further improved to $1/\varepsilon$ , by using optimization techniques, or more specifically, first-order methods. Despite the fact that the objective function of $\ell_{1}$ regression is neither smooth nor strongly-convex, it is known that by using Nesterov’s Accelerated Gradient Descent and smoothing reductions [51], one can solve $\ell_{1}$ regression using only $O(1/\varepsilon)$ full gradient calculations. On the other hand, the complexity of first-order methods usually has dependences on various parameters of the input matrix $A$ , which can be unbounded in the worst case. Fortunately, recent developments in $\ell_{1}$ regression [32] show how to precondition the matrix $A$ by simply doing an $\ell_{1}$ Lewis weights sampling, and then rotating the matrix appropriately. By carefully combining this preconditioning procedure with Accelerated Gradient Descent, we obtain an algorithm for $(1+\varepsilon)$ -approximate $\ell_{1}$ regression with communication complexity $\widetilde{O}(sd^{3}L/\varepsilon)$ in the coordinator model, which shows it is indeed possible to improve the $\varepsilon$ dependence for $\ell_{1}$ regression. A formal analysis is given in Section 7.

For general $\ell_{p}$ regression, if we still use Lewis weights sampling, then the number of sampled rows and thus the communication complexity will be $\Omega(d^{p/2})$ . Even worse, when $p=\infty$ , Lewis weights sampling will require an unbounded number of samples. However, $\ell_{\infty}$ regression can be easily formulated as a linear program, which we show how to solve exactly in the distributed setting. Inspired by this approach, we further develop a general reduction from $\ell_{p}$ regression to linear programming. Our idea is to use the max-stability of exponential random variables [3] to embed $\ell_{p}$ into $\ell_{\infty}$ , write the optimization problem in $\ell_{\infty}$ as a linear program and then solve the problem using linear program solvers. However, such embeddings based on exponential random variables usually produce heavy-tailed random variables and makes the dilation bound hard to analyze. Here, since our goal is just to solve a linear regression problem, we only need the dilation bound for the optimal solution of the regression problem. The formal analysis in Section 8 shows that $(1+\varepsilon)$ -approximate $\ell_{p}$ regression can be reduced to solving a linear program with $\widetilde{O}(d/\varepsilon^{2})$ variables, which implies a communication protocol for $\ell_{p}$ regression without the $\Omega(d^{p/2})$ dependence.

1.2.3 Linear and Convex Programs

We adapt two different algorithms from the literature for efficient communication and implement them in the distributed setting. The first is Clarkson’s algorithm, which works by sampling $O(d^{2})$ constraints in each iteration and finds an optimal solution to this subset; the sampling weights are maintained implicitly. In each iteration the total communication is $O(d^{3}L)$ for gathering the constraints and an additional $\widetilde{O}(sd^{2}L)$ per round to send the solution to this subset of constraints to all servers. This solution is used to update the sampling weights. Clarkson’s algorithm has the nice guarantee that it needs only $O(d\log n)$ rounds with high probability. A careful examination of this algorithm shows that the bit complexity of the computation (not the communication) is dominated by checking whether a proposed solution satisfies all constraints, i.e., computing $Ax$ for a given $x$ . We show this can be done with time complexity $\widetilde{O}(nd^{\omega}L)$ in the unit cost RAM model and this is the leading term of the claimed time bound.

Notice that the $\widetilde{O}(sd^{3}L)$ term in the communication complexity of Clarkson’s algorithm comes from the fact that the protocol needs to send an optimal solution $x^{*}$ of a linear program with size $O(d^{2})\times d$ for a total of $O(d\log n)$ times. However, when each server $P_{i}$ receives $x^{*}$ , all $P_{i}$ will do is to check whether $x^{*}$ satisfies the constraints stored on $P_{i}$ or not. Notice that here entries in the constraints have bit complexity $L$ , whereas the solution vector $x^{*}$ has bit complexity $\widetilde{O}(dL)$ for each entry. Intuitively, for most linear programs, we don’t need such a high precision for the solution vector $x^{*}$ . This leads to the idea of smoothed analysis. We show that if the coefficients of $A$ in the input to the linear program are perturbed independently by i.i.d. discrete Gaussians with variance as small as $2^{-\Theta(L)}$ , then we can improve the upper bound for solving this perturbed problem to $\widetilde{O}(sd^{2}L+d^{4}L)$ . The reason here is that with Gaussian noise, we can round each entry of the solution vector $x^{*}$ to have bit complexity $\widetilde{O}(L)$ , which would suffice for verifying whether $x^{*}$ satisfies the constraints or not, for most linear programs. Full details regarding Clarkson’s algorithm and the smoothed analysis model can be found in Section 10.

One minor drawback of Clarkson’s algorithm is it has a dependence on $\log n$ . In constant dimensions, our $\widetilde{\Omega}(s+L)$ lower bound in the blackboard model holds only when $n=2^{\Omega(L)}$ , in which case the communication complexity of Clarkson’s algorithm will be $\widetilde{O}(sL+L^{2})$ .

Under the additional assumption that each constraint in the linear program is placed on a random server, we develop an algorithm with communication complexity $\widetilde{O}(s+L)$ in the blackboard model. To achieve this goal, we modify Seidel’s classical algorithm and implement it in the distributed setting. Seidel’s algorithm benefits from the additional assumption from two aspects. On the one hand, Seidel’s classical algorithm needs to go through all the constraints in a random order, which can be easily achieved now since all constraints are placed on a random server. On the other hand, Seidel’s classical algorithm needs to make a recursive call each time it finds one of $d$ constraints that determines the optimal solution, and will make $\sum_{i=1}^{n}d/i=\Theta(d\log n)$ recursive calls in expectation. To implement Seidel’s algorithm in the distributed setting, each time we find one of the $d$ constraints that determines the optimal solution, the current server also needs to broadcast that constraint. Thus, naïvely we need to broadcast $O(d\log n)$ constraints during the execution, which would result in $O(s+L\log n)$ communication. Under the additional assumption, with good probability, the first server $P_{1}$ stores at least $\Omega(n/s)$ constraints. Since the first server $P_{1}$ does not need to make any recursive calls or broadcasts, the total number of recursive calls (and thus broadcasts) will be $\sum_{i=\Omega(n/s)}^{n}d/i=\Theta(d\log s)$ . The formal analysis is given in Section 12.

For convex programming, we have to use a more general algorithm. We use a refined version of the classical center-of-gravity method. The basic idea is to round violated constraints that are used as cutting planes to $O(d\log d)$ bits. We optimize over the ellipsoid method in the following two ways. First, we round the violated constraint sent in each iteration by locally maintaining an ellipsoid to ensure the rounding error does not affect the algorithm. Roughly speaking, each server maintains a well-rounded current feasible set, and the number of bits needed in each round is thus only $\widetilde{O}(d)$ . Secondly, we use the center of gravity method to make sure the volume is cut by a constant factor rather than a $(1-1/d)$ factor in each iteration, even when constraints are rounded. See Section 11 for the formal analysis.

2 Preliminaries

2.1 Notation

For $m$ matrices $A^{(1)}\in\mathbb{R}^{d\times n_{1}},A^{(2)}\in\mathbb{R}^{d\times n_{2}},\ldots,A^{(m)}\in\mathbb{R}^{d\times n_{m}}$ , we use $[A^{(1)}~{}A^{(2)}~{}\cdots~{}A^{(m)}]$ to denote the matrix in $\mathbb{R}^{d\times(n_{1}+n_{2}+\cdots+n_{m})}$ whose first $n_{1}$ columns are the same as $A^{(1)}$ , the next $n_{2}$ columns are the same as $A^{(2)}$ , …, and the last $n_{m}$ columns are the same as $A^{(m)}$ .

For a matrix $A\in\mathbb{R}^{n\times d}$ , we use $\mathrm{span}(A)=\{Ax\mid x\in\mathbb{R}^{d}\}$ to denote the subspace spanned by the columns of the matrix $A$ . For a set of vectors $S\subseteq R^{d}$ , we use $\mathrm{span}(S)$ to denote the subspace spanned by the vectors in $S$ . For a set of linear equations $C$ , we also $\mathrm{span}(C)$ to denote all linear combinations of linear equations in $C$ . We use $A_{i}$ to denote the $i$ -th column of $A$ and $A^{i}$ to denote the $i$ -th row of $A$ . We use $A^{\dagger}$ to denote the Moore-Penrose inverse of $A$ . We use ${\operatorname{rank}}(A)$ to denote the rank of $A$ over the real numbers and ${\operatorname{rank}}_{p}(A)$ to denote the rank of $A$ over the finite field $\mathbb{F}_{p}$ .

For a vector $x\in\mathbb{R}^{d}$ , we use $\|x\|_{p}=\left(\sum_{i=1}^{d}|x_{i}|^{p}\right)^{1/p}$ to denote its $\ell_{p}$ norm. For two vectors $x$ and $y$ , we use $\langle x,y\rangle$ to denote their inner product.

For matrices $A$ and $B$ , we say $A\approx_{\kappa}B$ if and only if

[TABLE]

where $\preceq$ refers to the Löwner partial ordering of matrices, i.e., $A\preceq B$ if $B-A$ is positive semi-definite.

2.2 Models of Computation and Problem Settings

We study the distributed linear regression problem in two distributed models: the coordinator model (a.k.a. the message passing model) and the blackboard model. The coordinator model represents distributed computation systems with point-to-point communication, while the blackboard model represents those where messages can be broadcasted to each party.

In the coordinator model, there are $s\geq 2$ servers $P_{1},P_{2},\ldots,P_{s}$ , and one coordinator. These $s$ servers can directly send messages to the coordinator through a two-way private channel. The computation is in terms of rounds: at the beginning of each round, the coordinator sends a message to some of the $s$ servers, and then each of those servers that have been contacted by the coordinator sends a message back to the coordinator.

In the alternative blackboard model, the coordinator is simply a blackboard where the $s$ servers $P_{1},P_{2},\ldots,P_{s}$ can share information; in other words, if one server sends a message to the coordinator/blackboard then the other $s-1$ servers can see this information without further communication. The order for the servers to send messages is decided by the contents of the blackboard.

For both models we measure the communication cost which is defined to be the total number of bits sent through the channels.

In the distributed linear system problem, there is a data matrix $A\in\mathbb{R}^{n\times d}$ and a vector $b$ of observed values. All entries in $A$ and $b$ are integers between $[-2^{L},2^{L}]$ , where $L$ is the bit complexity. The matrix $[A~{}b]$ is distributed row-wise among the $s$ servers $P_{1},P_{2},\ldots,P_{s}$ . More specifically, for each server $P_{i}$ , there is a matrix $[A^{(i)}~{}b^{(i)}]$ stored on $P_{i}$ , which is a subset of rows of $[A~{}b]$ . Here we assume $\{[A^{(1)}~{}b^{(1)}],[A^{(2)}~{}b^{(2)}],\ldots,[A^{(s)}~{}b^{(s)}]\}$ is a partition of all rows in $[A~{}b]$ . The goal of the feasibility testing problem is to design a protocol, such that upon termination of the protocol, the coordinator reports whether the linear system $Ax=b$ is feasible or not. The goal of the linear system solving problem is to design a protocol, such that upon termination of the protocol, either the coordinator outputs a vector $x^{*}\in\mathbb{R}^{d}$ , such that $Ax^{*}=b$ , or the coordinator reports the linear system $Ax=b$ is infeasible. It can be seen that the linear system solving problem is strictly harder than the feasibility testing problem.

In the distributed linear regression problem, there is a data matrix $A\in\mathbb{R}^{n\times d}$ and a vector $b$ of observed values, which is distributed in the same way as in the distributed linear system problem. The goal of the distributed $\ell_{p}$ regression problem is to design a protocol, such that upon termination of the protocol, the coordinator outputs a vector $x^{*}\in\mathbb{R}^{d}$ to minimize $\|Ax-b\|_{p}$ .

In the distributed linear programming problem, there is a matrix $A\in\mathbb{R}^{n\times d}$ and a vector $b$ , which is distributed in the same way as in the distributed linear system problem. The goal of the feasibility testing problem is to design a protocol, such that upon termination of the protocol, the coordinator reports whether the linear program $Ax\leq b$ is feasible or not. In the linear programming solving problem, the goal is to design a protocol, such that upon termination of the protocol, the coordinator outputs a vector $x^{*}\in\mathbb{R}^{d}$ such that $Ax^{*}\leq b$ is satisfied. There can also be a vector $c\in\mathbb{R}^{d}$ which is known to all servers, and in this case the goal is to minimize (or maximize) $\langle c,x\rangle$ under the constraint that $Ax\leq b$ .

2.3 Row Sampling Algorithms

Definition 2.1 ([26]).

Given a matrix $A\in\mathbb{R}^{n\times d}$ . The leverage score of a row $A^{i}$ is defined to be

[TABLE]

Given another matrix $B\in\mathbb{R}^{n^{\prime}\times d}$ , the generalized leverage score of a row $A^{i}$ w.r.t. $B$ is defined to be

[TABLE]

Definition 2.2 ([27]).

Given a matrix $A\in\mathbb{R}^{n\times d}$ . The $\ell_{1}$ Lewis weights $\{\overline{w}_{i}\}_{i=1}^{n}$ are the unique weights such that for each $i\in[n]$ we have

[TABLE]

where $\overline{W}$ is the diagonal matrix formed by putting $\{\overline{w}_{i}\}_{i=1}^{n}$ on the diagonal.

Theorem 2.1 ( $\ell_{2}$ Matrix Concentration Bound, Lemma 4 in [26]).

There exists an absolute constant $C$ such that for any matrix $A\in\mathbb{R}^{n\times d}$ and any set of sampling values $p_{i}$ satisfying

[TABLE]

if we generate a matrix $S$ with $N=\sum_{i=1}^{n}p_{i}$ rows, each chosen independently as the $i$ -th basis vector, times $p_{i}^{-1/2}$ with probability $p_{i}/N$ , then with probability at least $0.99$ , for all vector $x\in\mathbb{R}^{d}$ ,

[TABLE]

Theorem 2.2 ( $\ell_{1}$ Matrix Concentration Bound, Theorem 7.1 in [27]).

There exists an absolute constant $C$ such that for any matrix $A\in\mathbb{R}^{n\times d}$ and any set of sampling values $p_{i}$ satisfying

[TABLE]

if we generate a matrix $S$ with $N=\sum_{i=1}^{n}p_{i}$ rows, each chosen independently as the $i$ -th basis vector, times $p_{i}^{-1}$ with probability $p_{i}/N$ , then with probability at least $0.99$ , for all vectors $x\in\mathbb{R}^{d}$ ,

[TABLE]

Here $\{\overline{w}_{i}\}_{i=1}^{n}$ are the $\ell_{1}$ Lewis weights of the matrix $A$ .

3 Communication Complexity Lower Bound for Linear Systems

3.1 The Hard Instance

In this section, we construct a family of matrices, which will be used to prove a communication complexity lower bound in the subsequent section.

We first introduce generalized binomial distributions.

Definition 3.1.

For any $0\leq\mu\leq 1$ , let $\mathcal{B}^{(\mu)}\in\{-1,0,1\}$ be a random variable which takes $+1$ or $-1$ with probability $\mu/2$ , and [math] with probability $1-\mu$ . Let $\mathcal{B}_{t}^{(\mu)}$ be a random variable with the same distribution as the sum of $t$ i.i.d. copies of $\mathcal{B}^{(\mu)}$ . For simplicity we use $\mathcal{B}$ and $\mathcal{B}_{t}$ to denote $\mathcal{B}^{(1)}$ and $\mathcal{B}_{t}^{(1)}$ , respectively.

We need the following theorem on the singularity probability of discrete random matrices.

Theorem 3.1.

Let $M_{n}\in\mathbb{R}^{n\times n}$ be a matrix whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ , for sufficiently large $t$ ,

[TABLE]

where $C>0$ is an absolute constant.

The proof of Theorem 3.1 closely follows previous approaches for bounding the singularity probability of random $\pm 1$ matrices (see, e.g., [41, 63, 64, 16].). For completeness, we include a proof of Theorem 3.1 in Section 13.

Lemma 3.2.

For any $d>0$ and sufficiently large $t$ , there exists a set of matrices $\mathcal{T}\subseteq\mathbb{R}^{d\times(d-1)}$ with integral entries in $[-t,t]$ for which $|\mathcal{T}|=t^{\Omega(d^{2})}$ and

For any $T\in\mathcal{T}$ , ${\operatorname{rank}}(T)=d-1$ ; 2. 2.

For any $S,T\in\mathcal{T}$ such that $S\neq T$ , $\mathrm{span}([S~{}T])=\mathbb{R}^{d}$ .

Proof.

We use the probabilisitic method to prove existence. We use $\mathsf{Bad}\subset\mathbb{R}^{d\times(d-1)}$ to denote the set

[TABLE]

where $X\in\mathbb{R}^{d}$ is a vector whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ and $C$ is the constant in Theorem 3.1.

Consider a random matrix $A\in\mathbb{R}^{d\times(d-1)}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ , we have

[TABLE]

since otherwise, if we use $X\in\mathbb{R}^{d}$ to denote a vector whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ , we have

[TABLE]

which violates Theorem 3.1.

For any fixed $A\in\mathbb{R}^{d\times(d-1)}\setminus\mathsf{Bad}$ , consider a random matrix $B\in\mathbb{R}^{d\times(d-1)}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ . We have,

[TABLE]

which follows from the definition of $\mathsf{Bad}$ and the independence of columns of $B$ .

Now we construct a multiset $\mathcal{S}$ of $|\mathcal{S}|=t^{Cd(d-1)/6}$ matrices chosen with replacement, each of dimension $d\times(d-1)$ and with i.i.d. entries having the same distribution as $\mathcal{B}_{t}$ . By (1) and linearity of expectation, we have

[TABLE]

We use $\mathcal{E}_{1}$ to denote the even that

[TABLE]

which holds with probability at least $3/4$ by using Markov’s inequality.

We use $\mathcal{E}_{2}$ to denote the event that

[TABLE]

Using a union bound and (2), the probability that $\mathcal{E}_{2}$ holds is at least

[TABLE]

Thus by a union bound, the probability that both $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ hold is strictly larger than zero, which implies there exists a set $\mathcal{S}$ such that $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ hold simultaneously. Now we consider $\mathcal{T}=\mathcal{S}\setminus\mathsf{Bad}$ . Since $\mathcal{E}_{1}$ holds, we have $|\mathcal{T}|\geq t^{\Omega(d^{2})}$ . $\mathcal{E}_{2}$ implies that all elements in $\mathcal{T}$ are distinct, and furthermore for any $S,T\in\mathcal{T}$ such that $S\neq T$ , we have

[TABLE]

∎

Lemma 3.3.

For any $d>0$ and sufficiently large $t$ , there exists a set of matrices $\mathcal{H}\subseteq\mathbb{R}^{d\times d}$ with integral entries in $[-t,t]$ for which $|\mathcal{H}|=t^{\Omega(d^{2})}$ and

For any $T\in\mathcal{H}$ , $T$ is non-singular; 2. 2.

For any $S,T\in\mathcal{H}$ , $S^{-1}e_{d}\neq T^{-1}e_{d}$ , where $e_{d}$ is the $d$ -th standard basis vector.

Proof.

Consider the matrix set $\mathcal{T}$ constructed in Lemma 3.2. For each $T\in\mathcal{T}$ , we add $[T~{}e_{i}]^{T}$ into $\mathcal{H}$ , where $e_{i}$ is the $i$ -th standard basis vector such that $e_{i}\notin\mathrm{span}(T)$ . Clearly $[T~{}e_{i}]^{T}$ is non-singular since ${\operatorname{rank}}(T)=d-1$ and $e_{i}\notin\mathrm{span}(T)$ .

Now suppose there exists $S,T\in\mathcal{H}$ such that $S^{-1}e_{d}=T^{-1}e_{d}$ , which means there exists some $x\in\mathbb{R}^{d}$ such that $Sx=e_{d}$ and $Tx=e_{d}$ . This implies there exists some $S^{\prime},T^{\prime}\in\mathcal{T}$ such that $(S^{\prime})^{T}x=0$ and $(T^{\prime})^{T}x=0$ . However, $x$ must be $0^{d}$ since $\mathrm{span}([S^{\prime}~{}T^{\prime}])=\mathbb{R}^{d}$ , which implies $Sx=Tx=0\neq e_{d}$ . Thus for any $S,T\in\mathcal{H}$ , $S^{-1}e_{d}\neq T^{-1}e_{d}$ . ∎

3.2 Deterministic Lower Bound for the Equality Problem

In this section, we prove our deterministic communication complexity lower bound for the Equality problem in the coordinator model, which will be used as an intermediate problem in Section 3.3. In the Equality problem, each server $P_{i}$ receives a binary string $t_{i}\in\{0,1\}^{n}$ . The goal is to test whether $t_{1}=t_{2}=\ldots=t_{s}$ . We will prove an $\Omega(sn)$ lower bound for deterministic communication protocols.

The case $s=2$ has a well-known $\Omega(n)$ lower bound.

Lemma 3.4 (See, e.g., [44, p11]).

Any deterministic protocol for solving the Equality problem with $s=2$ requires $\Omega(n)$ bits of communication.

Our plan is to reduce the case $s=2$ to the case $s>2$ , using the symmetrization technique [54, 69]. Suppose there exists a deterministic communication protocol $\mathcal{P}$ for the Equality problem with $s$ servers, and the communication complexity of $\mathcal{P}$ is $C_{\mathcal{P}}(n)$ where $n$ is the length of the binary strings received by the servers. We show how to solve the case $s=2$ using $\mathcal{P}$ .

Suppose Alice receives a binary string $x\in\{0,1\}^{n-1}$ and Bob receives a binary string $y\in\{0,1\}^{n-1}$ . We show that by using the protocol $\mathcal{P}$ , they can judge whether $x=y$ or not using $O(C_{\mathcal{P}}(n)/s)$ communication. Thus by Lemma 3.4, we must have

[TABLE]

which implies $C_{\mathcal{P}}(n)=\Omega(sn)$ .

Since $\mathcal{P}$ is deterministic, by averaging, there exists a fixed server $P_{i}$ and a fixed set $S\subseteq\{0,1\}^{n}$ with size $|S|=2^{n-1}$ , such that for any $t\in S$ , when all servers have the same input $t$ , the total communication complexity beteen $P_{i}$ and the coordinator is upper bounded by $2C_{\mathcal{P}}(n)/s$ . Now we fix a bijection $g:S\to\{0,1\}^{n-1}$ . Alice plays the role of server $P_{i}$ in $\mathcal{P}$ , and sets the input of $P_{i}$ to be $g(x)$ . Bob plays the role of the coordinator and all servers $P_{j}$ for $i\neq j$ , and sets the input of $P_{j}$ to be $g(y)$ for all $i\neq j$ . To simulate the protocol $\mathcal{P}$ , Alice and Bob need to communicate if and only if server $P_{i}$ needs to communicate with the coordinator. If the total amount of communication between Alice and Bob exceeds $2C_{\mathcal{P}}(n)/s$ then they terminate and return $x\neq y$ . Alice and Bob return $x=y$ if and only if the protocol $\mathcal{P}$ returns $g(x)=g(y)$ .

Now we analyze the correctness and the efficiency of the reduction. When $x=y$ , we have $g(x)=g(y)\in S$ , and by definition of $P_{i}$ and $S$ , we must have the total communication complexity between $P_{i}$ and the coordinator, and thus that between Alice and Bob, is upper bounded by $2C_{\mathcal{P}}(n)/s$ . Also the protocol must return $g(x)=g(y)$ due to the correctness of $\mathcal{P}$ . When $x\neq y$ , either the total amount of communication between Alice and Bob exceeds $2C_{\mathcal{P}}(n)/s$ , in which case they will return $x\neq y$ . Otherwise $\mathcal{P}$ returns $g(x)\neq g(y)$ due to its correctness.

Formally, we have proved the following theorem.

Theorem 3.5.

Any deterministic protocol for solving the Equality problem with $s$ servers in the coordinator model requires $\Omega(sn)$ bits of communication.

3.3 Deterministic Lower Bound for Testing Feasibility of Linear Systems

In this section, we prove our deterministic communication complexity lower bound for testing the feasibility of linear systems, in the coordinator model and the blackboard model.

Theorem 3.6.

For any deterministic protocol $\mathcal{P}$ ,

•

If $\mathcal{P}$ can test whether $Ax=b$ is feasible or not in the coordinator model, then the communication complexity of $\mathcal{P}$ is $\Omega(sd^{2}L)$ ;

•

If $\mathcal{P}$ can test whether $Ax=b$ is feasible or not in the blackboard model, then the communication complexity of $\mathcal{P}$ is $\Omega(s+d^{2}L)$ ;

Proof.

Consider the set $\mathcal{H}$ constructed in Lemma 3.3 with $t=2^{L}$ . In the hard instance, each server $P_{i}$ receives a matrix $H_{i}\in\mathcal{H}$ . The linear system stored on each server is just $H_{i}x=e_{d}$ . Due to Lemma 3.3, the entire linear system is feasible if and only if $H_{1}=H_{2}=\ldots=H_{s}$ . Since $|\mathcal{H}|=2^{\Omega(d^{2}L)}$ , we can reduce the Equality problem in Section 3.2 to solving a linear system, with $n=\Theta(d^{2}L)$ . By Theorem 3.6, this implies an $\Omega(sd^{2}L)$ lower bound in the coordinator model.

In the blackboard model, the $\Omega(d^{2}L)$ bound follows from the case when $s=2$ . When $s=2$ , the blackboard model is essentially the same as the coordinator model, up to constants in the communication complexity. The $\Omega(s)$ lower bound follows from the fact that each server needs to communicate at least $1$ bit. ∎

3.4 Randomized Lower Bound for Solving Linear Systems

In this section, we prove randomized communication complexity lower bounds for solving linear systems. We first prove an $\Omega(d^{2}L)$ lower bound, which already holds for the case $s=2$ . When $s=2$ the coordinator model and the blackboard model are equivalent in terms of communication complexity, and thus we shall not distinguish these two models in the remaining part of this proof.

Consider the set $\mathcal{H}$ constructed in Lemma 3.3 with $t=2^{L}$ . In the hard instance, only server $P_{1}$ receives a matrix $H\in\mathcal{H}$ , and the goal is to let the coordinator output the solution to the linear system $Hx=e_{d}$ . For any two $H_{1},H_{2}\in\mathcal{H}$ and $H_{1}\neq H_{2}$ , we must have $H_{1}^{-1}e_{d}\neq H_{2}^{-1}e_{d}$ . Thus, by standard information-theoretic arguments, in order for the coordinator to output the solution to $Hx=e_{d}$ , the communication complexity is at least $\Omega(\log(|\mathcal{H}|))=\Omega(d^{2}L)$ .

Formally, we have proved the following theorem.

Theorem 3.7.

Any randomized protocol that succeeds with probability at least $0.99$ for solving linear systems requires $\Omega(d^{2}L)$ bits of communication in the coordinator model and the blackboard model. The lower bound holds even when $s=2$ .

Now we prove another lower bound of $\widetilde{\Omega}(sd)$ for solving linear systems in the coordinator model. In the hard instance, the last server $P_{s}$ receives a vector $\widehat{x}\in\{0,1\}^{d}$ , and the linear equations stored on server $P_{s}$ are simply $x=\widehat{x}$ , i.e., the solution vector $x$ should be exactly $\widehat{x}$ . This forces the solution vector to be some predefined binary vector $\widehat{x}$ . The remaining $s-1$ servers each receive a vector $a_{i}\in\{0,1\}^{d}$ , and the linear equation stored on $P_{i}$ is

[TABLE]

Also, it is guaranteed that for each $i\in[s]$ , $\langle a_{i},\widehat{x}\rangle=0$ or $1$ .

Here we interpret the vector $\widehat{x}$ as the characteristic vector of a set $S_{\widehat{x}}\subseteq[d]$ , and interpret each vector $a_{i}$ also as the characteristic vector a set $S_{a_{i}}$ . Thus, testing the feasibility of the linear system is equivalent to testing whether the set $S_{\widehat{x}}$ owned by the server $P_{s}$ is disjoint with the set owned by any other player, which is the OR of $s-1$ copies of the two-player set-disjointness problem. The communication complexity for the latter problem has been studied in [54, 68]. Combining Lemma 2.2 in [54] with Theorem 1 in [68], for any communication protocol that succeeds with probability $1-1/s^{3}$ , the communication complexity is lower bounded by $\Omega(sd)$ . By standard repetition arguments, this implies for any randomized communication protocol that succeeds with probability at least $0.99$ , the communication complexity is lower bounded by $\Omega(sd/\log s)$ .

Combining this lower bound and the trivial $\Omega(s)$ lower bound in the blackboard model with Theorem 3.7, we have the following theorem.

Theorem 3.8.

Any randomized protocol that succeeds with probability at least $0.99$ for solving linear systems requires $\widetilde{\Omega}(sd+d^{2}L)$ bits of communication in the coordinator model and $\Omega(s+d^{2}L)$ bits of communication in the blackboard model.

4 Communication Protocols for Linear Systems

4.1 Testing Feasibility of Linear Systems

In this section we present a randomized communication protocol for testing feasibility of linear systems, which has communication complexity $O(sd^{2}\log(dL))$ in the coordinator model and $O(s+d^{2}\log(dL))$ in the blackboard model. The protocol is described in Figure 1.

We first bound the communication complexity of the protocol in Figure 1. Clearly, Step 1 has communication complexity at most $O(s\log(dL))$ . During the execution of the whole protocol, at most $d$ linear equations will be added into $C$ . The communication complexity for sending each linear equation is $O(sdL\log p)$ in the coordinator model, and $O(dL\log p)$ in the blackboard model. Thus, the total communication complexity is $O(sd^{2}\log(dL))$ in the coordinator model, and $O(s+d^{2}\log(dL))$ in the blackboard model.

To prove the correctness of this protocol, we need the following lemma.

Lemma 4.1.

Given a matrix $A\in\mathbb{R}^{m\times n}$ where each entry is an integer in $[-2^{L},2^{L}]$ and ${\operatorname{rank}}(A)=r$ . Suppose $p$ is chose uniformly at random from all primes numbers in $[2,\operatorname{poly}(rL)]$ .

(i)

${\operatorname{rank}}_{p}(A)\leq{\operatorname{rank}}(A)$ ; 2. (ii)

With probability at least $0.99$ , ${\operatorname{rank}}_{p}(A)={\operatorname{rank}}(A)$ .

Proof.

The point (i) is immediate. For point (ii), there exists a square submatrix $A^{\prime}$ of $A$ with size $r\times r$ which is non-singular over real numbers, which implies the determinant of $A^{\prime}$ is non-zero as a real number. Since all entries of $A^{\prime}$ are integers in $[-2^{L},2^{L}]$ , the determinant of $A^{\prime}$ as a real number is an integer in $[-r!2^{rL},r!2^{rL}]$ . Thus, the determinant of $A^{\prime}$ has at most $\operatorname{poly}(rL)$ prime factors. According to the Prime Number Theorem, there are at least $n$ distinct prime numbers in the range $[2,n^{2}]$ , for sufficiently large $n$ . Thus, by adjusting constants, $p$ is not a prime factor of the determinant of $A^{\prime}$ with probability at least $0.99$ , in which case ${\operatorname{rank}}(A)={\operatorname{rank}}_{p}(A)$ . ∎

Notice that the protocol in Figure 1 is basically testing the feasibility of the linear system over the finite field $\mathbb{F}_{p}$ , for a randomly chosen prime number $p$ . Before the execution of the $i$ -th loop of Step 4, the set $C$ is a maximal set of linearly independent equations for all linear equations stored on the first $i-1$ servers $P_{1},P_{2},\ldots,P_{i-1}$ . Here the linear independence is defined over the finite field $\mathbb{F}_{p}$ . During the execution of the $i$ -th loop of Step 4, server $P_{i}$ considers each linear equation stored on itself one by one, sends the linear equation to all other servers and adds the linear equation to set $C$ if that linear equation is linearly independent with all existing linear equations in $C$ . If server $P_{i}$ finds that the set $C$ becomes infeasible after adding linear equations stored on $P_{i}$ , then $P_{i}$ terminates the protocol.

Consider a linear system $Ax=b$ where $A\in\mathbb{R}^{n\times d}$ and $b\in\mathbb{R}^{n}$ and all entries of $A$ and $b$ are integers in the range $[-2^{L},2^{L}]$ . If $Ax=b$ is feasible over the real numbers, then it will also be feasible over the finite field $\mathbb{F}_{p}$ . If $Ax=b$ is infeasible, then we have ${\operatorname{rank}}([A~{}b])>{\operatorname{rank}}(A)$ . By Lemma 4.1, ${\operatorname{rank}}_{p}(A)\leq{\operatorname{rank}}(A)$ , and since ${\operatorname{rank}}([A~{}b])\leq d+1$ , with probability at least $0.99$ , ${\operatorname{rank}}_{p}([A~{}b])={\operatorname{rank}}([A~{}b])$ , which implies with probability $0.99$ , $Ax=b$ is still infeasible over the finite field $\mathbb{F}_{p}$ . Since the protocol in Figure 1 tests the feasibility of the linear system over the finite field $\mathbb{F}_{p}$ , the correctness follows.

Formally, we have proved the following theorem.

Theorem 4.2.

The protocol in Figure 1 is a randomized protocol for testing feasibility of linear systems and has communication complexity $O(sd^{2}\log(dL))$ in the coordinator model and $O(s+d^{2}\log(dL))$ in the blackboard model. The protocol succeeds with probability at least $0.99$ .

4.2 Solving Linear Systems

In this section we present communication protocols for solving linear systems. We start with deterministic protocols, in which case we can get a protocol with communication complexity $O(sd^{2}L)$ in the coordinator model and $O(s+d^{2}L)$ in the blackboard model.

In order to solve linear systems, we can still use the protocol in Figure 1, but we don’t use the prime number $p$ any more. In Step 4a of the protocol, we no longer check the feasibility over the finite field. In Step 4b of the protocol, we no longer takes the residual modulo $p$ before sending the linear equations. At the end of the protocol, each server can use the set of linear equations $C$ , which is a maximal set of linear equations of the original linear system, to solve the linear system. The communication complexity is $O(sd^{2}L)$ in the coordinator model and $O(s+d^{2}L)$ in the blackboard model since at most $d$ linear equations will be added into the set $C$ , and each linear equation requires $O(dL)$ bits to describe.

Formally, we have proved the following theorem.

Theorem 4.3.

There exists a deterministic protocol for solving linear systems which has communication complexity $O(sd^{2}L)$ in the coordinator model and $O(s+d^{2}L)$ in the blackboard model.

Now we turn to randomized protocols. We describe a protocol for solving linear systems with communication complexity $\widetilde{O}(d^{2}L+sd)$ in the coordinator model. The description is given in Figure 2.

Now we prove the correctness of the protocol. We first note a few simple properties of the protocol. For each $i\in[s]$ , after executing the $i$ -th loop of Step 3, we have $C\subseteq\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)$ . Furthermore, at Step 3(a)ii, we must have $c\in\mathrm{span}(S_{i})$ . This means if $C\cup\{c\}$ is infeasible, then the original linear system must be infeasible.

Thus, it suffices to show that for each $i\in[s]$ , if there exists $s\in S_{i}$ such that $\left(\bigcup_{j<i}S_{j}\right)\cup\{s\}$ is infeasible, then the protocol is terminated, and otherwise after executing the $i$ -th loop of Step 3, we have $\mathrm{span}(C)=\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)$ .

Suppose before the execution of the $i$ -th loop of Step 3 we have $\mathrm{span}(C)=\mathrm{span}\left(\bigcup_{j<i}S_{j}\right)$ and $\mathrm{span}\left(\bigcup_{j<i}S_{j}\right)$ is feasible, and the protocol is executing the $i$ -th loop of Step 3. There are two cases here.

Case 1: $\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)=\mathrm{span}\left(\bigcup_{j<i}S_{j}\right)=\mathrm{span}(C)$ .

In this case, $C$ will remain unchanged during the $i$ -th loop of Step 3.

Case 2: There exists $s\in S_{i}$ , $s\notin\mathrm{span}(C)$ .

In this case, we claim that with probability at least $1/2$ , the linear equation $c$ calculated at Step 3(a)i satisfies $c\notin\mathrm{span}(C)$ . This can be seen since for any linear combination $c=\sum_{t\in S_{i}}r_{t}\cdot t$ , if we flip the sign of $r_{s}$ and obtain $\widehat{c}$ , then either $c\notin\mathrm{span}(C)$ or $\widehat{c}\notin\mathrm{span}(C)$ , since $c-\widehat{c}=\pm 2s$ and $s\notin\mathrm{span}(C)$ .

Thus, if there exists $s\in S_{i}$ such that $s\notin\mathrm{span}(C)$ , then with probability $1-1/\operatorname{poly}(d)$ , at least one of the linear equations $c$ calculated at Step 3(a)i satisfies $c\notin\mathrm{span}(C)$ , in which case the protocol terminates if $C\cup\{c\}$ is infeasible, or $c$ is added into $C$ otherwise. Thus, if $\mathrm{span}\left(\bigcup_{j<i}S_{j}\right)\neq\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)$ , then after the execution of the $i$ -th loop of Step 3, with probability at least $1-1/\operatorname{poly}(d)$ , either the protocol (correctly) terminates, or we have $\mathrm{span}(C)=\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)$ . The correctness of the protocol just follows by applying a union bound over all $i$ such that $\mathrm{span}\left(\bigcup_{j<i}S_{j}\right)\neq\mathrm{span}\left(\bigcup_{j\leq i}S_{j}\right)$ . Notice that there are at most $d$ such $i$ we need to apply a union bound over.

Now we analyze the communication complexity of the protocol. Notice that at most $d$ linear equations will be added into $C$ , and thus the total communication complexity associated with sending $c$ when $c$ is added into $C$ is upper bounded by $O(d^{2}L)$ . Furthermore, if $C\cup\{c\}$ is infeasible at Step 3(a)ii, then the protocol terminates and thus the communication complexity for sending $c$ associated such that case is upper bounded by $O(dL)$ . Furthermore, for each $i$ , server $P_{i}$ will send $O(\log d)$ different linear equations $c$ to the coordinator, and if we implement the protocol naïvely, then the total communication complexity is upper bounded by $\widetilde{O}(sdL)$ . Thus, the total communication complexity of the whole protocol is upper bounded by $\widetilde{O}(sdL+d^{2}L)$ .

However, using Lemma 4.1 and the same argument as in Section 4.1, to implement Step 3(a)ii, it suffices to check if $C\cup\{c\}$ is feasible and if $c$ is a linear combination of existing linear equations in $C$ , over the finite field $\mathbb{F}_{p}$ , for a random prime number $p\in[2,\operatorname{poly}(dL)]$ . The correctness still follows since this check fails with probability at most $0.01$ . After this modification, the communication complexity is now upper bounded by $\widetilde{O}(sd+d^{2}L)$ .

Formally, we have proved the following theorem.

Theorem 4.4.

The protocol described in Figure 2 is a randomized protocol for solving linear systems which has communication complexity $\widetilde{O}(sd+d^{2}L)$ in the coordinator model. Here the $\widetilde{O}(\cdot)$ notation hides only $\operatorname{polylog}(dL)$ factors. The protocol succeeds with probability at least $0.99$ .

5 Communication Complexity Lower Bounds for Linear Regressions in the Blackboard Model

In this section, we prove communication complexity lower bounds for linear regression in the blackboard model.

We first define the $k$ -XOR problem and the $k$ -MAJ problem. In the blackboard model, each server $P_{i}$ receives a binary string $x_{i}\in\{0,1\}^{d}$ . In the $k$ -XOR problem, at the end of a communication protocol, the coordinator correctly outputs the coordinate-wise XOR of these vectors, for at least $0.99d$ coordinates. In the $k$ -MAJ problem, at the end of a communication protocol, the coordinator correctly outputs the coordinate-wise majority of these vectors, for at least $0.99d$ coordinates.

We need the following lemma for our lower bound proof.

Lemma 5.1.

Any randomized communication protocol that solves the $k$ -XOR problem or the $k$ -MAJ problem and succeeds with probability at least $0.99$ has communication complexity $\Omega(dk)$ .

Proof.

The lower bound for $k$ -XOR directly follows from [54, Theorem 1.1]. Now we prove the lower bound for $k$ -MAJ.

First, consider a communication problem with two players. Alice receives a binary string $x\in\{0,1\}^{d}$ and $2k-1$ binary strings $z_{1},z_{2},\ldots,z_{2k-1}\in\{0,1\}^{d}$ . Bob receives a binary string $y\in\{0,1\}^{d}$ and the same $2k-1$ binary strings $z_{1},z_{2},\ldots,z_{2k-1}\in\{0,1\}^{d}$ . These $2k+1$ binary strings $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ are generated uniformly at random conditioned on the following constraint: for each coordinate $i\in[d]$ , the $i$ -th coordinate of $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ contains either $k$ zeros or $k+1$ zeros. For each coordiante $i\in[d]$ , whether the $i$ -th coordinate of $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ contains $k$ zeros or $k+1$ zeros is also chosen uniformly at random. In this communication problem, the goal of Alice is to output the vector $y$ .

Now we prove a lower bound the communication problem defined above. Notice that for each coordinate $i\in[d]$ , if $x,z_{1},z_{2},\ldots,z_{2k-1}$ contains exactly $k$ zeros and $k$ ones at the $i$ -th coordinate, then the $i$ -th coordinate of $y$ will be uniformly at random. By a Chernoff bound, with high probability, there exists a set $S\subseteq[d]$ with size $|S|\geq d/10$ , such that for each $i\in S$ , the $i$ -th coordinate of $x,z_{1},z_{2},\ldots,z_{2k-1}$ contains exactly $k$ zeros and $k$ ones. By standard information-theoretic arguments, if at the end of the communication protocol, with constant probability, Alice correctly outputs the value of $y_{i}$ for at least $9/10$ fraction of $i\in S$ , then the expected communication complexity is lower bounded by $\Omega(d)$ , even with public randomness. See, e.g., Lemma 2.1 in [54] for a formal proof.

Now we reduce the problem mentioned above to $(2k+1)$ -MAJ and prove an $\Omega(dk)$ lower bound. Given any protocol $\mathcal{P}$ for the $(2k+1)$ -MAJ problem with expected communication complexity $C_{\mathcal{P}}$ on the distribution $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ mentioned above, Alice and Bob first use public randomness to choose two distinct servers $s_{1}$ and $s_{2}$ uniformly at random, and then Alice and Bob simulate the protocol $\mathcal{P}$ . To simulate $\mathcal{P}$ , Alice plays the role of server $s_{1}$ and Bob plays the role of server $s_{2}$ . They both play the roles of all other players. Alice sets the input of $s_{1}$ to be $x$ , and Bob sets the input of $s_{2}$ to be $y$ . The inputs of the other $2k-1$ servers are set to be $z_{1},z_{2},\ldots,z_{2k-1}$ . To simulate $\mathcal{P}$ , Alice and Bob need to communicate if and only if server $s_{1}$ or server $s_{2}$ needs to communicate with the coordinator since all other communication can be simulated by Alice and Bob themselves.

By symmetry, the expected communication complexity between Alice and Bob is upper bounded by $2C_{\mathcal{P}}/k$ . Furthermore, at the end of the protocol, Alice and Bob have the coordinate-wise majority of $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ , for at least $0.99d$ coordinates. Thus, for at least $9/10$ fraction of $i\in S$ , Alice knows the majority of the $i$ -th coordinates of $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ . However, by definition of $S$ , the majority of the $i$ -th coordinates of $x,y,z_{1},z_{2},\ldots,z_{2k-1}$ is exactly $y_{i}$ . Thus, by the $\Omega(d)$ lower bound mentioned above, we must have $2C_{\mathcal{P}}/k=\Omega(d)$ , which implies an $\Omega(dk)$ lower bound. ∎

Now we give a reduction from $k$ -MAJ to $(1+\varepsilon)$ -approximate $\ell_{1}$ regression in the blackboard model, and prove an $\Omega(d/\varepsilon)$ lower bound when $s>\Omega(1/\varepsilon)$ . In the hard case we assume $s=\Theta(1/\varepsilon)$ , and we simply ignore all other servers if $s>\Theta(1/\varepsilon)$ . For each server $P_{i}$ , its matrix $A^{(i)}$ is set to be the identity matrix $I_{d}\in\mathbb{R}^{d\times d}$ , and $b^{(i)}\in\{0,1\}^{d}$ . Notice that in such case, we can calculate the $\ell_{1}$ regression value separately for each coordinate $x_{j}$ . The optimal solution can be achieved by taking $x_{j}$ to be $m_{j}$ , where $m_{j}$ is the majority of $b^{(1)}_{j},b^{(2)}_{j},\ldots,b^{(s)}_{j}$ . Notice that the $\ell_{1}$ regression value associated with the $j$ -th coordinate in the optimal solution is upper bounded by $s=\Theta(1/\varepsilon)$ , and thus the total $\ell_{1}$ regression value is upper bounded by $sd=\Theta(d/\varepsilon)$ in the optimal solution. Furthermore, if $|x_{j}-m_{j}|\geq 0.1$ , then the $\ell_{1}$ regression value associated with the $j$ -th coordinate by using $x_{j}$ will be at least $0.1$ larger than the $\ell_{1}$ regression value associated with the $j$ -th coordinate in the optimal solution.

Now consider a $(1+\varepsilon)$ -approximate solution $x$ . We claim that for at least $0.99d$ coordinates $x_{i}$ of $x$ , we have $|x_{i}-m_{i}|\leq 0.1$ . The claim follows since otherwise, the total $\ell_{1}$ regression value of $x$ would be at least $0.099d$ larger than the optimal $\ell_{1}$ regression value, which would again be larger than $1+\varepsilon$ times the optimal $\ell_{1}$ regression value, by adjusting the constant in $s=\Theta(1/\varepsilon)$ . Thus, from a $(1+\varepsilon)$ -approximate solution $x$ to the $\ell_{1}$ regression problem, we can solve the $k$ -MAJ problem with $k=\Theta(1/\varepsilon)$ , which implies an $\Omega(d/\varepsilon)$ lower bound.

Formally, we have proved the following theorem.

Theorem 5.2.

When $s>\Omega(1/\varepsilon)$ , any randomized protocol that succeeds with probability at least $0.99$ for solving $(1+\varepsilon)$ -approximate $\ell_{1}$ regression requires $\Omega(d/\varepsilon)$ bits of communication in the blackboard model.

Now we give a reduction from $k$ -XOR to $(1+\varepsilon)$ -approximate $\ell_{2}$ regression in the blackboard model, and prove an $\Omega(d/\sqrt{\varepsilon})$ lower bound when $s>\Omega(1/\sqrt{\varepsilon})$ . In the hard case we assume $s=\Theta(1/\sqrt{\varepsilon})$ , and we simply ignore all other servers if $s>\Theta(1/\sqrt{\varepsilon})$ . For each server $P_{i}$ , its matrix $A^{(i)}$ is set to be the identity matrix $I_{d}\in\mathbb{R}^{d\times d}$ , and $b^{(i)}\in\{0,1\}^{d}$ . For $\ell_{2}$ regression, the optimal solution can be achieved by taking $x_{j}$ to be $a_{j}$ , where $a_{j}$ is the average of $b^{(1)}_{j},b^{(2)}_{j},\ldots,b^{(s)}_{j}$ . Notice that the squared $\ell_{2}$ regression value associated with the $j$ -th coordinate in the optimal solution is upper bounded by $s=\Theta(1/\sqrt{\varepsilon})$ , and thus the total squared $\ell_{2}$ regression value is upper bounded by $sd=\Theta(d/\sqrt{\varepsilon})$ . Furthermore, if $|x_{j}-a_{j}|\geq\Omega(\sqrt{\varepsilon})$ , then the squared $\ell_{2}$ regression value associated with the $j$ -th coordinate by using $x_{j}$ will be at least $\Theta(s\cdot|x_{j}-a_{j}|^{2})=\Theta(\sqrt{\varepsilon})$ larger than the squared $\ell_{2}$ regression value associated with the $j$ -th coordinate in the optimal solution.

Now consider a $(1+\varepsilon)$ -approximate solution $x$ . We claim that for at least $0.99d$ coordinates $x_{i}$ of $x$ , we have $|x_{i}-a_{i}|\leq O(\sqrt{\varepsilon})$ . The claim follows since otherwise, the total squared $\ell_{2}$ regression value of $x$ would be at least $\Omega(d\sqrt{\varepsilon})$ larger than the optimal squared $\ell_{2}$ regression value, which would again be larger than $1+\varepsilon$ times the optimal squared $\ell_{2}$ regression value, by adjusting the constant in $s=\Theta(1/\sqrt{\varepsilon})$ . Notice that for those coordinates with $|x_{i}-a_{i}|\leq O(\sqrt{\varepsilon})$ , we can exactly recover $a_{i}$ from $x_{i}$ , since $a_{i}$ is the average of $b^{(1)}_{i},b^{(2)}_{i},\ldots,b^{(s)}_{i}$ and thus $a_{i}$ is an integer multiple of $1/s=\Theta(\sqrt{\varepsilon})$ . This also implies we can recover the XOR of $b^{(1)}_{i},b^{(2)}_{i},\ldots,b^{(s)}_{i}$ . Thus, from a $(1+\varepsilon)$ -approximate solution $x$ to the $\ell_{2}$ regression problem, we can solve the $k$ -XOR problem with $k=\Theta(1/\sqrt{\varepsilon})$ , which implies an $\Omega(d/\sqrt{\varepsilon})$ lower bound.

Formally, we have proved the following theorem.

Theorem 5.3.

When $s>\Omega(1/\sqrt{\varepsilon})$ , any randomized protocol that succeeds with probability at least $0.99$ for solving $(1+\varepsilon)$ -approximate $\ell_{2}$ regression requires $\Omega(d/\sqrt{\varepsilon})$ bits of communication in the blackboard model.

6 Communication Protocols for $\ell_{2}$ Regression

In this section, we design distributed protocols for solving the $\ell_{2}$ regression problem.

6.1 A Deterministic Protocol

In this section, we design a simple deterministic protocol for $\ell_{2}$ regression in the distributed setting with communication complexity $\widetilde{O}(sd^{2}L)$ in the coordinator model.

According to the normal equations, the optimal solution to the $\ell_{2}$ regression problem $\min_{x}\|Ax-b\|_{2}$ can be attained by setting $x^{*}=(A^{T}A)^{\dagger}A^{T}b$ . In Figure 3, we show how to calculate $A^{T}A$ and $A^{T}b$ in the distributed model.

Notice that the bit complexity of entries in $A^{T}A$ and $A^{T}b$ is $O(L+\log n)$ since the bit complexity of entries in $A$ and $b$ is $L$ , which implies the communication complexity of the protocol in Figure 3 is $O(sd^{2}(L+\log n))$ , in both the coordinator model and the blackboard model.

Theorem 6.1.

The protocol in Figure 3 is a deterministic protocol which exactly solves $\ell_{2}$ regression, and the communication complexity is $O(sd^{2}(L+\log n))$ , in both the coordinator model and the blackboard model.

6.2 A Protocol in the Blackboard Model

In this section, we design a recursive protocol for obtaining constant approximations to leverage scores in the distributed setting, which is described in Figure 4. We then show how to solve $\ell_{2}$ regression by using this protocol.

The protocol described in Figure 4 is basically Algorithm 2 in [26] for approximating leverage scores, implemented in the distributed setting. Using Lemma 8 in [26], the protocol returns an $O(d\log d)\times d$ matrix $\widetilde{A}$ such that

[TABLE]

for all $x\in\mathbb{R}^{d}$ , with constant probability. Each server $P_{i}$ can then obtain constant approximations to leverage scores of all rows in $A^{(i)}$ by calculating $\tau_{i}^{\widetilde{A}}(A)$ .

Now we analyze the communication complexity of the protocol. Notice that this recursive algorithm has $O(\log(n/d))$ levels of recursion. Step 1 has communication complexity $O(sd^{2}L\log d)$ in the coordinator model and $O(d^{2}L\log d)$ in the blackboard model, and will be executed at most once during the whole protocol. At Step 4, we can assume each $p_{i}^{-1/2}$ is a power of two between $1$ and $\operatorname{poly}(n)$ , since we can discard all rows whose $p_{i}<1/\operatorname{poly}(n)$ and increase each $p_{i}$ by a constant factor. In order to implement the sampling process in Theorem 2.1 in the distributed setting, each server $P_{i}$ sends the summation of $p_{i}$ for all rows in $A^{(i)}$ to the coordinator. After receiving all these summations, the coordinator decides the number of rows to be sampled from each $A^{(i)}$ and sends these numbers back to each server. The communication complexity of this step is at most $O(s\log n)$ . Each server $P_{i}$ samples and rescales the rows accordingly, and then sends these sampled rows to the coordinator, and the coordinator sends all sampled rows back to all servers. Notice that the bit complexity of all entries in the sampled rows is at most $O(L+\log n)$ since $p_{i}^{-1/2}$ is an integer between $1$ and $\operatorname{poly}(n)$ . Thus, in the blackboard model, the total communication complexity at each recursive level of the protocol is upper bounded by $O(d^{2}\log d(L+\log n)+s\log n)$ , which implies the communication complexity of the whole protocol is at most $O((d^{2}\log d(L+\log n)+s\log n)\cdot\log(n/d))$ in the blackboard model. A similar analysis shows that the communication complexity of the whole protocol is $\widetilde{O}(sd^{2}L)$ in the coordinator model.

Lemma 6.2.

The protocol described in Figure 4 is a randomized protocol with communication complexity $\widetilde{O}(sd^{2}L)$ in the coordinator model and $\widetilde{O}(s+d^{2}L)$ in the blackboard model, such that with constant probability, upon termination of the protocol, each server $P_{i}$ has constant approximations to leverage scores of all rows in $A^{(i)}$ .

Our protocol for solving the $\ell_{2}$ regression problem in the blackboard model is described in Figure 5. It is guaranteed that the vector $x$ calculated at Step 3 is a $(1+\varepsilon)$ -approximate solution, with constant probability. A naïve approach for obtaining $(1+\varepsilon)$ -approximate solution to the $\ell_{2}$ regression problem will be using Theorem 2.1 to obtain $SA$ and $Sb$ such that with constant probability, for all $x\in\mathbb{R}^{d}$ ,

[TABLE]

By doing so, the number of sampled rows should be $O(d\log d/\varepsilon^{2})$ according to Theorem 2.1. However, as shown in Theorem 36 of [24], in order to obtain a $(1+\varepsilon)$ -approximate solution to the $\ell_{2}$ regression problem (instead of obtaining a $(1+\varepsilon)$ subspace embedding), it suffices to sample $O(d\log d+d/\varepsilon)$ rows from $[A~{}b]$ .

Now we analyze the communication complexity of the protocol in Figure 5 in the blackboard model. By Lemma 6.2, the communication complexity of Step 1 is upper bounded by $\widetilde{O}(d^{2}L+s)$ . Similar to Step 4 of the protocol described in Figure 4, the sampling process in Step 2 can be implemented with communication complexity $\widetilde{O}(s+d^{2}L/\varepsilon)$ . Thus, the total communication complexity is $\widetilde{O}(s+d^{2}L/\varepsilon)$ in the blackboard model.

Theorem 6.3.

The protocol described in Figure 5 is a randomized protocol which returns a $(1+\varepsilon)$ -approximate solution to $\ell_{2}$ regression with constant probability, and the communication complexity is $\widetilde{O}(s+d^{2}L/\varepsilon)$ in the blackboard model .

7 Communication Protocols for $\ell_{1}$ Regression

In this section, we design distributed protocols for solving the $\ell_{1}$ regression problem.

7.1 A Simple Protocol

In this section, we design a simple protocol for obtaining a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem in the distributed setting. The protocol is described in Figure 6.

To implement Step 1, each server $P_{i}$ calculates the $\ell_{1}$ Lewis weights of $[A^{(i)}~{}b^{(i)}]$ and uses Theorem 2.2 to randomly generate a matrix $S^{(i)}$ . $P_{i}$ then checks whether for all $x\in\mathbb{R}^{d}$

[TABLE]

If not, $P_{i}$ randomly generates another $S^{(i)}$ until (3) is satisfied for all $x\in\mathbb{R}^{d}$ . Since (3) is satisfied with constant probability, the number of independent trials is $O(1)$ in expectation. Furthermore, each server can locally check whether (3) holds or not by e.g., verifying on an $\varepsilon$ -net. Notice that the use of randomness is not critical here, since each server $P_{i}$ can locally enumerate all possible $S^{(i)}$ up to a specific precision, instead of using Theorem 2.2 to randomly generate a matrix $S^{(i)}$ . Each server $P_{i}$ will eventually find a matrix $S^{(i)}$ which satisfies (3), whose existence is guaranteed by Theorem 2.2.

Now we prove the correctness of the protocol. Notice that it is guaranteed that for any $x\in\mathbb{R}^{d}$ and any $i\in[s]$ ,

[TABLE]

It implies that

[TABLE]

and

[TABLE]

Thus, the vector $x$ calculated at Step 2 is a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem.

Finally, we analyze the communication complexity of the protocol. Similar to the analysis in Section 6.2, we may assume all $1/p_{i}$ in the sampling process of Theorem 2.2 are integers between $1$ and $\operatorname{poly}(n)$ . Thus, the bit complexity of all entries in $S^{(i)}A^{(i)}$ and $S^{(i)}b^{(i)}$ is at most $O(L+\log n)$ , which implies the communication complexity of Step 1 is $O(sd^{2}\log d\cdot(L+\log n)/\varepsilon^{2})$ in both the coordinator model and the blackboard model.

Theorem 7.1.

The protocol described in Figure 6 is a deterministic protocol which returns a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem, and the communication complexity is $\widetilde{O}(sd^{2}L/\varepsilon^{2})$ in both the coordinator model and the blackboard model.

7.2 A Protocol Based on $\ell_{1}$ Lewis Weights Sampling

In this section, we first design a protocol for obtaining constant approximations to $\ell_{1}$ Lewis weights in the distributed setting, which is described in Figure 7, and then solves the $\ell_{1}$ regression problem based on this protocol.

The protocol described in Figure 7 is basically the algorithm in Section 3 of [27] for approximating $\ell_{1}$ Lewis weights, implemented in the distributed setting. Using the same analysis, by setting $T=O(\log\log n)$ , we can show $w_{i}$ are constant approximations to the $\ell_{1}$ Lewis weights of $A$ . Now we show that we can assume all $w_{i}^{-1/2}$ are integers between $1$ and $2^{\widetilde{O}(L)}$ .

Without loss of generality we assume each row of $A$ contains at least one non-zero entry. Since our goal here is to calculate constant approximations to the $\ell_{1}$ Lewis weights, using the analysis in Section 3 in [27], we only need constant approximations to $w_{i}$ during the execution of the algorithm. Furthermore, since leverage scores $\tau_{i}(W^{-1/2}A)$ are at most $1$ (see, e.g., Section 2.4 in [66]), we can prove by induction that $w_{i}\leq 1$ during the execution of the algorithm. Thus, we may assume $w_{i}^{-1/2}\geq 1$ and $w_{i}^{-1/2}$ are integers.

Now we show that $w_{i}\geq 2^{-\widetilde{O}(L)}$ . We prove this claim by induction. At the beginning of the algorithm, $w_{i}=1$ for all $1\leq i\leq n$ . Assume $w_{i}\geq 2^{-\widetilde{O}(L)}$ by the induction hypothesis, we know all entries in $W^{-1/2}A$ are integers between $1$ and $2^{\widetilde{O}(L)}$ . Using Lemma 2 in [26], we know

[TABLE]

By the Cauchy-Schwarz inequality, in order that $(W^{-1/2}A)^{T}x=(W^{-1/2}A)^{i}$ , we must have

[TABLE]

since otherwise all entries in $(W^{-1/2}A)^{T}x$ are less than $1$ , which violates the assumption that all entries in $W^{-1/2}A$ are integers and each row of $A$ contains at least one non-zero entry. Furthermore, the number of iterations is at most $O(\log\log n)$ , which implies $w_{i}\geq 2^{-\widetilde{O}(L)}$ by induction.

Thus, all entries in $W^{-1/2}A$ have bit complexity $\widetilde{O}(L)$ during the execution of the algorithm. Using Lemma 6.2, the communication complexity is $\widetilde{O}(s+d^{2}L)$ in the blackboard model, and $\widetilde{O}(sd^{2}L)$ in the coordinator model.

Lemma 7.2.

The protocol described in Figure 7 is a randomized protocol with communication complexity $\widetilde{O}(s+d^{2}L)$ in the blackboard model and $\widetilde{O}(sd^{2}L)$ in the blackboard model, such that with constant probability, upon termination of the protocol, each server $P_{i}$ has constant approximations to the $\ell_{1}$ Lewis weights of all rows in $A^{(i)}$ .

Our protocol for solving the $\ell_{1}$ regression problem in the blackboard model is described in Figure 8. By Theorem 2.2, it is guaranteed that the vector $x$ calculated at Step 3 is a $(1+\varepsilon)$ -approximate solution, with constant probability.

Now we analyze the communication complexity of the protocol in Figure 8. By Lemma 7.2, the communication complexity of Step 1 is upper bounded by $\widetilde{O}(d^{2}L+s)$ in the blackboard model and $\widetilde{O}(sd^{2})$ in the coordinator model. Similar to Step 4 of the protocol described in Figure 4, the sampling process in Step 2 can be implemented with communication complexity $\widetilde{O}(s+d^{2}L/\varepsilon^{2})$ in both models. Thus, the total communication complexity is $\widetilde{O}(s+d^{2}L/\varepsilon^{2})$ in the blackboard model. In the coordinator model, the total communication complexity is $\widetilde{O}(sd^{2}L+d^{2}L/\varepsilon^{2})$ .

Theorem 7.3.

The protocol described in Figure 8 is a randomized protocol which returns a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem with constant probability, and the communication complexity is $\widetilde{O}(s+d^{2}L/\varepsilon)$ in the blackboard model and $\widetilde{O}(sd^{2}L+d^{2}L/\varepsilon^{2})$ in the coordinator model.

7.3 A Protocol Based on Accelerated Gradient Descent

In this section, we present a protocol for the $\ell_{1}$ regression problem in the coordinator model, whose communication complexity is $\widetilde{O}(sd^{3}L/\varepsilon)$ .

We need the following definition in [32].

Definition 7.1 ([32]).

Suppose $n\geq d$ . A matrix $A\in\mathbb{R}^{n\times d}$ is approximately isotropic row-bounded if the following hold:

$A^{T}A\approx_{O(1)}I$ ; 2. 2.

For all rows of $A$ , $\|A^{i}\|_{2}^{2}\leq O(d/n)$ .

Before presenting the protocol, we first present a preconditioning procedure in Figure 9, which will later be used in the protocol for $\ell_{1}$ regression.

The communication complexity of the this protocol in the coordinator model is $\widetilde{O}(sd^{2}L)$ by Lemma 6.2 and Lemma 7.2. Similar to the analysis in Section 6.2, we can also assume the bit complexity of all entries in $SA$ and $Sb$ is $O(L+\log n)$ . Furthermore, by Theorem 2.2, a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem

[TABLE]

is a $(1+O(\varepsilon))$ -approximate solution to the original $\ell_{1}$ regression problem

[TABLE]

Thus, we will focus on the $\ell_{1}$ regression problem in (4) in the remaining part of this section

Now we show $SAR^{-1}$ is approximately isotropic row-bounded as in Definition 7.1. We only need to show $(SAR^{-1})^{T}SAR^{-1}\approx_{O(1)}I$ and all rows of $SAR^{-1}$ satisfy $\|(SAR^{-1})^{i}\|_{2}^{2}=O(d/N)$ where $N=O(d\log n/\varepsilon^{2})$ is the number of rows of $SAR^{-1}$ .

To show $(SAR^{-1})^{T}SAR^{-1}\approx_{O(1)}I$ , it is equivalent to show that for all $x\in\mathbb{R}^{d}$ ,

[TABLE]

Notice that $\widetilde{SA}R^{-1}$ is an orthogonal matrix $Q$ , which implies for all $x\in\mathbb{R}^{d}$

[TABLE]

Combining (6) and the fact that

[TABLE]

for all $x\in\mathbb{R}^{d}$ , we can prove (5), which implies $(SAR^{-1})^{T}SAR^{-1}\approx_{O(1)}I$ .

To show $\|(SA)^{i}R^{-1}\|_{2}^{2}=O(d/N)$ , we use Lemma 29 in [32], which states that with constant probability, the leverage scores of $SA$ satisfy $\tau_{i}(SA)=O(d/N)$ for all $i$ . Since leverage scores are invariant under change of basis (see, e.g., Section 2.4 in [66]), we have for all $i$ ,

[TABLE]

Since $(SAR^{-1})^{T}SAR^{-1}\approx_{O(1)}I$ , we have

[TABLE]

Thus, $SAR^{-1}$ is approximately isotropic row-bounded.

Now we describe our protocol for $\ell_{1}$ regression in Figure 10. Our protocol first uses the preconditioning procedure in Figure 9 and then uses Nesterov’s accelerated gradient descent [52] to solve the $\ell_{1}$ regression problem

[TABLE]

Furthermore, we invoke a smoothing reduction JointAdaptRegSmooth in [2] to obtain better dependence on $\varepsilon$ .

In order to implement Nesterov’s accelerated gradient descent in the distributed setting, each server $P_{i}$ maintains the current solution $x$ . In each round, servers communicate to calculate the current gradient vector. Once all servers receive the gradient vector, they can update their current solution $x$ locally and proceed to the next round. Analysis in [2] (Example C.3) shows that when JointAdaptRegSmooth is applied to Nesterov’s accelerated gradient descent, after $O(G\sqrt{\Theta}/\delta)$ full gradient calculations, the algorithm will output a vector $x$ such that

[TABLE]

where we assume $\|SAR^{-1}x-Sb\|_{1}$ is $G$ -Lipschitz continuous and the initial solution $x$ satisfies $\|x-x^{*}\|_{2}^{2}\leq\Theta$ . Since $SAR^{-1}$ is approximately isotropic row-bounded and the initial vector $x$ is the optimal solution to the $\ell_{2}$ regression problem $\min_{x}\|SAR^{-1}x-Sb\|_{2}^{2}$ , Lemma 19 in [32] shows that $\|x-x^{*}\|_{2}\leq\sqrt{d/n}\|Ax^{*}-b\|_{1}$ . Furthermore, Lemma 15 in [32] shows that $G\leq\sqrt{nd}.$ By setting $\delta=\varepsilon\|Ax^{*}-b\|_{1}$ , we can calculate a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem using $O(d/\varepsilon)$ full gradient calculations.

Both Nesterov’s accelerated gradient descent [52] and JointAdaptRegSmooth in [2] require an estimation (up to a constant factor) of $\|Ax^{*}-b\|_{1}$ , which be can be obtained by using the algorithm in Section 7.1 to obtain an $O(1)$ -approximate solution $\widehat{x}$ and then calculating $\|A\widehat{x}-b\|_{1}$ .

It remains to design an protocol to calculate the gradient vector of the smoothed objective function for $\|SAR^{-1}x-Sb\|_{1}$ , in the distributed setting. We show this can be done with communication complexity $\widetilde{O}(sd^{2}L)$ . By using JointAdaptRegSmooth in [2], the new objective function will be

[TABLE]

where

[TABLE]

for some $\lambda_{t},\sigma_{t}\in\mathbb{R}$ and $x_{0}\in\mathbb{R}^{d}$ known to each server.

Each server can locally calculate the gradient vector of the $\frac{\sigma_{t}}{2}\|x-x_{0}\|_{2}^{2}$ term, since $\sigma_{t}$ and $x_{0}$ is known to each server. In the remaining part of this section, we focus on designing an algorithm for calculating the gradient vector of the first term in (7).

For the first term in (7), we have

[TABLE]

Notice that we cannot directly let each server $P_{i}$ calculate the gradient vectors using (8), send the gradient vectors to the coordinator and calculate the summation, since the bit complexity of $R^{-1}$ can be unbounded. Instead, we deal the two cases in (7) by using two different approaches.

When $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|>\lambda_{t}$ , notice that although the bit complexity of $\mathrm{sign}(\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i})\cdot(SA)^{i}R^{-1}$ can be unbounded, all entries in the vector $\mathrm{sign}(\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i})\cdot(SA)^{i}$ have bit complexity at most $\widetilde{O}(L)$ and $R^{-1}$ is a matrix known to each server. Thus, for each server $P$ , it sends

[TABLE]

to the coordinator, for each row $(SA)^{i}$ which is stored on $P$ and satisfies $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|>\lambda_{t}$ . After receiving from each server, the coordinator calculates

[TABLE]

for all rows $(SA)^{i}$ that satisfy $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|>\lambda_{t}$ , and sends it to each server. All servers can then recover the gradient vector. The total communication for this case is at most $\widetilde{O}(sdL)$ .

When $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|\leq\lambda_{t}$ ,

[TABLE]

Thus, for each server $P$ , it sends

[TABLE]

and

[TABLE]

to the coordinator, for each row $(SA)^{i}$ which is stored on $P$ and satisfies $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|\leq\lambda_{t}$ . After receiving from each server, the coordinator calculates

[TABLE]

and

[TABLE]

for all rows $(SA)^{i}$ that satisfy $|\langle(SA)^{i}R^{-1},x\rangle-(Sb)^{i}|\leq\lambda_{t}$ , and sends it to each server. All servers can then recover the gradient vector. The total communication for this case is at most $\widetilde{O}(sd^{2}L)$ .

Thus, the total communication complexity of the protocol in Figure 10 is $\widetilde{O}(sd^{3}L/\varepsilon)$ .

Theorem 7.4.

The protocol described in Figure 10 is a randomized protocol which returns a $(1+\varepsilon)$ -approximate solution to the $\ell_{1}$ regression problem with constant probability, and the communication complexity is $\widetilde{O}(sd^{3}L/\varepsilon)$ in the coordinator model.

8 Communication Protocols for $\ell_{p}$ Regression

In this section, we design distributed protocols for solving the $\ell_{p}$ regression problem, including $p=\infty$ .

8.1 Communication Protocols for $\ell_{\infty}$ Regression

Any $\ell_{\infty}$ regression instance $\min_{x}\|Ax-b\|_{\infty}$ can be formulated as the following linear program,

[TABLE]

which has $2n$ constraints and $d+1$ variables. Thus, any linear programming protocol implies a protocol for solving the $\ell_{\infty}$ regression problem, with the same communication complexity. Using the linear program solvers in Section 10 and Section 11, we have the following theorem.

Theorem 8.1.

$\ell_{\infty}$ * regression can be solved deterministically and exactly with communication complexity $\widetilde{O}(sd^{3}L)$ in the coordinator model, and randomly and exactly with communication complexity $\widetilde{O}(\min\{sd+d^{4}L,sd^{3}L\})$ in the blackboard model.*

8.2 Communication Protocols for $\ell_{p}$ Regression When $p>2$

In this section, we introduce an approach that reduces $(1+\varepsilon)$ -approximate $\ell_{p}$ regression to linear programs with $\widetilde{O}(d/\varepsilon^{2})$ variables. Our main idea is to use the max-stability of exponential random variables [3] to embed $\ell_{p}$ into $\ell_{\infty}$ . Such idea was previously used to construct subspace embeddings for the $\ell_{p}$ norm [67]. However, since our goal here is to solve linear regression instead of providing an embedding for the whole subspace, we can achieve a much better approximation ratio than previous work [67].

Theorem 8.2.

For any matrix $A\in\mathbb{R}^{n\times d}$ and constant $p>2$ , let $D^{(1)},D^{(2)},\ldots,D^{(R)}$ be $n\times n$ random diagonal matrices, whose diagonal entries are i.i.d. random variables with the same distribution as $E^{-1/p}$ , where $E$ is an exponential random variable. If $R=O(d\log(d/\varepsilon)/\varepsilon^{2})$ , then with constant probability, the following holds:

[TABLE] 2. 2.

For all $x\in\mathbb{R}^{d}$ ,

[TABLE]

Here $x^{*}\in\mathbb{R}^{d}$ is the optimal solution to the $\ell_{p}$ regression problem $\min_{x}\|Ax-b\|_{p}$ and $C_{p}$ is a constant which is the expectation of $E^{-1/p}$ for an exponential random variable $E$ .

The proof of Theorem 8.2 can be found in Section 8.3.

Now we prove with constant probability, the optimal solution to the optimization problem

[TABLE]

satisfies

[TABLE]

where $x^{*}\in\mathbb{R}^{d}$ is the optimal solution to the $\ell_{p}$ regression problem $\min_{x}\|Ax-b\|_{p}$ .

Notice that with constant probability,

[TABLE]

Thus, we have reduced $(1+\varepsilon)$ -approxiamte $\ell_{p}$ regression to

[TABLE]

The optimzation problem in (9) can be written as a linear program with $R+d=\widetilde{O}(d/\varepsilon^{2})$ variables. For each $i\in[R]$ , we use $v_{i}$ to represent the value of $\|D^{(i)}(Ax-b)\|_{\infty}$ as in Section 8.1, and the goal is to minimize $\sum_{i=1}^{R}v_{i}.$ Furthermore this reduction can be easily implemented in the distributed setting since each server can independently generate random variables in $D^{(i)}$ associated with its own input rows in $[A~{}b]$ . We can round each entry in $D^{(i)}$ to its nearest integer mutiple of $\operatorname{poly}(\varepsilon/d)$ , which is enough for the correctness of Theorem 8.2, but increases the bit complexity of each entry by at most an $O(\log(d/\varepsilon))$ factor.

Using the linear program solvers in Section 10 and Section 11, we have the following theorem.

Theorem 8.3.

$(1+\varepsilon)$ -approximate $\ell_{p}$ regression can be solved by a randomized protocol with communication complexity $\widetilde{O}(sd^{3}L/\varepsilon^{6})$ in the coordinator model, or by a randomized protocol with communication complexity $\widetilde{O}(\min\{sd^{3}L/\varepsilon^{6},sd/\varepsilon^{2}+d^{4}L/\varepsilon^{8}\})$ in the blackboard model.

8.3 Proof of Theorem 8.2

We need the following Bernstein-type lower tail inequality which is due to Maurer [49].

Lemma 8.4 ([49]).

Suppose $X_{1},X_{2},\ldots,X_{n}$ are independent positive random variables that satisfy $\mathrm{E}[X_{i}^{2}]<\infty$ . Let $X=\sum_{i=1}^{n}X_{i}$ . For any $t>0$ we have

[TABLE]

We use the standard $\varepsilon$ -net construction of a subspace in [15].

Definition 8.1.

For any $p\geq 1$ , for a given $A\in\mathbb{R}^{n\times d}$ , let $B=\{Ax\mid x\in\mathbb{R}^{d},\|Ax\|_{p}=1\}$ . We say $\mathcal{N}\subseteq B$ is an $\varepsilon$ -net of $B$ if for any $y\in B$ , there exists a $\widehat{y}\in\mathcal{N}$ such that $\|y-\widehat{y}\|_{p}\leq\varepsilon$ .

Lemma 8.5 ([15]).

For a given $A\in\mathbb{R}^{n\times d}$ , there exists an $\varepsilon$ -net $\mathcal{N}\subseteq B=\{Ax\mid x\in\mathbb{R}^{d},\|Ax\|_{p}=1\}$ with size $|\mathcal{N}|\leq(3/\varepsilon)^{d}$ .

Lemma 8.6 (Auerbach basis [9]).

For any matrix $A\in\mathbb{R}^{n\times d}$ and $p\geq 1$ , there exists a basis matrix $U$ of the column space of $A$ , such that $\|U_{i}\|_{p}=1$ for all $i\in[d]$ , and for any vector $x\in\mathbb{R}^{d}$ ,

[TABLE]

Now we give the proof of Theorem 8.2. Notice that for any fixed vector $y\in\mathbb{R}^{n}$ ,

[TABLE]

where $E$ is an exponential random variable. Moreover, when $p>2$ , both $\mathrm{E}[(E^{-1/p})^{2}]$ and $\mathrm{Var}[E^{-1/p}]$ are bounded by a constant.

We have $\mathrm{E}[\|D^{(i)}y\|_{\infty}]=C_{p}\|y\|_{p}$ . By linearity of expectation, we also have

[TABLE]

We use $U\in\mathbb{R}^{n\times(d+1)}$ to denote an Auerbach basis of the column space of $\widetilde{A}=[A~{}b]$ . We create three events $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ , $\mathcal{E}_{3}$ . Here $C$ is an absolute constant.

•

$\mathcal{E}_{1}$ : $\|D^{(i)}(Ax^{*}-b)\|_{\infty}\leq C\cdot R^{1/p}\cdot\|Ax^{*}-b\|_{p}$ for all $i\in[R]$ .

•

$\mathcal{E}_{2}$ : $\|D^{(i)}U_{j}\|_{\infty}\leq C(R\cdot d)^{1/p}$ for all $i\in[R]$ and $j\in[d+1]$ .

•

$\mathcal{E}_{3}$ : for all $y\in\mathcal{N}$ ,

[TABLE]

where $\mathcal{N}$ is a $(\operatorname{poly}(\varepsilon/d))$ -net of $\{\widetilde{A}x\mid x\in\mathbb{R}^{d+1},\|\widetilde{A}x\|_{p}=1\}$ . By Lemma 8.5 we have $|\mathcal{N}|\leq(d/\varepsilon)^{O(d)}$ .

According to the cumulative density function of $E^{-1/p}$ for an exponential random variable $E$ , and a union bound over $i\in[R]$ , $\mathcal{E}_{1}$ holds with constant probability. Similarly, $\mathcal{E}_{2}$ also holds with constant probability. For each $y\in\mathcal{N}$ , using Maurer’s inequality in Lemma 8.4, we have

[TABLE]

Thus for $R=O(d\log(d/\varepsilon)/\varepsilon^{2})$ , by using a union bound for all $y\in\mathcal{N}$ , with constant probability $\mathcal{E}_{3}$ holds.

Conditioned on $\mathcal{E}_{1}$ , using Bernstein’s inequality, we have

[TABLE]

Thus, for $R=O(d\log(d/\varepsilon)/\varepsilon^{2})$ and $p>2$ ,

[TABLE]

holds with constant probability, which implies Part 1 of Theorem 8.2.

Now for any $y=Ux$ with $\|y\|_{p}=1$ , by definition of the Auerbach basis we have $\|x\|_{\infty}\leq\|y\|_{p}\leq 1$ . Conditioned on $\mathcal{E}_{2}$ , we have,

[TABLE]

Consider any $y=\widetilde{A}x$ with $\|y\|_{p}=1$ . We claim $y$ can be written as

[TABLE]

where for any $j\geq 0$ we have (i) $\frac{y^{j}}{\|y_{j}\|_{p}}\in\mathcal{N}$ and (ii) $\|y^{j}\|_{p}\leq\operatorname{poly}(\varepsilon/d)^{j}$ .

According to the definition of a $(\operatorname{poly}(\varepsilon/d))$ -net, there exists a vector $y^{0}\in\mathcal{N}$ for which $\|y-y^{0}\|_{p}\leq\operatorname{poly}(\varepsilon/d)$ and $\|y^{0}\|_{p}=1$ . If $y=y_{0}$ then we stop. Otherwise we consider the vector $\frac{y-y^{0}}{\|y-y^{0}\|_{p}}$ . Again we can find a vector $\widehat{y}^{1}\in\mathcal{N}$ such that $\left\|\frac{y-y^{0}}{\|y-y^{0}\|_{p}}-\widehat{y}^{1}\right\|_{p}\leq\operatorname{poly}(\varepsilon/d)$ and $\|\widehat{y}^{1}\|_{p}=1$ . Here we set $y^{1}=\|y-y^{0}\|_{p}\cdot\widehat{y}^{1}$ and continue this process inductively.

Thus, conditioned on $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ , we have for any $y=\widetilde{A}x$ with $\|y\|_{p}=1$ ,

[TABLE]

For any $y=Ax-b$ , by homogeneity, we still have

[TABLE]

which implies Part 2 of Theorem 8.2.

9 Communication Complexity Lower Bound for Linear Programming

In this section, we prove a communication complexity lower bound for testing feasibility of linear programs.

We need the following lemma to construct our hard instance.

Lemma 9.1.

Let $L$ be a sufficiently large integer. We use $m_{i}\in\mathbb{R}^{2}$ to denote the vector

[TABLE]

For any $1\leq i,j\leq 2^{L/100}$ , we have

$\|m_{i}\|_{2}^{2}\geq 1+\frac{1}{2^{4L+2}}$ ; 2. 2.

For any $i\neq j$ , $\langle m_{i},m_{j}\rangle\leq 1$ .

Proof.

For any $1\leq i\leq 2^{L/100}$ , we have

[TABLE]

For any $1\leq i,j\leq 2^{L/100}$ and $i\neq j$ , we have

[TABLE]

∎

Now we reduce the lopsided set disjiontness problem to testing feasibility of linear programs. In this problem, for a choice of universe size $U$ , the last server $P_{s}$ receives an element $u\in[U]$ , and for each $i<s$ , server $P_{i}$ receives a set $S_{i}\subseteq[U]$ . The goal is to test whether there exists $i$ such that $u\in S_{i}$ . We reduce this problem with $U=2^{L/100}$ to testing the feasibility of linear programs for $d=2$ , where $L$ is the bit complexity of the linear program.

For the reduction, server $P_{s}$ adds a constraint $x=m_{u}$ , for the element $u\in[U]$ that $P_{s}$ receives. I.e., server $P_{s}$ forces the solution $x$ to be $m_{u}$ . For each $i<s$ , for each $v\in S_{i}$ , server $P_{i}$ adds a constraint $\langle m_{v},x\rangle\leq 1$ . Here $m_{u}$ and $m_{v}$ are as defined in Lemma 9.1. By Lemma 9.1, this linear program is feasible if and only if $u\notin\bigcup_{i<s}S_{i}$ .

In the remaining part of this section, we show the lopsided set disjointness problem has an $\Omega(s\log U/\log s)$ randomized communication complexity lower bound in the coordinator model, which implies an $\Omega(s\log L/\log s)$ lower bound for testing feasiblity of linear programming, even for $d=2$ . An $\Omega(s+L)$ lower bound also holds in the blackboard model, since when $s=2$ the coordinator model is equivalent to the blackboard model, up to a constant factor in the communication complexity.

We first consider the two-player case, in which Alice receives an element $u\in[U]$ and Bob receives a set $S\subseteq[U]$ . The goal is to test whether $u\in S$ or not. Let $\mu$ be the distribution where $u$ is chosen uniformly at random from $[U]$ , and $S$ is a subset of $[U]$ such that each element $u\in U$ is included independently with probability $1/2$ . Let $\mu_{y}$ be the conditional distribution of $\mu$ given $u\in S$ , and $\mu_{n}$ be the conditional distribution of $\mu$ given $u\notin S$ . In [5, Section 2.2], it has been shown that any communication protocol that succeeds with probability at least $2/3$ on the distribution $\mu$ requires $\Omega(\log U)$ bits of communication in the worst case. By applying Markov’s inequality and stopping the protocol early once the communication complexity is too large, this implies any randomized protocol that succeeds with probability at least $3/4$ on the distribution $\mu$ requires $\Omega(\log U)$ bits of communication in expectation. In fact, this implies a stronger hardness result, that for any protocol that succeeds with probability at least $3/4$ on $\mu$ , its expected communication complexity is $\Omega(\log U)$ on both $\mu_{y}$ and $\mu_{n}$ .

Consider a new distribution $\mu^{\prime}$ which is $\mu_{y}$ with probability $1/s^{2}$ and $\mu_{n}$ with probability $1-1/s^{2}$ . Suppose a protocol $\mathcal{P}$ succeeds with probability at least $1-1/100s^{2}$ . Then by averaging $\mathcal{P}$ succeeds with probaility at least $4/5$ on both $\mu_{n}$ and $\mu_{y}$ , which implies the expected communication complexity of $\mathcal{P}$ is $\Omega(\log U)$ on both $\mu_{y}$ and $\mu_{n}$ . Now by linearity of expectation, the expected communication complexity of $\mathcal{P}$ on $\mu^{\prime}$ is lower bounded by $\Omega(\log U)$ . This, in particular, implies any protocol that succeeds with probability at least $1-1/100s^{2}$ on $\mu^{\prime}$ should have expected communication complexity $\Omega(\log U)$ . At this point, Theorem 1.1 in [68] implies that for the $s$ -player case, any communication protocol that succeeds with probability at least $1-1/s^{3}$ has worst case communication complexity at least $\Omega(s\log U)$ . By standard repitition arguments this implies an $\Omega(s\log U/\log s)$ lower bound for protocols that succeed with constant probability.

Formally, we have the following theorem.

Theorem 9.2.

Any randomized protocol that succeeds with probability at least $0.99$ for testing feasibility of linear programs requires $\Omega(s\log L/\log s)$ bits of communication in the coordinator model and $\Omega(s+L)$ bits of communication in the blackboard model. The lower bound holds even when $d=2$ .

Notice that by Theorem 4.2, testing feasibility of linear systems for $d=2$ requires only $O(s\log L)$ randomized communication complexity. This shows an exponential separation between testing feasibility of linear systems and linear programs, in the communication model.

10 Clarkson’s Algorithm

10.1 The Communication Complexity

In this section, we discuss how to implement Clarkson’s algorithm to solve linear programs in the distributed setting. The protocol is described in Figure 11. During the protocol, each server $P_{i}$ maintains a multi-set $H_{i}$ of constraints (i.e., each constraint can appear more than once in $H_{i}$ ). Initially, $H_{i}$ is the set of constraints stored on $P_{i}$ . Furthermore, the coordinator maintains $|H_{i}|$ , which is initially set to be the number of constraints stored on each server.

The protocol in Figure 11 is basically Clarkson’s algorithm [22], implemented in the distributed setting. Using the analysis in [22], the expected number of iterations is $O(d\log n)$ . The correctness also directly follows from the analysis in [22]. Now we analyze the communication complexity for each iteration.

To implement the sampling process in Step 1, the coordinator first determines the number of constraints to be sampled from each server $P_{i}$ and sends this number to $P_{i}$ . The total communication complexity for this step is $O(s\log n)$ in both the coordinator model and the blackboard model. Then each server $P_{i}$ samples accordingly and sends these constraints to the coordinator. The total communication for this step is $O(d^{3}L)$ in both models.

To implement Step 2, we first verify the bit complexity of the optimal solution $x_{R}$ . One of the optimal solutions $x_{R}$ is a vertex of the polyhedron $Ax\leq b$ . From polyhedral theory we know that there exists a non-singular subsystem of $Ax\leq b$ , say $Bx\leq c$ , such that $x_{R}$ is the unique solution of $Bx=c$ . Thus, by Cramer’s rule, each entry of $x$ is a fraction whose numerator and denominator are integers between $-d!2^{dL}$ and $d!2^{dL}$ , and thus can be represented by using at most $O(dL+d\log d)$ bits. This implies the bit complexity of all entries in the vector $x$ calculated at Step 2 is upper bounded by $\widetilde{O}(d^{2}L)$ . Thus the communication complexity for Step 2 is upper bounded by $\widetilde{O}(sd^{2}L)$ in the coordinator model and $\widetilde{O}(d^{2}L)$ in the blackboard model. The communication complexity of the last two steps of the protocol is upper bounded by $O(s\log n)$ in both models. Thus, the expected communication complexity is $\widetilde{O}(sd^{3}L+d^{4}L)$ in the coordinator model and $\widetilde{O}(sd+d^{4}L)$ in the blackboard model.

Theorem 10.1.

The expected communication complexity of the protocol in Figure 11 is $\widetilde{O}(sd^{3}L+d^{4}L)$ in the coordinator model and $\widetilde{O}(sd+d^{4}L)$ in the blackboard model

10.2 Running Time of Clarkson’s Algorithm in Unit Cost RAM

In this section, we show how to implement Clarkson’s algorithm in the unit cost RAM model on words of size $O(\log(nd))$ so that the running time is upper bounded by $\widetilde{O}(nd^{\omega}L+\operatorname{poly}(dL))$ , and prove Theorem 1.11.

A description of Clarkson’s algorithm can be found in Figure 11. This algorithm runs in $O(d\log n)$ rounds in expectation. In each round, it samples $O(d^{2})$ constraints $R$ , and calculates an optimal solution $x_{R}$ that satisfies all constraints in $R$ . This optimal solution $x_{R}$ can be calculated using any polynomial time linear programming algorithm, which always has running time $\operatorname{poly}(dL)$ . The bottleneck in the unit cost RAM model is Step 4 of the algorithm in Figure 11, i.e., for each of the $n$ constraints, testing whether $x_{R}$ satisfies the constraint or not. Formally, we just need to output $Ax_{R}$ , and then compare each entry with $b$ . In the remaining part of this section we show how to caculate $Ax_{R}$ in $\widetilde{O}(nd^{\omega}L)$ time

Since each entry of $x_{R}$ has bit complexity $\widetilde{O}(dL)$ , we first calculate a $d\times d$ matrix $X$ , where each entry of $X$ has bit complexity $\widetilde{O}(L)$ , and the entry $X_{i,1},X_{i,2},\ldots,X_{i,d}$ consists of the first $\widetilde{O}(L)$ bits of $(x_{R})_{i}$ , the second $\widetilde{O}(L)$ bits of $(x_{R})_{i},\ldots,$ the last $\widetilde{O}(L)$ bits of $(x_{R})_{i}$ . Now we calculate $A\cdot X$ . Since all entries in $A$ and $X$ have bit complexity $\widetilde{O}(L)$ , and caculating the matrix mutilplication of two $d\times d$ matrices with bit complexity $\widetilde{O}(L)$ requires only $\widetilde{O}(d^{\omega}L)$ time [39], $A\cdot X$ can therefore be calculated in $\widetilde{O}((n/d)\cdot d^{\omega}\cdot L)=\widetilde{O}(nd^{\omega-1}L)$ time. Given $AX$ , one can then easily calculate $Ax_{R}$ in $\widetilde{O}(ndL)$ time. Thus, the total expected running time is upper bounded by $O(d\log n(\widetilde{O}(nd^{\omega-1}L)+\widetilde{O}(ndL)+\operatorname{poly}(dL)))=\widetilde{O}(nd^{\omega}L+\operatorname{poly}(dL))$ .

10.3 Smoothed Analysis of Communication Complexity

In this section we define our model for smoothed analysis of communication complexity of communication protocols for solving linear programming.

For a randomized communication protocol $\mathcal{P}$ , we use $C_{\mathcal{P}}(A,b,c)$ to denote its communication complexity on the linear programming instance

[TABLE]

where $A\in\mathbb{R}^{n\times d}$ , $b\in\mathbb{R}^{n}$ and $c\in\mathbb{R}^{d}$ . The standard definition [61] of smoothed analysis assumes that each entry of $A$ is perturbed by i.i.d. Gaussian noise with zero mean and standard deviation $\sigma$ . However, since we are measuring the communication complexity in terms of bit complexity, we cannot allow the noise to be arbitrary real numbers. Instead, in our model, we use discrete Gaussian random variables as the noise.

Formally, we use $\mathsf{trunc}_{t}:\mathbb{R}\to\mathbb{R}$ to denote the function that rounds a real number to its nearest integer multiple of $2^{-t}$ . For notational convenience, we define $\mathsf{trunc}_{\infty}(x)=x$ . We say a communication protocol solves the linear program instance (10) with smoothed communication complexity $SC_{\mathcal{P},\sigma,t}(A,b,c)$ if with probability at least $0.99$ , the protocol correctly solves the instance

[TABLE]

with communication complexity $C_{\mathcal{P}}(A+G_{t,\sigma},b,c)\leq SC_{\mathcal{P},\sigma,t}(A,b,c)$ , where all entries of $G_{t,\sigma}$ are i.i.d. copies of $\mathsf{trunc}_{t}(g)$ and $g$ is a Gaussian random variable with zero mean and $\sigma\leq 1$ standard deviation. Here the probability is defined over the randomness of the protocol and the noise $G_{t,\sigma}$ . Notice that when $t=\infty$ , $G_{\infty,\sigma}$ is a matrix whose all entries are i.i.d. Gaussian random variables with standard deviation $\sigma$ .

10.4 Smoothed Analysis of Clarkson’s Algorithm

In this section, we present our variant of Clarkson’s algorithm for solving smoothed linear programming instances. The protocol is described in Figure 12. The main difference is in Step 2, where the coordinator rounds each entry of the solution $x_{R}$ before sending it to other servers.

10.4.1 Correctness of the Protocol

We first prove the correctness of the protocol. Our plan is to show if $t=\Omega(\log(nd/\sigma)+L)$ , then our modified Clarkson’s algorithm follows the computation path of the original Clarkson’s algorithm in Figure 11 when executing on the perturbed instance, with high probability, and thus prove the correctness of the protocol.

We need the following bound on the condition number of a matrix.

Lemma 10.2.

For a matrix $B\in\mathbb{R}^{d\times d}$ with all entries in $[0,2^{L}-1]$ , for any integer $n>0$ , $\sigma\leq 1$ and $t\geq\Omega(\log(nd/\sigma)+L)$ , we have

[TABLE]

and

[TABLE]

Proof.

To prove the first inequality, notice that

[TABLE]

Thus, the first inequality just follows from tail inequalities of the Guassian distribution.

To analyze $\|(B+G_{\sigma,t})^{-1}\|_{2}$ , we write $G_{\sigma,\infty}$ to denote a matrix whose entries are the Gaussian random variables of $G_{\sigma,t}$ before applying the truncation operation. Notice that this implies $\|G_{\sigma,t}-G_{\sigma,\infty}\|_{2}\leq\operatorname{poly}(d)\cdot 2^{-t}\leq 1/\operatorname{poly}(nd/\sigma)$ . We invoke Theorem 3.3 in [56], which states that with probability $1-1/\operatorname{poly}(nd)$ ,

[TABLE]

which implies with probability $1-1/\operatorname{poly}(nd)$ ,

[TABLE]

∎

Lemma 10.3.

During the execution of the protocol in Figure 12, each time Step 2 is executed, if $x_{R}\neq 0$ , with probability at least $1-1/\operatorname{poly}(nd)$ , $x_{R}$ satisfies

[TABLE]

Proof.

From polyhedral theory we know that there exists a non-singular subsystem of the sampled $9d^{2}$ constraints $R$ , say $Bx\leq c$ , such that $x_{R}$ is the unique solution of $Bx=c$ .

If $c\neq 0$ , since each entry of $B$ was pertubed by a discrete Gaussian noise, and all entries of $c$ are integers in the range $[0,2^{L}-1]$ , by Lemma 10.2 we have

[TABLE]

Furthermore, since $\|B\|_{2}\leq\operatorname{poly}(nd\cdot 2^{L}/\sigma)$ ,

[TABLE]

If $c=0$ , then we must have $x_{R}=0$ , since $Bx=c$ is non-singular and thus $x_{R}=0$ is the unique solution. ∎

Now we create a family of events $\{\mathcal{E}_{i}\}_{i=1}^{\infty}$ . We use $\mathcal{E}_{i}$ to denote the event that, during the $i$ -th loop of the execution of the protocol in Figure 12, for each constraint $h\notin R$ , the constraint $h$ can be satisfied by $x_{R}$ if and only if it can be satisfied by $\widehat{x}_{R}$ . Notice that for those constraints in $R$ , $x_{R}$ can always satisfy them, by definition of $x_{R}$ .

Now we show that for each $\mathcal{E}_{i}$ , the probability that $\mathcal{E}_{i}$ holds is at least $1-1/\operatorname{poly}(nd)$ . By showing this, we have actually shown our algorithm follows the computation path of the original Clarkson’s algorithm in Figure 11 when executing on the perturbed instance, with high probability. Since the original Clarkson’s algorithm in Figure 11 terminates in $O(d\log n)$ rounds with probability at least $0.999$ , the correcntess of our algorithm follows by applying a union bound over all events $\{\mathcal{E}_{i}\}_{i=1}^{O(d\log n)}$ .

To show that for each $\mathcal{E}_{i}$ , the probability that $\mathcal{E}_{i}$ holds is at least $1-1/\operatorname{poly}(nd)$ , by applying a union bound over all constraints, it suffces to show that for each constaint $h\notin R$ , $x_{R}$ can satisfy $h$ if and only if $\widehat{x}_{R}$ can satisfy $h$ , with probability $1-1/\operatorname{poly}(nd)$ .

Lemma 10.4.

For each constraint $h\notin R$ , with probability $1-1/\operatorname{poly}(nd)$ , $x_{R}$ can satisfy $h$ if and only if $\widehat{x}_{R}$ can satisfy $h$ .

Proof.

If $x_{R}=0$ , then $\widehat{x}_{R}=x_{R}=0$ , in which case the lemma follows trivially. Thus we assume $x_{R}\neq 0$ in the remaining part of this proof.

The constraint $h$ can be written as $(a_{h}+g_{\sigma,t})x\leq b_{h}$ , for some vector $a_{h}\in\mathbb{R}^{d}$ and some $b_{h}\in\mathbb{R}$ , and all entries of $g_{\sigma,t}\in\mathbb{R}^{d}$ are i.i.d. copies of $\mathsf{trunc}_{t}(g)$ and $g$ is a Gaussian random variable with zero mean and $\sigma$ standard deviation. Notice that since $h\notin R$ , the vector $g_{\sigma,t}$ and the vector $x_{R}$ are independent. By Lemma 10.3, with probability at least $1-1/\operatorname{poly}(nd)$ ,

[TABLE]

Furthermore, the probability that $x_{R}$ can satisfy $h$ but $\widehat{x}_{R}$ cannot satisfy $h$ , or $x_{R}$ cannot satisfy $h$ but $\widehat{x}_{R}$ can satisfy $h$ , is at most

[TABLE]

We first analyze the right hand side of the inequality. Notice that $\|a_{h}\|_{2}\leq\operatorname{poly}(d\cdot 2^{L})$ , and $\|g_{\sigma,t}\|_{2}\leq\operatorname{poly}(nd)$ with probability at least $1-1/\operatorname{poly}(nd)$ by tail inequalities of the Gaussian distribution and $\sigma\leq 1$ . Moreover, $\|x_{R}-\widehat{x}_{R}\|_{2}\leq\operatorname{poly}(d)\cdot\delta\leq 1/\operatorname{poly}(dn\cdot 2^{L}/\sigma)$ . Thus by Cauchy-Schwarz,

[TABLE]

On the other hand, if we write $g_{\sigma,\infty}$ to denote a vector whose entries are the Gaussian random variables of $g_{\sigma,t}$ before applying the truncation operation, then $\|g_{\sigma,\infty}-g_{\sigma,t}\|_{2}\leq\operatorname{poly}(d)2^{-t}$ .

Thus, by taking $t=\Omega(\log(nd/\sigma)+L)$ , we have

[TABLE]

By the lower tail inequality of the Gaussian distribution and the fact that $\|x_{R}\|_{2}\geq 1/\operatorname{poly}(nd\cdot 2^{L}/\sigma)$ , we have with probability at least $1-1/\operatorname{poly}(nd)$ ,

[TABLE]

Thus, the lemma follows by appropriately adjusting the constant in $O(1/\operatorname{poly}(nd\cdot 2^{L}/\sigma))=\delta$ . ∎

10.4.2 Communication Complexity of the Algorithm

The analysis in the preceding section shows that with high probability, our modified Clarkson’s algorithm follows the computation path of the original Clarkson’s algorithm, and thus also terminates within $O(d\log n)$ rounds with probability at least $0.999$ . Furthermore, with high probability, the discrete Gaussian noise of all entries is upper bounded by $O(nd)$ . Thus, the bit complexity of sending each constraint will be $\widetilde{O}(d(L+t))$ , with high probability.

The sampling process in Step 1 requires $\widetilde{O}(d^{3}(L+t)+s)$ bits of communication to sample $O(d^{2})$ constraints. To implement Step 2, we need to verify the bit complexity of $\widehat{x}_{R}$ . Since we round each entry of $x_{R}$ to its nearest integer multiple of $\delta$ , and by Lemma 10.3, with high probability, $\|x_{R}\|_{2}\leq\operatorname{poly}(nd\cdot 2^{L}/\sigma)$ , the communication compleixty for sending $\widehat{x}_{R}$ is upper bounded by $\widetilde{O}(sd(L+\log(1/\sigma)))$ . The communication complexity of the last two steps of the protocol is still upper bounded by $O(s\log n)$ . Thus, the smoothed communication complexity is $\widetilde{O}(sd^{2}(L+\log(1/\sigma))+d^{4}(L+t))$ in the coordinator model.

Theorem 10.5.

For $t=\Omega(\log(nd/\sigma)+L)$ , the protocol in Figure 12 correctly solves smoothed linear programming with probability at least $0.99$ , and the smoothed communication compleixty is

[TABLE]

in the coordinator model.

11 The Center of Gravity Method

In this section, we discuss how to implement the center-of-gravity cutting-plane method [35] in the distributed setting. The description of the protocol can be found in Figure 13.

The servers each maintain a polytope $P$ (the same one for all servers), adding a constraint in each iteration. Each server also maintains the center of the polytope $z$ and its covariance $C$ .

For any vector $a\in\mathbb{R}^{d}$ , its $\varepsilon$ -rounding $\widetilde{a}$ w.r.t. to $C$ is defined as follows: Let $B=C^{1/2}$ . We take the unit vector $B^{T}a/\|B^{T}a\|_{2}$ , round it down to the nearest multiple of $\varepsilon$ in each coordinate. So we have $\|\widetilde{a}-B^{T}a/\|B^{T}a\|_{2}\|_{2}\leq\varepsilon\sqrt{d}$ .

If each server were to report the exact violated constraint, the volume of $P$ would drop by a constant factor in each iteration. To reduce the communication, we round the constraint and shift it away a bit to make sure that the rounded constraint (1) is still valid for the target LP and (2) it is close enough that the volume still drops by a constant factor.

Lemma 11.1 ([12]).

Let $z$ be the center of gravity of an isotropic convex body $K$ in $\mathbb{R}^{d}$ . Then, for any halfspace $H$ within distance $t$ of $z$ , we have

[TABLE]

Lemma 11.2.

For $\varepsilon<0.1/d\sqrt{d}$ , the volume of the polytope $P$ maintained by each server drops by a constant factor in each iteration.

Proof.

Assume without loss of generality that $P$ is isotropic. If the centroid $z=0$ is not feasible, we get a violated constraint such that the entire feasible region lies in the halfspace $a\cdot x\leq 0$ with $\|a\|_{2}=1$ . Now we replace $a$ by $\widetilde{a}$ . As a result,

[TABLE]

Here we used the fact that in isotropic position any convex body is contained in a ball of radius $d$ , so $\|x\|_{2}\leq d$ for all of $P$ and therefore for the feasible region. Thus the constraint imposed by the algorithm is valid.

Next, we note that the distance of the constraint from the origin is at most $\varepsilon d^{3/2}$ , so for $\varepsilon<0.1/d^{3/2}$ , it is less than $0.1$ (in isotropic position). By Lemma 11.1, with $t=0.1$ , the volume of $P$ drops by a constant factor. ∎

Theorem 11.3.

The protocol in Figure 13 is a deterministic protocol for solving linear programming with communication complexity $O(sd^{3}L\log^{2}d)$ in both the coordinator model and the blackboard model.

Proof.

The algorithm runs for $T=O(d^{2}L\log d)$ rounds. To see this we note that the each vertex of the feasible region is the solution of a subset of the linear equalities taken to be equalities. Thus, each coordinate of each vertex is a ratio of two determinants of matrices whose entries are $L$ -bit numbers and so the maximum distance of a vertex from the origin is $R=d^{O(dL)}$ , which upper bounds the volume by $d^{O(d^{2}L)}$ . The smallest any coordinate can be is similarly $d^{-O(dL)}$ . The minimum volume we need to go to is the volume spanned by a simplex of vertices, which itself is a determinant with entries of this size. Thus, the volume is at least $d^{-O(d^{2}L)}$ . Since the volume of the polytope maintained drops by a constant factor in each iteration222The ellipsoid method uses the same argument, except that each round reduces the volume by only $(1-1/d)$ [34]., the number of rounds is $O(d^{2}L\log d)$ . Each round includes a broadcast of a single vector, with $O(d\log d)$ bits. This is because the size of the $\varepsilon$ -net used is $d^{O(d)}$ . By viewing the objective function as a constraint, we note that the volume bounds used above apply to the optimization version as well. At the end, we use diophantine approximation to get an exact solution [34]. ∎

A similar argument applies to general convex programming.

Theorem 11.4.

The communication complexity of the protocol in Figure 13 for solving convex programming is $O(sd^{2}\log d\log(Rd/\varepsilon))$ .

Proof.

The initial volume is at most $R^{d}$ and the algorithm can stop when the volume is $(d/\varepsilon)^{-O(d)}$ . Therefore the number of rounds is $O(d\log(Rd/\varepsilon))$ . Each round uses $O(sd\log d)$ bits giving the final bound. ∎

12 Seidel’s Algorithm

We give an alternative constant dimensional linear programming algorithms in the blackboard model, based on Seidel’s classical algorithm [58]. Here we additionally assume that each constraint in the linear program is placed on a random server. This assumption is essential to get rid of the $\log n$ dependence in the communication complexity. Here we also assume that the linear program is bounded.

To implement Seidel’s algorithm in the blackboard model, we go through all servers $P_{1},P_{2},\ldots,P_{s}$ , and for each server $P_{i}$ , we go through all constraints stored on $P_{i}$ in a random order. We maintain the optimal solution $x^{*}$ to the set of constraints that we have already went through. For a new constraint $\langle a,x\rangle\leq b$ , the current server first checks whether the constraint is satisfied or not. The current server proceeds to the next constraint if it is indeed satisfied. If it is not satisfied, then the current constraint $\langle a,x\rangle\leq b$ must be one of the $d$ constraints that determines the current optimal solution. In this case, the current server broadcasts the current constraint $\langle a,x\rangle\leq b$ to all other servers, and makes a recursive call to figure out the optimal solution, by adding an equality constraint $\langle a,x\rangle=b$ to the set of constraints. Notice that if the first server $P_{1}$ finds a violated constraint, $P_{1}$ does not need to broadcast the violated constraint, since $P_{1}$ can simply add the equality constraint to the beginning of all constraints owned by $P_{1}$ .

One major difference between the classical Seidel’s algorithm and our implementation is that each time we make a recursive call, we do not randomly permute the constraints again in the recursive calls. Instead, the order that we go through the servers is fixed, in different recursive calls. Due to this difference, there will be subtle dependence between the communication complexity of different recursive calls.

We use $\mathcal{E}$ to denote the event that the number of constraints stored on $P_{1}$ is at least $\Omega(n/s)$ . By a Chernoff bound, $\mathcal{E}$ holds with probability at least $0.99$ . For the $i$ -th constraint (in the order that we go through all constraints), we let the random variable $V_{i}$ be $1$ if the $i$ -th constraint is one the $d$ constraints that determines the optimal solution among the first $i$ constrains, and [math] otherwise.

Since each constraint in the linear program is placed on a random server, by standard backward analysis, $\mathrm{E}[V_{i}]=d/i$ . Furthermore, $\mathrm{E}[V_{i}\mid\mathcal{E}]\geq\mathrm{E}[V_{i}]/\mathrm{Pr}[\mathcal{E}]=O(d/i)$ . However, conditioned on $\mathcal{E}$ , the first $\Omega(n/s)$ constraints will not be broadcasted and there will be no recursive calls associated with them, since they are stored on the first server $P_{1}$ . Thus, conditioned on $\mathcal{E}$ , the expected number of broadcasts (and thus recursive calls) is upper bounded by

[TABLE]

We use $\mathcal{F}_{d}$ to denote the event that there are at most $O(d^{2}\log s)$ recursive calls made at the top layer of the recursive tree corresponding to Seidel’s algorithm. Conditioned on $\mathcal{E}$ , by Markov’s inequality, $\mathcal{F}_{d}$ holds with probability at least $1-1/(100d)$ . Similarly, we use $\mathcal{F}_{d-1}$ to denote the event that there are at at most $O((d^{2}\log s)^{2})$ recursive calls made at the second layer of the recursive tree corresponding to Seidel’s algorithm. Conditioned on $\mathcal{F}_{d}$ and $\mathcal{E}$ , again by Markov’s inequality, $\mathcal{F}_{d-1}$ holds with probability at least $1-1/(100d)$ . We similarly define $\mathcal{F}_{d-2},\mathcal{F}_{d-3},\ldots,\mathcal{F}_{1}$ . Thus, conditioned on $\mathcal{E}$ , with probability at least $0.99$ , $\mathcal{F}_{i}$ holds for all $i\in[d]$ . Which implies the total number of broadcasts is upper bounded by $O(\log^{d}s)$ .

In each broadcast, the current server needs to broadcast the current constraint, which has bit complexity $O(dL)$ . Moreover, after the recursive call, the current server broadcasts the current solution vector $x^{*}$ . By polyhedral theory, $x^{*}$ can be achieved by setting $d$ inequality constraints to be equality constraints. Thus, by Cramer’s rule, each entry of $x^{*}$ has bit complexity $O(d\log d+dL)$ . The total communication complexity is hence upper bounded by $O(s+L\cdot\log^{d}s)$ , with probability at least $0.9$ . Here the randomness is over the initial random assignment of each constraint and the random coins tossed by the algorithm.

Theorem 12.1.

Seidel’s algorithm can be used to solve linear programs in constant dimension with $O(s+L\log^{d}s)$ communication in the blackboard model, if each constraint in the linear program is placed on a random server. Here the randomness is over the initial random assignment of each constraint and the random coins tossed by the algorithm.

13 Singularity Probability

The goal of this section is to prove Theorem 3.1. We restate it here for convenience.

Theorem 3.1. (restated) * Let $M_{n}$ be a matrix whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ , for sufficiently large $t$ ,*

[TABLE]

*where $C>0$ is an absolute constant. *

Our proof of Theorem 3.1 follows very closely the proof of Theorem 1.5 in [63]. Throughout this section we use $\lambda$ to denote $t^{-1/2}$ . We use $X_{i}$ to denote the $i$ -th row of $M_{n}$ .

We need the following lemma on generalized binomial distributions.

Lemma 13.1.

We have

[TABLE]

and

[TABLE]

Here $c_{1}$ and $c_{2}$ are absolute constants.

Proof.

By Stirling’s approximation we have

[TABLE]

which proves (12). To prove (13), by a Chernoff bound swe have that the number of non-zero terms in the summation of $\mathcal{B}_{t}^{(\lambda e^{-\lambda})}$ is $\Theta(t\lambda e^{-\lambda})=\Theta(t^{1/2})$ with probability $1-\exp(-\Omega(t^{1/2}))$ . Conditioned on this event, we can then prove (13) by using the same estimation as (12). ∎

The following lemma is a direct implication of Lemma 13.1 and Odlyzko’s results in [53]. See also Lemma 2.1 in [63] and Section 3.2 in [41].

Lemma 13.2.

Let $W\subseteq\mathbb{R}^{n}$ be an arbitrary subspace and $X^{(\mu)}\in\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}^{(\mu)}$ . We have

[TABLE]

and

[TABLE]

By Lemma 5.1 in [63], we have

[TABLE]

which implies we only need to consider the case when $X_{1},X_{2},\ldots,X_{n}$ span a hyperplane.

We say a hyperplane $V$ is non-trivial if $V$ is spanned by its intersection with $\{-t,-(t-1),\ldots,-1,0,1,\ldots,t-1,t\}^{n}$ . Notice that a hyperplane $V$ has

[TABLE]

only when $V$ is non-trivial. Thus, we focus only on non-trivial hyperplanes in the remaining part of the proof.

Definition 13.1.

Let $X\in\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ . For a hyperplane $V\subseteq\mathbb{R}^{n}$ , define the discrete codimension $d(V)$ of $V$ to be the unique integer multiple of $1/nt$ such that

[TABLE]

According to the definition, it is clear from Lemma 13.2 that $1\leq d(V)\leq O(n)$ .

We first dispose hyperplanes with high discrete codimension using the following lemma, which is a direct corollary of Lemma 1 in [41].

Lemma 13.3.

Suppose $X\in\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ , then

[TABLE]

Let $1/2\geq\varepsilon>0$ be a constant to be determined. Using Lemma 13.3, we have

[TABLE]

Thus, in the remaining part of the proof we will focus only on the case when $d(V)\leq(\varepsilon-o(1))n$ .

We say a hyperplane $V$ to be non-degenerate if its normal vector $n(V)$ satisfies $\|n(V)\|_{0}\geq\left\lceil\log\log n/\log t\right\rceil$ . Here we use $\|n(V)\|_{0}$ to denote the number of non-zero entries in the normal vector $n(V)$ . The following lemma, which is a simple adaption of Lemma 5.3 in [63], provides a crude estimation of the number of degenerate hyperplanes.

Lemma 13.4.

The number of degenerate non-trivial hyperplanes is at most $t^{o(n)}$ .

Combining Lemma 13.2 and Lemma 13.4, we then have

[TABLE]

Thus, we can just focus on non-degenerate hyperplanes.

The following theorem, which first appeared in [41] as Theorem 2 (see also Section 7 in [63]), is based on Fourier-analytic arguments by Halász [38, 37].

Theorem 13.5.

Suppose $V\subseteq\mathbb{R}^{n}$ is a non-trivial hyperplane. Let $Y^{(\mu)}\in\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}^{(\mu)}$ , $\lambda<1$ be a positive number and $k$ be a positive integer such that $4\lambda k^{2}<1$ . We have

[TABLE]

where we use $n(V)$ to denote the normal vector of $V$ and $\|n(V)\|_{0}$ to denote the number of non-zero entries of $n(V)$ .

Corollary 13.6.

Suppose $W\subseteq\mathbb{R}^{n}$ is a non-degenerate non-trivial hyperplane. Let $X^{(\mu)}\in\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}^{(\mu)}$ . For sufficiently large $t$ , we have

[TABLE]

Proof.

We note that

[TABLE]

where $Y^{(\mu)}_{i,j}$ are i.i.d. random variables with the same distribution as $\mathcal{B}^{(\mu)}$ and $n_{i}(W)$ is the $i$ -th coordinate of the normal vector $n(W)$ . This enables one to apply Theorem 13.5. Notice that when applying Theorem 13.5 we have $\|n(V)\|_{0}=t\|n(W)\|_{0}$ , since each non-zero entry of $n(W)$ appears $t$ times in the summation of (14). Recall that $\lambda=t^{-1/2}$ . We set $k$ to be an integer which is at least $\Omega(t^{1/4})$ . Since $V$ is non-degenerate, we have $\|n(V)\|_{0}=t\|n(W)\|_{0}\geq t\cdot\left\lceil\log\log n/\log t\right\rceil$ , which implies

[TABLE]

The correctness of the corollary thus follows from our choice of $k$ . ∎

For a non-degenerate non-trivial hyperplane $V$ which satisfies $1\leq d(V)\leq(\varepsilon-o(1))n$ , define $A_{V}$ to be the event that

[TABLE]

where $X_{i}^{(\lambda e^{-\lambda})}$ are independent random vectors in $\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}^{(\lambda e^{-\lambda})}$ and $X_{i}^{\prime}$ are random vectors in $\mathbb{R}^{n}$ whose entries are i.i.d. random variables with the same distribution as $\mathcal{B}_{t}$ . Here $\eta=3d(V)/n$ and $\varepsilon^{\prime}=\min\{\eta,\varepsilon\}$ where $1/2\geq\varepsilon>0$ is a constant to be determined.

We first prove that

[TABLE]

To prove this, we define $A_{V}^{\prime}$ to be the event that

[TABLE]

By Corollary 13.6,

[TABLE]

Now we show that

[TABLE]

According to the definition of discrete codimension $d(V)$ , we have

[TABLE]

By Corollary 13.6 we have

[TABLE]

On the other hand, by Lemma 13.2, we have

[TABLE]

Thus,

[TABLE]

which implies

[TABLE]

Using the estimation given above, for sufficiently large $t$ , we have

[TABLE]

since $(1-\eta)n=n-3d(V)$ .

Similarly, when $\varepsilon^{\prime}<\eta$ , i.e., $\varepsilon^{\prime}=\varepsilon$ , we have

[TABLE]

Again we have

[TABLE]

We define $B_{V}$ to be the event that $X_{1},X_{2},\ldots,X_{n}$ span the hyperplane $V$ . Since $A_{V}$ and $B_{V}$ are independent, we have

[TABLE]

Consider a set

[TABLE]

which satisfies $A_{V}\land B_{V}$ . There exist $\varepsilon^{\prime}n-1$ vectors

[TABLE]

such that

[TABLE]

By using a union bound of size $\binom{n}{\varepsilon^{\prime}n-1}=2^{nh(\varepsilon^{\prime})+o(n)}$ , we can just assume $j_{i}=i$ . Here we use $h(\varepsilon^{\prime})$ to denote the binary entropy function. Thus,

[TABLE]

Thus, by using (15) and Lemma 13.2 we have

[TABLE]

Notice that

[TABLE]

Thus, for any $1\leq d_{0}\leq(\varepsilon-o(1))n$ and sufficiently large $t$ , we have

[TABLE]

Here the second inequality follows since $2^{nh(\varepsilon^{\prime})}\leq t^{nh(\varepsilon^{\prime})/5}$ for sufficiently large $t$ . The third inequality is due to the monotonicity of the binary entropy function $h(\cdot)$ on $[0,1/2]$ and the fact that $0<\varepsilon^{\prime}\leq\varepsilon\leq 1/2$ . The fourth inequality follows from the fact that $d_{0}/n\leq\varepsilon$ . The last inequality follows by setting $\varepsilon$ to be the solution of $h(\varepsilon)+3\varepsilon=1$ . A numerical calculation shows that $\varepsilon>0.177$ . Theorem 3.1 thus follows by using a union bound for all possible $d_{0}$ , which has at most $O(n^{2}t)=t^{o(n)}$ different valid values and setting $C=\varepsilon/5$ .

We remark that the choice of parameters here is mainly for simplicity and not optimized.

14 Discussion

The lens of communication complexity reveals surprising structure about well-known optimization problems. A very interesting open question is to fully resolve the randomized communication complexity of linear programming as a function of $s,d,$ and $L$ . Another interesting direction is to design more efficient linear programming algorithms in the RAM model with unit cost operations on words of size $O(\log(nd))$ bits; such algorithms while being inherently useful may also give rise to improved communication protocols. While our regression algorithms illustrated various shortcomings of previous techniques, there are still interesting gaps in our bounds to be resolved.

Bibliography75

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascale linear learning system. Journal of Machine Learning Research , 15(1):1111–1133, 2014.
2[2] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Advances in Neural Information Processing Systems , pages 1614–1622, 2016.
3[3] Alexandr Andoni. High frequency moments via max-stability. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on , pages 6364–6368. IEEE, 2017.
4[4] Alexandr Andoni et al. Eigenvalues of a matrix in the streaming model. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms , pages 1729–1737. Society for Industrial and Applied Mathematics, 2013.
5[5] Alexandr Andoni, Piotr Indyk, and Mihai Patrascu. On the optimality of the dimensionality reduction method. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on , pages 449–458. IEEE, 2006.
6[6] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 1756–1764, 2015.
7[7] Sepehr Assadi, Nikolai Karpov, and Qin Zhang. Distributed and streaming linear programming in low dimensions. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems , pages 236–253. ACM, 2019.
8[8] Sepehr Assadi and Sanjeev Khanna. Randomized composable coresets for matching and vertex cover. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017 , pages 3–12, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

1 Introduction

Question 1.1**.**

Question 1.2**.**

Question 1.3**.**

Question 1.4**.**

Question 1.5**.**

Recent Independent Work.

1.1 Our Contributions

1.1.1 Linear Systems

Theorem 1.6**.**

1.1.2 Approximate Linear Systems, i.e., ℓp\ell_{p}ℓp​ Regression

1.1.3 Linear Programming

Theorem 1.7**.**

Theorem 1.8**.**

Theorem 1.9**.**

Theorem 1.10**.**

Theorem 1.11**.**

1.1.4 Implications for Convex Optimization and Semidefinite Programming

Theorem 1.12**.**

1.2 Our Techniques

1.2.1 Linear Systems

Theorem 1.13** (Informal version of Theorem 3.1).**

1.2.2 Linear Regression

1.2.3 Linear and Convex Programs

2 Preliminaries

2.1 Notation

2.2 Models of Computation and Problem Settings

2.3 Row Sampling Algorithms

Definition 2.1** ([26]).**

Definition 2.2** ([27]).**

Theorem 2.1** (ℓ2\ell_{2}ℓ2​ Matrix Concentration Bound, Lemma 4 in [26]).**

Theorem 2.2** (ℓ1\ell_{1}ℓ1​ Matrix Concentration Bound, Theorem 7.1 in [27]).**

3 Communication Complexity Lower Bound for Linear Systems

3.1 The Hard Instance

Definition 3.1**.**

Theorem 3.1**.**

Lemma 3.2**.**

Proof.

Lemma 3.3**.**

Proof.

3.2 Deterministic Lower Bound for the Equality Problem

Lemma 3.4** (See, e.g., [44, p11]).**

Theorem 3.5**.**

3.3 Deterministic Lower Bound for Testing Feasibility of Linear Systems

Theorem 3.6**.**

Proof.

3.4 Randomized Lower Bound for Solving Linear Systems

Theorem 3.7**.**

Theorem 3.8**.**

4 Communication Protocols for Linear Systems

4.1 Testing Feasibility of Linear Systems

Lemma 4.1**.**

Proof.

Theorem 4.2**.**

4.2 Solving Linear Systems

Theorem 4.3**.**

Theorem 4.4**.**

5 Communication Complexity Lower Bounds for Linear Regressions in the Blackboard Model

Lemma 5.1**.**

Proof.

Theorem 5.2**.**

Theorem 5.3**.**

6 Communication Protocols for ℓ2\ell_{2}ℓ2​ Regression

6.1 A Deterministic Protocol

Theorem 6.1**.**

6.2 A Protocol in the Blackboard Model

Lemma 6.2**.**

Theorem 6.3**.**

7 Communication Protocols for ℓ1\ell_{1}ℓ1​ Regression

7.1 A Simple Protocol

Theorem 7.1**.**

7.2 A Protocol Based on ℓ1\ell_{1}ℓ1​ Lewis Weights Sampling

Lemma 7.2**.**

Theorem 7.3**.**

Question 1.1.

Question 1.2.

Question 1.3.

Question 1.4.

Question 1.5.

Theorem 1.6.

1.1.2 Approximate Linear Systems, i.e., $\ell_{p}$ Regression

Theorem 1.7.

Theorem 1.8.

Theorem 1.9.

Theorem 1.10.

Theorem 1.11.

Theorem 1.12.

Theorem 1.13 (Informal version of Theorem 3.1).

Definition 2.1 ([26]).

Definition 2.2 ([27]).

Theorem 2.1 ( $\ell_{2}$ Matrix Concentration Bound, Lemma 4 in [26]).

Theorem 2.2 ( $\ell_{1}$ Matrix Concentration Bound, Theorem 7.1 in [27]).

Definition 3.1.

Theorem 3.1.

Lemma 3.2.

Lemma 3.3.

Lemma 3.4 (See, e.g., [44, p11]).

Theorem 3.5.

Theorem 3.6.

Theorem 3.7.

Theorem 3.8.

Lemma 4.1.

Theorem 4.2.

Theorem 4.3.

Theorem 4.4.

Lemma 5.1.

Theorem 5.2.

Theorem 5.3.

6 Communication Protocols for $\ell_{2}$ Regression

Theorem 6.1.

Lemma 6.2.

Theorem 6.3.

7 Communication Protocols for $\ell_{1}$ Regression

Theorem 7.1.

7.2 A Protocol Based on $\ell_{1}$ Lewis Weights Sampling

Lemma 7.2.

Theorem 7.3.

Definition 7.1 ([32]).

Theorem 7.4.

8 Communication Protocols for $\ell_{p}$ Regression

8.1 Communication Protocols for $\ell_{\infty}$ Regression

Theorem 8.1.

8.2 Communication Protocols for $\ell_{p}$ Regression When $p>2$

Theorem 8.2.

Theorem 8.3.

Lemma 8.4 ([49]).

Definition 8.1.

Lemma 8.5 ([15]).

Lemma 8.6 (Auerbach basis [9]).

Lemma 9.1.

Theorem 9.2.

Theorem 10.1.

Lemma 10.2.

Lemma 10.3.

Lemma 10.4.

Theorem 10.5.

Lemma 11.1 ([12]).

Lemma 11.2.

Theorem 11.3.

Theorem 11.4.

Theorem 12.1.

Lemma 13.1.

Lemma 13.2.

Definition 13.1.

Lemma 13.3.

Lemma 13.4.

Theorem 13.5.

Corollary 13.6.