Iterative Refinement for $\ell_p$-norm Regression

Deeksha Adil; Rasmus Kyng; Richard Peng; Sushant Sachdeva

arXiv:1901.06764·cs.DS·December 20, 2024

Iterative Refinement for $\ell_p$-norm Regression

Deeksha Adil, Rasmus Kyng, Richard Peng, Sushant Sachdeva

PDF

Open Access

TL;DR

This paper presents improved iterative algorithms for solving $oldsymbol{ ext{l}_p}$-regression problems for all $p$ in (1,2) and (2,∞), achieving faster convergence and accuracy, especially for large-scale and sparse problems.

Contribution

The authors develop novel iterative refinement algorithms for $ ext{l}_p$-regression that leverage smoothed $ ext{l}_p$-norms, enabling faster solutions with near-linear iteration complexity and improved runtime over previous methods.

Findings

01

Achieve $ ilde{O}_p(m^{1/3})$ iterations for high-accuracy solutions.

02

Solve $ ext{l}_p$-regression in $ ilde{O}_p(m^{ ext{max}ig{race} rac{ ext{omega}}{ } , rac{7}{3} ig{race}})$ time, matching $ ext{l}_2$ regression for constant $p$.

03

Improve on previous algorithms for sparse graphs and matrices with similar dimensions.

Abstract

We give improved algorithms for the $ℓ_{p}$ -regression problem, $min_{x} ∥ x ∥_{p}$ such that $A x = b,$ for all $p \in (1, 2) \cup (2, \infty) .$ Our algorithms obtain a high accuracy solution in $\tilde{O}_{p} (m^{\frac{∣ p - 2∣}{2 p + ∣ p - 2∣}}) \leq \tilde{O}_{p} (m^{\frac{1}{3}})$ iterations, where each iteration requires solving an $m \times m$ linear system, $m$ being the dimension of the ambient space. By maintaining an approximate inverse of the linear systems that we solve in each iteration, we give algorithms for solving $ℓ_{p}$ -regression to $1/ poly (n)$ accuracy that run in time $\tilde{O}_{p} (m^{m a x {ω, 7/3}}),$ where $ω$ is the matrix multiplication constant. For the current best value of $ω > 2.37$ , we can thus solve $ℓ_{p}$ regression as fast as $ℓ_{2}$ regression, for all constant $p$ bounded away from $1.$ Our algorithms can be combined with…

Equations606

x \in R^{m} \mathchar 58 A x = b min ∥ x ∥_{p}^{p},

x \in R^{m} \mathchar 58 A x = b min ∥ x ∥_{p}^{p},

\frac{\frac{1}{2} - \frac{1}{p}}{1 + \frac{1}{2} - \frac{1}{p}},

\frac{\frac{1}{2} - \frac{1}{p}}{1 + \frac{1}{2} - \frac{1}{p}},

x min

x min

A x = b

∣ x + Δ ∣^{p} = ∣ x ∣^{p} + Δ \frac{d}{d x} ∣ x ∣^{p} + O_{p} (1) γ_{p} (∣ x ∣, Δ) .

∣ x + Δ ∣^{p} = ∣ x ∣^{p} + Δ \frac{d}{d x} ∣ x ∣^{p} + O_{p} (1) γ_{p} (∣ x ∣, Δ) .

A Δ = 0 max g^{⊤} Δ - γ_{p} (x, Δ),

A Δ = 0 max g^{⊤} Δ - γ_{p} (x, Δ),

A x = 0, g^{⊤} x = c min γ_{p} (t, x) .

A x = 0, g^{⊤} x = c min γ_{p} (t, x) .

γ_{p} (t, x) = {\frac{p}{2} t^{p - 2} x^{2} ∣ x ∣^{p} + (\frac{p}{2} - 1) t^{p} if ∣ x ∣ \leq t, otherwise .

γ_{p} (t, x) = {\frac{p}{2} t^{p - 2} x^{2} ∣ x ∣^{p} + (\frac{p}{2} - 1) t^{p} if ∣ x ∣ \leq t, otherwise .

min {2, p} \leq x \frac{γ _{p}^{'} ( t , x )}{γ _{p} ( t , x )} \leq max {2, p} .

min {2, p} \leq x \frac{γ _{p}^{'} ( t , x )}{γ _{p} ( t , x )} \leq max {2, p} .

min {λ^{2}, λ^{p}} γ_{p} (t, Δ) \leq γ_{p} (t, λ Δ) \leq max {λ^{2}, λ^{p}} γ_{p} (t, Δ) .

min {λ^{2}, λ^{p}} γ_{p} (t, Δ) \leq γ_{p} (t, λ Δ) \leq max {λ^{2}, λ^{p}} γ_{p} (t, Δ) .

γ_{p} (t, x + Δ) \leq γ_{p} (t, x) + γ_{p}^{'} (t, x) Δ + p^{2} 2^{p - 3} max {t, ∣ x ∣, ∣ Δ ∣}^{p - 2} Δ^{2} .

γ_{p} (t, x + Δ) \leq γ_{p} (t, x) + γ_{p}^{'} (t, x) Δ + p^{2} 2^{p - 3} max {t, ∣ x ∣, ∣ Δ ∣}^{p - 2} Δ^{2} .

∥ x ∥_{p}^{p} \leq (1 + ε) ∥ x^{⋆} ∥_{p}^{p},

∥ x ∥_{p}^{p} \leq (1 + ε) ∥ x^{⋆} ∥_{p}^{p},

α (Δ) = def ⟨ g, Δ ⟩ - \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ),

α (Δ) = def ⟨ g, Δ ⟩ - \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ),

A Δ = 0 max α (Δ) .

A Δ = 0 max α (Δ) .

∣ x ∣^{p} + g Δ + \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ) \leq ∣ x + Δ ∣^{p} \leq ∣ x ∣^{p} + g Δ + 2^{p} γ_{p} (∣ x ∣, Δ),

∣ x ∣^{p} + g Δ + \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ) \leq ∣ x + Δ ∣^{p} \leq ∣ x ∣^{p} + g Δ + 2^{p} γ_{p} (∣ x ∣, Δ),

∥ x ∥_{p}^{p} - α (λ Δ) \leq ∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ α (Δ) .

∥ x ∥_{p}^{p} - α (λ Δ) \leq ∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ α (Δ) .

∥ x ∥_{p}^{p} - ⟨ g, Δ ⟩ + \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ) \leq ∥ x - Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - ⟨ g, Δ ⟩ + 2^{p} γ_{p} (∣ x ∣, Δ) .

∥ x ∥_{p}^{p} - ⟨ g, Δ ⟩ + \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ) \leq ∥ x - Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - ⟨ g, Δ ⟩ + 2^{p} γ_{p} (∣ x ∣, Δ) .

∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - ⟨ g, λ Δ ⟩ + 2^{p} γ_{p} (∣ x ∣, λ Δ) \leq ∥ x ∥_{p}^{p} - λ (⟨ g, Δ ⟩ - λ^{m i n {1, p - 1}} 2^{p} γ_{p} (∣ x ∣, Δ)) .

∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - ⟨ g, λ Δ ⟩ + 2^{p} γ_{p} (∣ x ∣, λ Δ) \leq ∥ x ∥_{p}^{p} - λ (⟨ g, Δ ⟩ - λ^{m i n {1, p - 1}} 2^{p} γ_{p} (∣ x ∣, Δ)) .

∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ (⟨ g, Δ ⟩ - \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ)) = ∥ x ∥_{p}^{p} - λ α (Δ),

∥ x - λ Δ ∥_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ (⟨ g, Δ ⟩ - \frac{p - 1}{p 2 ^{p}} γ_{p} (∣ x ∣, Δ)) = ∥ x ∥_{p}^{p} - λ α (Δ),

α (Δ) \geq \frac{1}{κ} \cdot α (Δ^{⋆}) \geq \frac{1}{κ} α (x - x^{⋆}) \geq \frac{1}{κ} (∥ x ∥_{p}^{p} - ∥ x^{⋆} ∥_{p}^{p}) = \frac{1}{κ} (∥ x ∥_{p}^{p} - \textsc O P T) .

α (Δ) \geq \frac{1}{κ} \cdot α (Δ^{⋆}) \geq \frac{1}{κ} α (x - x^{⋆}) \geq \frac{1}{κ} (∥ x ∥_{p}^{p} - ∥ x^{⋆} ∥_{p}^{p}) = \frac{1}{κ} (∥ x ∥_{p}^{p} - \textsc O P T) .

x - λ Δ_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ α (Δ) .

x - λ Δ_{p}^{p} \leq ∥ x ∥_{p}^{p} - λ α (Δ) .

x - λ Δ_{p}^{p} - \textsc O P T

x - λ Δ_{p}^{p} - \textsc O P T

\leq - \frac{λ}{κ} (∥ x ∥_{p}^{p} - \textsc O P T) + (∥ x ∥_{p}^{p} - \textsc O P T)

\leq (1 - \frac{λ}{κ}) (∥ x ∥_{p}^{p} - \textsc O P T) .

x^{(t)} - \textsc O P T \leq (1 - \frac{λ}{κ})^{t} (x^{(0)} - \textsc O P T) \leq (1 - \frac{λ}{κ})^{t} (m^{\frac{p - 2}{2}} - 1) \textsc O P T .

x^{(t)} - \textsc O P T \leq (1 - \frac{λ}{κ})^{t} (x^{(0)} - \textsc O P T) \leq (1 - \frac{λ}{κ})^{t} (m^{\frac{p - 2}{2}} - 1) \textsc O P T .

O_{p} ((m + n)^{ω + \frac{p - 2}{3 p - 2}} lo g^{2} \nicefrac 1 ε) .

O_{p} ((m + n)^{ω + \frac{p - 2}{3 p - 2}} lo g^{2} \nicefrac 1 ε) .

A Δ = 0 max g^{⊤} Δ - \frac{p - 1}{p 2 ^{p}} γ_{p} (t, Δ),

A Δ = 0 max g^{⊤} Δ - \frac{p - 1}{p 2 ^{p}} γ_{p} (t, Δ),

Δ min γ_{p} (t, Δ) A Δ = 0, g^{⊤} Δ = c,

Δ min γ_{p} (t, Δ) A Δ = 0, g^{⊤} Δ = c,

i \in [lo g (\frac{ε ∥ x ^{(0)} ∥ _{p}^{p}}{m ^{\nicefrac ∣ p - 2∣ 2}}), lo g (\frac{∥ x ^{(0)} ∥ _{p}^{p}}{λ})],

i \in [lo g (\frac{ε ∥ x ^{(0)} ∥ _{p}^{p}}{m ^{\nicefrac ∣ p - 2∣ 2}}), lo g (\frac{∥ x ^{(0)} ∥ _{p}^{p}}{λ})],

γ_{p} (t, Δ) g^{T} Δ A Δ \leq \frac{p}{p - 1} 2^{i + p}, = 2^{i - 1}, = 0.

γ_{p} (t, Δ) g^{T} Δ A Δ \leq \frac{p}{p - 1} 2^{i + p}, = 2^{i - 1}, = 0.

\hat{t}_{j} = ⎩ ⎨ ⎧ m^{- \nicefrac 1 p} 1 (\frac{p - 1}{p})^{\nicefrac 1 p} 2^{\nicefrac - i p - 1} t_{j} (\frac{p - 1}{p})^{\nicefrac 1 p} 2^{\nicefrac - i p - 1} t_{j} \leq m^{\nicefrac - 1 p}, (\frac{p - 1}{p})^{1/ p} 2^{- i / p - 1} t_{j} \geq 1, otherwise .

\hat{t}_{j} = ⎩ ⎨ ⎧ m^{- \nicefrac 1 p} 1 (\frac{p - 1}{p})^{\nicefrac 1 p} 2^{\nicefrac - i p - 1} t_{j} (\frac{p - 1}{p})^{\nicefrac 1 p} 2^{\nicefrac - i p - 1} t_{j} \leq m^{\nicefrac - 1 p}, (\frac{p - 1}{p})^{1/ p} 2^{- i / p - 1} t_{j} \geq 1, otherwise .

c = (\frac{2}{p})^{1/2} (\frac{p - 1}{p})^{1/ p} 2^{i (1 - \frac{1}{p}) - 2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms

Full text

Iterative Refinement for $\ell_{p}$ -norm Regression

111This paper has been published at SODA 2019 [Adi+], and was initially submitted to SODA on July 12, 2018.

Deeksha Adil University of Toronto. [email protected]. Supported by an Ontario Graduate Scholarship, and by a Connaught New Researcher award to Sushant Sachdeva.

Rasmus Kyng Harvard. [email protected]. Supported by ONR grant N00014-18-1-2562.

Richard Peng

Georgia Tech. [email protected]. Supported in part by the National Science Foundation under Grant No. 1718533.

Sushant Sachdeva University of Toronto. [email protected]. Research supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), and a Connaught New Researcher award.

Abstract

We give improved algorithms for the $\ell_{p}$ -regression problem, $\min_{\bm{\mathit{x}}}\|\bm{\mathit{x}}\|_{p}$ such that $\bm{\mathit{A}}\bm{\mathit{x}}=\bm{\mathit{b}},$ for all $p\in(1,2)\cup(2,\infty).$ Our algorithms obtain a high accuracy solution in $\widetilde{O}_{p}(m^{\frac{|p-2|}{2p+|p-2|}})\leq\widetilde{O}_{p}(m^{\nicefrac{{1}}{{3}}})$ iterations, where each iteration requires solving an $m\times m$ linear system, with $m$ being the dimension of the ambient space.

Incorporating a procedure for maintaining an approximate inverse of the linear systems that we need to solve at each iteration, we give algorithms for solving $\ell_{p}$ -regression to $1/{\textrm{poly}}(n)$ accuracy that runs in time $\widetilde{O}_{p}(m^{\max\{\omega,7/3\}}),$ where $\omega$ is the matrix multiplication constant. For the current best value of $\omega>2.37$ , this means that we can solve $\ell_{p}$ regression as fast as $\ell_{2}$ regression, for all constant $p$ bounded away from $1.$

Our algorithms can be combined with nearly-linear time solvers for linear systems in graph Laplacians to give minimum $\ell_{p}$ -norm flow / voltage solutions to $1/{\textrm{poly}}(n)$ accuracy on an undirected graph with $m$ edges in $\widetilde{O}_{p}(m^{1+\frac{|p-2|}{2p+|p-2|}})\leq\widetilde{O}_{p}(m^{\nicefrac{{4}}{{3}}})$ time.

For sparse graphs and for matrices with similar dimensions, our iteration counts and running times improve upon the $p$ -norm regression algorithm by [Bubeck-Cohen-Lee-Li STOC‘18], as well as general purpose convex optimization algorithms. At the core of our algorithms is an iterative refinement scheme for $\ell_{p}$ -norms, using the quadratically-smoothed $\ell_{p}$ -norms introduced in the work of Bubeck et al. Formally, given an initial solution, we construct a problem that seeks to minimize a quadratically-smoothed $\ell_{p}$ norm over a subspace, such that a crude solution to this problem allows us to improve the initial solution by a constant factor, leading to algorithms with fast convergence.

1 Introduction
1.1 Contributions
1.2 Comparison to Previous Works
2 Technical Overview
3 Preliminaries
4 Main Iterative Algorithm
5 Solving the Residual Problem
5.1 Equivalent Problems
5.2 Oracle
5.3 The Algorithm
5.4 Proof of Theorem 5.1
6 Speedups for General Matrices via. Inverse Maintenance
7 Other Regression Formulations
7.1 Affine transformations within the norm
7.2 $1<p<2$
8 $p$ -Norm Optimization on Graphs
8.1 $p$ -Norm Flows
8.2 Lipschitz Learning and Graph Labelling
A Missing Proofs
A.1 Proofs from Section 3
A.2 Proofs from Section 4
A.3 Proofs from Section 5
A.4 Proofs from Section 6
B Controlling $\Phi$
C Solving L2 problems
D General $\ell_{2}$ Resistance Monotonicity

1 Introduction

Iterative methods that converge rapidly to a solution are of fundamental importance to numerical analysis, optimization, and more recently, graph algorithms. In the study of iterative methods, there are significant discrepancies between iterative methods geared towards linear problems, and ones that can handle more general convex objectives. For systems of linear equations, which corresponds to minimizing $\ell_{2}$ -norm objectives over a subspace, most iterative methods obtain $\epsilon$ -approximate solutions in iteration counts that scale as $\log(1/\epsilon)$ . More generally, for appropriately defined notions of accuracy, a constant-accuracy linear system solver can be iterated to give a much higher accuracy solver using a few calls to the crude solver. Such phenomena are not limited to linear systems either: an algorithm that produces approximate maximum flows on directed graphs can be iterated on the residual graph to quickly obtain high-accuracy answers.

On the other hand, for the much wider space of non-linear optimization problems arising from optimization and machine learning, it’s significantly more expensive to obtain high accuracy solutions. Many widely used methods such as (accelerated) gradient descent, obtain $\epsilon$ -approximate answers using iteration counts that scale as ${\textrm{poly}}(1/\epsilon).$ Such discrepancies also occur in the overall asymptotic running times. An important and canonical problem in this space is $\ell_{p}$ -norm regression:

[TABLE]

for some $\bm{\mathit{A}}\in\mathbb{R}^{n\times m}(m\geq n),$ and $\bm{\mathit{b}}\in\mathbb{R}^{n}.$ For $p=2,$ this corresponds exactly to solving a linear system, and hence is solvable by a matrix inversion in $O(m^{\omega})$ time 222 $\omega$ is the matrix multiplication exponent. Currently we know $\omega\leq 2.3728639..$ [Wil12, Le ̵14]. For $p=1$ and $p=\infty,$ this problem is inter-reducible to linear programming [Til13, Til15, Bub+18].

Interior point methods also allow us to solve $\ell_{p}$ -norm regression problems in $\sqrt{rank}$ iterations [NN94, LS14], where each iteration requires solving an $m\times m$ linear system for any $p\in[1,\infty]$ . Bubeck et al [Bub+18] show that this iteration count is tight for the interior point method framework, and instead propose a different method which requires only $\widetilde{O}_{p}(m^{\left|\frac{1}{2}-\frac{1}{p}\right|})$ 333 $O_{p}(\cdot)$ notation hides constant factors that depend on $p$ , and its dual norm $\frac{p}{p-1}.$ $\widetilde{O}_{p}(\cdot)$ notation also hides ${\textrm{poly}}(\log\frac{mn}{\varepsilon})$ factors in addition. iterations for $p\neq 1,\infty,$ which for large constant $p$ still tends to about $m^{1/2}$ . On the other hand, $\epsilon$ -approximate solutions can be computed in about $m^{1/3}{\textrm{poly}}(1/\epsilon)$ iterations [Chi+13] 444this result only addressed the $p=\infty$ case, but its techniques generalize to all other $p$ .

Furthermore, this discrepancy also carries over to the graph theoretic case. If the matrix $\bm{\mathit{A}}$ is the vertex-edge incidence matrix of a graph, then this problem captures graph problems such as $p$ -norm Lipschitz learning and finding $\ell_{p}$ -norm minimizing flows meeting demands given by $\bm{\mathit{b}}.$ Here low accuracy approximate solutions can be obtained in nearly-linear time when $p=\infty$ [Pen16, She17], and almost-linear time for all other values of $p$ [She17a, Sid17]. However, the current best high accuracy solutions take at least $m^{\min\{10/7,1+\left|1/2-1/p\right|\}}$ time [Mad13, Bub+18].

1.1 Contributions

Iterative Refinement for $\ell_{p}$ -norms.

In this paper, we propose a new iterative method for $\ell_{p}$ -norm regression problems (* ‣ 1) that achieves geometric convergence to the optimal solution. Our method only requires solving $O_{p}(\log\nicefrac{{1}}{{\varepsilon}})$ residual problems to find an $\varepsilon$ -approximate solution, or $O_{p}(\kappa\log\nicefrac{{1}}{{\varepsilon}})$ residual problems, each solved to a $\kappa$ -approximation factor. Such an iterative method was previously known only for $p=2$ and $\infty.$ Curiously, our residual problems look very similar to the original problem (* ‣ 1), with the $\ell_{p}$ norms replaced by their quadratically-smoothed versions introduced by Bubeck et al [Bub+18]. This result, Theorem 4.1, can be stated informally as:

Theorem 1.1.

There exists a class of residual problems for $p$ -norm regression (which we will define in Definition 4.3) such that any $p$ -norm regression problem can be solved to $\epsilon$ -relative accuracy by solving to relative error $\kappa$ a sequence of $O_{p}(\kappa\log(\frac{m}{\varepsilon}))$ residual problems.

Improved Iteration Count for $\ell_{p}$ -Regression.

We then give an algorithm for quickly solving the residual problem motivated by the approximate maximum flow by electrical flows algorithm by Christiano et al. [Chr+11] and its generalizations to regression problems [Chi+13]. This is given as Theorem 5.1, and can be stated informally as:

Theorem 1.2.

For any $p>2$ , an instance of a residual problem for $p$ -norm regression as defined in Definition 4.3 can be solved in $\widetilde{O}_{p}(m^{\frac{p-2}{3p-2}})$ iterations, each of which consist of solving a system of linear equations plus updates that take linear time.

This improves on the work of Bubeck et al [Bub+18] for all $p>2,$ with the number of iterations equaling $\widetilde{O}_{p}(1)$ for $p=2$ (essentially the same as Bubeck et al) and tending to $\widetilde{O}(m^{\nicefrac{{1}}{{3}}})$ as $p$ goes to $\infty$ (compared to $\widetilde{O}(m^{\nicefrac{{1}}{{2}}})$ for Bubeck et al). However, our results don’t give anything for $p=\infty$ due to the dependency in $p$ in the $\widetilde{O}_{p}(\cdot)$ term. It’s worth noting that even in the constant error regime, this improves by a factor of about $\min\{m^{\frac{(p-2)^{2}}{2p(3p-2)}},m^{\frac{4}{3(3p-2)}}\}$ over the current state of the art, which for small $p$ is due to Bubeck et al. [Bub+18], and for large $p$ is based on unpublished modifications to Christiano et al. [Chr+11, Mad11].

A Duality Based Approach to $\ell_{p}$ -Regression.

For the remaining case of $1<p<2$ , we instead solve the dual problem, which is a $\frac{p}{p-1}$ -norm regression problem, and utilize its solution to solve our original $\ell_{p}$ -Regression. This leads to iteration counts of the form $\widetilde{O}_{p}(m^{\frac{2-p}{p+2}}\log(1/\epsilon))$ for solving such problems. Note that this result also does not give anything when $p=1$ , as the constants related to its dual norm, $\frac{p}{1-p}$ become prohibitive. For all $p\in(1,\infty),$ our iteration count achieves the following exponent on $m,$

[TABLE]

while the exponent from the previous result [Bub+18] is $\left|\frac{1}{2}-\frac{1}{p}\right|$ : our algorithm has better dependence on $m$ on all constant $p$ (albeit with larger constants depending on $p$ ).

For the case of $p=4$ , a manuscript by Bullins [Bul18] from December 2018 (after our paper was accepted to SODA 2019, but independently developed), gives the same iteration count as our algorithm of $n^{1/5}\log(1/\epsilon)$ up to polylogs. Bullins’ approach requires a linear system solve per iteration, similar to our approach when implemented without inverse maintenance. Bullins’ algorithm is based on higher-order acceleration, and the agreement between running times suggests there may be a strong connection between our “accelerated” multiplicative weight method and his accelerated gradient-based method.

Faster $\ell_{p}$ -Regression.

Our improved iteration counts can be readily combined with methods for speeding up optimization algorithms that utilize linear system solvers, including inverse maintenance [Vai89, LS15]. This results in an $\widetilde{O}_{p}(m^{\max\{\omega,\nicefrac{{7}}{{3}}\}})$ time algorithm for solving $\ell_{p}$ regression problems for all $p\in(1,\infty)$ , which we formalize in Theorem 6.1.

This bound for $p$ -norm regression with general matrices brings us to the somewhat surprising conclusion that for the current value of $\omega>7/3$ , $p$ -norm regression problems (with constant $p$ that’s also constant-bounded away from $1$ ) on square matrices can be solved as fast as solving the underlying linear systems, or equivalently, $\ell_{2}$ regression problems.

This is based on maintaining an approximate inverse to the linear systems we need to solve in each step of the iterative method as pioneered by Vaidya [Vai89]. However, our modification interacts directly with the potential functions we use to control iteration counts in the inner loop of our iterative method. A similar approach for maintaining an approximate inverse was used by Cohen et al. [CLS18] to give an $\tilde{O}(m^{\omega})$ algorithm for Linear Programming, after our initial submission to SODA, but before our paper was publicly available. Both works build on ideas developed by Cohen, see [Lee17].

Faster $p$ -Norm Flows.

When solving $p$ -norm flow problems, our algorithm can made faster by using Laplacian solvers for graph problems [Vai90, Ten10] to solve the linear equations that arise during our iterations. This gives algorithms for finding $p$ -norm flows on undirected graphs to accuracy $\epsilon$ with running time $\widetilde{O}_{p}\left(m^{1+\frac{\left|p-2\right|}{2p+\left|p-2\right|}}\log(1/\epsilon)\right)$ for $p\in(1,\infty)$ via direct invocations of fast Laplacian solvers [ST14].

Our results thus give the first evidence that wide classes of graph optimization problems can be solved in time $m^{4/3}$ or faster. While such a bound (via. fast Laplacian solvers) is by now well-known in the approximate setting [Chr+11], the $m^{10/7}$ iteration bounds due to Madry [Mad13, Mad16] represent the only results to date in this direction for high accuracy answers on sparse graphs.

Generalizations and Extensions.

While we focus on Problem (* ‣ 1), under mild assumptions about polynomially bounded objectives, we can solve the following more general problem:

[TABLE]

The reduction is discussed in Section 7. The combination of an affine constraint on $\bm{\mathit{x}}$ with an affine transformation in the $p$ -norm objective means we can solve most variants of $p$ -norm optimization problems.

Similar ideas can be used to solve $p$ -norm Lipschitz learning problems [Kyn+15] on graphs quickly.

1.2 Comparison to Previous Works

Numerical Methods and Preconditioning.

Iterative methods and preconditioning are the most fundamental tools in numerical algorithms [Axe94, Saa03]. As studies of such methods often focus on linear problems, many existing analyses of iterative methods are restricted to linear systems. Generalizing such methods, as well as numerical methods, to broader settings is a major topic of study [Hen03, NW06, Kel99, KK04].

The study of more efficient algorithms for combinatorial flow problems has benefited enormously from ideas from linear and non-linear preconditioning. Recent advances in approximate maximum flow and transshipment algorithms [LRS13, She13, Kel+14, RST14, Gha+15, Pen16, She17a, She17] build upon such ideas. However, these methods rely on the preconditioner being a linear operator, and give ${\textrm{poly}}(1/\epsilon)$ dependence.

Optimization Algorithms.

Our techniques for solving the residual problems are directly motivated by approximating maximum flow using electrical flows [Chr+11]. While this algorithm has been extended to multicommodity flows and regression problems [KMP12, Chi+13], all these results have ${\textrm{poly}}(1/\epsilon)$ dependencies.

Several recent results for obtaining $\log(1/\epsilon)$ dependencies are all motivated by convex optimization techniques. In particular, the state of the art running times are by interior point methods. These include directly modifying the interior point method (IPM) [LS14, LS15, Kyn+15], combining techniques from the electrical flow algorithms with IPM update steps [Mad13, Kyn+15, KRS15, Mad16, Coh+17], and increasing the ‘confidence interval’, and in turn step lengths, of the IPM update steps [All+17, Coh+17a, Bub+18]. Our result based on creating intermediate problems has the most in common with the last of these. However, our method differs in that our guarantees for this intermediate problem holds over the entire space.

Inverse Maintenance

Our final running time of $\widetilde{O}_{p}(m^{\max\{\omega,\nicefrac{{7}}{{3}}\}})$ for $\ell_{p}$ -regression incorporates inverse maintenance. This is a method introduced by Vaidya [Vai89] for speeding up optimization algorithms for solving minimum cost and multicommodity flows. It takes advantage of the controllable rate at which optimization algorithms modify the solution variables to reuse inverses of matrices constructed from such variables.

Previous studies of inverse maintenance [Vai89, LS14, LS15] have been geared towards the interior point method. Here the norm per update step can be controlled, and we believe this also holds for their applications in faster cutting plane methods [LSW15]. While such methods also give gains in the case of our algorithm, for the final bound of about $m^{\omega}$ , we instead bound the progress of the steps against a global potential function motivated by the electrical flow max-flow algorithm [Chr+11].

Speedups for Matrices with Uneven Dimensions

Our algorithm on the other hand does not take into account sparsity of the input matrix, or possibly uneven dimensions (e.g. $m\approx n^{2})$ ). In these settings, the methods based on accelerated stochastic gradient descent from [Bub+18] obtain better performances. On the other hand, we believe our methods have the potential of extending to such settings by combining the intermediate problems with $\ell_{p}$ row sampling [CP15]. However, analysis of such row sampling routines for our residual problems containing mixed $\ell_{2}$ and $\ell_{p}$ norm functions is outside the scope of this paper.

2 Technical Overview

Iterative Refinement for $\ell_{p}$ -norms.

To design their algorithm for $\ell_{p}$ -norm regression, Bubeck et al [Bub+18] construct a function $\gamma_{p}(t,x),$ which is $C_{1}$ , 555A function is said to be $C_{1}$ if it’s continuous, differentiable, and has a continuous derivative, quadratic in the range $\left|x\right|\leq t,$ and behaves as $|x|^{p}$ asymptotically (see Def. 3.1). Our key lemma states one can locally approximate $\left\|x+\Delta\right\|_{p}^{p}$ as a linear function plus a $\gamma_{p}(|x|,\Delta)$ “error” term 666It is useful to compare the $\gamma_{p}$ term to the second-order Hessian term in Taylor expansion (Lemma 4.5):

[TABLE]

Surprisingly, this approximation only has an $O_{p}(1)$ “condition number”. Proceeding just as for gradient descent, or Newton’s method, means that if at each step we solve the following local approximation problem to a factor $\kappa,$

[TABLE]

where $\bm{\mathit{g}}$ is the gradient of our loss function, we can converge to an $\varepsilon$ -approximate solution in roughly $\kappa\log\nicefrac{{1}}{{\varepsilon}}$ iterations (Theorem 4.1).

Improved Algorithms for $\ell_{p}$ -regression for $p\geq 2$ .

The key advantage afforded by our iterative algorithm is that we now only need to design a algorithm for the residual problem that achieves a crude approximation factor (we achieve $O_{p}(1)$ ). As a first step, by a binary search and some rescaling, we show (Lemma 5.5) that it suffices to achieve a constant factor approximation to $O_{p}(\log\nicefrac{{m}}{{\varepsilon}})$ problems of the following form,

[TABLE]

The technical heart of our proof is to give an algorithm (Gamma-Solver, Algorithm 4) inspired by the multiplicative weight update (MWU) method (see [AHK12] for a survey), combined with the width-reduction inspired by the faster flow algorithm of Christiano et al. [Chr+11], and its matrix version by Chin et al.[Chi+13]. At each iteration, we solve a weighted $\ell_{2}$ minimization problem to find the next update step. If this update step has small $\ell_{p}$ norm, we add this to our current solution, and update the weights. Otherwise, we identify a set of coordinates that have small current weights, and yet are contributing most of the $\ell_{p}$ norm, and we penalize them by increasing their weights (and do not add our update step to the current solution). Setting the parameters carefully, we show that after $\widetilde{O}_{p}(m^{\frac{p-2}{3p-2}})$ iterations, the average of the update steps achieves an $O_{p}(1)$ -approximation to our modified residual problem (Theorem 5.8). Combining this with our iterative refinement algorithm, we obtain our algorithms for $\ell_{p}$ -norm regression that require only $\widetilde{O}_{p}(m^{\frac{p-2}{3p-2}})$ iterations (or linear system solves).

Maintaining Inverses for Improved Algorithm.

Our inverse maintenance procedure utilizes the same combination of low-rank updates and matrix multiplications as in previous results [Vai89, LS14, LS15]. However, the rate of convergence of our algorithm, and in turn the rate at which we adjust the weights from the MWU procedure, are governed by growths in the $\ell_{2}$ minimization problem. This leads to the difficulty of uneven progress across the iterations.

We solve this issue by a simple yet subtle scheme motivated by lazy updates in data structures [GP13, Abr+16]. We bucket changes to the values of entries based on their magnitudes, and update entries that received too many updates of a certain magnitude separately. This differs with previous methods that update weights exceeding approximation thresholds as they happen, and enables a closer interaction with the overall potential function based convergence analysis.

3 Preliminaries

We use the following family of functions, $\gamma_{p}(t,x)$ defined in [Bub+18].

Definition 3.1 ( $\gamma_{p}$ function).

For $t\geq 0$ and $p\geq 1$ , define

[TABLE]

These functions can be thought of a quadratic approximation of $\left|x\right|^{p}$ in a small range around zero. The following properties follow directly from the definition.

$\gamma_{p}(0,x)=\left|x\right|^{p}$ . 2. 2.

$\gamma_{p}(t,x)$ is quadratic in the range $-t\leq x\leq t$ . 3. 3.

$\gamma_{p}$ is $C^{1}$ in both $x,t.$

We show several other important properties of $\gamma_{p}$ in the following lemmas. Their proofs are straightforward and deferred to Appendix A.1

Lemma 3.2.

Function $\gamma_{p}$ is as defined above.

For any $p\geq 2,$ $t\geq 0,$ and $x\in\mathbb{R},$ we have $\gamma_{p}(t,x)\geq|x|^{p},$ and $\gamma_{p}(t,x)\geq\frac{p}{2}t^{p-2}x^{2}.$ 2. 2.

It is homogeneous under rescaling of both $t$ and $x,$ i.e., for any $t,\lambda\geq 0,p\geq 1,$ and any $x$ we have $\gamma_{p}\left(\lambda t,\lambda x\right)=\lambda^{p}\gamma_{p}\left(t,x\right).$ 3. 3.

For any $t>0,p\geq 1$ and any $x,$ we have $\gamma_{p}^{\prime}(t,x)=p\max\{t,\left|x\right|\}^{p-2}x.$

The next lemma shows a bound on the value of $\gamma_{p}$ when $x$ is scaled up or down.

Lemma 3.3.

For any $p>1,\Delta\in\mathbb{R}$ and $\lambda\geq 0$ , we have,

[TABLE]

This implies,

[TABLE]

The following lemma allows us to bound the second order change in $\gamma_{p}(\bm{\mathit{x}})$ as $\bm{\mathit{x}}$ changes to $\bm{\mathit{x}}+\Delta.$

Lemma 3.4.

For any $p\geq 2,t\geq 0$ and any $x,\Delta,$ we have

[TABLE]

Notation.

For a vector $\bm{\mathit{x}},$ let $\left|\bm{\mathit{x}}\right|$ denote the vector with its $i^{\text{th}}$ coordinate as $\left|\bm{\mathit{x}}_{i}\right|.$ For any two vectors $\bm{\mathit{t}}$ and $\bm{\mathit{x}}$ , $\gamma_{p}(\bm{\mathit{t}},\bm{\mathit{x}})$ denotes the sum $\sum_{i}\gamma_{p}(\bm{\mathit{t}}_{i},\bm{\mathit{x}}_{i})$ .

4 Main Iterative Algorithm

In this section we analyze procedure p-Norm, i.e., Algorithm 1. Our main result for this section is,

Theorem 4.1 ( $\ell_{p}$ -norm Iterative Refinement).

For any $p\in(1,\infty),$ and $\kappa\geq 1.$ Given an initial feasible solution $\bm{\mathit{x}}^{(0)}$ (Definition 4.7) to our optimization problem (Equation (* ‣ 1)), Algorithm 1 finds an $\epsilon$ -approximate solution to (* ‣ 1) in $O_{p}\left(\kappa\log\left(\frac{m}{\varepsilon}\right)\right)$ calls to a $\kappa$ -approximate solver for the residual problem (Equation (1)).

The theorem says that it is sufficient to solve an instance of the residual problem (1) crudely, and only a logarithmic number of times. Before we prove the theorem, we define the terms used in the statement and prove some results that would be needed for the proof. We begin by defining an $\varepsilon$ -approximate solution to our main optimization problem.

Definition 4.2 ( $\varepsilon$ -approximate solution).

We say our solution $\bm{\mathit{x}}$ is an $\varepsilon$ -approximate solution to (* ‣ 1) if $\bm{\mathit{A}}\bm{\mathit{x}}=\bm{\mathit{b}}$ and

[TABLE]

where $\bm{\mathit{x}}^{\star}$ is the OPT of (* ‣ 1).

We next define what we use as our residual problem and what we mean by a $\kappa$ -approximate solution.

Definition 4.3 (Residual Problem).

For any given $\bm{\mathit{x}}$ and $p>1$ , let

[TABLE]

where $\bm{\mathit{g}}$ is the gradient, $\bm{\mathit{g}}=p\left|\bm{\mathit{x}}\right|^{p-2}\bm{\mathit{x}}$ . We call the following problem to be the residual problem of (* ‣ 1) at $\bm{\mathit{x}}$ .

[TABLE]

Definition 4.4 ( $\kappa$ -approximate solution).

Let $\kappa\geq 1$ . A $\kappa$ -approximate solution for the residual problem is $\tilde{\Delta}$ such that $\bm{\mathit{A}}\tilde{\Delta}=0$ and, $\mathit{\alpha}(\tilde{\Delta})\geq\frac{1}{\kappa}\mathit{\alpha}(\Delta^{\star}).$ Here $\Delta^{\star}=\max_{\bm{\mathit{A}}\Delta=0}\mathit{\alpha}(\Delta)$ .

In order to see why we choose this problem as our residual problem we show that the objective of the residual problem bounds the change in $p$ -norm of a vector $\bm{\mathit{x}}$ when perturbed by $\Delta$ (Lemma 4.6).

Lemma 4.5.

Let $p\in(1,\infty)$ . Then for any $\bm{\mathit{x}}$ and any $\Delta$ ,

[TABLE]

where $g=p\left|x\right|^{p-2}x$ is the derivative of the function $\left|x\right|^{p}$ .

The proof can be found in Appendix A.2.

Lemma 4.6.

Let $p\in(1,\infty)$ and $\lambda$ be such that $\lambda^{\min\{1,p-1\}}\leq\frac{p-1}{p4^{p}}$ . Then for any $\Delta$ we have,

[TABLE]

Proof.

Applying lemma 4.5 to all the coordinates, we obtain,

[TABLE]

Using definition 4.3, equation (4) directly implies, $\left\|\bm{\mathit{x}}-\Delta\right\|_{p}^{p}\geq\left\|\bm{\mathit{x}}\right\|_{p}^{p}-\mathit{\alpha}(\Delta)$ for all $\Delta.$ Now to prove the other side, note that for any $\lambda\in[0,1],$ and any $\Delta,$ we have from Lemma 4.5 and Lemma 3.2

[TABLE]

Picking $\lambda$ such that $\lambda^{\min\{1,p-1\}}\leq\frac{p-1}{p4^{p}},$ we obtain that for any $\Delta,$

[TABLE]

thus concluding the proof of the lemma. ∎

For any iterative algorithm we need a starting feasible solution. We could potentially start with any feasible solution but we define the following starting solution which we claim is a good starting point. Lemma 4.8 shows us that our chosen starting point is only polynomially away from the optimum solution, and is thus a good choice. The proof of the lemma can be found in Appendix A.2.

Definition 4.7 (Initial Solution).

We define $\bm{\mathit{x}}^{(0)}$ to be our initial feasible solution to be $\bm{\mathit{x}}^{(0)}=\min_{\bm{\mathit{A}}\bm{\mathit{x}}=\bm{\mathit{b}}}\left\|\bm{\mathit{x}}\right\|^{2}_{2}.$

Lemma 4.8.

For $\bm{\mathit{x}}^{(0)}$ as defined in Definition 4.7, $\|\bm{\mathit{x}}^{(0)}\|_{p}^{p}\leq m^{\nicefrac{{(p-2)}}{{2}}}\textsc{OPT}$ .

We are now ready to prove Theorem 4.1.

Proof.

Let $\widetilde{\Delta}$ denote the solution returned by the $\kappa$ -approximate solver. We know that $\mathit{\alpha}(\widetilde{\Delta})\geq\frac{1}{k}\cdot\mathit{\alpha}(\Delta^{\star})$ . We have,

[TABLE]

From Lemma 4.6, for $\lambda=\left(\frac{p-1}{p4^{p}}\right)^{\frac{1}{\min\{1,p-1\}}}=\Omega_{p}(1),$ we get,

[TABLE]

Combining the above two equations and subtracting OPT from both sides gives us

[TABLE]

Using lemma 4.8 we get,

[TABLE]

Setting $t=O_{p}\left(\kappa\log\left(\frac{m}{\varepsilon}\right)\right)$ gives us an $\varepsilon$ -approximate solution. ∎

This concludes the discussion on the analysis of Algorithm 1. In the following sections we move on to analyzing how to solve the residual problem approximately.

5 Solving the Residual Problem

In this section, we give an algorithm that solves the residual problem to a constant approximation. Combined with the iterative refinement scheme from Theorem 4.1, we obtain the following result.

Theorem 5.1.

For $p\geq 2$ , we can find an $\varepsilon$ -approximate solution to (* ‣ 1) in time

[TABLE]

Here $\omega$ is the matrix multiplication constant.

Recall that the residual problem

[TABLE]

has a linear term followed by the $\gamma_{p}$ function. Instead of directly optimizing this function, we guess an approximate value of the linear term, and for each such guess, we minimize the $\gamma_{p}$ function under this additional constraint. We can scale the problem so that the optimum is at most $1.$ Finally, we can perturb $\bm{\mathit{t}}$ so that each $\bm{\mathit{t}}_{i}$ lies in a polynomially bounded range without adding significant error. Our final problem looks as follows,

[TABLE]

with $m^{-1/p}\leq\bm{\mathit{t}}_{i}\leq 1,\forall i$ .

To sumarise, $\kappa$ -Approx (Algorithm 2) formalizes this process and shows that we only need to solve a logarithmic number of instances of the above program, (2) and solving each to a $\kappa$ -approximation gives a $\Omega_{p}\left(\kappa^{1/(\min\{2,p\}-1)}\right)$ -approximate solution to (1). Gamma-Solver (Algorithm 4) solves problem (2) to an $O_{p}(1)$ approximation. Therefore, using Gamma-Solver as a subroutine for $\kappa$ -Approx we get an $O_{p}(1)$ approximate solution to (1). Section 5.1 gives an analysis for $\kappa$ -Approx. In Section 5.2, we give an oracle that is used in Gamma-Solver. We give an analysis of Gamma-Solver in Section 5.3. Finally in Section 5.4, we give a proof for Theorem 5.1.

5.1 Equivalent Problems

In this section we prove the following theorem.

Theorem 5.2.

Procedure $\kappa$ -Approx (Algorithm 2) returns an $\Omega_{p}\left(\kappa^{\nicefrac{{1}}{{(\min\{2,p\}-1)}}}\right)$ -approximate solution to the residual problem given by (1), by solving $O_{p}\left(\log\left(\frac{m}{\varepsilon}\right)\right)$ instances of program (2) to a $\kappa$ -approximation.

The following lemmas will lead to the proof of the above theorem. The first lemma gives an upper and lower bound on the objective of (1).

Lemma 5.3.

Let $p\in(1,\infty)$ and assume that our current solution $\bm{\mathit{x}}$ is not an $\varepsilon$ -approximate solution. Let $\lambda$ be such that $\lambda^{\min\{1,p-1\}}=\frac{p-1}{p4^{p}}$ . For some

[TABLE]

$\mathit{\alpha}(\Delta^{\star})\in[2^{i-1},2^{i})$ * where $\Delta^{\star}$ is the solution of (1).*

We defer the proof to Appendix A.3. Lemma 5.3 suggests that we can divide the range of the objective of our residual problem, $\mathit{\alpha}$ into a logarithmic number of bins and solve a decision problem that asks if the optimum belongs to the bin. The lemma guarantees that at least one of the decision problems will be feasible. The following lemma defines the required decision problems and shows that solving these to a constant approximation is sufficient to get a constant approximate solution to (1).

Lemma 5.4.

Let $p\in(1,\infty)$ . Suppose $\mathit{\alpha}(\Delta^{\star})\in[2^{i-1},2^{i})$ for some $i$ where $\Delta^{\star}$ is the solution of (1). The following program is feasible:

[TABLE]

If $\Delta(i)$ is a $\beta$ -approximate solution to program (3) for this choice of $i,$ then, we can pick $\mu\leq 1$ such that the vector $\mu\Delta(i)$ is an $\Omega_{p}\left(\beta^{\frac{1}{\min\{p,2\}-1}}\right)$ -approximate solution to (1).

The proof can be found in Appendix A.3. We now scale down the objective of (3) so that it is at most $1$ . The next lemma shows what the scaled down problem looks like and how an approximate solution to the scaled down problem gives an approximate solution to (3). Again the proof of the lemma can be found in Appendix A.3.

Lemma 5.5.

Let $p\in(1,\infty)$ . Let $i$ be such that (3) is feasible. Let

[TABLE]

Note that $m^{-1/p}\leq\hat{\bm{\mathit{t}}}_{j}\leq 1$ . Then program (2) with $\bm{\mathit{t}}=\hat{\bm{\mathit{t}}}$ , and

[TABLE]

has $\textsc{OPT}\leq 1$ . Let $\Delta^{\star}$ be a $\kappa$ -approximate solution to (2). Then, $\Delta=\left(\frac{p}{2}\right)^{1/2}\left(\frac{p}{p-1}\right)^{1/p}2^{1+\nicefrac{{i}}{{p}}}\Delta^{\star}$ is a $\Omega_{p}(\kappa)$ - approximate solution to (3).

We now prove Theorem 5.2.

Proof.

Lemma 5.3 suggests that there exists an index

[TABLE]

such that $OPT=\mathit{\alpha}(\Delta^{\star})\in[2^{j-1},2^{j})$ . Lemma 5.4 implies that (3) is feasible for index $j$ . Suppose $\Delta^{(j)}$ is a $\kappa$ -approximate solution to the scaled down problem (2) for index $j$ . Lemma 5.5 implies that $\tilde{\Delta}^{(j)}=\left(\frac{p}{2}\right)^{1/2}\left(\frac{p}{p-1}\right)^{1/p}2^{1+\nicefrac{{i}}{{p}}}\Delta^{(j)}$ is an $\Omega_{p}(\kappa)$ approximate solution to (3) for index $j$ . Lemma 5.4 now implies that $\tilde{\Delta}^{(j)}=\mu\Delta^{(j)}$ is a $\Omega_{p}\left(\kappa^{\frac{1}{\min\{p,2\}-1}}\right)$ -approximation to our residual problem (1). Now, the algorithm solves the scaled down problem for every $i$ and returns the $\tilde{\Delta}^{(i)}$ that when added to our current solution gives the minimum $\ell_{p}$ -norm. It either chooses $\tilde{\Delta}^{(j)}$ or some other solution $\tilde{\Delta}^{(i)}$ . In case it returns $\tilde{\Delta}^{(i)}$ ,

[TABLE]

From Lemma 4.6 we know,

[TABLE]

We thus have $\mathit{\alpha}(\lambda\tilde{\Delta}^{(i)})\geq\Omega_{p}\left(\kappa^{-\frac{1}{\min\{p,2\}-1}}\right)\textsc{OPT}$ , implying $\lambda\cdot\tilde{\Delta}^{(i)}$ is also a $\Omega_{p}\left(\kappa^{\frac{1}{\min\{p,2\}-1}}\right)$ approximate solution as required. ∎

It remains to solve problems of the form (2) up to a $\kappa$ -approximation. Recall that these problems look like,

[TABLE]

and satisfy $OPT\leq 1$ , and $m^{-1/p}\leq\bm{\mathit{t}}_{j}\leq 1,\forall j$ .

5.2 Oracle

Our approach follows the format of the approximate max-flow algorithm by Christiano et al. [Chr+11]. Specifically, we use a variant of multiplicative weights update to converge to a solution with small $\gamma_{p}(t,\Delta)$ . The multiplicative weights update scheme repeatedly updates a set of weights $\bm{\mathit{w}}$ using partial, local solutions computed based on these weights. The Christiano et al. algorithm can be viewed as picking these weights from the gradients of the soft-max function on flows. We will adapt this routine by showing that $\bm{\mathit{w}}$ ’s chosen from the gradient of $\gamma_{p}(t,\Delta)$ also suffices for approximately minimizing the problem stated in 2.

The subroutine that this algorithm passes the $\bm{\mathit{w}}$ onto is commonly referred to as an oracle. An oracle needs to compute a solution with both small dot-product against $\bm{\mathit{w}}$ , and small width, which is defined as the maximum value of an entry. In such an oracle, the dot product condition is the hard constraint, in that the final approximation factor of the solution is directly related to the value of these dot products. The width, on the other hand, only affects the overall iteration count/ running time, and can even be manipulated/improved algorithmically. Therefore we first need to define and show a good upper bound on the objective of the optimization problem solved within the oracle.

Formally, our oracle subroutine Algorithm 3 takes as input some affine constraints and vector of weights $\bm{\mathit{w}}$ . It first computes a vector of non-negative weights $\bm{\mathit{r}}$ , and then returns a minimizer to the following optimization problem

[TABLE]

Appendix C contains an algorithm that solves such problems efficiently.

Let us now look at some properties of the solution returned by the oracle. Note that the objective of our problem (2) is at most $1$ . This implies that we have $\Delta$ such that

•

$\sum_{e}(\Delta^{*}_{e})^{2}\bm{\mathit{t}}_{e}^{p-2}\leq 1$ ,

•

$\sum_{e}\left|\Delta^{*}_{e}\right|^{p}\leq 1$ , or $\left\|\Delta^{*}\right\|_{p}\leq 1$ .

We next look at some relations on the weights and resistances. The following lemma is a simple application of Hölder’s inequality. Its proof is given in Appendix A.3.

Lemma 5.6.

Let $p\geq 2$ . For any set of weights $\bm{\mathit{w}}$ on the edges, $\sum_{e}\bm{\mathit{w}}_{e}^{p-2}(\Delta^{*}_{e})^{2}\leq\left\|\bm{\mathit{w}}\right\|_{p}^{p-2}.$

Lemma 5.7.

Let $p\geq 2$ . For any $\bm{\mathit{w}}$ , let $\Delta$ be the electrical flow computed with respect to resistances

[TABLE]

and demand vector $\bm{\mathit{d}}.$

Then the following hold,

$\sum_{e}\Delta_{e}^{2}\leq\sum_{e}\bm{\mathit{r}}_{e}\Delta_{e}^{2}\leq m^{\frac{p-2}{p}}+\left\|\bm{\mathit{w}}\right\|_{p}^{p-2},$ ** 2. 2.

$\textstyle\sum_{e}\left|\Delta_{e}\right|\left|\gamma^{\prime}(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}}_{e},\bm{\mathit{w}}_{e})\right|\leq p\left({\sum_{e}\gamma_{p}(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}}_{e},\bm{\mathit{w}}_{e})}\right)^{\frac{p-1}{p}}+pm^{\frac{p-2}{2p}}\left({\sum_{e}\gamma_{p}(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}}_{e},\bm{\mathit{w}}_{e})}\right)^{\frac{1}{2}}.$ **

Proof.

Since $\Delta$ is the electrical flow,

[TABLE]

We have,

[TABLE]

Finally, using $\bm{\mathit{r}}_{e}\geq\left(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}}_{e}\right)^{p-2}\geq 1,$ we have $\sum_{e}\Delta_{e}^{2}\leq\sum_{e}\bm{\mathit{r}}_{e}\Delta_{e}^{2},$ completing part 1.

Now we know that,

[TABLE]

Using Cauchy Schwarz’s inequality,

[TABLE]

Combining the two cases we have,

[TABLE]

where the last line uses $\left\|x\right\|_{p}^{p}\leq\gamma_{p}(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}},\bm{\mathit{w}})$ for any $\bm{\mathit{t}}.$ ∎

5.3 The Algorithm

Next, we integrate this oracle into the overall algorithm that repeatedly adjusts the weights. As with the use of electrical flow oracles for approximate max-flow [Chr+11], the convergence of such a scheme depends on the maximum values in the $\Delta$ returned by the oracle. However, because the overall objective is now a $p$ -norm, the exact term of importance is actually the $p$ -norm of $\Delta$ . Up to this discrepancy, we follow the algorithmic template from [Chr+11] by making an update when $\left\|\Delta\right\|_{p}^{p}$ is small, and make progress via another potential function otherwise.

In the cases where we do not take the step due to entries with large values, we show significant increases in an additional potential function, namely the objective of the quadratic minimization problem inside the oracle (Algorithm 3). However, the less graduate update schemes related to $p$ -norms makes it no longer sufficient to update only the weight corresponding to the entry with maximum value. Furthermore, there may be entries with large values, whose corresponding resistances are too large for us to afford increasing. We address this by a scheme where we update an entry only if its value is larger than some threshold $\rho$ , and that its resistance is at most another threshold $\beta$ . Specifically, we show that for an appropriate choice of $\beta$ and $\rho$ , such updates both do not change the primary potential function (related to $\gamma_{p}(t,\bm{\mathit{x}})$ ) by too much (in Lemma 5.10), and increases the secondary potential function (the objective of the quadratic minimization problem) significantly whenever $\left\|\Delta\right\|_{p}^{p}$ is large (in Lemma 5.13). Pseudocode of this scheme is in Algorithm 4.

Theorem 5.8.

Let $p\geq 2$ . Given a matrix $\bm{\widehat{\mathit{A}}}$ and vectors $\bm{\mathit{x}}$ and $\bm{\mathit{t}}$ such that $\forall e,m^{-1/p}\leq\bm{\mathit{t}}_{e}\leq 1$ , Algorithm 4 uses $O_{p}\left(m^{\frac{p-2}{(3p-2)}}\left(\log\left(\frac{m\left\|\bm{\mathit{d}}\right\|^{2}_{2}}{\|\bm{\widehat{\mathit{A}}}\|^{2}}\right)\right)^{\frac{p}{3p-2}}\right)$ calls to the oracle and returns a vector $\bm{\mathit{x}}$ such that $\bm{\widehat{\mathit{A}}}\bm{\mathit{x}}=\bm{\mathit{d}},$ and $\gamma_{p}(\bm{\mathit{t}},\bm{\mathit{x}})=O_{p}(1)$ .

Analysis of Potentials.

We define the following potential function for the analysis of our algorithm.

Definition 5.9.

Let $\Phi$ be the potential function defined as

[TABLE]

Initially, since we start with $\bm{\mathit{w}}^{(0)}=0$ , we have $\Phi(\bm{\mathit{w}}^{(0)})=0.$ Observe that in the algorithm, we update the potentials in both the flow step and the width reduction step whereas we update the solution only in the flow step. It is easy to see that we always have $\bm{\mathit{w}}^{(i,k)}\geq\left|\bm{\mathit{x}}^{(i,k)}\right|.$

We next bound the potential. In addition, we track the energy of the electrical flow in the network with resistances $\bm{\mathit{r}}.$ Let $\Psi\left(\bm{\mathit{r}}\right)$ denote the minimum of routing $\bm{\mathit{d}}$ with resistances $\bm{\mathit{r}}$ :

[TABLE]

Note that this energy is equal to the energy calculated using the $\Delta$ obtained in the solution of (4).

Notation.

We overload notation for $\Psi\left(i,k\right)$ to denote $\Psi\left(\bm{\mathit{r}}^{(i,k)}\right).$

Our proof of Theorem 5.8 will be based two main parts:

Provided the total number of width reduction steps, $K$ , is not too big, then $\Phi(T,K)$ is small. This in turn upper bounds cost of the approximate solution $m^{-1/p}\bm{\mathit{x}}$ . 2. 2.

Showing that $K$ cannot be too big, because each width reduction step cause large growth in $\Psi\left(\cdot\right)$ , while we can bound the total growth in $\Psi\left(\cdot\right)$ by relating it to $\Phi(\cdot)$ .

We start by observing that when we when increase the weight $\bm{\mathit{w}}_{e}$ of an edge during a width reduction step, this has the effect of at least doubling the resistance $\bm{\mathit{r}}_{e}$ . Recall,

[TABLE]

Now,

[TABLE]

Meanwhile, the resistance does not grow by a factor larger than 4:

[TABLE]

We next show through the following lemma that the $\Phi$ potential does not increase too rapidly. The proof is through induction and can be found in Appendix B .

Lemma 5.10.

After $i$ flow steps, and $k$ width-reduction steps, provided

$\alpha^{p}\tau\leq\alpha m^{\frac{p-1}{p}}$ , (controls $\Phi$ growth in flow-steps) 2. 2.

$k\leq\rho^{2}m^{2/p}\beta^{-\frac{2}{p-2}}$ * , (acceptable number of width-reduction steps)*

the potential $\Phi$ is bounded as follows:

[TABLE]

We next wish to prove that in each width-reduction step, the electrical energy $\Psi\left(\cdot\right)$ goes up significantly. For this, we will use the following Lemma which is proven in Appendix D. It generalizes Lemma 2.6 of [Chr+11] to arbitrary weighted $\ell_{2}$ regression problems, and directly measures the change in terms of the electrical energy of the entries modified.

Lemma 5.11.

Assuming the program (4) is feasible, let $\Delta$ be an be a solution to the optimization problem (4) with weights $\bm{\mathit{r}}$ . Suppose we increase the resistance on each entry to get $\bm{\mathit{r}}^{\prime}$ Then,

[TABLE]

This statement also implies the form of the lemma that concerns increasing the resistances on a set of entries uniformly [Chr+11, Lemma 2.6].

The next lemma gives a lower bound on the energy in iteration [math], i.e., when we start, and an upper bound on the energy at each step.

Lemma 5.12.

Initially, we have,

[TABLE]

where $\left\|\bm{\mathit{A}}\right\|$ is the operator norm, or maximum singular value of $\bm{\mathit{A}}$ . Let us call this ratio $L$ . Moreover, at any step $i,$ we have,

[TABLE]

Proof.

For the lower bound in the initial state, recall that we scale the problem such that $\textsc{OPT}=1,$ and $\bm{\mathit{t}}_{e}\geq m^{\nicefrac{{1}}{{p}}}.$ Initially we have, $\bm{\mathit{r}}^{(0,0)}_{e}=(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}}_{e})^{p-2}\geq 1.$ This means for any solution $\Delta$ , we have

[TABLE]

On the other hand, because

[TABLE]

we get

[TABLE]

upon which squaring gives the lower bound on $\Psi(\bm{\mathit{r}}^{(0,0)})$ .

For the upper bound, Lemma 5.7 implies that

[TABLE]

∎

The next Lemma says that the three assumptions (stated in the statement of the Lemma) can be used to ensure that the potential $\Psi\left(\cdot\right)$ grows quickly with each width reduction step, and that flow steps do not cause the potential to shrink.

Lemma 5.13.

Suppose at step $(i,k)$ we have $\left\|\Delta\right\|_{p}>\tau$ so that we perform a width reduction step (line 19). If

$\Phi(i,k)\leq O_{p}(1)m$ , 2. 2.

$\tau^{2/p}\geq 2\Omega_{p}(1)\frac{m^{\frac{p-2}{p}}}{\beta}$ , and 3. 3.

$\frac{\tau}{10}\geq\rho^{p-2}m^{\frac{p-2}{p}}$ .

Then

[TABLE]

Furthermore, if at $(i,k)$ we have $\left\|\Delta\right\|_{p}\leq\tau$ so that we perform a flow step, then

[TABLE]

Proof.

It will be helpful for our analysis to split the index set into three disjoint parts:

•

$S=\left\{e\mathrel{\mathop{\mathchar 58\relax}}\left|\Delta_{e}\right|\leq\rho\right\}$

•

$H=\left\{e\mathrel{\mathop{\mathchar 58\relax}}\left|\Delta_{e}\right|>\rho\text{ and }\bm{\mathit{r}}_{e}\leq\beta\right\}$

•

$B=\left\{e\mathrel{\mathop{\mathchar 58\relax}}\left|\Delta_{e}\right|>\rho\text{ and }\bm{\mathit{r}}_{e}>\beta\right\}$ .

Firstly, we note

[TABLE]

hence, using Assumption 3

[TABLE]

This means,

[TABLE]

Secondly we note that, using Assumption (1) and Lemma 5.7, we have

[TABLE]

So then, using Assumption 2,

[TABLE]

As $\bm{\mathit{r}}_{e}\geq 1$ , this implies $\sum_{e\in H}\bm{\mathit{r}}_{e}\Delta_{e}^{2}\geq\Omega_{p}(1)\tau^{2/p}$ .

From Lemma 5.12 and Assumption 1 we have

[TABLE]

So then, combining our last two observations, and applying Lemma 5.11, we get

[TABLE]

Finally, for the “flow step” case, by applying Lemma 5.11 with $H$ as the whole set of indices, $\delta=1$ and $\gamma=1$ , we get that as the resistances only increase,

[TABLE]

∎

We are now ready to prove Theorem 5.8.

Proof of Theorem 5.8

Proof.

We first observe that our parameter choices in the Algorithm 4 satisfy Assumption 1 of Lemma 5.10, namely, we can choose the parameters $\alpha$ and $\tau$ s.t.

•

$\alpha\leftarrow\Theta_{p}\left(m^{-\frac{p^{2}-5p+2}{p(3p-2)}}\left(\log\left(\frac{m}{L}\right)\right)^{\frac{-p}{3p-2}}\right)$ ,

•

$\tau\leftarrow\Theta_{p}\left(m^{\frac{(p-1)(p-2)}{(3p-2)}}\left(\log\left(\frac{m}{L}\right)\right)^{\frac{p(p-1)}{3p-2}}\right)$ ,

while ensuring $\alpha^{p}\tau\leq\alpha m^{\frac{p-2}{p}}$ . This means by Lemma 5.10, that if the Algorithm completes after taking $T=\alpha^{-1}m^{1/p}$ flow steps and $K\leq\Omega_{p}(1)\rho^{2}m^{2/p}\beta^{-\frac{2}{p-2}}$ , when it returns, we have

[TABLE]

This means that the algorithm returns $m^{-\frac{1}{p}}\bm{\mathit{x}}$ with

[TABLE]

Note the only alternative is that the algorithm takes more than $\Omega_{p}(1)\rho^{2}m^{2/p}\beta^{-\frac{2}{p-2}}$ width reduction steps (and possibly infinitely many such steps, hence never terminating).

We will now show this cannot happen, by deriving a contradiction from the assumption that the algorithm takes a width reduction step starting from step $(i,k)$ where $i<T$ and $k=\rho^{2}m^{2/p}\beta^{-\frac{2}{p-2}}$ .

Since the conditions for Lemma 5.10 hold for all preceding steps, we must have $\Phi(i,k)\leq O_{p}(1)m$ .

Additionally, we note that our parameter choice of $\beta=\Theta_{p}\left(m^{\frac{p-2}{3p-2}}\left(\log\left(\frac{m}{L}\right)\right)^{-\frac{2(p-1)}{3p-2}}\right)$ and $\rho=\Theta_{p}\left(m^{\frac{(p^{2}-4p+2)}{p(3p-2)}}\left(\log\left(\frac{m}{L}\right)\right)^{\frac{p(p-1)}{(p-2)(3p-2)}}\right)$ along with our choice of $\tau$ (see above), ensures that

[TABLE]

This means that at every step $(j,l)$ preceding the current step, the conditions of Lemma 5.13 are satisfied, so we can prove by a simple induction that

[TABLE]

Since our parameter choices ensure $\Omega_{p}(1)\frac{\tau^{2/p}}{m^{\frac{p-2}{p}}}k>\Theta_{p}\left(\frac{m}{L}\right)$ this means

[TABLE]

But this contradicts Lemma 5.12, since this Lemma, combined with $\Phi(i,k)\leq O_{p}(1)m$ gives

[TABLE]

From this contradiction, we conclude that we never have more than $K=\Omega_{p}(1)\rho^{2}m^{2/p}\beta^{-\frac{2}{p-2}}$ width reduction steps.

Now we observe that the total number of oracle calls in the algorithm is bounded by

[TABLE]

∎

This concludes the analysis of our algorithm.

5.4 Proof of Theorem 5.1

Proof.

Theorem 5.8 implies that we can solve Program (2) using Algorithm 4 to get an $O_{p}(1)$ -approximate solution in $\widetilde{O}_{p}\left(m^{\frac{p-2}{3p-2}}\right)$ calls to the Oracle. Implementing the Oracle requires solving a linear system, and hence can be implemented in in $O(m+n)^{\omega}$ time where $\omega$ is the matrix multiplication constant (see the Appendix for a proof). Thus, we can find an $O_{p}(1)$ -approximate solution to (2) in total time

[TABLE]

Now, Theorem 5.2 implies that we can find an $O_{p}(1)$ -approximate solution to the residual problem (1) in total time,

[TABLE]

Finally using Theorem 4.1 we can conclude that we have an $\varepsilon$ -approximate solution to (* ‣ 1) in $\widetilde{O}_{p}\left(\log\frac{1}{\varepsilon}\right)$ calls to a $O_{p}(1)$ -approximate solver to the residual problem (1). This gives us a total running time of,

[TABLE]

∎

We now have a complete algorithm for the $p$ -norm regression problem that gives an $\varepsilon$ -approximate solution.

6 Speedups for General Matrices via. Inverse Maintenance

If $\bm{\mathit{A}}$ is an explicitly given, $m\times n$ , matrix, we need to solve the quadratic minimization problem at each step. This can be solved via a linear systems solve in the matrix

[TABLE]

which takes $O((m+n)^{\omega})$ , where $\omega$ is the matrix multiplication constant. This directly gives a total running time cost of $\widetilde{O}_{p}(m^{\frac{p-2}{(3p-2)}}(m+n)^{\omega}\log(1/\epsilon))$ , which for large values of $p$ , along with the assumption of $\omega>2.37$ , exceeds $2.70$ .

This is more than the running time of about $O(mn^{1.5})$ of algorithms based on inverse maintenance [Vai90, LS14, LS15]. In this section we show that the MWU routine from Section 5 can also benefit from fast inverse maintenance. Our main result is:

Theorem 6.1.

If $\bm{\mathit{A}}$ is an explicitly given, $m$ -by- $n$ matrix with polynomially bounded condition numbers, and $p\geq 2$ Algorithm 4 as given in Section 5.3 can be implemented to run in total time

[TABLE]

A few remarks about this running time: the term that dominates depends on the comparison between $2/3$ and $10-4\omega$ , or after manipulation, the comparison between $\omega$ and $7/3$ :

For the current best value of $\omega>7/3$ , the second term is at most $\omega$ , so the total running time is about $(m+n)^{\omega}$ . 2. 2.

If $\omega=2$ , then this running time is simply $(m+n)^{\frac{p-2}{3p-2}}$ : same as resolving the linear system at each step. 3. 3.

If $\omega\leq 7/3$ , then the overhead in the exponent on the second term is at most

[TABLE]

and this value approaches $\frac{p-2}{3p-2}$ as $\omega\rightarrow 2$ .

Our algorithm is based on gradually updating the $\bm{\mathit{r}}$ vector. First, note that $\bm{\mathit{w}}_{e}^{(i)}$ ’s, and thus $\bm{\mathit{r}}_{e}^{(i)}$ ’s are monotonically increasing. Secondly, for the $\bm{\mathit{r}}^{(i)}$ that do not double, we can replace with the original version while forming a factor $2$ preconditioner. Thus, we only need to update the $\bm{\mathit{r}}^{(i)}$ entries that have significant increases. This update can be encapsulated by the following result on computing low rank perturbations to a matrix, which is a direct consequence of rectangular matrix multiplication and Woodbury matrix formula.

Lemma 6.2.

Given an $m$ -by- $n$ matrix $\bm{\mathit{A}}$ , along with vectors $\bm{\widehat{r}}$ and $\bm{\widetilde{r}}$ that differ in $k$ entries, as well as the matrix $\bm{\widehat{\mathit{Z}}}=(\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\widehat{r}}}\right)^{-1}\bm{\mathit{A}})^{-1}$ , we can construct $(\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\widetilde{r}}}\right)^{-1}\bm{\mathit{A}})^{-1}$ in $O(k^{\omega-2}(m+n)^{2})$ time.

Proof.

Let $S$ denote the entries that differ in $\bm{\widehat{r}}$ and $\bm{\widetilde{r}}$ . Then we have

[TABLE]

This is a low rank perturbation, so by Woodbury matrix identity we get:

[TABLE]

where we use $\bm{\widehat{\mathit{Z}}}^{\top}=\bm{\widehat{\mathit{Z}}}$ because $\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\widehat{r}}}\right)^{-1}\bm{\mathit{A}}$ is a symmetric matrix. To explicitly compute this matrix, we need to:

compute the matrix $\bm{\mathit{A}}_{S,\mathrel{\mathop{\mathchar 58\relax}}}\bm{\widehat{\mathit{Z}}}$ , 2. 2.

compute $\bm{\mathit{A}}_{\mathrel{\mathop{\mathchar 58\relax}},S}\bm{\widehat{\mathit{Z}}}\bm{\mathit{A}}_{\mathrel{\mathop{\mathchar 58\relax}},S}^{\top}$ 3. 3.

invert the middle term.

This cost is dominated by the first term, which can be viewed as multiplying $\lceil n/k\rceil$ pairs of $k\times n$ and $n\times k$ matrices. Each such multiplication takes time $k^{\omega-1}n$ , for a total cost of $O(k^{\omega-2}n^{2})$ . The other terms all involve matrices with dimension at most $k\times n$ , and are thus lower order terms. ∎

Note that the running time of Lemma 6.2 favours ‘batching’ a large number of modified edges to insert. To this end, we show that it suffices to have an inverse that only approximates some entries of $\bm{\mathit{r}}^{(i)}$ . To do so, we first need to introduce our notions of approximations:

Definition 6.3.

We use $a\approx_{c}b$ for positive numbers $a$ and $b$ iff $c^{-1}a\leq b\leq c\cdot b$ , and for vectors and for vectors $\boldsymbol{\mathit{a}}$ and $\bm{\mathit{b}}$ we use $\boldsymbol{\mathit{a}}\approx_{c}\bm{\mathit{b}}$ to denote $\boldsymbol{\mathit{a}}_{i}\approx_{c}\bm{\mathit{b}}_{i}$ entry-wise.

Since we are only updating $k$ resistances that have a constant factor increase and using a constant factor preconditioning for the others, we need the following result on preconditioned iterative methods for solving systems of linear equations.

Lemma 6.4.

If $\bm{\mathit{r}}$ and $\bm{\widehat{r}}$ are vectors such that $\bm{\mathit{r}}\approx_{\widetilde{O}(1)}\bm{\widehat{r}}$ , and we’re given the matrix $\bm{\widehat{\mathit{Z}}}=(\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\widehat{r}}}\right)^{-1}\bm{\mathit{A}})^{-1}$ explicitly, then we can solve a system of linear equations involving $\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\mathit{r}}}\right)^{-1}\bm{\mathit{A}}$ to $1/{\textrm{poly}}(n)$ accuracy in $\widetilde{O}(n^{2})$ time.

As the resistances we provide to Oracle are in the range $[1,O_{p}(m)]$ , we get that each $\bm{\widehat{r}}_{e}$ only needs to be updated $O(\log{m})$ times, instead of after each iteration. However, it’s insufficient to use this bound in the worst-case manner: if there are $m^{1/3}$ iterations, each of which doubles the resistances on $m^{2/3}$ edges, then the total cost as given by Lemma 6.2 becomes

[TABLE]

which is about $(m+n)^{2.58}$ .

We get an even better bound by using our analysis from Section 5 to show that for iteration/edge combinations $i$ and $e$ , the (relative) update to $\bm{\mathit{r}}_{e}^{(i)}$ is small. Such small changes also imply that we can wait on such updates. For simplicity, suppose we only increment the resistances by factors of $\frac{1}{L},$ then it takes $\Theta(L)$ such increments until the edge’s resistance has deviated by a constant factor. Furthermore, we can wait for another $\Theta(L)$ iterations before having to reflect this change in $\bm{\widehat{r}}$ : the total relative increases in these iterations is also at most $O(1)$ . Formalizing this process leads to a lazy-update routine that tracks the increments of different sizes separately. Its Pseudocode is in Algorithms 5 and 6.

We will call the initialization routine InverseInit at the first iteration, and subsequently call UpdateInverse upon generating a new set of resistances in the call to Algorithm 3, Oracle. This is in turn called from Line 11 of Algorithm 4. As a result, we will assume access to all variables of these routines. Furthermore, our routines keeps the following global variables:

$\bm{\widehat{r}}$ : resistances from the last time we updated each entry. 2. 2.

$counter(\eta)_{e}$ : for each entry, track the number of times that it changed (relative to $\bm{\widehat{r}}$ ) by a factor of about $2^{-\eta}$ since the previous update. 3. 3.

$\bm{\widehat{\mathit{Z}}}$ , an inverse of the matrix given by $\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\widehat{r}}}\right)^{-1}\bm{\mathit{A}}$ .

We first verify that the maintained inverse is always a good preconditioner to the actual matrix, $\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\mathit{r}}^{(i)}}\right)\bm{\mathit{A}}$ .

Lemma 6.5.

After each call to UpdateInverse, the vector $\bm{\widehat{r}}$ satisfies

[TABLE]

Proof.

First, observe that any change in resistance exceeding $1$ is reflected immediately Otherwise, every time we update $counter(j)_{e}$ , $\bm{\mathit{r}}_{e}$ can only increase additively by at most

[TABLE]

Once $counter(j)_{e}$ exceeds $2^{j}$ , $e$ will be added to $E_{changed}$ after at most $2^{j}$ steps. So when we start from $\bm{\widehat{r}}_{e}$ , $e$ is added to $E_{changed}$ after $counter(j)_{e}\leq 2^{j}+2^{j}=2^{j+1}$ iterations. The maximum possible increase in resistance due to the bucket $j$ is,

[TABLE]

Since there are only at most $m^{1/3}$ iterations, the contributions of buckets with $j>\log{m}$ are negligible. Now the change in resistance is influenced by all buckets $j$ , each contributing at most $4\bm{\widehat{r}}_{e}$ increase. The total change is at most $4\bm{\widehat{r}}_{e}\log m$ since there are at most $\log m$ buckets. We therefore have

[TABLE]

for every $i$ . ∎

It remains to bound the number and sizes of calls made to Lemma 6.2. For this we define variables

[TABLE]

to denote the number of edges added to $E_{changed}$ at iteration $i$ due to the value of $counter(\eta)_{e}$ . Note that $k(\eta)^{(i)}$ is non-zero only if $i\equiv 0\pmod{2^{\eta}}$ , and

[TABLE]

The following lemma gives a lower bound on the relative change of energy across one update of resistances.

Lemma 6.6.

Assuming the program (4) is feasible, let $\Delta$ be be a solution to the optimization problem (4) with weights $\bm{\mathit{r}}$ . Suppose we increase the resistance on each entry to $\bm{\mathit{r}}^{\prime}$ Then,

[TABLE]

Proof.

Lemma 5.11 gives us that,

[TABLE]

Since $\bm{\mathit{r}}_{e}\geq 1$ and $\Psi\left(\bm{\mathit{r}}\right)\leq O_{p}(m^{\frac{p-2}{p}})$ ,

[TABLE]

which gives our result. ∎

We divide our analysis into 2 cases, when the relative change in resistance is at least $1$ and when the relative change in resistance is at most $1$ . To begin with, let us first look at the following lemma that relates the change in weights to the relative change in resistance. The proof is in the Appendix.

Lemma 6.7.

Consider a flow step from Line 13 of Algorithm 4. We have

[TABLE]

where $\Delta$ is the $\ell_{2}$ minimizer solution produced by the oracle.

Let us now see what happens when the relative change in resistance is at least $1$ .

Lemma 6.8.

Throughout the course of a run of Algorithm 4, the number of edges added to $E_{changed}$ due to relative resistance increase of at least $1$ ,

[TABLE]

Proof.

From Lemma 6.6, we know that the relative change in energy over one iteration is at least,

[TABLE]

Over all iterations, the relative change in energy is at least,

[TABLE]

which is upper bounded by $O(\log m)$ . When iteration $i$ is a width reduction step, the relative resistance change is always at least $1$ . In this case $\left|\Delta_{e}\right|\geq\rho$ . When we have a flow step, Lemma 6.7 implies that when the relative change in resistance is at least $1$ then,

[TABLE]

This gives, $\left|\Delta_{e}\right|\geq\Omega_{p}(\alpha^{-1})$ . Using this bound on $\left|\Delta_{e}\right|$ is sufficient since $\rho>\Omega_{p}(\alpha^{-1})$ and both kinds of iterations are accounted for. The total relative change in energy can now be bounded.

[TABLE]

The Lemma follows by substituting $\alpha=\tilde{\Theta}_{p}\left(m^{-\frac{p^{2}-5p+2}{p(3p-2)}}\right)$ in the above equation. ∎

Lemma 6.9.

Throughout the course of a run of Algorithm 4, the number of edges added to $E_{changed}$ due to relative resistance increase between $2^{-\eta}$ and $2^{-\eta+1}$ ,

[TABLE]

Proof.

From Lemma 6.6, the total relative change in energy is at least,

[TABLE]

We know that $\frac{\bm{\mathit{r}}^{(i+1)}_{e}-\bm{\mathit{r}}^{(i)}_{e}}{\bm{\mathit{r}}^{(i)}_{e}}\geq 2^{-\eta}$ . Using Lemma 6.7, we have,

[TABLE]

We can bound $(1+\alpha\left|\Delta_{e}\right|)^{p-2}-1$ as,

[TABLE]

Now, in the second case, when $\alpha\left|\Delta_{e}\right|\geq 1$ and $p-2>1$ ,

[TABLE]

For both cases we get,

[TABLE]

Using the above bound and the fact that the total relative change in energy is at most $\widetilde{O}_{p}(1)$ , gives,

[TABLE]

The Lemma follows substituting $\alpha=\tilde{\Theta}_{p}\left(m^{-\frac{p^{2}-5p+2}{p(3p-2)}}\right)$ in the above equation. ∎

We can now use the concavity of $f(z)=z^{\omega-2}$ to upper bound the contribution of these terms.

Corollary 6.10.

Let $k(\eta)^{(i)}$ be as defined. Over all iterations we have,

[TABLE]

and for every $\eta$ ,

[TABLE]

Proof.

Due to the concavity of the $\omega-2\approx 0.3727<1$ power, this total is maximized when it’s equally distributed over all iterations. In the first sum, the number of terms is equal to the number of iterations, i.e., $\widetilde{O}_{p}(m^{\frac{p-2}{3p-2}})$ . In the second sum the number of terms is $\widetilde{O}_{p}(m^{\frac{p-2}{3p-2}})2^{-\eta}$ . Distributing the sum equally over the above numbers give,

[TABLE]

and

[TABLE]

∎

Proof.

(of Theorem 6.1) By Lemma 6.5, the $\bm{\widehat{r}}$ that the inverse being maintained corresponds to always satisfy $\bm{\widehat{r}}\approx_{\widetilde{O}(1)}\bm{\mathit{r}}^{(i)}$ . So by the iterative linear systems solver method outlined in Lemma 6.4, we can implement each call to Oracle (Section 5.2)in time $O((n+m)^{2})$ in addition to the cost of performing inverse maintenance. This leads to a total cost of

[TABLE]

across the $T=\Theta_{p}(m^{\frac{p-2}{3p-2}})$ iterations.

The costs of inverse maintenance is dominated by the calls to the low-rank update procedure outlined in Lemma 6.2. Its total cost is bounded by

[TABLE]

Because there are only $O(\log{m})$ values of $\eta$ , and each $k(\eta)^{(i)}$ is non-negative, we can bound the total cost by:

[TABLE]

where the inequality follows from substituting in the result of Lemma 6.10. Depending on the sign of $3\omega-7$ , this sum is dominated either at $\eta=0$ or $\eta=\log{T}$ . Including both terms then gives

[TABLE]

with the exponent on the trailing term simplifying to $\omega-2$ to give,

[TABLE]

∎

7 Other Regression Formulations

In this section we discuss how various variants of $\ell_{p}$ -norm regression can be translated into our setting of

[TABLE]

As we will address a multitude of problems, we will make generic numerical assumptions to simplify the derivations.

$m\leq{\textrm{poly}}(n)$ . 2. 2.

All entries in $\bm{\mathit{A}}$ and $\bm{\mathit{b}}$ are at most ${\textrm{poly}}(n)$ . Note that this implies that the maximum singular value of $\bm{\mathit{A}}$ , $\sigma_{\max}(\bm{\mathit{A}})$ and $\left\|\bm{\mathit{b}}\right\|_{2}$ are at most ${\textrm{poly}}(n)$ . 3. 3.

$\left\|\bm{\mathit{b}}\right\|_{2}\geq 1$ . 4. 4.

The minimum non-zero singular value of $\bm{\mathit{A}}$ , $\sigma_{\min}(\bm{\mathit{A}})$ is at least $1/{\textrm{poly}}(n)$ .

Note that these conditions also imply bounds on the optimum value and the optimum solution $\bm{\mathit{x}}^{*}$ : Specifically,

[TABLE]

and

[TABLE]

7.1 Affine transformations within the norm

Let $\bm{\mathit{C}}$ be a matrix with the same assumptions as $\bm{\mathit{A}}$ and $\bm{\mathit{d}}$ have assumptions similar to $\bm{\mathit{b}}$ . Suppose we are minimizing $\left\|\bm{\mathit{C}}\bm{\mathit{x}}-\bm{\mathit{d}}\right\|_{p}$ instead of $\left\|\bm{\mathit{x}}\right\|_{p}$ , i.e.,

[TABLE]

Note that this can be reduced to the following unconstrained problem,

[TABLE]

To see this, first find the null space of $\bm{\mathit{A}}$ , as well as a particular solution $\bm{\mathit{x}}_{0}$ that satisfies $\bm{\mathit{A}}\bm{\mathit{x}}_{0}=\bm{\mathit{b}}$ . Let the null space of $\bm{\mathit{A}}$ be generated by the matrix $\bm{\mathit{V}}$ . Then the space of solutions can be parameterized as

[TABLE]

for some vector $\bm{\mathit{y}}$ . Now our objective becomes,

[TABLE]

which can be written as,

[TABLE]

where $\bm{\widehat{\mathit{C}}}=\bm{\mathit{C}}\bm{\mathit{V}}$ and $\bm{\widehat{d}}=\bm{\mathit{C}}\bm{\mathit{x}}_{0}-\bm{\mathit{d}}$ . Observe that $\bm{\widehat{\mathit{C}}}y$ spans the column space of $\bm{\widehat{\mathit{C}}}$ . Decomposing $\bm{\mathit{d}}$ into a linear combination of an orthonormal basis we could combine the part which is in the span of $\bm{\widehat{\mathit{C}}}$ with $\bm{\widehat{\mathit{C}}}\bm{\mathit{y}}$ . We can thus replace $\bm{\widehat{\mathit{C}}}$ in the objective with an orthonormal basis $\bm{\mathit{U}}$ of its column space and replace $\bm{\widehat{d}}$ by $\bm{\mathit{g}}$ , a vector orthogonal to all columns of $\bm{\mathit{U}}$ . Then any vector

[TABLE]

can equivalently be described by the conditions

[TABLE]

For the last condition, it suffices to generate an orthonormal basis of the null space of $\bm{\mathit{U}}$ . So the problem can be written as a linear constraint on $\bm{\mathit{z}}$ instead.

7.2 $1<p<2$

In case $1<p<2,$ we instead solve the dual problem:

[TABLE]

for $q=\frac{p}{p-1}>2$ . We can rescale the above problem to the equivalent $q$ -norm ball-constrained projection problem,

[TABLE]

where the goal is to check whether the optimum is less than $1$ . This problem is covered by the problem introduced in Section 7.1 and can thus be solved to high accuracy in the desired time.

It remains is to transform a nearly-optimal solution $\bm{\mathit{y}}$ of this $q$ -norm ball-constrained projection problem to a nearly-optimal solution $\bm{\mathit{x}}$ of the original subspace $p$ -norm minimization problem. Since both of these problems’ solutions are invariant under scalings to $\bm{\mathit{A}}$ or $\bm{\mathit{b}}$ , we may also assume that the optimum is at most $1$ .

Lemma 7.1.

If the optimum of

[TABLE]

is at most $1$ , and we have some $\bm{\mathit{y}}$ such that

[TABLE]

then the gradient of $\left\|\bm{\mathit{A}}^{\top}\bm{\mathit{y}}\right\|_{q}$ ,

[TABLE]

satisfies

[TABLE]

Proof.

Let $\Delta=\nabla-\bm{\mathit{b}}$ and $p(n)$ be a polynomial such that $\gamma_{q}\left(\left|\bm{\mathit{A}}^{\top}\bm{\mathit{y}}\right|,\bm{\mathit{A}}^{\top}(\nabla-b)\right)\leq p(n)$ . By the assumption of $\bm{\mathit{A}}$ and $\bm{\mathit{b}}$ being ${\textrm{poly}}(n)$ bounded, the above $\gamma$ function is polynomially bounded. Let $q(n)$ be a polynomial in $n$ such that, $q(n)\geq\sqrt{\frac{4p(n)}{\delta}}$ . Suppose,

[TABLE]

This gives us,

[TABLE]

Now consider the solution

[TABLE]

for step size $\theta=\varepsilon/2p(n)$ . Lemma 4.5 and Lemma 3.3 gives

[TABLE]

We can scale the solution $\bm{\widehat{\mathit{y}}}$ up by a factor of $1/(1-\theta\Delta^{\top}\bm{\mathit{b}}-\theta\epsilon/2)$ to get a solution with objective value

[TABLE]

But by the assumption of $1$ being the optimum, this cannot exceed $1$ , so we get

[TABLE]

or

[TABLE]

which combined with the choice of $\theta$ gives $\varepsilon<\delta q(n)$ which is a contradiction. So we much have, $\left\|\nabla-\bm{\mathit{b}}\right\|_{2}\leq\delta q(n)\leq\delta{\textrm{poly}}(n)$ ∎

This means once $\delta\leq 1/{\textrm{poly}}(n)$ , the solution created from the gradient

[TABLE]

satisfies

[TABLE]

Also, because $\sigma_{\min}(\bm{\mathit{A}})\geq 1/{\textrm{poly}}(n)$ , we can create a solution $\bm{\widetilde{\mathit{x}}}$ from $\bm{\widehat{x}}$ by doing a least squares projection on this difference. This gives:

[TABLE]

and

[TABLE]

Furthermore, note that because

[TABLE]

we have

[TABLE]

so

[TABLE]

Thus, for sufficiently small $\delta$ , we can get high accuracy answer to the $p$ -norm problem as well.

8 $p$ -Norm Optimization on Graphs

In this section we discuss the performance of our algorithms on graphs. Here instead of invoking general linear algebraic routines, we instead invoke Laplacian solvers, which provide $1/{\textrm{poly}}(n)$ accuracy solutions to Laplacian linear equations in nearly-linear ( $\widetilde{O}(m)$ ) time [ST14, KMP14, KMP11, Kel+13, Coh+14, PS14, Kyn+16, KS16], and the current best running time is $O(m\log^{\nicefrac{{1}}{{2}}}n\log\nicefrac{{1}}{{\varepsilon}})$ (up to ${{\textrm{polyloglog}}}\ n$ factors) [Coh+14].

Such matrices can be succinctly described as

[TABLE]

where $\bm{\mathit{r}}$ is the vector of resistances just as provided in the Oracle from Algorithm 3, but $\bm{\mathit{A}}$ is the edge-vertex incidence matrix: with each row corresponding to an edge, each column corresponding to a vertex, and entries given by:

[TABLE]

Throughout this entire section, we will use $\bm{\mathit{A}}$ to refer to the edge-vertex incidence matrix of a graph.

The main difficult of reducing to Laplacian solvers is that we can no longer manipulate general matrices. Specifically, instead of directly working with the normal matrices as in Section 7.1, we need to implicitly track the subspaces, and optimize quadratics on them. As a result, we need to tailor such reductions towards the specific problems.

8.1 $p$ -Norm Flows

This is closest to the general regression problem that we study:

[TABLE]

except with $\bm{\mathit{A}}$ as an edge vertex incidence matrix.

When $p\geq 2$ , the residual problem then has an extra condition of

[TABLE]

which means we need to solve the problem of

[TABLE]

which becomes a solve in the system of linear equations

[TABLE]

This matrix is a rank $3$ perturbation to the graph Laplacian $\bm{\mathit{A}}^{\top}{\bf Diag}\left({\bm{\mathit{r}}}\right)^{-1}\bm{\mathit{A}}$ , and can thus be solved in $\widetilde{O}(m)$ time. A more detailed analysis of a generalization of this case can be found in Appendix B of [DS08].

When $1<p<2$ , we invoke the dualization from Section 7.2 to obtain

[TABLE]

and if we retain the form of $\bm{\mathit{A}}^{\top}\bm{\mathit{y}}$ , but transfer the gradient over to $\bm{\mathit{y}}$ , the problem that we get is:

[TABLE]

The two additional linear constraints can removed by writing a variable of $\bm{\mathit{y}}$ as a linear combination of the rest (as well as $\alpha$ ). This then gives an unconstrained minimization problem on a subset of entries $S$ ,

[TABLE]

where $\bm{\mathit{L}}_{[S,S]}$ is a minor of the Laplacian above, and this solution is obtained by solving for

[TABLE]

8.2 Lipschitz Learning and Graph Labelling

This problem asks to label the vertices of a graph, with a set $T$ fixed to the vector $\bm{\mathit{s}}$ , while minimizing the $p$ -norm difference between neighbours. It can be written as

[TABLE]

where $\bm{\mathit{A}}$ is the edge-vertex incidence matrix.

In the case of $p\geq 2$ , the residue problem becomes

[TABLE]

Here the gradient condition can be handled in the same way as with the voltage problem above: by fixing one additional entry of $V\setminus T$ , and then solving an unconstrained quadratic minimization problem on the rest of the variables.

In the case of $1<p<2$ , we first write down the problem as an unconstrained minimization problem on $V\setminus T$ :

[TABLE]

Let $\bm{\mathit{b}}=\bm{\mathit{A}}_{\mathrel{\mathop{\mathchar 58\relax}},T}\bm{\mathit{s}}$ and taking the dual gives:

[TABLE]

That is, solving for a small $q$ -norm flow that maximizes the cost against $\bm{\mathit{b}}$ , while also having [math] residues at the vertices not in $T$ .

As $q>2$ , we can now invoke our main algorithm on $\bm{\mathit{y}}$ . Upon binary search, and taking residual problems, we get $\ell_{2}$ problems of the form

[TABLE]

which is solved by another low rank perturbation on a minor of the graph Laplacian.

Appendix A Missing Proofs

A.1 Proofs from Section 3

See 3.2

Proof.

We have $p\geq 2$ . When $\left|x\right|\leq t$ ,

[TABLE]

Otherwise,

[TABLE]

Let $s(x)=\frac{p}{2}t^{p-2}x^{2}$ . At $|x|=t$ we have $\gamma_{p}(t,t)=s(t)$ . Now, $\gamma_{p}^{\prime}(t,x)=p\left|x\right|^{p-2}x\geq pt^{p-2}x=s^{\prime}(x)$ . This means that for $x$ negative, $\gamma_{p}$ decreases faster than $s$ and for $x$ positive, $\gamma_{p}$ increases faster than $s$ . The two functions are equal in the range $-t\leq x\leq t$ . Therefore, $\gamma_{p}(t,x)\geq s(x)$ for all $x$ . 2. 2.

[TABLE] 3. 3.

Taking the derivative of $\gamma_{p}(t,x)$ with respect to $x$ gives,

[TABLE]

The statement clearly follows.

∎

See 3.3

Proof.

[TABLE]

Now, when $\left|x\right|\geq t$ , we have the following. When $p\leq 2$ ,

[TABLE]

and when $p\geq 2$ ,

[TABLE]

The above computations imply that,

[TABLE]

Let $\lambda\geq 1$ and $x\geq 0$ . Integrating both sides of the right inequality gives,

[TABLE]

Integrating both sides of the left inequality from $x$ to $\lambda x$ gives the required left inequality. Now, let $\lambda\leq 1$ . Integrating both sides of the left inequality gives,

[TABLE]

Similar to the previous case, integrating both sides of the right inequality from $\lambda x$ to $x$ gives the required left inequality. When $x\leq 0$ , the direction of the inequality changes but it gets reversed again after putting limits, since we integrate from $\lambda x$ to $x$ when $\lambda\geq 1$ and $x$ to $\lambda x$ when $\lambda\leq 1$ . We thus have,

[TABLE]

∎

See 3.4

Proof.

Since $\gamma_{p}(t,x)=\gamma_{p}(t,\left|x\right|),$ and $\gamma_{p}(t,x)$ is increasing in $x,$ it suffices to prove the claim for $x,\Delta\geq 0.$ We have,

[TABLE]

Integrating over $z\in[0,\Delta],$ we get,

[TABLE]

∎

A.2 Proofs from Section 4

See 4.5

Proof.

We first show the following two lemmas.

Lemma A.1.

For $|\alpha|\leq 1$ and $p\geq 1$ ,

[TABLE]

Proof.

Let us first show the left inequality, i.e. $1+\alpha p+\frac{p-1}{4}\alpha^{2}\leq(1+\alpha)^{p}$ . Define the following function,

[TABLE]

When $\alpha=1,-1$ , $h(\alpha)\geq 0$ . The derivative of $h$ with respect to $\alpha$ is, $h^{\prime}(\alpha)=p(1+\alpha)^{p-1}-p-\frac{(p-1)}{2}\alpha$ .

When $p\geq 2$ and $-1<\alpha<1$ ,

[TABLE]

For the last inequality, note that when the product is positive, either both terms are positive or both terms are negative. When both terms are positive, subtracting $(p-1)/2$ instead of $p$ gives a larger positive quantity. When both terms are negative then subtracting $(p-1)/2$ instead of $p$ gives only a smaller quantity, so the inequality holds. This shows that $h^{\prime}(\alpha)sign(\alpha)\geq 0$ , which means minimum of $h$ is at $h(0)=0$ . Next let us see what happens when $p\leq 2$ and $\left|\alpha\right|<1$ .

[TABLE]

This implies that $h^{\prime}(\alpha)$ is an increasing function of $\alpha$ and $\alpha_{0}$ for which $h^{\prime}(\alpha_{0})=0$ is where $h$ attains its minimum value. The only point where $h^{\prime}$ is 0 is $\alpha_{0}=0$ . This implies $h(\alpha)\geq h(0)=0$ . This concludes the proof of the left inequality. For the right inequality, define:

[TABLE]

Note that $s(0)=0$ and $s(1),s(-1)\geq 0$ . We have,

[TABLE]

Using the mean value theorem for $p\geq 2$ and $\alpha<0$ ,

[TABLE]

This implies that $s^{\prime}(\alpha)\leq 0$ for negative alpha. When $1>\alpha>0$ , using the convexity of $f(x)=(1+x)^{p-1}$ for $p>2$ , we get,

[TABLE]

which gives us

[TABLE]

This implies, $s^{\prime}(\alpha)\geq 0$ for positive $\alpha$ . The function $s$ is thus increasing for positive $\alpha$ and decreasing for negative $\alpha$ , so it attains the minimum at [math] which is $s(0)=0$ giving us $s(\alpha)\geq 0$ . We now look at the case $p\leq 2$ . We have

[TABLE]

Using this, we get, $s^{\prime}(\alpha)sign(\alpha)\geq p|\alpha|(2^{p}-1)\geq 0$ which says $s^{\prime}(\alpha)$ is positive for $\alpha$ positive and negative for $\alpha$ negative. Thus the minima of $s$ is at 0 which is [math]. So $s(\alpha)\geq 0$ in this range too.

∎

Lemma A.2.

For $\beta\geq 1$ and $p\geq 1$ , $(\beta-1)^{p-1}+1\geq\frac{1}{2^{p}}\beta^{p-1}$ .

Proof.

$(\beta-1)\geq\frac{\beta}{2}$ for $\beta\geq 2$ . So the claim clearly holds for $\beta\geq 2$ since $(\beta-1)^{p-1}\geq\left(\frac{\beta}{2}\right)^{p-1}$ . When $1\leq\beta\leq 2$ , $1\geq\frac{\beta}{2}$ , so the claim holds since, $1\geq\left(\frac{\beta}{2}\right)^{p-1}$ ∎

We now prove the theorem.

Let $\Delta=\alpha x$ . The term $g\Delta=p|x|^{p-1}sign(x)\cdot\alpha x=\alpha p|x|^{p-1}|x|=\alpha p|x|^{p}$ . Let us first look at the case when $|\alpha|\leq 1$ . We want to show,

[TABLE]

This follows from Lemma A.1 and the facts $\frac{cp}{2}\leq\frac{p-1}{4}$ and $\frac{Cp}{2}\geq p2^{p-1}$ . We next look at the case when $|\alpha|\geq 1$ . Now, $\gamma_{|f|}^{p}(\Delta)=|\Delta|^{p}+(\frac{p}{2}-1)|f|^{p}$ . We need to show

[TABLE]

When $|x|=0$ it is trivially true. When $|x|\neq 0$ , let

[TABLE]

Now, taking the derivative with respect to $\alpha$ we get,

[TABLE]

When $\alpha\geq 1$ and $p\geq 2$ ,

[TABLE]

So we have $h^{\prime}(\alpha)\geq h^{\prime}(1)\geq 0$ . When $p<2$ , we use the mean value theorem to get,

[TABLE]

which implies $h^{\prime}(\alpha)\geq 0$ in this range as well. When $\alpha\leq-1$ it follows from Lemma A.2 that $h^{\prime}(\alpha)\leq 0$ . So the function $h$ is increasing for $\alpha\geq 1$ and decreasing for $\alpha\leq-1$ . The minimum value of $h$ is $min\{h(1),h(-1)\}\geq 0$ . It follows that $h(\alpha)\geq 0$ which gives us the left inequality. The other side requires proving,

[TABLE]

Define:

[TABLE]

The derivative $s^{\prime}(\alpha)=p+\left(p2^{p}|\alpha|^{p-1}-p|1+\alpha|^{p-1}\right)sign(\alpha)$ is non negative for $\alpha\geq 1$ and non positive for $\alpha\leq-1$ . The minimum value taken by $s$ is $\min\{s(1),s(-1)\}$ which is non negative. This gives us the right inequality.

∎

See 4.8

Proof.

Let $\bm{\mathit{x}}^{\star}$ give the OPT. We know that, for any $\bm{\mathit{x}}$ ,

[TABLE]

This along with the fact $\|\bm{\mathit{x}}^{(0)}\|_{2}\leq\|\bm{\mathit{x}}^{\star}\|_{2}$ gives us,

[TABLE]

∎

A.3 Proofs from Section 5

See 5.3

Proof.

Let $\bm{\mathit{x}}^{\star}$ denote the optimum solution of (* ‣ 1) and $\bm{\mathit{x}}^{(0)}$ be as defined in Definition 4.7. We know that for any $\bm{\mathit{x}}$ ,

[TABLE]

This along with the fact $\|\bm{\mathit{x}}^{(0)}\|_{2}\leq\|\bm{\mathit{x}}^{\star}\|_{2}$ gives us,

[TABLE]

Now from Lemma 4.6 we have,

[TABLE]

Let us assume $\mathit{\alpha}(\Delta)\geq\varepsilon\|\bm{\mathit{x}}^{\star}\|_{p}^{p}\geq\varepsilon m^{\nicefrac{{-(p-2)}}{{2}}}\|\bm{\mathit{x}}^{(0)}\|_{p}^{p}$ . If this is not true we already have an $\varepsilon$ approximate solution to our problem. We thus have the following bound on $\mathit{\alpha}$ ,

[TABLE]

This gives us that,

[TABLE]

When $p\leq 2$ , following a similar proof and using,

[TABLE]

we get,

[TABLE]

thus concluding the proof of the lemma. thus concluding the proof of the lemma. ∎

See 5.4

Proof.

Assume that the optimum solution to (1), $\Delta^{\star}$ satisfies

[TABLE]

in addition to $\bm{\mathit{A}}\Delta^{\star}=0.$ Note that we know that the objective is strictly positive (as 0 is a feasible solution). Since $\gamma_{p}\geq 0,$ we must have,

[TABLE]

Consider scaling $\Delta^{\star}$ by a factor $\lambda>0.$ Since $\Delta^{\star}$ is optimal, we must have

[TABLE]

Now, from Lemma 3.3, we know that

[TABLE]

Thus, we get,

[TABLE]

Thus, $\gamma_{p}(\bm{\mathit{t}},\Delta^{\star})\leq\frac{1}{\frac{p-1}{p2^{p}}\left(\min\{2,p\}-1\right)}2^{i},$ and hence $\bm{\mathit{g}}^{\top}\Delta^{\star}\leq\frac{\max\{2,p\}}{\min\{2,p\}-1}2^{i}=\max\left\{p,\nicefrac{{2}}{{p-1}}\right\}2^{i}$ .

Now consider the vector $\Delta=\lambda\Delta^{\star},$ where $\lambda=\frac{2^{i-1}}{\bm{\mathit{g}}^{\top}\Delta^{\star}}.$ Note that $\lambda\in\left[\min\left\{\nicefrac{{1}}{{2p}},\nicefrac{{(p-1)}}{{4}}\right\},1\right].$ We have

[TABLE]

Thus, $\Delta$ is a feasible solution to Program (3). A $\beta$ -approximate solution $\Delta(i)$ must be such that,

[TABLE]

Now, we consider $\Delta=\mu\Delta(i)$ for some $\mu\leq 1.$ We have, $\bm{\mathit{A}}\Delta=0,$ and,

[TABLE]

We can pick,

[TABLE]

In either case, we get,

[TABLE]

Since we assumed that the optimum of Program (1) is at most $2^{i},$ this implies that $\mu\Delta(i)$ achieves an objective value for Program (1) that is within an $\Omega_{p}\left(\beta^{-\frac{1}{\min\{p,2\}-1}}\right)$ fraction of the optimal. ∎

See 5.5

Proof.

We choose $i$ such that (3) is feasible, i.e., there exists $\Delta$ such that,

[TABLE]

Scaling both $\bm{\mathit{t}}$ and $\Delta$ to $\tilde{\bm{\mathit{t}}}=\left(\frac{p-1}{p}\right)^{1/p}2^{-1-i/p}\bm{\mathit{t}}$ and $\tilde{\Delta}=\left(\frac{p-1}{p}\right)^{1/p}2^{-1-i/p}\Delta$ gives us the following.

[TABLE]

Now, let $\bm{\mathit{t}}^{\prime}=\max\{m^{-1/p},\tilde{\bm{\mathit{t}}_{e}}\}$ . We claim that when $p\geq 2$ , $\gamma_{p}\left(\bm{\mathit{t}}^{\prime},\tilde{\Delta}\right)-\gamma_{p}\left(\tilde{\bm{\mathit{t}}},\tilde{\Delta}\right)\leq\frac{p}{2}-1$ . To see this, for a single $j$ , let us look at the difference $\gamma_{p}\left(\bm{\mathit{t}}^{\prime}_{j},\tilde{\Delta}_{j}\right)-\gamma_{p}\left(\tilde{\bm{\mathit{t}}}_{j},\tilde{\Delta}_{j}\right)$ . If $\tilde{\bm{\mathit{t}}}_{j}\geq m^{-1/p}$ the difference is [math]. Otherwise from the proof of Lemma 5 of [Bub+18],

[TABLE]

When $p\leq 2$ , we claim that $\gamma_{p}\left(\tilde{\bm{\mathit{t}}},\tilde{\Delta}\right)-\gamma_{p}\left(\bm{\mathit{t}}^{\prime},\tilde{\Delta}\right)\leq 1-\frac{p}{2}$ . Again if $\tilde{\bm{\mathit{t}}}_{j}\geq m^{-1/p}$ the difference is [math]. Otherwise,

[TABLE]

To see the last inequality, when $\left|\Delta_{j}\right|\leq\bm{\mathit{t}}^{\prime}_{j}$ , we require, $\left|\Delta_{j}\right|^{p}-\frac{p}{2}\bm{\mathit{t}}^{p-2}_{j}\Delta_{j}^{2}\leq\left(1-\frac{p}{2}\right)\bm{\mathit{t}}_{j}^{p}$ which is true. When $\left|\Delta_{j}\right|\geq\bm{\mathit{t}}_{j}$ , it directly follows. Summing over all $j$ gives us our claims. We know that $\gamma_{p}\left(\tilde{\bm{\mathit{t}}},\tilde{\Delta}\right)\leq 1$ . Thus, $\gamma_{p}\left(\bm{\mathit{t}}^{\prime},\tilde{\Delta}\right)\leq\frac{p}{2}$ . Next we set $\hat{\Delta}=\left(\frac{2}{p}\right)^{1/2}\tilde{\Delta}$ . Note that $\max\{\left(\frac{2}{p}\right)^{2},\left(\frac{2}{p}\right)^{p}\}=\left(\frac{2}{p}\right)^{2}$ for all $p$ . Lemma 3.3 thus implies,

[TABLE]

Define $\hat{t}_{j}=\min\{1,\bm{\mathit{t}}^{\prime}_{j}\}$ . Note that $\gamma_{p}\left(\hat{\bm{\mathit{t}}},\hat{\Delta}\right)=\gamma_{p}\left(\bm{\mathit{t}}^{\prime},\hat{\Delta}\right)$ since $\gamma_{p}\left(\bm{\mathit{t}}^{\prime},\hat{\Delta}\right)\leq 1$ and as a result we have $\gamma_{p}\left(\hat{\bm{\mathit{t}}},\hat{\Delta}\right)\leq 1$ . Observe that $\hat{\Delta}$ is a feasible solution of (2) thus suggesting that for problem (2) $\textsc{OPT}\leq 1$ . Let $\Delta^{\star}$ be a $\kappa$ - approximate solution to (2), i.e.,

[TABLE]

When $p\geq 2$ , $\gamma_{p}$ is an increasing function of $\bm{\mathit{t}}$ giving us,

[TABLE]

When $p\leq 2$ ,

[TABLE]

This gives,

[TABLE]

and Lemma 3.3 then implies,

[TABLE]

Finally, $\Delta=\left(\frac{p}{2}\right)^{1/2}\left(\frac{p}{p-1}\right)^{1/p}2^{1+i/p}\Delta^{\star}$ satisfies the constraints of (3) and is a $\Omega_{p}(\kappa)$ approximate solution. ∎

See 5.6

Proof.

Using Hölder’s inequality, we have,

[TABLE]

∎

A.4 Proofs from Section 6

See 6.7

Proof.

Recall from the setting of resistances from Line 2 of Oracle (Algorithm 3) that

[TABLE]

By Line 13 of Algorithm 4, we have

[TABLE]

Substituting this in gives

[TABLE]

There are two cases to consider:

$\bm{\mathit{w}}_{e}^{(i)}\geq m^{1/p}\bm{\mathit{t}}_{e}$ .

[TABLE]

where the last inequality utilizes $\bm{\mathit{w}}_{e}^{(i)}\geq 1$ , which is due to the assumption and $m^{1/p}\bm{\mathit{t}}_{e}\geq 1$ . 2. 2.

$\bm{\mathit{w}}_{e}^{(i)}\leq m^{1/p}\bm{\mathit{t}}_{e}$ , then replacing the denominator with the $(m^{1/p}\bm{\mathit{t}}_{e})^{p-2}$ term and simplifying gives

[TABLE]

As the function $(z+\theta)^{p-2}-z^{p-2}$ is monotonically increasing when $\theta,p-2\geq 0$ , we may replace the $\frac{\bm{\mathit{w}}_{e}^{\left(i\right)}}{m^{1/p}\bm{\mathit{t}}_{e}}$ by its upper of $1$ (given by the assumption) to get

[TABLE]

where the last inequality follows from $m^{1/p}\bm{\mathit{t}}_{e}\geq 1$ .

∎

Appendix B Controlling $\Phi$

See 5.10

Proof.

We prove this claim by induction. Initially, $i=k=0,$ and $\Phi(0,0)=0,$ and thus, the claim holds trivially. Assume that the claim holds for some $i,k\geq 0.$ We will use $\Phi$ as an abbreviated notation for $\Phi(i,k)$ below.

Flow Step.

For brevity, we let $\gamma_{p}(\bm{\mathit{w}})$ denote $\gamma_{p}(m^{\nicefrac{{1}}{{p}}}\bm{\mathit{t}},\bm{\mathit{w}}),$ and use $\bm{\mathit{w}}$ to denote $\bm{\mathit{w}}^{(i,k)}$ .

If the next step is a flow step,

[TABLE]

From the inductive assumption, we have

[TABLE]

Thus,

[TABLE]

proving the inductive claim.

Width Reduction Step.

To analyze a width-reduction step, we first observe that, by Lemma 5.7 and the induction hypothesis, which ensures $\left\|\bm{\mathit{w}}^{(i,k)}\right\|_{p}^{p}\leq\Phi\leq O_{p}(1)m$ , and hence $\sum_{e}\bm{\mathit{r}}_{e}f_{e}^{2}\leq O_{p}(1)m^{(p-2)/p}$ so we have

[TABLE]

Thus, when the next step is a width-reduction step, we have,

[TABLE]

Thus,

[TABLE]

proving the inductive claim.

∎

Appendix C Solving L2 problems

Lemma C.1.

Given an algorithm Solver for solving $\bm{\mathit{B}}^{\top}\bm{\mathit{R}}^{-1}\bm{\mathit{B}}\bm{\mathit{x}}=\bm{\mathit{d}},$ for a $m\times n$ -fixed matrix $\bm{\mathit{B}},$ a fixed positive diagonal matrix $\bm{\mathit{R}}>0$ and an arbitrary vector $\bm{\mathit{d}},$ there is an algorithm EnhancedSolver that can solve

[TABLE]

with one call to Solver, two multiplications of $\bm{\mathit{B}}$ with a vector, and an additional $O(m+n)$ time, if we assume

[TABLE]

Proof.

Introducing the Lagrangian multipliers $\bm{\mathit{v}},a$ respectively for the constraint $\bm{\mathit{B}}^{\top}\bm{\mathit{f}}=0,$ and $\bm{\mathit{g}}^{\top}\bm{\mathit{f}}=z,$ we can write the Lagrangian as

[TABLE]

Now, optimizing the Lagrangian with respect to an unconstrained $\bm{\mathit{f}}$ , allows us to write

[TABLE]

Plugging this back, we can simplify our Lagrangian as

[TABLE]

Optimizing with respect to $a,$ gives us,

[TABLE]

Plugging this back, gives the Lagrangian as

[TABLE]

We let $\widetilde{\bm{\mathit{g}}}$ denote the vector $\bm{\mathit{B}}^{\top}\bm{\mathit{R}}^{-1}\bm{\mathit{g}}$ and $\bm{\mathit{M}}$ denote the matrix $\bm{\mathit{B}}^{\top}\bm{\mathit{R}}^{-1}\bm{\mathit{B}}.$ Thus, the Lagrangian can be written as,

[TABLE]

This implies that the optimal $\bm{\mathit{v}}$ is given by the equation

[TABLE]

From the condition assumed on $\bm{\mathit{g}},$ we have $\widetilde{\bm{\mathit{g}}}\bm{\mathit{M}}^{-1}\widetilde{\bm{\mathit{g}}}<\bm{\mathit{g}}\bm{\mathit{R}}^{-1}\bm{\mathit{g}}.$ Thus, we can solve this system using the Sherman-Morrisson formula as follows,

[TABLE]

The algorithm EnhancedSolver computes $\widetilde{\bm{\mathit{g}}},$ and then invokes Solver to compute $\bm{\mathit{M}}^{-1}\widetilde{\bm{\mathit{g}}}.$ This allows us to compute $\bm{\mathit{v}}$ is an additional $O(m+n)$ time. Finally, we can compute $\bm{\mathit{f}}=\bm{\mathit{R}}^{-1}(\bm{\mathit{B}}\bm{\mathit{v}}+a\bm{\mathit{g}})$ using another multiplication with $\bm{\mathit{B}}$ and an additional $O(m+n)$ time. ∎

Appendix D General $\ell_{2}$ Resistance Monotonicity

See 5.11

Proof.

Recall

[TABLE]

Letting $\bm{\mathit{R}}$ denote the diagonal matrix with $\bm{\mathit{r}}$ on its diagonal, we can write the above as

[TABLE]

Using Lagrangian duality, and noting that strong duality holds, we can write this as

[TABLE]

The minimizing $\Delta$ can be found by setting the gradient w.r.t. to this variable to zero. This gives $2\bm{\mathit{R}}\Delta-2\bm{\mathit{y}}^{\top}\bm{\mathit{A}}^{\prime}=0$ , so that $\Delta=\bm{\mathit{R}}^{-1}(\bm{\mathit{A}}^{\prime})^{\top}\bm{\mathit{y}}$ . Plugging in this choice of $\Delta$ , we arrive at the dual program

[TABLE]

Crucially, strong duality also implies that if $\Delta^{*}$ is an optimal solution of the primal program (9), and $\bm{\mathit{y}}^{*}$ is an optimal solution to the dual then

[TABLE]

is optimized at $\Delta=\Delta^{*}$ . This in turn implies the gradient w.r.t. $\Delta$ at $\Delta=\Delta^{*}$ is zero, so that $\Delta^{*}=\bm{\mathit{R}}^{-1}(\bm{\mathit{A}}^{\prime})^{\top}\bm{\mathit{y}}^{*}$ . Let $\boldsymbol{\mathit{a}}_{e}$ be the $e$ th row of $\bm{\mathit{A}}^{\prime}$ . Then the previous equation tells us that $\Delta^{*}_{e}=\frac{1}{\bm{\mathit{r}}_{e}}\boldsymbol{\mathit{a}}_{e}^{\top}\bm{\mathit{y}}^{*}$ . This implies that

[TABLE]

Consider another program, essentially the same as (10), but with additional scalar valued variable $\theta\in\mathbb{R}$ introduced.

[TABLE]

The two programs (10) and (12) have the same value, since for any $\bm{\mathit{y}}$ , the assignment $(\bm{\mathit{z}},\theta)=(\bm{\mathit{y}},1)$ ensures both objectives take the same value, and conversely for any $(\bm{\mathit{z}},\theta)$ , the assignment $\bm{\mathit{y}}=\theta\bm{\mathit{z}}$ ensures both programs take the same value.

We see that $\bm{\mathit{z}}=\bm{\mathit{y}}^{*},$ and $\theta=1$ is an optimal solution to (12). Hence

[TABLE]

Consequently, $\bm{\mathit{c}}^{\top}\bm{\mathit{y}}^{*}=(\bm{\mathit{y}}^{*})^{\top}\bm{\mathit{A}}^{\prime}\bm{\mathit{R}}^{-1}(\bm{\mathit{A}}^{\prime})^{\top}\bm{\mathit{y}}^{*}$ . Hence $\bm{\mathit{c}}^{\top}\bm{\mathit{y}}^{*}=\Psi\left(\bm{\mathit{r}}\right)$ . Again, by a scaling argument, this implies that

[TABLE]

So that

[TABLE]

Note that one optimal assignment for the program (13) is $\bm{\widetilde{\mathit{y}}}=\frac{\bm{\mathit{y}}^{*}}{\Psi\left(\bm{\mathit{r}}\right)}$ . Also observe that if we consider the program (13) with $\bm{\mathit{r}}^{\prime}$ instead of $\bm{\mathit{r}}$ as the resistances, then $\bm{\widetilde{\mathit{y}}}=\frac{\bm{\mathit{y}}^{*}}{\Psi\left(\bm{\mathit{r}}\right)}$ is still a feasible solution. Hence, using the observation (11), we get

[TABLE]

Factoring out the $\Psi(\bm{\mathit{r}})$ term gives

[TABLE]

Now consider the term $1-\frac{\bm{\mathit{r}}_{e}}{\bm{\mathit{r}}^{\prime}_{e}}$ : if $\bm{\mathit{r}}^{\prime}_{e}\geq 2\bm{\mathit{r}}_{e}$ , then it is at least $1/2$ . Otherwise, it can be rearranged to

[TABLE]

So in either case, we have

[TABLE]

which upon rearranging gives the desired result. ∎

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Abr+16] Ittai Abraham et al. “On Fully Dynamic Graph Sparsifiers” Available at: http://arxiv.org/abs/1604.02094 In Symposium on Foundations of Computer Science (FOCS) , 2016, pp. 335–344
2[Adi+] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for ℓ p subscript ℓ 𝑝 \ell_{p} -norm Regression” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pp. 1405–1424 DOI: 10.1137/1.9781611975482.86 · doi ↗
3[AHK 12] Sanjeev Arora, Elad Hazan and Satyen Kale “The Multiplicative Weights Update Method: a Meta-Algorithm and Applications.” In Theory of Computing 8.1 , 2012, pp. 121–164
4[All+17] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Mendes Oliveira and Avi Wigderson “Much Faster Algorithms for Matrix Scaling” Available at: https://arxiv.org/abs/1704.02315 In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017 , 2017, pp. 890–901
5[Axe 94] Owe Axelsson “Iterative Solution Methods” New York, NY: Cambridge University Press, 1994
6[Bub+18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee and Yuanzhi Li “An Homotopy Method for Lp Regression Provably Beyond Self-concordance and in Input-sparsity Time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , STOC 2018 Los Angeles, CA, USA: ACM, 2018, pp. 1130–1137 DOI: 10.1145/3188745.3188776 · doi ↗
7[Bul 18] Brian Bullins “Fast minimization of structured convex quartics” https://arxiv.org/abs/1812.10349 In Co RR abs/11812.10349 , 2018
8[Chi+13] Hui Han Chin, Aleksander Madry, Gary L. Miller and Richard Peng “Runtime guarantees for regression problems” Available at http://arxiv.org/abs/1110.1358 In Proceedings of the 4 th conference on Innovations in Theoretical Computer Science , ITCS ’13 Berkeley, California, USA: ACM, 2013, pp. 269–282

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Iterative Refinement for ℓp\ell_{p}ℓp​-norm Regression

Abstract

Contents

1 Introduction

1.1 Contributions

Iterative Refinement for ℓp\ell_{p}ℓp​-norms.

Theorem 1.1**.**

Improved Iteration Count for ℓp\ell_{p}ℓp​-Regression.

Theorem 1.2**.**

A Duality Based Approach to ℓp\ell_{p}ℓp​-Regression.

Faster ℓp\ell_{p}ℓp​-Regression.

Faster ppp-Norm Flows.

Generalizations and Extensions.

1.2 Comparison to Previous Works

Numerical Methods and Preconditioning.

Optimization Algorithms.

Inverse Maintenance

Speedups for Matrices with Uneven Dimensions

2 Technical Overview

Iterative Refinement for ℓp\ell_{p}ℓp​-norms.

Improved Algorithms for ℓp\ell_{p}ℓp​-regression for p≥2p\geq 2p≥2.

Maintaining Inverses for Improved Algorithm.

3 Preliminaries

Definition 3.1** (γp\gamma_{p}γp​ function).**

Lemma 3.2**.**

Lemma 3.3**.**

Lemma 3.4**.**

Notation.

4 Main Iterative Algorithm

Theorem 4.1** (ℓp\ell_{p}ℓp​-norm Iterative Refinement).**

Definition 4.2** (ε\varepsilonε-approximate solution).**

Definition 4.3** (Residual Problem).**

Definition 4.4** (κ\kappaκ-approximate solution).**

Lemma 4.5**.**

Lemma 4.6**.**

Proof.

Definition 4.7** (Initial Solution).**

Lemma 4.8**.**

Proof.

5 Solving the Residual Problem

Theorem 5.1**.**

5.1 Equivalent Problems

Theorem 5.2**.**

Lemma 5.3**.**

Lemma 5.4**.**

Lemma 5.5**.**

Proof.

5.2 Oracle

Lemma 5.6**.**

Lemma 5.7**.**

Proof.

5.3 The Algorithm

Theorem 5.8**.**

Analysis of Potentials.

Definition 5.9**.**

Notation.

Lemma 5.10**.**

Lemma 5.11**.**

Lemma 5.12**.**

Proof.

Lemma 5.13**.**

Proof.

Proof of Theorem 5.8

Proof.

5.4 Proof of Theorem 5.1

Proof.

6 Speedups for General Matrices via. Inverse Maintenance

Theorem 6.1**.**

Lemma 6.2**.**

Proof.

Definition 6.3**.**

Lemma 6.4**.**

Lemma 6.5**.**

Iterative Refinement for $\ell_{p}$ -norm Regression

Iterative Refinement for $\ell_{p}$ -norms.

Theorem 1.1.

Improved Iteration Count for $\ell_{p}$ -Regression.

Theorem 1.2.

A Duality Based Approach to $\ell_{p}$ -Regression.

Faster $\ell_{p}$ -Regression.

Faster $p$ -Norm Flows.

Iterative Refinement for $\ell_{p}$ -norms.

Improved Algorithms for $\ell_{p}$ -regression for $p\geq 2$ .

Definition 3.1 ( $\gamma_{p}$ function).

Lemma 3.2.

Lemma 3.3.

Lemma 3.4.

Theorem 4.1 ( $\ell_{p}$ -norm Iterative Refinement).

Definition 4.2 ( $\varepsilon$ -approximate solution).

Definition 4.3 (Residual Problem).

Definition 4.4 ( $\kappa$ -approximate solution).

Lemma 4.5.

Lemma 4.6.

Definition 4.7 (Initial Solution).

Lemma 4.8.

Theorem 5.1.

Theorem 5.2.

Lemma 5.3.

Lemma 5.4.

Lemma 5.5.

Lemma 5.6.

Lemma 5.7.

Theorem 5.8.

Definition 5.9.

Lemma 5.10.

Lemma 5.11.

Lemma 5.12.

Lemma 5.13.

Theorem 6.1.

Lemma 6.2.

Definition 6.3.

Lemma 6.4.

Lemma 6.5.

Lemma 6.6.

Lemma 6.7.

Lemma 6.8.

Lemma 6.9.

Corollary 6.10.

7.2 $1<p<2$

Lemma 7.1.

8 $p$ -Norm Optimization on Graphs

8.1 $p$ -Norm Flows

Lemma A.1.

Lemma A.2.

Appendix B Controlling $\Phi$

Lemma C.1.

Appendix D General $\ell_{2}$ Resistance Monotonicity