Iterative Refinement for $\ell_p$-norm Regression
Deeksha Adil, Rasmus Kyng, Richard Peng, Sushant Sachdeva

TL;DR
This paper presents improved iterative algorithms for solving $oldsymbol{ ext{l}_p}$-regression problems for all $p$ in (1,2) and (2,∞), achieving faster convergence and accuracy, especially for large-scale and sparse problems.
Contribution
The authors develop novel iterative refinement algorithms for $ ext{l}_p$-regression that leverage smoothed $ ext{l}_p$-norms, enabling faster solutions with near-linear iteration complexity and improved runtime over previous methods.
Findings
Achieve $ ilde{O}_p(m^{1/3})$ iterations for high-accuracy solutions.
Solve $ ext{l}_p$-regression in $ ilde{O}_p(m^{ ext{max}ig{race} rac{ ext{omega}}{ } , rac{7}{3} ig{race}})$ time, matching $ ext{l}_2$ regression for constant $p$.
Improve on previous algorithms for sparse graphs and matrices with similar dimensions.
Abstract
We give improved algorithms for the -regression problem, such that for all Our algorithms obtain a high accuracy solution in iterations, where each iteration requires solving an linear system, being the dimension of the ambient space. By maintaining an approximate inverse of the linear systems that we solve in each iteration, we give algorithms for solving -regression to accuracy that run in time where is the matrix multiplication constant. For the current best value of , we can thus solve regression as fast as regression, for all constant bounded away from Our algorithms can be combined with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms
Iterative Refinement for -norm Regression
111This paper has been published at SODA 2019 [Adi+], and was initially submitted to SODA on July 12, 2018.
Deeksha Adil University of Toronto. [email protected]. Supported by an Ontario Graduate Scholarship, and by a Connaught New Researcher award to Sushant Sachdeva.
Rasmus Kyng Harvard. [email protected]. Supported by ONR grant N00014-18-1-2562.
Richard Peng
Georgia Tech. [email protected]. Supported in part by the National Science Foundation under Grant No. 1718533.
Sushant Sachdeva University of Toronto. [email protected]. Research supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), and a Connaught New Researcher award.
Abstract
We give improved algorithms for the -regression problem, such that for all Our algorithms obtain a high accuracy solution in iterations, where each iteration requires solving an linear system, with being the dimension of the ambient space.
Incorporating a procedure for maintaining an approximate inverse of the linear systems that we need to solve at each iteration, we give algorithms for solving -regression to accuracy that runs in time where is the matrix multiplication constant. For the current best value of , this means that we can solve regression as fast as regression, for all constant bounded away from
Our algorithms can be combined with nearly-linear time solvers for linear systems in graph Laplacians to give minimum -norm flow / voltage solutions to accuracy on an undirected graph with edges in time.
For sparse graphs and for matrices with similar dimensions, our iteration counts and running times improve upon the -norm regression algorithm by [Bubeck-Cohen-Lee-Li STOC‘18], as well as general purpose convex optimization algorithms. At the core of our algorithms is an iterative refinement scheme for -norms, using the quadratically-smoothed -norms introduced in the work of Bubeck et al. Formally, given an initial solution, we construct a problem that seeks to minimize a quadratically-smoothed norm over a subspace, such that a crude solution to this problem allows us to improve the initial solution by a constant factor, leading to algorithms with fast convergence.
Contents
1 Introduction
Iterative methods that converge rapidly to a solution are of fundamental importance to numerical analysis, optimization, and more recently, graph algorithms. In the study of iterative methods, there are significant discrepancies between iterative methods geared towards linear problems, and ones that can handle more general convex objectives. For systems of linear equations, which corresponds to minimizing -norm objectives over a subspace, most iterative methods obtain -approximate solutions in iteration counts that scale as . More generally, for appropriately defined notions of accuracy, a constant-accuracy linear system solver can be iterated to give a much higher accuracy solver using a few calls to the crude solver. Such phenomena are not limited to linear systems either: an algorithm that produces approximate maximum flows on directed graphs can be iterated on the residual graph to quickly obtain high-accuracy answers.
On the other hand, for the much wider space of non-linear optimization problems arising from optimization and machine learning, it’s significantly more expensive to obtain high accuracy solutions. Many widely used methods such as (accelerated) gradient descent, obtain -approximate answers using iteration counts that scale as Such discrepancies also occur in the overall asymptotic running times. An important and canonical problem in this space is -norm regression:
[TABLE]
for some and For this corresponds exactly to solving a linear system, and hence is solvable by a matrix inversion in time 222 is the matrix multiplication exponent. Currently we know [Wil12, Le ̵14]. For and this problem is inter-reducible to linear programming [Til13, Til15, Bub+18].
Interior point methods also allow us to solve -norm regression problems in iterations [NN94, LS14], where each iteration requires solving an linear system for any . Bubeck et al [Bub+18] show that this iteration count is tight for the interior point method framework, and instead propose a different method which requires only 333 notation hides constant factors that depend on , and its dual norm notation also hides factors in addition. iterations for which for large constant still tends to about . On the other hand, -approximate solutions can be computed in about iterations [Chi+13] 444this result only addressed the case, but its techniques generalize to all other .
Furthermore, this discrepancy also carries over to the graph theoretic case. If the matrix is the vertex-edge incidence matrix of a graph, then this problem captures graph problems such as -norm Lipschitz learning and finding -norm minimizing flows meeting demands given by Here low accuracy approximate solutions can be obtained in nearly-linear time when [Pen16, She17], and almost-linear time for all other values of [She17a, Sid17]. However, the current best high accuracy solutions take at least time [Mad13, Bub+18].
1.1 Contributions
Iterative Refinement for -norms.
In this paper, we propose a new iterative method for -norm regression problems (* ‣ 1) that achieves geometric convergence to the optimal solution. Our method only requires solving residual problems to find an -approximate solution, or residual problems, each solved to a -approximation factor. Such an iterative method was previously known only for and Curiously, our residual problems look very similar to the original problem (* ‣ 1), with the norms replaced by their quadratically-smoothed versions introduced by Bubeck et al [Bub+18]. This result, Theorem 4.1, can be stated informally as:
Theorem 1.1**.**
There exists a class of residual problems for -norm regression (which we will define in Definition 4.3) such that any -norm regression problem can be solved to -relative accuracy by solving to relative error a sequence of residual problems.
Improved Iteration Count for -Regression.
We then give an algorithm for quickly solving the residual problem motivated by the approximate maximum flow by electrical flows algorithm by Christiano et al. [Chr+11] and its generalizations to regression problems [Chi+13]. This is given as Theorem 5.1, and can be stated informally as:
Theorem 1.2**.**
For any , an instance of a residual problem for -norm regression as defined in Definition 4.3 can be solved in iterations, each of which consist of solving a system of linear equations plus updates that take linear time.
This improves on the work of Bubeck et al [Bub+18] for all with the number of iterations equaling for (essentially the same as Bubeck et al) and tending to as goes to (compared to for Bubeck et al). However, our results don’t give anything for due to the dependency in in the term. It’s worth noting that even in the constant error regime, this improves by a factor of about over the current state of the art, which for small is due to Bubeck et al. [Bub+18], and for large is based on unpublished modifications to Christiano et al. [Chr+11, Mad11].
A Duality Based Approach to -Regression.
For the remaining case of , we instead solve the dual problem, which is a -norm regression problem, and utilize its solution to solve our original -Regression. This leads to iteration counts of the form for solving such problems. Note that this result also does not give anything when , as the constants related to its dual norm, become prohibitive. For all our iteration count achieves the following exponent on
[TABLE]
while the exponent from the previous result [Bub+18] is : our algorithm has better dependence on on all constant (albeit with larger constants depending on ).
For the case of , a manuscript by Bullins [Bul18] from December 2018 (after our paper was accepted to SODA 2019, but independently developed), gives the same iteration count as our algorithm of up to polylogs. Bullins’ approach requires a linear system solve per iteration, similar to our approach when implemented without inverse maintenance. Bullins’ algorithm is based on higher-order acceleration, and the agreement between running times suggests there may be a strong connection between our “accelerated” multiplicative weight method and his accelerated gradient-based method.
Faster -Regression.
Our improved iteration counts can be readily combined with methods for speeding up optimization algorithms that utilize linear system solvers, including inverse maintenance [Vai89, LS15]. This results in an time algorithm for solving regression problems for all , which we formalize in Theorem 6.1.
This bound for -norm regression with general matrices brings us to the somewhat surprising conclusion that for the current value of , -norm regression problems (with constant that’s also constant-bounded away from ) on square matrices can be solved as fast as solving the underlying linear systems, or equivalently, regression problems.
This is based on maintaining an approximate inverse to the linear systems we need to solve in each step of the iterative method as pioneered by Vaidya [Vai89]. However, our modification interacts directly with the potential functions we use to control iteration counts in the inner loop of our iterative method. A similar approach for maintaining an approximate inverse was used by Cohen et al. [CLS18] to give an algorithm for Linear Programming, after our initial submission to SODA, but before our paper was publicly available. Both works build on ideas developed by Cohen, see [Lee17].
Faster -Norm Flows.
When solving -norm flow problems, our algorithm can made faster by using Laplacian solvers for graph problems [Vai90, Ten10] to solve the linear equations that arise during our iterations. This gives algorithms for finding -norm flows on undirected graphs to accuracy with running time for via direct invocations of fast Laplacian solvers [ST14].
Our results thus give the first evidence that wide classes of graph optimization problems can be solved in time or faster. While such a bound (via. fast Laplacian solvers) is by now well-known in the approximate setting [Chr+11], the iteration bounds due to Madry [Mad13, Mad16] represent the only results to date in this direction for high accuracy answers on sparse graphs.
Generalizations and Extensions.
While we focus on Problem (* ‣ 1), under mild assumptions about polynomially bounded objectives, we can solve the following more general problem:
[TABLE]
The reduction is discussed in Section 7. The combination of an affine constraint on with an affine transformation in the -norm objective means we can solve most variants of -norm optimization problems.
Similar ideas can be used to solve -norm Lipschitz learning problems [Kyn+15] on graphs quickly.
1.2 Comparison to Previous Works
Numerical Methods and Preconditioning.
Iterative methods and preconditioning are the most fundamental tools in numerical algorithms [Axe94, Saa03]. As studies of such methods often focus on linear problems, many existing analyses of iterative methods are restricted to linear systems. Generalizing such methods, as well as numerical methods, to broader settings is a major topic of study [Hen03, NW06, Kel99, KK04].
The study of more efficient algorithms for combinatorial flow problems has benefited enormously from ideas from linear and non-linear preconditioning. Recent advances in approximate maximum flow and transshipment algorithms [LRS13, She13, Kel+14, RST14, Gha+15, Pen16, She17a, She17] build upon such ideas. However, these methods rely on the preconditioner being a linear operator, and give dependence.
Optimization Algorithms.
Our techniques for solving the residual problems are directly motivated by approximating maximum flow using electrical flows [Chr+11]. While this algorithm has been extended to multicommodity flows and regression problems [KMP12, Chi+13], all these results have dependencies.
Several recent results for obtaining dependencies are all motivated by convex optimization techniques. In particular, the state of the art running times are by interior point methods. These include directly modifying the interior point method (IPM) [LS14, LS15, Kyn+15], combining techniques from the electrical flow algorithms with IPM update steps [Mad13, Kyn+15, KRS15, Mad16, Coh+17], and increasing the ‘confidence interval’, and in turn step lengths, of the IPM update steps [All+17, Coh+17a, Bub+18]. Our result based on creating intermediate problems has the most in common with the last of these. However, our method differs in that our guarantees for this intermediate problem holds over the entire space.
Inverse Maintenance
Our final running time of for -regression incorporates inverse maintenance. This is a method introduced by Vaidya [Vai89] for speeding up optimization algorithms for solving minimum cost and multicommodity flows. It takes advantage of the controllable rate at which optimization algorithms modify the solution variables to reuse inverses of matrices constructed from such variables.
Previous studies of inverse maintenance [Vai89, LS14, LS15] have been geared towards the interior point method. Here the norm per update step can be controlled, and we believe this also holds for their applications in faster cutting plane methods [LSW15]. While such methods also give gains in the case of our algorithm, for the final bound of about , we instead bound the progress of the steps against a global potential function motivated by the electrical flow max-flow algorithm [Chr+11].
Speedups for Matrices with Uneven Dimensions
Our algorithm on the other hand does not take into account sparsity of the input matrix, or possibly uneven dimensions (e.g. ). In these settings, the methods based on accelerated stochastic gradient descent from [Bub+18] obtain better performances. On the other hand, we believe our methods have the potential of extending to such settings by combining the intermediate problems with row sampling [CP15]. However, analysis of such row sampling routines for our residual problems containing mixed and norm functions is outside the scope of this paper.
2 Technical Overview
Iterative Refinement for -norms.
To design their algorithm for -norm regression, Bubeck et al [Bub+18] construct a function which is , 555A function is said to be if it’s continuous, differentiable, and has a continuous derivative, quadratic in the range and behaves as asymptotically (see Def. 3.1). Our key lemma states one can locally approximate as a linear function plus a “error” term 666It is useful to compare the term to the second-order Hessian term in Taylor expansion (Lemma 4.5):
[TABLE]
Surprisingly, this approximation only has an “condition number”. Proceeding just as for gradient descent, or Newton’s method, means that if at each step we solve the following local approximation problem to a factor
[TABLE]
where is the gradient of our loss function, we can converge to an -approximate solution in roughly iterations (Theorem 4.1).
Improved Algorithms for -regression for .
The key advantage afforded by our iterative algorithm is that we now only need to design a algorithm for the residual problem that achieves a crude approximation factor (we achieve ). As a first step, by a binary search and some rescaling, we show (Lemma 5.5) that it suffices to achieve a constant factor approximation to problems of the following form,
[TABLE]
The technical heart of our proof is to give an algorithm (Gamma-Solver, Algorithm 4) inspired by the multiplicative weight update (MWU) method (see [AHK12] for a survey), combined with the width-reduction inspired by the faster flow algorithm of Christiano et al. [Chr+11], and its matrix version by Chin et al.[Chi+13]. At each iteration, we solve a weighted minimization problem to find the next update step. If this update step has small norm, we add this to our current solution, and update the weights. Otherwise, we identify a set of coordinates that have small current weights, and yet are contributing most of the norm, and we penalize them by increasing their weights (and do not add our update step to the current solution). Setting the parameters carefully, we show that after iterations, the average of the update steps achieves an -approximation to our modified residual problem (Theorem 5.8). Combining this with our iterative refinement algorithm, we obtain our algorithms for -norm regression that require only iterations (or linear system solves).
Maintaining Inverses for Improved Algorithm.
Our inverse maintenance procedure utilizes the same combination of low-rank updates and matrix multiplications as in previous results [Vai89, LS14, LS15]. However, the rate of convergence of our algorithm, and in turn the rate at which we adjust the weights from the MWU procedure, are governed by growths in the minimization problem. This leads to the difficulty of uneven progress across the iterations.
We solve this issue by a simple yet subtle scheme motivated by lazy updates in data structures [GP13, Abr+16]. We bucket changes to the values of entries based on their magnitudes, and update entries that received too many updates of a certain magnitude separately. This differs with previous methods that update weights exceeding approximation thresholds as they happen, and enables a closer interaction with the overall potential function based convergence analysis.
3 Preliminaries
We use the following family of functions, defined in [Bub+18].
Definition 3.1** ( function).**
For and , define
[TABLE]
These functions can be thought of a quadratic approximation of in a small range around zero. The following properties follow directly from the definition.
. 2. 2.
is quadratic in the range . 3. 3.
is in both
We show several other important properties of in the following lemmas. Their proofs are straightforward and deferred to Appendix A.1
Lemma 3.2**.**
Function is as defined above.
For any and we have and 2. 2.
It is homogeneous under rescaling of both and i.e., for any and any we have 3. 3.
For any and any we have
The next lemma shows a bound on the value of when is scaled up or down.
Lemma 3.3**.**
For any and , we have,
[TABLE]
This implies,
[TABLE]
The following lemma allows us to bound the second order change in as changes to
Lemma 3.4**.**
For any and any we have
[TABLE]
Notation.
For a vector let denote the vector with its coordinate as For any two vectors and , denotes the sum .
4 Main Iterative Algorithm
In this section we analyze procedure p-Norm, i.e., Algorithm 1. Our main result for this section is,
Theorem 4.1** (-norm Iterative Refinement).**
For any and Given an initial feasible solution (Definition 4.7) to our optimization problem (Equation (* ‣ 1)), Algorithm 1 finds an -approximate solution to (* ‣ 1) in calls to a -approximate solver for the residual problem (Equation (1)).
The theorem says that it is sufficient to solve an instance of the residual problem (1) crudely, and only a logarithmic number of times. Before we prove the theorem, we define the terms used in the statement and prove some results that would be needed for the proof. We begin by defining an -approximate solution to our main optimization problem.
Definition 4.2** (-approximate solution).**
We say our solution is an -approximate solution to (* ‣ 1) if and
[TABLE]
where is the OPT of (* ‣ 1).
We next define what we use as our residual problem and what we mean by a -approximate solution.
Definition 4.3** (Residual Problem).**
For any given and , let
[TABLE]
where is the gradient, . We call the following problem to be the residual problem of (* ‣ 1) at .
[TABLE]
Definition 4.4** (-approximate solution).**
Let . A -approximate solution for the residual problem is such that and, Here .
In order to see why we choose this problem as our residual problem we show that the objective of the residual problem bounds the change in -norm of a vector when perturbed by (Lemma 4.6).
Lemma 4.5**.**
Let . Then for any and any ,
[TABLE]
where is the derivative of the function .
The proof can be found in Appendix A.2.
Lemma 4.6**.**
Let and be such that . Then for any we have,
[TABLE]
Proof.
Applying lemma 4.5 to all the coordinates, we obtain,
[TABLE]
Using definition 4.3, equation (4) directly implies, for all Now to prove the other side, note that for any and any we have from Lemma 4.5 and Lemma 3.2
[TABLE]
Picking such that we obtain that for any
[TABLE]
thus concluding the proof of the lemma. ∎
For any iterative algorithm we need a starting feasible solution. We could potentially start with any feasible solution but we define the following starting solution which we claim is a good starting point. Lemma 4.8 shows us that our chosen starting point is only polynomially away from the optimum solution, and is thus a good choice. The proof of the lemma can be found in Appendix A.2.
Definition 4.7** (Initial Solution).**
We define to be our initial feasible solution to be
Lemma 4.8**.**
For as defined in Definition 4.7, .
We are now ready to prove Theorem 4.1.
Proof.
Let denote the solution returned by the -approximate solver. We know that . We have,
[TABLE]
From Lemma 4.6, for we get,
[TABLE]
Combining the above two equations and subtracting OPT from both sides gives us
[TABLE]
Using lemma 4.8 we get,
[TABLE]
Setting gives us an -approximate solution. ∎
This concludes the discussion on the analysis of Algorithm 1. In the following sections we move on to analyzing how to solve the residual problem approximately.
5 Solving the Residual Problem
In this section, we give an algorithm that solves the residual problem to a constant approximation. Combined with the iterative refinement scheme from Theorem 4.1, we obtain the following result.
Theorem 5.1**.**
For , we can find an -approximate solution to (* ‣ 1) in time
[TABLE]
Here is the matrix multiplication constant.
Recall that the residual problem
[TABLE]
has a linear term followed by the function. Instead of directly optimizing this function, we guess an approximate value of the linear term, and for each such guess, we minimize the function under this additional constraint. We can scale the problem so that the optimum is at most Finally, we can perturb so that each lies in a polynomially bounded range without adding significant error. Our final problem looks as follows,
[TABLE]
with .
To sumarise, -Approx (Algorithm 2) formalizes this process and shows that we only need to solve a logarithmic number of instances of the above program, (2) and solving each to a -approximation gives a -approximate solution to (1). Gamma-Solver (Algorithm 4) solves problem (2) to an approximation. Therefore, using Gamma-Solver as a subroutine for -Approx we get an approximate solution to (1). Section 5.1 gives an analysis for -Approx. In Section 5.2, we give an oracle that is used in Gamma-Solver. We give an analysis of Gamma-Solver in Section 5.3. Finally in Section 5.4, we give a proof for Theorem 5.1.
5.1 Equivalent Problems
In this section we prove the following theorem.
Theorem 5.2**.**
Procedure -Approx (Algorithm 2) returns an -approximate solution to the residual problem given by (1), by solving instances of program (2) to a -approximation.
The following lemmas will lead to the proof of the above theorem. The first lemma gives an upper and lower bound on the objective of (1).
Lemma 5.3**.**
Let and assume that our current solution is not an -approximate solution. Let be such that . For some
[TABLE]
* where is the solution of (1).*
We defer the proof to Appendix A.3. Lemma 5.3 suggests that we can divide the range of the objective of our residual problem, into a logarithmic number of bins and solve a decision problem that asks if the optimum belongs to the bin. The lemma guarantees that at least one of the decision problems will be feasible. The following lemma defines the required decision problems and shows that solving these to a constant approximation is sufficient to get a constant approximate solution to (1).
Lemma 5.4**.**
Let . Suppose for some where is the solution of (1). The following program is feasible:
[TABLE]
If is a -approximate solution to program (3) for this choice of then, we can pick such that the vector is an -approximate solution to (1).
The proof can be found in Appendix A.3. We now scale down the objective of (3) so that it is at most . The next lemma shows what the scaled down problem looks like and how an approximate solution to the scaled down problem gives an approximate solution to (3). Again the proof of the lemma can be found in Appendix A.3.
Lemma 5.5**.**
Let . Let be such that (3) is feasible. Let
[TABLE]
Note that . Then program (2) with , and
[TABLE]
has . Let be a -approximate solution to (2). Then, is a - approximate solution to (3).
We now prove Theorem 5.2.
Proof.
Lemma 5.3 suggests that there exists an index
[TABLE]
such that . Lemma 5.4 implies that (3) is feasible for index . Suppose is a -approximate solution to the scaled down problem (2) for index . Lemma 5.5 implies that is an approximate solution to (3) for index . Lemma 5.4 now implies that is a -approximation to our residual problem (1). Now, the algorithm solves the scaled down problem for every and returns the that when added to our current solution gives the minimum -norm. It either chooses or some other solution . In case it returns ,
[TABLE]
From Lemma 4.6 we know,
[TABLE]
We thus have , implying is also a approximate solution as required. ∎
It remains to solve problems of the form (2) up to a -approximation. Recall that these problems look like,
[TABLE]
and satisfy , and .
5.2 Oracle
Our approach follows the format of the approximate max-flow algorithm by Christiano et al. [Chr+11]. Specifically, we use a variant of multiplicative weights update to converge to a solution with small . The multiplicative weights update scheme repeatedly updates a set of weights using partial, local solutions computed based on these weights. The Christiano et al. algorithm can be viewed as picking these weights from the gradients of the soft-max function on flows. We will adapt this routine by showing that ’s chosen from the gradient of also suffices for approximately minimizing the problem stated in 2.
The subroutine that this algorithm passes the onto is commonly referred to as an oracle. An oracle needs to compute a solution with both small dot-product against , and small width, which is defined as the maximum value of an entry. In such an oracle, the dot product condition is the hard constraint, in that the final approximation factor of the solution is directly related to the value of these dot products. The width, on the other hand, only affects the overall iteration count/ running time, and can even be manipulated/improved algorithmically. Therefore we first need to define and show a good upper bound on the objective of the optimization problem solved within the oracle.
Formally, our oracle subroutine Algorithm 3 takes as input some affine constraints and vector of weights . It first computes a vector of non-negative weights , and then returns a minimizer to the following optimization problem
[TABLE]
Appendix C contains an algorithm that solves such problems efficiently.
Let us now look at some properties of the solution returned by the oracle. Note that the objective of our problem (2) is at most . This implies that we have such that
- •
,
- •
, or .
We next look at some relations on the weights and resistances. The following lemma is a simple application of Hölder’s inequality. Its proof is given in Appendix A.3.
Lemma 5.6**.**
Let . For any set of weights on the edges,
Lemma 5.7**.**
Let . For any , let be the electrical flow computed with respect to resistances
[TABLE]
and demand vector
Then the following hold,
** 2. 2.
**
Proof.
Since is the electrical flow,
[TABLE]
We have,
[TABLE]
Finally, using we have completing part 1.
Now we know that,
[TABLE]
Using Cauchy Schwarz’s inequality,
[TABLE]
Combining the two cases we have,
[TABLE]
where the last line uses for any ∎
5.3 The Algorithm
Next, we integrate this oracle into the overall algorithm that repeatedly adjusts the weights. As with the use of electrical flow oracles for approximate max-flow [Chr+11], the convergence of such a scheme depends on the maximum values in the returned by the oracle. However, because the overall objective is now a -norm, the exact term of importance is actually the -norm of . Up to this discrepancy, we follow the algorithmic template from [Chr+11] by making an update when is small, and make progress via another potential function otherwise.
In the cases where we do not take the step due to entries with large values, we show significant increases in an additional potential function, namely the objective of the quadratic minimization problem inside the oracle (Algorithm 3). However, the less graduate update schemes related to -norms makes it no longer sufficient to update only the weight corresponding to the entry with maximum value. Furthermore, there may be entries with large values, whose corresponding resistances are too large for us to afford increasing. We address this by a scheme where we update an entry only if its value is larger than some threshold , and that its resistance is at most another threshold . Specifically, we show that for an appropriate choice of and , such updates both do not change the primary potential function (related to ) by too much (in Lemma 5.10), and increases the secondary potential function (the objective of the quadratic minimization problem) significantly whenever is large (in Lemma 5.13). Pseudocode of this scheme is in Algorithm 4.
Theorem 5.8**.**
Let . Given a matrix and vectors and such that , Algorithm 4 uses calls to the oracle and returns a vector such that and .
Analysis of Potentials.
We define the following potential function for the analysis of our algorithm.
Definition 5.9**.**
Let be the potential function defined as
[TABLE]
Initially, since we start with , we have Observe that in the algorithm, we update the potentials in both the flow step and the width reduction step whereas we update the solution only in the flow step. It is easy to see that we always have
We next bound the potential. In addition, we track the energy of the electrical flow in the network with resistances Let denote the minimum of routing with resistances :
[TABLE]
Note that this energy is equal to the energy calculated using the obtained in the solution of (4).
Notation.
We overload notation for to denote
Our proof of Theorem 5.8 will be based two main parts:
Provided the total number of width reduction steps, , is not too big, then is small. This in turn upper bounds cost of the approximate solution . 2. 2.
Showing that cannot be too big, because each width reduction step cause large growth in , while we can bound the total growth in by relating it to .
We start by observing that when we when increase the weight of an edge during a width reduction step, this has the effect of at least doubling the resistance . Recall,
[TABLE]
Now,
[TABLE]
Meanwhile, the resistance does not grow by a factor larger than 4:
[TABLE]
We next show through the following lemma that the potential does not increase too rapidly. The proof is through induction and can be found in Appendix B .
Lemma 5.10**.**
After flow steps, and width-reduction steps, provided
, (controls growth in flow-steps) 2. 2.
* , (acceptable number of width-reduction steps)*
the potential is bounded as follows:
[TABLE]
We next wish to prove that in each width-reduction step, the electrical energy goes up significantly. For this, we will use the following Lemma which is proven in Appendix D. It generalizes Lemma 2.6 of [Chr+11] to arbitrary weighted regression problems, and directly measures the change in terms of the electrical energy of the entries modified.
Lemma 5.11**.**
Assuming the program (4) is feasible, let be an be a solution to the optimization problem (4) with weights . Suppose we increase the resistance on each entry to get Then,
[TABLE]
This statement also implies the form of the lemma that concerns increasing the resistances on a set of entries uniformly [Chr+11, Lemma 2.6].
The next lemma gives a lower bound on the energy in iteration [math], i.e., when we start, and an upper bound on the energy at each step.
Lemma 5.12**.**
Initially, we have,
[TABLE]
where is the operator norm, or maximum singular value of . Let us call this ratio . Moreover, at any step we have,
[TABLE]
Proof.
For the lower bound in the initial state, recall that we scale the problem such that and Initially we have, This means for any solution , we have
[TABLE]
On the other hand, because
[TABLE]
we get
[TABLE]
upon which squaring gives the lower bound on .
For the upper bound, Lemma 5.7 implies that
[TABLE]
∎
The next Lemma says that the three assumptions (stated in the statement of the Lemma) can be used to ensure that the potential grows quickly with each width reduction step, and that flow steps do not cause the potential to shrink.
Lemma 5.13**.**
Suppose at step we have so that we perform a width reduction step (line 19). If
, 2. 2.
, and 3. 3.
.
Then
[TABLE]
Furthermore, if at we have so that we perform a flow step, then
[TABLE]
Proof.
It will be helpful for our analysis to split the index set into three disjoint parts:
- •
- •
- •
.
Firstly, we note
[TABLE]
hence, using Assumption 3
[TABLE]
This means,
[TABLE]
Secondly we note that, using Assumption (1) and Lemma 5.7, we have
[TABLE]
So then, using Assumption 2,
[TABLE]
As , this implies .
From Lemma 5.12 and Assumption 1 we have
[TABLE]
So then, combining our last two observations, and applying Lemma 5.11, we get
[TABLE]
Finally, for the “flow step” case, by applying Lemma 5.11 with as the whole set of indices, and , we get that as the resistances only increase,
[TABLE]
∎
We are now ready to prove Theorem 5.8.
Proof of Theorem 5.8
Proof.
We first observe that our parameter choices in the Algorithm 4 satisfy Assumption 1 of Lemma 5.10, namely, we can choose the parameters and s.t.
- •
,
- •
,
while ensuring . This means by Lemma 5.10, that if the Algorithm completes after taking flow steps and , when it returns, we have
[TABLE]
This means that the algorithm returns with
[TABLE]
Note the only alternative is that the algorithm takes more than width reduction steps (and possibly infinitely many such steps, hence never terminating).
We will now show this cannot happen, by deriving a contradiction from the assumption that the algorithm takes a width reduction step starting from step where and .
Since the conditions for Lemma 5.10 hold for all preceding steps, we must have .
Additionally, we note that our parameter choice of and along with our choice of (see above), ensures that
[TABLE]
This means that at every step preceding the current step, the conditions of Lemma 5.13 are satisfied, so we can prove by a simple induction that
[TABLE]
Since our parameter choices ensure this means
[TABLE]
But this contradicts Lemma 5.12, since this Lemma, combined with gives
[TABLE]
From this contradiction, we conclude that we never have more than width reduction steps.
Now we observe that the total number of oracle calls in the algorithm is bounded by
[TABLE]
∎
This concludes the analysis of our algorithm.
5.4 Proof of Theorem 5.1
Proof.
Theorem 5.8 implies that we can solve Program (2) using Algorithm 4 to get an -approximate solution in calls to the Oracle. Implementing the Oracle requires solving a linear system, and hence can be implemented in in time where is the matrix multiplication constant (see the Appendix for a proof). Thus, we can find an -approximate solution to (2) in total time
[TABLE]
Now, Theorem 5.2 implies that we can find an -approximate solution to the residual problem (1) in total time,
[TABLE]
Finally using Theorem 4.1 we can conclude that we have an -approximate solution to (* ‣ 1) in calls to a -approximate solver to the residual problem (1). This gives us a total running time of,
[TABLE]
∎
We now have a complete algorithm for the -norm regression problem that gives an -approximate solution.
6 Speedups for General Matrices via. Inverse Maintenance
If is an explicitly given, , matrix, we need to solve the quadratic minimization problem at each step. This can be solved via a linear systems solve in the matrix
[TABLE]
which takes , where is the matrix multiplication constant. This directly gives a total running time cost of , which for large values of , along with the assumption of , exceeds .
This is more than the running time of about of algorithms based on inverse maintenance [Vai90, LS14, LS15]. In this section we show that the MWU routine from Section 5 can also benefit from fast inverse maintenance. Our main result is:
Theorem 6.1**.**
If is an explicitly given, -by- matrix with polynomially bounded condition numbers, and Algorithm 4 as given in Section 5.3 can be implemented to run in total time
[TABLE]
A few remarks about this running time: the term that dominates depends on the comparison between and , or after manipulation, the comparison between and :
For the current best value of , the second term is at most , so the total running time is about . 2. 2.
If , then this running time is simply : same as resolving the linear system at each step. 3. 3.
If , then the overhead in the exponent on the second term is at most
[TABLE]
and this value approaches as .
Our algorithm is based on gradually updating the vector. First, note that ’s, and thus ’s are monotonically increasing. Secondly, for the that do not double, we can replace with the original version while forming a factor preconditioner. Thus, we only need to update the entries that have significant increases. This update can be encapsulated by the following result on computing low rank perturbations to a matrix, which is a direct consequence of rectangular matrix multiplication and Woodbury matrix formula.
Lemma 6.2**.**
Given an -by- matrix , along with vectors and that differ in entries, as well as the matrix , we can construct in time.
Proof.
Let denote the entries that differ in and . Then we have
[TABLE]
This is a low rank perturbation, so by Woodbury matrix identity we get:
[TABLE]
where we use because is a symmetric matrix. To explicitly compute this matrix, we need to:
compute the matrix , 2. 2.
compute 3. 3.
invert the middle term.
This cost is dominated by the first term, which can be viewed as multiplying pairs of and matrices. Each such multiplication takes time , for a total cost of . The other terms all involve matrices with dimension at most , and are thus lower order terms. ∎
Note that the running time of Lemma 6.2 favours ‘batching’ a large number of modified edges to insert. To this end, we show that it suffices to have an inverse that only approximates some entries of . To do so, we first need to introduce our notions of approximations:
Definition 6.3**.**
We use for positive numbers and iff , and for vectors and for vectors and we use to denote entry-wise.
Since we are only updating resistances that have a constant factor increase and using a constant factor preconditioning for the others, we need the following result on preconditioned iterative methods for solving systems of linear equations.
Lemma 6.4**.**
If and are vectors such that , and we’re given the matrix explicitly, then we can solve a system of linear equations involving to accuracy in time.
As the resistances we provide to Oracle are in the range , we get that each only needs to be updated times, instead of after each iteration. However, it’s insufficient to use this bound in the worst-case manner: if there are iterations, each of which doubles the resistances on edges, then the total cost as given by Lemma 6.2 becomes
[TABLE]
which is about .
We get an even better bound by using our analysis from Section 5 to show that for iteration/edge combinations and , the (relative) update to is small. Such small changes also imply that we can wait on such updates. For simplicity, suppose we only increment the resistances by factors of then it takes such increments until the edge’s resistance has deviated by a constant factor. Furthermore, we can wait for another iterations before having to reflect this change in : the total relative increases in these iterations is also at most . Formalizing this process leads to a lazy-update routine that tracks the increments of different sizes separately. Its Pseudocode is in Algorithms 5 and 6.
We will call the initialization routine InverseInit at the first iteration, and subsequently call UpdateInverse upon generating a new set of resistances in the call to Algorithm 3, Oracle. This is in turn called from Line 11 of Algorithm 4. As a result, we will assume access to all variables of these routines. Furthermore, our routines keeps the following global variables:
: resistances from the last time we updated each entry. 2. 2.
: for each entry, track the number of times that it changed (relative to ) by a factor of about since the previous update. 3. 3.
, an inverse of the matrix given by .
We first verify that the maintained inverse is always a good preconditioner to the actual matrix, .
Lemma 6.5**.**
After each call to UpdateInverse, the vector satisfies
[TABLE]
Proof.
First, observe that any change in resistance exceeding is reflected immediately Otherwise, every time we update , can only increase additively by at most
[TABLE]
Once exceeds , will be added to after at most steps. So when we start from , is added to after iterations. The maximum possible increase in resistance due to the bucket is,
[TABLE]
Since there are only at most iterations, the contributions of buckets with are negligible. Now the change in resistance is influenced by all buckets , each contributing at most increase. The total change is at most since there are at most buckets. We therefore have
[TABLE]
for every . ∎
It remains to bound the number and sizes of calls made to Lemma 6.2. For this we define variables
[TABLE]
to denote the number of edges added to at iteration due to the value of . Note that is non-zero only if , and
[TABLE]
The following lemma gives a lower bound on the relative change of energy across one update of resistances.
Lemma 6.6**.**
Assuming the program (4) is feasible, let be be a solution to the optimization problem (4) with weights . Suppose we increase the resistance on each entry to Then,
[TABLE]
Proof.
Lemma 5.11 gives us that,
[TABLE]
Since and ,
[TABLE]
which gives our result. ∎
We divide our analysis into 2 cases, when the relative change in resistance is at least and when the relative change in resistance is at most . To begin with, let us first look at the following lemma that relates the change in weights to the relative change in resistance. The proof is in the Appendix.
Lemma 6.7**.**
Consider a flow step from Line 13 of Algorithm 4. We have
[TABLE]
where is the minimizer solution produced by the oracle.
Let us now see what happens when the relative change in resistance is at least .
Lemma 6.8**.**
Throughout the course of a run of Algorithm 4, the number of edges added to due to relative resistance increase of at least ,
[TABLE]
Proof.
From Lemma 6.6, we know that the relative change in energy over one iteration is at least,
[TABLE]
Over all iterations, the relative change in energy is at least,
[TABLE]
which is upper bounded by . When iteration is a width reduction step, the relative resistance change is always at least . In this case . When we have a flow step, Lemma 6.7 implies that when the relative change in resistance is at least then,
[TABLE]
This gives, . Using this bound on is sufficient since and both kinds of iterations are accounted for. The total relative change in energy can now be bounded.
[TABLE]
The Lemma follows by substituting in the above equation. ∎
Lemma 6.9**.**
Throughout the course of a run of Algorithm 4, the number of edges added to due to relative resistance increase between and ,
[TABLE]
Proof.
From Lemma 6.6, the total relative change in energy is at least,
[TABLE]
We know that . Using Lemma 6.7, we have,
[TABLE]
We can bound as,
[TABLE]
Now, in the second case, when and ,
[TABLE]
For both cases we get,
[TABLE]
Using the above bound and the fact that the total relative change in energy is at most , gives,
[TABLE]
The Lemma follows substituting in the above equation. ∎
We can now use the concavity of to upper bound the contribution of these terms.
Corollary 6.10**.**
Let be as defined. Over all iterations we have,
[TABLE]
and for every ,
[TABLE]
Proof.
Due to the concavity of the power, this total is maximized when it’s equally distributed over all iterations. In the first sum, the number of terms is equal to the number of iterations, i.e., . In the second sum the number of terms is . Distributing the sum equally over the above numbers give,
[TABLE]
and
[TABLE]
∎
Proof.
(of Theorem 6.1) By Lemma 6.5, the that the inverse being maintained corresponds to always satisfy . So by the iterative linear systems solver method outlined in Lemma 6.4, we can implement each call to Oracle (Section 5.2)in time in addition to the cost of performing inverse maintenance. This leads to a total cost of
[TABLE]
across the iterations.
The costs of inverse maintenance is dominated by the calls to the low-rank update procedure outlined in Lemma 6.2. Its total cost is bounded by
[TABLE]
Because there are only values of , and each is non-negative, we can bound the total cost by:
[TABLE]
where the inequality follows from substituting in the result of Lemma 6.10. Depending on the sign of , this sum is dominated either at or . Including both terms then gives
[TABLE]
with the exponent on the trailing term simplifying to to give,
[TABLE]
∎
7 Other Regression Formulations
In this section we discuss how various variants of -norm regression can be translated into our setting of
[TABLE]
As we will address a multitude of problems, we will make generic numerical assumptions to simplify the derivations.
. 2. 2.
All entries in and are at most . Note that this implies that the maximum singular value of , and are at most . 3. 3.
. 4. 4.
The minimum non-zero singular value of , is at least .
Note that these conditions also imply bounds on the optimum value and the optimum solution : Specifically,
[TABLE]
and
[TABLE]
7.1 Affine transformations within the norm
Let be a matrix with the same assumptions as and have assumptions similar to . Suppose we are minimizing instead of , i.e.,
[TABLE]
Note that this can be reduced to the following unconstrained problem,
[TABLE]
To see this, first find the null space of , as well as a particular solution that satisfies . Let the null space of be generated by the matrix . Then the space of solutions can be parameterized as
[TABLE]
for some vector . Now our objective becomes,
[TABLE]
which can be written as,
[TABLE]
where and . Observe that spans the column space of . Decomposing into a linear combination of an orthonormal basis we could combine the part which is in the span of with . We can thus replace in the objective with an orthonormal basis of its column space and replace by , a vector orthogonal to all columns of . Then any vector
[TABLE]
can equivalently be described by the conditions
[TABLE]
For the last condition, it suffices to generate an orthonormal basis of the null space of . So the problem can be written as a linear constraint on instead.
7.2
In case we instead solve the dual problem:
[TABLE]
for . We can rescale the above problem to the equivalent -norm ball-constrained projection problem,
[TABLE]
where the goal is to check whether the optimum is less than . This problem is covered by the problem introduced in Section 7.1 and can thus be solved to high accuracy in the desired time.
It remains is to transform a nearly-optimal solution of this -norm ball-constrained projection problem to a nearly-optimal solution of the original subspace -norm minimization problem. Since both of these problems’ solutions are invariant under scalings to or , we may also assume that the optimum is at most .
Lemma 7.1**.**
If the optimum of
[TABLE]
is at most , and we have some such that
[TABLE]
then the gradient of ,
[TABLE]
satisfies
[TABLE]
Proof.
Let and be a polynomial such that . By the assumption of and being bounded, the above function is polynomially bounded. Let be a polynomial in such that, . Suppose,
[TABLE]
This gives us,
[TABLE]
Now consider the solution
[TABLE]
for step size . Lemma 4.5 and Lemma 3.3 gives
[TABLE]
We can scale the solution up by a factor of to get a solution with objective value
[TABLE]
But by the assumption of being the optimum, this cannot exceed , so we get
[TABLE]
or
[TABLE]
which combined with the choice of gives which is a contradiction. So we much have, ∎
This means once , the solution created from the gradient
[TABLE]
satisfies
[TABLE]
Also, because , we can create a solution from by doing a least squares projection on this difference. This gives:
[TABLE]
and
[TABLE]
Furthermore, note that because
[TABLE]
we have
[TABLE]
so
[TABLE]
Thus, for sufficiently small , we can get high accuracy answer to the -norm problem as well.
8 -Norm Optimization on Graphs
In this section we discuss the performance of our algorithms on graphs. Here instead of invoking general linear algebraic routines, we instead invoke Laplacian solvers, which provide accuracy solutions to Laplacian linear equations in nearly-linear () time [ST14, KMP14, KMP11, Kel+13, Coh+14, PS14, Kyn+16, KS16], and the current best running time is (up to factors) [Coh+14].
Such matrices can be succinctly described as
[TABLE]
where is the vector of resistances just as provided in the Oracle from Algorithm 3, but is the edge-vertex incidence matrix: with each row corresponding to an edge, each column corresponding to a vertex, and entries given by:
[TABLE]
Throughout this entire section, we will use to refer to the edge-vertex incidence matrix of a graph.
The main difficult of reducing to Laplacian solvers is that we can no longer manipulate general matrices. Specifically, instead of directly working with the normal matrices as in Section 7.1, we need to implicitly track the subspaces, and optimize quadratics on them. As a result, we need to tailor such reductions towards the specific problems.
8.1 -Norm Flows
This is closest to the general regression problem that we study:
[TABLE]
except with as an edge vertex incidence matrix.
When , the residual problem then has an extra condition of
[TABLE]
which means we need to solve the problem of
[TABLE]
which becomes a solve in the system of linear equations
[TABLE]
This matrix is a rank perturbation to the graph Laplacian , and can thus be solved in time. A more detailed analysis of a generalization of this case can be found in Appendix B of [DS08].
When , we invoke the dualization from Section 7.2 to obtain
[TABLE]
and if we retain the form of , but transfer the gradient over to , the problem that we get is:
[TABLE]
The two additional linear constraints can removed by writing a variable of as a linear combination of the rest (as well as ). This then gives an unconstrained minimization problem on a subset of entries ,
[TABLE]
where is a minor of the Laplacian above, and this solution is obtained by solving for
[TABLE]
8.2 Lipschitz Learning and Graph Labelling
This problem asks to label the vertices of a graph, with a set fixed to the vector , while minimizing the -norm difference between neighbours. It can be written as
[TABLE]
where is the edge-vertex incidence matrix.
In the case of , the residue problem becomes
[TABLE]
Here the gradient condition can be handled in the same way as with the voltage problem above: by fixing one additional entry of , and then solving an unconstrained quadratic minimization problem on the rest of the variables.
In the case of , we first write down the problem as an unconstrained minimization problem on :
[TABLE]
Let and taking the dual gives:
[TABLE]
That is, solving for a small -norm flow that maximizes the cost against , while also having [math] residues at the vertices not in .
As , we can now invoke our main algorithm on . Upon binary search, and taking residual problems, we get problems of the form
[TABLE]
which is solved by another low rank perturbation on a minor of the graph Laplacian.
Appendix A Missing Proofs
A.1 Proofs from Section 3
See 3.2
Proof.
We have . When ,
[TABLE]
Otherwise,
[TABLE]
Let . At we have . Now, . This means that for negative, decreases faster than and for positive, increases faster than . The two functions are equal in the range . Therefore, for all . 2. 2.
[TABLE] 3. 3.
Taking the derivative of with respect to gives,
[TABLE]
The statement clearly follows.
∎
See 3.3
Proof.
[TABLE]
Now, when , we have the following. When ,
[TABLE]
and when ,
[TABLE]
The above computations imply that,
[TABLE]
Let and . Integrating both sides of the right inequality gives,
[TABLE]
Integrating both sides of the left inequality from to gives the required left inequality. Now, let . Integrating both sides of the left inequality gives,
[TABLE]
Similar to the previous case, integrating both sides of the right inequality from to gives the required left inequality. When , the direction of the inequality changes but it gets reversed again after putting limits, since we integrate from to when and to when . We thus have,
[TABLE]
∎
See 3.4
Proof.
Since and is increasing in it suffices to prove the claim for We have,
[TABLE]
Integrating over we get,
[TABLE]
∎
A.2 Proofs from Section 4
See 4.5
Proof.
We first show the following two lemmas.
Lemma A.1**.**
For and ,
[TABLE]
Proof.
Let us first show the left inequality, i.e. . Define the following function,
[TABLE]
When , . The derivative of with respect to is, .
When and ,
[TABLE]
For the last inequality, note that when the product is positive, either both terms are positive or both terms are negative. When both terms are positive, subtracting instead of gives a larger positive quantity. When both terms are negative then subtracting instead of gives only a smaller quantity, so the inequality holds. This shows that , which means minimum of is at . Next let us see what happens when and .
[TABLE]
This implies that is an increasing function of and for which is where attains its minimum value. The only point where is 0 is . This implies . This concludes the proof of the left inequality. For the right inequality, define:
[TABLE]
Note that and . We have,
[TABLE]
Using the mean value theorem for and ,
[TABLE]
This implies that for negative alpha. When , using the convexity of for , we get,
[TABLE]
which gives us
[TABLE]
This implies, for positive . The function is thus increasing for positive and decreasing for negative , so it attains the minimum at [math] which is giving us . We now look at the case . We have
[TABLE]
Using this, we get, which says is positive for positive and negative for negative. Thus the minima of is at 0 which is [math]. So in this range too.
∎
Lemma A.2**.**
For and , .
Proof.
for . So the claim clearly holds for since . When , , so the claim holds since, ∎
We now prove the theorem.
Let . The term . Let us first look at the case when . We want to show,
[TABLE]
This follows from Lemma A.1 and the facts and . We next look at the case when . Now, . We need to show
[TABLE]
When it is trivially true. When , let
[TABLE]
Now, taking the derivative with respect to we get,
[TABLE]
When and ,
[TABLE]
So we have . When , we use the mean value theorem to get,
[TABLE]
which implies in this range as well. When it follows from Lemma A.2 that . So the function is increasing for and decreasing for . The minimum value of is . It follows that which gives us the left inequality. The other side requires proving,
[TABLE]
Define:
[TABLE]
The derivative is non negative for and non positive for . The minimum value taken by is which is non negative. This gives us the right inequality.
∎
See 4.8
Proof.
Let give the OPT. We know that, for any ,
[TABLE]
This along with the fact gives us,
[TABLE]
∎
A.3 Proofs from Section 5
See 5.3
Proof.
Let denote the optimum solution of (* ‣ 1) and be as defined in Definition 4.7. We know that for any ,
[TABLE]
This along with the fact gives us,
[TABLE]
Now from Lemma 4.6 we have,
[TABLE]
Let us assume . If this is not true we already have an approximate solution to our problem. We thus have the following bound on ,
[TABLE]
This gives us that,
[TABLE]
When , following a similar proof and using,
[TABLE]
we get,
[TABLE]
thus concluding the proof of the lemma. thus concluding the proof of the lemma. ∎
See 5.4
Proof.
Assume that the optimum solution to (1), satisfies
[TABLE]
in addition to Note that we know that the objective is strictly positive (as 0 is a feasible solution). Since we must have,
[TABLE]
Consider scaling by a factor Since is optimal, we must have
[TABLE]
Now, from Lemma 3.3, we know that
[TABLE]
Thus, we get,
[TABLE]
[TABLE]
Thus, and hence .
Now consider the vector where Note that We have
[TABLE]
Thus, is a feasible solution to Program (3). A -approximate solution must be such that,
[TABLE]
Now, we consider for some We have, and,
[TABLE]
We can pick,
[TABLE]
In either case, we get,
[TABLE]
Since we assumed that the optimum of Program (1) is at most this implies that achieves an objective value for Program (1) that is within an fraction of the optimal. ∎
See 5.5
Proof.
We choose such that (3) is feasible, i.e., there exists such that,
[TABLE]
Scaling both and to and gives us the following.
[TABLE]
Now, let . We claim that when , . To see this, for a single , let us look at the difference . If the difference is [math]. Otherwise from the proof of Lemma 5 of [Bub+18],
[TABLE]
When , we claim that . Again if the difference is [math]. Otherwise,
[TABLE]
To see the last inequality, when , we require, which is true. When , it directly follows. Summing over all gives us our claims. We know that . Thus, . Next we set . Note that for all . Lemma 3.3 thus implies,
[TABLE]
Define . Note that since and as a result we have . Observe that is a feasible solution of (2) thus suggesting that for problem (2) . Let be a - approximate solution to (2), i.e.,
[TABLE]
When , is an increasing function of giving us,
[TABLE]
When ,
[TABLE]
This gives,
[TABLE]
and Lemma 3.3 then implies,
[TABLE]
Finally, satisfies the constraints of (3) and is a approximate solution. ∎
See 5.6
Proof.
Using Hölder’s inequality, we have,
[TABLE]
∎
A.4 Proofs from Section 6
See 6.7
Proof.
Recall from the setting of resistances from Line 2 of Oracle (Algorithm 3) that
[TABLE]
By Line 13 of Algorithm 4, we have
[TABLE]
Substituting this in gives
[TABLE]
There are two cases to consider:
.
[TABLE]
where the last inequality utilizes , which is due to the assumption and . 2. 2.
, then replacing the denominator with the term and simplifying gives
[TABLE]
As the function is monotonically increasing when , we may replace the by its upper of (given by the assumption) to get
[TABLE]
where the last inequality follows from .
∎
Appendix B Controlling
See 5.10
Proof.
We prove this claim by induction. Initially, and and thus, the claim holds trivially. Assume that the claim holds for some We will use as an abbreviated notation for below.
Flow Step.
For brevity, we let denote and use to denote .
If the next step is a flow step,
[TABLE]
From the inductive assumption, we have
[TABLE]
Thus,
[TABLE]
proving the inductive claim.
Width Reduction Step.
To analyze a width-reduction step, we first observe that, by Lemma 5.7 and the induction hypothesis, which ensures , and hence so we have
[TABLE]
Thus, when the next step is a width-reduction step, we have,
[TABLE]
Thus,
[TABLE]
proving the inductive claim.
∎
Appendix C Solving L2 problems
Lemma C.1**.**
Given an algorithm Solver for solving for a -fixed matrix a fixed positive diagonal matrix and an arbitrary vector there is an algorithm EnhancedSolver that can solve
[TABLE]
with one call to Solver, two multiplications of with a vector, and an additional time, if we assume
[TABLE]
Proof.
Introducing the Lagrangian multipliers respectively for the constraint and we can write the Lagrangian as
[TABLE]
Now, optimizing the Lagrangian with respect to an unconstrained , allows us to write
[TABLE]
Plugging this back, we can simplify our Lagrangian as
[TABLE]
Optimizing with respect to gives us,
[TABLE]
Plugging this back, gives the Lagrangian as
[TABLE]
We let denote the vector and denote the matrix Thus, the Lagrangian can be written as,
[TABLE]
This implies that the optimal is given by the equation
[TABLE]
From the condition assumed on we have Thus, we can solve this system using the Sherman-Morrisson formula as follows,
[TABLE]
The algorithm EnhancedSolver computes and then invokes Solver to compute This allows us to compute is an additional time. Finally, we can compute using another multiplication with and an additional time. ∎
Appendix D General Resistance Monotonicity
See 5.11
Proof.
Recall
[TABLE]
Letting denote the diagonal matrix with on its diagonal, we can write the above as
[TABLE]
Using Lagrangian duality, and noting that strong duality holds, we can write this as
[TABLE]
The minimizing can be found by setting the gradient w.r.t. to this variable to zero. This gives , so that . Plugging in this choice of , we arrive at the dual program
[TABLE]
Crucially, strong duality also implies that if is an optimal solution of the primal program (9), and is an optimal solution to the dual then
[TABLE]
is optimized at . This in turn implies the gradient w.r.t. at is zero, so that . Let be the th row of . Then the previous equation tells us that . This implies that
[TABLE]
Consider another program, essentially the same as (10), but with additional scalar valued variable introduced.
[TABLE]
The two programs (10) and (12) have the same value, since for any , the assignment ensures both objectives take the same value, and conversely for any , the assignment ensures both programs take the same value.
We see that and is an optimal solution to (12). Hence
[TABLE]
Consequently, . Hence . Again, by a scaling argument, this implies that
[TABLE]
So that
[TABLE]
Note that one optimal assignment for the program (13) is . Also observe that if we consider the program (13) with instead of as the resistances, then is still a feasible solution. Hence, using the observation (11), we get
[TABLE]
Factoring out the term gives
[TABLE]
Now consider the term : if , then it is at least . Otherwise, it can be rearranged to
[TABLE]
So in either case, we have
[TABLE]
which upon rearranging gives the desired result. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Abr+16] Ittai Abraham et al. “On Fully Dynamic Graph Sparsifiers” Available at: http://arxiv.org/abs/1604.02094 In Symposium on Foundations of Computer Science (FOCS) , 2016, pp. 335–344
- 2[Adi+] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for ℓ p subscript ℓ 𝑝 \ell_{p} -norm Regression” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pp. 1405–1424 DOI: 10.1137/1.9781611975482.86 · doi ↗
- 3[AHK 12] Sanjeev Arora, Elad Hazan and Satyen Kale “The Multiplicative Weights Update Method: a Meta-Algorithm and Applications.” In Theory of Computing 8.1 , 2012, pp. 121–164
- 4[All+17] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Mendes Oliveira and Avi Wigderson “Much Faster Algorithms for Matrix Scaling” Available at: https://arxiv.org/abs/1704.02315 In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017 , 2017, pp. 890–901
- 5[Axe 94] Owe Axelsson “Iterative Solution Methods” New York, NY: Cambridge University Press, 1994
- 6[Bub+18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee and Yuanzhi Li “An Homotopy Method for Lp Regression Provably Beyond Self-concordance and in Input-sparsity Time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , STOC 2018 Los Angeles, CA, USA: ACM, 2018, pp. 1130–1137 DOI: 10.1145/3188745.3188776 · doi ↗
- 7[Bul 18] Brian Bullins “Fast minimization of structured convex quartics” https://arxiv.org/abs/1812.10349 In Co RR abs/11812.10349 , 2018
- 8[Chi+13] Hui Han Chin, Aleksander Madry, Gary L. Miller and Richard Peng “Runtime guarantees for regression problems” Available at http://arxiv.org/abs/1110.1358 In Proceedings of the 4 th conference on Innovations in Theoretical Computer Science , ITCS ’13 Berkeley, California, USA: ACM, 2013, pp. 269–282
