Improved Convergence for $\ell_\infty$ and $\ell_1$ Regression via Iteratively Reweighted Least Squares
Alina Ene, Adrian Vladu

TL;DR
This paper introduces a simple IRLS algorithm for $\, ext{l}_ ext{infinity}$ and $ ext{l}_1$ regression that converges efficiently with a running time independent of input conditioning, outperforming previous methods.
Contribution
A new natural IRLS algorithm with provable convergence for $\, ext{l}_ ext{infinity}$ and $ ext{l}_1$ regression, featuring a runtime independent of input conditioning and sublinear in $1/\, ext{ extepsilon}$.
Findings
Converges to a $(1+ ext{ extepsilon})$-approximate solution in specified iterations.
Runtime is independent of input matrix conditioning.
Outperforms previous algorithms by at least a factor of $1/ ext{ extepsilon}^2$.
Abstract
The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance. In this paper we propose a simple and natural version of IRLS for solving and regression, which provably converges to a -approximate solution in iterations, where is the number of rows of the input matrix. Interestingly, this running time is independent of the conditioning of the input, and the dominant term of the running time depends sublinearly in , which is atypical for the optimization of non-smooth functions. This improves upon the more complex algorithms of Chin et al. (ITCS '12), and Christiano et al.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSlime Mold and Myxomycetes Research · Topological and Geometric Data Analysis · Cell Image Analysis Techniques
Improved Convergence for and Regression via Iteratively Reweighted Least Squares
Alina Ene Department of Computer Science, Boston University, [email protected].
Adrian Vladu Department of Computer Science, Boston University, [email protected].
Abstract
The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance.
In this paper we propose a simple and natural version of IRLS for solving and regression, which provably converges to a -approximate solution in iterations, where is the number of rows of the input matrix. Interestingly, this running time is independent of the conditioning of the input, and the dominant term of the running time depends sublinearly in , which is atypical for the optimization of non-smooth functions.
This improves upon the more complex algorithms of Chin et al. (ITCS ’12), and Christiano et al. (STOC ’11) by a factor of at least , and yields a truly efficient natural algorithm for the slime mold dynamics (Straszak-Vishnoi, SODA ’16, ITCS ’16, ITCS ’17).
1 Introduction
Regression problems are fundamental primitives in scientific computing. Among these, - and -regression are their hardest instantiations, since through standard reductions they can be shown to be equivalent to linear programming.
While the series of works on these topics is truly extensive and diverse, the simpler methods have pervaded into the realm of practical applications. Among these, an extremely popular scheme known for its simplicity is the iteratively re-weighted least squares (IRLS) method. The idea behind it is to reduce optimization problems to iteratively solving a series of weighted -minimization problems, where the weights are adaptively chosen in such a way that the resulting solutions from the sequence of least-squares problems converge to the sought optimal point. In particular, due to its relevance in signal processing, regression is a very important application of IRLS [C*+*06, CY08].
Despite the fact that various versions of this method have been studied ever since the 60’s [Law61, Osb85] theoretical understanding of their convergence has lacked. Recent works have attempted to fill this gap, and offer provable guarantees [DDFG10, SV16c, SV16b, SV16a], some of them inspired from the interpretation of this method as a dynamical system. In particular, we note the Physarum dynamics, which have been studied in a completely different context [IJNT11, JZ12, TKN07, BMV12, BBD*+*13] in order to justify an experiment which revealed that a unicellular organism, the slime mold, could solve the shortest path problem in a maze [NYT00]. The fact that these dynamics are essentially just a version of the IRLS method was observed in [SV16c].
Returning to the more classical world of algorithm design and analysis, it is worth observing that existing analyses of IRLS methods fall into one of the following two categories: (i) they show convergence only when the problem is properly initialized, or (ii) the guaranteed running time is prohibitive in the sense that it is highly dependent on how the input is conditioned, or it has a high polynomial dependency on the desired solution accuracy.
In this paper, we focus on analyzing simple versions of IRLS which overcome both aforementioned obstacles. In particular, our methods always converge to multiplicative approximation for the objectives , , in iterations111We use notation to suppress polylogarithmic factors in . of solving a weighted least squares problem, where is the dimension of the sought vector .
Inspiration for our methods is heavily drawn from the work of [CKM*+*11], which offered a ground-breaking result by showing that in undirected graphs, a -approximate maximum flow can be found in iterations (subsequently the dependence was improved to , see [CMMP13]) of solving a weighted least squares problem – which in conjunction with efficient Laplacian system solvers, broke a longstanding barrier for fast graph algorithms. While these algorithms generalize to arbitrary and regression problems, they are somewhat involved, in particular due to the fact that they are the product of combining the multiplicative weights update method with a regularization technique, and a second potential function.222To be more specific, Christiano et al. solve the approximate maximum flow problem, which is a specific instance of regression. Chin et al. build on this work to solve regression with block structure; the block structure is relevant for their specific applications, but is a direct extension of the method, so solving vanilla regression is still the main problem tackled there.
Instead, our method attempts to directly solve the non-smooth objective while tracking a single potential function. The number of iterations looks surprising, since the dominant term is , whenever , while classical techniques for optimizing non-smooth functions require a number of iterations that depends on the product between the function’s parameters (such as Lipschitz constant of the gradient or radius of the domain), and in the best case, when accelerated methods are used; see [Nes05] for more details.
333We emphasize that using off-the-shelf methods, without further assumptions on the input, the number of iterations of any standard optimization method would be even for the very special instances where the affine constraint corresponds to a flow satisfying a given demand in unweighted graphs, and in general will depend on how the input matrix is conditioned, since this conditioning determines the magnitude of the subgradients or the diameter of the domain we are optimizing over. The breakthrough of Christiano et al. was the first work that managed to reduce this dependence for maximum flow, which is a specific instance of the regression problem.
Interestingly, a line of work that yielded results very similar in spirit to ours is that of approximately solving positive linear and semidefinite programs [You01, AZO15, AZLO16], where the goal was to produce a first order optimization method that can be implemented in a number of iterations independent of the conditioning of the input. Improving the dependence to is an important open problem in this subfield.
We believe that understanding the connection between these results can pave the way for designing new efficient optimization primitives.
1.1 Main Theorem
We state the main theorem of this paper. It follows from the correctness proofs described in Sections 3.1 and 3.2, and the convergence proofs from Lemmas 4.6 and 4.8.
Theorem 1.1**.**
There exist algorithms -Minimization and -Minimization such that, on input , where is a matrix, is a vector which lies in the span of ’s columns, is an accuracy parameter, and is a target value:
-Minimization* returns a solution such that , and , or certifies that .* 2. 2.
-Minimization* returns a solution such that , and , or certifies that .*
Furthermore both algorithms finish in
[TABLE]
iterations, each of which can be implemented in the time required to solve a linear system of the form , where is an arbitrary nonnegative diagonal matrix.
While our theorem statements are concerned with approximately solving a decision problem which requires a guess on the value of the objective, it follows from standard techniques that this can be used to find a good approximation to the optimal solution without paying more than a overhead in the number of iterations. For completeness, we provide the details in Section 6.
1.2 Relation to Previous IRLS Methods and Slime-Mold Dynamics
A popular method for solving minimization is the iteratively re-weighted least squares method (IRLS). This is essentially based on the observation that whenever , one also has that this is the minimizer of the least squares problem .444Throughout the paper we use the convention that . Hence one approach that has been employed ever since the 60’s [Law61, Osb85, DDFG10] is to iteratively adjust the weighting of the coordinates and re-solve the least squares problem, until converges to a stationary point. This is rigorously described by the iteration
[TABLE]
We abuse notation by applying scalar operations to vectors, with the meaning that they are applied element-wise.
Subsequent works attempted to rigorously analyze this iteration and prove convergence bounds. Oftentimes this relied on specific structure, such as being sparse [DDFG10]. A recent series of works drew inspiration from convergence proofs for the slime-mold dynamics – a method which essentially solves minimization, based on a model used to describe the evolution of a slime mold (Physarum polycephalum) as it spreads through its environment in order to optimize its access to food sources [NYT00, TKN07]. Based on the intuition that these dynamics yield a method for solving the transportation problem, Straszak and Vishnoi proved in a series of works [SV16c, SV16b, SV16a] that this is as a matter of fact equivalent to the IRLS method, and provided a rigorous convergence analysis for a damped version of it:
[TABLE]
Unfortunately their convergence proof shows that this method is highly inefficient, and the time to convergence has a high polynomial dependence in the desired accuracy, and the structure of the linear constraint.
By comparison, what we describe in this work is an IRLS method where the weights are updated according to a thresholding rule. Given a guess for the optimal value, we perform an iteration equivalent to:
[TABLE]
where is a thresholding operator i.e. , if , and otherwise. Intuitively, this increases the weights only for the elements where the corresponding component of the quadratic objective contributes significantly, therefore we want to favor increasing it even more in the future by decreasing the weight we place on this coordinate.555Another way to think of this is that, ignoring the thresholding operator, the update would simply be , where is some normalization factor. What thresholding achieves here is to decide whether the contribution of a particular coordinate to the energy of the system is sufficiently large compared to the contributions of the entire vector .
2 Preliminaries
2.1 Basic Notation
Sets.
We let be the set of real numbers. For any natural number , we write . We denote by the -dimensional simplex i.e. \Delta_{m}=\{\boldsymbol{\mathit{p}}\in\mathbb{{R}}^{m}:\sum_{i=1}^{m}\boldsymbol{\mathit{p}}_{i}=1,\boldsymbol{\mathit{p}}_{i}\geq 0\textnormal{ for all i}\}.
Vectors.
We let denote the all zeros and all ones vectors, respectively. When it is clear from the context, we apply scalar operations to vectors with the interpretation that they are applied coordinate-wise.
Matrices.
We write matrices in bold. We use to denote the identity matrix. Given a vector we let be the diagonal matrix whose entries are given by . For a symmetric matrix , we let be its Moore-Penrose pseudoinverse, i.e. . The pseudoinverse can be thought of as replacing all the nonzero eigenvalues of with their reciprocals.
Inner products.
When it is convenient, we use notation to denote inner products. Given two vectors of equal dimensions, we let .
Norms.
Given a vector , we denote the norm of by . When the subscript is dropped, we refer to the norm. From this definition, we can also see that .
2.2 Proof Technique
Let us first understand the idea behind our minimization algorithm. The problem we aim to solve is . Letting be the -dimensional unit simplex, we can write our objective equivalently as
[TABLE]
where the second identity follows from Sion’s theorem [Sio58], which allows us to interchange min and max. The quantity between the parentheses has a very natural interpretation, in the case of electrical networks: it is precisely the electrical energy required to route a demand through an electrical network encoded in . Furthermore, we have an easy way to lower bound how this energy increases whenever resistances are increased, which is a finer quantitative version of Rayleigh’s monotonicity principle. More precisely, we can easily certify a lower bound on the increase in energy determined by increasing a single coordinate of . Using this observation, which we make more precise in Section 4.2, we can identify a set of coordinates of to increase, guaranteeing that if is the new vector with perturbed resistances, we have
[TABLE]
for a fixed parameter . In the case when no coordinates of can be increased, while preserving this property, this yields a certificate that is as a matter of fact (close to) optimal, and thus we are done (Lemma 4.5). Hence our goal becomes that of guaranteeing that increases very fast. Indeed, since the ”electrical energy” increases at the right rate relative to , after the latter has increased sufficiently, we can safely guarantee that , since the increase in cancels out most of the initial error introduced by starting with a potentially poor solution.
The minimization algorithm relies on squaring the objective, and then writing it equivalently as
[TABLE]
For the first identity we used the fact that , achieved at ; see [Owe07, SZ12] for further use of this trick.666Interestingly, this can also be thought of as achieving tightness for reverse Hölder’s inequality whenever we are considering the dual ‘norms’ and . The second identity follows from joint convexity w.r.t. and , which can be verified by computing the Hessian of the function in . So completely oppositely from the previous case, the objective of our problem becomes minimizing electrical energy with respect to a set of inverse resistances, which we will call conductances. Note that in this case the quantity that is invariant under scaling by a constant is . Therefore, equivalently, our goal will be to find the set of conductances for which . Similarly to the case, in this case we make progress by iteratively increasing conductances from to in such a way that
[TABLE]
Just as before, we can prove that unless the value of the objective can not be made smaller than , then can be increased while enforcing this invariant (Lemma 4.7). Hence we can prove fast convergence by arguing that increases very fast.
2.3 Approximate Solutions and Infeasibility Certificates
minimization
We consider the formulation
[TABLE]
for which we seek an approximate solution in the following sense. Given a target value , we aim to find one of the following:
an approximate solution in the sense that and , 2. 2.
an approximate infeasibility certificate in the sense that and .
We prove in Lemma 2.1 that the latter is indeed an infeasibility certificate.
Lemma 2.1**.**
Let be the solution to the problem defined in Equation 2.3, and let . Then .
Proof.
Using Lemma 4.2 we can write
[TABLE]
which gives us what we needed. ∎
minimization
We consider the formulation
[TABLE]
for which seek an approximate solution in the following sense. Given a target value , we seek one of the following:
an approximate infeasibility certificate in the sense that , 2. 2.
an approximate feasibility certificate in the sense that and , which yields an approximately feasible solution in the sense that and .
The fact that the former is an approximate infeasibility certificate follows from convex duality. Indeed, one can see that the dual of the minimization problem is , so exhibiting a solution as above implies that the value of this objective is at least . A proof for the fact that the latter is indeed an approximate feasibility certificate, and that it yields an approximately feasible solution can be found in Lemma 2.2.
Lemma 2.2**.**
Given , the vector satisfies , and .
Proof.
The fact that follows directly by substitution, and using the fact that . Using Lemma 4.2 and the definition in (4.1) we write
[TABLE]
We can use this identity inside the following upper bound, which we obtain by applying Cauchy-Schwarz:
[TABLE]
This yields our claim. ∎
3 The Algorithms
Having introduced the necessary notation, we can describe our simple IRLS routine. We prove convergence in Section 4.
3.1 The Minimization Algorithm
We first present the algorithm for the version of the problem, since it is the most intuitive. The method attempts to find a weighting of the columns of i.e. a vector for which the corresponding least squares solution has a small norm; more precisely for some chosen target value .
Then the weighting is updated via the following simple thresholding rule. Elements for which the corresponding coordinate of the least squares solution is below the desired target value are left unchanged. The others are scaled exactly by the amount by which the square of the corresponding coordinate violates the desired threshold i.e. .
Note that the iteration defined here simply attempts to construct an infeasibility certificate for the problem defined in Equation 2.3. Building the feasible solution involves maintains a solution obtained by uniformly averaging a subset of the iterates witnessed so far. These are used to return the approximately feasible solution in case the algorithm fails to quickly produce an (approximate) infeasibility certificate. The details referring to how and why we perform this specific set of updates are explained in the convergence proof. The steps involved in building this feasible solution are written in blue. They can be ignored if the goal is simply that of returning a yes/no answer.
Correctness.
We notice that Algorithm 1 has two possible outcomes. Either it returns a primal approximately feasible vector (lines 11and 15), or returns a dual certificate (line 20). In the former case, it is clear from the description of the algorithm that the returned vector is indeed approximately feasible: line 11 returns a uniform average of vectors satisfying the linear constraint with small norm; line 20 returns the computed within the corresponding iteration, whenever , i.e. .
Also, note that in case none of these stopping conditions is triggered, the algorithm returns a dual certificate on line 20 after a finite number of iterations. Indeed, note that every iteration where , at least one element of gets increased by a factor of at least , due to way is defined. Since the algorithm stops when , no element of can be scaled more than times, hence the total number of iterations is very roughly upper bounded by . We will see in Section 4 that we can prove a much finer upper bound.
Finally, we need to argue that whenever the algorithm returns on line 20, it returns an infeasibility certificate as per Lemma 2.1. We defer the proof to Lemma 4.5 in Section 4.
3.2 The Minimization Algorithm
The version is very similar. As a matter of fact, it can be re-derived simply by attempting to solve the convex dual of the problem from (2.3), which is an minimization problem, by using the routine from Figure 1. However, since the reduction requires several, and previous works attempted to solve this directly using various versions of IRLS, we provide a natural iteration which does not involve any reductions.
Correctness.
We notice that Algorithm 2 has two possible outcomes. Either it returns an approximate infeasibility certificate (lines 11 and 15), or returns an approximately feasible solution (line 20).
Let us verify that in the former case the returned vector is indeed an approximate infeasibility certificate. Line 11 returns , where we know that is a set for which
[TABLE]
Since , we see that returned vector is an approximate infeasibility certificate, as defined in Section 2.3. If the algorithm returns on line 15, we get that , hence is an approximate infeasibility certificate.
Also, note that in case none of these stopping conditions is triggered, the algorithm returns a solution on line 20 after a finite number of iterations. Indeed, just as in the case, in every iteration some conductance gets increased by a factor of at least , hence the algorithm must stop in finite time. We provide a rigorous analysis of the time required for convergence in Section 4.
Finally, we need to argue that whenever the algorithm returns a solution on line 20, it is indeed an approximately feasible solution. We defer the proof to Lemma 4.7 in Section 4.
4 The Algorithm Analyses
4.1 The Flow/Potential Interpretation
While we study a very general problem, it is very useful to develop intuition based on the case where is the vertex-edge incidence matrix of a graph. In this case we will always think of the sought solution as a flow on the graph’s edges. The corresponding dual object is a set of potentials defined on the graph’s vertices.
To be more precise, we consider the following setting. Let be an undirected graph. For each edge we choose an arbitrary orientation, and define be the set of arcs leaving vertex , and the set of arcs entering vertex , for all .
Letting , , we consider the matrix where
[TABLE]
One can easily verify that given a vector defined on the arcs of the graph (which we will think of as a flow), after applying the operator we obtain the demand routed by this flow , which lives in the space of potentials defined on the graph’s vertices.
Therefore the minimization problem from 2.3 can be interpreted as finding the flow with minimum congestion which routes the demand , while the minimization problem from 2.4 corresponds to finding the minimum cost flow routing the demand .
With this interpretation in mind, we proceed to define some objects that in the case of electrical networks correspond to energy and electrical flows.
We use weightings of ’s columns which we refer to as conductances. We equivalently refer to the reciprocals , with , which we call resistances. Our analysis is exclusively based on tracking a potential function which corresponds to the electrical energy of a flow.
Definition 4.1** (Energy of a flow).**
Given a flow , along with a vector of resistances , we let the energy of be
[TABLE]
Overloading this notation, given a vector , let the electrical energy be
[TABLE]
in other words this is the minimum energy over all flows satisfying . We drop the argument whenever is clear from the context.
4.2 Preliminaries on Electrical Energy
Throughout the paper, our analyses will rely on a potential function, which in the case of resistor networks corresponds to the electrical energy. In this section we provide a few useful facts.
Lemma 4.2** (Characterization of Electrical Energy).**
Given a vector of resistances , we have the following equivalent characterizations for the electrical energy.
[TABLE]
Furthermore, if is the minimizing flow for the expression in (4.1), and is the maximizing set of potentials for the expression in (4.3), then for all :
[TABLE]
Since the proof is standard, we defer it to Section A.1.
As a corollary, we can derive a lower bound on the increase in energy after increasing resistances.
Lemma 4.3**.**
Let , and let . Then, one has that
[TABLE]
Proof.
We use the characterization from Equation 4.3 for characterizing electrical energy. Let be the argument that maximizes for resistances . We certify a lower bound on using as follows:
[TABLE]
Finally substituting the relation between flows and potentials from Lemma 4.2, Equation (4.5), we obtain the desired claim. ∎
We can derive a similar lower bound on the inverse energy, after increasing conductances.
Lemma 4.4**.**
Let . Then one has that
[TABLE]
Proof.
We use the following basic inequality: for one has , which follows from . Also, from the definition of energy in (4.1), we obtain an upper bound on the new energy, after perturbing conductances. Let , i.e. the electrical flow corresponding to conductances . We therefore have:
[TABLE]
Using the fact that by optimality, (per Lemma 4.2), and combining with the previous inequality we obtain
[TABLE]
which is what we wanted. ∎
4.3 Convergence Proof for Minimization
Having put together all these tools, we are ready to analyze the algorithms presented in Section 3. We first prove that -Minimization returns a correct infeasibility certificate, whenever it returns on line 20. This lemma is key to understanding the intuition behind the algorithm.
Lemma 4.5**.**
Whenever -Minimization returns on line 20, is a correct approximate infeasibility certificate in the sense that
[TABLE]
Proof.
First notice that by Lemma 2.1, the lower bound on energy is indeed an approximate infeasibility certificate. Now we proceed to prove that throughout the iterations of the algorithm, energy increases at the right rate.
We show that every iteration satisfies the invariant
[TABLE]
This is easy to verify using Lemma 4.3, which lower bounds the increase in energy after perturbing resistances. We see that using the perturbation rule defined on line 13 of the algorithm, energy increases as follows
[TABLE]
For every coordinate of that has changed we see that the ratio between the contribution to above lower bound of that specific coordinate, and the increase in resistance is
[TABLE]
Therefore, summing up over all coordinates we obtain the desired inequality. Finally, we notice that initially , and . So once , one has that, using (4.6),
[TABLE]
and thus
[TABLE]
and equivalently:
[TABLE]
which implies what we needed. ∎
Knowing that the algorithm is correct, we can now proceed and prove that it converges fast (convergence rate can be slightly improved by using a more careful schedule for and ; we defer this improvement to Section 5).
Lemma 4.6**.**
The algorithm -Minimization returns a solution after iterations.
Proof.
We show that unless the algorithm returns an approximately feasible solution on lines 11 or 15, then there exists a coordinate for which increases very fast.
Suppose the algorithm has run for iterations without returning an approximately feasible solution. Consider the partial sum of iterates obtained so far for some . Since the algorithm did not return on line 11, we know that . Therefore there exists a coordinate for which . In other words, letting be the set of iterates that have contributed to , one definitely has that
[TABLE]
and thus
[TABLE]
where we used the fact that for each iteration one has that due to the perturbation rule defined on line 13. This implies that restricting ourselves only to iterations where increased the corresponding resistance , we have that
[TABLE]
By the condition on line 7 we see that for all iterations , one has
[TABLE]
Also since we only consider the iterations with , the rule from line 13 also enforces that for all these iterations
[TABLE]
Equations (4.7), (4.8) and (4.9) suggest that the product increases very fast: intuitively the worst case should occur either when all the factors contribute equally, either all of them are as small as possible (i.e. , or as large as possible, i.e. ). We formalize this intuition in Lemma A.1, which implies that
[TABLE]
Hence setting
[TABLE]
suffices to lower bound this product by . Since each iteration a resistance gets multiplied by the corresponding , and all resistances are initially , this lower bound implies that . But this means that the algorithm will finish execution after the current iteration, according to the condition on line 5.
Finally, we need to upper bound the number of iterations not in ; these correspond to those iterations where , so there exists some index for which . Therefore some resistance gets multiplied by . Since all resistances are initially , in the worst case, each such iteration increases one resistance from to . Therefore this can happen at most times, before the sum of resistances becomes at least , and the algorithm finishes.
Combining these two cases, we obtain our bound. ∎
We can prove the convergence bound for -Minimization similarly. The main difference is that this time we maintain conductances, and the potential function that enables us to prove convergence is .
4.4 Convergence Proof for -Minimization
Lemma 4.7**.**
Whenever -Minimization returns on line 20, is a correct approximate feasibility certificate in the sense that
[TABLE]
Proof.
By Lemma 2.2, this also yields a solution such that and .
In order to prove that at the end of the execution the norm of this solution is small enough, this time we track as potential function the inverse energy . More precisely, we show that every iteration satisfies the invariant
[TABLE]
This is easy to verify using Lemma 4.4, which lower bounds the increase in inverse energy after perturbing conductances. We see that using the perturbation rule defined on line 13 of the algorithm, inverse energy increases as follows
[TABLE]
For every coordinate of that has changed we see that the ratio between the contribution to above lower bound of that specific coordinate, and the increase in conductance is
[TABLE]
where we used the fact that (Lemma 4.2).
Therefore, summing up over all coordinates we obtain the desired inequality. Since and , we know that once , one has that, using (4.10),
[TABLE]
and thus,
[TABLE]
and equivalently:
[TABLE]
which is what we needed. ∎
Next we prove that the algorithm converges fast. Convergence rate can be slightly improved by using a more careful schedule for and , which we defer to Section 5.
Lemma 4.8**.**
The algorithm -Minimization returns a solution after iterations.
Proof.
The proof follows the lines of the proof we used for Lemma 4.6: unless the algorithm returns an approximate infeasibility certificate on lines 11 or 15, then there exists a coordinate for which increases very fast.
Suppose the algorithm has run for iterations without returning an approximate infeasibility certificate. Consider the partial sum of iterates obtained so far for some . Since the algorithm did not return on line 11, we know that , therefore there exists a coordinate for which . In other words, letting be the set of iterates that contributed to , one has that
[TABLE]
and thus, since by definition ,
[TABLE]
Therefore, restricting ourselves only to iterations where increased the corresponding conductance , we have that
[TABLE]
By the condition on line 7 we see that for all iterations , one has
[TABLE]
So considering only the iterations with , the rule from line 13 also enforces that for all these iterations
[TABLE]
Combining Equations (4.11), (4.12), and (4.13), and applying Lemma A.1, exactly the same way we did in the proof of Lemma 4.6 implies that
[TABLE]
So if
[TABLE]
once again we have that this product is lower bounded by , therefore the corresponding conductance , since its initial value was . Since we can only control the total number of iterations , we can lower bound by showing that the number of iterations not in can not be too large. Just as before, we lower bound the number of iterations where . Note that whenever this happens, there exists an index for which . Therefore some conductance gets multiplied by . Again, using an identical argument to the one from the proof of Lemma 4.6, we see that this can not happen more than times. Combining this with the sufficient number of iterations required by the other case, we obtain our bound. ∎
5 Using Phases to Improve the Iteration Count
In this section, we show that via minor modifications to our algorithms, we can improve the number of iterations to thus obtaining the bound promised by Theorem 1.1. This relies on the observation that the entire difficulty of the problem is concentrated on improving the quality of a solution from to . For conciseness, let us focus on the case, and consider the convergence argument described in Sections 4.3. Our goal there is to increase the sum of resistances to , since our argument assumes that the initial energy could be arbitrarily small.
However, if we assume that we warm start the method with a set of resistances , , for which the corresponding energy is already large enough, , we only need to iterate until we obtain a set of resistances such that (rather than ) in order to certify that the current energy/resistance ratio is as large as desired, i.e. . This in turn improves the number of iterations the algorithm needs before it returns. We expand these ideas in what follows.
Now, suppose we have a set of resistances , such that and . Let us analyze the number of iterations of the method described in Section 4.3 that we require before we can return such that or a solution such that .
First, we claim that if each update satisfies the invariant from Equation (4.6), then we can stop iteration once . Indeed, in this case, one has that
[TABLE]
whenever .
The remaining analysis is carried over almost identically, except that the threshold set on line 7 is changed to , and our goal is to get .
For the iterations that satisfy this threshold, by applying Lemma A.1 we see that it is sufficient to witness a small number of such iterations such that
[TABLE]
so suffices.
For the iterations that do not satisfy the threshold, in the worst case, each of them increases one resistance from to so this can happen at most times.
Setting we get that the total number of iterations is at most .
All of this holds assuming we have a good warm start for resistances. We obtain it by recursively invoking the same method for target value , and accuracy. In case of failure, this returns a vector which certainly satisfies , so this concludes the entire run on the algorithm; otherwise, it returns a certificate consisting of resistances for which the ratio between the corresponding energy and norm is at least , so they can be used as a warm start.
Recursion ends once . We note that since the desired accuracy gets increased by a constant factor after each level of recursion, the total number of iterations is dominated by those performed at the top level (i.e. for the lowest ). Hence our result.
Note that this method can also be implemented slightly more naturally by running Algorithm 1 with a varying schedule for and .
Improving the number of iterations for minimization is done analogously.
6 From Approximate Decision to Approximate Optimization
Our algorithms are designed to solve an approximate decision problem, given a guess for the value of the optimal solution. While this follows from a standard reduction, for the sake of completeness we prove here that this is sufficient to optimize the problem approximately without paying more than an additional constant overhead in running time.
To be more specific, let us first focus on minimization. Theorem 1.1 states that given a guess and accuracy , the algorithm either returns an approximately feasible solution with value , or an infeasibility certificate certifying that . Hence this restricts the search interval for the true value either within the interval or .
We initialize our search interval to where is the initial iterate obtained with uniform resistances. Using Lemma 2.1 we easily verify that is indeed a lower bound on , since energy lower bounds the squared optimal value.
Given a search interval , we let , . We invoke Theorem 1.1 for target value and accuracy . Depending on the outcome we update the search interval to or .
When we stop the search, call the algorithm for target value and accuracy , then output the approximately feasible iterate returned by the algorithm. The fact that this call indeed returns an approximately feasible iterate follows from the fact that is certainly feasible, since this is an invariant maintained by our search, and that if the algorithm were to return an infeasibility certificate it must have needed that , which is false. Thus we know that the returned solution has value at most , so it satisfies the desired approximation guarantee.
Finally, we analyze the cost of the search. Note that each iteration of the search reduces be a constant factor, and it stops whenever it becomes at most . For as long as , the algorithm is invoked with accuracy , and gets reduced by a constant factor, so this happens at most times. Once becomes small enough, i.e. , we use accuracy . Note that from Theorem 1.1 we know that the number of iterations of the algorithm for a single invocation depends on , where is a fixed constant; due to our schedule for choosing , the total cost of this sequence of invocations is dominated by the cost of the final one, where .
So letting be the number of iterations required by the algorithm from Theorem 1.1 to solve the approximate decision problem to accuracy , we have that solving the approximate optimization problem requires iterations.
The minimization problem is treated similarly, so we omit its description.
7 Experiments
We test both our resistance/conductance update schemes in order to verify that the resulting algorithms converge fast in practice. We slightly modify the schemes such that they always update their target value depending on the value of the objective they have achieved so far. We stop when given the history of witnessed iterates, we can certify a sufficiently small duality gap. For solving linear systems, we used the conjugate gradient implementation from the -MAGIC optimization suite [CR].
We test both algorithms while varying , and varying . We consider both the update scheme given by our algorithms from Section 3, and one where we attempt to double the length of the step for as long as the invariants from (2.1) and (2.2), respectively, are maintained. We notice that in general, using this long step strategy, we improve both the number of iterations and the running time.
The plots corresponding to the standard update scheme (short-steps) are drawn in red, those corresponding to the long-step version are drawn in blue.
The experiments where we varied are reported in figures 1(a), 1(b), 1(e), and 1(f). For all these experiments, the input consists a random matrix with orthogonal rows, and a vector obtained from applying to a -vector of sparsity . We plot the number of iterations/running time of the algorithm for , where .
We notice that for these experiments, the number of iterations for the short-step version does indeed scale linearly with ; the long-step version makes significant gains in the case.
The experiments where we varied are reported in figures 1(c), 1(d), 1(g), and 1(h). For all these experiments, the input consists of a random matrix with orthogonal vectors, and a vector obtained from applying to a -vector of sparsity , and a fixed accuracy . We plot the number of iterations required by the algorithm for .
We notice that for these experiments, both the number of iterations and the running time scale significantly better than by , which suggests that this polynomial dependence in depends on the input structure, and can be avoided in practice.
Acknowledgements
AE was partially supported by NSF CAREER grant CCF-1750333 and NSF grant CCF-1718342. AV was partially supported by NSF grant CCF-1718342.
Appendix A Deferred Proofs
A.1 Proof of Lemma 4.2
Proof.
We can write the formulation from (4.1) as an unconstrained optimization problem using Lagrange multipliers:
[TABLE]
By making the gradient with respect to equal to [math], we see that the inner minimization problem is optimized at for all , and equivalently . Plugging this back into the maximization objective w.r.t. we obtain
[TABLE]
where for the last equality we used that by optimality conditions one must have .
Finally, we prove (4.4) by using the fact that for any symmetric matrix and vector one has that
[TABLE]
which can be seen by observing that both expressions are optimized at , then applying (4.3).
∎
A.2 Lower Bound Lemma
Lemma A.1**.**
Let a set of nonnegative reals such that for all , and . Then, for any , one has that
[TABLE]
Proof.
Consider a fixed , and let us attempt to minimize the product of ’s subject to the constraints. Equivalently we want to minimize , which is a concave function. Therefore its minimizer is attained on the boundary of the feasible domain. This means that for some , there are elements equal to , equal to , and one which is exactly equal to the remaining budget, i.e. , which yields the product . This can be relaxed by allowing and to be non-integral. Hence we aim to minimize the product , subject to . Finally, we observe that we can always obtain a better solution by placing all the available mass on a single one of the factors, i.e. we lower bound either by , or , whichever is lowest. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AZLO 16] Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia, Using optimization to obtain a width-independent, parallel, simpler, and faster positive sdp solver , Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, 2016, pp. 1824–1831.
- 2[AZO 15] Zeyuan Allen-Zhu and Lorenzo Orecchia, Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel , Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, 2015, pp. 1439–1456.
- 3[BBD + 13] Luca Becchetti, Vincenzo Bonifaci, Michael Dirnberger, Andreas Karrenbauer, and Kurt Mehlhorn, Physarum can compute shortest paths: Convergence proofs and complexity bounds , International Colloquium on Automata, Languages, and Programming, Springer, 2013, pp. 472–483.
- 4[BMV 12] Vincenzo Bonifaci, Kurt Mehlhorn, and Girish Varma, Physarum can compute shortest paths , Journal of theoretical biology 309 (2012), 121–133.
- 5[C + 06] Emmanuel J Candès et al., Compressive sampling , Proceedings of the international congress of mathematicians, vol. 3, Madrid, Spain, 2006, pp. 1433–1452.
- 6[CKM + 11] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng, Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs , Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011 (Lance Fortnow and Salil P. Vadhan, eds.), ACM, 2011, pp. 273–282.
- 7[CMMP 13] Hui Han Chin, Aleksander Madry, Gary L. Miller, and Richard Peng, Runtime guarantees for regression problems , Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013 (Robert D. Kleinberg, ed.), ACM, 2013, pp. 269–282.
- 8[CR] Emmanuel Candès and Justin Romberg, ℓ 1 subscript ℓ 1 \ell_{1} - magic : Recovery of sparse signals via convex programming .
