Matrix Scaling and Balancing via Box Constrained Newton's Method and Interior Point Methods
Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, Adrian Vladu

TL;DR
This paper introduces nearly-linear and near-quadratic time algorithms for matrix scaling and balancing, utilizing a new second-order optimization framework and interior point methods, advancing computational efficiency in scientific computing.
Contribution
The paper develops a unified second-order optimization framework and algorithms that improve the efficiency of matrix scaling and balancing, especially for matrices with quasi-polynomial condition ratios.
Findings
Algorithms run in nearly-linear time for quasi-polynomial condition ratios.
An interior point method achieves near-quadratic time complexity.
The framework generalizes Laplacian system solving for broader function classes.
Abstract
In this paper, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time where is the amount of error we are willing to tolerate. Here, represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time $\widetilde{O}(m^{3/2} \log…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Matrix Scaling and Balancing via Box Constrained Newton’s Method and Interior Point Methods
Michael B. Cohen
MIT
[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1111109 and Grant No. 1553428, and by the National Defense Science and Engineering Graduate Fellowship.
Aleksander Mądry
MIT
[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1553428.
Dimitris Tsipras22footnotemark: 2
MIT
Adrian Vladu
MIT
[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1111109 and Grant No. 1553428
Abstract
In this paper, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time where is the amount of error we are willing to tolerate. Here, represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time .
In order to establish these results, we develop a new second-order optimization framework that enables us to treat both problems in a unified and principled manner. This framework identifies a certain generalization of linear system solving that we can use to efficiently minimize a broad class of functions, which we call second-order robust. We then show that in the context of the specific functions capturing matrix scaling and balancing, we can leverage and generalize the work on Laplacian system solving to make the algorithms obtained via this framework very efficient.
1 Introduction
Matrix balancing and scaling are problems of fundamental importance in scientific computing, as well as in statistics, operations research, image reconstruction, and engineering. The literature on these problems [39, 41, 18, 20, 12, 23, 44, 48, 47, 53, 42, 4, 15, 21] is truly extensive and dates back to 1960s. They both are key primitives in most mainstream numerical software packages (MATLAB, R, LAPACK, EISPACK) [36, 35, 43, 17, 3]. Also, both these problems can be seen as task in which we are aiming to find diagonal scalings of a given matrix so that the rescaled matrix gains some favorable structure.
More specifically, in the matrix scaling problem, we are given a nonnegative matrix , and our goal is to find diagonal matrices such that the matrix has prescribed row and column sums. The most common instance of this problem is the one where we want to scale the matrix so to make it doubly stochastic – in other words, we want to make all row and column sums be equal to one. This procedure has been repeatedly used since as early as 1937 in a number of diverse areas, such as telecommunication [27], engineering [4], statistics [14, 47], machine learning [9], and even computational complexity [33, 19]. A standard application for scaling is preconditioning linear system solving. Given a linear system , one can produce a solution by computing , since applying the inverse of is more numerically stable procedure than directly applying the inverse of [53]. Another example application, which commonly occurs in statistics, is iterative proportional fitting. This primitive is often used for standardizing cross-tabulations and has been studied since 1912 [55]. Even more interestingly, matrix scaling turned out to have surprising connections to fundamental problems in the theory of computation. Notably, in [33], it is observed that scaling can be used to approximate the permanent of any nonnegative matrix within a multiplicative factor of . Furthermore, deciding whether the permanent of a bipartite graph’s adjacency matrix is [math] or at least is equivalent to deciding whether that graph contains a perfect matching. Such scaling–based method can, as a matter of fact, be used to compute maximum matchings in bipartite graphs, which is a classic and intensely studied problem in graph algorithms [13, 16, 34]. For more history and information on this problem, we refer the reader to Idel’s extensive survey [21], or [45] for a list of applications.
Now, in the matrix balancing problem, we are, again, given a nonnegative matrix , and our goal here is to find a diagonal matrix such that the matrix is balanced, that is the sum of each row is equal to the sum of the corresponding column. This procedure has been introduced first by Osborne [39], who was using it to precondition matrices in order to increase the accuracy of the eigenvalue computation. (Note that the balancing operation does not change the eigenvalues of the matrix.) The initially proposed algorithm for it was based on a simple iterative approach, and was then followed by a long series of improvements and extensions. The initial work on this problem focused on a variant in which one aims to balance -norms of rows and columns. It turns out, however, that the -norm–based version we study here is equivalent. In fact, balancing problems with respect to norms, with constant , are all reducible to each other.
1.1 Previous Work
The early methods used for solving these problems – Osborne’s iteration for balancing, and the RAS method for scaling – are simple iterative algorithms. However, merely the task of analyzing their convergence turned out to be a major challenge. Significant effort has gone into understanding their convergence [5, 46, 40, 33], and providing better analyses or better iterative methods resulted in a long line of work in this context.
The major shortcoming of the methods obtained so far for exactly solving the problem (depending only logarithmically on ) is their very large running time. In the following discussion we ommit runtime factors that depend (logarithmically) on the size of the input entries. For matrix scaling, Kalantari and Kachiyan [22] obtained an algorithm that finds an -approximate solution and runs in time , where denotes the dimension of the matrix (we can assume the matrix is square w.l.o.g.) and is the desired accuracy parameter111The precise definition of varies across papers. However, in the regime of logarithmic running time dependence on we are interested in here, all these definitions are essentially equivalent.. This algorithm was based on the ellipsoid method. These authors also proposed – but not formally analyzed – an algorithm based on interior point method, which they expected to run in time , where denotes the number of non-zero entries of the input matrix. Then, Nemirovsky and Rothblum [37] analyzed an interior point method–based algorithm which run in time . Finally, Linial, Samorodnitsky, and Wigderson [33] gave an time algorithm that is also strongly polynomial, in the sense that it does not depend at all on the size of input entries.
For the case of matrix balancing, Parlett and Reinsch [41] provided an iterative method based on Osborne’s iteration, without proving convergence. Then, Grad [18] proved that Osborne’s iteration converges in the limit. The first polynomial time bound was obtained by Kalantari, Khachiyan, and Shokoufandeh [23], who gave an algorithm with running time .
Alternatively, if one is interested in the regime where the running time is allowed to depend polynomially – instead of logarithmically – on the (inverse of the) desired accuracy of the solution, there are algorithms that have an even better dependence on the other parameters. Specifically, the current state-of-the-art is given by Linial, Samorodnitsky, and Wigderson [33], who obtain running time for the scaling problem. In the case of the balancing problem, recently, Ostrovsky, Rabani, and Yousefi [40] made a significant progress by obtaining running times of and .
Finally, another important line of work in this domain was focused on the related variant of the balancing problem, where the maximum entry of each row is required to be equal to the maximum entry of the corresponding column. Schneider and Schneider [44] gave a non-iterative algorithm running in time , improved to by Young, Tarjan, and Orlin [54]. More recently, Schulman and Sinclair [46] provided an analysis of the classical Osborne-Parlett-Reinsch obtaining a running time of , and gave a version of it with running time .
1.2 Our Contributions
We provide algorithms for both matrix scaling and balancing problems.
For the matrix scaling problem, we establish an algorithm that runs in time
[TABLE]
where and are the optimal scaling matrices, is the maximum ratio between the diagonal entries of its argument, is the sum of the entries in the input matrix, and is the measure of the target error of the scaling, formally defined in Definition 4.2.
For the matrix balancing problem, we establish a running time of
[TABLE]
where is the ratio of the sum of the entries to the minimum nonzero entry, is the optimal balancing matrix, has the same meaning as above, and is the measure of the balancing error, as formally defined in Definition 4.16.
Notably, our running times depend logarithmically on both the target accuracy and the magnitude of the entries in the optimal balancing or scaling. This implies that if the optimal solution has quasi-polynomially bounded entries, our algorithms run in nearly linear time (ignoring logarithmic factors involving the entries of the input matrix). This includes, for instance, the case when input matrix has all its entries positive or, in case of matrix balancing, if there just exists a single row/column pair with all positive entries.
However, there are matrices for which can be exponentially large (in ). For the case of such matrices we develop algorithms with negligible dependence on . These algorithms are based on interior point methods, with appropriately chosen barriers, commonly used in exponential programming [2]. We show that the linear system solves required by the interior point method every iteration can be reduced via Schur complementing to approximately solving a Laplacian system, which can be done in nearly linear time using any standard Laplacian solver [51, 25, 26, 24, 8, 28, 29]. This yields a running time of
[TABLE]
where is the ratio between the largest and smallest nonzero entry of .
1.3 Our Approach
We approach the scaling and balancing problems by developing a continuous optimization based perspective on them. More precisely, we solve both matrix scaling and balancing problems by casting them as tasks of minimizing certain corresponding convex functions. In fact, in the case of the balancing problem, that function is directly inspired by the one used in [23]; for the scaling problem, it is function derived from the one used in [22].
Since our goal is to obtain logarithmic – instead of polynomial – dependence on the (inverse of the) desired accuracy , it would be tempting to use well-known tools for convex programing, such as ellipsoid method or interior point method. However, these methods are, a priori, computationally expensive. This motivates us to look for different, more direct approaches.
To this end, we develop a technique for minimizing a broader class of functions that we call second-order robust (with respect to ). Intuitively, this class corresponds to functions whose Hessians do not change too much within any unit -ball. And the consequence of that property that will be crucial for us is that local quadratic approximation of such functions at any given point is relatively accurate within the unit neighborhood of that point. As a result, iteratively optimizing the local approximation around the current point, while staying within that neighborhood, will be guaranteed to make progress towards minimizing the function. This iterative procedure can be viewed as a “box-constrained” variant of the Newton’s method.
A priori, performing a single step of such a box-constrained Newton’s method, i.e., minimizing a quadratic function subject to box constraints might be a computationally costly task. We show, however, that it suffices to implement a weaker primitive, which we call a -oracle. That primitive corresponds to (approximately) minimizing a quadratic function within a region that is within a factor of larger than the target -ball. Once such a -oracle is implemented efficiently, we can compute the global optimum of our second-order robust function using a small number of calls to it. More precisely, we show that one can minimize a convex function that is second-order robust with respect to to within additive error from optimum in
[TABLE]
iterations, where each iteration consists of one call to the -oracle, is the starting point, is the minimizer of , and is the radius of the level set of .
In the light of the above, the main technical difficulty remaining is obtaining an efficient implementation of a -oracle. We show that for functions whose Hessian is symmetrically diagonally dominant, with nonzero off-diagonal entries, or SDD for short222Such matrices can essentially be viewed as a Laplacian matrix plus a nonnegative diagonal., we can implement a -oracle, with , in time that is nearly linear in the sparsity of the Hessian. We build here on the strategy underlying the Laplacian solver of Lee, Peng and Spielman [30]. Specifically, we carefully lift the solutions corresponding to coarser (and smaller) approximations of the underlying matrix to the desired solutions corresponding to the initial matrix in a way that does not allow these lifted solutions to exceed the boundaries of a -radius -ball.
Once the above optimization framework is developed, applying it to the scaling and balancing problems is fairly straightforward. It boils down to verifying that the functions that capture the respective problems are indeed second-order robust and have an SDD Hessian, and then bounding all the relevant quantities that (1.1) involves.
Independent Work
Finally, we note that Allen-Zhu, Li, Oliveira, and Wigderson [1] obtained independently very similar results for the exact version of the problem. The running time of the algorithms they develop have a bit worse dependence on , but they were able to establish better absolute bounds on (in terms of the problem parameters and the magnitude of the input entries) for the general, non-doubly stochastic variant of the matrix scaling problem.
1.4 Roadmap
The rest of the paper is organized as follows. First, we introduce relevant notation and concepts in Section 2. Then, in Section 3 we formally introduce the class of convex functions we call second-order robust with respect to . For these, we develop a specific optimization primitive called box-constrained Newton method.
We describe how we can apply the primitive from Section 3 to matrix balancing and scaling in Section 4, by reducing these problem to a convex function minimization with favorable structure. In order to complete our algorithm, in Section 5, we show how to efficiently implement an iteration of the box-constrained Newton in the special case where the Hessian of the function is SDD. In Section 6 we provide a different approach for balancing and scaling based on interior point methods. Supplementary proofs and technical details are presented in the Appendix.
2 Preliminaries
2.1 Notations
Vectors
We let denote the all zeros and all ones vectors, respectively. When it is clear from the context, we apply scalar operations to vectors with the interpretation that they are applied coordinate-wise.
Matrices
We write matrices in bold. We use to denote the identity matrix, and to denote the zero matrix. Given a matrix , we denote its number of nonzero entries by . When it is clear from the context, we use to denote the the number of nonzeros; similarly, we use to denote the dimension of the ambient space.
We denote by the sum of entries of , by the minimum nonzero entry of , and by the ratio between these quantities. We use to denote the set of pairs of indices corresponding to the nonzero entries of . Given a matrix , we define to be the vector consisting of row sums, and to be the vector consisting of column sums. For a positive diagonal matrix we denote the maximum ratio between its diagonal elements by .
Positive Semidefinite Ordering and Approximation
For symmetric matrices we use to represent the fact that that , for all . A symmetric matrix is positive semidefinite (PSD) if . We use in a similar fashion. For vectors , we define the norm . Given two PSD matrices and , and a parameter , we use to denote the fact that .
Laplacian and SDD matrices
A family of matrices that will play an important role in this paper are symmetric diagonally dominant (SDD) matrices. These are matrices , that symmetric and, moreover, have each diagonal entry be larger than the sum of absolute values of the corresponding row entries. That is, for every
[TABLE]
A special case of SDD matrices are Laplacian matrices, which have negative off-diagonal entries and the sum of each row is required to be zero. The crucial fact about these matrices is that one can exploit their structure to solve linear systems in them in time that is only nearly linear [51, 25, 26, 24, 8, 28, 29].
Diagonal Matrices
For we denote by the diagonal matrix where . Given a nonnegative diagonal matrix , we use to denote the ratio between its largest and smallest entry. We will overload notation and, for any matrix , use to denote the main diagonal of , that is and for .
Gradients and Hessians
Given a function we denote by its gradient at , and by its Hessian at . When the function is clear from the context, we also use to denote its Hessian at .
Block Matrices
As part of our algorithms, we will consider partitioning the coordinates of vectors into sets of indices and . When we compute the quadratic form of a matrix with these vectors, we need to be able to reason about how values in each component interact with the rest of the vector. For that reason it is convenient to denote the block form notation for a matrix as:
[TABLE]
Schur Complements
For a matrix and a partition of its indices , the Schur complement of F in is defined as
[TABLE]
The exact use of Schur complements will become clear in Sections 5,6. These are objects that naturally arise during Gaussian elimination for the solution of linear systems. By pivoting out variables the remaining system to solve for variables of is exactly the Schur complement of in .
3 Box-Constrained Newton Method for Second-Order Robust Functions
The central element of our approach is developing an efficient second-order method based minimization framework for a broad class of functions that we will call second-order robust with respect to . To motivate the choice of this class, recall that second-order methods for function minimization are iterative in nature, and they boil down to repeated minimizing the local quadratic approximation of the function around the current point. Consequently, in order to obtain meaningful guarantees about the progress made by such methods, one needs to ensure that this local quadratic approximation constitutes a good approximation of the function not only at the current point but also in a reasonably large neighborhood of that point. The most natural way to obtain such a guarantee is to ensure that the Hessian of the function (which is the basis of our local quadratic approximations) does not change by more than a constant factor in that neighborhood. As a result, the functions we are interested in optimizing in this paper are the ones that satisfy that property in an -ball around the current point. This is formalized in the following definition.
Definition 3.1** (Second-Order Robust w.r.t. ).**
We say that a convex function is second-order robust (SOR) with respect to if, for any such that ,
[TABLE]
Note that the size of the -ball, as well as the exact factor by which the Hessian is allowed to change, are chosen somewhat arbitrarily – all choices of the constants can be made equivalent via an appropriate rescaling. Moreover, even if these quantities are not constant, they would only appear in the running time as a small polynomial factor.
Now, the above definition suggests a natural framework for optimizing such functions. Namely, in every iteration, we optimize a local quadratic approximation of the function within a unit -ball around the current point. As we will see shortly, this approach can be rigorously analyzed. In particular, our key technical result is that if we apply this approach to an SOR function whose Hessians has additionally a special structure, i.e., those for which the Hessian is, essentially, a symmetric diagonally dominant (SDD) matrix, we can implement every iteration in time nearly linear in the number of nonzero entries of the Hessian. This leads to running time bounds captured by the following theorem.
Theorem 3.2** (Minimizing Second-Order Robust Functions w.r.t ).**
Let be a second-order robust (SOR) function with respect to , such that its Hessian is symmetric diagonally dominant (SDD) with nonpositive off-diagonals, and has nonzero entries. Given a starting point we can compute a point , such that , in time
[TABLE]
where is a minimizer of , is the diameter of the corresponding level set of , and is the time required to compute the Hessian.
Note that the bounds provided by the above theorem are, in a sense, the best possible for any kind of approach that relies on repeated minimization of a local approximation of a function in an -ball neighborhood. In particular, as each step can make a progress of at most in -norm towards the optimal solution, one would expect the total number of steps to be .
It turns out that the above theorem is all we need to establish our results for scaling and balancing problems (except the ones relying on the interior point method). That is, these results can be obtained by direct application of the above theorem to an appropriate SOR function. We provide all the details in Section 4.
Now, the first step to proving the above Theorem 3.2 is to view each iteration of our iterative minimization procedure as a call to a certain oracle problem.
Definition 3.3**.**
We say that a procedure is a -oracle for a class of matrices , if on input , where , and , returns a vector satisfying
- (1)
, and 2. (2)
.
Note that the minimum of the left-hand side of Condition (2) above is always non-positive. This is desired, since this expression is supposed to measure our function minimization progress.
Observe that minimizing the function without any constraints on corresponds to solving a linear system . So, one can view the -oracle problem as a certain generalization of linear system solving. Specifically, it is a task in which we aim to find a point in the -ball of diameter around the origin that is closest (in a certain sense) to the solution to that linear system. If is sufficiently small, the -oracle problem corresponds directly to solving that system.
One can view the parameter as the measure of the “quality" of our -oracle. The smaller it is, the faster convergence the overall procedure will have. Importantly, however, the value of impacts only the convergence and not the quality of the final solution. The following theorem makes this relationship precise.
Theorem 3.4**.**
Let be a function that is second-order robust with respect to . Let be a -oracle for , along with an initial point and an accuracy parameter . Let , where is a minimizer of . Then one can produce a solution satisfying using
[TABLE]
calls to .
We present the proof of this theorem in Section A.1 of the Appendix. In Section 5 we design an efficient -oracle, with , for the family of SDD matrices. Combining Theorem 5.11 with Theorem 3.4 immediately gives the proof of Theorem 3.2. We remark that while Theorems 3.2, 3.4 are stated and proved for functions defined over , they can be extended in a straightforward way to hold when is defined over an arbitrary closed, convex set.
4 Matrix Scaling and Balancing
Having developed our main optimization primitives, we can develop efficient algorithms for matrix scaling and matrix balancing. Our approach is essentially the same for both problems, and differs only in technical details.
At the high level, we will construct convex functions with optima corresponding to exact scaling/balancing of the input matrix. Moreover, the gradient of these functions at a specific scaling/balancing of the matrix will be directly related to the quality of this particular scaling/balancing. This will allow us to prove that approximately optimal points correspond to -approximate scaling/balancing. The fact that that these functions are second-order robust with respect to makes it sufficient to apply the optimization method from Section 3. To complete the algorithm and its running time analysis, we need then to address two issues.
Firstly, proving running time bounds for this method requires an upper bound on the radius of the level set of the initial point, i.e. the parameter defined in Theorem 3.4. Depending on the structure of the matrix, there are several different bounds that one can prove, depending only on parameters of the original problem. However, the most interesting case is when we are promised that the exact scaling/balancing of the matrix is “small” (in the sense that the ratio between factors is, say, polynomial). In that case, we can regularize the function to turn this promise into a guarantee for the size of the level set without sacrificing too much accuracy. Moreover, by using a simple doubling approach, we can make the algorithm not require explicit knowledge of the value of this parameter, and it will only appear as a factor in the final running time of the algorithm.
Secondly, we need to ensure that we can efficiently implement -oracles for the Hessians of these functions. In our case, this boils down to proving that these Hessians are SDD matrices with sparsity equal to that of the input matrix, and then build on the existing Laplacian solving work. For the remainder of this section, we define the convex functions that we need optimize, show how to regularize them, and prove bounds on the corresponding parameters. We will describe and analyze the implementation of a -oracle in Section 5.
4.1 Matrix Scaling
We now formally define the scaling problem, along with the notion of -scaling.
Definition 4.1** (Matrix Scaling).**
Let be a nonnegative matrix and be vectors such that , and 333In literature we also encounter this problem for non-square matrices; however solving squares is sufficient, since given , we can reduce to this instance by scaling the square matrix . The upper bound on and is harmless, since for larger values we can always shrink all of , , and by the same factor in order to enforce this constraint.. We say that two nonnegative diagonal matrices and -scale if the matrix satisfies and , i.e. row sums to and column sums to for every .
Definition 4.2** (- scaling).**
Given nonnegative and positive diagonal matrices , we say that is an - scaling (or -scaling, when and are clear from the context) for matrix if the matrix satisfies
[TABLE]
Definition 4.3** (Scalable and Almost-Scalable Matrices).**
A nonnegative matrix , is called -scalable, if there exist and that -scale . It is called almost -scalable if for every , there exist and that - scale .
There are well-known necessary and sufficient conditions about the scalability of stated in the following lemma.
Lemma 4.4** ([33]).**
A nonnegative matrix is exactly -scalable iff for every zero minor of ,
- (1)
. 2. (2)
Equality in (1) holds iff is also a zero minor.
A nonnegative matrix is almost -scalable iff Condition (1) above holds.
We will cast matrix scaling as a convex optimization problem and show that applying the method from section 3 yields a good approximate scaling.
Theorem 4.5**.**
Let be a matrix, that has an scaling . Then, we can compute an - scaling of in time
[TABLE]
This implies that if and are, say, quasi-polynomially bounded, we can find an approximate scaling in nearly linear time. If fact, we can generalize this statement to obtain a similar result for the case of approximate scalings. This is made precise in Theorem 4.6.
4.1.1 Matrix Scaling via Convex Optimization
Recall that we want to encode the matrix scaling problem as a an instance of minimizing of a certain convex function. Given the input matrix , the function we want to consider is:
[TABLE]
We want to argue now that computing an (approximate) scaling of the matrix can indeed be recovered from an (approximate) minimum of the above function. Specifically, we want to establish the following theorem.
Theorem 4.6**.**
Suppose that there exist a point for which and . Then we can compute an - scaling of in time
[TABLE]
The proof is straightforward given the lemmas below and is presented in Section A.4 of the Appendix. First, we will prove that approximate optimality of implies an approximate scaling of the matrix.
Lemma 4.7**.**
Let be an -scalable matrix. Let . Then, a pair of vectors satisfying , for , yields an - scaling of :
[TABLE]
Note that we compare the value of to its infimum, as for the case of almost scalable matrices it is possible that this value is attained only in the limit.
To prove the above lemma, we first look at the first and second order derivatives of .
Lemma 4.8**.**
Let be the matrix obtained by scaling with vectors , i.e. . The gradient and Hessian of satisfy the identities:
[TABLE]
We can observe that any for which is equal to [math] yields diagonal matrices that exactly scale . Moreover, this statement also holds in an approximate sense. One can prove that a large gradient in norm implies that the current point is far from optimal in function value. Making this statement precise, allows us to prove Lemma 4.7. The technical details are presented in Section A.2 of the appendix.
4.1.2 Regularization for Solving via Box-Constrained Newton Method
It is straightforwards to verify that the function we are minimizing (defined in Equation 4.1), satisfies the requirements necessary for us to be able to apply the tools from Section 3.
Lemma 4.9**.**
The function defined in (4.1) is convex, second-order robust with respect to , and its Hessian is SDD.
Proof.
The Hessian of the function (cf. Lemma 4.8) is clearly a Laplacian matrix. Therefore, it is positive semi-definite and thus is convex. To prove that it is second-order robust, we notice that adding some with to the current scaling corresponds to multiplying each row and column by some factor between and . By writing down the quadratic form of the Hessian, , we observe that each will only be multiplied by some factor between and , proving that
[TABLE]
concluding the proof. ∎
One should observe, however, that Theorem 3.4 requires bounding the radius of the entire level set containing our initial point and not merely the distance to some (approximate) minimizer of our function . This means that the existence of an (approximate) minimizer that is close to our initial point is not sufficient to apply Theorem 3.4. To circumvent that problem, we regularize the function by adding to it a term that, on one hand, has a relatively small impact on the additive error we can achieve, but, on the other hand, ensures that the entire relevant level set is contained in some sufficiently small -ball around our initial point. The following lemma makes these statements precise. Its proof appears in Section A.3 of the Appendix.
Lemma 4.10**.**
Let be a point for which and . Then, the regularization of defined as
[TABLE]
satisfies the following properties
- (1)
* is second-order robust with respect to and its Hessian is SDD,* 2. (2)
, and there is a point such that , 3. (3)
for all such that , .
Theorems 4.5 and 4.6 follow from applying Theorem 3.2 to the regularized function defined in (4.2), and then combining it with the guarantees of Lemmas 4.7 and 4.10. The complete proof is presented in Section A.4 of the Appendix. We note that we don’t need an explicit knowledge of an a priori bound on . We can simply run our algorithm repeatedly, doubling our guess at the value of each time. This will not increasing the overall running time by more than a factor of two.
4.1.3 Bounding the Magnitude of the Optimal and Approximately Optimal Scalings for
Doubly Stochastic Scaling
In order to provide bounds for the magnitude of the scaling factors that only depend on the parameters of the initial problem, we refer to the following lemmas from [22] for the case of double stochastic (i.e. (1,1)) scaling.
Lemma 4.11** (Lemma 1 of [22]).**
If is strictly positive, then it can be scaled to doubly stochastic by diagonal matrices , with .
Lemma 4.12** (Corollary 1 of [22]).**
If is scalable, then it can be scaled to doubly stochastic by diagonal matrices , with .
For almost scalable matrices, there can be arbitrarily good solutions, using arbitrarily large scaling factors. To prove bounds on the runtime of finding an approximate doubly-stochastic matrix, we will have to explicitly demonstrate an vector that approximately minimizes function while having small norm.
Lemma 4.13**.**
If is almost-doubly-stochastic scalable, then there exist points such that , such that .
The proof of the lemma is presented in Section A.5.
For the general case of -scaling we refer to the recent lemmas from the parallel work of [1]. The assumption that the scaling targets are integral is mild, since one can approximate real numbers by rational ones which can then by scaled to be integral (the dependence on this scaling is logarithmic).
Lemma 4.14** (Lemma 3.3 of [1]).**
If is almost -scalable with , being integral, then it can be -scaled by diagonal matrices , with .
4.2 Matrix Balancing
Our approach for the balancing problem is completely analogous to the one we used for the scaling problem. There are only minor technical differences. To state them, we first formally define the problem and the notion of approximation we are considering for it.
Definition 4.15** (Matrix Balancing).**
Let be a square nonnegative matrix. We say that is balanced if the sum of each row is equal to the sum of the corresponding column, i.e. . We say that a nonnegative diagonal matrix balances if the matrix is balanced.
Definition 4.16** (-Balanced Matrix [23]).**
We say that a nonnegative matrix is -balanced if
[TABLE]
Observe that this definition is invariant to a global scaling of all the entries of the matrix by some factor. There is a very simple condition that characterizes the set of matrices that can be balanced
Lemma 4.17** ([23]).**
A nonnegative matrix can be balanced if and only if the graph with adjacency matrix is strongly connected.
In the case when the graph is not strongly connected, the matrix can have its rows and columns rearranged so as to be written as a lower triangular block matrix with strongly connected diagonal blocks. The reason no exact balancing exists is that off diagonal block elements will always create imbalances. This, however, is not an obstacle for approximately balancing the matrix. Once we balance the diagonal blocks, we can set all of the off-diagonal block entries to a very small value, say , so that they don’t cause significant imbalances. This corresponds to implicitly scaling the block rows and collumns by a very large amount, making the off-diagonal entries arbitrarily close to zero. Therefore, since the case of matrices that cannot be exactly balanced is easy to detect, and can be easily reduced to the exactly balanceable case, from now on we consider only matrices that can be balanced, and therefore represent strongly connected graphs.
We can now state our main theorem for this section, which follows our initial discussion.
Theorem 4.18**.**
Let be a matrix that can be balanced by the diagonal matrix . Then, we can compute an -approximate balancing of in time
[TABLE]
This immediately implies that if is, say, quasi-polynomially conditioned, we can find an approximate balancing in nearly linear time.
Again, we can generalize this result to hold for approximate balancings. We make this statement precise in Theorem 4.19.
4.2.1 Reducing Matrix Balancing to Convex Optimization
Similarly to the case of the scaling problem, we encode this problem as a minimization of an appropriately constructed convex function. The function we consider here is
[TABLE]
and this function was already defined in [23]. Similarly to the case of matrix scaling, we will show that (approximately) minimizing this function corresponds to (approximately) balancing the matrix . For the rest of this section, we will define to be the infimum value of in its domain, that is . The main theorem of this section is the following.
Theorem 4.19**.**
Suppose that there exists a point such that , and . Then, we can compute an -approximate balancing of in time
[TABLE]
Similarly to the matrix scaling case, the proof of this theorem follows directly from the key lemmas presented below. The proof is presented in Section A.7 of the Appendix. First, we prove that small additive error in function optimization implies an approximate balancing for .
Lemma 4.20**.**
Consider a matrix and the corresponding function . Any vector satisfying yields an -approximate balancing of :
[TABLE]
Proving the lemma requires computing the first and second order derivatives of .
Lemma 4.21**.**
Let be the matrix obtained by balancing with the vector , which corresponds to . The gradient and Hessian of satisfy the identities:
[TABLE]
Intuitively, since the gradient is [math] precisely when the corresponding point produces an exact balancing, a small gradient should imply a good approximate balancing. This guides the proof of Lemma 4.20. We will prove that a large gradient corresponds to being able to significantly decrease the function value, thus contradicting the approximate optimality of the point. The technical details are presented in Section A.6.
4.2.2 Regularization for Solving via Box-Constrained Newton Method
We observe that the function defined in (4.3) satisfies all the conditions required to efficiently minimize it using the method we described in Section 3.
Lemma 4.22**.**
The function is convex, second-order robust with respect to , and its Hessian is SDD.
The method we described in Section 3 depends on a promise concerning the point we initialize it with. Recall that in order to apply Theorem 3.2 we require an upper bound on the size of the -ball containing the level set of the initial point. In order to provide good bounds, we regularize . The description and effect of this regularization in captured in the following lemma.
Lemma 4.23**.**
Suppose that there exists a point such that , and . Then, the regularization of is defined as
[TABLE]
and satisfies the following properties:
* is second-order robust with respect to and has a SDD Hessian,* 2. 2.
, and if is the minimizer of , then , 3. 3.
for all such that , .
The details of the lemma are identical to Lemma 4.10, and we therefore omit the proof. In particular, this lemma implies that approximately optimizing the regularized function will still produce an approximately balanced matrix.
Theorem 4.18 then follows by applying Theorem 3.2 to the regularized function defined in Lemma 4.23, and combining it with the error guarantee of Lemma 4.20. The details are presented in Section A.7 of the Appendix. Similarly to the case of the scaling problem, we don’t need to know any a priori bound on . Just trying increasingly larger value of (i.e., doubling our guess at each iteration) is sufficient.
4.2.3 Bounding the Condition Number of the Optimal Balancing
As we saw above, the running time given by Theorem 4.18 depends logarithmically on , where is the matrix that achieves the optimal balancing. While, in general can be exponentially large (and therefore we might be better off running the interior point method described in Section 6), tighter connectivity of the graph implies better bounds:
Lemma 4.24**.**
Let be a nonnegative matrix. Suppose that the graph with adjacency matrix is strongly connected, and every vertex can reach every other vertex within at most hops. Then the matrix that perfectly balances has .
The proof of the lemma is in Section A.8 of the Appendix and it yields the following upper bound on the value of .
Corollary 4.25**.**
If is a balanceable matrix, and perfectly balances it, then . If is strictly positive, then .
4.3 Discussion of Numerical Precision Aspects
The exposition of the analysis so far is under the assumption of exact arithmetic. However, our algorithms do in fact tolerate finite fixed-point precision on the scale of the natural parameters of the problem (, , and ). It is therefore sufficient to use a number of bits that is logarithmic in the input parameters of the problem.
Between iterations, we store a fixed-point representation of the variables . These are, by construction, bounded by the parameter . It is important (at least if using fixed-point rather than floating-point) that we are storing the rather than the actual scalings .
When iterating, we first determine the post-scaling elements of the matrix. These can also simply be stored in fixed-point–i.e. up to additive error. Note that this rounding could completely eliminate very small entries of the matrix. This representation then gives us the gradient and Hessian of the problem up to additive error. To make the additive error polynomially small only a logarithmic number of bits are needed, because the entries of the scaled matrix can never be more than polynomial (in the natural parameters mentioned). This follows from the fact that the objective function, which includes the sum of all the entries, cannot increase.
Finally, there is an polynomially small absolute lower bound on the eigenvalues of the Hessian, simply from the regularizer itself. This ensures that additive error to the gradient and Hessian can only affect the function value improvement by a polynomially larger amount, and ensures the stability of the -oracle algorithm. Thus polynomially small error is sufficient, requiring only logarithmically many bits.
4.4 Matrix Scaling and Balancing as Nonlinear Flow Problems
An intriguing property of the matrix scaling and matrix balancing problems is that they both can be phrased as an instance of a more general problem. This problem can be seen as generalization of the electrical flow problem. That is, the problem of finding a potential-induced flow that routes a fixed demand in the case when Ohm’s Law, i.e., the relationship between the potential difference on a given edge and the flow flowing through it is exponential instead of being linear. (See [11] for a comprehensive treatment of such nonlinear networks.) To see this, given a weighted directed graph let us define the edge-vertex incidence matrix being an matrix with rows indexed by edges and columns indexed by vertices such that
[TABLE]
Using this matrix we define the nonlinear operator as follows.
Definition 4.26**.**
Let be a directed graph with vertex-edge incidence matrix , and let . We define the operator associated with as
[TABLE]
This can be seen as a nonlinear generalization of the Laplacian operator, which is a linear operator defined as . There is extensive literature on solving Laplacian linear systems [51, 25, 26, 24, 8, 28, 29]. We argue that our framework can be used to solve systems of the form
[TABLE]
This can be seen as finding vertex potentials which induce a flow vector :
[TABLE]
such that routes a given demand . This should be contrasted with the case of electrical flows where the flow is induced as . As it turns out, the solution to the system is the minimizer of a function similar to those defined in Equations 4.1 and 4.3. More precisely:
Lemma 4.27**.**
Let be a directed graph with nonnegative weights, let be its adjacency matrix, and let be the operator associated with , as defined as in Equation 4.5. Consider the function defined as
[TABLE]
Then has a minimizer if and only if it is the solution to the system .
The proof follows directly from writing optimality conditions for , noting that the condition that is equivalent to . Similarly to Theorem 4.6 and Theorem 4.19, we can provide conditions on function value error to bound the error . Also, in order to obtain a good running time, we require regularizing in a manner similar to the regularization applied in Lemmas 4.10 and 4.23. Note that in the case of the scaling and balancing problems, since we require problem specific error guarantees, the regularization needs to be customized accordingly.
Finally, we state without proof that balancing and scaling are instances of solving .
Observation 4.28**.**
Let be a balanceable nonnegative matrix. Let be the nonlinear operator associated with the graph with adjacency matrix . Then the solution to yields a balancing .
Observation 4.29**.**
Let be a -scalable nonnegative matrix. Let be the nonlinear operator associated with the graph with adjacency matrix . Then the solution , with to yields a -scaling .
5 Implementing an -Oracle in Nearly Linear Time
In Section 4 we reduced the balancing and scaling problems to the approximate minimization of second-order robust functions with respect to the norm. All that is left to have a complete algorithm, we need a fast procedure to implement a -oracle as in Definition 3.3. Namely, show how to construct an -oracle for the problem,
[TABLE]
where is an SDD matrix. For this section, whenever we say that a matrix is SDD we will also imply that the off-diagonal entries are nonpositive.
One possible approach, is to use standard convex optimization reductions to turn this problem into the minimization of the maximum of an norm and an norm subject to linear constraints. This problem can be solved in time using the multiplicative weights framework as applied in [7, 6]. The resulting algorithm for implementing the -oracle would take time , by taking advantage of spectral sparsification algorithms [49, 32, 31]. Instead, we will come up with a faster algorithm.
Our approach, based on the Lee-Peng-Spielman solver [30], is to identify large sets of vertices where the problem is “easy” to solve and then deal with the rest of the graph (reduced in size) recursively. The particular notion of “easy” we are going to use, is that of strong diagonal dominance.
Definition 5.1**.**
A matrix is -strongly diagonally dominant (-SDD), if for all
[TABLE]
The reason that such matrices enable us to solve the corresponding problems fast is that they can be well-approximated by a diagonal matrix.
Lemma 5.2**.**
Every -SDD matrix , with diagonal , satisfies
[TABLE]
Proof.
This follows from the fact that
[TABLE]
Applying this to each off-diagonal entry, the off-diagonal part of the matrix will be bounded between diagonal matrices; by the -SDD property these can be bounded by . ∎
In our context, problems in the form of Equation 5.1, where is an -SDD matrix for some , can be turned into well conditioned quadratic minimization problems for which we can apply standard linearly convergent algorithms. For a more detailed description and analysis of such algorithms can be found in [38].
Lemma 5.3**.**
There is an algorithm FastSolve, that given an -SDD matrix , and , returns a point , such that , and
[TABLE]
in time , where is the number of nonzero entries of .
Proof.
By Lemma 5.2, there is some diagonal matrix such that
[TABLE]
By applying the transformation , the problem becomes
[TABLE]
We will apply proximal gradient descent, defined by the sequence and
[TABLE]
Computing from corresponds to computing
[TABLE]
and projecting it to the space , by simply trunctating any coordinates exceeding the bounds. We can clearly implement each iteration in linear time in the number of nonzeros of . Since the condition number of the function is at most , such a step will imply that
[TABLE]
and thus inductively,
[TABLE]
Therefore, after steps we will have a point with The fact that concludes the proof. ∎
An even simpler case is when the matrix is of size 1, in which case the problem can be exactly solved in constant time:
Lemma 5.4**.**
There is an algorithm TrivialSolve, that given a 1 by 1 matrix returns an optimizing over the interval .
Proof.
By convexity, there must exist an optimal that is either one of the two endpoints of the interval, or the unique global optimum of the function over the line. One may simply check all candidates and return the best value. ∎
A key insight of [30] is that one can find -SDD submatrices of of size . We denote such a subset by and by . To ensure that solving the problem for will not interfere with our solution we map a solution supported only on coordinates of to a solution through a linear mapping . If were the energy minimizing extension of voltages on to voltages on ,
[TABLE]
we would have that and are -orthogonal, since . Then, optimizing over involves the quadratic which is exactly equal to . Applying this proccess recursively leads to the notion of vertex sparsifier chains that we will heavily rely on.
Definition 5.5** (Definition 5.7 of [30]).**
For any matrix , a vertex sparsifier chain of with parameters and , is a sequence of matrices and subsets such that
, 2. 2.
, 3. 3.
is -strongly diagonally dominant, and 4. 4.
has size 1.
Note that this last requirement is slightly different from [30]: we require the chain to end with size 1, rather than just being constant. However, the chain construction from [30] immediately extends to this requirement (they presumably proposed stopping early because it is a simple optimization that would likely be valuable in any implementation).
To be able to reason about the approximation guarantees of the chain as a whole we will use an error-quantifying definition.
Definition 5.6** (Definition 5.9 of [30]).**
An -vertex sparsifier chain of an SDD matrix of work , is a vertex sparsifier chain of with parameters and that satisfies
, 2. 2.
, where is the number of nonzeros in .
Finally, the construction of such chains, as well as their error guarantees have been already analyzed in [30] and can be used in a black-box manner.
Theorem 5.7** (Theorem 5.10 of [30]).**
Every SDD matrix of dimension has a -vertex sparsifier chain of work and , for any constant . Such a chain can be constructed in time, .
We note that Theorem 5.7 was stated for , but it is straightforward to modify to proof without changing the work or the length of the chain by more than a constant factor.
Since we cannot exactly compute the energy minizing mapping , we will define an approximate mapping that suffices for our purposes.
Definition 5.8**.**
A linear mapping is an -approximate voltage extension from to according to if for any ,
, 2. 2.
is the identity on coordinates in 3. 3.
the coordinates of are convex combination of the coordinates of and 0.
where is the energy minimizing extension.
We will construct such a mapping through a simple averaging scheme. First we set the voltage of every vertex in to be the weighted average of its neighbors in . Then at every step we replace its voltage by the weighted average of all its neighbors. (Here, excess diagonal is treated as an edge to a vertex with voltage 0.) We do so for iterations. We formally describe the procedure in Figure 5.1.
It is easy to see that all steps of the algorithm are linear maps, and we can therefore also implement its transpose.
Lemma 5.9**.**
For any SDD matrix , given an -SDD subset and some one can apply an -approximate voltage extension mapping in time .
Proof.
The linearity of the mapping, properties 2 and 3, as well as the runtime claimed hold inductively by the construction of the mapping. We will now prove property 1, namely, if is the true energy minimizing mapping and is our approximate mapping, for any , .
First, we will bound the error of . We define , which is 0 outside by construction. We will use the notation –i.e. the edge weight between and , and . Here, accounts for “excess diagonal” of , which we treat as an edge to a “virtual vertex” . We will also define . Now, we have
[TABLE]
Rearranging gives .
Next, we note that . This follows from the fact that
[TABLE]
since is 0 on . Applying Lemma 5.2, we get
[TABLE]
This implies that
[TABLE]
By induction, we have
[TABLE]
Finally, using Lemma 5.2 again, we get
[TABLE]
∎
Given such a mapping, we can uniquely express any voltage vector as , where (i.e. it is in the image of ) and is supported on . By the convex combination property of , we have ; since , by the triangle inequality we have . The domain is therefore contained in . Moreover, any point for which , , corresponds to a point with .
Having expressed all of the components of our approach, stating the algorithm is simple. Given the decomposition of the problem the vertex sparsifier chain provides, we will solve the smallest problem and then iteratively combine it with the solution of the submatrices along the chain. The algorithm is formally described in Figure 5.2, and the main claim in Theorem 5.11.
To facilitate the analysis we first state the following decoupling lemma.
Lemma 5.10**.**
Consider an SDD matrix , a partition of its columns and some . Let be an -approximate voltage extension from to as define in Definition 5.8. Then
[TABLE]
Proof.
Since is the true energy minimizing extension, we know that , will be zero on the coordinates of . Since is supported on (property 2 of Definition 5.8), we can expand
[TABLE]
The first property of in Definition 5.8, implies that
[TABLE]
We can upper bound the contribution of the cross-term as
[TABLE]
Similarly, we can lower bound the contribution of the cross-term as
[TABLE]
Combining these inequalities and rearranging terms concludes the proof. ∎
We can now use this lemma to prove that by decoupling the problem and using approximate Schur complements does not reduce the quality of the solution by more than a constant.
Theorem 5.11**.**
Algorithm OptimizeChain implements a -oracle, and runs in time .
Proof.
By construction and the triangle inequality we have that
[TABLE]
We will define the following functions to reason about our approximation guarantees:
[TABLE]
We can now derive the following facts about these functions:
Directly applying Lemma 5.10, for any , ,
[TABLE] 2. 2.
Using the fact that any with can be written as for and , and Lemma 5.10,
[TABLE] 3. 3.
By the definition of the vertex sparsifier chain, , implying for any
[TABLE] 4. 4.
Again, from the fact that ,
[TABLE] 5. 5.
By the definition of ,
[TABLE]
We are going to combine these facts to show that for any ,
[TABLE]
We procceed by induction from to . Recall that .
For the case of it trivially holds by the guarantee of TrivialSolve:
[TABLE]
Assuming that it holds for any , by the guarantees of FastSolve:
[TABLE]
Finally, we can similarly argue about : it is at most while . By the guarantees of the vertex sparsifier chain, we know that for some constant of our choice. By choosing the right constant we can ensure that our multiplicative error is less than .
In order to bound the runtime we notice that for every we need to compute ApproxMapping and FastSolve which both take time . By the bound on the work of the chain we get that applying the chain takes time , while constructing it takes time . ∎
6 Matrix Scaling and Balancing with Exponential Cone Programming
The algorithm developed in the previous sections is essentially optimal in the regime where the ratio between the scaling factors is relatively small (say polynomial in ). Since there are matrices for which this ratio is exponential, we develop a complementary algorithm with negligible runtime dependence on this ratio, at the cost of a mild increase in the dependence on . The algorithm is based on interior point methods.
Although interior point methods would seem like a natural option for the problems of matrix scaling and balancing, standard formulations require solving linear systems involving various rescalings of the input matrix. A priori, it is not clear whether these can be solved faster than matrix multiplication time. However, it turns out that a somewhat nonstandard formulation requires solving linear systems for more structured matrices. Particularly, we will see that these matrices admit a decomposition involving only matrices that are easy to invert (triangular matrices, solvable by back substitution, and SDD matrices which can be tackled via a standard Laplacian solver). Notably, a similar observation was made by Daitch and Spielman [10], in the case of interior point methods applied to flow problems on graphs. [22] also consider a formulation similar to ours for the matrix scaling problem, however they don’t prove exact convergence bounds or state the algorithm rigorously. Moreover, since nearly-linear SDD solvers where not known at the time, this algorithm provides no benefit compared to other approaches.
The main result of this section is the following.
Theorem 6.1**.**
Given a nonnegative matrix , one can:
compute an -balancing in time
[TABLE] 2. 2.
if the matrix is almost -scalable, compute a --scaling in time
[TABLE]
This is as a matter of fact a consequence of the fact that a specific class of functions, which capture both balancing and scaling, can be minimized efficiently. We capture this result in the following Theorem.
Theorem 6.2**.**
Let be a nonnegative matrix with nonzero entries, let be the function
[TABLE]
and let be a positive real number. There exists an algorithm which, for any , finds a vector such that (where is the optimum of over the region ) in time
[TABLE]
Using this result, one can then conclude the proof of Theorem 6.1.
Proof of Theorem 6.1.
For the balancing objective, we first decompose the nonzero entries of into strongly connected components. For each component, we will call 6.2 with . From Corollary 4.25 we have that . Plugging this in, along with Lemma 4.20, we obtain a total running time of .
For the --scaling objective, we set , and run the interior point method on the matrix . Lemma 4.14 ensures that the entries of the scalings exist within a polynomially bounded -ball. Using Lemma 4.7, this yields the conclusion. ∎
We prove Theorem 6.2 by showing that an interior point method defined and analyzed by Nesterov [38] can be efficiently implemented. In order to do so, we require two components. The former involves providing a formulation for minimizing the function in F1 for which the interior point method can produce an iterate that is close in value to optimum within a small number of iterations. The latter involves showing how to efficiently implement these iterations. Generally they involve solving a linear system; in our case, we show that such iterations can be executed by solving an SDD linear system to constant accuracy.
6.1 Setting Up the Interior Point Method, and Bounding the Number of Steps
In order to apply an interior point method, we first reformulate the problem in F1 in an equivalent form.
Lemma 6.3** (Equivalent Formulation).**
Let be a promise on the magnitude of the entries in the optimal solution of F1:
[TABLE]
Also, let
[TABLE]
Then the objective
[TABLE]
has an identical value and solution to F1.
Proof.
Given the promise, the bounds on are redundant. This is also the case with the upper bounds on , since setting and yields a solution of value . Therefore, setting , the value of must be at most . ∎
The second step is to replace the hard constraints with appropriate barrier functions, whose value blows up when approaching the boundary of the feasible set (see [2, 38] for more details). More precisely, we consider the barrier functions
[TABLE]
for all , and
[TABLE]
The former blow up when approaches from above, and are standard in exponential cone programming [2]. The barrier handles all the other inequality constraints. Very importantly, all these barrier functions are well behaved, in the sense that they satisfy a required property called self-concordance. Since this property defines the number of iterations the method needs to execute, we highlight it below.
Fact 6.4**.**
The function is an -self-concordant barrier for the set defined in F2.
With the barrier function set up, the method has to solve a sequence of subproblems of the form
[TABLE]
while increasing until it becomes sufficiently large that the solution we produce is close to the optimum of the initial constrained problem.
What is essential here is the number of iterations of the method, which depends mostly on the quality of the barrier function, and little on in initialization and accuracy to which we want to solve. More precisely, we apply the following theorem which follows from [38], Theorems 4.2.9 and 4.2.11.
Theorem 6.5**.**
Given an initial point in the strict interior of with a -self-concordant barrier , the problem can be solved to within additive error in
[TABLE]
iterations, where is the minimizer over of .
In what follows we bound the quantities involved in the above statement. In order to have a bound on the number of iterations required for our cone program, we require lower bounding . In order to do so, we lower bound the Hessian everywhere.
Lemma 6.6**.**
The Hessian is lower bounded everywhere by the diagonal matrix with on variables and on on variables.
Proof.
Since , and by calculation we see that is diagonal, and
[TABLE]
Therefore we have that this gives a lower bound on , and thus on . ∎
We also show how to pick the initial point for our particular problem, which turns out to be a trivial task, since the only requirement is that it lies in the strict interior of . The more challenging part is upper bounding the at that point in Hessian inverse norm.
Lemma 6.7**.**
The point , where , , belongs to the strict interior of . Furthermore, .
Proof.
First we verify that the point belongs to the strict interior. As we set , no constraint on is tight. .
For the second part, we may simply bound the contribution from each term of the barrier to each entry of the gradient. The entries end up bounded by at most , while the entries end up bounded by , providing the claimed bound. ∎
6.2 Implementing an Iteration of the Interior Point Method
The steps mentioned in the statement of Theorem 6.2 consist only of standard Newton steps, i.e. minimizing a second order local approximation of the function . These steps are generally expensive, since they involve applying the inverse of to a vector. In our case, fortunately, we are able to exploit the structure of in order to do this in nearly linear time in the sparsity of .
Below we give a precise statement concerning our ability to solve linear systems involving the Hessian matrix.
Theorem 6.8**.**
For any , and any , where is a point in the strict interior of the feasible region (see F2), and any vector , one can, with high probability, compute in time a vector such that , where is the solution to .
In order to achieve this result, we leverage the power of Laplacian solvers. From the algorithmic point of view, the crucial property of the Laplacian is that it is symmetric and diagonally dominant. This enables us to use fast approximate solvers for symmetric and diagonally dominant linear systems. Namely, there is a long line of work [51, 25, 26, 24, 8, 28, 29] that builds on an earlier work of Vaidya [52] and Spielman and Teng [50], that designed an SDD linear system solver. We employ as a black box the following theorem, which follows from [26], and constructs an operator that approximates .
Theorem 6.9**.**
For any , and any SDD matrix with nonzero entries, and any vector in the image of , one can, with high probability, compute in time a vector such that , where is the solution of . Furthermore, a given choice of random bits produces a correct result for all simultaneously, and makes linear in .
The result follows using this tool, and the following structural lemma, whose proof can be found in Appendix A.10.
Lemma 6.10**.**
The Hessian has a factorization
[TABLE]
where is SDD, is lower triangular, and each of then has nonzero entries. Furthermore, this factorization can be computed in time.
With this in hand, proving Theorem 6.8 is immediate. First note that, given , we can actually choose a satisfying the needed properties: a linear-operator based graph Laplacian solver, such as [26]. Having access to the linear operator , we consider the error produced by applying the operator :
[TABLE]
6.3 Error Tolerance
While the classical analysis of Newton’s method used for iterations of interior point methods assumes exact computations, in our case the Laplacian solver we employ adds some error. We quickly show that this error does not hurt us, and as a matter of fact it is sufficient to solve these systems to constant accuracy.
First, we require understanding the guarantees of Newton’s method, and its requirements.
Fact 6.11** (Progress via Newton Steps).**
Let be a point in the interior of the feasible region such that
[TABLE]
Then, applying one step of the interior point method consists of producing a new iterate
[TABLE]
In order to make progress it is sufficient that .
We can easily show that applying the inverse matrix with the solver guarantees we give in Theorem 6.8 is sufficient in order to make progress.
Lemma 6.12**.**
Let be a point in the interior of the feasible region such that
[TABLE]
Letting such that
[TABLE]
for , we get that
[TABLE]
Since the proof is rather standard, we defer it to Appendix A.9.
6.4 Putting Everything Together
We can combine the results from this section in order to provide a proof for Theorem 6.2.
Proof of Theorem 6.2.
By combining Theorem 6.5, along with Fact 6.4, Lemma 6.6 and Lemma 6.7, we see that we can approximately minimize the function defined in Equation F1 by performing
[TABLE]
iterations of the interior point method referenced in Theorem 6.5.
From Theorem 6.8 and the iteration accuracy required by Fact 6.11 and Lemma 6.12 we see that each iteration of the interior point method referenced in Theorem 6.2 can be implemented in time . This yields the conclusion. ∎
Appendix A Deferred Proofs
A.1 Proof of Theorem 3.4
Proof of Theorem 3.4.
The iteration we are going to implement is
[TABLE]
Since is second-order robust with respect to by definition (see Definition 3.1), we know that within an -ball centered at the function is lower and upper bounded by and , respectively, where:
[TABLE]
Also, define and to be the minimizers of and , respectively, over the -ball of radius centered at , i.e.
[TABLE]
Next, we see how much decreases when we move from to , where is obtained via the oracle call . We know from Definition 3.3 that
[TABLE]
as the function the oracle is approximately minimizing is precisely that. Expanding we have
[TABLE]
Since , we see that
[TABLE]
Also, we have that
[TABLE]
Combining this with Equation A.3 gives
[TABLE]
Finally, we show that this amount of progress is comparable to that achievable by making a large step towards . More precisely, we have from the condition that . Thus, letting , we have that . Therefore, , since was a minimizer of over the -ball of radius around . Also, since lower bounds over this -ball,
[TABLE]
Combining Equations A.4 and A.5, we see that
[TABLE]
where the last inequality follows from convexity. This implies that at every iteration is decreased by a factor of , implying that after
[TABLE]
iterations, we have that . ∎
A.2 Proof of Lemma 4.7
Proof of Lemma 4.7.
Suppose that, without loss of generality, the row of has a very large violation of the scaling constraint: letting we have .
In order to improve the solution, we can make an update to the corresponding coordinate of which makes the largest possible improvement in function value. More precisely by setting , and whenever , we have that
[TABLE]
Optimizing for the largest possible decrease, we set which shows that we can decrease by
[TABLE]
Since we have , we can lower bound the improvement by
[TABLE]
whenever , and by
[TABLE]
whenever .
Since by assumption , this change improves function value by at least , which contradicts the fact that . Therefore all rows and columns are within away from being correctly scaled. Hence this is a - scaling.
∎
A.3 Proof of Lemma 4.10
Proof.
The proof is similar to the one for Lemma 4.23. The first point holds by the same argument.
For the third part, all for which must satisfy:
[TABLE]
and similarly for . Therefore
[TABLE]
and similarly for .
The second part follows by the nonnegativity of the regularizer and the observation that . By the third property we know that the level set in bounded and thus attains its minimum, and that minimum can only be better that , which concludes the proof. ∎
A.4 Proof of Theorems 4.5 and 4.6
Proof of Theorem 4.6.
By Lemma 4.7 and Lemma 4.10 we get that in order to obtain a -approximate scaling, it is sufficient to minimize up to additive error. Furthermore, from Lemma 4.10 we get that the bound required for Theorem 3.2 is . Finally, since and , we see that, initializing at , the total running time of the method is upper bounded by
[TABLE]
∎
Proof of Theorem 4.5.
We can directly prove this theorem by applying Theorem 4.6 to the optimal solution promised. Since exactly -scales , we know that it must be a minimizer of and thus . Moreover, by definition we have the bound, , which concludes that proof. ∎
A.5 Proof of Lemma 4.13
Proof.
By Lemma 4.4, we know that any almost scalable matrix can be written as a block lower triangular matrix, whose diagonal blocks are exactly scalable. By Lemma 4.12, every such block can be scaled to doubly stochastic, using factors with a ratio at most , where is the number of vertices in block .
The infimum of the function value is exactly the sum of the function values for the diagonal block problems, since the contribution of the entries below the diagonal can be made arbitrarily close to [math]. We observe that it suffices to ensure that the contribution of each such edge is at most , since then the total contribution will be at most which is the additive error we can tolerate. Scaling the off-diagonal entries can be done in a very simple way. For any block, we can scale all the columns down by a fixed amount and all the columns up by the same amount. This will not affect the contribution of the block’s entries to the function and will only decrease the contribution of all the off-diagonal blocks in the same columns. By choosing the ratio between any two consecutive blocks to be , we can ensure that the entries contained in the interesection of the rows and columns of these blocks contribute less than each. That ratio between any two factors of this new scaling is at most
[TABLE]
∎
A.6 Proof of Lemma 4.20
Proof of Lemma 4.20.
First we observe that since the Hessian of is SDD, it is spectrally upper bounded by two times its diagonal and therefore by the identity matrix multiplied by twice the trace, that is . Since, by construction, , we have that
[TABLE]
Therefore for any with , we have that for some
[TABLE]
It is straightforward to reason that
[TABLE]
and thus,
[TABLE]
Finally, we lower bound . Since the matrix can be balanced, its corresponding graph is strongly connected. Therefore it contains a cycle, and thus some edge satisfies . Hence . Plugging in this lower bound, we get that
[TABLE]
Hence
[TABLE]
which is equivalent to the fact that yields an -balancing for . We note that a similar bound also follows from [40], using a different argument. ∎
A.7 Proof of Theorems 4.18 and
Proof of Theorem 4.19.
By Lemma 4.20 and Lemma 4.23 we get that optimizing up to an additive error of , suffices to get an -balancing of the matrix.
Furthermore, from Lemma 4.23 we see that the bound required for Theorem 3.2 is . Finally, using the fact that , we see that, initializing at , the total running time of the method is
[TABLE]
∎
Proof of Theorem 4.18.
Having proved Theorem 4.19, this theorem is a simple corollary. Consider to be the vector such that . That implies that and therefore (by the convexity of ) is a minimizer of implying that . Moreover, , which concludes that proof by applying Theorem 4.19. ∎
A.8 Proof of Lemma 4.24
Proof.
Consider the optimal solution to the optimization problem described in Equation 4.3, for which we know that via Lemma 4.20. Since this is a minimizer, we know that
[TABLE]
Therefore, for any , one has that
[TABLE]
Since there is a directed path of length at most from any vertex to any other, we get that
[TABLE]
∎
A.9 Proof of Lemma 6.12
Proof.
The first part is a standard property of Newton’s method applied to self-concordant functions. We refer the reader to [2] for details.
What we want to prove is that Newton’s is robust to errors in the solution to the linear system involving the Hessian. Indeed, first we see that the Hessian at approximates the one at . To simplify notation, we write . Since is self-concordant, we have that
[TABLE]
and similarly
[TABLE]
the latter of which can be written equivalently as
[TABLE]
so
[TABLE]
The error guarantee on equivalently gives us that
[TABLE]
so combining with A.8 we obtain that
[TABLE]
Also, since for any
[TABLE]
we get, by applying triangle inequality and A.7, that
[TABLE]
Therefore, using A.6 and A.11 where we substitute for :
[TABLE]
∎
A.10 Proof of Lemma 6.10
Proof.
First we note that the nonzero submatrix of is (where rows/columns correspond to , , , in this order):
[TABLE]
such that
[TABLE]
Furthermore, this submatrix can be factored, by Schur complementing the last row and column, as:
[TABLE]
and thus one can easily notice that the Schur complement is SDD.
Furthermore, since is a standard logarithmic barrier, its Hessian is a diagonal matrix with nonnegative entries. Therefore, we can split the contribution of the diagonal matrix into pieces which contains nonzeroes only at , and . In other words, we can write . Since the Schur complement of of the matrix is SDD, we also have that the Schur complement of of the matrix is SDD, so the matrix can also be factored similarly to the factoring above, and all of these factorizations can be computed in overall time.
Finally, since each of these factorizations is computed by Schur complementing a unique , which is nonzero in a single matrix, we see that the Schur complement of the block of the matrix is equal to the sum of Schur complements of the block of the matrices . This holds similarly, for the corresponding lower and upper diagonal matrices, which yields the desired factorization simply by summing up. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ar Xiv preprint ar Xiv:1704.02315 , 2017.
- 2[2] Aharon Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications . SIAM, 2001.
- 3[3] Susan Blackford. Balancing and conditioning. http://www.netlib.org/lapack/lug/node 94.html .
- 4[4] David T Brown. A note on approximations to discrete probability distributions. Information and Control , 2(4):386–392, 1959.
- 5[5] Tzu-Yi Chen and James W Demmel. Balancing sparse matrices for computing eigenvalues. Linear algebra and its applications , 309(1-3):261–287, 2000.
- 6[6] Hui Han Chin, Aleksander Madry, Gary L. Miller, and Richard Peng. Runtime guarantees for regression problems. In Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013 , pages 269–282, 2013.
- 7[7] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng. Electrical flows, Laplacian systems, and faster approximation of maximum flow in undirected graphs. In STOC’11: Proceedings of the 43rd ACM Symposium on Theory of Computing , pages 273–282, 2011.
- 8[8] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log 1 / 2 n superscript 1 2 𝑛 \log^{1/2}n time. In STOC’14: Proceedings of the 46th Annual ACM Symposium on Theory of Computing , pages 343–352, 2014.
