Matrix Scaling and Balancing via Box Constrained Newton's Method and   Interior Point Methods

Michael B. Cohen; Aleksander Madry; Dimitris Tsipras; Adrian Vladu

arXiv:1704.02310·cs.DS·August 22, 2017

Matrix Scaling and Balancing via Box Constrained Newton's Method and Interior Point Methods

Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, Adrian Vladu

PDF

TL;DR

This paper introduces nearly-linear and near-quadratic time algorithms for matrix scaling and balancing, utilizing a new second-order optimization framework and interior point methods, advancing computational efficiency in scientific computing.

Contribution

The paper develops a unified second-order optimization framework and algorithms that improve the efficiency of matrix scaling and balancing, especially for matrices with quasi-polynomial condition ratios.

Findings

01

Algorithms run in nearly-linear time for quasi-polynomial condition ratios.

02

An interior point method achieves near-quadratic time complexity.

03

The framework generalizes Laplacian system solving for broader function classes.

Abstract

In this paper, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time $O (m lo g κ lo g^{2} (1/ ϵ))$ where $ϵ$ is the amount of error we are willing to tolerate. Here, $κ$ represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever $κ$ is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time $\widetilde{O}(m^{3/2} \log…

Equations282

O (m lo g (κ (U^{*}) + κ (V^{*})) lo g^{2} \frac{s _{A}}{ε}),

O (m lo g (κ (U^{*}) + κ (V^{*})) lo g^{2} \frac{s _{A}}{ε}),

O (m lo g κ (D^{*}) lo g^{2} \frac{w _{A}}{ε}),

O (m lo g κ (D^{*}) lo g^{2} \frac{w _{A}}{ε}),

O (m^{3/2} lo g \frac{w _{A}}{ε}),

O (m^{3/2} lo g \frac{w _{A}}{ε}),

O ((k R_{\infty} + 1) lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε}))

O ((k R_{\infty} + 1) lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε}))

A_{ii} \geq j \neq = i \sum ∣ A_{ij} ∣.

A_{ii} \geq j \neq = i \sum ∣ A_{ij} ∣.

A = [A_{[F, F]} A_{[C, F]} A_{[F, C]} A_{[C, C]}] .

A = [A_{[F, F]} A_{[C, F]} A_{[F, C]} A_{[C, C]}] .

Sc (A, F) = def A_{[C, C]} - A_{[C, F]} A_{[F, F]}^{- 1} A_{[F, C]} .

Sc (A, F) = def A_{[C, C]} - A_{[C, F]} A_{[F, F]}^{- 1} A_{[F, C]} .

\nabla^{2} f (x) \approx_{2} \nabla^{2} f (y), that is, \frac{1}{e ^{2}} \nabla^{2} f (x) ≼ \nabla^{2} f (y) ≼ e^{2} \nabla^{2} f (x) .

\nabla^{2} f (x) \approx_{2} \nabla^{2} f (y), that is, \frac{1}{e ^{2}} \nabla^{2} f (x) ≼ \nabla^{2} f (y) ≼ e^{2} \nabla^{2} f (x) .

O ((m + T) R_{\infty} lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε})),

O ((m + T) R_{\infty} lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε})),

O ((k R_{\infty} + 1) lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε}))

O ((k R_{\infty} + 1) lo g (\frac{f ( x _{0} ) - f ( x ^{*} )}{ε}))

∥ r_{M} - r ∥_{2}^{2} + ∥ c_{M} - c ∥_{2}^{2} \leq ε .

∥ r_{M} - r ∥_{2}^{2} + ∥ c_{M} - c ∥_{2}^{2} \leq ε .

O (m lo g (κ (U^{*}) + κ (V^{*})) lo g^{2} (s_{A} / ε)) .

O (m lo g (κ (U^{*}) + κ (V^{*})) lo g^{2} (s_{A} / ε)) .

f (x, y) = 1 \leq i, j \leq n \sum A_{ij} e^{x_{i} - y_{j}} - (1 \leq i \leq n \sum r_{i} x_{i} - 1 \leq j \leq n \sum c_{j} y_{j}) .

f (x, y) = 1 \leq i, j \leq n \sum A_{ij} e^{x_{i} - y_{j}} - (1 \leq i \leq n \sum r_{i} x_{i} - 1 \leq j \leq n \sum c_{j} y_{j}) .

O (m B lo g^{2} (s_{A} / ε)) .

O (m B lo g^{2} (s_{A} / ε)) .

M = D (exp (x)) \cdot A \cdot D (exp (y)) .

M = D (exp (x)) \cdot A \cdot D (exp (y)) .

\nabla f (x, y)

\nabla f (x, y)

\nabla^{2} f (x, y)

\frac{1}{e ^{2}} \nabla^{2} f (x) ≼ \nabla f (x + z) ≼ e^{2} \nabla^{2} f (x),

\frac{1}{e ^{2}} \nabla^{2} f (x) ≼ \nabla f (x + z) ≼ e^{2} \nabla^{2} f (x),

f (x, y) = f (x, y) + \frac{ε ^{2}}{36 n ^{2} e ^{B}} (i \sum (e^{x_{i}} + e^{- x_{i}}) + j \sum (e^{y_{j}} - e^{- y_{j}}))

f (x, y) = f (x, y) + \frac{ε ^{2}}{36 n ^{2} e ^{B}} (i \sum (e^{x_{i}} + e^{- x_{i}}) + j \sum (e^{y_{j}} - e^{- y_{j}}))

\frac{∥ r _{M} - c _{M} ∥ _{2}}{\sum _{1 \leq i, j \leq n} M _{ij}} = \frac{\sum _{i = 1}^{n} (( r _{M} ) _{i} - ( c _{M} ) _{i} ) ^{2}}{\sum _{1 \leq i, j \leq n} M _{ij}} \leq ε .

\frac{∥ r _{M} - c _{M} ∥ _{2}}{\sum _{1 \leq i, j \leq n} M _{ij}} = \frac{\sum _{i = 1}^{n} (( r _{M} ) _{i} - ( c _{M} ) _{i} ) ^{2}}{\sum _{1 \leq i, j \leq n} M _{ij}} \leq ε .

O (m lo g κ (D^{*}) lo g^{2} (w_{A} / ε)) .

O (m lo g κ (D^{*}) lo g^{2} (w_{A} / ε)) .

f (x) = 1 \leq i, j \leq n \sum A_{ij} e^{x_{i} - x_{j}},

f (x) = 1 \leq i, j \leq n \sum A_{ij} e^{x_{i} - x_{j}},

O (m B lo g^{2} (w_{A} / ε)) .

O (m B lo g^{2} (w_{A} / ε)) .

M = D (exp (x)) \cdot A \cdot D (exp (- x)) .

M = D (exp (x)) \cdot A \cdot D (exp (- x)) .

\nabla f (x)

\nabla f (x)

\nabla^{2} f (x)

f (x) = f (x) + \frac{ε ^{2} ℓ _{A}}{48 n e ^{B}} i = 1 \sum n (e^{x_{i}} + e^{- x_{i}})

f (x) = f (x) + \frac{ε ^{2} ℓ _{A}}{48 n e ^{B}} i = 1 \sum n (e^{x_{i}} + e^{- x_{i}})

B_{v, e} = ⎩ ⎨ ⎧ 1 - 1 0 if e = (v, u) \in E, if e = (u, v) \in E, otherwise.

B_{v, e} = ⎩ ⎨ ⎧ 1 - 1 0 if e = (v, u) \in E, if e = (u, v) \in E, otherwise.

L (x) = B^{⊤} W exp (B x) .

L (x) = B^{⊤} W exp (B x) .

L (x) = d .

L (x) = d .

f_{uv} = w_{uv} \cdot e^{x_{u} - x_{v}}, for all (u, v) \in E,

f_{uv} = w_{uv} \cdot e^{x_{u} - x_{v}}, for all (u, v) \in E,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Matrix Scaling and Balancing via Box Constrained Newton’s Method and Interior Point Methods

Michael B. Cohen

MIT

[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1111109 and Grant No. 1553428, and by the National Defense Science and Engineering Graduate Fellowship.

Aleksander Mądry

MIT

[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1553428.

Dimitris Tsipras22footnotemark: 2

MIT

[email protected]

Adrian Vladu

MIT

[email protected] This material is based upon work supported by the National Science Foundation under Grant No. 1111109 and Grant No. 1553428

Abstract

In this paper, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time $\widetilde{O}\left(m\log\kappa\log^{2}(1/\varepsilon)\right)$ where $\varepsilon$ is the amount of error we are willing to tolerate. Here, $\kappa$ represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever $\kappa$ is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time $\widetilde{O}(m^{3/2}\log(1/\varepsilon))$ .

In order to establish these results, we develop a new second-order optimization framework that enables us to treat both problems in a unified and principled manner. This framework identifies a certain generalization of linear system solving that we can use to efficiently minimize a broad class of functions, which we call second-order robust. We then show that in the context of the specific functions capturing matrix scaling and balancing, we can leverage and generalize the work on Laplacian system solving to make the algorithms obtained via this framework very efficient.

1 Introduction

Matrix balancing and scaling are problems of fundamental importance in scientific computing, as well as in statistics, operations research, image reconstruction, and engineering. The literature on these problems [39, 41, 18, 20, 12, 23, 44, 48, 47, 53, 42, 4, 15, 21] is truly extensive and dates back to 1960s. They both are key primitives in most mainstream numerical software packages (MATLAB, R, LAPACK, EISPACK) [36, 35, 43, 17, 3]. Also, both these problems can be seen as task in which we are aiming to find diagonal scalings of a given matrix so that the rescaled matrix gains some favorable structure.

More specifically, in the matrix scaling problem, we are given a nonnegative matrix $\mathbf{A}$ , and our goal is to find diagonal matrices $\mathbf{X},\mathbf{Y}$ such that the matrix $\mathbf{X}\mathbf{A}\mathbf{Y}$ has prescribed row and column sums. The most common instance of this problem is the one where we want to scale the matrix so to make it doubly stochastic – in other words, we want to make all row and column sums be equal to one. This procedure has been repeatedly used since as early as 1937 in a number of diverse areas, such as telecommunication [27], engineering [4], statistics [14, 47], machine learning [9], and even computational complexity [33, 19]. A standard application for scaling is preconditioning linear system solving. Given a linear system $\mathbf{A}x=b$ , one can produce a solution by computing $\mathbf{Y}(\mathbf{X}\mathbf{A}\mathbf{Y})^{-1}\mathbf{X}b$ , since applying the inverse of $\mathbf{X}\mathbf{A}\mathbf{Y}$ is more numerically stable procedure than directly applying the inverse of $\mathbf{A}$ [53]. Another example application, which commonly occurs in statistics, is iterative proportional fitting. This primitive is often used for standardizing cross-tabulations and has been studied since 1912 [55]. Even more interestingly, matrix scaling turned out to have surprising connections to fundamental problems in the theory of computation. Notably, in [33], it is observed that scaling can be used to approximate the permanent of any nonnegative matrix within a multiplicative factor of $e^{n}$ . Furthermore, deciding whether the permanent of a bipartite graph’s adjacency matrix is [math] or at least $1$ is equivalent to deciding whether that graph contains a perfect matching. Such scaling–based method can, as a matter of fact, be used to compute maximum matchings in bipartite graphs, which is a classic and intensely studied problem in graph algorithms [13, 16, 34]. For more history and information on this problem, we refer the reader to Idel’s extensive survey [21], or [45] for a list of applications.

Now, in the matrix balancing problem, we are, again, given a nonnegative matrix $\mathbf{A}$ , and our goal here is to find a diagonal matrix $\mathbf{D}$ such that the matrix $\mathbf{D}\mathbf{A}\mathbf{D}^{-1}$ is balanced, that is the sum of each row is equal to the sum of the corresponding column. This procedure has been introduced first by Osborne [39], who was using it to precondition matrices in order to increase the accuracy of the eigenvalue computation. (Note that the balancing operation does not change the eigenvalues of the matrix.) The initially proposed algorithm for it was based on a simple iterative approach, and was then followed by a long series of improvements and extensions. The initial work on this problem focused on a variant in which one aims to balance $\ell_{2}$ -norms of rows and columns. It turns out, however, that the $\ell_{1}$ -norm–based version we study here is equivalent. In fact, balancing problems with respect to $\ell_{p}$ norms, with constant $p\geq 1$ , are all reducible to each other.

1.1 Previous Work

The early methods used for solving these problems – Osborne’s iteration for balancing, and the RAS method for scaling – are simple iterative algorithms. However, merely the task of analyzing their convergence turned out to be a major challenge. Significant effort has gone into understanding their convergence [5, 46, 40, 33], and providing better analyses or better iterative methods resulted in a long line of work in this context.

The major shortcoming of the methods obtained so far for exactly solving the problem (depending only logarithmically on $1/\varepsilon$ ) is their very large running time. In the following discussion we ommit runtime factors that depend (logarithmically) on the size of the input entries. For matrix scaling, Kalantari and Kachiyan [22] obtained an algorithm that finds an $\varepsilon$ -approximate solution and runs in time $\widetilde{O}(n^{4}\log(1/\varepsilon))$ , where $n$ denotes the dimension of the matrix (we can assume the matrix is square w.l.o.g.) and $\varepsilon$ is the desired accuracy parameter111The precise definition of $\varepsilon$ varies across papers. However, in the regime of logarithmic running time dependence on $1/\varepsilon$ we are interested in here, all these definitions are essentially equivalent.. This algorithm was based on the ellipsoid method. These authors also proposed – but not formally analyzed – an algorithm based on interior point method, which they expected to run in time $\widetilde{O}(m^{3.5}\log(1/\varepsilon))$ , where $m$ denotes the number of non-zero entries of the input matrix. Then, Nemirovsky and Rothblum [37] analyzed an interior point method–based algorithm which run in time $\widetilde{O}(m^{4}\log(1/\varepsilon))$ . Finally, Linial, Samorodnitsky, and Wigderson [33] gave an $\widetilde{O}(n^{7}\log(1/\varepsilon))$ time algorithm that is also strongly polynomial, in the sense that it does not depend at all on the size of input entries.

For the case of matrix balancing, Parlett and Reinsch [41] provided an iterative method based on Osborne’s iteration, without proving convergence. Then, Grad [18] proved that Osborne’s iteration converges in the limit. The first polynomial time bound was obtained by Kalantari, Khachiyan, and Shokoufandeh [23], who gave an algorithm with running time $\widetilde{O}(n^{4}\log(1/\varepsilon))$ .

Alternatively, if one is interested in the regime where the running time is allowed to depend polynomially – instead of logarithmically – on the (inverse of the) desired accuracy of the solution, there are algorithms that have an even better dependence on the other parameters. Specifically, the current state-of-the-art is given by Linial, Samorodnitsky, and Wigderson [33], who obtain $O(n^{3}\varepsilon^{-2})$ running time for the scaling problem. In the case of the balancing problem, recently, Ostrovsky, Rabani, and Yousefi [40] made a significant progress by obtaining running times of $\widetilde{O}(m+n\varepsilon^{-2})$ and $\widetilde{O}(n^{3.5}\varepsilon^{-1})$ .

Finally, another important line of work in this domain was focused on the related $\ell_{\infty}$ variant of the balancing problem, where the maximum entry of each row is required to be equal to the maximum entry of the corresponding column. Schneider and Schneider [44] gave a non-iterative algorithm running in time $O(n^{4})$ , improved to $\widetilde{O}(mn+n^{2})$ by Young, Tarjan, and Orlin [54]. More recently, Schulman and Sinclair [46] provided an analysis of the classical Osborne-Parlett-Reinsch obtaining a running time of $\widetilde{O}(n^{2}m)$ , and gave a version of it with running time $\widetilde{O}(n^{3}\log(1/\varepsilon))$ .

1.2 Our Contributions

We provide algorithms for both matrix scaling and balancing problems.

For the matrix scaling problem, we establish an algorithm that runs in time

[TABLE]

where $\mathbf{U}^{*}$ and $\mathbf{V}^{*}$ are the optimal scaling matrices, $\kappa(\cdot)$ is the maximum ratio between the diagonal entries of its argument, $s_{\mathbf{A}}$ is the sum of the entries in the input matrix, and $\varepsilon$ is the measure of the target error of the scaling, formally defined in Definition 4.2.

For the matrix balancing problem, we establish a running time of

[TABLE]

where $w_{\mathbf{A}}$ is the ratio of the sum of the entries to the minimum nonzero entry, $\mathbf{D}^{*}$ is the optimal balancing matrix, $\kappa(\cdot)$ has the same meaning as above, and $\varepsilon$ is the measure of the balancing error, as formally defined in Definition 4.16.

Notably, our running times depend logarithmically on both the target accuracy and the magnitude of the entries in the optimal balancing or scaling. This implies that if the optimal solution has quasi-polynomially bounded entries, our algorithms run in nearly linear time $\widetilde{O}(m\log(1/\varepsilon))$ (ignoring logarithmic factors involving the entries of the input matrix). This includes, for instance, the case when input matrix has all its entries positive or, in case of matrix balancing, if there just exists a single row/column pair with all positive entries.

However, there are matrices for which $\kappa$ can be exponentially large (in $n$ ). For the case of such matrices we develop algorithms with negligible dependence on $\kappa$ . These algorithms are based on interior point methods, with appropriately chosen barriers, commonly used in exponential programming [2]. We show that the linear system solves required by the interior point method every iteration can be reduced via Schur complementing to approximately solving a Laplacian system, which can be done in nearly linear time using any standard Laplacian solver [51, 25, 26, 24, 8, 28, 29]. This yields a running time of

[TABLE]

where $w_{\mathbf{A}}$ is the ratio between the largest and smallest nonzero entry of $\mathbf{A}$ .

1.3 Our Approach

We approach the scaling and balancing problems by developing a continuous optimization based perspective on them. More precisely, we solve both matrix scaling and balancing problems by casting them as tasks of minimizing certain corresponding convex functions. In fact, in the case of the balancing problem, that function is directly inspired by the one used in [23]; for the scaling problem, it is function derived from the one used in [22].

Since our goal is to obtain logarithmic – instead of polynomial – dependence on the (inverse of the) desired accuracy $\varepsilon$ , it would be tempting to use well-known tools for convex programing, such as ellipsoid method or interior point method. However, these methods are, a priori, computationally expensive. This motivates us to look for different, more direct approaches.

To this end, we develop a technique for minimizing a broader class of functions that we call second-order robust (with respect to $\ell_{\infty}$ ). Intuitively, this class corresponds to functions whose Hessians do not change too much within any unit $\ell_{\infty}$ -ball. And the consequence of that property that will be crucial for us is that local quadratic approximation of such functions at any given point is relatively accurate within the unit $\ell_{\infty}$ neighborhood of that point. As a result, iteratively optimizing the local approximation around the current point, while staying within that $\ell_{\infty}$ neighborhood, will be guaranteed to make progress towards minimizing the function. This iterative procedure can be viewed as a “box-constrained” variant of the Newton’s method.

A priori, performing a single step of such a box-constrained Newton’s method, i.e., minimizing a quadratic function subject to box constraints might be a computationally costly task. We show, however, that it suffices to implement a weaker primitive, which we call a $k$ -oracle. That primitive corresponds to (approximately) minimizing a quadratic function within a region that is within a factor of $k$ larger than the target $\ell_{\infty}$ -ball. Once such a $k$ -oracle is implemented efficiently, we can compute the global optimum of our second-order robust function using a small number of calls to it. More precisely, we show that one can minimize a convex function $f$ that is second-order robust with respect to $\ell_{\infty}$ to within $\varepsilon$ additive error from optimum in

[TABLE]

iterations, where each iteration consists of one call to the $k$ -oracle, $x_{0}$ is the starting point, $x^{*}$ is the minimizer of $f$ , and $R_{\infty}$ is the $\ell_{\infty}$ radius of the level set of $x_{0}$ .

In the light of the above, the main technical difficulty remaining is obtaining an efficient implementation of a $k$ -oracle. We show that for functions whose Hessian is symmetrically diagonally dominant, with nonzero off-diagonal entries, or SDD for short222Such matrices can essentially be viewed as a Laplacian matrix plus a nonnegative diagonal., we can implement a $k$ -oracle, with $k=O(\log n)$ , in time that is nearly linear in the sparsity of the Hessian. We build here on the strategy underlying the Laplacian solver of Lee, Peng and Spielman [30]. Specifically, we carefully lift the solutions corresponding to coarser (and smaller) approximations of the underlying matrix to the desired solutions corresponding to the initial matrix in a way that does not allow these lifted solutions to exceed the boundaries of a $O(\log n)$ -radius $\ell_{\infty}$ -ball.

Once the above optimization framework is developed, applying it to the scaling and balancing problems is fairly straightforward. It boils down to verifying that the functions that capture the respective problems are indeed second-order robust and have an SDD Hessian, and then bounding all the relevant quantities that (1.1) involves.

Independent Work

Finally, we note that Allen-Zhu, Li, Oliveira, and Wigderson [1] obtained independently very similar results for the exact version of the problem. The running time of the algorithms they develop have a bit worse dependence on $m$ , but they were able to establish better absolute bounds on $\kappa$ (in terms of the problem parameters and the magnitude of the input entries) for the general, non-doubly stochastic variant of the matrix scaling problem.

1.4 Roadmap

The rest of the paper is organized as follows. First, we introduce relevant notation and concepts in Section 2. Then, in Section 3 we formally introduce the class of convex functions we call second-order robust with respect to $\ell_{\infty}$ . For these, we develop a specific optimization primitive called box-constrained Newton method.

We describe how we can apply the primitive from Section 3 to matrix balancing and scaling in Section 4, by reducing these problem to a convex function minimization with favorable structure. In order to complete our algorithm, in Section 5, we show how to efficiently implement an iteration of the box-constrained Newton in the special case where the Hessian of the function is SDD. In Section 6 we provide a different approach for balancing and scaling based on interior point methods. Supplementary proofs and technical details are presented in the Appendix.

2 Preliminaries

2.1 Notations

Vectors

We let $\vec{0},\vec{1}\in\mathbb{{R}}^{n}$ denote the all zeros and all ones vectors, respectively. When it is clear from the context, we apply scalar operations to vectors with the interpretation that they are applied coordinate-wise.

Matrices

We write matrices in bold. We use $\mathbf{I}$ to denote the identity matrix, and $\mathbf{0}$ to denote the zero matrix. Given a matrix $\mathbf{A}$ , we denote its number of nonzero entries by $\textnormal{nnz}(\mathbf{A})$ . When it is clear from the context, we use $m$ to denote the the number of nonzeros; similarly, we use $n$ to denote the dimension of the ambient space.

We denote by $s_{\mathbf{A}}$ the sum of entries of $\mathbf{A}$ , by $\ell_{\mathbf{A}}$ the minimum nonzero entry of $\mathbf{A}$ , and by $w_{\mathbf{A}}$ the ratio between these quantities. We use $\mathrm{supp}(\mathbf{A})$ to denote the set of pairs of indices $(i,j)$ corresponding to the nonzero entries of $\mathbf{A}$ . Given a matrix $\mathbf{A}$ , we define $r_{\mathbf{A}}=\mathbf{A}\vec{1}$ to be the vector consisting of row sums, and $c_{\mathbf{A}}=\mathbf{A}^{\top}\vec{1}$ to be the vector consisting of column sums. For a positive diagonal matrix $\mathbf{A}$ we denote the maximum ratio between its diagonal elements by $\kappa(\mathbf{A})$ .

Positive Semidefinite Ordering and Approximation

For symmetric matrices $\mathbf{A},\mathbf{B}\in\mathbb{{R}}^{n\times n}$ we use $\mathbf{A}\preccurlyeq\mathbf{B}$ to represent the fact that that $x^{\top}\mathbf{A}x\leq x^{\top}\mathbf{B}x$ , for all $x$ . A symmetric matrix $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ is positive semidefinite (PSD) if $\mathbf{A}\succcurlyeq 0$ . We use $\preccurlyeq,\succ,\prec$ in a similar fashion. For vectors $x$ , we define the norm $\|x\|_{\mathbf{A}}=\sqrt{x^{\top}\mathbf{A}x}$ . Given two PSD matrices $\mathbf{A}$ and $\mathbf{B}$ , and a parameter $\alpha>0$ , we use $\mathbf{A}\approx_{\alpha}\mathbf{B}$ to denote the fact that $e^{-\alpha}\cdot\mathbf{B}\preccurlyeq\mathbf{A}\preccurlyeq e^{\alpha}\cdot\mathbf{B}$ .

Laplacian and SDD matrices

A family of matrices that will play an important role in this paper are symmetric diagonally dominant (SDD) matrices. These are matrices $\mathbf{A}$ , that symmetric and, moreover, have each diagonal entry be larger than the sum of absolute values of the corresponding row entries. That is, for every $i$

[TABLE]

A special case of SDD matrices are Laplacian matrices, which have negative off-diagonal entries and the sum of each row is required to be zero. The crucial fact about these matrices is that one can exploit their structure to solve linear systems in them in time that is only nearly linear [51, 25, 26, 24, 8, 28, 29].

Diagonal Matrices

For $x\in\mathbb{{R}}^{n}$ we denote by $\mathbb{D}(x)\in\mathbb{{R}}^{n\times n}$ the diagonal matrix where $\mathbb{D}(x)_{ii}=x_{i}$ . Given a nonnegative diagonal matrix $\mathbf{D}$ , we use $\kappa(\mathbf{D})$ to denote the ratio between its largest and smallest entry. We will overload notation and, for any matrix $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ , use $\mathbb{D}(\mathbf{A})$ to denote the main diagonal of $\mathbf{A}$ , that is $(\mathbb{D}(\mathbf{A}))_{ii}=\mathbf{A}_{ii}$ and $(\mathbb{D}(\mathbf{A}))_{ij}=0$ for $i\neq j$ .

Gradients and Hessians

Given a function $f$ we denote by $\nabla f(x)$ its gradient at $x$ , and by $\nabla^{2}f(x)$ its Hessian at $x$ . When the function is clear from the context, we also use $\mathbf{H}_{x}$ to denote its Hessian at $x$ .

Block Matrices

As part of our algorithms, we will consider partitioning the coordinates of vectors into sets of indices $F$ and $C$ . When we compute the quadratic form of a matrix with these vectors, we need to be able to reason about how values in each component interact with the rest of the vector. For that reason it is convenient to denote the block form notation for a matrix $\mathbf{A}$ as:

[TABLE]

Schur Complements

For a matrix $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ and a partition of its indices $(F,C)$ , the Schur complement of F in $\mathbf{A}$ is defined as

[TABLE]

The exact use of Schur complements will become clear in Sections 5,6. These are objects that naturally arise during Gaussian elimination for the solution of linear systems. By pivoting out variables $F$ the remaining system to solve for variables of $C$ is exactly the Schur complement of $F$ in $\mathbf{A}$ .

3 Box-Constrained Newton Method for Second-Order Robust Functions

The central element of our approach is developing an efficient second-order method based minimization framework for a broad class of functions that we will call second-order robust with respect to $\ell_{\infty}$ . To motivate the choice of this class, recall that second-order methods for function minimization are iterative in nature, and they boil down to repeated minimizing the local quadratic approximation of the function around the current point. Consequently, in order to obtain meaningful guarantees about the progress made by such methods, one needs to ensure that this local quadratic approximation constitutes a good approximation of the function not only at the current point but also in a reasonably large neighborhood of that point. The most natural way to obtain such a guarantee is to ensure that the Hessian of the function (which is the basis of our local quadratic approximations) does not change by more than a constant factor in that neighborhood. As a result, the functions we are interested in optimizing in this paper are the ones that satisfy that property in an $\ell_{\infty}$ -ball around the current point. This is formalized in the following definition.

Definition 3.1 (Second-Order Robust w.r.t. $\ell_{\infty}$ ).

We say that a convex function $f:\mathbb{{R}}^{n}\rightarrow\mathbb{{R}}$ is second-order robust (SOR) with respect to $\ell_{\infty}$ if, for any $x,y\in\mathbb{{R}}^{n}$ such that $\|x-y\|_{\infty}\leq 1$ ,

[TABLE]

Note that the size of the $\ell_{\infty}$ -ball, as well as the exact factor by which the Hessian is allowed to change, are chosen somewhat arbitrarily – all choices of the constants can be made equivalent via an appropriate rescaling. Moreover, even if these quantities are not constant, they would only appear in the running time as a small polynomial factor.

Now, the above definition suggests a natural framework for optimizing such functions. Namely, in every iteration, we optimize a local quadratic approximation of the function within a unit $\ell_{\infty}$ -ball around the current point. As we will see shortly, this approach can be rigorously analyzed. In particular, our key technical result is that if we apply this approach to an SOR function whose Hessians has additionally a special structure, i.e., those for which the Hessian is, essentially, a symmetric diagonally dominant (SDD) matrix, we can implement every iteration in time nearly linear in the number of nonzero entries of the Hessian. This leads to running time bounds captured by the following theorem.

Theorem 3.2 (Minimizing Second-Order Robust Functions w.r.t $\ell_{\infty}$ ).

Let $f:\mathbb{{R}}^{n}\rightarrow\mathbb{{R}}$ be a second-order robust (SOR) function with respect to $\ell_{\infty}$ , such that its Hessian is symmetric diagonally dominant (SDD) with nonpositive off-diagonals, and has $m$ nonzero entries. Given a starting point $x_{0}\in\mathbb{{R}}^{n}$ we can compute a point $x$ , such that $f(x)-f(x^{*})\leq\varepsilon$ , in time

[TABLE]

where $x^{*}$ is a minimizer of $f$ , $R_{\infty}=\sup_{x:f(x)\leq f(x_{0})}\|x-x^{*}\|_{\infty}$ is the $\ell_{\infty}$ diameter of the corresponding level set of $f$ , and $T$ is the time required to compute the Hessian.

Note that the bounds provided by the above theorem are, in a sense, the best possible for any kind of approach that relies on repeated minimization of a local approximation of a function in an $\ell_{\infty}$ -ball neighborhood. In particular, as each step can make a progress of at most $1$ in $\ell_{\infty}$ -norm towards the optimal solution, one would expect the total number of steps to be $\Omega(R_{\infty})$ .

It turns out that the above theorem is all we need to establish our results for scaling and balancing problems (except the ones relying on the interior point method). That is, these results can be obtained by direct application of the above theorem to an appropriate SOR function. We provide all the details in Section 4.

Now, the first step to proving the above Theorem 3.2 is to view each iteration of our iterative minimization procedure as a call to a certain oracle problem.

Definition 3.3.

We say that a procedure $\mathcal{O}$ is a $k$ -oracle for a class of matrices $\mathcal{M}$ , if on input $(\mathbf{A},b)$ , where $\mathbf{A}\in\mathcal{M}\subseteq\mathbb{{R}}^{n\times n}$ , and $b\in\mathbb{{R}}^{n}$ , returns a vector $\tilde{z}$ satisfying

(1)

$\|\tilde{z}\|_{\infty}\leq k$ , and 2. (2)

$\frac{1}{2}\tilde{z}^{\top}\mathbf{A}\tilde{z}+b^{\top}\tilde{z}\leq\frac{1}{2}\cdot\min_{\|z\|_{\infty}\leq 1}\left(\frac{1}{2}z^{\top}\mathbf{A}z+b^{\top}z\right)$ .

Note that the minimum of the left-hand side of Condition (2) above is always non-positive. This is desired, since this expression is supposed to measure our function minimization progress.

Observe that minimizing the function $\frac{1}{2}z^{\top}\mathbf{A}z+b^{\top}z$ without any constraints on $z$ corresponds to solving a linear system $\mathbf{A}z=-b$ . So, one can view the $k$ -oracle problem as a certain generalization of linear system solving. Specifically, it is a task in which we aim to find a point in the $\ell_{\infty}$ -ball of diameter $k$ around the origin that is closest (in a certain sense) to the solution to that linear system. If $b$ is sufficiently small, the $k$ -oracle problem corresponds directly to solving that system.

One can view the parameter $k$ as the measure of the “quality" of our $k$ -oracle. The smaller it is, the faster convergence the overall procedure will have. Importantly, however, the value of $k$ impacts only the convergence and not the quality of the final solution. The following theorem makes this relationship precise.

Theorem 3.4.

Let $f:\mathbb{{R}}^{n}\rightarrow\mathbb{{R}}$ be a function that is second-order robust with respect to $\ell_{\infty}$ . Let $\mathcal{O}$ be a $k$ -oracle for $\{\nabla^{2}f(x):x\in\mathbb{{R}}^{n}\}$ , along with an initial point $x_{0}\in\mathbb{{R}}^{n}$ and an accuracy parameter $\varepsilon$ . Let $R_{\infty}=\sup_{x:f(x)\leq f(x_{0})}\|x-x^{*}\|_{\infty}$ , where $x^{*}$ is a minimizer of $f$ . Then one can produce a solution $x_{T}$ satisfying $f(x_{T})-f(x^{*})\leq\varepsilon$ using

[TABLE]

calls to $\mathcal{O}$ .

We present the proof of this theorem in Section A.1 of the Appendix. In Section 5 we design an efficient $k$ -oracle, with $k=O(\log n)$ , for the family of SDD matrices. Combining Theorem 5.11 with Theorem 3.4 immediately gives the proof of Theorem 3.2. We remark that while Theorems 3.2, 3.4 are stated and proved for functions defined over $\mathbb{{R}}^{n}$ , they can be extended in a straightforward way to hold when $f$ is defined over an arbitrary closed, convex set.

4 Matrix Scaling and Balancing

Having developed our main optimization primitives, we can develop efficient algorithms for matrix scaling and matrix balancing. Our approach is essentially the same for both problems, and differs only in technical details.

At the high level, we will construct convex functions with optima corresponding to exact scaling/balancing of the input matrix. Moreover, the gradient of these functions at a specific scaling/balancing of the matrix will be directly related to the quality of this particular scaling/balancing. This will allow us to prove that approximately optimal points correspond to $\varepsilon$ -approximate scaling/balancing. The fact that that these functions are second-order robust with respect to $\ell_{\infty}$ makes it sufficient to apply the optimization method from Section 3. To complete the algorithm and its running time analysis, we need then to address two issues.

Firstly, proving running time bounds for this method requires an upper bound on the $\ell_{\infty}$ radius of the level set of the initial point, i.e. the $R_{\infty}$ parameter defined in Theorem 3.4. Depending on the structure of the matrix, there are several different bounds that one can prove, depending only on parameters of the original problem. However, the most interesting case is when we are promised that the exact scaling/balancing of the matrix is “small” (in the sense that the ratio between factors is, say, polynomial). In that case, we can regularize the function to turn this promise into a guarantee for the size of the level set without sacrificing too much accuracy. Moreover, by using a simple doubling approach, we can make the algorithm not require explicit knowledge of the value of this parameter, and it will only appear as a factor in the final running time of the algorithm.

Secondly, we need to ensure that we can efficiently implement $k$ -oracles for the Hessians of these functions. In our case, this boils down to proving that these Hessians are SDD matrices with sparsity equal to that of the input matrix, and then build on the existing Laplacian solving work. For the remainder of this section, we define the convex functions that we need optimize, show how to regularize them, and prove bounds on the corresponding $R_{\infty}$ parameters. We will describe and analyze the implementation of a $O(\log n)$ -oracle in Section 5.

4.1 Matrix Scaling

We now formally define the scaling problem, along with the notion of $\varepsilon$ -scaling.

Definition 4.1 (Matrix Scaling).

Let $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ be a nonnegative matrix and $r,c\in\mathbb{{R}}^{n}$ be vectors such that $\sum_{i=1}^{n}r_{i}=\sum_{j=1}^{n}c_{i}$ , and $\|r\|_{\infty},\|c\|_{\infty}\leq 1$ 333In literature we also encounter this problem for non-square matrices; however solving squares is sufficient, since given $\mathbf{A}\in\mathbb{{R}}^{n\times c}$ , we can reduce to this instance by scaling the square matrix $\begin{bmatrix}\mathbf{0}_{c,c}&\mathbf{A}^{\top}\\ \mathbf{A}&\mathbf{0}_{r,r}\end{bmatrix}$ . The upper bound on $r$ and $c$ is harmless, since for larger values we can always shrink all of $\mathbf{A}$ , $r$ , $c$ and $\varepsilon$ by the same factor in order to enforce this constraint.. We say that two nonnegative diagonal matrices $\mathbf{X}$ and $\mathbf{Y}$ $(r,c)$ -scale $\mathbf{A}$ if the matrix $\mathbf{M}=\mathbf{X}\mathbf{A}\mathbf{Y}$ satisfies $\mathbf{M}\vec{1}=r$ and $\mathbf{M}^{\top}\vec{1}=c$ , i.e. row $i$ sums to $r_{i}$ and column $j$ sums to $c_{j}$ for every $i,j$ .

Definition 4.2 ( $\varepsilon$ - $(r,c)$ scaling).

Given nonnegative $\mathbf{A}$ and positive diagonal matrices $\mathbf{X},\mathbf{Y}$ , we say that $(\mathbf{X},\mathbf{Y})$ is an $\varepsilon$ - $(r,c)$ scaling (or $\varepsilon$ -scaling, when $r$ and $c$ are clear from the context) for matrix $\mathbf{A}$ if the matrix $\mathbf{M}=\mathbf{X}\mathbf{A}\mathbf{Y}$ satisfies

[TABLE]

Definition 4.3 (Scalable and Almost-Scalable Matrices).

A nonnegative matrix $\mathbf{A}$ , is called $(r,c)$ -scalable, if there exist $\mathbf{X}$ and $\mathbf{Y}$ that $(r,c)$ -scale $\mathbf{A}$ . It is called almost $(r,c)$ -scalable if for every $\varepsilon>0$ , there exist $\mathbf{X}_{\varepsilon}$ and $\mathbf{Y}_{\varepsilon}$ that $\varepsilon$ - $(r,c)$ scale $\mathbf{A}$ .

There are well-known necessary and sufficient conditions about the scalability of $\mathbf{A}$ stated in the following lemma.

Lemma 4.4 ([33]).

A nonnegative matrix $\mathbf{A}$ is exactly $(r,c)$ -scalable iff for every zero minor $Z\times L$ of $\mathbf{A}$ ,

(1)

$\sum_{i\in Z^{c}}r_{i}\geq\sum_{j\in L}c_{j}$ . 2. (2)

Equality in (1) holds iff $Z^{c}\times L^{c}$ is also a zero minor.

A nonnegative matrix $\mathbf{A}$ is almost $(r,c)$ -scalable iff Condition (1) above holds.

We will cast matrix scaling as a convex optimization problem and show that applying the method from section 3 yields a good approximate scaling.

Theorem 4.5.

Let $\mathbf{A}$ be a matrix, that has an $(r,c)$ scaling $(\mathbf{U}^{*},\mathbf{V}^{*})$ . Then, we can compute an $\varepsilon$ - $(r,c)$ scaling of $\mathbf{A}$ in time

[TABLE]

This implies that if $\mathbf{U}^{*}$ and $\mathbf{V}^{*}$ are, say, quasi-polynomially bounded, we can find an approximate scaling in nearly linear time. If fact, we can generalize this statement to obtain a similar result for the case of approximate scalings. This is made precise in Theorem 4.6.

4.1.1 Matrix Scaling via Convex Optimization

Recall that we want to encode the matrix scaling problem as a an instance of minimizing of a certain convex function. Given the input matrix $\mathbf{A}$ , the function we want to consider is:

[TABLE]

We want to argue now that computing an (approximate) scaling of the matrix $\mathbf{A}$ can indeed be recovered from an (approximate) minimum of the above function. Specifically, we want to establish the following theorem.

Theorem 4.6.

Suppose that there exist a point $z^{*}_{\varepsilon}=(x^{*}_{\varepsilon},y^{*}_{\varepsilon})$ for which $f(z^{*}_{\varepsilon})-f^{*}\leq\varepsilon^{2}/(3n)$ and $\|z^{*}_{\varepsilon}\|_{\infty}\leq B$ . Then we can compute an $\varepsilon$ - $(r,c)$ scaling of $\mathbf{A}$ in time

[TABLE]

The proof is straightforward given the lemmas below and is presented in Section A.4 of the Appendix. First, we will prove that approximate optimality of $f$ implies an approximate scaling of the matrix.

Lemma 4.7.

Let $\mathbf{A}$ be an $\varepsilon$ -scalable matrix. Let $f^{*}=\inf_{(x,y)}f(x,y)$ . Then, a pair of vectors $(x,y)$ satisfying $f(x,y)-f^{*}\leq\varepsilon^{2}/3n$ , for $0<\varepsilon\leq 1$ , yields an $\varepsilon$ - $(r,c)$ scaling of $\mathbf{A}$ :

[TABLE]

Note that we compare the value of $f(x,y)$ to its infimum, as for the case of almost scalable matrices it is possible that this value is attained only in the limit.

To prove the above lemma, we first look at the first and second order derivatives of $f$ .

Lemma 4.8.

Let $\mathbf{M}$ be the matrix obtained by scaling $\mathbf{A}$ with vectors $(x,y)$ , i.e. $\mathbf{M}=\mathbb{D}(\exp(x))\cdot\mathbf{A}\cdot\mathbb{D}(\exp(y))$ . The gradient and Hessian of $f$ satisfy the identities:

[TABLE]

We can observe that any $(x,y)$ for which $\nabla f(x,y)$ is equal to [math] yields diagonal matrices that exactly scale $\mathbf{A}$ . Moreover, this statement also holds in an approximate sense. One can prove that a large gradient in $\ell_{2}$ norm implies that the current point is far from optimal in function value. Making this statement precise, allows us to prove Lemma 4.7. The technical details are presented in Section A.2 of the appendix.

4.1.2 Regularization for Solving via Box-Constrained Newton Method

It is straightforwards to verify that the function we are minimizing (defined in Equation 4.1), satisfies the requirements necessary for us to be able to apply the tools from Section 3.

Lemma 4.9.

The function $f$ defined in (4.1) is convex, second-order robust with respect to $\ell_{\infty}$ , and its Hessian is SDD.

Proof.

The Hessian of the function $f$ (cf. Lemma 4.8) is clearly a Laplacian matrix. Therefore, it is positive semi-definite and thus $f$ is convex. To prove that it is second-order robust, we notice that adding some $z$ with $\|z\|_{\infty}\leq 1$ to the current scaling corresponds to multiplying each row and column by some factor between $1/e$ and $e$ . By writing down the quadratic form of the Hessian, $v^{\top}\nabla^{2}f(x)v=\sum_{i,j}\mathbf{M}_{ij}(v_{i}-v_{j})^{2}$ , we observe that each $\mathbf{M}_{ij}$ will only be multiplied by some factor between $1/e^{2}$ and $e^{2}$ , proving that

[TABLE]

concluding the proof. ∎

One should observe, however, that Theorem 3.4 requires bounding the radius of the entire level set containing our initial point and not merely the distance to some (approximate) minimizer of our function $f$ . This means that the existence of an (approximate) minimizer that is close to our initial point is not sufficient to apply Theorem 3.4. To circumvent that problem, we regularize the function $f$ by adding to it a term that, on one hand, has a relatively small impact on the additive error we can achieve, but, on the other hand, ensures that the entire relevant level set is contained in some sufficiently small $\ell_{\infty}$ -ball around our initial point. The following lemma makes these statements precise. Its proof appears in Section A.3 of the Appendix.

Lemma 4.10.

Let $z^{*}_{\varepsilon}=(x^{*}_{\varepsilon},y^{*}_{\varepsilon})$ be a point for which $f(z^{*}_{\varepsilon})-f^{*}\leq\varepsilon^{2}/(3n)$ and $\|z^{*}_{\varepsilon}\|_{\infty}\leq B$ . Then, the regularization of $f$ defined as

[TABLE]

satisfies the following properties

(1)

$\widetilde{f}$ * is second-order robust with respect to $\ell_{\infty}$ and its Hessian is SDD,* 2. (2)

$f(z)\leq\widetilde{f}(z)$ , and there is a point $\widetilde{z}^{*}$ such that $\widetilde{f}(\widetilde{z}^{*})\leq f^{*}+\frac{\varepsilon^{2}}{9n}$ , 3. (3)

for all $z^{\prime}$ such that $\widetilde{f}(z^{\prime})\leq\widetilde{f}(0)$ , $\|z^{\prime}\|_{\infty}=O(B\log(ns_{\mathbf{A}}/\varepsilon))$ .

Theorems 4.5 and 4.6 follow from applying Theorem 3.2 to the regularized function defined in (4.2), and then combining it with the guarantees of Lemmas 4.7 and 4.10. The complete proof is presented in Section A.4 of the Appendix. We note that we don’t need an explicit knowledge of an a priori bound on $B$ . We can simply run our algorithm repeatedly, doubling our guess at the value of $B$ each time. This will not increasing the overall running time by more than a factor of two.

4.1.3 Bounding the Magnitude of the Optimal and Approximately Optimal Scalings for

Doubly Stochastic Scaling

In order to provide bounds for the magnitude of the scaling factors that only depend on the parameters of the initial problem, we refer to the following lemmas from [22] for the case of double stochastic (i.e. (1,1)) scaling.

Lemma 4.11 (Lemma 1 of [22]).

If $\mathbf{A}$ is strictly positive, then it can be scaled to doubly stochastic by diagonal matrices $\mathbf{U}$ , $\mathbf{V}$ with $\log(\kappa(\mathbf{U})+\kappa(\mathbf{V}))\leq O(\log(w_{\mathbf{A}}))$ .

Lemma 4.12 (Corollary 1 of [22]).

If $\mathbf{A}$ is scalable, then it can be scaled to doubly stochastic by diagonal matrices $\mathbf{U}$ , $\mathbf{V}$ with $\log(\kappa(\mathbf{U})+\kappa(\mathbf{V}))\leq O(n\log(w_{\mathbf{A}}))$ .

For almost scalable matrices, there can be arbitrarily good solutions, using arbitrarily large scaling factors. To prove bounds on the runtime of finding an approximate doubly-stochastic matrix, we will have to explicitly demonstrate an vector that approximately minimizes function $f$ while having small $\ell_{\infty}$ norm.

Lemma 4.13.

If $\mathbf{A}$ is almost-doubly-stochastic scalable, then there exist points $(x,y)$ such that $f(x,y)-f^{*}\leq\varepsilon^{2}/3n$ , such that $\|(x,y)\|_{\infty}\leq O(n\log(nw_{\mathbf{A}}/\varepsilon))$ .

The proof of the lemma is presented in Section A.5.

For the general case of $(r,c)$ -scaling we refer to the recent lemmas from the parallel work of [1]. The assumption that the scaling targets are integral is mild, since one can approximate real numbers by rational ones which can then by scaled to be integral (the dependence on this scaling is logarithmic).

Lemma 4.14 (Lemma 3.3 of [1]).

If $\mathbf{A}$ is almost $(r,c)$ -scalable with $r$ , $c$ being integral, then it can be $\varepsilon$ -scaled by diagonal matrices $\mathbf{U}$ , $\mathbf{V}$ with $\log(\kappa(\mathbf{U})+\kappa(\mathbf{V}))\leq O(n\log(nw_{\mathbf{A}}\|r\|_{1}/\varepsilon))$ .

4.2 Matrix Balancing

Our approach for the balancing problem is completely analogous to the one we used for the scaling problem. There are only minor technical differences. To state them, we first formally define the problem and the notion of approximation we are considering for it.

Definition 4.15 (Matrix Balancing).

Let $\mathbf{A}$ be a square nonnegative matrix. We say that $\mathbf{A}$ is balanced if the sum of each row is equal to the sum of the corresponding column, i.e. $r_{\mathbf{A}}=c_{\mathbf{A}}$ . We say that a nonnegative diagonal matrix $\mathbf{D}$ balances $\mathbf{A}$ if the matrix $\mathbf{M}=\mathbf{D}\mathbf{A}\mathbf{D}^{-1}$ is balanced.

Definition 4.16 ( $\varepsilon$ -Balanced Matrix [23]).

We say that a nonnegative matrix $\mathbf{M}\in\mathbb{{R}}^{n\times n}$ is $\varepsilon$ -balanced if

[TABLE]

Observe that this definition is invariant to a global scaling of all the entries of the matrix by some factor. There is a very simple condition that characterizes the set of matrices that can be balanced

Lemma 4.17 ([23]).

A nonnegative matrix $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ can be balanced if and only if the graph with adjacency matrix $\mathbf{A}$ is strongly connected.

In the case when the graph is not strongly connected, the matrix can have its rows and columns rearranged so as to be written as a lower triangular block matrix with strongly connected diagonal blocks. The reason no exact balancing exists is that off diagonal block elements will always create imbalances. This, however, is not an obstacle for approximately balancing the matrix. Once we balance the diagonal blocks, we can set all of the off-diagonal block entries to a very small value, say $\varepsilon/n$ , so that they don’t cause significant imbalances. This corresponds to implicitly scaling the block rows and collumns by a very large amount, making the off-diagonal entries arbitrarily close to zero. Therefore, since the case of matrices that cannot be exactly balanced is easy to detect, and can be easily reduced to the exactly balanceable case, from now on we consider only matrices that can be balanced, and therefore represent strongly connected graphs.

We can now state our main theorem for this section, which follows our initial discussion.

Theorem 4.18.

Let $\mathbf{A}$ be a matrix that can be balanced by the diagonal matrix $\mathbf{D}^{*}$ . Then, we can compute an $\varepsilon$ -approximate balancing of $\mathbf{A}$ in time

[TABLE]

This immediately implies that if $\mathbf{D}^{*}$ is, say, quasi-polynomially conditioned, we can find an approximate balancing in nearly linear time.

Again, we can generalize this result to hold for approximate balancings. We make this statement precise in Theorem 4.19.

4.2.1 Reducing Matrix Balancing to Convex Optimization

Similarly to the case of the scaling problem, we encode this problem as a minimization of an appropriately constructed convex function. The function we consider here is

[TABLE]

and this function was already defined in [23]. Similarly to the case of matrix scaling, we will show that (approximately) minimizing this function corresponds to (approximately) balancing the matrix $\mathbf{A}$ . For the rest of this section, we will define $f_{*}$ to be the infimum value of $f$ in its domain, that is $f_{*}=\inf_{x}f(x)$ . The main theorem of this section is the following.

Theorem 4.19.

Suppose that there exists a point $x$ such that $f(x)\leq f_{*}+\varepsilon^{2}\ell_{\mathbf{A}}/24$ , and $\|x\|_{\infty}\leq B$ . Then, we can compute an $\varepsilon$ -approximate balancing of $\mathbf{A}$ in time

[TABLE]

Similarly to the matrix scaling case, the proof of this theorem follows directly from the key lemmas presented below. The proof is presented in Section A.7 of the Appendix. First, we prove that small additive error in function optimization implies an approximate balancing for $\mathbf{A}$ .

Lemma 4.20.

Consider a matrix $\mathbf{A}$ and the corresponding function $f$ . Any vector $x$ satisfying $f(x)-f_{*}\leq\varepsilon^{2}\ell_{\mathbf{A}}/8$ yields an $\varepsilon$ -approximate balancing of $\mathbf{A}$ :

[TABLE]

Proving the lemma requires computing the first and second order derivatives of $f$ .

Lemma 4.21.

Let $\mathbf{M}$ be the matrix obtained by balancing $\mathbf{A}$ with the vector $x$ , which corresponds to $\mathbf{M}=\mathbb{D}(\exp(x))\cdot\mathbf{A}\cdot\mathbb{D}(\exp(-x))$ . The gradient and Hessian of $f$ satisfy the identities:

[TABLE]

Intuitively, since the gradient is [math] precisely when the corresponding point produces an exact balancing, a small gradient should imply a good approximate balancing. This guides the proof of Lemma 4.20. We will prove that a large gradient corresponds to being able to significantly decrease the function value, thus contradicting the approximate optimality of the point. The technical details are presented in Section A.6.

4.2.2 Regularization for Solving via Box-Constrained Newton Method

We observe that the function $f$ defined in (4.3) satisfies all the conditions required to efficiently minimize it using the method we described in Section 3.

Lemma 4.22.

The function $f$ is convex, second-order robust with respect to $\ell_{\infty}$ , and its Hessian is SDD.

The method we described in Section 3 depends on a promise concerning the point we initialize it with. Recall that in order to apply Theorem 3.2 we require an upper bound on the size of the $\ell_{\infty}$ -ball containing the level set of the initial point. In order to provide good bounds, we regularize $f$ . The description and effect of this regularization in captured in the following lemma.

Lemma 4.23.

Suppose that there exists a point $x$ such that $f(x)\leq f_{*}+\varepsilon^{2}\ell_{\mathbf{A}}/24$ , and $\|x\|_{\infty}\leq B$ . Then, the regularization of $f$ is defined as

[TABLE]

and satisfies the following properties:

$\widetilde{f}$ * is second-order robust with respect to $\ell_{\infty}$ and has a SDD Hessian,* 2. 2.

$f(x)\leq\widetilde{f}(x)$ , and if $\widetilde{x}^{*}$ is the minimizer of $\widetilde{f}$ , then $\widetilde{f}(\widetilde{x}^{*})\leq f(x^{*})+\varepsilon^{2}\ell_{\mathbf{A}}/24$ , 3. 3.

for all $y$ such that $\widetilde{f}(y)\leq\widetilde{f}(0)$ , $\|y-x^{*}\|_{\infty}=O(B\log(nw_{\mathbf{A}}/\varepsilon))$ .

The details of the lemma are identical to Lemma 4.10, and we therefore omit the proof. In particular, this lemma implies that approximately optimizing the regularized function will still produce an approximately balanced matrix.

Theorem 4.18 then follows by applying Theorem 3.2 to the regularized function defined in Lemma 4.23, and combining it with the error guarantee of Lemma 4.20. The details are presented in Section A.7 of the Appendix. Similarly to the case of the scaling problem, we don’t need to know any a priori bound on $B$ . Just trying increasingly larger value of $B$ (i.e., doubling our guess at each iteration) is sufficient.

4.2.3 Bounding the Condition Number of the Optimal Balancing

As we saw above, the running time given by Theorem 4.18 depends logarithmically on $\kappa(\mathbf{D}^{*})$ , where $\mathbf{D}^{*}$ is the matrix that achieves the optimal balancing. While, in general $\kappa(\mathbf{D}^{*})$ can be exponentially large (and therefore we might be better off running the interior point method described in Section 6), tighter connectivity of the graph implies better bounds:

Lemma 4.24.

Let $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ be a nonnegative matrix. Suppose that the graph with adjacency matrix $\mathbf{A}$ is strongly connected, and every vertex can reach every other vertex within at most $k$ hops. Then the matrix $\mathbf{D}^{*}$ that perfectly balances $\mathbf{A}$ has $\log\kappa(\mathbf{D}^{*})=O(k\log w_{\mathbf{A}})$ .

The proof of the lemma is in Section A.8 of the Appendix and it yields the following upper bound on the value of $\kappa(\mathbf{D}^{*})$ .

Corollary 4.25.

If $\mathbf{A}$ is a balanceable matrix, and $\mathbf{D}^{*}$ perfectly balances it, then $\log\kappa(\mathbf{D}^{*})=O(n\log w_{\mathbf{A}})$ . If $\mathbf{A}$ is strictly positive, then $\log\kappa(\mathbf{D}^{*})=O(\log w_{\mathbf{A}})$ .

4.3 Discussion of Numerical Precision Aspects

The exposition of the analysis so far is under the assumption of exact arithmetic. However, our algorithms do in fact tolerate finite fixed-point precision on the scale of the natural parameters of the problem ( $n$ , $\varepsilon$ , $s_{\mathbf{A}}$ and $w_{\mathbf{A}}$ ). It is therefore sufficient to use a number of bits that is logarithmic in the input parameters of the problem.

Between iterations, we store a fixed-point representation of the variables $x_{i}$ . These are, by construction, bounded by the parameter $R_{\infty}$ . It is important (at least if using fixed-point rather than floating-point) that we are storing the $x_{i}$ rather than the actual scalings $e^{x_{i}}$ .

When iterating, we first determine the post-scaling elements of the matrix. These can also simply be stored in fixed-point–i.e. up to additive error. Note that this rounding could completely eliminate very small entries of the matrix. This representation then gives us the gradient and Hessian of the problem up to additive error. To make the additive error polynomially small only a logarithmic number of bits are needed, because the entries of the scaled matrix can never be more than polynomial (in the natural parameters mentioned). This follows from the fact that the objective function, which includes the sum of all the entries, cannot increase.

Finally, there is an polynomially small absolute lower bound on the eigenvalues of the Hessian, simply from the regularizer itself. This ensures that additive error to the gradient and Hessian can only affect the function value improvement by a polynomially larger amount, and ensures the stability of the $k$ -oracle algorithm. Thus polynomially small error is sufficient, requiring only logarithmically many bits.

4.4 Matrix Scaling and Balancing as Nonlinear Flow Problems

An intriguing property of the matrix scaling and matrix balancing problems is that they both can be phrased as an instance of a more general problem. This problem can be seen as generalization of the electrical flow problem. That is, the problem of finding a potential-induced flow that routes a fixed demand in the case when Ohm’s Law, i.e., the relationship between the potential difference on a given edge and the flow flowing through it is exponential instead of being linear. (See [11] for a comprehensive treatment of such nonlinear networks.) To see this, given a weighted directed graph $G=(V,E,w)$ let us define the edge-vertex incidence matrix $\mathbf{B}$ being an $n\times m$ matrix with rows indexed by edges and columns indexed by vertices such that

[TABLE]

Using this matrix we define the nonlinear operator $\mathcal{L}$ as follows.

Definition 4.26.

Let $G=(V,E,w)$ be a directed graph with vertex-edge incidence matrix $\mathbf{B}$ , and let $\mathbf{W}=\mathbb{D}(w)$ . We define the operator $\mathcal{L}$ associated with $G$ as

[TABLE]

This can be seen as a nonlinear generalization of the Laplacian operator, which is a linear operator defined as $\mathbf{L}=\mathbf{B}^{\top}\mathbf{W}\mathbf{B}$ . There is extensive literature on solving Laplacian linear systems [51, 25, 26, 24, 8, 28, 29]. We argue that our framework can be used to solve systems of the form

[TABLE]

This can be seen as finding vertex potentials $x$ which induce a flow vector $f$ :

[TABLE]

such that $f$ routes a given demand $d$ . This should be contrasted with the case of electrical flows where the flow is induced as $f_{uv}=w_{uv}(x_{u}-x_{v})$ . As it turns out, the solution to the system $\mathcal{L}(x)=d$ is the minimizer of a function similar to those defined in Equations 4.1 and 4.3. More precisely:

Lemma 4.27.

Let $G=(V,E,w)$ be a directed graph with nonnegative weights, let $\mathbf{A}$ be its adjacency matrix, and let $\mathcal{L}$ be the operator associated with $G$ , as defined as in Equation 4.5. Consider the function $f$ defined as

[TABLE]

Then $f$ has a minimizer $x$ if and only if it is the solution to the system $\mathcal{L}(x)=d$ .

The proof follows directly from writing optimality conditions for $f$ , noting that the condition that $\nabla f(x)=0$ is equivalent to $\mathcal{L}(x)=d$ . Similarly to Theorem 4.6 and Theorem 4.19, we can provide conditions on function value error to bound the error $\|d-\mathcal{L}(x)\|_{2}$ . Also, in order to obtain a good running time, we require regularizing $f$ in a manner similar to the regularization applied in Lemmas 4.10 and 4.23. Note that in the case of the scaling and balancing problems, since we require problem specific error guarantees, the regularization needs to be customized accordingly.

Finally, we state without proof that balancing and scaling are instances of solving $\mathcal{L}(x)=d$ .

Observation 4.28.

Let $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ be a balanceable nonnegative matrix. Let $\mathcal{L}$ be the nonlinear operator associated with the graph with adjacency matrix $\mathbf{A}$ . Then the solution $x$ to $\mathcal{L}(x)=0$ yields a balancing $\mathbb{D}(\exp(x))$ .

Observation 4.29.

Let $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ be a $(r,c)$ -scalable nonnegative matrix. Let $\mathcal{L}$ be the nonlinear operator associated with the graph with adjacency matrix $\begin{bmatrix}\mathbf{0}&\mathbf{A}\\ \mathbf{0}&\mathbf{0}\end{bmatrix}$ . Then the solution $z=(x,y)^{\top}$ , with $x,y\in\mathbb{{R}}^{n}$ to $\mathcal{L}(z)=(r,-c)^{\top}$ yields a $(r,c)$ -scaling $(\mathbb{D}(\exp(x)),\mathbb{D}(\exp(y))$ .

5 Implementing an $O(\log n)$ -Oracle in Nearly Linear Time

In Section 4 we reduced the balancing and scaling problems to the approximate minimization of second-order robust functions with respect to the $\ell_{\infty}$ norm. All that is left to have a complete algorithm, we need a fast procedure to implement a $k$ -oracle as in Definition 3.3. Namely, show how to construct an $O(\log n)$ -oracle for the problem,

[TABLE]

where $\mathbf{M}$ is an SDD matrix. For this section, whenever we say that a matrix is SDD we will also imply that the off-diagonal entries are nonpositive.

One possible approach, is to use standard convex optimization reductions to turn this problem into the minimization of the maximum of an $\ell_{\infty}$ norm and an $\ell_{2}$ norm subject to linear constraints. This problem can be solved in time $\widetilde{O}(mn^{1/3})$ using the multiplicative weights framework as applied in [7, 6]. The resulting algorithm for implementing the $k$ -oracle would take time $\widetilde{O}(m+n^{4/3})$ , by taking advantage of spectral sparsification algorithms [49, 32, 31]. Instead, we will come up with a faster algorithm.

Our approach, based on the Lee-Peng-Spielman solver [30], is to identify large sets of vertices where the problem is “easy” to solve and then deal with the rest of the graph (reduced in size) recursively. The particular notion of “easy” we are going to use, is that of strong diagonal dominance.

Definition 5.1.

A matrix $\mathbf{M}$ is $\alpha$ -strongly diagonally dominant ( $\alpha$ -SDD), if for all $i$

[TABLE]

The reason that such matrices enable us to solve the corresponding problems fast is that they can be well-approximated by a diagonal matrix.

Lemma 5.2.

Every $\alpha$ -SDD matrix $\mathbf{M}$ , with diagonal $\mathbb{D}(\mathbf{M})$ , satisfies

[TABLE]

Proof.

This follows from the fact that

[TABLE]

Applying this to each off-diagonal entry, the off-diagonal part of the matrix will be bounded between diagonal matrices; by the $\alpha$ -SDD property these can be bounded by $\pm\frac{1}{1+\alpha}\mathbb{D}(\mathbf{M})$ . ∎

In our context, problems in the form of Equation 5.1, where $\mathbf{M}$ is an $\alpha$ -SDD matrix for some $\alpha\geq\Omega(1)$ , can be turned into well conditioned quadratic minimization problems for which we can apply standard linearly convergent algorithms. For a more detailed description and analysis of such algorithms can be found in [38].

Lemma 5.3.

There is an algorithm FastSolve, that given an $\Omega(1)$ -SDD matrix $\mathbf{M}$ , and $\varepsilon>0$ , returns a point $\widetilde{x}$ , such that $\|\widetilde{x}\|_{\infty}\leq 2$ , and

[TABLE]

in time $O(m\log(1/\varepsilon))$ , where $m$ is the number of nonzero entries of $\mathbf{M}$ .

Proof.

By Lemma 5.2, there is some diagonal matrix $\mathbf{D}$ such that

[TABLE]

By applying the transformation $x=\mathbf{D}^{-1/2}z$ , the problem becomes

[TABLE]

We will apply proximal gradient descent, defined by the sequence $z^{(0)}=0$ and

[TABLE]

Computing $z^{(t+1)}$ from $z^{(t)}$ corresponds to computing

[TABLE]

and projecting it to the space $\|\mathbf{D}^{-1/2}z\|_{\infty}\leq 2$ , by simply trunctating any coordinates exceeding the bounds. We can clearly implement each iteration in linear time in the number of nonzeros of $\mathbf{M}$ . Since the condition number of the function is at most $(1+2/\alpha)=O(1)$ , such a step will imply that

[TABLE]

and thus inductively,

[TABLE]

Therefore, after $O(\log(1/\varepsilon))$ steps we will have a point with $h(z^{(t)})-h(z^{*})\leq\varepsilon(h(z^{(0)})-h(z^{*})).$ The fact that $h(0)=0$ concludes the proof. ∎

An even simpler case is when the matrix is of size 1, in which case the problem can be exactly solved in constant time:

Lemma 5.4.

There is an algorithm TrivialSolve, that given a 1 by 1 matrix $\mathbf{M}$ returns an $x$ optimizing $x^{\top}\mathbf{M}x+\langle b,x\rangle$ over the interval $[-1,1]$ .

Proof.

By convexity, there must exist an optimal $x$ that is either one of the two endpoints of the interval, or the unique global optimum of the function over the line. One may simply check all candidates and return the best value. ∎

A key insight of [30] is that one can find $\Omega(1)$ -SDD submatrices of $\mathbf{M}$ of size $\Omega(n)$ . We denote such a subset by $F$ and $V\setminus F$ by $C$ . To ensure that solving the problem for $x_{F}$ will not interfere with our solution $x_{C}$ we map a solution $\hat{x}_{C}$ supported only on coordinates of $C$ to a solution $x_{C}$ through a linear mapping $\mathbf{P}$ . If $\mathbf{P}$ were the energy minimizing extension of voltages on $C$ to voltages on $V$ ,

[TABLE]

we would have that $x_{F}$ and $x_{C}$ are $\mathbf{M}$ -orthogonal, since $x_{F}^{\top}\mathbf{M}\mathbf{P}\hat{x}_{C}=0$ . Then, optimizing over $\hat{x}_{C}$ involves the quadratic $\mathbf{P}^{\top}\mathbf{M}\mathbf{P}$ which is exactly equal to $\mathbf{M}_{[C,C]}-\mathbf{M}_{[C,F]}\mathbf{M}_{[F,F]}^{-1}\mathbf{M}_{[F,C]}=\operatorname{Sc}(\mathbf{M},F)$ . Applying this proccess recursively leads to the notion of vertex sparsifier chains that we will heavily rely on.

Definition 5.5 (Definition 5.7 of [30]).

For any $SDD$ matrix $\mathbf{M}^{(0)}$ , a vertex sparsifier chain of $\mathbf{M}^{(0)}$ with parameters $\alpha_{i}\geq 4$ and $1/2\geq\varepsilon_{i}>0$ , is a sequence of matrices and subsets $(\mathbf{M}^{(1)},\ldots,\mathbf{M}^{(d)};F_{1},\ldots,F_{d-1})$ such that

$\mathbf{M}^{(1)}\approx_{\varepsilon_{0}}\mathbf{M}^{(0)}$ , 2. 2.

$\mathbf{M}^{(i+1)}\approx_{\varepsilon_{i}}\operatorname{Sc}(\mathbf{M}^{(i)},F_{i})$ , 3. 3.

$\mathbf{M}_{[F_{i},F_{i}]}^{(i)}$ is $\alpha_{i}$ -strongly diagonally dominant, and 4. 4.

$\mathbf{M}^{(d)}$ has size 1.

Note that this last requirement is slightly different from [30]: we require the chain to end with size 1, rather than just being constant. However, the chain construction from [30] immediately extends to this requirement (they presumably proposed stopping early because it is a simple optimization that would likely be valuable in any implementation).

To be able to reason about the approximation guarantees of the chain as a whole we will use an error-quantifying definition.

Definition 5.6 (Definition 5.9 of [30]).

An $\varepsilon$ -vertex sparsifier chain of an SDD matrix $\mathbf{M}^{(0)}$ of work $W$ , is a vertex sparsifier chain of $\mathbf{M}^{(0)}$ with parameters $\alpha_{i}\geq 4$ and $1/2\geq\varepsilon_{i}>0$ that satisfies

$2\sum_{i=0}^{d-1}\varepsilon_{i}\leq\varepsilon$ , 2. 2.

$\sum_{i=0}^{d-1}m_{i}\log_{\alpha_{i}}\varepsilon_{i}^{-1}\leq W$ , where $m_{i}$ is the number of nonzeros in $L^{(i)}$ .

Finally, the construction of such chains, as well as their error guarantees have been already analyzed in [30] and can be used in a black-box manner.

Theorem 5.7 (Theorem 5.10 of [30]).

Every SDD matrix $\mathbf{M}$ of dimension $n$ has a $\delta$ -vertex sparsifier chain of work $O(n)$ and $d\leq O(\log n)$ , for any constant $0<\delta\leq 1$ . Such a chain can be constructed in time, $\widetilde{O}(m)$ .

We note that Theorem 5.7 was stated for $\delta=1$ , but it is straightforward to modify to proof without changing the work or the length of the chain by more than a constant factor.

Since we cannot exactly compute the energy minizing mapping $\mathbf{P}$ , we will define an approximate mapping that suffices for our purposes.

Definition 5.8.

A linear mapping $\mathbf{\widetilde{P}}$ is an $\varepsilon$ -approximate voltage extension from $C$ to $V$ according to $L$ if for any $\hat{x}_{C}\in\mathbb{{R}}^{|C|}$ ,

$\|(\mathbf{\widetilde{P}}-\mathbf{P})\hat{x}_{C}\|_{\mathbf{M}}\leq\varepsilon\|\mathbf{P}\hat{x}_{C}\|_{\mathbf{M}}$ , 2. 2.

$\mathbf{\widetilde{P}}$ is the identity on coordinates in $C$ 3. 3.

the coordinates of $\mathbf{\widetilde{P}}\hat{x}_{C}$ are convex combination of the coordinates of $\hat{x}_{C}$ and 0.

where $\mathbf{P}$ is the energy minimizing extension.

We will construct such a mapping through a simple averaging scheme. First we set the voltage of every vertex in $F$ to be the weighted average of its neighbors in $C$ . Then at every step we replace its voltage by the weighted average of all its neighbors. (Here, excess diagonal is treated as an edge to a vertex with voltage 0.) We do so for $O(\log(1/\varepsilon))$ iterations. We formally describe the procedure in Figure 5.1.

It is easy to see that all steps of the algorithm are linear maps, and we can therefore also implement its transpose.

Lemma 5.9.

For any SDD matrix $\mathbf{M}$ , given an $\Omega(1)$ -SDD subset $F$ and some $\varepsilon>0$ one can apply an $\varepsilon$ -approximate voltage extension mapping in time $O(m\log(1/\varepsilon))$ .

Proof.

The linearity of the mapping, properties 2 and 3, as well as the runtime claimed hold inductively by the construction of the mapping. We will now prove property 1, namely, if $P$ is the true energy minimizing mapping and $\widetilde{P}$ is our approximate mapping, for any $\hat{x}\in\mathbb{{R}}^{|C|}$ , $\|(\widetilde{P}-P)\hat{x}_{C}\|_{\mathbf{M}}\leq\varepsilon\|P\hat{x}_{C}\|_{\mathbf{M}}$ .

First, we will bound the error of $x^{(0)}$ . We define $v^{(t)}=x^{(t)}-P\hat{x}_{C}$ , which is 0 outside $F$ by construction. We will use the notation $w_{ij}=-{\mathbf{M}}_{ij}$ –i.e. the edge weight between $i$ and $j$ , and $w_{i\varnothing}=\mathbf{M}_{ii}-\sum_{j\neq i}|\mathbf{M}_{ij}|$ . Here, $w_{i\varnothing}$ accounts for “excess diagonal” of $\mathbf{M}$ , which we treat as an edge to a “virtual vertex” $\varnothing$ . We will also define $(\hat{x}_{C})_{\varnothing}=0$ . Now, we have

[TABLE]

Rearranging gives $\|v^{(0)}\|_{\mathbb{D}(\mathbf{M})}\leq\sqrt{1+\frac{1}{\alpha}}\|P\hat{x}_{C}\|_{\mathbf{M}}$ .

Next, we note that $v^{(t)}=\mathbb{D}(\mathbf{M})^{-1}(\mathbf{I}-\mathbf{M})_{[F,F]}v^{(t-1)}$ . This follows from the fact that

[TABLE]

since $\mathbf{M}(P\hat{x}_{C})$ is 0 on $F$ . Applying Lemma 5.2, we get

[TABLE]

This implies that

[TABLE]

By induction, we have

[TABLE]

Finally, using Lemma 5.2 again, we get

[TABLE]

∎

Given such a mapping, we can uniquely express any voltage vector as $x=x_{C}+x_{F}$ , where $x_{C}=\mathbf{\widetilde{P}}\hat{x}_{C}$ (i.e. it is in the image of $\widetilde{P}$ ) and $x_{F}$ is supported on $F$ . By the convex combination property of $\mathbf{\widetilde{P}}$ , we have $\|x_{C}\|_{\infty}\leq\|\hat{x}_{C}\|_{\infty}\leq\|x\|_{\infty}$ ; since $x_{F}=x-x_{C}$ , by the triangle inequality we have $\|x_{F}\|_{\infty}\leq 2\|x\|_{\infty}$ . The domain $\|x\|_{\infty}\leq 1$ is therefore contained in $\|\hat{x}_{C}\|_{\infty}\leq 1,\|x_{F}\|_{\infty}\leq 2$ . Moreover, any point for which $\|\hat{x}_{C}\|_{\infty}\leq k$ , $\|x_{F}\|_{\infty}\leq 2$ , corresponds to a point $x=\mathbf{\widetilde{P}}\hat{x}_{C}+x_{F}$ with $\|x\|_{\infty}\leq k+2$ .

Having expressed all of the components of our approach, stating the algorithm is simple. Given the decomposition of the problem the vertex sparsifier chain provides, we will solve the smallest problem and then iteratively combine it with the solution of the submatrices along the chain. The algorithm is formally described in Figure 5.2, and the main claim in Theorem 5.11.

To facilitate the analysis we first state the following decoupling lemma.

Lemma 5.10.

Consider an SDD matrix $\mathbf{M}$ , a partition of its columns $(F,C)$ and some $0<\varepsilon\leq 1/2$ . Let $\mathbf{\widetilde{P}}$ be an $\varepsilon$ -approximate voltage extension from $C$ to $(F,C)$ as define in Definition 5.8. Then

[TABLE]

Proof.

Since $\mathbf{P}$ is the true energy minimizing extension, we know that $\mathbf{M}(\mathbf{P}\hat{x}_{C})$ , will be zero on the coordinates of $F$ . Since $(\mathbf{\widetilde{P}}-\mathbf{P})\hat{x}_{C}+x_{F}$ is supported on $F$ (property 2 of Definition 5.8), we can expand

[TABLE]

The first property of $\mathbf{\widetilde{P}}$ in Definition 5.8, implies that

[TABLE]

We can upper bound the contribution of the cross-term as

[TABLE]

Similarly, we can lower bound the contribution of the cross-term as

[TABLE]

Combining these inequalities and rearranging terms concludes the proof. ∎

We can now use this lemma to prove that by decoupling the problem and using approximate Schur complements does not reduce the quality of the solution by more than a constant.

Theorem 5.11.

Algorithm OptimizeChain implements a $O(\log n)$ -oracle, and runs in time $\widetilde{O}(m)$ .

Proof.

By construction and the triangle inequality we have that

[TABLE]

We will define the following functions to reason about our approximation guarantees:

[TABLE]

We can now derive the following facts about these functions:

Directly applying Lemma 5.10, for any $\hat{x}_{C}$ , $x_{F}$ ,

[TABLE] 2. 2.

Using the fact that any $x$ with $\|x\|_{\infty}\leq 1$ can be written as $\mathbf{\widetilde{P}}\hat{x}_{C}+x_{F}$ for $\|\hat{x}_{C}\|_{\infty}\leq 1$ and $\|x_{F}\|_{\infty}\leq 2$ , and Lemma 5.10,

[TABLE] 3. 3.

By the definition of the vertex sparsifier chain, $\mathbf{M}^{(i+1)}\approx_{\varepsilon_{i}}\operatorname{Sc}(\mathbf{M}^{(i)},F_{i})$ , implying for any $x$

[TABLE] 4. 4.

Again, from the fact that $\mathbf{M}^{(i+1)}\approx_{\varepsilon_{i}}\operatorname{Sc}(\mathbf{M}^{(i)},F_{i})$ ,

[TABLE] 5. 5.

By the definition of $b^{(i)}$ ,

[TABLE]

We are going to combine these facts to show that for any $i\in[1,d]$ ,

[TABLE]

We procceed by induction from $d$ to $1$ . Recall that $\varepsilon_{i}\leq 1/2$ .

For the case of $i=d$ it trivially holds by the guarantee of TrivialSolve:

[TABLE]

Assuming that it holds for any $j>i$ , by the guarantees of FastSolve:

[TABLE]

Finally, we can similarly argue about $H_{0}(x^{(1)})$ : it is at most $e^{\varepsilon_{0}}H_{1}(x^{(1)})$ while $\min_{\|x\|_{\infty}\leq 1}H_{1}(x)\leq e^{-3\varepsilon_{0}}\min_{\|x\|_{\infty}\leq 1}H_{0}(x)$ . By the guarantees of the vertex sparsifier chain, we know that $\sum_{i=0}^{d-1}\varepsilon_{i}\leq\delta$ for some constant $\delta$ of our choice. By choosing the right constant we can ensure that our multiplicative error is less than $1/2$ .

In order to bound the runtime we notice that for every $i$ we need to compute ApproxMapping and FastSolve which both take time $O(m_{i}\log(1/\varepsilon_{i}))$ . By the bound on the work $W$ of the chain we get that applying the chain takes time $O(n)$ , while constructing it takes time $\widetilde{O}(m)$ . ∎

6 Matrix Scaling and Balancing with Exponential Cone Programming

The algorithm developed in the previous sections is essentially optimal in the regime where the ratio between the scaling factors is relatively small (say polynomial in $n$ ). Since there are matrices for which this ratio is exponential, we develop a complementary algorithm with negligible runtime dependence on this ratio, at the cost of a mild increase in the dependence on $m$ . The algorithm is based on interior point methods.

Although interior point methods would seem like a natural option for the problems of matrix scaling and balancing, standard formulations require solving linear systems involving various rescalings of the input matrix. A priori, it is not clear whether these can be solved faster than matrix multiplication time. However, it turns out that a somewhat nonstandard formulation requires solving linear systems for more structured matrices. Particularly, we will see that these matrices admit a decomposition involving only matrices that are easy to invert (triangular matrices, solvable by back substitution, and SDD matrices which can be tackled via a standard Laplacian solver). Notably, a similar observation was made by Daitch and Spielman [10], in the case of interior point methods applied to flow problems on graphs. [22] also consider a formulation similar to ours for the matrix scaling problem, however they don’t prove exact convergence bounds or state the algorithm rigorously. Moreover, since nearly-linear SDD solvers where not known at the time, this algorithm provides no benefit compared to other approaches.

The main result of this section is the following.

Theorem 6.1.

Given a nonnegative matrix $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ , one can:

compute an $\varepsilon$ -balancing in time

[TABLE] 2. 2.

if the matrix is almost $(r,c)$ -scalable, compute a $\varepsilon$ - $(r,c)$ -scaling in time

[TABLE]

This is as a matter of fact a consequence of the fact that a specific class of functions, which capture both balancing and scaling, can be minimized efficiently. We capture this result in the following Theorem.

Theorem 6.2.

Let $\mathbf{A}\in\mathbb{{R}}^{n\times n}$ be a nonnegative matrix with $m$ nonzero entries, let $f$ be the function

[TABLE]

and let $B_{x}$ be a positive real number. There exists an algorithm which, for any $\varepsilon>0$ , finds a vector $x$ such that $f(x)-f(x^{*(B_{x})})\leq\varepsilon$ (where $x^{*(B_{x})}$ is the optimum of $f$ over the region $\|x\|_{\infty}\leq B_{x}$ ) in time

[TABLE]

Using this result, one can then conclude the proof of Theorem 6.1.

Proof of Theorem 6.1.

For the balancing objective, we first decompose the nonzero entries of $\mathbf{A}$ into strongly connected components. For each component, we will call 6.2 with $d=0$ . From Corollary 4.25 we have that $\|x^{*}\|_{\infty}=O(n\log w_{\mathbf{A}})$ . Plugging this in, along with Lemma 4.20, we obtain a total running time of $\widetilde{O}(m^{3/2}\log(w_{\mathbf{A}}\varepsilon^{-1}))$ .

For the $\varepsilon$ - $(r,c)$ -scaling objective, we set $d=(r,-c)^{\top}$ , and run the interior point method on the matrix $\begin{bmatrix}\mathbf{0}&\mathbf{A}\\ \mathbf{0}&\mathbf{0}\end{bmatrix}$ . Lemma 4.14 ensures that the entries of the $(r,c)$ scalings exist within a polynomially bounded $\ell_{\infty}$ -ball. Using Lemma 4.7, this yields the conclusion. ∎

We prove Theorem 6.2 by showing that an interior point method defined and analyzed by Nesterov [38] can be efficiently implemented. In order to do so, we require two components. The former involves providing a formulation for minimizing the function in F1 for which the interior point method can produce an iterate that is close in value to optimum within a small number of iterations. The latter involves showing how to efficiently implement these iterations. Generally they involve solving a linear system; in our case, we show that such iterations can be executed by solving an SDD linear system to constant accuracy.

6.1 Setting Up the Interior Point Method, and Bounding the Number of Steps

In order to apply an interior point method, we first reformulate the problem in F1 in an equivalent form.

Lemma 6.3 (Equivalent Formulation).

Let $B_{x}$ be a promise on the magnitude of the entries in the optimal solution of F1:

[TABLE]

Also, let

[TABLE]

Then the objective

[TABLE]

has an identical value and solution to F1.

Proof.

Given the promise, the bounds on $x$ are redundant. This is also the case with the upper bounds on $t_{ij}$ , since setting $x=\vec{0}$ and $t_{ij}=\mathbf{A}_{ij}$ yields a solution of value $s_{\mathbf{A}}$ . Therefore, setting $t^{*}_{ij}=\mathbf{A}_{ij}e^{x^{*}_{i}-x^{*}_{j}}$ , the value of $\langle\vec{1},t^{*}_{ij}\rangle$ must be at most $s_{\mathbf{A}}+\langle d,x^{*}\rangle\leq s_{\mathbf{A}}+\|d\|_{1}B_{x}<U$ . ∎

The second step is to replace the hard constraints with appropriate barrier functions, whose value blows up when approaching the boundary of the feasible set $S$ (see [2, 38] for more details). More precisely, we consider the barrier functions

[TABLE]

for all $(i,j)\in\mathrm{supp}(\mathbf{A})$ , and

[TABLE]

The former blow up when $t$ approaches $\exp(\log(\mathbf{A}_{ij}+x_{i}-x_{j}))$ from above, and are standard in exponential cone programming [2]. The barrier $\psi$ handles all the other inequality constraints. Very importantly, all these barrier functions are well behaved, in the sense that they satisfy a required property called self-concordance. Since this property defines the number of iterations the method needs to execute, we highlight it below.

Fact 6.4.

The function $\xi(t,x)=\psi(t,x)+\sum_{(i,j)\in\mathrm{supp}(\mathbf{A})}\phi_{ij}(t,x)$ is an $O(m)$ -self-concordant barrier for the set $S$ defined in F2.

With the barrier function set up, the method has to solve a sequence of subproblems of the form

[TABLE]

while increasing $\mu$ until it becomes sufficiently large that the solution we produce is close to the optimum of the initial constrained problem.

What is essential here is the number of iterations of the method, which depends mostly on the quality of the barrier function, and little on in initialization and accuracy to which we want to solve. More precisely, we apply the following theorem which follows from [38], Theorems 4.2.9 and 4.2.11.

Theorem 6.5.

Given an initial point $v$ in the strict interior of $\mathcal{D}$ with a $\nu$ -self-concordant barrier $\xi$ , the problem $\min_{v\in\mathcal{D}}c^{\top}v$ can be solved to within $\varepsilon$ additive error in

[TABLE]

iterations, where $v_{0}$ is the minimizer over $\mathcal{D}$ of $\xi(v)$ .

In what follows we bound the quantities involved in the above statement. In order to have a bound on the number of iterations required for our cone program, we require lower bounding $\nabla^{2}\xi(v_{0})$ . In order to do so, we lower bound the Hessian everywhere.

Lemma 6.6.

The Hessian $\nabla^{2}\xi$ is lower bounded everywhere by the diagonal matrix with $\frac{1}{9U^{2}}$ on $t$ variables and $\frac{2}{B_{x}^{2}}$ on on $x$ variables.

Proof.

Since $\nabla^{2}\xi=\nabla^{2}\phi+\nabla^{2}\psi\succcurlyeq\nabla^{2}\psi$ , and by calculation we see that $\nabla^{2}\psi$ is diagonal, and

[TABLE]

Therefore we have that this gives a lower bound on $\nabla^{2}\psi$ , and thus on $\mathbf{H}$ . ∎

We also show how to pick the initial point for our particular problem, which turns out to be a trivial task, since the only requirement is that it lies in the strict interior of $\mathcal{D}$ . The more challenging part is upper bounding the $\nabla\xi$ at that point in Hessian inverse norm.

Lemma 6.7.

The point $v=(t,x)$ , where $t_{ij}=2U$ , $x=\vec{0}$ , belongs to the strict interior of $S$ . Furthermore, $\log\|\nabla\xi(v)\|_{\nabla^{2}\xi(v)^{-1}}=O(\log(2+m+B_{x}))$ .

Proof.

First we verify that the point belongs to the strict interior. As we set $x=\vec{0}$ , no constraint on $x$ is tight. $3U>t_{ij}=2U>0$ .

For the second part, we may simply bound the contribution from each term of the barrier to each entry of the gradient. The $t$ entries end up bounded by at most $\frac{O(m)}{U}$ , while the $x$ entries end up bounded by $O(m)+\frac{O(m)}{B_{x}}$ , providing the claimed bound. ∎

6.2 Implementing an Iteration of the Interior Point Method

The steps mentioned in the statement of Theorem 6.2 consist only of standard Newton steps, i.e. minimizing a second order local approximation of the function $f_{\mu}(x)$ . These steps are generally expensive, since they involve applying the inverse of $\nabla^{2}f_{\mu}$ to a vector. In our case, fortunately, we are able to exploit the structure of $f$ in order to do this in nearly linear time in the sparsity of $\nabla^{2}f_{\mu}$ .

Below we give a precise statement concerning our ability to solve linear systems involving the Hessian matrix.

Theorem 6.8.

For any $\varepsilon>0$ , and any $\mathbf{H}=\nabla^{2}f_{\mu}(v)=\nabla^{2}\xi(v)$ , where $v=(t,x)$ is a point in the strict interior of the feasible region $S$ (see F2), and any vector $b\in\mathbb{{R}}^{m+n}$ , one can, with high probability, compute in $\tilde{O}(m\log\varepsilon^{-1})$ time a vector $y$ such that $\|y-y^{*}\|_{\mathbf{H}}\leq\varepsilon\|y^{*}\|_{\mathbf{H}}$ , where $y^{*}$ is the solution to $\mathbf{H}y^{*}=b$ .

In order to achieve this result, we leverage the power of Laplacian solvers. From the algorithmic point of view, the crucial property of the Laplacian is that it is symmetric and diagonally dominant. This enables us to use fast approximate solvers for symmetric and diagonally dominant linear systems. Namely, there is a long line of work [51, 25, 26, 24, 8, 28, 29] that builds on an earlier work of Vaidya [52] and Spielman and Teng [50], that designed an SDD linear system solver. We employ as a black box the following theorem, which follows from [26], and constructs an operator that approximates $\mathbf{M}^{+}$ .

Theorem 6.9.

For any $\varepsilon>0$ , and any SDD matrix $\mathbf{M}\in\mathbb{{R}}^{n\times n}$ with $m$ nonzero entries, and any vector $b$ in the image of $\mathbf{M}$ , one can, with high probability, compute in $\widetilde{O}(m\log\varepsilon^{-1})$ time a vector $x$ such that $\|x-x^{*}\|_{\mathbf{M}}\leq\varepsilon\|x^{*}\|_{\mathbf{M}}$ , where $x^{*}$ is the solution of $\mathbf{M}x^{*}=b$ . Furthermore, a given choice of random bits produces a correct result for all $b$ simultaneously, and makes $x$ linear in $b$ .

The result follows using this tool, and the following structural lemma, whose proof can be found in Appendix A.10.

Lemma 6.10.

The Hessian $\nabla^{2}\xi$ has a factorization

[TABLE]

where $\mathbf{S}$ is SDD, $\mathbf{U}$ is lower triangular, and each of then has $O(m)$ nonzero entries. Furthermore, this factorization can be computed in $O(m)$ time.

With this in hand, proving Theorem 6.8 is immediate. First note that, given $\widetilde{\mathbf{S}}$ , we can actually choose a $\widetilde{\mathbf{S}}^{-1}$ satisfying the needed properties: a linear-operator based graph Laplacian solver, such as [26]. Having access to the linear operator $\widetilde{\mathbf{S}}^{-1}$ , we consider the error produced by applying the operator $\mathbf{U}^{-1}\widetilde{\mathbf{S}}(\mathbf{U}^{\top})^{-1}$ :

[TABLE]

6.3 Error Tolerance

While the classical analysis of Newton’s method used for iterations of interior point methods assumes exact computations, in our case the Laplacian solver we employ adds some error. We quickly show that this error does not hurt us, and as a matter of fact it is sufficient to solve these systems to constant accuracy.

First, we require understanding the guarantees of Newton’s method, and its requirements.

Fact 6.11 (Progress via Newton Steps).

Let $v$ be a point in the interior of the feasible region such that

[TABLE]

Then, applying one step of the interior point method consists of producing a new iterate

[TABLE]

In order to make progress it is sufficient that $\|\nabla f_{\mu}(v^{\prime})\|_{\mathbf{H}_{v^{\prime}}^{-1}}\leq\frac{1}{6}$ .

We can easily show that applying the inverse matrix with the solver guarantees we give in Theorem 6.8 is sufficient in order to make progress.

Lemma 6.12.

Let $v$ be a point in the interior of the feasible region such that

[TABLE]

Letting $v^{\prime\prime}=v-\Delta$ such that

[TABLE]

for $\varepsilon\leq 0.1$ , we get that

[TABLE]

Since the proof is rather standard, we defer it to Appendix A.9.

6.4 Putting Everything Together

We can combine the results from this section in order to provide a proof for Theorem 6.2.

Proof of Theorem 6.2.

By combining Theorem 6.5, along with Fact 6.4, Lemma 6.6 and Lemma 6.7, we see that we can approximately minimize the function defined in Equation F1 by performing

[TABLE]

iterations of the interior point method referenced in Theorem 6.5.

From Theorem 6.8 and the iteration accuracy required by Fact 6.11 and Lemma 6.12 we see that each iteration of the interior point method referenced in Theorem 6.2 can be implemented in time $\widetilde{O}(m)$ . This yields the conclusion. ∎

Appendix A Deferred Proofs

A.1 Proof of Theorem 3.4

Proof of Theorem 3.4.

The iteration we are going to implement is

[TABLE]

Since $f$ is second-order robust with respect to $\ell_{\infty}$ by definition (see Definition 3.1), we know that within an $\ell_{\infty}$ -ball centered at $x$ the function $f$ is lower and upper bounded by $f_{L}$ and $f_{U}$ , respectively, where:

[TABLE]

Also, define $x_{L}$ and $x_{U}$ to be the minimizers of $f_{L}$ and $f_{U}$ , respectively, over the $\ell_{\infty}$ -ball of radius $\frac{1}{k}$ centered at $x$ , i.e.

[TABLE]

Next, we see how much $f$ decreases when we move from $x$ to $x^{\prime}=x+\frac{1}{k}\Delta$ , where $\Delta$ is obtained via the oracle call $\mathcal{O}\left(\frac{e^{2}}{k^{2}}\nabla^{2}f(x_{i}),\frac{1}{k}\nabla f(x)\right)$ . We know from Definition 3.3 that

[TABLE]

as the function the oracle is approximately minimizing is precisely that. Expanding $f_{U}\left(x+\frac{1}{k}\Delta\right)$ we have

[TABLE]

Since $f(x)=f_{U}(x)$ , we see that

[TABLE]

Also, we have that

[TABLE]

Combining this with Equation A.3 gives

[TABLE]

Finally, we show that this amount of progress is comparable to that achievable by making a large step towards $x^{*}$ . More precisely, we have from the $R_{\infty}$ condition that $\|x-x^{*}\|_{\infty}\leq R_{\infty}$ . Thus, letting $\hat{x}=x+\frac{1}{\max(kR_{\infty},1)}(x^{*}-x)$ , we have that $\|\hat{x}-x^{*}\|_{\infty}\leq\frac{1}{k}$ . Therefore, $f_{L}(\hat{x})\geq f_{L}(x_{L})$ , since $x_{L}$ was a minimizer of $f_{L}$ over the $\ell_{\infty}$ -ball of radius $\frac{1}{k}$ around $x$ . Also, since $f_{L}$ lower bounds $f$ over this $\ell_{\infty}$ -ball,

[TABLE]

Combining Equations A.4 and A.5, we see that

[TABLE]

where the last inequality follows from convexity. This implies that at every iteration $f(x)-f(x^{*})$ is decreased by a factor of $(1-\Omega(1/(kR_{\infty}+1)))$ , implying that after

[TABLE]

iterations, we have that $f(x_{T})-f(x^{*})\leq\varepsilon$ . ∎

A.2 Proof of Lemma 4.7

Proof of Lemma 4.7.

Suppose that, without loss of generality, the $i^{th}$ row of $\mathbf{M}$ has a very large violation of the scaling constraint: letting $\gamma:=(r_{\mathbf{M}})_{i}-r_{i}$ we have $\left|\gamma\right|\geq\varepsilon/\sqrt{2n}$ .

In order to improve the solution, we can make an update to the corresponding coordinate of $x$ which makes the largest possible improvement in function value. More precisely by setting $x_{i}^{\prime}=x_{i}+\delta$ , and $x_{j}^{\prime}=x_{j}$ whenever $j\neq i$ , we have that

[TABLE]

Optimizing for the largest possible decrease, we set $\delta=\log(r_{i}/(r_{\mathbf{M}})_{i})$ which shows that we can decrease $f$ by

[TABLE]

Since we have $\gamma/r_{i}\geq-1$ , we can lower bound the improvement by

[TABLE]

whenever $\gamma/r_{i}\leq 1.62$ , and by

[TABLE]

whenever $\gamma/r_{i}>1.62$ .

Since by assumption $\|r\|_{\infty}\leq 1$ , this change improves function value by at least $\min\{\varepsilon^{2}/(2n),\varepsilon/(3\sqrt{2n})\}$ , which contradicts the fact that $f(x)-f^{*}\leq\varepsilon^{2}/3n$ . Therefore all rows and columns are within $\varepsilon/\sqrt{2n}$ away from being correctly scaled. Hence this is a $\varepsilon$ - $(r,c)$ scaling.

∎

A.3 Proof of Lemma 4.10

Proof.

The proof is similar to the one for Lemma 4.23. The first point holds by the same argument.

For the third part, all $(x,y)$ for which $\widetilde{f}(x,y)\leq\widetilde{f}(0,0)$ must satisfy:

[TABLE]

and similarly for $y$ . Therefore

[TABLE]

and similarly for $y_{i}$ .

The second part follows by the nonnegativity of the regularizer and the observation that $\widetilde{f}(z^{*}_{\varepsilon})\leq f(z^{*}_{\varepsilon})+\varepsilon^{2}/(36n^{2}e^{B})\cdot n\cdot 4e^{B}\leq f(z^{*}_{\varepsilon})+\varepsilon^{2}/(9n)$ . By the third property we know that the level set in bounded and thus $\widetilde{f}$ attains its minimum, and that minimum can only be better that $x^{*}$ , which concludes the proof. ∎

A.4 Proof of Theorems 4.5 and 4.6

Proof of Theorem 4.6.

By Lemma 4.7 and Lemma 4.10 we get that in order to obtain a $2\varepsilon$ -approximate scaling, it is sufficient to minimize $\widetilde{f}$ up to $\varepsilon^{2}/(2n)$ additive error. Furthermore, from Lemma 4.10 we get that the $R_{\infty}$ bound required for Theorem 3.2 is $R_{\infty}=O(B\log(ns_{\mathbf{A}}\varepsilon^{-1}))$ . Finally, since $f(0)=s_{\mathbf{A}}$ and $f(z^{*})\geq O(n+B)$ , we see that, initializing at $(x_{0},y_{0})=(0,0)$ , the total running time of the method is upper bounded by

[TABLE]

∎

Proof of Theorem 4.5.

We can directly prove this theorem by applying Theorem 4.6 to the optimal solution promised. Since $z^{*}$ exactly $(r,c)$ -scales $\mathbf{A}$ , we know that it must be a minimizer of $f$ and thus $f(z^{*})=f^{*}$ . Moreover, by definition we have the bound, $\|z^{*}\|=B\leq\log(\kappa(\mathbf{U}^{*}_{\varepsilon})+\kappa(\mathbf{V}^{*}_{\varepsilon}))$ , which concludes that proof. ∎

A.5 Proof of Lemma 4.13

Proof.

By Lemma 4.4, we know that any almost scalable matrix can be written as a block lower triangular matrix, whose diagonal blocks are exactly scalable. By Lemma 4.12, every such block can be scaled to doubly stochastic, using factors with a ratio at most $O(n_{i}\log(1/\ell_{\mathbf{A}}))$ , where $n_{i}$ is the number of vertices in block $i$ .

The infimum of the function value is exactly the sum of the function values for the diagonal block problems, since the contribution of the entries below the diagonal can be made arbitrarily close to [math]. We observe that it suffices to ensure that the contribution of each such edge is at most $\varepsilon^{2}/3n^{3}$ , since then the total contribution will be at most $\varepsilon^{2}/3n$ which is the additive error we can tolerate. Scaling the off-diagonal entries can be done in a very simple way. For any block, we can scale all the columns down by a fixed amount and all the columns up by the same amount. This will not affect the contribution of the block’s entries to the function and will only decrease the contribution of all the off-diagonal blocks in the same columns. By choosing the ratio between any two consecutive blocks to be $\log(n^{3}s_{\mathbf{A}}/\varepsilon^{2})$ , we can ensure that the entries contained in the interesection of the rows and columns of these blocks contribute less than $\varepsilon^{2}/3n^{3}$ each. That ratio between any two factors of this new scaling is at most

[TABLE]

∎

A.6 Proof of Lemma 4.20

Proof of Lemma 4.20.

First we observe that since the Hessian of $f$ is SDD, it is spectrally upper bounded by two times its diagonal and therefore by the identity matrix multiplied by twice the trace, that is $\nabla^{2}f(x)\preccurlyeq 2\cdot\mathrm{tr}(\nabla^{2}f(x))\cdot\mathbf{I}$ . Since, by construction, $\mathrm{tr}(\nabla^{2}f(x))=\sum_{i}(r_{\mathbf{M}}+c_{\mathbf{M}})_{i}=2f(x)$ , we have that

[TABLE]

Therefore for any $y$ with $f(y)\leq f(x)$ , we have that for some $t\in[0,1]:$

[TABLE]

It is straightforward to reason that

[TABLE]

and thus,

[TABLE]

Finally, we lower bound $f(x)$ . Since the matrix can be balanced, its corresponding graph is strongly connected. Therefore it contains a cycle, and thus some edge $(i,j)$ satisfies $e^{x_{i}-x_{j}}\geq 1$ . Hence $f(x)\geq\mathbf{A}_{ij}\geq\ell_{\mathbf{A}}$ . Plugging in this lower bound, we get that

[TABLE]

Hence

[TABLE]

which is equivalent to the fact that $\mathbb{D}(\exp(x))$ yields an $\varepsilon$ -balancing for $\mathbf{A}$ . We note that a similar bound also follows from [40], using a different argument. ∎

A.7 Proof of Theorems 4.18 and

4.19

Proof of Theorem 4.19.

By Lemma 4.20 and Lemma 4.23 we get that optimizing $\widetilde{f}$ up to an additive error of $\varepsilon^{2}\mathbf{A}_{\min}/24$ , suffices to get an $\varepsilon$ -balancing of the matrix.

Furthermore, from Lemma 4.23 we see that the $R_{\infty}$ bound required for Theorem 3.2 is $R_{\infty}=O(B\log(nw_{\mathbf{A}}\varepsilon^{-1}))$ . Finally, using the fact that $f(0)=s_{\mathbf{A}}$ , we see that, initializing at $x_{0}=0$ , the total running time of the method is

[TABLE]

∎

Proof of Theorem 4.18.

Having proved Theorem 4.19, this theorem is a simple corollary. Consider $x^{*}$ to be the vector such that $\mathbf{D}^{*}=\mathbb{D}(\exp(x^{*}))$ . That implies that $\nabla f(x^{*})=0$ and therefore (by the convexity of $f$ ) $x^{*}$ is a minimizer of $f$ implying that $f(x^{*})=f^{*}$ . Moreover, $B=\max_{i}\left|\log\mathbf{D}^{*}_{ii}\right|=O(\log\kappa(\mathbf{D}^{*}))$ , which concludes that proof by applying Theorem 4.19. ∎

A.8 Proof of Lemma 4.24

Proof.

Consider the optimal solution $x$ to the optimization problem described in Equation 4.3, for which we know that $\mathbf{D}^{*}=\mathbb{D}(\exp(x))$ via Lemma 4.20. Since this is a minimizer, we know that

[TABLE]

Therefore, for any $(i,j)\in\mathrm{supp}(\mathbf{A})$ , one has that

[TABLE]

Since there is a directed path of length at most $k$ from any vertex to any other, we get that

[TABLE]

∎

A.9 Proof of Lemma 6.12

Proof.

The first part is a standard property of Newton’s method applied to self-concordant functions. We refer the reader to [2] for details.

What we want to prove is that Newton’s is robust to errors in the solution to the linear system involving the Hessian. Indeed, first we see that the Hessian at $v^{\prime\prime}$ approximates the one at $v^{\prime}$ . To simplify notation, we write $\mathbf{H}_{v}=\nabla^{2}f(v)$ . Since $f_{\mu}$ is self-concordant, we have that

[TABLE]

and similarly

[TABLE]

the latter of which can be written equivalently as

[TABLE]

so

[TABLE]

The error guarantee on $v^{\prime\prime}$ equivalently gives us that

[TABLE]

so combining with A.8 we obtain that

[TABLE]

Also, since for any $z$

[TABLE]

we get, by applying triangle inequality and A.7, that

[TABLE]

Therefore, using A.6 and A.11 where we substitute $v^{\prime}$ for $z$ :

[TABLE]

∎

A.10 Proof of Lemma 6.10

Proof.

First we note that the nonzero submatrix of $\nabla^{2}\phi_{ij}$ is (where rows/columns correspond to $x_{i}$ , $x_{j}$ , $t_{ij}$ , in this order):

[TABLE]

such that

[TABLE]

Furthermore, this submatrix can be factored, by Schur complementing the last row and column, as:

[TABLE]

and thus one can easily notice that the Schur complement is SDD.

Furthermore, since $\psi$ is a standard logarithmic barrier, its Hessian is a diagonal matrix with nonnegative entries. Therefore, we can split the contribution of the diagonal matrix $\nabla^{2}\psi(x,t)$ into pieces $\mathbf{D}_{ij}$ which contains nonzeroes only at $x_{i}$ , $x_{j}$ and $t_{ij}$ . In other words, we can write ${\mathbf{H}}=\sum_{(i,j)\in\mathrm{supp}(\mathbf{A})}(\nabla^{2}\phi_{ij}(x,t)+\mathbf{D}_{ij})$ . Since the Schur complement of $t_{ij}$ of the matrix $\nabla^{2}\phi(x,t)$ is SDD, we also have that the Schur complement of $t_{ij}$ of the matrix $\nabla^{2}\phi(x,t)+\mathbf{D}_{ij}$ is SDD, so the matrix can also be factored similarly to the factoring above, and all of these factorizations can be computed in overall $O(m+n)=O(m)$ time.

Finally, since each of these factorizations is computed by Schur complementing a unique $t_{ij}$ , which is nonzero in a single matrix, we see that the Schur complement of the block ${\mathbf{H}}(t)$ of the matrix ${\mathbf{H}}$ is equal to the sum of Schur complements of the block $t_{ij}$ of the matrices $\nabla^{2}\phi_{ij}(x,t)+\mathbf{D}_{ij}$ . This holds similarly, for the corresponding lower and upper diagonal matrices, which yields the desired factorization simply by summing up. ∎

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ar Xiv preprint ar Xiv:1704.02315 , 2017.
2[2] Aharon Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications . SIAM, 2001.
3[3] Susan Blackford. Balancing and conditioning. http://www.netlib.org/lapack/lug/node 94.html .
4[4] David T Brown. A note on approximations to discrete probability distributions. Information and Control , 2(4):386–392, 1959.
5[5] Tzu-Yi Chen and James W Demmel. Balancing sparse matrices for computing eigenvalues. Linear algebra and its applications , 309(1-3):261–287, 2000.
6[6] Hui Han Chin, Aleksander Madry, Gary L. Miller, and Richard Peng. Runtime guarantees for regression problems. In Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013 , pages 269–282, 2013.
7[7] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng. Electrical flows, Laplacian systems, and faster approximation of maximum flow in undirected graphs. In STOC’11: Proceedings of the 43rd ACM Symposium on Theory of Computing , pages 273–282, 2011.
8[8] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log 1 / 2 ⁡ n superscript 1 2 𝑛 \log^{1/2}n time. In STOC’14: Proceedings of the 46th Annual ACM Symposium on Theory of Computing , pages 343–352, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Matrix Scaling and Balancing via Box Constrained Newton’s Method and Interior Point Methods

Abstract

1 Introduction

1.1 Previous Work

1.2 Our Contributions

1.3 Our Approach

Independent Work

1.4 Roadmap

2 Preliminaries

2.1 Notations

Vectors

Matrices

Positive Semidefinite Ordering and Approximation

Laplacian and SDD matrices

Diagonal Matrices

Gradients and Hessians

Block Matrices

Schur Complements

3 Box-Constrained Newton Method for Second-Order Robust Functions

Definition 3.1** (Second-Order Robust w.r.t. ℓ∞\ell_{\infty}ℓ∞​).**

Theorem 3.2** (Minimizing Second-Order Robust Functions w.r.t ℓ∞\ell_{\infty}ℓ∞​).**

Definition 3.3**.**

Theorem 3.4**.**

4 Matrix Scaling and Balancing

4.1 Matrix Scaling

Definition 4.1** (Matrix Scaling).**

Definition 4.2** (ε\varepsilonε-(r,c)(r,c)(r,c) scaling).**

Definition 4.3** (Scalable and Almost-Scalable Matrices).**

Lemma 4.4** ([33]).**

Theorem 4.5**.**

4.1.1 Matrix Scaling via Convex Optimization

Theorem 4.6**.**

Lemma 4.7**.**

Lemma 4.8**.**

4.1.2 Regularization for Solving via Box-Constrained Newton Method

Lemma 4.9**.**

Proof.

Lemma 4.10**.**

4.1.3 Bounding the Magnitude of the Optimal and Approximately Optimal Scalings for

Lemma 4.11** (Lemma 1 of [22]).**

Lemma 4.12** (Corollary 1 of [22]).**

Lemma 4.13**.**

Lemma 4.14** (Lemma 3.3 of [1]).**

4.2 Matrix Balancing

Definition 4.15** (Matrix Balancing).**

Definition 4.16** (ε\varepsilonε-Balanced Matrix [23]).**

Lemma 4.17** ([23]).**

Theorem 4.18**.**

4.2.1 Reducing Matrix Balancing to Convex Optimization

Theorem 4.19**.**

Lemma 4.20**.**

Lemma 4.21**.**

4.2.2 Regularization for Solving via Box-Constrained Newton Method

Lemma 4.22**.**

Lemma 4.23**.**

4.2.3 Bounding the Condition Number of the Optimal Balancing

Lemma 4.24**.**

Corollary 4.25**.**

4.3 Discussion of Numerical Precision Aspects

4.4 Matrix Scaling and Balancing as Nonlinear Flow Problems

Definition 4.26**.**

Lemma 4.27**.**

Observation 4.28**.**

Observation 4.29**.**

5 Implementing an O(log⁡n)O(\log n)O(logn)-Oracle in Nearly Linear Time

Definition 5.1**.**

Lemma 5.2**.**

Proof.

Lemma 5.3**.**

Proof.

Lemma 5.4**.**

Proof.

Definition 5.5** (Definition 5.7 of [30]).**

Definition 5.6** (Definition 5.9 of [30]).**

Definition 3.1 (Second-Order Robust w.r.t. $\ell_{\infty}$ ).

Theorem 3.2 (Minimizing Second-Order Robust Functions w.r.t $\ell_{\infty}$ ).

Definition 3.3.

Theorem 3.4.

Definition 4.1 (Matrix Scaling).

Definition 4.2 ( $\varepsilon$ - $(r,c)$ scaling).

Definition 4.3 (Scalable and Almost-Scalable Matrices).

Lemma 4.4 ([33]).

Theorem 4.5.

Theorem 4.6.

Lemma 4.7.

Lemma 4.8.

Lemma 4.9.

Lemma 4.10.

Lemma 4.11 (Lemma 1 of [22]).

Lemma 4.12 (Corollary 1 of [22]).

Lemma 4.13.

Lemma 4.14 (Lemma 3.3 of [1]).

Definition 4.15 (Matrix Balancing).

Definition 4.16 ( $\varepsilon$ -Balanced Matrix [23]).

Lemma 4.17 ([23]).

Theorem 4.18.

Theorem 4.19.

Lemma 4.20.

Lemma 4.21.

Lemma 4.22.

Lemma 4.23.

Lemma 4.24.

Corollary 4.25.

Definition 4.26.

Lemma 4.27.

Observation 4.28.

Observation 4.29.

5 Implementing an $O(\log n)$ -Oracle in Nearly Linear Time

Definition 5.1.

Lemma 5.2.

Lemma 5.3.

Lemma 5.4.

Definition 5.5 (Definition 5.7 of [30]).

Definition 5.6 (Definition 5.9 of [30]).

Theorem 5.7 (Theorem 5.10 of [30]).

Definition 5.8.

Lemma 5.9.

Lemma 5.10.

Theorem 5.11.

Theorem 6.1.

Theorem 6.2.

Lemma 6.3 (Equivalent Formulation).

Fact 6.4.

Theorem 6.5.

Lemma 6.6.

Lemma 6.7.

Theorem 6.8.

Theorem 6.9.

Lemma 6.10.

Fact 6.11 (Progress via Newton Steps).

Lemma 6.12.