Spectral analysis of matrix scaling and operator scaling
Tsz Chiu Kwok, Lap Chi Lau, Akshay Ramachandran

TL;DR
This paper provides a spectral analysis of matrix and operator scaling, showing linear convergence of gradient flows under spectral gap conditions, with implications for various applications in mathematics and quantum information.
Contribution
It introduces a spectral gap condition that guarantees linear convergence of gradient methods for matrix and operator scaling, and derives bounds relevant for multiple applications.
Findings
Gradient flow converges linearly with spectral gap
Bounds on condition number and capacity derived
Applications include expander graphs and quantum information
Abstract
We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander studied in quantum information theory. The spectral analysis also provides bounds on some important quantities of the scaling problems, such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of scaling problems, including matrix scaling on expander graphs, permanent lower bounds on random matrices, the Paulsen problem on random frames, and Brascamp-Lieb constants on random operators. In some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Spectral Analysis of Matrix Scaling and Operator Scaling
Tsz Chiu Kwok111Institute for Theoretical Computer Science, Shanghai University of Finance and Economics. Part of the work was done at University of Waterloo as a postdoctoral researcher. Partially supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected], Lap Chi Lau222School of Computer Science, University of Waterloo. Supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected], Akshay Ramachandran333School of Computer Science at University of Waterloo. Supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected]
We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander studied in quantum information theory.
The spectral analysis also provides bounds on some important quantities of the scaling problems, such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of scaling problems, including matrix scaling on expander graphs, permanent lower bounds on random matrices, the Paulsen problem on random frames, and Brascamp-Lieb constants on random operators. In some applications, the inputs of interest satisfy the spectral condition and we prove significantly stronger bounds than the worst case bounds.
1 Introduction
In the matrix scaling problem, we are given a non-negative matrix , and the goal is to find a left diagonal scaling matrix and a right diagonal scaling matrix such that is doubly stochastic (every row sum and every column sum is one), or report that such scaling matrices do not exist. This problem has been extensively studied in different communities; see [39] for a detailed survey.
The operator scaling problem is a significant generalization of the matrix scaling problem. Given a tuple of real matrices where for , a linear operator is defined as
[TABLE]
where denotes the conjugate transpose of which is just the transpose when is real. We will simply refer to as an operator. The size of an operator is defined as where denotes the Frobenius norm of a matrix. An operator is called -nearly doubly balanced if
[TABLE]
and is called doubly balanced when . The operator scaling problem is defined by Gurvits [29]. The objective is to scale the input operator so that it becomes doubly balanced with size one.
Definition 1.1** (Operator Scaling Problem).**
- Input:* An operator where for .*
- Output:* A left scaling matrix and a right scaling matrix such that*
[TABLE]
or report that such scaling matrices do not exist.
There is a simple reduction from the matrix scaling problem to the operator scaling problem, by having one matrix for each entry with the -entry of being and all other entries zero; see Section 4.1 for details.
The operator scaling problem generalizes matrix scaling and frame scaling and has many applications; see Section 1.4 and Section 4. Much work has been done in analyzing algorithms for these scaling problems and in understanding the scaling solutions and related quantities.
1.1 Previous Algorithms
For matrix scaling, the most well-known algorithm is Sinkhorn’s algorithm [54], which is a simple iterative algorithm that alternatively rescale the rows and rescale the columns. This algorithm is analyzed in [18] and it is shown that the alternating algorithm finds an -nearly doubly stochastic scaling in time polynomial in and .
The alternating scaling algorithm is generalized in [29] for the operator scaling problem. In this algorithm, we alternately find a left scaling matrix and set so that the first condition of doubly balanced is satisfied, and a right scaling matrix and set so that the second condition of doubly balanced is satisfied, and repeat. This alternating algorithm is partially analyzed in [29] and is fully analyzed in [20, 19].
Theorem 1.2** ([54, 18, 20, 19]).**
The alternating scaling algorithm returns an -nearly doubly balanced scaling in iterations if such a scaling exists.
This theorem is used in [20, 19] to give the first polynomial time algorithm for computing the non-commutative rank of a symbolic matrix, as it is sufficient to set to be inverse polynomial in to solve that problem exactly. For some applications, however, faster convergence of is required.
For matrix scaling, there are several algorithms with dependency on being , including the ellipsoid method in [40], the interior point method in [51], and a strongly polynomial time combinatorial algorithm in [47]. The dependency on in these algorithms is at least even for sparse matrices. Recently, two independent groups [13, 2] developed a fast second order method for matrix scaling, and this method is extended to geodesic convex optimization in [1] for the operator scaling problem.
Theorem 1.3** ([13, 2, 1]).**
There is a second order method to return an -nearly doubly balanced scaling in time for operator scaling, and in time for matrix scaling where denotes the number of nonzero entries in and denotes the condition number of the scaling solution.
For matrix scaling, this theorem can be used to obtain a fast deterministic approximation algorithm for the permanent of a matrix [47]. For operator scaling, this theorem is used to obtain a polynomial time algorithm for an orbit intersection problem in invariant theory [1].
1.2 Gradient Flow
An important quantity in [29, 20, 1] to measure the progress of the algorithms is the -error of the current solution. Given an operator where for , define
[TABLE]
Note that if and only if is doubly balanced. In the matrix scaling problem for general matrix where the objective is to scale the input matrix such that every row sum is the same and every column sum is the same, this definition simplifies to
[TABLE]
where and are the -th row sum and the -th column sum of the matrix , and is the size of the matrix .
A continuous version of the alternating algorithm for operator scaling is studied in [45], where both operations are done simultaneously and continuously. The following differential equation describes how changes over time:
[TABLE]
In the matrix case, this continuous scaling algorithm simplifies to
[TABLE]
The continuous operator scaling algorithm is developed to bound the “total movement” of the operator in order to solve the Paulsen problem in [45]. Its convergence rate is shown to be similar to that of the alternating scaling algorithm, with dependency on being .
The continuous operator scaling algorithm can be understood as a natural first order method for the operator scaling problem. As we will show in Lemma A.1 in Appendix A, the dynamical system in continuous operator scaling is equivalent to the gradient flow (or continuous gradient descent) that always moves in the direction of minimizing at each time. This shows a close connection between gradient descent and the alternating algorithm.
This gradient flow was studied in much greater generality in symplectic geometry and algebraic geometry (see [41, 27]). After a long line of work [3, 25, 26, 43, 42], Kirwan proved that the image of the moment map of a Hamiltonian group action on a symplectic manifold is a convex polytope. To prove this, Kirwan uses the norm-square of the moment map (which in our setting is exactly ), and studies critical points of this function in order to understand the image of the moment map (where a point is critical for exactly when it is a fixed point of the gradient flow). The current result as well as the result in [45] can be seen as quantitative convergence analyses in the neighborhoods of fixed points of this natural gradient flow in the operator scaling setting. It is an interesting direction to extend our result to the above general setting.
1.3 Contributions
In this paper, we analyze this gradient flow for the operator scaling problem. We identify a natural spectral condition under which the gradient flow converges in time (corresponding to the number of iterations in the alternating algorithm) where is the output accuracy. The spectral condition is closely related to the notion of “quantum expander” and is satisfied in many random instances. A key feature of our approach is that it also provides bounds on some important mathematical quantities such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of the operator scaling problem to show significantly stronger results for inputs that satisfy the spectral condition such as random matrices and random frames. We remark that the new results in various applications cannot be obtained through previous work (e.g. the fast algorithm for operator scaling in [1]), as the analyses of previous algorithms do not provide mathematical bounds for the condition number of the scaling solution and the operator capacity.
Spectral Condition
We first state the spectral condition in the general operator setting.
Definition 1.4** (Spectral Gap Condition).**
Given an operator where for , define the matrix
[TABLE]
where denotes the tensor product. The operator is said to have a -spectral gap if
[TABLE]
where is the second largest singular value of .
Note that the spectral condition can be checked in polynomial time through standard eigenvalue computation.
The matrix associated with is studied in the quantum information theory literature (e.g. [61]), as the natural matrix representation of the completely positive map defined by . It can be shown that the largest singular value of satisfies
[TABLE]
when is -nearly doubly balanced (Lemma 3.6). The spectral gap condition is also studied under the name of “quantum expander” in [7, 35]. We will discuss more about this spectral gap condition in Section 2.1 after some background on quantum information theory is reviewed.
For matrix scaling, given the input matrix , the spectral gap condition is simply
[TABLE]
If we interpret the input matrix as a weighted undirected bipartite graph, then the spectral gap condition is closely related to the expansion/conductance of the graph. We will explain more about these in Section 1.4.1 and in Section 4.1.
Linear Convergence
We prove that the gradient flow has linear convergence when the input satisfies the spectral gap condition.
Theorem 1.5** (Linear Convergence).**
Given an operator where each with , if is -nearly doubly balanced and satisfies the -spectral gap condition in Definition 1.4 with for a sufficiently large constant , then in the gradient flow,
[TABLE]
In particular, the gradient flow converges to a -nearly doubly balanced scaling in time , and such a scaling always exists under our assumptions.
By discretizing the gradient flow with step size , it follows that a natural gradient descent algorithm returns an -nearly doubly stochastic scaling in polynomial time in the input size and logarithmic in , when the input satisfies the spectral gap condition.
Corollary 1.6** (Gradient Descent).**
Under the assumptions in Theorem 1.5, there is a gradient descent algorithm to return an -nearly doubly balanced scaling in iterations.
It is an interesting open question whether the alternating algorithm also has the same convergence rate as the gradient flow under the same assumptions. We believe that the answer is positive but we could not prove it yet.
Condition Number
The condition number of the scaling solutions are defined as where and denote the largest and smallest singular values of respectively. For matrix scaling, is simply the ratio between the largest entry and the smallest entry in the diagonal matrix .
In general, the condition numbers could be exponential in the input size. It is of interest to identify instances with small condition numbers as these are closely related to the performance of matrix/operator scaling algorithms (e.g. Theorem 1.3), but not much is known even in the simpler matrix scaling setting. Kalantari and Khachiyan [40] proved a bound for strictly positive matrices in terms of the ratio of the sum of the entries and the minimum entry. We show that the condition numbers are bounded by a small constant when the input satisfies the spectral gap condition (not necessarily strictly positive).
Theorem 1.7** (Condition Number).**
Under the assumptions in Theorem 1.5, the condition number of the scaling solutions and satisfy
[TABLE]
The condition number of the scaling solutions is used in bounding the time complexity of the scaling algorithms using the second order method [1, 13], in analyzing an approximation algorithm for permanent [53], and in bounding the optimal transport cost [14, 52]. We will discuss the implications of Theorem 1.7 to these applications in Section 4.
Operator Capacity
The capacity of an operator is defined by Gurvits [29] as
[TABLE]
The capacity of a matrix has a simpler form (Section 4.1.6) where
[TABLE]
Optimization problems of this form are also studied in functional analysis [5] and in approximation algorithms [50].
In general, when is -nearly doubly balanced [29, 20, 45], it is proved that
[TABLE]
Using a connection between the convergence rate of the gradient flow and the operator capacity developed in [45], we show a much stronger bound for operators that also satisfy the spectral gap condition.
Theorem 1.8** (Capacity).**
Under the assumptions in Theorem 1.5,
[TABLE]
The capacity of an operator is used in bounding the permanent of a matrix [47], the Brascamp-Lieb constant of an operator [21], and the total movement to a nearby doubly balanced operator [45]. We will discuss the implications of Theorem 1.8 to these applications in Section 1.4.
1.4 Applications of Matrix Scaling and Operator Scaling
The matrix scaling and the operator scaling problem has many applications and we will discuss some implications of our results in this section.
1.4.1 Matrix Scaling
In the matrix scaling problem, we are given a non-negative matrix , and the goal is to find a left diagonal scaling matrix and a right diagonal scaling matrix such that is doubly balanced (i.e. every row sum is the same and every column sum is the same; see Section 4.1 for definition), or report that such scaling matrices do not exist.
The matrix scaling problem is a special case of the operator scaling problem (Section 4.1.1) and so the spectral analysis also applies. In the case of matrix scaling, the spectral condition in Definition 1.4 is simply (Section 4.1.2). Using Cheeger’s inequality, we show that this spectral gap condition is closely related to the conductance of the weighted bipartite graph associated to (Section 4.1.3). These imply that many random matrices will satisfy the condition in Theorem 1.5 (Section 4.1.4).
Our results has implications for the matrix scaling problem, e.g. to obtain stronger results for random matrices. For bipartite matching, we show that the gradient flow converges quickly to a fractional perfect matching in an almost regular bipartite expander graph (Section 4.1.5).
Corollary 1.9**.**
Suppose is a bipartite graph with where each vertex satisfies for some . If the graph conductance satisfies for some sufficiently large constant , then the gradient flow converges to an -nearly perfect fractional matching in time .
For permanent, the Van der Waerden’s conjecture states that the permanent of a doubly stochastic matrix is at least , which is proven in [15, 16, 28]. The capacity lower bound in Theorem 1.8 can be used to prove a Van der Waerden’s type lower bound on the permanent of matrices satisfying the spectral gap condition (not necessarily doubly stochastic).
Corollary 1.10**.**
If a non-negative matrix is -nearly doubly balanced with , and with for some sufficiently large constant , then
[TABLE]
For example, consider a random matrix with each entry an independent random variable where is sampled from the Gaussian distribution . The corollary implies that with high probability. This implies a sub-exponential approximation of the permanent for this class of matrices [6]. See Section 4.1.6 for details.
For optimal transportation distance, we can use the condition number result in Theorem 4.1.7 to bound the Sinkhorn distance [14, 52], which is receiving increasing attention in computer vision and machine learning (Section 4.1.7).
The condition number result in Theorem 4.1.7 can also be used to show that the second-order method for matrix scaling [13, 2] as stated in Theorem 1.3 is near linear time in the instances satisfying the spectral gap assumption.
1.4.2 Frame Scaling
In the frame scaling problem, we are given vectors , and the goal is to find a matrix (a linear transformation) such that if we set then . This problem was studied in communication complexity [17], machine learning [33], and in frame theory [45, 32].
The frame scaling problem is a special case of the operator scaling problem (Section 4.2.1) and so the spectral analysis also applies. In the case of frame scaling, the spectral condition in Definition 1.4 has a nice form (Section 4.2.2): Let be the squared Gram matrix where . Then the spectral condition is equivalent to where is the second largest eigenvalue of and is the size of the frame defined as . We will prove in Section 5 that this condition is satisfied for random frames with high probability.
Theorem 1.11**.**
If we generate random unit vectors with , then the resulting frame is -nearly doubly balanced for and satisfies the spectral gap condition with constant with probability at least .
For intuition, suppose each is a random unit vector, then the expected value of for is and so the expected matrix is where is the -by- all-one matrix. The matrix has the largest spectral gap, and we expect that a random frame will have its squared Gram matrix close to and thus a large spectral gap. The proof is by a low moment analysis of the trace method commonly used in random matrix theory (Section 5).
One significant implication of our result is the Paulsen problem on random frames. Given a frame where each satisfying
[TABLE]
the Paulsen problem asks whether there always exists a frame where each satisfying , for , and small. It was an open problem whether can be bounded by a function independent of the number of vectors . Recently, this question was answered positively in [45], showing that . This bound is improved to by Hamilton and Moitra [32] with a much simpler proof. There are examples showing that , so the upper bound and the lower bound almost match in the worst case.
The Paulsen problem was asked [36] because it is difficult to generate that satisfies the conditions exactly but easier to generate that almost satisfies the conditions. But actually not many ways are known to generate that almost satisfies the conditions with small , and almost all known constructions are random frames [36, 59]. Even for the few constructions that are deterministic (such as equiangular lines), it is likely that they satisfy the spectral gap assumption. So, for the Paulsen problem, the inputs of interest satisfy the spectral gap assumption, and we can prove a much stronger bound that goes beyond the worst case lower bound.
Theorem 1.12**.**
Let be a random frame with , where each is an independent random vector with . Suppose . Then, with probability at least , there exists a frame with , for , and .
We also demonstrate how the results in spectral analysis can be used to construct with the additional property that is small for , which is an original motivation for the Paulsen problem (Section 4.2.4).
Theorem 1.13**.**
For , there exists a doubly balanced frame where each with and
[TABLE]
1.4.3 Operator Scaling
The operator scaling problem was used to compute the Brascamp-Lieb constant [21]. A Brascamp-Lieb datum is specified by an -tuple of linear transformations and an -tuple of exponents . The Brascamp-Lieb constant of this datum is defined as the smallest such that for every -tuple of non-negative functions which are integrable, we have
[TABLE]
This is a common generalization of many useful inequalities; see [8, 21]. It turns out that the functions for which the inequality is tight are density functions of Gaussians [46], and this implies the Brascamp-Lieb constant can be written in a form very similar to the capacity of an operator (see Section 4.3.1). This is used in [21] to compute the Brascamp-Lieb constant through operator scaling.
Using this connection, we can derive upper bounds on the Brascamp-Lieb constant using the capacity lower bound in Theorem 1.8.
Corollary 1.14**.**
Given a datum with for and , if is -nearly geometric and satisfies the -spectral gap condition with for some sufficiently large constant and , then
[TABLE]
An interesting special case of the Brascamp-Lieb inequality is the rank one case where and and for which was studied in [5]. In this case, the capacity of the operator from the reduction (Section 4.3.1) is
[TABLE]
which is a form that is also studied in approximation algorithms [50]. Using the results in Section 5 and the above corollary, we can show that if each is an independent random unit vector and , then and ; see Example 4.27. Note that this is independent of the number of vectors.
The operator scaling algorithm is used in [20, 19] to compute the non-commutative rank of a symbolic matrix. We show in Section 4.3.2 that an operator satisfying the spectral gap condition has full non-commutative rank.
In solving the orbit intersection problem [1], the result of a generalization of the Paulsen problem to the operator setting in [45] was used. As in Theorem 1.12, we prove a much stronger bound in Section 4.3.3 on the squared distance when the operator satisfies the spectral gap condition.
1.5 Techniques
We are not aware of previous work on spectral analysis of matrix scaling and operator scaling. To our knowledge, the results are new even in the well-studied special case of matrix scaling. The closest work in this direction that we are aware of is a recent work by Rudelson, Samorodnitsky and Zeitouni [53], who analyze the condition number of the matrix scaling solution when the matrix satisfies some strong (vertex) expansion property using a combinatorial argument.
In the following, we discuss the previous techniques used in analyzing the continuous operator scaling algorithm, and then discuss the techniques used in this paper.
1.5.1 Comparisons with Previous Techniques
The operator capacity defined by Gurvits [29] was used crucially as a potential function to analyze the discrete operator scaling algorithms in [29, 20] as well as the continuous operator scaling algorithm in [45].
A smoothed analysis of matrix scaling was presented in [45] for solving the Paulsen problem. It was shown that if most of the entries of an matrix with is at least for a large enough , then the continuous matrix scaling algorithm has linear convergence with rate at least . This combinatorial assumption is restrictive and only applies in the matrix scaling setting. Note that the combinatorial assumption implies the spectral gap assumption in Definition 1.4 with but not vice versa. Through a reduction from operator capacity to matrix capacity, the smoothed analysis can be extended to the frame setting but the proof was complicated, and it was not known whether the smoothed analysis can be extended to the general operator setting. The main difficulty is that there is no analogous combinatorial condition in the frame setting and in the operator setting to guarantee the linear convergence. This is an illustration of the difference between the matrix case and the noncommutative operator case, in which there is no natural basis to consider. In this paper, we have found a natural spectral condition to guarantee linear convergence directly in the general operator setting. As a consequence, we do not need to go through the operator capacity to analyze the convergence rate of the operator scaling algorithm, which is different from previous analyses. Nonetheless, we can use the linear convergence to prove a lower bound on the operator capacity as was done in [45].
1.5.2 Outline of Spectral Analysis
We illustrate the main ideas of the spectral analysis in the simpler matrix scaling setting and mention how these ideas can be generalized to the operator setting. For gradient descent, a common approach to prove linear convergence is to show that the Hessian matrix has small condition number. Instead, our approach is to directly analyze the change of . In the matrix scaling setting, it follow from Lemma 4.2.9 in [45] that
[TABLE]
where is the current non-negative matrix, and are the size, the -th row sum and the -th column sum of respectively. We call the first two terms in the right hand side the quadratic terms and the last term the cross term. Our goal is to lower bound their sum by . To do so, we will prove a lower bound on the sum of the quadratic terms, and an upper bound on the absolute value of the cross term.
First, we prove a structural result that the maximum violation of a row and a column will not increase much throughout the continuous matrix scaling algorithm, and then we use this to show that the sum of the quadratic terms is at least for an -nearly doubly balanced matrix . Then, we write the cross term as a quadratic form of the matrix as , where is the vector with the -th entry being and is the vector with the -th entry being . The observation is that and while are close to the first singular vectors of , so the cross term would be small if there is a spectral gap of the matrix . By a spectral argument, we can show that the absolute value of the cross term is at most . Combining these two bounds, we can lower bound the convergence rate to be at least initially.
To prove that the convergence rate is at least for all time, we need to prove that the spectral gap condition is maintained throughout the continuous matrix scaling algorithm. To do so, we argue through the condition number of the scaling solutions. We use the structural result and the linear convergence to show that the condition number of the scaling solution is small, and then we show that the singular values of the matrix would not change much if we scale the matrix by diagonal matrices of small condition numbers. Finally, we use an inductive argument to prove that the linear convergence is maintained for all time. The results for condition numbers and capacity follow from the arguments developed and the linear convergence.
The proof for the general operator setting has the same structure, with more involved technical details in some steps. To prove the structural result that the operator norm of the error matrices would not increase much throughout the continuous operator scaling algorithm, we need to use the envelope theorem to bound the maximum eigenvalue and the minimum eigenvalue. To bound the condition number of the scaling solutions, we need to use results from the theory of product integration to analyze the scaling solutions. For readers who are more interested in matrix scaling and/or who would like to understand the spectral analysis in a simpler setting first, we include a self-contained proof for the matrix scaling case in Appendix B even though the matrix scaling result is completely generalized by the operator scaling result.
1.6 Organization
We first review some background about completely positive linear operators and the continuous operator scaling algorithm in Section 2. We then prove the main technical results in Section 3 and show various applications in Section 4. We provide a proof in Section 5 that a random frame satisfies the spectral condition with high probability. In Appendix B, we provide a self-contained proof of Theorem 1.5 in the special case of matrix scaling.
2 Preliminaries
We first review in Section 2.1 some background in quantum information theory about completely positive maps and discuss the spectral gap condition stated in Definition 1.4. Then, we review the known results about the continuous operator scaling algorithm in Section 2.2
2.1 Positive Linear Maps, Matrix Representations, Quantum Expanders
First, we define completely positive linear maps and their natural matrix representation in Section 2.1.1. Then, in Section 2.1.2, we present the spectral gap condition in Definition 1.4 using this language, and compare to the notion of quantum expanders studied in the literature. Finally, we introduce the Choi matrix in Section 2.1.3 and state some facts about tensors and completely positive maps that we will use in our proof.
2.1.1 Completely Positive Linear Map
Given where for , it can be used to define a linear map as
[TABLE]
where is the adjoint map so that for any and , where is the Hilbert-Schmidt inner product.
Definition 2.1** (Completely Positive Map).**
A linear map is positive if for every , where denotes that is a positive semidefinite matrix. A linear map is completely positive if is positive for every natural number (see [61] for more details).
Theorem 2.2** (Choi [12]).**
A linear map is completely positive if and only if it can be written as the form described in (2.1).
The matrices are called the Kraus operators of . Note that the Kraus operators are not uniquely defined for a linear map .
Definition 2.3** (Doubly Balanced Map).**
A linear map is called unital if . A linear map is called trace preserving if (which implies that for any ). A linear map is called doubly balanced if there exists such that is unital and is trace preserving.
Using this terminology, the operator scaling problem can be rephrased as given the Kraus operators of a completely positive map, find a left scaling matrix and a right scaling matrix so that the completely positive map defined by the Kraus operators is non-zero doubly balanced.
For each completely positive linear map , we can associate a matrix representation describing the same linear transformation.
Definition 2.4** (Natural Matrix Representation of Linear Map).**
Given a linear map , we can interpret it as a matrix by vectorizing the input and output matrices such that
[TABLE]
where is the linear map satisfying for all , where is the matrix with one in the -th entry and zero otherwise and is the vector with one in the -th entry and zero otherwise.
There is a one-to-one correspondence between the matrix representations and the linear maps. Given a matrix , we can also interpret it as a map by matrixizing the input and output vectors such that
[TABLE]
where is the linear map satisfying .
The matrix representation of a completely positive map has a nice form in terms of its Kraus operators.
Fact 2.5** (Proposition 2.20 in [61]).**
Given a completely positive map with Kraus operators , the matrix representation can be written in the form described in Definition 1.4 such that
[TABLE]
2.1.2 Spectral Gap Condition and Quantum Expanders
Given the correspondence between the completely positive linear map and the natural matrix representation , the spectral gap condition in Definition 1.4 can be presented as follows.
Definition 2.6** (Spectral Gap Condition of ).**
Given an operator where for , let
[TABLE]
and as maximizers to the optimization problems with . Let
[TABLE]
The spectral gap condition in Definition 1.4 is equivalent to .
The concept of quantum expander was studied by Hastings [35] and Ben-Aroya, Schwartz, and Ta-Shma [7], which was stated using the above language with .
Definition 2.7** (Quantum Expander [35, 7]).**
An operator where each is called a -quantum expander if
The largest singular value is and the identity matrix is the largest left and right singular vector, i.e.
[TABLE] 2. 2.
For any orthogonal to , it holds that
[TABLE]
In [7, 35], the map is defined as , where is a unitary matrix. Then, the size of this operator is equal to , and the largest singular value is achieved at the identity matrix.
When the operator is -nearly doubly balanced, we will show in Lemma 3.6 that and is an approximate optimizer. Therefore, in the case , the spectral gap condition in Definition 1.4 is a more relaxed version of the quantum expander definition in [7], where we do not require to be the optimizer (but only an approximate optimizer).
From random matrix theory [58], almost all random non-negative matrices (from reasonable distributions) have a constant spectral gap, i.e. is a constant. For random operators, Hastings [35] proved that the operator has an almost Ramanujan spectral gap with if each is a random unitary matrix. This result has been extended recently by Gonźalez-Guilén, Junge and Nechita to more general distributions [24]. It is reasonable to expect that most random operators have a constant spectral gap. There are also deterministic constructions of quantum expanders [7]. See [7, 35] for some applications of quantum expanders.
2.1.3 Choi Matrix and Useful Facts
There is another matrix representation that is useful in studying completely positive linear maps.
Definition 2.8** (Choi Matrix).**
Given a completely positive linear map , the Choi matrix is defined as
[TABLE]
Using the Choi matrix, we can rephrase the operator scaling problem as finding left scaling matrix and right scaling matrix so that the scaled Choi matrix satisfies
[TABLE]
where the partial trace operations and are linear functions that satisfy and for and . This phrasing of the operator scaling problem is in line with the more general quantum marginal problem [11].
The following facts will be useful in our proofs. All but (4) are relatively straightforward.
Fact 2.9**.**
In the following, is the completely positive map with Kraus operators where each .
For any matrices and ,
[TABLE] 2. 2.
* for any .* 3. 3.
For any and ,
[TABLE] 4. 4.
Let and and define the scaled operator . Then,
[TABLE]
2.2 Continuous Operator Scaling
The continuous operator scaling algorithm was studied in [45]. We collect the definitions and the results that we will use in this subsection. We start with some definitions about operator scaling that we have already stated in the introduction.
2.2.1 Operator Scaling
Definition 2.10** (Operator).**
An operator is defined by a tuple of matrices where for .
Definition 2.11** (Size of an Operator).**
The size of an operator is defined as
[TABLE]
Definition 2.12** (-nearly Doubly Balanced Operator).**
An operator is called -nearly doubly balanced if
[TABLE]
* is called doubly balanced when .*
Definition 2.13** (-error).**
Given an operator , define
[TABLE]
Definition 2.14** (Error Matrices).**
We define the error matrices as
[TABLE]
Note that , as
[TABLE]
where the last equality is by Definition 2.11. Also, we write
[TABLE]
so that .
The -error is bounded for an -nearly doubly balanced operator.
Lemma 2.15** (Lemma 3.6.1 in [45]).**
For an -nearly doubly balanced operator ,
[TABLE]
2.2.2 Dynamical System
Definition 2.16** (Dynamical System).**
The following dynamical system describes how changes over time in the continuous operator scaling algorithm:
[TABLE]
We show in Lemma A.1 in Appendix A that the dynamical system is equivalent to the gradient flow with potential function .
It is shown in [45] that the dynamical system will converge to a solution with . The following lemmas describe how the different quantities evolve in the dynamical system. We use the superscript (t) to represent the quantity of interest at time in the dynamical system, and omit it when the time is clear from context.
Lemma 2.17** (Lemma 3.4.2 in [45]).**
The change of the size of the operator at time is
[TABLE]
The following lemma was proved directly in [45]. It can also be seen as a consequence that the dynamical system is the gradient flow on .
Lemma 2.18** (Lemma 3.4.3 in [45]).**
The change of at time is
[TABLE]
The following result was used in [45] for the smoothed analysis when the dynamical system has linear convergence.
Lemma 2.19** (Proposition 4.3.1 in [45]).**
Suppose there exists such that for all ,
[TABLE]
Then
[TABLE]
2.2.3 Operator Capacity
Definition 2.20** (Capacity).**
The capacity of an operator is defined as
[TABLE]
It was shown in [45] that the convergence rate of can be used to derive a lower bound on operator capacity.
Proposition 2.21** (Proposition 4.3.1 in [45]).**
Suppose there exists such that for all , it holds that
[TABLE]
Then, it follows that
[TABLE]
3 Spectral Analysis of Operator Scaling
We prove the main technical results in this section.
3.1 Overview
The main goal is to show that the dynamical system in Definition 2.16 has linear convergence. Let be an -nearly doubly balanced operator with -spectral gap. Assuming for a sufficiently large constant , we will prove that for all time ,
[TABLE]
We start by looking more closely at the expression for the change of .
Lemma 3.1**.**
The change of is
[TABLE]
Proof.
By Lemma 2.18 and Definition 2.16,
[TABLE]
and the lemma follows from Fact 2.9(3) that . ∎
We call the terms and the quadratic terms as they are always non-negative, and we call the term the cross term. The proof outline is the following:
In Section 3.2, we prove a structural result that bounds the operator norms of and throughout the dynamical system using the envelope theorem. This implies a bound on the operator norm of and , which is used to show that the sum of the quadratic terms is at least . 2. 2.
In Section 3.3, we bound the largest singular value of the matrix and show that is an approximate largest singular vector, and then we use a spectral argument to upper bound the absolute value of the cross term to be at most . 3. 3.
These two parts combine to show that when the spectral gap condition holds. To prove the linear convergence for all time , we need to prove that the spectral gap condition is maintained throughout the dynamical system. To do this, we bound the condition number of the scaling solutions in Section 3.5, and use it to conclude that the spectral gap condition and the linear convergence hold throughout in Section 3.6.
In Section 3.7 and Section 3.8, we use the results to prove Theorem 1.7 and Theorem 1.8 about condition number and operator capacity respectively.
Finally, in Section 3.9, we explain how to discretize the gradient flow to obtain a discrete algorithm with linear convergence under the spectral assumption.
3.2 Lower Bounding the Quadratic Terms
First, we prove a structural result bounding the operator norm of the error matrices and for all in Proposition 3.2, which will also be useful in bounding the condition number of the scaling solution in Section 3.5. Then we will use this proposition to lower bound the quadratic terms.
Proposition 3.2**.**
If is -nearly doubly balanced, then for any ,
[TABLE]
Proof.
The main idea is to show that the change of the quadratic form in the direction achieving is at most , and then to use it to conclude that to complete the proof using Lemma 2.17. Note that the direction achieving varies over time . To turn this idea into a formal proof, we use the generalized envelope theorem proven by Milgrom and Segal [49].
Theorem 3.3** (Corollary 4 in Milgrom and Segal [49]).**
Suppose that is a nonempty compact space, is continuous in and is continuous in . Then the function is differentiable almost everywhere and satisfies
[TABLE]
where is any optimizer at time satisfying .
To apply the theorem, we define the space to be , which is clearly nonempty and compact. The first coordinate indicates whether we are considering the error matrix or . The second coordinate indicates whether we are considering the largest or smallest eigenvalue of the error matrix. The third and fourth coordinates indicate the unit test vectors we are applying to and . The function is defined as follows:
[TABLE]
It is clear that is continuous in and is continuous in . Hence, by Theorem 3.3, the function satisfies
[TABLE]
Since and are Hermitian matrices,
[TABLE]
and so by the assumption that is -nearly doubly balanced. To compute the partial derivative, we consider the four cases of the optimizer at time one by one.
. As and are Hermitian matrices, the optimizer of is a maximum eigenvector of satisfying , and as in this case. Then, by the definition of in Definition 2.16 and from Lemma 2.17, it follows that
[TABLE]
where the inequality follows from by Fact 2.9(2). 2. 2.
. In this case, , and by similar calculations of the first case, we have
[TABLE] 3. 3.
. By symmetry of and , we get the same bound as the first case:
[TABLE] 4. 4.
. By symmetry of and , we get the same bound as the second case:
[TABLE]
Therefore, in any case we have , and we conclude that
[TABLE]
where the first equality is by Lemma 2.17 that . ∎
We have the following corollary by rewriting the conclusions of Proposition 3.2 using the definitions that and .
Proposition 3.4**.**
If is -nearly doubly balanced, then for any ,
[TABLE]
and
[TABLE]
We can use Proposition 3.4 to lower bound the quadratic terms in Lemma 3.1.
Lemma 3.5**.**
If is -nearly doubly balanced, then for any ,
[TABLE]
Proof.
By Proposition 3.4 and the fact that for positive semidefinite matrices ,
[TABLE]
∎
3.3 Upper Bounding the Cross Term
We will first bound the largest singular value of the matrix for any -nearly doubly balanced operator . Then, we will use a spectral argument to upper bound the absolute value of the cross term in Lemma 3.1.
Given a non-negative matrix, it is known that the square of the largest singular value is bounded by the product of the maximum row sum and the maximum column sum (see [38]). The proof of this bound is generalized to prove the following lemma.
John Watrous provided a different proof of Lemma 3.6 by generalizing the proof of Theorem 4.27 in his book [61]. We include his proof in Lemma A.2 in Appendix A.
Lemma 3.6**.**
If is an -nearly doubly balanced operator, then the largest singular value of its matrix representation in Definition 1.4 is
[TABLE]
Proof.
Given a vector norm , we can define an induced matrix norm . To prove the lemma, we define the vector norm for vectors in for any and its induced matrix norm for matrices in for any as
[TABLE]
where is the matrixizing operation in Definition 2.4 and is the standard operator norm of a matrix.
For a positive semidefinite matrix for some , we can bound its largest eigenvalue by this matrix norm, i.e. . To see this, let be an eigenvector with , then
[TABLE]
We apply this inequality to bound the largest singular value of , by considering the square matrix and its largest eigenvalue:
[TABLE]
As is the natural matrix representation of the completely positive map defined by the operator ,
[TABLE]
where the second equality is from Definition 2.4 and the last equality is by the theorem [9] that
[TABLE]
By a similar argument, . Therefore,
[TABLE]
where the last inequality follows from the assumption that is -nearly doubly balanced in Definition 2.12. Taking the square root on both sides gives the lemma. ∎
Lemma 3.6 implies that is an “approximate” first singular vector of . By the spectral gap condition in Definition 1.4, it will follow that any vector perpendicular to has a “small” quadratic form of , and this can be used to bound the cross term in Lemma 3.1. The following lemma summarizes the spectral argument, which will be used to bound the cross term in the next lemma.
Lemma 3.7**.**
Let . Let and be unit vectors. Suppose the following assumptions hold:
[TABLE]
Then, for any unit vectors and , it holds that
Proof.
First, we show that and are highly correlated with the first singular vectors of . Let be its singular value decomposition with and and are orthonormal bases. Write and as linear combinations of singular vectors as and . We will show that and are large. Observe that, since ,
[TABLE]
and similarly . So we have
[TABLE]
where the last inequality is because and for . Using our assumptions about and , it follows that
[TABLE]
which implies that
[TABLE]
By the same calculation, we have .
Next, we show that and are not highly correlated with the first singular vectors. Write and with and . We will show that and are small. Since by our assumption,
[TABLE]
By the same calculation, we have
Finally, we bound the absoluate value of the quadratic form
[TABLE]
Using our assumptions on and and Cauchy-Schwarz inequality,
[TABLE]
Putting in the upper bounds on and derived above, we conclude
[TABLE]
∎
We use Lemma 3.7 to bound the cross term in Lemma 3.1.
Lemma 3.8**.**
If satisfies the spectral gap condition in Definition 1.4 with the additional assumption that for , then
[TABLE]
Proof.
Note that the cross term
[TABLE]
where the first equality is by Fact 2.9(3) and the second equality is by the definition of matrix representation in Definition 2.4.
To prove the lemma, we apply Lemma 3.7 with
[TABLE]
and
[TABLE]
Clearly, are unit vectors, and are also unit vectors as by Definition 2.14 and similarly . Note that as and from Definition 2.14, and similarly .
We check the assumptions of Lemma 3.7. By the additional assumption,
[TABLE]
and so we can set . By the spectral gap condition in Definition 1.4,
[TABLE]
and so we can set . Also, we check that
[TABLE]
where the second equality is from Definition 2.4 and the last equality is from Definition 2.11.
Therefore, we can conclude from Lemma 3.7 that
[TABLE]
Finally, we complete the proof using the inequality , and by our assumption, and by definition. ∎
3.4 Lower Bounding the Convergence Rate
Putting the bounds in Lemma 3.5 and Lemma 3.8 into Lemma 3.1, we obtain the following lower bound on the convergence rate of at any time .
Proposition 3.9**.**
If is -nearly doubly balanced and the matrix representation of satisfies the spectral conditions that
[TABLE]
then
[TABLE]
Note that Proposition 3.9 implies that the dynamical system has linear convergence at time . To see this, note that by Lemma 3.6, and from Definition 1.4, and therefore
[TABLE]
Under our assumption that , the dynamical system has linear convergence at time with rate .
To prove that the dynamical system has linear convergence with rate for all time , we will prove that the quantities in Proposition 3.9 do not change much when we move from to , i.e. , , and .
To bound the change of the singular values of , we will bound the condition number of the scaling solutions in the dynamical system in Section 3.5, and then use these bounds to argue about the change of the singular values and establish Theorem 1.5 in Section 3.6.
3.5 Scaling Solutions and Condition Numbers
We first present the results in product integration in Slavik’s book [55] in Section 3.5.1, and then use these results to bound the condition number of the scaling solutions in Section 3.5.2.
3.5.1 Scaling Solutions
The dynamical system in Definition 2.16 describes the change of by a differential equation. The solution to the differential equation can be analyzed using the theory of product integration in [55].
Definition 3.10**.**
Let be a matrix valued function. A partition of the interval is a sequence of numbers . Let for and . When the limits over all partitions with exist, the left product integral is defined as
[TABLE]
and the right product integral is defined as
[TABLE]
Theorem 3.11** (Theorem 2.5.1 in [55]).**
If are continuous matrix functions, then the product integrals
[TABLE]
exist and satisfy the equations
[TABLE]
*for every . *
Applying Theorem 3.11 with , , and , we can explicitly describe the scaling matrices of the dynamical system.
Corollary 3.12**.**
The solution to the dynamical system in Definition 2.16 is where
[TABLE]
We are interested in bounding the condition number of and .
Definition 3.13** (Condition Number).**
The condition number of a matrix is defined as
[TABLE]
*where and are the maximum singular value and the minimum singular value of respectively. *
The following theorem in Slavik [55] will be used to bound and .
Theorem 3.14** (Corollary 3.4.3 in [55]).**
If are Riemann integrable functions, then
[TABLE]
Applying Theorem 3.14 with and , we have the following bound of the maximum and minimum eigenvalues of .
Corollary 3.15**.**
For any ,
[TABLE]
This corollary will be used to bound the condition number of in Lemma 3.16, which will then be used to bound the condition number of in Lemma 3.18.
3.5.2 Bounding the Condition Number
To bound the condition number, we use Corollary 3.15 and bound the integral in the exponent. To bound the integral, we divide the time into two phases. In the first phase, we use Proposition 3.2 to argue that . In the second phase, we use that is converging linearly to argue that is converging linearly. In the following lemma, we should think of as the spectral gap parameter in Definition 1.4.
Lemma 3.16**.**
Suppose there exists such that for all , it holds that
[TABLE]
If is -nearly doubly balanced for , then
[TABLE]
Proof.
To bound the condition number, we use Corollary 3.15 and bound the integral
[TABLE]
We split the integral into two terms. For the first term, we use Proposition 3.2 to bound
[TABLE]
where the second inequality is by the fact that is non-increasing from Lemma 2.17. Applying Lemma 2.19 with our assumption that , it follows that
[TABLE]
where the second inequality is by Lemma 2.15, and the last inequality is by our assumption that .
For the second term,
[TABLE]
where the second inequality is from the inequality that from Definition 2.14, and the third inequality follows from Lemma 2.19 using the assumption that is converging linearly with .
We choose
[TABLE]
This implies that
[TABLE]
and so the second term is at most . The first term is at most , and so Corollary 3.15 implies that
[TABLE]
∎
Remark 3.17**.**
We have some examples indicating that the term in the condition number is necessary, but we do not have a formal proof for this lower bound at the time of writing.
We cannot use the same argument to bound , as it will only give us a bound with dependency on (where we assumed ). Instead, we use the bound on to derive a similar bound on .
Lemma 3.18**.**
Suppose there exists such that for all , it holds that
[TABLE]
If is -nearly doubly balanced for , and also , then
[TABLE]
Proof.
We would like to bound
[TABLE]
First, we bound . Let be a maximizer such that and .
Consider . On one hand, we use Proposition 3.4 to upper bound
[TABLE]
On the other hand, by Fact 2.9(4),
[TABLE]
Since , all singular values of are at least , and thus all eigenvalues of are at least , i.e. . It follows from Fact 2.9(2) that {\Phi^{(0)}}^{*}\left({L^{(T)}}^{*}L^{(T)}\right)\succeq{\Phi^{(0)}}^{*}\Big{(}(1-\ell)^{2}I_{m}\Big{)}, and using it in the above equation gives
[TABLE]
where the second inequality uses that is -nearly doubly balanced. Combining the upper bound and lower bound gives
[TABLE]
where we use the assumptions that .
Next, we bound using a similar argument. Let be a minimizer such that and . Consider . On one hand, we use Proposition 3.4 to lower bound
[TABLE]
where the second inequality uses the assumption that is converging linearly for to apply Lemma 2.19 with to obtain
[TABLE]
where the second inequality is by Lemma 2.15 and the last inequality is from the assumption that .
On the other hand, by a similar calculation as above with , we obtain
[TABLE]
Combining the upper bound and lower bound gives
[TABLE]
where we used the assumptions that and are sufficiently small. Therefore, we conclude that
[TABLE]
∎
3.6 Invariance of Linear Convergence
We will first use Lemma 3.16 and Lemma 3.18 to bound the change of the singular values of . Then, we will combine the previous results to prove Theorem 1.5 that is converging linearly for all .
To bound the change of the singular values, we use the following inequality.
Lemma 3.19** (Theorem 3.3.16 in [37]).**
Let and be two matrices. For any ,
[TABLE]
The following lemma bounds the change of the singular values after scaling the operator.
Lemma 3.20**.**
For any , suppose and for some , then
[TABLE]
Proof.
The operator at time is . By Fact 2.5, the matrix representation of the operator at time is
[TABLE]
where the second equality is by Fact 2.9(1). By Lemma 3.19,
[TABLE]
To bound the right hand side, we expand as and expand similarly. Then can be written as the sum of fifteen terms, with cancelled with . To bound the operator norm, we use the triangle inequality and bound the sum of the fifteen operator norms. For each term, we use the facts that and to bound its norm. For example,
[TABLE]
Since we assumed that and for some , each of these term is at most and thus we conclude that . ∎
We are ready to put together the results to prove the following theorem which implies Theorem 1.5.
Theorem 3.21**.**
If is -nearly doubly balanced and satisfies the -spectral gap condition in Definition 1.4 with for a sufficiently large constant , then for all it holds that
[TABLE]
Proof.
Recall from Proposition 3.9 the definitions of and , and by Lemma 3.6 and from Definition 1.4. Let be the supremum such that and . Our goal is to prove that is converging linearly for and is unbounded.
First, we show that is converging linearly for . By Proposition 3.9,
[TABLE]
where in the second inequality we used that and for . Note that our assumption implies that for a sufficiently large constant as . Since from Lemma 3.6, it follows that for any ,
[TABLE]
Next, we argue that the size condition and the spectral gap condition will still be maintained beyond time . For the size change, by Lemma 2.19 with ,
[TABLE]
where the second inequality is by Lemma 2.15 and the last inequality is by for a sufficiently large constant .
For the change of the second largest singular value, by definition,
[TABLE]
On the other hand, we can upper bound using condition numbers. Using Lemma 3.16 with , . Note that our assumption implies that
[TABLE]
where the implication is by the inequality for close to zero. Then, by Lemma 3.18, we also have . Putting these bounds into of Lemma 3.20, we obtain
[TABLE]
Combining the upper bound and lower bound and using from Lemma 3.6, it follows that
[TABLE]
where the last inequality is by the assumption that .
For the change of the largest singular value, by Proposition 3.4,
[TABLE]
where the first and last inequalities use that . The same holds for and these imply that is -nearly doubly balanced. By Lemma 3.6, this implies that . Therefore,
[TABLE]
where the second last inequality uses that is a sufficiently large constant.
Since our dynamical system is continuous, we still have both conditions satisfied at time for some , which contradicts that is the supremum that both conditions are satisifed. Therefore, is unbounded and the linear convergence of is maintained throughout the execution of the dynamical system. ∎
3.7 Condition Number
With the invariance of the linear convergence, we can apply Lemma 3.16 and Lemma 3.18 to bound the condition number of the scaling solutions and prove Theorem 1.7
Theorem 3.22**.**
If is -nearly doubly balanced and satisfies the -spectral gap condition in Definition 1.4 with for a sufficiently large constant , then for any ,
[TABLE]
In particular, these bounds hold for the final scaling solutions and .
Proof.
By Theorem 3.21, is linearly converging for all time with rate at least . By Lemma 3.16, this implies that
[TABLE]
where we used the assumption that and for close to zero. By Lemma 3.18, this implies the same bound on
[TABLE]
Therefore, and , and hence
[TABLE]
where we used that . The same argument applies to give the same bound for . ∎
3.8 Operator Capacity
Theorem 1.8 follows easily from Theorem 3.21.
Theorem 3.23**.**
If is -nearly doubly balanced and satisfies the -spectral gap condition in Definition 1.4 with for a sufficiently large constant , then
[TABLE]
Proof.
By Theorem 3.21, is linearly converging for all time with rate . Apply Proposition 2.21 with ,
[TABLE]
where the second inequality is by Lemma 2.15. ∎
3.9 Discrete Gradient Flow
The gradient flow can be discretized to give a polynomial time algorithm with linear convergence when the input has a spectral gap. The analysis follows closely the continuous case, so we will just provide a sketch.
Recall that the gradient flow is defined as
[TABLE]
where and are the error matrices (Definition 2.14) of the current operator .
In the discrete case, a natural update step is
[TABLE]
for some small step size , but the problem of this update step is that may not be a scaling of . So we modified the discrete algorithm slightly as follows. In each step, we update
[TABLE]
where is the step size. This update is to maintain that the current operator is a scaling of the original operator.
We assume that and initially. We will set the step size to be for the same analysis in the continuous case to go through. With this choice of the step size, we can show that
[TABLE]
by expanding the change of the size and use the small step size to argue that the higher order terms are negligible. By a similar but more tedious calculation (since the degree is higher), we can also show that
[TABLE]
where is the change of in the continuous case. This is also the step that we need to hold. Since we know , this implies that
[TABLE]
that is decreasing geometrically with rate , when the current operator satisfies the spectral condition.
As in the continuous case, we use an inductive argument to prove that the spectral gap condition is maintained to establish that the convergence rate is maintained throughout the algorithm. Again, we go through the condition number of the error matrices, and use the arguments in Lemma 3.20 to show that the change of the singular value is
[TABLE]
and it follows that the -spectral gap condition holds throughout as
[TABLE]
which is negligible when the spectral assumption holds initially.
In the discrete algorithm, we will set the step size to be . If the continuous algorithm converges to an -approximate solution in time , the discrete algorithm will converge to an -approximate solution in number of iterations, and the dependency on is by Theorem 1.5.
Remark 3.24**.**
The step size is chosen for the same analysis as in the continuous to hold. It is an interesting open question whether the analysis can be extended to constant step size, in particular whether Sinkhorn’s alternating algorithm has the same convergence rate as in the gradient flow.
4 Applications of Matrix Scaling and Operator Scaling
In this section, we show some implications of our results in various applications of the operator scaling problem.
4.1 Matrix Scaling
Given a non-negative matrix , let be the size of the matrix, be the -th row sum of , and be the -th column sum of . A non-negative matrix is called -nearly doubly balanced if for every and for every ,
[TABLE]
and is called doubly balanced when . A common setting is when is an matrix when the average row sum is equal to one, in which case and the matrix is called “doubly stochastic” when every row sum and every column sum are equal to one.
Definition 4.1** (Matrix Scaling Problem).**
We are given a non-negative matrix , and the goal is to find a left diagonal scaling matrix and a right diagonal scaling matrix such that is doubly balanced, or report that such scaling matrices do not exist.
Outline: In the following, we will show that the matrix scaling problem can be reduced to the operator scaling problem in Section 4.1.1. Then, we will see that the spectral condition has a simple form in Section 4.1.2, and there is a natural combinatorial condition that implies the spectral condition in Section 4.1.3. We then argue that many random matrices will satisfy our condition in Section 4.1.4. Finally, we see the implications of our results in several applications of matrix scaling, including bipartite matching in Section 4.1.5, permanent lower bound in Section 4.1.6, and optimal transportation in Section 4.1.7.
4.1.1 Reduction to Operator Scaling
The matrix scaling problem is a special case of the operator scaling problem.
Lemma 4.2**.**
Given a non-negative matrix , let be the operator where each for and is the matrix with the -th entry equal to and all other entries equal to zero. Then, is -nearly doubly balanced if and only if is -nearly doubly balanced. Furthermore, there is a solution to the matrix scaling problem for if and only if there is a solution to the operator scaling problem for .
Proof.
By construction, is the matrix with in the -th entry and zero otherwise, and is the matrix with in the -th entry and zero otherwise. So, is the diagonal matrix where the -th diagonal entry is the -th row sum of , and is the diagonal matrix where the -th diagonal entry is the -th column sum of . Therefore, is -nearly doubly balanced if and only if is -nearly doubly balanced. It should be clear that the square root of a scaling solution to is also a (diagonal) scaling solution to .
Because of the special structure that each has only one non-zero entry, there is always a scaling solution with being diagonal matrices if a scaling solution exists. To see this, let be a scaling solution to with and Define . We claim that is also a scaling solution to and is a diagonal matrix. First, . Next, it follows from that , and this implies that is a diagonal matrix as is a diagonal matrix because each has only one non-zero entry. Finally, we check that
. By the same argument, we can define so that is also a scaling solution to and both and are diagonal matrices. Therefore, we conclude that the matrix scaling problem can be reduced to the operator scaling problem. ∎
4.1.2 Spectral Condition
The spectral condition for operator scaling has a simple form for matrix scaling.
Lemma 4.3**.**
Using the reduction from Lemma 4.2, the spectral condition for operator scaling in Definition 1.4 becomes
[TABLE]
Proof.
Note that each has only one non-zero entry , and in Definition 1.4 has only an submatrix with nonzero entries and this submatrix is exactly . So, the condition that becomes . ∎
4.1.3 Combinatorial Condition
To better understand the spectral gap condition in the matrix case, we present a natural combinatorial condition that implies the spectral condition.
Definition 4.4** (Edge-Weighted Bipartite Graph and Conductance).**
Given a non-negative matrix , we define its edge-weighted bipartite graph as follows. In , there is one vertex for each row , one vertex for each column , and an edge with weight between and .
The conductance of an edge-weighted graph with is defined as
[TABLE]
Using Cheeger’s inequality from spectral graph theory, we can show that satisfies the spectral gap condition if its edge-weighted bipartite graph has large conductance.
Lemma 4.5**.**
If is -nearly doubly balanced for , then
[TABLE]
where is the edge-weighted bipartite graph of .
Proof.
The adjacency matrix of the edge-weighted bipartite graph is . Note that if is the singular value decomposition of , then has eigenvalues and eigenvectors . Therefore, where is the second largest eigenvalue of .
To relate to the conductance , we will consider the normalized adjacency matrix of and apply Cheeger’s inequality. The normalized adjacency matrix of a matrix is defined as where is the diagonal degree matrix with . For , note that , where is the diagonal matrix with the -th entry being the -th row sum of and is the diagonal matrix with the -th entry being the -th column sum of . Then,
[TABLE]
Let . Note that by the argument in the first paragraph. Each entry of is
[TABLE]
where we used the assumptions that is -nearly doubly balanced and . Hence, we can write , where is the “error” matrix with for all . By Lemma 3.19, . By the fact that the square of the largest singular value is at most the maximum row sum times the maximum column sum,
[TABLE]
where the last inequality uses that for and for . Finally, Cheeger’s inequality states that . Therefore, we conclude that
[TABLE]
∎
4.1.4 Random Matrices
One source of matrices satisfying the spectral condition is random matrices. If we generate as a random bipartite graph (e.g. each entry is one with probability independently), then the resulting graph has with high probability by standard probabilistic method. Also, is -nearly doubly balanced for small by standard concentration inequality (e.g. in the above example). So, by Lemma 4.5, the in Lemma 4.3 is , which implies that the assumption in Theorem 1.5 is satisfied with high probability. We can then apply our results to conclude that for those matrices:
The continuous operator scaling algorithm converges to a -nearly doubly balanced solution in time . 2. 2.
The condition number of the scaling solution is from Theorem 1.7. 3. 3.
The capacity of the matrix is close to from Theorem 1.8.
Indeed, the assumption in Theorem 1.5 should hold for a large class of random non-negative matrices where each entry is an independent random variable with reasonable distribution such as the chi-squared distribution [58], and even for some limited dependent random matrices such as -wise independent random graphs. One can either verify the assumption by using the combinatorial condition in Lemma 4.5, or to bound the second largest singular value directly using the trace method as in Section 5.
4.1.5 Bipartite Matching
It is known that a matrix can be scaled to arbitrarily close to doubly stochastic if and only if the underlying bipartite graph has a perfect matching [47], and so the decision version of the bipartite perfect matching problem can be reduced to the matrix scaling problem. Moreover, the doubly stochastic scaling solution provides a fractional solution to the perfect matching problem, which can be converted to an integral solution to the perfect matching problem very efficiently using the random walks technique in [23] (see also [48]).
Our results imply that the continuous operator scaling algorithm can be used to find a fractional perfect matching in an almost regular bipartite expander graph.
Corollary 4.6**.**
Suppose is a bipartite graph with where each vertex satisfies for some . If for some sufficiently large constant , then the gradient flow converges to an -nearly doubly balanced scaling (i.e. -nearly perfect fractional matching) in time .
We remark that our results also imply that the second-order methods for matrix scaling in [13, 2] are near linear time algorithms for the instances in Corollary 4.6. This is because the condition number of the scaling solution for those instances is a constant by Theorem 1.7 and the algorithms in [13, 2] have time complexity . We also note that classical combinatorial algorithms can also achieve a similar running time in the instances in Corollary 4.6.
4.1.6 Permanent Lower Bound
Given a matrix , the permanent is defined as
[TABLE]
where is the set of all permutations of elements. Linial, Samorodnitsky, and Wigderson [47] used the matrix scaling algorithm to design a deterministic -approximation algorithm for computing the permanent of a non-negative matrix. The algorithm works by scaling the input matrix to a doubly stochastic matrix and keeping track of the change of the permanent, and then use the results in Van der Waerden’s conjecture that any doubly stochastic matrix has permanent at least and at most one to conclude the -approximation.
For matrices satisfying the spectral gap condition in Lemma 4.3 (e.g. random matrices in Section 4.1.4), we can use the capacity lower bound in Theorem 1.7 to argue that the continuous operator scaling algorithm doesn’t do much, and thus to establish a permanent lower bound for those matrices similar to that of Van der Waerden’s.
To see the proof, we first define the capacity of a matrix.
Definition 4.7** (Matrix Capacity).**
Given a matrix , define
[TABLE]
The following lemma is probably known but it was not stated in the literature.
Lemma 4.8**.**
Following the reduction in Lemma 4.2 from matrix scaling of to operator scaling of , we have that in Definition 4.7 is equivalent to in Definition 2.20.
Proof.
Recall that the capacity of an operator is defined as
[TABLE]
Using the reduction from Lemma 4.2, given a non-negative matrix , we define where each is the matrix with the -th entry equal to and all other entries zero. Then, is the diagonal matrix with the -th entry equal to . If we let be the vector of the diagonal entries of , then the -th entry of is simply . Then, the determinant of is simply . Finally, by Hadamard’s inequality, for any positive definite matrix , and so we can assume the optimizer to is a diagonal matrix, and thus simplifies to in Definition 4.7. ∎
We are ready to prove the main result in this subsubsection.
Corollary 4.9**.**
If a non-negative matrix is -nearly doubly balanced with and it satisfies the -spectral gap condition in Definition B.1 with for some sufficiently large constant , then
[TABLE]
Proof.
Let be the input non-negative matrix with . Find the scaling solution such that is doubly stochastic (i.e. every row sum and every column sum equal to one), which is guaranteed to exist under our assumptions. Gurvits [31, 29] defined the (unnormalized) capacity of as
[TABLE]
Note that and also . Using the fact that for a doubly stochastic matrix [29, 20],
[TABLE]
Note that , and so the results on Van der Waerden’s conjecture imply that
[TABLE]
If is -nearly doubly balanced with and satisfies the spectral gap condition in Definition B.1, then Theorem 1.8 and Lemma 4.8 imply that
[TABLE]
where is the operator in the reduction from Lemma 4.2. Therefore, we conclude that
[TABLE]
∎
Example 4.10**.**
If is a random matrix where each entry is an independent random variable , where is sampled from the normal distribution , then and with high probability. Hence, the conditions in Corollary 4.9 are satisfied and it follows that
[TABLE]
So, the permanent of a random matrix from this distribution has a Van der Waerden’s type lower bound even though it is not doubly stochastic.
Barvinok and Samorodnitsky [6] proved an upper bound of the permanent of these matrices, and this implies a subexponential approximation of the permanent for these matrices.
4.1.7 Optimal Transport Distance
Given two probability distributions and a cost function , the optimal transport distance is the earth mover distance to move from one distribution to another distribution under the cost function. When the two probability distributions are discrete, the cost function can be represented as a cost matrix , and the problem of computing the optimal transport distance can be formulated as the assignment problem (i.e. a generalization of the minimum cost perfect matching). So the problem can be solved in polynomial time and there is a linear programming formulation for the problem. In large scale data analysis, however, the polynomial time algorithms are not fast enough.
Using the maximum entropy principle, Cuturi [14] proposed to add an entropic regularizer to the linear program, and showed that the optimal solution is the matrix scaling solution to a matrix associated to (more precisely where is a parameter in the regularizer). Cuturi showed that the Sinkhorn’s algorithm for matrix scaling is very efficient in computing the optimal solution to the regularized linear program, and he even mentioned that Sinkhorn’s algorithm exhibits linear convergence in practice [14]. Since then the “Sinkhorn distance” becomes a popular alternative/approximation to the earth mover distance and is used in computer vision and machine learning research; see the book [52] and the references therein. Theorem 1.5 provides a condition to establish the linear convergence observed, which is satisfied in many random matrices as discussed in Section 4.1.4.
Also, it is of interest to bound the Sinkhorn distance, which is shown in [14, 52] to be at most
[TABLE]
where and are the scaling solutions to and is the regularizer parameter. This result states that the distance is small if the condition number of the scaling solution is small. Theorem 1.7 provides a condition to bound the condition number to bound the Sinkhorn distance.
4.2 Frame Scaling
A frame is a collection of vectors where each for . The size of a frame is defined as . A frame is called -nearly doubly balanced if
[TABLE]
and is called doubly balanced when .
Definition 4.11** (Frame Scaling Problem).**
Given a frame where each , the goal is to find a matrix such that satisfies .
Outline: In the following, we will show that the frame scaling problem can be reduced to the operator scaling problem in Section 4.2.1. Then, we will see that the spectral condition has a nice form in Section 4.2.2, and explain that random frames will satisfy our condition in Section 4.2.3. Finally, we show a significant implication of our results to the Paulsen problem in Section 4.2.4 and a construction of doubly stochastic frame with small inner products in Section 4.2.5.
4.2.1 Reduction to Operator Scaling
The frame scaling problem is a special case of the operator scaling problem.
Lemma 4.12**.**
Given a frame where each , let where each for is the matrix with the -th column being and all other columns equal to zero. Then, is -nearly doubly stochastic if and only if is -nearly doubly stochastic. Furthermore, there is a solution to the frame scaling problem for if and only if there is a solution to the operator scaling problem for .
Proof.
By construction, and , and so is -nearly doubly stochastic if and only if is -nearly doubly stochastic. If is a solution to the frame scaling problem for , then we can set and and see that it is a solution to the operator scaling problem for .
If and is a solution to the operator scaling problem for , then we can use a similar argument as in Lemma 4.2 to show that and is also a solution and is a diagonal matrix as has the special structure that each has only one non-zero column. This is also proved in Lemma 3.7.4 in [45] so we omit the details. Since is diagonal, the -th entry must necessarily be for the doubly stochastic conditions to be satisfied, and so is a solution to the frame scaling problem for . ∎
4.2.2 Spectral Condition
The spectral condition for operator scaling is related to the following Hermitian matrix.
Definition 4.13** (Entrywise Squared Gram Matrix).**
Given a frame where each , the squared Gram matrix is defined as for .
Note that is a positive semidefinite matrix. To see this, let be the matrix with the -th column being . Then, we can write where denotes the Hadamard (or entrywise) product of two matrices. As is a positive semidefinite matrix, is a positive semidefinite matrix by the Schur product theorem. The spectral condition in Definition 1.4 translates to the following spectral condition for the squared Gram matrix in the frame scaling case.
Lemma 4.14**.**
Using the reduction from Lemma 4.12, the spectral condition for operator scaling for in Definition 1.4 becomes
[TABLE]
where is the second largest eigenvalue of .
Proof.
Since each has only one non-zero column, each has only one non-zero column which is . The matrix has only non-zero columns . Hence, has only a non-zero submatrix, where the -th entry is . So, the non-zero submatrix of is exactly . Therefore, and the spectral condition is equivalent to as and in the reduction from Lemma 4.12. ∎
4.2.3 Random Frames
In Section 5, we will prove that if we generate random unit vectors, then the resulting frame is -nearly doubly balanced for and the in Lemma 4.14 satisfies with high probability. Hence, a random frame generated in this way will satisfy the condition and our results apply to these random frames. The proof is by a trace method. We believe that the trace method can be improved to prove that generating random unit vectors will satisfy our condition.
4.2.4 The Paulsen Problem in Random Frames
Given an -nearly doubly balanced frame with size where each , the Paulsen problem asks to find a doubly balanced frame that is “close” to . Given two frames , the squared distance between them is defined as . It was an open question whether for every -nearly doubly balanced frame with , there is always a doubly balanced frame with bounded by a function only dependent on and but independent of . Recently, this question was answered affirmatively in [45], showing that for any -nearly doubly balanced frame with , there is always a doubly balanced frame with . Very recently, Hamilton and Moitra [32] proved a stronger bound with a much simpler proof. On the other hand, there are examples showing that the best bound is at least , so the upper bound and the lower bound are within a factor of .
The Paulsen problem was asked because it is difficult to generate doubly balanced frames and easier to generate nearly doubly balanced frames, but actually not many ways are known to even generate -nearly doubly balanced frames for small . Most nearly doubly balanced frames that we know are random frames (e.g. random Gaussian vectors, random unit vectors), which can be shown to be -nearly doubly balanced for small by matrix concentration inequalities (see Section 5.1). So, for the Paulsen problem, the inputs of interest are random frames.
We will prove that for a random frame with that is -nearly doubly balanced, there is a doubly balanced frame with with high probability, which is much smaller than the worst case bound. We will also show how this result can be used to generate a frame in which every pair of vectors has small inner product in the next subsubsection.
The proof has two steps. The first step is to show that if we generate random unit vectors, then the resulting frame is -nearly doubly balanced for and also satisfies the spectral gap condition in Lemma 4.14 with . Therefore, the assumption in Theorem 1.5 is satisfied and the continuous operator scaling algorithm has linear convergence. The second step is to show that if the continuous operator scaling algorithm has linear convergence, then the “total movement” to a doubly balanced frame is .
The first step will be proved in Section 5. We will prove the second step here. The following lemma states the result in [45] that we will use.
Lemma 4.15** (Theorem 3.3.5, Lemma 3.3.1, Lemma 3.4.3 in [45]).**
The dynamical system in Definition 2.16 will move the input operator to a doubly balanced operator . For any time ,
[TABLE]
The second step actually holds in the more general operator setting, not just in the frame setting.
Lemma 4.16**.**
Given an operator where with for , if is -nearly doubly balanced and satisfies the -spectral gap condition in Definition 1.4 with for a sufficiently large constant , then
[TABLE]
Proof.
Given the assumptions, Theorem 3.21 implies that
[TABLE]
By Lemma 4.15 and the above inequality, for any ,
[TABLE]
where the last inequality is by Lemma 2.15. ∎
Combining the two steps gives the following theorem.
Theorem 4.17**.**
Let be a random frame with , where each is an independent random vector with . Then, with probability at least , there is a doubly balanced frame with if is -nearly doubly balanced.
Proof.
By Theorem 5.1, the random frame satisfies the spectral gap condition in Lemma 4.14 with constant and with probability at least . Note that Theorem 5.1 is stated when each but it is easy to see that the nearly doubly balanced condition and the spectral gap condition are unchanged upon scaling the vectors to for . By the reduction in Lemma 4.12 and the spectral gap condition in Lemma 4.14, this implies that the condition for operator scaling is satisfied and also . Therefore, by Lemma 4.16, the continuous operator scaling algorithm will move to a doubly balanced frame with . ∎
4.2.5 Constructing Frames with Small Inner Products
The original motivation for the Paulsen problem was to construct doubly balanced frames with some additional structure.
Definition 4.18**.**
A frame is equiangular if is the same for all .
For , finding a doubly balanced frame that is also equiangular will have implications for certain informationally complete quantum measurement operators. It is a major open problem in frame theory for which pairs such frames exist [57]. The known examples are sporadic and based on group/number-theoretic constructions. We consider a related but more relaxed problem.
Definition 4.19**.**
A doubly balanced frame is Grassmannian if its angle
[TABLE]
is minimized over all possible doubly balanced frames.
Doubly balanced frames with small angle are useful in constructing erasure codes [36, 56]. The original motivation of the Paulsen problem was to begin with some -nearly doubly balanced frame that has small , and see if it could be “rounded” to a nearby doubly balanced frame still having small . Bounding is one way to achieve this goal.
In this section, we use the results in the spectral analysis to construct a doubly balanced frame with small angle. The idea is to start with a random frame which is -nearly doubly balanced for small and has small with high probability, and then use the results in spectral analysis to show that we can scale to a doubly balanced frame with .
Theorem 4.20**.**
For any , there exists a doubly balanced frame where each with and
[TABLE]
Proof.
First, we generate a random frame where each is an independent random unit vector with . By Lemma 5.3 and Theorem 5.1, is -nearly doubly balanced for and satisfies the -spectral gap condition with with probability at least . Next, we bound using the following fact.
Fact 4.21** ([34]).**
Let be a fixed unit vector. For a random unit vector ,
[TABLE]
Choosing a large enough upper bound and applying union bound, it follows from the above fact and rotational invariance that
[TABLE]
By Theorem 3.21 and the reduction in Lemma 4.12, there is a left scaling matrix and a right diagonal scaling matrix such that if we set , then the frame is doubly balanced. By Theorem 3.22, the scaling solutions satisfy
[TABLE]
Using the arguments as in Lemma 3.20 (or Lemma B.17), we have
[TABLE]
Therefore, we conclude that
[TABLE]
∎
For examples, when the above theorem gives , and when then the above theorem gives .
4.3 Operator Scaling
The operator scaling problem was used to the Brascamp-Lieb constant [21] and to compute the non-commutative rank of a symbolic matrix [20]. It is also used in [1] to solve the orbit intersection problem for the left-right group action.
4.3.1 Brascamp-Lieb Constants
A Brascamp-Lieb datum is specified by an -tuple of linear transformations and an -tuple of exponents . The Brascamp-Lieb constant of this datum is defined as the smallest such that for every -tuple of non-negative integrable functions, we have
[TABLE]
For this inequality to be scale invariant in , we must have . This is a common generalization of many useful inequalities; see [8, 21].
The important point we need is that the optimizers of any non-degenerate Brascamp-Lieb datum (i.e. the functions for which the inequality is tight) is achieved by density functions of appropriately centered Gaussians [46], and this implies that the Brascamp-Lieb constant can be written as the following optimization problem:
[TABLE]
which is closely related to the capacity of an operator.
An BL-datum is called geometric if we have:
[TABLE]
It is proved in [4, 5] that the BL-constant is one when the BL-datum is geometric. We will show that the BL-constant is small when the BL-datum is nearly geometric and satisfies a spectral condition, using the reduction in [21] from BL-constant to operator capacity and our capacity lower bound in Theorem 1.8.
Reduction: We describe the reduction in [21] from computing the BL-constant of a datum to computing the capacity of an operator. Let be rational numbers where and are integers. Given a BL-datum , a completely positive map is constructed as follows. For intuition, think of the “intended” input matrix to as a block diagonal matrix, with blocks of for , so that is a square matrix with dimension . For each in , we create matrices in , where each has a copy of that acts only on the -th principle block of (i.e. the -th copy of in ) and all other entries of are zero. The operator is defined by the Kraus operators , with the completely positive map
[TABLE]
where is the -th principle block of as described above, and the notation denotes the direct sum of the matrices (i.e. putting each matrix in a diagonal block).
Theorem 4.22** ([21]).**
It follows from the reduction that
[TABLE]
Using this connection, it is shown in [21] that the Brascamp-Lieb constant can be computed by an operator scaling algorithm for .
Bounding BL-constants: Using Theorem 4.22, we would like to derive upper bounds on BL-constants using the capacity lower bound in Theorem 1.8, and show that for some random instances the BL-constant is small. To apply Theorem 1.8, we translate the definitions of -nearly doubly balanced operator and the -spectral gap conditions to the Brascamp-Lieb setting. Following the reduction from to , we have the following definitions from the corresponding definitions of the operator .
Definition 4.23** (Size of a Datum).**
The size of a BL-datum is
[TABLE]
The datum is -nearly geometric if and only if the corresponding operator is -nearly doubly balanced.
Definition 4.24** (Nearly Geometric Datum).**
A datum is -nearly geometric if
[TABLE]
The datum satisfies the -spectral gap condition if and only if the corresponding operator satisfies the -spectral gap condition.
Definition 4.25** (Spectral Gap of Datum).**
Let and be the matrix
[TABLE]
Let be with all but the -th block zeroed out, i.e. . The natural matrix representation of the datum is defined as
[TABLE]
The datum is said to have a -spectral gap if
[TABLE]
With these definitions, we can state the Brascamp-Lieb constant upper bound that follows from the capacity lower bound in Theorem 1.8.
Corollary 4.26**.**
Given a datum with for and , if is -nearly geometric and satisfies the -spectral gap condition with for some sufficiently large constant , then
[TABLE]
Let’s consider a concrete example to demonstrate the corollary.
Example 4.27**.**
An interesting special case of the Brascamp-Lieb inequality is the rank one case where and and for which was studied in [5]. Consider a random rank-one datum where each is an independent random unit vector of . Following the reduction,
[TABLE]
which is a form that is also studied in approximation algorithms [50]. Note that this is exactly the capacity of a frame through the reduction in 4.12. By Theorem 5.1, if , then is -nearly doubly balanced for and satisfies the -spectral gap condition with with high probability. Therefore, we can apply Theorem 1.8 to conclude that
[TABLE]
and from Corollary 4.26 the BL-constant for this datum is
[TABLE]
This is independent on the number of vectors and is much smaller than the worst case bound.
As another example, Hastings’ result [35] implies that a random operator where each is a random unitary has small Brascamp-Lieb constant with high probability.
4.3.2 Rank Non-Decreasing Operator
In [20, 19, 29], a polynomial time algorithm for computing the non-commutative rank of a symbolic matrix is designed using operator scaling. Given where each , let be the symbolic matrix defined by over non-commutative variables , the non-commutative rank - of is defined as the smallest such that where is of dimension and is of dimension with entries in the “free skew field” of (see [20, 19] for definitions). The algorithm in [20, 19, 29] is based on the following equivalent characterizations.
Theorem 4.28** ([20, 19, 29]).**
Given where each , the following conditions are equivalent.
The symbolic matrix is singular, i.e. -. 2. 2.
* has a shrunk subspace, i.e. there exists subspaces with such that for all .* 3. 3.
The completely positive linear map is rank decreasing, i.e. there exists and .
The alternating scaling algorithm for operator scaling is used to check whether is rank non-decreasing. It is shown in [20, 19, 29] that is rank non-decreasing if and only if can be scaled to -nearly balanced for , and so a polynomial time algorithm for operator scaling can be used to compute the non-commutative rank of a symbolic matrix over the reals.
The shrunk subspace condition is closely related to the concept of Hall-blocker in matching theory. In the matrix case, it is shown in Lemma 4.5 that a matrix satisfying the spectral condition is an almost regular bipartite expander graph, so there is no Hall-blocker and it always has a perfect matching as shown in Lemma 4.6. In the operator case, intuitively, the spectral condition is closely related to the notion of quantum expander (Section 2.1), and so there should be no Hall-blocker as well. Theorem 1.5 implies that it is the case.
Corollary 4.29**.**
Given an operator satisfying the conditions of Theorem 1.5, is rank-nondecreasing and the corresponding symbolic matrix is non-singular over reals.
This is a new sufficient condition for an operator to be rank non-decreasing. We remark that the assumption can be weakened to to get the same conclusion, but we omit the proof here.
4.3.3 The Operator Paulsen Problem
Given an -nearly doubly stochastic operator where each , the operator Paulsen problem asks to find a doubly stochastic operator where each with . In [45], it was proved that , and this result was used in [1] for the orbit intersection problem. For an operator that satisfies the spectral gap condition with constant , Lemma 4.16 implies a much stronger bound that .
5 Spectral Gap of Random Frames
In this section, we prove that a random frame is -nearly doubly stochastic for and satisfies the spectral gap condition for constant with high probability.
Theorem 5.1**.**
If we generate random unit vectors in with , then the resulting frame is -nearly doubly stochastic for and satisfies the spectral gap condition in Definition 4.14 with constant with probability at least .
To generate a random unit vector , we set each coordinate of to be an independent random Gaussian variable for , and then we scale the vector to have norm one. The size of the frame is . By construction, the frame satisfies the equal norm condition.
In Section 5.1, we will prove that is -nearly doubly stochastic with high probability by using a standard matrix concentration bound. Then, in Section 5.2, we will prove that the squared Gram matrix in Definition 4.13 satisfies the spectral gap condition in Definition 4.14 with high probability by using the trace method.
5.1 Nearly Doubly Balanced Condition by Matrix Concentration
By construction, each vector has and . So, for the nearly doubly stochastic condition, it remains to prove that is -nearly Parseval for with high probability when , i.e.
[TABLE]
We establish this by using the following matrix Bernstein bound.
Theorem 5.2** (Matrix Bernstein [60]).**
Let be independent random matrices in . Assume that, for ,
[TABLE]
and
[TABLE]
Then, for all ,
[TABLE]
Lemma 5.3**.**
If we generate random unit vectors in with , then
[TABLE]
with probability at least .
Proof.
To apply the matrix Bernstein bound, we consider the random matrix for . We check the assumptions in Theorem 5.2. First, as the covariance matrix of a Gaussian vector is a scaled identity matrix and we scale it so that , we have
[TABLE]
Second, as each is of rank one, the operator norm of is achieved at and
[TABLE]
Finally, as each is Hermitian,
[TABLE]
and thus
[TABLE]
Therefore, we can bound the probability that the -Parseval condition is not satisfied by Theorem 5.2 with and , which gives
[TABLE]
Therefore, for , by setting , this failure probability is at most inverse polynomial in . ∎
For our condition to be satisfied, it is sufficient for that we will show and , and Lemma 5.3 gives the following bound for the latter condition.
Corollary 5.4**.**
If we generate random unit vectors in with , then
[TABLE]
for with probability at least .
5.2 Spectral Gap Condition by Trace Method
Our goal is to prove that
[TABLE]
when we generate independent random unit vectors .
5.2.1 Trace Method
As in most results from random matrix theory, we use the trace method to bound .
Lemma 5.5**.**
For any natural number ,
[TABLE]
Proof.
Recall that is positive semidefinite from Section 4.2.2. Since all the eigenvalues of are non-negative, for any natural number , and thus . We bound the failure probabiliy by applying Markov’s inequality on the -th moment of so that
[TABLE]
We lower bound the term by using the test vector so that
[TABLE]
where the second inequality is by Jensen’s inequality on the convex function for integer . Note that
[TABLE]
where the last equality follows from the independence of and for so that
[TABLE]
Putting the value of gives , and thus
[TABLE]
∎
5.2.2 Expanding the Trace
To use the bound in Lemma 5.5, we need to compute . We expand the trace of as
[TABLE]
where the sum runs over all possible length words with letters in with . We interpret each term in the summation as a length closed walk in the complete graph of vertices, where are the vertices in the closed walk.
Let be an arbitrary orthonormal basis of . To analyze the trace, we write as a linear combination of the basis vectors, and
[TABLE]
Expanding each term in the product this way and and further expand the product, we can write
[TABLE]
We interpret each and as a color on the edge for . So, in this interpretation, the trace is summing over all possible closed walks on the complete graph of vertices, and all pairs of edge -coloring on the edges in the closed walk.
To calculate the expected value of the product terms in (5.2), we group the terms based on the vertices involved and use the following basic building block. The proof of the following lemma uses the normalization technique in the proof that in Ball’s survey [4], where denotes the unit sphere in .
Lemma 5.6**.**
Let with . Then
[TABLE]
where for an odd number .
Proof.
Let be a random Gaussian vector where each coordinate is an independent Gaussian variable . We will compute in two ways to prove the lemma. On one hand,
[TABLE]
where the second equality follows from the formula for the even moments of a standard Gaussian variable (e.g. from wikipedia). On the other hand, we can compute the same quantity by a change of variables to the polar coordinates. Using that the density function of is ,
[TABLE]
where the factor appears in the second equality because the sphere of radius has area times that of , and the last equality follows by a change of variable and so that
[TABLE]
where the last equality follows from the definition of the Gamma function that . By combining the two equalities for and using the fact that (e.g. from wikipedia), we have
[TABLE]
Using the fact that and thus , it implies that
[TABLE]
and the lemma follows. ∎
By taking the expectation of (5.1),
[TABLE]
For each closed -walk , we need to compute the expectation of the product term. For some specific closed walks, it is easier to compute the expectation of the product term. In the next two subsubsections, we show how to compute the product terms when the closed -walk forms a tree or a cycle.
5.2.3 Tree Walk
The first simplification is that if there is any self-loop (i.e. ), then we can just remove the term from the product as by our construction.
The next simplification is that if the closed -walk looks like a tree, i.e. the edges formed a tree when self-loops are removed and parallel edges are identified to a single edge, then the terms correspond to each edge in the tree can be computed independently using Lemma 5.6. This is because all non-neighbors in the tree are conditionally independent, and so we can iteratively fix all non-leaf vertices and compute the leaves independently.
Lemma 5.7**.**
Let be the graph formed by the edges in a closed -walk. Suppose is a tree when self-loops are removed and parallel edges are identified to a single edge. For each edge , let be the number of parallel edges of in . Then,
[TABLE]
where denotes in Lemma 5.6.
Proof.
We prove this by induction on . When , the statement follows from the rotational invariance of the distribution, so that where is the first vector in the orthonormal basis .
For the inductive step, let be the set of the leaves of the tree and be the set of leaf edges in . By conditional expectation and independence of ,
[TABLE]
Since , we can apply the induction hypothesis to obtain that the second term is equal to . For the first term, note that each non-leaf vertex is fixed in the conditional expectation, and so by rotational invariance of the distribution and independence of ,
[TABLE]
where is the vector with the first entry one and other entries zero. The lemma follows by combining the two terms. ∎
5.2.4 Cycle Walk
We can also compute the expectation of a product term in (5.3) when the closed -walk is a simple cycle, i.e. the edges form a cycle and the vertices are distinct.
Lemma 5.8**.**
Suppose the edges form a simple cycle. Then
[TABLE]
Proof.
We use the expansion in (5.2) that
[TABLE]
where is an orthonormal basis in .
Since is symmetric across and half space, if any term appears in a product term on the right hand side with odd degree, then that product term is equal to zero. So, we only focus on those product terms where each has even degree. Since the edges form a simple cycle, each vertex is involved in exactly four terms . We consider two cases of the -edge-colorings and .
The first case is when . Then, for and to have even degree, we must have . The same argument applies to every vertex, and thus we must have for , i.e. the same two colors appear in every edge in the simple cycle. There are two possibilities for each edge, either or . So, for each two colors, there are exactly such product terms. For each such product term, there are two colors that appear twice on each vertex, and so each such product term is exactly , where is the vector with the first two entries one and other entries zero. Therefore, the total contribution of these product terms is
[TABLE]
The second case is when . Then, for the terms in to have even degree, we must have , which could be the same color as or a different color. The same argument applies to every vertex, and thus we must have for , and so we can think of every edge in the cycle receives one color from colors. For each coloring, let be the number of vertices with two different colors of degree two (and so is the number of vertices with one color of degree four), then its contribution to the sum is
[TABLE]
To count the number of such colorings, we use the following fact.
Fact 5.9**.**
The number of proper -colorings of an -cycle is , where adjacent vertices receive different colors in a proper coloring. Since the line graph of an -cycle is also an -cycle, the number of proper -edge-colorings of an -cycle is also .
We would like to count the -edge-colorings of a -cycle with vertices with different colors on its two edges and vertices with the same color on its two edges. Notice that once we fix the location of the vertices with different colors, then the edges between any two such vertices must have the same color, and so we can think of the -cycle as an -cycle and each such coloring corresponds to a proper -edge-colorings of an -cycle. By enumerating the location of the vertices and using Fact 5.9, the number of such -edge-colorings is . Therefore, the total contribution of the second case is equal to
[TABLE]
where the last equality is by the binomial theorem. Combining the two cases,
[TABLE]
∎
5.2.5 Fourth Moment Analysis
We can use Lemma 5.7 and Lemma 5.8 to compute .
Lemma 5.10**.**
[TABLE]
Proof.
To compute , we only need to consider closed -walks . We do a case analysis on the possible configurations of closed -walks.
There are four self loops, i.e. , in which case the contribution is simply one as the vectors are of length one by construction. There are total possibilities for the location of the self-loops, and so the total contribution in this case is . 2. 2.
There are two self loops and a single edge traversed two times. By Lemma 5.7, this graph contributes . There are places to add two self-loops to a single edge and possibilities for the two vertices of the edge, so the total contribution in this case is
[TABLE] 3. 3.
The only other case with two distinct vertices is that an edge is traversed four times, and its contribution is by Lemma 5.7. There are for the location of the two vertices, and the total contribution in this case is
[TABLE] 4. 4.
There is one self loop and a -cycle. This graph contributes the same as a -cycle which is given by Lemma 5.8. There are places to add the self-loop and possibilities for the three vertices of the triangle, so the total contribution in this case is
[TABLE] 5. 5.
The only other case with three distinct vertices is two different edges sharing a single common vertex. By Lemma 5.7, this graph contributes . Note that there are two ways to combine, as the two edges could share the starting vertex or the middle vertex. There are for the locations of the three vertices, and so the total contribution is
[TABLE] 6. 6.
Finally, the only case with four distinct vertices is a -cycle. There are possibilities for the locations of the four vertices, and by Lemma 5.8 the total contribution is
[TABLE]
Combining all the cases,
[TABLE]
Taking the factor out proves the lemma. ∎
5.2.6 Proof of Theorem 5.1
We wrap up the fourth moment analysis to prove Theorem 5.1. Using Lemma 5.10 in Lemma 5.5, we have
[TABLE]
where we used .
For any constant , by generating random unit vectors, the probability that is at most where the dominating term is .
Also, by Corollary 5.4, by generating random unit vectors, the resulting frame is -nearly doubly stochastic with failure probability at most inverse polynomial in .
Therefore, by generating random unit vectors, with probability at least , the resulting frame is -nearly doubly stochastic for and for any constant . This proves Theorem 5.1.
Remark 5.11**.**
We believe that the trace method can be improved to prove the same conclusion with only random unit vectors.
Acknowledgement
We thank John Watrous for providing a proof of Lemma 3.6, and Nick Harvey for providing useful comments that improved the presentation of the paper.
Appendix A Operator Scaling
The following is a proof that the continuous operator scaling algorithm is equivalent to the gradient flow that always moves in the direction of minimizing .
Lemma A.1**.**
Given an operator where for , the direction defined by
[TABLE]
minimizes the function
[TABLE]
Proof.
As in Definition 2.14, we write
[TABLE]
Then
[TABLE]
Consider the directional derivative of at the direction of where each . For ease of notation, we write , and in the following, with the understanding that these are dependent on and we are moving in the direction .
[TABLE]
where the third inequality uses the fact that and as stated in Definition 2.14. It follows that the direction minimizes . ∎
The following is an alternative proof of Lemma 3.6 provided by John Watrous.
Lemma A.2** (Watrous, personal communication).**
If is an -nearly doubly balanced operator, then the largest singular value of its matrix representation in Definition 1.4 is
[TABLE]
Proof.
The proof is a generalization of the proof of Theorem 4.27 in [61]. As stated in Definition 2.6,
[TABLE]
where is as defined in (2.1).
First, we bound the maximum for Hermitian matrix . Let be an eigenvalue decomposition of . Let
[TABLE]
Then, by Cauchy-Schwarz inequality and Hölder’s inequality for Schatten norms for matrices,
[TABLE]
Since is a positive map, by Fact 2.9(2). It follows that the trace norm of is simply the trace of , and so
[TABLE]
where the third equality is by Fact 2.9(3) and the last inequality follows from the assumption that is -nearly doubly balanced. Therefore,
[TABLE]
where the second inequality is from the assumption that is -nearly doubly balanced.
For the non-Hermitian case, we use a standard reduction and write where and are Hermitian matrices. Note that . As is neccessarily Hermitian perserving, we also have . Therefore, as and are Hermitian,
[TABLE]
∎
Appendix B Matrix Scaling
The aim of this section is to provide a self-contained proof of the linear convergence result in the simpler setting of matrix scaling. It can be read as an exposition of the main ideas in Section 3.
In the matrix scaling problem, we are given a non-negative matrix , and the goal is to find a left diagonal scaling matrix and a right diagonal scaling matrix such that is doubly balanced, or report that such scaling matrices do not exist.
B.1 Definitions
In the following, we state the important definitions for the matrix scaling problem. Given a matrix , we define
[TABLE]
as the size, the -th row sum, and the -th column sum of the matrix .
A matrix is -nearly doubly balanced if
[TABLE]
for and , and is doubly balanced when .
The -error of is defined as
[TABLE]
The spectral condition is the same as defined in Lemma 4.3.
Definition B.1** (Spectral Gap Condition for Matrix).**
A matrix satisfies the -spectral gap condition if
[TABLE]
B.2 Continuous Matrix Scaling
The matrix scaling problem is a special case of the operator scaling problem. Following the reduction in Section 4.1, given a non-negative matrix , we consider the matrix where the -th entry of is
[TABLE]
The continuous matrix scaling algorithm works on and is defined by the following differential equation:
[TABLE]
Many quantities change over time in the dynamical system. We use the superscript (t) to denote the quantity of interest at time . Given a non-negative matrix as the input of the matrix scaling problem, the matrix in (B.4) is the input of the continuous operator scaling algorithm at time , i.e. and . Then changes over time following (B.5) and is defined as the matrix with . The dynamical system stops when is doubly balanced. It is proved in [45] that .
We state some known results about the continuous matrix scaling algorithm for the analysis. First, the matrix at any time is a scaling of the original matrix in the following form.
Lemma B.2** (Lemma 4.2.10 in [45]).**
At time , define and as
[TABLE]
Then .
In particular, if , then is doubly balanced, and and is a solution to the matrix scaling problem. This is how the continuous operator scaling algorithm finds a scaling solution.
From now on, the matrix of interest is and it evolves over time as changes in the dynamical system. For ease of notation, we will omit the matrix and sometimes also the superscript (t) on other quantities when they are clear from the context.
Lemma B.3** (Lemma 3.6.1 in [45]).**
For an -nearly doubly balanced matrix ,
[TABLE]
Lemma B.4** (Lemma 4.2.8 in [45]).**
For any time ,
[TABLE]
Lemma B.5** (Lemma 4.2.9 in [45]).**
For any time ,
[TABLE]
Lemma B.6** (Proposition 4.3.1 in [45]).**
Suppose there exists such that for all ,
[TABLE]
Then
[TABLE]
B.3 Overview
The proof overview is stated in Section 1.5.2 in the matrix scaling setting, so we won’t repeat here. It is easy to see from Lemma B.5 that
[TABLE]
The structure is the same as in Section 3 for the general operator setting. Our goal is to prove the following theorem.
Theorem B.7** (Linear Convergence).**
Given a non-negative matrix with , if is -nearly doubly balanced and satisfies the -spectral gap condition in Definition B.1 with for a sufficiently large constant , then in the gradient flow,
[TABLE]
In particular, the gradient flow converges to a -nearly doubly balanced scaling in time , and such a scaling always exists under our assumptions.
B.4 Lower Bounding the Quadratic Terms
First, we prove a structural result bounding the maximum error of the rows and columns, which will also be useful in bounding the condition number of the scaling solution later. Then, we will use this structural result to lower bound the quadratic terms of .
Proposition B.8**.**
If is -nearly doubly balanced, then for any ,
[TABLE]
for and .
Proof.
We present a slightly informal proof, which can be made formal by using the envelope theorem stated in Theorem 3.3 as done in Proposition 3.2.
Let
[TABLE]
be the maximum violation of a row and a column at time . Note that as is -nearly doubly balanced. We would like to show that for almost every time ,
[TABLE]
This would imply the proposition as
[TABLE]
where the second last equality is by Lemma B.4.
To bound , we consider different cases of how the maximum of is achieved. Suppose the maximum of is achieved by column and is negative such that . The change of the -th column sum is
[TABLE]
where the last equality is by the definition of the dynamical system in (B.5), and the inequality is by our assumption that the maximum of is achieved by column so that and for all . It follows that
[TABLE]
where the first equality is by Lemma B.4.
Similarly, suppose the maximum of is achieved by column and is positive, we can show that
[TABLE]
By symmetry of rows and columns, we can prove the same bounds for the change of the violation of the -th row sum. Therefore, in all four cases, the change of the maximum violation is at most . Note that can be written as the maximum of functions, one for each row and one for each column. We can then use the envelope theorem in Theorem 3.3 as done in Proposition 3.2 to prove formally that to complete the proof.
(It is possible to prove the proposition for the matrix case without using the envelope theorem as is only the maximum of a finite number of functions, but in the operator case is the maximum quadratic form of infinitely many unit vectors and we don’t know of a proof without using the envelope theorem.) ∎
We have the following corollary about the row sums and the column sums by rewriting the conclusions of Proposition B.8.
Proposition B.9**.**
If is -nearly doubly balanced, then for any , for and ,
[TABLE]
We can use Proposition B.9 to lower bound the quadratic terms in (B.6).
Lemma B.10**.**
If is -nearly doubly balanced, then for any ,
[TABLE]
Proof.
Using Proposition B.9, the first term in (B.6) is
[TABLE]
Similarly, the second term in (B.6) is
[TABLE]
The lemma follows from in (B.3). ∎
B.5 Upper Bounding the Cross Term
We will first bound the largest singular value of the matrix for any -nearly doubly balanced matrix . Then, we will use a spectral argument to upper bound the absolute value of the cross term in (B.6).
Lemma B.11**.**
If is -nearly doubly balanced, then
[TABLE]
Proof.
We use the fact that the square of the largest singular value of a non-negative matrix is at most the maximum column sum times the maximum row sum (see e.g. page 223 of [38]). So,
[TABLE]
where the second inequality follows from the assumption that is -nearly doubly balanced. ∎
Lemma B.11 implies that is an “approximate” first singular vector of . By the spectral gap condition in Definition B.1, it will follow that any vector perpendicular to has a “small” quadratic form, and this can be used to bound the cross term in Lemma B.6. The following lemma summarizes the spectral argument, which is the same as Lemma 3.7. Since Lemma 3.7 has no operators involved, we refer to the proof in Section 3.2 and just restate the statement here for ease of reference.
Lemma B.12**.**
Let . Let and be unit vectors. Suppose the following assumptions hold:
[TABLE]
Then, for any unit vectors and , it holds that
We can use Lemma B.12 to bound the cross term in Lemma B.6.
Lemma B.13**.**
If satisfies the spectral condition in Definition B.1 with the additional assumption that for , then
[TABLE]
Proof.
We apply Lemma B.12 with , and where
[TABLE]
Clearly, , , , are unit vectors, and and . We check the assumptions of Lemma B.12. By the additional assumption,
[TABLE]
and so we can set . Similarly, by the spectral gap condition in Definition B.1,
[TABLE]
and so we can set . Also, we check that
[TABLE]
Therefore, we can conclude from Lemma B.1 that
[TABLE]
which implies that
[TABLE]
where the last inequality follows from and and . ∎
B.6 Lower Bounding the Convergence Rate
Putting the bounds in Lemma B.10 and Lemma B.13 into (B.6), we obtain the following lower bound on the convergence rate of at any time .
Proposition B.14**.**
If is -nearly doubly balanced and satisfies the spectral conditions that
[TABLE]
for , then
[TABLE]
Note that Proposition B.14 implies that the dynamical system has linear convergence at time . To see this, note that by Lemma B.11, and from Definition B.1, and therefore
[TABLE]
Under our assumption that , the dynamical system has linear convergence at time with rate at least .
To prove that the dynamical system has linear convergence with rate for all time , we will prove that the quantities in Proposition B.14 do not change much when we move from to , i.e. , , and .
To bound the change of the singular values of , we will bound the condition number of the scaling solutions in the dynamical system in the next subsection, and then use these bounds to argue about the change of the singular values and establish Theorem B.7.
B.7 Condition Number
Recall from Lemma B.2 that where
[TABLE]
To bound the condition number of and , we bound the integrals in the exponent. To bound the integral, we divide the time into two phases. In the first phase, we use Proposition B.8 to argue that . In the second phase, we use that is converging linearly to argue that is converging linearly. In the following lemma, we should think of as the spectral gap parameter in Definition 1.4. The proof of the following lemma is almost identical to that in Lemma 3.16.
Lemma B.15**.**
Suppose there exists such that for all , it holds that
[TABLE]
If is -nearly doubly balanced for , then
[TABLE]
Proof.
To bound the condition number, we just need to bound for each as is a diagonal matrix. Using the form of described in Lemma B.2, we bound the absolute value of the integral
[TABLE]
We split the integral into two terms. For the first term, we use Proposition B.8 to bound
[TABLE]
where the second inequality is by the fact that is non-increasing from Lemma B.4. Applying Lemma B.6 with our assumption that , it follows that
[TABLE]
where the second inequality is by Lemma B.3, and the last inequality is by our assumption that .
For the second term,
[TABLE]
where the second inequality is from the inequality that from (B.3), and the third inequality follows from the assumption that is converging linearly with ; see Lemma B.6.
We choose
[TABLE]
This implies that
[TABLE]
and so the second term is at most . The first term is at most . Therefore, we conclude that
[TABLE]
∎
We cannot use the same argument to bound , as it will only give us a bound with dependency on (where we assumed ). Instead, we use the bound on to derive a similar bound on . The proof of the following lemma is simpler than that of Lemma 3.18 in the operator case.
Lemma B.16**.**
Suppose there exists such that for all , it holds that
[TABLE]
If is -nearly doubly balanced for , then and implies that
[TABLE]
Proof.
By Lemma B.15,
[TABLE]
To upper bound , we consider the column sum by summing the above inequality over to get
[TABLE]
This implies that
[TABLE]
where the second inequality is by Proposition B.9 and that is -nearly doubly balanced.
Similarly, we can lower bound
[TABLE]
where the last inequality uses the assumption that is converging linearly to apply Lemma B.6 with to obtain
[TABLE]
where we used Lemma B.3 and the assumption that . ∎
B.8 Invariance of Linear Convergence
We will first use Lemma B.15 and Lemma B.16 to bound the change of the singular values of . Then, we will combine the previous results to prove Theorem B.7 that is converging linearly for all .
Lemma B.17**.**
For any , suppose the diagonal matrices and satisfy and for some , then
[TABLE]
Proof.
We use Lemma 3.19 to bound the singular value change by the operator norm of the matrix change:
[TABLE]
We write and and , so that and by our assumptions. Then,
[TABLE]
where we used the triangle inequality and bound the sum of the eight operator norms, and used the fact that for each term, and used the assumption that so that each term is at most . ∎
We are ready to put together the results to prove the following theorem which implies Theorem B.7. The proof is almost the same as that of Theorem 3.21.
Theorem B.18**.**
If is -nearly doubly balanced and satisfies the -spectral gap condition in Definition B.1 with for a sufficiently large constant , then for all it holds that
[TABLE]
Proof.
Recall from Proposition B.14 the definitions of and , and by Lemma B.11 and from Definition B.1. Let be the supremum such that and . Our goal is to prove that is converging linearly for and is unbounded.
First, we show that is converging linearly for . By Proposition B.14,
[TABLE]
where in the second inequality we used that and for . Note that our assumption implies that for a sufficiently large constant as . Since from Lemma B.11, it follows that for any ,
[TABLE]
Next, we argue that the size condition and the spectral gap condition will still be maintained beyond time . For the size change, by Lemma B.6 with ,
[TABLE]
where the second inequality is by Lemma B.3 and the last inequality is by for a sufficiently large constant .
For the change of the second largest singular value, by definition,
[TABLE]
On the other hand, we can upper bound using condition numbers. Using Lemma B.15 with , and . Note that our assumption implies that
[TABLE]
where the implication is by the inequality for close to zero. Then, by Lemma B.16, we also have . Putting these bounds into of Lemma B.17, we obtain
[TABLE]
Combining the upper bound and lower bound and using from Lemma B.11, it follows that
[TABLE]
where the last inequality is by the assumption that .
For the change of the largest singular value, by Proposition B.9,
[TABLE]
where the first and last inequalities use that . The same holds for and these imply that is -nearly doubly balanced. By Lemma B.11, this implies that . Therefore,
[TABLE]
where the second last inequality uses that is a sufficiently large constant.
Since our dynamical system is continuous, we still have both conditions satisfied at time for some , which contradicts that is the supremum that both conditions are satisifed. Therefore, is unbounded and the linear convergence of is maintained throughout the execution of the dynamical system. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z. Allen-Zhu, A. Garg, Y. Li, R. Oliveira, A. Wigderson. Operator scaling via geodescially convex optimization, invariant theory and polynomial identity testing . In Proceeedings of the 50th Annual ACM Symposium on Theory of Computing (STOC), 172–181, 2018.
- 2[2] Z. Allen-Zhu, Y. Li, R. Oliveira, A. Wigderson. Much faster algorithms for matrix scaling . In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2017.
- 3[3] M. F. Atiyah. Convexity and Commuting Hamiltonians . Bulletin of the London Mathematical Society, Vol 14, Issue 1, Jan 1982.
- 4[4] K. Ball. Volumes of sections of cubes and related problems . Geometric Aspects of Functional Analysis, 251–260, 1989.
- 5[5] F. Barthe. On a reverse form of the Brascamp-Lieb inequality . Inventiones mathematicae 134(2), 335–361, 1998.
- 6[6] A. Barvinok, A. Samorodnitsky. Computing the partition function for perfect matchings in a hypergraph . Combinatorics, Probability, and Computing, 20(6), 2011.
- 7[7] A. Ben-Aroya, O. Schwartz, A. Ta-Shma. Quantum expanders: motivation and construction . Theory of Computing 6, 47–79, 2010.
- 8[8] J. Bennett, A. Carbery, M. Christ, T. Tao. The Brascamp-Lieb inequalities: finiteness, structure, and extremals . GAFA Geom. funct. anal. (2008) 17: 1343.
